Chapter 17 Aggregation

Sometimes we want to combine multiple objects in a dataset into a single object. For example, we often want to combine objects according to some categorical variable. We are often interested in the number of objects that go into each aggregate as well as various summary statistics; e.g.,

  • mean
  • median
  • mode
  • standard deviation
  • variance
  • minimum value
  • maximum value
  • range

If you are not familiar any of those summary statistics, look them up!

17.1 Dependencies and setup

We’ll be using the dplyr package, which comes loaded in the tidyverse collection of packages.

library(tidyverse)

17.2 Exercise

Given the following vector, y

y <- runif(100, min=-1000, max=1000)

Write R code to calculate each of the following:

  • mean
  • median
  • mode
  • standard deviation
  • variance
  • minimum value
  • maximum value
  • range

17.3 Using dplyr to aggregate objects in a dataset according to a category

In many cases, you will be able to use a combination of the group_by and summarise functions (from dplyr) to aggregate categories of data.

data <- starwars
data
## # A tibble: 87 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

Let’s aggregate characters in the Star Wars dataset according to their species. In this particular example, we’ll focus on summarize the mass values within each species.

# First let's filter out all characters with NA for mass or species.
data$species <- as.factor(data$species)
data <- data %>%
  filter(
    !(is.na(species) | is.na(mass))
  )

agg_data <- data %>%
  group_by(
    species
  ) %>%
  summarise(
    mass_median=median(mass),
    mass_mean=mean(mass),
    mass_sd=sd(mass),
    mass_min=min(mass),
    mass_max=max(mass),
    mass_total=sum(mass),
    num_characters=n()
  )
print(agg_data)
## # A tibble: 31 × 8
##    species   mass_median mass_mean mass_sd mass_min mass_max mass_total num_ch…¹
##    <fct>           <dbl>     <dbl>   <dbl>    <dbl>    <dbl>      <dbl>    <int>
##  1 Aleena           15        15      NA         15       15        15         1
##  2 Besalisk        102       102      NA        102      102       102         1
##  3 Cerean           82        82      NA         82       82        82         1
##  4 Clawdite         55        55      NA         55       55        55         1
##  5 Droid            53.5      69.8    51.0       32      140       279         4
##  6 Dug              40        40      NA         40       40        40         1
##  7 Ewok             20        20      NA         20       20        20         1
##  8 Geonosian        80        80      NA         80       80        80         1
##  9 Gungan           74        74      11.3       66       82       148         2
## 10 Human            79        82.8    19.4       45      136      1821.       22
## # … with 21 more rows, and abbreviated variable name ¹​num_characters

17.4 Exercises

  • Try loading in some data that you have used in a previous homework assignment (or any dataset of your choice), and combine objects in your dataset using a combination of group_by and summarize.
  • Identify any lines of code that you don’t understand. Use the documentation to figure out what those lines of code are doing.