Chapter 17 Aggregation
Sometimes we want to combine multiple objects in a dataset into a single object. For example, we often want to combine objects according to some categorical variable. We are often interested in the number of objects that go into each aggregate as well as various summary statistics; e.g.,
- mean
- median
- mode
- standard deviation
- variance
- minimum value
- maximum value
- range
If you are not familiar any of those summary statistics, look them up!
17.1 Dependencies and setup
We’ll be using the dplyr
package, which comes loaded in the tidyverse
collection of packages.
library(tidyverse)
17.2 Exercise
Given the following vector, y
<- runif(100, min=-1000, max=1000) y
Write R code to calculate each of the following:
- mean
- median
- mode
- standard deviation
- variance
- minimum value
- maximum value
- range
17.3 Using dplyr to aggregate objects in a dataset according to a category
In many cases, you will be able to use a combination of the group_by
and summarise
functions (from dplyr
) to aggregate categories of data.
<- starwars
data data
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
## 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
## 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
Let’s aggregate characters in the Star Wars dataset according to their species. In this particular example, we’ll focus on summarize the mass values within each species.
# First let's filter out all characters with NA for mass or species.
$species <- as.factor(data$species)
data<- data %>%
data filter(
!(is.na(species) | is.na(mass))
)
<- data %>%
agg_data group_by(
species%>%
) summarise(
mass_median=median(mass),
mass_mean=mean(mass),
mass_sd=sd(mass),
mass_min=min(mass),
mass_max=max(mass),
mass_total=sum(mass),
num_characters=n()
)print(agg_data)
## # A tibble: 31 × 8
## species mass_median mass_mean mass_sd mass_min mass_max mass_total num_ch…¹
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Aleena 15 15 NA 15 15 15 1
## 2 Besalisk 102 102 NA 102 102 102 1
## 3 Cerean 82 82 NA 82 82 82 1
## 4 Clawdite 55 55 NA 55 55 55 1
## 5 Droid 53.5 69.8 51.0 32 140 279 4
## 6 Dug 40 40 NA 40 40 40 1
## 7 Ewok 20 20 NA 20 20 20 1
## 8 Geonosian 80 80 NA 80 80 80 1
## 9 Gungan 74 74 11.3 66 82 148 2
## 10 Human 79 82.8 19.4 45 136 1821. 22
## # … with 21 more rows, and abbreviated variable name ¹num_characters
17.4 Exercises
- Try loading in some data that you have used in a previous homework assignment (or any dataset of your choice), and combine objects in your dataset using a combination of
group_by
andsummarize
. - Identify any lines of code that you don’t understand. Use the documentation to figure out what those lines of code are doing.