Chapter 8 Using dplyr to manipulate data
You can download the R Markdown file used to generate this document (or webpage) here. I would encourage you to download that Rmd file, open it own your own computer in RStudio, and run the code chunks on your own (and modify things as you wish!).
Goals:
- Learn about solving common data manipulation challenges with dplyr
- Learn to use the pipe operator (
%>%
)
Sources:
8.1 Setup
You should have already installed the tidyverse
collection of packages, which includes the dplyr package.
But, if you haven’t, go ahead and run:
install.packages("tidyverse")
With the tidyverse packages installed, you’ll need to load the dplyr library to use it.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
8.2 Overview
This document will cover key dplyr functions that solve the most common data manipulation challenges. The examples and explanations here non-exhaustive (dplyr has a lot of functionality and each of these functions are very flexible). Instead, I’d like you to start building a sense of the broad category of data manipulations that are possible with each of the dplyr functions below. As you encounter challenges with manipulating data, you should have a solid idea of which functions you will need to do what you want to do. Then, you can use those functions’ documentation to work out the specific details.
dplyr’s functions for basic data manipulation can be divided into three categories based on the component of the dataset that they work with (descriptions pulled from dplyr documentation):
- Rows
filter()
- choose rows in your data based on their column valuesslice()
- choose rows in your data based on their location (index) in the datasetarrange()
- change the ordering of rows in your data
- Columns
select()
- changes whether or not a column is includedrename()
- changes the names of columnsmutate()
- changes the values of columns and can create new columns that are functions of existing columnsrelocate()
- changes the order of columns
- Groups of rows
summarise()
- collapses a group of rows into a single row
These functions can be used in combination with group_by()
, which allows you to perform these functions on particular groups of rows, instead of the entire dataset.
Note that the above list of functions is non-exhaustive. I refer you to the dplyr documentation for an exhaustive list. Additionally, I’d encourage you to look over (and save) the cheat sheet linked at the end of this document.
8.3 The pipe operator
Most dplyr functions take a dataframe (or tibble) as their first argument.
dplyr provides the pipe operator (%>%
), which lets you conveniently compose multiple dplyr functions in a single line (instead of needing to save intermediate states of your dataset).
The pipe operator “pipes” the results of the left side of the pipe into the first argument of the right side of the pipe.
Here’s an example:
<- function(a, b) {
f return(a + b)
}<- function(a, b) {
g return(a * b)
}
<- 2
x <- 100
y
%>% f(y) # Is the same as f(x,y) x
## [1] 102
%>% f(y) %>% g(y) # Is the same as g(f(x,y), y) x
## [1] 10200
8.4 Data
We’ll use the starwars
dataset to demonstrate the basic data manipulation functions provided by dplyr.
You don’t need to download anything, the starwars
dataset is loaded when you load the dplyr package.
dim(starwars)
## [1] 87 14
This dataset has 87 rows and 14 columns where each row represents a character from Star Wars, and each column is a particular attribute.
starwars
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
## 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
## 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
If you look closely at the above output, you’ll notice that the starwars
dataset is represented as a tibble.
A tibble is a more modern version of the dataframe type, and for the most part, anywhere you could use a dataframe, you can use a tibble.
You can convert a data frame to tibbles with the as_tible()
function.
Read more about tibbles here: https://tibble.tidyverse.org/.
8.5 Filter rows with filter()
filter()
allows you to choose rows in a data frame based on their column values.
For example, if we wanted to create a new data frame that only contained droids, we could:
%>% filter(species=="Droid") starwars
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_c…¹ birth…² sex gender homew…³
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 2 R2-D2 96 32 <NA> white, bl… red 33 none mascu… Naboo
## 3 R5-D4 97 32 <NA> white, red red NA none mascu… Tatooi…
## 4 IG-88 200 140 none metal red 15 none mascu… <NA>
## 5 R4-P17 96 NA none silver, r… red, b… NA none femin… <NA>
## 6 BB8 NA NA none none black NA none mascu… <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹eye_color, ²birth_year,
## # ³homeworld
We can also filter rows based on more complex conditions:
%>% filter(species == "Droid" & height > 100) starwars
## # A tibble: 2 × 14
## name height mass hair_color skin_color eye_co…¹ birth…² sex gender homew…³
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 2 IG-88 200 140 none metal red 15 none mascu… <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹eye_color, ²birth_year,
## # ³homeworld
What if we wanted all droids, Wookiees, and Hutts?
%>% filter(species == "Droid" | species == "Wookiee" | species == "Hutt") starwars
## # A tibble: 9 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 2 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 3 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
## 4 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…
## 5 Jabba Desil… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu…
## 6 IG-88 200 140 none metal red 15 none mascu… <NA>
## 7 R4-P17 96 NA none silver… red, b… NA none femin… <NA>
## 8 Tarfful 234 136 brown brown blue NA male mascu… Kashyy…
## 9 BB8 NA NA none none black NA none mascu… <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
Even more simply with the %in%
operator,
%>% filter(species %in% c("Droid", "Wookiee", "Hutt")) starwars
## # A tibble: 9 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 2 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 3 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
## 4 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…
## 5 Jabba Desil… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu…
## 6 IG-88 200 140 none metal red 15 none mascu… <NA>
## 7 R4-P17 96 NA none silver… red, b… NA none femin… <NA>
## 8 Tarfful 234 136 brown brown blue NA male mascu… Kashyy…
## 9 BB8 NA NA none none black NA none mascu… <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
Notice that the original starwars
dataset is never modified by any of the filter functions.
None of the dplyr functions modify the dataset passed to them with the data
argument.
Instead, they return a new data frame with the appropriate manipulations.
8.6 Arrange rows with arrange()
arrange()
reorders rows in the given data frame.
It takes a data frame and a set of columns names to reorder by.
If you give more than one column name, additional column names are used to break ties in the values of proceeding column names.
%>% arrange(mass, height) starwars
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Ratts Tyer… 79 15 none grey, … unknown NA male mascu… Aleen …
## 2 Yoda 66 17 white green brown 896 male mascu… <NA>
## 3 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor
## 4 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 5 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
## 6 Sebulba 112 40 none grey, … orange NA male mascu… Malast…
## 7 Dud Bolt 94 45 none blue, … yellow NA male mascu… Vulpter
## 8 Padmé Amid… 165 45 brown light brown 46 fema… femin… Naboo
## 9 Sly Moore 178 48 none pale white NA <NA> <NA> Umbara
## 10 Wat Tambor 193 48 none green,… unknown NA male mascu… Skako
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
You can use desc()
to order in descending order.
%>% arrange(desc(mass)) starwars
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Jabba Desi… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu…
## 2 Grievous 216 159 none brown,… green,… NA male mascu… Kalee
## 3 IG-88 200 140 none metal red 15 none mascu… <NA>
## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 5 Tarfful 234 136 brown brown blue NA male mascu… Kashyy…
## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## 7 Bossk 190 113 none green red 53 male mascu… Trando…
## 8 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…
## 9 Jek Tono P… 180 110 brown fair blue NA male mascu… Bestin…
## 10 Dexter Jet… 198 102 none brown yellow NA male mascu… Ojom
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
8.7 Choose rows using their position with slice()
Use slice to choose rows by their position.
For example, to choose the character in the second row of the data frame:
%>% slice(2) starwars
## # A tibble: 1 × 14
## name height mass hair_color skin_color eye_co…¹ birth…² sex gender homew…³
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹eye_color, ²birth_year,
## # ³homeworld
The first, tenth, and twentieth rows of the data frame,
%>% slice(c(1,10,20)) starwars
## # A tibble: 3 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 Obi-Wan Ken… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
## 3 Palpatine 170 75 grey pale yellow 82 male mascu… Naboo
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
The odd rows of the data frame,
%>% slice(seq(1,nrow(starwars), 2)) starwars
## # A tibble: 44 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 3 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 4 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
## 5 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 6 Anakin Sky… 188 84 blond fair blue 41.9 male mascu… Tatooi…
## 7 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…
## 8 Greedo 173 74 <NA> green black 44 male mascu… Rodia
## 9 Wedge Anti… 170 77 brown fair hazel 21 male mascu… Corell…
## 10 Yoda 66 17 white green brown 896 male mascu… <NA>
## # … with 34 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
Combine with arrange()
to quickly get the character with the largest height,
%>% arrange(desc(height)) %>% slice(1) starwars
## # A tibble: 1 × 14
## name height mass hair_c…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Yarael Poof 264 NA none white yellow NA male mascu… Quermia
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
There are a few convenient helper versions of the slice function:
slice_head
slice_tail
slice_sample
slice_min
slice_max
I’ll leave it to you to pull up the help page for those functions.
8.8 Select columns with select()
When you’re working with large datasets, sometimes you want to drop all of the columns you’re not using or interested in.
For example, if you only wanted the species column:
%>% select(species) starwars
## # A tibble: 87 × 1
## species
## <chr>
## 1 Human
## 2 Droid
## 3 Droid
## 4 Human
## 5 Human
## 6 Human
## 7 Human
## 8 Droid
## 9 Human
## 10 Human
## # … with 77 more rows
Or, if you wanted name, species, and homeworld:
%>% select(name, species, homeworld) starwars
## # A tibble: 87 × 3
## name species homeworld
## <chr> <chr> <chr>
## 1 Luke Skywalker Human Tatooine
## 2 C-3PO Droid Tatooine
## 3 R2-D2 Droid Naboo
## 4 Darth Vader Human Tatooine
## 5 Leia Organa Human Alderaan
## 6 Owen Lars Human Tatooine
## 7 Beru Whitesun lars Human Tatooine
## 8 R5-D4 Droid Tatooine
## 9 Biggs Darklighter Human Tatooine
## 10 Obi-Wan Kenobi Human Stewjon
## # … with 77 more rows
What if you wanted all columns with the letter m in the column name?
%>% select(contains("m")) starwars
## # A tibble: 87 × 4
## name mass homeworld films
## <chr> <dbl> <chr> <list>
## 1 Luke Skywalker 77 Tatooine <chr [5]>
## 2 C-3PO 75 Tatooine <chr [6]>
## 3 R2-D2 32 Naboo <chr [7]>
## 4 Darth Vader 136 Tatooine <chr [4]>
## 5 Leia Organa 49 Alderaan <chr [5]>
## 6 Owen Lars 120 Tatooine <chr [3]>
## 7 Beru Whitesun lars 75 Tatooine <chr [3]>
## 8 R5-D4 32 Tatooine <chr [1]>
## 9 Biggs Darklighter 84 Tatooine <chr [1]>
## 10 Obi-Wan Kenobi 77 Stewjon <chr [6]>
## # … with 77 more rows
See the tidyselect documentation for more selection helpers (like contains).
8.9 Rename columns with rename()
rename()
lets you rename columns.
For example, if you wanted to rename the column “homeworld” to “home_world”:
<- starwars %>% rename(home_world=homeworld)
renamed_starwars colnames(starwars)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships"
colnames(renamed_starwars)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "home_world"
## [11] "species" "films" "vehicles" "starships"
8.10 Add new columns with mutate()
Often, it is useful to add new columns that are functions of existing columns.
For example, we could add a new column to each row that is the product of that row’s height and mass values.
<- starwars %>% mutate(heightXmass = height * mass)
starwars_new_col colnames(starwars_new_col)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships" "heightXmass"
8.11 Change column order with relocate()
Relocate let’s you move columns around in your data set.
For example, to move the name column after the height column, you can:
%>% relocate(name, .after=height) starwars
## # A tibble: 87 × 14
## height name mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <int> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 172 Luke Skywa… 77 blond fair blue 19 male mascu… Tatooi…
## 2 167 C-3PO 75 <NA> gold yellow 112 none mascu… Tatooi…
## 3 96 R2-D2 32 <NA> white,… red 33 none mascu… Naboo
## 4 202 Darth Vader 136 none white yellow 41.9 male mascu… Tatooi…
## 5 150 Leia Organa 49 brown light brown 19 fema… femin… Aldera…
## 6 178 Owen Lars 120 brown,… light blue 52 male mascu… Tatooi…
## 7 165 Beru White… 75 brown light blue 47 fema… femin… Tatooi…
## 8 97 R5-D4 32 <NA> white,… red NA none mascu… Tatooi…
## 9 183 Biggs Dark… 84 black light brown 24 male mascu… Tatooi…
## 10 182 Obi-Wan Ke… 77 auburn… fair blue-g… 57 male mascu… Stewjon
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
8.12 Summarize groups of rows using summarise()
summarise()
collapses a dataset into a single row.
%>% summarise(mass = mean(mass, na.rm=TRUE)) starwars
## # A tibble: 1 × 1
## mass
## <dbl>
## 1 97.3
summarise()
is particularly useful in combination with the group_by
function.
For example, to get the average mass of characters in each species,
%>% group_by(species) %>% summarise(mass=mean(mass, na.rm=TRUE)) starwars
## # A tibble: 38 × 2
## species mass
## <chr> <dbl>
## 1 Aleena 15
## 2 Besalisk 102
## 3 Cerean 82
## 4 Chagrian NaN
## 5 Clawdite 55
## 6 Droid 69.8
## 7 Dug 40
## 8 Ewok 20
## 9 Geonosian 80
## 10 Gungan 74
## # … with 28 more rows
Or, if we wanted to count the number of characters of each species,
%>% group_by(species) %>% summarise(count=n()) starwars
## # A tibble: 38 × 2
## species count
## <chr> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 6
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # … with 28 more rows
8.13 dplyr cheat sheet
Download pdf here: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf