Chapter 8 Using dplyr to manipulate data

You can download the R Markdown file used to generate this document (or webpage) here. I would encourage you to download that Rmd file, open it own your own computer in RStudio, and run the code chunks on your own (and modify things as you wish!).

Goals:

  • Learn about solving common data manipulation challenges with dplyr
  • Learn to use the pipe operator (%>%)

Sources:

8.1 Setup

You should have already installed the tidyverse collection of packages, which includes the dplyr package. But, if you haven’t, go ahead and run:

install.packages("tidyverse")

With the tidyverse packages installed, you’ll need to load the dplyr library to use it.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

8.2 Overview

This document will cover key dplyr functions that solve the most common data manipulation challenges. The examples and explanations here non-exhaustive (dplyr has a lot of functionality and each of these functions are very flexible). Instead, I’d like you to start building a sense of the broad category of data manipulations that are possible with each of the dplyr functions below. As you encounter challenges with manipulating data, you should have a solid idea of which functions you will need to do what you want to do. Then, you can use those functions’ documentation to work out the specific details.

dplyr’s functions for basic data manipulation can be divided into three categories based on the component of the dataset that they work with (descriptions pulled from dplyr documentation):

  • Rows
    • filter() - choose rows in your data based on their column values
    • slice() - choose rows in your data based on their location (index) in the dataset
    • arrange() - change the ordering of rows in your data
  • Columns
    • select() - changes whether or not a column is included
    • rename() - changes the names of columns
    • mutate() - changes the values of columns and can create new columns that are functions of existing columns
    • relocate() - changes the order of columns
  • Groups of rows
    • summarise() - collapses a group of rows into a single row

These functions can be used in combination with group_by(), which allows you to perform these functions on particular groups of rows, instead of the entire dataset.

Note that the above list of functions is non-exhaustive. I refer you to the dplyr documentation for an exhaustive list. Additionally, I’d encourage you to look over (and save) the cheat sheet linked at the end of this document.

8.3 The pipe operator

Most dplyr functions take a dataframe (or tibble) as their first argument. dplyr provides the pipe operator (%>%), which lets you conveniently compose multiple dplyr functions in a single line (instead of needing to save intermediate states of your dataset). The pipe operator “pipes” the results of the left side of the pipe into the first argument of the right side of the pipe.

Here’s an example:

f <- function(a, b) {
  return(a + b)
}
g <- function(a, b) {
  return(a * b)
}

x <- 2
y <- 100

x %>% f(y) # Is the same as f(x,y)
## [1] 102
x %>% f(y) %>% g(y) # Is the same as g(f(x,y), y)
## [1] 10200

8.4 Data

We’ll use the starwars dataset to demonstrate the basic data manipulation functions provided by dplyr. You don’t need to download anything, the starwars dataset is loaded when you load the dplyr package.

dim(starwars)
## [1] 87 14

This dataset has 87 rows and 14 columns where each row represents a character from Star Wars, and each column is a particular attribute.

starwars
## # A tibble: 87 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

If you look closely at the above output, you’ll notice that the starwars dataset is represented as a tibble. A tibble is a more modern version of the dataframe type, and for the most part, anywhere you could use a dataframe, you can use a tibble. You can convert a data frame to tibbles with the as_tible() function. Read more about tibbles here: https://tibble.tidyverse.org/.

8.5 Filter rows with filter()

filter() allows you to choose rows in a data frame based on their column values.

For example, if we wanted to create a new data frame that only contained droids, we could:

starwars %>% filter(species=="Droid")
## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color eye_c…¹ birth…² sex   gender homew…³
##   <chr>   <int> <dbl> <chr>      <chr>      <chr>     <dbl> <chr> <chr>  <chr>  
## 1 C-3PO     167    75 <NA>       gold       yellow      112 none  mascu… Tatooi…
## 2 R2-D2      96    32 <NA>       white, bl… red          33 none  mascu… Naboo  
## 3 R5-D4      97    32 <NA>       white, red red          NA none  mascu… Tatooi…
## 4 IG-88     200   140 none       metal      red          15 none  mascu… <NA>   
## 5 R4-P17     96    NA none       silver, r… red, b…      NA none  femin… <NA>   
## 6 BB8        NA    NA none       none       black        NA none  mascu… <NA>   
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​eye_color, ²​birth_year,
## #   ³​homeworld

We can also filter rows based on more complex conditions:

starwars %>% filter(species == "Droid" & height > 100)
## # A tibble: 2 × 14
##   name  height  mass hair_color skin_color eye_co…¹ birth…² sex   gender homew…³
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>      <dbl> <chr> <chr>  <chr>  
## 1 C-3PO    167    75 <NA>       gold       yellow       112 none  mascu… Tatooi…
## 2 IG-88    200   140 none       metal      red           15 none  mascu… <NA>   
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​eye_color, ²​birth_year,
## #   ³​homeworld

What if we wanted all droids, Wookiees, and Hutts?

starwars %>% filter(species == "Droid" | species == "Wookiee" | species == "Hutt")
## # A tibble: 9 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 C-3PO           167    75 <NA>    gold    yellow      112 none  mascu… Tatooi…
## 2 R2-D2            96    32 <NA>    white,… red          33 none  mascu… Naboo  
## 3 R5-D4            97    32 <NA>    white,… red          NA none  mascu… Tatooi…
## 4 Chewbacca       228   112 brown   unknown blue        200 male  mascu… Kashyy…
## 5 Jabba Desil…    175  1358 <NA>    green-… orange      600 herm… mascu… Nal Hu…
## 6 IG-88           200   140 none    metal   red          15 none  mascu… <NA>   
## 7 R4-P17           96    NA none    silver… red, b…      NA none  femin… <NA>   
## 8 Tarfful         234   136 brown   brown   blue         NA male  mascu… Kashyy…
## 9 BB8              NA    NA none    none    black        NA none  mascu… <NA>   
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

Even more simply with the %in% operator,

starwars %>% filter(species %in% c("Droid", "Wookiee", "Hutt"))
## # A tibble: 9 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 C-3PO           167    75 <NA>    gold    yellow      112 none  mascu… Tatooi…
## 2 R2-D2            96    32 <NA>    white,… red          33 none  mascu… Naboo  
## 3 R5-D4            97    32 <NA>    white,… red          NA none  mascu… Tatooi…
## 4 Chewbacca       228   112 brown   unknown blue        200 male  mascu… Kashyy…
## 5 Jabba Desil…    175  1358 <NA>    green-… orange      600 herm… mascu… Nal Hu…
## 6 IG-88           200   140 none    metal   red          15 none  mascu… <NA>   
## 7 R4-P17           96    NA none    silver… red, b…      NA none  femin… <NA>   
## 8 Tarfful         234   136 brown   brown   blue         NA male  mascu… Kashyy…
## 9 BB8              NA    NA none    none    black        NA none  mascu… <NA>   
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

Notice that the original starwars dataset is never modified by any of the filter functions. None of the dplyr functions modify the dataset passed to them with the data argument. Instead, they return a new data frame with the appropriate manipulations.

8.6 Arrange rows with arrange()

arrange() reorders rows in the given data frame. It takes a data frame and a set of columns names to reorder by. If you give more than one column name, additional column names are used to break ties in the values of proceeding column names.

starwars %>% arrange(mass, height)
## # A tibble: 87 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Ratts Tyer…     79    15 none    grey, … unknown      NA male  mascu… Aleen …
##  2 Yoda            66    17 white   green   brown       896 male  mascu… <NA>   
##  3 Wicket Sys…     88    20 brown   brown   brown         8 male  mascu… Endor  
##  4 R2-D2           96    32 <NA>    white,… red          33 none  mascu… Naboo  
##  5 R5-D4           97    32 <NA>    white,… red          NA none  mascu… Tatooi…
##  6 Sebulba        112    40 none    grey, … orange       NA male  mascu… Malast…
##  7 Dud Bolt        94    45 none    blue, … yellow       NA male  mascu… Vulpter
##  8 Padmé Amid…    165    45 brown   light   brown        46 fema… femin… Naboo  
##  9 Sly Moore      178    48 none    pale    white        NA <NA>  <NA>   Umbara 
## 10 Wat Tambor     193    48 none    green,… unknown      NA male  mascu… Skako  
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

You can use desc() to order in descending order.

starwars %>% arrange(desc(mass))
## # A tibble: 87 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Jabba Desi…    175  1358 <NA>    green-… orange    600   herm… mascu… Nal Hu…
##  2 Grievous       216   159 none    brown,… green,…    NA   male  mascu… Kalee  
##  3 IG-88          200   140 none    metal   red        15   none  mascu… <NA>   
##  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5 Tarfful        234   136 brown   brown   blue       NA   male  mascu… Kashyy…
##  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  7 Bossk          190   113 none    green   red        53   male  mascu… Trando…
##  8 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
##  9 Jek Tono P…    180   110 brown   fair    blue       NA   male  mascu… Bestin…
## 10 Dexter Jet…    198   102 none    brown   yellow     NA   male  mascu… Ojom   
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

8.7 Choose rows using their position with slice()

Use slice to choose rows by their position.

For example, to choose the character in the second row of the data frame:

starwars %>% slice(2)
## # A tibble: 1 × 14
##   name  height  mass hair_color skin_color eye_co…¹ birth…² sex   gender homew…³
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>      <dbl> <chr> <chr>  <chr>  
## 1 C-3PO    167    75 <NA>       gold       yellow       112 none  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​eye_color, ²​birth_year,
## #   ³​homeworld

The first, tenth, and twentieth rows of the data frame,

starwars %>% slice(c(1,10,20))
## # A tibble: 3 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
## 2 Obi-Wan Ken…    182    77 auburn… fair    blue-g…      57 male  mascu… Stewjon
## 3 Palpatine       170    75 grey    pale    yellow       82 male  mascu… Naboo  
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

The odd rows of the data frame,

starwars %>% slice(seq(1,nrow(starwars), 2))
## # A tibble: 44 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  5 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  6 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  7 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
##  8 Greedo         173    74 <NA>    green   black      44   male  mascu… Rodia  
##  9 Wedge Anti…    170    77 brown   fair    hazel      21   male  mascu… Corell…
## 10 Yoda            66    17 white   green   brown     896   male  mascu… <NA>   
## # … with 34 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

Combine with arrange() to quickly get the character with the largest height,

starwars %>% arrange(desc(height)) %>% slice(1)
## # A tibble: 1 × 14
##   name        height  mass hair_c…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>        <int> <dbl> <chr>    <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Yarael Poof    264    NA none     white   yellow       NA male  mascu… Quermia
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

There are a few convenient helper versions of the slice function:

  • slice_head
  • slice_tail
  • slice_sample
  • slice_min
  • slice_max

I’ll leave it to you to pull up the help page for those functions.

8.8 Select columns with select()

When you’re working with large datasets, sometimes you want to drop all of the columns you’re not using or interested in.

For example, if you only wanted the species column:

starwars %>% select(species)
## # A tibble: 87 × 1
##    species
##    <chr>  
##  1 Human  
##  2 Droid  
##  3 Droid  
##  4 Human  
##  5 Human  
##  6 Human  
##  7 Human  
##  8 Droid  
##  9 Human  
## 10 Human  
## # … with 77 more rows

Or, if you wanted name, species, and homeworld:

starwars %>% select(name, species, homeworld)
## # A tibble: 87 × 3
##    name               species homeworld
##    <chr>              <chr>   <chr>    
##  1 Luke Skywalker     Human   Tatooine 
##  2 C-3PO              Droid   Tatooine 
##  3 R2-D2              Droid   Naboo    
##  4 Darth Vader        Human   Tatooine 
##  5 Leia Organa        Human   Alderaan 
##  6 Owen Lars          Human   Tatooine 
##  7 Beru Whitesun lars Human   Tatooine 
##  8 R5-D4              Droid   Tatooine 
##  9 Biggs Darklighter  Human   Tatooine 
## 10 Obi-Wan Kenobi     Human   Stewjon  
## # … with 77 more rows

What if you wanted all columns with the letter m in the column name?

starwars %>% select(contains("m"))
## # A tibble: 87 × 4
##    name                mass homeworld films    
##    <chr>              <dbl> <chr>     <list>   
##  1 Luke Skywalker        77 Tatooine  <chr [5]>
##  2 C-3PO                 75 Tatooine  <chr [6]>
##  3 R2-D2                 32 Naboo     <chr [7]>
##  4 Darth Vader          136 Tatooine  <chr [4]>
##  5 Leia Organa           49 Alderaan  <chr [5]>
##  6 Owen Lars            120 Tatooine  <chr [3]>
##  7 Beru Whitesun lars    75 Tatooine  <chr [3]>
##  8 R5-D4                 32 Tatooine  <chr [1]>
##  9 Biggs Darklighter     84 Tatooine  <chr [1]>
## 10 Obi-Wan Kenobi        77 Stewjon   <chr [6]>
## # … with 77 more rows

See the tidyselect documentation for more selection helpers (like contains).

8.9 Rename columns with rename()

rename() lets you rename columns.

For example, if you wanted to rename the column “homeworld” to “home_world”:

renamed_starwars <- starwars %>% rename(home_world=homeworld)
colnames(starwars)
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "films"      "vehicles"   "starships"
colnames(renamed_starwars)
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "home_world"
## [11] "species"    "films"      "vehicles"   "starships"

8.10 Add new columns with mutate()

Often, it is useful to add new columns that are functions of existing columns.

For example, we could add a new column to each row that is the product of that row’s height and mass values.

starwars_new_col <- starwars %>% mutate(heightXmass = height * mass)
colnames(starwars_new_col)
##  [1] "name"        "height"      "mass"        "hair_color"  "skin_color" 
##  [6] "eye_color"   "birth_year"  "sex"         "gender"      "homeworld"  
## [11] "species"     "films"       "vehicles"    "starships"   "heightXmass"

8.11 Change column order with relocate()

Relocate let’s you move columns around in your data set.

For example, to move the name column after the height column, you can:

starwars %>% relocate(name, .after=height)
## # A tibble: 87 × 14
##    height name         mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##     <int> <chr>       <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1    172 Luke Skywa…    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2    167 C-3PO          75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3     96 R2-D2          32 <NA>    white,… red        33   none  mascu… Naboo  
##  4    202 Darth Vader   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  5    150 Leia Organa    49 brown   light   brown      19   fema… femin… Aldera…
##  6    178 Owen Lars     120 brown,… light   blue       52   male  mascu… Tatooi…
##  7    165 Beru White…    75 brown   light   blue       47   fema… femin… Tatooi…
##  8     97 R5-D4          32 <NA>    white,… red        NA   none  mascu… Tatooi…
##  9    183 Biggs Dark…    84 black   light   brown      24   male  mascu… Tatooi…
## 10    182 Obi-Wan Ke…    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

8.12 Summarize groups of rows using summarise()

summarise() collapses a dataset into a single row.

starwars %>% summarise(mass = mean(mass, na.rm=TRUE))
## # A tibble: 1 × 1
##    mass
##   <dbl>
## 1  97.3

summarise() is particularly useful in combination with the group_by function.

For example, to get the average mass of characters in each species,

starwars %>% group_by(species) %>% summarise(mass=mean(mass, na.rm=TRUE))
## # A tibble: 38 × 2
##    species    mass
##    <chr>     <dbl>
##  1 Aleena     15  
##  2 Besalisk  102  
##  3 Cerean     82  
##  4 Chagrian  NaN  
##  5 Clawdite   55  
##  6 Droid      69.8
##  7 Dug        40  
##  8 Ewok       20  
##  9 Geonosian  80  
## 10 Gungan     74  
## # … with 28 more rows

Or, if we wanted to count the number of characters of each species,

starwars %>% group_by(species) %>% summarise(count=n())
## # A tibble: 38 × 2
##    species   count
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # … with 28 more rows