Chapter 9 Using tidyr to create tidy data
You can download the R Markdown file used to generate this document (or webpage) here. I would encourage you to download that Rmd file, open it own your own computer in RStudio, and run the code chunks on your own (and modify things as you wish!).
Goals:
- Work through examples of using tidyr (and dplyr) functions to clean up messy (un-tidy) data
Sources:
9.1 Setup
You should have already installed the tidyverse
collection of packages, which includes the tidyr package.
But, if you haven’t, go ahead and run:
install.packages("tidyverse")
In this document, we’ll be using the tidyr and dplyr packages.
library(tidyr)
library(dplyr)
9.2 Overview
If you have not already, I suggest that you read the following article for an introduction to tidy data: https://tidyr.tidyverse.org/articles/tidy-data.html. Additionally, I suggest that you work through the dplyr R Markdown material before working through this document.
The tidyr package provides functions for making messy data “tidy”. For a complete list of the included functions, see https://tidyr.tidyverse.org/reference/index.html.
Understanding tidy data and how to reformat your data into a tidy format is very important, as many of the functions/packages that we will use this semester are written to work with tidy-format data. For example, ggplot (the R graphics package that we will be using) assumes tidy data. Taking the time now to get comfortable with what it means to have tidy data will save you substantial time later on.
9.3 What is tidy data?
You can represent the same underlying data many ways. For example, let’s say we have the following data about students’ grades:
<- read.csv("lecture-material/week-02/data/messy_grades_1.csv", na.strings="")
messy_grade_data1 messy_grade_data1
## name quiz1 quiz2 test1
## 1 Billy <NA> D C
## 2 Suzy F <NA> <NA>
## 3 Lionel B C B
## 4 Jenny A A B
Or, the same exact data could be structured as follows:
<- read.csv("lecture-material/week-02/data/messy_grades_2.csv", na.strings="")
messy_grade_data2 messy_grade_data2
## assessment Billy Suzy Lionel Jenny
## 1 quiz1 <NA> FALSE B A
## 2 quiz2 D NA C A
## 3 test1 C NA B B
Neither of these formats are particularly convenient to work with.
For example, in messy_grade_data2
format, it would be difficult to query the full set of student names in the dataset (because each student has their own column).
Likewise, in messy_grade_data1
, it would be diffult to query the set of assessments.
From https://tidyr.tidyverse.org/articles/tidy-data.html,
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
A tidy version of the grade data would look something like this:
<- messy_grade_data1 %>%
tidy_grade_data pivot_longer(c(quiz1,quiz2,test1), names_to="assessment", values_to="grade")
tidy_grade_data
## # A tibble: 12 × 3
## name assessment grade
## <chr> <chr> <chr>
## 1 Billy quiz1 <NA>
## 2 Billy quiz2 D
## 3 Billy test1 C
## 4 Suzy quiz1 F
## 5 Suzy quiz2 <NA>
## 6 Suzy test1 <NA>
## 7 Lionel quiz1 B
## 8 Lionel quiz2 C
## 9 Lionel test1 B
## 10 Jenny quiz1 A
## 11 Jenny quiz2 A
## 12 Jenny test1 B
This arrangement makes the values, variables, and observations clearer. What are the values, variables, and observations in the grade data? (the answer is here).
A dataset is tidy if:
- Each variable has its own column
- Each observation has its own row
- Each value has its own cell
In the grade data, we have the following variables: student name, assessment, and grade.
Each observation is a particular student’s grade on a particular assessment (e.g., Jenny’s grade on test1).
Each cell in the dataset specifies a value for a particular variable (e.g., in row 1 of tidy_grade_data
, Billy is the value of the name variable).
9.4 Making data tidy
Two common ways that data is messy:
- Column headers are values, not variable names
- Multiple variables are stored in one column
For more common ways that data are often messy, see: https://tidyr.tidyverse.org/articles/tidy-data.html#tidying
Below, are datasets that exemplify each problem.
9.4.1 Column headers are values, not variables
Observations over time are often recorded in a “wide” format. That is, each timepoint that an observation is made is entered as a column. This is convienient for data entry. For example, imagine you measure the temperature of a lake at several different depths over a 10 week period (one measurement for each depth every week). It’s convenient to enter that data in a “wide” format where each week is a column. However, this is not tidy, as each row is no longer a single observation.
The billboard
dataset that is loaded when you load the tidyr
package is exemplifies this problem.
billboard
## # A tibble: 317 × 79
## artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA
## 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA NA
## 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51
## 4 3 Doo… Loser 2000-10-21 76 76 72 69 67 65 55 59 62
## 5 504 B… Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3
## 7 A*Tee… Danc… 2000-07-08 97 97 96 95 100 NA NA NA NA
## 8 Aaliy… I Do… 2000-01-29 84 62 51 41 38 35 35 38 38
## 9 Aaliy… Try … 2000-03-18 59 53 38 28 21 18 16 14 12
## 10 Adams… Open… 2000-08-26 76 76 74 69 68 67 61 58 57
## # … with 307 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
The billboard
dataset records the date a song enters the billboard top 100.
First on your own, use the pivot_longer
function to create a tidy version of the billboard
dataset.
The solution can be found here.
9.4.2 Multiple variables are stored in one column
Sometimes, the values in a column describe multiple variables. For example, in the following messy dataset, the month and day of each temperature recording are combined under a single column.
<- read.csv("lecture-material/week-02/data/multi_variables_single_col.csv")
messy_temperature messy_temperature
## date year temperature
## 1 jan1 2022 15
## 2 jan14 2022 130
## 3 feb1 2022 19
## 4 feb14 2022 85
## 5 mar1 2022 95
## 6 mar14 2022 130
## 7 apr1 2022 29
## 8 apr14 2022 42
## 9 may1 2022 138
## 10 may14 2022 14
## 11 jun1 2022 133
## 12 jun14 2022 53
## 13 jul1 2022 120
## 14 jul14 2022 108
## 15 aug1 2022 23
## 16 aug14 2022 148
## 17 sep1 2022 103
## 18 sep14 2022 113
## 19 nov1 2022 67
## 20 nov14 2022 1
## 21 oct1 2022 113
## 22 oct14 2022 56
## 23 dec1 2022 107
## 24 dec14 2022 30
The first three characters of the “date” column give the month, and the following numeric value gives the day of the month of the temperature reading.
We can use the separate
function to properly split the date column into multiple columns.
<- messy_temperature %>% separate(col=date, into=c("month","day"), sep=3)
tidy_temperature tidy_temperature
## month day year temperature
## 1 jan 1 2022 15
## 2 jan 14 2022 130
## 3 feb 1 2022 19
## 4 feb 14 2022 85
## 5 mar 1 2022 95
## 6 mar 14 2022 130
## 7 apr 1 2022 29
## 8 apr 14 2022 42
## 9 may 1 2022 138
## 10 may 14 2022 14
## 11 jun 1 2022 133
## 12 jun 14 2022 53
## 13 jul 1 2022 120
## 14 jul 14 2022 108
## 15 aug 1 2022 23
## 16 aug 14 2022 148
## 17 sep 1 2022 103
## 18 sep 14 2022 113
## 19 nov 1 2022 67
## 20 nov 14 2022 1
## 21 oct 1 2022 113
## 22 oct 14 2022 56
## 23 dec 1 2022 107
## 24 dec 14 2022 30
Run ?separate
to see how the separate function works.
Adding extra columns doesn’t “untidy” your data.
Sometimes it is useful to add additional aggregate labels to an observation (especially when working with dates).
For example, it might be useful to add a proper date column (with the format YEAR-MONTH-DAY) to our tidied temperature data.
I’ll leave that as an exercise for you to do on your own.
You’ll need to map each month string to its numeric representation (e.g. jan = 1, feb = 2).
Then you can use mutate
to create a new column from the relevant columns.