Chapter 9 Using tidyr to create tidy data

You can download the R Markdown file used to generate this document (or webpage) here. I would encourage you to download that Rmd file, open it own your own computer in RStudio, and run the code chunks on your own (and modify things as you wish!).

Goals:

  • Work through examples of using tidyr (and dplyr) functions to clean up messy (un-tidy) data

Sources:

9.1 Setup

You should have already installed the tidyverse collection of packages, which includes the tidyr package. But, if you haven’t, go ahead and run:

install.packages("tidyverse")

In this document, we’ll be using the tidyr and dplyr packages.

library(tidyr)
library(dplyr)

9.2 Overview

If you have not already, I suggest that you read the following article for an introduction to tidy data: https://tidyr.tidyverse.org/articles/tidy-data.html. Additionally, I suggest that you work through the dplyr R Markdown material before working through this document.

The tidyr package provides functions for making messy data “tidy”. For a complete list of the included functions, see https://tidyr.tidyverse.org/reference/index.html.

Understanding tidy data and how to reformat your data into a tidy format is very important, as many of the functions/packages that we will use this semester are written to work with tidy-format data. For example, ggplot (the R graphics package that we will be using) assumes tidy data. Taking the time now to get comfortable with what it means to have tidy data will save you substantial time later on.

9.3 What is tidy data?

You can represent the same underlying data many ways. For example, let’s say we have the following data about students’ grades:

messy_grade_data1 <- read.csv("lecture-material/week-02/data/messy_grades_1.csv", na.strings="")
messy_grade_data1
##     name quiz1 quiz2 test1
## 1  Billy  <NA>     D     C
## 2   Suzy     F  <NA>  <NA>
## 3 Lionel     B     C     B
## 4  Jenny     A     A     B

Or, the same exact data could be structured as follows:

messy_grade_data2 <- read.csv("lecture-material/week-02/data/messy_grades_2.csv", na.strings="")
messy_grade_data2
##   assessment Billy  Suzy Lionel Jenny
## 1      quiz1  <NA> FALSE      B     A
## 2      quiz2     D    NA      C     A
## 3      test1     C    NA      B     B

Neither of these formats are particularly convenient to work with. For example, in messy_grade_data2 format, it would be difficult to query the full set of student names in the dataset (because each student has their own column). Likewise, in messy_grade_data1, it would be diffult to query the set of assessments.

From https://tidyr.tidyverse.org/articles/tidy-data.html,

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

A tidy version of the grade data would look something like this:

tidy_grade_data <- messy_grade_data1 %>%
  pivot_longer(c(quiz1,quiz2,test1), names_to="assessment", values_to="grade")
tidy_grade_data
## # A tibble: 12 × 3
##    name   assessment grade
##    <chr>  <chr>      <chr>
##  1 Billy  quiz1      <NA> 
##  2 Billy  quiz2      D    
##  3 Billy  test1      C    
##  4 Suzy   quiz1      F    
##  5 Suzy   quiz2      <NA> 
##  6 Suzy   test1      <NA> 
##  7 Lionel quiz1      B    
##  8 Lionel quiz2      C    
##  9 Lionel test1      B    
## 10 Jenny  quiz1      A    
## 11 Jenny  quiz2      A    
## 12 Jenny  test1      B

This arrangement makes the values, variables, and observations clearer. What are the values, variables, and observations in the grade data? (the answer is here).

A dataset is tidy if:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell

In the grade data, we have the following variables: student name, assessment, and grade. Each observation is a particular student’s grade on a particular assessment (e.g., Jenny’s grade on test1). Each cell in the dataset specifies a value for a particular variable (e.g., in row 1 of tidy_grade_data, Billy is the value of the name variable).

9.4 Making data tidy

Two common ways that data is messy:

  1. Column headers are values, not variable names
  2. Multiple variables are stored in one column

For more common ways that data are often messy, see: https://tidyr.tidyverse.org/articles/tidy-data.html#tidying

Below, are datasets that exemplify each problem.

9.4.1 Column headers are values, not variables

Observations over time are often recorded in a “wide” format. That is, each timepoint that an observation is made is entered as a column. This is convienient for data entry. For example, imagine you measure the temperature of a lake at several different depths over a 10 week period (one measurement for each depth every week). It’s convenient to enter that data in a “wide” format where each week is a column. However, this is not tidy, as each row is no longer a single observation.

The billboard dataset that is loaded when you load the tidyr package is exemplifies this problem.

billboard
## # A tibble: 317 × 79
##    artist track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9
##    <chr>  <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac  Baby… 2000-02-26    87    82    72    77    87    94    99    NA    NA
##  2 2Ge+h… The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA    NA
##  3 3 Doo… Kryp… 2000-04-08    81    70    68    67    66    57    54    53    51
##  4 3 Doo… Loser 2000-10-21    76    76    72    69    67    65    55    59    62
##  5 504 B… Wobb… 2000-04-15    57    34    25    17    17    31    36    49    53
##  6 98^0   Give… 2000-08-19    51    39    34    26    26    19     2     2     3
##  7 A*Tee… Danc… 2000-07-08    97    97    96    95   100    NA    NA    NA    NA
##  8 Aaliy… I Do… 2000-01-29    84    62    51    41    38    35    35    38    38
##  9 Aaliy… Try … 2000-03-18    59    53    38    28    21    18    16    14    12
## 10 Adams… Open… 2000-08-26    76    76    74    69    68    67    61    58    57
## # … with 307 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

The billboard dataset records the date a song enters the billboard top 100. First on your own, use the pivot_longer function to create a tidy version of the billboard dataset. The solution can be found here.

9.4.2 Multiple variables are stored in one column

Sometimes, the values in a column describe multiple variables. For example, in the following messy dataset, the month and day of each temperature recording are combined under a single column.

messy_temperature <- read.csv("lecture-material/week-02/data/multi_variables_single_col.csv")
messy_temperature
##     date year temperature
## 1   jan1 2022          15
## 2  jan14 2022         130
## 3   feb1 2022          19
## 4  feb14 2022          85
## 5   mar1 2022          95
## 6  mar14 2022         130
## 7   apr1 2022          29
## 8  apr14 2022          42
## 9   may1 2022         138
## 10 may14 2022          14
## 11  jun1 2022         133
## 12 jun14 2022          53
## 13  jul1 2022         120
## 14 jul14 2022         108
## 15  aug1 2022          23
## 16 aug14 2022         148
## 17  sep1 2022         103
## 18 sep14 2022         113
## 19  nov1 2022          67
## 20 nov14 2022           1
## 21  oct1 2022         113
## 22 oct14 2022          56
## 23  dec1 2022         107
## 24 dec14 2022          30

The first three characters of the “date” column give the month, and the following numeric value gives the day of the month of the temperature reading.

We can use the separate function to properly split the date column into multiple columns.

tidy_temperature <- messy_temperature %>% separate(col=date, into=c("month","day"), sep=3)
tidy_temperature
##    month day year temperature
## 1    jan   1 2022          15
## 2    jan  14 2022         130
## 3    feb   1 2022          19
## 4    feb  14 2022          85
## 5    mar   1 2022          95
## 6    mar  14 2022         130
## 7    apr   1 2022          29
## 8    apr  14 2022          42
## 9    may   1 2022         138
## 10   may  14 2022          14
## 11   jun   1 2022         133
## 12   jun  14 2022          53
## 13   jul   1 2022         120
## 14   jul  14 2022         108
## 15   aug   1 2022          23
## 16   aug  14 2022         148
## 17   sep   1 2022         103
## 18   sep  14 2022         113
## 19   nov   1 2022          67
## 20   nov  14 2022           1
## 21   oct   1 2022         113
## 22   oct  14 2022          56
## 23   dec   1 2022         107
## 24   dec  14 2022          30

Run ?separate to see how the separate function works.

Adding extra columns doesn’t “untidy” your data. Sometimes it is useful to add additional aggregate labels to an observation (especially when working with dates). For example, it might be useful to add a proper date column (with the format YEAR-MONTH-DAY) to our tidied temperature data. I’ll leave that as an exercise for you to do on your own. You’ll need to map each month string to its numeric representation (e.g. jan = 1, feb = 2). Then you can use mutate to create a new column from the relevant columns.