Chapter 15 Sampling objects in a dataset
In this lab activity, we will demonstrate two methods of sampling objects in a dataset:
- Naive random sampling
- Stratified random sampling
Sampling is a common method of dealing with a huge number of objects/observations in a dataset.
15.1 Dependencies and setup
In this activity, we’ll be using the following R packages:
library(tidyverse)
We’ll sample from our handy-dandy Star Wars data set:
head(starwars)
## # A tibble: 6 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
15.2 Random sampling
In naive random sampling, we simply randomly sample rows from our dataset. In this example, we’ll sample without replacement: we don’t want to characters more than once in our down-sampled dataset.
<- 0.2 # Down-sample down to just this % of our full data set
sample_proportion <- ceiling(sample_proportion * nrow(starwars)) # Sample size, round up. sample_size
We can use slice_sample
from the dplyr
package to randomly sample star wars characters.
<- starwars %>%
naive_sampled_data slice_sample(
n=sample_size,
replace=FALSE
) naive_sampled_data
## # A tibble: 18 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Taun We 213 NA none grey black NA fema… femin… Kamino
## 2 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 3 Wat Tambor 193 48 none green,… unknown NA male mascu… Skako
## 4 San Hill 191 NA none grey gold NA male mascu… Muunil…
## 5 Barriss Of… 166 50 black yellow blue 40 fema… femin… Mirial
## 6 Poe Dameron NA NA brown light brown NA male mascu… <NA>
## 7 Tarfful 234 136 brown brown blue NA male mascu… Kashyy…
## 8 Eeth Koth 171 NA black brown brown NA male mascu… Iridon…
## 9 Ayla Secura 178 55 none blue hazel 48 fema… femin… Ryloth
## 10 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor
## 11 Tion Medon 206 80 none grey black NA male mascu… Utapau
## 12 Arvel Cryn… NA NA brown fair brown NA male mascu… <NA>
## 13 Roos Tarpa… 224 82 none grey orange NA male mascu… Naboo
## 14 Rugor Nass 206 NA none green orange NA male mascu… Naboo
## 15 Saesee Tiin 188 NA none pale orange NA male mascu… Iktotch
## 16 Yarael Poof 264 NA none white yellow NA male mascu… Quermia
## 17 Wedge Anti… 170 77 brown fair hazel 21 male mascu… Corell…
## 18 Cliegg Lars 183 NA brown fair blue 82 male mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
Take a look at naive_sampled_data
.
What are some ways that our random sample might not be representative of our larger data set (starwars
)?
For example, we can look at the distribution of homeworlds in our original data.
# Get the distribution of homeworlds in full dataset.
<- starwars %>%
orig_homeworlds group_by(
homeworld%>%
) summarise(
n=n()
%>%
) mutate(
freq = n / sum(n),
%>%
) arrange(
desc(freq)
)
# Get the distribution of homeworlds in the sampled dataset.
<- naive_sampled_data %>%
sampled_homeworlds group_by(
homeworld%>%
) summarise(
n=n()
%>%
) mutate(
freq = n / sum(n),
%>%
) arrange(
desc(freq)
)
Let’s take a look at the distribution of homeworlds from the original dataset:
ggplot(
orig_homeworlds,aes(
x=reorder(homeworld, -freq),
y=freq
)+
) geom_bar(stat="identity") +
labs(
x="Homeworld",
y="Frequency",
title="Distribution of homeworlds for original dataset"
+
) theme(
axis.text.x=element_text(angle=45, hjust=1)
)
And now the sampled dataset:
ggplot(
sampled_homeworlds,aes(
x=reorder(homeworld, -freq),
y=freq
)+
) geom_bar(stat="identity") +
labs(
x="Homeworld",
y="Frequency",
title="Distribution of homeworlds for sampled dataset"
+
) theme(
axis.text.x=element_text(angle=45, hjust=1)
)
15.2.1 Exercises
- What differences do you notice between the distribution of homeworlds in the original dataset and the sampled dataset?
- How could you modify
slice_sample
to randomly sample, but instead of weighting all rows equally, you weight each row’s likelihood of being included in the sample according to the height attribute? Hint:?slice_sample
- Adjust the sample proportion parameter and resample. What happens to the distribution of
species
as you increase or decrease the sample size? - Read the R code. Identify any lines of code that you don’t understand. Use the documentation to figure out what those lines of code are doing.
15.3 Stratified random sampling
Stratified sampling can be used in cases where there are predefined groupings of objects. In stratified sampling, you enforce the level of representation for each group.
For example, we might want to ensure that each species among Star Wars characters is represented in our sample. There are many ways to implement this, and the particular approach you take will depend on your objective. In this example, I’ll randomly sample each species independently, guaranteeing that at least one character of each species is represented in our sample.
# Grab the set of species represented in our original dataset
<- levels(as.factor(starwars$species))
species # Create an empty data frame that we will store our sample
<- data.frame(
stratified_sampled_data matrix(
ncol=ncol(starwars),
nrow=0,
dimnames=list(NULL,colnames(starwars))
)
)for (s in species) {
# First, filter down to the species
<- starwars %>%
species_sample filter(
==s
species
)# Calculate sample size, guarantee >= 1
= max(
sample_size ceiling(nrow(species_sample)*sample_proportion),
1
)# Sample from characters of the current species
<- species_sample %>%
species_sample slice_sample(
n=sample_size,
replace=FALSE
)# Add sample of this species to larger sample
<- rbind(
stratified_sampled_data
stratified_sampled_data,
species_sample
)
} stratified_sampled_data
## # A tibble: 44 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Ratts Tyer… 79 15 none grey, … unknown NA male mascu… Aleen …
## 2 Dexter Jet… 198 102 none brown yellow NA male mascu… Ojom
## 3 Ki-Adi-Mun… 198 82 white pale yellow 92 male mascu… Cerea
## 4 Mas Amedda 196 NA none blue blue NA male mascu… Champa…
## 5 Zam Wesell 168 55 blonde fair, … yellow NA fema… femin… Zolan
## 6 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
## 7 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 8 Sebulba 112 40 none grey, … orange NA male mascu… Malast…
## 9 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor
## 10 Poggle the… 183 80 none green yellow NA male mascu… Geonos…
## # … with 34 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
15.3.1 Exercises
- Graph the distribution of species in the original
starwars
dataset and in the stratified sample (just like we did for homeworlds in the naive random sample). What differences do you notice? What might be causing those differences? - We specified a sample proportion of 0.2. Why might the size of our stratified sample differ from the size of our random sample?
- Read the R code. Identify any lines of code that you don’t understand. Use the documentation to figure out what those lines of code are doing.