Tutorial 1 - R Programming
Tutorial 1 - R Programming
Yuqi Sun
2024-09-19
1 / 40
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
# Load the necessary libraries
library(tidyverse) # or you can just load library(dplyr)
library(babynames)
Usually, we load all necessary libraries in the first chunk. Here we name it as
setup.
2 / 40
1. Basic Markdown Syntax
Headings
Headings in Markdown are created by using the ‘#’ symbol before the text.
There are six levels of headings, represented by one to six ‘#’ symbols.
# Heading 1
## Heading 2
### Heading 3
#### Heading 4
##### Heading 5
###### Heading 6
3 / 40
Emphasis
1. Bold text
You can make text bold by using two asterisks (**) or underscores (__) before
and after the text.
2. Italic text
You can make text italic by using one asterisk (*) or underscore (_) before and
after the text.
4 / 40
Lists
You can create unordered lists by starting a line with ’*‘,’+’ or ‘-’.
▶ Item 1
▶ Item 2
▶ Item 3
5 / 40
Links and Images
1. You can create a link by enclosing the link text in brackets [] and the URL
in parentheses ().
[Visit Google](https://www.google.com)
Visit Google
Figure 1: R Logo
6 / 40
2. Basic Data Exploration (Review on Dplyr)
We will only focus on using dplyr (library(tidyverse)) instead of base R in the
following tutorials.
▶ These can all be used in conjunction with group_by() which changes the
scope of each function from operating on the entire dataset to operating
on it group-by-group.
▶ These six functions provide the verbs for a language of data manipulation
connected using %>%.
7 / 40
Grouped Summaries with summarise()
## avg_mpg
## 1 20.09062
8 / 40
by_cyl <- group_by(mtcars, cyl)
by_cyl
## # A tibble: 32 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # i 22 more rows
9 / 40
summarise(by_cyl, avg_mpg = mean(mpg, na.rm = TRUE))
## # A tibble: 3 x 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
10 / 40
Group by multiple variables
## # A tibble: 32 x 11
## # Groups: cyl, vs [5]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # i 22 more rows
11 / 40
summarise(by_cyl2, avg_mpg = mean(mpg, na.rm = TRUE))
## # A tibble: 5 x 3
## # Groups: cyl [3]
## cyl vs avg_mpg
## <dbl> <dbl> <dbl>
## 1 4 0 26
## 2 4 1 26.7
## 3 6 0 20.6
## 4 6 1 19.1
## 5 8 0 15.1
12 / 40
Group by multiple variables
## # A tibble: 32 x 11
## # Groups: cyl, vs, am [7]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # i 22 more rows
13 / 40
summarise(by_cyl3, avg_mpg = mean(mpg, na.rm = TRUE))
## # A tibble: 7 x 4
## # Groups: cyl, vs [5]
## cyl vs am avg_mpg
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 1 26
## 2 4 1 0 22.9
## 3 4 1 1 28.4
## 4 6 0 1 20.6
## 5 6 1 0 19.1
## 6 8 0 0 15.0
## 7 8 0 1 15.4
Together, group_by() and summarise() provide one of the tools that you’ll use
most commonly when working with dplyr: grouped summaries.
14 / 40
Missing Values
▶ You may have wondered about the na.rm argument we used above.
▶ What happens if we don’t set it?
▶ We will get missing values!
▶ That’s because aggregation functions obey the usual rule of missing values:
if there’s any missing value in the input, the output will be a missing value.
▶ Fortunately, all aggregation functions have an na.rm argument which
removes the missing values prior to computation:
mtcars %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg, na.rm = TRUE))
## # A tibble: 3 x 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
15 / 40
%in%
# sequence of numbers 1:
a <- seq(1, 5)
# sequence of numbers 2:
b <- seq(3, 12)
16 / 40
2. Compare two Vectors Containing Letters or Factors
h %in% g
## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRU
which(h %in% g)
## [1] 2 4 5 6 9 10 11
17 / 40
3. Example: Use of %in% in filter
mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg, na.rm = TRUE),
count = n()
)
## # A tibble: 3 x 3
## cyl avg_mpg count
## <dbl> <dbl> <int>
## 1 4 26.7 11
## 2 6 19.7 7
## 3 8 15.1 14
19 / 40
What if we don’t use summarize(), but instead use mutate()?
mtcars %>%
group_by(cyl) %>%
mutate(
avg_mpg = mean(mpg, na.rm = TRUE),
count = n()
) %>%
select(mpg, cyl, avg_mpg, count)
## # A tibble: 32 x 4
## # Groups: cyl [3]
## mpg cyl avg_mpg count
## <dbl> <dbl> <dbl> <int>
## 1 21 6 19.7 7
## 2 21 6 19.7 7
## 3 22.8 4 26.7 11
## 4 21.4 6 19.7 7
## 5 18.7 8 15.1 14
## 6 18.1 6 19.7 7
## 7 14.3 8 15.1 14
## 8 24.4 4 26.7 11
## 9 22.8 4 26.7 11
## 10 19.2 6 19.7 7
## # i 22 more rows 20 / 40
Group by multiple variables:
mtcars %>%
group_by(cyl, vs, am) %>%
summarise(
avg_mpg = mean(mpg, na.rm = TRUE),
count = n()
)
## # A tibble: 7 x 5
## # Groups: cyl, vs [5]
## cyl vs am avg_mpg count
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 4 0 1 26 1
## 2 4 1 0 22.9 3
## 3 4 1 1 28.4 7
## 4 6 0 1 20.6 3
## 5 6 1 0 19.1 4
## 6 8 0 0 15.0 12
## 7 8 0 1 15.4 2
21 / 40
mtcars %>%
group_by(cyl) %>%
summarise(
engine_num = n_distinct(vs)
) %>%
arrange(desc(engine_num))
## # A tibble: 3 x 2
## cyl engine_num
## <dbl> <int>
## 1 4 2
## 2 6 2
## 3 8 1
22 / 40
2. Sum and Average
sum(x), mean(x), median(x)
It’s sometimes useful to combine aggregation with logical subsetting. For
example:
mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg),
avg_positive_mpg = mean(mpg[mpg > 20]) # the average mpg if mpg > 20
)
## # A tibble: 3 x 3
## cyl avg_mpg avg_positive_mpg
## <dbl> <dbl> <dbl>
## 1 4 26.7 26.7
## 2 6 19.7 21.1
## 3 8 15.1 NaN
23 / 40
3. Variation
Measures of spread:
Standard deviation sd(x), Variance var(x), The interquartile range IQR(x),
Median absolute deviation mad(x).
mtcars %>%
group_by(cyl) %>%
summarise(sd_mpg = sd(mpg)) %>%
arrange(desc(sd_mpg))
## # A tibble: 3 x 2
## cyl sd_mpg
## <dbl> <dbl>
## 1 4 4.51
## 2 8 2.56
## 3 6 1.45
24 / 40
4. Rank
Measures of rank: min(x), quantile(x, 0.25), quantile(x, 0.75), max(x).
mtcars %>%
group_by(cyl) %>%
summarise(
min_hp = min(hp, na.rm = TRUE),
max_hp = max(hp, na.rm = TRUE),
percentile_25 = quantile(hp, 0.25, na.rm = TRUE),
percentile_75 = quantile(hp, 0.75, na.rm = TRUE)
)
## # A tibble: 3 x 5
## cyl min_hp max_hp percentile_25 percentile_75
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 52 113 65.5 96
## 2 6 105 175 110 123
## 3 8 150 335 176. 241.
25 / 40
3. Babynames Data
bn <- babynames::babynames
nms <- c('Mary', 'John', 'Emily', 'Lawrence')
bn %>%
filter(name %in% nms) %>%
group_by(name) %>%
summarize('Number of Years' = n(), 'Number of babies' = sum(n))
## # A tibble: 4 x 3
## name ‘Number of Years‘ ‘Number of babies‘
## <chr> <int> <int>
## 1 Emily 212 843235
## 2 John 276 5137142
## 3 Lawrence 237 458898
## 4 Mary 268 4138360
26 / 40
length(unique(bn$year))
## [1] 138
27 / 40
bn %>%
filter(name %in% nms) %>%
group_by(name, sex) %>%
summarize(num.years.by.sex = n(),
num.babies.by.sex =sum(n)) %>%
mutate(total.num.babies = sum(num.babies.by.sex),
proportion = num.babies.by.sex/total.num.babies)
## # A tibble: 8 x 6
## # Groups: name [4]
## name sex num.years.by.sex num.babies.by.sex total.num.babies
## <chr> <chr> <int> <int> <int>
## 1 Emily F 138 841491 843235
## 2 Emily M 74 1744 843235
## 3 John F 138 21676 5137142
## 4 John M 138 5115466 5137142
## 5 Lawrence F 99 2125 458898
## 6 Lawrence M 138 456773 458898
## 7 Mary F 138 4123200 4138360
## 8 Mary M 130 15160 4138360
28 / 40
Example 2: Names that Persistently Appear in Each Year
## [1] 922
29 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize()
## # A tibble: 1,756,284 x 2
## # Groups: name [97,310]
## name year
## <chr> <dbl>
## 1 Aaban 2007
## 2 Aaban 2009
## 3 Aaban 2010
## 4 Aaban 2011
## 5 Aaban 2012
## 6 Aaban 2013
## 7 Aaban 2014
## 8 Aaban 2015
## 9 Aaban 2016
## 10 Aaban 2017
## # i 1,756,274 more rows
30 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
group_by(name)
## # A tibble: 1,756,284 x 2
## # Groups: name [97,310]
## name year
## <chr> <dbl>
## 1 Aaban 2007
## 2 Aaban 2009
## 3 Aaban 2010
## 4 Aaban 2011
## 5 Aaban 2012
## 6 Aaban 2013
## 7 Aaban 2014
## 8 Aaban 2015
## 9 Aaban 2016
## 10 Aaban 2017
## # i 1,756,274 more rows
31 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
group_by(name) %>%
summarize(Total = n())
## # A tibble: 97,310 x 2
## name Total
## <chr> <int>
## 1 Aaban 10
## 2 Aabha 5
## 3 Aabid 2
## 4 Aabir 1
## 5 Aabriella 5
## 6 Aada 1
## 7 Aadam 26
## 8 Aadan 11
## 9 Aadarsh 17
## 10 Aaden 17
## # i 97,300 more rows
32 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
group_by(name) %>%
summarize(Total = n()) %>%
filter(Total==numyears)
## # A tibble: 922 x 2
## name Total
## <chr> <int>
## 1 Aaron 138
## 2 Abbie 138
## 3 Abe 138
## 4 Abel 138
## 5 Abigail 138
## 6 Abner 138
## 7 Abraham 138
## 8 Abram 138
## 9 Ada 138
## 10 Adam 138
## # i 912 more rows
33 / 40
What if we delete: group_by(name) ?
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
##group_by(name) %>%
summarize(Total = n()) %>%
filter(Total==numyears)
## # A tibble: 922 x 2
## name Total
## <chr> <int>
## 1 Aaron 138
## 2 Abbie 138
## 3 Abe 138
## 4 Abel 138
## 5 Abigail 138
## 6 Abner 138
## 7 Abraham 138
## 8 Abram 138
## 9 Ada 138
## 10 Adam 138
## # i 912 more rows
34 / 40
Example 3: Names that Persistently Appear for Both Genders in Each
Year
## [1] 15
35 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize()
## # A tibble: 1,924,665 x 3
## # Groups: name, year [1,756,284]
## name year sex
## <chr> <dbl> <chr>
## 1 Aaban 2007 M
## 2 Aaban 2009 M
## 3 Aaban 2010 M
## 4 Aaban 2011 M
## 5 Aaban 2012 M
## 6 Aaban 2013 M
## 7 Aaban 2014 M
## 8 Aaban 2015 M
## 9 Aaban 2016 M
## 10 Aaban 2017 M
## # i 1,924,655 more rows
36 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
group_by(name) %>%
summarize(Total = n())
## # A tibble: 97,310 x 2
## name Total
## <chr> <int>
## 1 Aaban 10
## 2 Aabha 5
## 3 Aabid 2
## 4 Aabir 1
## 5 Aabriella 5
## 6 Aada 1
## 7 Aadam 26
## 8 Aadan 11
## 9 Aadarsh 17
## 10 Aaden 18
## # i 97,300 more rows
37 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
group_by(name) %>%
summarize(Total = n()) %>%
filter(Total == 2*numyears)
## # A tibble: 15 x 2
## name Total
## <chr> <int>
## 1 Francis 276
## 2 James 276
## 3 Jean 276
## 4 Jesse 276
## 5 Jessie 276
## 6 John 276
## 7 Johnnie 276
## 8 Joseph 276
## 9 Lee 276
## 10 Leslie 276
## 11 Marion 276
## 12 Ollie 276
## 13 Sidney 276
## 14 Tommie 276 38 / 40
What if we delete: group_by(name) ?
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
##group_by(name) %>%
summarize(Total = n())%>%
filter(Total == 2*numyears)
## # A tibble: 0 x 3
## # Groups: name [0]
## # i 3 variables: name <chr>, year <dbl>, Total <int>
39 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
##group_by(name) %>%
summarize(Total = n())
## # A tibble: 1,756,284 x 3
## # Groups: name [97,310]
## name year Total
## <chr> <dbl> <int>
## 1 Aaban 2007 1
## 2 Aaban 2009 1
## 3 Aaban 2010 1
## 4 Aaban 2011 1
## 5 Aaban 2012 1
## 6 Aaban 2013 1
## 7 Aaban 2014 1
## 8 Aaban 2015 1
## 9 Aaban 2016 1
## 10 Aaban 2017 1
## # i 1,756,274 more rows
40 / 40