0% found this document useful (0 votes)
18 views

Tutorial 1 - R Programming

ACCT3112 tutorial 1

Uploaded by

jwhc0908
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Tutorial 1 - R Programming

ACCT3112 tutorial 1

Uploaded by

jwhc0908
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Tutorial 1: R Programming

Yuqi Sun

2024-09-19

1 / 40
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
# Load the necessary libraries
library(tidyverse) # or you can just load library(dplyr)
library(babynames)

Usually, we load all necessary libraries in the first chunk. Here we name it as
setup.

▶ knitr::opts_chunk$set(echo = TRUE): This line sets global options for all


code chunks in the R Markdown document using the knitr package.
▶ echo = TRUE: means that the R code will be displayed in the final
output, allowing readers to see the code along with its results.
▶ more details on R code chunks: R code chunks

To insert an R code chunk:

▶ the keyboard shortcut is Ctrl + Alt + I (Cmd + Option + I on macOS).


▶ click Add Chunk command in the editor toolbar

2 / 40
1. Basic Markdown Syntax

Headings

Headings in Markdown are created by using the ‘#’ symbol before the text.
There are six levels of headings, represented by one to six ‘#’ symbols.

# Heading 1
## Heading 2
### Heading 3
#### Heading 4
##### Heading 5
###### Heading 6

3 / 40
Emphasis

1. Bold text
You can make text bold by using two asterisks (**) or underscores (__) before
and after the text.

2. Italic text
You can make text italic by using one asterisk (*) or underscore (_) before and
after the text.

4 / 40
Lists

You can create unordered lists by starting a line with ’*‘,’+’ or ‘-’.
▶ Item 1
▶ Item 2
▶ Item 3

You can create ordered lists by starting a line with a number.


1. Item 1
2. Item 2
3. Item 3

5 / 40
Links and Images
1. You can create a link by enclosing the link text in brackets [] and the URL
in parentheses ().
[Visit Google](https://www.google.com)
Visit Google

2. You can display an image by starting with an exclamation mark (!),


followed by alt text in brackets [], and the path to the image in
parentheses ().
![R Logo](C:\Users\FBERPG\Desktop\logo_r.png){width=30%}

Figure 1: R Logo

6 / 40
2. Basic Data Exploration (Review on Dplyr)
We will only focus on using dplyr (library(tidyverse)) instead of base R in the
following tutorials.

Five Important Verbs

1. Pick observations by their values (filter()).


2. Reorder the rows (arrange()).
3. Pick variables by their names (select()).
4. Create new variables with functions of existing variables (mutate()).
5. Collapse many values down to a single summary (summarise()).

▶ These can all be used in conjunction with group_by() which changes the
scope of each function from operating on the entire dataset to operating
on it group-by-group.
▶ These six functions provide the verbs for a language of data manipulation
connected using %>%.

7 / 40
Grouped Summaries with summarise()

summarise() collapses a data frame to a single row:

summarise(mtcars, avg_mpg = mean(mpg, na.rm = TRUE))

## avg_mpg
## 1 20.09062

▶ summarise() is not terribly useful unless we pair it with group_by().


▶ This changes the unit of analysis from the complete dataset to individual
groups.
▶ Then, when you use the dplyr verbs on a grouped data frame, they’ll be
automatically applied “by group”.

8 / 40
by_cyl <- group_by(mtcars, cyl)
by_cyl

## # A tibble: 32 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # i 22 more rows

9 / 40
summarise(by_cyl, avg_mpg = mean(mpg, na.rm = TRUE))

## # A tibble: 3 x 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1

10 / 40
Group by multiple variables

by_cyl2 <- group_by(mtcars, cyl, vs)


by_cyl2

## # A tibble: 32 x 11
## # Groups: cyl, vs [5]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # i 22 more rows

11 / 40
summarise(by_cyl2, avg_mpg = mean(mpg, na.rm = TRUE))

## # A tibble: 5 x 3
## # Groups: cyl [3]
## cyl vs avg_mpg
## <dbl> <dbl> <dbl>
## 1 4 0 26
## 2 4 1 26.7
## 3 6 0 20.6
## 4 6 1 19.1
## 5 8 0 15.1

12 / 40
Group by multiple variables

by_cyl3 <- group_by(mtcars, cyl, vs, am)


by_cyl3

## # A tibble: 32 x 11
## # Groups: cyl, vs, am [7]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # i 22 more rows

13 / 40
summarise(by_cyl3, avg_mpg = mean(mpg, na.rm = TRUE))

## # A tibble: 7 x 4
## # Groups: cyl, vs [5]
## cyl vs am avg_mpg
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 1 26
## 2 4 1 0 22.9
## 3 4 1 1 28.4
## 4 6 0 1 20.6
## 5 6 1 0 19.1
## 6 8 0 0 15.0
## 7 8 0 1 15.4

Together, group_by() and summarise() provide one of the tools that you’ll use
most commonly when working with dplyr: grouped summaries.

14 / 40
Missing Values
▶ You may have wondered about the na.rm argument we used above.
▶ What happens if we don’t set it?
▶ We will get missing values!
▶ That’s because aggregation functions obey the usual rule of missing values:
if there’s any missing value in the input, the output will be a missing value.
▶ Fortunately, all aggregation functions have an na.rm argument which
removes the missing values prior to computation:

mtcars %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg, na.rm = TRUE))

## # A tibble: 3 x 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1

15 / 40
%in%

▶ The %in% operator in R can be used to identify if an element (e.g., a


number) belongs to a vector or dataframe.
▶ For example, it can be used to see if the number 1 is in the sequence of
numbers 1 to 10.
1. Compare two Sequences of Numbers (vectors)
▶ In this example, we will use %in% to check if two vectors contain
overlapping numbers.
▶ Specifically, we will look at how we can get a logical value for more
specific elements, whether they are also present in a longer vector.

# sequence of numbers 1:
a <- seq(1, 5)
# sequence of numbers 2:
b <- seq(3, 12)

# using the %in% operator to check matching values in the vectors


a %in% b

## [1] FALSE FALSE TRUE TRUE TRUE

16 / 40
2. Compare two Vectors Containing Letters or Factors

g <- c("C", "D", "E")


h <- c("A", "E", "B", "C", "D", "E", "A", "B", "C", "D", "E")

h %in% g

## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRU

which(h %in% g)

## [1] 2 4 5 6 9 10 11

17 / 40
3. Example: Use of %in% in filter

mtcars %>% filter(cyl %in% c(4,6))

## mpg cyl disp hp drat wt qsec vs am gear carb


## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
18 / 40
Useful Functions for Summary Statistics
1. Sample Size
Whenever you do any aggregation, it’s always a good idea to include either a
count (n()), or a count of non-missing values (sum(!is.na(x))).
▶ n() returns the size of the current group.
▶ To count the number of non-missing values, use sum(!is.na(x)).
▶ To count the number of distinct (unique) values, use n_distinct(x).

mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg, na.rm = TRUE),
count = n()
)

## # A tibble: 3 x 3
## cyl avg_mpg count
## <dbl> <dbl> <int>
## 1 4 26.7 11
## 2 6 19.7 7
## 3 8 15.1 14

19 / 40
What if we don’t use summarize(), but instead use mutate()?
mtcars %>%
group_by(cyl) %>%
mutate(
avg_mpg = mean(mpg, na.rm = TRUE),
count = n()
) %>%
select(mpg, cyl, avg_mpg, count)

## # A tibble: 32 x 4
## # Groups: cyl [3]
## mpg cyl avg_mpg count
## <dbl> <dbl> <dbl> <int>
## 1 21 6 19.7 7
## 2 21 6 19.7 7
## 3 22.8 4 26.7 11
## 4 21.4 6 19.7 7
## 5 18.7 8 15.1 14
## 6 18.1 6 19.7 7
## 7 14.3 8 15.1 14
## 8 24.4 4 26.7 11
## 9 22.8 4 26.7 11
## 10 19.2 6 19.7 7
## # i 22 more rows 20 / 40
Group by multiple variables:

mtcars %>%
group_by(cyl, vs, am) %>%
summarise(
avg_mpg = mean(mpg, na.rm = TRUE),
count = n()
)

## # A tibble: 7 x 5
## # Groups: cyl, vs [5]
## cyl vs am avg_mpg count
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 4 0 1 26 1
## 2 4 1 0 22.9 3
## 3 4 1 1 28.4 7
## 4 6 0 1 20.6 3
## 5 6 1 0 19.1 4
## 6 8 0 0 15.0 12
## 7 8 0 1 15.4 2

21 / 40
mtcars %>%
group_by(cyl) %>%
summarise(
engine_num = n_distinct(vs)
) %>%
arrange(desc(engine_num))

## # A tibble: 3 x 2
## cyl engine_num
## <dbl> <int>
## 1 4 2
## 2 6 2
## 3 8 1

22 / 40
2. Sum and Average
sum(x), mean(x), median(x)
It’s sometimes useful to combine aggregation with logical subsetting. For
example:

mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg),
avg_positive_mpg = mean(mpg[mpg > 20]) # the average mpg if mpg > 20
)

## # A tibble: 3 x 3
## cyl avg_mpg avg_positive_mpg
## <dbl> <dbl> <dbl>
## 1 4 26.7 26.7
## 2 6 19.7 21.1
## 3 8 15.1 NaN

23 / 40
3. Variation
Measures of spread:
Standard deviation sd(x), Variance var(x), The interquartile range IQR(x),
Median absolute deviation mad(x).

mtcars %>%
group_by(cyl) %>%
summarise(sd_mpg = sd(mpg)) %>%
arrange(desc(sd_mpg))

## # A tibble: 3 x 2
## cyl sd_mpg
## <dbl> <dbl>
## 1 4 4.51
## 2 8 2.56
## 3 6 1.45

24 / 40
4. Rank
Measures of rank: min(x), quantile(x, 0.25), quantile(x, 0.75), max(x).

mtcars %>%
group_by(cyl) %>%
summarise(
min_hp = min(hp, na.rm = TRUE),
max_hp = max(hp, na.rm = TRUE),
percentile_25 = quantile(hp, 0.25, na.rm = TRUE),
percentile_75 = quantile(hp, 0.75, na.rm = TRUE)
)

## # A tibble: 3 x 5
## cyl min_hp max_hp percentile_25 percentile_75
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 52 113 65.5 96
## 2 6 105 175 110 123
## 3 8 150 335 176. 241.

25 / 40
3. Babynames Data

Example 1: Statistics for Certain Names

bn <- babynames::babynames
nms <- c('Mary', 'John', 'Emily', 'Lawrence')
bn %>%
filter(name %in% nms) %>%
group_by(name) %>%
summarize('Number of Years' = n(), 'Number of babies' = sum(n))

## # A tibble: 4 x 3
## name ‘Number of Years‘ ‘Number of babies‘
## <chr> <int> <int>
## 1 Emily 212 843235
## 2 John 276 5137142
## 3 Lawrence 237 458898
## 4 Mary 268 4138360

26 / 40
length(unique(bn$year))

## [1] 138

We have 138 years in total.


Why could the Number of Years variable be even larger than 138?

27 / 40
bn %>%
filter(name %in% nms) %>%
group_by(name, sex) %>%
summarize(num.years.by.sex = n(),
num.babies.by.sex =sum(n)) %>%
mutate(total.num.babies = sum(num.babies.by.sex),
proportion = num.babies.by.sex/total.num.babies)

## # A tibble: 8 x 6
## # Groups: name [4]
## name sex num.years.by.sex num.babies.by.sex total.num.babies
## <chr> <chr> <int> <int> <int>
## 1 Emily F 138 841491 843235
## 2 Emily M 74 1744 843235
## 3 John F 138 21676 5137142
## 4 John M 138 5115466 5137142
## 5 Lawrence F 99 2125 458898
## 6 Lawrence M 138 456773 458898
## 7 Mary F 138 4123200 4138360
## 8 Mary M 130 15160 4138360

28 / 40
Example 2: Names that Persistently Appear in Each Year

numyears <- length(unique(bn$year))


persistentnames <- bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
group_by(name) %>%
summarize(Total = n()) %>%
filter(Total==numyears)
nrow(persistentnames)

## [1] 922

29 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize()

## # A tibble: 1,756,284 x 2
## # Groups: name [97,310]
## name year
## <chr> <dbl>
## 1 Aaban 2007
## 2 Aaban 2009
## 3 Aaban 2010
## 4 Aaban 2011
## 5 Aaban 2012
## 6 Aaban 2013
## 7 Aaban 2014
## 8 Aaban 2015
## 9 Aaban 2016
## 10 Aaban 2017
## # i 1,756,274 more rows

30 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
group_by(name)

## # A tibble: 1,756,284 x 2
## # Groups: name [97,310]
## name year
## <chr> <dbl>
## 1 Aaban 2007
## 2 Aaban 2009
## 3 Aaban 2010
## 4 Aaban 2011
## 5 Aaban 2012
## 6 Aaban 2013
## 7 Aaban 2014
## 8 Aaban 2015
## 9 Aaban 2016
## 10 Aaban 2017
## # i 1,756,274 more rows

31 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
group_by(name) %>%
summarize(Total = n())

## # A tibble: 97,310 x 2
## name Total
## <chr> <int>
## 1 Aaban 10
## 2 Aabha 5
## 3 Aabid 2
## 4 Aabir 1
## 5 Aabriella 5
## 6 Aada 1
## 7 Aadam 26
## 8 Aadan 11
## 9 Aadarsh 17
## 10 Aaden 17
## # i 97,300 more rows

32 / 40
bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
group_by(name) %>%
summarize(Total = n()) %>%
filter(Total==numyears)

## # A tibble: 922 x 2
## name Total
## <chr> <int>
## 1 Aaron 138
## 2 Abbie 138
## 3 Abe 138
## 4 Abel 138
## 5 Abigail 138
## 6 Abner 138
## 7 Abraham 138
## 8 Abram 138
## 9 Ada 138
## 10 Adam 138
## # i 912 more rows

33 / 40
What if we delete: group_by(name) ?

bn %>%
group_by(name, year) %>% ## M/F doesn't matter
summarize() %>%
##group_by(name) %>%
summarize(Total = n()) %>%
filter(Total==numyears)

## # A tibble: 922 x 2
## name Total
## <chr> <int>
## 1 Aaron 138
## 2 Abbie 138
## 3 Abe 138
## 4 Abel 138
## 5 Abigail 138
## 6 Abner 138
## 7 Abraham 138
## 8 Abram 138
## 9 Ada 138
## 10 Adam 138
## # i 912 more rows
34 / 40
Example 3: Names that Persistently Appear for Both Genders in Each
Year

persistentnames2 <- bn %>%


group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
group_by(name) %>%
summarize(Total = n()) %>%
filter(Total == 2*numyears)
nrow(persistentnames2)

## [1] 15

35 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize()

## # A tibble: 1,924,665 x 3
## # Groups: name, year [1,756,284]
## name year sex
## <chr> <dbl> <chr>
## 1 Aaban 2007 M
## 2 Aaban 2009 M
## 3 Aaban 2010 M
## 4 Aaban 2011 M
## 5 Aaban 2012 M
## 6 Aaban 2013 M
## 7 Aaban 2014 M
## 8 Aaban 2015 M
## 9 Aaban 2016 M
## 10 Aaban 2017 M
## # i 1,924,655 more rows

36 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
group_by(name) %>%
summarize(Total = n())

## # A tibble: 97,310 x 2
## name Total
## <chr> <int>
## 1 Aaban 10
## 2 Aabha 5
## 3 Aabid 2
## 4 Aabir 1
## 5 Aabriella 5
## 6 Aada 1
## 7 Aadam 26
## 8 Aadan 11
## 9 Aadarsh 17
## 10 Aaden 18
## # i 97,300 more rows

37 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
group_by(name) %>%
summarize(Total = n()) %>%
filter(Total == 2*numyears)

## # A tibble: 15 x 2
## name Total
## <chr> <int>
## 1 Francis 276
## 2 James 276
## 3 Jean 276
## 4 Jesse 276
## 5 Jessie 276
## 6 John 276
## 7 Johnnie 276
## 8 Joseph 276
## 9 Lee 276
## 10 Leslie 276
## 11 Marion 276
## 12 Ollie 276
## 13 Sidney 276
## 14 Tommie 276 38 / 40
What if we delete: group_by(name) ?

bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
##group_by(name) %>%
summarize(Total = n())%>%
filter(Total == 2*numyears)

## # A tibble: 0 x 3
## # Groups: name [0]
## # i 3 variables: name <chr>, year <dbl>, Total <int>

39 / 40
bn %>%
group_by(name, year, sex) %>% ## M/F indeed matter
summarize() %>%
##group_by(name) %>%
summarize(Total = n())

## # A tibble: 1,756,284 x 3
## # Groups: name [97,310]
## name year Total
## <chr> <dbl> <int>
## 1 Aaban 2007 1
## 2 Aaban 2009 1
## 3 Aaban 2010 1
## 4 Aaban 2011 1
## 5 Aaban 2012 1
## 6 Aaban 2013 1
## 7 Aaban 2014 1
## 8 Aaban 2015 1
## 9 Aaban 2016 1
## 10 Aaban 2017 1
## # i 1,756,274 more rows

40 / 40

You might also like