Week3 Slides
Week3 Slides
Yuting Huang
AY24/25
1 / 36
The teaching team
Instructor:
▶ Dr. Huang Yuting ([email protected])
▶ Office: S16 04-01
▶ Office hour: In-person and by appointment
2 / 36
Tutorials in Week 3
3 / 36
Importing data into R
4 / 36
Recap
1. Where am I?
2. Where are my data?
5 / 36
Working directory
getwd()
6 / 36
File path
The second question implies that data are not necessarily stored at
the location of our current working directory.
▶ Relative path: the address of a file relative to our current
working directory.
▶ Access files directly in the current working path.
▶ Use two dots .. to denote “one level up in the directory
hierarchy”.
7 / 36
File path (Important!)
8 / 36
Memory requirements for R objects
9 / 36
Memory requirements for R objects
10 / 36
Comma separated values
11 / 36
What does a CSV file look like?
A .csv file, opened in a text editor.
▶ This is the raw form of the data.
12 / 36
What does a CSV file look like?
Here is the same file opened in Microsoft Excel.
▶ Excel assumes that it is a spreadsheet and put elements in its
own cell.
13 / 36
Read a CSV file into R
14 / 36
Example: Education, Height, and Income
## [1] 1192 6
15 / 36
Data checks
str(heights)
16 / 36
Data checks
2. race is a categorical variable.
What are the different races that have been read in?
table(heights$race)
##
## black hispanic other white
## 112 66 25 989
17 / 36
Data checks
sum(is.na(heights))
## [1] 0
18 / 36
Summary statistics
▶ We can compute summary statistics for earn:
summary(heights$earn)
## sex earn
## 1 female 15000
## 2 male 25000
19 / 36
Histogram
20 / 36
hist(heights$earn, freq = FALSE, col = "maroon",
main = "Histogram of Earnings", xlab = "Earnings")
Histogram of Earnings
1.5e−05
Density
0.0e+00
Earnings
21 / 36
Histogram (revised code)
Our presentation of the histogram can be improved:
22 / 36
Histogram (revised code)
Histogram of Earnings
0.030
0.020
Density
0.010
0.000
23 / 36
The income distribution
Who are those high-earning individuals – earn more than 100,000 a
year?
# install.packages("tidyverse")
library(tidyverse)
filter(heights, earn > 100000)
24 / 36
The income distribution
library(tidyverse)
filter(heights, earn > 100000)
25 / 36
Recap
Remember that you should inspect your data before and after you
read them in.
▶ Try to think of as many ways in which it could have gone wrong
and check.
26 / 36
Flat file
# install.packages("readr")
library(readr)
heights <- read_csv("../data/heights.csv")
▶ We can also use this function to read data directly from a URL
(more on this later).
27 / 36
Other file types
28 / 36
Excel spreadsheets
To read data from xls and xlsx spreadsheets, we need the readxl
package.
# install.packages("readxl")
library(readxl)
29 / 36
Excel example
read_excel("../data/read_excel_01.xlsx")
## # A tibble: 7 x 5
## ‘Table 1‘ ...2 ...3 ...4 ...5
## <lgl> <lgl> <chr> <dbl> <chr>
## 1 NA NA <NA> NA <NA>
## 2 NA NA <NA> NA <NA>
## 3 NA NA <NA> NA <NA>
## 4 NA NA <NA> NA <NA>
## 5 NA NA a 1 m
## 6 NA NA b 2 m
## 7 NA NA c 3 m
30 / 36
Excel example
read_excel("../data/read_excel_01.xlsx", skip = 5)
## # A tibble: 2 x 3
## a ‘1‘ m
## <chr> <dbl> <chr>
## 1 b 2 m
## 2 c 3 m
31 / 36
Excel example
Another way is the specify the data range precisely.
▶ We can also supply a set of column names in col_names.
read_excel("../data/read_excel_01.xlsx",
range = "C6:E8", col_names = c("var1", "var2", "var3"))
## # A tibble: 3 x 3
## var1 var2 var3
## <chr> <dbl> <chr>
## 1 a 1 m
## 2 b 2 m
## 3 c 3 m
32 / 36
Example: Workplace injuries
The excel file Workplace_injuries.xlsx contains data on selected
workplace injuries from 2019 to 2022.
▶ Originally from the Ministry of Manpower (MOM).
## # A tibble: 6 x 5
## Type ‘2019‘ ‘2020‘ ‘2
## <chr> <dbl> <dbl> <
## 1 Crushing, fractures and dislocations 3107 2577
## 2 Cuts and Bruises 4500 3895
## 3 Sprains & Strains 1982 1791
## 4 Others 2418 1675
## 5 <NA> NA NA
## 6 Notes: Workplace injury numbers include injuries ~ NA NA
33 / 36
To read in the correct range of data, we should specify an appropriate
range.
## # A tibble: 4 x 5
## Type ‘2019‘ ‘2020‘ ‘2021‘ ‘2022‘
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Crushing, fractures and dislocations 3107 2577 2950 2759
## 2 Cuts and Bruises 4500 3895 4263 4333
## 3 Sprains & Strains 1982 1791 1829 1778
## 4 Others 2418 1675 2100 2022
34 / 36
Common errors
When we first start importing data into R, it’s common to see some
frustrating error messages.
▶ The most common error is:
▶ This indicates that R cannot find the file you are trying to import.
▶ Check your file path! Perhaps also the spelling of the filename.
35 / 36
Summary
36 / 36