0% found this document useful (0 votes)
13 views36 pages

Week3 Slides

This document outlines the Week 3 tutorial for the DSA2101 course on Essential Data Analytics Tools, focusing on importing data into R. It covers various file formats, including CSV, Excel, and JSON, and provides instructions on how to read these files into R, along with best practices for managing data and memory. Additionally, it emphasizes the importance of data checks and introduces the readr and readxl packages for efficient data handling.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

Week3 Slides

This document outlines the Week 3 tutorial for the DSA2101 course on Essential Data Analytics Tools, focusing on importing data into R. It covers various file formats, including CSV, Excel, and JSON, and provides instructions on how to read these files into R, along with best practices for managing data and memory. Additionally, it emphasizes the importance of data checks and introduces the readr and readxl packages for efficient data handling.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Week 3: Importing Data I

1 / 36
The teaching team

Instructor:
▶ Dr. Huang Yuting ([email protected])
▶ Office: S16 04-01
▶ Office hour: In-person and by appointment

Teaching assistants (TAs): In-person/online and by appointment


▶ Yeo Jaye Lin ([email protected])
▶ Quek Chui Qing ([email protected])
▶ Agrawal Naman ([email protected])
▶ Loo Wen Wen ([email protected])
▶ Zhang Mingyuan ([email protected])

2 / 36
Tutorials in Week 3

Tutorials will begin in this week.


Due to the CNY public holidays, we will reschedule the session online.
▶ Your TA will be in touch with you and share the time and
meeting link.
▶ All sessions will be recorded and available on Canvas by end of
this week.

3 / 36
Importing data into R

1. CSV files Week 3


2. Flat files
3. Excel Files
4. R data files Week 4
5. JSON Files
6. Files from the web
7. APIs

4 / 36
Recap

An important pre-requisite to loading data into R is that we are able


to point to the location at which the data files are stored.

1. Where am I?
2. Where are my data?

5 / 36
Working directory

The first question addresses the notion of our current working


directory.
▶ Typically, it is the location of our current R script.
▶ The function getwd() returns the absolute path of our current
working directory.

getwd()

6 / 36
File path

The second question implies that data are not necessarily stored at
the location of our current working directory.
▶ Relative path: the address of a file relative to our current
working directory.
▶ Access files directly in the current working path.
▶ Use two dots .. to denote “one level up in the directory
hierarchy”.

Using relative path in all code you write.

7 / 36
File path (Important!)

We will strictly adhere to the following practice:


▶ Store all course materials in a folder named DSA2101.
▶ Within DSA2101, create a sub-folder named src to store all R
scripts and Rmd files.
▶ Within DSA2101, create another sub-folder called data to store
all data sets.
▶ The src and data folders should be positioned at the same
hierarchical level within DSA2101.

8 / 36
Memory requirements for R objects

Remember that R stores all its objects using physical memory.


▶ It is important to be aware of how much memory is being used in
your workspace.
▶ Especially when we are reading in or creating a new (large) data
set in R.

Other programs running on our computer take up RAM; other R


objects exist in the workspace, also take up RAM.

9 / 36
Memory requirements for R objects

If you do not have enough RAM, your computer (or at least


your R session) will freeze up.
▶ Usually an unpleasant experience that requires you to kill the R
session (the best scenario), or
▶ . . . reboot your computer.

So make sure you understand the memory requirements before


reading in or creating large data sets!
Read more about this on Posit.

10 / 36
Comma separated values

We first consider the simplest file format – comma separated values


(CSV).

Alice, 98, 92, 94


Brown, 85, 89, 91
Carly, 81, 96, 97

These files are in fact just text files, with


▶ An optional header, listing the column names.
▶ Each observation separated by commas within each row.

11 / 36
What does a CSV file look like?
A .csv file, opened in a text editor.
▶ This is the raw form of the data.

12 / 36
What does a CSV file look like?
Here is the same file opened in Microsoft Excel.
▶ Excel assumes that it is a spreadsheet and put elements in its
own cell.

13 / 36
Read a CSV file into R

The base R command to read a CSV file is read.csv()


The main arguments to this function are:
▶ file: The file name.
▶ header: Absence/presence of a header row. The default is TRUE.
▶ col.names: The names to identify columns in the table.
▶ stringsAsFactors: Whether to convert character vectors to
factors.
▶ na.strings: Specify strings to be interpreted as NA values.

14 / 36
Example: Education, Height, and Income

The file heights.csv contains information on 1192 individuals.


▶ Contains 6 columns. There’s also a column header.
▶ Hence, we read in the data in the following way:

heights <- read.csv("../data/heights.csv", header = TRUE)


dim(heights)

## [1] 1192 6

▶ The function dim() (stands for dimensions) tells us that the


data frame has 1192 rows and 6 columns.

15 / 36
Data checks

1. What type has each column been read in as?

str(heights)

## ’data.frame’: 1192 obs. of 6 variables:


## $ earn : num 50000 60000 30000 50000 51000 9000 29000 32000 2000 2
## $ height: num 74.4 65.5 63.6 63.1 63.4 ...
## $ sex : chr "male" "female" "female" "female" ...
## $ ed : int 16 16 16 16 17 15 12 17 15 12 ...
## $ age : int 45 58 29 91 39 26 49 46 21 26 ...
## $ race : chr "white" "white" "white" "other" ...

▶ The function str() (stands for structure) reveals information


about the columns, giving the names of the columns and a peek
into the contents of each.

16 / 36
Data checks
2. race is a categorical variable.
What are the different races that have been read in?

heights$race <- factor(heights$race)


levels(heights$race)

## [1] "black" "hispanic" "other" "white"

▶ A contingency variable of the counts of each factor level:

table(heights$race)

##
## black hispanic other white
## 112 66 25 989

17 / 36
Data checks

3. Are there any missing values in the data?

sum(is.na(heights))

## [1] 0

▶ Use is.na() to check missing entries in the entire data set.

18 / 36
Summary statistics
▶ We can compute summary statistics for earn:

summary(heights$earn)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 200 10000 20000 23155 30000 200000

▶ Summary statistics by group with aggregate():

aggregate(earn ~ sex, data = heights, FUN = median)

## sex earn
## 1 female 15000
## 2 male 25000

19 / 36
Histogram

Let us use a histogram to visualize the distribution of income.


▶ A histogram, hist(), divides the range of numeric values into
bins, then counts the number of observations that fall into each
bin.
▶ By default, the height of each bar represents frequencies.
▶ freq = FALSE alters a histogram such that the height represents
the probability densities (that is, the histogram has a total area
of one).

20 / 36
hist(heights$earn, freq = FALSE, col = "maroon",
main = "Histogram of Earnings", xlab = "Earnings")

Histogram of Earnings
1.5e−05
Density

0.0e+00

0 50000 100000 150000 200000

Earnings

▶ The distribution of income is right-skewed, as expected.

21 / 36
Histogram (revised code)
Our presentation of the histogram can be improved:

1. The bins correspond to intervals of width 20,000. We would like


bins of width 10,000 instead.
2. Transform the x-axis to display earnings in thousands of dollars
for better readability.

hist(heights$earn/1000, freq = FALSE, col = "maroon",


breaks = seq(0, 200, by = 10),
main = "Histogram of Earnings",
xlab = "Earnings (in thousands)")

▶ heights$earn/1000 divides earnings by a thousand. Now the


earnings value ranges from 0.2 to 200.
▶ breaks = seq(0, 200, by = 10) sets the range of the x-axis
from 0 to 200, and split it into bins with width 10.

22 / 36
Histogram (revised code)

Histogram of Earnings
0.030
0.020
Density

0.010
0.000

0 50 100 150 200

Earnings (in thousands)

23 / 36
The income distribution
Who are those high-earning individuals – earn more than 100,000 a
year?

# install.packages("tidyverse")
library(tidyverse)
filter(heights, earn > 100000)

## earn height sex ed age race


## 1 125000 74.34062 male 18 45 white
## 2 170000 71.01003 male 18 45 white
## 3 175000 70.58955 male 16 48 white
## 4 148000 66.74020 male 18 38 white
## 5 110000 65.96504 male 18 37 white
## 6 105000 74.58005 male 12 49 white
## 7 123000 61.42908 female 14 58 white
## 8 200000 69.66276 male 18 34 white
## 9 110000 66.31203 female 18 48 other

24 / 36
The income distribution

library(tidyverse)
filter(heights, earn > 100000)

The code uses the dplyr syntax.


▶ It is an great tool for data cleaning and manipulation.
▶ We shall learn about it soon.
▶ For now, only need to understand that it filters irrelevant rows
from the heights data frame, keeping only those who earned
more than 100, 000 per year.

25 / 36
Recap

Remember that you should inspect your data before and after you
read them in.
▶ Try to think of as many ways in which it could have gone wrong
and check.

As we covered here, you should at least consider the following:


▶ Correct number of rows and columns.
▶ Column variables read in with the correct class type.
▶ Missing values.

26 / 36
Flat file

The readr package is developed to deal with reading in large flat


files quickly.
▶ Faster than base R analogues.
▶ The function for CSV files is read_csv().

# install.packages("readr")
library(readr)
heights <- read_csv("../data/heights.csv")

▶ We can also use this function to read data directly from a URL
(more on this later).

27 / 36
Other file types

readr provides other functions to read in data:


▶ read_csv2() reads semicolon-separated files.
▶ read_tsv() reads tab-delimited files.
▶ read_delim() reads in files with any delimiter, attempting to
automatically guess the delimiter if you do not specify it.
▶ ...

Useful documentation and cheatsheet on data import.

28 / 36
Excel spreadsheets

To read data from xls and xlsx spreadsheets, we need the readxl
package.

# install.packages("readxl")
library(readxl)

▶ The read_excel() function automatically detects the rectangle


region that contains non-empty cells in the Excel spreadsheet.
▶ Nonetheless, ensure that you open up your file in Excel first, to
see what it contains and how you can provide further contextual
information for the function to use.

29 / 36
Excel example

read_excel("../data/read_excel_01.xlsx")

## # A tibble: 7 x 5
## ‘Table 1‘ ...2 ...3 ...4 ...5
## <lgl> <lgl> <chr> <dbl> <chr>
## 1 NA NA <NA> NA <NA>
## 2 NA NA <NA> NA <NA>
## 3 NA NA <NA> NA <NA>
## 4 NA NA <NA> NA <NA>
## 5 NA NA a 1 m
## 6 NA NA b 2 m
## 7 NA NA c 3 m

▶ In this example, read_excel() needs a little help as the data


seems to be “floating” in the center of the worksheet.

30 / 36
Excel example

read_excel("../data/read_excel_01.xlsx", skip = 5)

## # A tibble: 2 x 3
## a ‘1‘ m
## <chr> <dbl> <chr>
## 1 b 2 m
## 2 c 3 m

▶ The skip argument tells R to skip a certain number of rows.


▶ By default, the function reads the first row as the header. We
can disable it with col_names = FALSE.
▶ Notice that read_excel() uses a col_names argument, instead
of header.

31 / 36
Excel example
Another way is the specify the data range precisely.
▶ We can also supply a set of column names in col_names.

read_excel("../data/read_excel_01.xlsx",
range = "C6:E8", col_names = c("var1", "var2", "var3"))

## # A tibble: 3 x 3
## var1 var2 var3
## <chr> <dbl> <chr>
## 1 a 1 m
## 2 b 2 m
## 3 c 3 m

▶ In case you were wondering, a tibble is an improved version of a


data frame. We shall learn more about it soon.

32 / 36
Example: Workplace injuries
The excel file Workplace_injuries.xlsx contains data on selected
workplace injuries from 2019 to 2022.
▶ Originally from the Ministry of Manpower (MOM).

injuries <- read_excel("../data/Workplace_injuries.xlsx")


injuries

## # A tibble: 6 x 5
## Type ‘2019‘ ‘2020‘ ‘2
## <chr> <dbl> <dbl> <
## 1 Crushing, fractures and dislocations 3107 2577
## 2 Cuts and Bruises 4500 3895
## 3 Sprains & Strains 1982 1791
## 4 Others 2418 1675
## 5 <NA> NA NA
## 6 Notes: Workplace injury numbers include injuries ~ NA NA

33 / 36
To read in the correct range of data, we should specify an appropriate
range.

injuries <- read_excel("../data/Workplace_injuries.xlsx",


range = "A1:E5")
injuries

## # A tibble: 4 x 5
## Type ‘2019‘ ‘2020‘ ‘2021‘ ‘2022‘
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Crushing, fractures and dislocations 3107 2577 2950 2759
## 2 Cuts and Bruises 4500 3895 4263 4333
## 3 Sprains & Strains 1982 1791 1829 1778
## 4 Others 2418 1675 2100 2022

34 / 36
Common errors

When we first start importing data into R, it’s common to see some
frustrating error messages.
▶ The most common error is:

Error in file(file, "rt") : cannot open the connection


In addition: Warning message:
In file(file, "rt") :
cannot open file 'some_file.csv': No such file or directory

▶ This indicates that R cannot find the file you are trying to import.
▶ Check your file path! Perhaps also the spelling of the filename.

35 / 36
Summary

We learn about importing data from different formats and sources:

1. CSV file using read.csv()


2. Flat file using functions from the readr package.
3. Excel file with read_excel() from the readxl package.

Also a few more ways to clean and visualize data.

36 / 36

You might also like