0% found this document useful (0 votes)
18 views46 pages

Lec 13

The document discusses preparing tools for data analytics in R, including variables, functions, and packages. It then covers working with data frames, creating and manipulating them, as well as reading in external data from Excel, CSV, and RData files.

Uploaded by

nineeratima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views46 pages

Lec 13

The document discusses preparing tools for data analytics in R, including variables, functions, and packages. It then covers working with data frames, creating and manipulating them, as well as reading in external data from Excel, CSV, and RData files.

Uploaded by

nineeratima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Agenda

1. Preparing the tools for Data Analytics


- Variable, Function, and Packages

2. To the World of Data Frames


1) Creating variables and data frames

2) Using external data


Preparing the tools for Data Analytics

: Variable, Function, and Packages


NP

1. Understanding ‘Variable’: Numbers that Vary


Variable (변수)

• An attribute that contains multiple values.

• Variables are the subject of analysis.


NP

Creating Variables

a <- 1
a

## [1] 1

b <- 2
b

## [1] 2

c <- 3
c

## [1] 3

d <- 3.5
d

## [1] 3.5
NP
a+b

## [1] 3

a+b+c

## [1] 6

4/b

## [1] 2

5*b

## [1] 10
NP

Creating Variables Composed of Multiple Values

c()
var1 <- c(1, 2, 5, 7, 8) # creating var1 with 5 values
var1

## [1] 1 2 5 7 8

var2 <- c(1:5) # creating var2 with a sequence from 1 to 5


var2

## [1] 1 2 3 4 5
NP

seq()
var3 <- seq(1, 5) # creating var3 with a sequence from 1 to 5
var3

## [1] 1 2 3 4 5

var4 <- seq(1, 10, by = 2) # creating var4 with a sequence from 1 to 10 and with an interval of 2
var4

## [1] 1 3 5 7 9

var5 <- seq(1, 10, by = 3) # creating var5 with a sequence from 1 to 10 and with an interval of 3
var5

## [1] 1 4 7 10
NP

Calculation with Variables with Sequenced values


var1

## [1] 1 2 5 7 8

var1+2

## [1] 3 4 7 9 10

var1

## [1] 1 2 5 7 8

var2

## [1] 1 2 3 4 5

var1+var2

## [1] 2 4 8 11 13
NP

Creating a Character Value


str1 <- "a"
str1

## [1] "a"

str2 <- "text"


str2

## [1] "text"

str3 <- "Hello World!"


str3

## [1] "Hello World!"


NP

Creating a Sequenced Character Variables


str4 <- c("a", "b", "c")
str4

## [1] "a" "b" "c"

str5 <- c("Hello!", "World", "is", "good!")


str5

## [1] "Hello!" "World" "is" "good!"


NP

Calculation with Character Variables is Impossible


str1+2

## Error in str1 + 2: non-numeric argument to binary operator


NP

2. Understanding the Magic Box: Function


Function
• Function outputs a different value from the input through a particular operation.
Making Use of Function for Numbers
# Creating a variable
x <- c(1, 2, 3)
x

## [1] 1 2 3

# Applying function
mean(x)

## [1] 2

max(x)

## [1] 3

min(x)

## [1] 1
NP

Making Use of Function for Characters


str5

## [1] "Hello!" "World" "is" "good!"

paste(str5, collapse = ",") # Merging characters of str5, setting comma as a separator

## [1] "Hello,World,is,good!"
NP

Setting the Options of a Function - Parameters


paste(str5, collapse = " ")

## [1] "Hello! World is good!"

Creating a New Variable with the Outputs from the Function


x_mean <- mean(x)
x_mean

## [1] 2

str5_paste <- paste(str5, collapse = " ")


str5_paste

## [1] "Hello! World is good!"


NP

3. Understanding Packages: A Bundle of Functions


Packages
• A bundle of multiple functions

• One package contains various functions

• To use a function, one must first install relevant packages


Installing and Loading ggplot2 package
install.packages("ggplot2") # Installing ggplot2
library(ggplot2) # Loading ggplot2
NP

# Creating a variable with multiple values


x <- c("a", "a", "b", "c")
x

## [1] "a" "a" "b" "c"

# Print frequency graph


qplot(x)
NP

Creating a Graph Using mpg Data of ggplot2


# Assign mpg for data, hwy as the x-axis
qplot(data = mpg, x = hwy)
NP

Changing Parameters of qplot()


# cty as x-axis
qplot(data = mpg, x = cty)
NP

# drv as x-axis, hwy as y-axis


qplot(data = mpg, x = drv, y = hwy)
NP

# A line graph with drv as x-axis, hwy as y-axis


qplot(data = mpg, x = drv, y = hwy, geom = "line")
NP

# A boxplot with drv as x-axis, hwy as y-axis


qplot(data = mpg, x = drv, y = hwy, geom = "boxplot")
NP

# boxplot with drv as x-axis, hwy as y-axis, color by drv


qplot(data = mpg, x = drv, y = hwy, geom = "boxplot", colour = drv)
NP

Try using Help function to know what a function does


?qplot
NP

Individual Exercise

Q1. Creating a test score variable


Five students took an exam. Create and print a variable that contains the exam scores of the five students.
The scores of each of the student were as the following:
80, 60, 70, 50, 90

Q2. Calculate the mean


Using the variable created above, compute the mean of the test scores.

Q3. Create and print a variable that contains the mean value
Create and print a variable that contains the mean test score. Apply the codes used from the previous
questions.
To the World of Data Frames!
NP

1. Understanding Data Frames - What does the Data Look Like?


A Data Frame
NP

Data Frame

• ‘Column’ is an attribute
• ‘Row’ is the information of a person
NP

Large data = Many rows or many columns


NP

2. Creating Data Frames – Let’s Create a Test Score Data!

Creating a data frame through data entry


english <- c(90, 80, 60, 70) # Creating English score variable
english

## [1] 90 80 60 70

math <- c(50, 60, 100, 20) # Creating math score variable
math

## [1] 50 60 100 20

# coding english and math to a data frame and assigning them to df_midterm
df_midterm <- data.frame(english, math)
df_midterm

## english math
## 1 90 50
## 2 80 60
## 3 60 100
## 4 70 20
NP
class <- c(1, 1, 2, 2)
class

## [1] 1 1 2 2

df_midterm <- data.frame(english, math, class)


df_midterm

## english math class


## 1 90 50 1
## 2 80 60 1
## 3 60 100 2
## 4 70 20 2

mean(df_midterm$english) # calculating the mean of ‘english’ from df_midterm

## [1] 75

mean(df_midterm$math) # calculating the mean of ‘math’ from df_midterm

## [1] 57.5
NP

Creating a data frame at once


df_midterm <- data.frame(english = c(90, 80, 60, 70),
math = c(50, 60, 100, 20),
class = c(1, 1, 2, 2))
df_midterm

## english math class


## 1 90 50 1
## 2 80 60 1
## 3 60 100 2
## 4 70 20 2
NP
Individual Exercise

Q1. Using c() and data.frame() together, create and print the following data frame.
fruit price volume

Apple 1800 24

Strawberry 1500 38

Watermelon 3000 13

Q2. Using the data frame created above, compute the means of the price and
volume.
NP

3. Using External Data – Read-in Large Test Score Data!

Load the excel file


# Install readxl package
install.packages("readxl")

# Load readxl package


library(readxl)
NP
df_exam <- read_excel("excel_exam.xlsx") # Read-in excel file and assign it to df_exam
df_exam # Print

## # A tibble: 20 x 5
## id class math english science
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 50 98 50
## 2 2 1 60 97 60
## 3 3 1 45 86 78
## 4 4 1 30 98 58
## 5 5 2 25 80 65
## 6 6 2 50 89 98
## 7 7 2 80 90 45
## 8 8 2 90 78 25
## 9 9 3 20 98 15
## 10 10 3 50 98 45
## 11 11 3 65 65 65
## 12 12 3 45 85 32
## 13 13 4 46 98 65
## 14 14 4 48 87 12
## 15 15 4 75 56 78
## 16 16 4 58 98 65
## 17 17 5 65 68 98
## 18 18 5 80 78 90
## 19 19 5 89 68 87
## 20 20 5 78 83 58
mean(df_exam$english)

## [1] 84.9

mean(df_exam$science)

## [1] 59.45
NP

Manually setting the file path


df_exam <- read_excel("d:/easy_r/excel_exam.xlsx")

[Note] The file must be in the working directory.


NP

What if the first row is not variable names?


df_exam_novar <- read_excel("excel_exam_novar.xlsx", col_names = F)
df_exam_novar

What if there are multiple sheets in the excel file?


df_exam_sheet <- read_excel("excel_exam_sheet.xlsx", sheet = 3)
df_exam_sheet
NP

loading a csv file


• CSV: A universal format of data.

• Values are comma-separated.

• Has small file size, used in many softwares.


df_csv_exam <- read.csv("csv_exam.csv")
df_csv_exam

## id class math english science


## 1 1 1 50 98 50
## 2 2 1 60 97 60
## 3 3 1 45 86 78
## 4 4 1 30 98 58
## 5 5 2 25 80 65
## 6 6 2 50 89 98
## 7 7 2 80 90 45
## 8 8 2 90 78 25
## 9 9 3 20 98 15
## 10 10 3 50 98 45
## 11 11 3 65 65 65
## 12 12 3 45 85 32
## 13 13 4 46 98 65
## 14 14 4 48 87 12
## 15 15 4 75 56 78
## 16 16 4 58 98 65
## 17 17 5 65 68 98
## 18 18 5 80 78 90
## 19 19 5 89 68 87
## 20 20 5 78 83 58
NP

When loading files with texts, stringsAsFactors = F


df_csv_exam <- read.csv("csv_exam.csv", stringsAsFactors = F)
NP

Saving data frame as a CSV file.


df_midterm <- data.frame(english = c(90, 80, 60, 70),
math = c(50, 60, 100, 20),
class = c(1, 1, 2, 2))
df_midterm

## english math class


## 1 90 50 1
## 2 80 60 1
## 3 60 100 2
## 4 70 20 2

write.csv(df_midterm, file = "df_midterm.csv")


NP

Leveraging RData file


• A data file format dedicated to R

• Small in size and quick to process

Saving data frame as RData


save(df_midterm, file = "df_midterm.rda")

Loading RData
rm(df_midterm)
df_midterm

## Error in eval(expr, envir, enclos): object 'df_midterm' not found

load("df_midterm.rda")
df_midterm

## english math class


## 1 90 50 1
## 2 80 60 1
## 3 60 100 2
## 4 70 20 2
NP

Difference from loading different files


• Excel and CSV files have to be assigned to a new variable after loading.

• Rda files, when loaded, automatically becomes the data frame without assignment.

# Loading the excel file and assigning it to df_exam


df_exam <- read_excel("excel_exam.xlsx")

# Loading the csv file and assigning it to df_exam


df_csv_exam <- read.csv("csv_exam.csv")

# Loading Rda file


load("df_midterm.rda")
NP

Wrap-up
# 1.Creating variables and data frames
english <- c(90, 80, 60, 70) # Create ‘english’ variable
math <- c(50, 60, 100, 20) # Create ‘math’ variable
data.frame(english, math) # Create data frame

# 2. Using external data

# Excel file
library(readxl) # Load readxl package
df_exam <- read_excel("excel_exam.xlsx") # Loading excel file

# CSV file
df_csv_exam <- read.csv("csv_exam.csv") # Loading csv file
write.csv(df_midterm, file = "df_midterm.csv") # Saving as a csv file

# Rda file
load("df_midterm.rda") # Loading Rda file
save(df_midterm, file = "df_midterm.rda") # Saveing as a Rda file

You might also like