Data Handling and Manipulation
Data Handling and Manipulation
AND
DATA MANIPULATION IN R
Outline
• Reading Data
• Know your Data
• Cleaning Data
• Introduction to dplyr package
• Key functions for data manipulation
Reading Data from Flat Files
• # Using read.table()
loan <- read.table("loans data.csv",header = TRUE,
sep = ",")
str(loan)
• # Using read.csv()
loan <- read.csv("loans data.csv“)
str(loan)
Know your data
• #dimension of the data – no. of rows and columns
dim(loan)
• #structure of the data – dimension and columns and its #data types
str(loan)
• #cleaning NA values
• #omitting NA values in the data
loan_clean <- na.omit(loan)
loan_clean <- loan[complete.cases(loan),]
Introduction to dplyr package
• dplyr is a powerful R package used for
transforming and summarizing tabular data in the
form of rows and columns
• It provides five basic functions for performing
common data manipulation such as
• Filtering rows
• Selecting columns
• Arranging in order
• Creating new column
• Summarizing data
Subset of Rows
filter()
• Used to filter rows satisfying condition(s)
• Example
• # filtering based on single condition
filter(cars, mpg>25)
• # filtering based on multiple condition
filter(cars, mpg>25,hp >90)
Sampling
• # Randomly select fraction of rows
sample_frac(cars,0.2)
slice_sample(cars,n=5)
cars %>%
group_by(cyl)%>%
slice_sample(cars,n=2)
Subset of Columns
select()
• Used to select subset of columns
• Example
• # selecting single column
select(cars,mpg)
• # selecting multiple columns
select(cars,mpg,cyl)
select(cars,c("mpg","cyl","gear"))
• # combining filter and select- using pipe operator
cars %>%
filter(mpg>18)%>%
select(mpg,cyl)
select()
• #selecting columns starting with 'd'
select(cars,starts_with('d'))
• Example
• # arrange the data in ascending order of mpg
arrange(cars,mpg)
• #arrange the data in descending order of mpg
arrange(cars,desc(mpg))
• #arrange the data in order based on more than one
column
arrange(cars, mpg,disp)
Making new variables
mutate()
• Used to create a new column
• Example
• # creating a new column
mutate(cars,newvar=disp+hp )
• # combining functions
• # create a new variable that sum up disp and hp and filter only
• # the rows where mpg>25 & disp>90
• # and select only mpg, disp, hp, newvar
cars %>%
mutate(newvar =disp+hp) %>%
filter(mpg>25,disp>90) %>%
select(mpg,disp,hp,newvar)
Summarizing data
summarize()
• Used to get summary statistics of the data such as count,
mean, median, standard deviation
• Example
• #summarizing data
• #Always group_by is used along with summarise. It is applied
• #on categorical value
cars %>%
group_by(cyl) %>%
summarize(count=n()) #count of unique cyl values
• #computing max, min and standard dev
cars %>%
group_by(cyl) %>%
summarize(
mx_mpg=max(mpg),mi_mpg=min(mpg),std_mpg=sd(mpg))