0% found this document useful (0 votes)
29 views

Data Handling and Manipulation

This document provides an overview of reading, cleaning, and manipulating data in R. It introduces the dplyr package for transforming and summarizing tabular data. Key functions covered include filter() for subsetting rows, select() for subsetting columns, arrange() for ordering data, mutate() for creating new variables, and summarize() for obtaining summary statistics. Examples are given to illustrate the use of each function on sample data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Data Handling and Manipulation

This document provides an overview of reading, cleaning, and manipulating data in R. It introduces the dplyr package for transforming and summarizing tabular data. Key functions covered include filter() for subsetting rows, select() for subsetting columns, arrange() for ordering data, mutate() for creating new variables, and summarize() for obtaining summary statistics. Examples are given to illustrate the use of each function on sample data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

DATA HANDLING

AND
DATA MANIPULATION IN R
Outline
• Reading Data
• Know your Data
• Cleaning Data
• Introduction to dplyr package
• Key functions for data manipulation
Reading Data from Flat Files
• # Using read.table()
loan <- read.table("loans data.csv",header = TRUE,
sep = ",")
str(loan)

• # Using read.csv()
loan <- read.csv("loans data.csv“)
str(loan)
Know your data
• #dimension of the data – no. of rows and columns
dim(loan)

• #structure of the data – dimension and columns and its #data types
str(loan)

• #view of the data in a table


View(loan)

• #fetching top 6 rows


head(loan)

• #fetching last 6 rows


tail(loan)

• #summary of the data


summary(loan)
Cleaning Data – Omitting NA values
• #checking for missing values in the data
any(is.na(loan))

• #checking for the total no. of missing values in the data


sum(is.na(loan))

• #cleaning NA values
• #omitting NA values in the data
loan_clean <- na.omit(loan)
loan_clean <- loan[complete.cases(loan),]
Introduction to dplyr package
• dplyr is a powerful R package used for
transforming and summarizing tabular data in the
form of rows and columns
• It provides five basic functions for performing
common data manipulation such as
• Filtering rows
• Selecting columns
• Arranging in order
• Creating new column
• Summarizing data
Subset of Rows
filter()
• Used to filter rows satisfying condition(s)

• Example
• # filtering based on single condition
filter(cars, mpg>25)
• # filtering based on multiple condition
filter(cars, mpg>25,hp >90)
Sampling
• # Randomly select fraction of rows
sample_frac(cars,0.2)

• # Randomly select no. of rows


sample_n(cars,5)

slice_sample(cars,n=5)

cars %>%
group_by(cyl)%>%
slice_sample(cars,n=2)
Subset of Columns
select()
• Used to select subset of columns

• Example
• # selecting single column
select(cars,mpg)
• # selecting multiple columns
select(cars,mpg,cyl)
select(cars,c("mpg","cyl","gear"))
• # combining filter and select- using pipe operator
cars %>%
filter(mpg>18)%>%
select(mpg,cyl)
select()
• #selecting columns starting with 'd'
select(cars,starts_with('d'))

• #selecting columns ending with 't'


select(cars,ends_with('t'))

• #selecting columns containing 'g'


select(cars,contains('g'))

• #selecting columns matching regular expression


select(cars,matches('..a.'))
Arranging Data
arrange()
• Used to arrange the data according to ascending or
descending order of column(s)

• Example
• # arrange the data in ascending order of mpg
arrange(cars,mpg)
• #arrange the data in descending order of mpg
arrange(cars,desc(mpg))
• #arrange the data in order based on more than one
column
arrange(cars, mpg,disp)
Making new variables
mutate()
• Used to create a new column
• Example
• # creating a new column
mutate(cars,newvar=disp+hp )
• # combining functions
• # create a new variable that sum up disp and hp and filter only
• # the rows where mpg>25 & disp>90
• # and select only mpg, disp, hp, newvar
cars %>%
mutate(newvar =disp+hp) %>%
filter(mpg>25,disp>90) %>%
select(mpg,disp,hp,newvar)
Summarizing data
summarize()
• Used to get summary statistics of the data such as count,
mean, median, standard deviation
• Example
• #summarizing data
• #Always group_by is used along with summarise. It is applied
• #on categorical value
cars %>%
group_by(cyl) %>%
summarize(count=n()) #count of unique cyl values
• #computing max, min and standard dev
cars %>%
group_by(cyl) %>%
summarize(
mx_mpg=max(mpg),mi_mpg=min(mpg),std_mpg=sd(mpg))

You might also like