0% found this document useful (0 votes)
258 views36 pages

04 Data Cleaning in R

This document discusses steps for data cleansing in R, including checking for missing values and outliers, handling NA values, imputing missing data, removing duplicates, and transforming variables. It also covers splitting data into training and test sets, and sending output and graphical results to files. The key steps are checking for issues, exploring and handling NA values using functions like is.na() and na.omit(), imputing missing data with mean or median, removing outliers using IQR, and transforming variables. The data is then split into train and test sets before results are output to text and image files.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
258 views36 pages

04 Data Cleaning in R

This document discusses steps for data cleansing in R, including checking for missing values and outliers, handling NA values, imputing missing data, removing duplicates, and transforming variables. It also covers splitting data into training and test sets, and sending output and graphical results to files. The key steps are checking for issues, exploring and handling NA values using functions like is.na() and na.omit(), imputing missing data with mean or median, removing outliers using IQR, and transforming variables. The data is then split into train and test sets before results are output to text and image files.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

4- Data cleansing in R

Dr Akhter Raza
Review 1

Difference between parametric and non-parametric


statistics?

Business Statistics: Data Cleaning 2


Review 2

Difference between descriptive and inferential


statistics?

Business Statistics: Data Cleaning 3


Review 3

Difference between Parameter and statistic?

Business Statistics: Data Cleaning 4


Data Munging

Sometimes referred to as data wrangling is the


process of transforming and mapping raw data into
more appropriate and valuable form for the
purpose of analytics.

Business Statistics: Data Cleaning 5


Steps in data preparation
• Check for sensitive data
• Check for missing columns
• Check variables names
• Check missing observations
• Check variable classification
• Check misspellings/extra spaces
• Check numeric data distribution
• Check duplicate rows
• Check statistical assumptions
Business Statistics: Data Cleaning 6
Steps in data preparation

Business Statistics: Data Cleaning 7


Missing cases
One of the big issue in data is
i) NA
ii) NaN
iii) Inf

NA’s are the missing casses


NaN are not a number
Inf are the division by zero

Business Statistics: Data Cleaning 8


Function to be used in cleansing
head()
tail()
is.na()
any(is.na())
colSums(is.na())
na.omit()
complete.cases()

9
Business Statistics: Data Cleaning
Exploring and handling NA’s
The airquality data set is used for this
purpose. This set is found in Base R
df <- airquality
str(df)
this data contains 153 observations of
6 variables
is.na(df)
10
Business Statistics: Data Cleaning
Exploring and handling NA’s
Now we are deliberately creating NA’s
in data.
Add new column and a row full of NA’s
df[,7] <- c(NA)
df[154,] <- c(NA)
any(is.na(df))
is.na(df) 11
Business Statistics: Data Cleaning
Exploring and handling NA’s
Removing column number 7 because it
is full of NA's
df <- df[,-7]
str(df)
Removing last row
df <- df[-154,]
str(df) 12
Business Statistics: Data Cleaning
Exploring and handling NA’s
any(is.na(df))
How many total NA's are there
sum(is.na(df))
Now we check each column for na's
sum(is.na(df$Solar.R))

13
Business Statistics: Data Cleaning
Exploring and handling NA’s
instead of checking columns 1 by 1 for
NA’s we can use colSums function
colSums(is.na(df))
This shows that majority of NA’s are in
first column which is 37 and there are
7 missing cases in column 2 rest of the
columns are full and doesn’t have NA’s
14
Business Statistics: Data Cleaning
Exploring and handling NA’s
na.omit function can be used to
remove all missing cases
df.clean <- na.omit(df)
Most na's are in first column which are
37 if this column does not plays any
important role in data analysis then we
can omit this column
15
Business Statistics: Data Cleaning
Exploring and handling NA’s
we will remove na’s this will enhance
our sample size
df.clean2 <- na.omit(df[,-1])
nrow(df.clean2)

df.clean contains 111 rows


df.clean2 contain 146 rows 16
Business Statistics: Data Cleaning
Exploring and handling NA’s
We can implement a rule of keeping all
those columns in which NA’s are less
than 10

df.clean3 <- df[, colSums(is.na(df))<10]

nrow(df.clean3)
17
Business Statistics: Data Cleaning
Exploring and handling NA’s
mean, median and standard deviation
results in NA if variable having NA
mean(airquality$Solar.R)
median(airquality$Solar.R)
sd(airquality$Solar.R)
All three results are NA's
18
Business Statistics: Data Cleaning
Exploring and handling NA’s
To find mean and sd of remaining
values we use following

mean(!is.na(airquality$Solar.R))
sd(!is.na(airquality$Solar.R))

19
Business Statistics: Data Cleaning
Imputing NA’s
instead of deleting missing rows we
can impute them by mean or by
median

df.meanImputed <- df
df.medianImputed <- df

20
Business Statistics: Data Cleaning
Imputing NA’s
All NA’s are replaced by mean of the
rest of data

df.meanImputed$Solar.R[is.na(df.mean
Imputed$Solar.R)] <-
mean(!is.na(df.meanImputed$Solar.R))

Business Statistics: Data Cleaning 21


Imputing NA’s
All NA’s are replaced by median

df.medianImputed$Solar.R[is.na(df.me
dianImputed$Solar.R)] <-
median(!is.na(df.medianImputed$Solar
.R))

22
Business Statistics: Data Cleaning
Imputing NA’s
now we check is there any na in solar.r
of the two data frames

any(is.na(df.meanImputed$Solar.R))
any(is.na(df.medianImputed$Solar.R))

23
Business Statistics: Data Cleaning
Removing outliers
str(df.clean2)
boxplot(df.clean2$Temp)
No outlier in Temp variable
boxplot(df.clean2$Wind)
There are three outliers in the Wind
variable
summary(df.clean2$Wind) 24
Business Statistics: Data Cleaning
Removing outliers
There are three outliers in the Wind
variable
summary(df.clean2$Wind)
Q1=quantile(df.clean2$Wind,0.25)
Q3=quantile(df.clean2$Wind,0.75)
IQR_wind=Q3-Q1
25
Business Statistics: Data Cleaning
Removing outliers
# there is a direct function of IQR
# IQR(variablename)

upFenceWind <- Q3 + 1.5 * IQR_wind


df.clean4 <- subset(df.clean2,Wind <=
upFenceWind)
26
Business Statistics: Data Cleaning
Removing outliers
Now we can check the box plot of
Wind variable in clean4
boxplot(df.clean4)

box plot of clean4 shows no outlier in


any of the variable
boxplot(df.clean4$Wind)
27
Business Statistics: Data Cleaning
Checking for duplicates
str(df.clean4)
str(unique(df.clean4))
we duplicated row 130 at the 145
position
df.clean4[145,]<- df.clean4[130,]
str(df.clean4)
df.clean4[c(130,145),] 28
Business Statistics: Data Cleaning
Checking for duplicates
Now using unique function we
eliminate this row
df.clean4Distinct <- unique(df.clean4)
str(df.clean4Distinct)
hist(df.meanImputed$Wind)
hist(df.meanImputed$Temp)
29
Business Statistics: Data Cleaning
Transformations
Histogram is showing slightly left
skewed
we can use a transformation to make it
normal
Take log(), sin(), 1/x, sqrt() of original
data and regenerate Histogram

30
Business Statistics: Data Cleaning
Training and Testing set
Splitting data 80% training and 20% testing
data
sample_data <sample(2,nrow(df.clean4),
replace = TRUE, prob = c(0.8,0.2))

test_data<- df.clean4[sample_data==1,]

31
Business Statistics: Data Cleaning
Training and Testing set
train_data<-df.clean4[sample_data ==2,]
head(test_data)
head(train_data)
str(test_data)
str(train_data)

32
Business Statistics: Data Cleaning
Sending output to file
#sink("myfile",append=FALSE, split=FALSE)
# use sink() again to stop output to file
sink("myfile", append=FALSE, split=FALSE)
str(test_data) # output to myfile
str(train_data) # output to myfile
sink() # return output to screen
33
Business Statistics: Data Cleaning
Sending graphical outputs
# graphical output to any seperate file
# pdf("mygraph.pdf") pdf file
# png("mygraph.png") png file
# jpeg("mygraph.jpg") jpeg file
# bmp("mygraph.bmp") bmp file
# postscript("mygraph.ps") postscript file
# close the output use dev.off() function
Business Statistics: Data Cleaning
34
Sending graphical outputs
# Saving output to pdf

pdf("myplot.pdf")
hist(df.meanImputed$Wind)
hist(df.meanImputed$Temp)
dev.off()
35
Business Statistics: Data Cleaning
Questions?

You might also like