Data Cleansing Using R
Data Cleansing Using R
Data Cleansing
Data cleansing, Data cleaning and Data scrubbing is the process of detecting and
correcting corrupt or inaccurate records from a data set
This involves exploring raw data, tidying messy data and preparing data for
analysis
In data preprocessing phase, often cleaning data takes 50-80% of time before
actually mining them for insights.
Data Quality
Business decisions often revolve around
identifying prospects
understanding customers to stay connected
knowing about competitors and partners
being current and relevant with marketing campaigns
Data quality is an important factor that impacts the outcomes of the data analysis
to arrive at accurate decision making. Qualitative predictions cannot be made with
data having nil to low quality.
But Dirty Data is inevitable in the system due to various reasons. Hence it is
essential to clean your data at all times. This is an ongoing exercise that the
organizations have to follow.
Dirty data refers to data with erroneous information. Following are considered as
dirty data.
Misleading data
Duplicate data
Inaccurate data
Non-integrated data
Data that violates business rules
Data without a generalized formatting
Incorrectly punctuated or spelled data
*source - Techopedia
Missing values
Inaccurate values
Duplicates values
Outliers like typographic / measurement errors
Noisy values
Data timeliness (age of data)
Binning Method – First sort the data and partition them into equi-depth bins. Then,
smooth the data by bin means, bin median, bin boundaries etc.
Clustering – Group the data into clusters, then identify and remove outliers
Regression – Using regression functions to smooth the data
Explore dataset
Identify and tidy messy datasets
Perform manipulations and prepare the dataset for analysis
Handle missing values, inconsistent and noisy data
In this video we will learn about the typical steps involved like exploring raw
data, tidying data, and preparing data for analysis.
In these 3 different datasets, you could see the information being displayed is the
same, but in different layouts. However, ONLY one dataset will be
much easier to work with in R than others. And this is called as Tidy data set.
To make initial data cleaning easier, data has to be standardized. The tidy data
standard has been designed to facilitate initial exploration and analysis
of the data. Let us understand more about tidy data.
Understanding datasets
Intro to TidyR
This video covers the various functions available in Tidyr package
TidyR functions
gather() collapses multiple columns into two columns:
separate() and unite() help you split and combine cells to place a single, complete
value in each cell.
player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df <- data.frame(player,Y2010,Y2011,Y2012)
print(paverage.df)
Now try to recreate this dataset again after tidying the data.
####
player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df<-data.frame(player, Y2010, Y2011, Y2012)
library(tidyr)
pavg_gather<-gather(paverage.df, year,pavg, Y2010:Y2012)
print(pavg_gather)
print(spread(pavg_gather, year, pavg))
###
dplyr R package
This video covers the basic functions of dplyr package -
* arrange
* filter
* select
* arrange
* mutate
* rename
From the mtcars1 dataset, identify the cars having mpg>20 and cyl=6 and return all
the columns along
with the column cars and print it.
Hint: Use filter()
Ans:
library(dplyr)
mtcars1<-(select(mtcars, mpg:wt))
mutate(mtcars1, cars)
Ans:
mt_newcols<-filter(mtcars1, disp2=disp*disp)
print(mt_newcols)
String Manipulation
Stringr Package
Perform the following operations in the function stringr_operations
1. Perform the following operations:
Assign a string value "R" to a variable x
Use str_c and concatenate x with another string "Tutorial", separated by a blank
space and print it
Ans:
library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))
Ans:
X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')
print(str_count(X, 'little'))
3. Perform the following operations:
Create a variable with a value hop a little, jump a little. Find out the positions
of the matching patterns little.
Try out using str_locate and str_locate_all and print the output of both
separately.
Ans:
V<-c('hop a little', 'jump a little')
print(str_locate(V, 'little'))
print(str_locate_all(V, 'little'))
Ans:
print(str_detect(V, 'z'))
Ans:
Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')
print(str_extract (Z, 'NA'))
print(str_extract_all (Z, 'NA'))
Ans:
print(str_length(Z))
Ans:
print(str_to_upper(Z))
print(str_to_lower(Z)
Ans:
y<-c('alpha', 'gama', 'duo', 'uno', 'beta')
print(str_order(y))
Ex-%%%%alpha%%%%
Ans:
y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))
Ans:
z<-c(' A', ' B' , ' C')
print(str_trim(z))
Ans.
library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))
print(str_count(X, 'little'))
print(str_detect(V, 'z'))
print(str_length(Z))
print(str_to_upper(Z))
print(str_to_lower(Z))
y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))
Ans:
as.Date("01/05/1965", format= "%d/%m/%Y") ->strDates
print(strDates)
Special values in R
Outlier Analysis
This video discusses on some of the possible ways to deal with the outliers.
Outlierset <- c(19, 13, 29, 17, 5, 16, 18, 20, 55, 22,33,14,25, 10,29, 56)
Copy Outlierset to a new dataset Outlierset1.
Replace the outliers with 36 which is 3rd quartile + minimum.
Compare boxplot on both Outlierset and Outlierset1. You should see no outliers in
the new dataset.
Obvious errors
We have so far seen how to handle missing values, special values & outliers.
Sometimes, we might come across some obvious errors which cannot be caught by
previously learnt technical techniques.
Errors such as age field having a negative value, or height field being, say, 0 or
a smaller number. Such erroneous
data would still need manual checks and corrections.
tidyr dplyr stringr Further, different techniques to treat missing values, special
values and outliers and
preparation of tidy data for data analysis is discussed.
Final Hands-on
1. Perform the following operations
-Create a vector x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56).
-Treat the missing values by replacing them with mean once, and then with the
median of vector and print out the output
of the both separately.
2. Perform the following operations
-Create the dataset Outlierset<-c(19,13,29,17,5,6,18,20,55,22,33,14,25,10,29,56)
-Make a summary of the dataset and print it.
-Create a new dataset called Cleanset and assign data from Outlierset, discarding
value above 36(which is the 3rd Quartile + min) and print it
Ans:
x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56)
x[is.na(x)]<- mean(x[!is.na(x)])
print(x)
x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56)
x[is.na(x)]<-median(x[!is.na(x)])
print(x)
Outlierset<-c(19,13,29,17,5,16,18,20,55,22,33,14,25,10,29,56)
print(summary(Outlierset))
Cleanset<-Outlierset
print(Cleanset<-Cleanset[Cleanset < 36])
++++
Data Cleansing Using R-4 Springr package
library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))
print(str_count(X, "little"))
print(str_detect(Y, 'z'))
print(str_length(Z))
print(str_to_lower(Z))
print(str_to_upper(Z))
print(str_order(y))
y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))