0% found this document useful (1 vote)
731 views

Data Cleansing Using R

Here are the steps to perform the task using dplyr functions: library(dplyr) # Arrange the data frame by player name in ascending order paverage_df <- arrange(paverage_df, player) # Filter rows where year is greater than 2011 paverage_df <- filter(paverage_df, year > 2011) # Select only player and pavg columns paverage_df <- select(paverage_df, player, pavg) # Rename pavg column to average paverage_df <- rename(paverage_df, average = pavg) # Add a new column that calculates average of pavg columns pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
731 views

Data Cleansing Using R

Here are the steps to perform the task using dplyr functions: library(dplyr) # Arrange the data frame by player name in ascending order paverage_df <- arrange(paverage_df, player) # Filter rows where year is greater than 2011 paverage_df <- filter(paverage_df, year > 2011) # Select only player and pavg columns paverage_df <- select(paverage_df, player, pavg) # Rename pavg column to average paverage_df <- rename(paverage_df, average = pavg) # Add a new column that calculates average of pavg columns pa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 10

Course Introduction

Data cleansing is a crucial step in data pre-processing phase of data mining. In


this course, you will learn:

Why data cleansing is important?


What is a dirty data in a data set?
How to detect dirty data?
How to perform data cleansing operations?

Data Cleansing
Data cleansing, Data cleaning and Data scrubbing is the process of detecting and
correcting corrupt or inaccurate records from a data set
This involves exploring raw data, tidying messy data and preparing data for
analysis
In data preprocessing phase, often cleaning data takes 50-80% of time before
actually mining them for insights.

Data Quality
Business decisions often revolve around

identifying prospects
understanding customers to stay connected
knowing about competitors and partners
being current and relevant with marketing campaigns
Data quality is an important factor that impacts the outcomes of the data analysis
to arrive at accurate decision making. Qualitative predictions cannot be made with
data having nil to low quality.

But Dirty Data is inevitable in the system due to various reasons. Hence it is
essential to clean your data at all times. This is an ongoing exercise that the
organizations have to follow.

Dirty data refers to data with erroneous information. Following are considered as
dirty data.

Misleading data
Duplicate data
Inaccurate data
Non-integrated data
Data that violates business rules
Data without a generalized formatting
Incorrectly punctuated or spelled data
*source - Techopedia

Why Data Cleansing is needed


While integrating data, there might be quality issues, some of which are listed
below.

Inconsistent values. Ex - 'TWO' and '2' for same field


Additional fields or missing fields.
JSON files having different structure.
Out of range numbers. Ex - 'Age' being negative
Outliers to standard distributions
Variation in the date format
Data cleansing process helps to handle:

Missing values
Inaccurate values
Duplicates values
Outliers like typographic / measurement errors
Noisy values
Data timeliness (age of data)

How to manage Missing Values


Ignoring or removing missing values, is not a right approach as they may be too
important to ignore. Similarly, filling the missing value manually may be tedious
and not feasible.

Other options to consider for filling missing values could be:

Use a global constant e.g., “NA”


Use the attribute mean
Use the most probable value. Ex: inference based such as regression, Bayesian
formula, decision tree
You will see more about managing missing values in the subsequent topics

How to manage Noisy Data


Noisy data is a random error or variance in a measured variable. i.e. these are the
outliers of a dataset. This can be managed by following ways.

Binning Method – First sort the data and partition them into equi-depth bins. Then,
smooth the data by bin means, bin median, bin boundaries etc.
Clustering – Group the data into clusters, then identify and remove outliers
Regression – Using regression functions to smooth the data

How to Manage Inconsistent Data?


Inconsistent data can be removed by

Manual correction using external references


Semi-automatic ways using various tools i. to detect violation of known functional
dependencies and data constraints ii. to correct redundant data

In this course we will focus on R to

Explore dataset
Identify and tidy messy datasets
Perform manipulations and prepare the dataset for analysis
Handle missing values, inconsistent and noisy data

Cleaning Data with R


Dirty data is everywhere. In fact, most real-world datasets start off dirty in one
way or another, but need to be cleaned and prepared for analysis.

In this video we will learn about the typical steps involved like exploring raw
data, tidying data, and preparing data for analysis.

What is Tidy Data?


One of the important components of data cleaning is data tidying. Before
understanding what is tidy dataset, let us understand about datasets.

Please examine the data set 1 in the picture.

What is Tidy Data?


What is Tidy Data?
This is data set 3.

In these 3 different datasets, you could see the information being displayed is the
same, but in different layouts. However, ONLY one dataset will be
much easier to work with in R than others. And this is called as Tidy data set.

To make initial data cleaning easier, data has to be standardized. The tidy data
standard has been designed to facilitate initial exploration and analysis
of the data. Let us understand more about tidy data.

Let us understand Variables, Observations and Values in a dataset.

Understanding datasets

value belongs to a variable and an observation.


variable contains all values corresponding to an attribute across units. In the
example, 'Population' variable contains values across all units (Country).
observation contains all values measured on the same unit across attributes. Here,
for a 'Country', values across Year / Cases / Population are observations.

Principles of Tidy data


R follows a set of conventions that makes one layout of tabular data much easier to
work with than others. Any dataset that follows following three rules is said to be
Tidy data.

Each variable forms a column


Each observation forms a row
Each type of observational unit forms a table
Messy data is any other arrangement of data.

Messy data features


Column headers are values, not variable names
Multiple variables are stored in one column
Variables are stored in both rows and columns
Multiple types of observational units are stored in the same table
A single observational unit is stored in multiple tables

Column headers are values, not variable names


In this example, column headers <$10k, $10-20k, $20-30k etc are themselves values
and not variables.

Multiple variables are stored in one column


In this example variables:

Name - stores firstname, lastname Address - stores city, state

Variables are stored in both rows and columns


This dataset shows the variables across rows and columns i.e..,

Months on columns and


maxtemp, mintemp elements on rows.

Multiple types of observational units are stored in the same table


This example, the table stores rank and song information, resulting in redundancy
of data

A single observational unit is stored in multiple tables


In this example, Maxtemp for the same city is stored in different tables based on
Year.

Intro to TidyR
This video covers the various functions available in Tidyr package
TidyR functions
gather() collapses multiple columns into two columns:

A key column that contains the former column names.


A value column that contains the former column cells.
spread() generates multiple columns from two columns:

Each unique value in the key column becomes a column name.


Each value in the value column becomes a cell in the new column.
spread() and gather() help you reshape the layout of your data to place variables
in columns and observations in rows.

separate() and unite() help you split and combine cells to place a single, complete
value in each cell.

Data for gather() and spread()


Let us create a new dataframe called "paverage.df" that stores the average scores
by player across 3 different years. We will use this dataframe to examine gather()
and spread() behaviour.

Steps for new data frame

player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df <- data.frame(player,Y2010,Y2011,Y2012)
print(paverage.df)

Data for separate() and unite()


Let us create a small dataframe that we can refer before trying your hands with
separate().

fname <- c("Martina", "Monica", "Stan", "Oscar")


lname <- c("Welch", "Sobers", "Griffith", "Williams")
DoB <- c("1-Oct-1980", "2-Nov-1982", "13-Dec-1979", "27-Jan-1988")
first.df <- data.frame(fname,lname,DoB)
print(first.df)

Column header having values, not variables


When column headers are values and not variables, use gather() function to tidy the
data.

Now try to recreate this dataset again after tidying the data.

Multiple variables are stored in one column


For datasets that are messy due to single column holding multiple variables,
separate() function can be used.

We have seen this example on separate().

Datasets with variables in both rows & columns


When the variables are both at rows and columns, use gather() function followed by
spread() function.

####

player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df<-data.frame(player, Y2010, Y2011, Y2012)

library(tidyr)
pavg_gather<-gather(paverage.df, year,pavg, Y2010:Y2012)
print(pavg_gather)
print(spread(pavg_gather, year, pavg))

fname <- c("Martina", "Monica", "Stan", "Oscar")


lname <- c("Welch", "Sobers", "Griffith", "Williams")
DoB <- c("1-Oct-1980", "2-Nov-1982", "13-Dec-1979", "27-Jan-1988")
first.df<-data.frame(fname, lname, DoB)
print(first.df)
print(separate(first.df, DoB, c("date", "month", "year"), sep = "-"))
print(unite(first.df, "Name", c(fname, lname), sep = " "))

religion <- c("Agnostic", "Atheist", "Buddhist", "Catholic")


usd10k <- c(27,12,27,41)
usd20to30k <- c(60, 37, 30, 732)
usd30to40k <- c(81, 52, 34, 670)
mydf1.df <- data.frame(religion, usd10k, usd20to30k, usd30to40k)

print(gather(mydf1.df, usd_range, usd, usd10k:usd30to40k))

City <- c("Chennai", "Chennai","Hyderabad", "Hyderabad")


Year <- c(2010, 2010, 2010, 2010)
Element <- c("MaxTemp", "MinTemp","MaxTemp", "MinTemp")
Jan <- c(36,24,32,22)
Feb <- c(37,25,34,23)
Mar <- c(37.5,27,36,25)
mydf2.df <- data.frame(City,Year,Element,Jan,Feb,Mar)
print(mydf2.df)

mydf2.df_gather<-gather(mydf2.df, month, Elements, Jan:Mar)

print(spread(mydf2.df_gather, month, Elements))

###

dplyr R package
This video covers the basic functions of dplyr package -

* arrange
* filter
* select
* arrange
* mutate
* rename

1. Perform the following task:


Copy the mtcars dataset to a new dataset called mtcars1 with columns 1 to 6 and an
additional
column called cars that will have the names of the car models.

From the mtcars1 dataset, identify the cars having mpg>20 and cyl=6 and return all
the columns along
with the column cars and print it.
Hint: Use filter()

Ans:
library(dplyr)
mtcars1<-(select(mtcars, mpg:wt))
mutate(mtcars1, cars)

2. Perform the following task:


Using the mtcars1 dataset from the previous step, sort the dataset by cyl in
ascending order, and mpg
in descending order and print it.

3. Perform the following task:


Usimg the mtcars dataset, create a new dataset called mt_select having columns mpg
and hp only and print
the mt_select dataset.

Ans: mt_select<-(select(mtcars, mpg , hp))


print(mt_select)

4. Perform the following task:


Create a new dataset mt_newcols by using dataset mtcars1 with 2 additional columns:
disp2=disp*disp
Finally print mt_newcols dataset.

Ans:
mt_newcols<-filter(mtcars1, disp2=disp*disp)
print(mt_newcols)

5.Perform the following task:


Using the mtcars dataset, find out,
mean and print it
max mpg and print it
quantile of mpg with probability as 25% and print it.

String Manipulation

Stringr Package
Perform the following operations in the function stringr_operations
1. Perform the following operations:
Assign a string value "R" to a variable x
Use str_c and concatenate x with another string "Tutorial", separated by a blank
space and print it

Ans:
library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))

2. Perform the following operations:


Create a variable X with value hop a little, jump a little, eat a little, drive a
little.
Find the frequency of little in the string and print it.

Ans:
X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')
print(str_count(X, 'little'))
3. Perform the following operations:
Create a variable with a value hop a little, jump a little. Find out the positions
of the matching patterns little.
Try out using str_locate and str_locate_all and print the output of both
separately.

Ans:
V<-c('hop a little', 'jump a little')
print(str_locate(V, 'little'))
print(str_locate_all(V, 'little'))

4.Perform the following operations:


Use str_detect to identify the existance of a pattern in a string
Detect if there is a character z in the string hop a little, jump a little and
print it.

Ans:
print(str_detect(V, 'z'))

5. Perform the following operations:


Assign a string with say value TRUE NA TRUE NA NA NA FALSE.
Using str_extract and str_extract_all, find and extract a matching pattern, say, NA
and print the output of both
separately.

Ans:
Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')
print(str_extract (Z, 'NA'))
print(str_extract_all (Z, 'NA'))

6.Perform the following operation:


Use str_length and find out the length of the string TRUE NA TRUE NA NA NA FALSE
and print it.

Ans:
print(str_length(Z))

7.Perform the following operation:


Useing str_to_upper and str_to_lower, change the case of TRUE NA TRUE NA NA NA
FALSE to lower, and then to upper again
and print the output of both separately.

Ans:
print(str_to_upper(Z))
print(str_to_lower(Z)

8.Perform the following operations:


Create a vector y with values alpha, gama, duo, uno, beta
Using str_order, sort this vector and print it.

Ans:
y<-c('alpha', 'gama', 'duo', 'uno', 'beta')
print(str_order(y))

9.Perform the following operations:


Assign the value alpha for the vector y and try to pad the vector value with % on
the left and right of the string where the
total length of the string becomes 13 and print it.

Ex-%%%%alpha%%%%

Ans:
y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))

10.Perform the following operation:


Use str_trim to remove white spaces.
Create a vector z with values ("A", "B", "C") having leading white spaces.
Trim the white spaces and print it.

Ans:
z<-c(' A', ' B' , ' C')
print(str_trim(z))

Ans.
library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))

X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')

print(str_count(X, 'little'))

V<-c('hop a little', 'jump a little')


print(str_locate(V, 'little'))
print(str_locate_all(V, 'little'))

print(str_detect(V, 'z'))

Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')


print(str_extract (Z, 'NA'))
print(str_extract_all (Z, 'NA'))

print(str_length(Z))

print(str_to_upper(Z))
print(str_to_lower(Z))

y<-c('alpha', 'gama', 'duo', 'uno', 'beta')


print(str_order(y))

y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))

z<-c('A', 'B' , 'C')


print(str_trim(z))

Data Type conversions


Data type conversions of scalars, vectors (logical, character, numeric), matrix,
dataframes is possible in R. Converting a variable from one type to another is
called coercion

character > numeric > integer > logical

is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame() returns


true or false,
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame() converts
one data type to another.

Ans:
as.Date("01/05/1965", format= "%d/%m/%Y") ->strDates
print(strDates)

Special values in R

Missing value treatment using mean / median


This video covers how to do the mean and median imputations for some of the missing
values in the data set

Outlier Analysis
This video discusses on some of the possible ways to deal with the outliers.

Practice - Winsorizing technique


Consider the same set called 'Outlierset' from earlier example.

Outlierset <- c(19, 13, 29, 17, 5, 16, 18, 20, 55, 22,33,14,25, 10,29, 56)
Copy Outlierset to a new dataset Outlierset1.
Replace the outliers with 36 which is 3rd quartile + minimum.
Compare boxplot on both Outlierset and Outlierset1. You should see no outliers in
the new dataset.

Outlier treatment by Variable Transformation

Obvious errors
We have so far seen how to handle missing values, special values & outliers.

Sometimes, we might come across some obvious errors which cannot be caught by
previously learnt technical techniques.

Errors such as age field having a negative value, or height field being, say, 0 or
a smaller number. Such erroneous
data would still need manual checks and corrections.

Data Cleansing using R - Course Summary


In this course, you have learnt rules around what makes a data tidy or messy. You
have also seen 3 new packages -

tidyr dplyr stringr Further, different techniques to treat missing values, special
values and outliers and
preparation of tidy data for data analysis is discussed.

Final Hands-on
1. Perform the following operations
-Create a vector x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56).
-Treat the missing values by replacing them with mean once, and then with the
median of vector and print out the output
of the both separately.
2. Perform the following operations
-Create the dataset Outlierset<-c(19,13,29,17,5,6,18,20,55,22,33,14,25,10,29,56)
-Make a summary of the dataset and print it.
-Create a new dataset called Cleanset and assign data from Outlierset, discarding
value above 36(which is the 3rd Quartile + min) and print it

Ans:
x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56)
x[is.na(x)]<- mean(x[!is.na(x)])
print(x)
x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56)
x[is.na(x)]<-median(x[!is.na(x)])
print(x)

Outlierset<-c(19,13,29,17,5,16,18,20,55,22,33,14,25,10,29,56)
print(summary(Outlierset))
Cleanset<-Outlierset
print(Cleanset<-Cleanset[Cleanset < 36])

++++
Data Cleansing Using R-4 Springr package

library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))

X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')

print(str_count(X, "little"))

Y<-c('hop a little', 'jump a little')


print(str_locate(Y, 'little'))
print(str_locate_all(Y, 'little'))

print(str_detect(Y, 'z'))

Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')


print(str_extract(Z, 'NA'))
print(str_extract_all(Z, 'NA'))

print(str_length(Z))

print(str_to_lower(Z))
print(str_to_upper(Z))

y<-c('alpha', 'gama', 'duo', 'uno', 'beta')

print(str_order(y))

y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))

z<-c(' A',' B' ,' C')


print(str_trim(z))
++++

You might also like