0% found this document useful (1 vote)

1K views10 pages

Data Cleansing Using R

Here are the steps to perform the task using dplyr functions: library(dplyr) # Arrange the data frame by player name in ascending order paverage_df <- arrange(paverage_df, player) # Filter rows where year is greater than 2011 paverage_df <- filter(paverage_df, year > 2011) # Select only player and pavg columns paverage_df <- select(paverage_df, player, pavg) # Rename pavg column to average paverage_df <- rename(paverage_df, average = pavg) # Add a new column that calculates average of pavg columns pa

Uploaded by

Daniel N Sherine Foo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

1K views10 pages

Data Cleansing Using R

Uploaded by

Daniel N Sherine Foo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 10

Course Introduction

Data cleansing is a crucial step in data pre-processing phase of data mining. In

this course, you will learn:

Why data cleansing is important?

What is a dirty data in a data set?
How to detect dirty data?
How to perform data cleansing operations?

Data Cleansing
Data cleansing, Data cleaning and Data scrubbing is the process of detecting and
correcting corrupt or inaccurate records from a data set
This involves exploring raw data, tidying messy data and preparing data for
analysis
In data preprocessing phase, often cleaning data takes 50-80% of time before
actually mining them for insights.

Data Quality
Business decisions often revolve around

identifying prospects
understanding customers to stay connected
knowing about competitors and partners
being current and relevant with marketing campaigns
Data quality is an important factor that impacts the outcomes of the data analysis
to arrive at accurate decision making. Qualitative predictions cannot be made with
data having nil to low quality.

But Dirty Data is inevitable in the system due to various reasons. Hence it is
essential to clean your data at all times. This is an ongoing exercise that the
organizations have to follow.

Dirty data refers to data with erroneous information. Following are considered as
dirty data.

Misleading data
Duplicate data
Inaccurate data
Non-integrated data
Data that violates business rules
Data without a generalized formatting
Incorrectly punctuated or spelled data
*source - Techopedia

Why Data Cleansing is needed

While integrating data, there might be quality issues, some of which are listed
below.

Inconsistent values. Ex - 'TWO' and '2' for same field

Additional fields or missing fields.
JSON files having different structure.
Out of range numbers. Ex - 'Age' being negative
Outliers to standard distributions
Variation in the date format
Data cleansing process helps to handle:

Missing values
Inaccurate values
Duplicates values
Outliers like typographic / measurement errors
Noisy values
Data timeliness (age of data)

How to manage Missing Values

Ignoring or removing missing values, is not a right approach as they may be too
important to ignore. Similarly, filling the missing value manually may be tedious
and not feasible.

Other options to consider for filling missing values could be:

Use a global constant e.g., “NA”

Use the attribute mean
Use the most probable value. Ex: inference based such as regression, Bayesian
formula, decision tree
You will see more about managing missing values in the subsequent topics

How to manage Noisy Data

Noisy data is a random error or variance in a measured variable. i.e. these are the
outliers of a dataset. This can be managed by following ways.

Binning Method – First sort the data and partition them into equi-depth bins. Then,
smooth the data by bin means, bin median, bin boundaries etc.
Clustering – Group the data into clusters, then identify and remove outliers
Regression – Using regression functions to smooth the data

How to Manage Inconsistent Data?

Inconsistent data can be removed by

Manual correction using external references

Semi-automatic ways using various tools i. to detect violation of known functional
dependencies and data constraints ii. to correct redundant data

In this course we will focus on R to

Explore dataset
Identify and tidy messy datasets
Perform manipulations and prepare the dataset for analysis
Handle missing values, inconsistent and noisy data

Cleaning Data with R

Dirty data is everywhere. In fact, most real-world datasets start off dirty in one
way or another, but need to be cleaned and prepared for analysis.

In this video we will learn about the typical steps involved like exploring raw
data, tidying data, and preparing data for analysis.

What is Tidy Data?

One of the important components of data cleaning is data tidying. Before
understanding what is tidy dataset, let us understand about datasets.

Please examine the data set 1 in the picture.

What is Tidy Data?

What is Tidy Data?
This is data set 3.

In these 3 different datasets, you could see the information being displayed is the
same, but in different layouts. However, ONLY one dataset will be
much easier to work with in R than others. And this is called as Tidy data set.

To make initial data cleaning easier, data has to be standardized. The tidy data
standard has been designed to facilitate initial exploration and analysis
of the data. Let us understand more about tidy data.

Let us understand Variables, Observations and Values in a dataset.

Understanding datasets

value belongs to a variable and an observation.

variable contains all values corresponding to an attribute across units. In the
example, 'Population' variable contains values across all units (Country).
observation contains all values measured on the same unit across attributes. Here,
for a 'Country', values across Year / Cases / Population are observations.

Principles of Tidy data

R follows a set of conventions that makes one layout of tabular data much easier to
work with than others. Any dataset that follows following three rules is said to be
Tidy data.

Each variable forms a column

Each observation forms a row
Each type of observational unit forms a table
Messy data is any other arrangement of data.

Messy data features

Column headers are values, not variable names
Multiple variables are stored in one column
Variables are stored in both rows and columns
Multiple types of observational units are stored in the same table
A single observational unit is stored in multiple tables

Column headers are values, not variable names

In this example, column headers <$10k, $10-20k, $20-30k etc are themselves values
and not variables.

Multiple variables are stored in one column

In this example variables:

Name - stores firstname, lastname Address - stores city, state

Variables are stored in both rows and columns

This dataset shows the variables across rows and columns i.e..,

Months on columns and

maxtemp, mintemp elements on rows.

Multiple types of observational units are stored in the same table

This example, the table stores rank and song information, resulting in redundancy
of data

A single observational unit is stored in multiple tables

In this example, Maxtemp for the same city is stored in different tables based on
Year.

Intro to TidyR
This video covers the various functions available in Tidyr package
TidyR functions
gather() collapses multiple columns into two columns:

A key column that contains the former column names.

A value column that contains the former column cells.
spread() generates multiple columns from two columns:

Each unique value in the key column becomes a column name.

Each value in the value column becomes a cell in the new column.
spread() and gather() help you reshape the layout of your data to place variables
in columns and observations in rows.

separate() and unite() help you split and combine cells to place a single, complete
value in each cell.

Data for gather() and spread()

Let us create a new dataframe called "paverage.df" that stores the average scores
by player across 3 different years. We will use this dataframe to examine gather()
and spread() behaviour.

Steps for new data frame

player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df <- data.frame(player,Y2010,Y2011,Y2012)
print(paverage.df)

Data for separate() and unite()

Let us create a small dataframe that we can refer before trying your hands with
separate().

fname <- c("Martina", "Monica", "Stan", "Oscar")

lname <- c("Welch", "Sobers", "Griffith", "Williams")
DoB <- c("1-Oct-1980", "2-Nov-1982", "13-Dec-1979", "27-Jan-1988")
first.df <- data.frame(fname,lname,DoB)
print(first.df)

Column header having values, not variables

When column headers are values and not variables, use gather() function to tidy the
data.

Now try to recreate this dataset again after tidying the data.

Multiple variables are stored in one column

For datasets that are messy due to single column holding multiple variables,
separate() function can be used.

We have seen this example on separate().

Datasets with variables in both rows & columns

When the variables are both at rows and columns, use gather() function followed by
spread() function.

####

library(tidyr)
pavg_gather<-gather(paverage.df, year,pavg, Y2010:Y2012)
print(pavg_gather)
print(spread(pavg_gather, year, pavg))

fname <- c("Martina", "Monica", "Stan", "Oscar")

lname <- c("Welch", "Sobers", "Griffith", "Williams")
DoB <- c("1-Oct-1980", "2-Nov-1982", "13-Dec-1979", "27-Jan-1988")
first.df<-data.frame(fname, lname, DoB)
print(first.df)
print(separate(first.df, DoB, c("date", "month", "year"), sep = "-"))
print(unite(first.df, "Name", c(fname, lname), sep = " "))

religion <- c("Agnostic", "Atheist", "Buddhist", "Catholic")

usd10k <- c(27,12,27,41)
usd20to30k <- c(60, 37, 30, 732)
usd30to40k <- c(81, 52, 34, 670)
mydf1.df <- data.frame(religion, usd10k, usd20to30k, usd30to40k)

print(gather(mydf1.df, usd_range, usd, usd10k:usd30to40k))

City <- c("Chennai", "Chennai","Hyderabad", "Hyderabad")

Year <- c(2010, 2010, 2010, 2010)
Element <- c("MaxTemp", "MinTemp","MaxTemp", "MinTemp")
Jan <- c(36,24,32,22)
Feb <- c(37,25,34,23)
Mar <- c(37.5,27,36,25)
mydf2.df <- data.frame(City,Year,Element,Jan,Feb,Mar)
print(mydf2.df)

mydf2.df_gather<-gather(mydf2.df, month, Elements, Jan:Mar)

print(spread(mydf2.df_gather, month, Elements))

###

dplyr R package
This video covers the basic functions of dplyr package -

* arrange
* filter
* select
* arrange
* mutate
* rename

1. Perform the following task:

Copy the mtcars dataset to a new dataset called mtcars1 with columns 1 to 6 and an
additional
column called cars that will have the names of the car models.

From the mtcars1 dataset, identify the cars having mpg>20 and cyl=6 and return all
the columns along
with the column cars and print it.
Hint: Use filter()

Ans:
library(dplyr)
mtcars1<-(select(mtcars, mpg:wt))
mutate(mtcars1, cars)

2. Perform the following task:

Using the mtcars1 dataset from the previous step, sort the dataset by cyl in
ascending order, and mpg
in descending order and print it.

3. Perform the following task:

Usimg the mtcars dataset, create a new dataset called mt_select having columns mpg
and hp only and print
the mt_select dataset.

Ans: mt_select<-(select(mtcars, mpg , hp))

print(mt_select)

4. Perform the following task:

Create a new dataset mt_newcols by using dataset mtcars1 with 2 additional columns:
disp2=disp*disp
Finally print mt_newcols dataset.

Ans:
mt_newcols<-filter(mtcars1, disp2=disp*disp)
print(mt_newcols)

5.Perform the following task:

Using the mtcars dataset, find out,
mean and print it
max mpg and print it
quantile of mpg with probability as 25% and print it.

String Manipulation

Stringr Package
Perform the following operations in the function stringr_operations
1. Perform the following operations:
Assign a string value "R" to a variable x
Use str_c and concatenate x with another string "Tutorial", separated by a blank
space and print it

Ans:
library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))

2. Perform the following operations:

Create a variable X with value hop a little, jump a little, eat a little, drive a
little.
Find the frequency of little in the string and print it.

Ans:
X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')
print(str_count(X, 'little'))
3. Perform the following operations:
Create a variable with a value hop a little, jump a little. Find out the positions
of the matching patterns little.
Try out using str_locate and str_locate_all and print the output of both
separately.

Ans:
V<-c('hop a little', 'jump a little')
print(str_locate(V, 'little'))
print(str_locate_all(V, 'little'))

4.Perform the following operations:

Use str_detect to identify the existance of a pattern in a string
Detect if there is a character z in the string hop a little, jump a little and
print it.

Ans:
print(str_detect(V, 'z'))

5. Perform the following operations:

Assign a string with say value TRUE NA TRUE NA NA NA FALSE.
Using str_extract and str_extract_all, find and extract a matching pattern, say, NA
and print the output of both
separately.

Ans:
Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')
print(str_extract (Z, 'NA'))
print(str_extract_all (Z, 'NA'))

6.Perform the following operation:

Use str_length and find out the length of the string TRUE NA TRUE NA NA NA FALSE
and print it.

Ans:
print(str_length(Z))

7.Perform the following operation:

Useing str_to_upper and str_to_lower, change the case of TRUE NA TRUE NA NA NA
FALSE to lower, and then to upper again
and print the output of both separately.

Ans:
print(str_to_upper(Z))
print(str_to_lower(Z)

8.Perform the following operations:

Create a vector y with values alpha, gama, duo, uno, beta
Using str_order, sort this vector and print it.

Ans:
y<-c('alpha', 'gama', 'duo', 'uno', 'beta')
print(str_order(y))

9.Perform the following operations:

Assign the value alpha for the vector y and try to pad the vector value with % on
the left and right of the string where the
total length of the string becomes 13 and print it.

Ex-%%%%alpha%%%%

Ans:
y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))

10.Perform the following operation:

Use str_trim to remove white spaces.
Create a vector z with values ("A", "B", "C") having leading white spaces.
Trim the white spaces and print it.

Ans:
z<-c(' A', ' B' , ' C')
print(str_trim(z))

Ans.
library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))

X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')

print(str_count(X, 'little'))

V<-c('hop a little', 'jump a little')

print(str_locate(V, 'little'))
print(str_locate_all(V, 'little'))

print(str_detect(V, 'z'))

Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')

print(str_extract (Z, 'NA'))
print(str_extract_all (Z, 'NA'))

print(str_length(Z))

print(str_to_upper(Z))
print(str_to_lower(Z))

y<-c('alpha', 'gama', 'duo', 'uno', 'beta')

print(str_order(y))

y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))

z<-c('A', 'B' , 'C')

print(str_trim(z))

Data Type conversions

Data type conversions of scalars, vectors (logical, character, numeric), matrix,
dataframes is possible in R. Converting a variable from one type to another is
called coercion

character > numeric > integer > logical

is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame() returns

true or false,
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame() converts
one data type to another.

Ans:
as.Date("01/05/1965", format= "%d/%m/%Y") ->strDates
print(strDates)

Special values in R

Missing value treatment using mean / median

This video covers how to do the mean and median imputations for some of the missing
values in the data set

Outlier Analysis
This video discusses on some of the possible ways to deal with the outliers.

Practice - Winsorizing technique

Consider the same set called 'Outlierset' from earlier example.

Outlierset <- c(19, 13, 29, 17, 5, 16, 18, 20, 55, 22,33,14,25, 10,29, 56)
Copy Outlierset to a new dataset Outlierset1.
Replace the outliers with 36 which is 3rd quartile + minimum.
Compare boxplot on both Outlierset and Outlierset1. You should see no outliers in
the new dataset.

Outlier treatment by Variable Transformation

Obvious errors
We have so far seen how to handle missing values, special values & outliers.

Sometimes, we might come across some obvious errors which cannot be caught by
previously learnt technical techniques.

Errors such as age field having a negative value, or height field being, say, 0 or
a smaller number. Such erroneous
data would still need manual checks and corrections.

Data Cleansing using R - Course Summary

In this course, you have learnt rules around what makes a data tidy or messy. You
have also seen 3 new packages -

tidyr dplyr stringr Further, different techniques to treat missing values, special
values and outliers and
preparation of tidy data for data analysis is discussed.

Final Hands-on
1. Perform the following operations
-Create a vector x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56).
-Treat the missing values by replacing them with mean once, and then with the
median of vector and print out the output
of the both separately.
2. Perform the following operations
-Create the dataset Outlierset<-c(19,13,29,17,5,6,18,20,55,22,33,14,25,10,29,56)
-Make a summary of the dataset and print it.
-Create a new dataset called Cleanset and assign data from Outlierset, discarding
value above 36(which is the 3rd Quartile + min) and print it

Ans:
x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56)
x[is.na(x)]<- mean(x[!is.na(x)])
print(x)
x<-c(19,13,NA,17,5,16,NA,20,55,22,33,14,25,NA,29,56)
x[is.na(x)]<-median(x[!is.na(x)])
print(x)

Outlierset<-c(19,13,29,17,5,16,18,20,55,22,33,14,25,10,29,56)
print(summary(Outlierset))
Cleanset<-Outlierset
print(Cleanset<-Cleanset[Cleanset < 36])

++++
Data Cleansing Using R-4 Springr package

library(stringr)
x<-"R"
print(str_c(x, "Tutorial", sep = " "))

X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')

print(str_count(X, "little"))

Y<-c('hop a little', 'jump a little')

print(str_locate(Y, 'little'))
print(str_locate_all(Y, 'little'))

print(str_detect(Y, 'z'))

Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')

print(str_extract(Z, 'NA'))
print(str_extract_all(Z, 'NA'))

print(str_length(Z))

print(str_to_lower(Z))
print(str_to_upper(Z))

y<-c('alpha', 'gama', 'duo', 'uno', 'beta')

print(str_order(y))

y<-'alpha'
print(str_pad(y, 13, 'both', pad='%'))

z<-c(' A',' B' ,' C')

print(str_trim(z))
++++

Informatica MCQ
100% (1)
Informatica MCQ
400 pages
AdvanceTS1handson - Jupyter Notebook
100% (2)
AdvanceTS1handson - Jupyter Notebook
3 pages
R Basics
88% (8)
R Basics
8 pages
Import As From Import Import: Problem 1
100% (1)
Import As From Import Import: Problem 1
5 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Fresco
100% (2)
Fresco
17 pages
R Handson
100% (3)
R Handson
3 pages
Spark SQL Hands - On
No ratings yet
Spark SQL Hands - On
3 pages
Python Hands On
100% (1)
Python Hands On
11 pages
SR No Category Sub Category Course Name Enable / Disable D Hands On? Yes/No Handson Detail
No ratings yet
SR No Category Sub Category Course Name Enable / Disable D Hands On? Yes/No Handson Detail
3 pages
Dr. M.K.K Arya Model School IT Practical Information Technology (802) Class XI
No ratings yet
Dr. M.K.K Arya Model School IT Practical Information Technology (802) Class XI
17 pages
Data Handling in R - Introduction To Dplyr
No ratings yet
Data Handling in R - Introduction To Dplyr
2 pages
Stat 2
No ratings yet
Stat 2
3 pages
Python Pandas MCQs
No ratings yet
Python Pandas MCQs
7 pages
Unstructured Data Classification
No ratings yet
Unstructured Data Classification
5 pages
R
No ratings yet
R
15 pages
Regression Analysis - Notes
No ratings yet
Regression Analysis - Notes
3 pages
Unstructured Data Classification Handson
No ratings yet
Unstructured Data Classification Handson
4 pages
Informatica 41128 PDF
No ratings yet
Informatica 41128 PDF
34 pages
Data Visualization New
No ratings yet
Data Visualization New
3 pages
Python 3 Functions and OOPs
No ratings yet
Python 3 Functions and OOPs
7 pages
Image Classification Handson-Image - Test
No ratings yet
Image Classification Handson-Image - Test
5 pages
Power BI Outset
100% (1)
Power BI Outset
11 pages
This Study Resource Was
No ratings yet
This Study Resource Was
5 pages
Statistics and Probability Katabasis 2
No ratings yet
Statistics and Probability Katabasis 2
2 pages
Basics of Statistics and Probability - FP: Statistical Measures
No ratings yet
Basics of Statistics and Probability - FP: Statistical Measures
12 pages
Data Handling Using R
No ratings yet
Data Handling Using R
2 pages
Tableau MCQ
No ratings yet
Tableau MCQ
8 pages
Machine Learning - Exploring The Model Q&A.txt TCS
100% (1)
Machine Learning - Exploring The Model Q&A.txt TCS
1 page
Time Series Analysis
0% (1)
Time Series Analysis
2 pages
New Text Document
No ratings yet
New Text Document
10 pages
Tensor Flow
No ratings yet
Tensor Flow
2 pages
Advance Statistics & Probability Q & A
100% (3)
Advance Statistics & Probability Q & A
2 pages
Advanced Time Series Analysis
100% (1)
Advanced Time Series Analysis
3 pages
Hands On Python Qualis Pytest
No ratings yet
Hands On Python Qualis Pytest
7 pages
Tableau Sequel
No ratings yet
Tableau Sequel
5 pages
Python-Module03-Case Study03
100% (1)
Python-Module03-Case Study03
2 pages
Informatica
No ratings yet
Informatica
9 pages
Context Manager 1
No ratings yet
Context Manager 1
1 page
Grail
No ratings yet
Grail
23 pages
T13 Answers Ion PDF
No ratings yet
T13 Answers Ion PDF
20 pages
NLP Using Python
No ratings yet
NLP Using Python
50 pages
Informatica Transformations
No ratings yet
Informatica Transformations
6 pages
Machine Learning Scikit Handson
0% (1)
Machine Learning Scikit Handson
4 pages
Machine Learning Scikit Handson
No ratings yet
Machine Learning Scikit Handson
4 pages
Untitled
No ratings yet
Untitled
2 pages
Scala - The Diatonic Syallable
No ratings yet
Scala - The Diatonic Syallable
2 pages
Descriptor
No ratings yet
Descriptor
4 pages
Image Classification Hands-On
100% (1)
Image Classification Hands-On
1 page
Unit 1-MCQ-DV
No ratings yet
Unit 1-MCQ-DV
5 pages
Nodejs Mock Test III
No ratings yet
Nodejs Mock Test III
6 pages
Python List Handson 1
No ratings yet
Python List Handson 1
2 pages
Unstructured Data Classification
No ratings yet
Unstructured Data Classification
2 pages
Python 3 Programming
No ratings yet
Python 3 Programming
3 pages
Python 3 Programming Q & A
No ratings yet
Python 3 Programming Q & A
4 pages
PowerCenter MCQs
No ratings yet
PowerCenter MCQs
30 pages
Data As Clean of Excel
No ratings yet
Data As Clean of Excel
66 pages
Basic Data Cleaning
100% (2)
Basic Data Cleaning
66 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
66 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Gartner Predicts Procurement Data Challenges and Rapid Change 2025
No ratings yet
Gartner Predicts Procurement Data Challenges and Rapid Change 2025
2 pages
Statistics and Probability Katabasis
No ratings yet
Statistics and Probability Katabasis
7 pages
Q and A For Job Interview
No ratings yet
Q and A For Job Interview
2 pages
Statistics
No ratings yet
Statistics
1 page
Numpy - Python Package For Data
No ratings yet
Numpy - Python Package For Data
9 pages
Data Handling Using R
No ratings yet
Data Handling Using R
1 page
Clustering - The Data Ensemble
No ratings yet
Clustering - The Data Ensemble
4 pages
Advanced Regression
No ratings yet
Advanced Regression
13 pages
PCNE Workbook
No ratings yet
PCNE Workbook
83 pages
End-to-End Developer Journey On GKE Ebook 02
No ratings yet
End-to-End Developer Journey On GKE Ebook 02
37 pages
Cambridge O Level: Mathematics (Syllabus D) 4024/11
No ratings yet
Cambridge O Level: Mathematics (Syllabus D) 4024/11
16 pages
Anhvan12chuongtrinhglobalsuccess Thionlineunit9 Careerpaths
No ratings yet
Anhvan12chuongtrinhglobalsuccess Thionlineunit9 Careerpaths
3 pages
English Language Teacher Motivation and Turnover - Ross Thorburn PDF
No ratings yet
English Language Teacher Motivation and Turnover - Ross Thorburn PDF
3 pages
Task 1: Create A Hard Disk Volume and Format For Refs: Exercise 1: Creating and Managing Volumes
No ratings yet
Task 1: Create A Hard Disk Volume and Format For Refs: Exercise 1: Creating and Managing Volumes
9 pages
Project Secretarial
No ratings yet
Project Secretarial
45 pages
Chap 8 AE
No ratings yet
Chap 8 AE
8 pages
Topic 2: Operation Strategy Name Affiliation
No ratings yet
Topic 2: Operation Strategy Name Affiliation
4 pages
D10057563-E5 2016 Orientation Guide
No ratings yet
D10057563-E5 2016 Orientation Guide
67 pages
ATA 24 Electrical Power L1
100% (1)
ATA 24 Electrical Power L1
40 pages
Sarkarire DH G Sult - Com-Mr-02-2023-Batch - Watermark
No ratings yet
Sarkarire DH G Sult - Com-Mr-02-2023-Batch - Watermark
2 pages
Industrial Employement (Standing Orders) Act 1946
100% (1)
Industrial Employement (Standing Orders) Act 1946
9 pages
Lipsky Et Al 2020 IWGDF Infection Guideline
No ratings yet
Lipsky Et Al 2020 IWGDF Infection Guideline
24 pages
Alfredo Medina - Resume - June 09
No ratings yet
Alfredo Medina - Resume - June 09
3 pages
Badplaas and Mashishila Cluster: Geography Mapwork/Gis Task1 Marking Guideline MARCH 2023
No ratings yet
Badplaas and Mashishila Cluster: Geography Mapwork/Gis Task1 Marking Guideline MARCH 2023
9 pages
DuPont Analysis
No ratings yet
DuPont Analysis
2 pages
John M. Gilligan Resume
No ratings yet
John M. Gilligan Resume
3 pages
CPP & Protective Security Careers Company List
No ratings yet
CPP & Protective Security Careers Company List
33 pages
Swagatam Letter
No ratings yet
Swagatam Letter
5 pages
INSURANCE Argente Vs West Coast
No ratings yet
INSURANCE Argente Vs West Coast
3 pages
PDF Apo Fruits Corporation V Land Bank of The Philippines - Compress
No ratings yet
PDF Apo Fruits Corporation V Land Bank of The Philippines - Compress
4 pages
Social Emotional Learning Self-Regulation. Education Presentation Colorful Illustrative Style
No ratings yet
Social Emotional Learning Self-Regulation. Education Presentation Colorful Illustrative Style
23 pages
Star Topology: Advantages and Disadvantages
No ratings yet
Star Topology: Advantages and Disadvantages
3 pages
Conference Interpreting
No ratings yet
Conference Interpreting
28 pages
Crystal Reports Notes
No ratings yet
Crystal Reports Notes
10 pages
Sample Content
No ratings yet
Sample Content
16 pages
Analysis of RLC Circuits Using MATLAB
No ratings yet
Analysis of RLC Circuits Using MATLAB
1 page
Midcourse Test
No ratings yet
Midcourse Test
6 pages
Flashback Arrestors
No ratings yet
Flashback Arrestors
5 pages
EE87 - 7 - Protection, Alarm and Control Panels
No ratings yet
EE87 - 7 - Protection, Alarm and Control Panels
26 pages

Data Cleansing Using R

Uploaded by

Data Cleansing Using R

Uploaded by

Course Introduction

Data cleansing is a crucial step in data pre-processing phase of data mining. In

Why data cleansing is important?

Why Data Cleansing is needed

Inconsistent values. Ex - 'TWO' and '2' for same field

How to manage Missing Values

Other options to consider for filling missing values could be:

Use a global constant e.g., “NA”

How to manage Noisy Data

How to Manage Inconsistent Data?

Manual correction using external references

In this course we will focus on R to

Cleaning Data with R

What is Tidy Data?

Please examine the data set 1 in the picture.

What is Tidy Data?

Let us understand Variables, Observations and Values in a dataset.

value belongs to a variable and an observation.

Principles of Tidy data

Each variable forms a column

Messy data features

Column headers are values, not variable names

Multiple variables are stored in one column

Name - stores firstname, lastname Address - stores city, state

Variables are stored in both rows and columns

Months on columns and

Multiple types of observational units are stored in the same table

A single observational unit is stored in multiple tables

A key column that contains the former column names.

Each unique value in the key column becomes a column name.

Data for gather() and spread()

Steps for new data frame

Data for separate() and unite()

fname <- c("Martina", "Monica", "Stan", "Oscar")

Column header having values, not variables

Multiple variables are stored in one column

We have seen this example on separate().

Datasets with variables in both rows & columns

fname <- c("Martina", "Monica", "Stan", "Oscar")

religion <- c("Agnostic", "Atheist", "Buddhist", "Catholic")

print(gather(mydf1.df, usd_range, usd, usd10k:usd30to40k))

City <- c("Chennai", "Chennai","Hyderabad", "Hyderabad")

mydf2.df_gather<-gather(mydf2.df, month, Elements, Jan:Mar)

print(spread(mydf2.df_gather, month, Elements))

1. Perform the following task:

2. Perform the following task:

3. Perform the following task:

Ans: mt_select<-(select(mtcars, mpg , hp))

4. Perform the following task:

5.Perform the following task:

2. Perform the following operations:

4.Perform the following operations:

5. Perform the following operations:

6.Perform the following operation:

7.Perform the following operation:

8.Perform the following operations:

9.Perform the following operations:

10.Perform the following operation:

X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')

V<-c('hop a little', 'jump a little')

Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')

y<-c('alpha', 'gama', 'duo', 'uno', 'beta')

z<-c('A', 'B' , 'C')

Data Type conversions

character > numeric > integer > logical

is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame() returns

Missing value treatment using mean / median

Practice - Winsorizing technique

Outlier treatment by Variable Transformation

Data Cleansing using R - Course Summary

X<-c('hop a little', 'jump a little', 'eat a little', 'drive a little')

Y<-c('hop a little', 'jump a little')

Z<-c('TRUE', 'NA', 'TRUE', 'NA', 'NA', 'NA', 'FALSE')

y<-c('alpha', 'gama', 'duo', 'uno', 'beta')

z<-c(' A',' B' ,' C')

You might also like