0% found this document useful (0 votes)

7 views77 pages

14 Clean The Mess

Uploaded by

nayarh1997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views77 pages

14 Clean The Mess

Uploaded by

nayarh1997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Warm-Up

Today, we have a script (14-clean-the-mess.R),

presentation, and new dataset on Canvas.
Once you have the class materials:
1. Move the dataset into your R project for
this class.
2. Read through the data dictionary and make
a list of variables that may need cleaned
before analysis or visualization is possible.
az_data
az_data
• arid: the reviewer. There should only be one review for a
given reviewer for a given product. However, there may
be reviews for multiple products by the same reviewer.

• arin: the review's ID. Review ID should be unique across

all reviews (no duplications).

• asin: alphanumeric product ID.

• price: price of product in US$.

az_data
Clean the Mess

2022-10-12 | HGEN611
Roadmap
1. Data handling keys

2. How BPA can guide cleaning

3. Basic recoding

4. ifelse()

5. case_when()

6. coalesce()
Data Handling Rules
Before we touch our
data….
Rule 1
Raw data is READ ONLY
Rule 1
Raw data is READ ONLY

You must always maintain a pristine copy

of the original, unmodi ed data.

All data cleaning and processing needs to be

saved as a di erent object (preferably) in a
di erent folder.
ff
ff
fi
Rule 1
Raw data is READ ONLY

By modifying only copies of the original

data, you guarantee that you have a
way back to the start of the analysis if
something goes horribly wrong.
Rule 2
Raw data is saved in
more than one location
Rule 2
Raw data is saved in more than one location

You need to pick at least two places to

store precious data to give your future self
the absolute best chance of always having
an accessible copy of the raw data.
Rule 3
Create Analysis-Friendly
Data
Rule 3
Create Analysis-Friendly Data
Tidy, well-documented data will be much
less stressful to work with.

• More con dent in correctness

• Easier to make progress

fi
Rule 4
Record all data
processing steps
Rule 4
Record all data processing steps

No part of data handling should be done

through manual manipulation of les (e.g.,
deleting comments out of Excel les)

Does it happen occasionally? Yes. Should

we try to minimize it? Absolutely!

fi
fi
Rule 5
Use R Projects to
organize analyses
Rule 5
Use R Projects to organize analyses

R Projects make life

easier for you, future
you, and anyone else
who ever tries to access
your data, results,
scripts, or other les.
fi
R Projects
• Make organizing input, output, and
metadata les easy

• Provide an e cient, uniform way to

establish directory paths, even if you move
the project to di erent places on your
computer or to another computer
altogether.
fi
ffi
ff
Every le gets a home as soon as you
receive it.

Do not leave les in your

Downloads folder or on
your Desktop with the plan
of organizing it “later”.
Later rarely arrives kindly.
fi
fi
Basic Pattern Analysis
bpa
• Provides a digestible overview that maybe
otherwise impossible with table() or
count()

• The insight from bpa may be direct how

you approach recoding.
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>%
as.data.frame
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>% Extract variable’s
values as a vector
as.data.frame
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
Convert values to
table %>% standardized format
as.data.frame (all numbers to 9’s,
white space to “w”,
characters to “A/a”
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>%
Counts each
as.data.frame unique value
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>% Provides a nice output
as.data.frame for us to view by
converting the table()
output to a data.frame
bpa() example
messy %>% Patterns Freq
1 99/99/9999 259
pull(Date) %>%
2 9999-99-99 262
get_pattern %>%
3 99Aaa9999 241
table %>% 4 Aaaaaaaaaw99w9999 19
as.data.frame 5 Aaaaaaaaw99w9999 56
6 Aaaaaaaw99w9999 45
7 Aaaaaaw99w9999 24
8 Aaaaaw99w9999 36
9 Aaaaw99w9999 42
10 Aaaw99w9999 16
Your Turn 1
Use str()/head() and bpa() to
characterize the date variable in
az_data.
How many date formats are present?
Your Turn 1
az_data %>% Pattern Freq
1 Aaaaaaaaaw9,w9999 14283
pull(date) %>% 2 Aaaaaaaaaw99,w9999 28516
get_pattern %>% 3 Aaaaaaaaw9,w9999 34706
4 Aaaaaaaaw99,w9999 79174
table %>%
5 Aaaaaaaw9,w9999 23696
as.data.frame 6 Aaaaaaaw99,w9999 57597
7 Aaaaaaw9,w9999 15023
8 Aaaaaaw99,w9999 36825
9 Aaaaaw9,w9999 26011
10 Aaaaaw99,w9999 60508
11 Aaaaw9,w9999 26236
12 Aaaaw99,w9999 69445
13 Aaaw9,w9999 12235
14 Aaaw99,w9999 28391
Your Turn 1

az_data %>%
pull(date) %>%
head

head(az_data$date)

[1] NA NA
[3] "August 31, 2017" "July 20, 2017"
[5] "November 14, 2016" "August 31, 2016"
Verifying Correctness
Your Turn 2
How could you test if the code below
correctly converts the date character variable
from a character to Date object? What are
some challenges?

az_data %>%
mutate(date = lubridate::mdy(date))
Create a new variable
instead of replacing one.

Creates a new variable

az_data %>%
mutate(date_r = lubridate::mdy(date))

Replaces the variable

az_data %>%
mutate(date = lubridate::mdy(date))
Use select() to perform a
visual check.

az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
select(date, date_r)
az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
select(date, date_r)
Use bpa() to perform a
global check.
az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
pull(date_r) %>%
get_pattern %>%
table %>%
as.data.frame

Pattern Freq
9999-99-99 512646
Simple Recoding
ifelse
Recodes values based on the
outcome of a logical expression
ifelse
Recodes values based on the outcome of a logical expression

ifelse(<logical test>,
<outcome if TRUE>,
<outcome if FALSE>)
ifelse
Recodes values based on the outcome of a logical expression

x <- 1:10
ifelse(x < 5, “ zz”, “bang”)

(“ zz”, “ zz”, “ zz”, “ zz”, “bang”, “bang”, “bang”,

“bang”, “bang”, “bang”)
fi
fi
fi
fi
fi
ifelse
Recodes values based on the outcome of a logical expression

ifelse(messy$Gender == “M”,
“Male”,
messy$Gender)
ifelse
Verify recoding worked with count() or bpa()
messy %>%
mutate(Gender_r = ifelse(Gender == "M",
"Male",
Gender)) %>%
count(Gender, Gender_r)
Could we keep using ifelse
to recode all Gender values?
Yep!
messy %>%
mutate(Gender_r = ifelse(Gender == "M", "Male", Gender),
Gender_r = ifelse(Gender_r == "F",
"Female", Gender_r)) %>%
count(Gender, Gender_r)

Notice that we would have to change this variable from

Gender to Gender_r to account for our previous recoding.
messy %>%
mutate(Gender_r = ifelse(Gender == "M", "Male", Gender),
Gender_r = ifelse(Gender_r == "F",
"Female", Gender_r)) %>%
count(Gender, Gender_r)
messy %>%
mutate(Gender_r = ifelse(Gender == "M", "Male", Gender),
Gender_r = ifelse(Gender_r == "F",
"Female", Gender_r),
Gender_r = ifelse(Gender_r == "female",
"Female", Gender_r),
Gender_r = ifelse(Gender_r == "male",
"Male", Gender_r)) %>%
count(Gender, Gender_r)
But there’s a more
ef cient option
fi
case_when
Simple way to complete multiple ifelse statements

case_when(
<logical test> ~ <outcome>,
<logical test> ~ <outcome>,
)
Your Turn 3
Complete the code so that all values in Gender_r
are recoded to either "Male" or “Female".

messy %>%
mutate(Gender_r =
case_when(Gender == "M" ~ "Male",
Gender == "F" ~ "Female")) %>%
count(Gender, Gender_r)
Your Turn 3
messy %>%
mutate(Gender_r =
case_when(Gender == "M" ~ "Male",
Gender == "F" ~ "Female",
Gender == "male" ~ "Male",
Gender == "female" ~ "Female"
)
) %>%
count(Gender, Gender_r)
case_when
Can evaluate complex recodings involving multiple
variables at once.
case_when
To create a category equivalent to an “else”
statement, use “TRUE” at the end of the code
chunk.

All observations that do not meet one of the

above criteria will get the value of “other”.
Putting it all together
Your Turn 3
Does this code correctly convert the price variable
to a numeric value? Can it be improved?

az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price))
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price))
Could inspect visually
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price)) %>%
filter(is.na(price_r)) %>%
select(price, price_r)
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price)) %>%
filter(is.na(price_r)) %>%
select(price, price_r)
Maybe we do an online search for
help and learn about parse_number

az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r =
readr::parse_number(price))
No NA values!
That’s great, right?
Let’s double check what
bpa() recommends
az_data %>%
pull(price) %>%
get_pattern %>%
table %>%
as.data.frame
Whoa! We had no idea that there were price ranges
in the data. The data dictionary never mentioned
that information. How can we gure out how
parse_number recoded the prices with ranges?
fi
Your Turn 5
Figure out how parse_number recoded the
prices with ranges. Should we keep these
values?

Complete this task as a take-home challenge.

coalesce()
coalesce()
Function for collapsing missingness.
water_bottle_readings %>%
mutate(readings_combined = coalesce(bottle_1, bottle_2, bottle_3))

Checks the variables for values in the order that they are listed.
water_bottle_readings %>%
mutate(readings_combined = coalesce(bottle_1, bottle_2, bottle_3))

Checks the variables for values in the order that they are listed.
How to create the
water_bottle_readings toy dataset

set.seed(123)
water_bottle_readings <- data.frame(bottle_1 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE),
bottle_2 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE),
bottle_3 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE))
Test Yourself
Use the coalesce() + mutate() functions to provide more
information in the speci c_category variable. Speci cally,
create a new variable called speci c_category_ lled that
gets the value of speci c_category if speci c_category is
not NA, but if speci c_category is NA, then replace that
missingness with values from the quaternary_category,
tertiary_category, or secondary_category in that order.

Then use lter() and count() to examine the extent of the new
values you have added to speci c_category_ lled.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Reminders
Assignments
dplyr Practice Problems (Oct. 17th)
Practice Problems
Use the example code below to identify rows in
airquality with any missingness. Then, use
your knowledge of dplyr to calculate the total
amount of missingness present per month and
remove all the observations from months that
have any missingness in more than 20 days.
How many rows are left?

Example code
penguins %>%
mutate(total_NA = rowSums(is.na(penguins))) %>%
filter(total_NA < 3)
airquality %>%
mutate(total_NA = rowSums(is.na(airquality))) %>%
group_by(Month) %>%
mutate(month_NA = sum(total_NA > 0)) %>%
filter(month_NA < 20)

June has the largest number of missing

Ozone measurements (21). The code above
will remove all values from any month with
more than 20 days with missingness.
123 rows are left.
Test Yourself
Use the coalesce() + mutate() functions to provide more
information in the speci c_category variable. Speci cally,
create a new variable called speci c_category_ lled that
gets the value of speci c_category if speci c_category is
not NA, but if speci c_category is NA, then replace that
missingness with values from the quaternary_category,
tertiary_category, or secondary_category in that order.

Then use lter() and count() to examine the extent of the new
values you have added to speci c_category_ lled.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Hint: Test Yourself
az_data %>%
mutate(
speci c_category_ lled = coalesce(speci c_category,
quaternary_category)) %>%
lter(is.na(speci c_category)) %>%
count(speci c_category_ lled, sort = TRUE)
fi
fi
fi
fi
fi
fi
fi

Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
15 pages
John Deere 310 Tractor Loader Backhoe Service Manual
0% (2)
John Deere 310 Tractor Loader Backhoe Service Manual
22 pages
Data Mining Lab 3
No ratings yet
Data Mining Lab 3
17 pages
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
No ratings yet
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
58 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
57 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
R Module 8 - Data Cleaning
No ratings yet
R Module 8 - Data Cleaning
48 pages
Week 1-3
No ratings yet
Week 1-3
17 pages
4.18 Data Wrangling Slides Part1
No ratings yet
4.18 Data Wrangling Slides Part1
54 pages
R Programming
No ratings yet
R Programming
50 pages
Data Analyses R Manual NYTS
No ratings yet
Data Analyses R Manual NYTS
24 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
R1 Uptovisualisation
No ratings yet
R1 Uptovisualisation
122 pages
Week6 Slides Updated
No ratings yet
Week6 Slides Updated
57 pages
Data Cleansing
No ratings yet
Data Cleansing
18 pages
Datascience Practicals
No ratings yet
Datascience Practicals
23 pages
Practical
No ratings yet
Practical
8 pages
Preprocessing - Preprocessing Your Data With R
No ratings yet
Preprocessing - Preprocessing Your Data With R
23 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
R Programmimg Practical Journal All-1
No ratings yet
R Programmimg Practical Journal All-1
25 pages
Disha Data Science
No ratings yet
Disha Data Science
27 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Unit 2
No ratings yet
Unit 2
76 pages
Glocal University: Practical File of R Programming
100% (1)
Glocal University: Practical File of R Programming
32 pages
Lec 13
No ratings yet
Lec 13
46 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Lecture 5 (Managing and Understanding Data)
No ratings yet
Lecture 5 (Managing and Understanding Data)
9 pages
Dsda Manual
No ratings yet
Dsda Manual
64 pages
20mia1006 - Fda - Consolidated Report
No ratings yet
20mia1006 - Fda - Consolidated Report
119 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
FE418 RLectureNotes1
No ratings yet
FE418 RLectureNotes1
15 pages
Business Analytics - L2
No ratings yet
Business Analytics - L2
41 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
R Commands
No ratings yet
R Commands
18 pages
Arunav Da Prac
No ratings yet
Arunav Da Prac
55 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
INF30036 DataTypes Lecture2-1
No ratings yet
INF30036 DataTypes Lecture2-1
42 pages
STAT 04 Simplify Notes
No ratings yet
STAT 04 Simplify Notes
34 pages
Data Analytics Using R Lab - Master Manual
No ratings yet
Data Analytics Using R Lab - Master Manual
29 pages
Week1 R Programming Questions
No ratings yet
Week1 R Programming Questions
3 pages
Appd List of Contractros of CEEC As On 27 Dec 2019
No ratings yet
Appd List of Contractros of CEEC As On 27 Dec 2019
187 pages
A1rib T4
No ratings yet
A1rib T4
5 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
R Code Snippets
No ratings yet
R Code Snippets
10 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
Operations and Service Manual 69NT40-561-300 To 399: Container Refrigeration
100% (1)
Operations and Service Manual 69NT40-561-300 To 399: Container Refrigeration
154 pages
04.scaffold Manual
No ratings yet
04.scaffold Manual
6 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
How To Do Reliability Analysis and Basic Factor Analysis in R
No ratings yet
How To Do Reliability Analysis and Basic Factor Analysis in R
4 pages
Intro To Statistic Using R - Session 2
No ratings yet
Intro To Statistic Using R - Session 2
1 page
Finish This Written Test
No ratings yet
Finish This Written Test
9 pages
DAUR Lab Manual
No ratings yet
DAUR Lab Manual
14 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
HCI Lecture Module 1
0% (1)
HCI Lecture Module 1
7 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
Seminar Report Artificial Intelligence in Power Station
No ratings yet
Seminar Report Artificial Intelligence in Power Station
31 pages
Defining The Project: Powerpoint Presentation by Charlie Cook
No ratings yet
Defining The Project: Powerpoint Presentation by Charlie Cook
15 pages
Samsung Np-r410 PCB Diagram
No ratings yet
Samsung Np-r410 PCB Diagram
48 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
Palak Resume
No ratings yet
Palak Resume
1 page
Cs - REVISION TOUR
No ratings yet
Cs - REVISION TOUR
59 pages
2020 Bict Syllabus
No ratings yet
2020 Bict Syllabus
239 pages
Bottle Filling System
33% (3)
Bottle Filling System
6 pages
CB Insights CVC Report 2023
No ratings yet
CB Insights CVC Report 2023
132 pages
Rationale PDF
No ratings yet
Rationale PDF
3 pages
OdinSchool DataScience Bootcamp - Brochure-1
No ratings yet
OdinSchool DataScience Bootcamp - Brochure-1
13 pages
Microsoft Azure Ai Fundamentals Certification Companion Guide To Prepare For The Ai900 Exam 1st Edition Krunal S Trivedi Download
No ratings yet
Microsoft Azure Ai Fundamentals Certification Companion Guide To Prepare For The Ai900 Exam 1st Edition Krunal S Trivedi Download
82 pages
12 Guideline For Change Management System
No ratings yet
12 Guideline For Change Management System
4 pages
Entre Visillos Resumen
100% (1)
Entre Visillos Resumen
5 pages
Ideacentre AIO 700 Series Quick Start Guide
No ratings yet
Ideacentre AIO 700 Series Quick Start Guide
2 pages
Stepes in Designing Shielded Enclosures
No ratings yet
Stepes in Designing Shielded Enclosures
42 pages
9 Bsbpur301 Purchase Goods and Services 818
No ratings yet
9 Bsbpur301 Purchase Goods and Services 818
38 pages
Automotive Physical Layer SAE J1708 and The DS36277
No ratings yet
Automotive Physical Layer SAE J1708 and The DS36277
4 pages
Data Structres & Algorithms
No ratings yet
Data Structres & Algorithms
4 pages
AIRLINX INRICO Brochure - Opt
No ratings yet
AIRLINX INRICO Brochure - Opt
22 pages
PDF
No ratings yet
PDF
9 pages
SARA-N2 DataSheet (UBX-15025564)
No ratings yet
SARA-N2 DataSheet (UBX-15025564)
26 pages
Bearing Fault Detector Plus PDF 263 KB
No ratings yet
Bearing Fault Detector Plus PDF 263 KB
2 pages
Computer Icons: Pictures Icons Images Shortcuts Files Programs Different Similar
No ratings yet
Computer Icons: Pictures Icons Images Shortcuts Files Programs Different Similar
1 page
Software Requir SRS
No ratings yet
Software Requir SRS
3 pages
Address Decoding
No ratings yet
Address Decoding
5 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

14 Clean The Mess

Uploaded by

14 Clean The Mess

Uploaded by

Warm-Up

Today, we have a script (14-clean-the-mess.R),

• arin: the review's ID. Review ID should be unique across

• asin: alphanumeric product ID.

• price: price of product in US$.

2. How BPA can guide cleaning

You must always maintain a pristine copy

All data cleaning and processing needs to be

By modifying only copies of the original

You need to pick at least two places to

• More con dent in correctness

• Easier to make progress

No part of data handling should be done

Does it happen occasionally? Yes. Should

R Projects make life

• Provide an e cient, uniform way to

Do not leave les in your

• The insight from bpa may be direct how

Creates a new variable

Replaces the variable

(“ zz”, “ zz”, “ zz”, “ zz”, “bang”, “bang”, “bang”,

Notice that we would have to change this variable from

All observations that do not meet one of the

Complete this task as a take-home challenge.

June has the largest number of missing

You might also like