0% found this document useful (0 votes)
7 views77 pages

14 Clean The Mess

Uploaded by

nayarh1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views77 pages

14 Clean The Mess

Uploaded by

nayarh1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Warm-Up

Today, we have a script (14-clean-the-mess.R),


presentation, and new dataset on Canvas.
Once you have the class materials:
1. Move the dataset into your R project for
this class.
2. Read through the data dictionary and make
a list of variables that may need cleaned
before analysis or visualization is possible.
az_data
az_data
• arid: the reviewer. There should only be one review for a
given reviewer for a given product. However, there may
be reviews for multiple products by the same reviewer.

• arin: the review's ID. Review ID should be unique across


all reviews (no duplications).

• asin: alphanumeric product ID.

• price: price of product in US$.


az_data
Clean the Mess

2022-10-12 | HGEN611
Roadmap
1. Data handling keys

2. How BPA can guide cleaning

3. Basic recoding

4. ifelse()

5. case_when()

6. coalesce()
Data Handling Rules
Before we touch our
data….
Rule 1
Raw data is READ ONLY
Rule 1
Raw data is READ ONLY

You must always maintain a pristine copy


of the original, unmodi ed data.

All data cleaning and processing needs to be


saved as a di erent object (preferably) in a
di erent folder.
ff
ff
fi
Rule 1
Raw data is READ ONLY

By modifying only copies of the original


data, you guarantee that you have a
way back to the start of the analysis if
something goes horribly wrong.
Rule 2
Raw data is saved in
more than one location
Rule 2
Raw data is saved in more than one location

You need to pick at least two places to


store precious data to give your future self
the absolute best chance of always having
an accessible copy of the raw data.
Rule 3
Create Analysis-Friendly
Data
Rule 3
Create Analysis-Friendly Data
Tidy, well-documented data will be much
less stressful to work with.

• More con dent in correctness

• Easier to make progress


fi
Rule 4
Record all data
processing steps
Rule 4
Record all data processing steps

No part of data handling should be done


through manual manipulation of les (e.g.,
deleting comments out of Excel les)

Does it happen occasionally? Yes. Should


we try to minimize it? Absolutely!

fi
fi
Rule 5
Use R Projects to
organize analyses
Rule 5
Use R Projects to organize analyses

R Projects make life


easier for you, future
you, and anyone else
who ever tries to access
your data, results,
scripts, or other les.
fi
R Projects
• Make organizing input, output, and
metadata les easy

• Provide an e cient, uniform way to


establish directory paths, even if you move
the project to di erent places on your
computer or to another computer
altogether.
fi
ffi
ff
Every le gets a home as soon as you
receive it.

Do not leave les in your


Downloads folder or on
your Desktop with the plan
of organizing it “later”.
Later rarely arrives kindly.
fi
fi
Basic Pattern Analysis
bpa
• Provides a digestible overview that maybe
otherwise impossible with table() or
count()

• The insight from bpa may be direct how


you approach recoding.
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>%
as.data.frame
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>% Extract variable’s
values as a vector
as.data.frame
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
Convert values to
table %>% standardized format
as.data.frame (all numbers to 9’s,
white space to “w”,
characters to “A/a”
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>%
Counts each
as.data.frame unique value
bpa() template
<DATA> %>%
pull(<VARIABLE>) %>%
get_pattern %>%
table %>% Provides a nice output
as.data.frame for us to view by
converting the table()
output to a data.frame
bpa() example
messy %>% Patterns Freq
1 99/99/9999 259
pull(Date) %>%
2 9999-99-99 262
get_pattern %>%
3 99Aaa9999 241
table %>% 4 Aaaaaaaaaw99w9999 19
as.data.frame 5 Aaaaaaaaw99w9999 56
6 Aaaaaaaw99w9999 45
7 Aaaaaaw99w9999 24
8 Aaaaaw99w9999 36
9 Aaaaw99w9999 42
10 Aaaw99w9999 16
Your Turn 1
Use str()/head() and bpa() to
characterize the date variable in
az_data.
How many date formats are present?
Your Turn 1
az_data %>% Pattern Freq
1 Aaaaaaaaaw9,w9999 14283
pull(date) %>% 2 Aaaaaaaaaw99,w9999 28516
get_pattern %>% 3 Aaaaaaaaw9,w9999 34706
4 Aaaaaaaaw99,w9999 79174
table %>%
5 Aaaaaaaw9,w9999 23696
as.data.frame 6 Aaaaaaaw99,w9999 57597
7 Aaaaaaw9,w9999 15023
8 Aaaaaaw99,w9999 36825
9 Aaaaaw9,w9999 26011
10 Aaaaaw99,w9999 60508
11 Aaaaw9,w9999 26236
12 Aaaaw99,w9999 69445
13 Aaaw9,w9999 12235
14 Aaaw99,w9999 28391
Your Turn 1

az_data %>%
pull(date) %>%
head

head(az_data$date)

[1] NA NA
[3] "August 31, 2017" "July 20, 2017"
[5] "November 14, 2016" "August 31, 2016"
Verifying Correctness
Your Turn 2
How could you test if the code below
correctly converts the date character variable
from a character to Date object? What are
some challenges?

az_data %>%
mutate(date = lubridate::mdy(date))
Create a new variable
instead of replacing one.

Creates a new variable


az_data %>%
mutate(date_r = lubridate::mdy(date))

Replaces the variable


az_data %>%
mutate(date = lubridate::mdy(date))
Use select() to perform a
visual check.

az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
select(date, date_r)
az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
select(date, date_r)
Use bpa() to perform a
global check.
az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
pull(date_r) %>%
get_pattern %>%
table %>%
as.data.frame

Pattern Freq
9999-99-99 512646
Simple Recoding
ifelse
Recodes values based on the
outcome of a logical expression
ifelse
Recodes values based on the outcome of a logical expression

ifelse(<logical test>,
<outcome if TRUE>,
<outcome if FALSE>)
ifelse
Recodes values based on the outcome of a logical expression

x <- 1:10
ifelse(x < 5, “ zz”, “bang”)

(“ zz”, “ zz”, “ zz”, “ zz”, “bang”, “bang”, “bang”,


“bang”, “bang”, “bang”)
fi
fi
fi
fi
fi
ifelse
Recodes values based on the outcome of a logical expression

ifelse(messy$Gender == “M”,
“Male”,
messy$Gender)
ifelse
Verify recoding worked with count() or bpa()
messy %>%
mutate(Gender_r = ifelse(Gender == "M",
"Male",
Gender)) %>%
count(Gender, Gender_r)
Could we keep using ifelse
to recode all Gender values?
Yep!
messy %>%
mutate(Gender_r = ifelse(Gender == "M", "Male", Gender),
Gender_r = ifelse(Gender_r == "F",
"Female", Gender_r)) %>%
count(Gender, Gender_r)

Notice that we would have to change this variable from


Gender to Gender_r to account for our previous recoding.
messy %>%
mutate(Gender_r = ifelse(Gender == "M", "Male", Gender),
Gender_r = ifelse(Gender_r == "F",
"Female", Gender_r)) %>%
count(Gender, Gender_r)
messy %>%
mutate(Gender_r = ifelse(Gender == "M", "Male", Gender),
Gender_r = ifelse(Gender_r == "F",
"Female", Gender_r),
Gender_r = ifelse(Gender_r == "female",
"Female", Gender_r),
Gender_r = ifelse(Gender_r == "male",
"Male", Gender_r)) %>%
count(Gender, Gender_r)
But there’s a more
ef cient option
fi
case_when
Simple way to complete multiple ifelse statements

case_when(
<logical test> ~ <outcome>,
<logical test> ~ <outcome>,
)
Your Turn 3
Complete the code so that all values in Gender_r
are recoded to either "Male" or “Female".

messy %>%
mutate(Gender_r =
case_when(Gender == "M" ~ "Male",
Gender == "F" ~ "Female")) %>%
count(Gender, Gender_r)
Your Turn 3
messy %>%
mutate(Gender_r =
case_when(Gender == "M" ~ "Male",
Gender == "F" ~ "Female",
Gender == "male" ~ "Male",
Gender == "female" ~ "Female"
)
) %>%
count(Gender, Gender_r)
case_when
Can evaluate complex recodings involving multiple
variables at once.
case_when
To create a category equivalent to an “else”
statement, use “TRUE” at the end of the code
chunk.

All observations that do not meet one of the


above criteria will get the value of “other”.
Putting it all together
Your Turn 3
Does this code correctly convert the price variable
to a numeric value? Can it be improved?

az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price))
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price))
Could inspect visually
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price)) %>%
filter(is.na(price_r)) %>%
select(price, price_r)
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price)) %>%
filter(is.na(price_r)) %>%
select(price, price_r)
Maybe we do an online search for
help and learn about parse_number

az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r =
readr::parse_number(price))
No NA values!
That’s great, right?
Let’s double check what
bpa() recommends
az_data %>%
pull(price) %>%
get_pattern %>%
table %>%
as.data.frame
Whoa! We had no idea that there were price ranges
in the data. The data dictionary never mentioned
that information. How can we gure out how
parse_number recoded the prices with ranges?
fi
Your Turn 5
Figure out how parse_number recoded the
prices with ranges. Should we keep these
values?

Complete this task as a take-home challenge.


coalesce()
coalesce()
Function for collapsing missingness.
water_bottle_readings %>%
mutate(readings_combined = coalesce(bottle_1, bottle_2, bottle_3))

Checks the variables for values in the order that they are listed.
water_bottle_readings %>%
mutate(readings_combined = coalesce(bottle_1, bottle_2, bottle_3))

Checks the variables for values in the order that they are listed.
How to create the
water_bottle_readings toy dataset

set.seed(123)
water_bottle_readings <- data.frame(bottle_1 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE),
bottle_2 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE),
bottle_3 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE))
Test Yourself
Use the coalesce() + mutate() functions to provide more
information in the speci c_category variable. Speci cally,
create a new variable called speci c_category_ lled that
gets the value of speci c_category if speci c_category is
not NA, but if speci c_category is NA, then replace that
missingness with values from the quaternary_category,
tertiary_category, or secondary_category in that order.

Then use lter() and count() to examine the extent of the new
values you have added to speci c_category_ lled.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Reminders
Assignments
dplyr Practice Problems (Oct. 17th)
Practice Problems
Use the example code below to identify rows in
airquality with any missingness. Then, use
your knowledge of dplyr to calculate the total
amount of missingness present per month and
remove all the observations from months that
have any missingness in more than 20 days.
How many rows are left?

Example code
penguins %>%
mutate(total_NA = rowSums(is.na(penguins))) %>%
filter(total_NA < 3)
airquality %>%
mutate(total_NA = rowSums(is.na(airquality))) %>%
group_by(Month) %>%
mutate(month_NA = sum(total_NA > 0)) %>%
filter(month_NA < 20)

June has the largest number of missing


Ozone measurements (21). The code above
will remove all values from any month with
more than 20 days with missingness.
123 rows are left.
Test Yourself
Use the coalesce() + mutate() functions to provide more
information in the speci c_category variable. Speci cally,
create a new variable called speci c_category_ lled that
gets the value of speci c_category if speci c_category is
not NA, but if speci c_category is NA, then replace that
missingness with values from the quaternary_category,
tertiary_category, or secondary_category in that order.

Then use lter() and count() to examine the extent of the new
values you have added to speci c_category_ lled.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Hint: Test Yourself
az_data %>%
mutate(
speci c_category_ lled = coalesce(speci c_category,
quaternary_category)) %>%
lter(is.na(speci c_category)) %>%
count(speci c_category_ lled, sort = TRUE)
fi
fi
fi
fi
fi
fi
fi

You might also like