14 Clean The Mess
14 Clean The Mess
2022-10-12 | HGEN611
Roadmap
1. Data handling keys
3. Basic recoding
4. ifelse()
5. case_when()
6. coalesce()
Data Handling Rules
Before we touch our
data….
Rule 1
Raw data is READ ONLY
Rule 1
Raw data is READ ONLY
fi
fi
Rule 5
Use R Projects to
organize analyses
Rule 5
Use R Projects to organize analyses
az_data %>%
pull(date) %>%
head
head(az_data$date)
[1] NA NA
[3] "August 31, 2017" "July 20, 2017"
[5] "November 14, 2016" "August 31, 2016"
Verifying Correctness
Your Turn 2
How could you test if the code below
correctly converts the date character variable
from a character to Date object? What are
some challenges?
az_data %>%
mutate(date = lubridate::mdy(date))
Create a new variable
instead of replacing one.
az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
select(date, date_r)
az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
select(date, date_r)
Use bpa() to perform a
global check.
az_data %>%
mutate(date_r = lubridate::mdy(date)) %>%
pull(date_r) %>%
get_pattern %>%
table %>%
as.data.frame
Pattern Freq
9999-99-99 512646
Simple Recoding
ifelse
Recodes values based on the
outcome of a logical expression
ifelse
Recodes values based on the outcome of a logical expression
ifelse(<logical test>,
<outcome if TRUE>,
<outcome if FALSE>)
ifelse
Recodes values based on the outcome of a logical expression
x <- 1:10
ifelse(x < 5, “ zz”, “bang”)
ifelse(messy$Gender == “M”,
“Male”,
messy$Gender)
ifelse
Verify recoding worked with count() or bpa()
messy %>%
mutate(Gender_r = ifelse(Gender == "M",
"Male",
Gender)) %>%
count(Gender, Gender_r)
Could we keep using ifelse
to recode all Gender values?
Yep!
messy %>%
mutate(Gender_r = ifelse(Gender == "M", "Male", Gender),
Gender_r = ifelse(Gender_r == "F",
"Female", Gender_r)) %>%
count(Gender, Gender_r)
case_when(
<logical test> ~ <outcome>,
<logical test> ~ <outcome>,
)
Your Turn 3
Complete the code so that all values in Gender_r
are recoded to either "Male" or “Female".
messy %>%
mutate(Gender_r =
case_when(Gender == "M" ~ "Male",
Gender == "F" ~ "Female")) %>%
count(Gender, Gender_r)
Your Turn 3
messy %>%
mutate(Gender_r =
case_when(Gender == "M" ~ "Male",
Gender == "F" ~ "Female",
Gender == "male" ~ "Male",
Gender == "female" ~ "Female"
)
) %>%
count(Gender, Gender_r)
case_when
Can evaluate complex recodings involving multiple
variables at once.
case_when
To create a category equivalent to an “else”
statement, use “TRUE” at the end of the code
chunk.
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price))
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price))
Could inspect visually
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price)) %>%
filter(is.na(price_r)) %>%
select(price, price_r)
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r = as.numeric(price)) %>%
filter(is.na(price_r)) %>%
select(price, price_r)
Maybe we do an online search for
help and learn about parse_number
az_data %>%
filter(!duplicated(asin)) %>%
filter(!is.na(asin), !is.na(price)) %>%
mutate(price_r =
readr::parse_number(price))
No NA values!
That’s great, right?
Let’s double check what
bpa() recommends
az_data %>%
pull(price) %>%
get_pattern %>%
table %>%
as.data.frame
Whoa! We had no idea that there were price ranges
in the data. The data dictionary never mentioned
that information. How can we gure out how
parse_number recoded the prices with ranges?
fi
Your Turn 5
Figure out how parse_number recoded the
prices with ranges. Should we keep these
values?
Checks the variables for values in the order that they are listed.
water_bottle_readings %>%
mutate(readings_combined = coalesce(bottle_1, bottle_2, bottle_3))
Checks the variables for values in the order that they are listed.
How to create the
water_bottle_readings toy dataset
set.seed(123)
water_bottle_readings <- data.frame(bottle_1 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE),
bottle_2 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE),
bottle_3 = sample(size = 10,
x = c(1:5, NA, NA, NA),
replace = TRUE))
Test Yourself
Use the coalesce() + mutate() functions to provide more
information in the speci c_category variable. Speci cally,
create a new variable called speci c_category_ lled that
gets the value of speci c_category if speci c_category is
not NA, but if speci c_category is NA, then replace that
missingness with values from the quaternary_category,
tertiary_category, or secondary_category in that order.
Then use lter() and count() to examine the extent of the new
values you have added to speci c_category_ lled.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Reminders
Assignments
dplyr Practice Problems (Oct. 17th)
Practice Problems
Use the example code below to identify rows in
airquality with any missingness. Then, use
your knowledge of dplyr to calculate the total
amount of missingness present per month and
remove all the observations from months that
have any missingness in more than 20 days.
How many rows are left?
Example code
penguins %>%
mutate(total_NA = rowSums(is.na(penguins))) %>%
filter(total_NA < 3)
airquality %>%
mutate(total_NA = rowSums(is.na(airquality))) %>%
group_by(Month) %>%
mutate(month_NA = sum(total_NA > 0)) %>%
filter(month_NA < 20)
Then use lter() and count() to examine the extent of the new
values you have added to speci c_category_ lled.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Hint: Test Yourself
az_data %>%
mutate(
speci c_category_ lled = coalesce(speci c_category,
quaternary_category)) %>%
lter(is.na(speci c_category)) %>%
count(speci c_category_ lled, sort = TRUE)
fi
fi
fi
fi
fi
fi
fi