Cleaning Data3
Cleaning Data3
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
Uniformity
Different units or formats
Temperature: °C vs. °F
CLEANING DATA IN R
Where do uniformity issues come from?
CLEANING DATA IN R
Finding uniformity issues
head(nyc_temps)
date temp
1 2019-04-01 4.2
2 2019-04-02 7.5
3 2019-04-03 12.2
4 2019-04-04 11.1
5 2019-04-05 41.5
6 2019-04-06 11.9
CLEANING DATA IN R
Finding uniformity issues
library(ggplot2)
ggplot(nyc_temps, aes(x = date, y = temp)) +
geom_point()
CLEANING DATA IN R
What to do?
There's no one best option. It depends on your dataset!
Do your research to understand where your data comes from
Data from Apr 7, 16, and 23 is from an external source that measured temps in °F
CLEANING DATA IN R
Unit conversion
5
C = (F − 32) ×
9
ifelse(condition, value_if_true, value_if_false)
nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp))
CLEANING DATA IN R
Unit conversion
nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp)) %>%
ggplot(aes(x = date, y = temp_c)) +
geom_point()
CLEANING DATA IN R
Date uniformity
nyc_temps
CLEANING DATA IN R
Parsing multiple formats
library(lubridate)
parse_date_time(nyc_temps$date,
orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
NA
CLEANING DATA IN R
Ambiguous dates
Is 02/04/2019 in February or April?
Options include:
Treat as missing
CLEANING DATA IN R
Let's practice!
C L E A N I N G D ATA I N R
Cross field validation
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
What is cross field validation?
Cross field validation = a sanity check
Does this value make sense based on other values?
1 https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us
CLEANING DATA IN R
Credit card data
head(credit_cards)
CLEANING DATA IN R
Validating numbers
credit_cards %>%
select(dining_cb:total_cb)
CLEANING DATA IN R
Validating numbers
credit_cards %>%
mutate(theoretical_total = dining_cb + groceries_cb + gas_cb) %>%
filter(theoretical_total != total_cb) %>%
select(dining_cb:theoretical_total)
CLEANING DATA IN R
Validating date and age
credit_cards %>%
select(date_opened, acct_age)
date_opened acct_age
1 2018-07-05 1
2 2016-01-23 4
3 2016-03-25 4
4 2018-06-20 1
5 2017-02-08 3
6 2014-11-18 5
CLEANING DATA IN R
Calculating age
library(lubridate)
date_difference <- as.Date("2015-09-04") %--% today()
date_difference
as.numeric(date_difference, "years")
4.511978
floor(as.numeric(date_difference, "years"))
CLEANING DATA IN R
Validating age
credit_cards %>%
mutate(theor_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
filter(theor_age != acct_age)
CLEANING DATA IN R
What next?
CLEANING DATA IN R
Let's practice!
C L E A N I N G D ATA I N R
Completeness
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
What is missing data?
CLEANING DATA IN R
What is missing data?
CLEANING DATA IN R
What is missing data?
CLEANING DATA IN R
Air quality
head(airquality)
CLEANING DATA IN R
Air quality
head(airquality)
CLEANING DATA IN R
Finding missing values
is.na(airquality)
CLEANING DATA IN R
Counting missing values
# Count missing vals in entire dataset
sum(is.na(airquality))
44
CLEANING DATA IN R
Visualizing missing values
library(visdat)
vis_miss(airquality)
CLEANING DATA IN R
Investigating missingness
airquality %>%
mutate(miss_ozone = is.na(Ozone)) %>%
group_by(miss_ozone) %>%
summarize(across(everything(), median, na.rm = TRUE))
CLEANING DATA IN R
Investigating missingness
airquality %>%
arrange(Temp) %>%
vis_miss()
CLEANING DATA IN R
Types of missing data
CLEANING DATA IN R
Types of missing data
CLEANING DATA IN R
Types of missing data
CLEANING DATA IN R
Types of missingness
CLEANING DATA IN R
Dealing with missing data
Simple approaches:
2. Impute (fill in) with statistical measures (mean, median, mode..) or domain knowledge
CLEANING DATA IN R
Dropping missing values
airquality %>%
filter(!is.na(Ozone), !is.na(Solar.R))
CLEANING DATA IN R
Replacing missing values
airquality %>%
mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
CLEANING DATA IN R
Let's practice!
C L E A N I N G D ATA I N R