R Module 8 - Data Cleaning
R Module 8 - Data Cleaning
Andrew Jaffe
January 6, 2016
Data
Each missing data type has a function that returns TRUE if the data
is missing:
I NA - is.na
I NaN - is.nan
I Inf and -Inf - is.infinite
I is.finite returns FALSE for all missing data and TRUE for
non-missing
I complete.cases on a data.frame/matrix returns TRUE if
all values in that row of the object are not missing.
Missing Data with Logicals
x = c(0, NA, 2, 3, 4)
x > 2
x != NA
[1] NA NA NA NA NA
(x == 0 | x == 2) # has NA
(x == 0 | x == 2) & !is.na(x) # No NA
x + 2
[1] 2 NA 4 5 6
x * 2
[1] 0 NA 4 6 8
Creating One-way Tables
Here we will use table to make tabulations of the data. Look at
?table to see options for missing data.
table(x)
x
0 2 3 4
1 1 1 1
x
0 2 3 4 <NA>
1 1 1 1 1
Creating One-way Tables
0 1 2 3 <NA>
1 1 4 4 0
Creating Two-way Tables
margin.table(tab, 2)
0 1 2 3 4 <NA>
1 1 2 4 2 0
Proportion Tables
prop.table finds the marginal proportions of the table. Think of it
dividing the table by it’s respective marginal totals. If margin not
set, divides by overall total.
prop.table(tab)
0 1 2 3 4 <NA>
0 0.1 0.0 0.0 0.0 0.0 0.0
1 0.0 0.1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.2 0.0 0.2 0.0
3 0.0 0.0 0.0 0.4 0.0 0.0
<NA> 0.0 0.0 0.0 0.0 0.0 0.0
prop.table(tab,1)
0 1 2 3 4 <NA>
Download Salary FY2014 Data
From https://data.baltimorecity.gov/City-Government/
Baltimore-City-Employee-Salaries-FY2014/2j28-xzd7
http://www.aejaffe.com/winterR_2016/data/Baltimore_
City_Employee_Salaries_FY2014.csv
Read the CSV into R Sal:
Sal = read.csv("http://www.aejaffe.com/winterR_2016/data/Ba
as.is = TRUE)
Checking for logical conditions
I any() - checks if there are any TRUEs
I all() - checks if ALL are true
head(Sal,2)
[1] FALSE
Example of Recoding: base R
data$gender[data$gender %in%
c("Male", "M", "m")] <- "Male"
Example of Recoding with recode: car package
library(plyr)
-----------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause
If you need functions from both plyr and dplyr, please load
library(plyr); library(dplyr)
-----------------------------------------------------------
count
table(gender)
gender
F FeMAle FEMALE Fm M Ma mAle Male M
75 82 74 89 89 79 87 89
Man Woman
73 80
Pasting strings with paste and paste0
Paste can be very useful for joining vectors together:
paste(1:5)
I R can do much more than find exact matches for a whole string
I Like Perl and other languages, it can use regular expressions.
I What are regular expressions?
I Ways to search for specific strings
I Can be very complicated or simple
I Highly Useful - think “Find” on steroids
A bit on Regular Expressions
I http:
//www.regular-expressions.info/reference.html
I They can use to match a large number of strings in one
statement
I . matches any single character
I * means repeat as many (even if 0) more times the last
character
I ? makes the last thing optional
I ˆ matches start of vector ˆa - starts with “a”
I $ matches end of vector b$ - ends with “b”
Substringing
Very similar:
Base R
stringr
[[1]]
[1] "I" "really"
[[2]]
[1] "like" "writing"
[[3]]
[1] "R" "code" "programs"
Splitting String: stringr
stringr::str_split do the same thing:
library(stringr)
y2 <- str_split(x, " ") # returns a list
y2
[[1]]
[1] "I" "really"
[[2]]
[1] "like" "writing"
[[3]]
[1] "R" "code" "programs"
Using a fixed expression
str_split("I.like.strings", ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed("."))
[[1]]
[1] "I" "like" "strings"
Let’s extract from y
suppressPackageStartupMessages(library(dplyr)) # must be lo
y[[2]]
?modifiers
Base R:
stringr
grep("Rawlings",Sal$Name)
which(grepl("Rawlings", Sal$Name))
which(str_detect(Sal$Name, "Rawlings"))
head(grepl("Rawlings",Sal$Name))
head(str_detect(Sal$Name, "Rawlings"))
Sal[grep("Rawlings",Sal$Name),]
ss = str_extract(Sal$Name, "Rawling")
head(ss)
[1] NA NA NA NA NA NA
ss[ !is.na(ss)]
head(str_extract(Sal$AgencyID, "\\d"))
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]]
[1] "0" "2" "2" "0" "0"
[[2]]
[1] "0" "3" "0" "3" "1"
Using Regular Expressions
I Look for any name that starts with:
I Payne at the beginning,
I Leonard and then an S
I Spence then capital C
class(Sal$AnnualSalary)
[1] "character"
[1] 1 3 2
Replace
So we must change the annual pay into a numeric:
head(Sal$AnnualSalary, 4)
head(as.numeric(Sal$AnnualSalary), 4)
[1] NA NA NA NA
dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(
fixed("$"),
"") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)
[1] TRUE