Preprocessing - Preprocessing Your Data With R
Preprocessing - Preprocessing Your Data With R
mean(example)
## [1] NA
is.na(example)
## [1] 2 1 6 7 NA 4
is.na(example2)
mean(example2)
## [1] NA
length(example2)
## [1] 6
mean(example2,na.rm=TRUE)
## [1] 4
median(example2,na.rm=TRUE)
## [1] 4
Replacing just the missing values
example3 <- example2
example3
## [1] 2 1 6 7 NA 4
example3[is.na(example3)] <- 0
example3
## [1] 2 1 6 7 0 4
length(example3)
## [1] 6
Missing values can behave strangely
NA == NA
## [1] NA
NA+8
## [1] NA
NA^0
## [1] 1
1/NA
## [1] NA
Other strange values…
1/0
## [1] Inf
1/0-1/0
## [1] NaN
Data imputation
library("mice")
data(mammalsleep)
?mammalsleep
dim(mammalsleep)
## [1] 62 11
nic(mammalsleep)
## [1] 20
md.pattern(mammalsleep)
?mice
Outlier detection
summary(mammalsleep)
## species bw brw
## African elephant : 1 Min. : 0.005 Min. : 0.14
## African giant pouched rat: 1 1st Qu.: 0.600 1st Qu.: 4.25
## Arctic Fox : 1 Median : 3.342 Median : 17.25
## Arctic ground squirrel : 1 Mean : 198.790 Mean : 283.13
## Asian elephant : 1 3rd Qu.: 48.203 3rd Qu.: 166.00
## Baboon : 1 Max. :6654.000 Max. :5712.00
## (Other) :56
## sws ps ts mls
## Min. : 2.100 Min. :0.000 Min. : 2.60 Min. : 2.000
## 1st Qu.: 6.250 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625
## Median : 8.350 Median :1.800 Median :10.45 Median : 15.100
## Mean : 8.673 Mean :1.972 Mean :10.53 Mean : 19.878
## 3rd Qu.:11.000 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750
## Max. :17.900 Max. :6.600 Max. :19.90 Max. :100.000
## NA's :14 NA's :12 NA's :4 NA's :4
## gt pi sei odi
## Min. : 12.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 35.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 79.00 Median :3.000 Median :2.000 Median :2.000
## Mean :142.35 Mean :2.871 Mean :2.419 Mean :2.613
## 3rd Qu.:207.50 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :645.00 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :4
which.max(mammalsleep$bw)
## [1] 1
mammalsleep[which.max(mammalsleep$bw),]
Document them, find the reason they occurred, then remove them.
Make the data easier to look at
interactively
View(pressure)
View(iris)
Grouping Data
load('births.RData')
head(births)
###Subsetting
Sat <-birthn[birthn$day_of_week==6,]
Sat[1:5,]
library(dplyr)
##
## Attaching package: 'dplyr'
Another way of looking at data is to make them into what is called a tibble: (tbl:
tibble).
tbl s have the advantage of always showing themselves in the console optimally.
tbl_df gives similar information as str we have been using.
tbl_df(Sat1)
## # A tibble: 783 × 5
## year month date_of_month day_of_week births
## <int> <int> <int> <int> <int>
## 1 2000 1 1 6 9083
## 2 2000 1 8 6 8934
## 3 2000 1 15 6 8525
## 4 2000 1 22 6 8855
## 5 2000 1 29 6 8805
## 6 2000 2 5 6 8624
## 7 2000 2 12 6 8836
## 8 2000 2 19 6 8861
## 9 2000 2 26 6 9026
## 10 2000 3 4 6 9054
## # ... with 773 more rows
str(Sat1)
Here is an example
## # A tibble: 7 × 2
## day_of_week `mean(births)`
## <int> <dbl>
## 1 7 7518.377
## 2 6 8562.573
## 3 1 11897.830
## 4 5 12596.162
## 5 4 12845.826
## 6 3 12910.766
## 7 2 13122.444
str(SortedBirths)
birthn %>%
group_by(day_of_week) %>%
summarise(avg = mean(births)) %>%
arrange(avg)
## # A tibble: 7 × 2
## day_of_week avg
## <int> <dbl>
## 1 7 7518.377
## 2 6 8562.573
## 3 1 11897.830
## 4 5 12596.162
## 5 4 12845.826
## 6 3 12910.766
## 7 2 13122.444
####More succintly
birthn %>%
group_by(day_of_week) %>%
summarise(mean(births)) %>%
arrange()
## # A tibble: 7 × 2
## day_of_week `mean(births)`
## <int> <dbl>
## 1 1 11897.830
## 2 2 13122.444
## 3 3 12910.766
## 4 4 12845.826
## 5 5 12596.162
## 6 6 8562.573
## 7 7 7518.377
x %>% f(y) %>% g(z) %>% h(m) gives the same answer.
birthn %>%
filter(day_of_week == 5) %>%
filter(date_of_month == 13) %>%
summarise(mean(births))
## mean(births)
## 1 11949.96
birthn %>%
filter(day_of_week < 5) %>%
filter(date_of_month != 13) %>%
summarise(mean(births))
## mean(births)
## 1 12700.61
Bad Drivers Data
Five Thirty Eight Article
## State
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
## Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## 1 18.8
## 2 18.1
## 3 18.6
## 4 22.4
## 5 12.0
## 6 13.6
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## 1 39
## 2 41
## 3 35
## 4 18
## 5 35
## 6 37
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## 1 30
## 2 25
## 3 28
## 4 26
## 5 28
## 6 28
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## 1 96
## 2 90
## 3 84
## 4 94
## 5 91
## 6 79
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## 1 80
## 2 94
## 3 96
## 4 95
## 5 89
## 6 95
## Car.Insurance.Premiums....
## 1 784.55
## 2 1053.48
## 3 899.47
## 4 827.34
## 5 878.41
## 6 835.50
## Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## 1 145.08
## 2 133.93
## 3 110.35
## 4 142.39
## 5 165.63
## 6 139.91
tbl_df(drivers)
## # A tibble: 51 × 8
## State
## <fctr>
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
## 7 Connecticut
## 8 Delaware
## 9 District of Columbia
## 10 Florida
## # ... with 41 more rows, and 7 more variables:
## # Number.of.drivers.involved.in.fatal.collisions.per.billion.miles <dbl>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding <int>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired <int>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted <int>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accident
## # Car.Insurance.Premiums.... <dbl>,
## # Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver.... <dbl>
glimpse(drivers)
## Observations: 51
## Variables: 8
## $ State <
## $ Number.of.drivers.involved.in.fatal.collisions.per.billion.miles <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents <
## $ Car.Insurance.Premiums.... <
## $ Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver.... <
summary(drivers)
## State
## Alabama : 1
## Alaska : 1
## Arizona : 1
## Arkansas : 1
## California: 1
## Colorado : 1
## (Other) :45
## Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## Min. : 5.90
## 1st Qu.:12.75
## Median :15.60
## Mean :15.79
## 3rd Qu.:18.50
## Max. :23.90
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## Min. :13.00
## 1st Qu.:23.00
## Median :34.00
## Mean :31.73
## 3rd Qu.:38.00
## Max. :54.00
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## Min. :16.00
## 1st Qu.:28.00
## Median :30.00
## Mean :30.69
## 3rd Qu.:33.00
## Max. :44.00
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## Min. : 10.00
## 1st Qu.: 83.00
## Median : 88.00
## Mean : 85.92
## 3rd Qu.: 95.00
## Max. :100.00
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## Min. : 76.00
## 1st Qu.: 83.50
## Median : 88.00
## Mean : 88.73
## 3rd Qu.: 95.00
## Max. :100.00
##
## Car.Insurance.Premiums....
## Min. : 642.0
## 1st Qu.: 768.4
## Median : 859.0
## Mean : 887.0
## 3rd Qu.:1007.9
## Max. :1301.5
##
## Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## Min. : 82.75
## 1st Qu.:114.64
## Median :136.05
## Mean :134.49
## 3rd Qu.:151.87
## Max. :194.78
##
colnames(drivers)=
c("State","NperB","PrcSpeed","PrcAlco","PrcNotDist","PrcNoPrev","Premium","Loss")
sort(drivers[,2])
## [1] 5.9 8.2 9.6 10.6 10.8 11.1 11.2 11.3 11.6 12.0 12.3 12.5 12.7 12.8
## [15] 12.8 13.6 13.6 13.8 14.1 14.1 14.5 14.7 14.9 15.1 15.3 15.6 15.7 16.1
## [29] 16.2 16.8 17.4 17.5 17.6 17.8 17.9 18.1 18.2 18.4 18.6 18.8 19.4 19.4
## [43] 19.5 19.9 20.5 21.4 21.4 22.4 23.8 23.9 23.9
drivers[1:10,1:3]
drivers[order(drivers[,2]),1:3]
arrange(drivers,NperB)
arrange(drivers,desc(PrcSpeed))
driversp=mutate(drivers,prem_c=Loss/Premium)
select(arrange(driversp,prem_c),State,prem_c)
## State prem_c
## 1 Montana 0.1043236
## 2 District of Columbia 0.1067989
## 3 New York 0.1215335
## 4 Arizona 0.1226834
## 5 New Jersey 0.1228179
## 6 Florida 0.1242792
## 7 Washington 0.1254115
## 8 Alaska 0.1271310
## 9 Idaho 0.1289021
## 10 Rhode Island 0.1293136
## 11 Oregon 0.1299971
## 12 Delaware 0.1331259
## 13 Massachusetts 0.1341357
## 14 Nevada 0.1346869
## 15 Utah 0.1352640
## 16 South Carolina 0.1353831
## 17 Michigan 0.1370958
## 18 New Mexico 0.1388170
## 19 Hawaii 0.1404120
## 20 South Dakota 0.1447311
## 21 Maine 0.1459026
## 22 Louisiana 0.1519878
## 23 Vermont 0.1530438
## 24 Indiana 0.1533091
## 25 West Virginia 0.1536958
## 26 Wyoming 0.1542584
## 27 Texas 0.1560886
## 28 Connecticut 0.1562789
## 29 Georgia 0.1563818
## 30 Nebraska 0.1567979
## 31 Kentucky 0.1571673
## 32 Wisconsin 0.1590607
## 33 North Dakota 0.1593031
## 34 New Hampshire 0.1610229
## 35 Colorado 0.1674566
## 36 Pennsylvania 0.1698253
## 37 Kansas 0.1714396
## 38 Minnesota 0.1715819
## 39 Arkansas 0.1721058
## 40 Illinois 0.1732639
## 41 Mississippi 0.1738369
## 42 Iowa 0.1763627
## 43 North Carolina 0.1804755
## 44 Missouri 0.1827741
## 45 Maryland 0.1837373
## 46 Alabama 0.1849213
## 47 California 0.1885566
## 48 Ohio 0.1913634
## 49 Virginia 0.1999090
## 50 Tennessee 0.2025888
## 51 Oklahoma 0.2029018
Document all the changes you make using a
script.
The best way to make a report is to put everything into an .Rmd document and
then knit into an html file using the knitr package.
Summary of this Session:
Careful data preprocessing is necessary at the beginning of any data exploration
exercise.
Missing data may be imputed if there are only a few in a column or row and if
their occurrence patterns are random.
We saw how to use the package dplyr that allows us to easily do a sequence of
actions on data using the %>% operator.
Activity: Re-analyze the drivers data and make your own Rmd and html reports.