0% found this document useful (0 votes)

41 views

Preprocessing - Preprocessing Your Data With R

This document provides an overview of preprocessing data in R, including handling missing data, detecting outliers, imputing missing values, and transforming data for easier analysis. Key techniques discussed include using is.na() to find missing values, removing outliers, imputing missing data using the mice package, and data visualization. The document also demonstrates the pipe operator (%>%) for chaining data transformations and aggregation steps to obtain summary statistics.

Uploaded by

Tony

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Preprocessing - Preprocessing Your Data With R

Uploaded by

Tony

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Preprocessing your data with R

Susan Holmes (c)

Missing data.
?is.na
example <- c("A",1,6,7,NA,"B")
example

## [1] "A" "1" "6" "7" NA "B"

mean(example)

## Warning in mean.default(example): argument is not numeric or logical:

## returning NA

## [1] NA

is.na(example)

## [1] FALSE FALSE FALSE FALSE TRUE FALSE

example2 <- c(2,1,6,7,NA,4)

example2

## [1] 2 1 6 7 NA 4

is.na(example2)

## [1] FALSE FALSE FALSE FALSE TRUE FALSE

mean(example2)

## [1] NA

length(example2)

## [1] 6

mean(example2,na.rm=TRUE)

## [1] 4

median(example2,na.rm=TRUE)

## [1] 4
Replacing just the missing values
example3 <- example2
example3

## [1] 2 1 6 7 NA 4

example3[is.na(example3)] <- 0
example3

## [1] 2 1 6 7 0 4

length(example3)

## [1] 6
Missing values can behave strangely
NA == NA

## [1] NA

NA+8

## [1] NA

NA^0

## [1] 1

1/NA

## [1] NA
Other strange values…
1/0

## [1] Inf

1/0-1/0

## [1] NaN
Data imputation
library("mice")
data(mammalsleep)
?mammalsleep
dim(mammalsleep)

## [1] 62 11

nic(mammalsleep)

## [1] 20

md.pattern(mammalsleep)

## species bw brw pi sei odi ts mls gt ps sws

## 42 1 1 1 1 1 1 1 1 1 1 1 0
## 2 1 1 1 1 1 1 1 0 1 1 1 1
## 3 1 1 1 1 1 1 1 1 0 1 1 1
## 9 1 1 1 1 1 1 1 1 1 0 0 2
## 2 1 1 1 1 1 1 0 1 1 1 0 2
## 1 1 1 1 1 1 1 1 0 0 1 1 2
## 2 1 1 1 1 1 1 0 1 1 0 0 3
## 1 1 1 1 1 1 1 1 0 1 0 0 3
## 0 0 0 0 0 0 4 4 4 12 14 38

Missing at random (MCAR) versus systematic patterns (MNAR).

?mice
Outlier detection
summary(mammalsleep)

## species bw brw
## African elephant : 1 Min. : 0.005 Min. : 0.14
## African giant pouched rat: 1 1st Qu.: 0.600 1st Qu.: 4.25
## Arctic Fox : 1 Median : 3.342 Median : 17.25
## Arctic ground squirrel : 1 Mean : 198.790 Mean : 283.13
## Asian elephant : 1 3rd Qu.: 48.203 3rd Qu.: 166.00
## Baboon : 1 Max. :6654.000 Max. :5712.00
## (Other) :56
## sws ps ts mls
## Min. : 2.100 Min. :0.000 Min. : 2.60 Min. : 2.000
## 1st Qu.: 6.250 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625
## Median : 8.350 Median :1.800 Median :10.45 Median : 15.100
## Mean : 8.673 Mean :1.972 Mean :10.53 Mean : 19.878
## 3rd Qu.:11.000 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750
## Max. :17.900 Max. :6.600 Max. :19.90 Max. :100.000
## NA's :14 NA's :12 NA's :4 NA's :4
## gt pi sei odi
## Min. : 12.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 35.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 79.00 Median :3.000 Median :2.000 Median :2.000
## Mean :142.35 Mean :2.871 Mean :2.419 Mean :2.613
## 3rd Qu.:207.50 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :645.00 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :4

which.max(mammalsleep$bw)

## [1] 1

mammalsleep[which.max(mammalsleep$bw),]

## species bw brw sws ps ts mls gt pi sei odi

## 1 African elephant 6654 5712 NA NA 3.3 38.6 645 3 5 3

Document them, find the reason they occurred, then remove them.
Make the data easier to look at
interactively
View(pressure)
View(iris)
Grouping Data
load('births.RData')
head(births)

## year month date_of_month day_of_week births

## 1 2000 1 1 6 9083
## 2 2000 1 2 7 8006
## 3 2000 1 3 1 11363
## 4 2000 1 4 2 13032
## 5 2000 1 5 3 12558
## 6 2000 1 6 4 12466

birthn <- births

save(birthn,file="birthn.RData")
Different ways of filtering the data
Choosing only the Saturday births.

###Subsetting
Sat <-birthn[birthn$day_of_week==6,]
Sat[1:5,]

## year month date_of_month day_of_week births

## 1 2000 1 1 6 9083
## 8 2000 1 8 6 8934
## 15 2000 1 15 6 8525
## 22 2000 1 22 6 8855
## 29 2000 1 29 6 8805

library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##
## filter, lag

## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union

Sat1 <-filter(birthn, day_of_week == 6)

Sat1[1:5,]

## year month date_of_month day_of_week births

## 1 2000 1 1 6 9083
## 2 2000 1 8 6 8934
## 3 2000 1 15 6 8525
## 4 2000 1 22 6 8855
## 5 2000 1 29 6 8805

Sat2 <- birthn %>% filter(day_of_week == 6)

Sat2[1:5,]

## year month date_of_month day_of_week births

## 1 2000 1 1 6 9083
## 2 2000 1 8 6 8934
## 3 2000 1 15 6 8525
## 4 2000 1 22 6 8855
## 5 2000 1 29 6 8805

Another way of looking at data is to make them into what is called a tibble: (tbl:
tibble).

tbl s have the advantage of always showing themselves in the console optimally.
tbl_df gives similar information as str we have been using.

tbl_df(Sat1)

## # A tibble: 783 × 5
## year month date_of_month day_of_week births
## <int> <int> <int> <int> <int>
## 1 2000 1 1 6 9083
## 2 2000 1 8 6 8934
## 3 2000 1 15 6 8525
## 4 2000 1 22 6 8855
## 5 2000 1 29 6 8805
## 6 2000 2 5 6 8624
## 7 2000 2 12 6 8836
## 8 2000 2 19 6 8861
## 9 2000 2 26 6 9026
## 10 2000 3 4 6 9054
## # ... with 773 more rows

str(Sat1)

## 'data.frame': 783 obs. of 5 variables:

## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ month : int 1 1 1 1 1 2 2 2 2 3 ...
## $ date_of_month: int 1 8 15 22 29 5 12 19 26 4 ...
## $ day_of_week : int 6 6 6 6 6 6 6 6 6 6 ...
## $ births : int 9083 8934 8525 8855 8805 8624 8836 8861 9026 9054 ...
Sequences of Transformations
The %>% operator helps when we are doing several nested operations.

Here is an example

GroupBirths <- group_by(birthn,day_of_week)

GroupMeans <- summarise(GroupBirths,mean(births))
SortedBirths <- arrange(GroupMeans, `mean(births)`)
SortedBirths

## # A tibble: 7 × 2
## day_of_week `mean(births)`
## <int> <dbl>
## 1 7 7518.377
## 2 6 8562.573
## 3 1 11897.830
## 4 5 12596.162
## 5 4 12845.826
## 6 3 12910.766
## 7 2 13122.444

str(SortedBirths)

## Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of 2 variables:

## $ day_of_week : int 7 6 1 5 4 3 2
## $ mean(births): num 7518 8563 11898 12596 12846 ...

birthn %>%
group_by(day_of_week) %>%
summarise(avg = mean(births)) %>%
arrange(avg)

## # A tibble: 7 × 2
## day_of_week avg
## <int> <dbl>
## 1 7 7518.377
## 2 6 8562.573
## 3 1 11897.830
## 4 5 12596.162
## 5 4 12845.826
## 6 3 12910.766
## 7 2 13122.444

####More succintly
birthn %>%
group_by(day_of_week) %>%
summarise(mean(births)) %>%
arrange()

## # A tibble: 7 × 2
## day_of_week `mean(births)`
## <int> <dbl>
## 1 1 11897.830
## 2 2 13122.444
## 3 3 12910.766
## 4 4 12845.826
## 5 5 12596.162
## 6 6 8562.573
## 7 7 7518.377

x %>% f(y) is equivalent to just executing f(x,y)

If we need to execute a sequence of functions: h(g(f(x,y),z),m) can be hard to parse

and read.

x %>% f(y) %>% g(z) %>% h(m) gives the same answer.

To find out the average of Friday 13th births:

birthn %>%
filter(day_of_week == 5) %>%
filter(date_of_month == 13) %>%
summarise(mean(births))

## mean(births)
## 1 11949.96

birthn %>%
filter(day_of_week < 5) %>%
filter(date_of_month != 13) %>%
summarise(mean(births))

## mean(births)
## 1 12700.61
Bad Drivers Data
Five Thirty Eight Article

You need an internet connection for this to work:

drivers <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv"))

head(drivers)

## State
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
## Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## 1 18.8
## 2 18.1
## 3 18.6
## 4 22.4
## 5 12.0
## 6 13.6
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## 1 39
## 2 41
## 3 35
## 4 18
## 5 35
## 6 37
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## 1 30
## 2 25
## 3 28
## 4 26
## 5 28
## 6 28
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## 1 96
## 2 90
## 3 84
## 4 94
## 5 91
## 6 79
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## 1 80
## 2 94
## 3 96
## 4 95
## 5 89
## 6 95
## Car.Insurance.Premiums....
## 1 784.55
## 2 1053.48
## 3 899.47
## 4 827.34
## 5 878.41
## 6 835.50
## Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## 1 145.08
## 2 133.93
## 3 110.35
## 4 142.39
## 5 165.63
## 6 139.91

tbl_df(drivers)

## # A tibble: 51 × 8
## State
## <fctr>
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
## 7 Connecticut
## 8 Delaware
## 9 District of Columbia
## 10 Florida
## # ... with 41 more rows, and 7 more variables:
## # Number.of.drivers.involved.in.fatal.collisions.per.billion.miles <dbl>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding <int>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired <int>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted <int>,
## # Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accident
## # Car.Insurance.Premiums.... <dbl>,
## # Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver.... <dbl>

glimpse(drivers)

## Observations: 51
## Variables: 8
## $ State <
## $ Number.of.drivers.involved.in.fatal.collisions.per.billion.miles <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted <
## $ Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents <
## $ Car.Insurance.Premiums.... <
## $ Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver.... <

summary(drivers)

## State
## Alabama : 1
## Alaska : 1
## Arizona : 1
## Arkansas : 1
## California: 1
## Colorado : 1
## (Other) :45
## Number.of.drivers.involved.in.fatal.collisions.per.billion.miles
## Min. : 5.90
## 1st Qu.:12.75
## Median :15.60
## Mean :15.79
## 3rd Qu.:18.50
## Max. :23.90
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Speeding
## Min. :13.00
## 1st Qu.:23.00
## Median :34.00
## Mean :31.73
## 3rd Qu.:38.00
## Max. :54.00
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Alcohol.Impaired
## Min. :16.00
## 1st Qu.:28.00
## Median :30.00
## Mean :30.69
## 3rd Qu.:33.00
## Max. :44.00
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted
## Min. : 10.00
## 1st Qu.: 83.00
## Median : 88.00
## Mean : 85.92
## 3rd Qu.: 95.00
## Max. :100.00
##
## Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Had.Not.Been.Involved.In.Any.Previous.Accidents
## Min. : 76.00
## 1st Qu.: 83.50
## Median : 88.00
## Mean : 88.73
## 3rd Qu.: 95.00
## Max. :100.00
##
## Car.Insurance.Premiums....
## Min. : 642.0
## 1st Qu.: 768.4
## Median : 859.0
## Mean : 887.0
## 3rd Qu.:1007.9
## Max. :1301.5
##
## Losses.incurred.by.insurance.companies.for.collisions.per.insured.driver....
## Min. : 82.75
## 1st Qu.:114.64
## Median :136.05
## Mean :134.49
## 3rd Qu.:151.87
## Max. :194.78
##

colnames(drivers)=
c("State","NperB","PrcSpeed","PrcAlco","PrcNotDist","PrcNoPrev","Premium","Loss")
sort(drivers[,2])

## [1] 5.9 8.2 9.6 10.6 10.8 11.1 11.2 11.3 11.6 12.0 12.3 12.5 12.7 12.8
## [15] 12.8 13.6 13.6 13.8 14.1 14.1 14.5 14.7 14.9 15.1 15.3 15.6 15.7 16.1
## [29] 16.2 16.8 17.4 17.5 17.6 17.8 17.9 18.1 18.2 18.4 18.6 18.8 19.4 19.4
## [43] 19.5 19.9 20.5 21.4 21.4 22.4 23.8 23.9 23.9

drivers[1:10,1:3]

## State NperB PrcSpeed

## 1 Alabama 18.8 39
## 2 Alaska 18.1 41
## 3 Arizona 18.6 35
## 4 Arkansas 22.4 18
## 5 California 12.0 35
## 6 Colorado 13.6 37
## 7 Connecticut 10.8 46
## 8 Delaware 16.2 38
## 9 District of Columbia 5.9 34
## 10 Florida 17.9 21

drivers[order(drivers[,2]),1:3]

## State NperB PrcSpeed

## 9 District of Columbia 5.9 34
## 22 Massachusetts 8.2 23
## 24 Minnesota 9.6 23
## 48 Washington 10.6 42
## 7 Connecticut 10.8 46
## 40 Rhode Island 11.1 34
## 31 New Jersey 11.2 16
## 45 Utah 11.3 43
## 30 New Hampshire 11.6 35
## 5 California 12.0 35
## 33 New York 12.3 32
## 21 Maryland 12.5 34
## 47 Virginia 12.7 19
## 14 Illinois 12.8 36
## 38 Oregon 12.8 33
## 6 Colorado 13.6 37
## 46 Vermont 13.6 30
## 50 Wisconsin 13.8 36
## 23 Michigan 14.1 24
## 36 Ohio 14.1 28
## 15 Indiana 14.5 25
## 29 Nevada 14.7 37
## 28 Nebraska 14.9 13
## 20 Maine 15.1 38
## 13 Idaho 15.3 36
## 11 Georgia 15.6 19
## 16 Iowa 15.7 17
## 26 Missouri 16.1 43
## 8 Delaware 16.2 38
## 34 North Carolina 16.8 39
## 51 Wyoming 17.4 42
## 12 Hawaii 17.5 54
## 25 Mississippi 17.6 15
## 17 Kansas 17.8 27
## 10 Florida 17.9 21
## 2 Alaska 18.1 41
## 39 Pennsylvania 18.2 50
## 32 New Mexico 18.4 19
## 3 Arizona 18.6 35
## 1 Alabama 18.8 39
## 42 South Dakota 19.4 31
## 44 Texas 19.4 40
## 43 Tennessee 19.5 21
## 37 Oklahoma 19.9 32
## 19 Louisiana 20.5 35
## 18 Kentucky 21.4 19
## 27 Montana 21.4 39
## 4 Arkansas 22.4 18
## 49 West Virginia 23.8 34
## 35 North Dakota 23.9 23
## 41 South Carolina 23.9 38

arrange(drivers,NperB)

## State NperB PrcSpeed PrcAlco PrcNotDist PrcNoPrev

## 1 District of Columbia 5.9 34 27 100 100
## 2 Massachusetts 8.2 23 35 87 80
## 3 Minnesota 9.6 23 29 88 88
## 4 Washington 10.6 42 33 82 86
## 5 Connecticut 10.8 46 36 87 82
## 6 Rhode Island 11.1 34 38 92 79
## 7 New Jersey 11.2 16 28 86 78
## 8 Utah 11.3 43 16 88 96
## 9 New Hampshire 11.6 35 30 87 83
## 10 California 12.0 35 28 91 89
## 11 New York 12.3 32 29 88 80
## 12 Maryland 12.5 34 32 71 99
## 13 Virginia 12.7 19 27 87 88
## 14 Illinois 12.8 36 34 94 96
## 15 Oregon 12.8 33 26 67 90
## 16 Colorado 13.6 37 28 79 95
## 17 Vermont 13.6 30 30 96 95
## 18 Wisconsin 13.8 36 33 39 84
## 19 Michigan 14.1 24 28 95 77
## 20 Ohio 14.1 28 34 99 82
## 21 Indiana 14.5 25 29 95 95
## 22 Nevada 14.7 37 32 95 99
## 23 Nebraska 14.9 13 35 93 90
## 24 Maine 15.1 38 30 87 84
## 25 Idaho 15.3 36 29 85 98
## 26 Georgia 15.6 19 25 95 93
## 27 Iowa 15.7 17 25 97 87
## 28 Missouri 16.1 43 34 92 84
## 29 Delaware 16.2 38 30 87 99
## 30 North Carolina 16.8 39 31 94 81
## 31 Wyoming 17.4 42 32 81 90
## 32 Hawaii 17.5 54 41 82 87
## 33 Mississippi 17.6 15 31 10 100
## 34 Kansas 17.8 27 24 77 85
## 35 Florida 17.9 21 29 92 94
## 36 Alaska 18.1 41 25 90 94
## 37 Pennsylvania 18.2 50 31 96 88
## 38 New Mexico 18.4 19 27 67 98
## 39 Arizona 18.6 35 28 84 96
## 40 Alabama 18.8 39 30 96 80
## 41 South Dakota 19.4 31 33 98 86
## 42 Texas 19.4 40 38 91 87
## 43 Tennessee 19.5 21 29 82 81
## 44 Oklahoma 19.9 32 29 92 94
## 45 Louisiana 20.5 35 33 73 98
## 46 Kentucky 21.4 19 23 78 76
## 47 Montana 21.4 39 44 84 85
## 48 Arkansas 22.4 18 26 94 95
## 49 West Virginia 23.8 34 28 97 87
## 50 North Dakota 23.9 23 42 99 86
## 51 South Carolina 23.9 38 41 96 81
## Premium Loss
## 1 1273.89 136.05
## 2 1011.14 135.63
## 3 777.18 133.35
## 4 890.03 111.62
## 5 1068.73 167.02
## 6 1148.99 148.58
## 7 1301.52 159.85
## 8 809.38 109.48
## 9 746.54 120.21
## 10 878.41 165.63
## 11 1234.31 150.01
## 12 1048.78 192.70
## 13 768.95 153.72
## 14 803.11 139.15
## 15 804.71 104.61
## 16 835.50 139.91
## 17 716.20 109.61
## 18 670.31 106.62
## 19 1110.61 152.26
## 20 697.73 133.52
## 21 710.46 108.92
## 22 1029.87 138.71
## 23 732.28 114.82
## 24 661.88 96.57
## 25 641.96 82.75
## 26 913.15 142.80
## 27 649.06 114.47
## 28 790.32 144.45
## 29 1137.87 151.48
## 30 708.24 127.82
## 31 791.14 122.04
## 32 861.18 120.92
## 33 896.07 155.77
## 34 780.45 133.80
## 35 1160.13 144.18
## 36 1053.48 133.93
## 37 905.99 153.86
## 38 869.85 120.75
## 39 899.47 110.35
## 40 784.55 145.08
## 41 669.31 96.87
## 42 1004.75 156.83
## 43 767.91 155.57
## 44 881.51 178.86
## 45 1281.55 194.78
## 46 872.51 137.13
## 47 816.21 85.15
## 48 827.34 142.39
## 49 992.61 152.56
## 50 688.75 109.72
## 51 858.97 116.29

arrange(drivers,desc(PrcSpeed))

## State NperB PrcSpeed PrcAlco PrcNotDist PrcNoPrev

## 1 Hawaii 17.5 54 41 82 87
## 2 Pennsylvania 18.2 50 31 96 88
## 3 Connecticut 10.8 46 36 87 82
## 4 Missouri 16.1 43 34 92 84
## 5 Utah 11.3 43 16 88 96
## 6 Washington 10.6 42 33 82 86
## 7 Wyoming 17.4 42 32 81 90
## 8 Alaska 18.1 41 25 90 94
## 9 Texas 19.4 40 38 91 87
## 10 Alabama 18.8 39 30 96 80
## 11 Montana 21.4 39 44 84 85
## 12 North Carolina 16.8 39 31 94 81
## 13 Delaware 16.2 38 30 87 99
## 14 Maine 15.1 38 30 87 84
## 15 South Carolina 23.9 38 41 96 81
## 16 Colorado 13.6 37 28 79 95
## 17 Nevada 14.7 37 32 95 99
## 18 Idaho 15.3 36 29 85 98
## 19 Illinois 12.8 36 34 94 96
## 20 Wisconsin 13.8 36 33 39 84
## 21 Arizona 18.6 35 28 84 96
## 22 California 12.0 35 28 91 89
## 23 Louisiana 20.5 35 33 73 98
## 24 New Hampshire 11.6 35 30 87 83
## 25 District of Columbia 5.9 34 27 100 100
## 26 Maryland 12.5 34 32 71 99
## 27 Rhode Island 11.1 34 38 92 79
## 28 West Virginia 23.8 34 28 97 87
## 29 Oregon 12.8 33 26 67 90
## 30 New York 12.3 32 29 88 80
## 31 Oklahoma 19.9 32 29 92 94
## 32 South Dakota 19.4 31 33 98 86
## 33 Vermont 13.6 30 30 96 95
## 34 Ohio 14.1 28 34 99 82
## 35 Kansas 17.8 27 24 77 85
## 36 Indiana 14.5 25 29 95 95
## 37 Michigan 14.1 24 28 95 77
## 38 Massachusetts 8.2 23 35 87 80
## 39 Minnesota 9.6 23 29 88 88
## 40 North Dakota 23.9 23 42 99 86
## 41 Florida 17.9 21 29 92 94
## 42 Tennessee 19.5 21 29 82 81
## 43 Georgia 15.6 19 25 95 93
## 44 Kentucky 21.4 19 23 78 76
## 45 New Mexico 18.4 19 27 67 98
## 46 Virginia 12.7 19 27 87 88
## 47 Arkansas 22.4 18 26 94 95
## 48 Iowa 15.7 17 25 97 87
## 49 New Jersey 11.2 16 28 86 78
## 50 Mississippi 17.6 15 31 10 100
## 51 Nebraska 14.9 13 35 93 90
## Premium Loss
## 1 861.18 120.92
## 2 905.99 153.86
## 3 1068.73 167.02
## 4 790.32 144.45
## 5 809.38 109.48
## 6 890.03 111.62
## 7 791.14 122.04
## 8 1053.48 133.93
## 9 1004.75 156.83
## 10 784.55 145.08
## 11 816.21 85.15
## 12 708.24 127.82
## 13 1137.87 151.48
## 14 661.88 96.57
## 15 858.97 116.29
## 16 835.50 139.91
## 17 1029.87 138.71
## 18 641.96 82.75
## 19 803.11 139.15
## 20 670.31 106.62
## 21 899.47 110.35
## 22 878.41 165.63
## 23 1281.55 194.78
## 24 746.54 120.21
## 25 1273.89 136.05
## 26 1048.78 192.70
## 27 1148.99 148.58
## 28 992.61 152.56
## 29 804.71 104.61
## 30 1234.31 150.01
## 31 881.51 178.86
## 32 669.31 96.87
## 33 716.20 109.61
## 34 697.73 133.52
## 35 780.45 133.80
## 36 710.46 108.92
## 37 1110.61 152.26
## 38 1011.14 135.63
## 39 777.18 133.35
## 40 688.75 109.72
## 41 1160.13 144.18
## 42 767.91 155.57
## 43 913.15 142.80
## 44 872.51 137.13
## 45 869.85 120.75
## 46 768.95 153.72
## 47 827.34 142.39
## 48 649.06 114.47
## 49 1301.52 159.85
## 50 896.07 155.77
## 51 732.28 114.82
Make new variables

driversp=mutate(drivers,prem_c=Loss/Premium)
select(arrange(driversp,prem_c),State,prem_c)

## State prem_c
## 1 Montana 0.1043236
## 2 District of Columbia 0.1067989
## 3 New York 0.1215335
## 4 Arizona 0.1226834
## 5 New Jersey 0.1228179
## 6 Florida 0.1242792
## 7 Washington 0.1254115
## 8 Alaska 0.1271310
## 9 Idaho 0.1289021
## 10 Rhode Island 0.1293136
## 11 Oregon 0.1299971
## 12 Delaware 0.1331259
## 13 Massachusetts 0.1341357
## 14 Nevada 0.1346869
## 15 Utah 0.1352640
## 16 South Carolina 0.1353831
## 17 Michigan 0.1370958
## 18 New Mexico 0.1388170
## 19 Hawaii 0.1404120
## 20 South Dakota 0.1447311
## 21 Maine 0.1459026
## 22 Louisiana 0.1519878
## 23 Vermont 0.1530438
## 24 Indiana 0.1533091
## 25 West Virginia 0.1536958
## 26 Wyoming 0.1542584
## 27 Texas 0.1560886
## 28 Connecticut 0.1562789
## 29 Georgia 0.1563818
## 30 Nebraska 0.1567979
## 31 Kentucky 0.1571673
## 32 Wisconsin 0.1590607
## 33 North Dakota 0.1593031
## 34 New Hampshire 0.1610229
## 35 Colorado 0.1674566
## 36 Pennsylvania 0.1698253
## 37 Kansas 0.1714396
## 38 Minnesota 0.1715819
## 39 Arkansas 0.1721058
## 40 Illinois 0.1732639
## 41 Mississippi 0.1738369
## 42 Iowa 0.1763627
## 43 North Carolina 0.1804755
## 44 Missouri 0.1827741
## 45 Maryland 0.1837373
## 46 Alabama 0.1849213
## 47 California 0.1885566
## 48 Ohio 0.1913634
## 49 Virginia 0.1999090
## 50 Tennessee 0.2025888
## 51 Oklahoma 0.2029018
Document all the changes you make using a
script.
The best way to make a report is to put everything into an .Rmd document and
then knit into an html file using the knitr package.
Summary of this Session:
Careful data preprocessing is necessary at the beginning of any data exploration
exercise.

Missing data and outliers need to be identified.

Missing data may be imputed if there are only a few in a column or row and if
their occurrence patterns are random.

We saw how to use the package dplyr that allows us to easily do a sequence of
actions on data using the %>% operator.

Some of the possible actions are:

filter()
arrange()
select()
mutate()
summarise()
sample_n()

We saw that preprocessing your data should be documented with scripts

you save. A good way to do this is to use RStudio’s Rmd editor and html
generator.

Question: Look up the RStudio data wrangling cheatsheet: R Data Wrangling

Cheatsheet

Activity: Re-analyze the drivers data and make your own Rmd and html reports.

Verzani Answers
100% (8)
Verzani Answers
94 pages
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
100% (1)
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
424 pages
Solutions Manual Using R Introductory ST
No ratings yet
Solutions Manual Using R Introductory ST
33 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
57 pages
Exploratory Data Analysis and Graphics: Lab 2
No ratings yet
Exploratory Data Analysis and Graphics: Lab 2
19 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
07-ProgrammingR - Programming With Data in R
No ratings yet
07-ProgrammingR - Programming With Data in R
14 pages
BPS21018 SEC Practical
No ratings yet
BPS21018 SEC Practical
92 pages
Data_analysis_with_R _24
No ratings yet
Data_analysis_with_R _24
47 pages
Unit2
No ratings yet
Unit2
76 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
Analysis Course HW1
No ratings yet
Analysis Course HW1
5 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
32 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
BM1, Applied Statistics, Lesson 1: Data and Graph Basics: Luis Del Peso Ovalle
No ratings yet
BM1, Applied Statistics, Lesson 1: Data and Graph Basics: Luis Del Peso Ovalle
17 pages
Lesson 1
No ratings yet
Lesson 1
18 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
Using R For Data Preprocessing, Exploratory Analysis, Visualization
No ratings yet
Using R For Data Preprocessing, Exploratory Analysis, Visualization
7 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Package Zoo': July 30, 2024
No ratings yet
Package Zoo': July 30, 2024
75 pages
Package Zoo': R Topics Documented
No ratings yet
Package Zoo': R Topics Documented
75 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
FE418_RLectureNotes1
No ratings yet
FE418_RLectureNotes1
15 pages
CRAN (2020) - Package 'Zoo'
No ratings yet
CRAN (2020) - Package 'Zoo'
75 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
R Complete
No ratings yet
R Complete
24 pages
R Examples
No ratings yet
R Examples
56 pages
Exercises For R
No ratings yet
Exercises For R
40 pages
Package Zoo': R Topics Documented
No ratings yet
Package Zoo': R Topics Documented
73 pages
Stats
100% (1)
Stats
1,561 pages
Basic Data Types
No ratings yet
Basic Data Types
48 pages
Zoo
No ratings yet
Zoo
73 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
R Console
No ratings yet
R Console
6 pages
Workshop 1
No ratings yet
Workshop 1
7 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
Debarghya Das (Ba-1), 18021141033
No ratings yet
Debarghya Das (Ba-1), 18021141033
12 pages
Data Analyses R Manual NYTS
No ratings yet
Data Analyses R Manual NYTS
24 pages
R Programing Bhagu
No ratings yet
R Programing Bhagu
40 pages
4.18 Data Wrangling Slides Part1
No ratings yet
4.18 Data Wrangling Slides Part1
54 pages
Introduction To Statistics in R
No ratings yet
Introduction To Statistics in R
42 pages
da lab file 2
No ratings yet
da lab file 2
13 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
R Language PDF
100% (1)
R Language PDF
619 pages
R Programs
No ratings yet
R Programs
30 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
R-Programming-Cheat-Sheet
No ratings yet
R-Programming-Cheat-Sheet
7 pages
楊睿中統計學合併版
No ratings yet
楊睿中統計學合併版
557 pages
R Course 2014: Lecture 5
No ratings yet
R Course 2014: Lecture 5
32 pages
RStudio
No ratings yet
RStudio
4 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
R - Lecture #2
No ratings yet
R - Lecture #2
21 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
NYC Basic Tips and Etiquette
From Everand
NYC Basic Tips and Etiquette
Nathan W. Pyle
4/5 (34)
Book of Numbers: Number Systems Made Easy
From Everand
Book of Numbers: Number Systems Made Easy
Stan The Man
No ratings yet
Advanced Cue Ball Control Self-Testing Program
From Everand
Advanced Cue Ball Control Self-Testing Program
Allan P. Sand
No ratings yet
Shake Them Haters off Volume 12: Mastering Your Mathematics Skills – the Study Guide
From Everand
Shake Them Haters off Volume 12: Mastering Your Mathematics Skills – the Study Guide
Russell Bailey
No ratings yet
Res-E02-2202-S-5404 Core Wall 2 Reinforcement Details (Sheet-1)
No ratings yet
Res-E02-2202-S-5404 Core Wall 2 Reinforcement Details (Sheet-1)
1 page
Bilal Cv-Acca & Mba
No ratings yet
Bilal Cv-Acca & Mba
6 pages
Flexaboot Grdaccessories 2017
No ratings yet
Flexaboot Grdaccessories 2017
1 page
Jartest PDF
No ratings yet
Jartest PDF
2 pages
Mix Proportioning of Sprayed Concrete A Systematic Literature Review
No ratings yet
Mix Proportioning of Sprayed Concrete A Systematic Literature Review
12 pages
ST Requirement To MS IEC62305.PDF Rev1
No ratings yet
ST Requirement To MS IEC62305.PDF Rev1
41 pages
Church (Holy Spirit Parish) Bepanda.: Douala, Le 17 May 2021
No ratings yet
Church (Holy Spirit Parish) Bepanda.: Douala, Le 17 May 2021
6 pages
Effect of The Use of Ceramic Filters in Steel Casting
No ratings yet
Effect of The Use of Ceramic Filters in Steel Casting
6 pages
Turco 5668 PDF
No ratings yet
Turco 5668 PDF
2 pages
0 Atudosiei Elenaproiect de Lectie 1
No ratings yet
0 Atudosiei Elenaproiect de Lectie 1
4 pages
RT Duroid 6002 Laminate Data Sheet
No ratings yet
RT Duroid 6002 Laminate Data Sheet
2 pages
What Is A Turbine Pump
No ratings yet
What Is A Turbine Pump
2 pages
CCU program summary 2022_ANZ
No ratings yet
CCU program summary 2022_ANZ
1 page
Z OS ISPF Day 1
No ratings yet
Z OS ISPF Day 1
31 pages
Business MCQ PDF
No ratings yet
Business MCQ PDF
14 pages
Digital Signal Processing A Practical Guide For Engineers and Scientists
No ratings yet
Digital Signal Processing A Practical Guide For Engineers and Scientists
26 pages
Davey Dewatering AEM
No ratings yet
Davey Dewatering AEM
2 pages
Preboard 1
No ratings yet
Preboard 1
6 pages
Operational Issues in Aided Night Vision Flight: Pierandrea Trivelloni
No ratings yet
Operational Issues in Aided Night Vision Flight: Pierandrea Trivelloni
23 pages
Generator Requirements
No ratings yet
Generator Requirements
7 pages
Microsoft Word Shortcuts 2023 1
No ratings yet
Microsoft Word Shortcuts 2023 1
1 page
Ds Technicolor Tg788vn
No ratings yet
Ds Technicolor Tg788vn
5 pages
Igate Presentation
No ratings yet
Igate Presentation
15 pages
Oil and Drilling
No ratings yet
Oil and Drilling
3 pages
Cold Mix Asphalt
50% (4)
Cold Mix Asphalt
99 pages
Urban Drainage
No ratings yet
Urban Drainage
33 pages
Anthurium, Flamingo Flower: Produce Facts
No ratings yet
Anthurium, Flamingo Flower: Produce Facts
2 pages
IRD1-T2: Transformer Differential Protection Relay
No ratings yet
IRD1-T2: Transformer Differential Protection Relay
24 pages
Locating Similar SQL in Oracle
No ratings yet
Locating Similar SQL in Oracle
5 pages
SB SBH Series Blender IM
No ratings yet
SB SBH Series Blender IM
22 pages

Preprocessing - Preprocessing Your Data With R

Uploaded by

Preprocessing - Preprocessing Your Data With R

Uploaded by

Preprocessing your data with R

Susan Holmes (c)

## [1] "A" "1" "6" "7" NA "B"

## Warning in mean.default(example): argument is not numeric or logical:

## [1] FALSE FALSE FALSE FALSE TRUE FALSE

example2 <- c(2,1,6,7,NA,4)

## [1] FALSE FALSE FALSE FALSE TRUE FALSE

## species bw brw pi sei odi ts mls gt ps sws

Missing at random (MCAR) versus systematic patterns (MNAR).

## species bw brw sws ps ts mls gt pi sei odi

## year month date_of_month day_of_week births

birthn <- births

## year month date_of_month day_of_week births

## The following objects are masked from 'package:stats':

## The following objects are masked from 'package:base':

Sat1 <-filter(birthn, day_of_week == 6)

## year month date_of_month day_of_week births

Sat2 <- birthn %>% filter(day_of_week == 6)

## year month date_of_month day_of_week births

## 'data.frame': 783 obs. of 5 variables:

GroupBirths <- group_by(birthn,day_of_week)

## Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of 2 variables:

x %>% f(y) is equivalent to just executing f(x,y)

If we need to execute a sequence of functions: h(g(f(x,y),z),m) can be hard to parse

To find out the average of Friday 13th births:

You need an internet connection for this to work:

drivers <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/bad-drivers/bad-drivers.csv"))

## State NperB PrcSpeed

## State NperB PrcSpeed

## State NperB PrcSpeed PrcAlco PrcNotDist PrcNoPrev

## State NperB PrcSpeed PrcAlco PrcNotDist PrcNoPrev

Missing data and outliers need to be identified.

Some of the possible actions are:

We saw that preprocessing your data should be documented with scripts

Question: Look up the RStudio data wrangling cheatsheet: R Data Wrangling

You might also like