Data Preprocessing
Data Preprocessing
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Data Preprocessing
Outline:
– Perform Evaluation
Modeling Phase
transformations on Phase
certain variables
3
Why Do We Preprocess Data?
• May contain:
– redundant fields
– Missing values
– Outliers
– Data in form not suitable for data mining
– Erroneous values
4
Why Do We Preprocess Data? (cont’d)
• For data mining purposes, database values must undergo data
cleaning and data transformation
• Data often from legacy databases where values:
– Not looked at in years
– Expired
– No longer relevant
– Missing
• Minimize GIGO (Garbage In Garbage Out)
– IF garbage input minimized → THEN garbage in results minimized
• Data preparation is 60% of effort for data mining process (Pyle)
5
Data Cleaning
Data errors in the following table:
6
Data Cleaning (cont’d)
7
Data Cleaning (cont’d)
8
Data Cleaning (cont’d)
9
Data Cleaning (cont’d)
10
Handling Missing Data
11
Handling Missing Data (cont’d)
• Examine cars_preprocessing dataset containing records for 261
automobiles manufactured in 1970s and 1980s
cars = read.csv("cars_preprocessing.csv",stringsAsFactors
= FALSE, na.strings = "")
# create a new dataset as car.cleaned
cars.cleaned = cars
head(cars.cleaned)
12
Handling Missing Data (cont’d)
• Delete Records Containing Missing Values?
– Not necessarily best approach
– Pattern of missing values may be systematic
– Deleting records creates biased subset
– Valuable information in other fields lost
13
Handling Missing Data (cont’d)
• (1) Replace Missing Values with User-defined Constant
– Missing numeric values replaced with 0.0
cars.cleaned$hp[c(4,5)] = 0
cars.cleaned$hp[is.na(cars.cleaned$hp)] = 0
cars.cleaned$brand[is.na(cars.cleaned$brand)] =
"Missing"
14
Handling Missing Data (cont’d)
• (2) Replace Missing Values with Mode or Mean
– Mode of categorical field cylinders = 4
– Missing values replaced with this value
15
Handling Missing Data (cont’d)
• (3) Replace Missing Values with Random Values
– Values randomly taken from underlying distribution
– Method superior compared to mean substitution
– Measures of location and spread remain closer to original
# Generate random observations
obs_cubicinches = sample(na.omit(cars.cleaned$cubicinches), 1)
cars.cleaned$cubicinches[is.na(cars.cleaned$cubicinches)] =
obs_cubicinches
17
Identifying Misclassifications
table(cars.cleaned$brand)
# Create a Histogram
hist(cars$weightlbs,col="blue",border = "black",xlab =
"Weight", ylab = "Counts", main = "Histogram of Car Weights")
19
Graphical Methods for Identifying Outliers (cont’d)
20
Graphical Methods for Identifying Outliers (cont’d)
– Two-dimensional scatter plots help determine outliers between
variable pairs
– Scatter plot of mpg against weightlbs shows two possible
outliers
– Most data points cluster together along x-axis
– However, one car weighs 192.5 pounds and other gets over 500
miles per gallon?
# Create a Scatterplot
plot(cars$weight,cars$mpg,xlim = c(0, 5000),
ylim = c(0, 600),xlab = "Weight", ylab = "MPG",
main = "Scatterplot of MPG by Weight", type
= "p", pch = 20,col = "blue")
#Add open black circles
points(cars$weight, cars$mpg, type = "p", col =
"black")
21
Graphical Methods for Identifying Outliers (cont’d)
22
Measures of Center And Spread
• The numerical measures of center estimates where the center of a
particular variable Lies
– Mean
– Median
– Mode
• Mean: the average of the valid values taken by the variable
‒ For extremely skewed data sets, the mean becomes less representative of the
variable center
‒ Also, the mean is sensitive to the presence of outliers
• Median: defined as the field value in the middle when the field
values are sorted into ascending order
‒ The median is resistant to the presence of outliers
• Mode: represents the field value occurring with the greatest
frequency
‒ The mode may be used with either numerical or categorical data, but is not always
associated with the variable center
23
Measures of Center And Spread (cont’d)
24
Measures of Center And Spread (cont’d)
# Descriptive Statistics
mean(cars.cleaned$weightlbs) # Mean
median(cars.cleaned$weightlbs) # Median
length(cars.cleaned$weightlbs) # Number of
observations
sd(cars.cleaned$weightlbs) # Standard
deviation
summary(cars.cleaned$weightlbs) # Min,
Q1,Median, Mean,Q3, Max
25
Data Transformation
• Variables tend to have ranges different from each other
• In baseball, two fields may have ranges:
– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]
26
Min-Max Normalization
• Min-max normalization works by seeing how much greater the field value is
than the minimum value min(X), and scaling this difference by the range
• For example, an ultra-light vehicle, weighing only 1613 pounds (the field mini-
mum), the min–max normalization is:
27
Z-score Standardization
• For example, a vehicle weighing only 1613 pounds, the Z-score standard-
ization is:
28
Z-score Standardization
# Transformations
# Min-max normalization
mi = min(cars.cleaned$weightlbs)
ma = max(cars.cleaned$weightlbs)
minmax.weight = (cars.cleaned$weightlbs -
mi)/(ma - mi)
minmax.weight
# Z-score standarization
m = mean(cars.cleaned$weightlbs);
s = sd(cars.cleaned$weightlbs)
z.weight = (cars.cleaned$weightlbs - m)/s
z.weight
29
Transformations To Achieve Normality
• Figure below shows the normal distribution that has mean 𝜇 = 0 and
SD 𝜎 = 1, known as the standard normal distribution Z
30
Transformations To Achieve Normality (cont’d)
• Z-standardized data
Original
will have a mean = 0
and standard
deviation = 1 but does
not mean that they
are normally
distributed
Standardized
31
Skewness
32
Skewness (cont’d)
# Skewness
(3*(mean(cars$weightlbs) -
median(cars$weightlbs)))/sd(cars$weightlbs)
(3*(mean(z.weight) -
median(z.weight)))/sd(z.weight)
33
Transformations To Achieve Normality
34
Transformations To Achieve Normality (cont’d)
sqrt.weight_skew = (3*(mean(sqrt.weight) -
median(sqrt.weight))) / sd(sqrt.weight)
# Natural log
library(SciViews)
ln.weight = ln(cars.cleaned$weightlbs)
invsqrt.weight_skew = (3*(mean(invsqrt.weight) -
median(invsqrt.weight))) /sd(invsqrt.weight)
36
Transformations To Achieve Normality (cont’d)
• The 3 transformations may produce a
more normal distribution than one
another
qqnorm(invsqrt.weight,col = "red")
38
Transformations To Achieve Normality (cont’d)
# Side-by-Side Histograms
par(mfrow = c(1,2))
# Create two histograms
hist(cars$weight, breaks = 20,xlim = c(1000, 5000),
main = "Histogram of Weight", xlab = "Weight", ylab =
"Counts")
39
Histogram with Normal Distribution Overlay
40
Numerical Methods For Identifying Outliers
41
IQR (Interquartile range)
• Unfortunately, the mean and SD, which are both
part of the formula for the Z-score standardization,
are both rather sensitive to the presence of outliers.
42
IQR
• The quartiles of a data set divide the data set into the following
four parts, each containing 25% of the data:
– The first quartile (Q1) is the 25th percentile.
– The second quartile (Q2) is the 50th percentile, that is, the median.
– The third quartile (Q3) is the 75th percentile.
43
IQR (cont’d)
44
Dummy Variables
45
Dummy Variables (cont’d)
• Categorical predictor region has k=4 possible categories, {north,
east, south, west}, then the analyst could define the following
k−1=3 dummy variables
– north_dummy: If region = north then north_dummy = 1; otherwise
north_dummy = 0.
– east_dummy: If region = east then east_dummy = 1; otherwise east_dummy
= 0.
– south_dummy: If region = south then south_dummy = 1; otherwise
south_dummy = 0.
• The dummy variable for the west is not needed, as region=west
is already uniquely identified by zero values for each of the
three existing flag variable
library(fastDummies)
cars.cleaned.new= dummy_cols(cars.cleaned,
select_columns = c("brand","cylinders"))
46
Transforming Categorical Variables Into Numerical Variables
47
Removing Variables That Are Not Useful
• Example
– suppose that 99.95% of the players in a field hockey league are female,
with the remaining 0.05% male
48
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.
Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.