0% found this document useful (0 votes)
6 views

Data Preprocessing

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Preprocessing

Uploaded by

SahilPatel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

The slides are derived from the following publisher instructor

material. This work is protected by United States copyright laws


and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.
Data Preprocessing
Outline:

This chapter shows


how to:
– Evaluate the quality of Business / Research Data Understanding
the data Understanding Phase Phase

– Clean the raw data Data Preparation


Deployment Phase
– Deal with missing data Phase

– Perform Evaluation
Modeling Phase
transformations on Phase

certain variables

CRISP-DM standard process

3
Why Do We Preprocess Data?

• Raw data often incomplete, noisy

• May contain:
– redundant fields
– Missing values
– Outliers
– Data in form not suitable for data mining
– Erroneous values

4
Why Do We Preprocess Data? (cont’d)
• For data mining purposes, database values must undergo data
cleaning and data transformation
• Data often from legacy databases where values:
– Not looked at in years
– Expired
– No longer relevant
– Missing
• Minimize GIGO (Garbage In Garbage Out)
– IF garbage input minimized → THEN garbage in results minimized
• Data preparation is 60% of effort for data mining process (Pyle)

5
Data Cleaning
Data errors in the following table:

• Five-numeral U.S. Zip Code?


– Not all countries use same zip code format, 90210 (U.S.) vs. J2S7K7
(Canada)
– Should expect unusual values in some fields

• Four Digit Zip Code?


– Leading zero truncated, 6269 vs. 06269 (New England states)
– Database field numeric and chopped-off leading zero

6
Data Cleaning (cont’d)

• Income Field Contains $10,000,000?


– Assumed to measure gross annual income
– Possibly valid
– Still considered outlier (extreme data value)
– Some statistical and data mining methods affected by outliers
• Income Field Contains -$40,000?
– Income less than $0?
– Value beyond bounds for expected income, therefore an error
– Caused by data entry error?
– Discuss anomaly with database administrator

7
Data Cleaning (cont’d)

• Income Field Contains $99,999?


– Other values appear rounded to nearest $5,000
– Value may be completely valid
– Value represents database code used to denote missing value?
– Confirm values in expected unit of measure, such as U.S. dollars
– Which unit of measure for income?
– Customer with zip code J2S7K7 in Canadian dollars?
– Discuss anomaly with database administrator

8
Data Cleaning (cont’d)

• Age Field Contains “C”?


– Other records have numeric values for field
– Record categorized into group labeled “C”
– Value must be resolved
– Data mining software expects numeric values for field
• Age Field Contains 0?
– Zero-value used to indicate missing/unknown value?
– Customer refused to provide their age?

9
Data Cleaning (cont’d)

• Marital Status Field Contains “S”?


– What does this symbol mean?
– Does “S” imply single or separated?
– Discuss anomaly with database administrator

10
Handling Missing Data

• Missing values pose problems to data analysis methods

• More common in databases containing large number of fields

• Absence of information rarely beneficial to task of analysis

• In contrast, having more data almost always better

• Careful analysis required to handle issue

11
Handling Missing Data (cont’d)
• Examine cars_preprocessing dataset containing records for 261
automobiles manufactured in 1970s and 1980s

cars = read.csv("cars_preprocessing.csv",stringsAsFactors
= FALSE, na.strings = "")
# create a new dataset as car.cleaned
cars.cleaned = cars
head(cars.cleaned)

12
Handling Missing Data (cont’d)
• Delete Records Containing Missing Values?
– Not necessarily best approach
– Pattern of missing values may be systematic
– Deleting records creates biased subset
– Valuable information in other fields lost

• Four Alternate Methods Available


1. Replace Missing Values with User-defined Constant
2. Replace Missing Values with Mode (for categorical variables) or Mean (for
numeric variables)
3. Replace the missing values with a value generated at random from the
observed distribution of the variable.
4. Replace the missing values with imputed values based on the other
characteristics of the record.

13
Handling Missing Data (cont’d)
• (1) Replace Missing Values with User-defined Constant
– Missing numeric values replaced with 0.0

cars.cleaned$hp[c(4,5)] = 0

Or use the following code to replace all NAs with 0

cars.cleaned$hp[is.na(cars.cleaned$hp)] = 0

– Missing categorical values replaced with “Missing”

cars.cleaned$brand[is.na(cars.cleaned$brand)] =
"Missing"

14
Handling Missing Data (cont’d)
• (2) Replace Missing Values with Mode or Mean
– Mode of categorical field cylinders = 4
– Missing values replaced with this value

# Replace values with mean and mode


our_table = table(cars.cleaned$cylinders)
pos_mode = which.max(our_table)
our_mode = names(pos_mode)
cars.cleaned$cylinders[is.na(cars.cleaned$cylinders)] =
our_mode

– Mean for non-missing values in numeric field mpg= 23.12171


– Missing values replaced with 23.12171
cars.cleaned$mpg[is.na(cars.cleaned$mpg)] =
mean(na.omit(cars.cleaned$mpg))

15
Handling Missing Data (cont’d)
• (3) Replace Missing Values with Random Values
– Values randomly taken from underlying distribution
– Method superior compared to mean substitution
– Measures of location and spread remain closer to original
# Generate random observations
obs_cubicinches = sample(na.omit(cars.cleaned$cubicinches), 1)

cars.cleaned$cubicinches[is.na(cars.cleaned$cubicinches)] =
obs_cubicinches

– No guarantee resulting records make sense


– Suppose randomly-generated values cylinders = 8 and cubicinches = 82
– What is likely value, given record’s other attribute values? (imputation)
– For example, American car has 300 cubic inches and 150 horsepower
– Japanese car has 100 cubic inches and 90 horsepower
– American car expected to have more cylinders
16
Handling Missing Data (cont’d)
• (4) Imputation
– In data imputation, we need to answer “What would be the
most likely value for this missing value, given all the other
attributes for a particular record?”
• An American car with 300 cubic inches and 150 horsepower would
probably be expected to have more cylinders than a Japanese car with 100
cubic inches and 90 horsepower.

– For imputation of missing data, we need to learn the tools


needed to do so, such as multiple regression or classification and
regression trees (future topics)

17
Identifying Misclassifications

– Verify values valid and consistent

table(cars.cleaned$brand)

– Frequency distribution shows five classes: USA, France, US,


Europe, and Japan
– Count for USA = 1 and France = 1?
– Two records classified inconsistently with respect to origin
of the manufacturer
– Maintain consistency by labeling USA → US, and France →
Europe
18
Graphical Methods for Identifying Outliers
• Outliers are values that lie near extreme limits of data range
• Outliers may represent errors in data entry
• Certain statistical methods very sensitive to outliers and may produce
unstable results
• A histogram examines values of numeric fields

# Create a Histogram
hist(cars$weightlbs,col="blue",border = "black",xlab =
"Weight", ylab = "Counts", main = "Histogram of Car Weights")

# Make a box around the plot


box(which = "plot", lty = "solid", col = "black")

19
Graphical Methods for Identifying Outliers (cont’d)

– A histogram examines values


of numeric fields
– This histogram shows vehicle
weights for cars data set
– The extreme left-tail contains
one outlier weighing several
hundred pounds (192.5)
– Perhaps value of 192.5 is an
error
– Should it be 1925?
– Cannot know for sure and
requires further investigation

20
Graphical Methods for Identifying Outliers (cont’d)
– Two-dimensional scatter plots help determine outliers between
variable pairs
– Scatter plot of mpg against weightlbs shows two possible
outliers
– Most data points cluster together along x-axis
– However, one car weighs 192.5 pounds and other gets over 500
miles per gallon?
# Create a Scatterplot
plot(cars$weight,cars$mpg,xlim = c(0, 5000),
ylim = c(0, 600),xlab = "Weight", ylab = "MPG",
main = "Scatterplot of MPG by Weight", type
= "p", pch = 20,col = "blue")
#Add open black circles
points(cars$weight, cars$mpg, type = "p", col =
"black")

21
Graphical Methods for Identifying Outliers (cont’d)

– Most data points cluster together along x-axis


– However, one car weighs 192.5 pounds and other gets over 500
miles per gallon?

22
Measures of Center And Spread
• The numerical measures of center estimates where the center of a
particular variable Lies
– Mean
– Median
– Mode
• Mean: the average of the valid values taken by the variable
‒ For extremely skewed data sets, the mean becomes less representative of the
variable center
‒ Also, the mean is sensitive to the presence of outliers
• Median: defined as the field value in the middle when the field
values are sorted into ascending order
‒ The median is resistant to the presence of outliers
• Mode: represents the field value occurring with the greatest
frequency
‒ The mode may be used with either numerical or categorical data, but is not always
associated with the variable center
23
Measures of Center And Spread (cont’d)

• Measures of spread (variability) include the range (maximum —


minimum), the standard deviation, the mean absolute deviation,
and the interquartile range

• The sample standard deviation is perhaps the most widespread


measure of variability and is defined by

• The standard deviation can be interpreted as the “typical”


distance between a field value and the mean, and most field
values lie within two standard deviations of the mean.

24
Measures of Center And Spread (cont’d)

# Descriptive Statistics
mean(cars.cleaned$weightlbs) # Mean

median(cars.cleaned$weightlbs) # Median

length(cars.cleaned$weightlbs) # Number of
observations

sd(cars.cleaned$weightlbs) # Standard
deviation

summary(cars.cleaned$weightlbs) # Min,
Q1,Median, Mean,Q3, Max

25
Data Transformation
• Variables tend to have ranges different from each other
• In baseball, two fields may have ranges:
– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]

• Some data mining algorithms adversely affected by


differences in variable ranges
• Variables with greater ranges tend to have larger influence on
data model’s results

• Therefore, numeric field values should be normalized

26
Min-Max Normalization

• Min-max normalization works by seeing how much greater the field value is
than the minimum value min(X), and scaling this difference by the range

• For example, an ultra-light vehicle, weighing only 1613 pounds (the field mini-
mum), the min–max normalization is:

• The heaviest vehicle has a min-max normalization value of

27
Z-score Standardization

• Z-score standardization works by taking the difference between the field


value and the field mean value, and scaling this difference by the standard
deviation of the field values

• For example, a vehicle weighing only 1613 pounds, the Z-score standard-
ization is:

• For the heaviest car, the Z-score standardization is:

28
Z-score Standardization

# Transformations
# Min-max normalization

mi = min(cars.cleaned$weightlbs)
ma = max(cars.cleaned$weightlbs)
minmax.weight = (cars.cleaned$weightlbs -
mi)/(ma - mi)
minmax.weight

# Z-score standarization
m = mean(cars.cleaned$weightlbs);
s = sd(cars.cleaned$weightlbs)
z.weight = (cars.cleaned$weightlbs - m)/s
z.weight
29
Transformations To Achieve Normality

• Normal distribution is a continuous probability distribution (bell curve)

• Centered at mean 𝜇 and its spread determined by SD 𝜎 (sigma)

• Figure below shows the normal distribution that has mean 𝜇 = 0 and
SD 𝜎 = 1, known as the standard normal distribution Z

• Common misconception that variables that have had the Z-score


standardization applied to them follow the standard normal Z
distribution

• This is not correct!

30
Transformations To Achieve Normality (cont’d)

• Z-standardized data
Original
will have a mean = 0
and standard
deviation = 1 but does
not mean that they
are normally
distributed
Standardized

31
Skewness

• Measuring the skewness of


a distribution informs us of
symmetry

32
Skewness (cont’d)

# Skewness

(3*(mean(cars$weightlbs) -
median(cars$weightlbs)))/sd(cars$weightlbs)

(3*(mean(z.weight) -
median(z.weight)))/sd(z.weight)

33
Transformations To Achieve Normality

• To make our data “more normally distributed,” we


must first make it symmetric
• To eliminate skewness, we apply a transformation
to the data
• Common transformations are:
– Natural log transformation 𝒍𝒏(𝒘𝒆𝒊𝒈𝒉𝒕)
– Square root transformation weight
𝟏
– Inverse square root transformation
weight

34
Transformations To Achieve Normality (cont’d)

• Natural log transformation


ln(𝑤𝑒𝑖𝑔ℎ𝑡)

• Square root transformation


weight

• Inverse square root


1
transformation
weight
– (best but not really normal)
35
Transformations To Achieve Normality (cont’d)
# Transformations for Normality
# Square root
sqrt.weight = sqrt(cars.cleaned$weightlbs)

sqrt.weight_skew = (3*(mean(sqrt.weight) -
median(sqrt.weight))) / sd(sqrt.weight)
# Natural log
library(SciViews)
ln.weight = ln(cars.cleaned$weightlbs)

ln.weight_skew = (3*(mean(ln.weight) - median(ln.weight))) /


sd(ln.weight)
# Inverse square root
invsqrt.weight = 1 / sqrt(cars.cleaned$weight)

invsqrt.weight_skew = (3*(mean(invsqrt.weight) -
median(invsqrt.weight))) /sd(invsqrt.weight)

36
Transformations To Achieve Normality (cont’d)
• The 3 transformations may produce a
more normal distribution than one
another

• Check for normality => construct a


normal probability plot
Not Normal

• Distribution is normal, the bulk of the Normal


points in the plot should fall on a
straight line

• When the algorithm is done with its


analysis, don’t forget to “de-
transform” the data
37
Transformations To Achieve Normality (cont’d)

# Normal Q-Q Plot

qqnorm(invsqrt.weight,col = "red")

qqline(invsqrt.weight, col = "blue")

38
Transformations To Achieve Normality (cont’d)
# Side-by-Side Histograms
par(mfrow = c(1,2))
# Create two histograms
hist(cars$weight, breaks = 20,xlim = c(1000, 5000),
main = "Histogram of Weight", xlab = "Weight", ylab =
"Counts")

box(which = "plot",lty = "solid", col = "black")

hist(z.weight,breaks = 20,xlim = c(-2, 3),main =


"Histogram of Zscore of Weight", xlab = "Z-score of
Weight", ylab = "Counts")

box(which = "plot", lty = "solid",col = "black")

39
Histogram with Normal Distribution Overlay

# Histogram with Normal Distribution Overlay


par(mfrow=c(1,1))

x = rnorm(1000000, mean = mean(invsqrt.weight),sd =


sd(invsqrt.weight))

hist(invsqrt.weight, breaks = 30, xlim=c(0.0125,


0.0275), col = "lightblue", prob = TRUE, border =
"black", xlab="Inverse Square Root of Weight", ylab =
"Counts",main = "Histogram of Inverse Square Root of
Weight")

box(which = "plot",lty = "solid", col="black")

# Overlay with Normal density


lines(density(x), col="red")

40
Numerical Methods For Identifying Outliers

• The Z-score method for identifying outliers states:


– data value is an outlier if it has a Z-score that is
either less than −3 or greater than 3.
– Variable values with Z-scores much beyond this
range may bear further investigation
– However, one should not automatically omit
outliers from analysis.

41
IQR (Interquartile range)
• Unfortunately, the mean and SD, which are both
part of the formula for the Z-score standardization,
are both rather sensitive to the presence of outliers.

• Therefore, data analysts have developed more


robust statistical methods for outlier detection

• One elementary robust method is to use the IQR.

42
IQR
• The quartiles of a data set divide the data set into the following
four parts, each containing 25% of the data:
– The first quartile (Q1) is the 25th percentile.
– The second quartile (Q2) is the 50th percentile, that is, the median.
– The third quartile (Q3) is the 75th percentile.

• IQR is calculated as IQR=Q3−Q1, and may be interpreted to


represent the spread of the middle 50% of the data

• A data value is an outlier if:


– a. it is located 1.5(IQR) or more below Q1, or
– b. it is located 1.5(IQR) or more above Q3.

43
IQR (cont’d)

• Set of numbers: {1,6,3,14,5,2,7,8,4}


– 25th percentile was Q1=2.5
– 75th percentile was Q3=7.5
• Interquartile range, or the difference between
these quartiles was:
– IQR=7.5−2.5=5
• A number would identified as an outlier if:
– a. it is lower than Q1−1.5(IQR)=2.5−1.5(5)=-5,
or
– b. it is higher than Q3+1.5(IQR)=7.5+1.5(5)=12.5

44
Dummy Variables

• Some analytical methods, such as regression,


require predictors to be numeric

• A dummy variable is a categorical variable taking


only two values, 0 and 1

• When a categorical predictor takes k ≥ 3 possible


values, then define k−1 dummy variables

45
Dummy Variables (cont’d)
• Categorical predictor region has k=4 possible categories, {north,
east, south, west}, then the analyst could define the following
k−1=3 dummy variables
– north_dummy: If region = north then north_dummy = 1; otherwise
north_dummy = 0.
– east_dummy: If region = east then east_dummy = 1; otherwise east_dummy
= 0.
– south_dummy: If region = south then south_dummy = 1; otherwise
south_dummy = 0.
• The dummy variable for the west is not needed, as region=west
is already uniquely identified by zero values for each of the
three existing flag variable

library(fastDummies)

cars.cleaned.new= dummy_cols(cars.cleaned,
select_columns = c("brand","cylinders"))

46
Transforming Categorical Variables Into Numerical Variables

• In most instances, the data analyst should avoid


transforming categorical variables to numerical variables
(assumes order)

• Exception is for categorical variables that are clearly


ordered, such as the variable survey response, taking values
always, usually, sometimes, never

47
Removing Variables That Are Not Useful

• Duplicate records lead to an overweighting of the data


values

• Wish to remove variables that will not help the


analysis, regardless of the proposed data mining task
or algorithm
– Unary variables
• take on only a single value, so a unary variable is not
so much a variable as a constant
– Variables which are very nearly unary

• Example
– suppose that 99.95% of the players in a field hockey league are female,
with the remaining 0.05% male

48
The slides are derived from the following publisher instructor
material. This work is protected by United States copyright laws
and is provided solely for the use of instructors in teaching
their courses and assessing student learning, dissemination or
sale of any part of this work will destroy the integrity of the
work and is not permitted. All recipients of this work are
expected to abide by these restrictions.

Data Mining and Predictive Analytics, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2015.

You might also like