Lecture6a DataPreprocessing
Lecture6a DataPreprocessing
DATA PREPROCESSING
Big Data Science (Master in Statistical Data Analysis)
THE IMPORTANCE OF DATA PREPROCESSING
̶ Garbage in → garbage out!
̶ Quality decisions must be based on quality
data
‒ e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
2
DATA CLEANING
̶ Incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
̶ e.g., occupation=“”
̶ Noisy: containing errors or outliers
̶ e.g., Salary=“-10”
̶ Inconsistent: containing discrepancies in codes or names
̶ e.g., Age=“42” Birthday=“03/07/1997”
̶ e.g., Was rating “1,2,3”, now rating “A, B, C”
̶ e.g., discrepancy between duplicate records
3
MISSING DATA
4
MISSING DATA
5
REASONS FOR MISSING DATA
̶ Equipment malfunction
̶ Inconsistent with other recorded data and thus deleted
̶ Data not entered due to misunderstanding
̶ Certain data may not be considered important at the time of entry
6
MISSING DATA MECHANISMS
̶ Missing Completely at Random (MCAR)
̶ Missing value neither depends on x nor y
̶ Missing at Random (MAR)
̶ Missing value depends on x, but not y
̶ Example: Respondents in service occupations less likely to report
income
̶ Missing not at Random (NMAR)
̶ The probability of a missing value depends on the variable that is
missing
̶ Example: Respondents with high income less likely to report income
7
DEALING WITH MISSING DATA
̶ Use what you know about
̶ Why data is missing
̶ Distribution of missing data
̶ Decide on the best analysis strategy to yield the least biased estimates
̶ Deletion Methods
‒ Listwise deletion
‒ Pairwise deletion
̶ Single Imputation Methods
‒ Mean/mode substitution
‒ Dummy variable method
‒ Single regression
̶ Model-based Methods
‒ Maximum Likelihood
‒ Multiple imputation
8
LISTWISE DELETION (COMPLETE CASE ANALYSIS)
Only analyze cases with available data on
each variable
̶ Pros:
‒ Simplicity
‒ Comparability across analyses
̶ Cons:
‒ Reduces statistical power (because
lowers n)
‒ Doesn’t use all information
‒ Estimates may be biased if data not
MCAR
9
PAIRWISE DELETION (AVAILABLE CASE ANALYSIS)
Analysis with all cases in which the
variables of interest are present
̶ Pros:
‒ Keeps as many cases as possible
for each analysis
‒ Uses all information possible with
each analysis
̶ Cons:
‒ Can’t compare analyses because
sample different each time
10
SINGLE IMPUTATION METHODS
̶ Mean/Mode Substitution
̶ Replace missing value with sample mean or mode
̶ Run analyses as if all complete cases
̶ Pros:
‒ Can use complete case analysis methods
̶ Cons:
‒ Reduces variability
‒ Weakens covariance and correlation estimates in the
data (because ignores relationship between variables)
11
SINGLE IMPUTATION METHODS
̶ Regression imputation
̶ Replaces missing values with predicted score from a
regression equation
̶ Pros:
‒ Uses information from observed data
̶ Cons:
‒ Overestimates model fit and correlation estimates
‒ Weakens variance
12
MULTIPLE IMPUTATION
1. Impute: Data is "filled in" with imputed values using
specified regression model
̶ This step is repeated m times, resulting in a
separate dataset each time.
2. Analyze: Analyses performed within each dataset
3. Pool: Results pooled into one estimate
13
MULTIPLE IMPUTATION
14
MULTIPLE IMPUTATION
̶ Pros:
̶ Variability more accurate with multiple imputations for
each missing value
̶ Considers variability due to sampling AND variability due
to imputation
̶ Cons:
̶ Cumbersome coding
̶ Room for error when specifying models
15
NOISY DATA
16
DATA NORMALIZATION
̶ min-max normalization (rescaling)
′
𝑣 − min 𝑣
𝑣 = max 𝑣 ′ − min 𝑣 ′ + min 𝑣′
max 𝑣 − min 𝑣
̶ z-score normalization
v − meanA
v' =
stand _ devA
17
NOISY DATA
̶ Noise: random error or variance in a measured variable, resulting
in modification of the original values
̶ Incorrect attribute values may be due to
̶ faulty data collection instruments
̶ data entry problems
̶ data transmission problems
18
EXAMPLE
19
HOW TO HANDLE NOISY DATA?
̶ Binning method:
̶ first sort data and partition into (equi-depth) bins
̶ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
̶ Clustering
̶ Detect similar data points
̶ Average out over similar data points to construct “denoised” data points
20
BINNING METHODS FOR DATA SMOOTHING
̶ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
̶ Partition into (equi-depth) bins:
̶ Bin 1: 4, 8, 9, 15
̶ Bin 2: 21, 21, 24, 25
̶ Bin 3: 26, 28, 29, 34
21
OUTLIER REMOVAL
̶ Data points inconsistent with the majority of data
̶ Different outliers
̶ Valid: CEO’s salary,
̶ Noisy: One’s age = 200, widely deviated points
̶ Removal methods
̶ Clustering
̶ Curve-fitting
̶ Hypothesis-testing with a given model
22
DATA/DIMENSIONALITY
REDUCTION
23
HIGH-DIMENSIONAL VERSUS LARGE DATA
̶ High dimensional datasets (# variables is high)
̶ More and more high-dimensional data sets are
emerging (e.g. due to technological advances in
data-capturing instruments)
̶ Relevant features are often not known in advance
̶ In order not to lose potentially interesting
information: add as much features as possible
̶ Large datasets (# instances is high)
24
DATA REDUCTION METHODS
̶ Data is too big to work with
̶ Data reduction
̶ Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the
same (or almost the same) analytical results
̶ Methods
̶ Sampling
̶ Binning, prototype selection
25
SAMPLING METHODS
̶ Choose a representative subset of the data
̶ Simple random sampling may have poor
performance in the presence of skew.
̶ Develop adaptive sampling methods
̶ Stratified sampling:
‒ Approximate the percentage of each class (or
subpopulation of interest) in the overall database
‒ Used in conjunction with skewed data
26
EXAMPLE: PROTOTYPE SELECTION
27
EXAMPLE: BINNING BY HISTOGRAMS
28
EFFECT OF SAMPLE SIZE WHEN REDUCING
29
CURSE OF DIMENSIONALITY
[Bellman, 1961]
“Sample size needed to estimate a function of several
variables to a given degree of accuracy grows
exponentially with the number of variables”
Example:
̶ sample a unit interval with no
more than 0.01 distance between
points:
100 samples
̶ equivalent sampling of a 10-
dimensional unit hypercube:
10010 samples
“SMALL N LARGE P PROBLEMS”
̶ Refers to a class of modeling problems where the
number of features (p) largely outnumbers the amount
of samples (n)
̶ Also sometimes referred to as “wide data”
̶ Typically we want n >> p to perform accurate
parameter estimation
̶ Dimensionality reduction is key to avoiding overfitting in
small n large p problems
31
SIMILARITY FADES AWAY IN HIGH DIMENSIONS
̶ When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
̶ Definitions of density and distance between points,
which is critical for clustering and outlier detection,
become less meaningful
̶ Things look more similar on average when more
features are used to describe them
32
EXAMPLE: CURSE OF DIMENSIONALITY
33
THE EMPTY SPACE PHENOMENON
̶ Empty space phenomenon: “High dimensional spaces are
inherently sparse”
̶ Counter-intuitively, corners (tails) are much more important than centers
34
DIMENSIONALITY REDUCTION
̶ Not all measured variables (features) are important for
understanding the underlying phenomena
̶ Many methods are not designed to deal with HD
datasets (with many irrelevant features)
̶ Eliminate irrelevant features
̶ Eliminate redundant features
̶ Motivation for many domains where the sample per feature ratio is
low
35
ADVANTAGES OF DIMENSIONALITY REDUCTION
̶ Improve model performance
̶ Classification: improve classification performance
(maximize accuracy, AUC)
̶ Clustering: improve cluster detection (AIC, BIC, sum
of squares, various indices)
̶ Regression: improve fit (sum of squares error)
36
ADVANTAGES OF DIMENSIONALITY REDUCTION
̶ Faster and more cost-effective models
̶ Improve generalization performance (avoiding overfitting)
̶ Gain deeper insight into the processes that generated the data
(esp. important in Bioinformatics)
37
TYPES OF FEATURE SELECTION TECHNIQUES
̶ Linear vs Non-linear
̶ Supervised vs Non-supervised
Dimensionality reduction
40
FORWARD / BACKWARD METHODS
̶ Greedy methods
̶ Suppose the original set of features is 𝑃
̶ Recursive Feature Elimination (RFE)
1. Selected features: 𝑆 = 𝑃
2. While the performance is improved:
1. For each feature 𝑖 ∈ 𝑆
1. Train a model with features 𝑆 \ i
2. Compute the performance 𝑚𝑖
2. Remove the feature 𝑖 that maximizes 𝑚𝑖 : 𝑆=𝑆\ i
̶ Assuming 𝑥ҧ = 0 for simplicity, this can be solved using the singular value decomposition of matrix 𝐗 :
𝐗 = 𝐔𝐃𝐕 𝑇
̶ 𝐔 : 𝑁 × 𝑝 orthogonal matrix of left singular vectors
̶ 𝐃 : 𝑝 × 𝑝 diagonal matrix of singular values
̶ 𝐕 : 𝑝 × 𝑝 orthogonal matrix of right singular vectors
̶ In our context:
̶ 𝐕𝑞 corresponds to the first 𝑞 columns of 𝐕, for each 𝑞 ≤ 𝑝
̶ The columns of 𝐔𝐃 are the principal components of 𝐗
𝑑
̶ The variance of the first principal component 𝐳𝑗 = 𝐗𝑣1 = 𝐮1 𝑑1 is N1, therefore:
‒ 𝐗𝑣1 has the highest variance among all linear combinations of the features
‒ 𝐗𝑣2 has the highest variance among all linear combinations of the features, satisfying 𝑣2 being orthogonal to 𝑣1
‒ ...