0% found this document useful (0 votes)
8 views

Lecture6a DataPreprocessing

Uploaded by

Kassa Derbie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture6a DataPreprocessing

Uploaded by

Kassa Derbie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS

DATA PREPROCESSING
Big Data Science (Master in Statistical Data Analysis)
THE IMPORTANCE OF DATA PREPROCESSING
̶ Garbage in → garbage out!
̶ Quality decisions must be based on quality
data
‒ e.g., duplicate or missing data may cause
incorrect or even misleading statistics.

̶ Data preparation, cleaning, and


transformation comprise most of the work
in a data mining application (80-90%)

̶ Get a first “feeling” with the data

2
DATA CLEANING
̶ Incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
̶ e.g., occupation=“”
̶ Noisy: containing errors or outliers
̶ e.g., Salary=“-10”
̶ Inconsistent: containing discrepancies in codes or names
̶ e.g., Age=“42” Birthday=“03/07/1997”
̶ e.g., Was rating “1,2,3”, now rating “A, B, C”
̶ e.g., discrepancy between duplicate records

3
MISSING DATA
4
MISSING DATA

5
REASONS FOR MISSING DATA
̶ Equipment malfunction
̶ Inconsistent with other recorded data and thus deleted
̶ Data not entered due to misunderstanding
̶ Certain data may not be considered important at the time of entry

6
MISSING DATA MECHANISMS
̶ Missing Completely at Random (MCAR)
̶ Missing value neither depends on x nor y
̶ Missing at Random (MAR)
̶ Missing value depends on x, but not y
̶ Example: Respondents in service occupations less likely to report
income
̶ Missing not at Random (NMAR)
̶ The probability of a missing value depends on the variable that is
missing
̶ Example: Respondents with high income less likely to report income

7
DEALING WITH MISSING DATA
̶ Use what you know about
̶ Why data is missing
̶ Distribution of missing data
̶ Decide on the best analysis strategy to yield the least biased estimates
̶ Deletion Methods
‒ Listwise deletion
‒ Pairwise deletion
̶ Single Imputation Methods
‒ Mean/mode substitution
‒ Dummy variable method
‒ Single regression
̶ Model-based Methods
‒ Maximum Likelihood
‒ Multiple imputation

8
LISTWISE DELETION (COMPLETE CASE ANALYSIS)
Only analyze cases with available data on
each variable

̶ Pros:
‒ Simplicity
‒ Comparability across analyses
̶ Cons:
‒ Reduces statistical power (because
lowers n)
‒ Doesn’t use all information
‒ Estimates may be biased if data not
MCAR

9
PAIRWISE DELETION (AVAILABLE CASE ANALYSIS)
Analysis with all cases in which the
variables of interest are present

̶ Pros:
‒ Keeps as many cases as possible
for each analysis
‒ Uses all information possible with
each analysis
̶ Cons:
‒ Can’t compare analyses because
sample different each time

10
SINGLE IMPUTATION METHODS
̶ Mean/Mode Substitution
̶ Replace missing value with sample mean or mode
̶ Run analyses as if all complete cases
̶ Pros:
‒ Can use complete case analysis methods
̶ Cons:
‒ Reduces variability
‒ Weakens covariance and correlation estimates in the
data (because ignores relationship between variables)

11
SINGLE IMPUTATION METHODS
̶ Regression imputation
̶ Replaces missing values with predicted score from a
regression equation
̶ Pros:
‒ Uses information from observed data
̶ Cons:
‒ Overestimates model fit and correlation estimates
‒ Weakens variance

12
MULTIPLE IMPUTATION
1. Impute: Data is "filled in" with imputed values using
specified regression model
̶ This step is repeated m times, resulting in a
separate dataset each time.
2. Analyze: Analyses performed within each dataset
3. Pool: Results pooled into one estimate

13
MULTIPLE IMPUTATION

14
MULTIPLE IMPUTATION
̶ Pros:
̶ Variability more accurate with multiple imputations for
each missing value
̶ Considers variability due to sampling AND variability due
to imputation

̶ Cons:
̶ Cumbersome coding
̶ Room for error when specifying models

15
NOISY DATA
16
DATA NORMALIZATION
̶ min-max normalization (rescaling)

𝑣 − min 𝑣
𝑣 = max 𝑣 ′ − min 𝑣 ′ + min 𝑣′
max 𝑣 − min 𝑣

̶ z-score normalization
v − meanA
v' =
stand _ devA

17
NOISY DATA
̶ Noise: random error or variance in a measured variable, resulting
in modification of the original values
̶ Incorrect attribute values may be due to
̶ faulty data collection instruments
̶ data entry problems
̶ data transmission problems

18
EXAMPLE

Two Sine Waves Two Sine Waves + Noise

19
HOW TO HANDLE NOISY DATA?
̶ Binning method:
̶ first sort data and partition into (equi-depth) bins
̶ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
̶ Clustering
̶ Detect similar data points
̶ Average out over similar data points to construct “denoised” data points

20
BINNING METHODS FOR DATA SMOOTHING
̶ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
̶ Partition into (equi-depth) bins:
̶ Bin 1: 4, 8, 9, 15
̶ Bin 2: 21, 21, 24, 25
̶ Bin 3: 26, 28, 29, 34

̶ Smoothing by bin means:


̶ Bin 1: 9, 9, 9, 9
̶ Bin 2: 23, 23, 23, 23
̶ Bin 3: 29, 29, 29, 29
̶ Smoothing by bin boundaries:
̶ Bin 1: 4, 4, 4, 15
̶ Bin 2: 21, 21, 25, 25
̶ Bin 3: 26, 26, 26, 34

21
OUTLIER REMOVAL
̶ Data points inconsistent with the majority of data
̶ Different outliers
̶ Valid: CEO’s salary,
̶ Noisy: One’s age = 200, widely deviated points
̶ Removal methods
̶ Clustering
̶ Curve-fitting
̶ Hypothesis-testing with a given model

22
DATA/DIMENSIONALITY
REDUCTION
23
HIGH-DIMENSIONAL VERSUS LARGE DATA
̶ High dimensional datasets (# variables is high)
̶ More and more high-dimensional data sets are
emerging (e.g. due to technological advances in
data-capturing instruments)
̶ Relevant features are often not known in advance
̶ In order not to lose potentially interesting
information: add as much features as possible
̶ Large datasets (# instances is high)

24
DATA REDUCTION METHODS
̶ Data is too big to work with
̶ Data reduction
̶ Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the
same (or almost the same) analytical results
̶ Methods
̶ Sampling
̶ Binning, prototype selection

25
SAMPLING METHODS
̶ Choose a representative subset of the data
̶ Simple random sampling may have poor
performance in the presence of skew.
̶ Develop adaptive sampling methods
̶ Stratified sampling:
‒ Approximate the percentage of each class (or
subpopulation of interest) in the overall database
‒ Used in conjunction with skewed data

26
EXAMPLE: PROTOTYPE SELECTION

27
EXAMPLE: BINNING BY HISTOGRAMS

• A popular data reduction technique


• Divide data into buckets and store average for
each bucket

28
EFFECT OF SAMPLE SIZE WHEN REDUCING

29
CURSE OF DIMENSIONALITY
[Bellman, 1961]
“Sample size needed to estimate a function of several
variables to a given degree of accuracy grows
exponentially with the number of variables”

Example:
̶ sample a unit interval with no
more than 0.01 distance between
points:
100 samples
̶ equivalent sampling of a 10-
dimensional unit hypercube:
10010 samples
“SMALL N LARGE P PROBLEMS”
̶ Refers to a class of modeling problems where the
number of features (p) largely outnumbers the amount
of samples (n)
̶ Also sometimes referred to as “wide data”
̶ Typically we want n >> p to perform accurate
parameter estimation
̶ Dimensionality reduction is key to avoiding overfitting in
small n large p problems

31
SIMILARITY FADES AWAY IN HIGH DIMENSIONS
̶ When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
̶ Definitions of density and distance between points,
which is critical for clustering and outlier detection,
become less meaningful
̶ Things look more similar on average when more
features are used to describe them

32
EXAMPLE: CURSE OF DIMENSIONALITY

• Randomly generate 500 points


• Compute difference between max and min
distance between any pair of points

33
THE EMPTY SPACE PHENOMENON
̶ Empty space phenomenon: “High dimensional spaces are
inherently sparse”
̶ Counter-intuitively, corners (tails) are much more important than centers

34
DIMENSIONALITY REDUCTION
̶ Not all measured variables (features) are important for
understanding the underlying phenomena
̶ Many methods are not designed to deal with HD
datasets (with many irrelevant features)
̶ Eliminate irrelevant features
̶ Eliminate redundant features
̶ Motivation for many domains where the sample per feature ratio is
low

35
ADVANTAGES OF DIMENSIONALITY REDUCTION
̶ Improve model performance
̶ Classification: improve classification performance
(maximize accuracy, AUC)
̶ Clustering: improve cluster detection (AIC, BIC, sum
of squares, various indices)
̶ Regression: improve fit (sum of squares error)

36
ADVANTAGES OF DIMENSIONALITY REDUCTION
̶ Faster and more cost-effective models
̶ Improve generalization performance (avoiding overfitting)
̶ Gain deeper insight into the processes that generated the data
(esp. important in Bioinformatics)

37
TYPES OF FEATURE SELECTION TECHNIQUES

̶ Filter: compute a score for each feature, and discard features


with low scores

̶ Wrapper: for different subsets of features, train a model and


keep the features whose model give the best results

̶ Embedded: train a model with some degree of interpretability,


that allows us to see which features are relevant

Daniel Peralta <[email protected]> 38


TYPES OF FEATURE EXTRACTION TECHNIQUES

̶ Linear vs Non-linear
̶ Supervised vs Non-supervised

Daniel Peralta <[email protected]> 39


OVERVIEW OF DR TECHNIQUES

Dimensionality reduction

Feature selection Feature transformation/projection/extraction


Supervised Unsupervised Supervised Unsupervised

Filter IG, Correlation, Laplacian Linear LDA PCA, LSA,…


Mutual score,…
Information…
Non-linear Kernel DA, Local MDS, LLE,
Wrapper Forward, Category utility, FDA,… Isomap, tSNE,
Backward,… EM,… UMAP…
Embedded Decision tree, SVM Q-α,
weight,… biclustering,…

40
FORWARD / BACKWARD METHODS
̶ Greedy methods
̶ Suppose the original set of features is 𝑃
̶ Recursive Feature Elimination (RFE)
1. Selected features: 𝑆 = 𝑃
2. While the performance is improved:
1. For each feature 𝑖 ∈ 𝑆
1. Train a model with features 𝑆 \ i
2. Compute the performance 𝑚𝑖
2. Remove the feature 𝑖 that maximizes 𝑚𝑖 : 𝑆=𝑆\ i

̶ Sequential Feature Selector


1. Selected features: ∅
2. While the performance is improved:
1. For each feature 𝑖 ∈ 𝑃\𝑆
1. Train a model with features 𝑆 ∪ i
2. Compute the performance 𝑚𝑖
2. Select the feature 𝑖 that maximizes 𝑚𝑖 : 𝑆 = 𝑆 ∪ i

̶ Typically, the performance 𝑚𝑖 is measured with a validation set!

Daniel Peralta <[email protected]> 41


SURPRISING FACTS ON VARIABLE COMBINATIONS
A variable useless by itself can be useful together with others

One variable has completely overlapping class conditional densities. Still,


using it jointly with the other variable improves class separability
compared to using the other variable alone. 42
SURPRISING FACTS ON VARIABLE COMBINATIONS
Two variables useless on their own can be useful when combined

XOR-like or chessboard- like problems. The classes consist of disjoint clumps


such that in projection on the axes the class conditional densities overlap
perfectly. Therefore, individual variables have no separation power. Still, taken
together, the variables provide good class separability . 43
SURPRISING FACTS ON VARIABLE COMBINATIONS
Perfectly correlated variables are truly redundant in the sense that no
additional information is gained by adding them.

The class conditional distributions have a high covariance in the


direction of the line of the two class centers. There is no
significant gain in separation by using two variables instead of
44
just one.
SURPRISING FACTS ON VARIABLE COMBINATIONS
Very high variable correlation (or anti-correlation) does not mean absence of variable
complementarity.

The class conditional distributions have a high covariance in the


direction perpendicular to the line of the two class centers. An
important separation gain is obtained by using two variables
45
instead of one.
SURPRISING FACTS ON VARIABLE COMBINATIONS
Noise reduction and consequently better class separation may be obtained
by adding variables that are presumably redundant.

(Left) A two class problem with independently and identically


distributed (i.i.d.) variables. Each class has a Gaussian distribution
with no covariance. (Right) The same example after a 45 degree
rotation showing that a combination of the two variables yields a
separation improvement. I.i.d. variables are not truly redundant.
46
PRINCIPAL
COMPONENT ANALYSIS
47
PRINCIPAL COMPONENTS
̶ Principal components of a dataset 𝐗 ∈ ℝ𝑁×𝑝

̶ Directions along which the data are highly variable


̶ Lines and subspaces as close as possible to the
data cloud
̶ A sequence of best linear approximations to that
data, of all ranks 𝑞 ≤ 𝑝

Daniel Peralta <[email protected]> 48


PRINCIPAL COMPONENTS
̶ Linear model:
𝑓 𝜆 = 𝜇 + 𝐕𝑞 𝜆
̶ 𝜇 : location vector in ℝ𝑝
̶ 𝐕𝑞 ∈ ℝ𝑝×𝑞 : matrix with 𝑞 orthogonal unit vectors as columns
̶ 𝜆 ∈ ℝ𝑞 : vector of parameters

̶ Goal: minimize the reconstruction error:


𝑁
𝑚𝑖𝑛 2
෍ 𝑥𝑖 − 𝜇 − 𝐕𝑞 𝜆𝑖
𝜇, 𝜆𝑖 , 𝐕𝑞
𝑖=1

Daniel Peralta <[email protected]> 49


PRINCIPAL COMPONENTS
̶ Optimizing 𝜇 and 𝜆𝑖 is easy:
𝜇Ƹ = 𝑥ҧ
𝜆෡𝑖 = 𝐕𝑞𝑇 (𝑥𝑖 − 𝑥)ҧ
̶ Now we only need the orthogonal matrix:
𝑁
𝑚𝑖𝑛 2
෍ 𝑥𝑖 − 𝑥ҧ − 𝐕𝑞 𝐕𝑞𝑇 𝑥𝑖 − 𝑥ҧ
𝐕𝑞
𝑖=1

̶ Assuming 𝑥ҧ = 0 for simplicity, this can be solved using the singular value decomposition of matrix 𝐗 :
𝐗 = 𝐔𝐃𝐕 𝑇
̶ 𝐔 : 𝑁 × 𝑝 orthogonal matrix of left singular vectors
̶ 𝐃 : 𝑝 × 𝑝 diagonal matrix of singular values
̶ 𝐕 : 𝑝 × 𝑝 orthogonal matrix of right singular vectors

̶ In our context:
̶ 𝐕𝑞 corresponds to the first 𝑞 columns of 𝐕, for each 𝑞 ≤ 𝑝
̶ The columns of 𝐔𝐃 are the principal components of 𝐗
𝑑
̶ The variance of the first principal component 𝐳𝑗 = 𝐗𝑣1 = 𝐮1 𝑑1 is N1, therefore:
‒ 𝐗𝑣1 has the highest variance among all linear combinations of the features
‒ 𝐗𝑣2 has the highest variance among all linear combinations of the features, satisfying 𝑣2 being orthogonal to 𝑣1
‒ ...

Daniel Peralta <[email protected]> 50


PRINCIPAL COMPONENTS

Daniel Peralta <[email protected]> 51


PRINCIPAL COMPONENTS

Daniel Peralta <[email protected]> 52

You might also like