0% found this document useful (0 votes)

15 views52 pages

Lecture6a DataPreprocessing

Uploaded by

Kassa Derbie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views52 pages

Lecture6a DataPreprocessing

Uploaded by

Kassa Derbie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS

DATA PREPROCESSING
Big Data Science (Master in Statistical Data Analysis)
THE IMPORTANCE OF DATA PREPROCESSING
̶ Garbage in → garbage out!
̶ Quality decisions must be based on quality
data
‒ e.g., duplicate or missing data may cause
incorrect or even misleading statistics.

̶ Data preparation, cleaning, and

transformation comprise most of the work
in a data mining application (80-90%)

̶ Get a first “feeling” with the data

2
DATA CLEANING
̶ Incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
̶ e.g., occupation=“”
̶ Noisy: containing errors or outliers
̶ e.g., Salary=“-10”
̶ Inconsistent: containing discrepancies in codes or names
̶ e.g., Age=“42” Birthday=“03/07/1997”
̶ e.g., Was rating “1,2,3”, now rating “A, B, C”
̶ e.g., discrepancy between duplicate records

3
MISSING DATA
4
MISSING DATA

5
REASONS FOR MISSING DATA
̶ Equipment malfunction
̶ Inconsistent with other recorded data and thus deleted
̶ Data not entered due to misunderstanding
̶ Certain data may not be considered important at the time of entry

6
MISSING DATA MECHANISMS
̶ Missing Completely at Random (MCAR)
̶ Missing value neither depends on x nor y
̶ Missing at Random (MAR)
̶ Missing value depends on x, but not y
̶ Example: Respondents in service occupations less likely to report
income
̶ Missing not at Random (NMAR)
̶ The probability of a missing value depends on the variable that is
missing
̶ Example: Respondents with high income less likely to report income

7
DEALING WITH MISSING DATA
̶ Use what you know about
̶ Why data is missing
̶ Distribution of missing data
̶ Decide on the best analysis strategy to yield the least biased estimates
̶ Deletion Methods
‒ Listwise deletion
‒ Pairwise deletion
̶ Single Imputation Methods
‒ Mean/mode substitution
‒ Dummy variable method
‒ Single regression
̶ Model-based Methods
‒ Maximum Likelihood
‒ Multiple imputation

8
LISTWISE DELETION (COMPLETE CASE ANALYSIS)
Only analyze cases with available data on
each variable

̶ Pros:
‒ Simplicity
‒ Comparability across analyses
̶ Cons:
‒ Reduces statistical power (because
lowers n)
‒ Doesn’t use all information
‒ Estimates may be biased if data not
MCAR

9
PAIRWISE DELETION (AVAILABLE CASE ANALYSIS)
Analysis with all cases in which the
variables of interest are present

̶ Pros:
‒ Keeps as many cases as possible
for each analysis
‒ Uses all information possible with
each analysis
̶ Cons:
‒ Can’t compare analyses because
sample different each time

10
SINGLE IMPUTATION METHODS
̶ Mean/Mode Substitution
̶ Replace missing value with sample mean or mode
̶ Run analyses as if all complete cases
̶ Pros:
‒ Can use complete case analysis methods
̶ Cons:
‒ Reduces variability
‒ Weakens covariance and correlation estimates in the
data (because ignores relationship between variables)

11
SINGLE IMPUTATION METHODS
̶ Regression imputation
̶ Replaces missing values with predicted score from a
regression equation
̶ Pros:
‒ Uses information from observed data
̶ Cons:
‒ Overestimates model fit and correlation estimates
‒ Weakens variance

12
MULTIPLE IMPUTATION
1. Impute: Data is "filled in" with imputed values using
specified regression model
̶ This step is repeated m times, resulting in a
separate dataset each time.
2. Analyze: Analyses performed within each dataset
3. Pool: Results pooled into one estimate

13
MULTIPLE IMPUTATION

14
MULTIPLE IMPUTATION
̶ Pros:
̶ Variability more accurate with multiple imputations for
each missing value
̶ Considers variability due to sampling AND variability due
to imputation

̶ Cons:
̶ Cumbersome coding
̶ Room for error when specifying models

15
NOISY DATA
16
DATA NORMALIZATION
̶ min-max normalization (rescaling)
′
𝑣 − min 𝑣
𝑣 = max 𝑣 ′ − min 𝑣 ′ + min 𝑣′
max 𝑣 − min 𝑣

̶ z-score normalization
v − meanA
v' =
stand _ devA

17
NOISY DATA
̶ Noise: random error or variance in a measured variable, resulting
in modification of the original values
̶ Incorrect attribute values may be due to
̶ faulty data collection instruments
̶ data entry problems
̶ data transmission problems

18
EXAMPLE

Two Sine Waves Two Sine Waves + Noise

19
HOW TO HANDLE NOISY DATA?
̶ Binning method:
̶ first sort data and partition into (equi-depth) bins
̶ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
̶ Clustering
̶ Detect similar data points
̶ Average out over similar data points to construct “denoised” data points

20
BINNING METHODS FOR DATA SMOOTHING
̶ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
̶ Partition into (equi-depth) bins:
̶ Bin 1: 4, 8, 9, 15
̶ Bin 2: 21, 21, 24, 25
̶ Bin 3: 26, 28, 29, 34

̶ Smoothing by bin means:

̶ Bin 1: 9, 9, 9, 9
̶ Bin 2: 23, 23, 23, 23
̶ Bin 3: 29, 29, 29, 29
̶ Smoothing by bin boundaries:
̶ Bin 1: 4, 4, 4, 15
̶ Bin 2: 21, 21, 25, 25
̶ Bin 3: 26, 26, 26, 34

21
OUTLIER REMOVAL
̶ Data points inconsistent with the majority of data
̶ Different outliers
̶ Valid: CEO’s salary,
̶ Noisy: One’s age = 200, widely deviated points
̶ Removal methods
̶ Clustering
̶ Curve-fitting
̶ Hypothesis-testing with a given model

22
DATA/DIMENSIONALITY
REDUCTION
23
HIGH-DIMENSIONAL VERSUS LARGE DATA
̶ High dimensional datasets (# variables is high)
̶ More and more high-dimensional data sets are
emerging (e.g. due to technological advances in
data-capturing instruments)
̶ Relevant features are often not known in advance
̶ In order not to lose potentially interesting
information: add as much features as possible
̶ Large datasets (# instances is high)

24
DATA REDUCTION METHODS
̶ Data is too big to work with
̶ Data reduction
̶ Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the
same (or almost the same) analytical results
̶ Methods
̶ Sampling
̶ Binning, prototype selection

25
SAMPLING METHODS
̶ Choose a representative subset of the data
̶ Simple random sampling may have poor
performance in the presence of skew.
̶ Develop adaptive sampling methods
̶ Stratified sampling:
‒ Approximate the percentage of each class (or
subpopulation of interest) in the overall database
‒ Used in conjunction with skewed data

26
EXAMPLE: PROTOTYPE SELECTION

27
EXAMPLE: BINNING BY HISTOGRAMS

• A popular data reduction technique

• Divide data into buckets and store average for
each bucket

28
EFFECT OF SAMPLE SIZE WHEN REDUCING

29
CURSE OF DIMENSIONALITY
[Bellman, 1961]
“Sample size needed to estimate a function of several
variables to a given degree of accuracy grows
exponentially with the number of variables”

Example:
̶ sample a unit interval with no
more than 0.01 distance between
points:
100 samples
̶ equivalent sampling of a 10-
dimensional unit hypercube:
10010 samples
“SMALL N LARGE P PROBLEMS”
̶ Refers to a class of modeling problems where the
number of features (p) largely outnumbers the amount
of samples (n)
̶ Also sometimes referred to as “wide data”
̶ Typically we want n >> p to perform accurate
parameter estimation
̶ Dimensionality reduction is key to avoiding overfitting in
small n large p problems

31
SIMILARITY FADES AWAY IN HIGH DIMENSIONS
̶ When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
̶ Definitions of density and distance between points,
which is critical for clustering and outlier detection,
become less meaningful
̶ Things look more similar on average when more
features are used to describe them

32
EXAMPLE: CURSE OF DIMENSIONALITY

• Randomly generate 500 points

• Compute difference between max and min
distance between any pair of points

33
THE EMPTY SPACE PHENOMENON
̶ Empty space phenomenon: “High dimensional spaces are
inherently sparse”
̶ Counter-intuitively, corners (tails) are much more important than centers

34
DIMENSIONALITY REDUCTION
̶ Not all measured variables (features) are important for
understanding the underlying phenomena
̶ Many methods are not designed to deal with HD
datasets (with many irrelevant features)
̶ Eliminate irrelevant features
̶ Eliminate redundant features
̶ Motivation for many domains where the sample per feature ratio is
low

35
ADVANTAGES OF DIMENSIONALITY REDUCTION
̶ Improve model performance
̶ Classification: improve classification performance
(maximize accuracy, AUC)
̶ Clustering: improve cluster detection (AIC, BIC, sum
of squares, various indices)
̶ Regression: improve fit (sum of squares error)

36
ADVANTAGES OF DIMENSIONALITY REDUCTION
̶ Faster and more cost-effective models
̶ Improve generalization performance (avoiding overfitting)
̶ Gain deeper insight into the processes that generated the data
(esp. important in Bioinformatics)

37
TYPES OF FEATURE SELECTION TECHNIQUES

̶ Filter: compute a score for each feature, and discard features

with low scores

̶ Wrapper: for different subsets of features, train a model and

keep the features whose model give the best results

̶ Embedded: train a model with some degree of interpretability,

that allows us to see which features are relevant

Daniel Peralta <[email protected]> 38

TYPES OF FEATURE EXTRACTION TECHNIQUES

̶ Linear vs Non-linear
̶ Supervised vs Non-supervised

Daniel Peralta <[email protected]> 39

OVERVIEW OF DR TECHNIQUES

Dimensionality reduction

Feature selection Feature transformation/projection/extraction

Supervised Unsupervised Supervised Unsupervised

Filter IG, Correlation, Laplacian Linear LDA PCA, LSA,…

Mutual score,…
Information…
Non-linear Kernel DA, Local MDS, LLE,
Wrapper Forward, Category utility, FDA,… Isomap, tSNE,
Backward,… EM,… UMAP…
Embedded Decision tree, SVM Q-α,
weight,… biclustering,…

40
FORWARD / BACKWARD METHODS
̶ Greedy methods
̶ Suppose the original set of features is 𝑃
̶ Recursive Feature Elimination (RFE)
1. Selected features: 𝑆 = 𝑃
2. While the performance is improved:
1. For each feature 𝑖 ∈ 𝑆
1. Train a model with features 𝑆 \ i
2. Compute the performance 𝑚𝑖
2. Remove the feature 𝑖 that maximizes 𝑚𝑖 : 𝑆=𝑆\ i

̶ Sequential Feature Selector

1. Selected features: ∅
2. While the performance is improved:
1. For each feature 𝑖 ∈ 𝑃\𝑆
1. Train a model with features 𝑆 ∪ i
2. Compute the performance 𝑚𝑖
2. Select the feature 𝑖 that maximizes 𝑚𝑖 : 𝑆 = 𝑆 ∪ i

̶ Typically, the performance 𝑚𝑖 is measured with a validation set!

Daniel Peralta <[email protected]> 41

SURPRISING FACTS ON VARIABLE COMBINATIONS
A variable useless by itself can be useful together with others

One variable has completely overlapping class conditional densities. Still,

using it jointly with the other variable improves class separability
compared to using the other variable alone. 42
SURPRISING FACTS ON VARIABLE COMBINATIONS
Two variables useless on their own can be useful when combined

XOR-like or chessboard- like problems. The classes consist of disjoint clumps

such that in projection on the axes the class conditional densities overlap
perfectly. Therefore, individual variables have no separation power. Still, taken
together, the variables provide good class separability . 43
SURPRISING FACTS ON VARIABLE COMBINATIONS
Perfectly correlated variables are truly redundant in the sense that no
additional information is gained by adding them.

The class conditional distributions have a high covariance in the

direction of the line of the two class centers. There is no
significant gain in separation by using two variables instead of
44
just one.
SURPRISING FACTS ON VARIABLE COMBINATIONS
Very high variable correlation (or anti-correlation) does not mean absence of variable
complementarity.

The class conditional distributions have a high covariance in the

direction perpendicular to the line of the two class centers. An
important separation gain is obtained by using two variables
45
instead of one.
SURPRISING FACTS ON VARIABLE COMBINATIONS
Noise reduction and consequently better class separation may be obtained
by adding variables that are presumably redundant.

(Left) A two class problem with independently and identically

distributed (i.i.d.) variables. Each class has a Gaussian distribution
with no covariance. (Right) The same example after a 45 degree
rotation showing that a combination of the two variables yields a
separation improvement. I.i.d. variables are not truly redundant.
46
PRINCIPAL
COMPONENT ANALYSIS
47
PRINCIPAL COMPONENTS
̶ Principal components of a dataset 𝐗 ∈ ℝ𝑁×𝑝

̶ Directions along which the data are highly variable

̶ Lines and subspaces as close as possible to the
data cloud
̶ A sequence of best linear approximations to that
data, of all ranks 𝑞 ≤ 𝑝

Daniel Peralta <[email protected]> 48

PRINCIPAL COMPONENTS
̶ Linear model:
𝑓 𝜆 = 𝜇 + 𝐕𝑞 𝜆
̶ 𝜇 : location vector in ℝ𝑝
̶ 𝐕𝑞 ∈ ℝ𝑝×𝑞 : matrix with 𝑞 orthogonal unit vectors as columns
̶ 𝜆 ∈ ℝ𝑞 : vector of parameters

̶ Goal: minimize the reconstruction error:

𝑁
𝑚𝑖𝑛 2
෍ 𝑥𝑖 − 𝜇 − 𝐕𝑞 𝜆𝑖
𝜇, 𝜆𝑖 , 𝐕𝑞
𝑖=1

Daniel Peralta <[email protected]> 49

PRINCIPAL COMPONENTS
̶ Optimizing 𝜇 and 𝜆𝑖 is easy:
𝜇Ƹ = 𝑥ҧ
𝜆෡𝑖 = 𝐕𝑞𝑇 (𝑥𝑖 − 𝑥)ҧ
̶ Now we only need the orthogonal matrix:
𝑁
𝑚𝑖𝑛 2
෍ 𝑥𝑖 − 𝑥ҧ − 𝐕𝑞 𝐕𝑞𝑇 𝑥𝑖 − 𝑥ҧ
𝐕𝑞
𝑖=1

̶ Assuming 𝑥ҧ = 0 for simplicity, this can be solved using the singular value decomposition of matrix 𝐗 :
𝐗 = 𝐔𝐃𝐕 𝑇
̶ 𝐔 : 𝑁 × 𝑝 orthogonal matrix of left singular vectors
̶ 𝐃 : 𝑝 × 𝑝 diagonal matrix of singular values
̶ 𝐕 : 𝑝 × 𝑝 orthogonal matrix of right singular vectors

̶ In our context:
̶ 𝐕𝑞 corresponds to the first 𝑞 columns of 𝐕, for each 𝑞 ≤ 𝑝
̶ The columns of 𝐔𝐃 are the principal components of 𝐗
𝑑
̶ The variance of the first principal component 𝐳𝑗 = 𝐗𝑣1 = 𝐮1 𝑑1 is N1, therefore:
‒ 𝐗𝑣1 has the highest variance among all linear combinations of the features
‒ 𝐗𝑣2 has the highest variance among all linear combinations of the features, satisfying 𝑣2 being orthogonal to 𝑣1
‒ ...

Daniel Peralta <[email protected]> 50

PRINCIPAL COMPONENTS

Daniel Peralta <[email protected]> 51

PRINCIPAL COMPONENTS

Daniel Peralta <[email protected]> 52

Majestic 13 : The New Generation
No ratings yet
Majestic 13 : The New Generation
34 pages
Me Talk Pretty One Day Essay
100% (1)
Me Talk Pretty One Day Essay
6 pages
Grade.9.Pre Technical.studies
No ratings yet
Grade.9.Pre Technical.studies
56 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Machine Scheduling - Hands-On Mathematical Optimization With AMPL in Python
No ratings yet
Machine Scheduling - Hands-On Mathematical Optimization With AMPL in Python
10 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Lecture 3- Data Preprocessing
No ratings yet
Lecture 3- Data Preprocessing
50 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Week 2
No ratings yet
Week 2
96 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
data science slides
No ratings yet
data science slides
57 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
MCOB UNIT 3 CLASS NOTES
No ratings yet
MCOB UNIT 3 CLASS NOTES
36 pages
Chapter3
No ratings yet
Chapter3
50 pages
253777
No ratings yet
253777
66 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Week2_DataPreprocessing
No ratings yet
Week2_DataPreprocessing
43 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Copy-New dating celebrity fmt
No ratings yet
Copy-New dating celebrity fmt
42 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Week2-2
No ratings yet
Week2-2
25 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
CH 2
No ratings yet
CH 2
36 pages
PRESENTATION Guder
No ratings yet
PRESENTATION Guder
27 pages
Normalization
No ratings yet
Normalization
35 pages
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
THE SYLLABLE Final
No ratings yet
THE SYLLABLE Final
34 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Just Right OCD Fact Sheet 2
No ratings yet
Just Right OCD Fact Sheet 2
3 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
nascent.pdf
No ratings yet
nascent.pdf
13 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
ml4
No ratings yet
ml4
17 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
NCERT Solutions For Class 11 Maths Chapter 9 Sequences and Series Miscellaneous Exercise
No ratings yet
NCERT Solutions For Class 11 Maths Chapter 9 Sequences and Series Miscellaneous Exercise
29 pages
Francis! Francis! X1 Manual
100% (1)
Francis! Francis! X1 Manual
5 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Niko Escueta DLP
No ratings yet
Niko Escueta DLP
10 pages
Theory and Analysis: - Frank Jones
No ratings yet
Theory and Analysis: - Frank Jones
14 pages
RESEARCH PROPOSAL
No ratings yet
RESEARCH PROPOSAL
5 pages
Mark_A._Gluck
No ratings yet
Mark_A._Gluck
5 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Is 101 5 2 1988 PDF
No ratings yet
Is 101 5 2 1988 PDF
23 pages
Safety Precautions and I.E. Rules For Wiring: Learning Objectives
No ratings yet
Safety Precautions and I.E. Rules For Wiring: Learning Objectives
9 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Service Marketing Unit 2
No ratings yet
Service Marketing Unit 2
9 pages
Project ManagementLecture Six
No ratings yet
Project ManagementLecture Six
7 pages
Letter of Transmittal
No ratings yet
Letter of Transmittal
6 pages
PP PP9074MED Datasheet
No ratings yet
PP PP9074MED Datasheet
1 page
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
SYLLABUS DESIGN (HO ESP Mate Dev Topic# 4) PDF
No ratings yet
SYLLABUS DESIGN (HO ESP Mate Dev Topic# 4) PDF
49 pages
Execution
No ratings yet
Execution
2 pages
Garbage Record
100% (1)
Garbage Record
2 pages
Milaflor End of Course Reflection
No ratings yet
Milaflor End of Course Reflection
3 pages
The Order of Kindling The Chanukah Lights: For More On Chanukah Go To
No ratings yet
The Order of Kindling The Chanukah Lights: For More On Chanukah Go To
1 page
Lab Mnual-Activity-SE-Plant-Cell-and-Animal-Cell
No ratings yet
Lab Mnual-Activity-SE-Plant-Cell-and-Animal-Cell
7 pages
Dss - Lesson Plan
No ratings yet
Dss - Lesson Plan
3 pages
Philcare v. Ca
No ratings yet
Philcare v. Ca
2 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Lecture6a DataPreprocessing

Uploaded by

Lecture6a DataPreprocessing

Uploaded by

DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS

̶ Data preparation, cleaning, and

̶ Get a first “feeling” with the data

Two Sine Waves Two Sine Waves + Noise

̶ Smoothing by bin means:

• A popular data reduction technique

• Randomly generate 500 points

̶ Filter: compute a score for each feature, and discard features

̶ Wrapper: for different subsets of features, train a model and

̶ Embedded: train a model with some degree of interpretability,

Daniel Peralta <[email protected]> 38

Daniel Peralta <[email protected]> 39

Feature selection Feature transformation/projection/extraction

Filter IG, Correlation, Laplacian Linear LDA PCA, LSA,…

̶ Sequential Feature Selector

̶ Typically, the performance 𝑚𝑖 is measured with a validation set!

Daniel Peralta <[email protected]> 41

One variable has completely overlapping class conditional densities. Still,

XOR-like or chessboard- like problems. The classes consist of disjoint clumps

The class conditional distributions have a high covariance in the

The class conditional distributions have a high covariance in the

(Left) A two class problem with independently and identically

̶ Directions along which the data are highly variable

Daniel Peralta <[email protected]> 48

̶ Goal: minimize the reconstruction error:

Daniel Peralta <[email protected]> 49

Daniel Peralta <[email protected]> 50

Daniel Peralta <[email protected]> 51

Daniel Peralta <[email protected]> 52

You might also like