0% found this document useful (0 votes)
7 views57 pages

Dimensionality Reduction

The document discusses dimensionality reduction, highlighting the challenges of high-dimensional data and the benefits of reducing dimensions through feature selection and extraction. Techniques such as Principal Components Analysis (PCA), Factor Analysis (FA), and Linear Discriminant Analysis (LDA) are explored, along with methods for selecting and combining features to improve data analysis. The Expectation-Maximization algorithm is also introduced as a tool for handling incomplete data and enhancing clustering in machine learning.

Uploaded by

sahu.leena24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

Dimensionality Reduction

The document discusses dimensionality reduction, highlighting the challenges of high-dimensional data and the benefits of reducing dimensions through feature selection and extraction. Techniques such as Principal Components Analysis (PCA), Factor Analysis (FA), and Linear Discriminant Analysis (LDA) are explored, along with methods for selecting and combining features to improve data analysis. The Expectation-Maximization algorithm is also introduced as a tool for handling incomplete data and enhancing clustering in machine learning.

Uploaded by

sahu.leena24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 57

DIMENSIONALITY

REDUCTION
Dimensionality of input
2

 Number of Observables (e.g. age and income)


 If number of observables is increased
 More time to compute
 More memory to store inputs and intermediate results
 More complicated explanations (knowledge from
learning)
 Regression from 100 vs. 2 parameters
 No simple visualization
 2D vs. 10D graph
 Need much more data (curse of dimensionality)
 1M of 1-d inputs is not equal to 1 input of dimension 1M
Dimensionality reduction
3

 Some features (dimensions) bear little or nor


useful information (e.g. color of hair for a car
selection)
 Can drop some features
 Have to estimate which features can be dropped from
data

 Several features can be combined together without


loss or even with gain of information (e.g. income
of all family members for loan application)
 Some features can be combined together
 Have to estimate which features to combine from data
Feature Selection vs
4
Extraction
 Feature selection: Choosing k<d important
features, ignoring the remaining d – k
 Subset selection algorithms
 Feature extraction: Project the original xi , i
=1,...,d dimensions to new k<d dimensions,
zj , j =1,...,k
 Principal Components Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Factor Analysis (FA)
 Independent Component Analysis(ICA)
Usage
5

 Have data of dimension d


 Reduce dimensionality to k<d
 Discard unimportant features
 Combine several features in one
 Use resulting k-dimensional data set for
 Learning for classification problem (e.g.
parameters of probabilities P(x|C)
 Learning for regression problem (e.g.
parameters for model y=g(x|Thetha)
Subset selection
6

 Have initial set of features of size d


 There are 2^d possible subsets
 Need a criteria to decide which subset is
the best
 A way to search over the possible
subsets
 Can’t go over all 2^d possibilities
 Need some heuristics
“Goodness” of feature set
7

 Supervised
 Train using selected subset
 Estimate error on validation data set

 Unsupervised
 Look at input only(e.g. age, income and
savings)
 Select subset of 2 that bear most of the
information about the person
Mutual Information
8

 Have a 3 random variables(features) X,Y,Z and have


to select 2 which gives most information

 If X and Y are “correlated” then much of the


information about of Y is already in X

 Make sense to select features which are


“uncorrelated”

 Mutual Information (Kullback–Leibler Divergence ) is


more general measure of “mutual information”

 Can be extended to n variables (information variables


x1,.. xn have about variable xn+1)
Subset-selection
9

 Forward search
 Start from empty set of features
 Try each of remaining features
 Estimate classification/regression error for
adding specific feature
 Select feature that gives maximum
improvement in validation error
 Stop when no significant improvement

 Backward search
 Start with original set of size d
 Drop features with smallest impact on error
Floating Search
10

 Forward and backward search are


“greedy” algorithms
 Select best options at single step
 Do not always achieve optimum value

 Floating search
 Two types of steps: Add k, remove l
 More computations
Feature Extraction
11

 Face recognition problem


 Training data input: pairs of Image +
Label(name)
 Classifier input: Image
 Classifier output: Label(Name)
 Image: Matrix of 256X256=65536 values in
range 0..256
 Each pixels bear little information so can’t
select 100 best ones
 Average of pixels around specific positions
may give an indication about an eye color.
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Projection
12

 Find a projection matrix w from d-


dimensional to k-dimensional vectors that
keeps error low

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
PCA: Motivation
13

 Assume that d observables are linear


combination of k<d vectors
 zi=wi1xi1+…+wikxid
 We would like to work with basis as it has
lesser dimension and have all(almost)
required information
 What we expect from such basis
 Uncorrelated or otherwise can be reduced further
 Have large variance (e.g. wi1 have large
variation) or otherwise bear no information
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
PCA: Motivation
14

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
PCA: Motivation
15

 Choose directions such that a total


variance of data will be maximum
 Maximize Total Variance

 Choose directions that are orthogonal


 Minimize correlation

 Choose k<d orthogonal directions which


maximize total variance
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
PCA
16

 Choosing only directions:


 Maximize variance subject to a constrain using


Lagrange Multipliers

 Taking Derivatives

 Eigenvector. Since want to maximize


we should choose an eigenvector with
largest eigenvalue
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
PCA
17

 d-dimensional feature space


 d by d symmetric covariance matrix
estimated from samples
 Select k largest eigenvalue of the
covariance matrix and associated k
eigenvectors
 The first eigenvector will be a direction
with largest variance

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
What PCA does
18

z = WT(x – m)
where the columns of W are the
eigenvectors of ∑, and m is sample mean
Centers the data at the origin and rotates
the axes

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
How to choose k ?
19

 Proportion of Variance (PoV) explained

1   2     k
1   2     k     d

when λi are sorted in descending order


 Typically, stop at PoV>0.9
 Scree graph plots of PoV vs k, stop at
“elbow”
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
20

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press
(V1.1)
PCA
21

 PCA is unsupervised (does not take into


account class information)

 Can take into account classes : Karhuned-


Loeve Expansion
 Estimate Covariance Per Class
 Take average weighted by prior

 Common Principle Components


 Assume all classes have same eigenvectors
(directions) but different variances

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
PCA
22

 Does not try to explain noise


 Large noise can become new
dimension/largest PC

 Interested in resulting uncorrelated


variables which explain large portion of
total sample variance

 Sometimes interested in explained shared


variance (common factors) that affect data
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Factor Analysis
23

 Assume set of unobservable (“latent”)


variables
 Goal: Characterize dependency among
observables using latent variables
 Suppose group of variables having large
correlation among themselves and small
correlation with other variables
 Single factor?

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Factor Analysis
24

 Assume k input factors (latent


unobservable) variables generating d
observables

 Assume all variations in observable


variables are due to latent or noise (with
unknown variance)

 Find transformation from unobservable to


observables which explain the data
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Factor
25
Analysis
 Find a small number of factors z, which
when combined generate x :
xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi
where zj, j =1,...,k are the latent factors
with
E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,
εi are the noise sources
E[ εi ]= ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi ,
zj) =0 ,
and vij are the factor loadings
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Factor
26
Analysis
 Find V such that where
S is estimation of covariance matrix and
V loading (explanation by latent
variables)

 V is d x k matrix (k<d)

 Solution using eigenvalue and


eigenvectors
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Factor Analysis
27

 In FA, factors zj are stretched, rotated


and translated to generate x

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
FA Usage
28

 Speech is a function of position of small


number of articulators (lungs, lips, tongue)

 Factor analysis: go from signal space (4000


points for 500ms ) to articulation space (20
points)

 Classify speech (assign text label) by 20


points

 Speech Compression: send 20 values


Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Discriminant
29
Analysis
 Find a low-dimensional space such that
when x is projected, classes are well-
separated

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Means and Scatter after projection
30

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Good Projection
31

 Means are far away as possible


 Scatter is small as possible
 Fisher Linear Discriminant

 m1  m2 
2

J w   2 2
s s
1 2

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expectation-
32
Maximization Algorithm
 Expectation-Maximization algorithm can
be used for the latent variables (variables
that are not directly observable and are
actually inferred from the values of the other
observed variables) too in order to predict
their values with the condition that the
general form of probability distribution
governing those latent variables is known to
us. This algorithm is actually at the base of
many unsupervised clustering algorithms in
the field of machine learning.
Algorithm:
33

 Given a set of incomplete data, consider a


set of starting parameters.
 Expectation step (E – step): Using the
observed available data of the dataset,
estimate (guess) the values of the missing
data.
 Maximization step (M – step): Complete
data generated after the expectation (E) step
is used in order to update the parameters.
 Repeat step 2 and step 3 until convergence.
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
34

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
35

 The essence of Expectation-


Maximization algorithm is to use the
available observed data of the dataset to
estimate the missing data and then
using that data to update the values of
the parameters. Let us understand the
EM algorithm in detail.

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
36

 Initially, a set of initial values of the


parameters are considered. A set of
incomplete observed data is given to the
system with the assumption that the
observed data comes from a specific model.
 The next step is known as “Expectation” –
step or E-step. In this step, we use the
observed data in order to estimate or guess
the values of the missing or incomplete
data. It is basically used to update the
variables.
37

 The next step is known as “Maximization”-


step or M-step. In this step, we use the
complete data generated in the preceding
“Expectation” – step in order to update the
values of the parameters. It is basically used
to update the hypothesis.
 Now, in the fourth step, it is checked whether
the values are converging or not, if yes, then
stop otherwise repeat step-2 and step-3 i.e.
“Expectation” – step and “Maximization” –
step until the convergence occurs.
38
Usage of EM algorithm
39

 It can be used to fill the missing data in a


sample.
 It can be used as the basis of unsupervised
learning of clusters.
 It can be used for the purpose of estimating the
parameters of Hidden Markov Model (HMM).
 It can be used for discovering the values of
latent variables.
40

 Advantages of EM algorithm –
 It is always guaranteed that likelihood
will increase with each iteration.
 The E-step and M-step are often pretty
easy for many problems in terms of
implementation.
 Solutions to the M-steps often exist in
the closed form.
41

 Disadvantages of EM algorithm –
 It has slow convergence.
 It makes convergence to the local
optima only.
 It requires both the probabilities, forward
and backward (numerical optimization
requires only forward probability).
K-Nearest Neighbor

 Features
 All instances correspond to points in an n-
dimensional Euclidean space
 Classification is delayed till a new instance
arrives
 Classification done by comparing feature
vectors of the different points
 Target function may be discrete or real-
valued
1-Nearest Neighbor
3-Nearest Neighbor
K-Nearest Neighbor

 An arbitrary instance is represented by


(a1(x), a2(x), a3(x),.., an(x))
 ai(x) denotes features
 Euclidean distance between two instances
d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) -
ar(xj))2)
 Continuous valued target function
 mean value of the k nearest training examples
KNN Algorithm
46

 Step-1: Select the number K of the neighbors


 Step-2: Calculate the Euclidean distance of K
number of neighbors
 Step-3: Take the K nearest neighbors as per
the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the
number of the data points in each category.
 Step-5: Assign the new data points to that
category for which the number of the
neighbor is maximum.
47
48
KNN
49

 Firstly, we will choose the number of


neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean
distance between the data points.
 The Euclidean distance is the distance
between two points.
 It can be calculated as:
50
How to select the value of K
in the K-NN Algorithm?
51

 There is no particular way to determine


the best value for "K", so we need to try
some values to find the best out of them.
The most preferred value for K is 5.
 A very low value for K such as K=1 or
K=2, can be noisy and lead to the effects
of outliers in the model.
 Large values for K are good, but it may
find some difficulties.
Advantages of KNN
52
Algorithm:
 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training
data is large.
Disadvantages of KNN
53
Algorithm:
 Always needs to determine the value of
K which may be complex some time.
 The computation cost is high because of
calculating the distance between the
data points for all the training samples.
New customer named 'Monica' has height
161cm and weight 61kg.
54
Height (in cms) Weight (in kgs) T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
K-NN Problem
55

Acid durability Strength Classification


7 7 Bad
7 4 Bad
3 4 Good
1 4 Good

Acid durability =3
Strength= 7
Summary
56

 Feature selection
 Supervised: drop features which don’t introduce
large errors (validation set)
 Unsupervised: keep only uncorrelated features
(drop features that don’t add much information)
 Feature extraction
 Linearly combine feature into smaller set of
features
 Supervised
 PCA: explain most of the total variability
 FA: explain most of the common variability
 Unsupervised
 LDA: best separate class instances
 Missing data
57
 Remove the data(column/row)
 Replace it with average value if numeric
data
 Replace it with Default value
 Linear regression /bayesian regression
 Clustering techniques
 Based on observed/historical data

You might also like