Machine Learning (CSO851) - Lecture 03
Machine Learning (CSO851) - Lecture 03
worse performance.
33 bins
k: number of bins per feature
Dimensionality Reduction
• What is the objective?
− Choose an optimum set of features of lower
dimensionality to improve classification accuracy.
6
Dimensionality Reduction (cont’d)
y = Tx ϵ RK where K<<N
x1
x
2 y1
. y This is a projection from
T 2
. the N-dimensional space
x f (x)
y . to a K-dimensional space.
.
.
.
yK
.
xN
8
Feature Extraction (cont’d)
criterion.
9
Feature Extraction (cont’d)
• Popular linear feature extraction methods:
− Principal Components Analysis (PCA): Seeks a projection that
preserves as much information in the data as possible.
− Linear Discriminant Analysis (LDA): Seeks a projection that best
discriminates the data.
10
Vector Representation
x1
• A vector x ϵ Rn can be x
2
.
represented by n components:
.
x:
.
.
• Assuming the standard base .
xN
<v1, v2, …, vN> (i.e., unit vectors
in each dimension), xi can be xT vi
xi T xT vi
vi vi
obtained by projecting x along
the direction of vi:
N
• x can be “reconstructed” from x xi vi x1v1 x2v2 ... xN vN
i 1
its projections as follows:
• Since the basis vectors are the same for all x ϵ Rn
(standard basis), we typically represent them as a
n-component vector. 11
Vector Representation (cont’d)
12
Principal Component Analysis (PCA)
• The main objective of PCA is to identify patterns that
reduce the dimensions of the dataset with minimal loss of
information.
• PCA projects a feature space onto a smaller subspace that
represents the data well.
• In PCA, we are interested to find the directions
(components) that maximize the variance in the dataset.
That means the data is found more spread within a class.
• Principal components correspond to eigenvectors and
eigenvalues are associated with eigenvectors.
• Eigenvalues denote the magnitude of the eigenvectors.
• When eigenvalues are found having similar magnitudes, it
is said to be “good subspace”.
13
Principal Component Analysis (PCA)
Σx u𝑖= 𝜆𝑖 u𝑖
(training) data Σx
Φi xi x
Step 3: compute the sample covariance matrix Σx
1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x
M
i 1
( x i x )( x i x )T
M
i 1
T
i
i
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)
16
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
x ui i ui
where we assume 1 2 ... N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)
2 2
N . .
y1
y where U [u1 u2 ... u K ] N x K matrix
2
(xˆ x ) U .
i.e., the columns of U are the
. the first K eigenvectors of Σx
yK
y1
y
2 T= K x N matrix
. U T (xˆ x ) i.e., theU
T
rows of T are the first
. K eigenvectors of Σx
yK
19
What is the form of Σy ?
M M
1 1
x
M
i 1
(xi x )(xi x )
M
T
i i
i 1
T
Using diagonalization:
The diagonal elements of
The columns of P are the
x PPT Λ are the eigenvalues of ΣX
eigenvectors of ΣX
or the variances
y i U T (xi x ) PT i
M M
1 M 1 1
y
M
T
(y i y )(y i y )
M
i 1
( y i )( y i )T
M
( P )( P )
i 1
T
i
T
i
T
i 1
M M
1 1
M
i 1
(T
P i )( T
P
i ) P ( T
M
i 1
i
T
i ) P PT x P PT ( PPT ) P
21
Example
• Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
22
Example (cont’d)
• The eigenvectors are the solutions of the systems:
xui i ui
K that satisfies
i 1
N
T where T is a threshold (e.g ., 0.9)
the following
inequality:
i 1
i
24
Approximation Error
1 N
|| x xˆ || i
2 i K 1
25
Data Normalization
27
Application to Images (cont’d)
(N2 x M
matrix)
(N2 x M matrix)
29
Application to Images (cont’d)
30
Application to Images (cont’d)
Step 3 compute ATA (i.e., instead of AAT)
Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then
normalize ui to unit length.
31
Example
Dataset
32
Example (cont’d)
Top eigenvectors: u1,…uk
(visualized as an image - eigenfaces)
u1 u2 u3
Mean face: x
33
Example (cont’d)
34
Application to Images (cont’d)
• Interpretation: represent a face in terms of eigenfaces
u1 u2 u3
y1
y
2
K xˆ x : .
xˆ yi ui y1u1 y2u2 ... yK uK x
i 1
.
yK
y1 y2 y3
x
35
Case Study: Eigenfaces for Face
Detection/Recognition
• Face Recognition
− The simplest approach is to think of it as a template matching
problem.
36
Face Recognition Using Eigenfaces
• Process the image database (i.e., set of images with
labels) – typically referred to as “training” phase:
y1
y
2
Ωi .
.
yK
Face Recognition Using Eigenfaces
Given an unknown face x, follow these steps:
Step 1: Subtract mean face x (computed from training data)
x x
y1
Step 2: Project unknown face in the eigenspace: y
K 2
ˆ y u where yi T ui
i i
: .
i 1 .
Step 3: Find closest match Ωi from training set using:
yK
K K
1
er min i || i ||min i ( y j y ) or min i
i 2
j ( y j y ij )2
j 1 j 1 j
Euclidean distance Mahalanobis distance
39
Face Detection Using Eigenfaces
Given an unknown image x, follow these steps:
Step 1: Subtract mean face x (computed from training data):
x x
40
Eigenfaces
Input Reconstructed
Reconstructed image
looks like a face again!
41
Reconstruction from partial information
• Robust to partial face occlusion.
Input Reconstructed
42
Eigenfaces
43
Limitations
• Background changes cause problems
− De-emphasize the outside of the face (e.g., by multiplying the input
image by a 2D Gaussian window centered on the face).
• Light changes degrade performance
− Light normalization might help but this is a challenging issue.
• Performance decreases quickly with changes to face size
− Scale input image to multiple sizes.
− Multi-scale eigenspaces.
• Performance decreases with changes to face orientation
(but not as fast as with scale changes)
− Out-of-plane rotations are more difficult to handle.
− Multi-orientation eigenspaces.
44
Limitations (cont’d)
• Not robust to misalignment.
45
Limitations (cont’d)
• PCA is not always an optimal dimensionality-reduction
technique for classification purposes.
46
Linear Discriminant Analysis (LDA)
• Linear discriminant analysis, also known as normal discriminant
analysis (NDA) or discriminant function analysis (DFA), follows
a generative model framework.
• This means LDA algorithms model the data distribution for each
class and use Bayes' theorem to classify new data points.
47
Linear Discriminant Analysis (LDA)
• LDA works by identifying a linear combination of features that
separates or characterizes two or more classes of objects or
events.
• LDA does this by projecting data with two or more dimensions
into one dimension so that it can be more easily classified.
• The technique is, therefore, sometimes referred to as
dimensionality reduction.
• This versatility ensures that LDA can be used for multi-class data
classification problems, unlike logistic regression, which is
limited to binary classification.
• LDA is thus often applied to enhance the operation of other
learning classification algorithms such as decision tree, random
forest, or support vector machines (SVM).
48
Linear Discriminant Analysis (LDA)
49
Linear Discriminant Analysis (LDA)
projection direction
projection direction
• Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the
whole dataset: 1 C
μ
C
μ
i 1
i
C
S matrix
Between-class scatter
b (μ i μ )(
i 1
μ i μ ) T
51
Linear Discriminant Analysis (LDA) (cont’d)
• Suppose the desired projection transformation is:
y U T x
• Suppose the scatter matrices of the projected data y are:
Sb , S w
| Sb | | U T SbU |
max or max T
| S |
w
| U S wU |
52
Linear Discriminant Analysis (LDA) (cont’d)
54
Linear Discriminant Analysis (LDA) (cont’d)
Sb uk k S wuk
S Sb uk k uk
1
w
55
Linear Discriminant Analysis (LDA) (cont’d)
• To alleviate this problem, PCA could be applied first:
x1 y1
x y
2 2
x .
PCA
y .
. .
xN yM
57
Case Study I (cont’d)
• Assumptions
− Well-framed images are required as input for training and query-by-
example test probes.
− Only a small variation in the size, position, and orientation of the
objects in the images is allowed.
58
Case Study I (cont’d)
• Terminology
− Most Expressive Features (MEF): features obtained using PCA.
− Most Discriminating Features (MDF): features obtained using LDA.
• Numerical instabilities
− Computing the eigenvalues/eigenvectors of Sw-1SBuk = kuk could
lead to unstable computations since Sw-1SB is not always symmetric.
− Check the paper for more details about how to deal with this issue.
59
Case Study I (cont’d)
• Comparing projection directions between MEF with MDF:
− PCA eigenvectors show the tendency of PCA to capture major
variations in the training set such as lighting direction.
− LDA eigenvectors discount those factors unrelated to classification.
60
Case Study I (cont’d)
• Clustering effect
61
Case Study I (cont’d)
• Methodology
62
Case Study I (cont’d)
• Experiments and results
Face images
− A set of face images was used with 2 expressions, 3 lighting conditions.
− Testing was performed using a disjoint set of images.
63
Case Study I (cont’d)
64
Case Study I (cont’d)
− Examples of correct search probes
65
Case Study I (cont’d)
66
Case Study II
67
Case Study II (cont’d)
AR database
68
Case Study II (cont’d)
69
Case Study II (cont’d)
70
4th Quiz