0% found this document useful (0 votes)

18 views56 pages

PCA Fin. Econ.

The document discusses dimensionality reduction techniques like principal component analysis. It explains that dimensionality reduction can help address issues like overfitting by reducing the number of features. It also describes how principal component analysis works by finding new axes that maximize variance to extract principal components.

Uploaded by

tgnr890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views56 pages

PCA Fin. Econ.

Uploaded by

tgnr890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Dimensionality Reduction

(Principal Component Analysis)

Pre-requisites

 Basic Statistics

 Elementary Linear Algebra

 Lagrange Optimization
A Start - up Story
The Bottleneck !!

 Too few customers, too many KYC & Transaction Features

 i.e Few observations & Many explanatory variables.

 That’s causing a problem while predicting defaults & segmenting/

clustering customers.

 But why? Why is it a problem?

Curse of Dimensionality
 Supervised Learning (Prediction): Increasing the
number of features will not always improve
prediction accuracy.

 Low Bias High Variance

 Captures the noise / idiosyncrasies of training

data.

 Overfitting on training data , poor predictive

performance on test – data

 Unsupervised Learning: Clustering algorithms fail

in high dimensions.
Dimensionality Reduction
 What is the objective?
 Choose an optimum set of features of lower dimensionality to improve
classification accuracy or efficient clustering

 Different methods can be used to reduce dimensionality:

 Feature extraction
 Feature selection

6
Dimensionality Reduction – Possible Strategies
Feature selection:
Feature extraction: finds a set of chooses a subset of the
new features (i.e., through some original features.
mapping f(.) from the existing
features.)
The mapping f()  x1 
 x1  could be linear or x 
x  non-linear  2  xi1 
 2  .   
 y1   
 .  y   xi2 
 . 
   2 x y . 
x 
.

f (x)
y   .   .   
 .       . 
   .    .  
 .   yK   .   xiK 
 .   
   xN 
 xN  K<<N K<<N
7
Feature Extraction
 Linear combinations / Linear Maps.

 Every 𝑦𝑗 is a linear function of all the 𝑥𝑖′ s (i.e the original features)

8
Feature Extraction
 Linear combinations are particularly attractive because they are simpler to
compute and analytically tractable.

 Every 𝑦𝑗 is a linear function of all the 𝑥𝑖′ s (i.e the original features)

 Given x ϵ RN, find an K x N matrix T such that:

y = Tx ϵ RK where K<<N
 x1 
x  T This is a projection from the
 2  y1  N-dimensional space to a K-
 .  y 
   2 dimensional space.
.
x  
f (x)
y   . 
 .   
   . 
 .  
 .   yK 

 

 xN 

9
Linear Transformation : Example
 In a bank, for any customer ‘c’

𝑠𝑝𝑒𝑛𝑑𝑖𝑛𝑔
𝐴𝑑𝑣. 𝑃𝑎𝑦𝑚𝑒𝑛𝑡
𝑃𝑎𝑦𝑚𝑒𝑛𝑡 𝐷𝑒𝑙𝑎𝑦
(𝑐)
Original features 𝑥 = 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝐵𝑎𝑙𝑎𝑛𝑐𝑒
𝐶𝑟𝑒𝑑𝑖𝑡 𝐿𝑖𝑚𝑖𝑡
𝑀𝑖𝑛 𝑃𝑎𝑦 𝐴𝑚𝑜𝑢𝑛𝑡
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑆𝑖𝑛𝑔𝑙𝑒 𝑆𝑝𝑒𝑛𝑑

0.5 0 0 0.21 0.34 0 0.07

 T=
0.2 0.1 0 0.3 0.14 0.1 0
Linear Transformation : Example
𝑠𝑝𝑒𝑛𝑑𝑖𝑛𝑔
𝐴𝑑𝑣. 𝑃𝑎𝑦𝑚𝑒𝑛𝑡
𝑃𝑎𝑦𝑚𝑒𝑛𝑡 𝐷𝑒𝑙𝑎𝑦
0.5 0 0 0.21 0.34 0 0.07
 𝑦 (𝑐) = T. 𝑥 (𝑐) = 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝐵𝑎𝑙𝑎𝑛𝑐𝑒
0.2 0.1 0 0.3 0.14 0.1 0
𝐶𝑟𝑒𝑑𝑖𝑡 𝐿𝑖𝑚𝑖𝑡
𝑀𝑖𝑛 𝑃𝑎𝑦 𝐴𝑚𝑜𝑢𝑛𝑡
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑆𝑖𝑛𝑔𝑙𝑒 𝑆𝑝𝑒𝑛𝑑
• 𝑦 (𝑐) =
0.5 ∗ 𝑠𝑝𝑒𝑛𝑑𝑖𝑛𝑔 + 0 ∗ 𝐴𝑑𝑣. 𝑃𝑎𝑦𝑚𝑛𝑡 + 0 ∗ 𝑃𝑎𝑦 𝐷𝑒𝑙𝑎𝑦 + 0.21 ∗ 𝑐𝑢𝑟𝑟 𝐵𝑎𝑙 + 0.34 ∗ 𝐶𝑟𝑒𝑑 𝐿𝑖𝑚 + 0 ∗ 𝑀𝑃𝐴 + 0.07 ∗ 𝑀𝑎𝑥 𝑆𝑖𝑛𝑔𝑙𝑒
0.2 ∗ 𝑠𝑝𝑒𝑛𝑑𝑖𝑛𝑔 + 0.1 ∗ 𝐴𝑑𝑣. 𝑃𝑎𝑦𝑚𝑛𝑡 + 0 ∗ 𝑃𝑎𝑦 𝐷𝑒𝑙𝑎𝑦 + 0.3 ∗ 𝑐𝑢𝑟𝑟 𝐵𝑎𝑙 + 0.14 ∗ 𝐶𝑟𝑒𝑑 𝐿𝑖𝑚 + 0.1 ∗ 𝑀𝑃𝐴 + 0 ∗ 𝑀𝑎𝑥 𝑆𝑖𝑛𝑔𝑙𝑒
Feature Extraction : objective

 Finding an optimum linear mapping y=𝑓(x) that minimizes

information loss.

 Criterion of Minimizing Information Loss: represent the data as

accurately as possible in the lower-dimensional space.

 Principal Component Analysis uses exactly this criterion to

optimally reduce dimensions of the data.

12
Let’s Think 2 – D !!
Linear Feature Extraction Geometry: New axes

Original Variable: 𝒙𝟐
PC 2
PC 1

Original Variable : 𝒙𝟏

 Projections along PC1 explain the data most along any one axis
 New Feature creation ≡ 𝑅𝑜𝑡𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝐴𝑥𝑒𝑠
New Axes Transformation (2 – D)
𝑥1 Cos𝜃 − Sin𝜃 𝑧1
 𝑥2 = 𝑧2
S𝑖𝑛𝜃 Cos𝜃
𝑥1 Cos𝜃 − Sin𝜃 𝑧1
 𝑥2 = 𝑧2
S𝑖𝑛𝜃 Cos𝜃

Cos𝜃 − Sin𝜃 Cos𝜃 Sin𝜃 1 0

 =
S𝑖𝑛𝜃 Cos𝜃 −S𝑖𝑛𝜃 Cos𝜃 0 1
𝑥1 Cos𝜃 − Sin𝜃 𝑧1
 𝑥 = 𝑧2
2 S𝑖𝑛𝜃 Cos𝜃

Cos𝜃 − Sin𝜃 Cos𝜃 Sin𝜃 1 0

 =
S𝑖𝑛𝜃 Cos𝜃 −S𝑖𝑛𝜃 Cos𝜃 0 1

𝑧1 Cos𝜃 Sin𝜃 𝑥1
 𝑧 = 𝑥2
2 −S𝑖𝑛𝜃 Cos𝜃
Cos𝜃 Sin𝜃
 Z = AX ; A =
−S𝑖𝑛𝜃 Cos𝜃

 𝑧1 = 𝑎1 𝑇 X where 𝑎1 𝑇 = Cos𝜃 Sin𝜃 & thus 𝑎1 𝑇 𝑎1 = 1

 𝑧2 = 𝑎2 𝑇 X where 𝑎2 𝑇 = − 𝑆𝑖𝑛𝜃 Cos𝜃 & thus 𝑎2 𝑇 𝑎2 = 1

Cos𝜃 Sin𝜃
 Z = AX ; A =
−S𝑖𝑛𝜃 Cos𝜃

 𝑧1 = 𝑎1 𝑇 X where 𝑎1 𝑇 = Cos𝜃 Sin𝜃 & thus 𝑎1 𝑇 𝑎1 = 1

 𝑧2 = 𝑎2 𝑇 X where 𝑎2 𝑇 = − 𝑆𝑖𝑛𝜃 Cos𝜃 & thus 𝑎2 𝑇 𝑎2 = 1

 Rotation of axes yields new features which are linear comb. of original
features & combining weights form an unit vector.
Next Obvious Question !

 What rotation(s) is (are) optimal?

 Which directions to choose ? How to choose the “Principal

Components” ?
The Principal Components Rationale

 1st Principal component points in the direction of the largest variance.

 2nd Principal component also points in the direction of the largest

variance conditional on it being orthogonal to 1st Principal Comp.

 Each subsequent principal component…

• is orthogonal to the previous ones, and
• points in the directions of the largest variance of the residual
subspace
2D Gaussian dataset
1st PCA axis
2nd PCA axis
PCA Algorithm - Formalized
 Step 1: Choose 𝑃𝐶1 ≡ 𝑧1 = 𝑎1 𝑇 X such that Var(𝑧1 ) is maximum
i.e 𝑚𝑎𝑥𝑎1 Var(𝑧1 ) such that 𝑎1 𝑇 𝑎1 = 1
PCA Algorithm - Formalized
 Step 1: Choose𝑃𝐶1 ≡ 𝑧1 = 𝑎1 𝑇 X such that Var(𝑧1 ) is maximum
i.e 𝑚𝑎𝑥𝑎1 Var(𝑧1 ) such that 𝑎1 𝑇 𝑎1 = 1
 Step 2: Choose𝑃𝐶2 ≡ 𝑧2 = 𝑎2 𝑇 X such that Var(𝑧2 ) is maximum
& 𝑧2 ⊥ 𝑧1
i.e 𝑚𝑎𝑥𝑎2 Var(𝑧2 ) such that 𝑎2 𝑇 𝑎2 = 1 & 𝑧2 ⊥ 𝑧1
PCA Algorithm - Formalized
 Step 1: Choose 𝑃𝐶1 ≡ 𝑧1 = 𝑎1 𝑇 X such that Var(𝑧1 ) is maximum
i.e 𝑚𝑎𝑥𝑎1 Var(𝑧1 ) such that 𝑎1 𝑇 𝑎1 = 1

 Step 2: Choose 𝑃𝐶2 ≡ 𝑧2 = 𝑎2 𝑇 X such that 𝑧2 ⊥ 𝑧1 & Var(𝑧2 ) is maximum

i.e 𝑚𝑎𝑥𝑎2 Var(𝑧2 ) such that 𝑎2 𝑇 𝑎2 = 1 & 𝑧2 ⊥ 𝑧1
 Step 3: Choose 𝑃𝐶3 ≡ 𝑧3 = 𝑎3 𝑇 X such that Var(𝑧3 ) is maximum & 𝑧3 ⊥ 𝑧1 , 𝑧2
i.e 𝑚𝑎𝑥𝑎3 Var(𝑧3 ) such that 𝑎3 𝑇 𝑎3 = 1 & 𝑧3 ⊥ 𝑧1 , 𝑧2
⋮
Optimization Problem
 𝑃𝐶𝑗 ≡ 𝑧𝑗 = 𝑎𝑗 𝑇 X
Optimization Problem
 𝑃𝐶𝑗 ≡ 𝑧𝑗 = 𝑎𝑗 𝑇 X

 Var(𝑧𝑗 ) = E[ 𝑎𝑗 𝑇 X (𝑎𝑗 𝑇 X )𝑇 ] = 𝑎𝑗 𝑇 𝐸(𝑋𝑋 𝑇 ) 𝑎𝑗 = 𝑎𝑗 𝑇 𝑉𝑎𝑟(𝑋) 𝑎𝑗 = 𝑎𝑗 𝑇 ∑ 𝑎𝑗

Optimization Problem
 𝑃𝐶𝑗 ≡ 𝑧𝑗 = 𝑎𝑗 𝑇 X

 Var(𝑧𝑗 ) = E[ 𝑎𝑗 𝑇 X (𝑎𝑗 𝑇 X )𝑇 ] = 𝑎𝑗 𝑇 𝐸(𝑋𝑋 𝑇 ) 𝑎𝑗 = 𝑎𝑗 𝑇 𝑉𝑎𝑟(𝑋) 𝑎𝑗 = 𝑎𝑗 𝑇 ∑ 𝑎𝑗

 Optimization problem:

𝑚𝑎𝑥𝑎𝑗 𝑎𝑗 𝑇 ∑ 𝑎𝑗 Such that 𝑎𝑗 𝑇 𝑎𝑗 = 1

Optimization Problem
 𝑃𝐶𝑗 ≡ 𝑧𝑗 = 𝑎𝑗 𝑇 X
 Var(𝑧𝑗 ) = E[ 𝑎𝑗 𝑇 X (𝑎𝑗 𝑇 X )𝑇 ] = 𝑎𝑗 𝑇 𝐸(𝑋𝑋 𝑇 ) 𝑎𝑗 = 𝑎𝑗 𝑇 𝑉𝑎𝑟(𝑋) 𝑎𝑗 = 𝑎𝑗 𝑇 ∑ 𝑎𝑗

 Optimization problem:
𝑚𝑎𝑥𝑎𝑗 𝑎𝑗 𝑇 ∑ 𝑎𝑗 Such that 𝑎𝑗 𝑇 𝑎𝑗 = 1 ∀ 𝑗

 Optimization problem redefined:

෡ 𝑎𝑗 Such that 𝑎𝑗 𝑇 𝑎𝑗 = 1 ∀ 𝑗
𝑚𝑎𝑥𝑎𝑗 𝑎𝑗 𝑇 ∑
Missing Orthogonality Constraints !!

 In the optimization problem described above how come I haven’t incorporated

the fact that I desire Cov (𝑧𝑖 , 𝑧𝑗 ) = 0 ∀ 𝑖 ≠ 𝑗

෡ 𝑎𝑗
 Cov (𝑧𝑖 , 𝑧𝑗 ) = Cov (𝑎𝑖 𝑇 X , 𝑎𝑗 𝑇 X) = E[ 𝑎𝑖 𝑇 X (𝑎𝑗 𝑇 X )𝑇 ] = 𝑎𝑖 𝑇 ∑ 𝑎𝑗 ≈ 𝑎𝑖 𝑇 ∑

෡ 𝑎𝑗 = 0 ∀ 𝑖 ≠ 𝑗 while
 So shouldn’t we impose the additional restrictions 𝑎𝑖 𝑇 ∑
solving the optimization problem?
Optimization Solution
 The Lagrangian L = 𝑎𝑗 𝑇 ∑
෡ 𝑎𝑗 + 𝛽𝑗 ( 1 − 𝑎𝑗 𝑇 𝑎𝑗 )

𝝏𝑳
 ෡ 𝑎𝑗 – 2 𝛽𝑗 𝑎𝑗 = 0
= 0 => 2 ∑
𝝏 𝑎𝑗

෡ − 𝜷𝒋 𝑰) 𝒂𝒋 = 0
 (∑

 Therefore 𝛽𝑗 are the eigen values and 𝑎𝑗 are the corresponding unit eigen
෡
vectors of ∑
Optimization Solution
෡ 𝑎𝑗 + 𝛽𝑗 ( 1 − 𝑎𝑗 𝑇 𝑎𝑗 )
 The Lagrangian L = 𝑎𝑗 𝑇 ∑

𝝏𝑳
 ෡ 𝑎𝑗 – 2 𝛽𝑗 𝑎𝑗 = 0
= 0 => 2 ∑
𝝏 𝑎𝑗

෡ − 𝜷𝒋 𝑰) 𝒂𝒋 = 0
 (∑

 Therefore 𝛽𝑗 are the eigen values and 𝑎𝑗 are the corresponding unit eigen
෡
vectors of ∑

 Aah !!...So we have found all the PC’s effectively, since 𝑎𝑗′ s are the combining
weights of the PC’s
Symmetric Matrix – Our Saviour !!!
Symmetric Matrix – Our Saviour !!!
 Theorem: Eigen vectors corresponding to different eigen values of a
symmetric matrix are orthogonal
Symmetric Matrix – Our Saviour !!!
 Theorem: Eigen vectors corresponding to different eigen values of a
symmetric matrix are orthogonal
 Proof: Let’s say A is a symmetric matrix.
Let A𝑋1 = 𝛽1 𝑋1 & A𝑋2 = 𝛽2 𝑋2 with 𝛽1 ≠ 𝛽2
Symmetric Matrix – Our Saviour !!!
 Theorem: Eigen vectors corresponding to different eigen values of a
symmetric matrix are orthogonal
 Proof: Let’s say A is a symmetric matrix.
Let A𝑋1 = 𝛽1 𝑋1 & A𝑋2 = 𝛽2 𝑋2 with 𝛽1 ≠ 𝛽2
𝑋2𝑇 A𝑋1 = 𝛽1 𝑋2𝑇 𝑋1 ………………… (i)
𝑋1𝑇 A𝑋2 = 𝛽2 𝑋1𝑇 𝑋2 ………………… (ii)
Symmetric Matrix – Our Saviour !!!
 Theorem: Eigen vectors corresponding to different eigen values of a
symmetric matrix are orthogonal
 Proof: Let’s say A is a symmetric matrix.
Let A𝑋1 = 𝛽1 𝑋1 & A𝑋2 = 𝛽2 𝑋2 with 𝛽1 ≠ 𝛽2
𝑋2𝑇 A𝑋1 = 𝛽1 𝑋2𝑇 𝑋1 …………………. (i)
𝑋1𝑇 A𝑋2 = 𝛽2 𝑋1𝑇 𝑋2 …………………. (ii)
(𝑋2𝑇 A𝑋1 )1𝑋1 = (𝑋2𝑇 A𝑋1 )𝑇 = 𝑋1𝑇 𝐴𝑇 𝑋2 = 𝑋1𝑇 A 𝑋2 (since A is symmetric)
Symmetric Matrix – Our Saviour !!!
 Theorem: Eigen vectors corresponding to different eigen values of a
symmetric matrix are orthogonal
 Proof: Let’s say A is a symmetric matrix.
Let A𝑋1 = 𝛽1 𝑋1 & A𝑋2 = 𝛽2 𝑋2 with 𝛽1 ≠ 𝛽2
𝑋2𝑇 A𝑋1 = 𝛽1 𝑋2𝑇 𝑋1 ……………….. (ii)
𝑋1𝑇 A𝑋2 = 𝛽2 𝑋1𝑇 𝑋2 ……………….. (ii)
(𝑋2𝑇 A𝑋1 )1𝑋1 = (𝑋2𝑇 A𝑋1 )𝑇 = 𝑋1𝑇 𝐴𝑇 𝑋2 = 𝑋1𝑇 A 𝑋2 (since A is symmetric)