Dimensionality Reduction - PCA LDA
Dimensionality Reduction - PCA LDA
What is Eigendecomposition?
Think of a matrix as a machine that performs transformations on vectors. It can stretch, shrink,
rotate, or otherwise alter a vector.
● Eigenvectors: Special vectors that, when fed into this matrix-machine, only change in
length (get scaled), but their direction remains fundamentally the same.
● Eigenvalues: The factor by which each eigenvector gets scaled is its corresponding
eigenvalue.
Eigendecomposition is the process of breaking down a matrix into a set of eigenvectors and
their corresponding eigenvalues.
2. Simplifying Operations: Many matrix operations (like powers, inverses) become easier
with eigendecomposition because you're essentially just scaling the eigenvectors.
Example
A = [2 1]
[1 2]
1. Find Eigenvalues: Solve the equation: det(A - λI) = 0 where 'λ' represents
eigenvalues, and 'I' is the identity matrix. For our matrix, this gives us eigenvalues λ1 = 3
and λ2 = 1.
2. Find Eigenvectors: For each eigenvalue, solve the equation: (A - λI) v = 0 where
'v' is the eigenvector.
Representation
A = V * D * V^-1
Where:
Remember
● Eigendecomposition isn't always possible (some matrices don't diagonalize this way).
● A matrix represents a transformation; eigenvectors are the directions of this
transformation, and eigenvalues are how much scaling happens along those directions.
1. Goal of PCA: PCA aims to reduce the dimensionality of data while retaining the most
important directions of variation.
3. Key Idea: PCA projects the data onto the eigenvectors with the largest eigenvalues.
These eigenvectors become the new "principal components" of the data. Since
eigenvalues represent variance, this keeps directions of high variation and discards
those with less variation.
A modified code snippet incorporating the visualization of the original data along with the
principal components (eigenvectors) using Matplotlib:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
Explanation:
1. We define scaling factors (scale1 and scale2) to visually represent the eigenvectors
with a reasonable length. You might need to adjust these values depending on your data
spread.
2. We use plt.arrow to draw arrows starting from the origin (0, 0) with scaled lengths
based on the eigenvectors and colored differently for better distinction.
3. We add labels ('PC1' and 'PC2') to the plotted arrows for clarity.
Hand Span - Height Ratio - Another example with two features that have a high degree of
covariance, allowing for reduction to a single feature using PCA.
Scenario: Imagine a dataset where one feature represents the height of a person and the other
represents their hand span. These features are likely to be highly correlated, meaning taller
people generally have wider hand spans and vice versa.
import numpy as np
# Define the mean and standard deviation for height and hand span
mean_height = 170
std_height = 10
mean_span = 18
std_span = 8
Explanation:
1. We define the mean and standard deviation for both height and hand span.
2. We set a high covariance value (0.8) to create a strong linear relationship between the
two features.
3. We use np.random.normal to generate data with specified means and standard
deviations for both height and noise.
4. The hand span is calculated using a linear equation with the defined covariance, mean,
standard deviation, and added noise to introduce some variation.
Visualization :
plt.scatter(height, hand_span)
plt.xlabel('Height')
plt.ylabel('Hand Span')
plt.title('Original Data (Height vs. Hand Span)')
plt.show()
This scatter plot will visually demonstrate the strong positive correlation between height and
hand span.
Applying PCA:
Using PCA on this data with only two features will essentially capture the single direction of
variation (the linear relationship between height and hand span). The first principal
component will represent this dominant direction, and the second component will
contain minimal variance and can be discarded with minimal information loss.
Remember
● The discarded component (second principal component) will primarily capture the noise
added to the data.
Create a sample dataset for heights and hand spans of 100 people, followed by applying a PCA
for dim reduction, visualizing the PCA. Also show what the original feature matrices were, and
what is the feature matrix looking like after PCA - Check on Colab
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Define parameters
num_people = 100
mean_height = 170
std_height = 10
mean_span = 18
std_span = 8
covariance = 0.8
# Generate correlated data
np.random.seed(10)
height = np.random.normal(mean_height, std_height, num_people)
noise = np.random.normal(0, 1, num_people)
hand_span = covariance * height + mean_span + std_span * noise
# Apply PCA
pca = PCA(n_components=1) # Reduce to 1 dimension
pca.fit(np.c_[height, hand_span])
# Transform data
data_reduced = pca.transform(np.c_[height, hand_span])
This code creates the graphs as shown on your colab sheets, notice how two dims are
represented with a single dim:
Principal
Component 1
15.941
-4.421
-26.499
1.409
5.433
-12.921
0.04
5.236
3.05
-8.529
5.816
10.656
1.56
6.794
-0.526
1.953
-18.366
1.146
18.824
-25.034
-29.71
-31.458
9.63
43.782
21.261
21.631
6.209
17.079
5.55
10.411
-6.202
-15.35
-3.77
-5.409
7.251
-0.376
11.954
-5.458
26.686
-1.515
5.022
-14.309
-13.458
-10.543
-9.111
6.016
-12.685
0.043
7.713
3.833
1.446
4.777
0.242
10.507
7.612
1.476
24.062
13.453
3.068
-24.584
3.294
-8.045
15.556
-5.997
-16.793
-15.515
9.558
-24.092
-4.237
-6.168
-8.138
8.161
2.761
14.458
-1.952
27.764
-16.833
10.099
-8.181
-13.233
14.481
16.121
-21.269
0.446
11.67
-27.316
-0.505
2.379
-1.249
-21.444
14.621
0.333
-8.951
-11.61
5.022
17.895
-1.522
-6.7
16.204
-29.413
Original Data (Height vs. Hand Span)
Height Hand
Span
183.316 165.592
177.153 144.463
154.546 134.254
169.916 157.691
176.213 157.816
162.799 145.038
172.655 153.756
171.085 161.654
170.043 159.689
168.254 146.305
174.33 159.808
182.03 159.858
160.349 165.514
180.283 156.311
172.286 153.326
174.451 154.77
158.634 141.395
171.351 156.21
184.845 168.061
159.202 132.413
150.223 133.593
152.566 129.489
172.661 166.018
193.85 192.802
181.237 174.055
186.726 170.151
170.991 162.973
183.98 166.518
167.288 165.084
176.132 164.249
167.327 150.022
164.507 140.569
171.327 149.942
165.239 152.702
183.085 154.661
171.95 153.786
174.002 167.921
166.624 151.533
182.565 179.935
162.68 159.722
176.602 156.981
166.491 140.319
160.606 146.1
165.107 146.239
161.954 150.585
167.873 165.214
166.609 142.302
173.122 153.388
175.652 161.18
168.526 161.9
169.741 157.878
172.891 159.627
164.601 160.437
177.082 163.613
178.422 158.842
172.036 156.087
193.947 167.501
179.175 165.713
168.877 160.642
166.378 127.266
167.678 161.888
164.983 149.534
181.288 166.717
163.022 153.717
169.189 134.99
164.707 140.198
180.462 159.705
155.814 136.32
166.375 153.293
168.781 148.905
173.194 142.866
174.609 162.586
167.842 161.075
179.891 166.427
173.148 150.816
194.677 171.654
154.917 146.321
176.206 163.79
159.549 153.694
162.02 145.261
189.851 158.514
187.448 162.527
151.438 143.422
167.772 158.17
169.342 171.274
148.683 137.884
169.512 155.566
173.933 155.728
172.173 152.492
150.056 144.299
181.077 165.689
172.445 154.298
169.381 144.867
162.461 146.984
177.12 156.568
179.183 171.388
165.179 157.72
170.896 146.538
178.27 169.953
150.455 133.788
Bonus : give an example of matrix, some possible eigenvectors of the same and step by step
calculations of the eigenvectors
This matrix represents a linear transformation in two dimensions. Eigenvectors of this matrix will
tell us the directions (along which lines) this transformation stretches or shrinks vectors, and the
corresponding eigenvalues will tell us by how much it stretches or shrinks along those
directions.
Possible Eigenvectors:
Calculating Eigenvectors:
1. For v1:
○
○ Solve the system of equations:
■ (2 - λ) * 1 + 1 * 1 = 0
■ 1 + (2 - λ) * 1 = 0
○ Solving, we get λ = 3.
2. For v2:
○
○ Solve the system of equations:
■ (2 - λ) * (-1) + 1 * 1 = 0
■ 1 + (2 - λ) * 1 = 0
○ Solving, we get λ = 1.
Therefore:
Key Idea
The relationship between a matrix, its eigenvalues, and eigenvectors is represented by the
following equation:
A*v=λ*v
where:
● A is the matrix
● v is an eigenvector
● λ is the corresponding eigenvalue
A = P * D * P^-1
Where:
● Eigenvalues:
○ λ1 = 2
○ λ2 = 5
● Corresponding Eigenvectors:
○ v1 = [1, 1]
○ v2 = [2, -1]
Create the eigenvector matrix (P): Place eigenvectors as columns of this matrix:
P = [[1, 2],
[1, -1]]
1.
Create the diagonal eigenvalue matrix (D): Place eigenvalues on the diagonal, and set other
entries to zero:
D = [[2, 0],
[0, 5]]
2.
3.
A = [[2, -4/3],
[5/3, 1/3]]
4.
Result:
The constructed matrix A is:
[[2, -4/3],
[5/3, 1/3]]
The resulting matrices formed from eigenvectors and eigenvalues are not always unique.
Uniqueness Conditions:
Example:
A = [[1, 0],
[0, 1]]
This matrix represents the identity transformation (it leaves vectors unchanged). It has an
eigenvalue of 1 with two possible eigenvectors:
● v1 = [1, 0]
● v2 = [0, 1]
However, we can also use any non-zero scalar multiples of these eigenvectors, like:
● 2v1 = [2, 0]
● 3v2 = [0, 3]
If we construct the matrices using different combinations of eigenvectors (or their scalar
multiples), we'll end up with different-looking matrices even though they all represent the identity
transformation:
Using v1 and v2:
P = [[1, 0],
[0, 1]]
D = [[1, 0],
[0, 1]]
P^-1 = [[1, 0],
[0, 1]]
A = P * D * P^-1 = [[1, 0],
[0, 1]]
P = [[2, 0],
[0, 3]]
D = [[1, 0],
[0, 1]]
P^-1 = [[1/2, 0],
[0, 1/3]]
A = P * D * P^-1 = [[1, 0],
[0, 1]]
Both resulting matrices, A in both cases, are equal and represent the same identity
transformation, but they look different due to the different choices of eigenvectors (and their
scalings) used in the construction process.
While the core transformation captured by the eigenvalues and eigenvectors is unique, the
specific representation of the matrix using these components can have variations in ordering
and scaling.
A linear discriminant is a line (or a hyperplane in higher dimensions) that aims to separate data
points belonging to different classes. The main purpose of a linear discriminant is to make
classification decisions easier. For example, if a new data point falls on one side of the line, we
classify it as belonging to one class; if it falls on the other side, we classify it as belonging to the
other class.
How to Choose a Good Linear Discriminant
A good linear discriminant should maximize the separation between classes while minimizing
the spread within each class. Here's a general approach, frequently seen in techniques such as
Linear Discriminant Analysis (LDA):
1. Calculate Scatter:
○ Within-Class Scatter (Sw): Measures how spread out the data points are within
each class. We want to minimize this.
○ Between-Class Scatter (Sb): Measures the distance between the mean vectors
of each class. We want to maximize this.
2. Find Projection Direction: Find a direction (line) onto which we project the data that
maximizes the ratio of the between-class scatter to the within-class scatter (Sb / Sw).
This direction will be our linear discriminant.
import numpy as np
import matplotlib.pyplot as plt
plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Data with Linear Discriminant')
plt.show()
The dashed line in the visualization represents the linear discriminant. This line shows how the
data could be projected to optimally separate the two classes.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
1. Import necessary libraries for data loading, dimensionality reduction, and plotting.
2. Load the Iris dataset using sklearn.datasets.
3. Define the features (X) and target labels (y).
4. Create an LDA instance (LinearDiscriminantAnalysis) with the desired number of
components (2 for visualization).
5. Use fit_transform on the LDA object to transform the data onto the LDA subspace
(X_lda).
6. We can also perform PCA using PCA for comparison.
7. Use matplotlib to create scatter plots of the transformed data points, colored based
on their class labels.
8. The first plot shows the data projected onto the two LDA components (LD1 and LD2).
9. The optional second plot shows the data projected onto the first two principal
components using PCA (PC1 and PC2).
● This is a basic example using two dimensions for visualization. In practice, the number of
components chosen for LDA would depend on the specific data and analysis goals.
● LDA assumes that the data follows a Gaussian distribution within each class. If this
assumption is not met, alternative methods like Support Vector Machines (SVM) might
be more suitable.
1. Scatter Matrices:
● Within-Class Scatter (Sw): This matrix captures the variance within each class. It
measures how "spread out" the data points belonging to the same class are. We want to
minimize this when choosing a good projection.
● Between-Class Scatter (Sb): This matrix captures the variance between the means
of different classes. It measures how "separated" the class means are in the original
space. We want to maximize this when choosing a good projection.
LDA aims to find a linear transformation (projection) that projects the data onto a
lower-dimensional space while maximizing the ratio of the between-class scatter (Sb) to the
within-class scatter (Sw). This ratio is often referred to as the Fisher Ratio. Mathematically, this
can be represented as:
3. Dimensionality Reduction:
Once the optimal projection direction is found, we can use it to project the data points onto this
new, lower-dimensional space. This reduces the number of features while still maintaining as
much information as possible relevant to class separation.
Visualization:
Imagine a two-dimensional dataset with two classes. The original data points might be scattered
in a way that makes it difficult to separate the classes using a simple decision boundary (e.g., a
line). LDA finds a direction (projection) in the original space that stretches the data along this
direction, effectively separating the classes in the projected space. This might allow for a simpler
classification decision boundary in the lower-dimensional space.
● LDA is primarily used for supervised learning tasks like classification, where class
labels are available.
● It assumes an underlying linear relationship between the features.
● The number of dimensions to which the data is reduced is typically the number of
classes minus one (except when the number of classes is high or when there is
redundancy in the data).
While both LDA and PCA are dimensionality reduction techniques, they have different
objectives:
● PCA: Aims to capture the maximum overall variance in the data, regardless of class
labels.
● LDA: Specifically targets maximizing class separation, focusing on the variance that
helps distinguish between classes.
Choosing between LDA and PCA depends on the specific problem. If class labels are available
and class separation is the primary concern, LDA can be a good choice. On the other hand, if
class labels are not available or if overall variance and data exploration are more important,
PCA might be a better option.