Wk01 Machine Learning
Wk01 Machine Learning
Sherry Thomas
21f3001449
Contents
Introduction to Machine Learning 1
Broad Paradigms of Machine Learning . . . . . . . . . . . . . . . . . . 2
Representation Learning 3
Potential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Acknowledgments 6
Abstract
The week provides an introduction to Machine Learning and subse-
quently delves into the syllabus with a focus on unsupervised learning.
The two primary areas of study covered are representation learning and
Principal Component Analysis (PCA).
1
• What: Machine Learning departs from traditional procedural approaches,
instead it is driven by data analysis. Rather than memorizing specific
examples, it seeks to generalize patterns in the data. Machine Learning
is not based on magic, rather it relies on mathematical principles and
algorithms.
2
Representation Learning
Representation learning is a fundamental sub-field of machine learning that is
concerned with acquiring meaningful and compact representations of intricate
data, facilitating various tasks such as dimensionality reduction, clustering, and
classification.
Let us consider a dataset {x1 , x2 , … , x𝑛 }, where each x𝑖 ∈ ℝ𝑑 . The objective is
to find a representation that minimizes the reconstruction error.
We can start by seeking the best linear representation of the dataset, denoted
by w, subject to the constraint ||w|| = 1.
The representation is given by,
(x𝑇𝑖 w)
w
w𝑇 w
However, ||w|| = 1
∴ Projection = (x𝑇𝑖 w)w
1 𝑛
Reconstruction Error(𝑓(w)) = ∑ ||x − (x𝑇𝑖 w)w||2
𝑛 𝑖=1 𝑖
1 𝑛
min 𝑓(w) = ∑ −(x𝑇𝑖 w)2
w∈||w||=1 𝑛 𝑖=1
1 𝑛 𝑇 2
∴ max 𝑓(w) = ∑(x w)
w∈||w||=1 𝑛 𝑖=1 𝑖
1 𝑛
= w𝑇 ( ∑ x x𝑇 )w
𝑛 𝑖=1 𝑖 𝑖
max 𝑓(w) = w𝑇 Cw
w∈||w||=1
1 𝑛
where C = ∑ x x𝑇 represents the Covariance Matrix, and C ∈ ℝ𝑑×𝑑 .
𝑛 𝑖=1 𝑖 𝑖
Notably, the eigenvector w corresponding to the largest eigenvalue 𝜆 of C be-
comes the sought-after solution for the representation. This w is often referred
to as the First Principal Component of the dataset.
3
Potential Algorithm
Based on the above concepts, we can outline the following algorithm for repre-
sentation learning:
Given a dataset {x1 , x2 , … , x𝑛 } where x𝑖 ∈ ℝ𝑑 ,
1. Center the dataset:
1 𝑛
�= ∑x
𝑛 𝑖=1 𝑖
x𝑖 = x𝑖 − � ∀𝑖
x𝑖 = x𝑖 − (x𝑇𝑖 w)w ∀𝑖
From the above equation, we observe that we can represent the data using
constants {x𝑇𝑖 w1 , x𝑇𝑖 w2 , … , x𝑇𝑖 w𝑑 } along with vectors {w1 , w2 , … , w𝑑 }.
Thus, a dataset initially represented as 𝑑 × 𝑛 can now be compressed to 𝑑(𝑑 + 𝑛)
elements, which might seem suboptimal at first glance.
However, if the data resides in a lower-dimensional subspace, the residues can be
reduced to zero without requiring all 𝑑 principal components. Suppose the data
can be adequately represented using only 𝑘 principal components, where 𝑘 ≪ 𝑑.
In that case, the data can be efficiently compressed from 𝑑 × 𝑛 to 𝑘(𝑑 + 𝑛).
4
Approximate Representation
The question arises: If the data can be approximately represented by a lower-
dimensional subspace, would it suffice to use only those 𝑘 projections? Addi-
tionally, how much variance should be covered?
Let us consider a centered dataset {x1 , x2 , … , x𝑛 } where x𝑖 ∈ ℝ𝑑 . Let C rep-
resent its covariance matrix, and {𝜆1 , 𝜆2 , … , 𝜆𝑑 } be the corresponding eigen-
values, which are non-negative due to the positive semi-definiteness of the co-
variance matrix. These eigenvalues are arranged in descending order, with
{w1 , w2 , … , w𝑑 } as their corresponding eigenvectors of unit length.
The eigen equation for the covariance matrix can be expressed as follows:
Cw = 𝜆w
w Cw = w𝑇 𝜆w
𝑇
∴𝜆 = w𝑇 Cw {w𝑇 w = 1}
1 𝑛 𝑇 2
𝜆= ∑(x w)
𝑛 𝑖=1 𝑖
Hence, the mean of the dataset being zero, 𝜆 represents the variance captured
by the eigenvector w.
A commonly accepted heuristic suggests that PCA should capture at least 95%
of the variance. If the first 𝑘 eigenvectors capture the desired variance, it can
be stated as:
𝑘
∑ 𝜆𝑗
𝑗=1
𝑑
≥ 0.95
∑ 𝜆𝑖
𝑖=1
Thus, the higher the variance captured, the lower the error incurred.
5
P.C.A. Algorithm
The Principal Component Analysis algorithm can be summarized as follows for a
centered dataset {x1 , x2 , … , x𝑛 } where x𝑖 ∈ ℝ𝑑 , and C represents its covariance
matrix:
• Step 1: Find the eigenvalues and eigenvectors of C. Let {𝜆1 , 𝜆2 , … , 𝜆𝑑 }
be the eigenvalues arranged in descending order, and {w1 , w2 , … , w𝑑 } be
their corresponding eigenvectors of unit length.
• Step 2: Calculate 𝑘, the number of top eigenvalues and eigenvectors
required, based on the desired variance to be covered.
• Step 3: Project the data onto the eigenvectors and obtain the desired
representation as a linear combination of these projections.
Figure 1: The dataset depicted in the diagram has two principal components:
the green vector represents the first PC, whereas the red vector corresponds to
the second PC.
Acknowledgments
Professor Arun Rajkumar: The content, including the concepts and nota-
tions presented in this document, has been sourced from his slides and lectures.