0% found this document useful (0 votes)
39 views6 pages

Wk01 Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views6 pages

Wk01 Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Week-1: Dimensionality Reduction with PCA

Sherry Thomas
21f3001449

Contents
Introduction to Machine Learning 1
Broad Paradigms of Machine Learning . . . . . . . . . . . . . . . . . . 2

Representation Learning 3
Potential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Principal Component Analysis 4


Approximate Representation . . . . . . . . . . . . . . . . . . . . . . . 5
P.C.A. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Acknowledgments 6
Abstract
The week provides an introduction to Machine Learning and subse-
quently delves into the syllabus with a focus on unsupervised learning.
The two primary areas of study covered are representation learning and
Principal Component Analysis (PCA).

Introduction to Machine Learning


Machine Learning is a sub-field of artificial intelligence concerned with the de-
sign of algorithms and statistical models that allow computers to learn from
and make predictions or decisions based on data, without being explicitly pro-
grammed. It utilizes mathematical optimization, algorithms, and computational
models to analyze and understand patterns in data and make predictions about
future outcomes.
It can be further explained as follows:
• Why: Machine Learning is used to automate tasks that would otherwise
require human intelligence, to process vast amounts of data, and to make
predictions or decisions with greater accuracy than traditional approaches.
It also has surged in popularity in recent years.
• Where: Machine Learning is applied in various fields such as computer
vision, natural language processing, finance, and healthcare, among oth-
ers.Where: Machine Learning is applied in various fields such as computer
vision, natural language processing, finance, and healthcare, among oth-
ers.

1
• What: Machine Learning departs from traditional procedural approaches,
instead it is driven by data analysis. Rather than memorizing specific
examples, it seeks to generalize patterns in the data. Machine Learning
is not based on magic, rather it relies on mathematical principles and
algorithms.

Broad Paradigms of Machine Learning


1. Supervised Learning:Supervised Machine Learning is a type of machine
learning where the algorithm is trained on a labeled dataset, meaning that
the data includes both inputs and their corresponding outputs. The goal
of supervised learning is to build a model that can accurately predict the
output for new, unseen input data. Few examples:
• Linear regression for predicting a continuous output
• Logistic regression for binary classification problems
• Decision trees for non-linear classification and regression problems
• Support Vector Machines for binary and multi-class classification problems
• Neural Networks for complex non-linear problems in various domains such
as computer vision, natural language processing, and speech recognition
2. Unsupervised Learning: Unsupervised Machine Learning is a type of
machine learning where the algorithm is trained on an unlabeled dataset,
meaning that only the inputs are provided and no corresponding outputs.
The goal of unsupervised learning is to uncover patterns or relationships
within the data without any prior knowledge or guidance. Few examples:
• Clustering algorithms such as K-means, hierarchical clustering, and
density-based clustering, used to group similar data points together into
clusters
• Dimensionality reduction techniques such as Principal Component Anal-
ysis (PCA), used to reduce the number of features in a dataset while
preserving the maximum amount of information
• Anomaly detection algorithms used to identify unusual data points that
deviate from the normal patterns in the data
3. Sequential learning: Sequential Machine Learning (also known as time-
series prediction) is a type of machine learning that is focused on making
predictions based on sequences of data. It involves training the model on
a sequence of inputs, such that the predictions for each time step depend
on the previous time steps. Few examples:
• Time series forecasting, used to predict future values based on past trends
and patterns in data such as stock prices, weather patterns, and energy
consumption
• Speech recognition, used to transcribe speech into text by recognizing
patterns in audio signals
• Natural language processing, used to analyze and make predictions about
sequences of text data

2
Representation Learning
Representation learning is a fundamental sub-field of machine learning that is
concerned with acquiring meaningful and compact representations of intricate
data, facilitating various tasks such as dimensionality reduction, clustering, and
classification.
Let us consider a dataset {x1 , x2 , … , x𝑛 }, where each x𝑖 ∈ ℝ𝑑 . The objective is
to find a representation that minimizes the reconstruction error.
We can start by seeking the best linear representation of the dataset, denoted
by w, subject to the constraint ||w|| = 1.
The representation is given by,

(x𝑇𝑖 w)
w
w𝑇 w
However, ||w|| = 1
∴ Projection = (x𝑇𝑖 w)w

The reconstruction error is computed as follows,

1 𝑛
Reconstruction Error(𝑓(w)) = ∑ ||x − (x𝑇𝑖 w)w||2
𝑛 𝑖=1 𝑖

where x𝑖 − (x𝑇𝑖 w)w is termed the residue and can be represented as x′ .


The primary aim is to minimize the reconstruction error, leading to the following
optimization formulation:

1 𝑛
min 𝑓(w) = ∑ −(x𝑇𝑖 w)2
w∈||w||=1 𝑛 𝑖=1
1 𝑛 𝑇 2
∴ max 𝑓(w) = ∑(x w)
w∈||w||=1 𝑛 𝑖=1 𝑖
1 𝑛
= w𝑇 ( ∑ x x𝑇 )w
𝑛 𝑖=1 𝑖 𝑖
max 𝑓(w) = w𝑇 Cw
w∈||w||=1

1 𝑛
where C = ∑ x x𝑇 represents the Covariance Matrix, and C ∈ ℝ𝑑×𝑑 .
𝑛 𝑖=1 𝑖 𝑖
Notably, the eigenvector w corresponding to the largest eigenvalue 𝜆 of C be-
comes the sought-after solution for the representation. This w is often referred
to as the First Principal Component of the dataset.

3
Potential Algorithm
Based on the above concepts, we can outline the following algorithm for repre-
sentation learning:
Given a dataset {x1 , x2 , … , x𝑛 } where x𝑖 ∈ ℝ𝑑 ,
1. Center the dataset:
1 𝑛
�= ∑x
𝑛 𝑖=1 𝑖
x𝑖 = x𝑖 − � ∀𝑖

2. Find the best representation w ∈ ℝ𝑑 with ||w|| = 1.


3. Update the dataset with the representation:

x𝑖 = x𝑖 − (x𝑇𝑖 w)w ∀𝑖

4. Repeat steps 2 and 3 until the residues become zero, resulting in


w2 , w3 , … , w𝑑 .
The question arises: Is this the most effective approach, and how many w do
we need to achieve optimal compression?

Principal Component Analysis


Principal Component Analysis (PCA) is a powerful technique employed to re-
duce the dimensionality of a dataset by identifying its most important features,
known as principal components, which explain the maximum variance present
in the data. PCA achieves this by transforming the original dataset into a
new set of uncorrelated variables, ordered by their significance in explaining
the variance. This process is valuable for visualizing high-dimensional data and
preprocessing it before conducting machine learning tasks.
Following the potential algorithm mentioned earlier and utilizing the set of
eigenvectors {w1 , w2 , … , w𝑑 }, we can express each data point x𝑖 as a linear
combination of the projections on these eigenvectors:

∀𝑖 x𝑖 − ((x𝑇𝑖 w1 )w1 + (x𝑇𝑖 w2 )w2 + … + (x𝑇𝑖 w𝑑 )w𝑑 ) = 0

∴x𝑖 = (x𝑇𝑖 w1 )w1 + (x𝑇𝑖 w2 )w2 + … + (x𝑇𝑖 w𝑑 )w𝑑

From the above equation, we observe that we can represent the data using
constants {x𝑇𝑖 w1 , x𝑇𝑖 w2 , … , x𝑇𝑖 w𝑑 } along with vectors {w1 , w2 , … , w𝑑 }.
Thus, a dataset initially represented as 𝑑 × 𝑛 can now be compressed to 𝑑(𝑑 + 𝑛)
elements, which might seem suboptimal at first glance.
However, if the data resides in a lower-dimensional subspace, the residues can be
reduced to zero without requiring all 𝑑 principal components. Suppose the data
can be adequately represented using only 𝑘 principal components, where 𝑘 ≪ 𝑑.
In that case, the data can be efficiently compressed from 𝑑 × 𝑛 to 𝑘(𝑑 + 𝑛).

4
Approximate Representation
The question arises: If the data can be approximately represented by a lower-
dimensional subspace, would it suffice to use only those 𝑘 projections? Addi-
tionally, how much variance should be covered?
Let us consider a centered dataset {x1 , x2 , … , x𝑛 } where x𝑖 ∈ ℝ𝑑 . Let C rep-
resent its covariance matrix, and {𝜆1 , 𝜆2 , … , 𝜆𝑑 } be the corresponding eigen-
values, which are non-negative due to the positive semi-definiteness of the co-
variance matrix. These eigenvalues are arranged in descending order, with
{w1 , w2 , … , w𝑑 } as their corresponding eigenvectors of unit length.
The eigen equation for the covariance matrix can be expressed as follows:

Cw = 𝜆w
w Cw = w𝑇 𝜆w
𝑇

∴𝜆 = w𝑇 Cw {w𝑇 w = 1}
1 𝑛 𝑇 2
𝜆= ∑(x w)
𝑛 𝑖=1 𝑖

Hence, the mean of the dataset being zero, 𝜆 represents the variance captured
by the eigenvector w.
A commonly accepted heuristic suggests that PCA should capture at least 95%
of the variance. If the first 𝑘 eigenvectors capture the desired variance, it can
be stated as:
𝑘
∑ 𝜆𝑗
𝑗=1
𝑑
≥ 0.95
∑ 𝜆𝑖
𝑖=1

Thus, the higher the variance captured, the lower the error incurred.

5
P.C.A. Algorithm
The Principal Component Analysis algorithm can be summarized as follows for a
centered dataset {x1 , x2 , … , x𝑛 } where x𝑖 ∈ ℝ𝑑 , and C represents its covariance
matrix:
• Step 1: Find the eigenvalues and eigenvectors of C. Let {𝜆1 , 𝜆2 , … , 𝜆𝑑 }
be the eigenvalues arranged in descending order, and {w1 , w2 , … , w𝑑 } be
their corresponding eigenvectors of unit length.
• Step 2: Calculate 𝑘, the number of top eigenvalues and eigenvectors
required, based on the desired variance to be covered.
• Step 3: Project the data onto the eigenvectors and obtain the desired
representation as a linear combination of these projections.

Figure 1: The dataset depicted in the diagram has two principal components:
the green vector represents the first PC, whereas the red vector corresponds to
the second PC.

In essence, PCA is a dimensionality reduction technique that identifies feature


combinations that are de-correlated (independent of each other).

Acknowledgments
Professor Arun Rajkumar: The content, including the concepts and nota-
tions presented in this document, has been sourced from his slides and lectures.

You might also like