0% found this document useful (0 votes)
5 views

Unit-4

The document discusses unsupervised learning techniques, focusing on clustering, anomaly detection, and density estimation. It provides an overview of the K-Means algorithm, its drawbacks, and enhancements like K-Means++ and Mini Batch K-Means, as well as methods for determining the optimal number of clusters using silhouette analysis. Additionally, it touches on applications such as image segmentation and the DBSCAN algorithm for handling irregular clusters and noise.

Uploaded by

gowtami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit-4

The document discusses unsupervised learning techniques, focusing on clustering, anomaly detection, and density estimation. It provides an overview of the K-Means algorithm, its drawbacks, and enhancements like K-Means++ and Mini Batch K-Means, as well as methods for determining the optimal number of clusters using silhouette analysis. Additionally, it touches on applications such as image segmentation and the DBSCAN algorithm for handling irregular clusters and noise.

Uploaded by

gowtami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Unit-IV

Unsupervised Learning Techniques

Unsupervised learning tasks and algorithms:

Clustering :The goal is to group similar instances together into clusters.


Clustering is a great tool for data analysis, customer segmentation,
recommender systems, search engines, image segmentation,

semi-supervised learning, dimensionality reduction, and more.

Anomaly detection: The objective is to learn what “normal” data looks like, and
then use that to detect abnormal instances, such as defective items on a
production line or a new trend in a time series.

Density estimation This is the task of estimating the probability density


function (PDF) of the random process that generated the dataset. Density
estimation is commonly used for anomaly detection: instances located in very
lowdensity regions are likely to be anomalies. It is also useful for data analysis
and visualization.

K-MEANS Algorithm
K-Means Algorithm
The K-means algorithm is a machine-learning algorithm for clustering. This
type of learning is unsupervised. We use K-Means to group similar data
items based on their differences and similarities. It has many applications,
including anomaly detection, customer segmentation, and picture
segmentation.
The K-means algorithm involves the following steps:
It seeks to reduce the distances between each data item and its assigned
centroid. This is an iterative process, starting with random
assignments and gradually improving the clusters by adjusting the
centroids based on the data points assigned to each cluster.
For example:

Drawbacks of the K-Means Algorithm

Sensitivity to Initial Conditions


K-means is sensitive to initial conditions. The algorithm randomly initializes
the cluster centroids at the beginning, and the final clustering results can
vary depending on these initial positions.
Different initializations can lead to other local optima, resulting in different
clustering outcomes. This makes the K-means algorithm less reliable and
reproducible.

3.2. Difficulty in Determining K


One of the drawbacks of the K-means algorithm is that we have to set the
number of clusters ( ) in advance. Choosing an incorrect number of
clusters can lead to inaccurate results.
Various methods are available to estimate the optimal , such as
the silhouette analysis or elbow method, but they may not always provide a
clear-cut answer.
If we use a too-small , we’ll get too broad clusters. Conversely, an
overlarge l results in clusters that are too specific:

It may require multiple runs to find the most suitable value of , which can
be time-consuming and resource-consuming.
Inability to Handle Categorical Data
Another drawback of the K-means algorithm is its inability to
handle categorical data. The algorithm works with numerical data, where
distances between data points can be calculated. However, categorical
data doesn’t have a natural notion of distance or similarity.
When categorical data is used with the K-means algorithm, it requires
converting the categories into numerical values, such as using one-hot
encoding. One shortcoming of using one-hot encoding is that it treats each
feature independently and can degrade performance since it can
significantly increase data dimensionality.

K-Means++
Instead of randomly selecting the initial centroids, K-means++ uses a probabilistic
approach that biases the initial centroid selection toward points far apart. This helps
to ensure that the centroids are well distributed across the dataset and reduces the
likelihood of converging to suboptimal solutions.
This can result in faster convergence and improved clustering accuracy
compared to the original K-means algorithm.

Centroid Initialization Methods

k-means clustering aims to converge on an optimal set of cluster centers


(centroids) and cluster membership based on distance from these centroids via
successive iterations.The more optimal the positioning of these initial centroids,
the fewer iterations of the k-means clustering algorithms will be required for
convergence.For this some strategic consideration to the initialization of these
initial centroids is required.
 random data points: In this approach, described in the "traditional" case
above, k random data points are selected from the dataset and used as the
initial centroids, an approach which is obviously highly volatile and provides
for a scenario where the selected centroids are not well positioned throughout
the entire data space.
 k-means++: As spreading out the initial centroids is thought to be a worthy
goal, k-means++ pursues this by assigning the first centroid to the location of
a randomly selected data point, and then choosing the subsequent centroids
from the remaining data points based on a probability proportional to the
squared distance away from a given point's nearest existing centroid. The
effect is an attempt to push the centroids as far from one another as possible,
 covering as much of the occupied data space as they can from initi0alization.

 Randomly select the first centroid from the data points.


 For each data point compute its distance from the nearest, previously
chosen centroid.
 Select the next centroid from the data points such that the probability of
choosing a point as centroid is directly proportional to its distance from
the nearest, previously chosen centroid. (i.e. the point having maximum
distance from the nearest centroid is most likely to be selected next as a
centroid)
 Repeat steps 2 and 3 until k centroids have been sampled

This probability distribution ensures that instances farther away from already
chosen centroids are much more likely be selected as centroids

Mini Batch K-Means Algorithm

With the increasing size of the datasets being analyzed, the computation time
of K-means increases because of its constraint of needing the whole dataset in
main memory. For this reason, several methods have been proposed to reduce
the temporal and spatial cost of the algorithm. A different approach is the Mini
batch K-means algorithm.
Mini Batch K-means algorithm‘s main idea is to use small random batches
of data of a fixed size, so they can be stored in memory. Each iteration a new
random sample from the dataset is obtained and used to update the clusters and
this is repeated until convergence.

from sklearn.cluster import MiniBatchKMeans

minibatch_kmeans = MiniBatchKMeans(n_clusters=5)

minibatch_kmeans.fit(X)

Finding the optimal number of clusters

So far, we have set the number of clusters k to 5 because it was obvious by


looking at the data that this was the correct number of clusters. But in general, it
will not be so easy to know how to set k, and the result might be quite bad if
you set it to the wrong value.

The more precise approach for choosing the best value for the number of
clusters is to use the silhouette score, which is the mean silhouette coefficient
over all the instances.

An instance’s silhouette coefficient is equal to (b – a) / max(a, b)

where a is the mean distance to the other instances in the same cluster (i.e., the
mean intra-cluster distance)

b is the mean nearest-cluster distance (i.e., the mean distance to the instances of
the next closest cluster

The silhouette coefficient can vary between –1 and +1.

A coefficient close to +1 means that the instance is well inside its own cluster
and far from other clusters,

while a coefficient close to 0 means that it is close to a cluster boundary,

and finally a coefficient close to –1 means that the instance may have been
assigned to the wrong cluster.
To compute the silhouette score, you can use Scikit-Learn’s silhouette_score()
function, giving it all the instances in the dataset and the labels they were
assigned:

>>> from sklearn.metrics import silhouette_score

>>> silhouette_score(X, kmeans.labels_)

0.655517642572828

Silhouette Analysis

The silhouette coefficient or silhouette score kmeans is a measure of

how similar a data point is within-cluster (cohesion) compared to other

clusters (separation). The Silhouette score can be easily calculated in

Python using the metrics module of the scikit-learn/sklearn library.

 Select a range of values of k (say 1 to 10).

 Plot Silhouette coefficient for each value of K.

The equation for calculating the silhouette coefficient for a particular

data point:

 S(i) is the silhouette coefficient of the data point i.

 a(i) is the average distance between i and all the other data points in

the cluster to which i belongs.

 b(i) is the average distance from i to all clusters to which i does not

belong.
Source: medium

We will then calculate the average_silhouette for every k.

Then plot the graph between average_silhouette and K.

Points to Remember While Calculating Silhouette Coefficient:

 The value of the silhouette coefficient is between [-1, 1].

 A score of 1 denotes the best, meaning that the data point i is very

compact within the cluster to which it belongs and far away from the

other clusters.

 The worst value is -1. Values near 0 denote overlapping clusters.


Limits of K-Means

Choosing k manually.

Use the “Loss vs. Clusters” plot to find the optimal (k), as discussed in Interpret
Results.

Being dependent on initial values.

For a low k, you can mitigate this dependence by running k-means several times
with different initial values and picking the best result. As k increases, you need
advanced versions of k-means to pick better values of the initial centroids
(called k-means seeding). For a full discussion of k- means seeding see, A
Comparative Study of Efficient Initialization Methods for the K-Means
Clustering Algorithm by M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.

Clustering data of varying sizes and density.

k-means has trouble clustering data where clusters are of varying sizes and
density.

Clustering outliers.

Centroids can be dragged by outliers, or outliers might get their own cluster
instead of being ignored. Consider removing or clipping outliers before
clustering..

caling with number of dimensions.

As the number of dimensions increases, a distance-based similarity measure


converges to a constant value between any given examples.

Image Segmentation Using Clustering


Image segmentation is the classification of an image into different groups
Image segmentation is an important step in image processing, and it
seems everywhere if we want to analyze what’s inside the image. For
example, if we seek to find if there is a chair or person inside an indoor
image, we may need image segmentation to separate objects and analyze
each object individually to check what it is. Image segmentation usually
serves as the pre-processing before pattern recognition, feature extraction,
and compression of the image.

Image segmentation is the classification of an image into different groups...

The goal of Image segmentation is to change the representation of an


image into something that is more meaningful and easier to analyze.

Image segmentation is the process of partitioning a digital image into


multiple distinct regions containing each pixel(sets of pixels, also known as
superpixels) with similar attributes.

Image segmentation is typically used to locate objects


and boundaries(lines, curves, etc.) in images. More precisely, Image
Segmentation is the process of assigning a label to every pixel in an image
such that pixels with the same label share certain characteristics.

There are different methods and one of the most popular methods is K-
Means clustering algorithm

Object detection and Image Classification by an Autonomous Vehicle


Other examples involve Healthcare Industry where if we talk about Cancer,
even in today’s age of technological advancements, cancer can be fatal if
we don’t identify it at an early stage. Detecting cancerous cell(s) as quickly
as possible can potentially save millions of lives. The shape of the
cancerous cells plays a vital role in determining the severity of cancer
which can be identified using image classification algorithms.

Breast cancer cells

Image Segmentation involves converting an image into a collection of


regions of pixels that are represented by a mask or a labeled image. By
dividing an image into segments, you can process only the important
segments of the image instead of processing the entire image.
Using Clustering for Preprocessing

Clustering can be an efficient approach to dimensionality reduction, in particular as a preprocessing


step before a supervised learning algorithm,For example, digits dataset, which is a simple MNIST-
dataset containing 1,797 grayscale 8 × 8 images representing the digits 0 to 9.

First, load the dataset:

from sklearn.datasets import load_digits

X_digits, y_digits = load_digits(return_X_y=True)

Now, split it into a training set and a test set:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)

Next, fit a Logistic Regression model:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

log_reg.fit(X_train, y_train)

Let’s evaluate its accuracy on the test set:

>>> log_reg.score(X_test, y_test) 0.9688888888888889

we can do better by using K-Means as a preprocessing step

We will create a pipeline that will first cluster the training set into 50 clusters and replace the images
with their distances to these 50 clusters, then apply a Logistic Regression model

from sklearn.pipeline import Pipeline

pipeline = Pipeline([ ("kmeans", KMeans(n_clusters=50)), ("log_reg", LogisticRegression()), ])


pipeline.fit(X_train, y_train)
Now let’s evaluate this classification pipeline: >>> pipeline.score(X_test, y_test)
0.9777777777777777

We reduced the error rate by almost 30% (from about 3.1% to about 2.2%)

Using Clustering for Semi-Supervised Learning

Refer textbook

DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN)

Clusters are dense regions in the data space, separated by regions of the
lower density of points. The DBSCAN algorithm is based on this intuitive
notion of “clusters” and “noise”. The key idea is that for each point of a
cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.

Partitioning methods (K-means, PAM clustering) and hierarchical clustering


work for finding spherical-shaped clusters or convex clusters. In other words,
they are suitable only for compact and well-separated clusters. Moreover,
they are also severely affected by the presence of noise and outliers in the
data.
Real life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure
below.
2. Data may contain noise.
The figure below shows a data set containing nonconvex clusters and
outliers/noises. Given such data, k-means algorithm has difficulties in
identifying these clusters with arbitrary shapes.
DBSCAN algorithm requires two parameters:
1. eps : It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors. If the eps value is chosen too small then large part of the data
will be considered as outliers. If it is chosen very large then the clusters
will merge and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius.
Larger the dataset, the larger value of MinPts must be chosen. As a
general rule, the minimum MinPts can be derived from the number of
dimensions D in the dataset as, MinPts >= D+1. The minimum value of
MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within
eps.
Border Point: A point which has fewer than MinPts within eps but it is in the
neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
1. Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new
cluster.
3. Find recursively all its density connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exist a
point c which has a sufficient number of points in its neighbors and both
the points a and b are within the eps distance. This is a chaining process.
4. So, if b is neighbor of c, c is neighbor of d, d is neighbor of e, which in
turn is neighbor of a implies that b is neighbor of a.
5. Iterate through the remaining unvisited points in the dataset. Those points
that do not belong to any cluster are noise.

Importnump
y as np

from sklearn.cluster import DBSCAN

from sklearn import metrics

from sklearn.datasets import make_blobs

from sklearn.preprocessing import StandardScaler

# Generate sample data

centers = [[1, 1], [-1, -1], [1, -1]]

X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,

random_state=0)
X = StandardScaler().fit_transform(X)

# Compute DBSCAN

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)

core_samples_mask[db.core_sample_indices_] = True

labels = db.labels_

# Number of clusters in labels, ignoring noise if present.

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)

print('Estimated number of noise points: %d' % n_noise_)

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))

print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))

print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))

print("Adjusted Rand Index: %0.3f"

% metrics.adjusted_rand_score(labels_true, labels))

print("Adjusted Mutual Information: %0.3f"

% metrics.adjusted_mutual_info_score(labels_true, labels))

print("Silhouette Coefficient: %0.3f"

% metrics.silhouette_score(X, labels))

# Plot result

import matplotlib.pyplot as plt

%matplotlib inline

# Black removed and is used for noise instead.

unique_labels = set(labels)

colors = [plt.cm.Spectral(each)

for each in np.linspace(0, 1, len(unique_labels))]


for k, col in zip(unique_labels, colors):

if k == -1:

# Black used for noise.

col = [0, 0, 0, 1]

class_member_mask = (labels == k)

xy = X[class_member_mask & core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),

markeredgecolor='k', markersize=14)

xy = X[class_member_mask & ~core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),

markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)

plt.show()
DBSCAN to cluster spherical data

Gaussian Mixtures

Gaussian mixture models (GMMs) are a type of machine


learning algorithm. They are used to classify data into
different categories based on the probability distribution.
Gaussian mixture models can be used in many different
areas, including finance, marketing and so much more
A Gaussian mixture model can be used for clustering, which is the task of
grouping a set of data points into clusters.

GMMs can be used to find clusters in data sets where the clusters may
not be clearly defined.
Additionally, GMMs can be used to estimate the probability that a new
data point belongs to each cluster.

Gaussian mixture models are also relatively robust to outliers, meaning


that they can still yield accurate results even if there are some data
points that do not fit neatly into any of the clusters. This makes GMMs a
flexible and powerful tool for clustering data

Gaussian distributions are assumed for each group and they have means
and covariances which define their parameters.

GMM consists of two parts –


mean vectors (μ) & covariance matrices (Σ).
A Gaussian distribution is defined as a continuous probability distribution
that takes on a bell-shaped curve

Applications of GMM

GMM has many applications, such as density estimation, clustering, and


image segmentation.
For density estimation, GMM can be used to estimate the probability
density function of a set of data points.

For clustering, GMM can be used to group together data points that
come from the same Gaussian distribution.

For image segmentation, GMM can be used to partition an image into


different regions

A Gaussian Mixture is a function that is comprised of


several Gaussians, each identified by k ∈ {1,…, K},
where K is the number of clusters of our dataset. Each
Gaussian k in the mixture is comprised of the following
parameters:
 A mean μ that defines its centre.

 A covariance Σ that defines its width.

 A mixing probability π that defines how big or small the


Gaussian function will be.

A Gaussian mixture model is a probabilistic model that assumes all the data
points are generated from a mixture of a finite number of Gaussian
distributions with unknown parameters.

In a Gaussian mixture model, each cluster is associated


with a multivariate Gaussian distribution, and the
mixture model is a weighted sum of these distributions.
The weights indicate the probability that a data point
belongs to a particular cluster, and the Gaussian
distributions describe the distribution of the data within
each cluster.

In order to fit a Gaussian Mixture Model to a dataset,


the model parameters (i.e., the weights, means, and
covariances of the components) must be estimated from
the data. This is typically done using an iterative
optimization algorithm such as the expectation-
maximization (EM) algorithm.
Once a GMM has been fit to a dataset, it can be used for
a variety of tasks such as density estimation, clustering,
and anomaly detection, which are used in real-life
examples.

To fit a Gaussian Mixture Model (GMM) to a


dataset using Python and the scikit-learn library:
 First import the GaussianMixture library
from sklearn.mixture import GaussianMixture
import numpy as np
 Then load the data set, named ‘data.txt’.
X = np.loadtxt(‘data.txt’)
 Now, create the Gaussian Mixture Model
gmm = GaussianMixture(n_components=3)
 Fit the model to data.
gmm.fit(X).
 Now Predict the cluster labels for each data
point.
labels = gmm.predict(X).
 Get the model’s parameters i.e., means and
covariances of the components.
means = gmm. means_
covariances = gmm. covariances_
Now Defining all the components above, Here, X
is an (n x d) array of n observations with d
dimensions. The GaussianMixture class is used to
fit a GMM with n_components components to the
data. The fit method estimates the model
parameters (i.e., the means and covariances of
the components) using the expectation-
maximization (EM) algorithm. The predict method
can then be used to assign each data point to
one of the n_components clusters. The model
parameters (means and covariances) can be
accessed using the means_ and covariances_
attributes of the GaussianMixture object.

In general, the Gaussian density function is given by:

x represents our data points

D is the number of dimensions of each data point


μ and Σ are the mean and covariance

If we have a dataset comprised of N = 1000 three-


dimensional points (D = 3), then x will be a 1000 × 3
matrix. μ will be a 1 × 3 vector, and Σ will be a 3 × 3
matrix

If we differentiate this equation with respect to the mean


and covariance and then equate it to zero, then we will be
able to find the optimal values for these parameters

CoVariance Matrix

Covariance matrix is a square matrix that displays the variance exhibited by


elements of datasets and the covariance between a pair of datasets.
Variance is a measure of dispersion and can be defined as the spread of
data from the mean of the given dataset.
Covariance is calculated between two variables and is used to measure
how the two variables vary together.
Variance covariance matrix is defined as a square matrix where the diagonal
elements represent the variance and the off-diagonal elements represent the
covariance.
The covariance between two variables can be positive, negative, and zero. A
positive covariance indicates that the two variables have a positive relationship
whereas negative covariance shows that they have a negative relationship. If two
elements do not vary together then they will display a zero covariance.
To determine the covariance matrix, the formulas for variance and
covariance are required. Depending upon the type of data available, the
variance and covariance can be found for both sample data and
population data. These formulas are given below.

Covariance Matrix 2 × 2

A 2 × 2 matrix is one which has 2 rows and 2 columns. The formula for a 2
× 2 covariance matrix is given as follows:

[var(x) cov(x,y)

cov(x,y) var(y)]

Covariance Matrix 3 × 3

If there are 3 datasets, x, y, and z, then the formula to find the 3 × 3


covariance matrix is given below:

[var(x) cov(x,y) cov(x,z)

cov(x,y) var(y) cov(y,z)

cov(x,z) cov(y,z) var(z)]

Dimensionality Reduction

Dimensionality
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.

Definition

It is a way of converting the higher dimensions dataset into lesser


dimensions dataset ensuring that it provides similar information.

Need of Dimensionality Reduction

A dataset contains a huge number of input features in various cases,


which makes the predictive modeling task more complicated. Because it is
very difficult to visualize or make predictions for the training dataset with
a high number of features, for such cases, dimensionality reduction
techniques are required to use.

Application Areas

It is commonly used in the fields that deal with high-dimensional data,


such as speech recognition, signal processing, bioinformatics, etc.
It can also be used for data visualization, noise reduction, cluster
analysis, etc.

The Curse of Dimensionality


Handling the high-dimensional data is very difficult in practice, commonly
known as the curse of dimensionality. If the dimensionality of the input
dataset increases, any machine learning algorithm and model becomes
more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting
also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.

Advantages

o By reducing the dimensions of the features, the space required to


store the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions
of features.
o Reduced dimensions of features of the dataset help in visualizing
the data quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Common techniques of Dimensionality Reduction


a.Principal Component Analysis

b.Backward Elimination

c.Forward Selection

d.Score comparison

e.Missing Value Ratio

f.Low Variance Filter

g.High Correlation Filter

h.Random Forest

i. Factor Analysis

j. Auto-Encoder
Principal Component Analysis

Principal Component Analysis is a statistical process that converts the


observations of correlated features into a set of linearly uncorrelated features
with the help of orthogonal transformation. These new transformed features are
called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling.

The Principal Component Analysis is a popular unsupervised learning technique for reducing
the dimensionality of data. It increases interpretability yet, at the same time, it minimizes
information loss.

It helps to find the most significant features in a dataset and makes the data easy for plotting
in 2D and 3D. PCA helps in finding a sequence of linear combinations of variables.

Principal Components

The Principal Components are a straight line that captures most of the variance of
the data. They have a direction and magnitude. Principal components are orthogonal
projections (perpendicular) of data onto lower-dimensional space.

Variance
The term variance refers to a statistical measurement of the spread
between numbers in a data set. More specifically, variance measures how
far each number in the set is from the mean (average), and thus from
every other number in the set. Variance is denoted by this symbol: σ 2. It is
used by both analysts and traders to determine volatility and market
security

it measures the degree of dispersion of data around the sample's mean.

Scatter plot for the above data


After applying PCA
PCA allows us to go a step further and represent the data as linear
combinations of principal components. Getting principal components is
equivalent to a linear transformation of data from the feature1 x feature2
axis to a PCA1 x PCA2 axis

Each successive principal component explains the variance that is left


after its preceding component, so picking just a few of the first components
sufficiently approximates the original dataset without the need for additional
features

Principal Component analysis fits data in an n-dimensional


ellipsoid so that each axis of the ellipsoid represents a
principal component. The larger the principal component axis
the larger the variability in data it represents.

PCA calculation
There are multiple ways to calculate PCA:

1. Eigendecomposition of the covariance matrix


2. Singular value decomposition of the data matrix
3. Eigenvalue approximation via power iterative
computation
4. Non-linear iterative partial least squares (NIPALS)
computation.
Steps for Calculating PCA
1. Feature standardization. We standardize each feature
to have a mean of 0 and a variance of 1.
2.Obtain the covariance matrix computation. The
covariance matrix is a square matrix, of d x
d dimensions, where d stands for “dimension” (or feature
or column, if our data is tabular). It shows the pairwise
feature correlation between each feature.
3.Calculate the eigendecomposition of the covariance
matrix. We calculate the eigenvectors (unit vectors) and
their associated eigenvalues (scalars by which we
multiply the eigenvector) of the covariance matrix.
4.Sort the eigenvectors from the highest eigenvalue to
the lowest. The eigenvector with the highest eigenvalue
is the first principal component. Higher eigenvalues
correspond to greater amounts of shared variance
explained.
5.Select the number of principal components. Select the
top N eigenvectors (based on their eigenvalues) to
become the N principal components. The optimal
number of principal components is both subjective and
problem-dependent.

Example:

Let us consider the same scenario that we have taken as an example


previously. Let us assume the following features of dimensions as F1, F2,
F3, and F4.
Calculate the Mean and Standard Deviation for each feature and then,
tabulate the same as follows.

Then, after the Standardization of each variable, the results are tabulated
below.
This is the Standardized data set.

STEP 2: COVARIANCE MATRIX COMPUTATION


 In this step, you will get to know how the variables of the given data are
varying with the mean value calculated.
 Any interrelated variables can also be sorted out at the end of this step.
 To segregate the highly interrelated variables, you calculate the covariance
matrix with the help of the given formula.
**Note: **A covariance matrix is a N x N symmetrical matrix that contains
the covariances of all possible data sets.

The covariance matrix of two-dimensional data is, given as follows:

Where,
4.Make a note that, the covariance of a number with itself is its variance
(COV(X, X)=Var(X)), the values at the top left and bottom right will have
the variances of the same initial number.

5.Likewise, the entries of the Covariance Matrix at the main diagonal will
be symmetric concerning the fact that covariance is commutative (COV(X,
Y)=COV(Y, X)).

6A. If the value of the Covariance Matrix is positive, then it indicates that
the variables are correlated. ( If X increases, Y also increases and vice
versa)

6B. If the value of the Covariance Matrix is negative, then it indicates that
the variables are inversely correlated. ( If X increases, Y also decreases
and vice versa).

7.As a result, at the end of this step, you will come to know which pair of
variables are correlated with each other, so that you might categorize
them much easier.

Example:

So, continuing with the same example,

The formula to calculate the covariance matrix of the given example will
be:

Since you have already standardized the features, you can consider Mean
= 0 and Standard Deviation=1 for each feature.

VAR(F1) = ((-1.0695-0)² + (0.5347-0)² + (-1.0695-0)²


+ (0.5347–0)² +(1.069–0)²)/5

On solving the equation, you get, VAR(F1) = 0.78


COV(F1,F2) = ((-1.0695–0)(0.8196-0) + (0.5347–0)(-1.6393-0) + (-1.0695–
0)* (0.0000-0) + (0.5347–0)(0.0000-0)+(1.0695–0)(0.8196–0))/5

On solving the equation, you get, COV(F1,F2 = -0.8586)

Similarly solving all the features, the covariance matrix will be,

STEP 4: FEATURE VECTOR


1. To determine the principal components of variables, you have to define
eigen value and eigen vectors for the same.
Let A be any square matrix. A non-zero vector v is an eigenvector of A if

Av = λv

for some number λ, called the corresponding eigenvalue.

2. Once you have computed the eigen vector components, define eigen
values in descending order ( for all variables) and now you will get a list of
principal components.

3. So, the eigen values represent the principal components and these
components represent the direction of data.

4. This indicates that if the line contains large variables of large variances,
then there are many data points on the line. Thus, there is more
information on the line too.

5. Finally, these principal components form a line of new axes for easier
evaluation of data and also the differences between the observations can
also be easily monitored.
Example:

Let ν be a non-zero vector and λ a scalar.

As per the rule,

Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A.

Upon substituting the values in det(A- λI) = 0, you will get the following
matrix.

When you solve the following the matrix by considering 0 on right-hand


side, you can define eigen values as

λ = 2.11691 , 0.855413 , 0.481689 , 0.334007

Then, substitute each eigen value in (A-λI)ν=0 equation and solve the
same for different eigen vectors v1, v2, v3 and v4.

For instance,

For λ = 2.11691, solving the above equation using Cramer's rule, the
values for the v vector are
v1 = 0.515514
v2 = -0.616625
v3 = 0.399314
v4 = 0.441098

Follow the same process and you will form the following matrix by using
the eigen vectors calculated as instructed.
Now, calculate the sum of each Eigen column, arrange them in
descending order and pick up the topmost Eigen values. These are your
Principal components.

STEP 5: RECAST THE DATA ALONG THE


PRINCIPAL COMPONENTS AXES
 Still now, apart from standardization, you haven’t made any changes to the
original data. You have just selected the Principal components and formed a
feature vector. Yet, the initial data remains the same on their original axes.
 This step aims at the reorientation of data from their original axes to the ones
you have calculated from the Principal components.
This can be done by the following formula.

Final Data Set= Standardized Original Data Set * FeatureVector

Example:

So, in our guide, the final data set becomes

Standardized Original Data Set =



FeatureVector =

By solving the above equations, you will get the transformed data as
follows.

Did you notice something? Your large dataset is now compressed into a
small dataset without any loss of data! This is the significance of Principal
Component Analysis.

PCA using scikit learn


 To implement PCA in Scikit learn, it is essential to standardize/normalize
the data before applying PCA.
 PCA is imported from sklearn.decomposition. We need to select the
required number of principal components.
 Usually, n_components is chosen to be 2 for better visualization but it
matters and depends on data.
 By the fit and transform method, the attributes are passed.
 The values of principal components can be checked using components_
while the variance explained by each principal component can be
calculated using explained_variance_ratio.

Import all the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the
dataset has 569 data items with 30 input attributes. There are two output
classes-benign and malignant. Due to 30 input features, it is impossible to
visualize this data
from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()

# Check the output classes


print(data['target_names'])

# Check the input attributes


print(data['feature_names'])

Output:
3. Apply PCA
 Standardize the dataset prior to PCA.
 Import PCA from sklearn.decomposition.
 Choose the number of principal components.

Select it to 3. After executing this code, we get to know that the dimensions
of x are (569,3) while the dimension of actual data is (569,30). Thus, it is
clear that with PCA, the number of dimensions has reduced to 3 from 30. If
we choose n_components=2, the dimensions would be reduced to 2.

# construct a dataframe using pandas


df1=pd.DataFrame(data['data'],columns=data['feature_names'])

# Scale data before applying PCA


scaling=StandardScaler()

# Use fit and transform method


scaling.fit(df1)
Scaled_data=scaling.transform(df1)

# Set the n_components=3


principal=PCA(n_components=3)
principal.fit(Scaled_data)
x=principal.transform(Scaled_data)

# Check the dimensions of data after PCA


print(x.shape)

Output:
(569,3)

4. Check Components
The principal.components_ provide an array in which the number of rows
tells the number of principal components while the number of columns is
equal to the number of features in actual data. We can easily see that there
are three rows as n_components was chosen to be 3. However, each row
has 30 columns as in actual data.

5. Plot the components (Visualization)


Plot the principal components for better data visualization. Though we had
taken n_components =3, here we are plotting a 2d graph as well as 3d using
first two principal components and 3 principal components respectively. For
three principal components, we need to plot a 3d graph. The colors show the
2 output classes of the original dataset-benign and malignant. It is clear that
principal components show clear separation between two output classes.

plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')
Output:
For three principal components, we need to plot a 3d graph. x[:,0] signifies
the first principal component. Similarly, x[:,1] and x[:,2] represent the second
and the third principal component.
# import relevant libraries for 3d graph
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))

# choose projection 3d for creating a 3d graph


axis = fig.add_subplot(111, projection='3d')

# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3


axis.scatter(x[:,0],x[:,1],x[:,2], c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
axis.set_zlabel("PC3", fontsize=10)
6. Calculate variance ratio
Explained_variance_ratio provides an idea of how much variation is
explained by principal components.
# check how much variance is explained by each principal component
print(principal.explained_variance_ratio_)
++++++++++++++++++++++++++++++++++++++++++++++++
Output:

array([0.44272026, 0.18971182, 0.09393163])

Randomized PCA:

PCA is mostly used for very large data sets with many variables in order to
make them clearer and easier to interpret. This can lead to a very high
computing power and long waiting times. Randomized PCA can be used to
reduce the calculation time.

Randomized PCA is a variation of Principal Component Analysis (PCA) that


is designed to approximate the first k principal components of a large dataset
efficiently. Instead of computing the eigenvectors of the covariance matrix of
the data, as is done in traditional PCA, randomized PCA uses a random
projection matrix to map the data to a lower-dimensional subspace. The first
k principal components of the data can then be approximated by computing
the eigenvectors of the covariance matrix of the projected data.
we can approximate the first K principal components quickly than
classical PCA.

This is an extension to PCA which uses approximated Singular Value


Decomposition(SVD) of data. Conventional PCA works in O(n*p 2) + O(p3)
where n is the number of data points and p is the number of features
whereas randomized version works in O(n*d*2) + O(d 3) where d is the
number of principal components.

sklearn provides a method randomized_svd in sklearn.utils.extmath which


can be used to do randomized PCA.

The Singular Value Decomposition (SVD) of a matrix is a factorization of that


matrix into three matrices

The SVD of mxn matrix A is given by the formula is-

A=UWVT
Randomized PCA has several advantages over traditional PCA:
1. Scalability: Randomized PCA can handle large datasets that are not
possible to fit into memory using traditional PCA.
2. Speed: Randomized PCA is much faster than traditional PCA for large
datasets, making it more suitable for real-time applications.
3. Sparsity: Randomized PCA is able to handle sparse datasets, which
traditional PCA is not able to handle well.

from sklearn.decomposition import PCA

rpca = PCA(n_components=2, svd_solver='randomized')


X_rpca = rpca.fit_transform(X)

scatter_plot(X_rpca, y)

Kernel PCA

Kernel Principal Component Analysis (KPCA) is a technique used in machine


learning for nonlinear dimensionality reduction. It is an extension of the
classical Principal Component Analysis (PCA) algorithm, which is a linear
method that identifies the most significant features or components of a
dataset. KPCA applies a nonlinear mapping function to the data before
applying PCA, allowing it to capture more complex and nonlinear
relationships between the data points.
Kernel PCA is an extension of PCA that allows for the separability of
nonlinear data by making use of kernels. The basic idea behind it is
to project the linearly inseparable data onto a higher dimensional
space where it becomes linearly separable.

PCA is a linear method. That is it can only be applied to datasets which are
linearly separable. It does an excellent job for datasets, which are linearly
separable. But, if we use it to non-linear datasets, we might get a result
which may not be the optimal dimensionality reduction. Kernel PCA uses a
kernel function to project dataset into a higher dimensional feature space,
where it is linearly separable. It is similar to the idea of Support Vector
Machines. There are various kernel methods like linear, polynomial, and
gaussian.

from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)

X_reduced = rbf_pca.fit_transform(X)

You might also like