0% found this document useful (0 votes)

5 views

Unit-4

The document discusses unsupervised learning techniques, focusing on clustering, anomaly detection, and density estimation. It provides an overview of the K-Means algorithm, its drawbacks, and enhancements like K-Means++ and Mini Batch K-Means, as well as methods for determining the optimal number of clusters using silhouette analysis. Additionally, it touches on applications such as image segmentation and the DBSCAN algorithm for handling irregular clusters and noise.

Uploaded by

gowtami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit-4

Uploaded by

gowtami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Unit-IV

Unsupervised Learning Techniques

Unsupervised learning tasks and algorithms:

Clustering :The goal is to group similar instances together into clusters.

Clustering is a great tool for data analysis, customer segmentation,
recommender systems, search engines, image segmentation,

semi-supervised learning, dimensionality reduction, and more.

Anomaly detection: The objective is to learn what “normal” data looks like, and
then use that to detect abnormal instances, such as defective items on a
production line or a new trend in a time series.

Density estimation This is the task of estimating the probability density

function (PDF) of the random process that generated the dataset. Density
estimation is commonly used for anomaly detection: instances located in very
lowdensity regions are likely to be anomalies. It is also useful for data analysis
and visualization.

K-MEANS Algorithm
K-Means Algorithm
The K-means algorithm is a machine-learning algorithm for clustering. This
type of learning is unsupervised. We use K-Means to group similar data
items based on their differences and similarities. It has many applications,
including anomaly detection, customer segmentation, and picture
segmentation.
The K-means algorithm involves the following steps:
It seeks to reduce the distances between each data item and its assigned
centroid. This is an iterative process, starting with random
assignments and gradually improving the clusters by adjusting the
centroids based on the data points assigned to each cluster.
For example:

Drawbacks of the K-Means Algorithm

Sensitivity to Initial Conditions

K-means is sensitive to initial conditions. The algorithm randomly initializes
the cluster centroids at the beginning, and the final clustering results can
vary depending on these initial positions.
Different initializations can lead to other local optima, resulting in different
clustering outcomes. This makes the K-means algorithm less reliable and
reproducible.

3.2. Difficulty in Determining K

One of the drawbacks of the K-means algorithm is that we have to set the
number of clusters ( ) in advance. Choosing an incorrect number of
clusters can lead to inaccurate results.
Various methods are available to estimate the optimal , such as
the silhouette analysis or elbow method, but they may not always provide a
clear-cut answer.
If we use a too-small , we’ll get too broad clusters. Conversely, an
overlarge l results in clusters that are too specific:

It may require multiple runs to find the most suitable value of , which can
be time-consuming and resource-consuming.
Inability to Handle Categorical Data
Another drawback of the K-means algorithm is its inability to
handle categorical data. The algorithm works with numerical data, where
distances between data points can be calculated. However, categorical
data doesn’t have a natural notion of distance or similarity.
When categorical data is used with the K-means algorithm, it requires
converting the categories into numerical values, such as using one-hot
encoding. One shortcoming of using one-hot encoding is that it treats each
feature independently and can degrade performance since it can
significantly increase data dimensionality.

K-Means++
Instead of randomly selecting the initial centroids, K-means++ uses a probabilistic
approach that biases the initial centroid selection toward points far apart. This helps
to ensure that the centroids are well distributed across the dataset and reduces the
likelihood of converging to suboptimal solutions.
This can result in faster convergence and improved clustering accuracy
compared to the original K-means algorithm.

Centroid Initialization Methods

k-means clustering aims to converge on an optimal set of cluster centers

(centroids) and cluster membership based on distance from these centroids via
successive iterations.The more optimal the positioning of these initial centroids,
the fewer iterations of the k-means clustering algorithms will be required for
convergence.For this some strategic consideration to the initialization of these
initial centroids is required.
 random data points: In this approach, described in the "traditional" case
above, k random data points are selected from the dataset and used as the
initial centroids, an approach which is obviously highly volatile and provides
for a scenario where the selected centroids are not well positioned throughout
the entire data space.
 k-means++: As spreading out the initial centroids is thought to be a worthy
goal, k-means++ pursues this by assigning the first centroid to the location of
a randomly selected data point, and then choosing the subsequent centroids
from the remaining data points based on a probability proportional to the
squared distance away from a given point's nearest existing centroid. The
effect is an attempt to push the centroids as far from one another as possible,
 covering as much of the occupied data space as they can from initi0alization.

 Randomly select the first centroid from the data points.

 For each data point compute its distance from the nearest, previously
chosen centroid.
 Select the next centroid from the data points such that the probability of
choosing a point as centroid is directly proportional to its distance from
the nearest, previously chosen centroid. (i.e. the point having maximum
distance from the nearest centroid is most likely to be selected next as a
centroid)
 Repeat steps 2 and 3 until k centroids have been sampled

This probability distribution ensures that instances farther away from already
chosen centroids are much more likely be selected as centroids

Mini Batch K-Means Algorithm

With the increasing size of the datasets being analyzed, the computation time
of K-means increases because of its constraint of needing the whole dataset in
main memory. For this reason, several methods have been proposed to reduce
the temporal and spatial cost of the algorithm. A different approach is the Mini
batch K-means algorithm.
Mini Batch K-means algorithm‘s main idea is to use small random batches
of data of a fixed size, so they can be stored in memory. Each iteration a new
random sample from the dataset is obtained and used to update the clusters and
this is repeated until convergence.

from sklearn.cluster import MiniBatchKMeans

minibatch_kmeans = MiniBatchKMeans(n_clusters=5)

minibatch_kmeans.fit(X)

Finding the optimal number of clusters

So far, we have set the number of clusters k to 5 because it was obvious by

looking at the data that this was the correct number of clusters. But in general, it
will not be so easy to know how to set k, and the result might be quite bad if
you set it to the wrong value.

The more precise approach for choosing the best value for the number of
clusters is to use the silhouette score, which is the mean silhouette coefficient
over all the instances.

An instance’s silhouette coefficient is equal to (b – a) / max(a, b)

where a is the mean distance to the other instances in the same cluster (i.e., the
mean intra-cluster distance)

b is the mean nearest-cluster distance (i.e., the mean distance to the instances of
the next closest cluster

The silhouette coefficient can vary between –1 and +1.

A coefficient close to +1 means that the instance is well inside its own cluster
and far from other clusters,

while a coefficient close to 0 means that it is close to a cluster boundary,

and finally a coefficient close to –1 means that the instance may have been
assigned to the wrong cluster.
To compute the silhouette score, you can use Scikit-Learn’s silhouette_score()
function, giving it all the instances in the dataset and the labels they were
assigned:

>>> from sklearn.metrics import silhouette_score

>>> silhouette_score(X, kmeans.labels_)

0.655517642572828

Silhouette Analysis

The silhouette coefficient or silhouette score kmeans is a measure of

how similar a data point is within-cluster (cohesion) compared to other

clusters (separation). The Silhouette score can be easily calculated in

Python using the metrics module of the scikit-learn/sklearn library.

 Select a range of values of k (say 1 to 10).

 Plot Silhouette coefﬁcient for each value of K.

The equation for calculating the silhouette coefﬁcient for a particular

data point:

 S(i) is the silhouette coefficient of the data point i.

 a(i) is the average distance between i and all the other data points in

the cluster to which i belongs.

 b(i) is the average distance from i to all clusters to which i does not

belong.
Source: medium

We will then calculate the average_silhouette for every k.

Then plot the graph between average_silhouette and K.

Points to Remember While Calculating Silhouette Coefficient:

 The value of the silhouette coefﬁcient is between [-1, 1].

 A score of 1 denotes the best, meaning that the data point i is very

compact within the cluster to which it belongs and far away from the

other clusters.

 The worst value is -1. Values near 0 denote overlapping clusters.

Limits of K-Means

Choosing k manually.

Use the “Loss vs. Clusters” plot to find the optimal (k), as discussed in Interpret
Results.

Being dependent on initial values.

For a low k, you can mitigate this dependence by running k-means several times
with different initial values and picking the best result. As k increases, you need
advanced versions of k-means to pick better values of the initial centroids
(called k-means seeding). For a full discussion of k- means seeding see, A
Comparative Study of Efficient Initialization Methods for the K-Means
Clustering Algorithm by M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.

Clustering data of varying sizes and density.

k-means has trouble clustering data where clusters are of varying sizes and
density.

Clustering outliers.

Centroids can be dragged by outliers, or outliers might get their own cluster
instead of being ignored. Consider removing or clipping outliers before
clustering..

caling with number of dimensions.

As the number of dimensions increases, a distance-based similarity measure

converges to a constant value between any given examples.

Image Segmentation Using Clustering

Image segmentation is the classification of an image into different groups
Image segmentation is an important step in image processing, and it
seems everywhere if we want to analyze what’s inside the image. For
example, if we seek to find if there is a chair or person inside an indoor
image, we may need image segmentation to separate objects and analyze
each object individually to check what it is. Image segmentation usually
serves as the pre-processing before pattern recognition, feature extraction,
and compression of the image.

Image segmentation is the classification of an image into different groups...

The goal of Image segmentation is to change the representation of an

image into something that is more meaningful and easier to analyze.

Image segmentation is the process of partitioning a digital image into

multiple distinct regions containing each pixel(sets of pixels, also known as
superpixels) with similar attributes.

Image segmentation is typically used to locate objects

and boundaries(lines, curves, etc.) in images. More precisely, Image
Segmentation is the process of assigning a label to every pixel in an image
such that pixels with the same label share certain characteristics.

There are different methods and one of the most popular methods is K-
Means clustering algorithm

Object detection and Image Classification by an Autonomous Vehicle

Other examples involve Healthcare Industry where if we talk about Cancer,
even in today’s age of technological advancements, cancer can be fatal if
we don’t identify it at an early stage. Detecting cancerous cell(s) as quickly
as possible can potentially save millions of lives. The shape of the
cancerous cells plays a vital role in determining the severity of cancer
which can be identified using image classification algorithms.

Breast cancer cells

Image Segmentation involves converting an image into a collection of

regions of pixels that are represented by a mask or a labeled image. By
dividing an image into segments, you can process only the important
segments of the image instead of processing the entire image.
Using Clustering for Preprocessing

Clustering can be an efficient approach to dimensionality reduction, in particular as a preprocessing

step before a supervised learning algorithm,For example, digits dataset, which is a simple MNIST-
dataset containing 1,797 grayscale 8 × 8 images representing the digits 0 to 9.

First, load the dataset:

from sklearn.datasets import load_digits

X_digits, y_digits = load_digits(return_X_y=True)

Now, split it into a training set and a test set:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)

Next, fit a Logistic Regression model:

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

log_reg.fit(X_train, y_train)

Let’s evaluate its accuracy on the test set:

>>> log_reg.score(X_test, y_test) 0.9688888888888889

we can do better by using K-Means as a preprocessing step

We will create a pipeline that will first cluster the training set into 50 clusters and replace the images
with their distances to these 50 clusters, then apply a Logistic Regression model

from sklearn.pipeline import Pipeline

pipeline = Pipeline([ ("kmeans", KMeans(n_clusters=50)), ("log_reg", LogisticRegression()), ])

pipeline.fit(X_train, y_train)
Now let’s evaluate this classification pipeline: >>> pipeline.score(X_test, y_test)
0.9777777777777777

We reduced the error rate by almost 30% (from about 3.1% to about 2.2%)

Using Clustering for Semi-Supervised Learning

Refer textbook

DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN)

Clusters are dense regions in the data space, separated by regions of the
lower density of points. The DBSCAN algorithm is based on this intuitive
notion of “clusters” and “noise”. The key idea is that for each point of a
cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.

Partitioning methods (K-means, PAM clustering) and hierarchical clustering

work for finding spherical-shaped clusters or convex clusters. In other words,
they are suitable only for compact and well-separated clusters. Moreover,
they are also severely affected by the presence of noise and outliers in the
data.
Real life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure
below.
2. Data may contain noise.
The figure below shows a data set containing nonconvex clusters and
outliers/noises. Given such data, k-means algorithm has difficulties in
identifying these clusters with arbitrary shapes.
DBSCAN algorithm requires two parameters:
1. eps : It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors. If the eps value is chosen too small then large part of the data
will be considered as outliers. If it is chosen very large then the clusters
will merge and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius.
Larger the dataset, the larger value of MinPts must be chosen. As a
general rule, the minimum MinPts can be derived from the number of
dimensions D in the dataset as, MinPts >= D+1. The minimum value of
MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points within
eps.
Border Point: A point which has fewer than MinPts within eps but it is in the
neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
1. Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new
cluster.
3. Find recursively all its density connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exist a
point c which has a sufficient number of points in its neighbors and both
the points a and b are within the eps distance. This is a chaining process.
4. So, if b is neighbor of c, c is neighbor of d, d is neighbor of e, which in
turn is neighbor of a implies that b is neighbor of a.
5. Iterate through the remaining unvisited points in the dataset. Those points
that do not belong to any cluster are noise.

Importnump
y as np

from sklearn.cluster import DBSCAN

from sklearn import metrics

from sklearn.datasets import make_blobs

from sklearn.preprocessing import StandardScaler

# Generate sample data

centers = [[1, 1], [-1, -1], [1, -1]]

X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,

random_state=0)
X = StandardScaler().fit_transform(X)

# Compute DBSCAN

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)

core_samples_mask[db.core_sample_indices_] = True

labels = db.labels_

# Number of clusters in labels, ignoring noise if present.

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)

print('Estimated number of noise points: %d' % n_noise_)

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))

print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))

print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))

print("Adjusted Rand Index: %0.3f"

% metrics.adjusted_rand_score(labels_true, labels))

print("Adjusted Mutual Information: %0.3f"

% metrics.adjusted_mutual_info_score(labels_true, labels))

print("Silhouette Coefficient: %0.3f"

% metrics.silhouette_score(X, labels))

# Plot result

import matplotlib.pyplot as plt

%matplotlib inline

# Black removed and is used for noise instead.

unique_labels = set(labels)

colors = [plt.cm.Spectral(each)

for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):

if k == -1:

# Black used for noise.

col = [0, 0, 0, 1]

class_member_mask = (labels == k)

xy = X[class_member_mask & core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),

markeredgecolor='k', markersize=14)

xy = X[class_member_mask & ~core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),

markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)

plt.show()
DBSCAN to cluster spherical data

Gaussian Mixtures

Gaussian mixture models (GMMs) are a type of machine

learning algorithm. They are used to classify data into
different categories based on the probability distribution.
Gaussian mixture models can be used in many different
areas, including finance, marketing and so much more
A Gaussian mixture model can be used for clustering, which is the task of
grouping a set of data points into clusters.

GMMs can be used to find clusters in data sets where the clusters may
not be clearly defined.
Additionally, GMMs can be used to estimate the probability that a new
data point belongs to each cluster.

Gaussian mixture models are also relatively robust to outliers, meaning

that they can still yield accurate results even if there are some data
points that do not fit neatly into any of the clusters. This makes GMMs a
flexible and powerful tool for clustering data

Gaussian distributions are assumed for each group and they have means
and covariances which define their parameters.

GMM consists of two parts –

mean vectors (μ) & covariance matrices (Σ).
A Gaussian distribution is defined as a continuous probability distribution
that takes on a bell-shaped curve

Applications of GMM

GMM has many applications, such as density estimation, clustering, and

image segmentation.
For density estimation, GMM can be used to estimate the probability
density function of a set of data points.

For clustering, GMM can be used to group together data points that
come from the same Gaussian distribution.

For image segmentation, GMM can be used to partition an image into

different regions

A Gaussian Mixture is a function that is comprised of

several Gaussians, each identified by k ∈ {1,…, K},
where K is the number of clusters of our dataset. Each
Gaussian k in the mixture is comprised of the following
parameters:
 A mean μ that defines its centre.

 A covariance Σ that defines its width.

 A mixing probability π that defines how big or small the

Gaussian function will be.

A Gaussian mixture model is a probabilistic model that assumes all the data
points are generated from a mixture of a finite number of Gaussian
distributions with unknown parameters.

In a Gaussian mixture model, each cluster is associated

with a multivariate Gaussian distribution, and the
mixture model is a weighted sum of these distributions.
The weights indicate the probability that a data point
belongs to a particular cluster, and the Gaussian
distributions describe the distribution of the data within
each cluster.

In order to fit a Gaussian Mixture Model to a dataset,

the model parameters (i.e., the weights, means, and
covariances of the components) must be estimated from
the data. This is typically done using an iterative
optimization algorithm such as the expectation-
maximization (EM) algorithm.
Once a GMM has been fit to a dataset, it can be used for
a variety of tasks such as density estimation, clustering,
and anomaly detection, which are used in real-life
examples.

To fit a Gaussian Mixture Model (GMM) to a

dataset using Python and the scikit-learn library:
 First import the GaussianMixture library
from sklearn.mixture import GaussianMixture
import numpy as np
 Then load the data set, named ‘data.txt’.
X = np.loadtxt(‘data.txt’)
 Now, create the Gaussian Mixture Model
gmm = GaussianMixture(n_components=3)
 Fit the model to data.
gmm.fit(X).
 Now Predict the cluster labels for each data
point.
labels = gmm.predict(X).
 Get the model’s parameters i.e., means and
covariances of the components.
means = gmm. means_
covariances = gmm. covariances_
Now Defining all the components above, Here, X
is an (n x d) array of n observations with d
dimensions. The GaussianMixture class is used to
fit a GMM with n_components components to the
data. The fit method estimates the model
parameters (i.e., the means and covariances of
the components) using the expectation-
maximization (EM) algorithm. The predict method
can then be used to assign each data point to
one of the n_components clusters. The model
parameters (means and covariances) can be
accessed using the means_ and covariances_
attributes of the GaussianMixture object.

In general, the Gaussian density function is given by:

x represents our data points

D is the number of dimensions of each data point

μ and Σ are the mean and covariance

If we have a dataset comprised of N = 1000 three-

dimensional points (D = 3), then x will be a 1000 × 3
matrix. μ will be a 1 × 3 vector, and Σ will be a 3 × 3
matrix

If we differentiate this equation with respect to the mean

and covariance and then equate it to zero, then we will be
able to find the optimal values for these parameters

CoVariance Matrix

Covariance matrix is a square matrix that displays the variance exhibited by

elements of datasets and the covariance between a pair of datasets.
Variance is a measure of dispersion and can be defined as the spread of
data from the mean of the given dataset.
Covariance is calculated between two variables and is used to measure
how the two variables vary together.
Variance covariance matrix is defined as a square matrix where the diagonal
elements represent the variance and the off-diagonal elements represent the
covariance.
The covariance between two variables can be positive, negative, and zero. A
positive covariance indicates that the two variables have a positive relationship
whereas negative covariance shows that they have a negative relationship. If two
elements do not vary together then they will display a zero covariance.
To determine the covariance matrix, the formulas for variance and
covariance are required. Depending upon the type of data available, the
variance and covariance can be found for both sample data and
population data. These formulas are given below.

Covariance Matrix 2 × 2

A 2 × 2 matrix is one which has 2 rows and 2 columns. The formula for a 2
× 2 covariance matrix is given as follows:

[var(x) cov(x,y)

cov(x,y) var(y)]

Covariance Matrix 3 × 3

If there are 3 datasets, x, y, and z, then the formula to find the 3 × 3

covariance matrix is given below:

[var(x) cov(x,y) cov(x,z)

cov(x,y) var(y) cov(y,z)

cov(x,z) cov(y,z) var(z)]

Dimensionality Reduction

Dimensionality
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.

Definition

It is a way of converting the higher dimensions dataset into lesser

dimensions dataset ensuring that it provides similar information.

Need of Dimensionality Reduction

A dataset contains a huge number of input features in various cases,

which makes the predictive modeling task more complicated. Because it is
very difficult to visualize or make predictions for the training dataset with
a high number of features, for such cases, dimensionality reduction
techniques are required to use.

Application Areas

It is commonly used in the fields that deal with high-dimensional data,

such as speech recognition, signal processing, bioinformatics, etc.
It can also be used for data visualization, noise reduction, cluster
analysis, etc.

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly
known as the curse of dimensionality. If the dimensionality of the input
dataset increases, any machine learning algorithm and model becomes
more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting
also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.

Advantages

o By reducing the dimensions of the features, the space required to

store the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions
of features.
o Reduced dimensions of features of the dataset help in visualizing
the data quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Common techniques of Dimensionality Reduction

a.Principal Component Analysis

b.Backward Elimination

c.Forward Selection

d.Score comparison

e.Missing Value Ratio

f.Low Variance Filter

g.High Correlation Filter

h.Random Forest

i. Factor Analysis

j. Auto-Encoder
Principal Component Analysis

Principal Component Analysis is a statistical process that converts the

observations of correlated features into a set of linearly uncorrelated features
with the help of orthogonal transformation. These new transformed features are
called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling.

The Principal Component Analysis is a popular unsupervised learning technique for reducing
the dimensionality of data. It increases interpretability yet, at the same time, it minimizes
information loss.

It helps to find the most significant features in a dataset and makes the data easy for plotting
in 2D and 3D. PCA helps in finding a sequence of linear combinations of variables.

Principal Components

The Principal Components are a straight line that captures most of the variance of
the data. They have a direction and magnitude. Principal components are orthogonal
projections (perpendicular) of data onto lower-dimensional space.

Variance
The term variance refers to a statistical measurement of the spread
between numbers in a data set. More specifically, variance measures how
far each number in the set is from the mean (average), and thus from
every other number in the set. Variance is denoted by this symbol: σ 2. It is
used by both analysts and traders to determine volatility and market
security

it measures the degree of dispersion of data around the sample's mean.

Scatter plot for the above data

After applying PCA
PCA allows us to go a step further and represent the data as linear
combinations of principal components. Getting principal components is
equivalent to a linear transformation of data from the feature1 x feature2
axis to a PCA1 x PCA2 axis

Each successive principal component explains the variance that is left

after its preceding component, so picking just a few of the first components
sufficiently approximates the original dataset without the need for additional
features

Principal Component analysis fits data in an n-dimensional

ellipsoid so that each axis of the ellipsoid represents a
principal component. The larger the principal component axis
the larger the variability in data it represents.

PCA calculation
There are multiple ways to calculate PCA:

1. Eigendecomposition of the covariance matrix

2. Singular value decomposition of the data matrix
3. Eigenvalue approximation via power iterative
computation
4. Non-linear iterative partial least squares (NIPALS)
computation.
Steps for Calculating PCA
1. Feature standardization. We standardize each feature
to have a mean of 0 and a variance of 1.
2.Obtain the covariance matrix computation. The
covariance matrix is a square matrix, of d x
d dimensions, where d stands for “dimension” (or feature
or column, if our data is tabular). It shows the pairwise
feature correlation between each feature.
3.Calculate the eigendecomposition of the covariance
matrix. We calculate the eigenvectors (unit vectors) and
their associated eigenvalues (scalars by which we
multiply the eigenvector) of the covariance matrix.
4.Sort the eigenvectors from the highest eigenvalue to
the lowest. The eigenvector with the highest eigenvalue
is the first principal component. Higher eigenvalues
correspond to greater amounts of shared variance
explained.
5.Select the number of principal components. Select the
top N eigenvectors (based on their eigenvalues) to
become the N principal components. The optimal
number of principal components is both subjective and
problem-dependent.

Example:

Let us consider the same scenario that we have taken as an example

previously. Let us assume the following features of dimensions as F1, F2,
F3, and F4.
Calculate the Mean and Standard Deviation for each feature and then,
tabulate the same as follows.

Then, after the Standardization of each variable, the results are tabulated
below.
This is the Standardized data set.

STEP 2: COVARIANCE MATRIX COMPUTATION

 In this step, you will get to know how the variables of the given data are
varying with the mean value calculated.
 Any interrelated variables can also be sorted out at the end of this step.
 To segregate the highly interrelated variables, you calculate the covariance
matrix with the help of the given formula.
**Note: **A covariance matrix is a N x N symmetrical matrix that contains
the covariances of all possible data sets.

The covariance matrix of two-dimensional data is, given as follows:

Where,
4.Make a note that, the covariance of a number with itself is its variance
(COV(X, X)=Var(X)), the values at the top left and bottom right will have
the variances of the same initial number.

5.Likewise, the entries of the Covariance Matrix at the main diagonal will
be symmetric concerning the fact that covariance is commutative (COV(X,
Y)=COV(Y, X)).

6A. If the value of the Covariance Matrix is positive, then it indicates that
the variables are correlated. ( If X increases, Y also increases and vice
versa)

6B. If the value of the Covariance Matrix is negative, then it indicates that
the variables are inversely correlated. ( If X increases, Y also decreases
and vice versa).

7.As a result, at the end of this step, you will come to know which pair of
variables are correlated with each other, so that you might categorize
them much easier.

Example:

So, continuing with the same example,

The formula to calculate the covariance matrix of the given example will
be:

Since you have already standardized the features, you can consider Mean
= 0 and Standard Deviation=1 for each feature.

VAR(F1) = ((-1.0695-0)² + (0.5347-0)² + (-1.0695-0)²

+ (0.5347–0)² +(1.069–0)²)/5

On solving the equation, you get, VAR(F1) = 0.78

COV(F1,F2) = ((-1.0695–0)(0.8196-0) + (0.5347–0)(-1.6393-0) + (-1.0695–
0)* (0.0000-0) + (0.5347–0)(0.0000-0)+(1.0695–0)(0.8196–0))/5

On solving the equation, you get, COV(F1,F2 = -0.8586)

Similarly solving all the features, the covariance matrix will be,

STEP 4: FEATURE VECTOR

1. To determine the principal components of variables, you have to define
eigen value and eigen vectors for the same.
Let A be any square matrix. A non-zero vector v is an eigenvector of A if

Av = λv

for some number λ, called the corresponding eigenvalue.

2. Once you have computed the eigen vector components, define eigen
values in descending order ( for all variables) and now you will get a list of
principal components.

3. So, the eigen values represent the principal components and these
components represent the direction of data.

4. This indicates that if the line contains large variables of large variances,
then there are many data points on the line. Thus, there is more
information on the line too.

5. Finally, these principal components form a line of new axes for easier
evaluation of data and also the differences between the observations can
also be easily monitored.
Example:

Let ν be a non-zero vector and λ a scalar.

As per the rule,

Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A.

Upon substituting the values in det(A- λI) = 0, you will get the following
matrix.

When you solve the following the matrix by considering 0 on right-hand

side, you can define eigen values as

λ = 2.11691 , 0.855413 , 0.481689 , 0.334007

Then, substitute each eigen value in (A-λI)ν=0 equation and solve the
same for different eigen vectors v1, v2, v3 and v4.

For instance,

For λ = 2.11691, solving the above equation using Cramer's rule, the
values for the v vector are
v1 = 0.515514
v2 = -0.616625
v3 = 0.399314
v4 = 0.441098

Follow the same process and you will form the following matrix by using
the eigen vectors calculated as instructed.
Now, calculate the sum of each Eigen column, arrange them in
descending order and pick up the topmost Eigen values. These are your
Principal components.

STEP 5: RECAST THE DATA ALONG THE

PRINCIPAL COMPONENTS AXES
 Still now, apart from standardization, you haven’t made any changes to the
original data. You have just selected the Principal components and formed a
feature vector. Yet, the initial data remains the same on their original axes.
 This step aims at the reorientation of data from their original axes to the ones
you have calculated from the Principal components.
This can be done by the following formula.

Final Data Set= Standardized Original Data Set * FeatureVector

Example:

So, in our guide, the final data set becomes

Standardized Original Data Set =


FeatureVector =

By solving the above equations, you will get the transformed data as
follows.

Did you notice something? Your large dataset is now compressed into a
small dataset without any loss of data! This is the significance of Principal
Component Analysis.

PCA using scikit learn

 To implement PCA in Scikit learn, it is essential to standardize/normalize
the data before applying PCA.
 PCA is imported from sklearn.decomposition. We need to select the
required number of principal components.
 Usually, n_components is chosen to be 2 for better visualization but it
matters and depends on data.
 By the fit and transform method, the attributes are passed.
 The values of principal components can be checked using components_
while the variance explained by each principal component can be
calculated using explained_variance_ratio.

Import all the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the
dataset has 569 data items with 30 input attributes. There are two output
classes-benign and malignant. Due to 30 input features, it is impossible to
visualize this data
from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()

# Check the output classes

print(data['target_names'])

# Check the input attributes

print(data['feature_names'])

Output:
3. Apply PCA
 Standardize the dataset prior to PCA.
 Import PCA from sklearn.decomposition.
 Choose the number of principal components.

Select it to 3. After executing this code, we get to know that the dimensions
of x are (569,3) while the dimension of actual data is (569,30). Thus, it is
clear that with PCA, the number of dimensions has reduced to 3 from 30. If
we choose n_components=2, the dimensions would be reduced to 2.

# construct a dataframe using pandas

df1=pd.DataFrame(data['data'],columns=data['feature_names'])

# Scale data before applying PCA

scaling=StandardScaler()

# Use fit and transform method

scaling.fit(df1)
Scaled_data=scaling.transform(df1)

# Set the n_components=3

principal=PCA(n_components=3)
principal.fit(Scaled_data)
x=principal.transform(Scaled_data)

# Check the dimensions of data after PCA

print(x.shape)

Output:
(569,3)

4. Check Components
The principal.components_ provide an array in which the number of rows
tells the number of principal components while the number of columns is
equal to the number of features in actual data. We can easily see that there
are three rows as n_components was chosen to be 3. However, each row
has 30 columns as in actual data.

5. Plot the components (Visualization)

Plot the principal components for better data visualization. Though we had
taken n_components =3, here we are plotting a 2d graph as well as 3d using
first two principal components and 3 principal components respectively. For
three principal components, we need to plot a 3d graph. The colors show the
2 output classes of the original dataset-benign and malignant. It is clear that
principal components show clear separation between two output classes.

plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')
Output:
For three principal components, we need to plot a 3d graph. x[:,0] signifies
the first principal component. Similarly, x[:,1] and x[:,2] represent the second
and the third principal component.
# import relevant libraries for 3d graph
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))

# choose projection 3d for creating a 3d graph

axis = fig.add_subplot(111, projection='3d')

# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3

axis.scatter(x[:,0],x[:,1],x[:,2], c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
axis.set_zlabel("PC3", fontsize=10)
6. Calculate variance ratio
Explained_variance_ratio provides an idea of how much variation is
explained by principal components.
# check how much variance is explained by each principal component
print(principal.explained_variance_ratio_)
++++++++++++++++++++++++++++++++++++++++++++++++
Output:

array([0.44272026, 0.18971182, 0.09393163])

Randomized PCA:

PCA is mostly used for very large data sets with many variables in order to
make them clearer and easier to interpret. This can lead to a very high
computing power and long waiting times. Randomized PCA can be used to
reduce the calculation time.

Randomized PCA is a variation of Principal Component Analysis (PCA) that

is designed to approximate the first k principal components of a large dataset
efficiently. Instead of computing the eigenvectors of the covariance matrix of
the data, as is done in traditional PCA, randomized PCA uses a random
projection matrix to map the data to a lower-dimensional subspace. The first
k principal components of the data can then be approximated by computing
the eigenvectors of the covariance matrix of the projected data.
we can approximate the first K principal components quickly than
classical PCA.

This is an extension to PCA which uses approximated Singular Value

Decomposition(SVD) of data. Conventional PCA works in O(n*p 2) + O(p3)
where n is the number of data points and p is the number of features
whereas randomized version works in O(n*d*2) + O(d 3) where d is the
number of principal components.

sklearn provides a method randomized_svd in sklearn.utils.extmath which

can be used to do randomized PCA.

The Singular Value Decomposition (SVD) of a matrix is a factorization of that

matrix into three matrices

The SVD of mxn matrix A is given by the formula is-

A=UWVT
Randomized PCA has several advantages over traditional PCA:
1. Scalability: Randomized PCA can handle large datasets that are not
possible to fit into memory using traditional PCA.
2. Speed: Randomized PCA is much faster than traditional PCA for large
datasets, making it more suitable for real-time applications.
3. Sparsity: Randomized PCA is able to handle sparse datasets, which
traditional PCA is not able to handle well.

from sklearn.decomposition import PCA

rpca = PCA(n_components=2, svd_solver='randomized')

X_rpca = rpca.fit_transform(X)

scatter_plot(X_rpca, y)

Kernel PCA

Kernel Principal Component Analysis (KPCA) is a technique used in machine

learning for nonlinear dimensionality reduction. It is an extension of the
classical Principal Component Analysis (PCA) algorithm, which is a linear
method that identifies the most significant features or components of a
dataset. KPCA applies a nonlinear mapping function to the data before
applying PCA, allowing it to capture more complex and nonlinear
relationships between the data points.
Kernel PCA is an extension of PCA that allows for the separability of
nonlinear data by making use of kernels. The basic idea behind it is
to project the linearly inseparable data onto a higher dimensional
space where it becomes linearly separable.

PCA is a linear method. That is it can only be applied to datasets which are
linearly separable. It does an excellent job for datasets, which are linearly
separable. But, if we use it to non-linear datasets, we might get a result
which may not be the optimal dimensionality reduction. Kernel PCA uses a
kernel function to project dataset into a higher dimensional feature space,
where it is linearly separable. It is similar to the idea of Support Vector
Machines. There are various kernel methods like linear, polynomial, and
gaussian.

from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)

X_reduced = rbf_pca.fit_transform(X)

The Hundred Page Machine Learning Book
No ratings yet
The Hundred Page Machine Learning Book
7 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
Week 11
No ratings yet
Week 11
49 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Week 9
No ratings yet
Week 9
66 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
1 s2.0 S0031320319301608 Main
No ratings yet
1 s2.0 S0031320319301608 Main
18 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
Cluster
No ratings yet
Cluster
50 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
K Mean
No ratings yet
K Mean
7 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Unit IV
No ratings yet
Unit IV
96 pages
K Means Algo
No ratings yet
K Means Algo
7 pages
Clustering
No ratings yet
Clustering
6 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
Detecting Patterns with Unsupervised Learning
No ratings yet
Detecting Patterns with Unsupervised Learning
21 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Clustering
No ratings yet
Clustering
125 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
USL
No ratings yet
USL
21 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
Week 10
No ratings yet
Week 10
41 pages
2 - K-Mean
No ratings yet
2 - K-Mean
39 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Unit 5
No ratings yet
Unit 5
63 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
algo
No ratings yet
algo
59 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
UnsupervisedLearning_FoundationalMathofAI_S24
No ratings yet
UnsupervisedLearning_FoundationalMathofAI_S24
6 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
ML - Unit - 2
No ratings yet
ML - Unit - 2
13 pages
genedata doc
No ratings yet
genedata doc
67 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
What Is Machine Learning ?
No ratings yet
What Is Machine Learning ?
4 pages
Final-Report22 4
No ratings yet
Final-Report22 4
121 pages
Random Forest
No ratings yet
Random Forest
225 pages
An assessment of machine learning models and algorithms for early
No ratings yet
An assessment of machine learning models and algorithms for early
14 pages
Adversarial_Attack_and_Defense_Mechanisms_in_Medical_Imaging_A_Comprehensive_Review
No ratings yet
Adversarial_Attack_and_Defense_Mechanisms_in_Medical_Imaging_A_Comprehensive_Review
5 pages
Topic 6d - Hierarchical Algorithm
No ratings yet
Topic 6d - Hierarchical Algorithm
38 pages
Big Data Smart Cities
0% (1)
Big Data Smart Cities
52 pages
Rudra Sharma CV (3)
No ratings yet
Rudra Sharma CV (3)
2 pages
K Means Clustering - 11032022
No ratings yet
K Means Clustering - 11032022
20 pages
Report
No ratings yet
Report
36 pages
BERT Sentiment Analysis Twitter
No ratings yet
BERT Sentiment Analysis Twitter
11 pages
Week01 Intro AI
No ratings yet
Week01 Intro AI
53 pages
Ukoha Chinonso Precious 17CG023225
No ratings yet
Ukoha Chinonso Precious 17CG023225
86 pages
417_AI_MS
No ratings yet
417_AI_MS
7 pages
Bowles - The Computerisation of European Jobs - Bruegel
No ratings yet
Bowles - The Computerisation of European Jobs - Bruegel
4 pages
AtScale Training
No ratings yet
AtScale Training
18 pages
Lab 3
No ratings yet
Lab 3
3 pages
Laptop Price Prediction Sanket
No ratings yet
Laptop Price Prediction Sanket
19 pages
FlightDelay_SVR
No ratings yet
FlightDelay_SVR
43 pages
Summary of Progress
No ratings yet
Summary of Progress
9 pages
Module 2 Notes
No ratings yet
Module 2 Notes
24 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
5 pages
SK Learn 1
No ratings yet
SK Learn 1
11 pages
Resume - Shrugal Tayal - IIITD
No ratings yet
Resume - Shrugal Tayal - IIITD
3 pages
ANUJ Decision Trees in AI
No ratings yet
ANUJ Decision Trees in AI
7 pages
Profile Creation Template (1) Lovely Professional University
No ratings yet
Profile Creation Template (1) Lovely Professional University
12 pages
Siamese CNN For Job-Candidate Matching: Thomas Belhalfaoui
No ratings yet
Siamese CNN For Job-Candidate Matching: Thomas Belhalfaoui
21 pages
Innovations in Machine and Deep Learning Case Studies and Applications Gilberto Rivera Download PDF
No ratings yet
Innovations in Machine and Deep Learning Case Studies and Applications Gilberto Rivera Download PDF
49 pages
Uri Kartoun PHD Final 2007
No ratings yet
Uri Kartoun PHD Final 2007
418 pages