Unit-4
Unit-4
Anomaly detection: The objective is to learn what “normal” data looks like, and
then use that to detect abnormal instances, such as defective items on a
production line or a new trend in a time series.
K-MEANS Algorithm
K-Means Algorithm
The K-means algorithm is a machine-learning algorithm for clustering. This
type of learning is unsupervised. We use K-Means to group similar data
items based on their differences and similarities. It has many applications,
including anomaly detection, customer segmentation, and picture
segmentation.
The K-means algorithm involves the following steps:
It seeks to reduce the distances between each data item and its assigned
centroid. This is an iterative process, starting with random
assignments and gradually improving the clusters by adjusting the
centroids based on the data points assigned to each cluster.
For example:
It may require multiple runs to find the most suitable value of , which can
be time-consuming and resource-consuming.
Inability to Handle Categorical Data
Another drawback of the K-means algorithm is its inability to
handle categorical data. The algorithm works with numerical data, where
distances between data points can be calculated. However, categorical
data doesn’t have a natural notion of distance or similarity.
When categorical data is used with the K-means algorithm, it requires
converting the categories into numerical values, such as using one-hot
encoding. One shortcoming of using one-hot encoding is that it treats each
feature independently and can degrade performance since it can
significantly increase data dimensionality.
K-Means++
Instead of randomly selecting the initial centroids, K-means++ uses a probabilistic
approach that biases the initial centroid selection toward points far apart. This helps
to ensure that the centroids are well distributed across the dataset and reduces the
likelihood of converging to suboptimal solutions.
This can result in faster convergence and improved clustering accuracy
compared to the original K-means algorithm.
This probability distribution ensures that instances farther away from already
chosen centroids are much more likely be selected as centroids
With the increasing size of the datasets being analyzed, the computation time
of K-means increases because of its constraint of needing the whole dataset in
main memory. For this reason, several methods have been proposed to reduce
the temporal and spatial cost of the algorithm. A different approach is the Mini
batch K-means algorithm.
Mini Batch K-means algorithm‘s main idea is to use small random batches
of data of a fixed size, so they can be stored in memory. Each iteration a new
random sample from the dataset is obtained and used to update the clusters and
this is repeated until convergence.
minibatch_kmeans = MiniBatchKMeans(n_clusters=5)
minibatch_kmeans.fit(X)
The more precise approach for choosing the best value for the number of
clusters is to use the silhouette score, which is the mean silhouette coefficient
over all the instances.
where a is the mean distance to the other instances in the same cluster (i.e., the
mean intra-cluster distance)
b is the mean nearest-cluster distance (i.e., the mean distance to the instances of
the next closest cluster
A coefficient close to +1 means that the instance is well inside its own cluster
and far from other clusters,
and finally a coefficient close to –1 means that the instance may have been
assigned to the wrong cluster.
To compute the silhouette score, you can use Scikit-Learn’s silhouette_score()
function, giving it all the instances in the dataset and the labels they were
assigned:
0.655517642572828
Silhouette Analysis
data point:
a(i) is the average distance between i and all the other data points in
b(i) is the average distance from i to all clusters to which i does not
belong.
Source: medium
A score of 1 denotes the best, meaning that the data point i is very
compact within the cluster to which it belongs and far away from the
other clusters.
Choosing k manually.
Use the “Loss vs. Clusters” plot to find the optimal (k), as discussed in Interpret
Results.
For a low k, you can mitigate this dependence by running k-means several times
with different initial values and picking the best result. As k increases, you need
advanced versions of k-means to pick better values of the initial centroids
(called k-means seeding). For a full discussion of k- means seeding see, A
Comparative Study of Efficient Initialization Methods for the K-Means
Clustering Algorithm by M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.
k-means has trouble clustering data where clusters are of varying sizes and
density.
Clustering outliers.
Centroids can be dragged by outliers, or outliers might get their own cluster
instead of being ignored. Consider removing or clipping outliers before
clustering..
There are different methods and one of the most popular methods is K-
Means clustering algorithm
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
We will create a pipeline that will first cluster the training set into 50 clusters and replace the images
with their distances to these 50 clusters, then apply a Logistic Regression model
We reduced the error rate by almost 30% (from about 3.1% to about 2.2%)
Refer textbook
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions of the
lower density of points. The DBSCAN algorithm is based on this intuitive
notion of “clusters” and “noise”. The key idea is that for each point of a
cluster, the neighborhood of a given radius has to contain at least a minimum
number of points.
Importnump
y as np
random_state=0)
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_noise_ = list(labels).count(-1)
% metrics.adjusted_rand_score(labels_true, labels))
% metrics.adjusted_mutual_info_score(labels_true, labels))
% metrics.silhouette_score(X, labels))
# Plot result
%matplotlib inline
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
if k == -1:
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
markeredgecolor='k', markersize=14)
markeredgecolor='k', markersize=6)
plt.show()
DBSCAN to cluster spherical data
Gaussian Mixtures
GMMs can be used to find clusters in data sets where the clusters may
not be clearly defined.
Additionally, GMMs can be used to estimate the probability that a new
data point belongs to each cluster.
Gaussian distributions are assumed for each group and they have means
and covariances which define their parameters.
Applications of GMM
For clustering, GMM can be used to group together data points that
come from the same Gaussian distribution.
A Gaussian mixture model is a probabilistic model that assumes all the data
points are generated from a mixture of a finite number of Gaussian
distributions with unknown parameters.
CoVariance Matrix
Covariance Matrix 2 × 2
A 2 × 2 matrix is one which has 2 rows and 2 columns. The formula for a 2
× 2 covariance matrix is given as follows:
[var(x) cov(x,y)
cov(x,y) var(y)]
Covariance Matrix 3 × 3
Dimensionality Reduction
Dimensionality
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.
Definition
Application Areas
Advantages
b.Backward Elimination
c.Forward Selection
d.Score comparison
h.Random Forest
i. Factor Analysis
j. Auto-Encoder
Principal Component Analysis
The Principal Component Analysis is a popular unsupervised learning technique for reducing
the dimensionality of data. It increases interpretability yet, at the same time, it minimizes
information loss.
It helps to find the most significant features in a dataset and makes the data easy for plotting
in 2D and 3D. PCA helps in finding a sequence of linear combinations of variables.
Principal Components
The Principal Components are a straight line that captures most of the variance of
the data. They have a direction and magnitude. Principal components are orthogonal
projections (perpendicular) of data onto lower-dimensional space.
Variance
The term variance refers to a statistical measurement of the spread
between numbers in a data set. More specifically, variance measures how
far each number in the set is from the mean (average), and thus from
every other number in the set. Variance is denoted by this symbol: σ 2. It is
used by both analysts and traders to determine volatility and market
security
PCA calculation
There are multiple ways to calculate PCA:
Example:
Then, after the Standardization of each variable, the results are tabulated
below.
This is the Standardized data set.
Where,
4.Make a note that, the covariance of a number with itself is its variance
(COV(X, X)=Var(X)), the values at the top left and bottom right will have
the variances of the same initial number.
5.Likewise, the entries of the Covariance Matrix at the main diagonal will
be symmetric concerning the fact that covariance is commutative (COV(X,
Y)=COV(Y, X)).
6A. If the value of the Covariance Matrix is positive, then it indicates that
the variables are correlated. ( If X increases, Y also increases and vice
versa)
6B. If the value of the Covariance Matrix is negative, then it indicates that
the variables are inversely correlated. ( If X increases, Y also decreases
and vice versa).
7.As a result, at the end of this step, you will come to know which pair of
variables are correlated with each other, so that you might categorize
them much easier.
Example:
The formula to calculate the covariance matrix of the given example will
be:
Since you have already standardized the features, you can consider Mean
= 0 and Standard Deviation=1 for each feature.
Similarly solving all the features, the covariance matrix will be,
Av = λv
2. Once you have computed the eigen vector components, define eigen
values in descending order ( for all variables) and now you will get a list of
principal components.
3. So, the eigen values represent the principal components and these
components represent the direction of data.
4. This indicates that if the line contains large variables of large variances,
then there are many data points on the line. Thus, there is more
information on the line too.
5. Finally, these principal components form a line of new axes for easier
evaluation of data and also the differences between the observations can
also be easily monitored.
Example:
Upon substituting the values in det(A- λI) = 0, you will get the following
matrix.
Then, substitute each eigen value in (A-λI)ν=0 equation and solve the
same for different eigen vectors v1, v2, v3 and v4.
For instance,
For λ = 2.11691, solving the above equation using Cramer's rule, the
values for the v vector are
v1 = 0.515514
v2 = -0.616625
v3 = 0.399314
v4 = 0.441098
Follow the same process and you will form the following matrix by using
the eigen vectors calculated as instructed.
Now, calculate the sum of each Eigen column, arrange them in
descending order and pick up the topmost Eigen values. These are your
Principal components.
Example:
By solving the above equations, you will get the transformed data as
follows.
Did you notice something? Your large dataset is now compressed into a
small dataset without any loss of data! This is the significance of Principal
Component Analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the
dataset has 569 data items with 30 input attributes. There are two output
classes-benign and malignant. Due to 30 input features, it is impossible to
visualize this data
from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()
Output:
3. Apply PCA
Standardize the dataset prior to PCA.
Import PCA from sklearn.decomposition.
Choose the number of principal components.
Select it to 3. After executing this code, we get to know that the dimensions
of x are (569,3) while the dimension of actual data is (569,30). Thus, it is
clear that with PCA, the number of dimensions has reduced to 3 from 30. If
we choose n_components=2, the dimensions would be reduced to 2.
Output:
(569,3)
4. Check Components
The principal.components_ provide an array in which the number of rows
tells the number of principal components while the number of columns is
equal to the number of features in actual data. We can easily see that there
are three rows as n_components was chosen to be 3. However, each row
has 30 columns as in actual data.
plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')
Output:
For three principal components, we need to plot a 3d graph. x[:,0] signifies
the first principal component. Similarly, x[:,1] and x[:,2] represent the second
and the third principal component.
# import relevant libraries for 3d graph
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))
Randomized PCA:
PCA is mostly used for very large data sets with many variables in order to
make them clearer and easier to interpret. This can lead to a very high
computing power and long waiting times. Randomized PCA can be used to
reduce the calculation time.
A=UWVT
Randomized PCA has several advantages over traditional PCA:
1. Scalability: Randomized PCA can handle large datasets that are not
possible to fit into memory using traditional PCA.
2. Speed: Randomized PCA is much faster than traditional PCA for large
datasets, making it more suitable for real-time applications.
3. Sparsity: Randomized PCA is able to handle sparse datasets, which
traditional PCA is not able to handle well.
scatter_plot(X_rpca, y)
Kernel PCA
PCA is a linear method. That is it can only be applied to datasets which are
linearly separable. It does an excellent job for datasets, which are linearly
separable. But, if we use it to non-linear datasets, we might get a result
which may not be the optimal dimensionality reduction. Kernel PCA uses a
kernel function to project dataset into a higher dimensional feature space,
where it is linearly separable. It is similar to the idea of Support Vector
Machines. There are various kernel methods like linear, polynomial, and
gaussian.
X_reduced = rbf_pca.fit_transform(X)