Module 3
Module 3
Unsupervised Learning process: Types of Unsupervised Learning, Challenges in Unsupervised learning - Preprocessing
and scaling, finding value of K, Dimensionality Reduction, Feature Extraction Clustering - K-means clustering -
agglomerative clustering - DBSCAN comparing and Evaluating clustering Algorithm - Hierarchical Clustering
No labeled data: The model works with data where the target variable is unknown, and it tries to infer
patterns or relationships.
Pattern discovery: The primary goal is to discover hidden patterns, structures, or groupings in the data (e.g.,
clustering, dimensionality reduction).
Data grouping: Unsupervised learning models can group similar data points together into clusters or map
high-dimensional data into fewer dimensions while retaining key information (e.g., Principal Component
Analysis).
1. Data Collection: Collect raw data that doesn't have predefined labels.
2. Data Preprocessing: Clean the data by handling missing values, scaling features, and reducing noise.
3. Model Selection: Choose an appropriate unsupervised learning algorithm (e.g., clustering algorithms like K-
means, hierarchical clustering, or dimensionality reduction methods like PCA).
4. Model Training: Train the selected model on the data, where the model identifies patterns or structures
within the data.
5. Pattern Discovery: The model analyzes the data and finds relationships or groups within the data based on
similarity, proximity, or other measures.
6. Model Evaluation: Evaluate the quality of the clusters or patterns discovered using evaluation metrics like
silhouette score (for clustering) or explained variance (for dimensionality reduction).
7. Interpret Results: Interpret and use the identified patterns or groups for further analysis or decision-making.
1. Unsupervised Transformations of a Dataset: These algorithms aim to create new representations of the data
that are often easier for humans or other machine learning algorithms to interpret. Common types include:
o Dimensionality Reduction: This process reduces the number of features in the dataset while
preserving its essential characteristics. A common application is reducing high-dimensional data to two
dimensions for visualization purposes. Techniques like Principal Component Analysis (PCA) are used
for this.
o Finding Components: This approach attempts to discover the underlying components that make up
the data. A good example is topic extraction from text documents, where the task is to identify
unknown topics discussed in the documents. This helps in organizing large collections of text and
understanding themes (e.g., elections, social issues, celebrities).
2. Clustering: Clustering algorithms divide data into distinct groups of similar items without predefined labels.
Each group represents items that are more similar to each other than to those in other groups. An example is
the automatic grouping of images that contain faces into clusters, where each cluster corresponds to images
of the same person. The key here is that the algorithm doesn't know who the people are but groups them
based on similarity, such as facial features.
Lack of Ground Truth: Since there are no predefined labels, it's difficult to measure whether the model has
learned something useful. Unlike supervised learning, where we can directly compare predictions to known
labels, unsupervised learning often requires manual inspection of the results to determine if the algorithm is
grouping or transforming the data in a meaningful way.
Exploratory Use: Due to the evaluation difficulty, unsupervised learning is often used in an exploratory setting,
where the goal is to understand the data better, rather than make automated predictions.
Preprocessing for Supervised Algorithms: Unsupervised learning can also be used as a preprocessing step for
supervised learning. For example, dimensionality reduction or finding hidden components might improve the
performance of a supervised model by reducing complexity or improving feature representation.
StandardScaler
Definition: StandardScaler standardizes features by removing the mean and scaling to unit variance,
transforming the data so that each feature has a mean of 0 and variance of 1.
Example: If a feature has values ranging from 1 to 10, after applying StandardScaler, the values are centered
around 0, and the variance becomes 1, but the exact values are not constrained to any specific range.
RobustScaler
Definition: Unlike StandardScaler, RobustScaler uses the median and the interquartile range (IQR) instead of
the mean and variance, making it more robust to outliers.
Example: If there are extreme outliers in the data (like erroneous entries), this method will minimize their
influence, making it ideal for data with many outliers.
MinMaxScaler
Definition: MinMaxScaler transforms the data such that the feature values are scaled between a specified
range, usually between 0 and 1.
Example: If a feature's values are between 10 and 20, applying MinMaxScaler will scale them between 0 and
1, where 10 becomes 0, and 20 becomes 1.
Normalizer
Definition: The Normalizer scales each data point to have a unit norm (i.e., the length of each data point vector
becomes 1). It works well when the direction of the data is more important than the magnitude.
Example: In text analysis, where each document is represented by a vector of word counts, Normalizer adjusts
each vector to have a length of 1, making comparisons between documents more meaningful.
1. Elbow Method
1. Elbow Method
The Elbow Method involves running the K-means clustering algorithm for a range of cluster numbers (usually from 1
to 10) and observing the Sum of Squared Errors (SSE) or the percentage of variance explained for each kkk. The idea is
to look for an "elbow" point on the graph, which indicates the optimal number of clusters. Here's a detailed
explanation:
Run K-means Clustering: Perform clustering for different values of kkk (e.g., 1 to 10).
Compute SSE or Variance Explained: For each kkk, calculate the sum of squared errors (SSE) or percentage of
variance explained.
Plot Results: Plot the number of clusters kkk against SSE or variance explained. The optimal kkk is often the
point where the curve bends or levels off, which is called the "elbow."
Interpretation: The elbow shape represents the point where increasing the number of clusters no longer
significantly reduces the SSE, thus suggesting the ideal kkk.
The Average Silhouette Method helps assess the quality of clusters by measuring how well-separated they are. The
silhouette score indicates how similar an object is to its own cluster compared to other clusters. The silhouette value
ranges from -1 to 1, where:
Compute Silhouette Score: For each k, calculate the average silhouette score.
Plot Results: Plot the silhouette score for each k. The optimal k corresponds to the highest silhouette score.
Interpretation: The value of k that maximizes the average silhouette score is considered the best for clustering.
1. Dimensionality Reduction
Dimensionality reduction refers to the process of reducing the number of input variables (features) in a dataset while
retaining as much information as possible. High-dimensional data (data with many features) can lead to several
problems, such as increased computation time, overfitting, and difficulty in visualizing the data. Dimensionality
reduction aims to address these issues by projecting data into a lower-dimensional space.
Noise Reduction: High-dimensional spaces tend to have more noise, which dimensionality reduction can help
eliminate.
Improved Performance: Reducing the number of features can improve the performance of machine learning
algorithms by eliminating irrelevant or redundant features.
Principal Component Analysis (PCA): PCA is one of the most widely used techniques for dimensionality
reduction. It finds a set of orthogonal (uncorrelated) axes, called principal components, that explain the most
variance in the data. These components are linear combinations of the original features and are ordered by
the amount of variance they explain. By keeping the first few principal components, we can reduce the data
dimensions without losing much information.
Steps in PCA:
1. Center the Data: Subtract the mean of each feature from the dataset.
2. Compute the Covariance Matrix: This matrix describes the relationships between different features.
3. Compute Eigenvalues and Eigenvectors: Eigenvectors represent the directions of maximum variance,
and eigenvalues represent the magnitude of variance along those directions.
4. Select Principal Components: Choose the top k eigenvectors (with the largest eigenvalues) to reduce
the dataset to k dimensions.
Linear Discriminant Analysis (LDA): LDA is used for supervised dimensionality reduction, often when we have
labeled data. It aims to find a lower-dimensional space that maximizes the separation between different classes
while minimizing the variance within each class.
2. Feature Extraction
Feature extraction is the process of transforming raw data into a set of features that can be more effectively used in
machine learning models. Instead of using the original raw data, feature extraction involves deriving new, more
informative attributes (features) from the data.
Improved Model Performance: Extracting relevant features can lead to better model accuracy and
generalization.
Data Compression: Feature extraction can reduce the data size by creating a more compact representation,
which is especially useful in situations with limited storage or processing power.
Noise Reduction: By focusing on important features, irrelevant or noisy data can be eliminated.
Non-Negative Matrix Factorization (NMF): NMF is a matrix factorization technique that decomposes a non-
negative matrix into two lower-dimensional non-negative matrices. It is often used for feature extraction in
text data (e.g., in topic modeling or document clustering) or image data (e.g., extracting parts of an image).
The main advantage of NMF is that it produces a part-based representation, which can be more interpretable.
Independent Component Analysis (ICA): ICA is a technique similar to PCA, but it aims to find components that
are statistically independent, rather than uncorrelated. It is particularly useful in applications like separating
mixed signals (e.g., blind source separation).
Wavelet Transform: In signal processing, the wavelet transform is used to extract features from time-series
data. It represents data at multiple scales and resolutions, which is useful for detecting patterns at different
levels of granularity.
3. Manifold Learning
Manifold learning is a form of non-linear dimensionality reduction that aims to uncover the intrinsic structure of data,
assuming that the data lies on a lower-dimensional manifold embedded in a higher-dimensional space. Unlike linear
dimensionality reduction techniques like PCA, manifold learning can handle non-linear relationships in the data.
Understanding the Intrinsic Structure: It helps in finding the hidden patterns and relationships in high-
dimensional data.
Non-Linear Relationships: While techniques like PCA assume linearity, manifold learning techniques can
capture more complex non-linear structures in data.
Isomap: Isomap is a non-linear dimensionality reduction technique that preserves the geodesic distances
between data points. It first constructs a neighborhood graph based on the Euclidean distances between points
and then computes the shortest path distances on this graph to preserve the manifold structure.
Locally Linear Embedding (LLE): LLE is a method that focuses on preserving local linear relationships between
data points. It seeks to map the data points into a lower-dimensional space while maintaining their local
neighborhood relationships.
Laplacian Eigenmaps: This is another technique based on graph theory that seeks to preserve local
neighborhood information. It constructs a graph with edges between points that are nearby in the high-
dimensional space and uses eigenvectors to find a lower-dimensional representation.
Clustering
Clustering is a type of unsupervised learning where the task is to partition a dataset into groups, called clusters, based
on the similarity of data points. Each point in a cluster is like other points in the same cluster, while points in different
clusters are different.
K-means clustering
K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into a set of distinct
clusters. Each cluster contains data points that are similar to each other based on a certain distance metric (typically
Euclidean distance). The goal of K-means is to group similar data points together while ensuring that data points in
different clusters are as dissimilar as possible.
Key Concepts:
2. Centroid: The central point (mean) of a cluster, representing the "average" location of all the data points in the
cluster.
3. K (Number of Clusters): This is a user-defined value, which specifies how many clusters the algorithm should
divide the data into. It’s an important hyperparameter that needs to be chosen beforehand.
1. Initialization: Choose K initial centroids randomly from the dataset or use a more sophisticated initialization
technique like K-means++ to select centroids that are more spread out.
2. Assignment Step: Assign each data point to the nearest centroid. This forms K clusters based on the proximity
of each point to the centroid. The proximity is typically calculated using the Euclidean distance between the
data points and the centroids.
3. Update Step: After all points are assigned to the clusters, update the centroids. The new centroid of each
cluster is the mean (average) of all the points assigned to that cluster.
4. Repeat: Steps 2 and 3 are repeated until the centroids stop changing (i.e., the algorithm converges). This means
that the clusters are stable and further iterations will not change the assignments of data points.
Advantages of K-means:
Simple and fast: The algorithm is easy to understand and can work with large datasets.
Scalable: It performs well when there are a large number of data points.
Disadvantages of K-means:
Choosing K: Selecting the right value for K can be challenging and requires domain knowledge or methods like
the Elbow Method.
Sensitivity to Initialization: K-means is sensitive to the initial placement of centroids. Different initializations
can lead to different results.
Assumes Spherical Clusters: It assumes that clusters are spherical in shape and equally sized. K-means may
not perform well if clusters have different shapes, sizes, or densities.
Outliers: K-means is sensitive to outliers because they can drastically change the position of the centroid.
Applications:
Image compression
Anomaly detection
https://medium.com/@karna.sujan52/k-means-algorithm-solved-numerical-3c94d25076e8
https://youtu.be/KzJORp8bgqs?si=GHHrjM0h8pb3zEdd
Agglomerative clustering
Agglomerative clustering is a type of hierarchical clustering that builds clusters by iteratively merging the closest ones.
Unlike K-means, which requires specifying the number of clusters ahead of time, agglomerative clustering merges
clusters based on similarity until a stopping criterion is met, usually defined by the number of clusters you want to
have.
1. Initial state: Each data point starts as its own cluster (called a "singleton" cluster).
2. Merge clusters: In each iteration, the two closest clusters (based on a defined distance metric) are merged.
3. Stop condition: The merging process continues until the desired number of clusters is reached, or another
stopping criterion is met.
Linkage Criteria:
The method used to define the "closeness" between clusters is called linkage. Different linkage criteria can be used to
decide which clusters to merge. In scikit-learn, there are three main linkage options:
o Merges the two clusters that result in the least increase in the variance of the merged cluster. This
method tends to result in clusters of similar size.
2. Average linkage:
o Merges the two clusters that have the smallest average distance between all points in the clusters. This
can work better when clusters vary in size.
o Merges the two clusters that have the smallest maximum distance between their points. This tends to
create more compact clusters and can be useful when the clusters have different shapes or densities.
Visual Explanation:
Agglomerative clustering progressively merges the two most similar clusters. At first, each data point is its own
cluster, and gradually, these clusters are combined:
Step 2-4: The closest clusters are merged, leading to 2-point clusters.
Step 5-9: Clusters grow in size as they merge, and by step 9, only the desired number of clusters remains
(e.g., 3 clusters in the example).
While agglomerative clustering requires the user to specify the number of clusters (n_clusters), there are various
methods for determining the optimal number of clusters, such as the Elbow Method or Silhouette Score, which can
be used to evaluate the quality of clustering and guide the choice of n_clusters.
Pros: Cons:
o Can handle clusters of different shapes and o Computationally more expensive than K-
sizes. means, especially for large datasets.
o Does not require specifying the number of
o It can be sensitive to noisy data and
clusters upfront (but can if desired). Can be
outliers.
more flexible than methods like K-means.
https://youtu.be/YH0r47m0kFM?si=xrEG3GWLwF6fY59p ,
https://youtu.be/d1qAwe8hthM?si=dyxfD2_hLWOG6V7m
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular unsupervised clustering algorithm
that groups together data points that are closely packed together while marking points that lie alone in low-density
regions as outliers or noise.
Dense Regions: DBSCAN assumes that clusters are formed by areas of high density separated by areas of
low density.
Core Points: A point is considered a core point if there are at least min_samples points (including the
point itself) within a radius eps (epsilon) around it.
Border Points: Points that are within the eps radius of a core point but do not have enough neighbors to
be core points themselves.
Noise Points: Points that are not within the eps neighborhood of any core point.
2. Algorithm Steps:
Step 1: Select an Unvisited Point: Pick an arbitrary point from the dataset.
Step 2: Retrieve Neighbors: Identify all points within distance eps of the selected point.
o If the number of neighbors is less than min_samples, label the point as noise (temporarily).
o If the number is equal to or greater than min_samples, mark it as a core point and start forming
a new cluster.
Step 4: Expand the Cluster:
o If any neighbor is also a core point, merge its neighbors into the current cluster.
o Continue until no more new core points can be added to the cluster.
Step 5: Repeat: Continue this process with unvisited points until all points have been classified as core
points, border points, or noise.
3. Cluster Formation:
Noise: Points that never meet the density criteria remain labeled as noise (commonly represented with a
label of -1).
Key Parameters:
eps (ε): The maximum distance between two points for one to be considered as in the neighborhood of the
other. This parameter defines the radius of the neighborhood.
min_samples: The minimum number of points required within the eps radius for a point to be considered a
core point. It essentially controls the minimum cluster size.
Benefits of DBSCAN:
No Need to Specify the Number of Clusters: Unlike k-means, DBSCAN automatically discovers the number of
clusters based on the data density.
Detects Arbitrarily Shaped Clusters: DBSCAN can find clusters of various shapes and sizes because it relies on
density rather than geometric assumptions.
Noise Identification: It effectively identifies and separates out noise or outlier points from the clusters.
Drawbacks:
Parameter Sensitivity: The performance of DBSCAN is heavily dependent on the choice of eps and
min_samples. Poorly chosen values can lead to suboptimal clustering.
Varying Density: It may struggle with datasets containing clusters of differing densities, as a single eps value
may not be suitable for all clusters.
Computational Complexity: Although it scales reasonably well, DBSCAN can be slower on very large datasets
compared to simpler methods like k-means.
https://medium.com/@karna.sujan52/density-based-dbscan-numerical-f4e00b9cce68
https://youtu.be/-p354tQsKrs?si=XmNASbKi1QyK3dOe
COMAPARING AND EVALUATING CLUSTERING ALGORITHM
Criteria K-Means Agglomerative DBSCAN Hierarchical (General)
Cluster Shape Spherical or convex Can detect clusters of Arbitrary shapes Can detect clusters of
various shapes (based on density) various shapes
Scalability Very fast for large Computationally O(n log n) for spatial Depends on the method:
datasets, O(n) expensive, O(n^2) data Agglomerative can be
slow (O(n^2)), Divisive
can be O(n^3)
Handling Does not handle Sensitive to noise (less Effectively identifies Noise handling depends
Noise noise well (outliers so than K-Means) and separates noise on implementation
may be assigned to (outliers)
clusters)
Parameter Needs predefined k No predefined number Needs eps and Requires distance
Tuning value of clusters, but requires min_samples values threshold, but no k value
a distance threshold to be set
Cluster Size Assumes roughly Can handle clusters of Can handle clusters Can handle clusters of
equal-sized clusters different sizes of different sizes and different sizes
densities
Centroid- Yes (cluster centers No (clusters are formed No (clusters are No (clusters are formed
based are calculated) hierarchically) formed based on based on distance
density) hierarchy)
Flexibility Less flexible due to More flexible; can work Highly flexible; does Very flexible (varies
the need for with any number of not assume a fixed based on specific
predefined k clusters and shapes number of clusters hierarchical method)
Outliers Often ignored or Can be affected by Explicitly detects and Can handle outliers
grouped with outliers (but less so handles outliers (depending on method)
nearest cluster than K-Means) (noise points)
Example Use When the number When the data has When data has When hierarchical
Case of clusters is known hierarchical noise, varying relationships exist
and clusters are relationships or varying density, or irregular between clusters
compact and cluster shapes shapes
spherical
Visualization Easy to visualize in Easy to visualize, Works well for Visualizations can show
2D/3D (but especially for small visualizing density- dendrograms for
struggles with more datasets based clusters agglomerative or divisive
complex datasets) clustering
Hierarchical Clustering
What Is Hierarchical Clustering
Hierarchical clustering, or hierarchical clustering analysis, is a cluster analysis technique that creates a hierarchy of
clusters from points in a dataset.
With clustering, data points are put into groups — known as clusters — based on similarities like color, shape or other
features. In hierarchical clustering, each cluster is placed within a nested tree-like hierarchy, where clusters are grouped
and break down further into smaller clusters depending on similarities. Here, the closer clusters are together in the
hierarchy, the more similar they are to each other.
While clustering analyses like k-means can visualize data points as distinct and linear groups, hierarchical clustering
visualizes data groups in relation to one another with multiple levels of similarity.
Hierarchical clustering is used to help find patterns and related occurrences within datasets, especially those that are
complex or multifaceted.
The hierarchical clustering process involves finding the two data points closest to each other and combining the two
most similar ones. After repeating this process until all data points are grouped into clusters, the end result is a
hierarchical tree of related groups known as a dendrogram.
Hierarchical clustering is based on the core idea that similar objects lie nearby to each other in a data space while
others lie far away. It uses distance functions to find nearby data points and group the data points together as clusters.
There are different types of clustering algorithms, including centroid-based clustering algorithms, connectivity-based
clustering algorithms (hierarchical clustering), distribution-based clustering algorithms and density-based clustering
algorithms. The two main types of hierarchical clustering include agglomerative clustering and divisive clustering.
Agglomerative clustering: Divide the data points into different clusters and then aggregate them as the
distance decreases.
Divisive clustering: Combine all the data points as a single cluster and divide them as the distance between
them increases.
1. Agglomerative Clustering
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual data points as a single
cluster, then it is merged continuously based on similarity until it forms one big cluster containing all objects. It is good
at identifying small clusters.
2. Use a linkage function to group objects into a hierarchical cluster tree based on the computed distance matrix
from the above step.
3. Data points with close proximity are merged together to form a cluster.
After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5 are merged together to
form clusters.
Again, the proximity matrix is computed and clusters with points 4,5 and 6 are merged together.
And again, the proximity matrix is computed, then the clusters with points 4,5,6 and 2,3 are merged together
to form a cluster.
As a final step, the remaining clusters are merged together to form a single cluster.
The above proximity matrix consists of n points named x, and the d(xi,xj) represents the distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in the proximity matrix are
taken and the data points are grouped based on similarity. The newly formed clusters are linked to each other until
they form a single cluster containing all the data points.
Single linkage: The minimum of all pairwise distance between elements in each pair of clusters is used to
measure the distance between two clusters.
Average linkage: The average of all pairwise distances between elements in each pair of clusters is used to
measure the distance between two clusters.
Centroid linkage: Before merging, the distance between the two clusters’ centroids are considered.
Ward’s Method: It uses squared error to compute the similarity of the two clusters for merging.
Dendrogram Charts
From the above chart we can visualize the hierarchical technique, so how to find the optimal number of clusters from
the above chart?
To find it, draw a horizontal line where there is no overlap in the vertical lines of the bars. The number of bars
without the overlap below the line is the optical number of the clusters. Refer to the figure below for a clear
illustration.
From the above figure, we have three bars below the horizontal line, so the optimal number of clusters is three. Also,
if you recall, the Iris dataset has three classes and we got the same number from the above chart.
2. Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all the data points into
a big single cluster and later on splitting them into smaller heterogeneous clusters continuously until all data points
are in their own cluster. Thus, they are good at identifying large clusters. It follows a top-down approach and is more
efficient than agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any predefined
implementation in any of the major machine learning frameworks.
After calculating the proximity matrix, based on the dissimilarity the points are split up into separate clusters.
The proximity matrix is again computed until each point is assigned to an individual cluster.
The proximity matrix and linkage function follow the same procedure as agglomerative clustering, As the divisive
clustering is not used in many places, there is no predefined class/function in any Python library.
Hierarchical clustering can be used for several applications, ranging from customer segmentation to object recognition.
Market Segmentation
Companies can better understand their markets by identifying target groups based on certain traits like demographics,
personal interests or behaviors. Organizing consumers according to these characteristics enables organizations to see
what consumers care about and whose backgrounds or interests may align with their products and services. Businesses
can then tailor products, marketing ads and other materials according to the preferences of target audiences.
Geo-Spatial Analysis
Besides grouping individual consumers or customers based on specific traits, hierarchical clustering can also group
individuals based on their geographic location. Organizations can then view where their customer bases are located,
predict product demand in certain areas and adjust their marketing and business strategies accordingly.
Image Segmentation
When dealing with images, hierarchical clustering can distinguish between separate visual elements. For example, the
technique can discern between different facial features, aiding facial recognition technology. It can also discriminate
between cars and other objects like buildings and animals, powering image recognition technology.
Anomaly Detection
Hierarchical clustering is also effective at detecting anomalies. By clustering data points into groups, hierarchical
clustering can isolate outliers that don’t belong to any cluster. Researchers can apply this method to root out errors at
different stages of the data collection process, preventing anomalies from impacting the accuracy of data sets.
Hierarchical clustering isn’t a fix-all; it does have some limits. Among them:
It has high time and space computational complexity. For computing proximity matrix, the time complexity is
O(N2), since it takes N steps to search, the total time complexity is O(N3).
Clustering helps with the analysis of an unlabelled dataset to group the data points based on their similarity. In terms
of business needs, clustering helps quickly segment customers and get insightful decisions.
https://youtu.be/YH0r47m0kFM?si=xrEG3GWLwF6fY59p ,
https://youtu.be/d1qAwe8hthM?si=dyxfD2_hLWOG6V7m