0% found this document useful (0 votes)

7 views17 pages

Module 3

The document provides an overview of unsupervised learning, detailing its processes, types, challenges, and techniques such as clustering and dimensionality reduction. It explains the steps involved in unsupervised learning, including data preprocessing, model selection, and evaluation, while also discussing methods for determining the optimal number of clusters in K-means clustering. Additionally, it covers various scaling techniques and feature extraction methods used to enhance model performance.

Uploaded by

chetankumarchatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views17 pages

Module 3

Uploaded by

chetankumarchatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Module 3

Unsupervised Learning process: Types of Unsupervised Learning, Challenges in Unsupervised learning - Preprocessing
and scaling, finding value of K, Dimensionality Reduction, Feature Extraction Clustering - K-means clustering -
agglomerative clustering - DBSCAN comparing and Evaluating clustering Algorithm - Hierarchical Clustering

Unsupervised Learning process:

Unsupervised learning is a type of machine learning where the model is trained on data without predefined labels or
categories for the target variable. In other words, the algorithm is given input data that does not have corresponding
output labels, and its goal is to identify inherent patterns, structures, or groupings in the data. This type of learning is
often used when we do not have labeled data but still want to gain insights from the data.

Key Characteristics of Unsupervised Learning:

 No labeled data: The model works with data where the target variable is unknown, and it tries to infer
patterns or relationships.

 Pattern discovery: The primary goal is to discover hidden patterns, structures, or groupings in the data (e.g.,
clustering, dimensionality reduction).

 Data grouping: Unsupervised learning models can group similar data points together into clusters or map
high-dimensional data into fewer dimensions while retaining key information (e.g., Principal Component
Analysis).

Unsupervised Learning Process Flow:

1. Data Collection: Collect raw data that doesn't have predefined labels.

2. Data Preprocessing: Clean the data by handling missing values, scaling features, and reducing noise.

3. Model Selection: Choose an appropriate unsupervised learning algorithm (e.g., clustering algorithms like K-
means, hierarchical clustering, or dimensionality reduction methods like PCA).

4. Model Training: Train the selected model on the data, where the model identifies patterns or structures
within the data.

5. Pattern Discovery: The model analyzes the data and finds relationships or groups within the data based on
similarity, proximity, or other measures.
6. Model Evaluation: Evaluate the quality of the clusters or patterns discovered using evaluation metrics like
silhouette score (for clustering) or explained variance (for dimensionality reduction).

7. Interpret Results: Interpret and use the identified patterns or groups for further analysis or decision-making.

Types of Unsupervised Learning

Unsupervised learning can be divided into two main categories:

1. Unsupervised Transformations of a Dataset: These algorithms aim to create new representations of the data
that are often easier for humans or other machine learning algorithms to interpret. Common types include:

o Dimensionality Reduction: This process reduces the number of features in the dataset while
preserving its essential characteristics. A common application is reducing high-dimensional data to two
dimensions for visualization purposes. Techniques like Principal Component Analysis (PCA) are used
for this.

o Finding Components: This approach attempts to discover the underlying components that make up
the data. A good example is topic extraction from text documents, where the task is to identify
unknown topics discussed in the documents. This helps in organizing large collections of text and
understanding themes (e.g., elections, social issues, celebrities).

2. Clustering: Clustering algorithms divide data into distinct groups of similar items without predefined labels.
Each group represents items that are more similar to each other than to those in other groups. An example is
the automatic grouping of images that contain faces into clusters, where each cluster corresponds to images
of the same person. The key here is that the algorithm doesn't know who the people are but groups them
based on similarity, such as facial features.

Challenges in Unsupervised Learning

Unsupervised learning presents some challenges, particularly in the evaluation process:

 Lack of Ground Truth: Since there are no predefined labels, it's difficult to measure whether the model has
learned something useful. Unlike supervised learning, where we can directly compare predictions to known
labels, unsupervised learning often requires manual inspection of the results to determine if the algorithm is
grouping or transforming the data in a meaningful way.

 Exploratory Use: Due to the evaluation difficulty, unsupervised learning is often used in an exploratory setting,
where the goal is to understand the data better, rather than make automated predictions.

 Preprocessing for Supervised Algorithms: Unsupervised learning can also be used as a preprocessing step for
supervised learning. For example, dimensionality reduction or finding hidden components might improve the
performance of a supervised model by reducing complexity or improving feature representation.

Preprocessing and Scaling in Machine Learning

Preprocessing refers to transforming the data into a suitable format for machine learning algorithms to perform
effectively. Scaling is a vital preprocessing step that helps models learn better by ensuring that all features are on the
same scale. Here are different methods used for scaling:

StandardScaler

 Definition: StandardScaler standardizes features by removing the mean and scaling to unit variance,
transforming the data so that each feature has a mean of 0 and variance of 1.
 Example: If a feature has values ranging from 1 to 10, after applying StandardScaler, the values are centered
around 0, and the variance becomes 1, but the exact values are not constrained to any specific range.
RobustScaler

 Definition: Unlike StandardScaler, RobustScaler uses the median and the interquartile range (IQR) instead of
the mean and variance, making it more robust to outliers.

 Example: If there are extreme outliers in the data (like erroneous entries), this method will minimize their
influence, making it ideal for data with many outliers.

MinMaxScaler

 Definition: MinMaxScaler transforms the data such that the feature values are scaled between a specified
range, usually between 0 and 1.

 Example: If a feature's values are between 10 and 20, applying MinMaxScaler will scale them between 0 and
1, where 10 becomes 0, and 20 becomes 1.

Normalizer

 Definition: The Normalizer scales each data point to have a unit norm (i.e., the length of each data point vector
becomes 1). It works well when the direction of the data is more important than the magnitude.

 Example: In text analysis, where each document is represented by a vector of word counts, Normalizer adjusts
each vector to have a length of 1, making comparisons between documents more meaningful.

Finding the Value of k

There are two commonly used methods to determine the optimal value of kkk (the number of clusters) in K-means
clustering:

1. Elbow Method

2. Average Silhouette Method

1. Elbow Method

The Elbow Method involves running the K-means clustering algorithm for a range of cluster numbers (usually from 1
to 10) and observing the Sum of Squared Errors (SSE) or the percentage of variance explained for each kkk. The idea is
to look for an "elbow" point on the graph, which indicates the optimal number of clusters. Here's a detailed
explanation:

Steps for Elbow Method:

 Run K-means Clustering: Perform clustering for different values of kkk (e.g., 1 to 10).

 Compute SSE or Variance Explained: For each kkk, calculate the sum of squared errors (SSE) or percentage of
variance explained.

 Plot Results: Plot the number of clusters kkk against SSE or variance explained. The optimal kkk is often the
point where the curve bends or levels off, which is called the "elbow."
 Interpretation: The elbow shape represents the point where increasing the number of clusters no longer
significantly reduces the SSE, thus suggesting the ideal kkk.

2. Average Silhouette Method

The Average Silhouette Method helps assess the quality of clusters by measuring how well-separated they are. The
silhouette score indicates how similar an object is to its own cluster compared to other clusters. The silhouette value
ranges from -1 to 1, where:

 A value close to 1 indicates that the sample is well-clustered.

 A value close to -1 indicates that the sample is wrongly clustered.

 A value near 0 indicates overlapping clusters.

Steps for Average Silhouette Method:

 Compute Silhouette Score: For each k, calculate the average silhouette score.

 Plot Results: Plot the silhouette score for each k. The optimal k corresponds to the highest silhouette score.

 Interpretation: The value of k that maximizes the average silhouette score is considered the best for clustering.

Dimensionality Reduction, Feature Extraction, and Manifold Learning

Dimensionality reduction, feature extraction, and manifold learning are techniques commonly used in machine
learning and data analysis to transform high-dimensional data into a more manageable form.

1. Dimensionality Reduction

Dimensionality reduction refers to the process of reducing the number of input variables (features) in a dataset while
retaining as much information as possible. High-dimensional data (data with many features) can lead to several
problems, such as increased computation time, overfitting, and difficulty in visualizing the data. Dimensionality
reduction aims to address these issues by projecting data into a lower-dimensional space.

Motivations for Dimensionality Reduction:

 Visualization: Reducing data to 2D or 3D allows easier visualization of the data structure.

 Noise Reduction: High-dimensional spaces tend to have more noise, which dimensionality reduction can help
eliminate.

 Improved Performance: Reducing the number of features can improve the performance of machine learning
algorithms by eliminating irrelevant or redundant features.

Common Techniques for Dimensionality Reduction:

 Principal Component Analysis (PCA): PCA is one of the most widely used techniques for dimensionality
reduction. It finds a set of orthogonal (uncorrelated) axes, called principal components, that explain the most
variance in the data. These components are linear combinations of the original features and are ordered by
the amount of variance they explain. By keeping the first few principal components, we can reduce the data
dimensions without losing much information.

Steps in PCA:

1. Center the Data: Subtract the mean of each feature from the dataset.

2. Compute the Covariance Matrix: This matrix describes the relationships between different features.

3. Compute Eigenvalues and Eigenvectors: Eigenvectors represent the directions of maximum variance,
and eigenvalues represent the magnitude of variance along those directions.

4. Select Principal Components: Choose the top k eigenvectors (with the largest eigenvalues) to reduce
the dataset to k dimensions.

 t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a non-linear dimensionality reduction

technique commonly used for visualizing high-dimensional data. It focuses on preserving local structure by
minimizing the divergence between probability distributions representing pairwise similarities in the original
and reduced spaces.

 Linear Discriminant Analysis (LDA): LDA is used for supervised dimensionality reduction, often when we have
labeled data. It aims to find a lower-dimensional space that maximizes the separation between different classes
while minimizing the variance within each class.
2. Feature Extraction

Feature extraction is the process of transforming raw data into a set of features that can be more effectively used in
machine learning models. Instead of using the original raw data, feature extraction involves deriving new, more
informative attributes (features) from the data.

Motivations for Feature Extraction:

 Improved Model Performance: Extracting relevant features can lead to better model accuracy and
generalization.

 Data Compression: Feature extraction can reduce the data size by creating a more compact representation,
which is especially useful in situations with limited storage or processing power.

 Noise Reduction: By focusing on important features, irrelevant or noisy data can be eliminated.

Common Techniques for Feature Extraction:

 Non-Negative Matrix Factorization (NMF): NMF is a matrix factorization technique that decomposes a non-
negative matrix into two lower-dimensional non-negative matrices. It is often used for feature extraction in
text data (e.g., in topic modeling or document clustering) or image data (e.g., extracting parts of an image).
The main advantage of NMF is that it produces a part-based representation, which can be more interpretable.
 Independent Component Analysis (ICA): ICA is a technique similar to PCA, but it aims to find components that
are statistically independent, rather than uncorrelated. It is particularly useful in applications like separating
mixed signals (e.g., blind source separation).

 Wavelet Transform: In signal processing, the wavelet transform is used to extract features from time-series
data. It represents data at multiple scales and resolutions, which is useful for detecting patterns at different
levels of granularity.

3. Manifold Learning

Manifold learning is a form of non-linear dimensionality reduction that aims to uncover the intrinsic structure of data,
assuming that the data lies on a lower-dimensional manifold embedded in a higher-dimensional space. Unlike linear
dimensionality reduction techniques like PCA, manifold learning can handle non-linear relationships in the data.

Motivations for Manifold Learning:

 Understanding the Intrinsic Structure: It helps in finding the hidden patterns and relationships in high-
dimensional data.

 Non-Linear Relationships: While techniques like PCA assume linearity, manifold learning techniques can
capture more complex non-linear structures in data.

Common Techniques for Manifold Learning:

 Isomap: Isomap is a non-linear dimensionality reduction technique that preserves the geodesic distances
between data points. It first constructs a neighborhood graph based on the Euclidean distances between points
and then computes the shortest path distances on this graph to preserve the manifold structure.

 Locally Linear Embedding (LLE): LLE is a method that focuses on preserving local linear relationships between
data points. It seeks to map the data points into a lower-dimensional space while maintaining their local
neighborhood relationships.

 Laplacian Eigenmaps: This is another technique based on graph theory that seeks to preserve local
neighborhood information. It constructs a graph with edges between points that are nearby in the high-
dimensional space and uses eigenvectors to find a lower-dimensional representation.
Clustering
Clustering is a type of unsupervised learning where the task is to partition a dataset into groups, called clusters, based
on the similarity of data points. Each point in a cluster is like other points in the same cluster, while points in different
clusters are different.

K-means clustering
K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into a set of distinct
clusters. Each cluster contains data points that are similar to each other based on a certain distance metric (typically
Euclidean distance). The goal of K-means is to group similar data points together while ensuring that data points in
different clusters are as dissimilar as possible.

Key Concepts:

1. Cluster: A group of data points that are similar to each other.

2. Centroid: The central point (mean) of a cluster, representing the "average" location of all the data points in the
cluster.

3. K (Number of Clusters): This is a user-defined value, which specifies how many clusters the algorithm should
divide the data into. It’s an important hyperparameter that needs to be chosen beforehand.

How K-means Works:

The K-means algorithm follows these steps:

1. Initialization: Choose K initial centroids randomly from the dataset or use a more sophisticated initialization
technique like K-means++ to select centroids that are more spread out.

2. Assignment Step: Assign each data point to the nearest centroid. This forms K clusters based on the proximity
of each point to the centroid. The proximity is typically calculated using the Euclidean distance between the
data points and the centroids.
3. Update Step: After all points are assigned to the clusters, update the centroids. The new centroid of each
cluster is the mean (average) of all the points assigned to that cluster.

4. Repeat: Steps 2 and 3 are repeated until the centroids stop changing (i.e., the algorithm converges). This means
that the clusters are stable and further iterations will not change the assignments of data points.

5. Convergence: The algorithm stops when either:

 The cluster assignments do not change.

 A predefined number of iterations is reached.

Advantages of K-means:

 Simple and fast: The algorithm is easy to understand and can work with large datasets.

 Scalable: It performs well when there are a large number of data points.

Disadvantages of K-means:

 Choosing K: Selecting the right value for K can be challenging and requires domain knowledge or methods like
the Elbow Method.

 Sensitivity to Initialization: K-means is sensitive to the initial placement of centroids. Different initializations
can lead to different results.

 Assumes Spherical Clusters: It assumes that clusters are spherical in shape and equally sized. K-means may
not perform well if clusters have different shapes, sizes, or densities.

 Outliers: K-means is sensitive to outliers because they can drastically change the position of the centroid.

Applications:

K-means is widely used in applications like:

 Image compression

 Customer segmentation in marketing

 Anomaly detection

 Organizing large datasets into smaller, more manageable clusters

EXAMPLES FOR K MEAN CLUSTERING:

https://medium.com/@karna.sujan52/k-means-algorithm-solved-numerical-3c94d25076e8

https://youtu.be/KzJORp8bgqs?si=GHHrjM0h8pb3zEdd

Agglomerative clustering
Agglomerative clustering is a type of hierarchical clustering that builds clusters by iteratively merging the closest ones.
Unlike K-means, which requires specifying the number of clusters ahead of time, agglomerative clustering merges
clusters based on similarity until a stopping criterion is met, usually defined by the number of clusters you want to
have.

Steps of Agglomerative Clustering:

1. Initial state: Each data point starts as its own cluster (called a "singleton" cluster).

2. Merge clusters: In each iteration, the two closest clusters (based on a defined distance metric) are merged.

3. Stop condition: The merging process continues until the desired number of clusters is reached, or another
stopping criterion is met.
Linkage Criteria:

The method used to define the "closeness" between clusters is called linkage. Different linkage criteria can be used to
decide which clusters to merge. In scikit-learn, there are three main linkage options:

1. Ward linkage (default):

o Merges the two clusters that result in the least increase in the variance of the merged cluster. This
method tends to result in clusters of similar size.

2. Average linkage:

o Merges the two clusters that have the smallest average distance between all points in the clusters. This
can work better when clusters vary in size.

3. Complete linkage (maximum linkage):

o Merges the two clusters that have the smallest maximum distance between their points. This tends to
create more compact clusters and can be useful when the clusters have different shapes or densities.

Visual Explanation:

Agglomerative clustering progressively merges the two most similar clusters. At first, each data point is its own
cluster, and gradually, these clusters are combined:

 Step 1: Start with each point as its own cluster.

 Step 2-4: The closest clusters are merged, leading to 2-point clusters.

 Step 5-9: Clusters grow in size as they merge, and by step 9, only the desired number of clusters remains
(e.g., 3 clusters in the example).

Choosing the Number of Clusters:

While agglomerative clustering requires the user to specify the number of clusters (n_clusters), there are various
methods for determining the optimal number of clusters, such as the Elbow Method or Silhouette Score, which can
be used to evaluate the quality of clustering and guide the choice of n_clusters.

 Pros:  Cons:

o Can handle clusters of different shapes and o Computationally more expensive than K-
sizes. means, especially for large datasets.
o Does not require specifying the number of
o It can be sensitive to noisy data and
clusters upfront (but can if desired). Can be
outliers.
more flexible than methods like K-means.

EXAMPLES FOR AGGLOMERATIVE CLUSTERING:

https://youtu.be/YH0r47m0kFM?si=xrEG3GWLwF6fY59p ,

https://youtu.be/d1qAwe8hthM?si=dyxfD2_hLWOG6V7m
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular unsupervised clustering algorithm
that groups together data points that are closely packed together while marking points that lie alone in low-density
regions as outliers or noise.

How DBSCAN Works:

1. Core Concept – Density:

 Dense Regions: DBSCAN assumes that clusters are formed by areas of high density separated by areas of
low density.

 Core Points: A point is considered a core point if there are at least min_samples points (including the
point itself) within a radius eps (epsilon) around it.

 Border Points: Points that are within the eps radius of a core point but do not have enough neighbors to
be core points themselves.

 Noise Points: Points that are not within the eps neighborhood of any core point.

2. Algorithm Steps:

 Step 1: Select an Unvisited Point: Pick an arbitrary point from the dataset.

 Step 2: Retrieve Neighbors: Identify all points within distance eps of the selected point.

 Step 3: Determine if a Core Point:

o If the number of neighbors is less than min_samples, label the point as noise (temporarily).

o If the number is equal to or greater than min_samples, mark it as a core point and start forming
a new cluster.
 Step 4: Expand the Cluster:

o Recursively visit all the neighbors of this core point.

o If any neighbor is also a core point, merge its neighbors into the current cluster.

o Continue until no more new core points can be added to the cluster.

 Step 5: Repeat: Continue this process with unvisited points until all points have been classified as core
points, border points, or noise.

3. Cluster Formation:

 Cluster Labeling: Each cluster is assigned a unique label.

 Noise: Points that never meet the density criteria remain labeled as noise (commonly represented with a
label of -1).

Key Parameters:

 eps (ε): The maximum distance between two points for one to be considered as in the neighborhood of the
other. This parameter defines the radius of the neighborhood.

 min_samples: The minimum number of points required within the eps radius for a point to be considered a
core point. It essentially controls the minimum cluster size.

Benefits of DBSCAN:

 No Need to Specify the Number of Clusters: Unlike k-means, DBSCAN automatically discovers the number of
clusters based on the data density.

 Detects Arbitrarily Shaped Clusters: DBSCAN can find clusters of various shapes and sizes because it relies on
density rather than geometric assumptions.

 Noise Identification: It effectively identifies and separates out noise or outlier points from the clusters.

Drawbacks:

 Parameter Sensitivity: The performance of DBSCAN is heavily dependent on the choice of eps and
min_samples. Poorly chosen values can lead to suboptimal clustering.

 Varying Density: It may struggle with datasets containing clusters of differing densities, as a single eps value
may not be suitable for all clusters.

 Computational Complexity: Although it scales reasonably well, DBSCAN can be slower on very large datasets
compared to simpler methods like k-means.

EXAMPLES FOR DBSCAN CLUSTERING:

https://medium.com/@karna.sujan52/density-based-dbscan-numerical-f4e00b9cce68

https://youtu.be/-p354tQsKrs?si=XmNASbKi1QyK3dOe
COMAPARING AND EVALUATING CLUSTERING ALGORITHM
Criteria K-Means Agglomerative DBSCAN Hierarchical (General)

Type Partitional (Flat Hierarchical (Bottom- Density-based Hierarchical (Can be

clustering) up) agglomerative or divisive)

Number of Predefined (k) No need to predefine Automatically No need to predefine

Clusters (based on distance detected based on (based on distance
threshold) density (eps and threshold)
min_samples)

Cluster Shape Spherical or convex Can detect clusters of Arbitrary shapes Can detect clusters of
various shapes (based on density) various shapes

Scalability Very fast for large Computationally O(n log n) for spatial Depends on the method:
datasets, O(n) expensive, O(n^2) data Agglomerative can be
slow (O(n^2)), Divisive
can be O(n^3)

Handling Does not handle Sensitive to noise (less Effectively identifies Noise handling depends
Noise noise well (outliers so than K-Means) and separates noise on implementation
may be assigned to (outliers)
clusters)

Parameter Needs predefined k No predefined number Needs eps and Requires distance
Tuning value of clusters, but requires min_samples values threshold, but no k value
a distance threshold to be set

Cluster Size Assumes roughly Can handle clusters of Can handle clusters Can handle clusters of
equal-sized clusters different sizes of different sizes and different sizes
densities

Centroid- Yes (cluster centers No (clusters are formed No (clusters are No (clusters are formed
based are calculated) hierarchically) formed based on based on distance
density) hierarchy)

Flexibility Less flexible due to More flexible; can work Highly flexible; does Very flexible (varies
the need for with any number of not assume a fixed based on specific
predefined k clusters and shapes number of clusters hierarchical method)

Outliers Often ignored or Can be affected by Explicitly detects and Can handle outliers
grouped with outliers (but less so handles outliers (depending on method)
nearest cluster than K-Means) (noise points)

Example Use When the number When the data has When data has When hierarchical
Case of clusters is known hierarchical noise, varying relationships exist
and clusters are relationships or varying density, or irregular between clusters
compact and cluster shapes shapes
spherical

Visualization Easy to visualize in Easy to visualize, Works well for Visualizations can show
2D/3D (but especially for small visualizing density- dendrograms for
struggles with more datasets based clusters agglomerative or divisive
complex datasets) clustering
Hierarchical Clustering
What Is Hierarchical Clustering

Hierarchical clustering, or hierarchical clustering analysis, is a cluster analysis technique that creates a hierarchy of
clusters from points in a dataset.

With clustering, data points are put into groups — known as clusters — based on similarities like color, shape or other
features. In hierarchical clustering, each cluster is placed within a nested tree-like hierarchy, where clusters are grouped
and break down further into smaller clusters depending on similarities. Here, the closer clusters are together in the
hierarchy, the more similar they are to each other.

While clustering analyses like k-means can visualize data points as distinct and linear groups, hierarchical clustering
visualizes data groups in relation to one another with multiple levels of similarity.

Hierarchical clustering is used to help find patterns and related occurrences within datasets, especially those that are
complex or multifaceted.

How Does Hierarchical Clustering Work?

The hierarchical clustering process involves finding the two data points closest to each other and combining the two
most similar ones. After repeating this process until all data points are grouped into clusters, the end result is a
hierarchical tree of related groups known as a dendrogram.

Hierarchical clustering is based on the core idea that similar objects lie nearby to each other in a data space while
others lie far away. It uses distance functions to find nearby data points and group the data points together as clusters.

There are different types of clustering algorithms, including centroid-based clustering algorithms, connectivity-based
clustering algorithms (hierarchical clustering), distribution-based clustering algorithms and density-based clustering
algorithms. The two main types of hierarchical clustering include agglomerative clustering and divisive clustering.

Types of Hierarchical Clustering

There are two major types of hierarchical clustering approaches:

 Agglomerative clustering: Divide the data points into different clusters and then aggregate them as the
distance decreases.
 Divisive clustering: Combine all the data points as a single cluster and divide them as the distance between
them increases.

1. Agglomerative Clustering

Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual data points as a single
cluster, then it is merged continuously based on similarity until it forms one big cluster containing all objects. It is good
at identifying small clusters.

The steps for agglomerative clustering are:

1. Compute the proximity matrix using a distance metric.

2. Use a linkage function to group objects into a hierarchical cluster tree based on the computed distance matrix
from the above step.

3. Data points with close proximity are merged together to form a cluster.

4. Repeat steps 2 and 3 until a single cluster remains.

The pictorial representation of the above steps would be:

In the above figure,

 The data points 1,2,...6 are assigned to each individual cluster.

 After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5 are merged together to
form clusters.

 Again, the proximity matrix is computed and clusters with points 4,5 and 6 are merged together.

 And again, the proximity matrix is computed, then the clusters with points 4,5,6 and 2,3 are merged together
to form a cluster.

 As a final step, the remaining clusters are merged together to form a single cluster.

Proximity Matrix and Linkage

The proximity matrix is a matrix consisting of the distance between each pair of data points. The distance is
computed by a distance function. Euclidean distance is one of the most commonly used distance functions.

The above proximity matrix consists of n points named x, and the d(xi,xj) represents the distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in the proximity matrix are
taken and the data points are grouped based on similarity. The newly formed clusters are linked to each other until
they form a single cluster containing all the data points.

The most common linkage methods are:

 Complete linkage: The maximum of all pairwise distance between elements in each pair of clusters is used to
measure the distance between two clusters.

 Single linkage: The minimum of all pairwise distance between elements in each pair of clusters is used to
measure the distance between two clusters.

 Average linkage: The average of all pairwise distances between elements in each pair of clusters is used to
measure the distance between two clusters.

 Centroid linkage: Before merging, the distance between the two clusters’ centroids are considered.

 Ward’s Method: It uses squared error to compute the similarity of the two clusters for merging.

Dendrogram Charts

From the above chart we can visualize the hierarchical technique, so how to find the optimal number of clusters from
the above chart?

To find it, draw a horizontal line where there is no overlap in the vertical lines of the bars. The number of bars
without the overlap below the line is the optical number of the clusters. Refer to the figure below for a clear
illustration.

From the above figure, we have three bars below the horizontal line, so the optimal number of clusters is three. Also,
if you recall, the Iris dataset has three classes and we got the same number from the above chart.

2. Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all the data points into
a big single cluster and later on splitting them into smaller heterogeneous clusters continuously until all data points
are in their own cluster. Thus, they are good at identifying large clusters. It follows a top-down approach and is more
efficient than agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any predefined
implementation in any of the major machine learning frameworks.

Steps in Divisive Clustering

Consider all the data points as a single cluster.

1. Split into clusters using any flat-clustering method, say k-means.

2. Choose the best cluster among the clusters to split further, choose the one that has the largest Sum of
Squared Error (SSE).

3. Repeat steps 2 and 3 until a single cluster is formed.

In the above figure,

 The data points 1,2,...6 are assigned to large cluster.

 After calculating the proximity matrix, based on the dissimilarity the points are split up into separate clusters.

 The proximity matrix is again computed until each point is assigned to an individual cluster.

The proximity matrix and linkage function follow the same procedure as agglomerative clustering, As the divisive
clustering is not used in many places, there is no predefined class/function in any Python library.

Uses of Hierarchical Clustering

Hierarchical clustering can be used for several applications, ranging from customer segmentation to object recognition.

Market Segmentation

Companies can better understand their markets by identifying target groups based on certain traits like demographics,
personal interests or behaviors. Organizing consumers according to these characteristics enables organizations to see
what consumers care about and whose backgrounds or interests may align with their products and services. Businesses
can then tailor products, marketing ads and other materials according to the preferences of target audiences.

Geo-Spatial Analysis
Besides grouping individual consumers or customers based on specific traits, hierarchical clustering can also group
individuals based on their geographic location. Organizations can then view where their customer bases are located,
predict product demand in certain areas and adjust their marketing and business strategies accordingly.

Image Segmentation

When dealing with images, hierarchical clustering can distinguish between separate visual elements. For example, the
technique can discern between different facial features, aiding facial recognition technology. It can also discriminate
between cars and other objects like buildings and animals, powering image recognition technology.

Anomaly Detection
Hierarchical clustering is also effective at detecting anomalies. By clustering data points into groups, hierarchical
clustering can isolate outliers that don’t belong to any cluster. Researchers can apply this method to root out errors at
different stages of the data collection process, preventing anomalies from impacting the accuracy of data sets.

Limits of Hierarchical Clustering

Hierarchical clustering isn’t a fix-all; it does have some limits. Among them:

 It has high time and space computational complexity. For computing proximity matrix, the time complexity is
O(N2), since it takes N steps to search, the total time complexity is O(N3).

 There is no objective function for hierarchical clustering.

 Due to high time complexity, it cannot be used for large datasets.

 It is sensitive to noise and outliers since we use distance metrics.

 It has difficulty handling large clusters.

Clustering helps with the analysis of an unlabelled dataset to group the data points based on their similarity. In terms
of business needs, clustering helps quickly segment customers and get insightful decisions.

EXAMPLES FOR HIERARCHICAL CLUSTERING:

https://youtu.be/YH0r47m0kFM?si=xrEG3GWLwF6fY59p ,

https://youtu.be/d1qAwe8hthM?si=dyxfD2_hLWOG6V7m

Numerical Methods Formula Sheet
63% (8)
Numerical Methods Formula Sheet
8 pages
Unsupervised Machine Learning in Python
100% (1)
Unsupervised Machine Learning in Python
89 pages
Flow Chart Practice - Grade 6
100% (1)
Flow Chart Practice - Grade 6
6 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
20 pages
Big-O Algorithm Complexity Cheat Sheet PDF
100% (1)
Big-O Algorithm Complexity Cheat Sheet PDF
4 pages
Numerical Methods
No ratings yet
Numerical Methods
3 pages
UnSupervised ML
No ratings yet
UnSupervised ML
17 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
Unsupervised Lec
No ratings yet
Unsupervised Lec
12 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
6 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Machine Learning File
No ratings yet
Machine Learning File
7 pages
Unit 5a
No ratings yet
Unit 5a
60 pages
2 ML
No ratings yet
2 ML
9 pages
Ai - W8L15
No ratings yet
Ai - W8L15
44 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
Optimisation and Dimension Reduction Tech-Unlocked
No ratings yet
Optimisation and Dimension Reduction Tech-Unlocked
29 pages
Rr210504 Design and Analysis of Algorithms
No ratings yet
Rr210504 Design and Analysis of Algorithms
4 pages
Variance
No ratings yet
Variance
6 pages
Unit III 1
No ratings yet
Unit III 1
22 pages
ML Unit-2 - RTU
No ratings yet
ML Unit-2 - RTU
33 pages
New Doc 09-30-2024 20.37
No ratings yet
New Doc 09-30-2024 20.37
6 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
16 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
8 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
Unit - 1-1
No ratings yet
Unit - 1-1
35 pages
Group I Discrete Mathematics
No ratings yet
Group I Discrete Mathematics
4 pages
ML Unit 2 Notes
No ratings yet
ML Unit 2 Notes
14 pages
Unit 3 Supervised Learning
No ratings yet
Unit 3 Supervised Learning
89 pages
Machine Learning Section3 Ebook v05
No ratings yet
Machine Learning Section3 Ebook v05
15 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
21 pages
M Learning
No ratings yet
M Learning
11 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
L05 Unsupervised Learning - Overview
No ratings yet
L05 Unsupervised Learning - Overview
16 pages
Lecture 3 Types of Machine Learning
No ratings yet
Lecture 3 Types of Machine Learning
40 pages
Unit 4
No ratings yet
Unit 4
53 pages
Ai Notes V
No ratings yet
Ai Notes V
7 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
59 pages
ML Unit5 Notes
No ratings yet
ML Unit5 Notes
18 pages
U5 Unsupervised Learning
No ratings yet
U5 Unsupervised Learning
15 pages
2nd Unit NN Final Class Notes
No ratings yet
2nd Unit NN Final Class Notes
51 pages
Lab 10 Unsupervised
No ratings yet
Lab 10 Unsupervised
12 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
15 pages
Un-Supervised Machine Learning
No ratings yet
Un-Supervised Machine Learning
9 pages
Chapter 04 - 1731894685
No ratings yet
Chapter 04 - 1731894685
17 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Unit 1 Notes
0% (1)
Unit 1 Notes
33 pages
2nd Unit NN Final Class Notes
No ratings yet
2nd Unit NN Final Class Notes
50 pages
Unit 2 Unsupervised Learning
No ratings yet
Unit 2 Unsupervised Learning
86 pages
CP4252 ML Unit-Iii
No ratings yet
CP4252 ML Unit-Iii
18 pages
Unsupervised - Learning Final
No ratings yet
Unsupervised - Learning Final
20 pages
Unit 4
No ratings yet
Unit 4
26 pages
How To Perform Clustering Algorithms in Machine Learning
No ratings yet
How To Perform Clustering Algorithms in Machine Learning
9 pages
Lecture 03
No ratings yet
Lecture 03
28 pages
Unit 4
No ratings yet
Unit 4
62 pages
Introduction To Unsupervised Machine Learning
No ratings yet
Introduction To Unsupervised Machine Learning
9 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Optimization Operational Research
No ratings yet
Optimization Operational Research
19 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
13 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
9 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
6 pages
CIE 115 SAS 5 Highlighted Version
No ratings yet
CIE 115 SAS 5 Highlighted Version
7 pages
Assignment 3
No ratings yet
Assignment 3
22 pages
ML-CBT July24
No ratings yet
ML-CBT July24
3 pages
105 Machine Learning Paper
No ratings yet
105 Machine Learning Paper
6 pages
Microsoft PPT On Bilateral Filtering
100% (1)
Microsoft PPT On Bilateral Filtering
30 pages
Topic 10 Recursive Backtracking
No ratings yet
Topic 10 Recursive Backtracking
40 pages
CS Class Lecture 1
No ratings yet
CS Class Lecture 1
4 pages
A Tutorial On Formulating and Using QUBO Models
No ratings yet
A Tutorial On Formulating and Using QUBO Models
46 pages
Data Mining Complete Lab Manual - DRSNR
No ratings yet
Data Mining Complete Lab Manual - DRSNR
27 pages
Chapter 10 Dip Me - Ec 4
No ratings yet
Chapter 10 Dip Me - Ec 4
22 pages
Simulation of Wireless Communication Systems Using MATLAB
No ratings yet
Simulation of Wireless Communication Systems Using MATLAB
57 pages
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
No ratings yet
ISMIR 2019 Tutorial - Waveform-Based Music Processing With Deep Learning
152 pages
Wavelet Transform
No ratings yet
Wavelet Transform
5 pages
A Guide To Convolutional Neural Networks - The ELI5 Way - Saturn Cloud Blog
No ratings yet
A Guide To Convolutional Neural Networks - The ELI5 Way - Saturn Cloud Blog
10 pages
A Swarm Anomaly Detection Model For IoT UAVs Based On A Multi-Modal Denoising Autoencoder and Federated Learning
No ratings yet
A Swarm Anomaly Detection Model For IoT UAVs Based On A Multi-Modal Denoising Autoencoder and Federated Learning
22 pages
I2ml3e Chap6
No ratings yet
I2ml3e Chap6
37 pages
AP Computer Science Principles Session1 MCQ
No ratings yet
AP Computer Science Principles Session1 MCQ
6 pages
Maher SEBAI Internship Presentation
No ratings yet
Maher SEBAI Internship Presentation
94 pages
Chapter 5: Roots of Equations - Bracketing Methods: Lesson Plan
No ratings yet
Chapter 5: Roots of Equations - Bracketing Methods: Lesson Plan
14 pages
Journal of Computational and Applied Mathematics: J. Rashidinia, M. Ghasemi
No ratings yet
Journal of Computational and Applied Mathematics: J. Rashidinia, M. Ghasemi
18 pages
U X U Y: Homework 1
No ratings yet
U X U Y: Homework 1
2 pages
Hasil Word SPSS
No ratings yet
Hasil Word SPSS
4 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet

Module 3

Uploaded by

Module 3

Uploaded by

Module 3

Unsupervised Learning process:

Key Characteristics of Unsupervised Learning:

Unsupervised Learning Process Flow:

Types of Unsupervised Learning

Challenges in Unsupervised Learning

Preprocessing and Scaling in Machine Learning

Finding the Value of k

2. Average Silhouette Method

Steps for Elbow Method:

2. Average Silhouette Method

 A value close to 1 indicates that the sample is well-clustered.

 A value near 0 indicates overlapping clusters.

Steps for Average Silhouette Method:

Dimensionality Reduction, Feature Extraction, and Manifold Learning

Motivations for Dimensionality Reduction:

 Visualization: Reducing data to 2D or 3D allows easier visualization of the data structure.

Common Techniques for Dimensionality Reduction:

 t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a non-linear dimensionality reduction

Motivations for Feature Extraction:

Common Techniques for Feature Extraction:

Motivations for Manifold Learning:

Common Techniques for Manifold Learning:

1. Cluster: A group of data points that are similar to each other.

How K-means Works:

The K-means algorithm follows these steps:

5. Convergence: The algorithm stops when either:

 The cluster assignments do not change.

K-means is widely used in applications like:

 Customer segmentation in marketing

 Organizing large datasets into smaller, more manageable clusters

EXAMPLES FOR K MEAN CLUSTERING:

Steps of Agglomerative Clustering:

1. Ward linkage (default):

3. Complete linkage (maximum linkage):

 Step 1: Start with each point as its own cluster.

Choosing the Number of Clusters:

EXAMPLES FOR AGGLOMERATIVE CLUSTERING:

How DBSCAN Works:

1. Core Concept – Density:

 Step 3: Determine if a Core Point:

o Recursively visit all the neighbors of this core point.

 Cluster Labeling: Each cluster is assigned a unique label.

EXAMPLES FOR DBSCAN CLUSTERING:

Type Partitional (Flat Hierarchical (Bottom- Density-based Hierarchical (Can be

Number of Predefined (k) No need to predefine Automatically No need to predefine

How Does Hierarchical Clustering Work?

Types of Hierarchical Clustering

There are two major types of hierarchical clustering approaches:

The steps for agglomerative clustering are:

1. Compute the proximity matrix using a distance metric.

4. Repeat steps 2 and 3 until a single cluster remains.

In the above figure,

 The data points 1,2,...6 are assigned to each individual cluster.

Proximity Matrix and Linkage

The most common linkage methods are:

Steps in Divisive Clustering

Consider all the data points as a single cluster.

1. Split into clusters using any flat-clustering method, say k-means.

3. Repeat steps 2 and 3 until a single cluster is formed.

In the above figure,

 The data points 1,2,...6 are assigned to large cluster.

Uses of Hierarchical Clustering

Limits of Hierarchical Clustering

 There is no objective function for hierarchical clustering.

 Due to high time complexity, it cannot be used for large datasets.

 It is sensitive to noise and outliers since we use distance metrics.

 It has difficulty handling large clusters.

EXAMPLES FOR HIERARCHICAL CLUSTERING:

You might also like