0% found this document useful (0 votes)

67 views17 pages

Clustering Algorithm

Clustering Algorithm Explained

Uploaded by

spraga1995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views17 pages

Clustering Algorithm

Clustering Algorithm Explained

Uploaded by

spraga1995

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

What is Clustering?

A cluster refers to a collection of data points aggregated together because of certain

similarities.

Grouping unlabelled data is called clustering.

E.g., Say music, one approach might be to look for meaningful groups or collections.
You might organize music by genre, while your friend might organize music by
decade. How you choose to group items helps you to understand more about them
as individual pieces of music.

You might find that you have affinity for rock and further break down the genre into
different approaches or music from different locations.

On the other hand, your friend might look at music from the 1980's and be able to
understand how the music across genres at that time was influenced by the socio-
political climate.

In both cases, you and your friend have learned something interesting about music,
even though you took different approaches.

In machine learning too, we often group examples as a first step to understand a

subject (data set) in a machine learning system.

Grouping unlabelled examples is called clustering.

As the examples are unlabelled, clustering relies on unsupervised machine learning.

If the examples are labelled, then clustering becomes classification

Unlabelled examples grouped into three clusters.

This is based on feature similarity.

Similarity Measure:
A numerical value that quantifies the similarity between two data points is called
similarity measure

For instance, consider a shoe data set with only one feature: shoe size. You can
quantify how similar two shoes are by calculating the difference between their sizes.
The smaller the numerical difference between sizes, the greater the similarity
between shoes. This is called a manual similarity measure.

Suppose the model has two features: shoe size and shoe price data. Since both
features are numeric, you can combine them into a single number representing
similarity as follows.

Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then
normalize the data.

Price (p): The data is probably a Poisson distribution. Confirm this. If you have
enough data, convert the data to quantiles and scale to [0,1].

Combine the data by using root mean squared error (RMSE). Here, the similarity is

s2 + p2
√
2

let’s calculate similarity for two shoes with US sizes 8 and 11, and prices 120 and
150. Since we don’t have enough data to understand the distribution, we’ll simply
scale the data without normalizing or using quantiles.

1. Scale the size: Assume a maximum possible shoe size of 20. Divide 8 and 11 by
the maximum size 20 to get 0.4 and 0.55

2. Scale the price: Divide 120 and 150 by the maximum price 150 to get 0.8 and 1

3. Find the difference in size: 0.55-0.4=0.15

4. Find the difference in price:1-0.8=0.2

0.22 +0.152
5. Find the RMSE:√ = 0.17
2

6. Similarity =1-0.17=0.83

What if you wanted to find similarities between shoes by using both size and color?
Color is categorical data, and is harder to combine with the numerical size data.

It cannot be calculated MANUALLY. That’s when you switch to a supervised

similarity measure, where a Deep Neural Network calculates the similarity.
Loss Function for Supervised similarity measure calculation:

Mean square error (MSE) for numerical output

Log loss /Softmax cross entropy loss for categorical.

What are the Uses of Clustering?

Clustering has a myriad of uses in a variety of industries. Some common
applications for clustering include the following:

 market segmentation
 social network analysis
 search result grouping
 medical imaging
 image segmentation
 anomaly detection
 generalization
 data compression
 privacy preservation.

After clustering, each cluster is assigned a number called a cluster ID.

Now, you can condense the entire feature set for an example into its cluster ID.

Representing a complex example by a simple cluster ID makes clustering powerful.

Clustering data can simplify large datasets and becomes easy for managing the
data.

For example, you can group items by different features as follows:

Group documents by topic.

Group stars by brightness.

Group books by category

Machine learning systems can then use cluster IDs to simplify the processing of
large datasets. Thus, clustering’s output serves as feature data for downstream ML
systems.

Clustering:
Grouping related examples, particularly during unsupervised learning. Once all the
examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the k-means algorithm clusters
examples based on their proximity to a centroid, as in the following diagram:

A human researcher could then review the clusters and, for example, label cluster 1
as "dwarf trees" and cluster 2 as "full-size trees."

As another example, consider a clustering algorithm based on an example's distance

from a center point, illustrated as follows:

Types of Clustering:
Each approach is best suited to a particular data distribution. Below is a short
discussion of four common approaches, focusing on centroid-based clustering using
k-means.
Centroid-based Clustering:
Centroid-based clustering organizes the data into non-hierarchical clusters, in
contrast to hierarchical clustering defined below.

K-means is the most widely used centroid-based clustering algorithm.

Centroid-based algorithms are efficient but sensitive to outliers. It is an efficient,

effective, and simple clustering algorithm.

Example of centroid-based clustering.

Density-based Clustering:
Density-based clustering connects areas of high example density into clusters. This
allows for arbitrary-shaped distributions as long as dense areas can be connected.
These algorithms have difficulty with data of varying densities and high dimensions.
Further, by design, these algorithms do not assign outliers to clusters.

Example of density-based clustering.

Distribution-based Clustering:
This clustering approach assumes data is composed of distributions, such
as Gaussian distributions.
The distribution-based algorithm clusters data into three Gaussian distributions. As
distance from the distributions centre increases, the probability that a point belongs
to the distribution decreases. The bands show that decrease in probability. When
you do not know the type of distribution in your data, you should use a different
algorithm.

Example of distribution-based clustering.

Hierarchical Clustering:
Hierarchical clustering creates a tree of clusters.It is well suited to hierarchical
data. Another advantage is that any number of clusters can be chosen by cutting the
tree at the right level.

Example of a hierarchical tree clustering animals.

Objective of Cluster Analysis:
Intra cluster distance is the sum of distances between objects in the same cluster.

This distance should always be minimized.

Inter cluster distance is the distance between objects in the different cluster.

This distance should always be maximized.

Data prepearation:
In clustering, calculate the similarity between two examples by combining all the
feature data for those examples into a numeric value. Combining feature data
requires that the data have the same scale.

Normalizing: min-max/std

Transforming: log

Quantile bucketing: Distributing a feature's values into buckets so that each bucket
contains the same (or almost the same) number of examples. For example, the
following figure divides 44 points into 4 buckets, each of which contains 11 points. In
order for each bucket in the figure to contain the same number of points, some
buckets span a different width of x-values.
K-means clustering in Machine Learning:
K-means clustering is one of the simplest and popular unsupervised machine
learning algorithms.

The objective of K-means is simple: group similar data points together and discover
underlying patterns.

To achieve this objective, K-means looks for a fixed number (k) of clusters in a
dataset.

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs
to only one group. It tries to make the intra-cluster data points as similar as possible
while also keeping the clusters as different (far) as possible. It assigns data points to
a cluster such that the sum of the squared distance between the data points and the
cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is
at the minimum. The less variation we have within clusters, the more homogeneous
(similar) the data points are within the same cluster.

The way Kmeans algorithm works is as follows:

 Specify number of clusters K.

 Initialize centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without replacement.
 Keep iterating until there is no change to the centroids. i.e assignment of data
points to clusters isn’t changing.
o Compute the sum of the squared distance between data points and all
centroids.
o Assign each data point to the closest cluster (centroid).
o Compute the centroids for the clusters by taking the average of the all
data points that belong to each cluster.

Determining the Optimal Number of Clusters:

Elbow Method:
Elbow method gives us an idea on what a good k number of clusters would be based
on the sum of squared distance (SSE) between data points and their assigned
clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming
an elbow.

See where the curve might form an elbow and flatten out.
Silhouette Analysis:
Silhouette analysis can be used to determine the degree of separation between
clusters.

It computes the average silhouette of observations for different values of k.

• It measures the quality of a clustering i.e.it determines how well each object lies
within its cluster.
• A high average silhouette coefficient indicates a good clustering.

• The optimal number of clusters k is the one that maximize the average silhouette
over a range of possible values for k.

For each sample:

 Compute the average distance from all data points in the same cluster (ai).
 Compute the average distance from all data points in the closest cluster (bi).
 Compute the coefficient:

The coefficient can take values in the interval [-1, 1].

 If it is 0 –> the sample is very close to the neighboring clusters.

 If it is 1 –> the sample is far away from the neighboring clusters.
 If it is -1 –> the sample is assigned to the wrong clusters.
Therefore, we want the coefficients to be as big as possible and close to 1 to have a
good clusters.

Gap Statistic Method:

The gap statistic compares the total within intra-cluster variation for different values
of k with their expected values under null reference distribution of the data. The
estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that
yields the largest gap statistic). This means that the clustering structure is far away
from the random uniform distribution of points.

Bootstrapping(B) by generating B copies of the reference datasets and, by

computing the average log(W k). The gap statistic measures the deviation of the
observed W k value from its expected value under the null hypothesis. The estimate
of the optimal clusters will be the value that maximizes Gap (k) This means that the
clustering structure is far away from the uniform distribution of points.

That is, for each variable (xi) in the data set we compute its range [min(xi),max(xj)]
and generate values for the n points uniformly from the interval min to max.

The algorithm works as follow:

Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and
compute the corresponding total within intra-cluster variation Wk.

Generate B Bootstrapped reference data sets with a random uniform distribution.

Cluster each of these reference data sets with varying number of clusters k = 1,
…, kmax, and compute the corresponding total within intra-cluster variation Wkb.

For the observed data and the reference data, the total intra-cluster variation is
computed using different values of k. The gap statistic for a given k is defined as
follows. Compute the estimated gap statistic as the deviation of the
observed Wk value from its expected value Wkb under the null hypothesis:
B
1 ∗ )
Gap(k) = ∑ log(wkb − log(wk )
B
b=1

The estimate of the optimal clusters will be the value that maximizes the gap
statistic.

This means that the clustering structure is far away from the random uniform
distribution of points.

Drawbacks:
Kmeans algorithm is good in capturing structure of the data if clusters have a
spherical-like shape. It always try to construct a nice spherical shape around the
centroid. That means, the minute the clusters have a complicated geometric shapes,
kmeans does a poor job in clustering the data. We’ll illustrate three cases where
kmeans will not perform well.

First, kmeans algorithm doesn’t let data points that are far-away from each other
share the same cluster even though they obviously belong to the same cluster.
Below is an example of data points on two different horizontal lines that illustrates
how kmeans tries to group half of the data points of each horizontal lines together.
we would have 3 groups of data where each group was generated from different
multivariate normal distribution (different mean/standard deviation). One group will
have a lot more data points than the other two combined. Next, run kmeans on the
data with K=3 and see if it will be able to cluster the data correctly.

Data that have complicated geometric shapes such as moons and circles within
each other and test kmeans on both of the datasets.

However, we can help kmeans perfectly cluster these kind of datasets if we use
kernel methods. The idea is we transform to higher dimensional representation that
make the data linearly separable (the same idea that we use in SVMs). Different
kinds of algorithms work very well in such scenarios such as Spectral Clustering.
Hierarchical clustering Technique:
Hierarchical clustering is one of the popular and easy to understand clustering
technique. This clustering technique is divided into two types:

Agglomerative

Divisive

1. Agglomerative Hierarchical clustering Technique:

In this technique, initially each data point is considered as an individual cluster. At
each iteration, the similar clusters merge with other clusters until one cluster or K
clusters are formed.

2. Divisive Hierarchical clustering Technique:

The Divisive Hierarchical clustering is exactly the opposite of the Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we consider all the data
points as a single cluster and in each iteration, we separate the data points from the
cluster which are not similar. Each data point which is separated is considered as an
individual cluster. In the end, it will be left with n clusters.
Calculating the Similarity Between Two Clusters:
 MIN
 MAX
 Group Average
 Distance Between Centroids
 Ward’s Method

MIN:
Also known as single-linkage algorithm can be defined as the similarity of two
clusters C1 and C2 is equal to the minimum of the similarity between points Pi and Pj
such that Pi belongs to C1 and Pj belongs to C2.

Mathematically this can be written as,

Sim(C1,C2) = Min Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2

In simple words, pick the two closest points such that one point lies in cluster one
and the other point lies in cluster 2 and takes their similarity and declares it as the
similarity between two clusters.

Pros of MIN:
This approach can separate non-elliptical shapes as long as the gap between the
two clusters is not small.
Original data vs Clustered data using MIN approach

Cons of MIN:
MIN approach cannot separate clusters properly if there is noise between clusters.

Original data vs Clustered data using MIN approach

MAX:
Also known as the complete linkage algorithm, this is exactly opposite to
the MIN approach. The similarity of two clusters C1 and C2 is equal to
the maximum of the similarity between points Pi and Pj such that Pi belongs to C1
and Pj belongs to C2.

Mathematically this can be written as,

Sim(C1,C2) = Max Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2

In simple words, pick the two farthest points such that one point lies in cluster one
and the other point lies in cluster 2 and takes their similarity and declares it as the
similarity between two clusters.
Pros of MAX:
MAX approach does well in separating clusters if there is noise between clusters.

Original data vs Clustered data using MAX approach

Cons of Max:
Max approach is biased towards globular clusters.

Max approach tends to break large clusters.

Original data vs Clustered data using MAX approach

Group Average:
Take all the pairs of points and compute their similarities and calculate the average
of the similarities.

Mathematically this can be written as,

sim(C1,C2) = ∑ sim(Pi, Pj)/|C1|*|C2|

where, Pi ∈ C1 & Pj ∈ C2
The group Average approach does well in separating clusters if there is noise
between clusters

Distance between centroids:

Compute the centroids of two clusters C1 & C2 and take the similarity between the
two centroids as the similarity between two clusters. This is a less popular technique
in the real world.

Ward’s Method:
This approach of calculating the similarity between two clusters is exactly the same
as Group Average except that Ward’s method calculates the sum of the square of
the distances Pi and PJ.

Mathematically this can be written as,

sim(C1,C2) = ∑ (dist(Pi, Pj))²/|C1|*|C2|

Pros of Ward’s method:

Ward’s method approach also does well in separating clusters if there is noise
between clusters.

Cons of Ward’s method:

Ward’s method approach is also biased towards globular clusters.

Space and Time Complexity of Hierarchical clustering Technique

Space complexity: The space required for the Hierarchical clustering Technique
is very high when the number of data points are high as we need to store the
similarity matrix in the RAM.

Time complexity: Since we’ve to perform n iterations and in each iteration, we

need to update the similarity matrix and restore the matrix, the time complexity is
also very high.

High space and time complexity for Hierarchical clustering. Hence this clustering
algorithm cannot be used when we have huge data.

Reinforcement Learning
No ratings yet
Reinforcement Learning
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
23 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Le 4
No ratings yet
Le 4
12 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
EC441-Lecture - 2 - Static Characteristics of Measurement Systems
No ratings yet
EC441-Lecture - 2 - Static Characteristics of Measurement Systems
22 pages
Algorithmic Complexity
No ratings yet
Algorithmic Complexity
20 pages
Chapter - 3 Signal Conditioning Elements
100% (1)
Chapter - 3 Signal Conditioning Elements
56 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
54 pages
Motion Control Chapter 1 - Introduction
No ratings yet
Motion Control Chapter 1 - Introduction
43 pages
Problems-Part II
No ratings yet
Problems-Part II
3 pages
Proxy Sensor Connections PDF
No ratings yet
Proxy Sensor Connections PDF
4 pages
(E) CHAPTER 5 DC Machine
No ratings yet
(E) CHAPTER 5 DC Machine
19 pages
MCE 202 - Electrical Actuator System (HOD)
No ratings yet
MCE 202 - Electrical Actuator System (HOD)
40 pages
s7 1200 PDF
No ratings yet
s7 1200 PDF
24 pages
Relays Types Operations Specification Symbols
No ratings yet
Relays Types Operations Specification Symbols
12 pages
Metrology Solved Problems
No ratings yet
Metrology Solved Problems
63 pages
Disturbance Compensation For Gun Control System of Tank Based On LADRC (289KB)
No ratings yet
Disturbance Compensation For Gun Control System of Tank Based On LADRC (289KB)
4 pages
Lecture 8 Ramses
No ratings yet
Lecture 8 Ramses
12 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Fundamentals of Analysis of Algorithm
No ratings yet
Fundamentals of Analysis of Algorithm
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
2 pages
11.4 Strain Gauge Electrical Circuits
No ratings yet
11.4 Strain Gauge Electrical Circuits
10 pages
EC441-Lecture - 5 - Accuracy of Measurement Systems in The Steady State
No ratings yet
EC441-Lecture - 5 - Accuracy of Measurement Systems in The Steady State
26 pages
Class Note 06: D'arsonval Movement and DC Measurement I. D'Arsonval Meter Movement
No ratings yet
Class Note 06: D'arsonval Movement and DC Measurement I. D'Arsonval Meter Movement
8 pages
Unit 1
No ratings yet
Unit 1
59 pages
Chapter 3 Signal Conditioning and Chapter Four Output Presentation
100% (1)
Chapter 3 Signal Conditioning and Chapter Four Output Presentation
123 pages
ECE3073 P7 Analogue Answers
No ratings yet
ECE3073 P7 Analogue Answers
5 pages
Displacement and Position Sensors
No ratings yet
Displacement and Position Sensors
14 pages
Elastic Transducer
No ratings yet
Elastic Transducer
61 pages
Topic - 2 (Intelligent Agents)
No ratings yet
Topic - 2 (Intelligent Agents)
31 pages
Lect 5 2023
No ratings yet
Lect 5 2023
26 pages
Noise and Interference
No ratings yet
Noise and Interference
14 pages
Intrumentation - Introduction To Measurement System
100% (3)
Intrumentation - Introduction To Measurement System
31 pages
Syllabus Unit-I: Unit-I Introduction To Measurement Systems and Passive Sensors
No ratings yet
Syllabus Unit-I: Unit-I Introduction To Measurement Systems and Passive Sensors
57 pages
Human Machine Interface (HMI) Design The Good
No ratings yet
Human Machine Interface (HMI) Design The Good
10 pages
Assignment Problem - Gradient Descent
No ratings yet
Assignment Problem - Gradient Descent
6 pages
Simit
No ratings yet
Simit
679 pages
3141709
100% (2)
3141709
3 pages
Control and Data Acquisition Systems
No ratings yet
Control and Data Acquisition Systems
17 pages
Programmable Logic Control Trainer IT-1200S
No ratings yet
Programmable Logic Control Trainer IT-1200S
34 pages
1 - Relays & Contactors PDF
No ratings yet
1 - Relays & Contactors PDF
14 pages
Unit 5
No ratings yet
Unit 5
20 pages
4-20 Ma Transmitter
100% (2)
4-20 Ma Transmitter
3 pages
Assignment 6
No ratings yet
Assignment 6
2 pages
Signal Conditioning For Resistive, Rectance Varian and Self Generating Sensors
No ratings yet
Signal Conditioning For Resistive, Rectance Varian and Self Generating Sensors
74 pages
Automation Sensors Tutorial
No ratings yet
Automation Sensors Tutorial
14 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
UNIT5
No ratings yet
UNIT5
60 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit 2
No ratings yet
Unit 2
89 pages
Clustering
No ratings yet
Clustering
75 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Unit-4 ML
No ratings yet
Unit-4 ML
16 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
05b NDE ODAA Briefing 20110913
No ratings yet
05b NDE ODAA Briefing 20110913
18 pages
File 1
No ratings yet
File 1
1 page
Hansen - The Time of Affect or Bearing Witness To Life
No ratings yet
Hansen - The Time of Affect or Bearing Witness To Life
44 pages
Dungeon 123
No ratings yet
Dungeon 123
146 pages
Graphing Coordinate Plane
No ratings yet
Graphing Coordinate Plane
1 page
Raw Water GRP Pipe
No ratings yet
Raw Water GRP Pipe
48 pages
CHPT 9
No ratings yet
CHPT 9
2 pages
Development Finance Debates Dogmas and New Directions 1st Edition Stephen Spratt - Get The Ebook in PDF Format For A Complete Experience
No ratings yet
Development Finance Debates Dogmas and New Directions 1st Edition Stephen Spratt - Get The Ebook in PDF Format For A Complete Experience
31 pages
Repeat and Learn - 01 (Yoku Nemureta)
No ratings yet
Repeat and Learn - 01 (Yoku Nemureta)
1 page
Resume 2 RTF 0 1
No ratings yet
Resume 2 RTF 0 1
3 pages
Teacher'S Loading Report (TLR) College of Criminal Justice Education
No ratings yet
Teacher'S Loading Report (TLR) College of Criminal Justice Education
1 page
Alexandria City of Memory Michael Haag Download
No ratings yet
Alexandria City of Memory Michael Haag Download
79 pages
Resume
No ratings yet
Resume
2 pages
AI and Data Science
No ratings yet
AI and Data Science
12 pages
Circular Motion Achievers
No ratings yet
Circular Motion Achievers
10 pages
Psikologi Komunikasi Remaja Terhadap Konsep Diri Di Kalangan Komunitas Cosplayer Medan
No ratings yet
Psikologi Komunikasi Remaja Terhadap Konsep Diri Di Kalangan Komunitas Cosplayer Medan
15 pages
Bsce 4a Cecom1 Ass
No ratings yet
Bsce 4a Cecom1 Ass
9 pages
Optimizing Retail Operations
No ratings yet
Optimizing Retail Operations
18 pages
PHIL IRI Form A
No ratings yet
PHIL IRI Form A
3 pages
T L 526204 The Cautious Caterpillar Story Powerpoint English - Ver - 6
No ratings yet
T L 526204 The Cautious Caterpillar Story Powerpoint English - Ver - 6
16 pages
Exercise 3: 1. Initial Capital Investments (Compound Entry) DR CR
No ratings yet
Exercise 3: 1. Initial Capital Investments (Compound Entry) DR CR
4 pages
S15.Chapter 10 Ethnomethodological Research
No ratings yet
S15.Chapter 10 Ethnomethodological Research
15 pages
Dsds
No ratings yet
Dsds
10 pages
Tmi 25 Years Later The Three Mile Island Nuclear Power Plant Accident and Its Impact Bonnie A Osif PDF Download
No ratings yet
Tmi 25 Years Later The Three Mile Island Nuclear Power Plant Accident and Its Impact Bonnie A Osif PDF Download
59 pages
ETABS 2016 Concrete Frame Design: ETABS 2016 16.0.2 License # 1SVK7TR5A72FZP3
No ratings yet
ETABS 2016 Concrete Frame Design: ETABS 2016 16.0.2 License # 1SVK7TR5A72FZP3
2 pages
Erspectives For An Rchitecture of Olitude: E C, A P F
No ratings yet
Erspectives For An Rchitecture of Olitude: E C, A P F
432 pages
Brian Krigmont: Personal Info and Skills
No ratings yet
Brian Krigmont: Personal Info and Skills
4 pages
7.5A Photosynthesis Warm-Ups
No ratings yet
7.5A Photosynthesis Warm-Ups
3 pages
Statistical Inferences and Applications-Comprehensive Exam Qp-Cohort 6 - Makeup
No ratings yet
Statistical Inferences and Applications-Comprehensive Exam Qp-Cohort 6 - Makeup
3 pages
VecForce 1 Ok
No ratings yet
VecForce 1 Ok
9 pages

Clustering Algorithm

Uploaded by

Clustering Algorithm

Uploaded by

What is Clustering?

A cluster refers to a collection of data points aggregated together because of certain

Grouping unlabelled data is called clustering.

In machine learning too, we often group examples as a first step to understand a

Grouping unlabelled examples is called clustering.

As the examples are unlabelled, clustering relies on unsupervised machine learning.

Unlabelled examples grouped into three clusters.

This is based on feature similarity.

3. Find the difference in size: 0.55-0.4=0.15

4. Find the difference in price:1-0.8=0.2

It cannot be calculated MANUALLY. That’s when you switch to a supervised

Mean square error (MSE) for numerical output

Log loss /Softmax cross entropy loss for categorical.

What are the Uses of Clustering?

After clustering, each cluster is assigned a number called a cluster ID.

Representing a complex example by a simple cluster ID makes clustering powerful.

For example, you can group items by different features as follows:

Group documents by topic.

Group stars by brightness.

Group books by category

As another example, consider a clustering algorithm based on an example's distance

K-means is the most widely used centroid-based clustering algorithm.

Centroid-based algorithms are efficient but sensitive to outliers. It is an efficient,

Example of centroid-based clustering.

Example of density-based clustering.

Example of distribution-based clustering.

Example of a hierarchical tree clustering animals.

This distance should always be minimized.

This distance should always be maximized.

The way Kmeans algorithm works is as follows:

 Specify number of clusters K.

Determining the Optimal Number of Clusters:

It computes the average silhouette of observations for different values of k.

For each sample:

The coefficient can take values in the interval [-1, 1].

 If it is 0 –> the sample is very close to the neighboring clusters.

Gap Statistic Method:

Bootstrapping(B) by generating B copies of the reference datasets and, by

The algorithm works as follow:

Generate B Bootstrapped reference data sets with a random uniform distribution.

1. Agglomerative Hierarchical clustering Technique:

2. Divisive Hierarchical clustering Technique:

Mathematically this can be written as,

Sim(C1,C2) = Min Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2

Original data vs Clustered data using MIN approach

Mathematically this can be written as,

Sim(C1,C2) = Max Sim(Pi,Pj) such that Pi ∈ C1 & Pj ∈ C2

Original data vs Clustered data using MAX approach

Max approach tends to break large clusters.

Original data vs Clustered data using MAX approach

Mathematically this can be written as,

sim(C1,C2) = ∑ sim(Pi, Pj)/|C1|*|C2|

Distance between centroids:

Mathematically this can be written as,

sim(C1,C2) = ∑ (dist(Pi, Pj))²/|C1|*|C2|

Pros of Ward’s method:

Cons of Ward’s method:

Space and Time Complexity of Hierarchical clustering Technique

Time complexity: Since we’ve to perform n iterations and in each iteration, we

You might also like