0% found this document useful (0 votes)
104 views84 pages

Clustering

This document discusses different types of machine learning algorithms including supervised learning, unsupervised learning, and clustering. It provides examples of clustering applications such as customer segmentation, crime analysis, and social network analysis. The document then focuses on k-means clustering, describing how it works, the algorithm, limitations, and providing a simple example. Finally, it briefly discusses hierarchical clustering and different approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views84 pages

Clustering

This document discusses different types of machine learning algorithms including supervised learning, unsupervised learning, and clustering. It provides examples of clustering applications such as customer segmentation, crime analysis, and social network analysis. The document then focuses on k-means clustering, describing how it works, the algorithm, limitations, and providing a simple example. Finally, it briefly discusses hierarchical clustering and different approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Clustering

Machine learning: Supervised vs


Unsupervised.
Supervised learning - discover patterns in the data that relate data
attributes with a target (class) attribute.
• there must be a training data set in which the solution is already
known.

Unsupervised learning - the outcomes are unknown or The data have


no target attribute.

• cluster the data to reveal meaningful partitions and Hierarchies


• We have to explore the data to find some intrinsic structures in
them.
INTRODUCTION-
What is clustering?
• Clustering is the classification of objects into
different groups, or more precisely, the
partitioning of a data set into subsets (clusters),
so that the data in each subset (ideally) share
some common trait - often according to some
defined distance measure
Examples
• Let us see some real-life examples
• Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
according to their similarities
– To do targeted marketing.
Current Applications
• Document Classification: Cluster documents in
multiple categories based on tags, topics, and the
content of the document. This is a very standard
classification problem and k-means is a highly
suitable algorithm for this purpose.
• Identifying Crime Localities: With data related
to crimes available in specific localities in a city,
the category of crime, the area of the crime, and
the association between the two can give quality
insight into crime-prone areas within a city
Current Applications
• Customer Segmentation: Clustering helps
marketers improve their customer base, work on
target areas, and segment customers based
on purchase history, interests, or activity
monitoring.
• Insurance Fraud Detection: Utilizing past
historical data on fraudulent claims, it is possible
to isolate new claims based on its proximity to
clusters that indicate fraudulent patterns.
Current Applications
• Rideshare Data Analysis: The publicly
available Uber ride information dataset
provides a large amount of valuable data
around traffic, transit time, peak pickup
localities, and more. Analyzing this data is
useful not just in the context of Uber but also
in providing insight into urban traffic patterns
and helping us plan for the cities of the future.
Current Applications
• Social network analysis - Facebook
"smartlists"
• Organizing computer clusters and data
centers for network layout and location
• Astronomical data analysis - Understanding
galaxy formation
illustration
• The data set has three natural groups of data
points, i.e., 3 natural clusters.
Aspects of clustering
• A distance (similarity, or dissimilarity)
function
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
Types of clustering
• Hierarchical algorithms: these find
successive clusters
1. Agglomerative ("bottom-up"): Agglomerative
algorithms begin with each element as a
separate cluster and merge them into
successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin
with the whole set and proceed to divide it into
successively smaller clusters.
Types of clustering
• Partitional clustering: Partitional algorithms
determine all clusters at once. It include:
K-means and derivatives.
• The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where k
< n.
• It assumes that the object attributes form a vector
space.
Other Approaches
• Density-based
• Mixture model
• Spectral methods
K-means clustering
• k-means clustering is an algorithm to classify or to
group the objects based on attributes/features into K
number of group.
• K is positive integer number.
• The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.
How it works?
Algorithm
• Step 1: Begin with a decision on the value of k =
number of clusters .
• Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the
training samples randomly, or systematically
as the following:
1.Take the first k training sample as single-
element clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute
the centroid of the gaining cluster.
Algorithm
• Step 3: Take each sample in sequence and
compute its distance from the centroid of each of
the clusters. If a sample is not currently in the
cluster with the closest centroid, switch this
sample to that cluster and update the centroid of
the cluster gaining the new sample and the cluster
losing the sample.
• Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
Simple example
• Take some random values.
• Consider number of clusters and centroids
randomly.
• 3,8,24,91,53,75,31,9,6,44,62,15
• Two clusters?
• Consider mid points as 24 and 62
• With 24 -?
• 3,8,31,9,6,15
• With 62-44, 53,91,75
• With three clusters with 15, 44 and 75 as mid
points?
A Simple example showing the implementation
of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
• Thus, we obtain two
• clusters containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object,
as shown in table.
• Therefore, the new
clusters are:{1,2} and
{3,4,5,6,7}

• Next centroids are:


m1=(1.25,1.5) and
m2 = (3.9,5.1)
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
• Therefore, there is
no change in the cluster.
• Thus, the algorithm
comes to a halt here and
final result consist of
2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
(with K=3) Step 1
Step 2
PLOT
Limitations
• K-means is extremely
sensitive to cluster
center initializations
• Bad initialization can
lead to Poor
convergence speed
• Bad initialization can
lead to bad overall
clustering
‘k’ value
• Elbow method
• Within Group Sum of
Square(WGSS)
• Convergence value
will be chosen
Practise
• We have 4 medicines as our training data points
object and each medicine has 2 attributes. Each
attribute represents coordinate of the object. We have
to determine which medicines belong to cluster 1 and
which medicines belong to the other cluster.
Attribute1 (X): Attribute 2 (Y): pH
Object
weight index

Medicine A 1 1

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does
not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
34
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

35
Dendrogram: Shows How Clusters are Merged

Decompose data
objects into a several
levels of nested
partitioning (tree of
clusters), called a
dendrogram

A clustering of the
data objects is
obtained by cutting
the dendrogram at
the desired level,
then each connected
component forms a
cluster
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

37
Example of converting data points
into distance matrix
• Clustering analysis with agglomerative
algorithm

data matrix

distance matrix

Euclidean distance 38
Example
X Y

A 0.40 0.53

B 0.22 0.38

C 0.35 0.32

D 0.26 0.19

E 0.08 0.41

F 0.45 0.30
Example
A B C D E F

A 0

B 0.23 0

C 0.22 0.15 0

D 0.37 0.20 0.15 0

E 0.34 0.14 0.28 0.29 0

F 0.23 0.25 0.11 0.22 0.39 0


Example
A B C,F D E

A 0

B 0.23 0

C,F 0.22 0.15 0

D 0.37 0.20 0.15 0

E 0.34 0.14 0.28 0.29


Example
A B,E C,F D

A 0

B,E 0.23 0

C,F 0.22 0.15 0

D 0.37 0.20 0.15 0


Example
A (B,E), (C,F) D

A 0

(B,E), (C,F) 0.22 0

D 0.37 0.15 0
Practise
1 2 3 4 5
1 0
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
MIN or Single Link
Inter-cluster distance
• The distance between two clusters is
represented by the distance of the closest pair
of data objects belonging to different clusters.
• Determined by one pair of points, i.e., by one
link in the proximity graph
MAX or Complete Link
Inter-cluster distance
• The distance between two clusters is
represented by the distance of the farthest pair
of data objects belonging to different clusters
Distance between X X

Clusters
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
– Medoid: a chosen, centrally located object in the cluster

49
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster iN= 1(t )
Cm = N ip

• Radius: square root of average distance from any point of the


cluster to its centroid  N (t − cm ) 2
Rm = i =1 ip
N
• Diameter: square root of average mean squared distance
between all pairs of points in the cluster

 N  N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

50
Parametric vs Non Parametric
Estimation
Learning a Function

• Machine learning can be summarized as


learning a function (f) that maps input
variables (X) to output variables (Y).
Y = f(x)
• An algorithm learns this target mapping
function from training data
• Different algorithms make different
assumptions or biases about the form of the
function and how it can be learned.
Parametric Machine Learning
Algorithms
• Assumptions can greatly simplify the learning
process, but can also limit what can be learned.
Algorithms that simplify the function to a known
form are called parametric machine learning
algorithms.
• A learning model that summarizes data with a set
of parameters of fixed size (independent of the
number of training examples) is called a
parametric model.
• No matter how much data you throw at a
parametric model, it won’t change about how
many parameters it needs.
The algorithms involve two steps:
1. Select a form for the function.
2. Learn the coefficients for the function from
the training data.
• An easy to understand functional form for the
mapping function is a line, as is used in linear
regression: b0 + b1*x1 + b2*x2 = 0
• Where b0, b1 and b2 are the coefficients of the
line that control the intercept and slope, and x1
and x2 are two input variables.
Parametric Estimation
• Assuming the functional form of a line greatly
simplifies the learning process. Now, all we need to
do is estimate the coefficients of the line equation and
we have a predictive model for the problem.
• Some more examples of parametric machine learning
algorithms include
– Logistic Regression
– Linear Discriminant Analysis
– Perceptron
– Naive Bayes
– Simple Neural Networks
Benefits of Parametric Machine
Learning Algorithms:
• Simpler: These methods are easier to
understand and interpret results.
• Speed: Parametric models are very fast to
learn from data.
• Less Data: They do not require as much
training data and can work well even if the fit
to the data is not perfect.
Limitations of Parametric Machine
Learning Algorithms:
• Constrained: By choosing a functional form
these methods are highly constrained to the
specified form.
• Limited Complexity: The methods are more
suited to simpler problems.
• Poor Fit: In practice the methods are unlikely
to match the underlying mapping function.
Nonparametric Machine Learning
Algorithms
• Algorithms that do not make strong assumptions
about the form of the mapping function are called
nonparametric machine learning algorithms. By
not making assumptions, they are free to learn any
functional form from the training data.
• Nonparametric methods are good when you have
a lot of data and no prior knowledge, and when
you don’t want to worry too much about choosing
just the right features.
Nonparametric Estimation
• Nonparametric methods seek to best fit the training data
in constructing the mapping function, whilst
maintaining some ability to generalize to unseen data.
As such, they are able to fit a large number of
functional forms.
• An easy to understand nonparametric model is the k-
nearest neighbors algorithm that makes predictions
based on the k most similar training patterns for a new
data instance. The method does not assume anything
about the form of the mapping function other than
patterns that are close are likely have a similar output
variable.
Nonparametric Estimation
• Some more examples of popular non
parametric machine learning algorithms are:
• k-Nearest Neighbours
• Decision Trees like CART and C4.5
• Support Vector Machines
Benefits of Nonparametric Machine
Learning Algorithms:
• Flexibility: Capable of fitting a large number
of functional forms.
• Power: No assumptions (or weak
assumptions) about the underlying function.
• Performance: Can result in higher
performance models for prediction.
Limitations of Nonparametric Machine
Learning Algorithms:
• More data: Require a lot more training data to
estimate the mapping function.
• Slower: A lot slower to train as they often have
far more parameters to train.
• Overfitting: More of a risk to overfit the
training data and it is harder to explain why
specific predictions are made.
K Nearest Neighbour Classification
• It classifies new points based on the similarity
measure.
• Also identifies data points that are separated
into several classes to predict the classification
of a sample point.
K Nearest Neighbor Classification
• Step 1: Initialize ‘k’
• Step 2: For each sample in the training data,
– Calculate distance between query point and the current
point
– Add the distance and the index of the example to an
ordered collection.
• Sort the ordered collection of distances and
indexes from small to large.
• Pick the first ‘k’ entries from the list.
• Get the labels of selected ‘k’ entries.
K Nearest Neighbor Classification
Height Weight T Shirt Size
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
K=5 and for an input height as 161cm
and weight as 61kg
K Nearest Neighbor Classification-
Visualization
KNN vs. K-mean
• K-mean is an unsupervised learning technique (no
dependent variable) whereas KNN is a supervised
learning algorithm (dependent variable exists)
• K-mean is a clustering technique which tries to
split data points into K-clusters such that the
points in each cluster tend to be near each other
whereas K-nearest neighbor tries to determine the
classification of a point, combines the
classification of the K nearest points
K Nearest Neighbor Classification -
Practise
Perform KNN Classification algorithm on
following dataset and predict the class for P1=3
and P2=7. Consider k=3.
P1 P2 Class
7 7 False
7 5 False
5 6 False
3 4 True
2 3 True
4 3 True
Voronoi Diagram
Nonparametric Regression:
Smoothing Models
Regression
• In regression, given the training set X ={xt, rt}
where rt ∈ R, we assume
rt = g(xt ) + ∈
• In parametric regression, we assume a
polynomial of a certain order and compute its
coefficients that minimize the sum of squared
error on the training set.
Nonparametric regression
• Nonparametric regression is used when no
such polynomial can be assumed;
• we only assume that close x have close g(x)
values.
• As in nonparametric density estimation, given
x, our approach is to find the neighborhood of
x and average the r values in the neighborhood
to calculate ˆg(x).
Nonparametric regression
• The nonparametric regression estimator is also
called a smoother and the estimate is called a
smooth
Regressogram
• a commonly used simple non-parametric
method
Regressogram

• This is an analysis for astronomy data. On the


X-axis is the galaxy distance to some
cosmological structure and on the Y-axis is the
correlation for some features of this galaxy. We
binned the data according to galaxy distance
and take the mean within each bin as a
landmark (or summary) and show how this
landmark changes along galaxy distance.
Regressogram

• Note that now the range of Y is (0,1) while in


the regressogram, the range is (0.7, 0.8). If you
want to visualize the data, this scatter plot will
not be helpful. The regressogram, however, is
a simple approach to visualize hidden structure
within this complicated data.
• Here’s the steps for constructing regressogram.
First we bin the data according to the X-axis
(shown by red lines):
• Then we compute the mean within each bin
(shown by the blue points):
• We can show only the blue points (and blue
curves, which just connects each points) so
that the result looks much more concise:
• However, since the range for Y-axis is too large, this
does not show the trend. So we zoom-in and compute
the error for estimating the mean within each bin.
• The advantage for regressogram is its simplicity.
Since we’re summarizing the whole data by
points representing the mean within each bin, the
interpretation is very straight-forward.
• Also, it shows the trend (and error bars) for the
data so that we have rough idea what’s going on.
Moreover, no matter how complicated the original
plot is, the regressogram uses only a few of
statistics (the mean within each bin) to summarize
the whole data. Notice that we do not make any
assumption on distribution (like normally
distributed) of the data; thus, regressogram is a
non-parametric method.
Kernel smoother
• KDE

K – Kernel function (non negative)


h – smoothing parameter

You might also like