0% found this document useful (0 votes)

18 views

ML4 Unsupervised Learning

Uploaded by

andesong88

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

ML4 Unsupervised Learning

Uploaded by

andesong88

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Unsupervised Learning

TS. Dinh Dong Luong

Contents
 Introduction

 Distance and Similarity Measurement

 K-Means Clustering

 Hierarchical Clustering

 Density-based Clustering

2
Supervised learning vs. unsupervised
learning
 Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute (labelled).
 These patterns are then utilized to predict the
values of the target attribute in future data
instances.
 Unsupervised learning: The data have no
target attribute (non lablelled).
 We want to explore the data to find some intrinsic
structures in them.
3
Introduction
 Data clustering concerns how to group a set of objects
based on their similarity of attributes and/or their
proximity in the vector space.
 A good clustering method will produce high quality
clusters with
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 Clustering is often called an unsupervised learning
task
 Clustering is often considered synonymous with
unsupervised learning. In fact, association rule mining is
also unsupervised

4
An illustration
 The data set has three natural groups of data points,
i.e., 3 natural clusters.

5
Stages in clustering

6
Aspects of clustering
 A clustering algorithm
 Partitional clustering
 Hierarchical clustering
 …
 A distance (similarity, or dissimilarity) function
 Clustering quality
 Inter-clusters distance  maximized
 Intra-clusters distance  minimized
 The quality of a clustering result depends on the
algorithm, the distance function, and the
application.

7
Similarity and Distance Measurement
 Key to clustering. “similarity” and
“dissimilarity” can also commonly used terms.
 There are numerous distance functions for
 Different types of data
 Numeric data
 Nominal data

8
Distance functions for
numeric attributes

9
Pearson Correlation

10
Trend Similarity (Pearson Correlation)

11
Similarity Measurements

12
Similarity Measurements

13
Similarity Measurements

14
Similarity Measurements

15
Similarity Measurements

16
Similarity Measurements

17
Euclidean distance and Manhattan distance
 Euclidean distance
dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2

 Weighted Euclidean distance

dist (xi , x j )  w1 ( xi1  x j1 ) 2  w2 ( xi 2  x j 2 ) 2  ...  wr ( xir  x jr ) 2

18
Squared distance and Chebychev distance
 Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2

 Chebychev distance: one wants to define two

data points as "different" if they are different
on any one of the attributes.
dist(xi , x j )  max(| xi1  x j1 |, | xi 2  x j 2 |, ..., | xir  x jr |)

19
Distance function for text documents
 A text document consists of a sequence of
sentences and each sentence consists of a
sequence of words.
 To simplify: a document is usually considered a
“bag” of words in document clustering.
 Sequence and position of words are ignored.
 A document is represented with a vector just like a
normal data point.
 It is common to use similarity to compare two
documents rather than distance.
 The most commonly used similarity function is the cosine
similarity. We will study this later.

20
Data standardization
 In the Euclidean space, standardization of attributes
is recommended so that all attributes can have
equal impact on the computation of distances.
 Consider the following pair of data points
 xi: (0.1, 20) and xj: (0.9, 720).

dist ( x i , x j )  (0.9  0.1) 2  (720  20) 2  700.000457 ,

 The distance is almost completely dominated by

(720-20) = 700.
 Standardize attributes: to force the attributes to have
a common value range
21
How to choose a clustering algorithm
Choosing the “best” algorithm is a challenge.
 Every algorithm has limitations and works well with certain
data distributions.
 It is very hard, if not impossible, to know what distribution
the application data follow. The data may not fully follow
any “ideal” structure or distribution required by the
algorithms.
 One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.

22
K-means clustering
 K-means is a partitional clustering algorithm
 Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
 The k-means algorithm partitions the given
data into k clusters.
 Each cluster has a cluster center, called centroid.
 k is specified by the user

23
K-means algorithm
 Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).

24
An example

+
+

25
An example (cont …)

26
Weaknesses of k-means: Problems with
outliers

27
Weaknesses of k-means: To deal with
outliers
 One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
 To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
 Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
 Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification

28
Weaknesses of k-means (cont …)
 The algorithm is sensitive to initial seeds.

29
Weaknesses of k-means (cont …)
 If we use different seeds: good results
 There are some
methods to help
choose good
seeds

30
Use frequent values to represent cluster
 This method is mainly for clustering of
categorical data (e.g., k-modes clustering).
 Main method used in text clustering, where a
small set of frequent words in each cluster is
selected to represent the cluster.

31
Clusters of arbitrary shapes
 Hyper-elliptical and hyper-
spherical clusters are usually
easy to represent, using their
centroid together with spreads.
 Irregular shape clusters are hard
to represent. They may not be
useful in some applications.
 Using centroids are not suitable
(upper figure) in general
 K-means clusters may be more
useful (lower figure), e.g., for making
2 size T-shirts.

32
Hierarchical Clustering
 Produce a nested sequence of clusters, a tree.

33
Types of hierarchical clustering
 Agglomerative (bottom up) clustering: It builds the
tree from the bottom level, and
 merges the most similar (or nearest) pair of clusters
 stops when all the data points are merged into a single
cluster (i.e., the root cluster).
 Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
 Splits the root into a set of child clusters. Each child cluster
is recursively divided further
 stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point

34
Agglomerative clustering

It is more popular then divisive methods.

 At the beginning, each data point forms a
cluster (also called a node).
 Merge nodes/clusters that have the least
distance.
 Go on merging

 Eventually all nodes belong to one cluster

35
Agglomerative clustering algorithm

36
An example: working of the algorithm

37
Measuring the distance of two clusters
 A few ways to measure distances of two
clusters.
 Results in different variations of the
algorithm.
 Single link
 Complete link
 Average link
 Centroids
 …

38
Single link method
 The distance between
two clusters is the
distance between two
closest data points in
the two clusters, one
data point from each
cluster.
 It can find arbitrarily
shaped clusters, but Two natural clusters are
 It may cause the
split into two
undesirable “chain effect”
by noisy points

39
Complete link method
 The distance between two clusters is the distance
of two furthest data points in the two clusters.
 It is sensitive to outliers because they are far
away

40
Average link and centroid methods
Average link: A compromise
between
 the sensitivity of complete-link
clustering to outliers and
 the tendency of single-link
clustering to form long chains
that do not correspond to the
intuitive notion of clusters as
compact, spherical objects.
 In this method, the distance
between two clusters is the
average distance of all pair-
wise distances between the
data points in two clusters.
41
Average link and centroid methods
 Centroid method: In this method, the distance
between two clusters is the distance between
their centroids

42
Density-based Approaches

 Why Density-Based Clustering methods?

 Discover clusters of arbitrary shape.
 Clusters – Dense regions of objects separated by
regions of low density
 DBSCAN – the first density based clustering
 OPTICS – density based cluster-ordering
 DENCLUE – a general density-based
description of cluster and clustering
Density-Based Clustering
Basic Idea:
Clusters are dense regions in the
data space, separated by regions of
lower object density

 Why Density-Based Clustering?

Results of a
k-medoid
algorithm for
k=4
Density Based Clustering: Basic Concept
 Intuition for the formalization of the basic idea
 For any point in a cluster, the local point density
around that point has to exceed some threshold
 The set of points from one cluster is spatially

connected
 Local point density at a point p defined by two
parameters
 e – radius for the neighborhood of point p:

Ne (p) := {q in data set D | dist(p, q)  e}

 MinPts – minimum number of points in the given

neighbourhood N(p)
e-Neighborhood
 e-Neighborhood – Objects within a radius of e
from an object.
N e ( p) : {q | d ( p, q)  e }
 “High density” - ε-Neighborhood of an object
contains at least MinPts of objects.

ε-Neighborhood of p
ε ε
ε-Neighborhood of q
Density of p is “high” (MinPts = 4)
q p Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given e and MinPts,
Outlier categorize the objects into
three exclusive groups.
Border
A point is a core point if it has more
than a specified number of points
Core (MinPts) within Eps These are
points that are at the interior of a
cluster.
A border point has fewer than
e = 1unit, MinPts = 5 MinPts within Eps, but is in the
neighborhood of a core point.

A noise point is any point that

is not a core point nor a border
point.
DBSCAN Algorithm

Input: The data set D

Parameter: e, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

DBScan Algorithm
DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps and MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p and

DBSCAN visits the next point of the database.

 Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example

 Parameter
 e = 2 cm
 MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example

 Parameter
 e = 2 cm
 MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Example

Original Points Point types: core,

border and outliers
e = 10, MinPts = 4
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.92).

Original Points

• Cannot handle Varying

densities
• sensitive to parameters
(MinPts=4, Eps=9.75)
DBSCAN: Sensitive to Parameters
Density Based Clustering: Discussion

 Advantages
 Clusters can have arbitrary shape and size
 Number of clusters is determined automatically
 Can separate clusters from surrounding noise
 Can be supported by spatial index structures
 Disadvantages
 Input parameters may be difficult to determine
 In some situations very sensitive to input
parameter setting
Denclue: Technical Essence

 Influence functions: (influence of y on x,  is a

user given constant)
 Square : f ysquare(x) = 0, if dist(x,y) > ,
1, otherwise

 Guassian:
d ( x, y )2

f y
Gaussian ( x)  e 2 2
Density Function

 Density Definition is defined as the sum of the

influence functions of all data points.

d ( x , xi ) 2


N
D
( x) 
2
2
f Gaussian i 1
e
60

CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
Chapter 2 - Database Analytics
No ratings yet
Chapter 2 - Database Analytics
64 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Meeting 7 Unsupervised Learnign
No ratings yet
Meeting 7 Unsupervised Learnign
95 pages
UnsupervisedLearning_FoundationalMathofAI_S24
No ratings yet
UnsupervisedLearning_FoundationalMathofAI_S24
6 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Unsupervised Learning Update
No ratings yet
Unsupervised Learning Update
37 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
CS583 Unsupervised Learning (2) (3)
No ratings yet
CS583 Unsupervised Learning (2) (3)
95 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
U02Lecture08 Statistical Machine Learning
No ratings yet
U02Lecture08 Statistical Machine Learning
41 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering
No ratings yet
Clustering
36 pages
Nearset Clustering
No ratings yet
Nearset Clustering
15 pages
Clustering
No ratings yet
Clustering
80 pages
Clustering
No ratings yet
Clustering
7 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Unsupervised Learning by Suleiman M. Abdi
No ratings yet
Unsupervised Learning by Suleiman M. Abdi
45 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
ML Unit V
No ratings yet
ML Unit V
26 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
ML Assign4
No ratings yet
ML Assign4
7 pages
UNIT 3 Data Mining
No ratings yet
UNIT 3 Data Mining
11 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Data Mining
No ratings yet
Data Mining
98 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Unit 4 Introduction to Algorithm
No ratings yet
Unit 4 Introduction to Algorithm
10 pages
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
No ratings yet
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
95 pages
Hierarchical Clustering and Data Science Group Project - Assignment 2
No ratings yet
Hierarchical Clustering and Data Science Group Project - Assignment 2
29 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
dm 4
No ratings yet
dm 4
76 pages
Clustering
No ratings yet
Clustering
65 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
UNIT-6
No ratings yet
UNIT-6
102 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Lecture-18-Clustering-19092024-091909am
No ratings yet
Lecture-18-Clustering-19092024-091909am
33 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
M4 - Clustering
No ratings yet
M4 - Clustering
43 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deepak Cs Project
No ratings yet
Deepak Cs Project
25 pages
NOSQL DB Modulewise Imp Questions
No ratings yet
NOSQL DB Modulewise Imp Questions
3 pages
IT3020 - Database Systems
No ratings yet
IT3020 - Database Systems
6 pages
Azure Training Draft 2
No ratings yet
Azure Training Draft 2
4 pages
Taking Interviw
No ratings yet
Taking Interviw
15 pages
Data Cleaning Methods in Excel
No ratings yet
Data Cleaning Methods in Excel
11 pages
The Application of Arcgis Cadastral Fabric Model For Cadastral Database Management
No ratings yet
The Application of Arcgis Cadastral Fabric Model For Cadastral Database Management
6 pages
NORMALIZATION
No ratings yet
NORMALIZATION
11 pages
CBSE CS 12th Pratical Slips
No ratings yet
CBSE CS 12th Pratical Slips
8 pages
Train Reservation System Documentation
No ratings yet
Train Reservation System Documentation
27 pages
E2 E3 Infosphere Datastage - Compilation and Execution
No ratings yet
E2 E3 Infosphere Datastage - Compilation and Execution
52 pages
Database Programming With SQL Section 17 Quiz
No ratings yet
Database Programming With SQL Section 17 Quiz
3 pages
DIYA
No ratings yet
DIYA
5 pages
Computer class 12 2081
No ratings yet
Computer class 12 2081
1 page
SQL JOIN Examples Full
No ratings yet
SQL JOIN Examples Full
8 pages
DBS Practice
No ratings yet
DBS Practice
6 pages
Project 1 Part 2 Report
No ratings yet
Project 1 Part 2 Report
9 pages
Bot Creator Assessment
No ratings yet
Bot Creator Assessment
3 pages
Tisha SIP
No ratings yet
Tisha SIP
60 pages
MERN Stack Interview Questions - CodeHype
No ratings yet
MERN Stack Interview Questions - CodeHype
6 pages
Learning Insights #4
No ratings yet
Learning Insights #4
5 pages
White Paper Intelligent Indexing
No ratings yet
White Paper Intelligent Indexing
9 pages
Python MYSQL
No ratings yet
Python MYSQL
29 pages
InfoSphere DataStage Balanced Optimization
No ratings yet
InfoSphere DataStage Balanced Optimization
17 pages
Group Assignment Example
No ratings yet
Group Assignment Example
4 pages
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
No ratings yet
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
35 pages
Unit-4 NPT 2024
No ratings yet
Unit-4 NPT 2024
4 pages
Final Documentation of the project
No ratings yet
Final Documentation of the project
47 pages
Emerging Technologies - Lecture Notes - CH 1 & 2
No ratings yet
Emerging Technologies - Lecture Notes - CH 1 & 2
73 pages