0% found this document useful (0 votes)

19 views

Clustering

This document discusses different clustering methods including partitioning, hierarchical, and density-based approaches. It provides details on partitioning methods like k-means and k-medoids, which assign data points to clusters to minimize distance between points and cluster centroids or medoids. The k-means algorithm iterates between assigning points to nearest centroids and updating centroids. K-medoids uses actual data points as cluster representatives instead of centroids to be more robust to outliers. Hierarchical clustering and density-based methods are also covered at a high level.

Uploaded by

Amey Chaudhari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Clustering

Uploaded by

Amey Chaudhari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Clustering

Cluster Analysis: Basic Concepts and Methods

● Cluster Analysis: Basic Concepts
● Partitioning Methods
● Hierarchical Methods

2
What is Cluster Analysis?
● Cluster: A collection of data objects
○ similar (or related) to one another within the same group
○ dissimilar (or unrelated) to the objects in other groups
● Cluster analysis (or clustering, data segmentation, …)
○ Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters

3
What is Cluster Analysis?
● Unsupervised learning: no predefined classes (i.e., learning by
observations)
● Typical applications
○ As a stand-alone tool to get insight into data distribution
○ As a preprocessing step for other algorithms

4
Clustering for Data Understanding and Applications
● Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
● City-planning: Identifying groups of houses according to their house type,
value, and geographical location
● Biology
● Information retrieval: document clustering
● Land use: Identification of areas of similar land use in an earth observation
database
● Climate: understanding earth climate, find patterns of atmospheric and
ocean

5
Clustering as a Preprocessing Tool (Utility)
● Summarization & Compression:
● Finding K-nearest Neighbors
● Outlier detection
○ Outliers are often viewed as those “far away” from any cluster

6
Quality: What Is Good Clustering?
● A good clustering method will produce high quality clusters

○ high intra-class similarity: cohesive within clusters

○ low inter-class similarity: distinctive between clusters

● The quality of a clustering method depends on

○ the similarity measure used by the method

○ its implementation, and

○ Its ability to discover some or all of the hidden patterns

7
Major Clustering Approaches (I)
● Partitioning approach:
○ Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
○ Typical methods: k-means, k-medoids, CLARANS
● Hierarchical approach:
○ Create a hierarchical decomposition of the set of data (or objects)
using some criterion
○ Typical methods: Diana, Agnes, BIRCH, CAMELEON
● Density-based approach:
○ Based on connectivity and density functions
○ Typical methods: DBSACN, OPTICS, DenClue

8
Cluster Analysis: Basic Concepts and Methods
● Cluster Analysis: Basic Concepts
● Partitioning Methods
● Hierarchical Methods
● Density-Based Methods

9
Partitioning Algorithms: Basic Concept

• Partitioning method: Partitioning a database D of n objects into a set of k

clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)

E  ik1 pCi ( p  ci ) 2
• Given k, find a partition of k clusters that optimizes the chosen partitioning
criterion

• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the

center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster

10
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four

steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed
point
4. Go back to Step 2, stop when the assignment does
not change

11
An Example of K-Means Clustering

K=2

Arbitrarily Update the

partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects

needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition
centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
12
Variations of the K-Means Method
● Most of the variants of the k-means which differ in

○ Selection of the initial k means

○ Dissimilarity calculations

○ Strategies to calculate cluster means

● Handling categorical data: k-modes

○ Replacing means of clusters with modes

13
What Is the Problem of the K-Means Method?
● The k-means algorithm is sensitive to outliers !

○ Since an object with an extremely large value may substantially distort the
distribution of the data
● K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster

14
The K-Medoid Clustering Method
● K-Medoids Clustering: Find representative objects (medoids) in clusters

○ PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

■ Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering

■ PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)

15
PAM: A Typical K-Medoids Algorithm Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

16
Cluster Analysis: Basic Concepts and Methods
● Cluster Analysis: Basic Concepts
● Partitioning Methods
● Hierarchical Methods
● Density-Based Methods

17
Hierarchical Clustering
● This method does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
• AGNES- bottom-up a ab
strategy b abcde
• DIVISIVE- Top- c
cde
down strategy d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
18
AGNES
● Introduced in Kaufmann and Rousseeuw (1990)

● Initially, places each object into a cluster of its own.

● The clusters are then merged step-by-step according to some criterion.

● For example, clusters C1 and C2 may be merged if an object in C1

and an object in C2 form the minimum Euclidean distance.

19
AGNES (Agglomerative Nesting)
● Use the single-link method and the dissimilarity matrix
● Merge nodes that have the least dissimilarity
● Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

20
Dendrogram: Shows How Clusters are Merged
It is commonly used to
represent the process
of hierarchical
clustering. It shows
how objects are
grouped together

A clustering of the data

objects is obtained by
cutting the dendrogram
at the desired level,
then each connected
component forms a
cluster
DIANA (Divisive Analysis)

● Introduced in Kaufmann and Rousseeuw (1990)

● Inverse order of AGNES

● Eventually each node forms a cluster on its own

10
10 10
9
9 9
8
8 8
7
7 7
6
6 6
5
5 5
4
4 4
3
3 3
2
2 2
1
1 1
0
0 0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

22
Distance between Clusters
● Single link: smallest distance between an element in one cluster and an element
in the other,
● Complete link: largest distance between an element in one cluster and an
element in the other,
● Average link: avg distance between an element in one cluster and an element in
the other

23
Extensions to Hierarchical Clustering
● Major weakness of agglomerative clustering methods

○ Can never undo what was done previously

○ Do not scale well: time complexity of at least O(n2), where n is the number
of total objects

Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Clustering
No ratings yet
Clustering
32 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Lecture 16
No ratings yet
Lecture 16
29 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Cluster
No ratings yet
Cluster
20 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Clustering
No ratings yet
Clustering
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
DM_C6
No ratings yet
DM_C6
37 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
(PML ITS - Week 10) - Clustering
No ratings yet
(PML ITS - Week 10) - Clustering
42 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Grouping
No ratings yet
Grouping
98 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
No ratings yet
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
31 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
clustering
No ratings yet
clustering
16 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Partitioning Methods
No ratings yet
Partitioning Methods
26 pages
Unit 4
No ratings yet
Unit 4
4 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Lifting Bags: A Real Lifesaver White Paper
No ratings yet
Lifting Bags: A Real Lifesaver White Paper
13 pages
HMB China Introduction
No ratings yet
HMB China Introduction
17 pages
Fs 2 - 1-4
100% (3)
Fs 2 - 1-4
20 pages
Old Age Homes Thesis
100% (1)
Old Age Homes Thesis
5 pages
Chapter IV Analysis and Interpretation: Age of The Respondents
No ratings yet
Chapter IV Analysis and Interpretation: Age of The Respondents
6 pages
Cutoffs For Enginnering MAHA
No ratings yet
Cutoffs For Enginnering MAHA
86 pages
VRL-070C-25-K5-14BK14
No ratings yet
VRL-070C-25-K5-14BK14
2 pages
A Comparative Study On Mushrooms Classification
No ratings yet
A Comparative Study On Mushrooms Classification
8 pages
Leakage of Valves - Testing API 598, ANSI FCI 70-2, MSS-SP-61 and ISO standard 5208
No ratings yet
Leakage of Valves - Testing API 598, ANSI FCI 70-2, MSS-SP-61 and ISO standard 5208
16 pages
Infosys - A Dream or Nightmare
100% (14)
Infosys - A Dream or Nightmare
2 pages
EMBA Syllabi
No ratings yet
EMBA Syllabi
9 pages
Procedure:: Importing Creo Schematics Data Into Creo Parametric
No ratings yet
Procedure:: Importing Creo Schematics Data Into Creo Parametric
2 pages
IAT-I CP QP
No ratings yet
IAT-I CP QP
1 page
RFT Manual
No ratings yet
RFT Manual
14 pages
Prototype Theory: by Don L. F. Nilsen
No ratings yet
Prototype Theory: by Don L. F. Nilsen
14 pages
Assignment 3
No ratings yet
Assignment 3
11 pages
Theory Formulas - FERRY DELA CRUZ
No ratings yet
Theory Formulas - FERRY DELA CRUZ
5 pages
Optimal Political Control of The Bureaucracy
No ratings yet
Optimal Political Control of The Bureaucracy
59 pages
Megaphones - Grade 3 Science
No ratings yet
Megaphones - Grade 3 Science
3 pages
Rate Problem: Consolidated Engineering Review and Training Institute Engineering Mathematics Ece Review Algebra Part 2
No ratings yet
Rate Problem: Consolidated Engineering Review and Training Institute Engineering Mathematics Ece Review Algebra Part 2
2 pages
Samsung Vs Apple
No ratings yet
Samsung Vs Apple
13 pages
S.P. Karthikeyan: Employment Profile
No ratings yet
S.P. Karthikeyan: Employment Profile
3 pages
10 Motivational Theories
No ratings yet
10 Motivational Theories
26 pages
Tech Maths p2 Gr11 Memo Nov2022 Afrikaansenglish
No ratings yet
Tech Maths p2 Gr11 Memo Nov2022 Afrikaansenglish
16 pages
The 360 Leader
100% (16)
The 360 Leader
117 pages
Diploma of Building and Construction
No ratings yet
Diploma of Building and Construction
2 pages
SAS Slides 10: ODS HTML
No ratings yet
SAS Slides 10: ODS HTML
24 pages
Selected New Provisions of ASCE 7-05
No ratings yet
Selected New Provisions of ASCE 7-05
93 pages
Department of Ece, Adhiparasakthi College of Engineering, Kalavai
No ratings yet
Department of Ece, Adhiparasakthi College of Engineering, Kalavai
31 pages
Mastercontrol Out of Specification
No ratings yet
Mastercontrol Out of Specification
4 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Cluster Analysis: Basic Concepts and Methods

○ high intra-class similarity: cohesive within clusters

○ low inter-class similarity: distinctive between clusters

○ the similarity measure used by the method

○ its implementation, and

○ Its ability to discover some or all of the hidden patterns

• Partitioning method: Partitioning a database D of n objects into a set of k

• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the

Given k, the k-means algorithm is implemented in four

Arbitrarily Update the

The initial data set Loop if Reassign objects

○ Selection of the initial k means

○ Strategies to calculate cluster means

○ Replacing means of clusters with modes

○ PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

K=2 Randomly select a

● Initially, places each object into a cluster of its own.

● The clusters are then merged step-by-step according to some criterion.

● For example, clusters C1 and C2 may be merged if an object in C1

A clustering of the data

● Introduced in Kaufmann and Rousseeuw (1990)

● Inverse order of AGNES

● Eventually each node forms a cluster on its own

○ Can never undo what was done previously

You might also like