Lecture 6

Uploaded by

mehrtash.soltani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views14 pages

Lecture 6

Uploaded by

mehrtash.soltani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 14

What is Cluster Analysis?

• Cluster: a collection of data objects

– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering: Rich Applications
and Multidisciplinary Efforts
• Pattern Recognition
• Spatial Data Analysis
– Create thematic maps in GIS by clustering feature spaces
– Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering
Applications
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Quality: What Is Good
Clustering?
• A good clustering method will produce high quality
clusters with
– high intra-class similarity
– low inter-class similarity

• The quality of a clustering result depends on both the

similarity measure used by the method and its
implementation
• The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns
Measure the Quality of
Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering in Data
Mining
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Typical Alternatives to Calculate the
Distance between Clusters
• Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

• Average: avg distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

• Centroid: distance between the centroids of two clusters, i.e., dis(K i,

Kj) = dis(Ci, Cj)

• Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj)
= dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
iN 1(t )
• Centroid: the “middle” of a cluster Cm  N
ip

• Radius: square root of average distance from any point

of the cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N

• Diameter: square root of average mean squared

distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms: Basic
Concept
• Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters, s.t., min sum of
squared distance
 km1tmiKm (Cm  tmi ) 2
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in
four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of
the current partition (the centroid is the center, i.e.,
mean point, of the cluster)
– Assign each object to the cluster with the nearest seed
point
– Go back to Step 2, stop when no more new assignment
The K-Means Clustering Method
• Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Comments on the K-Means
Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means

– Dissimilarity calculations

– Strategies to calculate cluster means

• Handling categorical data: k-modes (Huang’98)

– Replacing means of clusters with modes

– Using new dissimilarity measures to deal with categorical objects

– Using a frequency-based method to update modes of clusters

– A mixture of categorical and numerical data: k-prototype method

Summary
• Cluster analysis groups objects based on their similarity and
has wide applications
• Measure of similarity can be computed for various types of
data
• Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods, grid-
based methods, and model-based methods
• Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical, distance-
based or deviation-based approaches
• There are still lots of research issues on cluster analysis

BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
4.1 Clustering
No ratings yet
4.1 Clustering
69 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Unit IV
No ratings yet
Unit IV
96 pages
10 Clus Basic
No ratings yet
10 Clus Basic
66 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
10 Clus Basic
No ratings yet
10 Clus Basic
31 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Cluster Analysis: Minh Tran, PHD
No ratings yet
Cluster Analysis: Minh Tran, PHD
37 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering
No ratings yet
Clustering
29 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Clustering
No ratings yet
Clustering
39 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Unit 5
No ratings yet
Unit 5
85 pages
11-1 Clustering Part 1 PDF
No ratings yet
11-1 Clustering Part 1 PDF
18 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Clustering
No ratings yet
Clustering
34 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering
No ratings yet
Clustering
24 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Analysis of Cluteruing
No ratings yet
Analysis of Cluteruing
16 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
DM 4
No ratings yet
DM 4
76 pages
Clustering Methods
No ratings yet
Clustering Methods
14 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Clustering
No ratings yet
Clustering
32 pages
Clustering
No ratings yet
Clustering
25 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Clustering
No ratings yet
Clustering
125 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Clustering
No ratings yet
Clustering
104 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages