Data Mining-Unit 3-Part1
Data Mining-Unit 3-Part1
2
Clustering Example
3
Clustering Houses
Geographic
Size
Distance
Based Based
4
Clustering vs. Classification
No prior knowledge
– Number of clusters
– Meaning of clusters
Unsupervised learning
5
Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
6
Impact of Outliers on
Clustering
7
Clustering Problem
Given a database D={t1,t2,…,tn} of
tuples and an integer value k, the
Clustering Problem is to define a
mapping f:Dg{1,..,k} where each ti is
assigned to one cluster Kj, 1<=j<=k.
A Cluster, Kj, contains precisely those
tuples mapped to it.
Unlike classification problem, clusters
are not known a priori.
8
Types of Clustering
Hierarchical – Nested set of clusters
created.
Partitional – One set of clusters
created.
Incremental – Each element handled
one at a time.
Simultaneous – All elements handled
together.
Overlapping/Non-overlapping
9
Clustering Approaches
Clustering
10
Cluster Parameters
11
Distance Between Clusters
Single Link: smallest distance between
points
Complete Link: largest distance between
points
Average Link: average distance between
points
Centroid: distance between centroids
12
Hierarchical Clustering
Clusters are created in levels actually
creating sets of clusters at each level.
Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
13
Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link
14
Dendrogram
Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
Each level shows clusters
for that level.
– Leaf – individual clusters
– Root – one cluster
A cluster at level i is the
union of its children clusters
at level i+1.
15
Levels of Clustering
16
Agglomerative Example
A B C D E A B
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5 E C
D 2 4 1 0 3
E 3 3 5 3 0
D
Threshold of
1 2 34 5
A B C D E
17
MST Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D
18
Agglomerative Algorithm
19
Single Link
View all items with links (distances)
between them.
Finds maximal connected components
in this graph.
Two clusters are merged if there is at
least one edge which connects them.
Uses threshold distances at each level.
Could be agglomerative or divisive.
20
MST Single Link Algorithm
21
Single Link Clustering
22
Partitional Clustering
Nonhierarchical
Creates clusters in one step as opposed
to several steps.
Since only one set of clusters is output,
the user normally has to input the
desired number of clusters, k.
Usually deals with static sets.
23
Partitional Algorithms
MST
Squared Error
K-Means
Nearest Neighbor
PAM
BEA
GA
24
MST Algorithm
25
Squared Error
Minimized squared error
26
Squared Error Algorithm
27
K-Means
Initial set of clusters randomly chosen.
Iteratively, items are moved among sets
of clusters until the desired set is
reached.
High degree of similarity among
elements in a cluster is obtained.
Given a cluster Ki={ti1,ti2,…,tim}, the
cluster mean is mi = (1/m)(ti1 + … + tim)
28
K-Means Example
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
Stop as the clusters with these means
are the same.
29
K-Means Algorithm
30
31
Nearest Neighbor
Items are iteratively merged into the
existing clusters that are closest.
Incremental
Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.
32
Nearest Neighbor Algorithm
33
PAM
Partitioning Around Medoids (PAM)
(K-Medoids)
Handles outliers well.
Ordering of input does not impact results.
Does not scale well.
Each cluster represented by one item,
called the medoid.
Initial set of k medoids randomly chosen.
34
PAM
35
PAM Cost Calculation
At each step in algorithm, medoids are
changed if the overall cost is improved.
Cjih – cost change for an item tj associated
with swapping medoid ti with non-medoid th.
36
PAM Algorithm
37
BEA
Bond Energy Algorithm
Database design (physical and logical)
Vertical fragmentation
Determine affinity (bond) between attributes
based on common usage.
Algorithm outline:
1. Create affinity matrix
2. Convert to BOND matrix
3. Create regions of close bonding
38
BEA
39
Genetic Algorithm Example
{A,B,C,D,E,F,G,H}
Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
Suppose crossover at point four and
choose 1st and 3rd individuals:
10100011, 01000100, 00011000
What should termination criteria be?
40
GA Algorithm
41