0% found this document useful (0 votes)
172 views

Data Mining-Unit 3-Part1

The document discusses clustering and association rule mining techniques. It covers hierarchical and partitional clustering algorithms like single link, complete link, K-means, and PAM. It also discusses issues with clustering like outliers and evaluating results. Association rule mining concepts like support and confidence are introduced.

Uploaded by

madhanrvmp7867
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views

Data Mining-Unit 3-Part1

The document discusses clustering and association rule mining techniques. It covers hierarchical and partitional clustering algorithms like single link, complete link, K-means, and PAM. It also discusses issues with clustering like outliers and evaluating results. Association rule mining concepts like support and confidence are introduced.

Uploaded by

madhanrvmp7867
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Mining

Unit -3 Clustering and Association


Syllabus :
 Clustering : Introduction – Similarity and
Distance Measures – Outliers – Hierarchical
Algorithms – Partitional Algorithms.
 Association rules : Introduction - large item
sets - basic algorithms – parallel &distributed
algorithms – comparing approaches-
incremental rules – advanced association rules
techniques – measuring the quality of rules.
1
Clustering Examples
 Segment customer database based on
similar buying patterns.
 Group houses in a town into
neighborhoods based on similar
features.
 Identify new plant species
 Identify similar Web usage patterns

2
Clustering Example

3
Clustering Houses

Geographic
Size
Distance
Based Based

4
Clustering vs. Classification
 No prior knowledge
– Number of clusters
– Meaning of clusters
 Unsupervised learning

5
Clustering Issues
 Outlier handling
 Dynamic data
 Interpreting results
 Evaluating results
 Number of clusters
 Data to be used
 Scalability

6
Impact of Outliers on
Clustering

7
Clustering Problem
 Given a database D={t1,t2,…,tn} of
tuples and an integer value k, the
Clustering Problem is to define a
mapping f:Dg{1,..,k} where each ti is
assigned to one cluster Kj, 1<=j<=k.
 A Cluster, Kj, contains precisely those
tuples mapped to it.
 Unlike classification problem, clusters
are not known a priori.
8
Types of Clustering
 Hierarchical – Nested set of clusters
created.
 Partitional – One set of clusters
created.
 Incremental – Each element handled
one at a time.
 Simultaneous – All elements handled
together.
 Overlapping/Non-overlapping
9
Clustering Approaches

Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

10
Cluster Parameters

11
Distance Between Clusters
 Single Link: smallest distance between
points
 Complete Link: largest distance between
points
 Average Link: average distance between
points
 Centroid: distance between centroids

12
Hierarchical Clustering
 Clusters are created in levels actually
creating sets of clusters at each level.
 Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
 Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down

13
Hierarchical Algorithms
 Single Link
 MST Single Link
 Complete Link
 Average Link

14
Dendrogram
 Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
 Each level shows clusters
for that level.
– Leaf – individual clusters
– Root – one cluster
 A cluster at level i is the
union of its children clusters
at level i+1.

15
Levels of Clustering

16
Agglomerative Example
A B C D E A B
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5 E C
D 2 4 1 0 3
E 3 3 5 3 0
D

Threshold of
1 2 34 5

A B C D E
17
MST Example

A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D

18
Agglomerative Algorithm

19
Single Link
 View all items with links (distances)
between them.
 Finds maximal connected components
in this graph.
 Two clusters are merged if there is at
least one edge which connects them.
 Uses threshold distances at each level.
 Could be agglomerative or divisive.

20
MST Single Link Algorithm

21
Single Link Clustering

22
Partitional Clustering
 Nonhierarchical
 Creates clusters in one step as opposed
to several steps.
 Since only one set of clusters is output,
the user normally has to input the
desired number of clusters, k.
 Usually deals with static sets.

23
Partitional Algorithms
 MST
 Squared Error
 K-Means
 Nearest Neighbor
 PAM
 BEA
 GA

24
MST Algorithm

25
Squared Error
 Minimized squared error

26
Squared Error Algorithm

27
K-Means
 Initial set of clusters randomly chosen.
 Iteratively, items are moved among sets
of clusters until the desired set is
reached.
 High degree of similarity among
elements in a cluster is obtained.
 Given a cluster Ki={ti1,ti2,…,tim}, the
cluster mean is mi = (1/m)(ti1 + … + tim)

28
K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}, k=2
 Randomly assign means: m1=3,m2=4
 K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
 K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
 K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
 K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
 Stop as the clusters with these means
are the same.
29
K-Means Algorithm

30
31
Nearest Neighbor
 Items are iteratively merged into the
existing clusters that are closest.
 Incremental
 Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.

32
Nearest Neighbor Algorithm

33
PAM
 Partitioning Around Medoids (PAM)
(K-Medoids)
 Handles outliers well.
 Ordering of input does not impact results.
 Does not scale well.
 Each cluster represented by one item,
called the medoid.
 Initial set of k medoids randomly chosen.

34
PAM

35
PAM Cost Calculation
 At each step in algorithm, medoids are
changed if the overall cost is improved.
 Cjih – cost change for an item tj associated
with swapping medoid ti with non-medoid th.

36
PAM Algorithm

37
BEA
 Bond Energy Algorithm
 Database design (physical and logical)
 Vertical fragmentation
 Determine affinity (bond) between attributes
based on common usage.
 Algorithm outline:
1. Create affinity matrix
2. Convert to BOND matrix
3. Create regions of close bonding

38
BEA

Modified from [OV99]

39
Genetic Algorithm Example

 {A,B,C,D,E,F,G,H}
 Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
 Suppose crossover at point four and
choose 1st and 3rd individuals:
10100011, 01000100, 00011000
 What should termination criteria be?

40
GA Algorithm

41

You might also like