0% found this document useful (0 votes)
156 views

Data Mining-Unit 3-Part1

The document discusses clustering and association rule mining techniques. It covers hierarchical and partitional clustering algorithms like single link, complete link, K-means, and PAM. It also discusses issues with clustering like outliers and evaluating results. Association rule mining concepts like support and confidence are introduced.

Uploaded by

madhanrvmp7867
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views

Data Mining-Unit 3-Part1

The document discusses clustering and association rule mining techniques. It covers hierarchical and partitional clustering algorithms like single link, complete link, K-means, and PAM. It also discusses issues with clustering like outliers and evaluating results. Association rule mining concepts like support and confidence are introduced.

Uploaded by

madhanrvmp7867
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Mining

Unit -3 Clustering and Association


Syllabus :
 Clustering : Introduction – Similarity and
Distance Measures – Outliers – Hierarchical
Algorithms – Partitional Algorithms.
 Association rules : Introduction - large item
sets - basic algorithms – parallel &distributed
algorithms – comparing approaches-
incremental rules – advanced association rules
techniques – measuring the quality of rules.
1
Clustering Examples
 Segment customer database based on
similar buying patterns.
 Group houses in a town into
neighborhoods based on similar
features.
 Identify new plant species
 Identify similar Web usage patterns

2
Clustering Example

3
Clustering Houses

Geographic
Size
Distance
Based Based

4
Clustering vs. Classification
 No prior knowledge
– Number of clusters
– Meaning of clusters
 Unsupervised learning

5
Clustering Issues
 Outlier handling
 Dynamic data
 Interpreting results
 Evaluating results
 Number of clusters
 Data to be used
 Scalability

6
Impact of Outliers on
Clustering

7
Clustering Problem
 Given a database D={t1,t2,…,tn} of
tuples and an integer value k, the
Clustering Problem is to define a
mapping f:Dg{1,..,k} where each ti is
assigned to one cluster Kj, 1<=j<=k.
 A Cluster, Kj, contains precisely those
tuples mapped to it.
 Unlike classification problem, clusters
are not known a priori.
8
Types of Clustering
 Hierarchical – Nested set of clusters
created.
 Partitional – One set of clusters
created.
 Incremental – Each element handled
one at a time.
 Simultaneous – All elements handled
together.
 Overlapping/Non-overlapping
9
Clustering Approaches

Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

10
Cluster Parameters

11
Distance Between Clusters
 Single Link: smallest distance between
points
 Complete Link: largest distance between
points
 Average Link: average distance between
points
 Centroid: distance between centroids

12
Hierarchical Clustering
 Clusters are created in levels actually
creating sets of clusters at each level.
 Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
 Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down

13
Hierarchical Algorithms
 Single Link
 MST Single Link
 Complete Link
 Average Link

14
Dendrogram
 Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
 Each level shows clusters
for that level.
– Leaf – individual clusters
– Root – one cluster
 A cluster at level i is the
union of its children clusters
at level i+1.

15
Levels of Clustering

16
Agglomerative Example
A B C D E A B
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5 E C
D 2 4 1 0 3
E 3 3 5 3 0
D

Threshold of
1 2 34 5

A B C D E
17
MST Example

A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D

18
Agglomerative Algorithm

19
Single Link
 View all items with links (distances)
between them.
 Finds maximal connected components
in this graph.
 Two clusters are merged if there is at
least one edge which connects them.
 Uses threshold distances at each level.
 Could be agglomerative or divisive.

20
MST Single Link Algorithm

21
Single Link Clustering

22
Partitional Clustering
 Nonhierarchical
 Creates clusters in one step as opposed
to several steps.
 Since only one set of clusters is output,
the user normally has to input the
desired number of clusters, k.
 Usually deals with static sets.

23
Partitional Algorithms
 MST
 Squared Error
 K-Means
 Nearest Neighbor
 PAM
 BEA
 GA

24
MST Algorithm

25
Squared Error
 Minimized squared error

26
Squared Error Algorithm

27
K-Means
 Initial set of clusters randomly chosen.
 Iteratively, items are moved among sets
of clusters until the desired set is
reached.
 High degree of similarity among
elements in a cluster is obtained.
 Given a cluster Ki={ti1,ti2,…,tim}, the
cluster mean is mi = (1/m)(ti1 + … + tim)

28
K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}, k=2
 Randomly assign means: m1=3,m2=4
 K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
 K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
 K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
 K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
 Stop as the clusters with these means
are the same.
29
K-Means Algorithm

30
31
Nearest Neighbor
 Items are iteratively merged into the
existing clusters that are closest.
 Incremental
 Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.

32
Nearest Neighbor Algorithm

33
PAM
 Partitioning Around Medoids (PAM)
(K-Medoids)
 Handles outliers well.
 Ordering of input does not impact results.
 Does not scale well.
 Each cluster represented by one item,
called the medoid.
 Initial set of k medoids randomly chosen.

34
PAM

35
PAM Cost Calculation
 At each step in algorithm, medoids are
changed if the overall cost is improved.
 Cjih – cost change for an item tj associated
with swapping medoid ti with non-medoid th.

36
PAM Algorithm

37
BEA
 Bond Energy Algorithm
 Database design (physical and logical)
 Vertical fragmentation
 Determine affinity (bond) between attributes
based on common usage.
 Algorithm outline:
1. Create affinity matrix
2. Convert to BOND matrix
3. Create regions of close bonding

38
BEA

Modified from [OV99]

39
Genetic Algorithm Example

 {A,B,C,D,E,F,G,H}
 Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
 Suppose crossover at point four and
choose 1st and 3rd individuals:
10100011, 01000100, 00011000
 What should termination criteria be?

40
GA Algorithm

41

You might also like