0% found this document useful (0 votes)

184 views41 pages

Data Mining-Unit 3-Part1

The document discusses clustering and association rule mining techniques. It covers hierarchical and partitional clustering algorithms like single link, complete link, K-means, and PAM. It also discusses issues with clustering like outliers and evaluating results. Association rule mining concepts like support and confidence are introduced.

Uploaded by

madhanrvmp7867

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

184 views41 pages

Data Mining-Unit 3-Part1

Uploaded by

madhanrvmp7867

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Data Mining

Unit -3 Clustering and Association

Syllabus :
 Clustering : Introduction – Similarity and
Distance Measures – Outliers – Hierarchical
Algorithms – Partitional Algorithms.
 Association rules : Introduction - large item
sets - basic algorithms – parallel &distributed
algorithms – comparing approaches-
incremental rules – advanced association rules
techniques – measuring the quality of rules.
1
Clustering Examples
 Segment customer database based on
similar buying patterns.
 Group houses in a town into
neighborhoods based on similar
features.
 Identify new plant species
 Identify similar Web usage patterns

2
Clustering Example

3
Clustering Houses

Geographic
Size
Distance
Based Based

4
Clustering vs. Classification
 No prior knowledge
– Number of clusters
– Meaning of clusters
 Unsupervised learning

5
Clustering Issues
 Outlier handling
 Dynamic data
 Interpreting results
 Evaluating results
 Number of clusters
 Data to be used
 Scalability

6
Impact of Outliers on
Clustering

7
Clustering Problem
 Given a database D={t1,t2,…,tn} of
tuples and an integer value k, the
Clustering Problem is to define a
mapping f:Dg{1,..,k} where each ti is
assigned to one cluster Kj, 1<=j<=k.
 A Cluster, Kj, contains precisely those
tuples mapped to it.
 Unlike classification problem, clusters
are not known a priori.
8
Types of Clustering
 Hierarchical – Nested set of clusters
created.
 Partitional – One set of clusters
created.
 Incremental – Each element handled
one at a time.
 Simultaneous – All elements handled
together.
 Overlapping/Non-overlapping
9
Clustering Approaches

Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

10
Cluster Parameters

11
Distance Between Clusters
 Single Link: smallest distance between
points
 Complete Link: largest distance between
points
 Average Link: average distance between
points
 Centroid: distance between centroids

12
Hierarchical Clustering
 Clusters are created in levels actually
creating sets of clusters at each level.
 Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
 Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down

13
Hierarchical Algorithms
 Single Link
 MST Single Link
 Complete Link
 Average Link

14
Dendrogram
 Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
 Each level shows clusters
for that level.
– Leaf – individual clusters
– Root – one cluster
 A cluster at level i is the
union of its children clusters
at level i+1.

15
Levels of Clustering

16
Agglomerative Example
A B C D E A B
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5 E C
D 2 4 1 0 3
E 3 3 5 3 0
D

Threshold of
1 2 34 5

A B C D E
17
MST Example

A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D

18
Agglomerative Algorithm

19
Single Link
 View all items with links (distances)
between them.
 Finds maximal connected components
in this graph.
 Two clusters are merged if there is at
least one edge which connects them.
 Uses threshold distances at each level.
 Could be agglomerative or divisive.

20
MST Single Link Algorithm

21
Single Link Clustering

22
Partitional Clustering
 Nonhierarchical
 Creates clusters in one step as opposed
to several steps.
 Since only one set of clusters is output,
the user normally has to input the
desired number of clusters, k.
 Usually deals with static sets.

23
Partitional Algorithms
 MST
 Squared Error
 K-Means
 Nearest Neighbor
 PAM
 BEA
 GA

24
MST Algorithm

25
Squared Error
 Minimized squared error

26
Squared Error Algorithm

27
K-Means
 Initial set of clusters randomly chosen.
 Iteratively, items are moved among sets
of clusters until the desired set is
reached.
 High degree of similarity among
elements in a cluster is obtained.
 Given a cluster Ki={ti1,ti2,…,tim}, the
cluster mean is mi = (1/m)(ti1 + … + tim)

28
K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}, k=2
 Randomly assign means: m1=3,m2=4
 K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
 K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
 K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
 K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
 Stop as the clusters with these means
are the same.
29
K-Means Algorithm

30
31
Nearest Neighbor
 Items are iteratively merged into the
existing clusters that are closest.
 Incremental
 Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.

32
Nearest Neighbor Algorithm

33
PAM
 Partitioning Around Medoids (PAM)
(K-Medoids)
 Handles outliers well.
 Ordering of input does not impact results.
 Does not scale well.
 Each cluster represented by one item,
called the medoid.
 Initial set of k medoids randomly chosen.

34
PAM

35
PAM Cost Calculation
 At each step in algorithm, medoids are
changed if the overall cost is improved.
 Cjih – cost change for an item tj associated
with swapping medoid ti with non-medoid th.

36
PAM Algorithm

37
BEA
 Bond Energy Algorithm
 Database design (physical and logical)
 Vertical fragmentation
 Determine affinity (bond) between attributes
based on common usage.
 Algorithm outline:
1. Create affinity matrix
2. Convert to BOND matrix
3. Create regions of close bonding

38
BEA

Modified from [OV99]

39
Genetic Algorithm Example

 {A,B,C,D,E,F,G,H}
 Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
 Suppose crossover at point four and
choose 1st and 3rd individuals:
10100011, 01000100, 00011000
 What should termination criteria be?

40
GA Algorithm

Chapter 5 - Asynchronous Concurrent Execution: Modified by D. Khaled W. Mahmoud 2009/2010
No ratings yet
Chapter 5 - Asynchronous Concurrent Execution: Modified by D. Khaled W. Mahmoud 2009/2010
34 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
Machine Learning-4
100% (1)
Machine Learning-4
18 pages
Product Quality
No ratings yet
Product Quality
16 pages
Data Structures Unit 1
No ratings yet
Data Structures Unit 1
96 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
28 pages
UNIT-2 ML Notes
No ratings yet
UNIT-2 ML Notes
15 pages
Cloud Computing Introduction - Unit I-1
100% (1)
Cloud Computing Introduction - Unit I-1
32 pages
23ma2101 Advanced Mathematics For Scientific Computing
No ratings yet
23ma2101 Advanced Mathematics For Scientific Computing
10 pages
Unit IV Data Mining
No ratings yet
Unit IV Data Mining
65 pages
Unit 3 DV
No ratings yet
Unit 3 DV
44 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
Operating System (Questions)
No ratings yet
Operating System (Questions)
27 pages
APP Question Bank Unit3
100% (1)
APP Question Bank Unit3
5 pages
Aos Unit-1 Notes
No ratings yet
Aos Unit-1 Notes
29 pages
ML Unit-2
No ratings yet
ML Unit-2
17 pages
Distributed Operating Systems: Unit - 2
No ratings yet
Distributed Operating Systems: Unit - 2
48 pages
2 Mark Question With Answers
No ratings yet
2 Mark Question With Answers
9 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Segmented Paging: Unit Iv
100% (1)
Segmented Paging: Unit Iv
11 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
23 pages
Unit 4 DATA PLACEMENT ON DISKS
No ratings yet
Unit 4 DATA PLACEMENT ON DISKS
23 pages
Distributed File Systems
No ratings yet
Distributed File Systems
50 pages
Ai-Unit-I Notes
No ratings yet
Ai-Unit-I Notes
74 pages
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
On Distributed Os
100% (1)
On Distributed Os
131 pages
Rmi - Overview: Advanced Java Programming - Unit Ii
100% (1)
Rmi - Overview: Advanced Java Programming - Unit Ii
17 pages
Operating Digital Notes (R22 Regulation)
No ratings yet
Operating Digital Notes (R22 Regulation)
156 pages
Aos PG
No ratings yet
Aos PG
111 pages
ML UNIT 2 Sir
No ratings yet
ML UNIT 2 Sir
46 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
DMW - Unit 1
No ratings yet
DMW - Unit 1
21 pages
Unit - 26 - Machine - Learning - Assignment - 01 (1) Alish
No ratings yet
Unit - 26 - Machine - Learning - Assignment - 01 (1) Alish
42 pages
Production Systems
No ratings yet
Production Systems
27 pages
Big Questions With Answers
100% (1)
Big Questions With Answers
32 pages
Unit 1 (DMW)
No ratings yet
Unit 1 (DMW)
53 pages
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
No ratings yet
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
2 pages
Research Proposal
No ratings yet
Research Proposal
6 pages
Grid Architecture
No ratings yet
Grid Architecture
19 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
Big Data
No ratings yet
Big Data
28 pages
Centralized and Client - Server Architecture For DBMS - by Krishnaharshith - Medium
No ratings yet
Centralized and Client - Server Architecture For DBMS - by Krishnaharshith - Medium
10 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Data Mining Question Bank U3 & U4
No ratings yet
Data Mining Question Bank U3 & U4
3 pages
CNS Bits
No ratings yet
CNS Bits
3 pages
Java Interface To HDFS
No ratings yet
Java Interface To HDFS
4 pages
Data Mining Metrices
No ratings yet
Data Mining Metrices
6 pages
IET - Applications of Machine Learning in Wireless Communications
No ratings yet
IET - Applications of Machine Learning in Wireless Communications
492 pages
Unit 2 Fod
No ratings yet
Unit 2 Fod
27 pages
CNS Unit IV PDF
No ratings yet
CNS Unit IV PDF
80 pages
Cs3451 Ios Unit 5 Notes
No ratings yet
Cs3451 Ios Unit 5 Notes
21 pages
Ugc Net Questions For Computer Science DBMS PDF
No ratings yet
Ugc Net Questions For Computer Science DBMS PDF
3 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
CS8792 CNS Unit 1 - R1
No ratings yet
CS8792 CNS Unit 1 - R1
89 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
Manonmaniam Sundaranar University, Tirunelveli April 2020 Degree Examination
100% (1)
Manonmaniam Sundaranar University, Tirunelveli April 2020 Degree Examination
2 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
Unit 5
No ratings yet
Unit 5
104 pages
Unit 5
No ratings yet
Unit 5
27 pages
Cloud Computing Chapter-11
No ratings yet
Cloud Computing Chapter-11
15 pages
ML Unit-3
No ratings yet
ML Unit-3
92 pages
Cns Unit-3 CJR - For Students
No ratings yet
Cns Unit-3 CJR - For Students
18 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Data Pruning
No ratings yet
Data Pruning
52 pages
Predicting Breast Cancer Recurrence Using Machine Learning Techniques: A Systematic Review
No ratings yet
Predicting Breast Cancer Recurrence Using Machine Learning Techniques: A Systematic Review
41 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Parallel Database Systems
No ratings yet
Parallel Database Systems
17 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Fake News Analysis
No ratings yet
Fake News Analysis
46 pages
Complete Download Outlier Ensembles An Introduction 1st Edition Charu C. Aggarwal PDF All Chapters
100% (3)
Complete Download Outlier Ensembles An Introduction 1st Edition Charu C. Aggarwal PDF All Chapters
65 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
32 pages
Unit 4
No ratings yet
Unit 4
12 pages
MCQ Amt 1
No ratings yet
MCQ Amt 1
17 pages
Computational Stadistic With Matlab
No ratings yet
Computational Stadistic With Matlab
11 pages
Stream and Pool Based Active Learning
No ratings yet
Stream and Pool Based Active Learning
11 pages
Unit Iii
No ratings yet
Unit Iii
15 pages
How To Pass Sem 5 - Comps
No ratings yet
How To Pass Sem 5 - Comps
11 pages
Characterization and Delimitation of The Terroir Coffee in Plantations in The Municipal District of Araponga, Minas Gerais, Brazil
No ratings yet
Characterization and Delimitation of The Terroir Coffee in Plantations in The Municipal District of Araponga, Minas Gerais, Brazil
9 pages
R Data Analysis
No ratings yet
R Data Analysis
10 pages
A Conceptual Design of Virtual Internship System To Benchmark Software Development Skills in A Blended Learning Environment
No ratings yet
A Conceptual Design of Virtual Internship System To Benchmark Software Development Skills in A Blended Learning Environment
6 pages
Cleaning and Quality Assurance
No ratings yet
Cleaning and Quality Assurance
7 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
ML Education
No ratings yet
ML Education
6 pages
Customer Segmentation - Project With R
No ratings yet
Customer Segmentation - Project With R
5 pages
Cardiovascular Disease Prediction Using Deep Learning
No ratings yet
Cardiovascular Disease Prediction Using Deep Learning
6 pages
Akhilesh Chauhan
No ratings yet
Akhilesh Chauhan
2 pages

Data Mining-Unit 3-Part1

Uploaded by

Data Mining-Unit 3-Part1

Uploaded by

Data Mining

Unit -3 Clustering and Association

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

Modified from [OV99]

You might also like