0% found this document useful (0 votes)

31 views45 pages

Lecture 13

This document summarizes hierarchical clustering methods. It discusses agglomerative (AGNES) and divisive (DIANA) hierarchical clustering, which create dendrograms by iteratively merging or splitting clusters. The BIRCH algorithm is also introduced, which uses a CF-tree data structure to incrementally cluster large datasets in memory and improve cluster quality with additional scans.

Uploaded by

zafar.phdcs82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views45 pages

Lecture 13

Uploaded by

zafar.phdcs82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Data Mining: Principles and Algorithms

Jianyong Wang
Database Lab, Institute of Software
Department of Computer Science and Technology
Tsinghua University
[email protected]

December 23, 2007 Data Mining: Principles and Algorithms 1

Chapter 7. Cluster Analysis

 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 23, 2007 Data Mining: Principles and Algorithms 2
Hierarchical Clustering

 Use distance matrix as clustering criteria. This method does

not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)

December 23, 2007 Data Mining: Principles and Algorithms 3

AGNES (Agglomerative Nesting)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

December 23, 2007 Data Mining: Principles and Algorithms 4

Dendrogram: Shows How the Clusters are Merged

Decompose data objects into several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected
component forms a cluster.

December 23, 2007 Data Mining: Principles and Algorithms 5

DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

December 23, 2007 Data Mining: Principles and Algorithms 6

Advanced Hierarchical Clustering Methods

 Major weakness of agglomerative clustering methods

- Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
- Can never undo what was done previously
 Integration of hierarchical with distance-based clustering
- BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-
clusters (SIGMOD’06 test of time award)
- ROCK (1999): clustering categorical data by neighbor and link analysis
- CHAMELEON (1999): hierarchical clustering using dynamic modeling

December 23, 2007 Data Mining: Principles and Algorithms 7

BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
- Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
- Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
 Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
 Weakness: handles only (metric) numeric data, and sensitive to
the order of the data record.

December 23, 2007 Data Mining: Principles and Algorithms 8

Clustering Feature Vector in BIRCH

 Clustering Feature: CF = (N, LS, SS)

N: Number of data points CF = (5, (16,30),(54,190))
LS: Ni=1 Xi 10
(3,4)
9

N
8

SS: X
i=1 i
2 7

6
(2,6)
5

4 (4,5)
3

1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)

 How to compute the diameter of a cluster using CF feature?

December 24, 2007 Data Mining: Principles and Algorithms 9

CF-Tree in BIRCH

 Clustering feature:
- Summary of the statistics for a given subcluster: the 0-th, 1st and 2nd
moments of the subcluster from the statistical point of view.
- Registers crucial measurements for computing cluster and utilizes storage
efficiently
 A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
- A nonleaf node in a tree has descendants or “children”
- The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
- Branching factor: specify the maximum number of children.
- Threshold: max diameter of sub-clusters stored at the leaf nodes

December 24, 2007 Data Mining: Principles and Algorithms 10

The CF Tree Structure

Root
B=7 CF1 CF2 CF3 CF6
L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

December 24, 2007 Data Mining: Principles and Algorithms 11

Clustering Categorical Data: The ROCK Algorithm

 ROCK: RObust Clustering using linKs

- S. Guha, R. Rastogi & K. Shim, ICDE’99
 Major ideas
- Use links to measure similarity/proximity
- Not distance-based
 Algorithm: agglomerative hierarchical clustering
- First constructs a sparse graph from a given data similarity matrix
- Performs agglomerative hierarchical clustering
 Experiments
- Congressional voting, mushroom data

December 24, 2007 Data Mining: Principles and Algorithms 12

Similarity Measure in ROCK
 Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
 Example: Two groups (clusters) of transactions
- C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
- C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
T1  T2
 Jaccard co-efficient: Sim(T1 , T2 ) 
T1  T2

- Ex. let T1 = {a, b, c}, T2 = {c, d, e}

{c} 1
Sim (T 1, T 2)    0.2
{a, b, c, d , e} 5
 Jaccard co-efficient may lead to wrong clustering result
- C1: 0.2 (e.g., {a, b, c}, {b, d, e}) to 0.5 (e.g., {a, b, c}, {a, b, d})
- C1 & C2: could be as high as 0.5 (e.g., {a, b, c} in C1, {a, b, f} in C2)

December 24, 2007 Data Mining: Principles and Algorithms 13

Link Measure in ROCK

 Links: # of common neighbors

- C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b,
c, d}, {b, c, e}, {b, d, e}, {c, d, e}
- C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

- link(T1, T2) = 4, since they have 4 common neighbors
 {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}

- link(T1, T3) = 3, since they have 3 common neighbors

 {a, b, d}, {a, b, e}, {a, b, g}

 Thus link is a better measure than Jaccard coefficient

December 24, 2007 Data Mining: Principles and Algorithms 14

CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
 CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
 Measures the similarity based on a dynamic model
- Uses a k-nearest neighbor graph approach to constructing a sparse graph
- Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
- Rock ignores information about the closeness of two clusters, Cure ignores
information about interconnectivity of the objects
 A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large number of relatively
small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters
by repeatedly combining these sub-clusters

December 24, 2007 Data Mining: Principles and Algorithms 15

Overall Framework of CHAMELEON

Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters

December 24, 2007 Data Mining: Principles and Algorithms 16

CHAMELEON (Clustering Complex Objects)

December 24, 2007 Data Mining: Principles and Algorithms 17

Chapter 7. Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 24, 2007 Data Mining: Principles and Algorithms 18
Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such as

density-connected points
 Major features:
- Discover clusters of arbitrary shape
- Handle noise
- Need density parameters as termination condition
 Several interesting studies:
- DBSCAN: Ester, et al. (KDD’96)
- OPTICS: Ankerst, et al (SIGMOD’99).
- DENCLUE: Hinneburg & D. Keim (KDD’98)
- CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

December 24, 2007 Data Mining: Principles and Algorithms 19

Density-Based Clustering: Basic Concepts

 Two parameters:
- Eps: Maximum radius of the neighbourhood
- MinPts: Minimum number of points in an Eps-neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
- p belongs to NEps(q)
- Core point condition:
|NEps (q)| >= MinPts p MinPts = 5
q
Eps = 1 cm

December 24, 2007 Data Mining: Principles and Algorithms 20

Density-Reachable and Density-Connected

 Density-reachable:
- A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p1
q
p such that pi+1 is directly density-
reachable from pi
 Density-connected
- A point p is density-connected to a
p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps o
and MinPts

December 24, 2007 Data Mining: Principles and Algorithms 21

DBSCAN: Density Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases with
noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

December 24, 2007 Data Mining: Principles and Algorithms 22

DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t. Eps and

MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p

and DBSCAN visits the next point of the database.

 Continue the process until all of the points have been processed.

December 24, 2007 Data Mining: Principles and Algorithms 23

DBSCAN: Sensitive to Parameters

December 24, 2007 Data Mining: Principles and Algorithms 24

CHAMELEON (Clustering Complex Objects)

December 24, 2007 Data Mining: Principles and Algorithms 25

OPTICS: A Cluster-Ordering Method (1999)

 OPTICS: Ordering Points To Identify the Clustering

Structure
- Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
- Produces a special order of the database w.r.t its density-
based clustering structure
- This cluster-ordering contains info equiv. to the density-
based clusterings corresponding to a broad range of
parameter settings
- Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
- Can be represented graphically or using visualization
techniques
December 24, 2007 Data Mining: Principles and Algorithms 26
OPTICS: Some Extension from DBSCAN

 Core Distance
- The smallest e value that makes p a core object.
 Reachability Distance
p1
- Max (core-distance (o), d (o, p))
r(p1, o) = 3cm. r(p2,o) = 4cm o
p2

MinPts = 5, e = 3 cm
d(p1, o) = 2.8cm, d(p2, o) =4cm
 OPTICS creates an ordering of the objects in a database,
additionally storing the core-distance and a suitable
reachability-distance for each object.

December 24, 2007 Data Mining: Principles and Algorithms 27

Reachability-distance

undefined
e

e‘

Cluster-order of the objects

December 23, 2007 Data Mining: Principles and Algorithms 28

DENCLUE: Using Statistical Density
Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Using statistical density functions: d ( x, y )2

f Gaussian( x, y )  e 2 2

d ( x , xi ) 2

 i 1 e
N
( x)  2
D 2
f Gaussian

 Major features
- Solid mathematical foundation
- Good for data sets with large amounts of noise
- Allows a compact mathematical description of arbitrarily shaped clusters in high-
dimensional data sets
- Significant faster than existing algorithm (e.g., DBSCAN)
- But needs a large number of parameters

December 23, 2007 Data Mining: Principles and Algorithms 29

Denclue: Technical Essence

 Influence function: describes the impact of a data point within

its neighborhood
 Overall density of the data space can be calculated as the sum
of the influence function of all data points
 Clusters can be determined mathematically by identifying
density attractors
 Density attractors are local maximal of the overall density
function

December 23, 2007 Data Mining: Principles and Algorithms 30

Density Function

December 23, 2007 Data Mining: Principles and Algorithms 31

Center-Defined and Arbitrary

December 23, 2007 Data Mining: Principles and Algorithms 32

Chapter 7. Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 23, 2007 Data Mining: Principles and Algorithms 33
Grid-Based Clustering Method

 Ideas
- Using multi-resolution grid data structures
- Use dense grid cells to form clusters
 Several interesting methods
- STING (a STatistical INformation Grid approach) by Wang, Yang
and Muntz (VLDB’1997)
- WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multi-resolution clustering approach using wavelet method
- CLIQUE: Agrawal, et al. (SIGMOD’98)
 On high-dimensional data (thus put in the section of clustering high-dimensional
data)

December 23, 2007 Data Mining: Principles and Algorithms 34

Overview: Grid-Based Clustering

 A type of density-based clustering

- Basic grid-based clustering algorithm
 Define a set of grid cells
 Assign objects to the appropriate cells and compute the density of each cell
 Eliminate cells having a density below a specified threshold, λ
 Form clusters from contiguous (adjacent) groups of dense cells
7
0 0 0 0 0 0 0
0 0 0 0 0 0 0
4 17 18 6 0 0 0
14 14 13 13 0 18 27
11 18 10 21 0 24 31
3 20 14 4 0 0 0
0 0 0 0 0 0 0

0
7
December 23, 2007 Data Mining: Principles and Algorithms 35
STING: A Statistical Information Grid Approach

 Wang, Yang and Muntz (VLDB’97)

 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different levels
of resolution

December 23, 2007 Data Mining: Principles and Algorithms 36

The STING Clustering Method

 Each cell at a high level is partitioned into a number of smaller

cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
 Parameters of higher level cells can be easily calculated from
parameters of lower level cell
- Count, mean, s, min, max
- Type of distribution—normal, uniform, etc.

December 23, 2007 Data Mining: Principles and Algorithms 37

Comments on STING

 Use a top-down approach to answer spatial data queries

 Start from a pre-selected layer—typically with a small number of
cells
 Remove the irrelevant cells from further consideration
 When finish examining the current layer, proceed to the next
lower level
 Repeat this process until the bottom layer is reached

December 23, 2007 Data Mining: Principles and Algorithms 38

Comments on STING

 Advantages:
- Query-independent, easy to parallelize, incremental update
- O(K), where K is the number of grid cells at the lowest level
 Disadvantages:
- All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected

December 23, 2007 Data Mining: Principles and Algorithms 39

Chapter 7. Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 23, 2007 Data Mining: Principles and Algorithms 40
Model-Based Clustering

 What is model-based clustering?

- Attempt to optimize the fit between the given data and some
mathematical model
- Based on the assumption: Data are generated by a mixture of
underlying probability distribution
 Typical methods
- Statistical approach
 EM (Expectation maximization), AutoClass
- Neural network approach
 SOM (Self-Organizing Feature Map)

December 23, 2007 Data Mining: Principles and Algorithms 41

EM — Expectation Maximization
 EM — A popular iterative refinement algorithm
 An extension to k-means
- Assign each object to a cluster according to a weight (prob.
distribution)
- New means are computed based on weighted measures
 General idea
- Given a set of data, the probability of the data as a function of
the parameters is called the likelihood function.
- Given a certain likelihood function, a general principle for
estimating the parameters of a statistical model is the
maximum likelihood principle, i.e., choose those parameters
that maximize the probability of the data.
December 23, 2007 Data Mining: Principles and Algorithms 42
EM — Expectation Maximization Algorithm

 EM algorithm
- Select an initial set of model parameters
- Repeat
 Estimation Step: For each object, calculate the probability of that each point
belongs to each distribution, i.e, calculate prob(j|xi,θ).
 Maximization Step: Given the probabilities from the estimation step, find the
new estimates of the parameters that maximize the expected likelihood.
- Until the parameters do not change

December 23, 2007 Data Mining: Principles and Algorithms 43

Gaussian Distribution based EM
 Gaussian distribution probability density function 1
1  ( x   j )T  j 1 ( x   j )
P( x | j, )  e 2


N
i 1
p( j | xi , )
 Estimate for the covariance matrix of distribution j

i1 i j i j p( j | xi , )
 
N
( x  )( x  ) T

j 

N
i 1
p( j | xi , )
December 23, 2007 Data Mining: Principles and Algorithms 44
Gaussian Distribution based EM

 EM algorithm
- Specify a likelihood threshold ε, number of clusters k, maximum iteration
number N.
- Select an initial set of model parameters, prob(j|x,θ), that is, the probability
of that each point belongs to each distribution.
- Repeat
 Estimation Step: For each object, calculate its Gaussian distribution probability density,
prob(x|j,θ), and the probability of that each point belongs to each distribution, i.e,
calculate prob(j|x,θ).
 Maximization Step: Given the probabilities from the estimation step, find the new
estimates of the parameters that maximize the expected likelihood, Pj, j, Σj.
- Until the difference between the likelihoods of the latest two iterations is
smaller than ε or the number of iterations is larger than N

 Likelihood equation
N
likelihood ()   j 1 p j p( xi |  j )
k

i 1
December 23, 2007 Data Mining: Principles and Algorithms 45

Math 10 - Q1 - Week 6 - Module 6 - Division-Of-Polynomials - For - Reproduction
60% (5)
Math 10 - Q1 - Week 6 - Module 6 - Division-Of-Polynomials - For - Reproduction
31 pages
Additional Exercise (Quadratic Equation) Student
No ratings yet
Additional Exercise (Quadratic Equation) Student
2 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Daily Lesson Log
No ratings yet
Daily Lesson Log
8 pages
Applied Numerical Methods - (NAFTI - Ir)
No ratings yet
Applied Numerical Methods - (NAFTI - Ir)
593 pages
Clustering
No ratings yet
Clustering
28 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Clustering
No ratings yet
Clustering
110 pages
Clustering
No ratings yet
Clustering
84 pages
Unit VII
No ratings yet
Unit VII
30 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
The Hilber-Hughes-Taylor-α (HHT-α) method compared with
No ratings yet
The Hilber-Hughes-Taylor-α (HHT-α) method compared with
14 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Clustering
No ratings yet
Clustering
45 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Clustering
No ratings yet
Clustering
49 pages
Btech Cs 3 Sem Data Structure Kcs 301 2023
No ratings yet
Btech Cs 3 Sem Data Structure Kcs 301 2023
2 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
An Efficient Enhanced K-Means Clustering Algorithm
No ratings yet
An Efficient Enhanced K-Means Clustering Algorithm
8 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
24 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Clustering
No ratings yet
Clustering
32 pages
Clustering Part 2
No ratings yet
Clustering Part 2
28 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
Clustering
No ratings yet
Clustering
27 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
Cluster
No ratings yet
Cluster
20 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Polynomials: X 3 7 y 8 Xy X 3 7 y 8 Xy
No ratings yet
Polynomials: X 3 7 y 8 Xy X 3 7 y 8 Xy
32 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Cluster
100% (1)
Cluster
72 pages
Paper-2 Clustering Algorithms in Data Mining A Review
No ratings yet
Paper-2 Clustering Algorithms in Data Mining A Review
7 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering
No ratings yet
Clustering
7 pages
Clustering
No ratings yet
Clustering
35 pages
Machine Learning Updated Lesson Plan
No ratings yet
Machine Learning Updated Lesson Plan
6 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
UNIT5
No ratings yet
UNIT5
60 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
Neural Ordinary Differential Equations: Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud
No ratings yet
Neural Ordinary Differential Equations: Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud
42 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
11 pages
Math 8: Multiplication of Polynomials
No ratings yet
Math 8: Multiplication of Polynomials
18 pages
MTH1106 Lecture Notes 2 - v20
No ratings yet
MTH1106 Lecture Notes 2 - v20
3 pages
ML Module 5 Full Notes
No ratings yet
ML Module 5 Full Notes
23 pages
Hopfield Network (Discrete) - A Recurrent Autoassociative Network. Recurrent Autoassociative Network
100% (1)
Hopfield Network (Discrete) - A Recurrent Autoassociative Network. Recurrent Autoassociative Network
12 pages
Mathematics For Management
No ratings yet
Mathematics For Management
102 pages
Finate Fields Divison4
No ratings yet
Finate Fields Divison4
25 pages
Unit-3: Divide and Conquer: Algorithms
No ratings yet
Unit-3: Divide and Conquer: Algorithms
78 pages
Week 6 Prev & Current Assignments
No ratings yet
Week 6 Prev & Current Assignments
21 pages
February 2023
No ratings yet
February 2023
2 pages
Complexity
No ratings yet
Complexity
19 pages
Linear Programming
No ratings yet
Linear Programming
2 pages
ADA Lab Viva Questions
100% (1)
ADA Lab Viva Questions
12 pages
Datamining Mod3
No ratings yet
Datamining Mod3
21 pages
Dynammic Programming Decoded
No ratings yet
Dynammic Programming Decoded
3 pages
Chapter 6 - Algorithms Part II
No ratings yet
Chapter 6 - Algorithms Part II
25 pages
Multi-Layer Perceptrons (MLP) in R - GeeksforGeeks
No ratings yet
Multi-Layer Perceptrons (MLP) in R - GeeksforGeeks
8 pages
Bicubic Spline Interpolation (Journal of Mathematics and Physics, Vol. 41, Issue 1-4) (1962)
No ratings yet
Bicubic Spline Interpolation (Journal of Mathematics and Physics, Vol. 41, Issue 1-4) (1962)
7 pages
Final Edx Revision Notes
No ratings yet
Final Edx Revision Notes
6 pages
1 1+hws+power+functions
No ratings yet
1 1+hws+power+functions
4 pages
Practice Lecture4
No ratings yet
Practice Lecture4
3 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Sudoku New: Workouts to sharpen your mind
From Everand
Sudoku New: Workouts to sharpen your mind
Sahil Gupta
No ratings yet

Lecture 13

Uploaded by

Lecture 13

Uploaded by

Data Mining: Principles and Algorithms

December 23, 2007 Data Mining: Principles and Algorithms 1

 What is Cluster Analysis?

 Use distance matrix as clustering criteria. This method does

December 23, 2007 Data Mining: Principles and Algorithms 3

 Introduced in Kaufmann and Rousseeuw (1990)

December 23, 2007 Data Mining: Principles and Algorithms 4

Decompose data objects into several levels of nested

A clustering of the data objects is obtained by cutting the

December 23, 2007 Data Mining: Principles and Algorithms 5

 Introduced in Kaufmann and Rousseeuw (1990)

December 23, 2007 Data Mining: Principles and Algorithms 6

 Major weakness of agglomerative clustering methods

December 23, 2007 Data Mining: Principles and Algorithms 7

December 23, 2007 Data Mining: Principles and Algorithms 8

 Clustering Feature: CF = (N, LS, SS)

 How to compute the diameter of a cluster using CF feature?

December 24, 2007 Data Mining: Principles and Algorithms 9

December 24, 2007 Data Mining: Principles and Algorithms 10

Leaf node Leaf node

December 24, 2007 Data Mining: Principles and Algorithms 11

 ROCK: RObust Clustering using linKs

December 24, 2007 Data Mining: Principles and Algorithms 12

- Ex. let T1 = {a, b, c}, T2 = {c, d, e}

December 24, 2007 Data Mining: Principles and Algorithms 13

 Links: # of common neighbors

 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

- link(T1, T3) = 3, since they have 3 common neighbors

 Thus link is a better measure than Jaccard coefficient

December 24, 2007 Data Mining: Principles and Algorithms 14

December 24, 2007 Data Mining: Principles and Algorithms 15

December 24, 2007 Data Mining: Principles and Algorithms 16

December 24, 2007 Data Mining: Principles and Algorithms 17

 Clustering based on density (local cluster criterion), such as

December 24, 2007 Data Mining: Principles and Algorithms 19

December 24, 2007 Data Mining: Principles and Algorithms 20

December 24, 2007 Data Mining: Principles and Algorithms 21

December 24, 2007 Data Mining: Principles and Algorithms 22

 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t. Eps and

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p

December 24, 2007 Data Mining: Principles and Algorithms 23

December 24, 2007 Data Mining: Principles and Algorithms 24

December 24, 2007 Data Mining: Principles and Algorithms 25

 OPTICS: Ordering Points To Identify the Clustering

December 24, 2007 Data Mining: Principles and Algorithms 27

Cluster-order of the objects

December 23, 2007 Data Mining: Principles and Algorithms 28

December 23, 2007 Data Mining: Principles and Algorithms 29

 Influence function: describes the impact of a data point within

December 23, 2007 Data Mining: Principles and Algorithms 30

December 23, 2007 Data Mining: Principles and Algorithms 31

December 23, 2007 Data Mining: Principles and Algorithms 32

December 23, 2007 Data Mining: Principles and Algorithms 34

 A type of density-based clustering

 Wang, Yang and Muntz (VLDB’97)

December 23, 2007 Data Mining: Principles and Algorithms 36

 Each cell at a high level is partitioned into a number of smaller

December 23, 2007 Data Mining: Principles and Algorithms 37

 Use a top-down approach to answer spatial data queries

December 23, 2007 Data Mining: Principles and Algorithms 38

December 23, 2007 Data Mining: Principles and Algorithms 39

 What is model-based clustering?

December 23, 2007 Data Mining: Principles and Algorithms 41

December 23, 2007 Data Mining: Principles and Algorithms 43

You might also like