BCA Semester VI Data Mining Module 4 (Presentation Kind of N
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with random shape
Minimal requirements for domain knowledge to determine
input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Data Structures
Data matrix
(two modes)
Dissimilarity matrix
(one mode)
Type of data in clustering analysis
Interval-scaled variables
Binary variables
Nominal
Ordinal
ratio variables
Variables of mixed types
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
where
Calculate the standardized measurement (z-score)
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
Similarity and Dissimilarity Between Objects
If q = 2, d is Euclidean distance:
Properties
d(i,j) ≥ 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)
Also, one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity measures
Binary Variables
Object j
high).
Ordinal Variables
f is binary or nominal:
Step 3: Assign each object to the cluster with the nearest seed point
Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
Assign Update 3
3
2 each the 2
1
objects cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
K=2
Arbitrarily choose K
object as initial
cluster center Update
the
cluster
means
Example
Point mean1
x1, y1 x2, y2 p(a, b)=)=|x2-x1| + |y2-y1|
(2, 10) (2, 10) =|2 – 2| + | 10 – 10|
P(a,b)=|x2-x1| + |y2-y1| = 0+0 = 0
Point mean1
x1, y1 x2, y2 p(a,b)=)=|x2-x1| + |y2-y1|
(2, 10) (5, 8) =|5 – 2| + | 8 – 10|
P(a,b)=|x2-x1| + |y2-y1| = 3+2 = 5
Point mean1
x1, y1 x2, y2 p(a,b)=)=|x2-x1| + |y2-y1|
(2, 10) (1, 1) =|1 – 2| + | 2 – 10|
P(a,b)=|x2-x1| + |y2-y1| = 1+8 = 9
(2,10) (5,8) (1,2)
Point Distance Distance Distance Cluster
Mean 1 Mean 2 Mean 3
A1 (2,10) 0 5 9 1
A2 (2,5) 5 6 4 3
A3 (8,4) 12 7 9 2
A4 (5,8) 5 0 10 2
A5 (7,5) 10 5 9 2
A6 (6,4) 10 5 7 2
A7 (1,2) 9 10 0 3
A8 (4,9) 3 2 10 2
PAM works effectively for small data sets, but does not
scale well for large data sets
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10
6
Arbitrary Assign
5
choose k each
4 object as remainin
3
initial g object
2
medoids to
1
0
nearest
0 1 2 3 4 5 6 7 8 9 10
medoids
Do loop 9
8
Compute
9
8
Swapping O 7 total cost of 7
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
The initial representative objects(or seeds) are chosen arbitrarily.
Oi
P
Oj
Orandom
1. Reassigned to Oi
Case 2: p currently belongs to representative object, oj. If
oj is replaced by orandom as a representative object and p is
closest to orandom then p is reassigned to orandom.
Oi
Oj
Orandom
2. Reassigned to Orandom
Case 3: p currently belongs to representative object, oi i!=j.
If oj is replaced by orandom as a representative object and p
is still closest to oi then the assignment does not change.
Oi
Oj
P
Orandom
3. No change
Case 4: p currently belongs to representative object, oi i!=j.
If oj is replaced by orandom as a representative object and p
is closest to orandom then p is assigned to orandom.
Oi
Oj
P
Orandom
3. No change
CLARA (Clustering LARge Applications)
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Weakness:
i=3
3.600
i=1 i=2
2.236 2.828
1 3 4 5 6 2
DIANA (DIvisive ANAlysis)
Major features:
Handle noise
One scan
52
Density-Based Clustering: Background
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-neighbourhood
of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-reachable
from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition: p MinPts = 5
|NEps (q)| >= MinPts q
Eps = 1 cm
53
Density-Based Clustering: Background (II)
Density-reachable:
Density-connected
p q
A point p is density-connected to a point q
wrt. Eps, MinPts if there is a point o such o
that both, p and q are density-reachable
from o wrt. Eps and MinPts.
54
DBSCAN: Density Based Spatial Clustering of Applications
with Noise
Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with
noise
Not density reachable
from core point
Density reachable
from core point
Outlier
Border
Eps = 1cm
Core MinPts = 5
55
DBSCAN: The Algorithm
Arbitrary select a point p
56