DM - Topic Four - Part III (Autosaved)
DM - Topic Four - Part III (Autosaved)
8
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
9
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
Approach: To identify frequently occurring terms in each
document , form a similarity measure based on the
frequencies of different terms. Use it to cluster.
10
Clustering: Application
outlier detection
Clustering can also be used for outlier detection,
where outliers (values that are “far away” from
any cluster) may be more interesting than
common cases.
Applications of outlier detection include the
detection of credit card fraud
What Is Good Clustering?
Quality
• A good clustering method will produce high quality clusters
with
high intra-class similarity
low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
• Key requirement of clustering: Need a good measure of
similarity between instances
Interval-scaled variables
Binary variables
mixed types:
Interval-valued variables- and distance functions
These are values of variables of an object which are characterized
by its continuous nature of the measurement such as height,
weight, age
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1 i2 j 2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
Similarity and Dissimilarity Between Objects (Cont.)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Basic Properties
d(i,j) 0
d(i,i) =0
d(i,j) = d(j,i)
Cosine similarity
Measures similarity between objects say d1 and q or d2
as
n
d j q w w
sim(d j , q) i 1 i , j i ,q
n
Length d j i 1
2
w
i, j
Example : Computing Cosine Similarity
• Let say we have query vector
• Q = (0.4, 0.8); and also document
• D1 = (0.2, 0.7).
• Compute their similarity using cosine?
If all the binary valued attributes have the same weight, we can
construct a 2-by-2 contingency table for any two objects I and J as
shown bellow
Binary Variables
Object j
A contingency table for binary data 1 0 sum
Where Object i
1 a b a b
0 c d cd
a is the number attributes with value 1 in both objects
sum a c b d p
b is the number attributes with value 1 in object I, 0 in object j.
Jaccard coefficient: the two values are not equally important for
example smoker no(=1) more relevant than smoker yes (=0)
(asymmetric):
d (i, j) bc
a bc
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Mary
1 0 sum
1 3 1 4
Jack 0 1 2 3
sum 4 3 7
Dissimilarity between Binary Variables
Jim
1 0 sum
1 3 1 4
Jack
0 1 2 3
sum 4 3 7
Dissimilarity between Binary Variables
Jim
1 0 sum
1 2 2 4
0 2 1 3
sum 4 3 7
Dissimilarity between Binary Variables
Jim
Contingency table between any
1 0 sum
two objects
1 3 1 4
Jack 0 1 2 3
sum 4 3 7
Mary Jim
1 0 sum 1 0 sum
1 3 1 4 1 2 2 4
Jack 0 1 2 3 0 2 1 3
sum 4 3 7 sum 4 3 7
Dissimilarity between Binary Variables
11
d ( jack , mary ) 0.4
3 11
11
d ( jack , jim) 0.4
3 11
22
d ( jim, mary ) 0.66
222
Nominal Variables
A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
d (i, j) p
p
m
or splits
0
1 3 2 5 4 6
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
Divisive: it is a Top Down clustering technique
Start with all sample units in a single cluster of size n.
Then, at each step of the algorithm, clusters are partitioned
into a pair of daughter clusters, selected to maximize the
distance between each daughter.
The algorithm stops when sample units are partitioned into n
clusters of size 1.
Introduced in Kaufmann and Rousseeuw (1990)
Thus it is an Inverse order of AGNES
10
10 10 9
9 9 8
8 8 7
7 7 6
6 6 5
5 5 4
4 4 3
3 3 2
2 2 1
1 1 0
0 1 2 3 4 5 6 7 8 9 10
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Strengths of Hierarchical Clustering
o Given two clusterings, we can choose the one with the smallest error
o One easy way to reduce SSE is to increase K, the number of clusters
o But do not forget that a good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Internal Measures: SSE
o Internal Index: Used to measure the goodness of a clustering structure
without respect to external information
o SSE
o SSE is good for comparing two clusterings or two clusters (average SSE).
o Can also be used to estimate the number of clusters
10
6
SSE 5
0
2 5 10 15 20 25 30
K
Review questions
o What makes clustering more challenging?
o What Is Good Clustering?
o Explain SSE?
o Describe the basic Agglomerative Clustering Algorithm?
o Explain the key concept in Clustering?
Review questions
o What is the key issue in clustering and what makes it
challenging?
o How do you know that a given clustering activity is good?
o How does SSE works ?
o What does unsupervised learning means?
o Describe the basic Agglomerative Clustering Algorithm?
o Explain data format in Clustering?
Thank you