Classification and Clustering
Classification and Clustering
2
What Is Classification?
The goal of data classification is to organize and
categorize data in distinct classes
A model is first created based on the data distribution
The model is then used to classify new data
Given the model, a class can be predicted for new data
3
Prediction, Clustering, Classification
What is Prediction?
The goal of prediction is to forecast or deduce the value of an
attribute based on values of other attributes
A model is first created based on the data distribution
The model is then used to predict future or unknown values
4
Classification: 3 Step Process
1. Model construction (Learning):
Each record (instance) is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label
The set of all records used for construction of the model is called training
set
The model is usually represented in the form of classification rules, (IF-THEN
statements) or decision trees
5
Model Construction
6
Model Evaluation
7
Model Use: Classification
8
Classification Methods
Decision
Decision Tree
Tree Induction
Induction
Neural
Neural Networks
Networks
Bayesian
Bayesian Classification
Classification
Association-Based
Association-Based
Classification
Classification
K-Nearest
K-Nearest Neighbor
Neighbor
Case-Based
Case-Based Reasoning
Reasoning
Genetic
Genetic Algorithms
Algorithms
Fuzzy
Fuzzy Sets
Sets
Many
Many More
More
9
Decision Trees
A decision tree is a flow-chart-like tree structure
Internal node denotes a test on an attribute (feature)
Branch represents an outcome of the test
All records in a branch have the same value for the tested
attribute
Leaf node represents class label or class label distribution
outlook
humidity P windy
N P N P
10
Decision Trees
Example: “is it a good day to play golf?” A particular instance in the
a set of attributes and their possible values: training set might be:
outlook sunny, overcast, rain
<overcast, hot, normal, false>: play
temperature cool, mild, hot
humidity high, normal
windy true, false
11
Using Decision Trees for Classification
Examples can be classified as follows
1. look at the example's value for the feature specified
2. move along the edge labeled with this value
3. if you reach a leaf, return the label of the leaf
4. otherwise, repeat from step 1
Example (a decision tree to decide whether to go on a picnic):
outlook
So a new instance:
sunny overcast rain
<rainy, hot, normal, true>: ?
N P N P
12
Decision Trees and Decision Rules
outlook
no yes no yes
Rule1: Rule3:
If (outlook=“sunny”) AND (humidity<=0.75) If (outlook=“overcast”)
Then (play=“yes”) Then (play=“yes”)
Rule2: ...
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)
13
Top-Down Decision Tree
Generation
The basic approach usually consists of two phases:
Tree construction
At the start, all the training examples are at the root
Partition examples are recursively based on selected attributes
Tree pruning
remove tree branches that may reflect noise in the training data
and lead to errors when classifying test data
improve classification accuracy
Basic Steps in Decision Tree Construction
Tree starts a single node representing all data
If sample are all same class then node becomes a leaf labeled with
class label
Otherwise, select feature that best separates sample into individual
classes.
Recursion stops when:
Samples in node belong to the same class (majority)
There are no remaining attributes on which to split
14
Trees Construction Algorithm (ID3)
Decision Tree Learning Method (ID3)
Input: a set of examples S, a set of features F, and a target set T (target class T
represents the type of instance we want to classify, e.g., whether “to play golf”)
1. If every element of S is already in T, return “yes”; if no element of S is in T
return “no”
2. Otherwise, choose the best feature f from F (if there are no features
remaining, then return failure);
3. Extend tree from f by adding a new branch for each attribute value
4. Distribute training examples to leaf nodes (so each leaf node S is now the set
of examples at that node, and F is the remaining set of features not yet selected)
5. Repeat steps 1-5 for each leaf node
Main Question:
how do we choose the best feature at each step?
Note:
Note:ID3ID3algorithm
algorithmonly
onlydeals
dealswith
withcategorical
categoricalattributes,
attributes,but
butcan
canbe
beextended
extended
(as
(asininC4.5)
C4.5)totohandle
handlecontinuous
continuousattributes
attributes
15
Choosing the “Best” Feature
Using Information Gain to find the “best” (most discriminating) feature
Entropy, E(I) of a set of instance I, containing p positive and n negative examples
p p n n
E ( I ) log 2 log 2
pn pn pn pn
Gain(A, I) is the expected reduction in entropy due to feature (attribute) A
pj nj
Gain( A, I ) E ( I ) pn
E(I j )
descendant j
S: [9+,5-]
Outlook? E = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.940
overcast rainy
sunny
16
Decision Tree Learning - Example
Day outlook temp humidity wind play
S: [9+,5-] (E = 0.940)
D1 sunny hot high weak No
D2 sunny hot high strong No humidity?
D3 overcast hot high weak Yes
D4 rain mild high weak Yes high normal
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes [3+,4-] (E = 0.985) [6+,1-] (E = 0.592)
D8 sunny mild high weak No
Gain(S, humidity) = .940 - (7/14)*.985 - (7/14)*.592
D9 sunny cool normal weak Yes
= .151
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes S: [9+,5-] (E = 0.940)
D13 overcast hot normal weak Yes
wind?
D14 rain mild high strong No
weak strong
So, classifying examples by humidity
provides more information gain than by
wind. In this case, however, you can [6+,2-] (E = 0.811) [3+,3-] (E = 1.00)
verify that outlook has largest information Gain(S, wind) = .940 - (8/14)*.811 - (8/14)*1.0
gain, so it’ll be selected as root = .048
17
Decision Tree Learning - Example
Partially learned decision tree
S: [9+,5-]
Outlook {D1, D2, …, D14}
? yes ?
[2+,3-] [4+,0-] [3+,2-]
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
18
Dealing With Continuous Variables
Partition continuous attribute into a discrete set of intervals
sort the examples according to the continuous attribute A
identify adjacent examples that differ in their target classification
generate a set of candidate thresholds midway
problem: may generate too many intervals
Another Solution:
take a minimum threshold M of the examples of the majority class in each
adjacent partition; then merge adjacent partitions with the same majority class
70.5 77.5
Example: M = 3
Temperature 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Play? yes no yes yes yes no no yes yes yes no yes yes no
Final mapping: temperature 77.5 ==> “yes”; temperature > 77.5 ==> “no”
19
Improving on Information Gain
Info. Gain tends to favor attributes with a large number of values
larger distribution ==> lower entropy ==> larger Gain
Quinlan suggests using Gain Ratio
penalize for large number of values
Si S Gain ( A, S )
SplitInfo ( A, S ) log i GainRatio ( A, S )
S S SplitInfo ( A, S )
Example: “outlook”
S: [9+,5-]
SplitInfo (outlook, S) Outlook
= -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14)
= 0.156 overcast rainy
sunny
GainRatio (outlook, S)
= 0.246 / 0.156 = 1.577
S1: [4+,0-] S2 : [2+,3-] S3 : [3+,2-]
20
Over-fitting in Classification
A tree generated may over-fit the training examples
due to noise or too small a set of training data
Two approaches to avoid over-fitting:
(Stop earlier): Stop growing the tree earlier
(Post-prune): Allow over-fit and then post-prune the tree
Approaches to determine the correct final tree size:
Separate training and testing sets or use cross-validation
Use all the data for training, but apply a statistical test (e.g., chi-
square) to estimate whether expanding or pruning a node may
improve over entire distribution
Use Minimum Description Length (MDL) principle: halting growth of
the tree when the encoding is minimized.
Rule post-pruning (C4.5): converting to rules before
pruning
21
Pruning the Decision Tree
A decision tree constructed using the training
data may need to be pruned
over-fitting may result in branches or leaves based on too few
examples
pruning is the process of removing branches and subtrees that
are generated due to noise; this improves classification
accuracy
Subtree Replacement: merge a subtree into a
leaf node
Using a set of data different from the training data
Atcolor
a tree node, if the accuracy without splitting is higher than
Suppose
Supposewith
the accuracy with splitting, test
testset
replace
with we
wefind
setthe 33red
red“no”
subtree
find examples,
with
“no” a leafand
examples, and
11blue “yes”
“yes”example. We
Wecan
canreplace
replacethe
thetree
treewith
withaa
red node; label
blue it using the majority
blue
single
class
example.
single“no”
“no”node.
node.After
Afterreplacement
replacementthere
therewill
willbe
beonly
only
22errors instead of 5.
errors instead of 5.
yes no
1 2
22
Bayesian Classification
It is a statistical classifier based on Bayes theorem
It uses probabilistic learning by calculating explicit probabilities for
hypothesis
A naïve Bayesian classifier, that assumes total independence between
attributes, is commonly used and performs well with large data sets
The model is incremental in the sense that each training example can
incrementally increase or decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed data
Given a data sample X with an unknown class label, H
is the hypothesis that X belongs to a specific class C
The conditional probability of hypothesis H given X, Pr(H|X), follows
the Bayes theorem:
Pr( X | H ) Pr( H )
Pr( H | X )
Pr( X )
Practical difficulty: requires initial knowledge of
many probabilities, significant computational cost
23
Naïve Bayesian Classifier
Suppose we have n classes C1 , C2 ,…,Cn. Given an
unknown sample X, the classifier will predict that
X=(x1 ,x2 ,…,xn) belongs to the class with the highest
conditional probability:
X Ci if Pr(Ci | X ) Pr(C j | X ), for i j n, i j
24
Naïve Bayesian Classifier -
Example
Given a training set, we can compute the
probabilities
26
Data Preparation
Several steps for prepare data for Weka and for See5
open training data in Excel, remove the “id” column, save results (as a comma
delimited file (e.g., “bank.csv”)
do the same with new customer data, but also add a new column called “pep”
as the last column; the value of this column for each record should be “?”
Weka
must convert the the data to ARFF format
attribute specification and data are in the same file
the data portion is just the comma delimited data file without the label row
See5/C5
create a “name” file and a “data” file
“name” file contains attribute specification; “data” file is same as above
first line of “name” file must be the name(s) of the target class(es) - in this case
“pep”
27
Data File Format for Weka
@relation
@relation’train-bank-data'
’train-bank-data'
@attribute
@attribute 'age'real
'age' real
@attribute 'sex' {'MALE','FEMALE'}
@attribute 'sex' {'MALE','FEMALE'}
@attribute
@attribute'region'
'region'{'INNER_CITY','RURAL','TOWN','SUBURBAN'}
{'INNER_CITY','RURAL','TOWN','SUBURBAN'}
@attribute 'income' real
@attribute 'income' real
Training Data @attribute
@attribute'married'
'married'{'YES','NO'}
{'YES','NO'}
@attribute 'children' real
@attribute 'children' real
@attribute
@attribute'car'
'car'{'YES','NO'}
{'YES','NO'}
@attribute
@attribute 'save_act'{'YES','NO'}
'save_act' {'YES','NO'}
@attribute
@attribute 'current_act'{'YES','NO'}
'current_act' {'YES','NO'}
@attribute
@attribute'mortgage'
'mortgage'{'YES','NO'}
{'YES','NO'}
@attribute
@attribute'pep'
'pep'{'YES','NO'}
{'YES','NO'}
@data
@data
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
. .. .. .
@relation
@relation'new-bank-data'
'new-bank-data'
@attribute
@attribute 'age'real
'age'
New Cases @attribute
real
@attribute 'region'{'INNER_CITY','RURAL','TOWN','SUBURBAN'}
'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}
. .. .. .
@attribute
@attribute'pep'
'pep'{'YES','NO'}
{'YES','NO'}
@data
@data
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
28
Data File Format for See5/C5
pep.
pep.
age:
age:continuous.
continuous.
sex: MALE,FEMALE. Name file for
sex: MALE,FEMALE.
region:
region:INNER_CITY,RURAL,TOWN,SUBURBAN.
INNER_CITY,RURAL,TOWN,SUBURBAN. Training Data
income: continuous.
income: continuous.
married:
married:YES,NO.
YES,NO.
children: continuous.
children: continuous.
car:
car:YES,NO.
YES,NO.
save_act:
save_act:YES,NO.
YES,NO.
current_act: Data file for
current_act:YES,NO.
YES,NO.
mortgage: YES,NO.
mortgage: YES,NO.
Training Data
pep:
pep:YES,NO.
YES,NO.
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
. .. .. .
Note: no “name file” is
necessary for new cases,
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
but the file name must
45,FEMALE,RURAL,21881.6,NO,0,YES,YES,YES,NO,? use the same stem, and a
... suffix “.cases”
29
C4.5 Implementation in Weka
To build a model (decision tree) children <= 2
children <= 2
| children <= 0
| children <= 0
using the classifiers.j48.J48 class | | married = YES
| | married = YES
| | | mortgage = YES
from command line or using the | | | mortgage = YES
| | | | save_act = YES: NO (16.0/2.0)
| | | | save_act = YES: NO (16.0/2.0)
| | | | save_act = NO: YES (9.0/1.0)
simple java-based command line | | | | save_act = NO: YES (9.0/1.0)
| | | mortgage = NO: NO (59.0/6.0)
| | | mortgage = NO: NO (59.0/6.0)
interface | | married = NO
| | married = NO
| | | mortgage = YES
| | | mortgage = YES
| | | | save_act = YES: NO (12.0)
| | | | save_act = YES: NO (12.0)
Decision Tree | | | | save_act = NO: YES (3.0)
| | | | save_act = NO: YES (3.0)
| | | mortgage = NO: YES (29.0/2.0)
| | | mortgage = NO: YES (29.0/2.0)
Output (pruned) | children > 0
| children > 0
| | income <= 29622
| | income <= 29622
| | | children <= 1
| | | children <= 1
| | | | income <= 12640.3: NO (5.0)
| | | | income <= 12640.3: NO (5.0)
Command for building the model | | | | income > 12640.3
| | | | income > 12640.3
| | | | | current_act = YES: YES (28.0/1.0)
(additional parameters can be specified | | | | | current_act = YES: YES (28.0/1.0)
| | | | | current_act = NO
| | | | | current_act = NO
for pruning, cross validation, etc.) | | | | | | income <= 17390.1: NO (3.0)
| | | | | | income <= 17390.1: NO (3.0)
| | | | | | income > 17390.1: YES (6.0)
| | | | | | income > 17390.1: YES (6.0)
| | | children > 1: NO (47.0/3.0)
| | | children > 1: NO (47.0/3.0)
| | income > 29622: YES (48.0/2.0)
| | income > 29622: YES (48.0/2.0)
children > 2
children > 2
| income <= 43228.2: NO (30.0/2.0)
| income <= 43228.2: NO (30.0/2.0)
| income > 43228.2: YES (5.0)
| income > 43228.2: YES (5.0)
30
C4.5 Implementation in Weka
=== Error on training data ===
=== Error on training data ===
Correctly Classified Instances 281 93.6667 %
The rest of the output Correctly Classified Instances
Incorrectly Classified Instances
281
19
93.6667 %
6.3333 %
Incorrectly Classified Instances 19 6.3333 %
contains statistical Mean absolute error
Mean absolute error
0.1163
0.1163
Root mean squared error 0.2412
information about the Root mean squared error
Relative absolute error
0.2412
23.496 %
Relative absolute error 23.496 %
model, including confusion Root relative squared error
Root relative squared error
48.4742 %
48.4742 %
Total Number of Instances 300
matrix, error rates, etc. Total Number of Instances 300
=== Confusion Matrix ===
=== Confusion Matrix ===
a b <-- classified as
a b <-- classified as
122 13 | a = YES
122 13 | a = YES
6 159 | b = NO
6 159 | b = NO
=== Stratified cross-validation ===
=== Stratified cross-validation ===
Correctly Classified Instances 274 91.3333 %
Correctly Classified Instances 274 91.3333 %
Incorrectly Classified Instances 26 8.6667 %
Incorrectly Classified Instances 26 8.6667 %
Mean absolute error 0.1434
Mean absolute error 0.1434
The model is now contained Root mean squared error
Root mean squared error
0.291
0.291
Relative absolute error 28.9615 %
in the (binary) file Relative absolute error
Root relative squared error
28.9615 %
58.4922 %
Root relative squared error 58.4922 %
<file-path-name>.model Total Number of Instances
Total Number of Instances
300
300
=== Confusion Matrix ===
=== Confusion Matrix ===
a b <-- classified as
a b <-- classified as
118 17 | a = YES
118 17 | a = YES
9 156 | b = NO
9 156 | b = NO
31
C4.5 Implementation in Weka
Applying the model to new cases:
00 NO
NO 0.875
0.875 ??
11 NO
NO 1.0
1.0 ??
The output gives the predicted
22 YES
YES 0.9310344827586207
0.9310344827586207 ??
class for each new instance along 33 YES
YES 0.9583333333333334
0.9583333333333334 ??
with its predicted acuracy. 44 NO
NO 0.8983050847457628
0.8983050847457628 ??
55 YES
YES 0.9642857142857143
0.9642857142857143 ??
Since we removed the “id” field, 66 NO
NO 0.875
0.875 ??
we now need to map these 77 YES
YES 1.0
1.0 ??
88 NO
NO 0.9333333333333333
0.9333333333333333 ??
predictions to the original “new 99 YES
YES 0.9642857142857143
0.9642857142857143 ??
case” records (e.g. using Excel) 10
10 NO
NO 0.875
0.875 ??
.. .. ..
195
195 YES
YES 0.9583333333333334
0.9583333333333334 ??
196
196 NONO 0.9361702127659575
0.9361702127659575 ??
197
197 YES
YES 1.0
1.0 ??
198
198 NONO 0.8983050847457628
0.8983050847457628 ??
199
199 NONO 0.9361702127659575
0.9361702127659575 ??
32
Classification Using See5/C5
33
Classification Using See5/C5
Decision
Decision tree:
tree:
lass specified
Class specified by
by attribute
attribute `pep'
`pep'
income
income >> 30085.1:
30085.1:
** This demonstration version cannot :...children
process
** This demonstration version cannot process **
:...children
** >> 0:
0: YES
YES (43/5)
(43/5)
** more than 200 training or test :
cases.
** more than 200 training or test cases. : children
**
children
** <= 0:
<= 0:
:: :...married
:...married == YES:
YES: NONO (19/2)
(19/2)
ead 200 cases (11 attributes) from :
bank-train.data
Read 200 cases (11 attributes) from bank-train.data : married = NO:
married = NO:
:: :...mortgage
:...mortgage == YES:YES: NO
NO (3)
(3)
:: mortgage = NO: YES
mortgage = NO: YES (5) (5)
income <= 30085.1:
income <= 30085.1:
Evaluation on training data (200 :...children
cases):
Evaluation on training data (200 cases): :...children >> 1:1: NO
NO (50/4)
(50/4)
children <=
children <= 1:1:
Decision
Decision Tree
Tree :...children
---------------- :...children <= <= 0:
0:
---------------- :...save_act
:...save_act == YES:YES: NO
NO (27/5)
(27/5)
Size
Size Errors
Errors :: save_act = NO:
save_act = NO:
13 19( 9.5%) << :: :...married
:...married == NO: NO: YES
YES (6)
(6)
13 19( 9.5%) <<
:: married = YES:
married = YES:
(a) : :...mortgage
:...mortgage == YES:YES: YES
YES (6
(a) (b)(b) <-classified
<-classified as as : (
---- : mortgage = NO: NO (12/
---- ----
---- : mortgage = NO: NO (12
76 13 (a): class YES children >
children > 0:0:
76 13 (a): class YES
66 105 (b): class NO :...income
:...income <= <= 12681.9:
12681.9: NONO (5)
(5)
105 (b): class NO
income > 12681.9:
income > 12681.9:
Time: :...current_act
:...current_act == YES: YES: YES
YES (19
Time: 0.1
0.1 secs
secs (1
current_act =
current_act = NO: NO:
:...car
:...car == YES:
YES: NO
NO (2)
(2)
car = NO: YES
car = NO: YES (3)(3)
34
See5/C5: Applying Model to New Cases
Need to use the executable file “sample.exe” (note that source
code is available to allow you to build classifier into your
applications
From the Command Line:
Case
Case Given
Given Predicted
Predicted
No
No Class
Class Class
Class
11 ?? NO
NO [0.79]
[0.79]
Output classification 22 ?? NO [0.86]
NO [0.86]
file for new cases (along 33 ?? NO
NO [0.79]
[0.79]
with predicted accuracy 44 ?? YES
YES [0.87]
[0.87]
55 ?? NO [0.79]
for each new case). .. .. ..
NO [0.79]
197
197 ?? NO
NO [0.90]
[0.90]
198
198 ?? YES
YES [0.88]
[0.88]
199
199 ?? NO
NO [0.79]
[0.79]
200
200 ?? NO
NO [0.90]
[0.90]
35
Classification Using See5/C5
Building the model based on decision rules:
Rule
Rule 1:
1: (31,
(31, lift
lift 2.2)
2.2) Rule
Rule 4:
4: (7,
(7, lift
lift 2.0)
2.0)
income > 12681.9
income > 12681.9 married
married = NO
= NO
children
children >> 00 children
children <= 00
<=
children save_act
save_act == NO
children <=<= 11 NO
current_act ->
-> class YES [0.889]
class YES
current_act == YES
YES [0.889]
->
-> class YES [0.970]
class YES [0.970]
Rule
Rule 5:
5: (43/5,
(43/5, lift
lift 1.9)
1.9)
Rule income > 30085.1
Rule 2:
2: (20,
(20, lift
lift 2.1)
2.1) income > 30085.1
income > 12681.9 children
children >> 00
income > 12681.9
children
children >> 00 ->
-> class
class YES
YES [0.867]
[0.867]
children
children <= <= 11
car
car == NO
NO Rule
Rule 6:
6: (7/1,
(7/1, lift
lift 1.7)
1.7)
->
-> class YES
class YES [0.955]
[0.955] children
children <= 00
<=
save_act
save_act == NO
NO
Rule mortgage = YES
Rule 3:
3: (17,
(17, lift
lift 2.1)
2.1) mortgage = YES
income > 30085.1 ->
-> class
class YES
YES [0.778]
[0.778]
income > 30085.1
married
married == NO
NO
mortgage .. .. ..
mortgage = NO
= NO
->
-> class YES [0.947]
class YES [0.947]
36
What is Clustering in Data
Mining?
Clustering
Clustering is
is aa process
process ofof partitioning
partitioning aa set
set of
of data
data
(or
(or objects)
objects) in
in aa set
set of
of meaningful
meaningful sub-classes,
sub-classes,
called
called clusters
clusters
Helps users understand the natural grouping or structure in a data s
Cluster:
a collection of data objects that
are “similar” to one another and
thus can be treated collectively
as one group
but as a collection, they are
sufficiently different from other
groups
Clustering
unsupervised classification
no predefined classes
37
Requirements of Clustering
Methods
Scalability
Scalability
Dealing
Dealing with
with different
different types
types of
of attributes
attributes
Discovery
Discovery ofof clusters
clusters with
with arbitrary
arbitrary
shape
shape
Minimal
Minimal requirements
requirements for for domain
domain
knowledge
knowledge to to determine
determine input
input
parameters
parameters
Able
Able to
to deal
deal with
with noise
noise and
and outliers
outliers
Insensitive
Insensitive toto order
order of
of input
input records
records
The
The curse
curse of
of dimensionality
dimensionality
Interpretability
Interpretability and
and usability
usability
38
Applications of Clustering
Clustering has wide applications in Pattern
Recognition
Spatial Data Analysis:
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data mining
Image Processing
Market Research
Information Retrieval
Document or term categorization
Information visualization and IR interfaces
Web Mining
Cluster Web usage data to discover groups of similar access patterns
Web Personalization
39
Clustering Methodologies
Two general methodologies
Partitioning Based Algorithms
Hierarchical Algorithms
Partitioning Based
divide a set of N items into K clusters (top-down)
Hierarchical
agglomerative: pairs of items or clusters are successively linked to produce
larger clusters
divisive: start with the whole set as a cluster and successively divide sets into
smaller partitions
40
Distance or Similarity Measures
Measuring Distance
In order to group similar items, we need a way to measure the distance
between objects (e.g., records)
Note: distance = inverse of similarity
Often based on the representation of objects as “feature vectors”
Cosine similarity:
( xi yi )
dist ( X , Y ) 1 sim( X , Y ) sim( X , Y ) i
xi yi
2 2
i i
42
Distance or Similarity Measures
Weighting Attributes
in some cases we want some attributes to count more than others
associate a weight with each of the attributes in calculating distance, e.g.,
dist ( X , Y ) w1 x1 y1 wn xn yn
2 2
Normalization:
xi min xi
want values to fall between 0 an 1: x 'i
other variations possible max xi min xi
43
Distance or Similarity Measures
Example
max distance for age: 100000-19000 = 79000 xi min xi
x 'i
max distance for age: 52-27 = 25 max xi min xi
44
Domain Specific Distance Functions
For some data sets, we may need to use specialized functions
we may want a single or a selected group of attributes to be used in the
computation of distance - same problem as “feature selection”
may want to use special properties of one or more attribute in the data
Example:
Example:Zip ZipCodes
Codes
dist
distzipzip(A,
(A,B)
B)==0,0,ififzip
zipcodes
codesare
areidentical
identical
dist
distzipzip(A,
(A,B)
B)==0.1,
0.1,ififfirst
first33digits
digitsare
areidentical
identical
dist
distzipzip(A,
(A,B)
B)==0.5,
0.5,ififfirst
firstdigits
digitsare
areidentical
identical
dist
distzip(A, (A,B)
B)==1,1,ififfirst
firstdigits
digitsare
aredifferent
different
zip
Example:
Example:Customer
CustomerSolicitation
Solicitation
dist
distsolicit (A, B) = 0, if both A and B responded
solicit(A, B) = 0, if both A and B responded
dist
distsolicit (A, B) = 0.1, both A and B were chosen but did not respond
solicit(A, B) = 0.1, both A and B were chosen but did not respond
dist
distsolicit (A, B) = 0.5, both A and B were chosen, but only one responded
solicit(A, B) = 0.5, both A and B were chosen, but only one responded
dist
distsolicit(A,
(A,B)
B)==1,1,one
onewas
waschosen,
chosen,but
butthe
theother
otherwas
wasnot
not
solicit
45
Distance (Similarity) Matrix
Similarity (Distance) Matrix
based on the distance or similarity measure we can construct a symmetric
matrix of distance (or similarity values)
(i, j) entry in the matrix is the distance (similarity) between items i and j
I1 I2 In
I1 d12 d1n Note
Notethat
thatddijij==ddjiji(i.e.,
(i.e.,the
thematrix
matrixisis
I 2 d 21 d 2 n symmetric.
symmetric.So,
triangle
So,we weonlyonlyneed
needthe
thelower
lower
trianglepart
partofofthe thematrix.
matrix.
The
Thediagonal
diagonalisisall
all1’s
1’s(similarity)
(similarity)or
orall
all
I n d n1 d n 2 0’s
0’s(distance)
(distance)
N
sim(Ti , Tj ) ( wik w jk )
k 1
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
Term-Term
Term-Term T4 15 12 18
Similarity
SimilarityMatrix
Matrix T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
47
Similarity (Distance) Thresholds
A similarity (distance) threshold may be used to mark pairs that are
“sufficiently” similar
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3 Using a threshold
value of 10 in the
T1 T2 T3 T4 T5 T6 T7 previous example
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0
48
Graph Representation
The similarity matrix can be visualized as an undirected graph
each item is represented by a node, and edges represent the fact that two items
are similar (a one in the similarity threshold matrix)
T1 T2 T3 T4 T5 T6 T7
T1 T3
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0 T5
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T4 T2
T8 0 1 0 0 0 1 0
IfIfno
nothreshold
thresholdisisused,
used,then
then T7
matrix
matrixcan
canbe
berepresented
representedas as T6
aaweighted
weightedgraph
graph T8
49
Simple Clustering Algorithms
If we are interested only in threshold (and not the degree of
similarity or distance), we can use the graph directly for clustering
Clique Method (complete link)
all items within a cluster must be within the similarity threshold of all other
items in that cluster
clusters may overlap
generally produces small but very tight clusters
Single Link Method
any item in a cluster must be within the similarity threshold of at least one
other item in that cluster
produces larger but weaker clusters
Other methods
star method - start with an item and place all related items in that cluster
string method - start with an item; place one related item in that cluster; then
place anther item related to the last item entered, and so on
50
Simple Clustering Algorithms
Clique Method
a clique is a completely connected subgraph of a graph
in the clique method, each maximal clique in the graph becomes a cluster
T1 T3
Maximal cliques (and therefore the
clusters) in the previous example are:
{T1, T3, T4, T6}
T5
{T2, T4, T6}
{T2, T6, T8}
T4
{T1, T5}
T2
{T7}
Note that, for example, {T1, T3, T4}
is also a clique, but is not maximal.
T7
T6
T8
51
Simple Clustering Algorithms
Single Link Method
selected an item not in a cluster and place it in a new cluster
place all other similar item in that cluster
repeat step 2 for each item in the cluster until nothing more can be added
repeat steps 1-3 for each item that remains unclustered
T1 T3
T7
T6
T8
52
Clustering with Existing Clusters
The notion of comparing item similarities can be extended to clusters
themselves, by focusing on a representative vector for each cluster
cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid
this methodology reduces the number of similarity computations in clustering
clusters are revised successively until a stopping condition is satisfied, or until no
more changes to clusters can be made
Partitioning Methods
reallocation method - start with an initial assignment of items to clusters and then
move items from cluster to cluster to obtain an improved partitioning
Single pass method - simple and efficient, but produces large clusters, and
depends on order in which items are processed
Hierarchical Agglomerative Methods
starts with individual items and combines into clusters
then successively combine smaller clusters to form larger ones
grouping of individual items can be based on any of the methods discussed earlier
53
K-Means Algorithm
The basic algorithm (based on reallocation method):
1. select K data points as the initial representatives
2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters)
3. for j = 1 to K, recalculate the cluster centroid Cj
4. repeat steps 2 and 3 until these is (little or) no change in clusters
T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3
Doc1 0 4 0 0 0 2 1 3 4/2 0/2 2/2
Doc2 3 1 4 3 1 2 0 1 4/2 7/2 3/2
Doc3 3 0 0 0 3 0 3 0 3/2 0/2 3/2
Doc4 0 1 0 3 0 0 2 0 1/2 3/2 0/2
Doc5 2 2 2 3 1 4 0 2 4/2 5/2 5/2
54
Example: K-Means
Example (continued)
Now using simple similarity measure, compute the new cluster-term similarity matrix
T1 T2 T3 T4 T5 T6 T7 T8
Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2
Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2
Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2
Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1
Now compute new cluster centroids using the original document-term matrix
T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3
Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1
Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1
Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1
Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1
Doc5 2 2 2 3 1 4 0 2 4/3 11/4 1/1
The process is repeated until no further changes are made to the clusters
55
K-Means Algorithm
Strength of the k-means:
Relatively efficient: O(tkn), where n is # of objects, k is #
of clusters, and t is # of iterations. Normally, k, t << n
Often terminates at a local optimum
56
Hierarchical Algorithms
Use distance matrix as clustering criteria
does not require the number of clusters k as an input,
but needs a termination condition
57
Hierarchical Agglomerative Clustering
HAC starts with unclustered data and performs successive pairwise
joins among items (or previous clusters) to form larger ones
this results in a hierarchy of clusters which can be viewed as a dendrogram
useful in pruning search in a clustered item set, or in browsing clustering results
Some commonly used HACM methods
Single Link: at each step join most similar pairs of objects that are not yet
in the same cluster
Complete Link: use least similar pair between each cluster pair to
determine inter-cluster similarity - all items within one cluster are linked
to each other within a similarity threshold
Ward’s method: at each step join cluster pair whose merger minimizes the
increase in total within-group error sum of squares (based on distance
between centroids) - also called the minimum variance method
Group Average (Mean): use average value of pairwise links within a
cluster to determine inter-cluster similarity (i.e., all objects contribute to
inter-cluster similarity)
58
Hierarchical Agglomerative Clustering
Dendrogram for a hierarchy of clusters
A B C D E F G H I
59