0% found this document useful (0 votes)
17 views59 pages

Classification and Clustering

Uploaded by

hello125643
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views59 pages

Classification and Clustering

Uploaded by

hello125643
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Mining Techniques:

Classification and Clustering


Today
 Classification
 Basic Concepts in Classification
 Decision Trees
 ID3/C4.5 Algorithm
 Bayesian Classification
 Tools and Software for Classification
 Clustering
 Distance Measures
 Graph-based Techniques
 K-Means Clustering
 Tools and Software for Clustering

2
What Is Classification?
 The goal of data classification is to organize and
categorize data in distinct classes
 A model is first created based on the data distribution
 The model is then used to classify new data
 Given the model, a class can be predicted for new data

 Classification = prediction for discrete and


nominal values
 With classification, I can predict in which bucket to put the ball,
but I can’t predict the weight of the ball

3
Prediction, Clustering, Classification
 What is Prediction?
 The goal of prediction is to forecast or deduce the value of an
attribute based on values of other attributes
 A model is first created based on the data distribution
 The model is then used to predict future or unknown values

 Supervised vs. Unsupervised Classification


 Supervised Classification = Classification
 We know the class labels and the number of classes

 Unsupervised Classification = Clustering


 We do not know the class labels and may not know the
number of classes

4
Classification: 3 Step Process
 1. Model construction (Learning):
 Each record (instance) is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label
 The set of all records used for construction of the model is called training
set
 The model is usually represented in the form of classification rules, (IF-THEN
statements) or decision trees

 2. Model Evaluation (Accuracy):


 Estimate accuracy rate of the model based on a test set
 The known label of test sample is compared with the classified result from
model
 Accuracy rate: percentage of test set samples correctly classified by the
model
 Test set is independent of training set otherwise over-fitting will occur

 3. Model Use (Classification):


 The model is used to classify unseen instances (assigning class labels)
 Predict the value of an actual attribute

5
Model Construction

6
Model Evaluation

7
Model Use: Classification

8
Classification Methods
 Decision
Decision Tree
Tree Induction
Induction
 Neural
Neural Networks
Networks
 Bayesian
Bayesian Classification
Classification
 Association-Based
Association-Based
Classification
Classification
 K-Nearest
K-Nearest Neighbor
Neighbor
 Case-Based
Case-Based Reasoning
Reasoning
 Genetic
Genetic Algorithms
Algorithms
 Fuzzy
Fuzzy Sets
Sets
 Many
Many More
More

9
Decision Trees
 A decision tree is a flow-chart-like tree structure
 Internal node denotes a test on an attribute (feature)
 Branch represents an outcome of the test
 All records in a branch have the same value for the tested
attribute
 Leaf node represents class label or class label distribution
outlook

sunny overcast rain

humidity P windy

high normal true false

N P N P

10
Decision Trees
 Example: “is it a good day to play golf?” A particular instance in the
 a set of attributes and their possible values: training set might be:
outlook sunny, overcast, rain
<overcast, hot, normal, false>: play
temperature cool, mild, hot
humidity high, normal
windy true, false

In this case, the target class


is a binary attribute, so each
instance represents a positive
or a negative example.

11
Using Decision Trees for Classification
 Examples can be classified as follows
 1. look at the example's value for the feature specified
 2. move along the edge labeled with this value
 3. if you reach a leaf, return the label of the leaf
 4. otherwise, repeat from step 1
 Example (a decision tree to decide whether to go on a picnic):

outlook

So a new instance:
sunny overcast rain
<rainy, hot, normal, true>: ?

humidity windy will be classified as “noplay”


P

high normal true false

N P N P

12
Decision Trees and Decision Rules
outlook

If attributes are continuous, sunny overcast rain


internal nodes may test
against a threshold. humidity yes windy

> 75%<= 75% > 20 <= 20

no yes no yes

Each path in the tree represents a decision rule:

Rule1: Rule3:
If (outlook=“sunny”) AND (humidity<=0.75) If (outlook=“overcast”)
Then (play=“yes”) Then (play=“yes”)
Rule2: ...
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)

13
Top-Down Decision Tree
Generation
 The basic approach usually consists of two phases:
 Tree construction
 At the start, all the training examples are at the root
 Partition examples are recursively based on selected attributes
 Tree pruning
 remove tree branches that may reflect noise in the training data
and lead to errors when classifying test data
 improve classification accuracy
 Basic Steps in Decision Tree Construction
 Tree starts a single node representing all data
 If sample are all same class then node becomes a leaf labeled with
class label
 Otherwise, select feature that best separates sample into individual
classes.
 Recursion stops when:
 Samples in node belong to the same class (majority)
 There are no remaining attributes on which to split

14
Trees Construction Algorithm (ID3)
 Decision Tree Learning Method (ID3)
 Input: a set of examples S, a set of features F, and a target set T (target class T
represents the type of instance we want to classify, e.g., whether “to play golf”)
 1. If every element of S is already in T, return “yes”; if no element of S is in T
return “no”
 2. Otherwise, choose the best feature f from F (if there are no features
remaining, then return failure);
 3. Extend tree from f by adding a new branch for each attribute value
 4. Distribute training examples to leaf nodes (so each leaf node S is now the set
of examples at that node, and F is the remaining set of features not yet selected)
 5. Repeat steps 1-5 for each leaf node

 Main Question:
 how do we choose the best feature at each step?

Note:
Note:ID3ID3algorithm
algorithmonly
onlydeals
dealswith
withcategorical
categoricalattributes,
attributes,but
butcan
canbe
beextended
extended
(as
(asininC4.5)
C4.5)totohandle
handlecontinuous
continuousattributes
attributes
15
Choosing the “Best” Feature
 Using Information Gain to find the “best” (most discriminating) feature
 Entropy, E(I) of a set of instance I, containing p positive and n negative examples

p p n n
E ( I )  log 2  log 2
pn pn pn pn
 Gain(A, I) is the expected reduction in entropy due to feature (attribute) A

pj  nj
Gain( A, I )  E ( I )   pn
E(I j )
descendant j

 the jth descendant of I is the set of instances with value vj for A

S: [9+,5-]
Outlook? E = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.940
overcast rainy
sunny

“yes”, since all positive examples


[4+,0-] [2+,3-] [3+,2-]

16
Decision Tree Learning - Example
Day outlook temp humidity wind play
S: [9+,5-] (E = 0.940)
D1 sunny hot high weak No
D2 sunny hot high strong No humidity?
D3 overcast hot high weak Yes
D4 rain mild high weak Yes high normal
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes [3+,4-] (E = 0.985) [6+,1-] (E = 0.592)
D8 sunny mild high weak No
Gain(S, humidity) = .940 - (7/14)*.985 - (7/14)*.592
D9 sunny cool normal weak Yes
= .151
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes S: [9+,5-] (E = 0.940)
D13 overcast hot normal weak Yes
wind?
D14 rain mild high strong No
weak strong
So, classifying examples by humidity
provides more information gain than by
wind. In this case, however, you can [6+,2-] (E = 0.811) [3+,3-] (E = 1.00)
verify that outlook has largest information Gain(S, wind) = .940 - (8/14)*.811 - (8/14)*1.0
gain, so it’ll be selected as root = .048

17
Decision Tree Learning - Example
 Partially learned decision tree
S: [9+,5-]
Outlook {D1, D2, …, D14}

sunny overcast rainy

? yes ?
[2+,3-] [4+,0-] [3+,2-]
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}

 which attribute should be tested here?


Ssunny = {D1, D2, D8, D9, D11}
Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970
Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570
Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019

18
Dealing With Continuous Variables
 Partition continuous attribute into a discrete set of intervals
 sort the examples according to the continuous attribute A
 identify adjacent examples that differ in their target classification
 generate a set of candidate thresholds midway
 problem: may generate too many intervals
 Another Solution:
 take a minimum threshold M of the examples of the majority class in each
adjacent partition; then merge adjacent partitions with the same majority class

70.5 77.5
Example: M = 3
Temperature 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Play? yes no yes yes yes no no yes yes yes no yes yes no

Same majority, so they are merged

Final mapping: temperature  77.5 ==> “yes”; temperature > 77.5 ==> “no”

19
Improving on Information Gain
 Info. Gain tends to favor attributes with a large number of values
 larger distribution ==> lower entropy ==> larger Gain
 Quinlan suggests using Gain Ratio
 penalize for large number of values

Si S Gain ( A, S )
SplitInfo ( A, S )   log i GainRatio ( A, S ) 
S S SplitInfo ( A, S )

 Example: “outlook”
S: [9+,5-]
SplitInfo (outlook, S) Outlook
= -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14)
= 0.156 overcast rainy
sunny
GainRatio (outlook, S)
= 0.246 / 0.156 = 1.577
S1: [4+,0-] S2 : [2+,3-] S3 : [3+,2-]

20
Over-fitting in Classification
 A tree generated may over-fit the training examples
due to noise or too small a set of training data
 Two approaches to avoid over-fitting:
 (Stop earlier): Stop growing the tree earlier
 (Post-prune): Allow over-fit and then post-prune the tree
 Approaches to determine the correct final tree size:
 Separate training and testing sets or use cross-validation
 Use all the data for training, but apply a statistical test (e.g., chi-
square) to estimate whether expanding or pruning a node may
improve over entire distribution
 Use Minimum Description Length (MDL) principle: halting growth of
the tree when the encoding is minimized.
 Rule post-pruning (C4.5): converting to rules before
pruning

21
Pruning the Decision Tree
 A decision tree constructed using the training
data may need to be pruned
 over-fitting may result in branches or leaves based on too few
examples
 pruning is the process of removing branches and subtrees that
are generated due to noise; this improves classification
accuracy
 Subtree Replacement: merge a subtree into a
leaf node
 Using a set of data different from the training data
 Atcolor
a tree node, if the accuracy without splitting is higher than
Suppose
Supposewith
the accuracy with splitting, test
testset
replace
with we
wefind
setthe 33red
red“no”
subtree
find examples,
with
“no” a leafand
examples, and
11blue “yes”
“yes”example. We
Wecan
canreplace
replacethe
thetree
treewith
withaa
red node; label
blue it using the majority
blue
single
class
example.
single“no”
“no”node.
node.After
Afterreplacement
replacementthere
therewill
willbe
beonly
only
22errors instead of 5.
errors instead of 5.
yes no
1 2

22
Bayesian Classification
 It is a statistical classifier based on Bayes theorem
 It uses probabilistic learning by calculating explicit probabilities for
hypothesis
 A naïve Bayesian classifier, that assumes total independence between
attributes, is commonly used and performs well with large data sets
 The model is incremental in the sense that each training example can
incrementally increase or decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed data
 Given a data sample X with an unknown class label, H
is the hypothesis that X belongs to a specific class C
 The conditional probability of hypothesis H given X, Pr(H|X), follows
the Bayes theorem:
Pr( X | H ) Pr( H )
Pr( H | X ) 
Pr( X )
 Practical difficulty: requires initial knowledge of
many probabilities, significant computational cost

23
Naïve Bayesian Classifier
 Suppose we have n classes C1 , C2 ,…,Cn. Given an
unknown sample X, the classifier will predict that
X=(x1 ,x2 ,…,xn) belongs to the class with the highest
conditional probability:
X  Ci if Pr(Ci | X )  Pr(C j | X ), for i  j n, i  j

 Maximize Pr(X | Ci).Pr(Ci) / Pr(X) => maximize Pr(X |


Ci).Pr(Ci)

 Note: Pr(Ci) = si / s, and


n
Pr( X | Ci )  Pr( xk | Ci ) where Pr( xk | Ci ) sik / si
k 1

 Greatly reduces the computation cost, only count the


class distribution
 Naïve: class conditional independence

24
Naïve Bayesian Classifier -
Example
 Given a training set, we can compute the
probabilities

X = <sunny, mild, high, true>


Pr(X | “no”).Pr(“no”) = (3/5 . 2/5 . 4/5 . 3/5) . 5/14 = 0.04
Pr(X | “yes”).Pr(“yes”) = (2/9 . 4/9 . 3/9 . 3/9) . 9/14 = 0.007
25
Classification Example - Bank Data
 Want to determine likely responders to a direct mail campaign
 a new product, a "Personal Equity Plan" (PEP)
 training data include records kept about how previous customers responded
and bought the product
 in this case the target class is “pep” with binary value
 want to build a model and apply it to new data (a customer list) in which the
value of the class attribute is not available
id age sex region income married children car save_act current_act mortgage pep
ID12101 48 FEMALE INNER_CITY 17546 NO 1 NO NO NO NO YES
ID12102 40 MALE TOWN 30085.1 YES 3 YES NO YES YES NO
ID12103 51 FEMALE INNER_CITY 16575.4 YES 0 YES YES YES NO NO
ID12104 23 FEMALE TOWN 20375.4 YES 3 NO NO YES NO NO
ID12105 57 FEMALE RURAL 50576.3 YES 0 NO YES NO NO NO
ID12106 57 FEMALE TOWN 37869.6 YES 2 NO YES YES NO YES
ID12107 22 MALE RURAL 8877.07 NO 0 NO NO YES NO YES
ID12108 58 MALE TOWN 24946.6 YES 0 YES YES YES NO NO
ID12109 37 FEMALE SUBURBAN 25304.3 YES 2 YES NO NO NO NO
ID12110 54 MALE TOWN 24212.1 YES 2 YES YES YES NO NO
ID12111 66 FEMALE TOWN 59803.9 YES 0 NO YES YES NO NO
… … … … … … … … … … … …

26
Data Preparation
 Several steps for prepare data for Weka and for See5
 open training data in Excel, remove the “id” column, save results (as a comma
delimited file (e.g., “bank.csv”)
 do the same with new customer data, but also add a new column called “pep”
as the last column; the value of this column for each record should be “?”

 Weka
 must convert the the data to ARFF format
 attribute specification and data are in the same file
 the data portion is just the comma delimited data file without the label row

 See5/C5
 create a “name” file and a “data” file
 “name” file contains attribute specification; “data” file is same as above
 first line of “name” file must be the name(s) of the target class(es) - in this case
“pep”

27
Data File Format for Weka
@relation
@relation’train-bank-data'
’train-bank-data'
@attribute
@attribute 'age'real
'age' real
@attribute 'sex' {'MALE','FEMALE'}
@attribute 'sex' {'MALE','FEMALE'}
@attribute
@attribute'region'
'region'{'INNER_CITY','RURAL','TOWN','SUBURBAN'}
{'INNER_CITY','RURAL','TOWN','SUBURBAN'}
@attribute 'income' real
@attribute 'income' real
Training Data @attribute
@attribute'married'
'married'{'YES','NO'}
{'YES','NO'}
@attribute 'children' real
@attribute 'children' real
@attribute
@attribute'car'
'car'{'YES','NO'}
{'YES','NO'}
@attribute
@attribute 'save_act'{'YES','NO'}
'save_act' {'YES','NO'}
@attribute
@attribute 'current_act'{'YES','NO'}
'current_act' {'YES','NO'}
@attribute
@attribute'mortgage'
'mortgage'{'YES','NO'}
{'YES','NO'}
@attribute
@attribute'pep'
'pep'{'YES','NO'}
{'YES','NO'}
@data
@data
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
. .. .. .

@relation
@relation'new-bank-data'
'new-bank-data'
@attribute
@attribute 'age'real
'age'
New Cases @attribute
real
@attribute 'region'{'INNER_CITY','RURAL','TOWN','SUBURBAN'}
'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}
. .. .. .
@attribute
@attribute'pep'
'pep'{'YES','NO'}
{'YES','NO'}
@data
@data
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?

28
Data File Format for See5/C5
pep.
pep.
age:
age:continuous.
continuous.
sex: MALE,FEMALE. Name file for
sex: MALE,FEMALE.
region:
region:INNER_CITY,RURAL,TOWN,SUBURBAN.
INNER_CITY,RURAL,TOWN,SUBURBAN. Training Data
income: continuous.
income: continuous.
married:
married:YES,NO.
YES,NO.
children: continuous.
children: continuous.
car:
car:YES,NO.
YES,NO.
save_act:
save_act:YES,NO.
YES,NO.
current_act: Data file for
current_act:YES,NO.
YES,NO.
mortgage: YES,NO.
mortgage: YES,NO.
Training Data
pep:
pep:YES,NO.
YES,NO.

48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
. .. .. .
Note: no “name file” is
necessary for new cases,
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
but the file name must
45,FEMALE,RURAL,21881.6,NO,0,YES,YES,YES,NO,? use the same stem, and a
... suffix “.cases”

29
C4.5 Implementation in Weka
 To build a model (decision tree) children <= 2
children <= 2
| children <= 0
| children <= 0
using the classifiers.j48.J48 class | | married = YES
| | married = YES
| | | mortgage = YES
 from command line or using the | | | mortgage = YES
| | | | save_act = YES: NO (16.0/2.0)
| | | | save_act = YES: NO (16.0/2.0)
| | | | save_act = NO: YES (9.0/1.0)
simple java-based command line | | | | save_act = NO: YES (9.0/1.0)
| | | mortgage = NO: NO (59.0/6.0)
| | | mortgage = NO: NO (59.0/6.0)
interface | | married = NO
| | married = NO
| | | mortgage = YES
| | | mortgage = YES
| | | | save_act = YES: NO (12.0)
| | | | save_act = YES: NO (12.0)
Decision Tree | | | | save_act = NO: YES (3.0)
| | | | save_act = NO: YES (3.0)
| | | mortgage = NO: YES (29.0/2.0)
| | | mortgage = NO: YES (29.0/2.0)
Output (pruned) | children > 0
| children > 0
| | income <= 29622
| | income <= 29622
| | | children <= 1
| | | children <= 1
| | | | income <= 12640.3: NO (5.0)
| | | | income <= 12640.3: NO (5.0)
Command for building the model | | | | income > 12640.3
| | | | income > 12640.3
| | | | | current_act = YES: YES (28.0/1.0)
(additional parameters can be specified | | | | | current_act = YES: YES (28.0/1.0)
| | | | | current_act = NO
| | | | | current_act = NO
for pruning, cross validation, etc.) | | | | | | income <= 17390.1: NO (3.0)
| | | | | | income <= 17390.1: NO (3.0)
| | | | | | income > 17390.1: YES (6.0)
| | | | | | income > 17390.1: YES (6.0)
| | | children > 1: NO (47.0/3.0)
| | | children > 1: NO (47.0/3.0)
| | income > 29622: YES (48.0/2.0)
| | income > 29622: YES (48.0/2.0)
children > 2
children > 2
| income <= 43228.2: NO (30.0/2.0)
| income <= 43228.2: NO (30.0/2.0)
| income > 43228.2: YES (5.0)
| income > 43228.2: YES (5.0)

> java weka.classifiers.j48.J48 -t <file-path-name>.arff -d <file-path-name>.model

30
C4.5 Implementation in Weka
=== Error on training data ===
=== Error on training data ===
Correctly Classified Instances 281 93.6667 %
The rest of the output Correctly Classified Instances
Incorrectly Classified Instances
281
19
93.6667 %
6.3333 %
Incorrectly Classified Instances 19 6.3333 %
contains statistical Mean absolute error
Mean absolute error
0.1163
0.1163
Root mean squared error 0.2412
information about the Root mean squared error
Relative absolute error
0.2412
23.496 %
Relative absolute error 23.496 %
model, including confusion Root relative squared error
Root relative squared error
48.4742 %
48.4742 %
Total Number of Instances 300
matrix, error rates, etc. Total Number of Instances 300
=== Confusion Matrix ===
=== Confusion Matrix ===
a b <-- classified as
a b <-- classified as
122 13 | a = YES
122 13 | a = YES
6 159 | b = NO
6 159 | b = NO
=== Stratified cross-validation ===
=== Stratified cross-validation ===
Correctly Classified Instances 274 91.3333 %
Correctly Classified Instances 274 91.3333 %
Incorrectly Classified Instances 26 8.6667 %
Incorrectly Classified Instances 26 8.6667 %
Mean absolute error 0.1434
Mean absolute error 0.1434
The model is now contained Root mean squared error
Root mean squared error
0.291
0.291
Relative absolute error 28.9615 %
in the (binary) file Relative absolute error
Root relative squared error
28.9615 %
58.4922 %
Root relative squared error 58.4922 %
<file-path-name>.model Total Number of Instances
Total Number of Instances
300
300
=== Confusion Matrix ===
=== Confusion Matrix ===
a b <-- classified as
a b <-- classified as
118 17 | a = YES
118 17 | a = YES
9 156 | b = NO
9 156 | b = NO

31
C4.5 Implementation in Weka
 Applying the model to new cases:

java weka.classifiers.j48.J48 -p -l <file-pathnaem>.model -T <newdatafile-pathnaem>.arff

00 NO
NO 0.875
0.875 ??
11 NO
NO 1.0
1.0 ??
The output gives the predicted
22 YES
YES 0.9310344827586207
0.9310344827586207 ??
class for each new instance along 33 YES
YES 0.9583333333333334
0.9583333333333334 ??
with its predicted acuracy. 44 NO
NO 0.8983050847457628
0.8983050847457628 ??
55 YES
YES 0.9642857142857143
0.9642857142857143 ??
Since we removed the “id” field, 66 NO
NO 0.875
0.875 ??
we now need to map these 77 YES
YES 1.0
1.0 ??
88 NO
NO 0.9333333333333333
0.9333333333333333 ??
predictions to the original “new 99 YES
YES 0.9642857142857143
0.9642857142857143 ??
case” records (e.g. using Excel) 10
10 NO
NO 0.875
0.875 ??
.. .. ..
195
195 YES
YES 0.9583333333333334
0.9583333333333334 ??
196
196 NONO 0.9361702127659575
0.9361702127659575 ??
197
197 YES
YES 1.0
1.0 ??
198
198 NONO 0.8983050847457628
0.8983050847457628 ??
199
199 NONO 0.9361702127659575
0.9361702127659575 ??

32
Classification Using See5/C5

Note: See5/C5 also


has options for
creating the model as
a set of decision rules
(as opposed to
a decision tree)

Either model can be


used to classify new
unclassified cases.

33
Classification Using See5/C5
Decision
Decision tree:
tree:
lass specified
Class specified by
by attribute
attribute `pep'
`pep'
income
income >> 30085.1:
30085.1:
** This demonstration version cannot :...children
process
** This demonstration version cannot process **
:...children
** >> 0:
0: YES
YES (43/5)
(43/5)
** more than 200 training or test :
cases.
** more than 200 training or test cases. : children
**
children
** <= 0:
<= 0:
:: :...married
:...married == YES:
YES: NONO (19/2)
(19/2)
ead 200 cases (11 attributes) from :
bank-train.data
Read 200 cases (11 attributes) from bank-train.data : married = NO:
married = NO:
:: :...mortgage
:...mortgage == YES:YES: NO
NO (3)
(3)
:: mortgage = NO: YES
mortgage = NO: YES (5) (5)
income <= 30085.1:
income <= 30085.1:
Evaluation on training data (200 :...children
cases):
Evaluation on training data (200 cases): :...children >> 1:1: NO
NO (50/4)
(50/4)
children <=
children <= 1:1:
Decision
Decision Tree
Tree :...children
---------------- :...children <= <= 0:
0:
---------------- :...save_act
:...save_act == YES:YES: NO
NO (27/5)
(27/5)
Size
Size Errors
Errors :: save_act = NO:
save_act = NO:
13 19( 9.5%) << :: :...married
:...married == NO: NO: YES
YES (6)
(6)
13 19( 9.5%) <<
:: married = YES:
married = YES:
(a) : :...mortgage
:...mortgage == YES:YES: YES
YES (6
(a) (b)(b) <-classified
<-classified as as : (
---- : mortgage = NO: NO (12/
---- ----
---- : mortgage = NO: NO (12
76 13 (a): class YES children >
children > 0:0:
76 13 (a): class YES
66 105 (b): class NO :...income
:...income <= <= 12681.9:
12681.9: NONO (5)
(5)
105 (b): class NO
income > 12681.9:
income > 12681.9:
Time: :...current_act
:...current_act == YES: YES: YES
YES (19
Time: 0.1
0.1 secs
secs (1
current_act =
current_act = NO: NO:
:...car
:...car == YES:
YES: NO
NO (2)
(2)
car = NO: YES
car = NO: YES (3)(3)
34
See5/C5: Applying Model to New Cases
 Need to use the executable file “sample.exe” (note that source
code is available to allow you to build classifier into your
applications
 From the Command Line:

prompt> sample -f bank-train > bank-new.out

Case
Case Given
Given Predicted
Predicted
No
No Class
Class Class
Class
11 ?? NO
NO [0.79]
[0.79]
Output classification 22 ?? NO [0.86]
NO [0.86]
file for new cases (along 33 ?? NO
NO [0.79]
[0.79]
with predicted accuracy 44 ?? YES
YES [0.87]
[0.87]
55 ?? NO [0.79]
for each new case). .. .. ..
NO [0.79]
197
197 ?? NO
NO [0.90]
[0.90]
198
198 ?? YES
YES [0.88]
[0.88]
199
199 ?? NO
NO [0.79]
[0.79]
200
200 ?? NO
NO [0.90]
[0.90]

35
Classification Using See5/C5
 Building the model based on decision rules:

Rule
Rule 1:
1: (31,
(31, lift
lift 2.2)
2.2) Rule
Rule 4:
4: (7,
(7, lift
lift 2.0)
2.0)
income > 12681.9
income > 12681.9 married
married = NO
= NO
children
children >> 00 children
children <= 00
<=
children save_act
save_act == NO
children <=<= 11 NO
current_act ->
-> class YES [0.889]
class YES
current_act == YES
YES [0.889]
->
-> class YES [0.970]
class YES [0.970]
Rule
Rule 5:
5: (43/5,
(43/5, lift
lift 1.9)
1.9)
Rule income > 30085.1
Rule 2:
2: (20,
(20, lift
lift 2.1)
2.1) income > 30085.1
income > 12681.9 children
children >> 00
income > 12681.9
children
children >> 00 ->
-> class
class YES
YES [0.867]
[0.867]
children
children <= <= 11
car
car == NO
NO Rule
Rule 6:
6: (7/1,
(7/1, lift
lift 1.7)
1.7)
->
-> class YES
class YES [0.955]
[0.955] children
children <= 00
<=
save_act
save_act == NO
NO
Rule mortgage = YES
Rule 3:
3: (17,
(17, lift
lift 2.1)
2.1) mortgage = YES
income > 30085.1 ->
-> class
class YES
YES [0.778]
[0.778]
income > 30085.1
married
married == NO
NO
mortgage .. .. ..
mortgage = NO
= NO
->
-> class YES [0.947]
class YES [0.947]

36
What is Clustering in Data
Mining?
Clustering
Clustering is
is aa process
process ofof partitioning
partitioning aa set
set of
of data
data
(or
(or objects)
objects) in
in aa set
set of
of meaningful
meaningful sub-classes,
sub-classes,
called
called clusters
clusters
Helps users understand the natural grouping or structure in a data s
 Cluster:
 a collection of data objects that
are “similar” to one another and
thus can be treated collectively
as one group
 but as a collection, they are
sufficiently different from other
groups

 Clustering
 unsupervised classification
 no predefined classes

37
Requirements of Clustering
Methods
 Scalability
Scalability
 Dealing
Dealing with
with different
different types
types of
of attributes
attributes
 Discovery
Discovery ofof clusters
clusters with
with arbitrary
arbitrary
shape
shape
 Minimal
Minimal requirements
requirements for for domain
domain
knowledge
knowledge to to determine
determine input
input
parameters
parameters
 Able
Able to
to deal
deal with
with noise
noise and
and outliers
outliers
 Insensitive
Insensitive toto order
order of
of input
input records
records
 The
The curse
curse of
of dimensionality
dimensionality
 Interpretability
Interpretability and
and usability
usability

38
Applications of Clustering
 Clustering has wide applications in Pattern
Recognition
 Spatial Data Analysis:
 create thematic maps in GIS by clustering feature spaces
 detect spatial clusters and explain them in spatial data mining

 Image Processing
 Market Research
 Information Retrieval
 Document or term categorization
 Information visualization and IR interfaces

 Web Mining
 Cluster Web usage data to discover groups of similar access patterns
 Web Personalization

39
Clustering Methodologies
 Two general methodologies
 Partitioning Based Algorithms
 Hierarchical Algorithms

 Partitioning Based
 divide a set of N items into K clusters (top-down)

 Hierarchical
 agglomerative: pairs of items or clusters are successively linked to produce
larger clusters
 divisive: start with the whole set as a cluster and successively divide sets into
smaller partitions

40
Distance or Similarity Measures
 Measuring Distance
 In order to group similar items, we need a way to measure the distance
between objects (e.g., records)
 Note: distance = inverse of similarity
 Often based on the representation of objects as “feature vectors”

An Employee DB Term Frequencies for Documents


ID Gender Age Salary T1 T2 T3 T4 T5 T6
1 F 27 19,000 Doc1 0 4 0 0 0 2
2 M 51 64,000 Doc2 3 1 4 3 1 2
3 M 52 100,000 Doc3 3 0 0 0 3 0
4 F 33 55,000 Doc4 0 1 0 3 0 0
5 M 45 45,000 Doc5 2 2 2 3 1 4

Which objects are more similar?


41
Distance or Similarity Measures
 Properties of Distance Measures:
 for all objects A and B, dist(A, B)  0, and dist(A, B) = dist(B, A)
 for any object A, dist(A, A) = 0
 dist(A, C)  dist(A, B) + dist (B, C)
 Common Distance Measures:
X  x1 , x2 ,, xn Y  y1 , y2 ,, yn
 Manhattan distance:
dist ( X , Y )  x1  y1  x2  y2    xn  yn
Can
Canbe
benormalized
normalized
 Euclidean distance: totomake
makevalues
valuesfall
fall
between 0 and 1.
between 0 and 1.
 x1  y1      xn  yn 
2 2
dist ( X , Y ) 

 Cosine similarity:
 ( xi yi )
dist ( X , Y ) 1  sim( X , Y ) sim( X , Y )  i

 xi  yi
2 2

i i

42
Distance or Similarity Measures
 Weighting Attributes
 in some cases we want some attributes to count more than others
 associate a weight with each of the attributes in calculating distance, e.g.,

dist ( X , Y )  w1  x1  y1     wn  xn  yn 
2 2

 Nominal (categorical) Attributes


 can use simple matching: distance=1 if values match, 0 otherwise
 or convert each nominal attribute to a set of binary attribute, then use the usual
distance measure
 if all attributes are nominal, we can normalize by dividing the number of
matches by the total number of attributes

 Normalization:
xi  min xi
 want values to fall between 0 an 1: x 'i 
 other variations possible max xi  min xi

43
Distance or Similarity Measures
 Example
 max distance for age: 100000-19000 = 79000 xi  min xi
x 'i 
 max distance for age: 52-27 = 25 max xi  min xi

ID Gender Age Salary ID Gender Age Salary


1 F 27 19,000 1 1 0.00 0.00
2 M 51 64,000 2 0 0.96 0.56
3 M 52 100,000 3 0 1.00 1.00
4 F 33 55,000 4 1 0.24 0.44
5 M 45 45,000 5 0 0.72 0.32

 dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44


 dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24

44
Domain Specific Distance Functions
 For some data sets, we may need to use specialized functions
 we may want a single or a selected group of attributes to be used in the
computation of distance - same problem as “feature selection”
 may want to use special properties of one or more attribute in the data

Example:
Example:Zip ZipCodes
Codes
dist
distzipzip(A,
(A,B)
B)==0,0,ififzip
zipcodes
codesare
areidentical
identical
dist
distzipzip(A,
(A,B)
B)==0.1,
0.1,ififfirst
first33digits
digitsare
areidentical
identical
dist
distzipzip(A,
(A,B)
B)==0.5,
0.5,ififfirst
firstdigits
digitsare
areidentical
identical
dist
distzip(A, (A,B)
B)==1,1,ififfirst
firstdigits
digitsare
aredifferent
different
zip

 natural distance functions may exist in the data

Example:
Example:Customer
CustomerSolicitation
Solicitation
dist
distsolicit (A, B) = 0, if both A and B responded
solicit(A, B) = 0, if both A and B responded
dist
distsolicit (A, B) = 0.1, both A and B were chosen but did not respond
solicit(A, B) = 0.1, both A and B were chosen but did not respond
dist
distsolicit (A, B) = 0.5, both A and B were chosen, but only one responded
solicit(A, B) = 0.5, both A and B were chosen, but only one responded
dist
distsolicit(A,
(A,B)
B)==1,1,one
onewas
waschosen,
chosen,but
butthe
theother
otherwas
wasnot
not
solicit

45
Distance (Similarity) Matrix
 Similarity (Distance) Matrix
 based on the distance or similarity measure we can construct a symmetric
matrix of distance (or similarity values)
 (i, j) entry in the matrix is the distance (similarity) between items i and j

I1 I2  In
I1  d12  d1n Note
Notethat
thatddijij==ddjiji(i.e.,
(i.e.,the
thematrix
matrixisis
I 2 d 21   d 2 n symmetric.
symmetric.So,
triangle
So,we weonlyonlyneed
needthe
thelower
lower
trianglepart
partofofthe thematrix.
matrix.
     The
Thediagonal
diagonalisisall
all1’s
1’s(similarity)
(similarity)or
orall
all
I n d n1 d n 2   0’s
0’s(distance)
(distance)

dij similarity (or distance) of Di to D j


46
Example: Term Similarities in Documents
T1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2

N
sim(Ti , Tj )  ( wik w jk )
k 1

T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
Term-Term
Term-Term T4 15 12 18
Similarity
SimilarityMatrix
Matrix T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3

47
Similarity (Distance) Thresholds
 A similarity (distance) threshold may be used to mark pairs that are
“sufficiently” similar

T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3 Using a threshold
value of 10 in the
T1 T2 T3 T4 T5 T6 T7 previous example
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0

48
Graph Representation
 The similarity matrix can be visualized as an undirected graph
 each item is represented by a node, and edges represent the fact that two items
are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7
T1 T3
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0 T5
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T4 T2
T8 0 1 0 0 0 1 0

IfIfno
nothreshold
thresholdisisused,
used,then
then T7
matrix
matrixcan
canbe
berepresented
representedas as T6
aaweighted
weightedgraph
graph T8

49
Simple Clustering Algorithms
 If we are interested only in threshold (and not the degree of
similarity or distance), we can use the graph directly for clustering
 Clique Method (complete link)
 all items within a cluster must be within the similarity threshold of all other
items in that cluster
 clusters may overlap
 generally produces small but very tight clusters
 Single Link Method
 any item in a cluster must be within the similarity threshold of at least one
other item in that cluster
 produces larger but weaker clusters
 Other methods
 star method - start with an item and place all related items in that cluster
 string method - start with an item; place one related item in that cluster; then
place anther item related to the last item entered, and so on

50
Simple Clustering Algorithms
 Clique Method
 a clique is a completely connected subgraph of a graph
 in the clique method, each maximal clique in the graph becomes a cluster

T1 T3
Maximal cliques (and therefore the
clusters) in the previous example are:
{T1, T3, T4, T6}
T5
{T2, T4, T6}
{T2, T6, T8}
T4
{T1, T5}
T2
{T7}
Note that, for example, {T1, T3, T4}
is also a clique, but is not maximal.
T7
T6
T8

51
Simple Clustering Algorithms
 Single Link Method
 selected an item not in a cluster and place it in a new cluster
 place all other similar item in that cluster
 repeat step 2 for each item in the cluster until nothing more can be added
 repeat steps 1-3 for each item that remains unclustered

T1 T3

In this case the single link method


produces only two clusters:
T5
{T1, T3, T4, T5, T6, T2, T8}
{T7}
T4 T2 Note that the single link method does
not allow overlapping clusters, thus
partitioning the set of items.

T7
T6
T8

52
Clustering with Existing Clusters
 The notion of comparing item similarities can be extended to clusters
themselves, by focusing on a representative vector for each cluster
 cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid
 this methodology reduces the number of similarity computations in clustering
 clusters are revised successively until a stopping condition is satisfied, or until no
more changes to clusters can be made
 Partitioning Methods
 reallocation method - start with an initial assignment of items to clusters and then
move items from cluster to cluster to obtain an improved partitioning
 Single pass method - simple and efficient, but produces large clusters, and
depends on order in which items are processed
 Hierarchical Agglomerative Methods
 starts with individual items and combines into clusters
 then successively combine smaller clusters to form larger ones
 grouping of individual items can be based on any of the methods discussed earlier

53
K-Means Algorithm
 The basic algorithm (based on reallocation method):
1. select K data points as the initial representatives
2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters)
3. for j = 1 to K, recalculate the cluster centroid Cj
4. repeat steps 2 and 3 until these is (little or) no change in clusters

 Example: Clustering Terms

Initial (arbitrary) assignment:


C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3
Doc1 0 4 0 0 0 2 1 3 4/2 0/2 2/2
Doc2 3 1 4 3 1 2 0 1 4/2 7/2 3/2
Doc3 3 0 0 0 3 0 3 0 3/2 0/2 3/2
Doc4 0 1 0 3 0 0 2 0 1/2 3/2 0/2
Doc5 2 2 2 3 1 4 0 2 4/2 5/2 5/2

54
Example: K-Means
 Example (continued)
Now using simple similarity measure, compute the new cluster-term similarity matrix

T1 T2 T3 T4 T5 T6 T7 T8
Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2
Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2
Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2
Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1

Now compute new cluster centroids using the original document-term matrix

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3
Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1
Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1
Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1
Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1
Doc5 2 2 2 3 1 4 0 2 4/3 11/4 1/1

The process is repeated until no further changes are made to the clusters

55
K-Means Algorithm
 Strength of the k-means:
 Relatively efficient: O(tkn), where n is # of objects, k is #
of clusters, and t is # of iterations. Normally, k, t << n
 Often terminates at a local optimum

 Weakness of the k-means:


 Applicable only when mean is defined; what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers

 Variations of K-Means usually differ in:


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means

56
Hierarchical Algorithms
 Use distance matrix as clustering criteria
 does not require the number of clusters k as an input,
but needs a termination condition

57
Hierarchical Agglomerative Clustering
 HAC starts with unclustered data and performs successive pairwise
joins among items (or previous clusters) to form larger ones
 this results in a hierarchy of clusters which can be viewed as a dendrogram
 useful in pruning search in a clustered item set, or in browsing clustering results
 Some commonly used HACM methods
 Single Link: at each step join most similar pairs of objects that are not yet
in the same cluster
 Complete Link: use least similar pair between each cluster pair to
determine inter-cluster similarity - all items within one cluster are linked
to each other within a similarity threshold
 Ward’s method: at each step join cluster pair whose merger minimizes the
increase in total within-group error sum of squares (based on distance
between centroids) - also called the minimum variance method
 Group Average (Mean): use average value of pairwise links within a
cluster to determine inter-cluster similarity (i.e., all objects contribute to
inter-cluster similarity)
58
Hierarchical Agglomerative Clustering
Dendrogram for a hierarchy of clusters

A B C D E F G H I

59

You might also like