0% found this document useful (0 votes)
2 views

DLWSS551 - Algorithms Part I

Chapter 3 of DLWSS551 discusses basic data mining algorithms, focusing on inferring rules, statistical modeling, decision trees, and association rule learning. It emphasizes the effectiveness of simple algorithms and introduces the 1R algorithm for creating decision trees based on attribute values, along with techniques for handling numeric attributes and avoiding overfitting. The chapter also covers the Naïve Bayes classifier and its application to weather data, illustrating how to calculate probabilities and address the zero-frequency problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DLWSS551 - Algorithms Part I

Chapter 3 of DLWSS551 discusses basic data mining algorithms, focusing on inferring rules, statistical modeling, decision trees, and association rule learning. It emphasizes the effectiveness of simple algorithms and introduces the 1R algorithm for creating decision trees based on attribute values, along with techniques for handling numeric attributes and avoiding overfitting. The chapter also covers the Naïve Bayes classifier and its application to weather data, illustrating how to calculate probabilities and address the zero-frequency problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

DLWSS551: Data Mining

Chapter 3: Algorithms – Basic Methods


Part I

Harris Papadopoulos
Algorithms: The Basic Methods
• Inferring rudimentary rules
• Statistical modeling
• Constructing decision trees
• Constructing rules
• Association rule learning
• Linear models
• Instance-based learning

We will cover the first four in Part I and the remaining three in Part II.
Simplicity First
• Simple algorithms often work very well!
• There are many kinds of simple structures, eg:
• One attribute does all the work
• All attributes contribute equally & independently
• A weighted linear combination might do
• Instance-based: use a few prototypes
• Use simple logical rules
• Success of method depends on the domain

3
Inferring Rudimentary Rules
• 1R: learns a 1-level decision tree
• I.e., rules that all test one particular attribute
• Basic version
• One branch for each value
• Each branch assigns most frequent class
• Error rate: proportion of instances that don’t belong to
the majority class of their corresponding branch
• Choose attribute with lowest error rate
(assumes nominal attributes)

4
Pseudo-code for 1R

For each attribute,


For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate

• Note: “missing” is treated as a separate attribute


value

5
Evaluating the Weather Attributes
Outlook Temp Humidity Windy Play

Sunny Hot High False No Attribute Rules Errors Total


errors
Sunny Hot High True No
Outlook Sunny → No 2/5 4/14
Overcast Hot High False Yes
Overcast → Yes 0/4
Rainy Mild High False Yes
Rainy → Yes 2/5
Rainy Cool Normal False Yes
Temp Hot → No* 2/4 5/14
Rainy Cool Normal True No
Mild → Yes 2/6
Overcast Cool Normal True Yes
Cool → Yes 1/4
Sunny Mild High False No
Humidity High → No 3/7 4/14
Sunny Cool Normal False Yes
Normal → Yes 1/7
Rainy Mild Normal False Yes
Windy False → Yes 2/8 5/14
Sunny Mild Normal True Yes
True → No* 3/6
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No * indicates a tie
6

Given the data on the left table 1R goes through the attributes one by one and assigns a class
for each attribute value:
For Outlook: Outlook = Sunny in 5 instances and out of these 5, Play = No in the majority
(3 out of 5). Outlook = Overcast in 4 instances and Play = Yes in all 4. Outlook = Rainy in 5
instances and out of these 5, Play = Yes in the majority (3 out of 5).
- So the rule set for Outlook is: if Outlook = Sunny then Play = No, if Outlook = Overcast
then Play = Yes, if Outlook = Rainy then Play = Yes. This rule set makes 4 errors in total
over the data – 2 errors by the 1st rule and 2 by the 3rd rule.
For Temp: Temp = Hot in 4 instances, for 2 of which Play = Yes and for the other 2 Play =
No. Temp = Mild in 6 instances and out of these 6, Play = Yes in the majority (4 out of 6).
Temp = Cool in 4 instances and out of these 4, Play = Yes in the majority (3 out of 4).
- So by randomly breaking the tie for Temp = Hot, the rule set for Temp is: if Temp = Hot
then Play = No, if Temp = Mild then Play = Yes, if Temp = Cool then Play = Yes. This rule
set makes 5 errors in total over the data – 2 errors by the 1st rule, 2 by the 2nd rule and 1 by
the 3rd.
The same process gives the rule sets for the two remaining attributes. Then 1R chooses the
rule set with the smallest number of errors – that is, the 1st and 3rd rule sets. Since there is a
tie, 1R will choose randomly one of these 2 rule sets.
Dealing with Numeric Attributes
• Discretize numeric attributes
• Divide each attribute’s range into intervals
• Sort instances according to attribute’s values
• Place breakpoints where class changes (majority class)
• This minimizes the total error
• Example: temperature from weather data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

Outlook Temperature Humidity Windy Play


Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … … 7
The problem of Overfitting
• This procedure is very sensitive to noise
• One instance with an incorrect class label will probably
produce a separate interval
• Also: time stamp attribute will have zero errors!
• Simple solution:
enforce minimum number of instances in majority class
per interval
• Example (with min = 3):
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

8
With Overfitting Avoidance
• Resulting rule set:
Attribute Rules Errors Total errors
Outlook Sunny → No 2/5 4/14
Overcast → Yes 0/4
Rainy → Yes 2/5
Temperature  77.5 → Yes 3/10 5/14
> 77.5 → No* 2/4
Humidity  82.5 → Yes 1/7 3/14
> 82.5 and  95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6

9
Discussion of 1R
• 1R was described in a paper by Holte (1993)
• Contains an experimental evaluation on 16 datasets (using
cross-validation so that results were representative of
performance on future data)
• Minimum number of instances was set to 6 after some
experimentation
• 1R’s simple rules performed not much worse than much
more complex decision trees
• Simplicity first pays off!
Very Simple Classification Rules Perform Well on Most Commonly
Used Datasets
Robert C. Holte, Computer Science Department, University of Ottawa
10
Discussion of 1R: Hyperpipes
• Another simple technique: build one rule for each class
• Each rule is a conjunction of tests, one for each attribute
• For numeric attributes: test checks whether instance's
value is inside an interval
• Interval given by minimum and maximum observed in
training data
• For nominal attributes: test checks whether value is one of
a subset of attribute values
• Subset given by all possible values observed in training data
• Class with most matching tests is predicted

11
Statistical Modeling
• “Opposite” of 1R: use all the attributes
• Two assumptions: Attributes are
• equally important
• statistically independent (given the class value)
• I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
• Independence assumption is never correct!
• But … this scheme works well in practice

12
Probabilities for Weather Data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False 13 Yes
Rainy Mild High True
13 No

Given the data on the bottom right table, we calculate the frequency of each attribute value
for each of the classes.
So, for Outlook, out of the 9 instances with Play = Yes: 2 have Sunny, 4 have Overcast and
3 have Rainy; out of the 5 instances with Play = No: 3 have Sunny and 2 Rainy. This gives
the first column of the top table. We do the same for each attribute. The last column is just
the portion of Yes and No over all instances.
This calculated table in effect gives us the empirical probability of each attribute value
given that the instance belongs to each of the two classes.
i.e. The probability of Outlook = Sunny given Play = Yes, the probability of Outlook =
Sunny given Play = No etc.
Probabilities for Weather Data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5

Outlook Temp. Humidity Windy Play


• A new day: Sunny Cool High True ?

Likelihood of the two classes


For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
14

Given these probabilities we can now calculate the likelihood of each class for a new day by
multiplying the corresponding probabilities given Play = Yes and Play = No.
Note that the product includes the probabilities of each class over all instances as well (last
column of the table).
The calculated likelihoods can then be divided by their sum to convert them to probabilities
(make them sum to 1).
Bayes’s Rule
• Probability of event H given evidence E:
Pr[ E | H ] Pr[ H ]
Pr[ H | E ] =
Pr[ E ]
• A priori probability of H: Pr[H ]
• Probability of event before evidence is seen
• A posteriori probability of H: Pr[ H | E ]
• Probability of event after evidence is seen

Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England
15

What we just did is based on the Bayes Rule.


Naïve Bayes for Classification
• Classification learning: what’s the probability of the
class given an instance?
• Evidence E = instance
• Event H = class value for instance
• Naïve assumption: evidence splits into parts (i.e.
attributes) that are independent

Pr[ E1 | H ] Pr[ E2 | H ]... Pr[ En | H ] Pr[ H ]


Pr[ H | E ] =
Pr[ E ]

16

In effect this is the formula we used.


Weather Data Example
Outlook Temp. Humidity Windy Play
Evidence E
Sunny Cool High True ?

Pr[ yes | E ] = Pr[Outlook = Sunny | yes]


 Pr[Temperature = Cool | yes]
Probability of  Pr[ Humidity = High | yes]
class “yes”
 Pr[Windy = True | yes]
Pr[ yes]

Pr[ E ]
2 3 3 3 9
   
= 9 9 9 9 14
Pr[ E ]
17

The only thing missing is the prior probability of the evidence E. However, since we know
that the probability of Yes and the probability of No should add up to 1, Pr[E] is the sum of
Pr[Yes] and Pr[No].
Again, this assumes independence of the attributes!
The “Zero-frequency Problem”
• What if an attribute value doesn’t occur with every
class value?
(e.g. “Humidity = high” for class “yes”)
• Probability will be zero! Pr[ Humidity = High | yes] = 0
• A posteriori probability will also be zero! Pr[ yes | E ] = 0
(No matter how likely the other values are!)
• Remedy: add 1 to the count for every attribute
value-class combination (Laplace estimator)
• Result: probabilities will never be zero!
(also: stabilizes probability estimates)

18
Modified Probability Estimates
• In some cases adding a constant different from 1
might be more appropriate
• Example: attribute outlook for class yes
2+  /3 4+  /3 3+  /3
9+ 9+ 9+
Sunny Overcast Rainy
• Weights don’t need to be equal
(but they must sum to 1)
2 + p1 4 + p2  3 + p3 
9+ 9+ 9+
19
Missing Values
• Training: instance is not included in frequency count for
attribute value-class combination
• Classification: attribute will be omitted from calculation
• Example:
Outlook Temp. Humidity Windy Play
? Cool High True ?

Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238


Likelihood of “no” = 1/5  4/5  3/5  5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

20
Numeric Attributes
• Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class)
• The probability density function for the normal
distribution is defined by two parameters:
• Sample mean μ 1 n
 =  xi
n i =1
• Standard deviation σ 1 n
= 
n − 1 i =1
( xi −  ) 2

• Then the density function f(x) is


( x− )2
1 −
f ( x) = e 2 2

2 π
21
Statistics for Weather Data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5  =73  =75  =79  =86 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5  =6.2  =7.9  =10.2  =9.7 True 3/9 3/5 14 14
Rainy 3/9 2/5

• Example density value:


( 66 − 73) 2
1 −
f (temperature = 66 | yes) = e 26.2 2
= 0.0340
2 π 6.2
22
Classifying a New Day
• A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?

Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036


Likelihood of “no” = 3/5  0.0221  0.0381  3/5  5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%

• Missing values during training are not included in


calculation of mean and standard deviation

23
Probability Densities
• Relationship between probability and density:
 
Pr[c −  x  c + ]    f (c )
2 2
• But: this doesn’t change calculation of a posteriori
probabilities because ε cancels out
• Exact relationship:
b
Pr[a  x  b] =  f (t )dt
a

24
Multinomial Naïve Bayes I
• Version of naïve Bayes used for document classification
using bag of words model
• n1,n2, ..., nk: number of times word i occurs in document
• P1,P2, ..., Pk: probability of obtaining word i when
sampling from documents in class H
• Probability of observing document E given class H
(based on multinomial distribution):
Pi ni
k
Pr[ E | H ]  N !
i =1 ni !

• Ignores probability of generating a document of the


right length (prob. assumed constant for each class) 25

Bag of words represents each document as the number of times each word in the vocabulary
(words found in all documents) appears in it.
N is the number of words in the document.
Multinomial Naïve Bayes II
• Suppose dictionary has two words, yellow and blue
• Suppose Pr[yellow|H] = 75% and Pr[blue|H] = 25%
• Suppose E is the document “blue yellow blue”
• Probability of observing this document:
0.751 0.252 9
Pr[{blue yellow blue} | H ]  3!  =  0.14
1! 2! 64
Suppose there is another class H' that has
Pr[yellow|H'] = 10% and Pr[yellow|H'] = 90%:
0.11 0.9 2
Pr[{blue yellow blue} | H ' ]  3!  = 0.24
1! 2!
• Need to take prior probability of class into account to make final
classification
• Factorials don't actually need to be computed
• Underflows can be prevented by using logarithms
26

Note that the factorials are the same for each class so they don’t actually need to be
computed.
Naïve Bayes: Discussion
• Naïve Bayes works surprisingly well (even if the
independence assumption is clearly violated)
• Why? Because classification doesn’t require accurate
probability estimates as long as maximum
probability is assigned to correct class
• However: adding too many redundant attributes will
cause problems (e.g. identical attributes)
• Note also: many numeric attributes are not normally
distributed (→ kernel density estimators)

27
Constructing Decision Trees
• Strategy: top down
Recursive divide-and-conquer fashion
• First: select attribute for root node
Create branch for each possible attribute value
• Then: split instances into subsets
One for each branch extending from the node
• Finally: repeat recursively for each branch, using only
instances that reach the branch
• Stop if all instances have the same class

28
Which Attribute to Select?

29

Think about which you think would be best to select and why before going to the next slide.
Which Attribute to Select?

30

In effect we want to select an attribute that can give as the most information about the class.
In other words, knowing that attribute value, will make it most likely that we will find the
correct class of the instance.
Criterion for Attribute Selection
• Which is the best attribute?
• Want to get the smallest tree
• Heuristic: choose the attribute that produces the
“purest” nodes
• Popular impurity criterion: information gain
• Information gain increases with the average purity of
the subsets
• Strategy: choose attribute that gives greatest
information gain

31
Computing Information
• Measure information in bits
• Given a probability distribution, the info required
to predict an event is the distribution’s entropy
• Entropy gives the information required in bits
(can involve fractions of bits!)
• Formula for computing the entropy:
entropy ( p1 , p2 ,..., pn ) = − p1 log( p1 ) − p2 log( p2 )... − pn log( pn )

• Minus signs because the logarithms of the fractions


are negative
32

Where p1 is the portion of instances belonging to class 1, p2 is the portion of instances


belonging to class 2, …, pn is the portion of instances belonging to class n (n is the total
number of classes).
Example: Attribute Outlook
• Outlook = Sunny:
 2 3 2  2 3 3
info([2,3]) = entropy ,  = − log  − log  = 0.971 bits
 5 5 5  5 5 5

• Outlook = Overcast: Note: this


info([4,0]) = entropy (1,0) = −1 log(1) − 0 log(0) = 0 bits is normally
undefined.
• Outlook = Rainy :
3 2 3 3 2  2
info([3,2]) = entropy ,  = − log  − log  = 0.971 bits
5 5 5 5 5  5

• Expected information for attribute:


5 4 5
info([2,3], [4,0], [3,2]) =  0.971 +  0 +  0.971 = 0.693 bits
14 14 14
33

Out of the 5 instances that have Outlook = Sunny, 2 have Play = Yes and 3 have Play = No
(see slide 30). So we calculate info([2,3]).
We do the same for the other 2 values of the attribute.
And finally, we combine the 3 by calculating their weighted sum, where the weights are the
portion of instances out of the whole set that have each value.
i.e. 5 out of 14 have Outlook = Sunny, 4 out of 14 have Outlook = Overcast and 5 out of 14
have Outlook = Rainy.
Computing Information Gain
• Information gain:
information before splitting – information after splitting
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits

• Information gain for attributes from weather data:


gain(Outlook) = 0.247 bits
gain(Temperature) = 0.029 bits
gain(Humidity) = 0.152 bits
gain(Windy) = 0.048 bits
34

After calculating the information for each attribute, we calculate its information gain.
We select the attribute with the highest information gain. In this case Outlook.
Continuing to Split

gain(Temperature) = 0.571 bits


gain(Humidity) = 0.971 bits
gain(Windy) = 0.020 bits
35

We then repeat the same process for each node that is not totally pure (contains more than 1
classes), but only with the subset of the dataset that belons to that node.
Final Decision Tree

• Note: not all leaves need to be pure; sometimes


identical instances have different classes
 Splitting stops when data can’t be split any further
36
Properties of Entropy
• When node is pure, entropy is zero
• When impurity is maximal (i.e. all classes equally likely),
entropy is maximal

• Simplification of computation:
2  2 3 3 4  4
info([2,3,4]) = − log  − log  − log 
9 9 9 9 9 9

= [−2 log(2 ) − 3 log(3) − 4 log(4 )] / 9

• Note: instead of maximizing info gain we could just


minimize information
37

Maximal entropy is 1.
Highly-branching Attributes
• Problematic: attributes with a large number of values
(extreme case: ID code)
• Subsets are more likely to be pure if there is a large
number of values
Information gain is biased towards choosing attributes
with a large number of values
This may result in overfitting (selection of an attribute
that is non-optimal for prediction)
• Another problem: fragmentation

38

Fragmentation: As the tree grows, the decision of which attribute to split at the bottom
nodes is made based on less and less data – choices made with a small amount of data are
more likely to be non-optimal.
Weather Data with ID Code
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No
39

As an extreme example, let’s consider the ID code attribute – a unique ID for each instance.
This attribute obviously is not useful for our classification problem.
Tree Stump for ID Code Attribute

• Entropy of split:
info(ID Code) = info([0,1]) + info([0,1]) + ... + info([0,1]) = 0 bits
 Information gain is maximal for ID code (namely 0.940 bits)

40

But ID code has maximal information gain!


Gain Ratio
• Gain ratio: a modification of the information gain that
reduces its bias
• Gain ratio takes number and size of branches into
account when choosing an attribute
• It corrects the information gain by taking the intrinsic
information of a split into account
• Intrinsic information: entropy of distribution of
instances into branches (i.e. how much info do we
need to tell which branch an instance belongs to)

41
Computing the Gain Ratio
• Example: intrinsic information for ID code
 1  1 
info([1,1,...,1]) = 14   −  log   = 3.807 bits
 14  14  

• Value of attribute decreases as intrinsic information


gets larger
• Definition of gain ratio:
gain (attribute)
gain_ratio(attribute) =
intrinsic_info(attribute)
• Example:
0.940 bits
gain_ratio(ID Code) = = 0.246 bits
3.807 bits
42

Intrinsic Information = entropy of the split. In other words, info of the number of instances
that go into each branch.
In the case of ID code, it divides instances into 14 branches with 1 instance in each branch.
Hence info([1,1,…,1]).
Gain Ratios for Weather Data

Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049

43

See slide 29 for finding the Split info.


More on the Gain Ratio
• “Outlook” still comes out top
• However: “ID Code” has greater gain ratio
• Standard fix: ad hoc test to prevent splitting on that
type of attribute
• Problem with gain ratio: it may overcompensate
• May choose an attribute just because its intrinsic
information is very low
• Standard fix: only consider attributes with greater than
average information gain

44
Discussion
• Top-down induction of decision trees: ID3 – algorithm
developed by Ross Quinlan
• Gain ratio just one modification of this basic algorithm
•  C4.5: deals with numeric attributes, missing values,
noisy data
• Similar approach: CART
• There are many other attribute selection criteria!
(But little difference in accuracy of result)

45
Covering Algorithms
• Convert decision tree into a rule set
• Straightforward, but rule set overly complex
• More effective conversions are not trivial
• Instead, can generate rule set directly
• for each class in turn find rule set that covers all
instances in it
(excluding instances not in the class)
• Called a covering approach:
• at each stage a rule is identified that “covers” some of
the instances
46
Example: Generating a Rule

If true If x > 1.2 and y > 2.6


then class = a then class = a
If x > 1.2
then class = a

• Possible rule set for class “b”:


If x  1.2 then class = b
If x > 1.2 and y  2.6 then class = b

• Could add more rules, get “perfect” rule set


47

For each class, start with a rule covering all instances (i.e. IF true THEN class = a) and
refine by adding attribute tests one by one until the rule covers only instances of that class.
Rules vs. Trees

Corresponding decision tree:


(produces exactly the same
predictions)

• But: rule sets can be more perspicuous when decision


trees suffer from replicated subtrees
• Also: in multiclass situations, covering algorithm
concentrates on one class at a time whereas decision tree
learner takes all classes into account
48
Simple Covering Algorithm
• Generates a rule by adding tests that maximize rule’s
accuracy
• Similar to situation in decision trees: problem of
selecting an attribute to split on
• But: decision tree inducer maximizes overall purity
• Each new test reduces
rule’s coverage:

49
Selecting a Test
• Goal: maximize accuracy
• t total number of instances covered by rule
• p positive examples of the class covered by rule
• t – p number of errors made by rule
 Select test that maximizes the ratio p/t
• We are finished when p/t = 1 or the set of instances
can’t be split any further

50
Example: Contact Lens Data
• Rule we seek: If ?
then recommendation = hard

• Possible tests:
Age = Young 2/8
Age = Pre-presbyopic 1/8
Age = Presbyopic 1/8
Spectacle prescription = Myope 3/12
Spectacle prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
Tear production rate = Reduced 0/12
Tear production rate = Normal 4/12

51

Start with class ‘hard’.


We evaluate each possible test, i.e. each attribute-value combination. See the complete
dataset from Topic 1 to calculate the p/t ratio shown here.
Select the test with the highest ratio. In this case there are 2 so we randomly choose one of
the 2: Astigmatism = yes.
Modified Rule and Resulting Data
• Rule with best test added:
If astigmatism = yes
then recommendation = hard

• Instances covered by modified rule:


Age Spectacle prescription Astigmatism Tear production Recommended
rate lenses
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None

52

These are the 12 instances covered by the rule (i.e. the ones that have Astigmatism = yes).
Further Refinement
• Current state: If astigmatism = yes
and ?
then recommendation = hard

• Possible tests:

Age = Young 2/4


Age = Pre-presbyopic 1/4
Age = Presbyopic 1/4
Spectacle prescription = Myope 3/6
Spectacle prescription = Hypermetrope 1/6
Tear production rate = Reduced 0/6
Tear production rate = Normal 4/6

53

We now perform the same process (evaluate all possible tests) for only these 12 instances
covered by the rule. Note that now we cannot test astigmatism again.
Again, we select the test with the highest ratio: Tear production rate = Normal.
Modified Rule and Resulting Data
• Rule with best test added:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard

• Instances covered by modified rule:


Age Spectacle prescription Astigmatism Tear production Recommended
rate lenses
Young Myope Yes Normal Hard
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Normal None

54
Further Refinement
• Current state: If astigmatism = yes
and tear production rate = normal
and ?
then recommendation = hard

• Possible tests:
Age = Young 2/2
Age = Pre-presbyopic 1/2
Age = Presbyopic 1/2
Spectacle prescription = Myope 3/3
Spectacle prescription = Hypermetrope 1/3

• Tie between the first and the fourth test


• We choose the one with greater coverage
55

If there is a tie between rules with different coverage, we select the one with the greatest
coverage.
The Result
• Final rule: If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard

• Second rule for recommending “hard lenses”:


(built from instances not covered by first rule)
If age = young and astigmatism = yes
and tear production rate = normal
then recommendation = hard

• These two rules cover all “hard lenses”:


• Process is repeated with other two classes

56

After finishing with a rule (p/t ratio = 1), we remove the instances covered by this rule (its
final version) and if the remaining instances still contain the class in question, we repeat the
process with only them.
We repeat this process until we cover all instances of the particular class. Each time with
only the instances not covered by the previous rules!
Then we start again for the next class – with all instances now.
Again: the first rule for each class is derived with all instances, then each time we remove
the instances covered by the previous rules for that class.
Pseudo-code for PRISM
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
Until R is perfect (or there are not more attributes to use) do,
For each attribute A not mentioned in R, and each value v of A,
Consider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
Add A = v to R
Remove the instances covered by R from E

57
Rules vs. Decision Lists
• PRISM with outer loop removed generates a decision
list for one class
• Subsequent rules are designed for rules that are not
covered by previous rules
• But: order doesn’t matter because all rules predict the
same class
• Outer loop considers all classes separately
• No order dependence implied
• Problems: overlapping rules, default rule required

58
Separate and Conquer
• Methods like PRISM (for dealing with one class) are
separate-and-conquer algorithms:
• First, identify a useful rule
• Then, separate out all the instances it covers
• Finally, “conquer” the remaining instances
• Difference to divide-and-conquer methods:
• Subset covered by rule doesn’t need to be explored
any further

59

You might also like