DLWSS551 - Algorithms Part I
DLWSS551 - Algorithms Part I
Harris Papadopoulos
Algorithms: The Basic Methods
• Inferring rudimentary rules
• Statistical modeling
• Constructing decision trees
• Constructing rules
• Association rule learning
• Linear models
• Instance-based learning
We will cover the first four in Part I and the remaining three in Part II.
Simplicity First
• Simple algorithms often work very well!
• There are many kinds of simple structures, eg:
• One attribute does all the work
• All attributes contribute equally & independently
• A weighted linear combination might do
• Instance-based: use a few prototypes
• Use simple logical rules
• Success of method depends on the domain
3
Inferring Rudimentary Rules
• 1R: learns a 1-level decision tree
• I.e., rules that all test one particular attribute
• Basic version
• One branch for each value
• Each branch assigns most frequent class
• Error rate: proportion of instances that don’t belong to
the majority class of their corresponding branch
• Choose attribute with lowest error rate
(assumes nominal attributes)
4
Pseudo-code for 1R
5
Evaluating the Weather Attributes
Outlook Temp Humidity Windy Play
Given the data on the left table 1R goes through the attributes one by one and assigns a class
for each attribute value:
For Outlook: Outlook = Sunny in 5 instances and out of these 5, Play = No in the majority
(3 out of 5). Outlook = Overcast in 4 instances and Play = Yes in all 4. Outlook = Rainy in 5
instances and out of these 5, Play = Yes in the majority (3 out of 5).
- So the rule set for Outlook is: if Outlook = Sunny then Play = No, if Outlook = Overcast
then Play = Yes, if Outlook = Rainy then Play = Yes. This rule set makes 4 errors in total
over the data – 2 errors by the 1st rule and 2 by the 3rd rule.
For Temp: Temp = Hot in 4 instances, for 2 of which Play = Yes and for the other 2 Play =
No. Temp = Mild in 6 instances and out of these 6, Play = Yes in the majority (4 out of 6).
Temp = Cool in 4 instances and out of these 4, Play = Yes in the majority (3 out of 4).
- So by randomly breaking the tie for Temp = Hot, the rule set for Temp is: if Temp = Hot
then Play = No, if Temp = Mild then Play = Yes, if Temp = Cool then Play = Yes. This rule
set makes 5 errors in total over the data – 2 errors by the 1st rule, 2 by the 2nd rule and 1 by
the 3rd.
The same process gives the rule sets for the two remaining attributes. Then 1R chooses the
rule set with the smallest number of errors – that is, the 1st and 3rd rule sets. Since there is a
tie, 1R will choose randomly one of these 2 rule sets.
Dealing with Numeric Attributes
• Discretize numeric attributes
• Divide each attribute’s range into intervals
• Sort instances according to attribute’s values
• Place breakpoints where class changes (majority class)
• This minimizes the total error
• Example: temperature from weather data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
8
With Overfitting Avoidance
• Resulting rule set:
Attribute Rules Errors Total errors
Outlook Sunny → No 2/5 4/14
Overcast → Yes 0/4
Rainy → Yes 2/5
Temperature 77.5 → Yes 3/10 5/14
> 77.5 → No* 2/4
Humidity 82.5 → Yes 1/7 3/14
> 82.5 and 95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6
9
Discussion of 1R
• 1R was described in a paper by Holte (1993)
• Contains an experimental evaluation on 16 datasets (using
cross-validation so that results were representative of
performance on future data)
• Minimum number of instances was set to 6 after some
experimentation
• 1R’s simple rules performed not much worse than much
more complex decision trees
• Simplicity first pays off!
Very Simple Classification Rules Perform Well on Most Commonly
Used Datasets
Robert C. Holte, Computer Science Department, University of Ottawa
10
Discussion of 1R: Hyperpipes
• Another simple technique: build one rule for each class
• Each rule is a conjunction of tests, one for each attribute
• For numeric attributes: test checks whether instance's
value is inside an interval
• Interval given by minimum and maximum observed in
training data
• For nominal attributes: test checks whether value is one of
a subset of attribute values
• Subset given by all possible values observed in training data
• Class with most matching tests is predicted
11
Statistical Modeling
• “Opposite” of 1R: use all the attributes
• Two assumptions: Attributes are
• equally important
• statistically independent (given the class value)
• I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
• Independence assumption is never correct!
• But … this scheme works well in practice
12
Probabilities for Weather Data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False 13 Yes
Rainy Mild High True
13 No
Given the data on the bottom right table, we calculate the frequency of each attribute value
for each of the classes.
So, for Outlook, out of the 9 instances with Play = Yes: 2 have Sunny, 4 have Overcast and
3 have Rainy; out of the 5 instances with Play = No: 3 have Sunny and 2 Rainy. This gives
the first column of the top table. We do the same for each attribute. The last column is just
the portion of Yes and No over all instances.
This calculated table in effect gives us the empirical probability of each attribute value
given that the instance belongs to each of the two classes.
i.e. The probability of Outlook = Sunny given Play = Yes, the probability of Outlook =
Sunny given Play = No etc.
Probabilities for Weather Data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Given these probabilities we can now calculate the likelihood of each class for a new day by
multiplying the corresponding probabilities given Play = Yes and Play = No.
Note that the product includes the probabilities of each class over all instances as well (last
column of the table).
The calculated likelihoods can then be divided by their sum to convert them to probabilities
(make them sum to 1).
Bayes’s Rule
• Probability of event H given evidence E:
Pr[ E | H ] Pr[ H ]
Pr[ H | E ] =
Pr[ E ]
• A priori probability of H: Pr[H ]
• Probability of event before evidence is seen
• A posteriori probability of H: Pr[ H | E ]
• Probability of event after evidence is seen
Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent, England
15
16
The only thing missing is the prior probability of the evidence E. However, since we know
that the probability of Yes and the probability of No should add up to 1, Pr[E] is the sum of
Pr[Yes] and Pr[No].
Again, this assumes independence of the attributes!
The “Zero-frequency Problem”
• What if an attribute value doesn’t occur with every
class value?
(e.g. “Humidity = high” for class “yes”)
• Probability will be zero! Pr[ Humidity = High | yes] = 0
• A posteriori probability will also be zero! Pr[ yes | E ] = 0
(No matter how likely the other values are!)
• Remedy: add 1 to the count for every attribute
value-class combination (Laplace estimator)
• Result: probabilities will never be zero!
(also: stabilizes probability estimates)
18
Modified Probability Estimates
• In some cases adding a constant different from 1
might be more appropriate
• Example: attribute outlook for class yes
2+ /3 4+ /3 3+ /3
9+ 9+ 9+
Sunny Overcast Rainy
• Weights don’t need to be equal
(but they must sum to 1)
2 + p1 4 + p2 3 + p3
9+ 9+ 9+
19
Missing Values
• Training: instance is not included in frequency count for
attribute value-class combination
• Classification: attribute will be omitted from calculation
• Example:
Outlook Temp. Humidity Windy Play
? Cool High True ?
20
Numeric Attributes
• Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class)
• The probability density function for the normal
distribution is defined by two parameters:
• Sample mean μ 1 n
= xi
n i =1
• Standard deviation σ 1 n
=
n − 1 i =1
( xi − ) 2
2 π
21
Statistics for Weather Data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5 =73 =75 =79 =86 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 =6.2 =7.9 =10.2 =9.7 True 3/9 3/5 14 14
Rainy 3/9 2/5
23
Probability Densities
• Relationship between probability and density:
Pr[c − x c + ] f (c )
2 2
• But: this doesn’t change calculation of a posteriori
probabilities because ε cancels out
• Exact relationship:
b
Pr[a x b] = f (t )dt
a
24
Multinomial Naïve Bayes I
• Version of naïve Bayes used for document classification
using bag of words model
• n1,n2, ..., nk: number of times word i occurs in document
• P1,P2, ..., Pk: probability of obtaining word i when
sampling from documents in class H
• Probability of observing document E given class H
(based on multinomial distribution):
Pi ni
k
Pr[ E | H ] N !
i =1 ni !
Bag of words represents each document as the number of times each word in the vocabulary
(words found in all documents) appears in it.
N is the number of words in the document.
Multinomial Naïve Bayes II
• Suppose dictionary has two words, yellow and blue
• Suppose Pr[yellow|H] = 75% and Pr[blue|H] = 25%
• Suppose E is the document “blue yellow blue”
• Probability of observing this document:
0.751 0.252 9
Pr[{blue yellow blue} | H ] 3! = 0.14
1! 2! 64
Suppose there is another class H' that has
Pr[yellow|H'] = 10% and Pr[yellow|H'] = 90%:
0.11 0.9 2
Pr[{blue yellow blue} | H ' ] 3! = 0.24
1! 2!
• Need to take prior probability of class into account to make final
classification
• Factorials don't actually need to be computed
• Underflows can be prevented by using logarithms
26
Note that the factorials are the same for each class so they don’t actually need to be
computed.
Naïve Bayes: Discussion
• Naïve Bayes works surprisingly well (even if the
independence assumption is clearly violated)
• Why? Because classification doesn’t require accurate
probability estimates as long as maximum
probability is assigned to correct class
• However: adding too many redundant attributes will
cause problems (e.g. identical attributes)
• Note also: many numeric attributes are not normally
distributed (→ kernel density estimators)
27
Constructing Decision Trees
• Strategy: top down
Recursive divide-and-conquer fashion
• First: select attribute for root node
Create branch for each possible attribute value
• Then: split instances into subsets
One for each branch extending from the node
• Finally: repeat recursively for each branch, using only
instances that reach the branch
• Stop if all instances have the same class
28
Which Attribute to Select?
29
Think about which you think would be best to select and why before going to the next slide.
Which Attribute to Select?
30
In effect we want to select an attribute that can give as the most information about the class.
In other words, knowing that attribute value, will make it most likely that we will find the
correct class of the instance.
Criterion for Attribute Selection
• Which is the best attribute?
• Want to get the smallest tree
• Heuristic: choose the attribute that produces the
“purest” nodes
• Popular impurity criterion: information gain
• Information gain increases with the average purity of
the subsets
• Strategy: choose attribute that gives greatest
information gain
31
Computing Information
• Measure information in bits
• Given a probability distribution, the info required
to predict an event is the distribution’s entropy
• Entropy gives the information required in bits
(can involve fractions of bits!)
• Formula for computing the entropy:
entropy ( p1 , p2 ,..., pn ) = − p1 log( p1 ) − p2 log( p2 )... − pn log( pn )
Out of the 5 instances that have Outlook = Sunny, 2 have Play = Yes and 3 have Play = No
(see slide 30). So we calculate info([2,3]).
We do the same for the other 2 values of the attribute.
And finally, we combine the 3 by calculating their weighted sum, where the weights are the
portion of instances out of the whole set that have each value.
i.e. 5 out of 14 have Outlook = Sunny, 4 out of 14 have Outlook = Overcast and 5 out of 14
have Outlook = Rainy.
Computing Information Gain
• Information gain:
information before splitting – information after splitting
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])
= 0.940 – 0.693
= 0.247 bits
After calculating the information for each attribute, we calculate its information gain.
We select the attribute with the highest information gain. In this case Outlook.
Continuing to Split
We then repeat the same process for each node that is not totally pure (contains more than 1
classes), but only with the subset of the dataset that belons to that node.
Final Decision Tree
• Simplification of computation:
2 2 3 3 4 4
info([2,3,4]) = − log − log − log
9 9 9 9 9 9
Maximal entropy is 1.
Highly-branching Attributes
• Problematic: attributes with a large number of values
(extreme case: ID code)
• Subsets are more likely to be pure if there is a large
number of values
Information gain is biased towards choosing attributes
with a large number of values
This may result in overfitting (selection of an attribute
that is non-optimal for prediction)
• Another problem: fragmentation
38
Fragmentation: As the tree grows, the decision of which attribute to split at the bottom
nodes is made based on less and less data – choices made with a small amount of data are
more likely to be non-optimal.
Weather Data with ID Code
ID code Outlook Temp. Humidity Windy Play
A Sunny Hot High False No
B Sunny Hot High True No
C Overcast Hot High False Yes
D Rainy Mild High False Yes
E Rainy Cool Normal False Yes
F Rainy Cool Normal True No
G Overcast Cool Normal True Yes
H Sunny Mild High False No
I Sunny Cool Normal False Yes
J Rainy Mild Normal False Yes
K Sunny Mild Normal True Yes
L Overcast Mild High True Yes
M Overcast Hot Normal False Yes
N Rainy Mild High True No
39
As an extreme example, let’s consider the ID code attribute – a unique ID for each instance.
This attribute obviously is not useful for our classification problem.
Tree Stump for ID Code Attribute
• Entropy of split:
info(ID Code) = info([0,1]) + info([0,1]) + ... + info([0,1]) = 0 bits
Information gain is maximal for ID code (namely 0.940 bits)
40
41
Computing the Gain Ratio
• Example: intrinsic information for ID code
1 1
info([1,1,...,1]) = 14 − log = 3.807 bits
14 14
Intrinsic Information = entropy of the split. In other words, info of the number of instances
that go into each branch.
In the case of ID code, it divides instances into 14 branches with 1 instance in each branch.
Hence info([1,1,…,1]).
Gain Ratios for Weather Data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/1.577 0.157 Gain ratio: 0.029/1.557 0.019
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
43
44
Discussion
• Top-down induction of decision trees: ID3 – algorithm
developed by Ross Quinlan
• Gain ratio just one modification of this basic algorithm
• C4.5: deals with numeric attributes, missing values,
noisy data
• Similar approach: CART
• There are many other attribute selection criteria!
(But little difference in accuracy of result)
45
Covering Algorithms
• Convert decision tree into a rule set
• Straightforward, but rule set overly complex
• More effective conversions are not trivial
• Instead, can generate rule set directly
• for each class in turn find rule set that covers all
instances in it
(excluding instances not in the class)
• Called a covering approach:
• at each stage a rule is identified that “covers” some of
the instances
46
Example: Generating a Rule
For each class, start with a rule covering all instances (i.e. IF true THEN class = a) and
refine by adding attribute tests one by one until the rule covers only instances of that class.
Rules vs. Trees
49
Selecting a Test
• Goal: maximize accuracy
• t total number of instances covered by rule
• p positive examples of the class covered by rule
• t – p number of errors made by rule
Select test that maximizes the ratio p/t
• We are finished when p/t = 1 or the set of instances
can’t be split any further
50
Example: Contact Lens Data
• Rule we seek: If ?
then recommendation = hard
• Possible tests:
Age = Young 2/8
Age = Pre-presbyopic 1/8
Age = Presbyopic 1/8
Spectacle prescription = Myope 3/12
Spectacle prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
Tear production rate = Reduced 0/12
Tear production rate = Normal 4/12
51
52
These are the 12 instances covered by the rule (i.e. the ones that have Astigmatism = yes).
Further Refinement
• Current state: If astigmatism = yes
and ?
then recommendation = hard
• Possible tests:
53
We now perform the same process (evaluate all possible tests) for only these 12 instances
covered by the rule. Note that now we cannot test astigmatism again.
Again, we select the test with the highest ratio: Tear production rate = Normal.
Modified Rule and Resulting Data
• Rule with best test added:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard
54
Further Refinement
• Current state: If astigmatism = yes
and tear production rate = normal
and ?
then recommendation = hard
• Possible tests:
Age = Young 2/2
Age = Pre-presbyopic 1/2
Age = Presbyopic 1/2
Spectacle prescription = Myope 3/3
Spectacle prescription = Hypermetrope 1/3
If there is a tie between rules with different coverage, we select the one with the greatest
coverage.
The Result
• Final rule: If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard
56
After finishing with a rule (p/t ratio = 1), we remove the instances covered by this rule (its
final version) and if the remaining instances still contain the class in question, we repeat the
process with only them.
We repeat this process until we cover all instances of the particular class. Each time with
only the instances not covered by the previous rules!
Then we start again for the next class – with all instances now.
Again: the first rule for each class is derived with all instances, then each time we remove
the instances covered by the previous rules for that class.
Pseudo-code for PRISM
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
Until R is perfect (or there are not more attributes to use) do,
For each attribute A not mentioned in R, and each value v of A,
Consider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
Add A = v to R
Remove the instances covered by R from E
57
Rules vs. Decision Lists
• PRISM with outer loop removed generates a decision
list for one class
• Subsequent rules are designed for rules that are not
covered by previous rules
• But: order doesn’t matter because all rules predict the
same class
• Outer loop considers all classes separately
• No order dependence implied
• Problems: overlapping rules, default rule required
58
Separate and Conquer
• Methods like PRISM (for dealing with one class) are
separate-and-conquer algorithms:
• First, identify a useful rule
• Then, separate out all the instances it covers
• Finally, “conquer” the remaining instances
• Difference to divide-and-conquer methods:
• Subset covered by rule doesn’t need to be explored
any further
59