Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1
Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1
One R learning;
Bayes Model;
Decision Tree;
Covering algorithm;
Mining for Association Rules
Linear models for numeric prediction;
Instance-based learning.
Reading Materials: Chapter 4 of textbook
by Witten etc, Sections 6.1, 6.2,7.17.4, 7.8
of the textbook by Han.
Temper.
humidity
windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
In total, there are 6 instances whose temperature is mild. Four of them with final decision
Yes and two with No. The rule is
If Temper.=mild then Play=Yes
Error rate: 2/6.
1R algorithm
Attribute
Outlook
Temper.
Humidity
Windy
Rules
Errors
Sunny no
2/5
overcast yes
0/4
rainy no
2/5
hot no
2/4
mild yes
2/6
cool yes
1/4
high no
3/7
normal yes
1/7
f alse yes
2/8
true no
3/6
Total err.
4/14
5/14
4/14
5/14
Discretization
64
65
68 69 70
71 72
72 75 75
80
81 83
85
No
Y Y Y
No No
?Y Y Y
No
Y Y
No
71 72 72 75 75
80 81 83 85
Y No Y Y Y
No No Y Y Y
No Y Y No
80 81 83 85
Y No Y Y Y No No Y Y Y
No Y Y No
How about?
64 65 68 69 70 71 72 72 75 75 80 81 83
85
Y No Y Y Y No No Y Y Y No Y Y
No
Statistical Modelling
Basic assumptions: Attributes are equally
important and statistically independent
Illusive assumptions never meet in practice, but the scheme works well!
The weather data with probabilities
Outlook
Temper.
Humidity
Windy
Play
y n
y n
y n
y n
y n
Sunny 2 3
Hot 2 2
High 3 4
F 6 2
9 5
Overc 4 0
Mild 4 2
Norm. 6 1
T 3 3
Rainy 3 2
Cool 3 1
Sunny
Overc
Rainy
2
9
4
9
3
9
3
5
Hot
Mild
2
5
Cool
2
9
4
9
3
9
2
5
2
5
1
5
High
Norm.
3
9
6
9
4
5
1
5
F
T
6
9
3
9
2
5
3
5
9 5
14 14
yes =
P(yes) =
Consider the weather problem with the instance (sunny, cool, high, true, ?)
P r[Y es|E]P r[E] = P r[sunny|yes]P r[cool|yes]P r[high|yes]
P r[true|yes]P r[yes]
2 3 3 3 9
=
9 9 9 9 14
yes =
Does it make sense to claim that the likelihood is zero? If not, how should we deal with
this issue?
Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)
In some cases adding a constant different
from 1 might be more appropriate:
Attribute outlook for class yes
2 + 4 + 3 +
,
,
9+ 9+ 9+
Weights satisfying + + = 1, a > 0, b >
0, g > 0.
Extra Merit: Missing values are not counted
in both training and prediction!!
v
u n
u X (xi )2
=t
i=1 n 1
For the weather problem, if the attribute temperature has a mean of 73 and a standard
deviation of 6.2, then the density function
f (temperature = 66|yes) =
1
26.2
(7366)
2(6.2)2
= 0.0340,
Temper.
Humidity
Windy
Play
Y no
Y no
Y no
Y no
Y no
Sunny 2 3
83 85
86 85
F 6 2
9 5
Overc 4 0
70 80
96 90
T 3 3
Rainy 3 2
68 65
80 70
... ...
... ...
73 74.6
79.1 86.2
6.2 7.9
10.2 9.7
Sunny
Overc
Rainy
2
9
4
9
3
9
3
5
0
5
2
5
6
9
3
9
2
5
3
5
9 5
14 14
Probability densities
Relationship between probability and density:
Z b
a
f (t)dt.
Decision trees
Normal procedure: top down in recursive
divide-and-conquer fashion
1. Attribute is selected for root node and
branch is created for each possible attribute value
2. The instances are split into subsets (one
for each branch extending from the node)
3. Procedure is repeated recursively for
each branch, using only instances that
reach the branch
Process stops if all instances have the
same class
Issue: How to select the splitting attribute?
Criteria: The best attribute is the one leading to the smallest tree
Trick: choose the attribute that produces
the purest nodes
How to measure information? Information gain!
Computing information
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy: choose the attribute that results
in greatest information gain
Information is measured in bits
1. Given a probability distribution, the info
required to predict an event is the distributions entropy
2. Entropy gives the information required
in bits(involve fractions of bits!)
Formula for computing the entropy:
entropy(p1, p2, , pn) =
n
X
pi log pi.
i=1
More Examples
Outlook = Overcast:
inf o([4, 0]) = entropy(1, 0) = log 1 = 0bits;
Outlook = Rainy:
inf o([3, 2]) = entropy(3/5, 2/5)
2 3
3
2
= log log = 0.971bits;
5
5 5
5
Expected information for attribute:
10
inf o([3, 2], [4, 0], [3, 2]) = 0.971
+0
14
= 0.693bits.
Information gain: the gap between information before and after the splitting.
Information gain for the weather problem:
gain(outlook)= inf o([9, 5])inf o([3, 2],[4, 0],[3, 2])
= 0.940 0.693 = 0.247bits.
gain(temper.) = 0.029bits,
gain(humidity) = 0.152bits,
gain(windy) = 0.048bits.
i=1
xi
Pn
Pn
i=1 xi ).
Avoiding overfitting
Trouble: attributes with a large number of
values (extreme case: ID code) and thus the
corresponding subsets are more likely to be
pure. In this case, Information gain is biased towards choosing attributes with a large
number of values. This leads to so-called
overfitting (selection of a useless attribute
for prediction)
gain(a)
.
inf o(a)
Info.
Info Gain
Split Info
Gain Ratio
outlook
0.693
0.247
info([5,4,5])
0.156
temper.
0.911
0.029
1.362
0.021
humidity
0.788
0.152
info([7,7])=1
0.152
windy
0.892
0.048
0.985
0.049
Covering algorithm
Decision tree can be converted into a rule
set by using straightforward conversion, but
this usually leads to very complex rule set.
Efficient conversions are useful but not easy
to find.
The covering approach generates a rule
set directly (exclude instances in other classes).
Key idea: find the rule set that covers all the
instances in one class.
Let consider the problem of classifying a set
(two classes denoted by circles and boxes) of
points on the plane. We can start with
If ? then the point belongs to the circle class
It covers all the instances in circle-class, but
too general.
Adding a pre-condition (x 1), we get:
If x 1, then class is circle.
The rule covers some instances in circle.
We need more rules to cover some circleinstances and box instances.
Figure: Covering
y
y=1.5
x
x=1
Spect. prescr.
Astig.
Tear prod.
Rec. lenses
Prepresbyopic
Hypermetrope
Yes
Reduced
None
Prepresbyopic
Hypermetrope
Yes
Normal
None
Presbyopic
Myope
Yes
Reduced
None
Presbyopic
Myope
Yes
Normal
Hard
Presbyopic
Hypermetrope
Yes
Reduced
None
Presbyopic
Hypermetrope
Yes
Normal
None
Prepresbyopic
Myope
Yes
Normal
Hard
Prepresbyopic
Myope
Yes
Reduced
None
Young
Hypermetrope
Yes
Normal
Hard
Young
Hypermetrope
Yes
Reduced
None
Young
Myope
Yes
Normal
Hard
Young
Myope
Yes
Reduced
None
Further refinement
Rule we seek: If astigmatism=yes and ? then
recommendation = hard
Possible tests:
Age = Young
2/4
Age=Pre-presbyopic
1/4
Age=Presbyopic
1/4
Spectacle prescription=Myope
3/6
Spectacle prescription=Hypermetrope 1/6
Tear production rate = Reduced
0/6
Tear production rate = Normal 4/6
Spect. prescr.
Astig.
Tear prod.
Rec. lenses
Prepresbyopic
Hypermetrope
Yes
Normal
None
Presbyopic
Myope
Yes
Normal
Hard
Presbyopic
Hypermetrope
Yes
Normal
None
Myope
Yes
Normal
Hard
Young
Hypermetrope
Yes
Normal
Hard
Young
Myope
Yes
Normal
Hard
Prepresbyopic
Further refinement
Rule we seek: If astigmatism=yes and tear production rate=normal and ? then recommendation = hard
Possible tests:
Age = Young
2/2
Age=Pre-presbyopic
1/2
Age=Presbyopic
1/2
Spectacle prescription=Myope 3/3
Spectacle prescription=Hypermetrope 1/1
Between the first and fourth, we select the one with
larger coverage.
Resulting Table
Age
Spect. prescr.
Astig.
Tear prod.
Rec. lenses
Presbyopic
Myope
Yes
Normal
Hard
Prepresbyopic
Myope
Yes
Normal
Hard
Young
Myope
Yes
Normal
Hard
The Final rule: If astigmatism = yes and tear production rate = normal and spectacle prescription =
myope then recommendation = hard
Second rule for recommending hard lenses: If age =
young and astigmatism = yes and tear production rate
= normal then recommendation = hard
PRISM
Pseudo- code for PRISM:
For each class C
Initialize E to the instance set
While E contains instances in class C
1. Create a rule R with an empty left- hand side
that predicts class C
2. Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R,
and each value v,
Consider adding the condition A = v to the
left- hand side of R
Select A and v to maximize the accuracy p/
t (break ties by choosing the condition with
the largest p)
Add A = v to R
Remove the instances covered by R from E
Item sets
Item sets for weather data
one-item set
outlook=sunny(5)
two-item set
three-item set
outlook=sunny(2)
outlook=sunny
temperature=mild
temperature=hot
humidity=high(2)
temperature=cool(4)
outlook=sunny(3)
outlook=sunny
hudimity=high(3)
humidity=high
windy=false(2)
Association rules
Rules for the weather data with support > 1:
In total: 3 rules with support four, 5 with
support three, and 50 with support two
Example rules from the same set:
Temperature = Cool, Humidity = Normal, Windy = False, Play
= Yes (2)
Resulting rules : Temperature = Cool, Windy = False
Humidity = Normal, Play = Yes
Temperature = Cool, Windy = False, Humidity = Normal
Play = Yes
Temperature = Cool, Windy = False, Play = Yes Humidity =
Normal
Apriori Algorithm
We are looking for all high-confidence rules:
Support of antecedent obtained from hash
table
Building (c+1)-consequent rules from cconsequent ones
Observation: (c+1)-consequent rule can
only hold if all corresponding c-consequent
rules also hold
Just like the procedure for large item sets
Key Steps from k-item sets to (k+1)-item
sets:
Create a table of potential candidates of
(k+1)-item sets from the hash table of kitem sets by computing the product of two
sets.
Using the Apriori property of the frequent
item set and the order in the hash table to
improve the efficiency!
Remove non-promising candidates in the
table via consulting the hash table of k-item
sets.
Scan the whole data set to remove the
candidates that does not satisfy the minimal support requirement and get the frequent
(k+1)-item sets.
Items
T1
I1, I2, I5
T2
I2, I4
T3
I2, I3
T4
I1, I2, I4
T5
I1, I3
T6
I2, I3
T7
I1, I3
T8
I1,I2,I3,I5
T9
I1, I2,I3
Itemset
C: Candidate set
L: the frequent item set
Support
Itemset
Support
I1
I1
I2
I2
I3
I3
I4
I4
I5
I5
C1 = L1
Sup.
{I1,I2}
{I1,I3}
Itemset
Sup.
{I1, I4}
{I1,I2}
{I1,I5}
{I1,I3}
{I2,I3}
{I1,I5}
{I2,I4}
{I2,I3}
{I2,I5}
{I2,I4}
{I3,I4}
{I2,I5}
{I3,I5}
{I4,I5}
C2 = L2
Itemset
Sup.
{I1,I2, I3}
{I1,I2, I5}
C3 = L3
Itemset
Sup.
{I1,I2, I3}
{I1,I2, I5}
Linear models
Work naturally with numeric attributes
Standard technique for numeric prediction: linear regression
Output is a linear combination of attributes
y = w 0 + a 1 x1 + a 2 x2 + + a k xk .
Weights are calculated from the training
data
Predicted value for the training instance
a(1)
(1)
(1)
(1)
(1)
y = a 0 x0 + a 1 x1 + a 2 x2 + + a k xk .
All these k + 1 coefficients are chosen so that
the squared error on the training data is minimized
n
X
i=1
y i
n
X
j=0
(i)
a j xj .
0
12 8 a
= .
8 6
Multiple-Class classification
2
1
a1 x + b1y + c1
X
f =
0 a2x + b2y + c2
(x,y)S1
0
a3 x + b3y + c3
2
0
a1 x + b1 y + c1
X
+
1 a2x + b2y + c2
(x,y)S2
0
a3 x + b3 y + c3
2
0
a1 x + b1 y + c1
X
0 a2x + b2y + c2
+
(x,y)S3
1
a3 x + b3 y + c3
Two-class classification
For two-class classification problem, we should
change the label for each class to (1, 0)T and
(0, 1)T , respectively.
Another way for bi-classification is to perform regression for each class first
1
1
1
f1(a) = x1
0 + a 1 x1 + a 2 x2 + + a k xk ,
2 + a x2 + + a x2 .
+
a
x
f2(a) = x2
2 2
1 1
k k
0
y=a2x+b2
Pairwise regression
Another regression model for classification:
Regression for each pair of classes, using
only instances from these two classes
An output of +1 is assigned to one member of the pair, and 1 to the other
Class receiving most votes is predicted
What to do if there is no agreement
Maybe more accurate but expensive
Logistic regression:
Designed for classification problems
Tries to estimate class probabilities directly using the following linear model:
P r(G = i|X = x)
= i0 + iT x, i = 1, 2, , K 1
P r(G = K|X = x)
exp(i0 + iT x)
, i = 1, 2, , K 1
P r(G = i|X = x) =
PK1
T
1+ l=1 exp(l0 +l x)
1
.
P r(G = K|X = x) =
PK1
T x)
exp(
+
1+
l0
l
l=1
log
N
X
i=1
Instance-based learning
Distance function defines whats learned
Most popular distance function:
1
2
1
2
1
2
(a1 a1, a2 a2, , an an)
2
Temper.
Humidity
windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
Yes
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
No
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
New instances:
(sunny, hot, high, true), easy!
(sunny, cool, high, false), no agreement!
What to do? Go with majority, no.
(rainy, hot, normal, false), no agreement!
A tie between (rainy, mild,normal, false) and
(rainy, cool,normal, false)! Maybe go from
1-NN to 2-NN,... K-NN.
Classification
by Linear Model
Decision Boundary by
K-NN