CS3491-AI ML-Chapter 2
CS3491-AI ML-Chapter 2
Machine
Learning
CHAPTER 2:
Supervised
Learning
Learning a Class from
Examples
Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect
from a family car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
3
Training set X
X {xt ,r t }tN1
1 if x is positive
r
0 if x is negative
x1
x
x2
4
Class C
p1 price p2 AND e1 engine power e2
5
Hypothesis class H
1 if h classifies x as positive
h (x)
0 if h classifies x as negative
Error of h on H
N
E (h | X ) 1 h x r
t t
t 1
6
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h H, between S and G is
consistent
(Mitchell, 1997)
7
VC Dimension
N points can be labeled in 2N ways as +/–
H shatters N if there
exists h H consistent
for any of these:
VC(H ) = N
9
Noise and Model Complexity
Use the simpler one because
Simpler to use
(lower computational
complexity)
Easier to train (lower
space complexity)
Easier to explain
(more interpretable)
Generalizes better (lower
variance - Occam’s razor)
10
Multiple Classes, Ci i=1,...,K
X {xt ,r t }tN1
t
1 if xt
Ci
ri t
0 if x C j , j i
Train hypotheses
hi(x), i =1,...,K:
1 if xt
Ci
t
hi x t
0 if x C j , j i
11
Regression
X x ,r t
t N
t 1
g x w1x w 0
t
r
g x w 2x 2 w1x w 0
r t f xt
r
1 N
E g | X t t 2
gx
N t 1
r
1 N
E w1 ,w 0 | X
N
t
t
w1x w 0
2
t 1
12
Model Selection &
Generalization
Learning is an ill-posed problem; data is not
sufficient to find a unique solution
The need for inductive bias, assumptions about H
Generalization: How well a model performs on
new data
Overfitting: H more complex than C or f
Underfitting: H less complex than C or f
13
Triple Trade-Off
There is a trade-off between three factors
(Dietterich, 2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
As NE
As c (H)first Eand then E
14
Cross-Validation
To estimate generalization error, we need data
unseen during training. We split the data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
Resampling when there is few data
15
Dimensions of a Supervised
Learner
1. Model : g x |
2. Loss function:
E | X L r t ,g xt |
t
3. Optimization procedure:
* arg min E | X
16