0% found this document useful (0 votes)
29 views16 pages

CS3491-AI ML-Chapter 2

CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Uploaded by

Steephen Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views16 pages

CS3491-AI ML-Chapter 2

CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Uploaded by

Steephen Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 16

INTRODUCTION TO

Machine
Learning
CHAPTER 2:

Supervised
Learning
Learning a Class from
Examples
 Class C of a “family car”
 Prediction: Is car x a family car?
 Knowledge extraction: What do people expect
from a family car?
 Output:
Positive (+) and negative (–) examples
 Input representation:
x1: price, x2 : engine power

3
Training set X
X  {xt ,r t }tN1

 1 if x is positive
r 
0 if x is negative

 x1 
x  
 x2 

4
Class C
p1 price p2  AND e1 engine power e2 

5
Hypothesis class H
 1 if h classifies x as positive
h (x)  
0 if h classifies x as negative

Error of h on H
N
 
E (h | X )   1 h x r
t t

t 1

6
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G

h  H, between S and G is
consistent

and make up the


version space

(Mitchell, 1997)

7
VC Dimension
 N points can be labeled in 2N ways as +/–
 H shatters N if there
exists h  H consistent
for any of these:
VC(H ) = N

An axis-aligned rectangle shatters 4 points only !


8
Probably Approximately
Correct (PAC) Learning
 How many training examples N should we have, such that
with probability at least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)

 Each strip is at most ε/4


 Pr that we miss a strip 1‒ ε/4
 Pr that N instances miss a strip (1 ‒ ε/4)N
 Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

9
Noise and Model Complexity
Use the simpler one because
 Simpler to use
(lower computational
complexity)
 Easier to train (lower
space complexity)
 Easier to explain
(more interpretable)
 Generalizes better (lower
variance - Occam’s razor)

10
Multiple Classes, Ci i=1,...,K
X  {xt ,r t }tN1

t

 1 if xt
 Ci
ri   t
0 if x  C j , j i

Train hypotheses
hi(x), i =1,...,K:

 1 if xt
 Ci
 
t 
hi x   t
0 if x  C j , j i

11
Regression

X  x ,r t

t N
t 1
g x  w1x  w 0
t
r 
g x  w 2x 2  w1x  w 0
 
r t f xt  

 r  
1 N
E g | X   t t 2
 gx
N t 1

 r 
1 N
E w1 ,w 0 | X  
N
t
 t
 w1x  w 0
2

t 1

12
Model Selection &
Generalization
 Learning is an ill-posed problem; data is not
sufficient to find a unique solution
 The need for inductive bias, assumptions about H
 Generalization: How well a model performs on
new data
 Overfitting: H more complex than C or f
 Underfitting: H less complex than C or f

13
Triple Trade-Off
 There is a trade-off between three factors
(Dietterich, 2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
 As NE
 As c (H)first Eand then E

14
Cross-Validation
 To estimate generalization error, we need data
unseen during training. We split the data as
 Training set (50%)
 Validation set (25%)
 Test (publication) set (25%)
 Resampling when there is few data

15
Dimensions of a Supervised
Learner
1. Model : g x | 

2. Loss function:  
E  | X    L r t ,g xt |  
t

3. Optimization procedure:

* arg min E  | X 

16

You might also like