0% found this document useful (0 votes)
11 views70 pages

ML Unit - 2

This document covers Artificial Neural Networks (ANNs), detailing their structure, properties, and training methods, including perceptrons and the backpropagation algorithm. It discusses the applications of ANNs in various fields such as speech recognition and image classification, and addresses challenges like overfitting and convergence. Additionally, it highlights advanced topics and techniques for improving ANN performance, such as alternative error functions and recurrent networks.

Uploaded by

dileepreddy769
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views70 pages

ML Unit - 2

This document covers Artificial Neural Networks (ANNs), detailing their structure, properties, and training methods, including perceptrons and the backpropagation algorithm. It discusses the applications of ANNs in various fields such as speech recognition and image classification, and addresses challenges like overfitting and convergence. Additionally, it highlights advanced topics and techniques for improving ANN performance, such as alternative error functions and recurrent networks.

Uploaded by

dileepreddy769
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Unit -2

Machine Learning

Chapter – 4
Artificial Neural Networks

1
Overview

• Introduction
• Perceptrons
• Multilayer networks and Backpropagation Algorithm
• Remarks on the Backpropagation Algorithm
• An Illustrative Example: Face Recognition
• Advanced Topics in Artificial Neural Networks

2
4.1 Introduction

• Human brain
– densely interconnected network of 1011 neurons each connected to 104
others (neuron switching time : approx. 10-3 sec.)
• ANN (Artificial Neural Network)
– this kind of highly parallel processes

3
Introduction (cont.)

• Properties of ANNs
– Many neuron-like threshold switching units
– Many weighted interconnections among units
– Highly parallel, distributed process

4
Introduction (cont.)

• 4.2 Neural network representations


-ALVINN drives 70 mph on highways

5
4.3 Appropriate problems for neural network
learning
• Appropriate problems for neural network learning
– Input is high-dimensional discrete or real-valued
(e.g. raw sensor input)
– Output is discrete or real valued
– Output is a vector of values
– Possibly noisy data
– Long training times accepted
– Fast evaluation of the learned function required.
– Not important for humans to understand the weights

• Examples
– Speech phoneme recognition
– Image classification
– Financial prediction

6
4.4 Perceptrons
• 4.4.1 Perceptron

– Input values  Linear weighted sum  Threshold

– Given real-valued inputs x1 through xn, the output o(x1,…,xn) computed by the
perceptron is

o(x1, …, xn) = 1 if w0 + w1x1 + … + wnxn >


0
-1 otherwise
where wi is a real-valued constant, or weight

7
Perceptrons (cont.)

• Decision surface of a perceptron:


- Linearly separable case like (a):
Possible to classify by hyperplane,
- Linearly inseparable case like (b):
Impossible to classify

8
Perceptrons (cont.)

• Examples of Boolean functions


- AND :
possible to classify when w0 = -0.8, w1 = w2 = 0.5
<Training ex amples >
x1 x2 output
0 0 -1 Decision hyperplane :
0 1 -1
1 0 -1
1 1 1
w0 + w1 x1 + w2 x2 = 0
x2 -0.8 + 0.5 x1 + 0.5 x2 = 0
<Test Results>
x1 x2  wi xi output
- + 0 0 -0.8 -1
0 1 -0.3 -1
1 0 -0.3 -1
1 1 0.2 1
- -
x1
-0.8 + 0.5 x1 + 0.5 x2 = 0
9
Perceptrons (cont.)
• Examples of Boolean functions
- OR :
possible to classify when w0 = -0.3, w1 = w2 = 0.5
<Training ex amples >
x1 x2 output
0 0 -1 Decision hyperplane :
0 1 1
1 0 1 w0 + w1 x1 + w2 x2 = 0
1 1 1

x2 -0.3 + 0.5 x1 + 0.5 x2 = 0

<Test Results>
+ + x1 x2  wi xi output
0 0 -0.3 -1
0 1 0.2 -1
1 0 0.2 -1
- + 1 1 0.7 1
x1
-0.3 + 0.5 x1 + 0.5 x2 = 0
10
Perceptrons (cont.)

• Examples of Boolean functions


- XOR :
impossible to classify because this case is linearly inseparable

x2
<Training ex amples >
x1 x2 output
0 0 -1
0 1 1
+ -
1 0 1
1 1 -1

- +
x1

cf) Two-layer network of perceptrons can represent XOR.


Refer to this equation, .

11
4.4.2 Perceptrons training rule

• Perceptron training rule

wi  wi + wi
where wi = 
(t – o) xi
Where:
– t = c(x) is target value
– o is perceptron output
–  is small constant (e.g., 0.1) called learning rate
Can prove it will converge
– If training data is linearly separable
– and  sufficiently small

12
4.4.3 Gradient Descent and Delta rule
• Delta rule :

– Unthresholed, just using linear unit, ( differentiable )


o = w 0 + w 1x 1 +
···
+ w nx n
– Training Error : Learn wi’s that minimize the squared error

– Where D is set of training examples

13
4.4.3 Gradient Descent and Delta rule
• Hypothesis space

- wo, w1 plane represents the entire hypothesis space.


-For linear units, this error surface must be parabolic with a single
global minimum. And we desire a hypothesis with this minimum.

14
4.4.3 Gradient Descent and Delta rule

• Gradient (steepest) descent rule


- Error (for all Training ex.):

- Gradient of E ( Partial Differentiating ) :

- direction : steepest increase in E.


- Thus, training rule is as follows.

(The negative sign : decreases E.)

15
4.4.3 Gradient Descent and Delta rule
• Derivation of gradient descent

where xid denotes the single input components xi

- Because the error surface contains only a single global minimum, this
algorithm will converge to a weight vector with minimum error, regardless of
whether the tr. examples are linearly separable, given a sufficiently small 
is used.
16
4.4.3 Gradient Descent and Delta rule
• Gradient descent and delta rule
– Search through the space of possible network weights, iteratively
reducing the error E between the training example target values and the
network outputs

17
4.4.3 Gradient Descent and Delta rule

• Stochastic approximation to gradient descent

Stochastic gradient descent (i.e. incremental mode) can sometimes avoid falling into
local minima because it uses the various gradient of E rather than overall gradient of
E.
18
Perceptrons (cont.)

• Summary
– Perceptron training rule
• Perfectly classifies training data
• Converge, provided the training examples are linearly separable

– Delta Rule using gradient descent


• Converge asymptotically to min. error hypothesis
• Converge regardless of whether training data are linearly separable

19
4.5 Multilayer Networks and Backpropagation
Algorithm
• Speech recognition example of multilayer networks learned
by the backpropagation algorithm
• Highly nonlinear decision surfaces

20
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Sigmoid threshold unit
– What type of unit as the basis for multilayer networks
• Perceptron : not differentiable -> can’t use gradient descent
• Linear Unit : multi-layers of linear units -> still produce only
linear function
• Sigmoid Unit : differentiable threshold function

21
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Sigmoid threshold unit

- Interesting property:

- Output ranges between 0 and 1

- We can derive gradient decent rules to train One sigmoid unit

- Multilayer networks of sigmoid units  Backpropagation

22
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Backpropagation algorithm
– Two layered feedforward networks

23
Multilayer Networks and Backpropagation
Algorithm (cont.)

• Adding momentum
- Another weight update is possible.

- n-th iteration update depend on (n-1)th iteration


-  : constant between 0 and 1 -> momentum
- Role of momentum term :
• Keep the ball rolling through small local minima in the error surface.
• Gradually increase the step size of the search in regions where
the gradient is unchanging, thereby speeding convergence

24
Multilayer Networks and Backpropagation
Algorithm (cont.)
• ALVINN (again..)
- Uses backpropagation algorithm
- Two layered feedforward network
• 960 neural network inputs
• 4 hidden units
• 30 output units

AI & CV Lab, SNU 25


5
2
4.6 Remarks on Backpropagation Algorithm

• Convergence and local minima


– Gradient descent to some local minimum
• Perhaps not global minimum...
– Add momentum
– Stochastic gradient descent
– Train multiple nets with different initial weights

26
Remarks on Backpropagation Algorithm (cont.)

• Expressive capabilities of ANNs


– Boolean functions:
• Every boolean function can be represented by network with two layers
of units where the number of hidden units required grows exponentially.

– Continuous functions:
• Every bounded continuous function can be approximated with arbitrarily
small error, by network with two layers of units [Cybenko 1989; Hornik et
al. 1989]

– Arbitrary functions:
• Any function can be approximated to arbitrary accuracy by a network
with three layers of units [Cybenko 1988].

27
Remarks on Backpropagation Algorithm (cont.)

• Hypothesis space search and Inductive bias


— Hypothesis space search
• Every possible assignment of network weights represents a
syntactically distinct hypothesis.
• This hypothesis space is continuous in contrast to that
of decision tree.

— Inductive bias
• One can roughly characterize it as smooth interpolation
between data points. (Consider a speech recognition
example!)

28
Remarks on Backpropagation Algorithm (cont.)

• Hidden layer representations


- This 8x3x8 network was trained to learn the identity function.
- 8 training examples are used.
- After 5000 training iterations, the three hidden unit values encode
the eight distinct inputs using the encoding shown on the right.

29
Remarks on Backpropagation Algorithm (cont.)

• Learning the 8x3x8 network


- Most of the interesting weight
changes occurred during the
first 2500 iterations.

30
Remarks on Backpropagation Algorithm (cont.)
• Generalization, overfitting, and stopping criterion
– Termination condition
• Until the error E falls below some predetermined threshold
(overfitting problem)

– Techniques to address the overfitting problem


• Weight decay : Decrease each weight by some small factor during each
iteration.
• Cross-validation
• k-fold cross-validation (small training set)

31
Remarks on Backpropagation Algorithm (cont.)
• Overfitting in ANNs

32
Remarks on Backpropagation Algorithm (cont.)
• Overfitting in ANNs

33
4.7 An Illustrative Example: Face Recognition
• Neural nets for face recognition
– Training images : 20 different persons with 32 images per person.
– (120x128 resolution → 30x32 pixel image)
– After 260 training images, the network achieves an accuracy of 90% over a separate test set.
– Algorithm parameters : η=0.3, α=0.3

34
An Illustrative Example: Face Recognition (cont.)
• Learned hidden unit weights

http://www.cs.cmu.edu/tom/faces.html

35
4.8 Advanced Topics in Artificial Neural Networks

• 4.8.1 Alternative error functions


– Penalize large weights: (weight decay) : Reducing the risk of
overfitting

– Train on target slopes as well as values:

– Minimizing the cross entropy : Learning a probabilistic output function (chapter 6)

 t log o d  (1  t d ) log(1  o d )
d
d∈D

where target value, td∈{0,1} and od is the probabilistic output from the learning system,
approximating the target function, f’(xd) = p( f(xd) = td = 1) where d=<xd, td>, d∈D
36
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.2 Alternative error minimization procedures
– Weight-update method
• Direction : choosing a direction in which to alter the current weight
vector (ex: the negation of the gradient in Backpropagation)
• Distance : choosing a distance to move
(ex: the learning ratio η)

Ex : Line search method, Conjugate


gradient method

37
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.3 Recurrent networks

(a) (b) (c)

(a) Feedforward network


(b) Recurrent network
(c) Recurrent network unfolded
in time
AI & CV Lab, U 8
38
SN 3
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.3 Dynamically modifying network structure
– To improve generalization accuracy and training efficiency
– Cascade-Correlation algorithm (Fahlman and Lebiere 1990)
• Start with the simplest possible network and add complexity
– Optimal brain damage (Lecun et al. 1990)
• Start with the complex network and prune it as we find that
certain connectives are inessential.

3
39
9
Unit – 2
Machine Learning

Chapter – 5
Evaluating Hypotheses

40
Overview

• Motivation
• Estimating Hypothesis Accuracy
• Basics of Sampling Theory
• A General Approach for Driving Confidence Intervals
• Difference in Error of Two Hypotheses
• Comparing Learning Algorithms

41
5.1 Motivation

• The importance of evaluating hypotheses


– To understand whether to use the hypothesis
– An integral component of learning algorithm Ex) Post-pruning in
DT
• Two difficulties
– Bias in Estimate
 Test examples chosen independently of the training examples
– Variance in Estimate
 Larger set of test examples
• Subjects in this chapter
– Evaluating learned hypotheses
– Comparing the accuracy of two hypotheses
– Comparing the accuracy of two learning algorithms

42
5.2 Estimating Hypothesis Accuracy
• Notations
X : space of all possible D : probability distribution of X
instances f : target function h:
S : sample drawn from D hypothesis
• 5.2.1 Sample error and true error
– Sample Error : the fraction of S that it
misclassifies
1
  f (x),
S
error (h)  n xS
– True Error : the misclassification probability a randomly drawn instance from D
h(x)
Pr f (x) 
errorD (h) xD
h(x)
we can measure only
“While we want to know errorD (h) , errorS (h).”
 “How good an estimate of errorD (h) is provided by errorS (h) ?”

43
Estimating Hypothesis Accuracy (cont.)

• 5.2.2 Confidence interval for discrete-valued hypothesis

S  n, n  30, errorS (h) 


r n
errorS (h)1  errorS (h)
Confidence interval : error S(h)  z N
n

Requirements
 Discrete-valued hypothesis
 S drawn randomly from D
 The data independent of hypothesis
Recommendation

n  30 & errorS (h) is not too


or
close ton0 or
error
1
S
(h)1  errorS (h) 
5
44
5.3 Basics of Sampling Theory

• 5.3.1 Error estimation and estimating binomial poportions


– The probability that h misclassifies errorS (h)
i
Repeated experiment  Random variable errorS (h)
errorS (h) ~ Binomial Distribution
i

Coin toss example “The probability of observing r heads”


• Toss a worn and bent coin n times  “The probability that h
• The probability p misclassifies”
r
• Heads turn up r times p  errorD (h), pˆ n  error S

(h)

45
Basics of Sampling Theory (cont.)

5.3.2 Binomial distribution


•Y : A random variable which can take on two values (Ex) 0 or 1
• p : The probability that on any single trial Y=1
• Y1 ,Y 2n,⋯,Y n : The sequence of i.i.d random variables Y
• : The number of trials for which Yi
R  iYi experiments  1 in n independent
1

The probability that R will take on a specific value r


n
PrR  r  p r (1  p)nr
r!(n!  r)!

46
Basics of Sampling Theory (cont.)
• 5.3.3 Mean and variance n
E Y  
Expected Value (or Mean) i  yi PrY  yi
 1

Variance VarY  E   Y  E Y  



“How far the random variable is expected to vary
2 from its mean value”

Standard Deviation 
 Y  Var Y   E Y  E Y

※ In case of binomial distribution  2 


Expected Value (or Mean)
EY   np
Variance
VarY   np(1 
Standard Deviation  Y  np(1  p)
p)

47
Basics of Sampling Theory (cont.)

• 5.3.4 Estimators, bias, and variance


– An estimator estimates the true value we do not
know
[Example] error (h) estimates the true error errorD (h)
S
– Estimation
bias  E Y   p 
: The difference between the expected value of estimator and the true value
– Unbiased estimator : Y such that
• As n grows larger, EY  p 
• Eestimator
errorS (h) is the unbiased Y  pof error
0 D (h)
– Variance : The smaller, the better

 error (h)  r/   p(1  p) errorS (h)1  errorS (h)


S  nr 
 n
n n
Two quick remarks
 S and h chosen independently
 Don’t be confused by “Estimation Bias” and “Inductive Bias”

48
Basics of Sampling Theory (cont.)

• Example
– 12 errors on a sample of 40 randomly drawn test
examples

r 12
pˆ  error
S (h) 
n 40 
0.3
2
 r  np(1 p)  npˆ(1 pˆ )  40  0.3 (1 0.3)
 8.4
 2.9
 rerror
 (h)8.4 
 r/ 
2.9 r
 
 n 0.07
S

n 40

49
Basics of Sampling Theory (cont.)

• Example
– 300 errors on a sample of 1000 randomly drawn test
examples
30
pˆ  error S (h)  r 
n 1000 0.3
r2  np(1 p) np(1 ˆ p̂)  1000  0.3 (1 0.3) 
 210
 r  210 
14.5 r 14.5
 error (h)  r /   
 n 0.0145
S

– As  error ( h )
n 1000
S
gets smaller, the confidence interval gets narrower with same
probability

50
Basics of Sampling Theory (cont.)

• Normal distribution
– A bell shaped distribution specified by its mean and standard deviation 
– Central limit theorem (See Section 5.4.1)

“Binomial distribution can be approximated by normal distribution”

Normal distribution

X : A random variable X  , 
2
1  1  x   
2  
e
Probability density function p( x)
2 2

Cumulative distribution Pra  X  b ba p( x)d
 deviation x
Expected value, variance, and standard
E X    VarX    2  X  

51
Basics of Sampling Theory (cont.)

• Normal distribution
– Table about the Standard Normal distribution
–   0,   1 ; Table 5.1
The size of the interval about the mean that contains N% of the probability

Table 5.1
Confidence Level N% 50% 68% 80% 90% 95% 98% 99%

zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58


Constant

52
Basics of Sampling Theory (cont.)

• 5.3.5 Confidence intervals


– N% confidence interval
: An interval that is expected with probability N % to contain
p
– Confidence interval for  and y : y  zN  ,   zN 

Obtaining Confidence Intervals for errorD (h)


 error S(h) ~ Binomial distribution where   error (h), D  errorS (h)1  errorS (h)
n
 For large n, this binomial distribution is approximated by a normal distribution
 Find the N% confidence interval for estimating  of a Normal distribution

errorS (h)  zN errorS (h)1  errorS (h)


n n  30 or np(1  p)  5
– Two approximations involved

• errorD (h) approximated by errorS (h)
• Binomial distribution approximated by normal distribution
53
Basics of Sampling Theory (cont.)

• 5.3.6 Two-sided and one-sided bounds


– Two-sided bound specifies both lower and upper bound
– One-sided bound specifies either of them

“What is the probability that errorD (h) is at most U ?”  One-sided bound


Confidence Interval  : The probability that the correct value
1001   %
 lies outside the interval
 
100 1   / 2 % Confidence Interval

54
Basics of Sampling Theory (cont.)

• Example
– 12 errors on a sample of 40 randomly drawn test
examples

 0.3
errorS (h)
 errorS (h)
0.07
(Two-sided) 95% confidence interval
 
errorS (h)  zN errorS (h)1  errorS (h)
n  0.3  1.96  0.07  0.3 
0.05 
0.14
(One-sided) 97.5% confidence interval   0.05 

errorD (h) is at most 0.3+0.14=0.44


No assertion about the lower bound!

55
Next…

• A General approach for driving confidence intervals


— Central Limit Theorem
• Difference in errors of two hypotheses
— Hypothesis testing
• Comparing learning algorithms
— Paired t-tests
— Practical considerations

56
Estimation and Hypothesis Testing

Central Limit Theorem

Population
Parameter

Estimation Hypothesis
 :
TesYHti2ng
0
 0


X  μ Test
Sample Statistics
Estimator

57
5.4 A General Approach for
Deriving Confidence Intervals
• General process estimating parameter
P : errorD(h)
(1) Identify the underlying population parameter p
(2) Define the estimator Y : errorS(h)
: minimum variance, unbiased estimator desirable

(3) Determine the probability distribution DY of Y

: mean( ) and variance( 2 ) of Y

(4) Determine the N% confidence interval from DY


: LowerBound and UpperBound

  zn For Discrete-valued Hypothesis error s (h)(1  error s (h))


error S (h)  z n
n
 
58
A General Approach for
Deriving Confidence Intervals (cont.)
• 5.4.1 Central limit theorem
Consider a set of iid random variables Y1 ⋯Yn governed by an arbitrary
distribution with mean and finite variance  2. Define the sample mean, Y  1
n

probability n
 i
Y  Y  i

Then as n   , the distribution governing  n approaches a Normal Dist .


n
n
n 1

with zero mean and standard deviation equal to 1.

1 Yn 
 ZY 
Yn Yi 
n
n

i n
Y1 1
Y2
Y3 1
Y4 ⋯Yn 1 Yn 2

0
Y ⋯ Y ~ UD (  ,  2 ZY n ~ N
2
) Y n ~ N (  ,   
1 n  (0,1)
)
n 
59
A General Approach for
Deriving Confidence Intervals (cont.)
• Why central limit theorem is useful ?
– We can know the distn. of sample mean Y
( even when we do not know the distn. of Yi)

– We can determine the mean( ) and variance(  2 )


of Y i . ( from the mean and variance of Y)

~ UD (  ,  2 ) ~ N (  , S 2)
Y ki Yk
(1) Y11 ⋯ Y1n
Y1 mean (Y k ) 
(2) Y 21 ⋯ Y 
Y 2
(3) Y2n 31 ⋯ 3n Y 23 variance( Y k )  S 2
 n
Y ⁝ ⁝

(K) Y K 1 ⋯ Y Kn YK

 Then we can compute confidence interval !   zn



60
5.5 A Difference in Error of Two Hypotheses
• Parameter to be estimated
: The difference between the true error of 2 hypotheses, h1 & h2.
: Parameter d  errorD (h1 )  errorD (h2 )

• CASE 1 : Tested on independent test samples


– Hypothesis h1 : sample S1 containing n1 examples
– Hypothesis h2 : sample S2 containing n2 examples


: Estimator d  errorS (h1 )  errorS (h2 )
1 2



E (d ) 
– d gives an unbiased estimate of d :
 d
E (d )  d  E{errorS1 (h1 )  errorS 2 (h2 )} {errorD (h1 )  errorD (h2 )}
 E{errorS 1 (h1 )}  E{errorS 2 (h2 )} {errorD (h1 )  errorD (h2 )}
 [ E{errorS1 (h1 )}  errorD (h1 )]  [ E{errorS 2 (h2 )}  errorD (h2 )]
 [errorD (h1 )  errorD (h1 )]  [errorD (h2 )  errorD (h2 )]
 0

61
A Difference in Error of Two Hypotheses (cont.)

• CASE 1 : Tested on independent test samples (continued)



– For large n1, n ( >= 30),
2 dist . of d
n is approximately Normal distn.
∵ errorS1 (h1 ) ~ N (1 , 1 ), errorS 2 (h2 ) ~ N ( 2 , 2 )
Difference

of 2 normal distributions is also a normal distribution
– Mean of d 
E (d )  E{errorS1 (h1 )  errorS 2 (h2 )}  1  2

recall : E (aX  bY )  aE ( X )  bE (if X and Y are independent R.V.)



– Variance of d (Y )
error S1 (h1 )(1  error S1 (h )) error S 2 (h2 )(1  errorS 2 (h2 ))
 2  1

d n1 n2
(if X and Y are independent R.V.)
recall : Var (aX  bY )  a 2Var ( X )  b2Var (Y )

– Confidence Interval of d (when n1, n2 are large enough).


 errorS1 (h1 )(1  errorS1 (h1 )) errorS 2 (h2 )(1  errorS 2 (h2 ))
d  zN 
n1 n
 2

62
A Difference in Error of Two Hypotheses (cont.)

• CASE 2 : Tested on a single test sample


: Hypothesis h1 & Hypothesis h2 are tested on a single test sample S.

: Estimator d  errorS (h1 )  errorS (h2 )

– Confidence interval of
d.

errorS (h1 )(1  errorS (h1 ))  errorS (h2 )(1  errorS (h2 ))
d  zN
 n

– Smaller variance comparing with CASE1.


: Single sample S eliminates the variance due to random differences in the S1 and S2.

63
A Difference in Error of Two Hypotheses (cont.)

• 5.5.1 Hypothesis testing


: Testing for some specific conjecture (rather than in confidence intervals for some parameter)

– Situation
• Independent sample S1 & S2 ( |S1| =|S2|=100)
• errorS1(h1) = 0.30 
• error (h ) = 0.20 “What is the probability the error D (h 1 ) > error D (h 2 ) given d
 S2 2
• d = 0.10 = 0.10 ?”

“What is the probability that d>0 given d = 0.10 ?”
  
• d falls into the one-sided interval d  d  0.10  d    

 0.10
d    ZN  
 d
d d

ZN   0.10 ,  0.3(1 0.3)  0.2(1


  
0.2) 100
 0.061
ZN d
Two-sided constant for 90% confidence interval 
1.64d  d
d

– Test result
Therefore, the probability the errorD(h1) > errorD(h2) is approximately 95% .
• Accept H0 with 95% confidence
• Reject H0 with 5% significant
AI & CV Lab, SNU 25
64
level
5.6 Comparing Learning Algorithms

Which of LA and LB is the better learning method on average for learning


some particular target function f ?
• Comparing the performance of two algorithms (LA, LB)
: Expected value of the difference in errors between LA and LB. where LA(S) is the hypothesis output by
learning method, LA, on the sample, S, of training data.

E [error
SD D (LA (S ))  errorD (LB (S ))]
(S : Training Data sampled from underlying distribution D)

• Practical ways of algorithm comparison given limited sample, D0, of data

(1) Partitioning data set into training set & test set
: A limited sample D0 is divided into a training set S0 and Test Set T0
errorT0 (L A(S 0))  error T0(L B(S 0))

D0 S0 T0

65
Comparing Learning Algorithms (cont.)

(2) Repeated partitioning and averaging : k-fold method


: D0 is divided into disjoint training and test sets repeatedly and then the mean of the test set errors
for these different experiment is calculated. E
S  D0
[ error D ( L A ( S ))  error D ( L B ( S ))]

D0
(1) T1 S1
(2) T2 S2
(3) T3 S3

…..

…..
(k) Sk Tk
 returned from the above is the estimate of
E [error k 1 Tk 
D (LA (S ))  errorD (LB (S ))] Sk  k D 0 ,
S
D
30
which is again the approximation of
E
S  D0 [errorD (LA (S ))  errorD (LB (S ))]

66
Comparing Learning Algorithms (cont.)

(2) Repeated partitioning and averaging : k-fold method (continued)

• The approximate N% confidence interval


k
1
  Nt ,k 1   where s (   ) 2
s 

k (k  1) 
i i
1
- N : Confidence level ,
- k-1 : Degrees of freedom , number of independent
random events producing the values for random
variable 
- If k tN,k-1 approaches the constant zN.
Paired test : Tests where the hypotheses are
evaluated over identical samples.
Paired Test generate tighter confidence interval than Test on Separate
Data samples (Due to eliminate the difference of sample makeup)
67
Comparing Learning Algorithms (cont.)

• 5.6.1 Paired t-test


: Statistical justification of the previous comparing
algorithm procedure
– Estimation procedure
(1) Given i.i.d. random variables : Y1,…,Yk

(2) Estimate the mean  of distribution governing Yi


from estimator
(3) Estimator : Y 1 k
 Yi
 k i1

68
Comparing Learning Algorithms (cont.)

- t-test, which is applicable to the special case of the estimator procedure


where each Yi follows a Normal distribution, provides
1
Y  t N ,k 1  Ys    E (Y i )  Y N t,k 1  , where sY 
 (Yi  Y ) 2

k (k  1) i
s Y 1
where tN,k-1 is a constant characterizing t distribution as zn characterizes a Normal
distribution.

- In the previous comparing learning algorithm, if on each iteration a new random


training set Si and new random test set Ti are drawn from the underlying
instance distribution instead of the fixed sample D0, then each
 iA  errorT (h )  errorTB (h ) with |T i| ≥ 30 follows a normal distribution and
i i

from t-test result, thus

  E(i )  E [errorD (LA (S ))  errorD (LB (S ))]  


S
 t N ,K 1  s D

69
Comparing Learning Algorithms (cont.)

• (1)k-fold
5.6.2
method
• k is limited. considerations
Practical
• Test set are drawn independently (examples are tested
Paired t-test does not strictly justify the confidence interval previously discussed
exactly once)
because it is evaluated on a limited data D0 and partitioned method. Nevertheless, this
confidence interval
(2)Randomized provides good basis for experimental comparisons of learning
method
methods.
: -Randomly
When data is choose
limited… a test set at least 30
examples
,
from D0 and use remaining examples for
training.

• Procedure can be repeated infinitely


(k can be infinite number  narrower confidence interval)
• Test sets are not independent.

70

You might also like