0% found this document useful (0 votes)

11 views70 pages

ML Unit - 2

This document covers Artificial Neural Networks (ANNs), detailing their structure, properties, and training methods, including perceptrons and the backpropagation algorithm. It discusses the applications of ANNs in various fields such as speech recognition and image classification, and addresses challenges like overfitting and convergence. Additionally, it highlights advanced topics and techniques for improving ANN performance, such as alternative error functions and recurrent networks.

Uploaded by

dileepreddy769

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views70 pages

ML Unit - 2

Uploaded by

dileepreddy769

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 70

Unit -2

Machine Learning

Chapter – 4
Artificial Neural Networks

1
Overview

• Introduction
• Perceptrons
• Multilayer networks and Backpropagation Algorithm
• Remarks on the Backpropagation Algorithm
• An Illustrative Example: Face Recognition
• Advanced Topics in Artificial Neural Networks

2
4.1 Introduction

• Human brain
– densely interconnected network of 1011 neurons each connected to 104
others (neuron switching time : approx. 10-3 sec.)
• ANN (Artificial Neural Network)
– this kind of highly parallel processes

3
Introduction (cont.)

• Properties of ANNs
– Many neuron-like threshold switching units
– Many weighted interconnections among units
– Highly parallel, distributed process

4
Introduction (cont.)

• 4.2 Neural network representations

-ALVINN drives 70 mph on highways

5
4.3 Appropriate problems for neural network
learning
• Appropriate problems for neural network learning
– Input is high-dimensional discrete or real-valued
(e.g. raw sensor input)
– Output is discrete or real valued
– Output is a vector of values
– Possibly noisy data
– Long training times accepted
– Fast evaluation of the learned function required.
– Not important for humans to understand the weights

• Examples
– Speech phoneme recognition
– Image classification
– Financial prediction

6
4.4 Perceptrons
• 4.4.1 Perceptron

– Input values  Linear weighted sum  Threshold

– Given real-valued inputs x1 through xn, the output o(x1,…,xn) computed by the
perceptron is

o(x1, …, xn) = 1 if w0 + w1x1 + … + wnxn >

0
-1 otherwise
where wi is a real-valued constant, or weight

7
Perceptrons (cont.)

• Decision surface of a perceptron:

- Linearly separable case like (a):
Possible to classify by hyperplane,
- Linearly inseparable case like (b):
Impossible to classify

8
Perceptrons (cont.)

• Examples of Boolean functions

- AND :
possible to classify when w0 = -0.8, w1 = w2 = 0.5
<Training ex amples >
x1 x2 output
0 0 -1 Decision hyperplane :
0 1 -1
1 0 -1
1 1 1
w0 + w1 x1 + w2 x2 = 0
x2 -0.8 + 0.5 x1 + 0.5 x2 = 0
<Test Results>
x1 x2  wi xi output
- + 0 0 -0.8 -1
0 1 -0.3 -1
1 0 -0.3 -1
1 1 0.2 1
- -
x1
-0.8 + 0.5 x1 + 0.5 x2 = 0
9
Perceptrons (cont.)
• Examples of Boolean functions
- OR :
possible to classify when w0 = -0.3, w1 = w2 = 0.5
<Training ex amples >
x1 x2 output
0 0 -1 Decision hyperplane :
0 1 1
1 0 1 w0 + w1 x1 + w2 x2 = 0
1 1 1

x2 -0.3 + 0.5 x1 + 0.5 x2 = 0

<Test Results>
+ + x1 x2  wi xi output
0 0 -0.3 -1
0 1 0.2 -1
1 0 0.2 -1
- + 1 1 0.7 1
x1
-0.3 + 0.5 x1 + 0.5 x2 = 0
10
Perceptrons (cont.)

• Examples of Boolean functions

- XOR :
impossible to classify because this case is linearly inseparable

x2
<Training ex amples >
x1 x2 output
0 0 -1
0 1 1
+ -
1 0 1
1 1 -1

- +
x1

cf) Two-layer network of perceptrons can represent XOR.

Refer to this equation, .

11
4.4.2 Perceptrons training rule

• Perceptron training rule

wi  wi + wi
where wi = 
(t – o) xi
Where:
– t = c(x) is target value
– o is perceptron output
–  is small constant (e.g., 0.1) called learning rate
Can prove it will converge
– If training data is linearly separable
– and  sufficiently small

12
4.4.3 Gradient Descent and Delta rule
• Delta rule :

– Unthresholed, just using linear unit, ( differentiable )

o = w 0 + w 1x 1 +
···
+ w nx n
– Training Error : Learn wi’s that minimize the squared error

– Where D is set of training examples

13
4.4.3 Gradient Descent and Delta rule
• Hypothesis space

- wo, w1 plane represents the entire hypothesis space.

-For linear units, this error surface must be parabolic with a single
global minimum. And we desire a hypothesis with this minimum.

14
4.4.3 Gradient Descent and Delta rule

• Gradient (steepest) descent rule

- Error (for all Training ex.):

- Gradient of E ( Partial Differentiating ) :

- direction : steepest increase in E.

- Thus, training rule is as follows.

(The negative sign : decreases E.)

15
4.4.3 Gradient Descent and Delta rule
• Derivation of gradient descent

where xid denotes the single input components xi

- Because the error surface contains only a single global minimum, this
algorithm will converge to a weight vector with minimum error, regardless of
whether the tr. examples are linearly separable, given a sufficiently small 
is used.
16
4.4.3 Gradient Descent and Delta rule
• Gradient descent and delta rule
– Search through the space of possible network weights, iteratively
reducing the error E between the training example target values and the
network outputs

17
4.4.3 Gradient Descent and Delta rule

• Stochastic approximation to gradient descent

Stochastic gradient descent (i.e. incremental mode) can sometimes avoid falling into
local minima because it uses the various gradient of E rather than overall gradient of
E.
18
Perceptrons (cont.)

• Summary
– Perceptron training rule
• Perfectly classifies training data
• Converge, provided the training examples are linearly separable

– Delta Rule using gradient descent

• Converge asymptotically to min. error hypothesis
• Converge regardless of whether training data are linearly separable

19
4.5 Multilayer Networks and Backpropagation
Algorithm
• Speech recognition example of multilayer networks learned
by the backpropagation algorithm
• Highly nonlinear decision surfaces

20
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Sigmoid threshold unit
– What type of unit as the basis for multilayer networks
• Perceptron : not differentiable -> can’t use gradient descent
• Linear Unit : multi-layers of linear units -> still produce only
linear function
• Sigmoid Unit : differentiable threshold function

21
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Sigmoid threshold unit

- Interesting property:

- Output ranges between 0 and 1

- We can derive gradient decent rules to train One sigmoid unit

- Multilayer networks of sigmoid units  Backpropagation

22
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Backpropagation algorithm
– Two layered feedforward networks

23
Multilayer Networks and Backpropagation
Algorithm (cont.)

• Adding momentum
- Another weight update is possible.

- n-th iteration update depend on (n-1)th iteration

-  : constant between 0 and 1 -> momentum
- Role of momentum term :
• Keep the ball rolling through small local minima in the error surface.
• Gradually increase the step size of the search in regions where
the gradient is unchanging, thereby speeding convergence

24
Multilayer Networks and Backpropagation
Algorithm (cont.)
• ALVINN (again..)
- Uses backpropagation algorithm
- Two layered feedforward network
• 960 neural network inputs
• 4 hidden units
• 30 output units

AI & CV Lab, SNU 25

5
2
4.6 Remarks on Backpropagation Algorithm

• Convergence and local minima

– Gradient descent to some local minimum
• Perhaps not global minimum...
– Add momentum
– Stochastic gradient descent
– Train multiple nets with different initial weights

26
Remarks on Backpropagation Algorithm (cont.)

• Expressive capabilities of ANNs

– Boolean functions:
• Every boolean function can be represented by network with two layers
of units where the number of hidden units required grows exponentially.

– Continuous functions:
• Every bounded continuous function can be approximated with arbitrarily
small error, by network with two layers of units [Cybenko 1989; Hornik et
al. 1989]

– Arbitrary functions:
• Any function can be approximated to arbitrary accuracy by a network
with three layers of units [Cybenko 1988].

27
Remarks on Backpropagation Algorithm (cont.)

• Hypothesis space search and Inductive bias

— Hypothesis space search
• Every possible assignment of network weights represents a
syntactically distinct hypothesis.
• This hypothesis space is continuous in contrast to that
of decision tree.

— Inductive bias
• One can roughly characterize it as smooth interpolation
between data points. (Consider a speech recognition
example!)

28
Remarks on Backpropagation Algorithm (cont.)

• Hidden layer representations

- This 8x3x8 network was trained to learn the identity function.
- 8 training examples are used.
- After 5000 training iterations, the three hidden unit values encode
the eight distinct inputs using the encoding shown on the right.

29
Remarks on Backpropagation Algorithm (cont.)

• Learning the 8x3x8 network

- Most of the interesting weight
changes occurred during the
first 2500 iterations.

30
Remarks on Backpropagation Algorithm (cont.)
• Generalization, overfitting, and stopping criterion
– Termination condition
• Until the error E falls below some predetermined threshold
(overfitting problem)

– Techniques to address the overfitting problem

• Weight decay : Decrease each weight by some small factor during each
iteration.
• Cross-validation
• k-fold cross-validation (small training set)

31
Remarks on Backpropagation Algorithm (cont.)
• Overfitting in ANNs

32
Remarks on Backpropagation Algorithm (cont.)
• Overfitting in ANNs

33
4.7 An Illustrative Example: Face Recognition
• Neural nets for face recognition
– Training images : 20 different persons with 32 images per person.
– (120x128 resolution → 30x32 pixel image)
– After 260 training images, the network achieves an accuracy of 90% over a separate test set.
– Algorithm parameters : η=0.3, α=0.3

34
An Illustrative Example: Face Recognition (cont.)
• Learned hidden unit weights

http://www.cs.cmu.edu/tom/faces.html

35
4.8 Advanced Topics in Artificial Neural Networks

• 4.8.1 Alternative error functions

– Penalize large weights: (weight decay) : Reducing the risk of
overfitting

– Train on target slopes as well as values:

– Minimizing the cross entropy : Learning a probabilistic output function (chapter 6)

 t log o d  (1  t d ) log(1  o d )
d
d∈D

where target value, td∈{0,1} and od is the probabilistic output from the learning system,
approximating the target function, f’(xd) = p( f(xd) = td = 1) where d=<xd, td>, d∈D
36
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.2 Alternative error minimization procedures
– Weight-update method
• Direction : choosing a direction in which to alter the current weight
vector (ex: the negation of the gradient in Backpropagation)
• Distance : choosing a distance to move
(ex: the learning ratio η)

Ex : Line search method, Conjugate

gradient method

37
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.3 Recurrent networks

(a) (b) (c)

(a) Feedforward network

(b) Recurrent network
(c) Recurrent network unfolded
in time
AI & CV Lab, U 8
38
SN 3
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.3 Dynamically modifying network structure
– To improve generalization accuracy and training efficiency
– Cascade-Correlation algorithm (Fahlman and Lebiere 1990)
• Start with the simplest possible network and add complexity
– Optimal brain damage (Lecun et al. 1990)
• Start with the complex network and prune it as we find that
certain connectives are inessential.

3
39
9
Unit – 2
Machine Learning

Chapter – 5
Evaluating Hypotheses

40
Overview

• Motivation
• Estimating Hypothesis Accuracy
• Basics of Sampling Theory
• A General Approach for Driving Confidence Intervals
• Difference in Error of Two Hypotheses
• Comparing Learning Algorithms

41
5.1 Motivation

• The importance of evaluating hypotheses

– To understand whether to use the hypothesis
– An integral component of learning algorithm Ex) Post-pruning in
DT
• Two difficulties
– Bias in Estimate
 Test examples chosen independently of the training examples
– Variance in Estimate
 Larger set of test examples
• Subjects in this chapter
– Evaluating learned hypotheses
– Comparing the accuracy of two hypotheses
– Comparing the accuracy of two learning algorithms

42
5.2 Estimating Hypothesis Accuracy
• Notations
X : space of all possible D : probability distribution of X
instances f : target function h:
S : sample drawn from D hypothesis
• 5.2.1 Sample error and true error
– Sample Error : the fraction of S that it
misclassifies
1
  f (x),
S
error (h)  n xS
– True Error : the misclassification probability a randomly drawn instance from D
h(x)
Pr f (x) 
errorD (h) xD
h(x)
we can measure only
“While we want to know errorD (h) , errorS (h).”
 “How good an estimate of errorD (h) is provided by errorS (h) ?”

43
Estimating Hypothesis Accuracy (cont.)

• 5.2.2 Confidence interval for discrete-valued hypothesis

S  n, n  30, errorS (h) 

r n
errorS (h)1  errorS (h)
Confidence interval : error S(h)  z N
n

Requirements
 Discrete-valued hypothesis
 S drawn randomly from D
 The data independent of hypothesis
Recommendation

n  30 & errorS (h) is not too

or
close ton0 or
error
1
S
(h)1  errorS (h) 
5
44
5.3 Basics of Sampling Theory

• 5.3.1 Error estimation and estimating binomial poportions

– The probability that h misclassifies errorS (h)
i
Repeated experiment  Random variable errorS (h)
errorS (h) ~ Binomial Distribution
i

Coin toss example “The probability of observing r heads”

• Toss a worn and bent coin n times  “The probability that h
• The probability p misclassifies”
r
• Heads turn up r times p  errorD (h), pˆ n  error S

(h)

45
Basics of Sampling Theory (cont.)

5.3.2 Binomial distribution

•Y : A random variable which can take on two values (Ex) 0 or 1
• p : The probability that on any single trial Y=1
• Y1 ,Y 2n,⋯,Y n : The sequence of i.i.d random variables Y
• : The number of trials for which Yi
R  iYi experiments  1 in n independent
1

The probability that R will take on a specific value r

n
PrR  r  p r (1  p)nr
r!(n!  r)!


46
Basics of Sampling Theory (cont.)
• 5.3.3 Mean and variance n
E Y  
Expected Value (or Mean) i  yi PrY  yi
 1

Variance VarY  E   Y  E Y  


“How far the random variable is expected to vary
2 from its mean value”

Standard Deviation 
 Y  Var Y   E Y  E Y

※ In case of binomial distribution  2 

Expected Value (or Mean)
EY   np
Variance
VarY   np(1 
Standard Deviation  Y  np(1  p)
p)

47
Basics of Sampling Theory (cont.)

• 5.3.4 Estimators, bias, and variance

– An estimator estimates the true value we do not
know
[Example] error (h) estimates the true error errorD (h)
S
– Estimation
bias  E Y   p 
: The difference between the expected value of estimator and the true value
– Unbiased estimator : Y such that
• As n grows larger, EY  p 
• Eestimator
errorS (h) is the unbiased Y  pof error
0 D (h)
– Variance : The smaller, the better

 error (h)  r/   p(1  p) errorS (h)1  errorS (h)

S  nr 
 n
n n
Two quick remarks
 S and h chosen independently
 Don’t be confused by “Estimation Bias” and “Inductive Bias”

48
Basics of Sampling Theory (cont.)

• Example
– 12 errors on a sample of 40 randomly drawn test
examples

r 12
pˆ  error
S (h) 
n 40 
0.3
2
 r  np(1 p)  npˆ(1 pˆ )  40  0.3 (1 0.3)
 8.4
 2.9
 rerror
 (h)8.4 
 r/ 
2.9 r
 
 n 0.07
S

n 40

49
Basics of Sampling Theory (cont.)

• Example
– 300 errors on a sample of 1000 randomly drawn test
examples
30
pˆ  error S (h)  r 
n 1000 0.3
r2  np(1 p) np(1 ˆ p̂)  1000  0.3 (1 0.3) 
 210
 r  210 
14.5 r 14.5
 error (h)  r /   
 n 0.0145
S

– As  error ( h )
n 1000
S
gets smaller, the confidence interval gets narrower with same
probability

50
Basics of Sampling Theory (cont.)

• Normal distribution
– A bell shaped distribution specified by its mean and standard deviation 
– Central limit theorem (See Section 5.4.1)

“Binomial distribution can be approximated by normal distribution”

Normal distribution
•
X : A random variable X  , 
2
1  1  x   
2  
e
Probability density function p( x)
2 2

Cumulative distribution Pra  X  b ba p( x)d
 deviation x
Expected value, variance, and standard
E X    VarX    2  X  

51
Basics of Sampling Theory (cont.)

• Normal distribution
– Table about the Standard Normal distribution
–   0,   1 ; Table 5.1
The size of the interval about the mean that contains N% of the probability

Table 5.1
Confidence Level N% 50% 68% 80% 90% 95% 98% 99%

zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58

Constant

52
Basics of Sampling Theory (cont.)

• 5.3.5 Confidence intervals

– N% confidence interval
: An interval that is expected with probability N % to contain
p
– Confidence interval for  and y : y  zN  ,   zN 

Obtaining Confidence Intervals for errorD (h)

 error S(h) ~ Binomial distribution where   error (h), D  errorS (h)1  errorS (h)
n
 For large n, this binomial distribution is approximated by a normal distribution
 Find the N% confidence interval for estimating  of a Normal distribution

errorS (h)  zN errorS (h)1  errorS (h)

n n  30 or np(1  p)  5
– Two approximations involved

• errorD (h) approximated by errorS (h)
• Binomial distribution approximated by normal distribution
53
Basics of Sampling Theory (cont.)

• 5.3.6 Two-sided and one-sided bounds

– Two-sided bound specifies both lower and upper bound
– One-sided bound specifies either of them

“What is the probability that errorD (h) is at most U ?”  One-sided bound

Confidence Interval  : The probability that the correct value
1001   %
 lies outside the interval
 
100 1   / 2 % Confidence Interval

54
Basics of Sampling Theory (cont.)

• Example
– 12 errors on a sample of 40 randomly drawn test
examples

 0.3
errorS (h)
 errorS (h)
0.07
(Two-sided) 95% confidence interval
 
errorS (h)  zN errorS (h)1  errorS (h)
n  0.3  1.96  0.07  0.3 
0.05 
0.14
(One-sided) 97.5% confidence interval   0.05 

errorD (h) is at most 0.3+0.14=0.44

No assertion about the lower bound!

55
Next…

• A General approach for driving confidence intervals

— Central Limit Theorem
• Difference in errors of two hypotheses
— Hypothesis testing
• Comparing learning algorithms
— Paired t-tests
— Practical considerations

56
Estimation and Hypothesis Testing

Central Limit Theorem

Population
Parameter

Estimation Hypothesis
 :
TesYHti2ng
0
 0


X  μ Test
Sample Statistics
Estimator

57
5.4 A General Approach for
Deriving Confidence Intervals
• General process estimating parameter
P : errorD(h)
(1) Identify the underlying population parameter p
(2) Define the estimator Y : errorS(h)
: minimum variance, unbiased estimator desirable

(3) Determine the probability distribution DY of Y

: mean( ) and variance( 2 ) of Y

(4) Determine the N% confidence interval from DY

: LowerBound and UpperBound

  zn For Discrete-valued Hypothesis error s (h)(1  error s (h))

error S (h)  z n
n
 
58
A General Approach for
Deriving Confidence Intervals (cont.)
• 5.4.1 Central limit theorem
Consider a set of iid random variables Y1 ⋯Yn governed by an arbitrary
distribution with mean and finite variance  2. Define the sample mean, Y  1
n

probability n
 i
Y  Y  i

Then as n   , the distribution governing  n approaches a Normal Dist .

n
n
n 1

with zero mean and standard deviation equal to 1.

1 Yn 
 ZY 
Yn Yi 
n
n

i n
Y1 1
Y2
Y3 1
Y4 ⋯Yn 1 Yn 2

0
Y ⋯ Y ~ UD (  ,  2 ZY n ~ N
2
) Y n ~ N (  ,   
1 n  (0,1)
)
n 
59
A General Approach for
Deriving Confidence Intervals (cont.)
• Why central limit theorem is useful ?
– We can know the distn. of sample mean Y
( even when we do not know the distn. of Yi)

– We can determine the mean( ) and variance(  2 )

of Y i . ( from the mean and variance of Y)

~ UD (  ,  2 ) ~ N (  , S 2)
Y ki Yk
(1) Y11 ⋯ Y1n
Y1 mean (Y k ) 
(2) Y 21 ⋯ Y 
Y 2
(3) Y2n 31 ⋯ 3n Y 23 variance( Y k )  S 2
 n
Y ⁝ ⁝
⁝
(K) Y K 1 ⋯ Y Kn YK

 Then we can compute confidence interval !   zn


60
5.5 A Difference in Error of Two Hypotheses
• Parameter to be estimated
: The difference between the true error of 2 hypotheses, h1 & h2.
: Parameter d  errorD (h1 )  errorD (h2 )

• CASE 1 : Tested on independent test samples

– Hypothesis h1 : sample S1 containing n1 examples
– Hypothesis h2 : sample S2 containing n2 examples


: Estimator d  errorS (h1 )  errorS (h2 )
1 2



E (d ) 
– d gives an unbiased estimate of d :
 d
E (d )  d  E{errorS1 (h1 )  errorS 2 (h2 )} {errorD (h1 )  errorD (h2 )}
 E{errorS 1 (h1 )}  E{errorS 2 (h2 )} {errorD (h1 )  errorD (h2 )}
 [ E{errorS1 (h1 )}  errorD (h1 )]  [ E{errorS 2 (h2 )}  errorD (h2 )]
 [errorD (h1 )  errorD (h1 )]  [errorD (h2 )  errorD (h2 )]
 0

61
A Difference in Error of Two Hypotheses (cont.)

• CASE 1 : Tested on independent test samples (continued)


– For large n1, n ( >= 30),
2 dist . of d
n is approximately Normal distn.
∵ errorS1 (h1 ) ~ N (1 , 1 ), errorS 2 (h2 ) ~ N ( 2 , 2 )
Difference

of 2 normal distributions is also a normal distribution
– Mean of d 
E (d )  E{errorS1 (h1 )  errorS 2 (h2 )}  1  2

recall : E (aX  bY )  aE ( X )  bE (if X and Y are independent R.V.)


– Variance of d (Y )
error S1 (h1 )(1  error S1 (h )) error S 2 (h2 )(1  errorS 2 (h2 ))
 2  1

d n1 n2
(if X and Y are independent R.V.)
recall : Var (aX  bY )  a 2Var ( X )  b2Var (Y )

– Confidence Interval of d (when n1, n2 are large enough).

 errorS1 (h1 )(1  errorS1 (h1 )) errorS 2 (h2 )(1  errorS 2 (h2 ))
d  zN 
n1 n
 2

62
A Difference in Error of Two Hypotheses (cont.)

• CASE 2 : Tested on a single test sample

: Hypothesis h1 & Hypothesis h2 are tested on a single test sample S.

: Estimator d  errorS (h1 )  errorS (h2 )

– Confidence interval of
d.

errorS (h1 )(1  errorS (h1 ))  errorS (h2 )(1  errorS (h2 ))
d  zN
 n

– Smaller variance comparing with CASE1.

: Single sample S eliminates the variance due to random differences in the S1 and S2.

63
A Difference in Error of Two Hypotheses (cont.)

• 5.5.1 Hypothesis testing

: Testing for some specific conjecture (rather than in confidence intervals for some parameter)

– Situation
• Independent sample S1 & S2 ( |S1| =|S2|=100)
• errorS1(h1) = 0.30 
• error (h ) = 0.20 “What is the probability the error D (h 1 ) > error D (h 2 ) given d
 S2 2
• d = 0.10 = 0.10 ?”

“What is the probability that d>0 given d = 0.10 ?”
  
• d falls into the one-sided interval d  d  0.10  d    

 0.10
d    ZN  
 d
d d


ZN   0.10 ,  0.3(1 0.3)  0.2(1

  
0.2) 100
 0.061
ZN d
Two-sided constant for 90% confidence interval 
1.64d  d
d

– Test result
Therefore, the probability the errorD(h1) > errorD(h2) is approximately 95% .
• Accept H0 with 95% confidence
• Reject H0 with 5% significant
AI & CV Lab, SNU 25
64
level
5.6 Comparing Learning Algorithms

Which of LA and LB is the better learning method on average for learning

some particular target function f ?
• Comparing the performance of two algorithms (LA, LB)
: Expected value of the difference in errors between LA and LB. where LA(S) is the hypothesis output by
learning method, LA, on the sample, S, of training data.

E [error
SD D (LA (S ))  errorD (LB (S ))]
(S : Training Data sampled from underlying distribution D)

• Practical ways of algorithm comparison given limited sample, D0, of data

(1) Partitioning data set into training set & test set
: A limited sample D0 is divided into a training set S0 and Test Set T0
errorT0 (L A(S 0))  error T0(L B(S 0))

D0 S0 T0

65
Comparing Learning Algorithms (cont.)

(2) Repeated partitioning and averaging : k-fold method

: D0 is divided into disjoint training and test sets repeatedly and then the mean of the test set errors
for these different experiment is calculated. E
S  D0
[ error D ( L A ( S ))  error D ( L B ( S ))]

D0
(1) T1 S1
(2) T2 S2
(3) T3 S3

…..

…..
(k) Sk Tk
 returned from the above is the estimate of
E [error k 1 Tk 
D (LA (S ))  errorD (LB (S ))] Sk  k D 0 ,
S
D
30
which is again the approximation of
E
S  D0 [errorD (LA (S ))  errorD (LB (S ))]

66
Comparing Learning Algorithms (cont.)

(2) Repeated partitioning and averaging : k-fold method (continued)

• The approximate N% confidence interval

k
1
  Nt ,k 1   where s (   ) 2
s 

k (k  1) 
i i
1
- N : Confidence level ,
- k-1 : Degrees of freedom , number of independent
random events producing the values for random
variable 
- If k tN,k-1 approaches the constant zN.
Paired test : Tests where the hypotheses are
evaluated over identical samples.
Paired Test generate tighter confidence interval than Test on Separate
Data samples (Due to eliminate the difference of sample makeup)
67
Comparing Learning Algorithms (cont.)

• 5.6.1 Paired t-test

: Statistical justification of the previous comparing
algorithm procedure
– Estimation procedure
(1) Given i.i.d. random variables : Y1,…,Yk

(2) Estimate the mean  of distribution governing Yi

from estimator
(3) Estimator : Y 1 k
 Yi
 k i1

68
Comparing Learning Algorithms (cont.)

- t-test, which is applicable to the special case of the estimator procedure

where each Yi follows a Normal distribution, provides
1
Y  t N ,k 1  Ys    E (Y i )  Y N t,k 1  , where sY 
 (Yi  Y ) 2

k (k  1) i
s Y 1
where tN,k-1 is a constant characterizing t distribution as zn characterizes a Normal
distribution.

- In the previous comparing learning algorithm, if on each iteration a new random

training set Si and new random test set Ti are drawn from the underlying
instance distribution instead of the fixed sample D0, then each
 iA  errorT (h )  errorTB (h ) with |T i| ≥ 30 follows a normal distribution and
i i

from t-test result, thus

  E(i )  E [errorD (LA (S ))  errorD (LB (S ))]  

S
 t N ,K 1  s D

69
Comparing Learning Algorithms (cont.)

• (1)k-fold
5.6.2
method
• k is limited. considerations
Practical
• Test set are drawn independently (examples are tested
Paired t-test does not strictly justify the confidence interval previously discussed
exactly once)
because it is evaluated on a limited data D0 and partitioned method. Nevertheless, this
confidence interval
(2)Randomized provides good basis for experimental comparisons of learning
method
methods.
: -Randomly
When data is choose
limited… a test set at least 30
examples
,
from D0 and use remaining examples for
training.

• Procedure can be repeated infinitely

(k can be infinite number  narrower confidence interval)
• Test sets are not independent.

50 LLM Interview Questions
100% (1)
50 LLM Interview Questions
56 pages
Introduction To Computational Science
No ratings yet
Introduction To Computational Science
9 pages
FDS - Unit 3 - MCQ
No ratings yet
FDS - Unit 3 - MCQ
8 pages
Unit 5
No ratings yet
Unit 5
219 pages
Cloud Security Lecture 3
No ratings yet
Cloud Security Lecture 3
37 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
Module 4 Continued
No ratings yet
Module 4 Continued
244 pages
Anomaly Detection: Course: Data Mining II
No ratings yet
Anomaly Detection: Course: Data Mining II
12 pages
ANN-unit 4 PDF
No ratings yet
ANN-unit 4 PDF
23 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
NN Suppl
No ratings yet
NN Suppl
64 pages
Neural Network
No ratings yet
Neural Network
44 pages
Walter R. Gilks, Sylvia Richardson (Auth.), Walter R. Gilks, Sylvia Richardson, David J. Spiegelhalter (Eds.) - Markov Chain Monte Carlo in Practice-Springer US (1996)
No ratings yet
Walter R. Gilks, Sylvia Richardson (Auth.), Walter R. Gilks, Sylvia Richardson, David J. Spiegelhalter (Eds.) - Markov Chain Monte Carlo in Practice-Springer US (1996)
487 pages
AIML-Module-3-part 2
No ratings yet
AIML-Module-3-part 2
122 pages
Lecture 8
No ratings yet
Lecture 8
65 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Artificial Neural Networks: Introduction To Computational Neuroscience
No ratings yet
Artificial Neural Networks: Introduction To Computational Neuroscience
42 pages
2023 Lecture11 NeuralNetworks
No ratings yet
2023 Lecture11 NeuralNetworks
48 pages
Basics
No ratings yet
Basics
48 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
855597620
No ratings yet
855597620
44 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Bim309 Ai Week13
No ratings yet
Bim309 Ai Week13
53 pages
Chapter 7
No ratings yet
Chapter 7
68 pages
Sample Final AI
No ratings yet
Sample Final AI
9 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Neural
No ratings yet
Neural
32 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
Neural Network BSC
No ratings yet
Neural Network BSC
32 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
4.2 Ann
No ratings yet
4.2 Ann
26 pages
Lec03 NeuralNetwork
No ratings yet
Lec03 NeuralNetwork
39 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
2024 MTH058 Lecture02 Backpropagation
No ratings yet
2024 MTH058 Lecture02 Backpropagation
62 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
Main
No ratings yet
Main
25 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
Project Scheduling by Pert CPM
No ratings yet
Project Scheduling by Pert CPM
32 pages
Artificial Neural Networks: Dan Simon Cleveland State University
No ratings yet
Artificial Neural Networks: Dan Simon Cleveland State University
44 pages
Descriptive Stats With R Software Book
No ratings yet
Descriptive Stats With R Software Book
944 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Artificial Neural Networks: Biological Motivation
No ratings yet
Artificial Neural Networks: Biological Motivation
22 pages
Lecture 4: Perceptrons and Multilayer Perceptrons: Cognitive Systems II - Machine Learning SS 2005
No ratings yet
Lecture 4: Perceptrons and Multilayer Perceptrons: Cognitive Systems II - Machine Learning SS 2005
25 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Machine Learning: Chapter 4. Artificial Neural Networks
No ratings yet
Machine Learning: Chapter 4. Artificial Neural Networks
34 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Ann4-3s.pdf 7oct PDF
No ratings yet
Ann4-3s.pdf 7oct PDF
21 pages
Canonical Problem Forms: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Canonical Problem Forms: Ryan Tibshirani Convex Optimization 10-725
27 pages
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
14 pages
5 1 ArtificialNeuralNetworks 4up
No ratings yet
5 1 ArtificialNeuralNetworks 4up
12 pages
4 - Neural Networks
No ratings yet
4 - Neural Networks
10 pages
Neural Networks Backpropagation Algorithm: COMP4302/COMP5322, Lecture 4, 5
No ratings yet
Neural Networks Backpropagation Algorithm: COMP4302/COMP5322, Lecture 4, 5
11 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
Outline and Reading: Tries 4/1/2003 9:02 AM
No ratings yet
Outline and Reading: Tries 4/1/2003 9:02 AM
3 pages
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
14 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Module 8, Week 9, CE-301, Dr. Bashir Alam
No ratings yet
Module 8, Week 9, CE-301, Dr. Bashir Alam
155 pages
Linear Programming
No ratings yet
Linear Programming
45 pages
Pda PDF
No ratings yet
Pda PDF
37 pages
Sampling of Continuous-Time Signals
No ratings yet
Sampling of Continuous-Time Signals
11 pages
Handout Part IV Lie Groups
No ratings yet
Handout Part IV Lie Groups
85 pages
Q2. Can Binary Search Be Used For Linked Lists?
No ratings yet
Q2. Can Binary Search Be Used For Linked Lists?
6 pages
BIO310 Lecture-1
No ratings yet
BIO310 Lecture-1
15 pages
Maths and Stat Research Projects Supervision 2024 25
No ratings yet
Maths and Stat Research Projects Supervision 2024 25
4 pages
5th Sem Main Exam + Test Exam Questions
No ratings yet
5th Sem Main Exam + Test Exam Questions
10 pages
Maths Inequalities Region
No ratings yet
Maths Inequalities Region
8 pages
1 Rayleigh Quotients: Eigenvalues As Optimization: Lecture 6: October 14, 2021
No ratings yet
1 Rayleigh Quotients: Eigenvalues As Optimization: Lecture 6: October 14, 2021
4 pages
Printable Module Test Form A1
No ratings yet
Printable Module Test Form A1
4 pages
CSE B.SC Operating System 12th Batch
No ratings yet
CSE B.SC Operating System 12th Batch
2 pages
10th Daily Assignment 2
No ratings yet
10th Daily Assignment 2
7 pages
Unicycle Dynamics
No ratings yet
Unicycle Dynamics
6 pages
Lec 10
No ratings yet
Lec 10
16 pages
Comprehensive Test NMS620S
No ratings yet
Comprehensive Test NMS620S
2 pages
Ee3512 Ci Lab 2021R
No ratings yet
Ee3512 Ci Lab 2021R
3 pages
ExamPaper - CMPG215 1
No ratings yet
ExamPaper - CMPG215 1
2 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet