ML Unit - 2
ML Unit - 2
Machine Learning
Chapter – 4
Artificial Neural Networks
1
Overview
• Introduction
• Perceptrons
• Multilayer networks and Backpropagation Algorithm
• Remarks on the Backpropagation Algorithm
• An Illustrative Example: Face Recognition
• Advanced Topics in Artificial Neural Networks
2
4.1 Introduction
• Human brain
– densely interconnected network of 1011 neurons each connected to 104
others (neuron switching time : approx. 10-3 sec.)
• ANN (Artificial Neural Network)
– this kind of highly parallel processes
3
Introduction (cont.)
• Properties of ANNs
– Many neuron-like threshold switching units
– Many weighted interconnections among units
– Highly parallel, distributed process
4
Introduction (cont.)
5
4.3 Appropriate problems for neural network
learning
• Appropriate problems for neural network learning
– Input is high-dimensional discrete or real-valued
(e.g. raw sensor input)
– Output is discrete or real valued
– Output is a vector of values
– Possibly noisy data
– Long training times accepted
– Fast evaluation of the learned function required.
– Not important for humans to understand the weights
• Examples
– Speech phoneme recognition
– Image classification
– Financial prediction
6
4.4 Perceptrons
• 4.4.1 Perceptron
– Given real-valued inputs x1 through xn, the output o(x1,…,xn) computed by the
perceptron is
7
Perceptrons (cont.)
8
Perceptrons (cont.)
<Test Results>
+ + x1 x2 wi xi output
0 0 -0.3 -1
0 1 0.2 -1
1 0 0.2 -1
- + 1 1 0.7 1
x1
-0.3 + 0.5 x1 + 0.5 x2 = 0
10
Perceptrons (cont.)
x2
<Training ex amples >
x1 x2 output
0 0 -1
0 1 1
+ -
1 0 1
1 1 -1
- +
x1
11
4.4.2 Perceptrons training rule
wi wi + wi
where wi =
(t – o) xi
Where:
– t = c(x) is target value
– o is perceptron output
– is small constant (e.g., 0.1) called learning rate
Can prove it will converge
– If training data is linearly separable
– and sufficiently small
12
4.4.3 Gradient Descent and Delta rule
• Delta rule :
13
4.4.3 Gradient Descent and Delta rule
• Hypothesis space
14
4.4.3 Gradient Descent and Delta rule
15
4.4.3 Gradient Descent and Delta rule
• Derivation of gradient descent
- Because the error surface contains only a single global minimum, this
algorithm will converge to a weight vector with minimum error, regardless of
whether the tr. examples are linearly separable, given a sufficiently small
is used.
16
4.4.3 Gradient Descent and Delta rule
• Gradient descent and delta rule
– Search through the space of possible network weights, iteratively
reducing the error E between the training example target values and the
network outputs
17
4.4.3 Gradient Descent and Delta rule
Stochastic gradient descent (i.e. incremental mode) can sometimes avoid falling into
local minima because it uses the various gradient of E rather than overall gradient of
E.
18
Perceptrons (cont.)
• Summary
– Perceptron training rule
• Perfectly classifies training data
• Converge, provided the training examples are linearly separable
19
4.5 Multilayer Networks and Backpropagation
Algorithm
• Speech recognition example of multilayer networks learned
by the backpropagation algorithm
• Highly nonlinear decision surfaces
20
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Sigmoid threshold unit
– What type of unit as the basis for multilayer networks
• Perceptron : not differentiable -> can’t use gradient descent
• Linear Unit : multi-layers of linear units -> still produce only
linear function
• Sigmoid Unit : differentiable threshold function
21
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Sigmoid threshold unit
- Interesting property:
22
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Backpropagation algorithm
– Two layered feedforward networks
23
Multilayer Networks and Backpropagation
Algorithm (cont.)
• Adding momentum
- Another weight update is possible.
24
Multilayer Networks and Backpropagation
Algorithm (cont.)
• ALVINN (again..)
- Uses backpropagation algorithm
- Two layered feedforward network
• 960 neural network inputs
• 4 hidden units
• 30 output units
26
Remarks on Backpropagation Algorithm (cont.)
– Continuous functions:
• Every bounded continuous function can be approximated with arbitrarily
small error, by network with two layers of units [Cybenko 1989; Hornik et
al. 1989]
– Arbitrary functions:
• Any function can be approximated to arbitrary accuracy by a network
with three layers of units [Cybenko 1988].
27
Remarks on Backpropagation Algorithm (cont.)
— Inductive bias
• One can roughly characterize it as smooth interpolation
between data points. (Consider a speech recognition
example!)
28
Remarks on Backpropagation Algorithm (cont.)
29
Remarks on Backpropagation Algorithm (cont.)
30
Remarks on Backpropagation Algorithm (cont.)
• Generalization, overfitting, and stopping criterion
– Termination condition
• Until the error E falls below some predetermined threshold
(overfitting problem)
31
Remarks on Backpropagation Algorithm (cont.)
• Overfitting in ANNs
32
Remarks on Backpropagation Algorithm (cont.)
• Overfitting in ANNs
33
4.7 An Illustrative Example: Face Recognition
• Neural nets for face recognition
– Training images : 20 different persons with 32 images per person.
– (120x128 resolution → 30x32 pixel image)
– After 260 training images, the network achieves an accuracy of 90% over a separate test set.
– Algorithm parameters : η=0.3, α=0.3
34
An Illustrative Example: Face Recognition (cont.)
• Learned hidden unit weights
http://www.cs.cmu.edu/tom/faces.html
35
4.8 Advanced Topics in Artificial Neural Networks
t log o d (1 t d ) log(1 o d )
d
d∈D
where target value, td∈{0,1} and od is the probabilistic output from the learning system,
approximating the target function, f’(xd) = p( f(xd) = td = 1) where d=<xd, td>, d∈D
36
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.2 Alternative error minimization procedures
– Weight-update method
• Direction : choosing a direction in which to alter the current weight
vector (ex: the negation of the gradient in Backpropagation)
• Distance : choosing a distance to move
(ex: the learning ratio η)
37
Advanced Topics in Artificial Neural Networks
(cont.)
• 4.8.3 Recurrent networks
3
39
9
Unit – 2
Machine Learning
Chapter – 5
Evaluating Hypotheses
40
Overview
• Motivation
• Estimating Hypothesis Accuracy
• Basics of Sampling Theory
• A General Approach for Driving Confidence Intervals
• Difference in Error of Two Hypotheses
• Comparing Learning Algorithms
41
5.1 Motivation
42
5.2 Estimating Hypothesis Accuracy
• Notations
X : space of all possible D : probability distribution of X
instances f : target function h:
S : sample drawn from D hypothesis
• 5.2.1 Sample error and true error
– Sample Error : the fraction of S that it
misclassifies
1
f (x),
S
error (h) n xS
– True Error : the misclassification probability a randomly drawn instance from D
h(x)
Pr f (x)
errorD (h) xD
h(x)
we can measure only
“While we want to know errorD (h) , errorS (h).”
“How good an estimate of errorD (h) is provided by errorS (h) ?”
43
Estimating Hypothesis Accuracy (cont.)
Requirements
Discrete-valued hypothesis
S drawn randomly from D
The data independent of hypothesis
Recommendation
(h)
45
Basics of Sampling Theory (cont.)
46
Basics of Sampling Theory (cont.)
• 5.3.3 Mean and variance n
E Y
Expected Value (or Mean) i yi PrY yi
1
Standard Deviation
Y Var Y E Y E Y
47
Basics of Sampling Theory (cont.)
48
Basics of Sampling Theory (cont.)
• Example
– 12 errors on a sample of 40 randomly drawn test
examples
r 12
pˆ error
S (h)
n 40
0.3
2
r np(1 p) npˆ(1 pˆ ) 40 0.3 (1 0.3)
8.4
2.9
rerror
(h)8.4
r/
2.9 r
n 0.07
S
n 40
49
Basics of Sampling Theory (cont.)
• Example
– 300 errors on a sample of 1000 randomly drawn test
examples
30
pˆ error S (h) r
n 1000 0.3
r2 np(1 p) np(1 ˆ p̂) 1000 0.3 (1 0.3)
210
r 210
14.5 r 14.5
error (h) r /
n 0.0145
S
– As error ( h )
n 1000
S
gets smaller, the confidence interval gets narrower with same
probability
50
Basics of Sampling Theory (cont.)
• Normal distribution
– A bell shaped distribution specified by its mean and standard deviation
– Central limit theorem (See Section 5.4.1)
Normal distribution
•
X : A random variable X ,
2
1 1 x
2
e
Probability density function p( x)
2 2
Cumulative distribution Pra X b ba p( x)d
deviation x
Expected value, variance, and standard
E X VarX 2 X
51
Basics of Sampling Theory (cont.)
• Normal distribution
– Table about the Standard Normal distribution
– 0, 1 ; Table 5.1
The size of the interval about the mean that contains N% of the probability
Table 5.1
Confidence Level N% 50% 68% 80% 90% 95% 98% 99%
52
Basics of Sampling Theory (cont.)
54
Basics of Sampling Theory (cont.)
• Example
– 12 errors on a sample of 40 randomly drawn test
examples
0.3
errorS (h)
errorS (h)
0.07
(Two-sided) 95% confidence interval
errorS (h) zN errorS (h)1 errorS (h)
n 0.3 1.96 0.07 0.3
0.05
0.14
(One-sided) 97.5% confidence interval 0.05
55
Next…
56
Estimation and Hypothesis Testing
Population
Parameter
Estimation Hypothesis
:
TesYHti2ng
0
0
X μ Test
Sample Statistics
Estimator
57
5.4 A General Approach for
Deriving Confidence Intervals
• General process estimating parameter
P : errorD(h)
(1) Identify the underlying population parameter p
(2) Define the estimator Y : errorS(h)
: minimum variance, unbiased estimator desirable
probability n
i
Y Y i
1 Yn
ZY
Yn Yi
n
n
i n
Y1 1
Y2
Y3 1
Y4 ⋯Yn 1 Yn 2
0
Y ⋯ Y ~ UD ( , 2 ZY n ~ N
2
) Y n ~ N ( ,
1 n (0,1)
)
n
59
A General Approach for
Deriving Confidence Intervals (cont.)
• Why central limit theorem is useful ?
– We can know the distn. of sample mean Y
( even when we do not know the distn. of Yi)
~ UD ( , 2 ) ~ N ( , S 2)
Y ki Yk
(1) Y11 ⋯ Y1n
Y1 mean (Y k )
(2) Y 21 ⋯ Y
Y 2
(3) Y2n 31 ⋯ 3n Y 23 variance( Y k ) S 2
n
Y ⁝ ⁝
⁝
(K) Y K 1 ⋯ Y Kn YK
: Estimator d errorS (h1 ) errorS (h2 )
1 2
E (d )
– d gives an unbiased estimate of d :
d
E (d ) d E{errorS1 (h1 ) errorS 2 (h2 )} {errorD (h1 ) errorD (h2 )}
E{errorS 1 (h1 )} E{errorS 2 (h2 )} {errorD (h1 ) errorD (h2 )}
[ E{errorS1 (h1 )} errorD (h1 )] [ E{errorS 2 (h2 )} errorD (h2 )]
[errorD (h1 ) errorD (h1 )] [errorD (h2 ) errorD (h2 )]
0
61
A Difference in Error of Two Hypotheses (cont.)
62
A Difference in Error of Two Hypotheses (cont.)
63
A Difference in Error of Two Hypotheses (cont.)
– Situation
• Independent sample S1 & S2 ( |S1| =|S2|=100)
• errorS1(h1) = 0.30
• error (h ) = 0.20 “What is the probability the error D (h 1 ) > error D (h 2 ) given d
S2 2
• d = 0.10 = 0.10 ?”
“What is the probability that d>0 given d = 0.10 ?”
• d falls into the one-sided interval d d 0.10 d
0.10
d ZN
d
d d
– Test result
Therefore, the probability the errorD(h1) > errorD(h2) is approximately 95% .
• Accept H0 with 95% confidence
• Reject H0 with 5% significant
AI & CV Lab, SNU 25
64
level
5.6 Comparing Learning Algorithms
E [error
SD D (LA (S )) errorD (LB (S ))]
(S : Training Data sampled from underlying distribution D)
(1) Partitioning data set into training set & test set
: A limited sample D0 is divided into a training set S0 and Test Set T0
errorT0 (L A(S 0)) error T0(L B(S 0))
D0 S0 T0
65
Comparing Learning Algorithms (cont.)
D0
(1) T1 S1
(2) T2 S2
(3) T3 S3
…..
…..
(k) Sk Tk
returned from the above is the estimate of
E [error k 1 Tk
D (LA (S )) errorD (LB (S ))] Sk k D 0 ,
S
D
30
which is again the approximation of
E
S D0 [errorD (LA (S )) errorD (LB (S ))]
66
Comparing Learning Algorithms (cont.)
68
Comparing Learning Algorithms (cont.)
k (k 1) i
s Y 1
where tN,k-1 is a constant characterizing t distribution as zn characterizes a Normal
distribution.
69
Comparing Learning Algorithms (cont.)
• (1)k-fold
5.6.2
method
• k is limited. considerations
Practical
• Test set are drawn independently (examples are tested
Paired t-test does not strictly justify the confidence interval previously discussed
exactly once)
because it is evaluated on a limited data D0 and partitioned method. Nevertheless, this
confidence interval
(2)Randomized provides good basis for experimental comparisons of learning
method
methods.
: -Randomly
When data is choose
limited… a test set at least 30
examples
,
from D0 and use remaining examples for
training.
70