Foundations of Machine Learning: Module 6: Neural Network
Foundations of Machine Learning: Module 6: Neural Network
Sudeshna Sarkar
IIT Kharagpur
Introduction
• Inspired by the human brain.
• Some NNs are models of biological neural networks
• Human brain contains a massively interconnected
net of 1010-1011 (10 billion) neurons (cortical cells)
– Massive parallelism – large number of simple
processing units
– Connectionism – highly interconnected
– Associative distributed memory
• Pattern and strength of synaptic connections
Neuron
Neural Unit
ANNs
• ANNs incorporate the two fundamental components of
biological neural nets:
1. Nodes - Neurones
2. Weights - Synapses
Perceptrons
• Basic unit in a neural network: Linear separator
– N inputs, x1 ... xn
– Weights for each input, w1 ... wn
– A bias input x0 (constant) and associated weight w0
– Weighted sum of inputs, 𝑦 = σ𝑛𝑖=0 𝑤𝑖 𝑥𝑖
– A threshold function, i.e., 1 if y > 0, −1 if y <= 0
x0
x1 w0
w1
x2 w2
⋮ wn
Σ 𝜑=
1 if 𝑦 > 0
𝑦 = 𝑤𝑖 𝑥𝑖 −1 otherwise
xn
Perceptron training rule
Updates perceptron weights for a training ex as follows:
𝑤𝑖 = 𝑤𝑖 + 𝛥𝑤𝑖
𝛥𝑤𝑖 = 𝜂 𝑦 − 𝑦ො 𝑥𝑖
• If the data is linearly separable and 𝜂 is sufficiently small, it will
converge to a hypothesis that classifies all training data correctly in a
finite number of iterations
Gradient Descent
• Perceptron training rule may not converge if points are not
linearly separable
• Gradient descent by changing the weights by the total error
for all training points.
– If the data is not linearly separable, then it will converge to
the best fit
Linear neurons
• The neuron has a real- • Define the error as the
valued output which is a squared residuals summed
weighted sum of its inputs over all training cases:
1
𝐸 = (𝑦 − 𝑦) ො 2
𝑦ො = 𝑤𝑖 𝑥𝑖 = 𝐰 𝑇 𝐱 2
𝑗
𝑖
• Differentiate to get error derivatives for weights
𝜕𝐸 1 𝜕𝑦ෝ𝑗 𝜕𝐸𝑗
= = − 𝑥𝑖,𝑗 (𝑦𝑗 − 𝑦ෝ𝑗 )
𝜕𝑤𝑖 2 𝜕𝑤𝑖 𝜕𝑦ෝ𝑗
𝑗=1..𝑚 𝑗=1..𝑚
• The batch delta rule changes the weights in proportion to
their error derivatives summed over all training cases
𝜕𝐸
∆𝑤𝑖 = −𝜂
𝜕𝑤𝑖
Error Surface
• The error surface lies in a space with a horizontal axis for each
weight and one vertical axis for the error.
– For a linear neuron, it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
Batch Line and Stochastic Learning
Batch Learning Stochastic/ Online Learning
• Steepest descent on the For each example compute the
error surface gradient.
1
𝐸 = (𝑦 − 𝑦) ො 2
2
𝜕𝐸 1 𝜕ෝ𝑦. 𝜕𝐸𝑗
=
𝜕𝑤𝑖 2 𝜕𝑤𝑖 𝜕ෝ 𝑦.
= −𝑥𝑖 (𝑦. − 𝑦ෝ. )
Computation at Units
• Compute a 0-1 or a graded function of the
weighted sum of the inputs
• () is the activation function
x1 w1
( w.x)
x2
w2
wn
xn w.x wi xi
Neuron Model: Logistic Unit
1 1
(z) z
1 e 1 e w .x
𝜙′ 𝑧 = 𝜑 𝑧 1 − 𝜑 𝑧
1 1
ො = (𝑦 − 𝜑 𝑤. 𝑥𝑑 )2
𝐸 = (𝑦 − 𝑦) 2
2 2
𝑑 𝑑
𝜕𝐸 1 𝜕𝐸𝑑 𝜕 𝑦ෞ
𝑑.
=
𝜕𝑤𝑖 2 𝜕ෞ
𝑦𝑑 𝜕𝑤𝑖
𝑑
𝜕𝑦
= (𝑦𝑑 − 𝑦ෞ
𝑑. ) 𝑦 − 𝜑(𝑤. 𝑥𝑑 )
𝜕𝑤𝑖 𝑑
𝑑
= − (𝑦𝑑 − 𝑦ෞ
𝑑 . ) 𝜑′ 𝑤. 𝑥𝑑 𝑥𝑖,𝑑
𝑑
= − 𝑦𝑑 − 𝑦ෞ
𝑑. 𝑦
ෞ𝑑 . (1 − 𝑦
ෞ𝑑 . )𝑥𝑖,𝑑
𝑑
Training Rule: ∆𝑤𝑖 = 𝜂 σ𝑑 𝑦𝑑 − 𝑦ෞ
𝑑. 𝑦
ෞ𝑑 . (1 − 𝑦
ෞ𝑑 . )𝑥𝑖,𝑑
Thank You
Foundations of Machine Learning
Module 6: Neural Network
Part B: Multi-layer Neural
Network
Sudeshna Sarkar
IIT Kharagpur
Limitations of Perceptrons
• Perceptrons have a monotinicity property:
If a link has positive weight, activation can only increase as the
corresponding input value increases (irrespective of other
input values)
• Can’t represent functions where input interactions can cancel
one another’s effect (e.g. XOR)
• Can represent only linearly separable functions
A solution: multiple layers
output layer
y y
z2
hidden layer
z1 z2 z1
x2
input layer
x1 x2 x1
Power/Expressiveness of Multilayer
Networks
• Can represent interactions among inputs
• Two layer networks can represent any Boolean
function, and continuous functions (within a
tolerance) as long as the number of hidden units is
sufficient and appropriate activation functions used
• Learning algorithms exist, but weaker guarantees
than perceptron learning algorithms
Multilayer Network
Outputls
Inputs
First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2
i wij j wjk
xi k yk
n1
n n2 yn2
xn
Input Hidden Output
layer layer
Error signals
6
The back-propagation training algorithm
• Step 1: Initialisation
Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small range
1
v01
v11 1
x1 1 1 w11
v21 w01
1 y1
v22
x2 2 2 w21
v22
Input v02 Output
1
x z y
Backprop
• Initialization
– Set all the weights and threshold levels of the network to
random numbers uniformly distributed inside a small
range
• Forward computing:
– Apply an input vector x to input units
– Compute activation/output vector z on hidden layer
𝑧𝑗 = 𝜑(σ𝑖 𝑣𝑖𝑗 𝑥𝑖 )
– Compute the output vector y on output layer
𝑦𝑘 = 𝜑(σ𝑗 𝑤𝑗𝑘 𝑧𝑗 )
y is the result of the computation.
Learning for BP Nets
• Update of weights in W (between output and hidden layers):
– delta rule
• Not applicable to updating V (between input and hidden)
– don’t know the target values for hidden units z1, Z2, … ,ZP
• Solution: Propagate errors at output units to hidden units to
drive the update of weights in V (again by delta rule)
(error BACKPROPAGATION learning)
• Error backpropagation can be continued downward if the net
has more than one hidden layer.
• How to compute errors on hidden units?
Derivation
• For one output neuron, the error function is
1
𝐸 = (𝑦 − 𝑦) ො 2
2
• For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛
𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
The input 𝑛𝑒𝑡𝑗 to a neuron is the weighted sum of outputs 𝑜𝑘
of previous 𝑛 neurons.
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
Derivation
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
𝑛
𝜕𝑛𝑒𝑡𝑗 𝜕
= 𝑤𝑘𝑗 𝑜𝑘 = 𝑜𝑖
𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗
𝑘=1
𝜕𝑜𝑗 𝜕
= 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑛𝑒𝑡𝑗 1 − 𝜑 𝑛𝑒𝑡𝑗
𝜕𝑛𝑒𝑡𝑗 𝜕𝑛𝑒𝑡𝑗
Consider 𝐸 as as a function of the inputs of all neurons 𝑍 = 𝑧1 , 𝑧2 , …
receiving input from neuron 𝑗,
𝜕𝐸 𝑜𝑗 𝜕𝐸 𝑛𝑒𝑡𝑧1 , 𝑛𝑒𝑡𝑧2 , …
=
𝜕𝑜𝑗 𝜕𝑜𝑗
taking the total derivative with respect to 𝑜𝑗 , a recursive expression for
the derivative is obtained:
𝜕𝐸 𝜕𝐸 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝐸 𝜕𝑜𝑙
= = 𝑤𝑗𝑧𝑙
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝑜𝑗 𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙
𝑙 𝑙
𝜕𝐸 𝜕𝐸 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝐸 𝜕𝑜𝑙
= = 𝑤
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑧𝑙 𝜕𝑜𝑗 𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙 𝑗𝑧𝑙
𝑙 𝑙
• Therefore, the derivative with respect to 𝑜𝑗 can be calculated if all the derivatives
with respect to the outputs 𝑜𝑧𝑙 of the next layer – the one closer to the output
neuron – are known.
• Putting it all together:
𝜕𝐸
= 𝛿𝑗 𝑜𝑖
𝜕𝑤𝑖𝑗
With
𝑜𝑗 − 𝑡𝑗 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an output neuron
𝜕𝐸 𝜕𝑜𝑗
𝛿𝑗 = =
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝛿𝑧𝑙 𝑤𝑗𝑙 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an inner neuron
𝑍
To update the weight 𝑤𝑖𝑗 using gradient descent, one must choose a learning rate 𝜂.
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜂
𝜕𝑤𝑖𝑗
Backpropagation Algorithm
Thank You
Foundations of Machine Learning
Module 6: Neural Network
Part C: Neural Network and
Backpropagation Algorithm
Sudeshna Sarkar
IIT Kharagpur
Single layer Perceptron
• Single layer perceptrons learn o x
linear decision boundaries
x2
0 0
0 0 o o
+ + 0 0
+ 0
+ ++
x: class I (y = 1)
o: class II (y = -1)
x1
x x
x2
+ 0
o x
0 +
x: class I (y = 1)
x1 o: class II (y = -1)
xor
x2
Boolean OR + +
OR
- + x1
input input
ouput
x1 x2
w0= -0.5
0 0 0
0 1 1 1
w1=1 w2=1
1 0 1
1 1 1 x1 x2
x2
Boolean AND - +
AND
input input x1
ouput - -
x1 x2
w0= -1.5
0 0 0
0 1 0 1
w1=1 w2=1
1 0 0
1 1 1 x1 x2
x2
Boolean XOR
+ -
XOR
input input
ouput
x1 x2 x1
- +
0 0 0
0 1 1
1 0 1
1 1 0
Boolean XOR
XOR
o -0.5
input input
ouput 1 -1
x1 x2
OR AND
0 0 0 -0.5 h1 h1 -1.5
0 1 1
1
1 0 1 1
1 1
1 1 0
x1 x1
Representation Capability of NNs
• Single layer nets have limited representation power (linear
separability problem). Multi-layer nets (or nets with non-
linear hidden units) may overcome linear inseparability
problem.
• Every Boolean function can be represented by a network with
a single hidden layer.
• Every bounded continuous function can be approximated with
arbitrarily small error, by network with one hidden layer
• Any function can be approximated to arbitrary accuracy by a
network with two hidden layers.
Multilayer Network
Outputls
Inputs
First Second
Input hidden hidden Output
layer layer layer
Two-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2
i wij j wjk
xi k yk
n1
n n2 yn2
xn
Input Hidden Output
layer layer
Error signals
9
Derivation
• For one output neuron, the error function is
1
𝐸 = (𝑦 − 𝑜)2
2
• For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛
𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
The input 𝑛𝑒𝑡𝑗 to a neuron is the weighted sum of outputs 𝑜𝑘
of previous 𝑛 neurons.
• Finding the derivative of the error:
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
1
For one output neuron, the error function is 𝐸 = (𝑦 − 𝑜)2
2
For each unit 𝑗, the output 𝑜𝑗 is defined as
𝑛
𝑜𝑗 = 𝜑 𝑛𝑒𝑡𝑗 = 𝜑 𝑤𝑘𝑗 𝑜𝑘
𝑘=1
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗
=
𝜕𝑤𝑖𝑗 𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑖𝑗
𝜕𝐸 𝜕𝑜𝑙
= 𝑤 𝜑 𝑛𝑒𝑡𝑗 1 − 𝜑 𝑛𝑒𝑡𝑗 𝑜𝑖
𝜕𝑜𝑙 𝜕𝑛𝑒𝑡𝑧𝑙 𝑗𝑧𝑙
𝑙
𝜕𝐸
= 𝛿𝑗 𝑜𝑖
𝜕𝑤𝑖𝑗
with
𝑜𝑗 − 𝑦𝑗 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an output neuron
𝜕𝐸 𝜕𝑜𝑗
𝛿𝑗 = =
𝜕𝑜𝑗 𝜕𝑛𝑒𝑡𝑗 𝛿𝑧𝑙 𝑤𝑗𝑙 𝑜𝑗 1 − 𝑜𝑗 if 𝑗 is an inner neuron
𝑍
To update the weight 𝑤𝑖𝑗 using gradient descent, one must choose a learning rate 𝜂.
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜂
𝜕𝑤𝑖𝑗
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, do
– For each training example, do
• Input the training example to the network and compute the network
outputs
• For each output unit 𝑘
𝛿𝑘 ← 𝑜𝑘 (1 − 𝑜𝑘 )(𝑦𝑘 − 𝑜𝑘 )
• For each hidden unit h 𝑥𝑑 = input
Stopping
1. Fixed maximum number of epochs: most naïve
2. Keep track of the training and validation error
curves.
Overfitting in ANNs
Local Minima
Sudeshna Sarkar
IIT Kharagpur
Deep Learning
• Breakthrough results in
– Image classification
– Speech Recognition
– Machine Translation
– Multi-modal learning
Deep Neural Network
• Problem: training networks with many hidden layers
doesn’t work very well
• Local minima, very slow training if initialize with zero
weights.
• Diffusion of gradient.
Hierarchical Representation
• Hierarchical Representation help represent complex
functions.
• NLP: character ->word -> Chunk -> Clause -> Sentence
• Image: pixel > edge -> texton -> motif -> part -> object
• Deep Learning: learning a hierarchy of internal
representations
• Learned internal representation at the hidden layers
(trainable feature extractor)
• Feature learning
Trainable Trainable
Input Feature … Trainable
Feature Output
Extractor Classifier
Extractor
Unsupervised Pre-training
We will use greedy, layer wise pre-training
Train one layer at a time
Fix the parameters of previous hidden layers
Previous layers viewed as feature extraction
find hidden unit features that are more common in training
input than in random inputs
Tuning the Classifier
• After pre-training of the layers
– Add output layer
– Train the whole network using
supervised learning (Back propagation)
Deep neural network
• Feed forward NN
• Stacked Autoencoders (multilayer neural net
with target output = input)
• Stacked restricted Boltzmann machine
• Convolutional Neural Network
A Deep Architecture: Multi-Layer Perceptron
Output Layer
y
Here predicting a supervised target
h3 …
Hidden layers
These learn more
h2 …
abstract representations
as you head up
h1 …
Input layer x …
Raw sensory inputs
A Neural Network
• Training : Back
Propagation of Error
– Calculate total error at
the top
– Calculate contributions
to error at each step
going backwards INPUT LAYER HIDDEN LAYER OUTPUT LAYER
10
Training of neural networks
• Forward Propagation :
– Sum inputs, produce
activation
– feed-forward
𝑒 𝑥 −𝑒 −𝑥
• tanh(x)= 𝑥 −𝑥
𝑒 +𝑒
1
• sigmoid(x) =
1+𝑒 −𝑥
• Rectified linear
relu(x) = max(0,x)
- Simplifies backprop
- Makes learning faster
- Make feature sparse
→ Preferred option
Autoencoder
Unlabeled training examples
set
a1
{𝑥 (1) , 𝑥 (2) , 𝑥 (3) . . . }, 𝑥 (𝑖) ∈
ℝ𝑛
a2
Set the target values to be
a3 equal to the inputs. 𝑦 (𝑖) = 𝑥 (𝑖)
Network is trained to output
the input (learn identify
function).
ℎ𝑤,𝑏 𝑥 ≈ 𝑥
Solution may be trivial!
Autoencoders and sparsity
1. Place constraints on the
network, like limiting the
number of hidden units, to
discover interesting structure
about the data.
2. Impose sparsity constraint.
a neuron is “active” if its output
value is close to 1
It is “inactive” if its output value is
close to 0.
constrain the neurons to be inactive
most of the time.
Auto-Encoders
15
Stacked Auto-Encoders
• Do supervised training on the last layer using final
features
• Then do supervised training on the entire network
to fine- tune all weights
e zi
yi
e j
z
j
16
Convolutional Neural netwoks
• A CNN consists of a number of convolutional and
subsampling layers.
• Input to a convolutional layer is a m x m x r image
where m x m is the height and width of the image
and r is the number of channels, e.g. an RGB image
has r=3
• Convolutional layer will have k filters (or kernels)
• size n x n x q
• n is smaller than the dimension of the image and,
• q can either be the same as the number of
channels r or smaller and may vary for each kernel
Convolutional Neural netwoks