EE769 7 Introduction To Neural Networks
EE769 7 Introduction To Neural Networks
models
network
Increasing nonlinearity in models
Linear models Support vector machines Neural networks Deep neural networks
Trainable
features
Trainable … …
Fixed features
features
Trainable
features
• Network of neurons x1 w 1
w2
• Somewhat like the brain x2 Σ σ
x3 w 3
b
1
Activation function is the secret sauce
of neural networks
• Neural network training
is all about tuning
x1 weights and biases
w1
w1
f(.) y1
w2
x2
w3
x3
Input Output
Logistic regression can be trained
using gradient descent
Desired output
x1
Index can be permutated
w1 y1
(supervision)
f(.)
Loss
w2
x2
w3
x3
Input Output
We can have multiple outputs as well
x1
Index can be permutated
f(.) y1
x2
f(.) y2
x3
Input Output
Layered structure of mammalian visual
cortex
Introducing a hidden layer in neural
networks
x1 g(.)
Index can be permutated
h11
f(.) y1
x2 g(.)
h12
f(.) y2
x3 g(.)
h13
Input Output
Importance of hidden layers
− + +
+ Single
+ − sigmoid • First hidden
− −
− + +
− + layer extracts
−
+
+ + − features
• Second hidden
layer extracts
+
features of
− +
+ −
+ Sigmoid
hidden
features
− + +
−
−
+
− layers and
sigmoid
• …
+
+ + −
− output • Output layer
gives the
desired output
Visualizing what hidden layers are
doing
Universal approximation theorem
• W2 σ(W1 x) can approximate within error ε in a
compact interval, any smooth function f(x),
provided
– Size of W’s is arbitrary
– σ is also smooth but not a polynomial
Step function divides the input space
into two halves 0 and 1
• In a single neuron, step
function is a linear binary
classifier
• The weights and biases
determine where the step will
be in n-dimensions
• But, as we shall see later, it
gives little information about
how to change the weights if
we make a mistake
• So, we need a smoother
version of a step function
• Enter: the Sigmoid function
Types of activation functions
• Step: original concept behind
classification and region
bifurcation. Not used
anymore
• Sigmoid and tanh: trainable
approximations of the step-
function
• ReLU: currently preferred
due to fast convergence
• Softmax: currently preferred
for output of a classification
net. Generalized sigmoid
• Linear: good for modeling a
range in the output of a
regression net
Formulas for activation functions
sign(𝑥)+1
• Step:
2
• Sigmoid:
1
1+
• Tanh:
• ReLU:
• Softmax:
∑𝑖
• Linear:
The sigmoid function is a smoother
step function
x1 g(.) g(.)
Index can be permutated
h21
h11 f(.) y1
x2 g(.) g(.)
h22
h12
f(.) y2
x3 g(.) g(.)
h13 h23
Input Output
E.g.
𝑑 𝑓(𝑥)
• Derivative is the rate of change of with
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the minimum or
maximum of
• It is positive when is sloping up, and negative when is sloping down
• To move towards the maxima, taking a small step in a direction of the derivative
Gradient of a function of a vector
• Derivative with respect to each
dimension, holding other
dimensions constant
f(x1, x2) →
x1
→ • At a minima or a maxima the
x2 →
gradient is a zero vector
The function is flat in every direction
• At a minima or a maxima the
gradient is a zero vector
Original image source unknown
Gradient of a function of a vector
• Gradient gives a direction
for moving towards the
minima
f(x1, x2) →
• Then
§ For example:
Backpropagation makes use of chain rule of
derivatives
• Variable names: output of -th linear op.; output of -th nonlinearity
𝜕𝑙 𝑔( 𝑓( + ) + ) 𝜕𝑙 𝜕 𝜕 𝜕 𝜕
• Chain rule:
𝜕 𝜕 𝜕 𝜕 𝜕 𝜕
𝜕 𝜕
• Term is , and is the local derivative of activation function
𝜕 𝜕
etc.
xi
× ?
W1 + Z1 f A1
b1 ?
×
W2 + Z2 g A2
b2 l Loss
ti
1. Make a forward pass and store partial
derivatives
2. During backward pass multiply partial
derivatives
xi
× ?
W1 + Z1 f A1
b1 ?
×
W2 + Z2 g A2
b2 l Loss
ti
Vector valued functions and Jacobians
• We often deal with functions that give multiple outputs
• Let
• Thinking in terms of vector of functions can make the
representation less cumbersome and computations more
efficient
E.g.
• Let
• Assuming
• Minima is at:
• For any the perfect step would be:
Hessian matrix:
→
x2 →
x1 • If all eigenvalues of a
Hessian matrix are positive,
then the function is convex
• Then
• And,
Saddle points, Hessian and long local
furrows
• Some variables may have reached
a local minima while others have
not
• Some weights may have almost
zero gradient
• At least some eigenvalues may not
be negative
Saddle
point
Local
minima
Local
maxima