0% found this document useful (0 votes)
43 views52 pages

EE769 7 Introduction To Neural Networks

The document discusses neural networks and their layered structure. It explains how adding hidden layers increases the modeling power of neural networks over linear models by allowing them to learn complex patterns in data. The document covers activation functions, different types of activation functions, and the use of multiple hidden layers.

Uploaded by

fun world
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views52 pages

EE769 7 Introduction To Neural Networks

The document discusses neural networks and their layered structure. It explains how adding hidden layers increases the modeling power of neural networks over linear models by allowing them to learn complex patterns in data. The document covers activation functions, different types of activation functions, and the use of multiple hidden layers.

Uploaded by

fun world
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

EE769 Intro to ML

Introduction to neural networks


Amit Sethi
Faculty member, IIT Bombay
Learning objectives

• Explain how adding a hidden layers increases

modeling power of neural networks over linear

models

• Write the mathematical expression of a neural

network
Increasing nonlinearity in models
Linear models Support vector machines Neural networks Deep neural networks

𝒙 Input Input Input Input

Trainable
features
Trainable … …
Fixed features
features
Trainable
features

Linear Linear Linear Linear


𝑓(𝒙) classifier classifier classifier classifier
or regressor or regressor or regressor or regressor

𝐿(𝑓(𝒙), 𝒕) Loss Loss Loss Loss

𝐿 σ(𝑾 𝒙) 𝐿 Σ𝑖 𝑡𝑖 𝑎 𝑘(𝑥, 𝑥𝑖) 𝐿 σ 𝑾 σ(𝑾 𝒙) 𝐿 σ 𝑾 …σ(𝑾 𝒙)


Layered functional representation of
NN
• Representation:
– Inputs and outputs are vectors
– Weights are matrices
• No model is:
• Linear model is:
• Adding one more layer
• Adding multiple layers

• Loss is also a layer:


• All trained using chain-rule to compute derivatives
for gradient descent
Structure of biological neurons
Artificial neurons is inspired by
biological neurons

• Network of neurons x1 w 1

w2
• Somewhat like the brain x2 Σ σ
x3 w 3
b
1
Activation function is the secret sauce
of neural networks
• Neural network training
is all about tuning
x1 weights and biases
w1

x2 w 2 • If there was no activation


w3
Σ σ function f, the output of
the entire neural network
x3 b would be a linear
function of the inputs
1
• In a perceptron, it was a
step function
Single neuron can model logistic
regression
x1
Index can be permutated

w1
f(.) y1

w2
x2

w3

x3

Input Output
Logistic regression can be trained
using gradient descent

Desired output
x1
Index can be permutated

w1 y1

(supervision)
f(.)

Loss
w2
x2

w3

x3

Input Output
We can have multiple outputs as well

x1
Index can be permutated

f(.) y1

x2

f(.) y2

x3

Input Output
Layered structure of mammalian visual
cortex
Introducing a hidden layer in neural
networks
x1 g(.)
Index can be permutated

h11
f(.) y1

x2 g(.)
h12

f(.) y2

x3 g(.)
h13

Input Output
Importance of hidden layers
− + +
+ Single
+ − sigmoid • First hidden
− −
− + +
− + layer extracts

+
+ + − features
• Second hidden
layer extracts
+
features of
− +
+ −
+ Sigmoid
hidden
features
− + +


+
− layers and
sigmoid
• …
+
+ + −
− output • Output layer
gives the
desired output
Visualizing what hidden layers are
doing
Universal approximation theorem
• W2 σ(W1 x) can approximate within error ε in a
compact interval, any smooth function f(x),
provided
– Size of W’s is arbitrary
– σ is also smooth but not a polynomial
Step function divides the input space
into two halves  0 and 1
• In a single neuron, step
function is a linear binary
classifier
• The weights and biases
determine where the step will
be in n-dimensions
• But, as we shall see later, it
gives little information about
how to change the weights if
we make a mistake
• So, we need a smoother
version of a step function
• Enter: the Sigmoid function
Types of activation functions
• Step: original concept behind
classification and region
bifurcation. Not used
anymore
• Sigmoid and tanh: trainable
approximations of the step-
function
• ReLU: currently preferred
due to fast convergence
• Softmax: currently preferred
for output of a classification
net. Generalized sigmoid
• Linear: good for modeling a
range in the output of a
regression net
Formulas for activation functions
sign(𝑥)+1
• Step:
2
• Sigmoid:
1
1+
• Tanh:
• ReLU:

• Softmax:

∑𝑖

• Linear:
The sigmoid function is a smoother
step function

• Smoothness ensures that there is more


information about the direction in which to
change the weights if there are errors
• Sigmoid function is also mathematically
linked to logistic regression, which is a
theoretically well-backed linear classifier
The problem with sigmoid is (near)
zero gradient on both extremes
• For both large
positive and
negative input
values, sigmoid
doesn’t change
much with
change of input
• ReLU has a
constant gradient
for almost half of
the inputs
Output activation functions can only be
of the following kinds
• Sigmoid gives
binary
classification
output
• Tanh can also do
that provided the
desired output is
in {-1, +1}
• Softmax
generalizes
sigmoid to n-ary
classification
Multiple hidden layers

x1 g(.) g(.)
Index can be permutated

h21
h11 f(.) y1

x2 g(.) g(.)
h22
h12
f(.) y2

x3 g(.) g(.)
h13 h23

Input Output

Challenges: (1) Too many parameters,


(2) Gradient dilution.
Basic structure of a neural network
y1 y2 … yn • It is feed forward
– Connections from inputs
towards outputs
… – No connection comes
backwards
… … … • It consists of layers
– Current layer’s input is
h11 h12 … h1n
1
previous layer’s output
– No lateral (intra-layer)
connections

x1 x2 xd
• That’s it!
Basic structure of a neural network
• Output layer
y1 y2 … yn – Represent the output of the
neural network
– For a two class problem or
regression with a 1-d output, we
need only one output node
… • Hidden layer(s)
– Represent the intermediary nodes
that divide the input space into
regions with (soft) boundaries
… … … – These usually form a hidden layer
– Usually, there is only one such
layer
h1n – Given enough hidden nodes, we
h11 h12 … can model an arbitrary input-
1
output relation.
• Input layer
– Represent dimensions of the
… input vector (one node for each
x1 x2 xd dimension)
– These usually form an input layer,
and
– Usually there is only one such
layer
Gradient ascent
• If you didn’t know the shape of a mountain
• But at every step you knew the slope
• Can you reach the top of the mountain?
Gradient descent minimizes the loss
function
• At every point, compute
• Loss (scalar):
• Gradient of loss with respect to weights
(vector):
𝒘
• Take a step towards negative gradient:
𝑁
𝒘
𝑖=1
Derivative of a function of a scalar

E.g.
𝑑 𝑓(𝑥)
• Derivative is the rate of change of with
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the minimum or
maximum of
• It is positive when is sloping up, and negative when is sloping down
• To move towards the maxima, taking a small step in a direction of the derivative
Gradient of a function of a vector
• Derivative with respect to each
dimension, holding other
dimensions constant
f(x1, x2) →

x1
→ • At a minima or a maxima the
x2 →
gradient is a zero vector
The function is flat in every direction
• At a minima or a maxima the
gradient is a zero vector
Original image source unknown
Gradient of a function of a vector
• Gradient gives a direction
for moving towards the
minima
f(x1, x2) →

• Take a small step towards


negative of the gradient

x1
x2 →

Original image source unknown


Example of gradient
• Let

• Then

• At a location a step in or direction


will lead to maximal increase in the function
This story is unfolding in multiple
dimensions

Original image source unknown


Backpropagation
• Backpropagation
y1 y2 … yn
is an efficient
method to do

gradient descent
• It saves the
… … … gradient w.r.t. the
upper layer
h11 h12 … h1n
1
output to
compute the
x1 x2 … xd
gradient w.r.t. the
weights
immediately
below
Chain rule of differentiation
• Very handy for complicated functions
§ Especially functions of functions
§ E.g. NN outputs are functions of previous layers
§ For example: Let
§ Let
§ Then

§ For example:
Backpropagation makes use of chain rule of
derivatives
• Variable names: output of -th linear op.; output of -th nonlinearity
𝜕𝑙 𝑔( 𝑓( + ) + ) 𝜕𝑙 𝜕 𝜕 𝜕 𝜕
• Chain rule:
𝜕 𝜕 𝜕 𝜕 𝜕 𝜕
𝜕 𝜕
• Term is , and is the local derivative of activation function
𝜕 𝜕
etc.

xi

× ?
W1 + Z1 f A1
b1 ?
×
W2 + Z2 g A2
b2 l Loss

ti
1. Make a forward pass and store partial
derivatives
2. During backward pass multiply partial
derivatives

xi

× ?
W1 + Z1 f A1
b1 ?
×
W2 + Z2 g A2
b2 l Loss

ti
Vector valued functions and Jacobians
• We often deal with functions that give multiple outputs
• Let
• Thinking in terms of vector of functions can make the
representation less cumbersome and computations more
efficient

• Then the Jacobian is

• Compute the derivatives of a higher layer’s output with


respect to those of the lower layer, i.e. Jacobian of with
respect to
Some questions
• What if we scale all the weights and biases by a
factor?

• What happens to gradients in deep neural networks?


Vanilla gradient descent
Role of step size and learning rate
• Tale of two loss functions
– Same value, and
– Same gradient (first derivative), but
– Different Hessian (second derivative)
– Different step sizes needed
The perfect step size is impossible to
guess
• Goldilocks finds the perfect balance only in a fairy
tale

• The step size is decided by learning rate and the


gradient
Issues with GD
• Need to find good step-size

• Lots of computation before each update

• Can get stuck in local minima


Stochastic gradient descent to escape
local minima and other critical points
Batch gradient descent for speed up
Double derivative

E.g.

• Double derivative is the derivative of derivative


of
• Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a
single maxima)
Double derivative

• Double derivative tells how far the minima might be from a


given point.
• From the minima is closer for the red dashed curve
than for the blue solid curve, because the former has a larger
second derivative (its slope reverses faster)
Perfect step size for a paraboloid

• Let
• Assuming
• Minima is at:
• For any the perfect step would be:

• So, the perfect learning rate is:


• In multiple dimensions,
• Practically, we do not want to compute the inverse of a
Hessian matrix, so we approximate Hessian inverse
Hessian of a function of a vector
• Double derivative with
respect to a pair of
dimensions forms the
f(x1, x2) →

Hessian matrix:


x2 →
x1 • If all eigenvalues of a
Hessian matrix are positive,
then the function is convex

Original image source unknown


Example of Hessian
• Let

• Then

• And,
Saddle points, Hessian and long local
furrows
• Some variables may have reached
a local minima while others have
not
• Some weights may have almost
zero gradient
• At least some eigenvalues may not
be negative

Image source: Wikipedia


Global
minima?

Saddle
point

Local
minima

Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/


Adding momentum in lieu of second
derivative

You might also like