0% found this document useful (0 votes)

43 views52 pages

EE769 7 Introduction To Neural Networks

The document discusses neural networks and their layered structure. It explains how adding hidden layers increases the modeling power of neural networks over linear models by allowing them to learn complex patterns in data. The document covers activation functions, different types of activation functions, and the use of multiple hidden layers.

Uploaded by

fun world

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views52 pages

EE769 7 Introduction To Neural Networks

Uploaded by

fun world

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

EE769 Intro to ML

Introduction to neural networks

Amit Sethi
Faculty member, IIT Bombay
Learning objectives

• Explain how adding a hidden layers increases

modeling power of neural networks over linear

models

• Write the mathematical expression of a neural

network
Increasing nonlinearity in models
Linear models Support vector machines Neural networks Deep neural networks

𝒙 Input Input Input Input

Trainable
features
Trainable … …
Fixed features
features
Trainable
features

Linear Linear Linear Linear

𝑓(𝒙) classifier classifier classifier classifier
or regressor or regressor or regressor or regressor

𝐿(𝑓(𝒙), 𝒕) Loss Loss Loss Loss

𝐿 σ(𝑾 𝒙) 𝐿 Σ𝑖 𝑡𝑖 𝑎 𝑘(𝑥, 𝑥𝑖) 𝐿 σ 𝑾 σ(𝑾 𝒙) 𝐿 σ 𝑾 …σ(𝑾 𝒙)

Layered functional representation of
NN
• Representation:
– Inputs and outputs are vectors
– Weights are matrices
• No model is:
• Linear model is:
• Adding one more layer
• Adding multiple layers

• Loss is also a layer:

• All trained using chain-rule to compute derivatives
for gradient descent
Structure of biological neurons
Artificial neurons is inspired by
biological neurons

• Network of neurons x1 w 1

w2
• Somewhat like the brain x2 Σ σ
x3 w 3
b
1
Activation function is the secret sauce
of neural networks
• Neural network training
is all about tuning
x1 weights and biases
w1

x2 w 2 • If there was no activation

w3
Σ σ function f, the output of
the entire neural network
x3 b would be a linear
function of the inputs
1
• In a perceptron, it was a
step function
Single neuron can model logistic
regression
x1
Index can be permutated

w1
f(.) y1

w2
x2

Input Output
Logistic regression can be trained
using gradient descent

Desired output
x1
Index can be permutated

w1 y1

(supervision)
f(.)

Loss
w2
x2

Input Output
We can have multiple outputs as well

x1
Index can be permutated

f(.) y1

f(.) y2

Input Output
Layered structure of mammalian visual
cortex
Introducing a hidden layer in neural
networks
x1 g(.)
Index can be permutated

h11
f(.) y1

x2 g(.)
h12

f(.) y2

x3 g(.)
h13

Input Output
Importance of hidden layers
− + +
+ Single
+ − sigmoid • First hidden
− −
− + +
− + layer extracts
−
+
+ + − features
• Second hidden
layer extracts
+
features of
− +
+ −
+ Sigmoid
hidden
features
− + +
−
−
+
− layers and
sigmoid
• …
+
+ + −
− output • Output layer
gives the
desired output
Visualizing what hidden layers are
doing
Universal approximation theorem
• W2 σ(W1 x) can approximate within error ε in a
compact interval, any smooth function f(x),
provided
– Size of W’s is arbitrary
– σ is also smooth but not a polynomial
Step function divides the input space
into two halves  0 and 1
• In a single neuron, step
function is a linear binary
classifier
• The weights and biases
determine where the step will
be in n-dimensions
• But, as we shall see later, it
gives little information about
how to change the weights if
we make a mistake
• So, we need a smoother
version of a step function
• Enter: the Sigmoid function
Types of activation functions
• Step: original concept behind
classification and region
bifurcation. Not used
anymore
• Sigmoid and tanh: trainable
approximations of the step-
function
• ReLU: currently preferred
due to fast convergence
• Softmax: currently preferred
for output of a classification
net. Generalized sigmoid
• Linear: good for modeling a
range in the output of a
regression net
Formulas for activation functions
sign(𝑥)+1
• Step:
2
• Sigmoid:
1
1+
• Tanh:
• ReLU:

• Softmax:

∑𝑖

• Linear:
The sigmoid function is a smoother
step function

• Smoothness ensures that there is more

information about the direction in which to
change the weights if there are errors
• Sigmoid function is also mathematically
linked to logistic regression, which is a
theoretically well-backed linear classifier
The problem with sigmoid is (near)
zero gradient on both extremes
• For both large
positive and
negative input
values, sigmoid
doesn’t change
much with
change of input
• ReLU has a
constant gradient
for almost half of
the inputs
Output activation functions can only be
of the following kinds
• Sigmoid gives
binary
classification
output
• Tanh can also do
that provided the
desired output is
in {-1, +1}
• Softmax
generalizes
sigmoid to n-ary
classification
Multiple hidden layers

x1 g(.) g(.)
Index can be permutated

h21
h11 f(.) y1

x2 g(.) g(.)
h22
h12
f(.) y2

x3 g(.) g(.)
h13 h23

Input Output

Challenges: (1) Too many parameters,

(2) Gradient dilution.
Basic structure of a neural network
y1 y2 … yn • It is feed forward
– Connections from inputs
towards outputs
… – No connection comes
backwards
… … … • It consists of layers
– Current layer’s input is
h11 h12 … h1n
1
previous layer’s output
– No lateral (intra-layer)
connections
…
x1 x2 xd
• That’s it!
Basic structure of a neural network
• Output layer
y1 y2 … yn – Represent the output of the
neural network
– For a two class problem or
regression with a 1-d output, we
need only one output node
… • Hidden layer(s)
– Represent the intermediary nodes
that divide the input space into
regions with (soft) boundaries
… … … – These usually form a hidden layer
– Usually, there is only one such
layer
h1n – Given enough hidden nodes, we
h11 h12 … can model an arbitrary input-
1
output relation.
• Input layer
– Represent dimensions of the
… input vector (one node for each
x1 x2 xd dimension)
– These usually form an input layer,
and
– Usually there is only one such
layer
Gradient ascent
• If you didn’t know the shape of a mountain
• But at every step you knew the slope
• Can you reach the top of the mountain?
Gradient descent minimizes the loss
function
• At every point, compute
• Loss (scalar):
• Gradient of loss with respect to weights
(vector):
𝒘
• Take a step towards negative gradient:
𝑁
𝒘
𝑖=1
Derivative of a function of a scalar

E.g.
𝑑 𝑓(𝑥)
• Derivative is the rate of change of with
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the minimum or
maximum of
• It is positive when is sloping up, and negative when is sloping down
• To move towards the maxima, taking a small step in a direction of the derivative
Gradient of a function of a vector
• Derivative with respect to each
dimension, holding other
dimensions constant
f(x1, x2) →

x1
→ • At a minima or a maxima the
x2 →
gradient is a zero vector
The function is flat in every direction
• At a minima or a maxima the
gradient is a zero vector
Original image source unknown
Gradient of a function of a vector
• Gradient gives a direction
for moving towards the
minima
f(x1, x2) →

• Take a small step towards

negative of the gradient
→
x1
x2 →

Original image source unknown

Example of gradient
• Let

• Then

• At a location a step in or direction

will lead to maximal increase in the function
This story is unfolding in multiple
dimensions

Original image source unknown

Backpropagation
• Backpropagation
y1 y2 … yn
is an efficient
method to do
…
gradient descent
• It saves the
… … … gradient w.r.t. the
upper layer
h11 h12 … h1n
1
output to
compute the
x1 x2 … xd
gradient w.r.t. the
weights
immediately
below
Chain rule of differentiation
• Very handy for complicated functions
§ Especially functions of functions
§ E.g. NN outputs are functions of previous layers
§ For example: Let
§ Let
§ Then

§ For example:
Backpropagation makes use of chain rule of
derivatives
• Variable names: output of -th linear op.; output of -th nonlinearity
𝜕𝑙 𝑔( 𝑓( + ) + ) 𝜕𝑙 𝜕 𝜕 𝜕 𝜕
• Chain rule:
𝜕 𝜕 𝜕 𝜕 𝜕 𝜕
𝜕 𝜕
• Term is , and is the local derivative of activation function
𝜕 𝜕
etc.

× ?
W1 + Z1 f A1
b1 ?
×
W2 + Z2 g A2
b2 l Loss

ti
1. Make a forward pass and store partial
derivatives
2. During backward pass multiply partial
derivatives

× ?
W1 + Z1 f A1
b1 ?
×
W2 + Z2 g A2
b2 l Loss

ti
Vector valued functions and Jacobians
• We often deal with functions that give multiple outputs
• Let
• Thinking in terms of vector of functions can make the
representation less cumbersome and computations more
efficient

• Then the Jacobian is

• Compute the derivatives of a higher layer’s output with

respect to those of the lower layer, i.e. Jacobian of with
respect to
Some questions
• What if we scale all the weights and biases by a
factor?

• What happens to gradients in deep neural networks?

Vanilla gradient descent
Role of step size and learning rate
• Tale of two loss functions
– Same value, and
– Same gradient (first derivative), but
– Different Hessian (second derivative)
– Different step sizes needed
The perfect step size is impossible to
guess
• Goldilocks finds the perfect balance only in a fairy
tale

• The step size is decided by learning rate and the

gradient
Issues with GD
• Need to find good step-size

• Lots of computation before each update

• Can get stuck in local minima

Stochastic gradient descent to escape
local minima and other critical points
Batch gradient descent for speed up
Double derivative

E.g.

• Double derivative is the derivative of derivative

of
• Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a
single maxima)
Double derivative

• Double derivative tells how far the minima might be from a

given point.
• From the minima is closer for the red dashed curve
than for the blue solid curve, because the former has a larger
second derivative (its slope reverses faster)
Perfect step size for a paraboloid

• Let
• Assuming
• Minima is at:
• For any the perfect step would be:

• So, the perfect learning rate is:

• In multiple dimensions,
• Practically, we do not want to compute the inverse of a
Hessian matrix, so we approximate Hessian inverse
Hessian of a function of a vector
• Double derivative with
respect to a pair of
dimensions forms the
f(x1, x2) →

Hessian matrix:

→
x2 →
x1 • If all eigenvalues of a
Hessian matrix are positive,
then the function is convex

Original image source unknown

Example of Hessian
• Let

• Then

• And,
Saddle points, Hessian and long local
furrows
• Some variables may have reached
a local minima while others have
not
• Some weights may have almost
zero gradient
• At least some eigenvalues may not
be negative

Image source: Wikipedia

Global
minima?

Saddle
point

Local
minima

Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Adding momentum in lieu of second
derivative

Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Unit 2 DL
No ratings yet
Unit 2 DL
70 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
Linearity: Skip To Content
No ratings yet
Linearity: Skip To Content
10 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Introduction Deep Eng
No ratings yet
Introduction Deep Eng
50 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Unit 1
No ratings yet
Unit 1
72 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Unit V
No ratings yet
Unit V
25 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
34 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
Forward and Backward Propagation Deep Learning 1703697260
No ratings yet
Forward and Backward Propagation Deep Learning 1703697260
9 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Equivallance Degree
0% (1)
Equivallance Degree
129 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
NN 2
No ratings yet
NN 2
12 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Ls2-Lesson 1 How Do I Get Bad Luck
100% (1)
Ls2-Lesson 1 How Do I Get Bad Luck
5 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
Machine Learning With Artificial Neural Networks
No ratings yet
Machine Learning With Artificial Neural Networks
44 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Lecture 9
No ratings yet
Lecture 9
17 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
Beamforming
No ratings yet
Beamforming
28 pages
Https SSC - Digialm.com-Watermark - pdf-14-1569103190 PDF
No ratings yet
Https SSC - Digialm.com-Watermark - pdf-14-1569103190 PDF
67 pages
NN Concepts
No ratings yet
NN Concepts
4 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
8 RDT Part2 Jan 25
No ratings yet
8 RDT Part2 Jan 25
27 pages
Sách
No ratings yet
Sách
52 pages
Class 8 Chemistry Index
No ratings yet
Class 8 Chemistry Index
2 pages
KAMINI DIDI Chapter 3
No ratings yet
KAMINI DIDI Chapter 3
25 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
EE769 13 Density Estimation and Sampling
No ratings yet
EE769 13 Density Estimation and Sampling
19 pages
Final Demo 8
No ratings yet
Final Demo 8
10 pages
CH 1 Intro To Parallel Architecture
No ratings yet
CH 1 Intro To Parallel Architecture
18 pages
Static Quiz - History 3 June 2025
No ratings yet
Static Quiz - History 3 June 2025
1 page
Combining XGBoost With Particle Swarm Optimization To Improve Phishing Detection (JOURNAL (Revisi Note
No ratings yet
Combining XGBoost With Particle Swarm Optimization To Improve Phishing Detection (JOURNAL (Revisi Note
8 pages
pr2 Nakakamatay
No ratings yet
pr2 Nakakamatay
15 pages
Teaching The Four Skills
No ratings yet
Teaching The Four Skills
11 pages
PGDCG
No ratings yet
PGDCG
17 pages
The Structured Interview An Alternative To The Assessment Center?
No ratings yet
The Structured Interview An Alternative To The Assessment Center?
15 pages
Emerge 1
No ratings yet
Emerge 1
5 pages
Chapter 2 Rizal
No ratings yet
Chapter 2 Rizal
15 pages
Linda Silverman
No ratings yet
Linda Silverman
7 pages
Evaluating OFDM Performance Under Carrier Frequency Offset Effects
No ratings yet
Evaluating OFDM Performance Under Carrier Frequency Offset Effects
7 pages
2629 4843 2 PB
No ratings yet
2629 4843 2 PB
6 pages
SMART Postgraduate Course Scientific Program
No ratings yet
SMART Postgraduate Course Scientific Program
3 pages
DLL Lecture Note 1.8
No ratings yet
DLL Lecture Note 1.8
4 pages
Diploma in Graphic Design
No ratings yet
Diploma in Graphic Design
2 pages
Upcat 2020
No ratings yet
Upcat 2020
4 pages
Devendra MIshra Resume
No ratings yet
Devendra MIshra Resume
1 page
EED 102 Module 1
No ratings yet
EED 102 Module 1
13 pages
Classroom Incentive System
100% (1)
Classroom Incentive System
2 pages
Class Program Departmentalized 23-24
No ratings yet
Class Program Departmentalized 23-24
2 pages
Gyan Basyal Resume
No ratings yet
Gyan Basyal Resume
3 pages
Becoming Culturally Responsive Educators
No ratings yet
Becoming Culturally Responsive Educators
16 pages
Lesson Plan Rational Numbers Differentiated
No ratings yet
Lesson Plan Rational Numbers Differentiated
5 pages
NCP
No ratings yet
NCP
9 pages

EE769 7 Introduction To Neural Networks

Uploaded by

EE769 7 Introduction To Neural Networks

Uploaded by

EE769 Intro to ML

Introduction to neural networks

• Explain how adding a hidden layers increases

modeling power of neural networks over linear

• Write the mathematical expression of a neural

𝒙 Input Input Input Input

Linear Linear Linear Linear

𝐿(𝑓(𝒙), 𝒕) Loss Loss Loss Loss

𝐿 σ(𝑾 𝒙) 𝐿 Σ𝑖 𝑡𝑖 𝑎 𝑘(𝑥, 𝑥𝑖) 𝐿 σ 𝑾 σ(𝑾 𝒙) 𝐿 σ 𝑾 …σ(𝑾 𝒙)

• Loss is also a layer:

x2 w 2 • If there was no activation

• Smoothness ensures that there is more

Challenges: (1) Too many parameters,

• Take a small step towards

Original image source unknown

• At a location a step in or direction

Original image source unknown

• Then the Jacobian is

• Compute the derivatives of a higher layer’s output with

• What happens to gradients in deep neural networks?

• The step size is decided by learning rate and the

• Lots of computation before each update

• Can get stuck in local minima

• Double derivative is the derivative of derivative

• Double derivative tells how far the minima might be from a

• So, the perfect learning rate is:

Original image source unknown

Image source: Wikipedia

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

You might also like