0% found this document useful (0 votes)

10 views22 pages

Unit-3 NNDL

The document provides an overview of deep learning, its historical development, and the architecture of neural networks, emphasizing the importance of deep learning algorithms in handling complex data. It discusses the evolution of deep learning from early models like the perceptron to modern architectures, highlighting key milestones and breakthroughs. Additionally, it covers concepts such as gradient-based learning, cost functions, and the application of neural networks in solving problems like the XOR function.

Uploaded by

shinjithreddy123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views22 pages

Unit-3 NNDL

Uploaded by

shinjithreddy123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Neural networks and Deep learning CSE(AI&ML)

UNIT-III

Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -

forward networks, Gradient-Based learning, Hidden Units, Architecture Design,
Back-Propagation and Other Differentiation Algorithms

INTRODUCTION TO DEEP LEARNING:

Deep Learning is a subset of Machine Learning that uses mathematical functions to map the
input to the output. These functions can extract non-redundant information or patterns from
the data, which enables them to form a relationship between the input and the output. This is
known as learning, and the process of learning is called training.

In traditional computer programming, input and a set of rules are combined together to get
the desired output. In machine learning and deep learning, input and output are correlated to
the rules. These rules when combined with new input-yield desired results

Modern deep learning models use artificial neural networks or simply neural networks to
extract information.

These neural networks are made up of a simple mathematical function that can be stacked on
top of each other and arranged in the form of layers, giving them a sense of depth, hence the
term Deep Learning.

Deep learning can also be thought of as an approach to Artificial Intelligence, a smart

combination of hardware and software to solve tasks requiring human intelligence.

1
Neural networks and Deep learning CSE(AI&ML)

Importance of Deep Learning

Deep learning algorithms play a crucial role in determining the features and can handle the
large number of processes for the data that might be structured or unstructured. Although,
deep learning algorithms can overkill some tasks that might involve complex problems
because they need access to huge amounts of data so that they can function effectively. For
example, there's a popular deep learning tool that recognizes images namely Imagenet that
has access to 14 million images in its dataset-driven algorithms. It is a highly comprehensive
tool that has defined a next-level benchmark for deep learning tools that aim images as their
dataset.

Deep learning algorithms are highly progressive algorithms that learn about the image that
we discussed previously by passing it through each neural network layer. The layers are
highly sensitive to detect low-level features of the image like edges and pixels and henceforth
the combined layers take this information and form holistic representations by comparing it
with previous data. For example, the middle layer might be programmed to detect some
special parts of the object in the photograph which other deep trained layers are programmed
to detect special objects like dogs, trees, utensils, etc.

However, if we talk out the simple task that involves less complexity and a data-driven
resource, deep learning algorithms fail to generalize simple data. This is one of the main
reasons deep learning is not considered effective as linear or boosted tree models. Simple
models aim to churn out custom data, track fraudulent transactions and deal with less
complex datasets with fewer features. Also, there are various cases like multiclass

2
Neural networks and Deep learning CSE(AI&ML)

classification where deep learning can be effective because it involves smaller but more
structured datasets but is not preferred usually.

Why Deep Learning

Historical Trends in Deep Learning

Deep Learning have been three waves of development: The first wave started with
cybernetics in the 1940s-1960s, with the development of theories of biological learning and
implementations of the first models such as the perceptron allowing the training of a single
neuron. The second wave started with the connectionist approach of the 1980-1995 period,
with back-propagation to train a neural network with one or two hidden layers. The current
and third wave, deep learning, started around 2006.

3
Neural networks and Deep learning CSE(AI&ML)

Deep Learning History Timeline

McCulloch Pitts Neuron - Beginning 1943

 Walter Pitts and Warren McCulloch show the model of biological neuron. This
McCulloch-Pitts Neuron could be used to learn logical inferences based on learning
via simple thresholding functions.

1957 Frank Rosenblatt creates Perceptron

 Rosenblatt develops the first perceptron neural network funded by the US

government. This machine is based on the idea of binary classifiers and learning
through trial and error. This was one of the earliest algorithms for supervised learning
and binary classifiers.

The first Backpropagation Model 1960

 Henry J. Kelley shows the first derivation of continuous backpropagation method.

But the model is not clearly mentioned. 1980 is the foundation of Backpropagation
algorithm which is widely used in ANN in future years.

1962 Backpropagation with Chain Rule

 Stuart Dreyfus shows a simple derivation of continuous backpropagation with chain

rule programming after feedforward neural models were using it.

Birth of Multilayer Neural Network 1965

 Alexey Grigoryevich Ivakhnenko and Valentin Grigor'evich Lapa started working

on the first general, working, learning algorithm for supervised deep feedforward
multilayer perceptrons (MLP) which was published in the "Cybernetic Predicting
Devices" in 1965.
 Ivakhnenko's paper describes a deep network with 8 layers trained by the Group
Method of Data Handling (GMDH) algorithm

1969 The Fall of Perceptron

 Marvin Minsky and Seymour Papert publish the book "Perceptrons" that proves
that the Perceptron algorithm cannot solve simple nonlinear problems, such as the
XOR problem. This leads to a decrease in interest in neural network research and the
first AI winter.

Neural Network goes Deep 1971

Halexey Grigoryevich Ivakhnenko continues his research in Neural Networks. He creates an

8-layer deep neural network using Group Method of Data Handling (GMDH).

4
Neural networks and Deep learning CSE(AI&ML)

1980 Neocognitron - First CNN Architecture

 Kunihiko Fukushima introduces neocognitron - the first convolutional architecture

which helps to create robust models for handwriting characters.

Hopfield Network - Early RNN 1982

 John Hopfield creates the Hopfield Network, a recurrent neural network that can
serve as associative memory systems with binary threshold nodes. This foundation
stone for recurrent networks and deep learning

Boltzmann Machine 1985

 David H. Ackley, Geoffrey Hinton, and Terrence Sejnowski create Boltzmann

machine network - stochastic neural network that has hidden layer and binary units

1986 NetTalk - ANN Learns Speech

 Terrence Sejnowski creates NetTalk, a neural network which learns to pronounce

English text by being trained on a corpus of text with corresponding phonetic
transcriptions.

1989 CNN using Backpropagation

 Yann LeCun uses backpropagation to train convolutional neural networks (CNN) to

recognize handwritten digits. This is a breakthrough for deep learning and forms the
foundation of modern computer vision..

The Milestone of LSTM 1997

Sepp Hochreiter and Jurgen Schmidhuber make the invention of LSTM (Long Short-Term
Memory) which is a refinement of RNN with all architecture. This will become the common
architecture for deep learning in the years to come.

5
Neural networks and Deep learning CSE(AI&ML)

2012 AlexNet Starts Deep Learning Boom

DeepMind's deep reinforcement learning model beats human champion in the complex game
of Go. The game is much more complex than chess, so this feat captures the imagination of
everyone and takes the promise of deep learning to a whole new level. AlexNet, a GPU-
implemented CNN model designed by Alex Krizhevsky, wins ImageNet's image
classification contest with an accuracy of 84%. It is a huge jump over the 75% accuracy that
earlier models had achieved. This win triggers a new deep learning boom globally.

2016 AlphaGo beats human: AlphaGo defeated one of the best human players in the game
Go, proving the power of deep learning in complex board games. This had a huge impact on
the AI community and brought deep learning to the forefront of AI research globally.

6
Neural networks and Deep learning CSE(AI&ML)

7
Neural networks and Deep learning CSE(AI&ML)

Deep Feedforward Networks

Example: Learning XOR

The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When
exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it
returns 0. The XOR function provides the target function y=¿ f ¿ (x ) that we want to learn.
Our model provides a function y=f (x ; θ) and our learning algorithm will adapt the
¿
parameters θ to make f as similar as possible to f .

We want our network to perform correctly on the four points X = {[0, 0], [0, 1],[1, 0], and [1, 1]}.
We will train the network on all four of these points. The only challenge is to fit the training set.

We can treat this problem as a regression problem and use a mean squared error loss
function. In practical applications, MSE is usually not an appropriate cost function for
modeling binary data.

Evaluated on our whole training set, the MSE loss function is

Suppose that we choose a linear model, with θ consisting of w and b. Our model is defined to
be

We can minimize J(θ) in closed form with respect to w and b using the normal equations.

After solving the normal equations, we obtain w = 0 and b = 1/2. The linear model simply
outputs 0.5 everywhere. Why does this happen? A linear model is not able to represent the
XOR function. One way to solve this problem is to use a model that learns a different feature
space in which a linear model is able to represent the solution
Specifically, we will introduce a very simple feedforward network with one hidden layer
containing two hidden units.

8
Neural networks and Deep learning CSE(AI&ML)

This feedforward network has a vector of hidden units h that are computed by a
function f (1) ¿(x ; W , c). The values of these hidden units are then used as the input for a
second layer. The second layer is the output layer of the network. The output layer is still just
a linear regression model, but now it is applied to h rather than to x . The network now
contains two functions chained together:
(1) (2)
h=f (x ; W ,c )∧ y =f (h ; w , b),

with the complete model being f (x ; W , c , w , b)=f (2) (f (1) (x))

What function should f(1) compute? Linear models have served us well so far, and it may be
tempting to make f(1) be linear as well. Unfortunately, if f(1) were linear, then the feedforward
network as a whole would remain a linear function of its input. we must use a nonlinear
function to describe the features. Most neural networks do so using an affine transformation
controlled by learned parameters, followed by a fixed, nonlinear function called an activation
function. We use that strategy here, by defining h=g(W T x +c ), where W provides the
weights of a linear transformation and c the biases.

We describe an affine transformation from a vector x to a vector h, so an entire vector of bias

parameters is needed. The activation function g is typically chosen to be a function that is
applied element-wise, with hi =g (x T W : ,i +c i ). In modern neural networks, the default
recommendation is to use the rectified linear unit or ReLU defined by the activation
function g(z )=max {0 , z }.

We can now specify our complete network as

We can now specify a solution to the XOR problem. Let

and b = 0

9
Neural networks and Deep learning CSE(AI&ML)

We can now walk through the way that the model processes a batch of inputs. Let X be the
design matrix containing all four points in the binary input space, with one example per row:

The first step in the neural network is to multiply the input matrix by the first layer’s weight
matrix:

Next, we add the bias vector c, to obtain

In this space, all of the examples lie along a line with slope 1. As we move along this line, the
output needs to begin at 0, then rise to 1, then drop back down to 0. A linear model cannot
implement such a function. To finish computing the value of h for each example, we apply
the rectified linear transformation:

This transformation has changed the relationship between the examples. They no longer lie
on a single line. They now lie in a space where a linear model can solve the problem. We
finish by multiplying by the weight vector w:

The neural network has obtained the correct answer for every example in the batch.

10
Neural networks and Deep learning CSE(AI&ML)

In this example, we simply specified the solution, then showed that it obtained zero error. In a
real situation, there might be billions of model parameters and billions of training examples,
so one cannot simply guess the solution as we did here. Instead, a gradient-based
optimization algorithm can find parameters that produce very little error.

Gradient-Based Learning

As with other machine learning models, to apply gradient-based learning we must choose a
cost function, and we must choose how to represent the output of the model. Largest
difference between simple ML Models and neural networks are nonlinearity of a neural
network causes most interesting loss functions to become non-convex. This means that neural
networks are usually trained by using iterative, gradient-based optimizers that merely drive
the cost function to a very low value, rather than exact linear equation solvers used to train
linear regression models or the convex optimization algorithms used for logistic regression or

SVMs. Cost Functions:

A cost function is an important parameter that determines how well a machine learning
model performs for a given dataset. It calculates the difference between the expected value and
predicted value and represents it as a single real number.

Types of Cost Function

1. Regression Cost Function

o Means Error
o Mean Squared Error
o Mean Absolute Error
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.
In most cases, our parametric model defines a distribution p( y ∨x ; θ) and we simply use the
principle of maximum likelihood. This means we use the cross-entropy between the training
data and the model’s predictions as the cost function.
Sometimes, we rather than predicting a complete probability distribution over y, we merely
predict some statistic of y conditioned on x. Specialized loss functions allow us to train a
predictor of these estimates.
The total cost function used to train a neural network will often combine one of the primary
cost functions described here with a regularization term.

11
Neural networks and Deep learning CSE(AI&ML)

Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means that the
cost function is simply the negative log-likelihood, equivalently described as the cross-
entropy between the training data and the model distribution. This cost function is given by:

.
The specific form of the cost function changes from model to model, depending on the
specific form of log p model .
An advantage of this approach of deriving the cost function from maximum likelihood is that
it removes the burden of designing cost functions for each model. Specifying a model
p( y ∨x) automatically determines a cost function log p( y ∨x).
Learning Conditional Statistics:

In many cases, instead of learning a full probability distribution p( y ∨x ; θ), we focus on

learning a specific conditional statistic of \( y \) given \( x \), such as predicting the mean
using a predictor f(x; θ ). A powerful neural network can represent a wide class of functions,
constrained by properties like continuity and boundedness rather than a specific parametric
form.

Here, the cost function can be viewed as a functional (a mapping from functions to real
numbers), and learning becomes a process of choosing a function, not just a set of
parameters. We can design the cost functional so that its minimum corresponds to the
function that maps ( x) to the expected value of ( y) given ( x ).
Solving such optimization problems involves *calculus of variations*. While understanding
this tool is not necessary here, it helps derive certain key results.

The first result derived using calculus of variations shows that solving the optimization
problem:

……………….(1)

Leads to: ……………..(2)

This means that minimizing the mean squared error

(MSE) cost function would yield a function predicting the mean of ( y ) for each ( x ),
assuming we could train on infinite data from the true distribution. Different cost functions
yield different statistics. A second result derived from the calculus of variations is that results
in a function that predicts the median of y for each x , commonly referred to as the mean
absolute error (MAE).

……………………. (3)

However, both MSE and MAE can lead to poor results with gradient-based optimization, as
certain output units may saturate and produce very small gradients. This is why the cross-

12
Neural networks and Deep learning CSE(AI&ML)

entropy cost function is often preferred, even when estimating the full distribution p( y ∨x)
is not necessary.

Output Units
The choice of cost function is tightly coupled with the choice of output unit. Most of
the time, we simply use the cross-entropy between the data distribution and the model
distribution. The choice of how to represent the output then determines the form of the cross-
entropy function.
the feedforward network provides a set of hidden features deﬁned by h=f (x ; θ). The
role of the output layer is then to provide some additional transformation from the features to
complete the task that the network must perform.

Linear Units for Gaussian Output Distributions

One simple kind of output unit is based on an aﬃne transformation with no
nonlinearity. These are often just called linear units. Given features h, a layer of linear output
units produces a vector ˆy = W T h+b.
Linear output layers are often used to produce the mean of a conditional Gaussian
distribution: p( y ∨x)=N ( y ; ˆy , I ) ……………………………..(4)
. Maximizing the log-likelihood is then equivalent to minimizing the mean squared
error. The maximum likelihood framework makes it straightforward to learn the covariance
of the Gaussian too, or to make the covariance of the Gaussian be a function of the input.
However, the covariance must be constrained to be a positive definite matrix for all inputs. It
is difficult to satisfy such constraints with a linear output layer, so typically other output units
are used to parametrize the covariance. Because linear units do not saturate, they pose little
difficulty for gradient based optimization algorithms and may be used with a wide variety of
optimization algorithms.

Sigmoid Units for Bernoulli Output Distributions

Many tasks require predicting the value of a binary variable y . Classification problems with
two classes can be cast in this form. The maximum-likelihood approach is to define a
Bernoulli distribution over y conditioned on x . A Bernoulli distribution is defined by just a
single number. The neural net needs to predict only P( y =1∨x) . For this number to be a
valid probability, it must lie in the interval [0, 1].
Satisfying this constraint requires some careful design effort. Suppose we were to use a linear
unit, and threshold its value to obtain a valid probability:
We omit the dependence on x for the moment to discuss how to deﬁne a probability
distribution over y using the value z . The sigmoid can be motivated by constructing an
~
unnormalized probability distribution P ( y) , which does not sum to 1. We can then divide by
an appropriate constant to obtain a valid probability distribution. If we begin with the
assumption that the unnormalized log probabilities are linear in y and z , we can exponentiate

13
Neural networks and Deep learning CSE(AI&ML)

to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli
distribution controlled by a sigmoidal transformation of z:

When we use other loss functions, such as mean squared error, the loss can saturate
anytime σ(z) saturates. The sigmoid activation function saturates to 0 when z becomes very
negative and saturates to 1 when z becomes very positive. The gradient can shrink too small
to be useful for learning whenever this happens, whether the model has the correct answer or
the incorrect answer. For this reason, maximum likelihood is almost always the preferred
approach to training sigmoid output units.

Softmax Units for Multinoulli Output Distributions

Any time we wish to represent a probability distribution over a discrete variable with n
possible values, we may use the softmax function. Softmax functions are most often used as
the output of a classifier, to represent the probability distribution over n different classes
In the case of binary variables, we wished to produce a single
number
Because this number needed to lie between 0 and 1, and
because we wanted the logarithm of the number to be well-behaved for gradient-based
~
optimization of the log-likelihood, we chose to instead predict a number z=log ⁡P ( y=1| x )
To generalize to the case of a discrete variable with n values, we now need to produce a
~
y i=⁡P ( y =i|x ) . We require not only that each element of yî yî be between 0
vector ŷ, with ^
and 1, but also that the entire vector sums to 1 so that it represents a valid probability
distribution.
A linear layer predicts unnormalized log probabilities:
T
z=W h+b
~
where z i=log ⁡P ( y=i|x )The softmax function can then exponentiate and normalize z to
obtain the desired ŷ. Formally, the softmax function is given by

As with the logistic sigmoid, the use of the exp function works very well when training the
softmax to output a target value y using maximum log-likelihood.

14
Neural networks and Deep learning CSE(AI&ML)

Hidden Units

The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles. Rectified linear units are an excellent default
choice of hidden unit.It is usually impossible to predict in advance which will work best. The
design process consists of trial and error, intuiting that a kind of hidden unit may work well,
and evaluating its performance on a validation set.

Some hidden units are not differentiable at all input points. For example, the rectified linear
function g(z)=max{0, z} is not differentiable at z = 0. This may seem like it invalidates g
for use with a gradient based learning algorithm. In practice, gradient descent still performs
well enough for these models to be used for machine learning tasks.
Most hidden units can be described as accepting a vector of inputs x, computing an affine
transformation z=W T x+b , and then applying an element-wise nonlinear function g(z).
Most hidden units are distinguished from each other only by the choice of the form of the

activation function g(z).

Rectified Linear Units and Their Generalizations

Rectified linear units use the activation function g(z)=max{0, z}.

Rectified linear units are easy to optimize due to similarity with linear units.

 Only difference with linear units that they output 0 across half its domain

 Derivative is 1 everywhere that the unit is active

 Thus gradient direction is far more useful than with activation functions with second-order
effects

Rectified linear units are typically used on top of an affine transformation: h=g ¿ ¿)
Good practice to set all elements of b to a small value such as 0.1. This makes it likely that
ReLU will be initially active for most training samples and allow derivatives to pass through

ReLU vs other activations:

15
Neural networks and Deep learning CSE(AI&ML)

 Sigmoid and tanh activation functions cannot be with many layers due to the vanishing
gradient problem.

 ReLU overcomes the vanishing gradient problem, allowing models to learn faster and
perform better.

 ReLU is the default activation function with MLP and CNN

One drawback to rectified linear units is that they cannot learn via gradient based methods on
examples for which their activation is zero.

Three generalizations of rectified linear units are based on using a non-zero slope α i when
zi < 0;hi =g ( z , α )i =max ( 0 , z i ) +α i min ⁡(0 , zi )

Absolute value rectification fixes αi = −1 to obtain g(z) = |z|. It is used for

object recognition from images

1. A leaky ReLU fixes αi to a small value like 0.01

2. parametric ReLU treats αi as a learnable parameter

Logistic Sigmoid and Hyperbolic Tangent

Most neural networks used the logistic sigmoid activation function prior to rectified linear
units. g(z )=σ (z)

or the hyperbolic tangent activation function g(z )=tanh( z)

These activation functions are closely related because tanh(z)=2 σ (2 z)−1

We have already seen sigmoid units as output units, used to predict the probability that a
binary variable is 1.

Sigmoidals saturate across most of domain

 Saturate to 1 when z is very positive and 0 when z is very negative

 Strongly sensitive to input when z is near 0

 Saturation makes gradient-learning difficult

16
Neural networks and Deep learning CSE(AI&ML)

Hyperbolic tangent typically performs better than logistic sigmoid. It resembles the identity
function more closely. Because tanh is similar to the identity function near 0, training a
deep neural network

resembles training a linear model so long as the activations of the network can be kept small.

Architecture Design

The word architecture refers to the overall structure of the network: how many units it should
have and how these units should be connected to each other.

Most neural networks are organized into groups of units called layers. Most neural
network architectures arrange these layers in a chain structure, with each layer being a
function of the layer that preceded it. In this structure, the first layer is given by

the second layer is given by

In these chain-based architectures, the main architectural considerations are to choose the
depth of the network and the width of each layer.

17
Neural networks and Deep learning CSE(AI&ML)

Universal Approximation Properties and Depth

The universal approximation theorem (Horniket al., 1989; Cybenko, 1989) states that a
feedforward network with a linear output layer and at least one hidden layer with any
“squashing” activation function (such as the logistic sigmoid activation function) can
approximate any Borel measurable function from one ﬁnite-dimensional space to another
with any desired nonzero amount of error, provided that the network is given enough hidden
units.

A feed-forward network with a single hidden layer containing a finite number of neurons can

approximate continuous functions on compact subsets of ℝn.

Simple neural networks can represent a wide variety of interesting functions when given
appropriate parameters

 However, it does not touch upon the algorithmic learnability of those parameters

18
Neural networks and Deep learning CSE(AI&ML)

The universal approximation theorem means that regardless of what function we are trying to
learn, we know that a large multilayer perception (MLP) will be able to represent this
function. However, we are not guaranteed that the training algorithm will be able to learn that
function. Even if the MLP is able to represent the function, learning can fail for two different
reasons.

1. Optimizing algorithms may not be able to find the value of the parameters that corresponds to
the desired function.

2. The training algorithm might choose wrong function due to over-fitting

The UA theorem says that, there exists a network large enough to achieve any degree of
accuracy we desire, but the theorem does not say how large this network will be. provides
some bounds on the size of a single-layer network needed to approximate a broad class of
functions. Unfortunately, in the worse case, an exponential number of hidden units may be
required. This is easiest to see in the binary case: the number of possible binary functions on
vectors
n
v ∈ { 0 ,1 } is 22n and selecting one such function requires 2nbits, which will in general require

O(2n ¿degrees of freedom.

A feedforward network with a single layer is sufficient to represent any function, But the
layer may be infeasibly large and may fail to generalize correctly. Using deeper models can
reduce no.of units required and reduce generalization error.

 The above graph shows the Eﬀect of number of parameters

19
Neural networks and Deep learning CSE(AI&ML)

 Deeper models perform better than wider models with more parameters because they
can capture more complex hierarchical functions.
 Shallow models with many parameters tend to overfit quickly (around 20 million
parameters), while deeper models benefit from larger parameter counts (over 60
million).
 Deep models learn representations by composing simpler functions, such as detecting
edges, corners, and then objects.
 Depth expresses a preference for learning complex functions as compositions of
simpler ones, improving generalization over shallow models.

This explains why adding depth is more effective than merely increasing the number of
parameters in neural networks.

Back-Propagation and Other DiﬀerentiationAlgorithms

 Forward Propagation : In a feedforward neural network, input xxx flows through layers
to produce output y^\hat{y}y^. This process is called forward propagation. During training,
forward propagation computes the cost J(θ)J(\theta)J(θ).

gradient of the cost function ∇θJ(θ)\nabla_\theta J(\theta)∇θJ(θ). It sends information

 Backpropagation: Introduced by Rumelhart et al. (1986), backpropagation computes the

backward through the network to update weights.

 Backprop Misconceptions:

 It is not the entire learning algorithm but just a method for computing gradients.
 It is not specific to multi-layer networks but can compute derivatives of any function.

 Gradient Calculation: Backprop computes the gradient ∇xf(x,y)\nabla_x f(x, y)∇xf(x,y)

for arbitrary functions, often for optimizing the cost function with respect to model
parameters. It can also compute the Jacobian for functions with multiple outputs.

 Computational Graphs:

 A computational graph formalizes operations in a network. Each node represents a

variable (scalar, vector, matrix, etc.).
 Operations are functions that act on variables, combining them to form more
complex functions. The graph's edges represent these operations.

 General Use: Backpropagation extends beyond neural networks, being useful for
computing various derivatives in machine learning tasks.

20
Neural networks and Deep learning CSE(AI&ML)

 Backpropagation: Backprop computes derivatives by multiplying Jacobians by gradients

for each operation in a computational graph. This is efficient for deep networks.

 Tensors and Backpropagation: Backprop can be generalized to tensors (multi-

dimensional arrays), where the process remains the same as for vectors. Each tensor’s
gradient is computed by multiplying Jacobians and gradients, and the result is reshaped back
into tensor form.

 Chain Rule with Tensors: For tensor-valued functions, the chain rule becomes:

This computes the gradient of z with respect to each element of tensor X.

21
Neural networks and Deep learning CSE(AI&ML)

Deep Learning
100% (3)
Deep Learning
207 pages
Unit IV
No ratings yet
Unit IV
21 pages
Unit 1
No ratings yet
Unit 1
21 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
Tubingen DL Notes
No ratings yet
Tubingen DL Notes
151 pages
Deep Learning PIAIC
100% (1)
Deep Learning PIAIC
229 pages
Recent Advances in Deep Learning Based Computer Vision
No ratings yet
Recent Advances in Deep Learning Based Computer Vision
6 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Machine Learning 4th Unit
No ratings yet
Machine Learning 4th Unit
54 pages
Deep Learning
No ratings yet
Deep Learning
34 pages
Deep Learning
100% (3)
Deep Learning
32 pages
Deep Learnings
No ratings yet
Deep Learnings
44 pages
BMG5109 Winter 2025 Data Science For Enginners Lecture Note 4 ML and DL ANN I
No ratings yet
BMG5109 Winter 2025 Data Science For Enginners Lecture Note 4 ML and DL ANN I
63 pages
Deep Learning, Theory and Foundation A Brief Review
No ratings yet
Deep Learning, Theory and Foundation A Brief Review
7 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
15 pages
Module 1 Intro
No ratings yet
Module 1 Intro
32 pages
Unit I - Fundamentals of DL
No ratings yet
Unit I - Fundamentals of DL
41 pages
Deep Learning Midsem Merged Previous Batch
No ratings yet
Deep Learning Midsem Merged Previous Batch
423 pages
MV cs4243 2024 Amir 6 p0
No ratings yet
MV cs4243 2024 Amir 6 p0
40 pages
AI Module V
No ratings yet
AI Module V
40 pages
Deep Learning
No ratings yet
Deep Learning
100 pages
Lesson 02 Introduction To Deep Learning
No ratings yet
Lesson 02 Introduction To Deep Learning
74 pages
XCXCXCXCXCXCXCXC
No ratings yet
XCXCXCXCXCXCXCXC
20 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
Transcript - Module 3 - Deep Learning and Generative AI
No ratings yet
Transcript - Module 3 - Deep Learning and Generative AI
38 pages
Article
No ratings yet
Article
10 pages
Report On Neural Networks
No ratings yet
Report On Neural Networks
15 pages
Deep Learning Algorithms
No ratings yet
Deep Learning Algorithms
21 pages
DLTest 1 QB
No ratings yet
DLTest 1 QB
13 pages
Unit 4 Notes New
No ratings yet
Unit 4 Notes New
49 pages
Project 7
100% (1)
Project 7
6 pages
JETIR2107018
No ratings yet
JETIR2107018
5 pages
Unit 4 NNDL
No ratings yet
Unit 4 NNDL
37 pages
Neural Networks and Deep Learning: A Comprehensive Overview of Modern Techniques and Applications
No ratings yet
Neural Networks and Deep Learning: A Comprehensive Overview of Modern Techniques and Applications
15 pages
On The Origin of Deep Learning: Haohan Wang Bhiksha Raj
No ratings yet
On The Origin of Deep Learning: Haohan Wang Bhiksha Raj
72 pages
A Research Survey Report On Deep Learning Concepts
No ratings yet
A Research Survey Report On Deep Learning Concepts
8 pages
21AI71 Module 3 Textbook 2 3
No ratings yet
21AI71 Module 3 Textbook 2 3
3 pages
What Is Deep Learning Basics
No ratings yet
What Is Deep Learning Basics
11 pages
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-2
No ratings yet
Krishna Rungta - TensorFlow in 1 Day Make Your Own Neural Network (2018) - Trang-2
12 pages
Article Review 10 Eng
No ratings yet
Article Review 10 Eng
28 pages
Technical Seminar Index
No ratings yet
Technical Seminar Index
4 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
21 pages
Dl-Unit 1
No ratings yet
Dl-Unit 1
12 pages
‎⁨فصل ثاني اسراء⁩
No ratings yet
‎⁨فصل ثاني اسراء⁩
13 pages
Leture 01
No ratings yet
Leture 01
105 pages
1 - Intro To Machine Learning
100% (1)
1 - Intro To Machine Learning
20 pages
Unit 4 NNDL
No ratings yet
Unit 4 NNDL
37 pages
Unit 1
No ratings yet
Unit 1
30 pages
Deep Learning University
No ratings yet
Deep Learning University
129 pages
Module 1
No ratings yet
Module 1
16 pages
Reading+10+ +Introduction+to+Deep+Learning
No ratings yet
Reading+10+ +Introduction+to+Deep+Learning
21 pages
Deep Learning Unit-2
No ratings yet
Deep Learning Unit-2
33 pages
Unit - 1 Deep Learning Techniques
No ratings yet
Unit - 1 Deep Learning Techniques
18 pages
Advancements and Applications of Deep Learning
No ratings yet
Advancements and Applications of Deep Learning
4 pages
Deep Learning Algorithms and Architectures
No ratings yet
Deep Learning Algorithms and Architectures
26 pages
MVDAFT Final
No ratings yet
MVDAFT Final
30 pages
Unit I
No ratings yet
Unit I
10 pages
Machine Learning in Microservices Productionizing Microservices Architecture For Machine Learning Solutions (Mohamed Abouahmed, Omar Ahmed) (Z-Library)
No ratings yet
Machine Learning in Microservices Productionizing Microservices Architecture For Machine Learning Solutions (Mohamed Abouahmed, Omar Ahmed) (Z-Library)
270 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
12 pages
Icost 2024
No ratings yet
Icost 2024
150 pages
Deepface
100% (1)
Deepface
9 pages
AI Ethics Class 10
No ratings yet
AI Ethics Class 10
14 pages
Unit 4
No ratings yet
Unit 4
93 pages
Module 3 - AI and ML
No ratings yet
Module 3 - AI and ML
64 pages
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
No ratings yet
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
24 pages
All Thesis
No ratings yet
All Thesis
33 pages
Solar Documentation
No ratings yet
Solar Documentation
27 pages
Module 2
No ratings yet
Module 2
28 pages
Age and Gender Detection
No ratings yet
Age and Gender Detection
4 pages
Deep Learning-Aided 6G Wireless Networks
No ratings yet
Deep Learning-Aided 6G Wireless Networks
51 pages
Deep Learning Modernisasi Machine Learning Untuk Big Data
100% (1)
Deep Learning Modernisasi Machine Learning Untuk Big Data
4 pages
L12 Intro-Cnn-Part1 Slides
No ratings yet
L12 Intro-Cnn-Part1 Slides
56 pages
Toxictoxic
No ratings yet
Toxictoxic
45 pages
Syllabus AI Bootcamp #7
No ratings yet
Syllabus AI Bootcamp #7
19 pages
Skill Oriented Program: Shaik Akram
No ratings yet
Skill Oriented Program: Shaik Akram
121 pages
Major Project
No ratings yet
Major Project
13 pages
UGRD-AI6100 AI Prompt Engineering Lab
No ratings yet
UGRD-AI6100 AI Prompt Engineering Lab
28 pages
Solar Radiation Prediction Base Papers
No ratings yet
Solar Radiation Prediction Base Papers
18 pages
Innovative Project PPT
No ratings yet
Innovative Project PPT
12 pages
What Is YOLOv9 An In-Depth Exploration of The Inte
No ratings yet
What Is YOLOv9 An In-Depth Exploration of The Inte
10 pages
An Emotion-Aware Multitask Approach To Fake News and Rumor Detection Using Transfer Learning
No ratings yet
An Emotion-Aware Multitask Approach To Fake News and Rumor Detection Using Transfer Learning
9 pages
Solar
No ratings yet
Solar
10 pages
Uncertainty Estimation For Data-Driven Visual Odometry: IEEE Transactions On Robotics June 2020
No ratings yet
Uncertainty Estimation For Data-Driven Visual Odometry: IEEE Transactions On Robotics June 2020
21 pages
J of Business Logistics - 2023 - Richey - Artificial Intelligence in Logistics and Supply Chain Management
No ratings yet
J of Business Logistics - 2023 - Richey - Artificial Intelligence in Logistics and Supply Chain Management
18 pages
DH IPC HDBW3541RN ZS - Datasheet
No ratings yet
DH IPC HDBW3541RN ZS - Datasheet
3 pages
Sanskar's Resume
No ratings yet
Sanskar's Resume
1 page
Fpls 13 1053329
No ratings yet
Fpls 13 1053329
14 pages
Final Report This Is Gtu de Project For Third Sem 240805 073354
No ratings yet
Final Report This Is Gtu de Project For Third Sem 240805 073354
23 pages
1644397192phd Computer Engg
No ratings yet
1644397192phd Computer Engg
42 pages
Does Learning Require Memorization? A Short Tale About A Long Tail
No ratings yet
Does Learning Require Memorization? A Short Tale About A Long Tail
6 pages
A New Machine Learning Method For Identifying Alzheimer's Disease
No ratings yet
A New Machine Learning Method For Identifying Alzheimer's Disease
12 pages
Computers in Biology and Medicine: Nelson R.C. Monteiro, José L. Oliveira, Joel P. Arrais
No ratings yet
Computers in Biology and Medicine: Nelson R.C. Monteiro, José L. Oliveira, Joel P. Arrais
13 pages
Cse 2019 Curriculum
No ratings yet
Cse 2019 Curriculum
11 pages
Slow Momentum With Fast Reversion: A Trading Strategy Using Deep Learning and Changepoint Detection
No ratings yet
Slow Momentum With Fast Reversion: A Trading Strategy Using Deep Learning and Changepoint Detection
14 pages
HypLL: The Hyperbolic Learning Library
No ratings yet
HypLL: The Hyperbolic Learning Library
4 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Beyond Silicon
From Everand
Beyond Silicon
Piyush yadav
5/5 (1)