0% found this document useful (0 votes)
10 views22 pages

Unit-3 NNDL

The document provides an overview of deep learning, its historical development, and the architecture of neural networks, emphasizing the importance of deep learning algorithms in handling complex data. It discusses the evolution of deep learning from early models like the perceptron to modern architectures, highlighting key milestones and breakthroughs. Additionally, it covers concepts such as gradient-based learning, cost functions, and the application of neural networks in solving problems like the XOR function.

Uploaded by

shinjithreddy123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

Unit-3 NNDL

The document provides an overview of deep learning, its historical development, and the architecture of neural networks, emphasizing the importance of deep learning algorithms in handling complex data. It discusses the evolution of deep learning from early models like the perceptron to modern architectures, highlighting key milestones and breakthroughs. Additionally, it covers concepts such as gradient-based learning, cost functions, and the application of neural networks in solving problems like the XOR function.

Uploaded by

shinjithreddy123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Neural networks and Deep learning CSE(AI&ML)

UNIT-III

Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -


forward networks, Gradient-Based learning, Hidden Units, Architecture Design,
Back-Propagation and Other Differentiation Algorithms

INTRODUCTION TO DEEP LEARNING:

Deep Learning is a subset of Machine Learning that uses mathematical functions to map the
input to the output. These functions can extract non-redundant information or patterns from
the data, which enables them to form a relationship between the input and the output. This is
known as learning, and the process of learning is called training.

In traditional computer programming, input and a set of rules are combined together to get
the desired output. In machine learning and deep learning, input and output are correlated to
the rules. These rules when combined with new input-yield desired results

Modern deep learning models use artificial neural networks or simply neural networks to
extract information.

These neural networks are made up of a simple mathematical function that can be stacked on
top of each other and arranged in the form of layers, giving them a sense of depth, hence the
term Deep Learning.

Deep learning can also be thought of as an approach to Artificial Intelligence, a smart


combination of hardware and software to solve tasks requiring human intelligence.

1
Neural networks and Deep learning CSE(AI&ML)

Importance of Deep Learning


Deep learning algorithms play a crucial role in determining the features and can handle the
large number of processes for the data that might be structured or unstructured. Although,
deep learning algorithms can overkill some tasks that might involve complex problems
because they need access to huge amounts of data so that they can function effectively. For
example, there's a popular deep learning tool that recognizes images namely Imagenet that
has access to 14 million images in its dataset-driven algorithms. It is a highly comprehensive
tool that has defined a next-level benchmark for deep learning tools that aim images as their
dataset.

Deep learning algorithms are highly progressive algorithms that learn about the image that
we discussed previously by passing it through each neural network layer. The layers are
highly sensitive to detect low-level features of the image like edges and pixels and henceforth
the combined layers take this information and form holistic representations by comparing it
with previous data. For example, the middle layer might be programmed to detect some
special parts of the object in the photograph which other deep trained layers are programmed
to detect special objects like dogs, trees, utensils, etc.

However, if we talk out the simple task that involves less complexity and a data-driven
resource, deep learning algorithms fail to generalize simple data. This is one of the main
reasons deep learning is not considered effective as linear or boosted tree models. Simple
models aim to churn out custom data, track fraudulent transactions and deal with less
complex datasets with fewer features. Also, there are various cases like multiclass

2
Neural networks and Deep learning CSE(AI&ML)

classification where deep learning can be effective because it involves smaller but more
structured datasets but is not preferred usually.

Why Deep Learning

Historical Trends in Deep Learning


Deep Learning have been three waves of development: The first wave started with
cybernetics in the 1940s-1960s, with the development of theories of biological learning and
implementations of the first models such as the perceptron allowing the training of a single
neuron. The second wave started with the connectionist approach of the 1980-1995 period,
with back-propagation to train a neural network with one or two hidden layers. The current
and third wave, deep learning, started around 2006.

3
Neural networks and Deep learning CSE(AI&ML)

Deep Learning History Timeline


McCulloch Pitts Neuron - Beginning 1943

 Walter Pitts and Warren McCulloch show the model of biological neuron. This
McCulloch-Pitts Neuron could be used to learn logical inferences based on learning
via simple thresholding functions.

1957 Frank Rosenblatt creates Perceptron

 Rosenblatt develops the first perceptron neural network funded by the US


government. This machine is based on the idea of binary classifiers and learning
through trial and error. This was one of the earliest algorithms for supervised learning
and binary classifiers.

The first Backpropagation Model 1960

 Henry J. Kelley shows the first derivation of continuous backpropagation method.


But the model is not clearly mentioned. 1980 is the foundation of Backpropagation
algorithm which is widely used in ANN in future years.

1962 Backpropagation with Chain Rule

 Stuart Dreyfus shows a simple derivation of continuous backpropagation with chain


rule programming after feedforward neural models were using it.

Birth of Multilayer Neural Network 1965

 Alexey Grigoryevich Ivakhnenko and Valentin Grigor'evich Lapa started working


on the first general, working, learning algorithm for supervised deep feedforward
multilayer perceptrons (MLP) which was published in the "Cybernetic Predicting
Devices" in 1965.
 Ivakhnenko's paper describes a deep network with 8 layers trained by the Group
Method of Data Handling (GMDH) algorithm

1969 The Fall of Perceptron

 Marvin Minsky and Seymour Papert publish the book "Perceptrons" that proves
that the Perceptron algorithm cannot solve simple nonlinear problems, such as the
XOR problem. This leads to a decrease in interest in neural network research and the
first AI winter.

Neural Network goes Deep 1971

Halexey Grigoryevich Ivakhnenko continues his research in Neural Networks. He creates an


8-layer deep neural network using Group Method of Data Handling (GMDH).

4
Neural networks and Deep learning CSE(AI&ML)

1980 Neocognitron - First CNN Architecture

 Kunihiko Fukushima introduces neocognitron - the first convolutional architecture


which helps to create robust models for handwriting characters.

Hopfield Network - Early RNN 1982

 John Hopfield creates the Hopfield Network, a recurrent neural network that can
serve as associative memory systems with binary threshold nodes. This foundation
stone for recurrent networks and deep learning

Boltzmann Machine 1985

 David H. Ackley, Geoffrey Hinton, and Terrence Sejnowski create Boltzmann


machine network - stochastic neural network that has hidden layer and binary units

1986 NetTalk - ANN Learns Speech

 Terrence Sejnowski creates NetTalk, a neural network which learns to pronounce


English text by being trained on a corpus of text with corresponding phonetic
transcriptions.

1989 CNN using Backpropagation

 Yann LeCun uses backpropagation to train convolutional neural networks (CNN) to


recognize handwritten digits. This is a breakthrough for deep learning and forms the
foundation of modern computer vision..

The Milestone of LSTM 1997

Sepp Hochreiter and Jurgen Schmidhuber make the invention of LSTM (Long Short-Term
Memory) which is a refinement of RNN with all architecture. This will become the common
architecture for deep learning in the years to come.

5
Neural networks and Deep learning CSE(AI&ML)

2012 AlexNet Starts Deep Learning Boom

DeepMind's deep reinforcement learning model beats human champion in the complex game
of Go. The game is much more complex than chess, so this feat captures the imagination of
everyone and takes the promise of deep learning to a whole new level. AlexNet, a GPU-
implemented CNN model designed by Alex Krizhevsky, wins ImageNet's image
classification contest with an accuracy of 84%. It is a huge jump over the 75% accuracy that
earlier models had achieved. This win triggers a new deep learning boom globally.

2016 AlphaGo beats human: AlphaGo defeated one of the best human players in the game
Go, proving the power of deep learning in complex board games. This had a huge impact on
the AI community and brought deep learning to the forefront of AI research globally.

6
Neural networks and Deep learning CSE(AI&ML)

7
Neural networks and Deep learning CSE(AI&ML)

Deep Feedforward Networks

Example: Learning XOR


The XOR function (“exclusive or”) is an operation on two binary values, x1 and x2. When
exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it
returns 0. The XOR function provides the target function y=¿ f ¿ (x ) that we want to learn.
Our model provides a function y=f (x ; θ) and our learning algorithm will adapt the
¿
parameters θ to make f as similar as possible to f .

We want our network to perform correctly on the four points X = {[0, 0], [0, 1],[1, 0], and [1, 1]}.
We will train the network on all four of these points. The only challenge is to fit the training set.

We can treat this problem as a regression problem and use a mean squared error loss
function. In practical applications, MSE is usually not an appropriate cost function for
modeling binary data.

Evaluated on our whole training set, the MSE loss function is

Suppose that we choose a linear model, with θ consisting of w and b. Our model is defined to
be

We can minimize J(θ) in closed form with respect to w and b using the normal equations.

After solving the normal equations, we obtain w = 0 and b = 1/2. The linear model simply
outputs 0.5 everywhere. Why does this happen? A linear model is not able to represent the
XOR function. One way to solve this problem is to use a model that learns a different feature
space in which a linear model is able to represent the solution
Specifically, we will introduce a very simple feedforward network with one hidden layer
containing two hidden units.

8
Neural networks and Deep learning CSE(AI&ML)

This feedforward network has a vector of hidden units h that are computed by a
function f (1) ¿(x ; W , c). The values of these hidden units are then used as the input for a
second layer. The second layer is the output layer of the network. The output layer is still just
a linear regression model, but now it is applied to h rather than to x . The network now
contains two functions chained together:
(1) (2)
h=f (x ; W ,c )∧ y =f (h ; w , b),

with the complete model being f (x ; W , c , w , b)=f (2) (f (1) (x))


What function should f(1) compute? Linear models have served us well so far, and it may be
tempting to make f(1) be linear as well. Unfortunately, if f(1) were linear, then the feedforward
network as a whole would remain a linear function of its input. we must use a nonlinear
function to describe the features. Most neural networks do so using an affine transformation
controlled by learned parameters, followed by a fixed, nonlinear function called an activation
function. We use that strategy here, by defining h=g(W T x +c ), where W provides the
weights of a linear transformation and c the biases.

We describe an affine transformation from a vector x to a vector h, so an entire vector of bias


parameters is needed. The activation function g is typically chosen to be a function that is
applied element-wise, with hi =g (x T W : ,i +c i ). In modern neural networks, the default
recommendation is to use the rectified linear unit or ReLU defined by the activation
function g(z )=max {0 , z }.

We can now specify our complete network as

We can now specify a solution to the XOR problem. Let

and b = 0

9
Neural networks and Deep learning CSE(AI&ML)

We can now walk through the way that the model processes a batch of inputs. Let X be the
design matrix containing all four points in the binary input space, with one example per row:

The first step in the neural network is to multiply the input matrix by the first layer’s weight
matrix:

Next, we add the bias vector c, to obtain

In this space, all of the examples lie along a line with slope 1. As we move along this line, the
output needs to begin at 0, then rise to 1, then drop back down to 0. A linear model cannot
implement such a function. To finish computing the value of h for each example, we apply
the rectified linear transformation:

This transformation has changed the relationship between the examples. They no longer lie
on a single line. They now lie in a space where a linear model can solve the problem. We
finish by multiplying by the weight vector w:

The neural network has obtained the correct answer for every example in the batch.

10
Neural networks and Deep learning CSE(AI&ML)

In this example, we simply specified the solution, then showed that it obtained zero error. In a
real situation, there might be billions of model parameters and billions of training examples,
so one cannot simply guess the solution as we did here. Instead, a gradient-based
optimization algorithm can find parameters that produce very little error.

Gradient-Based Learning

As with other machine learning models, to apply gradient-based learning we must choose a
cost function, and we must choose how to represent the output of the model. Largest
difference between simple ML Models and neural networks are nonlinearity of a neural
network causes most interesting loss functions to become non-convex. This means that neural
networks are usually trained by using iterative, gradient-based optimizers that merely drive
the cost function to a very low value, rather than exact linear equation solvers used to train
linear regression models or the convex optimization algorithms used for logistic regression or

SVMs. Cost Functions:

A cost function is an important parameter that determines how well a machine learning
model performs for a given dataset. It calculates the difference between the expected value and
predicted value and represents it as a single real number.

Types of Cost Function

1. Regression Cost Function


o Means Error
o Mean Squared Error
o Mean Absolute Error
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.
In most cases, our parametric model defines a distribution p( y ∨x ; θ) and we simply use the
principle of maximum likelihood. This means we use the cross-entropy between the training
data and the model’s predictions as the cost function.
Sometimes, we rather than predicting a complete probability distribution over y, we merely
predict some statistic of y conditioned on x. Specialized loss functions allow us to train a
predictor of these estimates.
The total cost function used to train a neural network will often combine one of the primary
cost functions described here with a regularization term.

11
Neural networks and Deep learning CSE(AI&ML)

Learning Conditional Distributions with Maximum Likelihood


Most modern neural networks are trained using maximum likelihood. This means that the
cost function is simply the negative log-likelihood, equivalently described as the cross-
entropy between the training data and the model distribution. This cost function is given by:

.
The specific form of the cost function changes from model to model, depending on the
specific form of log p model .
An advantage of this approach of deriving the cost function from maximum likelihood is that
it removes the burden of designing cost functions for each model. Specifying a model
p( y ∨x) automatically determines a cost function log p( y ∨x).
Learning Conditional Statistics:

In many cases, instead of learning a full probability distribution p( y ∨x ; θ), we focus on


learning a specific conditional statistic of \( y \) given \( x \), such as predicting the mean
using a predictor f(x; θ ). A powerful neural network can represent a wide class of functions,
constrained by properties like continuity and boundedness rather than a specific parametric
form.

Here, the cost function can be viewed as a functional (a mapping from functions to real
numbers), and learning becomes a process of choosing a function, not just a set of
parameters. We can design the cost functional so that its minimum corresponds to the
function that maps ( x) to the expected value of ( y) given ( x ).
Solving such optimization problems involves *calculus of variations*. While understanding
this tool is not necessary here, it helps derive certain key results.

The first result derived using calculus of variations shows that solving the optimization
problem:

……………….(1)

Leads to: ……………..(2)

This means that minimizing the mean squared error


(MSE) cost function would yield a function predicting the mean of ( y ) for each ( x ),
assuming we could train on infinite data from the true distribution. Different cost functions
yield different statistics. A second result derived from the calculus of variations is that results
in a function that predicts the median of y for each x , commonly referred to as the mean
absolute error (MAE).

……………………. (3)

However, both MSE and MAE can lead to poor results with gradient-based optimization, as
certain output units may saturate and produce very small gradients. This is why the cross-

12
Neural networks and Deep learning CSE(AI&ML)

entropy cost function is often preferred, even when estimating the full distribution p( y ∨x)
is not necessary.

Output Units
The choice of cost function is tightly coupled with the choice of output unit. Most of
the time, we simply use the cross-entropy between the data distribution and the model
distribution. The choice of how to represent the output then determines the form of the cross-
entropy function.
the feedforward network provides a set of hidden features defined by h=f (x ; θ). The
role of the output layer is then to provide some additional transformation from the features to
complete the task that the network must perform.

Linear Units for Gaussian Output Distributions


One simple kind of output unit is based on an affine transformation with no
nonlinearity. These are often just called linear units. Given features h, a layer of linear output
units produces a vector ˆy = W T h+b.
Linear output layers are often used to produce the mean of a conditional Gaussian
distribution: p( y ∨x)=N ( y ; ˆy , I ) ……………………………..(4)
. Maximizing the log-likelihood is then equivalent to minimizing the mean squared
error. The maximum likelihood framework makes it straightforward to learn the covariance
of the Gaussian too, or to make the covariance of the Gaussian be a function of the input.
However, the covariance must be constrained to be a positive definite matrix for all inputs. It
is difficult to satisfy such constraints with a linear output layer, so typically other output units
are used to parametrize the covariance. Because linear units do not saturate, they pose little
difficulty for gradient based optimization algorithms and may be used with a wide variety of
optimization algorithms.

Sigmoid Units for Bernoulli Output Distributions


Many tasks require predicting the value of a binary variable y . Classification problems with
two classes can be cast in this form. The maximum-likelihood approach is to define a
Bernoulli distribution over y conditioned on x . A Bernoulli distribution is defined by just a
single number. The neural net needs to predict only P( y =1∨x) . For this number to be a
valid probability, it must lie in the interval [0, 1].
Satisfying this constraint requires some careful design effort. Suppose we were to use a linear
unit, and threshold its value to obtain a valid probability:
We omit the dependence on x for the moment to discuss how to define a probability
distribution over y using the value z . The sigmoid can be motivated by constructing an
~
unnormalized probability distribution P ( y) , which does not sum to 1. We can then divide by
an appropriate constant to obtain a valid probability distribution. If we begin with the
assumption that the unnormalized log probabilities are linear in y and z , we can exponentiate

13
Neural networks and Deep learning CSE(AI&ML)

to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli
distribution controlled by a sigmoidal transformation of z:

When we use other loss functions, such as mean squared error, the loss can saturate
anytime σ(z) saturates. The sigmoid activation function saturates to 0 when z becomes very
negative and saturates to 1 when z becomes very positive. The gradient can shrink too small
to be useful for learning whenever this happens, whether the model has the correct answer or
the incorrect answer. For this reason, maximum likelihood is almost always the preferred
approach to training sigmoid output units.

Softmax Units for Multinoulli Output Distributions


Any time we wish to represent a probability distribution over a discrete variable with n
possible values, we may use the softmax function. Softmax functions are most often used as
the output of a classifier, to represent the probability distribution over n different classes
In the case of binary variables, we wished to produce a single
number
Because this number needed to lie between 0 and 1, and
because we wanted the logarithm of the number to be well-behaved for gradient-based
~
optimization of the log-likelihood, we chose to instead predict a number z=log ⁡P ( y=1| x )
To generalize to the case of a discrete variable with n values, we now need to produce a
~
y i=⁡P ( y =i|x ) . We require not only that each element of yî yî be between 0
vector ŷ, with ^
and 1, but also that the entire vector sums to 1 so that it represents a valid probability
distribution.
A linear layer predicts unnormalized log probabilities:
T
z=W h+b
~
where z i=log ⁡P ( y=i|x )The softmax function can then exponentiate and normalize z to
obtain the desired ŷ. Formally, the softmax function is given by

As with the logistic sigmoid, the use of the exp function works very well when training the
softmax to output a target value y using maximum log-likelihood.

14
Neural networks and Deep learning CSE(AI&ML)

Hidden Units

The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles. Rectified linear units are an excellent default
choice of hidden unit.It is usually impossible to predict in advance which will work best. The
design process consists of trial and error, intuiting that a kind of hidden unit may work well,
and evaluating its performance on a validation set.

Some hidden units are not differentiable at all input points. For example, the rectified linear
function g(z)=max{0, z} is not differentiable at z = 0. This may seem like it invalidates g
for use with a gradient based learning algorithm. In practice, gradient descent still performs
well enough for these models to be used for machine learning tasks.
Most hidden units can be described as accepting a vector of inputs x, computing an affine
transformation z=W T x+b , and then applying an element-wise nonlinear function g(z).
Most hidden units are distinguished from each other only by the choice of the form of the

activation function g(z).

Rectified Linear Units and Their Generalizations

Rectified linear units use the activation function g(z)=max{0, z}.


Rectified linear units are easy to optimize due to similarity with linear units.

 Only difference with linear units that they output 0 across half its domain

 Derivative is 1 everywhere that the unit is active

 Thus gradient direction is far more useful than with activation functions with second-order
effects

Rectified linear units are typically used on top of an affine transformation: h=g ¿ ¿)
Good practice to set all elements of b to a small value such as 0.1. This makes it likely that
ReLU will be initially active for most training samples and allow derivatives to pass through

ReLU vs other activations:

15
Neural networks and Deep learning CSE(AI&ML)

 Sigmoid and tanh activation functions cannot be with many layers due to the vanishing
gradient problem.

 ReLU overcomes the vanishing gradient problem, allowing models to learn faster and
perform better.

 ReLU is the default activation function with MLP and CNN

One drawback to rectified linear units is that they cannot learn via gradient based methods on
examples for which their activation is zero.

Three generalizations of rectified linear units are based on using a non-zero slope α i when
zi < 0;hi =g ( z , α )i =max ( 0 , z i ) +α i min ⁡(0 , zi )

Absolute value rectification fixes αi = −1 to obtain g(z) = |z|. It is used for


object recognition from images

1. A leaky ReLU fixes αi to a small value like 0.01

2. parametric ReLU treats αi as a learnable parameter

Logistic Sigmoid and Hyperbolic Tangent

Most neural networks used the logistic sigmoid activation function prior to rectified linear
units. g(z )=σ (z)

or the hyperbolic tangent activation function g(z )=tanh( z)

These activation functions are closely related because tanh(z)=2 σ (2 z)−1

We have already seen sigmoid units as output units, used to predict the probability that a
binary variable is 1.

Sigmoidals saturate across most of domain

 Saturate to 1 when z is very positive and 0 when z is very negative

 Strongly sensitive to input when z is near 0

 Saturation makes gradient-learning difficult

16
Neural networks and Deep learning CSE(AI&ML)

Hyperbolic tangent typically performs better than logistic sigmoid. It resembles the identity
function more closely. Because tanh is similar to the identity function near 0, training a
deep neural network

resembles training a linear model so long as the activations of the network can be kept small.

Architecture Design

The word architecture refers to the overall structure of the network: how many units it should
have and how these units should be connected to each other.

Most neural networks are organized into groups of units called layers. Most neural
network architectures arrange these layers in a chain structure, with each layer being a
function of the layer that preceded it. In this structure, the first layer is given by

the second layer is given by

In these chain-based architectures, the main architectural considerations are to choose the
depth of the network and the width of each layer.

17
Neural networks and Deep learning CSE(AI&ML)

Universal Approximation Properties and Depth

The universal approximation theorem (Horniket al., 1989; Cybenko, 1989) states that a
feedforward network with a linear output layer and at least one hidden layer with any
“squashing” activation function (such as the logistic sigmoid activation function) can
approximate any Borel measurable function from one finite-dimensional space to another
with any desired nonzero amount of error, provided that the network is given enough hidden
units.

A feed-forward network with a single hidden layer containing a finite number of neurons can

approximate continuous functions on compact subsets of ℝn.

Simple neural networks can represent a wide variety of interesting functions when given
appropriate parameters

 However, it does not touch upon the algorithmic learnability of those parameters

18
Neural networks and Deep learning CSE(AI&ML)

The universal approximation theorem means that regardless of what function we are trying to
learn, we know that a large multilayer perception (MLP) will be able to represent this
function. However, we are not guaranteed that the training algorithm will be able to learn that
function. Even if the MLP is able to represent the function, learning can fail for two different
reasons.

1. Optimizing algorithms may not be able to find the value of the parameters that corresponds to
the desired function.

2. The training algorithm might choose wrong function due to over-fitting

The UA theorem says that, there exists a network large enough to achieve any degree of
accuracy we desire, but the theorem does not say how large this network will be. provides
some bounds on the size of a single-layer network needed to approximate a broad class of
functions. Unfortunately, in the worse case, an exponential number of hidden units may be
required. This is easiest to see in the binary case: the number of possible binary functions on
vectors
n
v ∈ { 0 ,1 } is 22n and selecting one such function requires 2nbits, which will in general require

O(2n ¿degrees of freedom.


A feedforward network with a single layer is sufficient to represent any function, But the
layer may be infeasibly large and may fail to generalize correctly. Using deeper models can
reduce no.of units required and reduce generalization error.

 The above graph shows the Effect of number of parameters


19
Neural networks and Deep learning CSE(AI&ML)

 Deeper models perform better than wider models with more parameters because they
can capture more complex hierarchical functions.
 Shallow models with many parameters tend to overfit quickly (around 20 million
parameters), while deeper models benefit from larger parameter counts (over 60
million).
 Deep models learn representations by composing simpler functions, such as detecting
edges, corners, and then objects.
 Depth expresses a preference for learning complex functions as compositions of
simpler ones, improving generalization over shallow models.

This explains why adding depth is more effective than merely increasing the number of
parameters in neural networks.

Back-Propagation and Other DifferentiationAlgorithms

 Forward Propagation : In a feedforward neural network, input xxx flows through layers
to produce output y^\hat{y}y^. This process is called forward propagation. During training,
forward propagation computes the cost J(θ)J(\theta)J(θ).

gradient of the cost function ∇θJ(θ)\nabla_\theta J(\theta)∇θJ(θ). It sends information


 Backpropagation: Introduced by Rumelhart et al. (1986), backpropagation computes the

backward through the network to update weights.

 Backprop Misconceptions:

 It is not the entire learning algorithm but just a method for computing gradients.
 It is not specific to multi-layer networks but can compute derivatives of any function.

 Gradient Calculation: Backprop computes the gradient ∇xf(x,y)\nabla_x f(x, y)∇xf(x,y)


for arbitrary functions, often for optimizing the cost function with respect to model
parameters. It can also compute the Jacobian for functions with multiple outputs.

 Computational Graphs:

 A computational graph formalizes operations in a network. Each node represents a


variable (scalar, vector, matrix, etc.).
 Operations are functions that act on variables, combining them to form more
complex functions. The graph's edges represent these operations.

 General Use: Backpropagation extends beyond neural networks, being useful for
computing various derivatives in machine learning tasks.

20
Neural networks and Deep learning CSE(AI&ML)

 Backpropagation: Backprop computes derivatives by multiplying Jacobians by gradients


for each operation in a computational graph. This is efficient for deep networks.

 Tensors and Backpropagation: Backprop can be generalized to tensors (multi-


dimensional arrays), where the process remains the same as for vectors. Each tensor’s
gradient is computed by multiplying Jacobians and gradients, and the result is reshaped back
into tensor form.

 Chain Rule with Tensors: For tensor-valued functions, the chain rule becomes:

This computes the gradient of z with respect to each element of tensor X.

21
Neural networks and Deep learning CSE(AI&ML)

22

You might also like