0% found this document useful (0 votes)
167 views78 pages

3rd Unit DL Final Class Notes

Uploaded by

drmadancse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views78 pages

3rd Unit DL Final Class Notes

Uploaded by

drmadancse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Subject: Introduction to Deep Learning

(UNIT-3 Class Notes)

Prepared
by
Dr K Madan Mohan
Asst. Professor
Department of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Nagole, Bandlaguda, Hyderabad

Sreyas Institute of Engineering and Technology – Autonomous


9-39, Sy No 107 Tattiannaram, GSI Rd, beside INDU ARANYA HARITHA,
Bandlaguda, Nagole, Hyderabad, Telangana 500068
R18 B.Tech. CSE (AIML) III & IV Year JNTU Hyderabad

NEURAL NETWORKS AND DEEP LEARNING

B.Tech. IV Year I Sem. L T P C


3 0 0 3
Course Objectives:
 To introduce the foundations of Artificial Neural Networks
 To acquire the knowledge on Deep Learning Concepts
 To learn various types of Artificial Neural Networks
 To gain knowledge to apply optimization strategies

Course Outcomes:
 Ability to understand the concepts of Neural Networks
 Ability to select the Learning Networks in modeling real world systems
 Ability to use an efficient algorithm for Deep Models
 Ability to apply optimization strategies for large scale applications

UNIT-I
Artificial Neural Networks Introduction, Basic models of ANN, important terminologies, Supervised
Learning Networks, Perceptron Networks, Adaptive Linear Neuron, Back-propagation Network.
Associative Memory Networks. Training Algorithms for pattern association, BAM and Hopfield
Networks.

UNIT-II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet, Hamming
Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter Propagation
Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various networks.

UNIT - III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed - forward networks,
Gradient-Based learning, Hidden Units, Architecture Design, Back-Propagation and Other
Differentiation Algorithms

UNIT - IV
Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing and
Parameter Sharing, Sparse Representations, Bagging and other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier

UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic Algorithms,
Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates, Approximate Second-
Order Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural Language
Processing

TEXT BOOKS:
1. Deep Learning: An MIT Press Book By Ian Goodfellow and Yoshua Bengio and Aaron Courville
2. Neural Networks and Learning Machines, Simon Haykin, 3rd Edition, Pearson Prentice Hall.
III-UNIT

Introduction to Deep Learning

1.1 Introduction about Deep Learning[1st Topic]

1. Deep learning is a subset of machine learning that focuses on training artificial


neural networks to learn and make predictions.
2. It is inspired by the structure and function of the human brain, where artificial neural
networks attempt to mimic the behavior of neurons.
3. Neural networks are composed of interconnected layers of nodes called neurons,
which process and transmit information.
4. Deep learning models are called "deep" because they typically have multiple hidden
layers between the input and output layers.
5. Training a deep learning model involves feeding it a large amount of labeled data
and adjusting the weights and biases of the neurons to minimize the difference
between predicted and actual outputs.
6. Deep learning has gained popularity because it can automatically learn features from
raw data, reducing the need for manual feature engineering.
7. It has been successful in various applications, such as computer vision, natural
language processing, speech recognition, and recommendation systems.
8. Some popular deep learning architectures include Convolutional Neural Networks
(CNNs) for image processing, Recurrent Neural Networks (RNNs) for sequence
data, and Generative Adversarial Networks (GANs) for generating new content.
9. Deep learning has achieved remarkable results in tasks like image classification,
object detection, machine translation, and even beating human players in complex
games like Go and Chess.
10. However, deep learning models require large amounts of data for training and can
be computationally intensive, often requiring specialized hardware like GPUs or
TPUs.
11. Despite its successes, deep learning still faces challenges, such as interpretability
(understanding how and why the model makes predictions) and robustness to
adversarial attacks (where small perturbations can fool the model).
Researchers and engineers continue to explore and enhance deep learning
techniques to overcome these challenges and unlock more applications.
1.2 What is Deep Learning?
1. Deep learning is a subfield of machine learning that focuses on training artificial
neural networks to learn and make predictions.
2. It is based on the concept of artificial neural networks, which are inspired by the
structure and function of the human brain.
3. Deep learning models are called "deep" because they typically have multiple
layers of interconnected nodes, known as neurons.
4. These neurons process and transmit information, allowing the network to capture
complex patterns and relationships in the data.
5. Deep learning models are trained by feeding them large amounts of labeled data
and adjusting the weights and biases of the neurons to minimize prediction
errors.
6. One of the key advantages of deep learning is its ability to automatically learn
features from raw data, reducing the need for manual feature engineering.
7. Deep learning has been successful in various applications, including computer
vision, natural language processing, speech recognition, and recommendation
systems.
8. Some popular deep learning architectures include Convolutional Neural
Networks (CNNs) for image processing, Recurrent Neural Networks (RNNs) for
sequence data, and Generative Adversarial Networks (GANs) for generating new
content.
9. Deep learning has achieved impressive results, such as surpassing human
performance in tasks like image classification, object detection, and game
playing.
10. However, deep learning models require a significant amount of data for training
and can be computationally intensive, often requiring specialized hardware.
11. There are still challenges in deep learning, such as interpretability
(understanding how and why the model makes predictions) and robustness to
adversarial attacks.
Researchers and practitioners are actively working on improving deep learning
techniques to address these challenges and explore new applications.
1.3. Advantages of Deep Learning:
1. Automatic feature extraction: Deep learning models can automatically learn
relevant features from raw data, reducing the need for manual feature engineering.
2. High Accuracy: Deep learning models have achieved impressive results in various
tasks, often surpassing human performance in areas like image recognition and
natural language processing.
3. Handling complex data: Deep learning models can effectively handle large and
complex datasets with high-dimensional inputs, such as images, audio, and text.
4. Scalability: Deep learning models can scale well with large amounts of data,
allowing for improved performance as more data becomes available.
5. Versatility: Deep learning can be applied to a wide range of tasks, including image
and speech recognition, language translation, recommendation systems, and more.

1.4 Disadvantages of Deep Learning:

1. Data Requirements: Deep learning models typically require large amounts of


labeled data for training, which can be challenging to obtain in certain domains.
2. Computational requirements: Training deep learning models can be
computationally intensive and may require specialized hardware like GPUs or
TPUs.
3. Lack of Interpretability: Deep learning models can be considered black boxes,
making it difficult to understand and interpret how and why they make certain
predictions.
4. Overfitting: Deep learning models are prone to overfitting, meaning they can
perform well on the training data but struggle to generalize to new, unseen data.
5. Vulnerability to adversarial attacks: Deep learning models can be susceptible to
small, intentional perturbations in the input data that can cause the model to make
incorrect predictions.
It's important to note that while deep learning has many advantages, it may not
always be the best approach for every problem. It's crucial to consider the specific
requirements and limitations of the problem at hand before deciding to use deep
learning.
1.5 Applications of Deep Learning:

a. Computer vision: Deep learning has revolutionized computer vision tasks such as
image classification, object detection, and image segmentation. It enables machines
to understand and interpret visual data, powering technologies like autonomous
vehicles, facial recognition, and augmented reality.

b. Natural language processing: Deep learning is extensively used in natural


language processing tasks, including sentiment analysis, language translation,
chatbots, and speech recognition. It enables machines to understand, generate, and
interact with human language.

c. Recommender systems: Deep learning plays a significant role in recommendation


systems, helping platforms personalize content and make tailored suggestions to
users. It powers recommendation algorithms used by popular platforms like Netflix,
Amazon, and Spotify.

d. Healthcare: Deep learning has shown promise in medical imaging analysis, disease
diagnosis, and drug discovery. It aids in identifying patterns in medical images,
predicting disease outcomes, and developing new treatments.

e. Finance: Deep learning is used in finance for tasks like fraud detection, algorithmic
trading, and credit scoring. It helps identify fraudulent transactions, analyze market
trends, and make data-driven investment decisions.

1.6 Challenges of Deep Learning:

a. Data availability: Deep learning models require large amounts of labeled data for
training, which can be challenging to obtain in certain domains. Limited or biased
data can affect model performance and generalization.

b. Computational requirements: Training deep learning models can be


computationally intensive and time-consuming, often requiring powerful hardware
resources like GPUs or TPUs. This can limit the accessibility and scalability of deep
learning approaches.

c. Interpretability: Deep learning models are often considered black boxes, making
it difficult to understand the rationale behind their predictions. The lack of
interpretability raises concerns in critical applications like healthcare and finance.

d. Overfitting: Deep learning models are prone to overfitting, where they memorize
the training data instead of learning generalizable patterns. Overfitting can lead to
poor performance on unseen data.

e. Adversarial Attacks: Deep learning models can be vulnerable to adversarial


attacks, where small, intentional perturbations in the input data can cause the model
to make incorrect predictions. Ensuring robustness against such attacks is a crucial
challenge.

Addressing these challenges requires ongoing research and development in the field of
deep learning to improve data collection, model interpretability, regularization
techniques, and security measures.
2.1 Historical Trends in Deep Learning [2nd Topic]
1. Deep learning has a history that spans a long time and has been known by different
names, reflecting different perspectives and trends in the field.
2. The usefulness of deep learning has increased as the availability of training data has
grown. More data allows deep learning models to learn more effectively and make
better predictions.
3. Deep learning models have become larger over time due to advancements in
computer infrastructure. This includes improvements in both hardware (such as
GPUs) and software (such as optimized algorithms and frameworks) specifically
designed for deep learning.
4. As deep learning models have evolved, they have been able to tackle increasingly
complex applications with higher accuracy. This means that deep learning has been
successful in solving more challenging tasks and producing more reliable results.
2.1.1 The Many Names and Changing Fortunes of Neural Networks
1. Deep learning has a long history dating back to the 1940s, but it has recently gained
popularity and is often referred to as a new technology.
2. Deep learning has gone through various name changes over time, reflecting different
researchers and perspectives in the field.
3. There have been three waves of development in deep learning: cybernetics in the
1940s-1960s, connectionism in the 1980s-1990s, and the current resurgence known
as deep learning since 2006.
4. Deep learning models are sometimes called artificial neural networks (ANNs)
because they are inspired by the functioning of the biological brain.
5. While neural networks have been used to understand brain function, they are not
necessarily realistic models of how the brain works.
6. Deep learning is motivated by the idea of reverse engineering the brain's
computational principles to build intelligent systems and understand human
intelligence.
7. Deep learning also focuses on learning multiple levels of composition, which can be
applied in machine learning frameworks that are not necessarily based on neural
inspiration.

The figure represents two historical waves of artificial neural network research based
on Google Books. The first wave, cybernetics (1940s-1960s), focused on theories of
biological learning and the development of the perceptron, a model that could train a
single neuron. The second wave, connectionism (1980-1995), introduced back-
propagation to train neural networks with one or two hidden layers. The current third
wave, deep learning, began around 2006 and is just now being documented in books
since 2016. It's important to note that books on these waves usually appear later than
the actual research takes place.

Early Neural Networks (Cybernetics)

MCCULLOCH-PITTS NEURON:

 Formula: f(x, w) = x1w1 + ... + xnwn


 Function: Recognizes two categories of inputs based on whether f(x, w) is
positive or negative
 Limitations: Weights need to be set correctly, cannot learn complex functions

Perceptron and ADALINE

PERCEPTRON:

 Formula: f(x, w) = x1w1 + ... + xnwn + b


 Function: Learns weights to recognize two categories of inputs
 Limitations: Cannot learn complex functions, cannot handle non-linear
relationships
ADALINE:

 Formula: f(x, w) = x1w1 + ... + xnwn


 Function: Learns weights to predict real-valued numbers
 Limitations: Cannot learn complex functions, cannot handle non-linear
relationships

Linear Models

 Training Algorithm: Stochastic gradient descent


 Applications: Widely used in machine learning
 Limitations: Cannot learn complex functions, cannot handle non-linear
relationships
 Impact: Critics led to backlash against biologically inspired learning
NEUROSCIENCE AND DEEP LEARNING

 Neuroscience: Still a source of inspiration, but not the predominant guide


 Reason: Lack of information about the brain
 Future: Deep understanding of brain algorithms requires monitoring thousands of
neurons simultaneously

 Deep learning was inspired by neuroscience but is not a direct simulation of the
brain.
 Early neural networks were simple linear models and could only learn to recognize
two categories of inputs.
 The perceptron was the first model that could learn to recognize multiple categories
of inputs.
 Linear models have limitations and cannot learn certain functions, such as the XOR
function.
 Neuroscience is still an important source of inspiration for deep learning, but it is
not the predominant guide for the field.
 We do not have enough information about the brain to use it as a complete guide for
deep learning research.
 Deep learning researchers are more likely to cite the brain as an influence than
researchers working in other machine learning fields.
 Deep learning and computational neuroscience are two separate fields of study that
are both concerned with understanding the brain.
 Deep learning is focused on building AI systems, while computational neuroscience
is focused on building accurate models of the brain.
 Connectionism is a movement in cognitive science that studies models of cognition
based on neural implementations.
 Distributed representation is a key concept in connectionism that states that each
input should be represented by many features and each feature should be involved
in the representation of many possible inputs.
 Back-propagation: A popular algorithm for training deep neural networks.
 LSTM: A type of neural network that is well-suited for modeling sequences.
 Decline of neural networks: In the 1990s, neural networks lost popularity due to
unrealistic expectations and advances in other machine learning fields.
 CIFAR NCAP research initiative: A program that helped to keep neural networks
research alive during the decline.
 Deep Networks: Were once thought to be very difficult to train, but this is no longer
the case.
 Geoffrey Hinton: Developed a new technique for training deep neural
networks called greedy layer-wise pre-training.
 Deep belief networks: A type of neural network that can be efficiently trained
using deep belief networks.
 Deep learning: A term used to emphasize the ability to train deeper neural
networks.
 Third wave of neural networks research: Began in 2006 and is still ongoing.
 Focus of deep learning research: Has shifted from unsupervised learning to
supervised learning.
2.1.2.Important and Conclusions points about Deep Feedforward Networks
1. Deep feedforward networks, also known as multilayer perceptrons (MLPs), are a
type of artificial neural network that approximate a function f* by defining a
mapping y = f(x; θ) and learning the parameters θ. These networks are called
feedforward because information flows in a single direction from the input x to the
output y, without feedback connections. When extended with feedback connections,
they become recurrent neural networks.
2. Feedforward networks consist of multiple layers of functions, where each layer is
connected to the next in a chain. The first layer is called the input layer, and the final
layer is called the output layer. The layers in between are called hidden layers
because their behavior is not directly specified by the training data.
3. During training, the network is presented with labeled examples (x, y) to learn the
desired output y for each input x. The learning algorithm determines how to use the
hidden layers to best approximate f*.
4. The width of the network is determined by the dimensionality of the hidden layers,
and the depth is determined by the number of layers. The choice of functions used
to compute the hidden layer values is inspired by neuroscience, but the goal of these
networks is not to perfectly model the brain.
5. To overcome the limitations of linear models, such as logistic regression and linear
regression, which can only represent linear functions, we can apply a nonlinear
transformation φ(x) to the input x to obtain a set of features describing x. This is
equivalent to using a kernel function in kernel machines.
6. In deep learning, we learn the function φ(x; θ) and map it to the desired output using
parameters w. This approach allows us to capture the benefits of both highly generic
feature mappings and manually engineered feature mappings, while avoiding the
limitations of either.
7. To train a feedforward network, we choose an optimizer, cost function, and output
units, which are similar to those used for linear models. We also choose the
activation functions used to compute the hidden layer values and design the
architecture of the network, including the number of layers, connections between
layers, and number of units in each layer.
8. Computing the gradients of complicated functions in deep neural networks requires
the back-propagation algorithm and its modern generalizations, which can
efficiently compute these gradients.
9. Deep feedforward networks are a type of artificial neural network that approximate
a function by defining a mapping y = f(x; θ) and learning the parameters θ. They
consist of multiple layers of functions, where each layer is connected to the next in
a chain, and can capture the benefits of both highly generic and manually engineered
feature mappings.
10. Training and optimization techniques, such as choosing an optimizer, cost function,
output units, activation functions, and designing the network architecture, are
required to effectively train these networks. The back-propagation algorithm is used
to efficiently compute the gradients required for learning.

2.1.3 A Feedforward Neural Network’s Layers

The following are the components of a feedforward neural network:

Layer of input

It contains the neurons that receive input. The data is subsequently passed on to the next
tier. The input layer’s total number of neurons is equal to the number of variables in the
dataset.

Hidden layer

This is the intermediate layer, which is concealed between the input and output layers.
This layer has a large number of neurons that perform alterations on the inputs. They
then communicate with the output layer.

Output layer

It is the last layer and is depending on the model’s construction. Additionally, the output
layer is the expected feature, as you are aware of the desired outcome.
Neurons weights

Weights are used to describe the strength of a connection between neurons. The range
of a weight’s value is from 0 to 1.

Cost Function in Feedforward Neural Network

The cost function is an important factor of a feedforward neural network. Generally,


minor adjustments to weights and biases have little effect on the categorized data points.
Thus, to determine a method for improving performance by making minor adjustments
to weights and biases using a smooth cost function.

The mean square error cost function is defined as follows:

Where,

w = weights collected in the network

b = biases

n = number of training inputs

a = output vectors

x = input

‖v‖ = usual length of vector v

Loss Function in Feedforward Neural Network


A neural network’s loss function is used to identify if the learning process needs to be
adjusted.

As many neurons as there are classes in the output layer. To show the difference
between the predicted and actual distributions of probabilities.

The cross-entropy loss for binary classification is as follows.

The cross-entropy loss associated with multi-class categorization is as follows:

Figure-Deep Feed Forward Architectural Diagram


2.2 Feed Forward Networks
1. Linear models are not able to solve the XOR problem because they cannot represent non-
linear relationships between inputs. A feedforward network with a hidden layer can be used
to solve this problem. The hidden layer learns a new feature space in which the XOR
function can be represented by a linear model. The function that computes the hidden
layer outputs should be non-linear.
2. The linear model we obtained has w = 0 and b = 1, which means it predicts a constant
value of 0.5 everywhere. This happens because the linear model cannot represent some
functions, like the XOR function. To solve this, we can use a feedforward network with a
hidden layer containing two hidden units and a nonlinear activation function.
3. The hidden units' values are computed using a function f(1) with learned parameters W
and c, and the output is computed using a linear regression applied to the hidden units'
values. This network allows us to learn a different feature space where a linear model can
represent the solution.
4. If f(1) were linear, the network would still be a linear function of its input, so we need a
nonlinear function to describe the features. Most neural networks use a nonlinear
activation function after a linear transformation with learned weights and biases.

Figure-1: Solving the XOR problem by learning a representation


The XOR problem is difficult to solve with a linear model because it requires the output to
behave differently based on the values of two inputs. A linear model can't do this because it
applies a fixed coefficient to one input. However, by learning a nonlinear representation of the
inputs through a neural network, the problem can be solved with a linear model in the new
feature space.
This is because the neural network collapses certain input points into the same point in the
new space, allowing the linear model to describe the function using the new features. This
technique can increase the model's capacity to fit the training data and also improve its ability
to generalize to new inputs.

Figure-2: An example of a feedforward network, drawn in two different styles


The feedforward network shown here has a single hidden layer with 2 units, used to solve the
XOR problem. In the left diagram, each unit is represented as a separate node, making it clear
and unambiguous but taking up a lot of space for larger networks. In the right diagram, a node
represents an entire layer's activation vector, making it more compact. The edges are labeled
with parameter names, such as W for the mapping from input to hidden layer and w for the
mapping from hidden to output layer, without including intercept parameters. This style of
diagram is more concise but may require additional context to fully understand.
Figure-3 The rectified linear activation function.
The rectified linear activation function (ReLU) is a popular choice for deep learning
models because it's simple, efficient, and effective. It's like a gate that turns on when
the input is positive and stays off when it's negative. This helps the model learn complex
patterns without getting too complicated.
 The rectified linear activation function is a commonly used function in neural
networks. It takes the output of a linear transformation and applies a nonlinear
transformation, but it remains almost linear.
 This function is recommended for most feedforward neural networks because it
preserves the properties that make linear models easy to optimize and generalize
well.
 It is a piecewise linear function with two linear pieces, and it is nearly linear, which
means it is a simple component that can be used to build complex function
approximators, just like a Turing machine's memory can store only 0 or 1 states.
 The neural network we trained was able to accurately predict the answer for all the
examples in the batch. In a real situation with many parameters and examples, we
can't just guess the solution like we did here.
 Instead, we use a gradient-based optimization algorithm to find parameters that
result in low error. The algorithm can converge to a point with very little error, but
the solution may not be as simple as the one we presented for the XOR problem.
 The solution found by the algorithm depends on its initial values, and in practice, it
may not be as clean and easy to understand as the one we showed.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

 
1
w= , (6.6)
−2
and b = 0.
We can now walk through the way that the model processes a batch of inputs.
Let X be the design matrix containing all four points in the binary input space,
with one example per row:  
0 0
 0 1 
X=  1 0 .
 (6.7)
1 1
The first step in the neural network is to multiply the input matrix by the first
layer’s weight matrix:  
0 0
 1 1 
XW =  
 1 1 . (6.8)
2 2
Next, we add the bias vector c, to obtain
 
0 −1
 1 0 
  (6.9)
 1 0 .
2 1
In this space, all of the examples lie along a line with slope 1. As we move along
this line, the output needs to begin at 0, then rise to 1, then drop back down to 0.
A linear model cannot implement such a function. To finish computing the value
of h for each example, we apply the rectified linear transformation:
 
0 0
 1 0 
  (6.10)
 1 0 .
2 1
This transformation has changed the relationship between the examples. They no
longer lie on a single line. As shown in figure 6.1, they now lie in a space where a
linear model can solve the problem.
We finish by multiplying by the weight vector w :
 
0
 1 
 . (6.11)
 1 
0
176
t

e
Deep rud" PoWaeA 3_

,tx

bx* el/1- ;ry"


ClAlf"i$,* e ? C -) )

.l
tt b a, gfud-7 c't)
ffr,t
or Pi'W* Cur) a,
at cou-l-nl,be
(ff^= SPA ,0 od
€- ,ndJ
insff'vn
d,u ,t4* i{ a'i'> Trt h^,wd, ern, L{n on*"^l'
LCil nn L SPO fr{ t
{-t0
A/e l,rctil< CU
{C'n,oD /k^*#'r
L
in A" Lwo ,{ co r bnl rI
ne hrtoLH
,rrt x)
,rn^P h A)/"t' ;"LP C
k alw''{ &)
Ar
& {qn
)
C s)
Jkxs
nalP"rfi
uj"nq a, SeL o$
buL&'o
man'7
dnh tht"b La"L ^rb
t
fud ca/> "eY**

L
6,
S'.{

xF ed4*^w-d' fiert'oe?j#' Nl',ilt" rnaru?'nn&tsoL,

lanSunry 6,4;
ne*,oo-gkh ""P
J
red 'hurs*rd ^;nA d; {$ewent {*nc{,aon$.
0kt
(3) c?,) C t)
c! Cil)
fx -fcfl =+ { ,

I
z

xJ I )
(x ;8
U= {
t
+
Tn thtr, a,PPx'aacl, ln/, l' aSAryflekafae
Y..

zep6e/be nfutidrt ^t)


fx;
-b
s) und, uAe
/he,g fh.h
rn ,y;ffirrt
\)
fo ar af trn.

neur^.!- nafi'o odl{-b.aeyizw


xl-ea :n, deeP
,o+

ncu'utrnh

*
e.8_
b
-a
a{ Lea,q"W XOn

0 t

fftf ,r* h I"p& B


0
0 0
I
0 I
c I
I

I
0
I
6

* Tb lrnwn fh" x cR S unct n wrnfl) &"

AJ'{a*, Wo tgn ica.!-/.yt rteec!, a ne{r,octk-, N;ffi,

"* 1*/'t 0{Le h;ddr* w" haogk uciih


a,c{tuobfrm
L u) CErL leortn

an l^ ,
3 r*rPl* l4ryarTLDE
x Itu ft">t I

nttuUkfr*l"
l,rf" fnlilt */ b" CPNCE vrled-
I^ttft*

aftiart lfiu

f
e1,a

b trc$#w Ccwecil?
o6L ne{oottl+
kl* l^l*'*
{orn Ymtn{P
\=l ;u,q lt,tt
T
u,, T),,0,{,L''
o{ /t*e Pvn
ne {,^rN-l<)
ma!{
{rJ" l,brt( Luo:*fr' oaL.
b J;L lh' lvlS7 lDffi
lt

T6e u"''Q
uh"lbq. lP ent, lhe
gy> Wvl- uhale bal,;"Xt
*E ? &ttc*]N" -Lh,
f #( + {'x ) e)
c
L o( )
+ C
x ex
aj ou.t- *odtl, $(x;4,
Slan uce ,*rj,{ el'}'od1e
g u^n&ff :F
SurrpuP, a" ],trlea* r,od*[., /^Jilf,
ac ane b' mud*{, i} cl"{rrrd- tt be )
\
w b), b = nT*+b
LoL cen qrtiwamt&e JCO) fn .ln4 ,-o t'fk ve+,ect. b'u' qn{- b'
UJ"iW fh" ,ntn L
+

Gvrd;*nt, 1rk-d Leodry I

Aj ne ,2eft^rylk J"h
n an'L "frs"[,
-o.b",yy"tr l"^
ffn>tW b rlfr-+
b
hle co* aJ>e dr-[tu

L-
I
x To{fie cy"'efttz't df neuro/* nelooo,rk2, o t"/e Anv"fbtha
Pic*
tu co*b !'
uz/1{ hort *'rA de{eu'n)ns /n,,o br"
ab d,u6*, cluienz JrL

*("
q

nL h m*(, D ftfufb wbtuv-b


D L"*nln2 Co

!,lfrJi" @ b;xehefuJ
t n YYLa x'i rt'utrn
kfu/;haod-
Jlw co* {*'* cL).art ,h tecLc
frrprro c'oe// ffr"
l-@
tralru Ynea,+utuz Ab iw lte
rnadcl" ff* ec
tu"/ O**cont

a
q )'g {rn drj
tr A
Cr*a'
JCg) = l-tz k -) f' tr
x-
L

l/atVe/)
SF€ c"tlLc {og,rYL

ile CA {
,/rt uoc{*rn'h
NE
fa6rnh If,"-*
rno4rln, and- rrrck{b,
bel,w** rfl pa,*l
clon'L d"P ,rrl- gn' lh*
adel
ffi"6 Sveltt-
e6706 cdtub
k
t" jfr. rn€atu 5 N(u ) {g;
,h4 Tt1fr,! ,J.,te PwEl"L Ctrl.)
r, e7'70 zca )
L Pnobab 2
+ ^#
I d"t* fl .$ (w ) 0 CCI

tCo)= d,
E X )Y
r-) P

YVLA *trfilrr) ,Ut"L


be\i^leoru a
75e e
Cx a
I
y14 €o r72
and,
o/#,t *u[r6{L
Dn
)
Che rcw
o+
ern
t-\
4 uruc 2

a
h otdn ttt A) tr/o Lrd
I
vne f ,rt*
Ne t 6wo
te ar
ah 1ur$
nb
I
bfh* "l*)atrto'
1C

q l.-eawnl"X; Cor,ffi$wel i2,b ;

c, {u// 6rh
N T,1/rLrA' o{ '{eavrrtw\'

x, I lnl' I oflu* ^) o**b


PC )
S L*/ iktrc
l,eaar> l rg &r)€.c0

9)
K-)

"W
r#- tru
t"
LA aL*t*' ? - {tx)
M{tw Y^t i

o
li rlctj
tr
+o@= Ey ^/ Pr,^tu(tr I
r@ At
e s'70
fv\ )

oaQ
ta/ y-vo
bb It- a
A/ Lutu c{im rfrrfr tnJ+',4*,b'-
n*n be* rp
J+*ttb tuair)c{L srL il/l,,
I
J-L ewcb x, h,[l* fr- k**dab
$[

#
+ ry,j- r"rtfr> f^,y * pd^/w {

#rL p
5,f,,?A.Jl fy
f-*
L o"lpu* ur*ln: Mktu
c o*t
nt [ryhq, 'u*Pl*'L
Tfre
unrl
,{
b"L^t"u* th *e*
uw jh, C.l,Alrb -
rv10
o*C bul'u*r*
h #" tt f+e-tewcg ,il-JhO
AI,L{L
CJ,olh "r,fu"ft qt t-rl9
cb
tt* cM,
hu5 Cffr"
.$,.t,
b"bi,E,
hrrL {sb
L fuC
Lrt<,t fh" lndd eru 4rn,{u
Tofhe r'{eu;oL Alellnulb, AyC- Lz-a,*nf*,'rur'7
eru{est{'d' {-w"a
k"
tk"* L. UCE
o *+f,^* Dfffu;b'^t'em;
) Lrngat" Uni{h +"L Gau't-unw ne&.?6'l<-u .kku Al
L lrvw W ,!-n^!-
h l;t laken ,rt vJpe* pexletn
'fh"h e*,
hw" 0ru t't
Ar d( o *oJft,
d* 4
x*d.t*c.rn & VectyL
A t-^Xti- a{ l"neo'q' @@,^L und
tnl'L+b
t9.
lH

arviltt'b elmtuW mk#"


{, n W"-t {'*{u{w
* '*4
,l;r,en* ULlr*^

W + IEt* r* * ?d o# *"Sl'b"
fexrn.
fru firar bro*
b -)
A

& A

il= hl'fr+b
a&e No ca
.i" oPU^h6'L 1A er*tel'X "Uneal,TEc#.
Wn +
4 At) ,*r',,b*east htn.
(Lr,'.D
Itner&- ,lrrtb La o-lten
*Tffin @

,{ aQ ft*
x bmes,r-
rrt*dn" '+ A
I r)
PCwlA= NC.,
5
ff,
bell -s
d,utn;b u{;m- h ,liks o'/
GouP*o a
I A"Lg
"{ha tw*
eL.t-t6vs fh'& d-A edf beb
.rnfu-frYli$eb "ffiear)
-, kt<rbl,PecL
x lul,ax1rnJe;rfl b2 u $&".4u1'7'
nod st Covo'qAa(t0e *rrd, *"ry
e-td,T)o*db
a b^*rA
*fl-*r
l/9 f#. a,
wtde V ^ilP,lr{ 'f ry
b?.
rndoWq.
Jc" [email protected]

, E
'i
t5

StywtonL unaia fx" Besne'*Llz


c+#
I +
@
d

;h olu #*.&
xfr Betnoa-tll eulffibulr'n
6. 7T* ne ,^ill-
ne,
J-
L,

,*l{, PC* L
4
Le o, vo/"d, /-'ufu
r) w*b'ub
* FuL'tfrit,
ir> y'fre i' ki^/'-L [O ' 1
i[ n"ePv{ /;e u*nVL )
L
b9L 62edA
t, b
u.be a"
a bLo';n
knea'r
aV"lroL
* Su-p
hb, Vofirr.
**e
xafubrL{"'y
t, tJ-ryr"*
tC= orffin *
t
14 =vr)ax
cnf"lh
tn corn brn "d fen
uv>t
3tg rnfr ?d O'" Stt'o-a
(.-_
iK- W'Y)
k Kiljfqo d efljrp-q'eh
Lrue L'P
,!
#r1-di*c b "fr"ul
* nS TcL

A
+L
tr
er,'t,
q- fu' ffrt 'brw*it stT* d i,r*cL'st
,rolruae
rq-

,Jh, Sa,rrmzcl oq^* un'nl 'a* h*""ff &,oo

Cooryne^/*;
b
,1,

ceryute
TL azap/5 or h,,negL
,l\
-r- I

tt', -TI b
T = Irl
a,cbrv*fr,tm hu*c*ion
@ T{ u-bu)th" sry*"M
Z tn b pfuobrb&tr
nff a)
b cpnveut
ava!*e 'f in/a
d_ ( 'loa
a

Te?be'6n bl* a)
d-)

,t-6 CD=Yz
U) =exPCVO
CD=
ex-P et r)
,
ex-Pcz'4
D

PCD bulitn
fie n vow"bh
0w6
L^fi* ,J5 O-- Nal) 6v)

- btlihu,a $rL
r5

Jlr" us,e of h2 * Spare Prubabtkt,Pr, Nt ftt,


rnr&7okn sr7
Tnaxrrrutn/L !:t eh/woa fuaert'E
erlq b$ry be*&'t
wlrun LiafT'fltr.fl) a" B*znouJff fauan ,h
{c{toa44},b
b,S, a- 5r6"to
t& 7r)

J(e)= -'l'7 P
V
U t4
Cntr-D n
0
\__ ,(-o

,
)D
(t-a'l)7
,, uoVlh rnnrtrl[r'n^-
^f
Jlte Co bt- * unchr>Q
w$

.!;t<*/"ud-
Ln - lu? Peylil' -riln geP"*
-f.L f4"lJi*u
il 3o/;hnan6wh: Ur,fi,$,
a

chnaPe
b
NE N&{' /h" mu'["(- 1srYte (n
J,+ oP tianf JrL
offie ,$ v'L d;++ezent
Va,u-abfe
hr*' vondblr* I lrrlu
I*th" -tD
c*e o4
wduct- Sr & nunbed
uJi /r'h^e-&
4 (y =t -)

U=
16

Conclusion Foints about Gradient-Based Learning:

Gradient-based learning is a technique used in machine learning to train models to


predictions or perform tasks. It involves finding the optimal values for the model
parametefs by iteratively adjusting thern based on the cornputed gradients.

Explanation of the process:


1. First, we initializethe model's parameters with some initial values.
2. We then feed training data into the rnodel and make predictions.

3. By comparing the predictions with the aetual targets, we calculate a measure of


well the model is perforrning, called the loss or cost.
4. To improve the model's performance, we need to rninimize the loss. This is
gradient-based learning comes in.

5. We compute the gradients, which are like the slopes of the loss function with

to each parameter. They tell us how the loss changes as we adjust each parameter.
6. Using these gradients, we update the parameters in the direction that reduces the loss

This process is called gradient descent.


7. We repeat steps 2-6 for multiple iterationg, adjusting the parameters gradually
rninimize the loss.
8. Eventually, the model converges to a set of parameter values that minimize the los

and make accurate predictions.

lfiitial
\a/1" ig. hi 6riisl'ient
{sst

li){t$r}}{:r',l.rt
5r*p

/ l,{inti}xrJm C0}t
rr,'fiy$live c7l

Wr:1ght

Gradient-based learning is away to teach a model by adjusting its parameters to reduce

the difference between its predictions and the correot answers.

It's like adjusting the knobs on a machine until it produees the desired results,
HjAdnl Unfr{*
nf

T6* 4* of ht dden u YL,L * {a- iee d {ouoaM,


n e MqI- ne tuo o rldb In & b 0 LL { cle cl &oT rh
{*, on er" ff., Pe LL), Srg,mor/)
L; #* {za^xfoern's
tr,y"U lo ea"cL l*X,eX
unr/* ,in neu*t
cltoot;nfi f,fre u;gh { @"* Lrddrru
'te-*p&c'lu aeea. buL,o&'
nehrlotkn ln &rL lognrrW
a/Le deJaaW chatce'
,l,f"*n,t, und{* C EILUQ %ud

*uwn-[.' rleh'towt<x '


-T"nl, per{em n"Jl ln
Llt>aLn '
) Tnn h Un;L'u ,l o.-
no'tntfril4 et a"d'
ere /fi" *f"* drb* tn
Lu*ku ;nlfru wn{* -16L.
e,ot

lh, or^fi"t tn eryoeate'{


k be
uori'{,b ql-e Lt'b e!*0 ln
v) f-lur@,r^t Unah -N)-non "L anrL
T,8
t,o lwr"e k*k^a
k"ry'rffrff
+a*'5 ,j irfrt ffi1ff$ t'k*r 6'frte'Y
z"a$'b
tr"*CL ec- uorde* Sollmex
Sornedtr-LwW,b: @ jI*d- ALtenkw
CtSan:a;ngt rllyn @
t-r'y llw# feurrl
@ ell'
nermsrvS
€P\Kflr1fl)
r8
* &ooCP*t clur*t u 1 b-
slt/-1, ooorkt ooel/ "rung'| I
scgnf&cartt{-
'r"r/r*",ffi" c,e[-#wH*
.+g Arupuekcq t'h" non-
"l;+!.r*-#*br'{'itY
{#"
co' be soi nI?
l>rr{rl EW um{ o,ckvut;'trt $et-r,rcffidw,

Ar\X, &'
t nga,4-
@'

$;o't /h*L aTe eah b*Pkmt?o


a,cLrva-Ltcn' iurtc 4"4- e
anC,h^ve kry!* ,{rL d*f l'eaanitlq
ffrs,n su,ttoble
karnng vna]<lnVt
[6 er't €-t*"h L"#drL
Untt* *nd,
R", t+fid, l--r"es* iu*c!;n
L!/5€- 'tt rrc/;vot'on
Ppcli:*ei- bi nls,g- U"ib5
c, ,]
ff(o TYLAry-

n'*o-d Y'
xfucLf+id Ltpe-a* ww{n a'l'e \n'Ary
h:P ,! aL"L a.{hne Liu*S&ne[tdn:
Tx+ D
t, n
(-/ Qa

wiffr sn'L"ll )
'rectf fte I
A,C hu*{*
heFn
g
1"8

* GL^rk*ri c[e*r9y{ S$rt], ooorks wel/ tnu*,,/,rf


-I i

stgrtt#fca't{-{"
y"Cu** /t* *u'/ $u*c*ta'
'+* r* Pu*b.eu -ft* no r) - &'+!t6t",tfuA'h +'+, *:
sn{nL?
hrdrtrew u,ni[ a'ckvatttt ]wntffidnb ca'obe
d;*wg'"q-dPd' '

l,rr di"* unf


{fr , Pez$os-m
x JYI-yI,+ Soywtn L"jce* b,y
e{,Lt (kX"
^{naff,
*mrl t't*n Uhe Ar
lrt* 'rn*ttffi
Jurctt-m- b ,nnd;fL*)

n'**L u*
xp,".Li+rd Lipea,q. ww{n a'4'€' fuP'*!fr!
L"'^A SurY"*$;m :
fuP o{ at'L a.$ &i,'te
Tx+ D
t, U
q
Qa
toiffr swm/l ,

0.C kvo{* vecLrf r.L


11
*"*3
.rec L; {r.rL'bnLa'L
x fF" at;+f e-renb fuP*'t
e elm blrr{,
{.[-ax"D eV<.rg tolru't, )
e
trr^L^-&" L,oitL fiera dcf*a*;m*
lea,rilnX; elL
"*"^/rh 'lnnv" ft*r*
R"rh#ed- Lnrwa,l, Ur**[p
frb*",{ule Va-t-ue P"c{t{lc,thn
L.*y R"{-u
Pu-r.nrnetq-i" EuL u CPR.L(,
atrd.
w(, lT\ orrir@)
a]tanP a. srru-ll frkP* (b,tt'" b r*nA rn
lntlr;oL
ahieh caru I e-(Ltt"L Li [p
tl
slnPe
C LU R"
\*
i* Lt ye.*Pu o{ K valw*'
xlhrynd u,itf d,avrde r {h* ryila'xrrnlarn
ft,"* "*lf n&
Ea,c[, vYlaanut& wnt L
epmrnL ,+ on<.a? tkffi ff*fol
rymad-
gc4n = " Ti
,^Ci') d
leq
r-\.L
U lte ae'{ '; fnd*cen "erb thu @*ftY
* lr^J fie're (]l

J"t ?1*? i, L Cn-) K+5


P A/
f ruT JA;h Pwvicfe
-,-Ur!*_
t
osLhe {.umc{icn,
tury+ A.

A ,r>axauf un;L CAru learn carnf lux .eb


convex #uuzc

utff?t af b K-pi &@tr ) and-


ca, ba u*el
!-rnea'S'
a ckva.{tcv, !-unn{*n b br*Ae''1"'L te cfr{red'
&unr{lon* -

!n|,a*oul unib hnu* €- Nkryfi t vechrn CK)


fl
at1ru '2"e4*d'Ae 'Yn06e T*8 ul.aW,aa*;na ffi*"
ne-a-?- U r>t*Sl, u rl<** Jfru traidny
seL ii
ane ihe piece-fifhe c{ivi,.gncw a.,{.e b),nt{rd-. '
BO

*r l"laxeuL uwfus earu t*"lote- #r. n"'tb ed oj fM*"t"r

Cz) = r( A,

@R)
tl,
L*V* a,rfrrru{t* {uw{;an '

fh* fuP''tbol'"

G.Sf- clnnr\,,
*afft*e o'ie Livu"bart {ur* frmh
?l) f ( a
14 7')
L
Cz;)
becau*e, L"mh {oL [,;aa*
nat tecomrfl *rtnd
* SiYnnid un;lx aue n.{a tg/<'b du-e b h**
wtn{V ir> #*u& t
s'o^Id L-nrtb utfff>
ah
b,r-t caru be uh"A b"r."rL
b sahrzo**, p

agpzoPia & c-sxt *utnfuo+


t--LYr-
a-vl
Itaorfirt$.
t

q\,
fiiJ

x TE" lrAr* bolrc La,nffent C{*t',) a.cl;vahm irs


ics ochen
Prt! e6ved, ovet ffr* 1r(N
r{rnoid.

u*tL7/ 1tffrnndl ac{rvolion*


a&-e c o*mo"Ay^
{
* Srgrrnila'l' ar$rvo-L'ort "uwbf,S,
Te ctta{en{
n .tutor n* @t)
oauj, in 5e t

L
t

(j.. r,
tn i ,...

X 504iwtax uv>t I TePTeffn { O) prtnbab r,{Z drkbr&l-


kon anrL a&e u-bed a-b a'" 5N i bcA ;n ed auclrt

ffrr-b inlt/ve rn zrvtctl'tJl -mar,*yulal-tttrt


tecSutu>
{A*lr,"l, b^*la J*e{fun 06 P'9F t:
co'nZ

I a
_\_ X-rf
-p

T ll
#
e lrtl :
)
,
"(.)
"ll (

'fffrn T)oTe a'C{Ne ah%


{Prneb ol
x- {uncUtYL bu
a-
b N,, i .
8e cau*g .Ll-,

.ulb
4"9 ,no,*{ X ) rI con be
olw"fuY bo
'li++i

rLry bound'd,
I)q
7 Ca) = TYiQffi e 1 ) vn7n

a'c[ive 'L*fu 'f


TL,NLa;ltb a)/>
{Jrddrn uniLd'n6" ';;*"'/';dde'ru unr{ W
tWN,d" a-*d- T"La{'Y
vuYaain b bo
t

!l

frrchi{ecfwte DowArL ry

.,t1,""!- nelwflj<'r M'e eafferu e"c- inb


"{x 1vloxt r)
untln calfeC W ab.
Y&Ph'( 1't* '"^ Vartl
ulfib\ 'ln ewcl>
TEu nurnb'u o{
'ltu ta'$/1 {c
'/h, cr*Pl'ofli of fre& in
6t/L
ca,l,btrt oW*
&r.e.fuP'
x f/e u'\'L NeLuoo-yKh

)-ogern '
Jfl{4'uh ) lrrdC en
\- jadud-e 1l7tu*
X Cyrnr{lon
,(""ff*h uvrt{n ct
ancL yu'bP l,r* tt.*
,hc s Pec)fie'Y
b
,rlh,,L
J he a,Ychikcf*r* ee"cL'
a,4e @ry'ft-4 rgtl
ne uzlnzb oJ) 6 JuudJr*N
onr YY) -t'* WNr\
Jl'u AZT' "nrt
a*cl"'L'-chne'
}h
(frtnrc ["',t
i"P
"[
@
t h" $"tt,ut Ta+b Cr)
Ar)
L Cr) Cr) l^l
{, -d
-A

tht i^/U>
I h" Sece ud'
fc1
Ld ) c?) ) Irt,
T ct)
+ b
C e)
(" =tr
_!_*
anL 50 en. a-rf' u0
cqn*ideq*Liwb ea'ctr,
J he,,rna,trt
a*c['t&cbtt""{-
;;,;; tfr"dr4rh yrfr' ne fr,o&qK o"'L- ^"d/li'+
t

Ott
x+
(i)
Depffi o+ ffre,Nr{quyE, J /rsh mea l'tb
aw t it -go* rEW-L
ftn* *"ny lnyaan Ww r ,l- {to,a
,. flfi,u;:, (LbCtbLr-
(ii1 lnlrd-rt $ "Fwo t" boYeL unrtP hov* itt
man/41 node*@o) pl,ocuewi ry ff'*
,ubowo,L.
PtuNd'bh
,tt
Un;veunal
,
De{tfr c

TePu€fr
,nodefn CAlL o,^/Ll,
*'Lmen* lYh^te en $e
L,,le
ffr.Y'+
b -15") n b
lfl
L/ ) 1,1h,
nf,n lrnec^L
J&uf Lwrr>fi {ccrzb kc;t'h lodtl'o
ulte De ,,PL
ne of
X t,tle cen a, aoide 6on8e'
Le
Xitrtat
{5 oP Pu' ,
TTr"sUm t

ftP I^J?Jft
Wr{sn ",L Layu.
)K {n^v* ^*
l,lo
l*fr N)e l*rLcl''n(s fffrnofd'
aa

{
Ceaba*n acLZval*rn
#,;ad,ru u,r;/x
5rnd' Ei'LCI 'fheozenn b"*trilIV
oftrCI nrrYabdn htscX/,h, Ay' I t {\V
lr'o
* UrYr vru*^,(-
ne ,X^l ne "uo;d'e
g&rf $
ft*t LPD {'*v* a'
tuurtt C f"t

r@ri0 rt$
# hdclen
,

wu L
-;: r'
: D/L" T

{,- {hru -rA 7f?G Lat'ie}fr b


fnq-P
5e e ,fr, Ttr" b \-- La/)\v "
d

t
t

fra
a.) nbrulnk Volue f?rckhc^rtan: lfr;n /pa. AW" g
'
o.r*va#an- $urtchnt rhol €ltph rl,e6a*ve "nurr)b,uurb
b po*f{i,rc onffi, C,;rA.*tnY rntTvor rkwge* 't f#e
wvl5f"u'f' Ju'tctnn

r h-
rccfutcfltaYt "

a'od e44ec {*u"


K T{ altc L^,tfr 4"L rnql'e e$*iae*'b
do&-,p"!*
prt-ecrsrvt"f e{ Aatplw
fr* seveo ,t b"'"li/* '
Fc tr{twy fryu* 5 Pefe cern/euIY
oedac(afti th"
nc!-rrd^t'trg
SrYtpTcvinc|t
lfr,
leaortngt o,l,Vxr{fr"'rt, lf,u
o] fifi" )cI 'ocductn-gl
rrtn7o p&aQeVt, tu
urrk- '{ rtok(rt L
b e7
,
6r>
X T6" k r"*.f unc{t6}1,b oS bu^?

e*? b [.ra;"n beca-uge ^ffiaruy /rr"y unc{;a<$


.{e.r"<nLt io Co1'lveoe eP -ETrfit ryfrr,rz
b/err* urh"r1
PL6
EPP!,te&- ta .l-rneoo mrd"Lu. tlA -

uoi ffr /i"Llnu' LaUe ah


XF Mnert.aooJa
C
XirrLa-fu'trl
tt n ,)
JErctmeeo*"K' '
pdwido a' uwwryna't' ePPto
tt asr* flrov* oL/"''
,( ;gm'iveThal ufP"ou?rnafrrr>
cla*z "4 a,c[rvohcn
beu* F m\/e& #':'- a' uord<a

)$-

L
x

dafu' co,o*air* tAh, a,",',


"fu>CIot-*,
-f,'-^t-rt/-
T{
Ne fi'ou/<''s',Ztffi;" *fifu'
x Fe<d, .CI n e c {eA Ftn
cx/$ W'#;,;;*^
hat n-'{":A- f l}'6<'n 'rnr}t'*;rt'
{nave ef;*''L
c
ne[,oakn
JIru**
"&"l;"rfr:;. f;al*W I

i
t

Ac{rwt^?rl Func*'Vrut ,EwcL .unf# -Eot'h )1

l*r/u
lntclrt"* U u,be,b avL !' a,c&un*faYt

b fiLtoDcltrc eno
a,c*vat:r,t $** clzrm- r* fi"
On. Cfr{nrrlolrl

lnyi,vhc srgmef
I ) w&t"rd- ,[w'^* un#eP"Lu)
LnlroJrrce ffr" obrl'q' &,
,1hr*, Su*r{fm,b in, {fr.
eX- ) l')o{) knen L pa{lernh
coTD Luw

V"fute> { ill soo&


Inl[;c l* olt a.4,el*r* "*d-
vep 6epen
jr> ou* {eh,
*
c$ Pa
#ern* *f
.rq
a.
.ll) urr"e 4*w'o''ry P
A%nt,5rET &{t
-N'''lh
L w
>
percnp{a*,
cfr1/L -rePTelf,n
(ruili
G*p/ex tVeu*".(-
O*r! krnrl a! ,fu*ciimt ^
[x!"{r,csLk
ella ,-r/" ne"esL n,e/WoqX ca* be ur),"e,
'L,
A l.wtre
tt al-
d-o tt. 2
,,cf t't' &,- zeclxa'* I

8,"* Uyu Fdx /'

D/4ji "*l*X, a
Jfr. me'ff*'l uce u*e La
(i)
a

D
duurrtnt Loatrsng
{h" nefr,mrif
sett{nX*
,*/#
"

j**
"4*t W)e &** ff,, ,*rt'u
@ eYetflr
tk,W( ) '**]o
'ifr^* i{ dw.snt Noz"K
tu lrsell
L lh, {ru;ntrtY d-r,fu"
0"v'> ne,r-C ^

coTretil{ p.L m e Jhr,{,


fh^t
,JffrlknLf
t'h,"w rA nrt
Z:jf.LL
*
on€- P€
eV"uY P/4n
blrm. * bfg
LOCTKh b,/rL$'Z Can- rnake'

(_
,
&,t I

lt
$ur''.cLio
A,'
Ll{

W{ r{ 'il*'J tr@ lrr) DeeP Neuml" Net-NsaKroi


Slr* 6
toflt rec*J-nt AE hva*an
1 "A d""p rteurol rte fruoztK
#e
,T nL
Te/9euuYl'x lunciltwt
co.t/L
4**$io,h
n SLol'/o* nelcoor1 Can
,rt.
7e FTe.hen 4"J- WlrX! piu& -
e. ilw,e kuw&;ffib
co*b'
Lt{Yr,b ,bbe tec A^ f-,"'#
Latbe lf neat* a,chvo#on- Ju*"
Ctr) mne-out tt'u'?+g' Ctul lr^u,

t
tu '--.,-()

l.x and I
^fd

Ca o al
n

th e nunnbet o{ hr?ea)- cL-;nF,r#, d -.{' )


ne traot HN
d"u{) uech&"a kddrn ) if,
aod- r>* urrl[/\
n
J" cl -D &
YT
01,
L.
i,
.j
C'L
rf
-rt, . Jftt'he Catr
}n t'fi" drp {
i.e ewPon "*ti".l * ftt'tfu'o1' PLd unz|" '/#"
r)e {,,coa|z usfft, l<
&*x*b rf,W .a-D o ^ Fa

nwnbu os lrne*
cj,-)o& @
K
[sl;evr-+t*
oce
Cl,>toXf*V Cl/ 4-, vn.drl rneanh tian
"p ,ifa Gmfis;
)eaT{L ur[ur*
$rnclrart- t^le ci)a'n t b
(r)

'x sWb Wr€-l b o'ffo""i|€


a$ 5t,^/, ?unc{uvr* Coo)
U

-l t,D r(l klru"l-


al, p@cLb"L"?
e@rgL
h*dh {- rWU*
t-!-
b b"t{"" 0
Gn*{un dry# {dkb ) arS Slxwo b,gt

(YL 4"t ,,orde Vtu;e{y 4


kefr a,

rrtpiu; *l zwbe o-' uAe +,^,{,


e

U/"{r.3 d""f *4 ,4
bchLrLb e/-{'
juactnon* Jf*
gfjr"rL))
l'Y1 &"1 kow''*
D,q/. p5- cvej- *fru s'P'e tr

of?* AzcfruLe'#""q'l'
hJb,t" n eU,qnL neStoovt{*
ett
lfua"
a.b C[4J0" tq'
have
aLC[";bcb''zu-U

TWVE

l
t:- It
tu,) n- I

9ti.Ir
96.0
i{
C)
9$.5
U
L
u f)5"0
9.1.,Ir
u
t{ 94.0
U
t 93.5
*;
y)
93,0
92.5
o'2 rl
j^} ,i S 6 78 !l 10 11

Figure: I . I
Deeper neura I networks perform better and generalize better tha n sha llowe r o nes w he n
transcribing multi-digit numbers from photographs of addresses,

The connections between la rs in a neural netwo rk are important to consider in architecture


design.

1. In a standard neurat network, every input unit is connested to every output unit.
Z. Specialized networks may ha\re fewer connections, reducang the number of parameters
and computation needed.
3. Different applications may require different connection strategies.
4. Convolutional networks in computer vision use sparse connections that work weltfor that
type of problem.
5. Specific advice for architecture design may vary depending on the application,
6. Future chapters will explore more architectural strategies for different application
domains.

i_ ilri H 3, r:onvolutiona,l
{)
{j
!-
il, firlly (:onlrecied
x rJO
W 11, cr:nvolutiorrai
.= 9il
3r

UeB
? o.)
-
() I

0.0 0.2 {l.i} 0.ti il.8 1.0


Nluul:cr ol 1:a:'arnel.elrs x 10$

Figure: I 'P, Deeper neural networks tend to perforrn better than shallow ones because they express
a preference for composing many simpler functions together; allowing for more complex
representations and sequential processes to be learned.
,:{
L,

Baclt PnnPaXn'{'on
a,rrd, o/fur Dti#' r.
A^t7or.atfunn
Back.PropagationandotherDifferentiationAlgorithms
l.Back-propagationisamethodforcomputingthegradientinaneurainetwork.
to compute the gradient'
l 2. It allows infbrmation to flow backu'aris through the networkalgorithm such as stochastic
I 3. Back-propagation is used in conjunction with a learning
i gradient descent.
I
derivatives for any function' not just neural
4. Back-propagation can be used to compute
networks'
5. We
L^-" to
will describe' how +^ compute fh gradient vx f(x,y) for an arbitrary function f' where
^nmmrre the
set of variables'
and y is an additional
x is a set of variables whose derivative-s are desired,
algorithms is the gradient of the cost
6. The gradient we most often require in learning
function with respect to the parameters' V0J(e)'
andanalyzing learned
I
7 . Back-propagation can also be applied to computing other derivatives
models.
information through a network is very
8. The idea of computing derivatives by propagating
L general and can be used for multiple outputs'

i
Computational GraPhs
net.'rrorks more precisely'
1. Using computational graphs helps us describe neural
2. Each node in the graph represents a variable' language is accompanied by a
3. An operation is a function of one oI more variables, and our
set of allowable oPerations'
4, The output of an operation is a single variable or a vector' variable when an operation
5. We draw a directed edge from the input variable to the output

I
\
I
a\

f
I
I
I
i

f
I
I
I
,

\Jq)

pk6c/k5 +

.L
-r

cho;" Q* r{ C^,[",r&-rs, 1

# {lr" cLo;n 'or,[e colc,rt .u J,/, uwd. b Cr*pnfu tfre


d"rlvo.#er.* of i*c#6n2 Ja?-nL.d by, curnVun"T'#"4
Srrnc{,;6y15,
q

X Ba,ck PtuP"fl"*rm 1l) arL tkA Gmputen

rhe c[rot* vule, ln[rrk a' specf #;c a'd't o{ efuw%dnt5


,ffi"* lr, hr/,,Ltrt eiJrcie"*"

I * Lu[ x b, ar v"^,1- 'tumb"t, ond- kt 4 "*d ?


b" j*c{am* rnPPr'? herr> a- ztol nu'nbe,n- Lo o
ouoL nunb*,
Su1>po*"fkC ff=|CD'!A
jt",o'/h.
Z={CA( xt) =- + Cff),
ckw rule s[ah* tf*,t
L
r d_ Z d-,h

4,
-g
d.x clx,

h{" caru grn**b4e


yk'o b",*^L rk" sccl& ca'se '

Suppow"#ot *€R]", W €lRn,

f nn*fh J*u* R* +u_ Ri ond- {rnapn #'u*


ff kR..
n-P
tf ff = tcq arrcL {, = 4 ry), t'h,n
,lnl-c,tlar> I AEM L STET

Or Br at,
,_e

Zx"
Z AU Dx; r+.-

i
In vec'ba no{ott*rthfu "nr1 be eyw''-b*Un
tafi Lbn a'r)

Z a? T
x, Z
Ex Y

rrel',rou 33 )a fh, n X rn J*r"bia-y> ,r*L'u'* '{ B


Dx, rt

eat o{ 4., {",q;able w


*-Fr,prn, {t-n we xee ffi"t ti* Au"A' atl

,*u/-Ltf./#*? At JacnilarL Twttrrfr


can be ob$o;'ed' by
Q/9"OLru e"t
ll' 7, -JFe
fart n*t f'; ft, skPe "
L
3
A. 0 tr ,/
S+e+Dosb e+
.(\' $ttr>chon
"Lt
Dx
Eon 'P[a ,+ C*,p,A"qn"'tGq^t*
uhty-t'hu xe? eW*rcrt- {"
@) the YyL
C*rffu T =z-Ll,
4
U
.J

(b) fl,, ?L.PL


'{*rfr*'togu'ti" *4"7*u M pl*&'"t'"" ^

+
"

tr;o-"CxT
* rD
-r,q

e:Xpue'l'?tfft ll
trll>,,1'
+ a(
f z"rhh.il ];n o* u-rrtL'
,rrrtLrix- Co
H BrveYt "^dun'Zn
ga- YYLfru bnicL o{
t"f& X

x
L Surn

,LC,

L
L +o

! lre {,rd*t '{ w va'fue witk ,'-YP"y' b o

1",,,**X, hle hlnte xfr t ju,t a,b bf X * NL6e a'


Veciat . -&'fl'
Te-Pwfre'* rff"
a*P/"t"
A Srngle r/a'ua ble ; k
o{ fndrcen.
index {"ptrn i} X,
)L6iveu*
P-6v all Po**ible

uzib 1fr, ,clrnnrt vu,b ob 2t


hJe CA{L
U*ry ffu* n@efitun ,
lren h, QYljtot.h,
"fiP
*rA Z { Cfi,,If*
T{Y tr(x)

Z v, a7,
(

x [, X
a Y;
J
ftPfWrW fke cfroto e"hb oAbn
Rcouo*rw'Qt
/+!

th* lwt CE o,oou,Ldbe


+ Cc *p"-h ry 5 a w-t e etb e,X{d"}rXf tr^t
eould, be a,
3oflte Caffb ) bn t "Lr> rrt*u
@beh
ura*'hw lr>
,"fil6' ,J aP
YY* corb tr
NNI,
va,Ud- b zedfir^ rnernoNr cptr8t-t,

L
Ol.k.NLadanwd,"aft 4a

\.__

? C!,) fi)
t a, ,,vL
h-z
t\,

@ Jhe alSor; /frm s*6f;en


Ao, b c""2cah-te k"wd'
6eP*eb'nLtd' a'b a
) lnil;cl, ca,,tbe fryA
Ctlled, Q. o* ex{n
'
a{':an 2 lr"l, neecl
@ To perlnwt b*ckPn-'t'*g ar neN 5ub Cr//rd' B

sat '{ nnden\; b ,


G C'aeafifrty
mocl'e in 6'
o,nd-,U
i:

wde j-o, ereL


b 8 'frru 0ne 'ifi*
witk,
ab%€cIafud'
drouolr,0 aodeo a{ Ln)
UC wrye
/t, Ow
WrrnT '#* deutvnhte
i

8 C"'{'&t' Dru
,Edct' nPde iru
l,l, )
(,{,
nade

a Lf"
(n) Dr* ci)
Cn) j)
e U,
(i )
e L.[,"'
Y
i,j€P{u
e
ci)
a
ti+
pofuff1t Tfr,
1 fr Pruc'duve t't''t 'cLeD
€,
fll rnPub go *Cn;)

{"L i=lr."s.,flid,
U,C
i) ,/_ X;
end- +oL
dn
M
D
)= fi; *L, *r-t)
n

ci)
A ,"[irlj € P" C*'o)
C-o) ci)
ff't) { C,q

enA {"L
Cn)
tetutrt U-
*i"' i

Algou ;lh-rn- Ex Plrunlta+ 1


!\
dn a' set oj
computa/tr'*
J/ran a!6oa;fh'm PeefozmZ
O
nPu"t vo'!rue* ' eacL r*'/'
tc/,wre
@ tL u'b,'b a, Corn faft
.L-
Ar-
TeP{wen? ly "PPWry
DLLrnefftcal
vebfih
co'n'P'k
@W nodrrg
set '{
tu SurrrcLion t" Ar
JL ,] evf oub 7*r/nZ
PT
-c," a,\e, Jne Wbnnb rcnL seb oj
ffrr*e aL{u,rnentp b"ls ,y=b t'fi" P"
f nler
art{
L wfff" a- -!,, Ur*
nw nde
lhe 4, tYt s"{ '4
a'4'e seL trtb'lfre
5 76. ,nP"* voluo*
,ur{rh, a+ { tl,, lrt { wde a

,rpd.
vabe W rLi ;"P"b
@ Tfr" ot''$ttd b*oO
Ib du!'ntd'
algccfrfhm
@ Jhe
b"* ortP"t ,J
ed, oW b"*rj'r-l 0r> lfr*
cad-rt'
A/g-e oubput Valua
@76. *''rrdeh 1 0nrL fh" &o&
L vfoa
Valuen
aj Poe
e

o{- {t*
a'
lh* wylt
vlt
@ 3"[ *p o ,tacsp ?tr"{ N;tt ;b?}"fu
lh" e,s cS,

@
@

@
(@

TE
L*u*og'
o GfiArten{- 808"d
rrlolen
cnA, $lt rL Di{{er"n{io*furt
@ B*ck PnryYLrm
nl,gru;ttw' l\bt05
I
t

t. J

t' 6.2 Gradient-Based Learning

. largest difference between linaer models and neural networks:


o nonlinearity of a neural network causes most interesting loss runctions to become non-convex
o because of this, neural networks are usually trained by using iterative, gradient-based optimizers (rather than linear
equation solvers/convex optimization/SVMs)
o convex optimization converges starting from any initial parameters (in theory-in practice is robust but can encounter
numerical problems)
. in contrast, stochastic gradient descenl applied to non-convex loss functions has no such convergence guar
antee, and is sensitive to the values of the initial parameters
o for feedforward neural networks, it is important to intialize all weights to small random values
. biases may be initialized to zero or to small positive values
o the training algorithm is almost always based on using the gradient to descent the cost function in one way or another
e the specific algorithms are usually improvements of the stochastic gradient descent algorithm
r gradient descent can also be used to train simpler models such as linear regression and support vector machines
(common when the training set is extremely large)
o lhe gradient can be obtained efficiently for a neural network using the back-propagation algorithm and its mod-
ern generalizations
e to apply gradient-based learning to neural networks, need lo choose a cost function and an output representation

6.2.1 Cost Functions

. cost functions for neural networks are more or less then same as those {or other parameteric models, e.g. linear models

-- . in most cases, our parametric model defines a distribution p(x I x; d)


. we then use the principle of maximum likelihood, resulting in the cross-entropy between the training data and the model
predictions as the cost function
o sometimes, take a simpler approach: rather than predicting a complele probability distribution over y, merely predict some
statistic of y condilioned on x
e use a specialized loss {unction to train a predictor of these estiamtes
. total cost function otten combines a primary cost lunction with a regularization term

6,2.1.'t Learning Conditional Distributions with Maximum Likelihood

r most modern neural networks are trained using maximum likelihood

1
@

the training data


as the cross-entropy between
likelihood, equivalently described
the negativeJog
o the cost tunction is then
and the model distribution:

J (0) : log P"aa (Y I x)


-E*'Y-P*,

'.'.'fi :':il:il$[ilH]{il}d:',".x:ff##Jfr i;i*ff :lkml:"HiliT:l::.'.i,::f t1


maxt
r the cost lunction lrom
advantage of deriving
each model: acostlunction logo(v l
x)
r p(V l *) autlmlticafyfltermines large and predictable enough to
specifying a model cost {unction ** o"
throughor;;; network design: the n*'"tt "tin"
r- recurring theme
algorithm gradient become very small
."tta *od guide lor the learning undermine this objective
because they make the
e "= "
{unctions that saturate
(become very flat) ol hidden units or output
used to produce the output
happens because tt u.jluution functions
o in many cases this "
units saturate
for many models:
helps avoid this problem
' negative log-likelihood
an a*p tr*tio"'tjut lun
rut,'utu *nen its argument
is very negative (-
. many output units involve
.theloglunctioninnegativelogJikelihoodundoestheexpolsomeoutputunits does not have a mtnt-
likelihood estimation: it usually
used a o'"*n.,
*J.r"r.-."*"r*ost
'r ilt
an unusual property, '"'tmum
cannot represen'i a
probabi*v
'"*]i*i*:*:nu:n::il1'J:1,:r:,:Ji:'i!::':Ll:'.::xl* 'ha'l 'ihev

infinity
. ::rz,il-'J,'Jee;:Itr1ff*iktjm*u"x:';?i's:ilcslffilf}'o''n"io"rthenitbecomespossibreto
th-".ou..t uuin,,n'*i
resulting in cross-entropy
approaching negative
assign extremelr a,nn o"*t,. ""0*''
like this
o regularization is needed to avoid overfitting

Statistics
5.2.1.2 Learning Conditional statistic ol y given'x
want to learn iust one conditional
distribution P(Y I x; d) ,
olten
r instead of learning a full ProbabilitY
(x;P) that we wish to Predict the
mean ol Y
.
e,g.' maY have a Predictor / able to represent any
lunction .f lrom a wide class
can think of the network as being
r usingasufficientlY powerful
neural network' (rather than bY having a sPecif,c
as continuitY and boundedness
class being limited onlY by features such
of functions, with this
o"t:'n:It',:tJlae than iust a tunction
cost lunction as being atunctionalrather
lunctions to real numbers
o lunctionat"a mapping lrom than merely choosing a
set of parameters
learning as choosing tun.tion Ltn.t
o thus, can think of "
rcandesign.n.."*.,0*",tohaveits*lni.u,*.,'atSomespecifictunctionwedesire
i'nttion that maps x to f [V I
x]
' e'g', design it to have its
minimum fi" *
tftt
'r solving an optimization problem requlres calculus ol vafiations
variations:
o Mo results derived lrom calculus ol
squared error'optimization problem
r' ing
'of
thu mean

: argminEl,Y-1'a"' llv - /(")ll'


"f*

o Yields

l.(*) : Er-r0",.,r,*, [Y)

' .': :T;:H: :T:l[:'l ;i}'.ill,]}ilffiffi'T:l'in""iLiJ


tn"t
o"'*n'neratins
p"aitt'
drstribution' minimizng
tnt mean ol y ior each value
of

the mean squared error


coJ;;;;; gives a tunction
x
error optimization problem
2' solving the mean absolute

./* : argrTn E*.y-r,","llY - f(*)llr

2
.yieldsafunctionthatpredictsthemedianvalueoiyforeachx'solongassuchalunctinomaybe
over
described by the family of lunctions we optimize
lead to poor results when used with gradient-based
. mean squared error and mean absolute error olten
optimization ,
-L:.--r...:rL
.someoutputsunitsthatSaturateprodUceVerySmallgradientswhencombinedwiththesecostlunction
necessary
r function is more popular, even when it is not
this is one reason that the cross-entropy cost
to estimate an entire distribution p(y i x)

6.2.2 Output Units

ot output unit
r choice of cost function is tightly couple with the choice
o most o1 the time, we use cross-entropy loss between the data distribution and the model distribution
the form o{ the cross-entropy lunction
r
the choice ot how to represent the output then determines

oanykindofneuralnetworkunitthatmaybeusedaSanoutputcanalsobeusedasahiddenunit
. supposethattheteedforwardnetworkprovidesasetofhiddenleaturesdefinedbyh:'/(xld)
I . the role of the output layer is then to provide some
additional translormation from the fealures to
complete the task that

the network must Perform

6.2.21 Linear Units lor Gaussian OutPut Dislributions

olinearunit:outputunitbasedonana'finetranslormationwithnononlinearity
o given features h, a layer oi linear output units produces a vectoli : WTh + b
r Gaussian distribution:
olten used to produce the mean of a conditional

P(v I x) : l/ (Y;9,I)
rmaximizingthelog.likelihoodisthenequivalenttominimizingthemeansquarederror of the Gaussian
r max likelihood makes it straightforward to learn the
covariance of the Gaussian' or to make the covariance

be a ,unction of the inPut


.however,covariacemustbeconstrainedtobeapositivedefinitematrixlorallinputs
used to parame-
. output layer, so typically other output units are
difficult to satisly such constraints with a linear
terize the covariance
tor gradient-based optimization algorithms
r linear units do not saturate, so they pose little dificulty

(* 6.2.2.2 Sigmoid Units for Bernoulli Output Distributions

r variable"q' esp' two-class classifi cation


many tasks require predicling the value of a binar
. a Bernoulli distributlon 'over.t/ conditioned x
the max likelihood approach: define
o neural net only needs to predict P(t1 1 :
| x) e [0' 1l
gradient descent
r this couldn't be trained elfectively with
could be satisfied with a thresholded linear unit' but
(the gradient is zero outside the unit interval)
the wrong
. a strong gradient whenever the model has
a better approach which ensures that there is always
answer:
r based on sigmoid output units combined with maximum
likelihood

. sigmoid outPut uniti

Y:o(wrn+b)
a linear layer to comp ule z
: w"h + b; then uses
o sigmoid output unit has tvvo components: first, uses
probability
sigmoid activation function to convert z to a
o how to define a probability distribution over g using the value z:
probability distribulion P(g)' then
. thesigmoid can be motivated by constructing an unnormalized
probabilty distribution
divide by an appropriate constant to obtain a valid
log probabilities are linear ih y and z''
r begin with the assumption that the unnormalized

3
Iog'PIY) = Yz
ijlv) = exp(gz)
_ exp(uz) . .

P(s) - Dl,=. exP (g'z )


€ {0' 1}?
holds because U
t) z) Ithink this only
P(a) = o ((2Y -
.

is the logit
over binarY variables with maximum
likelihood
deflning such a distribution is natural to use
o z variable probabilities in log-sPace
predlcting the
. this approachto
x)' the log in the cost
learning max tikelihood is -logP(v I
function used with
r because the cosl
the exp of the sigmoid learning lrommakino
iunction undoes lrom preventing gradient-based
ol the sigmoid
o this keePs the saturation a sigmoid:
parameterized by
progress
likerihood rearning of a Bernou*i
ior maximum
r the loss lunction
J(0)=-logP(elx)
.-6--.-r'*+'-"*..-ff :-logo((29- l) .)
= ( ((1 '2a)
z)

it
t
i

\
1

1
i
t
l
\I
I
,, I
' "-:e'
="f-
and Other
6.5

i
4

\
(_i
@

. i.e. compute V*/(x, y) for an arbitrary lunction 'f


are desired
. x: a set of variables whose derivatives
to the lunction (but whose derivatives are not required)
. y: an additional .", ", ,urli'u. that are inputs
omostotten,wewanttocalcu|atethegradlentofthecostfunctionw.r.t'theparameters:Vo"r(0)

6,5.'l ComPutational Graphs

. todiscussbackprop,it'susefultofirstdevelopcomputationalgraphlanguage
oleteachnodeintheg'upfinAitut"uvariable(ascalar'vector'matrix'tensor'orothe0
more variables
simple function of one or
o introduce the idea of an opteration-a
a set of allowable operations
graph language is accompanied by
r together
by composing many operations
. these operations may be described
tunctions more comp,cated than
output variable
o wlog, define an operation to return only a single
entries' e'g' a vector
.
the output variabte could have multiple
fi, then we draw a directed edge
lrom r to ly
to u.rilur.
i_ o if a variable g is computed by applying an operation "
.wesometimesannotatetheoutputnodewithtrrenameottrreoperationapplied'andothertimesomitthelabelwhenthe
operation is clear from context

6.5.2 Chain Rule ol Calculus

rule of probability)
o (not to be confused with the chain are known
other functions whose derivatives
r to compute the ouriuutira. ol functions formed by composing highly etficient
used of operations that is
the chain ,rru, *iti u ,p.lific order
. backprop is an algorithm tnat computes
o let:
o r€iR
. f,gtR+R
o y: g(r)

o then,
. #:#H
r this gentiitizeo beyond the scalar case:
caiie
o suPPose that:
r x€lR-,Y€lR'
o g ;lR- *r lR"
\- o /:Rnr+lR
. . y:9(x)
. z: lU)
.- then,
"-' a- : Dz dlti
'r ind;vectorLinotation:
6i6i
r V*(z): (H) V,(,)
x can be obtained bv murtiprvins
a Jacobian matrix ff bv
: ffi,ffi ;l:Jj;J,'i^.Tiin[tH:flr.
a gradient Vr(z) !---Li^^ in the
^rarrianr nrnrtrrnt for each ooeration
lthebackpropalgorithmconsistsolperlormingsuchaJacobian.gradientproductforeachopt
graph , -_:--_r:r,
of arbitrary dimensionality
o usually, backprop is not applied to vectors' but tensors

' t:':ffilJ;l].',::,ff.il1ffi:ifi:ilil,',o'J'iJli**,.g backprop, computins avector-varued sradient'


Jacobians by
andthen
gradienls
just multiplying
reshaping the gradient back into a
tensor. tn fris view, backprop is still
. denote the graiient o{ aualue zw'r't
a tensor X Oy Y xQ)
r the indices into have multiple coordinates
't
5
tupre or indices
[1"'iJ:::i5:; '" represent the comprete
I i3 ;1,11"*'[iJfJ"?:ilJ t'\), :#t
: ;";ffi;i"i,or.' t,'tv"
to tensors:
. the chain rule as it applies

! : s(X),2 : J()), then


Dz
y x@): !i tv"lil fo

the chain Rule to Obtain BackPtoP


6.5.3 RecursivelY APPlYing
w'r't' any node in the
for the gradient of a scalar
to write down an algebraic expression
r using chain rule, it is straightforward
produced that scalar
comPutational graPh that
the expression introduces extra considerations lor the gradient
e actuallY evaluating within the overall exPression
be rePeated several times ot the chain
. many subexPressions may
these exPressions multiPle times can make
a naive implementation
for comPlicated graPhs, comPuting
(") 1e.g., the loss on a training example)
a single scarar tr'
g'upr' describing how to compute
a nrst, consid:Xl:::f;r::llionur
.wanttoobtainthegradie-ntw.r't.thet.r';inputnotes,(1)1o,(,,') over Parameters:
{or computing gradient descent
J
. in the application ba"k-p'opug"tion
ou(,,)willbethecostassociatedwithanexample/minibatch
model
ol the and going
. ,(r) 1o 1("') to""tponO to the parameters one a{ter the other, starting
u1 t''(ntt1)
cah compute their output
l

such that we
r assume the nodes are ordered
,r bv evaluating the lunction
j. with an operation /(i ) and is computed
lj:; node u(i) is associared

uttl :1(5ttl)
(i) are parents ol r'r'(t)
o here, A' is the set ol all nodes that
comPutation
a 1: forward
r lor 1,,..,rlido
. u,Q) 1- :r:i

o end lor
o
' {o1j:'n,;*1,"',n'do
'"."A,t;; e Pa (utr))}
{u{rl 1,
. u(,) +- ;trt 1n{'))
. end tor
be put in asraphG
computation' which could
*he lorward propagation
. ;J:l'"tilinln' 'o"t'o"t #;;;;;;"ot ""
q and adds to it an exta set of nodes
o backprop, *" .on.ur.t a computation
' to'" periorm ""n
:"';**;ji:lf**#::n::1iffiJ the rorward graph node
u(i):
## L.o.,u,.o **n

5u@)
O;fr t
i:j €ca(uti) )
6u@'t 7uli)
)u(i\
--
Dutr)

u(i) to node u(') of 9


for each edge from node
. the subgraph 6 contains exactly one edge
the computation oI #
. the edge 1166 u(j) to u(i) isassociated with
is performed between:
. fo, each node, a dot product are chldren or u(ri)
o rhe sradient
",,..0r;;;;:;;;;:;; "l:],;i"lor the same children nodes'u'(i)
. the vector containing the
partial derivatives ffi

6
. the amount ol computation required Ior performing backprop scales linearly wilh the number ol edge in f
r computation lor each edge corresponds to computing a partial derivative (of one node w.r.t. its parenls) as well
as performing one multiplication and one addition
6.2: version of (for computing the derivatives of u(') w.r'.t the variables In the
graph)
' . simplifications: all variables are scalars, and compute derivatives of all nodes in the graph
o run {orward prapagation (Algorithm 6.1) to obtain.network activalions
. initialize gradtable: a data structure that will store the computed derivatives
c gradtablelu(,)l : ##
r gradtable[ut")] <- tr
o lorj: n,-ldowntoldo
r gradtable [u{i)] +- },.i6e"(,r,r;gradtable fu(t)l *;

.
o thiscomputer *8 : Di,;ee"(,r,,; #+#
end tor

(_ . return {gradtable[u(')] I i : 1,...,nt]


. backprop is designed to reduce the number ol common subexpressions without regard to memory
o performs on the order of one Jacobian product per node in the graph, thus avoiding exponential explosion in
repeated subexpressions
r this can be seen from that fact that backprop visits each edge from n66s 2(j) to node u(') ol ihe graph
exactly once in order to obtain the associated partial derivative
ffi
r other algorithms may be able to avoid more subexpressions by performing simplifications on the computational
graph, or may conserve memory by recomputing rather than storing some subexpressions

6.5.4 Back-Propagation Computation in Fully-Connected MLP

o consider the specific graph associated with a lully-connected multFlayer MLP


. Algorithm 6.3: torward propagation (compute of cost ./ w.r.t. w with training

.example x)
- o maps paramelers to the supervised loss I (i, y) associated with a single training example (x, y)
o $r is the output of the neural network when x is provided as input
. require: l, the network depth
o require: 1ry(i), i € {1,.. ",1}, theweigfrtmatricesof themodel
o require: 6('), ri € {1, . . . , l}, the bias parameters of the model
. requird: x, the inputto process
t r require: y, the target output
r h(o):x
. fork:L,...,1do
. a(r,) - 6(t) 11y(*)6(t-t)
o 6(&):/(atrll
o end for
o j'= 1(l)
o J:L(9,v)+,lO(0)
6.4: on the same network Algorithm o,J (compute the on lhe activations a(k) for each
from the and backwards lo the hidden
. afterthe computation, compute the gradient on the output layer:
r g +- V9,.I : VS,tr(9, y)
c lork:l,L-1,...,1do
. convert the gradient on the layer's output into a gradient on the pre-nonlinearity activation (element-wise multi-
f
plciation if is element-wise):
. g <- V.tor./ : gO // (ate))
. compute gradients on weights and biases (including the regularization term, where needed):
. Yrt^'r./ : g * )V61*1()(0)

7
c V1ryroJ : gh(k-1)r' -t-'lVwturf,l(0] ...,,
to*"ui""t hidden raver's activations:
" i=u!;gtr t:t;"5r;;Uf;'l'*i
. g (- vh(ft-1) 'r - a'
o end lor
. algorithms can accomodate any computational
these are simple, specialized on a generarized lorm
of
are based ,backprop,that
r
modern sortware implemJntations
o"ta structure fJr representing
symbolic computation
graph by expiicitly #,0,i"0"n."

Derivatives
6"5.5 SYmbol'to'SYmbol
specific values
variables that do not have
comPutational graPhs both operate on sYmbols'
r algebraic exPressions
and
representations are called
sYmbolic representations
o these algebraic and graPh-based sYmbolic inputs are rePlaced
with a sPecific numericvalue
o when using or training a neural network, approach to differentiation:
use a "sYmbot-to'number"
. some aPProaches to backprop values lor the inputs to the
graPh
graph and a set of numerical
r
take a comPutational gradient at those inPut values
values describing the
.
return a set of numerical
o e.g. Torch, catfe differentiation:
^ a-,,nh^t-ro-svmbol,,apptoach to t desired
.--L^,,^ A^6^rinri6n oi the
p'ouiou a svmbo[c description
' -$il:'m,':,[::1ffi:::;':;Xl1iltrT:::[1ff"":l'1'to#t
derivatives
o e'g' Theano' Tensorflow ,acnrihed in the same language as the original
expression
: ;:U:*;[:XffJ;il]::,il,T'"'jff:f:ffi:?ffi1Tilli;;;,,;;l;*'o,,n backprop asain' dirrerentiatins

exacuv
us to avoid speciivins
,u,u,. are avai,ab,e, a,,owins
: [{tr,.:::Tii}:::ff::,:::::':'l]x'},1un,,,
be computed
when each operation should
. .,,0:|fi:1***;;jf;m#T;:T:IllT::::u.,:,s as are done in the sraph
buird bv the svmbor'to-svmbol

not expose the graph


. ililtitl'nt'"nce: svmbol{o-number does

ation
6.5.6 General Back-ProPag
in the graph:
t. one ol its ancestors x
gradient of some scalar z w'r'
. to comPute the
rbegin bY observing that
the gradient w'r .t. z is given bY 1
f" : the current gradient by
the Jacobian
of z in the graPh by multiplying
the gradient w.r.t' each Parent
we can then compute
ot the operation that Produced
z we reach x
the graPh in this waY until
traveling backwards through simPlY sum the
gradients
. continue multiPYing by Jacobians
bY going backwards lrom z through
two or more Paths,
rtor anY node that may be reached
paths at that node
arriving lrom different
r more lormallY:
variable
corresponds to a.
-r ""t
each node in the graph Q
;."'.'",T#;m:t*:;U:;XJ:'jil'.]fi:lllorn.n.,on, (subsumins scarars' vectors'
matrices)

the lollowing subroutines:


r assume each uu'iuoiJy it utsociated with
. getoPeration(V) . \'
computes y-
returns the operation that . . .. nranh
' il;;;;;'ns into v in the computational
-^..+^+r^-al sraph
' "o"t"n*J"oi
' t".t;ffi:1iJ,="\'";f]'*'es that are children ol V in the computationalsraph $

. -".tTIX:,\l,ir1I, that are parents ot V in the


computationalgraph I
*,,*,es
associated with a bprop operation
o each operation op is also

8
. bproP computes a Jacobian-vector product, i'e' the chain rule' Vi(z) : Di (V xU) 'ffi
o e.g. consider a matiix multiplication operation creating
a variable C : Ats
r let the gradient of a scalat z w't'l' C is given by G
. two backprop rules, one {or each of lts input
the matrix mulliplication operation is responsible {or defining
arguments:
gradient on the output is G' bprop must
. lf we call to request the gradient w.r.t. A-given that the
bprop
state that the gradient w'r't' A is given by GB"
must state that the gradient is A7G
o il we call bprop to request the gradient w.r.t. B, bprop
.thebackpropalgorithmdoesnotneedtoknowanydifferentiationrules
. it only needs to call each operation's bprop rules with the right arguments

o lormally, op. bprop (inputs, 't, f ) must return

f (vxop.r (inPuts),)9t

t o inputs:
.
list of inputs supplied to the operation
op. f : the mathematical lunction that the operation implements
o .t: the input whose gradient we wish to compute
o Q'. the gradient on the output of the operation
other, even if they are not
o the op. method should always treat all of its inputs as distinct lrom each
brop
.
e.g. il two copies ol r are input to compute 12, the derivative
w.r.t' each input should still be r
and their bproP methods
. software implementations of backprop usually provide both the operations
a custom operation to an existing library, usually
o a new implementation of backprop
if building or adding
need to derive the op. bprop method for the new operation
a 6.5: the
o this is outermost skeleton, for simple setup and cleanup
omosto|theimportantworkhappensinthebui.ldgradSubroutineolAlgorithm6'6
o require: the target set ol variables whose gradients must be computed
1f ,

. require: Q, the computational graph


o require: z, the varaiable to be dilferentiated 'lf
descendents ol nodes in
Lelg, be pruned to contain only nodes that are ancestors of z and
o I
olnitializegradtable,adataStructureassociatingtensorstotheirgradients
r gradtablelz] <'I
. forYinlldo
o buildgrad (V. Q, Q' .gtadtable)
r
t a
.
end
return
lor
gradtable restricted
subroutine
to'lf

gradtable
o require , the variable
o require f , the graph to modilY
o f
require 9/, lhe restriction ol to nodes that participate in the
gradient

. require gradtable, a data structure mapping nodes to their


gradients

o il V is in gradtable then
. return gradtable[V]
r end il
o ri<-1
i
I o lor C in gerconsumers(V, 9) ao
?

I
. op <- getoperation(C)
a
I
. D +- buildgrad (V,9,9',Sradtable)
. G(i) <- op.bprop (getinputs (C,9') 'V,D)
e i+-'i,*l
o end lor
r G <- Irct'l
o gradtablelY] : G
o insert G and the operations creating it into I

9
. retuln G
. computationai cost
with the algorithm specified, we can examine the cost in terms of
. assume that each operation evaluation has roughly the same cost, and then analyze the computational

the number ol oPerations executed consist of


unit of the computational graph, though each could
.
note: we reler to an operation as the fundamental
but can be many arithmetic operations)
is one operation
several arithmetic operations (e.g. a matrix multiplicatron
the output of more
. execute more than O(n'2) operations or store
then, computing a gradient with n nodes will never
than O(n2 ) operations
nodes,peredgeintheoriginal
. thebackpropalgorithmaddsoneJacobian-vectorproduct,expressedwithO(1)
graPh
graph, it has at most o(ri'2) edges
o since the computational graph is a directed acyclic
. lor the graphs used in practice' the situation is better:

rmostneuralnetworksareroughlychain-structured,causingbackproptohaveo(ll)cost
by expanding and rewrlting the recursive
chain rule non-recursively:
o the potentially exponential cost can be seen

6'r,(r')
a}i t
rl:jeea(u(i) )
6u0,) gy(i)
a;O a;O

0u@)
ail, tn
nan(u('r ),s(':), ...,'"i*
from n1 :i to rt:n
) ),
h=2
fr1'6x)
ffi *r1
i.e. sumoverallpathsf'o'"(i) tot'(')'
multiplying alt derivatives alonq each path

rthenumberofpathslromnode.jtonode?1cangrowexponentiallyinthelengtholthesepaths with
. (which is the number ol paths)' can grow exponentially
there{ore, the number o{ terms in the sum above
the depth ol the forward propagation graph
is recalculated many times
. the large computational .o,ii' int"'ud *.en fij
rbackpropisadynamicprogrammings|ralegytoavoidtheserecomputations
storing intermediate results
o it can be thought ol as a table-filling algorithm that takes advantage ol
,i,r,")
to store the gradient for that node
. .rsl,oo" in the graph has a corresponding srot in a tabre
avoids repeating evaluating common subexpres-
o by filling in thele table entries in order, backprop
sions

6.5.7 BackProP for MLP

. single hidden layer


consider a simple multilayer perceptron with a
r train with minibatch stochastic gradient descent

r cost on a single minibatch


backprop is used to compute the gradient of the
set
X: a design matrix representing a minibatch ol examples from the training
r
e y: vector of associated class labels

o H - ,r.,u* {0, Xlv(')}


r simplily by assuming'no biases in the model
. assume the existence o{ operation that computes max {0' Z} elementwise
relu
are given by HW(2)
. predictions o{ the unnormalized log probabilities over classes y and
the cross-entropy between the targets
. assume the existence o{ a crossentropy operation
that computes
log probabilities
the probability distribution defined by these unnormalized
o the resulting cross-entropy defines the cost 4'a1g
of the classifrer
r minimiziig the cross-entropy = maximum likelihood estimation
r also include a regularization term:

J: Jur-e * ) (; (',1?)'* t
,i,.j
(r,,1) )

10

You might also like