0% found this document useful (0 votes)

101 views21 pages

Cs229 Notes Deep Learning

The document provides lecture notes on deep learning and neural networks. It begins with an overview of neural networks and discusses training neural networks using backpropagation. It then discusses supervised learning with non-linear models and defines cost functions. It introduces neural network concepts like single neuron models using activation functions like ReLU. It describes how neurons can be stacked together into deeper networks and uses a housing price prediction example to illustrate a simple multi-layer network.

Uploaded by

Chirag Pramod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views21 pages

Cs229 Notes Deep Learning

Uploaded by

Chirag Pramod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CS229 Lecture Notes

Tengyu Ma, Anand Avati, Kian Katanforoosh, and Andrew Ng

Deep Learning
We now begin our study of deep learning. In this set of notes, we give an
overview of neural networks, discuss vectorization and discuss training neural
networks with backpropagation.

1 Supervised Learning with Non-linear Mod-

els
In the supervised learning setting (predicting y from the input x), suppose
our model/hypothesis is hθ (x). In the past lectures, we have considered the
cases when hθ (x) = θ> x (in linear regression or logistic regression) or hθ (x) =
θ> φ(x) (where φ(x) is the feature map). A commonality of these two models
is that they are linear in the parameters θ. Next we will consider learning
general family of models that are non-linear in both the parameters θ
and the inputs x. The most common non-linear models are neural networks,
which we will define staring from the next section. For this section, it suffices
to think hθ (x) as an abstract non-linear model.1
Suppose {(x(i) , y (i) )}ni=1 are the training examples. For simplicity, we start
with the case where y (i) ∈ R and hθ (x) ∈ R.

Cost/loss function. We define the least square cost function for the i-th
example (x(i) , y (i) ) as
1
J (i) (θ) = (hθ (x(i) ) − y (i) )2 (1.1)
2
1
If a concrete example is helpful, perhaps think about the model hθ (x) = θ12 x21 + θ22 x22 +
· · · + θd2 x2d in this subsection, even though it’s not a neural network.

1
2

and define the mean-square cost function for the dataset as

n
1 X (i)
J(θ) = J (θ) (1.2)
n i=1
which is same as in linear regression except that we introduce a constant
1/n in front of the cost function to be consistent with the convention. Note
that multiplying the cost function with a scalar will not change the local
minima or global minima of the cost function. Also note that the underlying
parameterization for hθ (x) is different from the case of linear regression,
even though the form of the cost function is the same mean-squared loss.
Throughout the notes, we use the words “loss” and “cost” interchangeably.

Optimizers (SGD). Commonly, people use gradient descent (GD), stochas-

tic gradient (SGD), or their variants to optimize the loss function J(θ). GD’s
update rule can be written as2
θ := θ − α∇θ J(θ) (1.3)
where α > 0 is often referred to as the learning rate or step size. Next, we
introduce a version of the SGD (Algorithm 1), which is lightly different from
that in the first lecture notes.
Algorithm 1 Stochastic Gradient Descent
1: Hyperparameter: learning rate α, number of total iteration niter .
2: Initialize θ randomly.
3: for i = 1 to niter do
4: Sample j uniformly from {1, . . . , n}, and update θ by

θ := θ − α∇θ J (j) (θ) (1.4)

Oftentimes computing the gradient of B examples simultaneously for the

parameter θ can be faster than computing B gradients separately due to
hardware parallelization. Therefore, a mini-batch version of SGD is most
commonly used in deep learning, as shown in Algorithm 2. There are also
other variants of the SGD or mini-batch SGD with slightly different sampling
schemes.
2
Recall that, as defined in the previous lecture notes, we use the notation “a := b” to
denote an operation (in a computer program) in which we set the value of a variable a
to be equal to the value of b. In other words, this operation overwrites a with the value
of b. In contrast, we will write “a = b” when we are asserting a statement of fact, that
the value of a is equal to the value of b.
3

Algorithm 2 Mini-batch Stochastic Gradient Descent

1: Hyperparameters: learning rate α, batch size B, # iterations niter .
2: Initialize θ randomly
3: for i = 1 to niter do
4: Sample B examples j1 , . . . , jB (without replacement) uniformly from
{1, . . . , n}, and update θ by
B
αX
θ := θ − ∇θ J (jk ) (θ) (1.5)
B k=1

With these generic algorithms, a typical deep learning model is learned

with the following steps. 1. Define a neural network parametrization hθ (x),
which we will introduce in Section 2, and 2. write the backpropagation
algorithm to compute the gradient of the loss function J (j) (θ) efficiently,
which will be covered in Section 3, and 3. run SGD or mini-batch SGD (or
other gradient-based optimizers) with the loss function J(θ).

2 Neural Networks
Neural networks refer to broad type of non-linear models/parametrizations
hθ (x) that involve combinations of matrix multiplications and other entry-
wise non-linear operations. We will start small and slowly build up a neural
network, step by step.

A Neural Network with a Single Neuron. Recall the housing price

prediction problem from before: given the size of the house, we want to
predict the price. We will use it as a running example in this subsection.
Previously, we fit a straight line to the graph of size vs. housing price.
Now, instead of fitting a straight line, we wish to prevent negative housing
prices by setting the absolute minimum price as zero. This produces a “kink”
in the graph as shown in Figure 1. How do we represent such a function with
a single kink as hθ (x) with unknown parameter? (After doing so, we can
invoke the machinery in Section 1.)
We define a parameterized function hθ (x) with input x, parameterized by
θ, which outputs the price of the house y. Formally, hθ : x → y. Perhaps
one of the simplest parametrization would be
hθ (x) = max(wx + b, 0), where θ = (w, b) ∈ R2 (2.1)
4

Here hθ (x) returns a single value: (wx+b) or zero, whichever is greater. In

the context of neural networks, the function max{t, 0} is called a ReLU (pro-
nounced “ray-lu”), or rectified linear unit, and often denoted by ReLU(t) ,
max{t, 0}.
Generally, a one-dimensional non-linear function that maps R to R such as
ReLU is often referred to as an activation function. The model hθ (x) is said
to have a single neuron partly because it has a single non-linear activation
function. (We will discuss more about why a non-linear activation is called
neuron.)
When the input x ∈ Rd has multiple dimensions, a neural network with
a single neuron can be written as
hθ (x) = ReLU(w> x + b), where w ∈ Rd , b ∈ R, and θ = (w, b) (2.2)
The term b is often referred to as the “bias”, and the vector w is referred
to as the weight vector. Such a neural network has 1 layer. (We will define
what multiple layers mean in the sequel.)

Stacking Neurons. A more complex neural network may take the single
neuron described above and “stack” them together such that one neuron
passes its output as input into the next neuron, resulting in a more complex
function.
Let us now deepen the housing prediction example. In addition to the size
of the house, suppose that you know the number of bedrooms, the zip code
and the wealth of the neighborhood. Building neural networks is analogous
to Lego bricks: you take individual bricks and stack them together to build
complex structures. The same applies to neural networks: we take individual
neurons and stack them together to create complex neural networks.
Given these features (size, number of bedrooms, zip code, and wealth),
we might then decide that the price of the house depends on the maximum
family size it can accommodate. Suppose the family size is a function of
the size of the house and number of bedrooms (see Figure 2). The zip code
may provide additional information such as how walkable the neighborhood
is (i.e., can you walk to the grocery store or do you need to drive everywhere).
Combining the zip code with the wealth of the neighborhood may predict
the quality of the local elementary school. Given these three derived features
(family size, walkable, school quality), we may conclude that the price of the
home ultimately depends on these three features.
Formally, the input to a neural network is a set of input features
x1 , x2 , x3 , x4 . We denote the intermediate variables for “family size”, “walk-
able”, and “school quality” by a1 , a2 , a3 (these ai ’s are often referred to as
5

housing prices
1000

900

800

700
price (in $1000)
600

500

400

300

200

100

500 1000 1500 2000 2500 3000 3500 4000 4500 5000
square feet

Figure 1: Housing prices with a “kink” in the graph.

Size
Fam
ily S
# Bedrooms ize
Price
y
Walkable
Zip Code
ity
ol Qual
Scho
Wealth

Figure 2: Diagram of a small neural network for predicting housing prices.

“hidden units” or “hidden neurons”). We represent each of the ai ’s as a neu-

ral network with a single neuron with a subset of x1 , . . . , x4 as inputs. Then
as in Figure 1, we will have the parameterization:

a1 = ReLU(θ1 x1 + θ2 x2 + θ3 )
a2 = ReLU(θ4 x3 + θ5 )
a3 = ReLU(θ6 x3 + θ7 x4 + θ8 )

where (θ1 , · · · , θ8 ) are parameters. Now we represent the final output hθ (x)
as another linear function with a1 , a2 , a3 as inputs, and we get3

hθ (x) = θ9 a1 + θ10 a2 + θ11 a3 + θ12 (2.3)

3
Typically, for multi-layer neural network, at the end, near the output, we don’t apply
ReLU, especially when the output is not necessarily a positive number.
6

where θ contains all the parameters (θ1 , · · · , θ12 ).

Now we represent the output as a quite complex function of x with pa-
rameters θ. Then you can use this parametrization hθ with the machinery of
Section 1 to learn the parameters θ.

Inspiration from Biological Neural Networks. As the name suggests,

artificial neural networks were inspired by biological neural networks. The
hidden units a1 , . . . , am correspond to the neurons in a biological neural net-
work, and the parameters θi ’s correspond to the synapses. However, it’s
unclear how similar the modern deep artificial neural networks are to the bi-
ological ones. For example, perhaps not many neuroscientists think biological
neural networks could have 1000 layers, while some modern artificial neural
networks do (we will elaborate more on the notion of layers.) Moreover, it’s
an open question whether human brains update their neural networks in a
way similar to the way that computer scientists learn artificial neural net-
works (using backpropagation, which we will introduce in the next section.).

Two-layer Fully-Connected Neural Networks. We constructed the

neural network in equation (2.3) using a significant amount of prior knowl-
edge/belief about how the “family size”, “walkable”, and “school quality” are
determined by the inputs. We implicitly assumed that we know the family
size is an important quantity to look at and that it can be determined by
only the “size” and “# bedrooms”. Such a prior knowledge might not be
available for other applications. It would be more flexible and general to have
a generic parameterization. A simple way would be to write the intermediate
variable a1 as a function of all x1 , . . . , x4 :

a1 = ReLU(w1> x + b1 ), where w1 ∈ R4 and b1 ∈ R (2.4)

a2 = ReLU(w2> x 4
+ b2 ), where w2 ∈ R and b2 ∈ R
a3 = ReLU(w3> x + b3 ), where w3 ∈ R4 and b3 ∈ R

We still define hθ (x) using equation (2.3) with a1 , a2 , a3 being defined

as above. Thus we have a so-called fully-connected neural network as
visualized in the dependency graph in Figure 2 because all the intermediate
variables ai ’s depend on all the inputs xi ’s.
For full generality, a two-layer fully-connected neural network with m
hidden units and d dimensional input x ∈ Rd is defined as
[1] > [1] [1] [1]
∀j ∈ [1, ..., m], zj = wj x + bj where wj ∈ Rd , bj ∈ R (2.5)
7

Figure 3: Diagram of a two-layer fully connected neural network. Each edge

from node xi to node aj indicates that aj depends on xi . The edge from xi
[1]
to aj is associated with the weight (wj )i which denotes the i-th coordinate
[1]
of the vector wj . The activation aj can be computed by taking the ReLUof
the weighted sum of xi ’s with the weightsPbeing the weights associated with
m [1]
the incoming edges, that is, aj = ReLU( i=1 (wj )i xi ).

aj = ReLU(zj ),
a = [a1 , . . . , am ]> ∈ Rm
>
hθ (x) = w[2] a + b[2] where w[2] ∈ Rm , b[2] ∈ R, (2.6)

Note that by default the vectors in Rd are viewed as column vectors, and
in particular a is a column vector with components a1 , a2 , ..., am . The indices
[1] [1]
and [2] are used to distinguish two sets of parameters: the wj ’s (each of
which is a vector in Rd ) and w[2] (which is a vector in Rm ). We will have
more of these later.

Vectorization. Before we introduce neural networks with more layers and

more complex structures, we will simplify the expressions for neural networks
with more matrix and vector notations. Another important motivation of
vectorization is the speed perspective in the implementation. In order to
implement a neural network efficiently, one must be careful when using for
loops. The most natural way to implement equation (2.5) in code is perhaps
to use a for loop. In practice, the dimensionalities of the inputs and hidden
units are high. As a result, code will run very slowly if you use for loops.
8

Leveraging the parallelism in GPUs is/was crucial for the progress of deep
learning.
This gave rise to vectorization. Instead of using for loops, vectorization
takes advantage of matrix algebra and highly optimized numerical linear
algebra packages (e.g., BLAS) to make neural network computations run
quickly. Before the deep learning era, a for loop may have been sufficient
on smaller datasets, but modern deep networks and state-of-the-art datasets
will be infeasible to run with for loops.
We vectorize the two-layer fully-connected neural network as below. We
define a weight matrix W [1] in Rm×d as the concatenation of all the vectors
[1]
wj ’s in the following way:

[1] >
 
— w1 —
>
 — w2[1] — 
 
[1]  ∈ Rm×d
W = .. (2.7)
.
 
 
[1] >
— wm —

Now by the definition of matrix vector multiplication, we can write z =

[z1 , . . . , zm ]> ∈ Rm as

[1] >
 
     [1] 
z1 — w1 — x1 b1
 ..   >
 — w2[1] — 

x2 [1]
b2
 . 
   
 .  =   +  (2.8)
   
..
 .. ..
 .. 
 

 .

  .   . 
[1] > xd [1]
zm — wm — bm
| {z } | {z } | {z }
x ∈ Rd×1
| {z }
z ∈ Rm×1 W [1] ∈ Rm×d b[1] ∈ Rm×1

Or succinctly,

z = W [1] x + b[1] (2.9)

We remark again that a vector in Rd in this notes, following the conventions

previously established, is automatically viewed as a column vector, and can
also be viewed as a d × 1 dimensional matrix. (Note that this is different
from numpy where a vector is viewed as a row vector in broadcasting.)
Computing the activations a ∈ Rm from z ∈ Rm involves an element-
wise non-linear application of the ReLU function, which can be computed in
parallel efficiently. Overloading ReLU for element-wise application of ReLU
9

(meaning, for a vector t ∈ Rd , ReLU(t) is a vector such that ReLU(t)i =

ReLU(ti )), we have

a = ReLU(z) (2.10)
>
Define W [2] = [w[2] ] ∈ R1×m similarly. Then, the model in equation (2.6)
can be summarized as

a = ReLU(W [1] x + b[1] )

hθ (x) = W [2] a + b[2] (2.11)

Here θ consists of W [1] , W [2] (often referred to as the weight matrices) and
b[1] , b[2] (referred to as the biases). The collection of W [1] , b[1] is referred to as
the first layer, and W [2] , b[2] the second layer. The activation a is referred to as
the hidden layer. A two-layer neural network is also called one-hidden-layer
neural network.

Multi-layer fully-connected neural networks. With this succinct no-

tations, we can stack more layers to get a deeper fully-connected neu-
ral network. Let r be the number of layers (weight matrices). Let
W [1] , . . . , W [r] , b[1] , . . . , b[r] be the weight matrices and biases of all the layers.
Then a multi-layer neural network can be written as

a[1] = ReLU(W [1] x + b[1] )

a[2] = ReLU(W [2] a[1] + b[2] )
···
a[r−1] = ReLU(W [r−1] a[r−2] + b[r−1] )
hθ (x) = W [r] a[r−1] + b[r] (2.12)

We note that the weight matrices and biases need to have compatible
dimensions for the equations above to make sense. If a[k] has dimension mk ,
then the weight matrix W [k] should be of dimension mk × mk−1 , and the bias
b[k] ∈ Rmk . Moreover, W [1] ∈ Rm1 ×d and W [r] ∈ R1×mr−1 .
The total number of neurons in the network is m1 + · · · + mr , and the
total number of parameters in this network is (d + 1)m1 + (m1 + 1)m2 + · · · +
(mr−1 + 1)mr .
Sometimes for notational consistency we also write a[0] = x, and a[r] =
hθ (x). Then we have simple recursion that

a[k] = ReLU(W [k] a[k−1] + b[k] ), ∀k = 1, . . . , r − 1 (2.13)

Note that this would have be true for k = r if there were an additional
ReLU in equation (2.12), but often people like to make the last layer linear
(aka without a ReLU) so that negative outputs are possible and it’s easier
to interpret the last layer as a linear model. (More on the interpretability at
the “connection to kernel method” paragraph of this section.)

Other activation functions. The activation function ReLU can be re-

placed by many other non-linear function σ(·) that maps R to R such as
1
σ(z) = (sigmoid) (2.14)
1 + e−z
ez − e−z
σ(z) = z (tanh) (2.15)
e + e−z

Why do we not use the identity function for σ(z)? That is, why
not use σ(z) = z? Assume for sake of argument that b[1] and b[2] are zeros.
Suppose σ(z) = z, then for two-layer neural network, we have that

hθ (x) = W [2] a[1] (2.16)

[2] [1]
= W σ(z ) by definition (2.17)
[2] [1]
=W z since σ(z) = z (2.18)
= W [2] W [1] x from Equation (2.8) (2.19)
= W̃ x where W̃ = W [2] W [1] (2.20)

Notice how W [2] W [1] collapsed into W̃ .

This is because applying a linear function to another linear function will
result in a linear function over the original input (i.e., you can construct a W̃
such that W̃ x = W [2] W [1] x). This loses much of the representational power
of the neural network as often times the output we are trying to predict
has a non-linear relationship with the inputs. Without non-linear activation
functions, the neural network will simply perform linear regression.

Connection to the Kernel Method. In the previous lectures, we covered

the concept of feature maps. Recall that the main motivation for feature
maps is to represent functions that are non-linear in the input x by θ> φ(x),
where θ are the parameters and φ(x), the feature map, is a handcrafted
function non-linear in the raw input x. The performance of the learning
algorithms can significantly depends on the choice of the feature map φ(x).
Oftentimes people use domain knowledge to design the feature map φ(x) that
11

suits the particular applications. The process of choosing the feature maps
is often referred to as feature engineering.
We can view deep learning as a way to automatically learn the right
feature map (sometimes also referred to as “the representation”) as follows.
Suppose we denote by β the collection of the parameters in a fully-connected
neural networks (equation (2.12)) except those in the last layer. Then we
can abstract right a[r−1] as a function of the input x and the parameters in
β: a[r−1] = φβ (x). Now we can write the model as

hθ (x) = W [r] φβ (x) + b[r] (2.21)

When β is fixed, then φβ (·) can viewed as a feature map, and therefore hθ (x)
is just a linear model over the features φβ (x). However, we will train the
neural networks, both the parameters in β and the parameters W [r] , b[r] are
optimized, and therefore we are not learning a linear model in the feature
space, but also learning a good feature map φβ (·) itself so that it’s possi-
ble to predict accurately with a linear model on top of the feature map.
Therefore, deep learning tends to depend less on the domain knowledge of
the particular applications and requires often less feature engineering. The
penultimate layer a[r] is often (informally) referred to as the learned features
or representations in the context of deep learning.
In the example of house price prediction, a fully-connected neural network
does not need us to specify the intermediate quantity such “family size”, and
may automatically discover some useful features in the last penultimate layer
(the activation a[r−1] ), and use them to linearly predict the housing price.
Often the feature map / representation obtained from one datasets (that is,
the function φβ (·) can be also useful for other datasets, which indicates they
contain essential information about the data. However, oftentimes, the neural
network will discover complex features which are very useful for predicting
the output but may be difficult for a human to understand or interpret. This
is why some people refer to neural networks as a black box, as it can be
difficult to understand the features it has discovered.

3 Backpropagation
In this section, we introduce backpropgation or auto-differentiation, which
computes the gradient of the loss ∇J (j) (θ) efficiently. We will start with an
informal theorem that states that as long as a real-valued function f can be
efficiently computed/evaluated by a differentiable network or circuit, then its
12

gradient can be efficiently computed in a similar time. We will then show

how to do this concretely for fully-connected neural networks.
Because the formality of the general theorem is not the main focus here,
we will introduce the terms with informal definitions. By a differentiable
circuit or a differentiable network, we mean a composition of a sequence of
differentiable arithmetic operations (additions, subtraction, multiplication,
divisions, etc) and elementary differentiable functions (ReLU, exp, log, sin,
cos, etc.). Let the size of the circuit be the total number of such operations
and elementary functions. We assume that each of the operations and func-
tions, and their derivatives or partial derivatives ecan be computed in O(1)
time in the computer.

Theorem 3.1: [backpropagation or auto-differentiation, informally stated]

Suppose a differentiable circuit of size N computes a real-valued function
f : R` → R. Then, the gradient ∇f can be computed in time O(N ), by a
circuit of size O(N ).4

We note that the loss function J (j) (θ) for j-th example can be indeed
computed by a sequence of operations and functions involving additions,
subtraction, multiplications, and non-linear activations. Thus the theorem
suggests that we should be able to compute the ∇J (j) (θ) in a similar time
to that for computing J (j) (θ) itself. This does not only apply to the fully-
connected neural network introduced in the Section 2, but also many other
types of neural networks.
In the rest of the section, we will showcase how to compute the gradient of
the loss efficiently for fully-connected neural networks using backpropagation.
Even though auto-differentiation or backpropagation is implemented in all
the deep learning packages such as tensorflow and pytorch, understanding it
is very helpful for gaining insights into the working of deep learning.

3.1 Preliminary: chain rule

We first recall the chain rule in calculus. Suppose the variable J depends on
the variables θ1 , . . . , θp via the intermediate variable g1 , . . . , gk :

gj = gj (θ1 , . . . , θp ), ∀j ∈ {1, · · · , k} (3.1)

4
We note if the output of the function f does not depend on some of the input coordinates,
then we set by default the gradient w.r.t that coordinate to zero. Setting to zero does
not count towards the total runtime here in our accounting scheme. This is why when
N ≤ `, we can compute the gradient in O(N ) time, which might be potentially even less
than `.
13

J = J(g1 , . . . , gk ) (3.2)

Here we overload the meaning of gj ’s: they denote both the intermediate
variables but also the functions used to compute the intermediate variables.
Then, by the chain rule, we have that ∀i,
k
∂J X ∂J ∂gj
= (3.3)
∂θi j=1
∂gj ∂θi

For the ease of invoking the chain rule in the following subsections in various
ways, we will call J the output variable, g1 , . . . , gk intermediate variables,
and θ1 , . . . , θp the input variable in the chain rule.

3.2 One-neuron neural networks

Simplifying notations: In the rest of the section, we will consider a
generic input x and compute the gradient of hθ (x) w.r.t θ. For simplicity,
we use o as a shorthand for hθ (x) (o stands for output). For simplicity, with
slight abuse of notation, we use J = 21 (y − o)2 to denote the loss function.
(Note that this overrides the definition of J as the total loss in Section 1.)
Our goal is to compute the derivative of J w.r.t the parameter θ.
We first consider the neural network with one neuron defined in equa-
tion (2.2). Recall that we compute the loss function via the following se-
quential steps:

z = w> x + b (3.4)
o = ReLU(z) (3.5)
1
J = (y − o)2 (3.6)
2
By the chain rule with J as the output variable, o as the intermediate variable,
and wi the input variable, we have that
∂J ∂J ∂o
= · (3.7)
∂wi ∂o ∂wi
Invoking the chain rule with o as the output variable, z as the intermediate
variable, and wi the input variable, we have that
∂o ∂o ∂z
= ·
∂wi ∂z ∂wi
14

Combining the equation above with equation (3.7), we have

∂J ∂J ∂o ∂z
= · · = (o − y) · 1{z ≥ 0} · xi
∂wi ∂o ∂z ∂wi
(because ∂J
∂o
∂o
= (o − y) and ∂z = 1{z ≥ 0} and ∂z
∂wi
= xi )
∂J
Here, the key is that we reduce the computation of ∂w i
to the computa-
∂J ∂o ∂z
tion of three simpler more “local” objects ∂o , ∂z , and ∂wi , which are much
simpler to compute because J directly depends on o via equation (3.6), o
directly depends on a via equation (3.5), and z directly depends on wi via
equation (3.4). Note that in a vectorized form, we can also write

∇w J = (o − y) · 1{z ≥ 0} · x

Similarly, we compute the gradient w.r.t b by

∂J ∂J ∂o ∂z
= · · = (o − y) · 1{z ≥ 0}
∂b ∂o ∂z ∂b
∂J ∂o ∂z
(because ∂o = (o − y) and ∂z = 1{z ≥ 0} and ∂b
= 1)

3.3 Two-layer neural networks: a low-level unpacked

computation
Note: this subsection derives the derivatives with low-level notations to
help you build up intuition on backpropagation. If you are looking for a
clean formula, or you are familiar with matrix derivatives, then feel free to
jump to the next subsection directly.
Now we consider the two-layer neural network defined in equation (2.6).
We compute the loss J by following sequence of operations
[1] > [1] [1] [1]
∀j ∈ [1, ..., m], zj = wj x + bj where wj ∈ Rd , bj
aj = ReLU(zj ),
a = [a1 , . . . , am ]> ∈ Rm
>
o = w[2] a + b[2] where w[2] ∈ Rm , b[2] ∈ R
1
J = (y − o)2 (3.8)
2
[1]
We will use (w[2] )` to denote the `-th coordinate of w[2] , and (wj )` to denote
[1]
the `-coordinate of wj . (We will avoid using these cumbersome notations
once we figure out how to write everything in matrix and vector forms.)
15

By invoking chain rule with J as the output variable, o as intermediate

variable, and (w[2] )` as the input variable, we have
∂J ∂J ∂o
=
∂(w[2] )` ∂o ∂(w[2] )`
∂o
= (o − y)
∂(w[2] )`
= (o − y)a`
∂J
It’s more challenging to compute [1] . Towards computing it, we first
∂(wj )`
invoke the chain rule with J as the output variable, zj as the intermediate
[1]
variable, and (wj )` as the input variable.

∂J ∂J ∂zj
[1]
= ·
∂(wj )` ∂zj ∂(wj[1] )`
∂J ∂zj
= · x` (becaues [1] = x` .)
∂zj ∂(wj )`

∂J
Thus, it suffices to compute the ∂z j
. We invoke the chain rule with J as the
output variable, aj as the intermediate variable, and zj as the input variable,

∂J ∂J ∂aj
=
∂zj ∂aj ∂zj
∂J
= 1{zj ≥ 0}
∂aj
∂J
Now it suffices to compute ∂a j
, and we invoke the chain rule with J as the
output variable, o as the intermediate variable, and aj as the input variable,

∂J ∂J ∂o
=
∂aj ∂o ∂aj
= (o − y) · (w[2] )j

Now combining the equations above, we obtain

∂J
[1]
= (o − y) · (w[2] )j 1{zj ≥ 0}x`
∂(wj )`

Next we gauge the runtime of computing these partial derivatives. Let p

denotes the total number of parameters in the network. We note that p ≥ md
16

where m is the number of hidden units and d is the input dimension. For
every j and `, to compute ∂J[1] , apparently we need to compute at least
∂(wj )`
the output o, which takes at least p ≥ md operations. Therefore at the first
glance computing a single gradient takes at least md time, and the total time
to compute the derivatives w.r.t to all the parameters is at least (md)2 , which
is inefficient.
However, the key of the backpropagation is that for different choices of `,
the formulas above for computing ∂J[1] share many terms, such as, (o − y),
∂(wj )`
[2]
(w )j and 1{zj ≥ 0}. This suggests that we can re-organize the computation
to leverage the shared computation.
It turns out the crucial shared quantities in these formulas are ∂J ∂o
,
∂J ∂J
∂z1
, . . . , ∂zm . We now write the following formulas to compute the gradi-
ents efficiently in Algorithm 3.

Algorithm 3 Backpropagation for two-layer neural networks

1: Compute the values of z1 , . . . , zm , a1 , . . . , am and o as in the definition of
neural network (equation (3.8)).
2: Compute ∂J∂o
= (o − y).
∂J
3: Compute ∂z for j = 1, . . . , m by
j

∂J ∂J ∂o ∂aj ∂J
= = · (w[2] )j · 1{zj ≥ 0} (3.9)
∂zj ∂o ∂aj ∂zj ∂o
∂J ∂J ∂J ∂J
4: Compute [1] , [1] , ∂(w [2] ) , and ∂b[2]
by
∂(wj )` ∂bj j

∂J ∂J ∂zj ∂J
[1]
= · [1]
= · x`
∂(wj )` ∂zj ∂(wj )` ∂zj
∂J ∂J ∂zj ∂J
[1]
= · [1] =
∂bj ∂zj ∂bj ∂zj
∂J ∂J ∂o ∂J
[2]
= [2]
= · aj
∂(w )j ∂o ∂(w )j ∂o
∂J ∂J ∂o ∂J
[2]
= [2]
=
∂b ∂o ∂b ∂o
17

3.4 Two-layer neural network with vector notation

As we have done before in the definition of neural networks, the equations for
backpropagation becomes much cleaner with proper matrix notation. Here
we state the algorithm first and also provide a cleaner proof via matrix cal-
culus.
Let
∂J
δ [2] , ∈R
∂o
∂J
δ [1] , ∈ Rm (3.10)
∂z
Here we note that when A is a real-valued variable,5 and B is a vector or
∂A
matrix variable, then ∂B denotes the collection of the partial derivatives with
the same shape as B.6 In other words, if B is a matrix of dimension m × d,
∂A ∂A
then ∂B is a matrix in Rm×d with ∂B ij
as the ijth-entry. Let v w denote
the entry-wise product of two vectors v and w of the same dimension. Now
we are ready to describe backpropagation in Algorithm 4.

Algorithm 4 Back-propagation for two-layer neural networks in vectorized

.notations.
1: Compute the values of z ∈ Rm , a ∈ Rm , and o
2: Compute δ [2] = (o − y) ∈ R
>
3: Compute δ [1] = (o − y) · W [2] 1{z ≥ 0} ∈ Rm×1
4: Compute

∂J
= δ [2] a> ∈ R1×m
∂W [2]
∂J
= δ [2] ∈ R
∂b[2]
∂J
= δ [1] x> ∈ Rm×d
∂W [1]
∂J
= δ [1] ∈ Rm
∂b[1]

5 ∂A
We will avoid using the notation ∂B for A that is not a real-valued variable.
6
If you are familiar with the notion of total derivatives, we note that the dimensionality
here is different from that for total derivatives.
18

Derivation using the chain rule for matrix multiplication. To

have a succinct derivation of the backpropagation algorithm in Algorithm 4
without working with the complex indices, we state the extensions of the
chain rule in vectorized notations. It requires more knowledge of matrix
calculus to state the most general result, and therefore we will introduce
a few special cases that are most relevant for deep learning. Suppose J
is a real-valued output variable, z ∈ Rm is the intermediate variable and
W ∈ Rm×d , u ∈ Rd are the input variables. Suppose they satisfy:

z = W u + b, where W ∈ Rm×d
J = J(z) (3.11)
∂J ∂J
Then we can compute ∂u
and ∂W
by:

∂J ∂J
= W> (3.12)
∂u ∂z
∂J ∂J >
= ·u (3.13)
∂W ∂z
∂J ∂J
= (3.14)
∂b ∂z
We can verify the dimensionality is indeed compatible because ∂J∂z
∈ Rm ,
W > ∈ Rd×m , ∂J
∂u
∂J
∈ Rd , ∂W ∈ Rm×d , u> ∈ R1×d .
Here the chain rule in equation (3.12) only works for the special cases
where z = W u. Another useful case is the following:

a = σ(z), where σ is an element-wise activation, z, a ∈ Rd

J = J(a)

Then, we have that

∂J ∂J
= σ 0 (z) (3.15)
∂z ∂a
where σ 0 (·) is the element-wise derivative of the activation function σ, and
is element-wise product of two vectors of the same dimensionality.
Using equation (3.12), (3.13),and (3.15), we can verify the correctness of
Algorithm 4. Indeed, using the notations in the two-layer neural network

∂J ∂J
= ReLU0 (z) (by Jinvoking equation (3.15) with setting
← J, a ← a, z ← a, σ ← ReLU. )
∂z ∂a
19

>
= (o − y)W [2] ReLU0 (z) (Jby←invoking equation (3.12) with setting
J, z ← o, W ← W [2] , u ← a, b ← b[2]
)

Therefore, δ [1] = ∂J
∂z
, and we verify the correctness of Line 3 in Algorithm 4.
Similarly, let’s verify the third equation in Line 4,
∂J ∂J >
[1]
= ·x (Jby←invoking equation (3.13) with setting
J, z ← z, W ← W [1] , u ← x, b ← b[1]
)
∂W ∂z
= δ [1] x> (because we have proved δ [1] = ∂J
∂z
)

3.5 Multi-layer neural networks

In this section, we will derive the backpropagation algorithms for the model
defined in (2.12). Recall that we have

a[1] = ReLU(W [1] x + b[1] )

a[2] = ReLU(W [2] a[1] + b[2] )
···
a[r−1] = ReLU(W [r−1] a[r−2] + b[r−1] )
a[r] = z [r] = W [r] a[r−1] + b[r]
1
J = (a[r] − y)2
2
Here we define both a[r] and z [r] as hθ (x) for notational simplicity.
Define
∂J
δ [k] = (3.16)
∂z [k]
The backpropagation algorithm computes δ [k] ’s from k = r to 1, and
∂J [k]
computes ∂W [ k] from δ as described in Algorithm 5.

4 Vectorization Over Training Examples

As we discussed in Section 1, in the implementation of neural networks, we
will leverage the parallelism across the multiple examples. This means that
we will need to write the forward pass (the evaluation of the outputs) of
the neural network and the backward pass (backpropagation) for multiple
training examples in matrix notation.
20

Algorithm 5 Back-propagation for multi-layer neural networks.

.
1: Compute and store the values of a[k] ’s and z [k] ’s for k = 1, . . . , r − 1, and
J. . This is often called the “forward pass”
[r] ∂J [r]
2: Compute δ = ∂z [r] = (z − o).
3: for k = r − 1 to 1 do
4: Compute
∂J
[k+1] > [k+1]

δ [k] = = W δ ReLU0 (z [k] )
∂z [k]
5: Compute
∂J >
[k+1]
= δ [k+1] a[k]
∂W
∂J
[k+1]
= δ [k+1]
∂b

The basic idea. The basic idea is simple. Suppose you have a training
set with three examples x(1) , x(2) , x(3) . The first-layer activations for each
example are as follows:

z [1](1) = W [1] x(1) + b[1]

z [1](2) = W [1] x(2) + b[1]
z [1](3) = W [1] x(3) + b[1]

Note the difference between square brackets [·], which refer to the layer num-
ber, and parenthesis (·), which refer to the training example number. In-
tuitively, one would implement this using a for loop. It turns out, we can
vectorize these operations as well. First, define:
 
| | |
X =  x(1) x(2) x(3)  ∈ Rd×3 (4.1)
| | |

Note that we are stacking training examples in columns and not rows. We
can then combine this into a single unified formulation:
 
| | |
Z [1] =  z [1](1) z [1](2) z [1](3)  = W [1] X + b[1] (4.2)
| | |
21

You may notice that we are attempting to add b[1] ∈ R4×1 to W [1] X ∈
R4×3 . Strictly following the rules of linear algebra, this is not allowed. In
practice however, this addition is performed using broadcasting. We create
an intermediate b̃[1] ∈ R4×3 :
 
| | |
b̃[1] =  b[1] b[1] b[1]  (4.3)
| | |

We can then perform the computation: Z [1] = W [1] X + b̃[1] . Often times, it
is not necessary to explicitly construct b̃[1] . By inspecting the dimensions in
(4.2), you can assume b[1] ∈ R4×1 is correctly broadcast to W [1] X ∈ R4×3 .
The matricization approach as above can easily generalize to multiple
layers, with one subtlety though, as discussed below.

Complications/Subtlety in the Implementation. All the deep learn-

ing packages or implementations put the data points in the rows of a data
matrix. (If the data point itself is a matrix or tensor, then the data are con-
centrated along the zero-th dimension.) However, most of the deep learning
papers use a similar notation to these notes where the data points are treated
as column vectors.7 There is a simple conversion to deal with the mismatch:
in the implementation, all the columns become row vectors, row vectors be-
come column vectors, all the matrices are transposed, and the orders of the
matrix multiplications are flipped. In the example above, using the row ma-
jor convention, the data matrix is X ∈ R3×d , the first layer weight matrix
has dimensionality d × m (instead of m × d as in the two layer neural net
section), and the bias vector b[1] ∈ R1×m . The computation for the hidden
activation becomes
Z [1] = XW [1] + b[1] ∈ R3×m (4.4)

7
The instructor suspects that this is mostly because in mathematics we naturally multiply
a matrix to a vector on the left hand side.

Deep Learning: CS229 Lecture Notes
No ratings yet
Deep Learning: CS229 Lecture Notes
16 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
NNs PDF
No ratings yet
NNs PDF
16 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Machine Learning with Artificial Neural Networks
No ratings yet
Machine Learning with Artificial Neural Networks
6 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
CS 329 Lecture4 2025New
No ratings yet
CS 329 Lecture4 2025New
61 pages
ML807_Distributed_and_Federated_Learning_Slides_2
No ratings yet
ML807_Distributed_and_Federated_Learning_Slides_2
211 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Neural network
No ratings yet
Neural network
7 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
3 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Deep Learning by AndrewNG Tutorial Notes
No ratings yet
Deep Learning by AndrewNG Tutorial Notes
298 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
80 pages
2. Neural Network Training
No ratings yet
2. Neural Network Training
73 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Regression
No ratings yet
Regression
30 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Machine Learning Notes by Standard Andrew Ng
No ratings yet
Machine Learning Notes by Standard Andrew Ng
142 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
CS229 Lecture Notes
No ratings yet
CS229 Lecture Notes
142 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
AD601 Deep Learning Unit-2 Notes
No ratings yet
AD601 Deep Learning Unit-2 Notes
14 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
neural-networks-essay-feranmi-dere
No ratings yet
neural-networks-essay-feranmi-dere
7 pages
Intro To DL
No ratings yet
Intro To DL
28 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Chapter 2 - 2 Shallow neural network 2_2
No ratings yet
Chapter 2 - 2 Shallow neural network 2_2
34 pages
Lecture 5 - CS50's Introduction to Artificial Intelligence with Python
No ratings yet
Lecture 5 - CS50's Introduction to Artificial Intelligence with Python
16 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
nn2
No ratings yet
nn2
12 pages
1.4+Computing+Gradient+Using+Backpropagation
No ratings yet
1.4+Computing+Gradient+Using+Backpropagation
5 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
DL Notes
No ratings yet
DL Notes
652 pages
week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
week 03-04 - Deep Feedforward Networks - Intro
141 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Research Paper on Daniel Boone
100% (1)
Research Paper on Daniel Boone
8 pages
Special Educ. Report
No ratings yet
Special Educ. Report
2 pages
CEAT Tyres - NATS JD
No ratings yet
CEAT Tyres - NATS JD
2 pages
EPANTO - PE Semi Detailed Lesson Plan
No ratings yet
EPANTO - PE Semi Detailed Lesson Plan
3 pages
Aspect of Verb
No ratings yet
Aspect of Verb
22 pages
How To Write PHD Thesis
No ratings yet
How To Write PHD Thesis
7 pages
2023 Control Systems III - Learner Guide
No ratings yet
2023 Control Systems III - Learner Guide
29 pages
Panic Stations - 05 - Unhelpful Thinking Styles
No ratings yet
Panic Stations - 05 - Unhelpful Thinking Styles
14 pages
Constructive and Destructive Communication
100% (1)
Constructive and Destructive Communication
23 pages
Part A Fill in The Blank by Using Past Continuous or Simple Past
No ratings yet
Part A Fill in The Blank by Using Past Continuous or Simple Past
6 pages
Lesson Plan 9-25-15
No ratings yet
Lesson Plan 9-25-15
2 pages
whitworth2015
No ratings yet
whitworth2015
39 pages
Full Download Microsoft Blazor: Building Web Applications in .NET - Second Edition Peter Himschoot PDF DOCX
100% (2)
Full Download Microsoft Blazor: Building Web Applications in .NET - Second Edition Peter Himschoot PDF DOCX
50 pages
Building The Innovative Organization (Part 1) (Week 5) : Bachelor of Business Management (Hons)
No ratings yet
Building The Innovative Organization (Part 1) (Week 5) : Bachelor of Business Management (Hons)
39 pages
Unit 3 Lesson 1 - Activity 1
No ratings yet
Unit 3 Lesson 1 - Activity 1
2 pages
libra
No ratings yet
libra
7 pages
Teach Yourself Xhosa Beverley Kirsch download
100% (1)
Teach Yourself Xhosa Beverley Kirsch download
61 pages
Final Basic Maths Mock Excel 2024
No ratings yet
Final Basic Maths Mock Excel 2024
15 pages
Lesson 3 Selecting and Organizing Information Brainstorming
No ratings yet
Lesson 3 Selecting and Organizing Information Brainstorming
27 pages
MontessoriReadingHomeschoolCurriculumFREESampleMovableAlphabet 1
No ratings yet
MontessoriReadingHomeschoolCurriculumFREESampleMovableAlphabet 1
26 pages
A Brief History of The MBA
No ratings yet
A Brief History of The MBA
16 pages
02a A Level Mathematics SAMs - Paper 2 Pure Mathematics
No ratings yet
02a A Level Mathematics SAMs - Paper 2 Pure Mathematics
10 pages
Spanish Doc 13
No ratings yet
Spanish Doc 13
1 page
Katelyn Reedy Resume Final PDF
No ratings yet
Katelyn Reedy Resume Final PDF
1 page
TED@Work - IBM - Case Study
No ratings yet
TED@Work - IBM - Case Study
1 page
Users Guides to the Medical Literature A Manual for Evidence Based Clinical Practice 3rd Edition Gordon Guyatt/Drummond Rennie/Maureen O Meade && Deborah J. Cook - Discover the ebook with all chapters in just a few seconds
100% (1)
Users Guides to the Medical Literature A Manual for Evidence Based Clinical Practice 3rd Edition Gordon Guyatt/Drummond Rennie/Maureen O Meade && Deborah J. Cook - Discover the ebook with all chapters in just a few seconds
47 pages
Example of Research Paper Color Psychology
No ratings yet
Example of Research Paper Color Psychology
8 pages
master rteee2ec
No ratings yet
master rteee2ec
2 pages
Reading Key
No ratings yet
Reading Key
32 pages
Practical: Edu 102: Childhood and Growing Up
No ratings yet
Practical: Edu 102: Childhood and Growing Up
15 pages

Cs229 Notes Deep Learning

Uploaded by

Cs229 Notes Deep Learning

Uploaded by

CS229 Lecture Notes

Tengyu Ma, Anand Avati, Kian Katanforoosh, and Andrew Ng

1 Supervised Learning with Non-linear Mod-

and define the mean-square cost function for the dataset as

Optimizers (SGD). Commonly, people use gradient descent (GD), stochas-

θ := θ − α∇θ J (j) (θ) (1.4)

Oftentimes computing the gradient of B examples simultaneously for the

Algorithm 2 Mini-batch Stochastic Gradient Descent

With these generic algorithms, a typical deep learning model is learned

A Neural Network with a Single Neuron. Recall the housing price

Here hθ (x) returns a single value: (wx+b) or zero, whichever is greater. In

Figure 1: Housing prices with a “kink” in the graph.

Figure 2: Diagram of a small neural network for predicting housing prices.

“hidden units” or “hidden neurons”). We represent each of the ai ’s as a neu-

hθ (x) = θ9 a1 + θ10 a2 + θ11 a3 + θ12 (2.3)

where θ contains all the parameters (θ1 , · · · , θ12 ).

Inspiration from Biological Neural Networks. As the name suggests,

Two-layer Fully-Connected Neural Networks. We constructed the

a1 = ReLU(w1> x + b1 ), where w1 ∈ R4 and b1 ∈ R (2.4)

We still define hθ (x) using equation (2.3) with a1 , a2 , a3 being defined

Figure 3: Diagram of a two-layer fully connected neural network. Each edge

Vectorization. Before we introduce neural networks with more layers and

Now by the definition of matrix vector multiplication, we can write z =

z = W [1] x + b[1] (2.9)

We remark again that a vector in Rd in this notes, following the conventions

(meaning, for a vector t ∈ Rd , ReLU(t) is a vector such that ReLU(t)i =

a = ReLU(W [1] x + b[1] )

Multi-layer fully-connected neural networks. With this succinct no-

a[1] = ReLU(W [1] x + b[1] )

a[k] = ReLU(W [k] a[k−1] + b[k] ), ∀k = 1, . . . , r − 1 (2.13)

Other activation functions. The activation function ReLU can be re-

hθ (x) = W [2] a[1] (2.16)

Notice how W [2] W [1] collapsed into W̃ .

Connection to the Kernel Method. In the previous lectures, we covered

hθ (x) = W [r] φβ (x) + b[r] (2.21)

gradient can be efficiently computed in a similar time. We will then show

Theorem 3.1: [backpropagation or auto-differentiation, informally stated]

3.1 Preliminary: chain rule

gj = gj (θ1 , . . . , θp ), ∀j ∈ {1, · · · , k} (3.1)

3.2 One-neuron neural networks

Combining the equation above with equation (3.7), we have

Similarly, we compute the gradient w.r.t b by

3.3 Two-layer neural networks: a low-level unpacked

By invoking chain rule with J as the output variable, o as intermediate

Now combining the equations above, we obtain

Next we gauge the runtime of computing these partial derivatives. Let p

Algorithm 3 Backpropagation for two-layer neural networks

3.4 Two-layer neural network with vector notation

Algorithm 4 Back-propagation for two-layer neural networks in vectorized

Derivation using the chain rule for matrix multiplication. To

a = σ(z), where σ is an element-wise activation, z, a ∈ Rd

Then, we have that

3.5 Multi-layer neural networks

a[1] = ReLU(W [1] x + b[1] )

4 Vectorization Over Training Examples

Algorithm 5 Back-propagation for multi-layer neural networks.

z [1](1) = W [1] x(1) + b[1]

Complications/Subtlety in the Implementation. All the deep learn-

You might also like