Cs229 Notes Deep Learning
Cs229 Notes Deep Learning
Deep Learning
We now begin our study of deep learning. In this set of notes, we give an
overview of neural networks, discuss vectorization and discuss training neural
networks with backpropagation.
Cost/loss function. We define the least square cost function for the i-th
example (x(i) , y (i) ) as
1
J (i) (θ) = (hθ (x(i) ) − y (i) )2 (1.1)
2
1
If a concrete example is helpful, perhaps think about the model hθ (x) = θ12 x21 + θ22 x22 +
· · · + θd2 x2d in this subsection, even though it’s not a neural network.
1
2
2 Neural Networks
Neural networks refer to broad type of non-linear models/parametrizations
hθ (x) that involve combinations of matrix multiplications and other entry-
wise non-linear operations. We will start small and slowly build up a neural
network, step by step.
Stacking Neurons. A more complex neural network may take the single
neuron described above and “stack” them together such that one neuron
passes its output as input into the next neuron, resulting in a more complex
function.
Let us now deepen the housing prediction example. In addition to the size
of the house, suppose that you know the number of bedrooms, the zip code
and the wealth of the neighborhood. Building neural networks is analogous
to Lego bricks: you take individual bricks and stack them together to build
complex structures. The same applies to neural networks: we take individual
neurons and stack them together to create complex neural networks.
Given these features (size, number of bedrooms, zip code, and wealth),
we might then decide that the price of the house depends on the maximum
family size it can accommodate. Suppose the family size is a function of
the size of the house and number of bedrooms (see Figure 2). The zip code
may provide additional information such as how walkable the neighborhood
is (i.e., can you walk to the grocery store or do you need to drive everywhere).
Combining the zip code with the wealth of the neighborhood may predict
the quality of the local elementary school. Given these three derived features
(family size, walkable, school quality), we may conclude that the price of the
home ultimately depends on these three features.
Formally, the input to a neural network is a set of input features
x1 , x2 , x3 , x4 . We denote the intermediate variables for “family size”, “walk-
able”, and “school quality” by a1 , a2 , a3 (these ai ’s are often referred to as
5
housing prices
1000
900
800
700
price (in $1000)
600
500
400
300
200
100
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
square feet
a1 = ReLU(θ1 x1 + θ2 x2 + θ3 )
a2 = ReLU(θ4 x3 + θ5 )
a3 = ReLU(θ6 x3 + θ7 x4 + θ8 )
where (θ1 , · · · , θ8 ) are parameters. Now we represent the final output hθ (x)
as another linear function with a1 , a2 , a3 as inputs, and we get3
aj = ReLU(zj ),
a = [a1 , . . . , am ]> ∈ Rm
>
hθ (x) = w[2] a + b[2] where w[2] ∈ Rm , b[2] ∈ R, (2.6)
Note that by default the vectors in Rd are viewed as column vectors, and
in particular a is a column vector with components a1 , a2 , ..., am . The indices
[1] [1]
and [2] are used to distinguish two sets of parameters: the wj ’s (each of
which is a vector in Rd ) and w[2] (which is a vector in Rm ). We will have
more of these later.
Leveraging the parallelism in GPUs is/was crucial for the progress of deep
learning.
This gave rise to vectorization. Instead of using for loops, vectorization
takes advantage of matrix algebra and highly optimized numerical linear
algebra packages (e.g., BLAS) to make neural network computations run
quickly. Before the deep learning era, a for loop may have been sufficient
on smaller datasets, but modern deep networks and state-of-the-art datasets
will be infeasible to run with for loops.
We vectorize the two-layer fully-connected neural network as below. We
define a weight matrix W [1] in Rm×d as the concatenation of all the vectors
[1]
wj ’s in the following way:
[1] >
— w1 —
>
— w2[1] —
[1] ∈ Rm×d
W = .. (2.7)
.
[1] >
— wm —
[1] >
[1]
z1 — w1 — x1 b1
.. >
— w2[1] —
x2 [1]
b2
.
. = + (2.8)
..
.. ..
..
.
. .
[1] > xd [1]
zm — wm — bm
| {z } | {z } | {z }
x ∈ Rd×1
| {z }
z ∈ Rm×1 W [1] ∈ Rm×d b[1] ∈ Rm×1
Or succinctly,
a = ReLU(z) (2.10)
>
Define W [2] = [w[2] ] ∈ R1×m similarly. Then, the model in equation (2.6)
can be summarized as
Here θ consists of W [1] , W [2] (often referred to as the weight matrices) and
b[1] , b[2] (referred to as the biases). The collection of W [1] , b[1] is referred to as
the first layer, and W [2] , b[2] the second layer. The activation a is referred to as
the hidden layer. A two-layer neural network is also called one-hidden-layer
neural network.
We note that the weight matrices and biases need to have compatible
dimensions for the equations above to make sense. If a[k] has dimension mk ,
then the weight matrix W [k] should be of dimension mk × mk−1 , and the bias
b[k] ∈ Rmk . Moreover, W [1] ∈ Rm1 ×d and W [r] ∈ R1×mr−1 .
The total number of neurons in the network is m1 + · · · + mr , and the
total number of parameters in this network is (d + 1)m1 + (m1 + 1)m2 + · · · +
(mr−1 + 1)mr .
Sometimes for notational consistency we also write a[0] = x, and a[r] =
hθ (x). Then we have simple recursion that
Note that this would have be true for k = r if there were an additional
ReLU in equation (2.12), but often people like to make the last layer linear
(aka without a ReLU) so that negative outputs are possible and it’s easier
to interpret the last layer as a linear model. (More on the interpretability at
the “connection to kernel method” paragraph of this section.)
Why do we not use the identity function for σ(z)? That is, why
not use σ(z) = z? Assume for sake of argument that b[1] and b[2] are zeros.
Suppose σ(z) = z, then for two-layer neural network, we have that
suits the particular applications. The process of choosing the feature maps
is often referred to as feature engineering.
We can view deep learning as a way to automatically learn the right
feature map (sometimes also referred to as “the representation”) as follows.
Suppose we denote by β the collection of the parameters in a fully-connected
neural networks (equation (2.12)) except those in the last layer. Then we
can abstract right a[r−1] as a function of the input x and the parameters in
β: a[r−1] = φβ (x). Now we can write the model as
When β is fixed, then φβ (·) can viewed as a feature map, and therefore hθ (x)
is just a linear model over the features φβ (x). However, we will train the
neural networks, both the parameters in β and the parameters W [r] , b[r] are
optimized, and therefore we are not learning a linear model in the feature
space, but also learning a good feature map φβ (·) itself so that it’s possi-
ble to predict accurately with a linear model on top of the feature map.
Therefore, deep learning tends to depend less on the domain knowledge of
the particular applications and requires often less feature engineering. The
penultimate layer a[r] is often (informally) referred to as the learned features
or representations in the context of deep learning.
In the example of house price prediction, a fully-connected neural network
does not need us to specify the intermediate quantity such “family size”, and
may automatically discover some useful features in the last penultimate layer
(the activation a[r−1] ), and use them to linearly predict the housing price.
Often the feature map / representation obtained from one datasets (that is,
the function φβ (·) can be also useful for other datasets, which indicates they
contain essential information about the data. However, oftentimes, the neural
network will discover complex features which are very useful for predicting
the output but may be difficult for a human to understand or interpret. This
is why some people refer to neural networks as a black box, as it can be
difficult to understand the features it has discovered.
3 Backpropagation
In this section, we introduce backpropgation or auto-differentiation, which
computes the gradient of the loss ∇J (j) (θ) efficiently. We will start with an
informal theorem that states that as long as a real-valued function f can be
efficiently computed/evaluated by a differentiable network or circuit, then its
12
We note that the loss function J (j) (θ) for j-th example can be indeed
computed by a sequence of operations and functions involving additions,
subtraction, multiplications, and non-linear activations. Thus the theorem
suggests that we should be able to compute the ∇J (j) (θ) in a similar time
to that for computing J (j) (θ) itself. This does not only apply to the fully-
connected neural network introduced in the Section 2, but also many other
types of neural networks.
In the rest of the section, we will showcase how to compute the gradient of
the loss efficiently for fully-connected neural networks using backpropagation.
Even though auto-differentiation or backpropagation is implemented in all
the deep learning packages such as tensorflow and pytorch, understanding it
is very helpful for gaining insights into the working of deep learning.
J = J(g1 , . . . , gk ) (3.2)
Here we overload the meaning of gj ’s: they denote both the intermediate
variables but also the functions used to compute the intermediate variables.
Then, by the chain rule, we have that ∀i,
k
∂J X ∂J ∂gj
= (3.3)
∂θi j=1
∂gj ∂θi
For the ease of invoking the chain rule in the following subsections in various
ways, we will call J the output variable, g1 , . . . , gk intermediate variables,
and θ1 , . . . , θp the input variable in the chain rule.
z = w> x + b (3.4)
o = ReLU(z) (3.5)
1
J = (y − o)2 (3.6)
2
By the chain rule with J as the output variable, o as the intermediate variable,
and wi the input variable, we have that
∂J ∂J ∂o
= · (3.7)
∂wi ∂o ∂wi
Invoking the chain rule with o as the output variable, z as the intermediate
variable, and wi the input variable, we have that
∂o ∂o ∂z
= ·
∂wi ∂z ∂wi
14
∇w J = (o − y) · 1{z ≥ 0} · x
∂J ∂J ∂zj
[1]
= ·
∂(wj )` ∂zj ∂(wj[1] )`
∂J ∂zj
= · x` (becaues [1] = x` .)
∂zj ∂(wj )`
∂J
Thus, it suffices to compute the ∂z j
. We invoke the chain rule with J as the
output variable, aj as the intermediate variable, and zj as the input variable,
∂J ∂J ∂aj
=
∂zj ∂aj ∂zj
∂J
= 1{zj ≥ 0}
∂aj
∂J
Now it suffices to compute ∂a j
, and we invoke the chain rule with J as the
output variable, o as the intermediate variable, and aj as the input variable,
∂J ∂J ∂o
=
∂aj ∂o ∂aj
= (o − y) · (w[2] )j
where m is the number of hidden units and d is the input dimension. For
every j and `, to compute ∂J[1] , apparently we need to compute at least
∂(wj )`
the output o, which takes at least p ≥ md operations. Therefore at the first
glance computing a single gradient takes at least md time, and the total time
to compute the derivatives w.r.t to all the parameters is at least (md)2 , which
is inefficient.
However, the key of the backpropagation is that for different choices of `,
the formulas above for computing ∂J[1] share many terms, such as, (o − y),
∂(wj )`
[2]
(w )j and 1{zj ≥ 0}. This suggests that we can re-organize the computation
to leverage the shared computation.
It turns out the crucial shared quantities in these formulas are ∂J ∂o
,
∂J ∂J
∂z1
, . . . , ∂zm . We now write the following formulas to compute the gradi-
ents efficiently in Algorithm 3.
∂J ∂J ∂o ∂aj ∂J
= = · (w[2] )j · 1{zj ≥ 0} (3.9)
∂zj ∂o ∂aj ∂zj ∂o
∂J ∂J ∂J ∂J
4: Compute [1] , [1] , ∂(w [2] ) , and ∂b[2]
by
∂(wj )` ∂bj j
∂J ∂J ∂zj ∂J
[1]
= · [1]
= · x`
∂(wj )` ∂zj ∂(wj )` ∂zj
∂J ∂J ∂zj ∂J
[1]
= · [1] =
∂bj ∂zj ∂bj ∂zj
∂J ∂J ∂o ∂J
[2]
= [2]
= · aj
∂(w )j ∂o ∂(w )j ∂o
∂J ∂J ∂o ∂J
[2]
= [2]
=
∂b ∂o ∂b ∂o
17
∂J
= δ [2] a> ∈ R1×m
∂W [2]
∂J
= δ [2] ∈ R
∂b[2]
∂J
= δ [1] x> ∈ Rm×d
∂W [1]
∂J
= δ [1] ∈ Rm
∂b[1]
5 ∂A
We will avoid using the notation ∂B for A that is not a real-valued variable.
6
If you are familiar with the notion of total derivatives, we note that the dimensionality
here is different from that for total derivatives.
18
z = W u + b, where W ∈ Rm×d
J = J(z) (3.11)
∂J ∂J
Then we can compute ∂u
and ∂W
by:
∂J ∂J
= W> (3.12)
∂u ∂z
∂J ∂J >
= ·u (3.13)
∂W ∂z
∂J ∂J
= (3.14)
∂b ∂z
We can verify the dimensionality is indeed compatible because ∂J∂z
∈ Rm ,
W > ∈ Rd×m , ∂J
∂u
∂J
∈ Rd , ∂W ∈ Rm×d , u> ∈ R1×d .
Here the chain rule in equation (3.12) only works for the special cases
where z = W u. Another useful case is the following:
∂J ∂J
= ReLU0 (z) (by Jinvoking equation (3.15) with setting
← J, a ← a, z ← a, σ ← ReLU. )
∂z ∂a
19
>
= (o − y)W [2] ReLU0 (z) (Jby←invoking equation (3.12) with setting
J, z ← o, W ← W [2] , u ← a, b ← b[2]
)
Therefore, δ [1] = ∂J
∂z
, and we verify the correctness of Line 3 in Algorithm 4.
Similarly, let’s verify the third equation in Line 4,
∂J ∂J >
[1]
= ·x (Jby←invoking equation (3.13) with setting
J, z ← z, W ← W [1] , u ← x, b ← b[1]
)
∂W ∂z
= δ [1] x> (because we have proved δ [1] = ∂J
∂z
)
The basic idea. The basic idea is simple. Suppose you have a training
set with three examples x(1) , x(2) , x(3) . The first-layer activations for each
example are as follows:
Note the difference between square brackets [·], which refer to the layer num-
ber, and parenthesis (·), which refer to the training example number. In-
tuitively, one would implement this using a for loop. It turns out, we can
vectorize these operations as well. First, define:
| | |
X = x(1) x(2) x(3) ∈ Rd×3 (4.1)
| | |
Note that we are stacking training examples in columns and not rows. We
can then combine this into a single unified formulation:
| | |
Z [1] = z [1](1) z [1](2) z [1](3) = W [1] X + b[1] (4.2)
| | |
21
You may notice that we are attempting to add b[1] ∈ R4×1 to W [1] X ∈
R4×3 . Strictly following the rules of linear algebra, this is not allowed. In
practice however, this addition is performed using broadcasting. We create
an intermediate b̃[1] ∈ R4×3 :
| | |
b̃[1] = b[1] b[1] b[1] (4.3)
| | |
We can then perform the computation: Z [1] = W [1] X + b̃[1] . Often times, it
is not necessary to explicitly construct b̃[1] . By inspecting the dimensions in
(4.2), you can assume b[1] ∈ R4×1 is correctly broadcast to W [1] X ∈ R4×3 .
The matricization approach as above can easily generalize to multiple
layers, with one subtlety though, as discussed below.
7
The instructor suspects that this is mostly because in mathematics we naturally multiply
a matrix to a vector on the left hand side.