7 NN Apr 28 2021
7 NN Apr 28 2021
7 NN Apr 28 2021
Output value y
a
Non-linear transform σ
z
Weighted sum ∑
Weights w1 w2 w3
bias
b
Input layer x1 x2 x3 +1
3
heart, a neural unit is taking a weighted sum of its inputs, with on
me in theNeural
convenient tounit
sum called a biasthis
express term. Given sum
weighted a set using
of inputs x1 ...x
vector n, a
notatio
Eq. 7.2,
rrespondingz is just
weights a real
w valued
and a
gebra that a vector is,1 at nheart, just
...w number.
biasab,list
so or
thearray
weighted sum z
of numbe
stead
dutas: of
z inTake
termsusing
weighted z, a linear
sum of
of a weight function
inputs,
vector w, plus
a a of
bias
scalar x, as the output
bias b, and an inp
X
near function f to z. We
eplace the sum with the convenienti dot
z = b + willw refer
x i
to the
product: output of th
n value for the unit, a. Since i we are just modeling a si
he node is in fact the zfinal
= w output
· x + b of the network, which
more convenient to express this weighted sum using vector notation w
value y isthat
r algebra defined as:is, at heart, just a list or array of number
a vector
Instead of just using z, we'll apply a nonlinear activation
Eq. 7.2,
about inzterms
is just
zfunction of
f: a areal valued
weight number.
vector w, a scalar bias b, and an inpu
ll replace
nstead of the sumz,with
using the = f (z)dot
y =convenient
a linearafunction of product:
x, as the output, neu
inear function f to z. We will refer to the output of this fu
z = w·x+b
fact the final output of the network, which we’ll generally
ned as: Non-Linear Activation Functions
y = a = f (z)
We're already seen the sigmoid for logistic regression:
r non-linear functions f () below (the sigmoid, the tanh,
LU) but it’s pedagogically convenient to start with the
saw it in Chapter 5:
Sigmoid
1
y = s (z) = z
(7.3)
1+e
Fig. 7.1) has a number of advantages; it maps the output
is useful in squashing outliers toward 0 or 1. And it’s
saw in Section ?? will be handy for learning.
5
Final function the unit is computing
7.1 • U NI
tituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = s (w · x + b) =
1 + exp( (w · x + b))
7.2 shows a final schematic of a basic neural unit. In this example t
nput values x1 , x2 , and x3 , and computes a weighted sum, multiplyin
a weight (w1 , w2 , and w3 , respectively), adds them to a bias term b, an
he resulting sum through a sigmoid function to result in a number bet
Final unit again
Output value y
a
Weights w1 w2 w3
bias
b
Input layer x1 x2 x3 +1
7
An example w = [0.2, 0.3, 0.9]
b = 0.5
Suppose a unit has:
What
w would this unit do with the following input vector:
= [0.2,0.3,0.9]
b = 0.5 x = [0.5, 0.6, 0.1]
What happens with input x:
The resulting output y would be:
x = [0.5,0.6,0.1]
1 1
y = s (w · x + b) = = =
1 + e (w·x+b) 1 + e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
In practice, the sigmoid is not commonly used as an activation
An example ww =
= 0.3,0.9]
[0.2,0.3,
[0.2, 0.9]
Suppose a unit has: bb =
= 0.5
0.5
What w
would= [0.2,0.3,0.9]
this unitdo
What would this unit dowith
withthe
thefollowing
followinginput
inputvector:
vector:
b = 0.5
xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
What happens with the following input x?
Theresulting
The resulting outputyywould
x = output wouldbe:
[0.5,0.6,0.1] be:
=ss(w 11 11
yy= (w··xx+
+b)b)==
1 + e (w·x+b)
=
= 1 + e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
=
=
1+e (w·x+b) 1+e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
Inpractice,
In practice,the
thesigmoid
sigmoidisisnot
notcommonly
commonlyused
usedas asan
anactivation
activation
exampleAnjustexample ww == [0.2,
to get an intuition. Let’s suppose [0.2,
we 0.3,
0.3,
have 0.9]
a 0.9]
eight vector and bias: bb = 0.5
= 0.5
Suppose a unit has:
w =
Whatwould[0.2,
would 0.3, 0.9]
this unitdo
dowith
withthe
thefollowing
followinginput
inputvector:
vector:
What w = this unit
[0.2,0.3,0.9]
b = 0.5
b = 0.5 xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
with the following input vector:
What happens with input x:
Theresulting
The resultingoutput
outputyywould
wouldbe:
be:
x = [0.5,x =0.6,[0.5,0.6,0.1]
0.1]
11 11
yy=
uld be: =ss(w(w··xx+ +b)b)= = == =
=
1 +
1+e e (w·x+b)
(w·x+b) 1 +
1+e e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
(.5⇤.2+.6⇤.3+.1⇤.9+.5)
1 1 1
In=practice,
In practice,
(w·x+b) 1+e the
the sigmoid
sigmoid is
is not
=
not
(.5⇤.2+.6⇤.3+.1⇤.9+.5) commonly
commonly
1+e 0.87
= used
.70
used as
as an
an activation
activation
nexample
examplejust
justtotoget
getananintuition.
intuition.Let’s
Let’ssuppose
supposewewe have
have a a
eight
ight vector
vector
An
example and
and bias:
bias:
justexample ww = = we
to get an intuition. Let’s suppose [0.2,
[0.2, 0.3,
0.3,
have 0.9]
a 0.9]
eight vector and bias: bb = 0.5
= 0.5
wwSuppose
== [0.2,a0.3,
[0.2, unit has:
0.3,0.9]
0.9]
wbb=
What = [0.2,
0.5
w 0.5
would =0.3, 0.9]
[0.2,0.3,0.9]
this unitdo
dowith
withthe
thefollowing
followinginput
inputvector:
vector:
What would this unit
=
b = 0.5
b = 0.5
with
withthe
thefollowing
followinginput
inputvector:
vector: xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
What happens
with the following with input x:
input vector:
The
The =resulting
xx resulting
= [0.5,
[0.5, output
0.6,
0.6,0.1]
output y
0.1]
y would
would
x =0.6,[0.5,0.6,0.1] be:
be:
x = [0.5, 0.1]
uld
ould be:
be:
=ss(w 11 11
yy=
uld be: (w··xx+ b)=
+b) =
1 + e (w·x+b)
=
= 1 + e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
=
=
1+e (w·x+b) 1+e (.5⇤.2+.6⇤.3+.1⇤.9+.5)
11 1 11 1 11
In==
practice,
= the sigmoid is not
= =
commonly
= = used
.70 = .70
as
= an
.70 activation
In
w·x+b) practice,
(w·x+b) 11+
(w·x+b) e
1+e
+ e the sigmoid is not
(.5⇤.2+.6⇤.3+.1⇤.9+.5)
(.5⇤.2+.6⇤.3+.1⇤.9+.5)
(.5⇤.2+.6⇤.3+.1⇤.9+.5)commonly
1 + 1
e 1
+ +e
0.87
e used
0.87
0.87 as an activation
1+e 1+e
The simplest activation function, and perhaps the most comm
not commonly used
Non-Linear as an activation
Activation function.
FunctionsA function
besides sigmoid
U tified linear unit, also called the ReLU, shown
ost always better is the tanh function shown in Fig. 7.3a; in Fig. 7.3b.
when
moid that z isfrom
ranges positive, and 0 otherwise:
-1 to +1: Most Common:
ez e z
y = max(z,
y= z z
(7.5) 0)
e +e
nction, and perhaps the most commonly used, is the rec-
ed the ReLU, shown in Fig. 7.3b. It’s just the same as x
therwise:
y = max(x, 0) (7.6)
tanh ReLU
Rectified Linear Unit 12
Simple Neural
Networks and Units in Neural Networks
Neural
Language
Models
Simple Neural
Networks and The XOR problem
Neural
Language
Models
ts into larger networks.
The XOR problem
One of the most clever demonstrations of the need for multi-layer networks was
proof by Minsky and Papert (1969) that aMinsky
single neural unit cannot compute
and Papert (1969)
me very simple functions of its input. Consider the task of computing elementary
Can neural
ical functions units compute
of two inputs, simple
like AND, OR, and XOR.functions of input?
As a reminder, here are
truth tables for those functions:
AND OR XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0
This example was first shown for the perceptron, which is a very simple neural
t that has a binary output and does not have a non-linear activation function. The
Perceptrons 7.2 • T HE XO
A very simple neural unit
ron is• 0Binary
or 1,output
and is computed
(0 or 1) as follows (using
b as in Eq.non-linear
• No 7.2): activation function
⇢
0, if w · x + b 0
y=
1, if w · x + b > 0
asy
s 0 orto
easy 1,tobuild
andbuild a perceptron
a perceptron
is computed that
as follows (using
Early in the history of Early that can
can
the same
neuralinnetworkscompute
compute
weight
it was the
thelogical
realized logical
that the AND
AND
power a
of
the history of neural networks it was real
binary⇢
Easy
in Eq. 7.2):
stsbinary works, to
inputs;
inputs; build
Fig.
Fig.
as with 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes
that from comb
inspired the
units
0, if winto
· x +larger
b 0 networks.
units into larger networks.
y= (7.7)
1, ifOne
w·x+ ofbthe
> 0most clever demonstrations of the need for multi-layer net
One of the most clever demonstrations of the n
x the proof by Minsky and Papert (1969)
x x that a single neural unit canno
x
d a perceptron
puts; Fig.1 7.4
1some
that
shows
can
very
compute
the simple
the logical AND and OR
the proof by Minsky and
weights. of its input. 1Consider
necessaryfunctions 1
Papert
the
(1969)
task of
that a
computing
si
1 1 some very simple functions 1 1 of its input. Consider
logical functions of two inputs,
logical like AND,
functions OR,inputs,
of two and XOR. As a reminde
like AND, OR, an
x2x2the truth1 1 tables x1 for those functions: x x
the truth tables for2those 1 1
functions:
1 1 2
1 -1-1 x2 1 AND OR AND00 XOR OR
1 +1+1 0 x1 x2 y +1
x1 +1x2x1y x2 y x1 x2x1y x2 y
+1 0 0 0 0 0 0 00 0 0 0 0 00 0
) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 1
0 1 1
he
The weights AND
w and bias b for perceptrons
w and
for computing
bias b for
0 0functions. 1The 0 1 1OR
1 logical
perceptrons for 0 0
computing
1 0 1 1
logical 0 1
functio
2 andweights w and
the bias as a special bias
node b 1for+11perceptrons
with value which 1 for
1 is multiplied 1 1computing
1 1 1 1 logical
1 1 0 1 functi
1
asy
s 0 orto
easy 1,tobuild
andbuild a perceptron
a perceptron
is computed that
as follows (using
Early in the history of Early that can
can
the same
neuralinnetworkscompute
compute
weight
it was the
thelogical
realized logical
that the AND
AND
power a
of
the history of neural networks it was real
binary⇢
Easy
in Eq. 7.2):
stsbinary works, to
inputs;
inputs; build
Fig.
Fig.
as with 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes
that from comb
inspired the
units
0, if winto
· x +larger
b 0 networks.
units into larger networks.
y= (7.7)
1, ifOne
w·x+ ofbthe
> 0most clever demonstrations of the need for multi-layer net
One of the most clever demonstrations of the n
x the proof by Minsky and Papert (1969)
x x that a single neural unit canno
x
d a perceptron
puts; Fig.1 7.4
1some
that
shows
can
very
compute
the simple
the logical AND and OR
the proof by Minsky and
weights. of its input. 1Consider
necessaryfunctions 1
Papert
the
(1969)
task of
that a
computing
si
1 1 some very simple functions 1 1 of its input. Consider
logical functions of two inputs,
logical like AND,
functions OR,inputs,
of two and XOR. As a reminde
like AND, OR, an
x2x2the truth1 1 tables x1 for those functions: x x
the truth tables for2those 1 1
functions:
1 1 2
1 -1-1 x2 1 AND OR AND00 XOR OR
1 +1+1 0 x1 x2 y +1
x1 +1x2x1y x2 y x1 x2x1y x2 y
+1 0 0 0 0 0 0 00 0 0 0 0 00 0
) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 1
0 1 1
he
The weights AND
w and bias b for perceptrons
w and
for computing
bias b for
0 0functions. 1The 0 1 1OR
1 logical
perceptrons for 0 0
computing
1 0 1 1
logical 0 1
functio
2 andweights w and
the bias as a special bias
node b 1for+11perceptrons
with value which 1 for
1 is multiplied 1 1computing
1 1 1 1 logical
1 1 0 1 functi
1
asy
s 0 orto
easy 1,tobuild
andbuild a perceptron
a perceptron
is computed that
as follows (using
Early in the history of Early that can
can
the same
neuralinnetworkscompute
compute
weight
it was the
thelogical
realized logical
that the AND
AND
power a
of
the history of neural networks it was real
binary⇢
Easy
in Eq. 7.2):
stsbinary works, to
inputs;
inputs; build
Fig.
Fig.
as with 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes
that from comb
inspired the
units
0, if winto
· x +larger
b 0 networks.
units into larger networks.
y= (7.7)
1, ifOne
w·x+ ofbthe
> 0most clever demonstrations of the need for multi-layer net
One of the most clever demonstrations of the n
x the proof by Minsky and Papert (1969)
x x that a single neural unit canno
x
d a perceptron
puts; Fig.1 7.4
1some
that
shows
can
very
compute
the simple
the logical AND and OR
the proof by Minsky and
weights. of its input. 1Consider
necessaryfunctions 1
Papert
the
(1969)
task of
that a
computing
si
1 1 some very simple functions 1 1 of its input. Consider
logical functions of two inputs,
logical like AND,
functions OR,inputs,
of two and XOR. As a reminde
like AND, OR, an
x2x2the truth1 1 tables x1 for those functions: x x
the truth tables for2those 1 1
functions:
1 1 2
1 -1-1 x2 1 AND OR AND00 XOR OR
1 +1+1 0 x1 x2 y +1
x1 +1x2x1y x2 y x1 x2x1y x2 y
+1 0 0 0 0 0 0 00 0 0 0 0 00 0
) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 1
0 1 1
he
The weights AND
w and bias b for perceptrons
w and
for computing
bias b for
0 0functions. 1The 0 1 1OR
1 logical
perceptrons for 0 0
computing
1 0 1 1
logical 0 1
functio
2 andweights w and
the bias as a special bias
node b 1for+11perceptrons
with value which 1 for
1 is multiplied 1 1computing
1 1 1 1 logical
1 1 0 1 functi
1
Not possible to capture XOR with perceptrons
x2 x2 x2
1 1 1
?
0 x1 0 x1 0 x1
0 1 0 1 0 1
a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2
1 1
0 x1 0
h1
0 1 0 1 2
y1 y2 … yn
2
h1 h2 h3 … hn
1
W b
x1 x2 … xn
0 +1
Binary Logistic Regression as a 1-layer Network
(we don't count the input layer in counting layers!)
w w1 wn b (scalar)
(vector)
Input layer x1 xn +1
vector x
29
Multinomial Logistic Regression as a 1-layer Network
Fully connected single layer network
y1 yn
Output layer s s s 𝑦 = softmax(𝑊𝑥 + 𝑏)
(softmax nodes) y is a vector
W b
W is a b is a vector
matrix
Input layer x1 xn +1
scalars
30
The multinomial logistic classifier exp uses
(z= a generalization of the sigmoid,
y ofthe
y being in each potential class c 2 C, p(y i ) c|x).
softmax function, softmax(z
to compute
i) = P the 1ik
k probability p(y = c|x). The softmax fu
(5.3
nomial Reminder:
logistic classifier softmax:
uses a a generalization
generalization exp (zof j )the
takes a vector z = [z1 , z2 , ..., zk ] of k arbitrary values and maps them to a prob
j=1
of
sigmoid, sigmoid
called
unction, to compute the probability p(y = c|x). The
distribution, with each value in the range (0,1), and all the values summin softmax function
z=The[z 1softmax
, z 2 , ..., zkof
] ankinput
of vector
arbitrary z = [zand
values 1 , z2maps
, ..., zkthem] is thus to aaprobability
vector itself:
For
Like
with each a
the vector
sigmoid,z itofisdimensionality
value in the
an exponential function.
range (0,1), and all k,
the the
values softmax
summing is:to 1.
For a vector z of dimensionality k, the softmax is defined as:
oid, it5is an
PTER L OGISTIC "R
• exponential function.
EGRESSION #
or z of dimensionality exp (z 1 ) exp (z 2 ) exp (z )
softmax(z) = k, the softmax is defined as: k
PP k
, Pk , ..., Pk (5.3
• L OGISTIC R
The denominator EGRESSION k exp (zi ) exp exp (z (z
i ) ) exp (z )
i=1 exp =i ) isPused to normalize all the values into p
i=1 i=1 i i=1 i
softmax(z i ) (z k
1 i k
ities. Thus for exampleexp given
(zi ) a vector: j=1 exp (z j )
Pk
softmax(zi ) = Pexp
The denominator (z i ) is used 1
to i k
normalize all the (5.30)
values into probabi
i=1 k exp (z
The softmax of an input
j=1 z vector
= j ) z1.1,
[0.6, = [z 1.5,
, z , 1.2,
..., z 3.2,
] is 1.1]
thus a vector itself:
es. Thus for example given a vector: 1 2 k
Example:
ax of the
an input vector
resulting z=
(rounded)
[z 1 ,
" z
softmax(z)
2 , ..., zk ] is thus
is a
z = [0.6, 1.1, 1.5, 1.2, 3.2, 1.1] vector itself: #
exp (z1 ) exp (z2 ) exp (zk )
"softmax(z) = P
[0.055, 0.090, 0.006,
, P 0.099, 0.74,
,# 0.010]
..., P
e resulting (rounded) softmax(z) is k k k
Two-Layer Network with scalar output
Output layer 𝑦 = 𝜎(𝑧) y is a scalar
(σ node) z = 𝑈ℎ
U
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
Two-Layer Network with scalar output
Output layer 𝑦 = 𝜎(𝑧) y is a scalar
(σ node) z = 𝑈ℎ
U
hidden units j
Wji
(σ node)
W b vector
Input layer
x1 i xn +1
(vector)
Two-Layer Network with scalar output
Output layer 𝑦 = 𝜎(𝑧) y is a scalar
(σ node) z = 𝑈ℎ
U
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
Two-Layer Network with softmax output
Output layer 𝑦 = softmax(𝑧)
(σ node) z = 𝑈ℎ
U y is a vector
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
Multi-layer Notation
𝑦 = 𝑎[%]
sigmoid or softmax
𝑎[%] %
= 𝑔 (𝑧 ) %
j 𝑎["] "
= 𝑔 (𝑧 " ) ReLU
x1 i xn +1 𝑎[$]
y
z
∑
w1 w2 w3 b
x1 x2 x3 +1
37
Replacing the bias unit
Let's switch to a notation without the bias unit
Just a notational change
1. Add a dummy node a0=1 to each layer
2. Its weight w0 will be the bias
3. So input layer a[0]0=1,
◦ And a[1]0=1 , a[2]0=1,…
1
willvalues,
have a with
= a
1, new
and 0th
so dummy
on. This value
dummy x node
0 =
esents exactly the same function without referring to an ex-
0 1: x
still has
= x 0
that weight
computing
Replacing represents
each h as the bias
follows: value b. For example inste
d, we add a dummythe nodebias
j a tounit
each layer whose value will
0
h = s (W x)
0, the input layer, will have a dummy node a
[0]
= 1, layer !1 (7.13
h = sX n 00 x + b)
(W
Instead of: hj = We'll
s do Wthis:
o on. This dummy node still has an associated ji xi + band
weight, j ,
our we’ll
vectoruse:x having n values: x = x1 , . . . , xn , it will have n +
e bias value
x= xb., For
x , …, example
x instead of an equation
x= i=1
x , x , x like
w 0th dummy 1 2value n0 x0 = 1: x = x0 , . . . , xn00 . 1 And 2, …, xn0
instead o
as follows:
we’ll instead
h = s (W x + b) use: h = s (W x) (7.12)
! n
!
n0 X 0
X of our vector x having n values: x = x
But now instead
h j = s W ji x i + b j , s W ji x i , (7.14
1 values, with a new 0th dummy value x0 = 1: x = x0
i=1 i=0
h = s (W
computing each x) h j as follows: (7.13)
!
Replacing the bias unit
Instead of: We'll do this:
y1 y2 … yn
2
y1 y2 … yn
2
U U
h1 h2 h3 … hn
1
h1 h2 h3 … hn
1
W b W
x1 x2 … xn
0 +1 x0=1 x1 x2 … xn
0
Simple Neural
Networks and Feedforward Neural Networks
Neural
Language
Models
Simple Neural
Networks and
Applying feedforward networks
Neural to NLP tasks
Language
Models
Use cases for feedforward networks
Let's consider 2 (simplified) sample tasks:
1. Text classification
2. Language modeling
43
Classification: Sentiment Analysis
We could do exactly what we did with logistic
regression
Input layer are binary features as before
Output layer is 0 or 1 σ
U
W
x1 xn
+ or to a review document doc. We’ll represent each input ob
features x1 ...x6 of the input shown in the following table; Fig. 5.2
Sentiment Features
in a sample mini test document.
f1 f2 fn x1 xn
f1 f2 fn
Just adding a hidden layer to logistic regression
46
46
47
Neural Net Classification with embeddings as input features!
p(positive sentiment|The dessert is…)
Output layer ^y
sigmoid
U |V|⨉dh
wt-1 dh⨉3d
W
Projection layer 3d⨉1
embeddings
E embedding for
word 534
embedding for
word 23864
embedding for
word 7
3d⨉1
embeddings
W
x1 xn
50
Neural Language Models (LMs)
Language Modeling: Calculating the probability of the
next word in a sequence given some history.
• We've seen N-gram based LMs
• But neural network LMs far outperform n-gram
language models
State-of-the-art neural LMs are based on more
powerful neural network technology like Transformers
But simple feedforward LMs can do almost as well!
51
Simple feedforward Neural Language Models
52
Neural Language Model
53
Why Neural LMs work better than N-gram LMs
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed
Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog"
embeddings to generalize and predict “fed” after dog
Simple Neural
Networks and Applying feedforward networks
Neural to NLP tasks
Language
Models
Simple Neural
Networks and Training Neural Nets: Overview
Neural
Language
Models
Intuition: training a 2-layer Network
Actual answer 𝑦
Loss function L(𝑦,
! 𝑦)
System output 𝑦!
U Backward pass
Forward pass
W
Training instance x1 xn
57
Intuition: Training a 2-layer network
For every training tuple (𝑥, 𝑦)
◦ Run forward computation to find our estimate 𝑦3
◦ Run backward computation to update weights:
◦ For every output node
◦ Compute loss 𝐿 between true 𝑦 and the estimated 𝑦#
◦ For every weight 𝑤 from hidden layer to the output layer
◦ Update the weight
◦ For every hidden node
◦ Assess how much blame it deserves for the current answer
◦ For every weight 𝑤 from input layer to the hidden layer
◦ Update the weight
58
y 1 y
of the probability: log p(y|x) = log ŷ (1 ŷ)
Reminder: Loss=Functiony⇥logy ŷ + (1 1 y) ⇤log(1 ŷ)
log ŷ (1 for ŷ) binary logistic regression
= y (5.10)
log p(y|x)
q. 5.10 describes a log likelihood = y logthatŷ + (1 y)
should belog(1 ŷ)
maximized. In order to turn this
(5.10)
A measure
to loss function for how
(something farneed
that we off the currentwe’ll
to minimize), answer is to
just flip the sign on
5.10
q. describes
5.10. The a log
result is likelihood
the that should
cross-entropy loss be maximized.
L : In order to turn this
the right answer CE
loss function (something that we need to minimize), we’ll just flip the sign on
5.10. TheCross
LCE (entropy
resultŷ,isy)the loss
=cross-entropy
log for =
p(y|x) logistic
loss L[y : regression:
CElog ŷ + (1 y) log(1 ŷ)] (5.11)
nally, weLCE
can(ŷ,plug
y) =in the
logdefinition
p(y|x) =of ŷ [y
=log ŷ +
s (w · x(1 y) log(1
+ b): ŷ)] (5.11)
lly, we L
can (plug
CE [y log s (wof· xŷ+=b)s+
ŷ, y) in=the definition (w(1· x +y)b):
log (1 s (w · x + b))] (5.12)
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))] (5.12)
et’s see if this loss function does the right thing for our example from Fig. 5.2. We
ant the loss to be smaller if the model’s estimate is close to correct, and
s see if this loss function does the right thing for our example from Fig. 5.2. Webigger if
e model is confused. So first let’s suppose the correct gold label for the sentiment
t the loss to be smaller if the model’s estimate is close to correct, and bigger if 59
Reminder: gradient descent for weight updates
5.4 • G RADIENT D ESCENT 11
STIC RUse the derivative of the loss function with respect to
EGRESSION
1 Theweights !
Gradient 𝐿(𝑓
for Logistic Regression
!"
𝑥; 𝑤 , 𝑦)
der to update q , we need a definition for the gradient —L( f (x; q ), y). Recall that
To tell us how to adjust weights for each training item
ogistic regression, the cross-entropy loss function is:
◦ Move them in the opposite direction of the gradient
LCE (ŷ, y) = [y log s (w · x + b)d
+ (1 y) log (1 s (w · x + b))] (5.17)
t+1 t
w = w h
turns out that the derivative of thisdw L( f (x; w), y)
function for one observation vector x is
.18 (the interested reader
◦ For logistic can see Section 5.8 for the derivation of this equation):
regression
extend the intuition ∂ LCE (ŷ,from
y) a function of one scalar variable w
= [s (w · x + b) y]x j (5.18)
ause we don’t just∂ wwant j to move left or right, we want to kn
of f (x) is the derivative of u(x) with respect to v(x) time
Where didtothat
respect x: derivative come from?
65
Computation Graphs
66
Example:
Computations:
a
e=a+d
b d = 2b L=ce
c 67
Example:
Computations:
68
Backwards differentiation in computation graphs
Example = L = ce : = c, =e
∂a ∂e ∂a ∂e ∂c
∂L ∂L ∂e ∂d ∂e ∂e
e = a+d : = 1, =1
= (7.26)
∂a ∂d
a=3 ∂b ∂e ∂d ∂b
∂d
d = 2b : =2
a thus requires five intermediate derivatives:
Eq. 7.26 ∂L ∂L ∂e ∂e
∂e , ∂c , ∂a , ∂d ,
∂ b
and ,
∂b
∂d
75
Fig. 7.12 shows a sample computation graph for a 2-la
2, n1 = 2,
Backward and n2 = 1, assuming
differentiation on a two binary classificatio
layer network
output unit for simplicity. The function that the compu
y
[1] [1] [1]
Sigmoid activation σ z = W x+b
[1] [1]
a = ReLU(z )
W[2] b[2] [2] [2] [1] [2]
1
z = W a +b
ReLU activation
a [2]
= s (z ) [2]
W[1]
b[1] [2]
ŷ = a
1
The weights that need updating (those for which
x1 x2 76
e computation graph
We’ll for
also aneed
2-layer neural of
the derivatives network with
each of the n0activation
other =
derivative of tanh is:
assumingBackward differentiation
binary classification andon a two
hence layer
using network
a sigmoid
d tanh(z)
y. The function that the computation graph=is1 computing is:
tanh2 (z)
dz
[1] [1] derivative
[1] of the ReLU is
z = W x+b
The
⇢
[1] [1] d ReLU(z)
a = ReLU(z ) =
0 f or z < 0
dz 1 f or z 0
[2] [2] [1] [2]
z = W a +b
a [2]
= s (z ) [2]
[2]
ŷ = a (7.27)
80
Simple Neural
Networks and Computation Graphs and
Neural Backward Differentiation
Language
Models