Lecture 4. Neural Networks and Neural Language Models

Simple Neural
Networks and Units in Neural Networks

Neural
Language
Models
This is in your brain
By BruceBlaus - Own work, CC BY 3.0,

https://commons.wikimedia.org/w/index.php?curid=28761830 2
Neural Network Unit
This is not in your brain
Output value y
a
Non-linear transform σ
z
Weighted sum ∑
Weights w1 w2 w3 b
bias
Input layer x1 x2 x3 +1
3
dilding
as: blockthe
l replace of sum
a neural
withnetwork dot product:
is a single computational
the convenient unit. A unit ta
X
f real valued numbers asz input,
= b + performs
wi xi some computation on them,
es an output.
z = wi · x + b
its heart, a neural unit is taking a weighted sum of its inputs, with one a
erm in
more theNeural
convenient tounit
sum called a biasthis
express term. Given sum
weighted a set using
of inputs x1 ...x
vector n , a unit
notation;
frin Eq. 7.2,
algebra thatz is
corresponding just a w
aweights
vector real atvalued
is,1 ...w nheart,
number.
and ajust
biasab,list
so or
thearray
weighted sum z ca
of numbers.
,about
instead
nted as: of
Take using
weightedz, a
sum linear
z in terms of a weight vector of function
inputs, plus a of
biasx, as the output, n
Xw, a scalar bias b, and an input v
lln-linear
replace function
the sum with f tothez.
z= We
b + willwirefer
convenient xdot
i
to the output of this (f
product:
ion value for the unit, a. Since i we are just modeling a singl
or more
t’s the node is in fact
convenient thezfinal
to express w ·output
=this x + b ofsum
weighted the using
network,
vectorwhich we’l
notation; re
he
nearvalue y isthat
algebra defined as:is, at heart, just a list or array of numbers. T
a vector
Instead of just using z, we'll apply a nonlinear activation
in about
alk Eq. 7.2,inzterms
is just
zfunction f:ofaareal valued
weight number.
vector w, a scalar bias b, and an input ve
we’ll replace
, instead of the sumz,with
using the = f (z)dot
y =convenient
a linearafunction of product:
x, as the output, neural
n-linear function f to z. We will refer to the output of this functi
tion value for the unit, a. zSince
= w · xwe
+ bare just modeling a single uni(
uss three popular non-linear functions f () below (the sigmoid
for the node
ctified is in
linear fact thebut
ReLU) finalit’s
output of the network,
pedagogically which we’ll
convenient to gen
sta
ned in Eq. 7.2, z is just a real valued number.
he value
nction
ally, y isofdefined
since
instead we sawas:
using z,ita in Chapter
linear 5: of x, as the output, neural u
function
z is just a real valued number.
using z, a linear function of x, as the output, neural units
ction f to z. We will refer to the output of this function as
or the unit, a. Since we are just modeling a single unit, the
is in fact the final output of the network, which we’ll generally
s defined as: Non-Linear Activation Functions
y = a = f (z)
We're already seen the sigmoid for logistic regression:
pular non-linear functions f () below (the sigmoid, the tanh,
r ReLU) but it’s pedagogically convenient to start with the
e we saw it in Chapter 5:
Sigmoid
1
y = s (z) = z
(7.3)
1+e
wn in Fig. 7.1) has a number of advantages; it maps the output
which is useful in squashing outliers toward 0 or 1. And it’s
s we saw in Section ?? will be handy for learning.
5
Final function the unit is computing
7.1 • U NITS
Substituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = s (w · x + b) = (7
1 + exp( (w · x + b))
Fig. 7.2 shows a final schematic of a basic neural unit. In this example the u
s 3 input values x1 , x2 , and x3 , and computes a weighted sum, multiplying ea
e by a weight (w1 , w2 , and w3 , respectively), adds them to a bias term b, and th
es the resulting sum through a sigmoid function to result in a number betwee
1.
y
Final unit again
Output value y
a
Non-linear activation function σ

z
Weighted sum ∑
Weights w1 w2 w3 b
bias
Input layer x1 x2 x3 +1
7
Let’s walk through an example just to get an intuition. Let’s su
unit with the following weight vector and bias:
An example w = [0.2, 0.3, 0.9]

b = 0.5
Suppose a unit has:
What
w would this unit do with the following input vector:
= [0.2,0.3,0.9]
b = 0.5 x = [0.5, 0.6, 0.1]
What happens with input x:
The resulting output y would be:
x = [0.5,0.6,0.1]
1 1
y = s (w · x + b) = = =
1+e (w·x+b) 1+e (.5⇤.2+.6⇤.3+.1⇤.9+.5) 1+
In practice, the sigmoid is not commonly used as an activation func
tanh that is very similar but almost always better is the tanh function sh
tanh is a variant of the sigmoid that ranges from -1 to +1:
ez e z
Let’s
Let’swalk
walkthrough
throughan
anexample
examplejust
justto
toget
getan
anintuition.
intuition. Let’s
Let’ssu
su
unitwith
unit withthe
thefollowing
followingweight
weightvector
vectorand
andbias:
bias:
An example ww =
= 0.3,0.9]
[0.2,0.3,
[0.2, 0.9]
Suppose a unit has: bb =
= 0.5
0.5
What w =this
Whatwould
would [0.2,0.3,0.9]
this unitdo
unit dowith
withthe
thefollowing
followinginput
inputvector:
vector:
b = 0.5
xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
What happens with the following input x?
Theresulting
The resulting outputyywould
x = output wouldbe:
[0.5,0.6,0.1] be:
=ss(w 11 11
yy= (w··xx+
+b)b)== == ==
11++ee (w·x+b) 11++ee (.5⇤.2+.6⇤.3+.1⇤.9+.5) 11+
(w·x+b) (.5⇤.2+.6⇤.3+.1⇤.9+.5)
Inpractice,
In practice,thethesigmoid
sigmoidisisnot
notcommonly
commonlyused usedas asan
anactivation
activationfunc
fun
tanh
tanh thatisisvery
that verysimilar
similarbut
butalmost
almostalways
alwaysbetter
betterisisthe tanhfunction
thetanh functionshsh
tanhisisaavariant
tanh variantofofthe
thesigmoid
sigmoidthat
thatranges
rangesfrom
from-1 -1to
to+1:
+1:
z z
e ee
e
z z
mped at +1) and producing an output y. We include some convenient
the output of the summation, z, and the output of the sigmoid, a. In
he unit y is theLet’s
same walk
Let’s walk
as through
through
a, but an
in deeper example
annetworks
example just
just
we’ll to
toget
reserveget an
y toanintuition.
intuition. Let’s
Let’ssu
su
unit
the entireunit withleaving
with
network, thefollowing
the following weight
weight
a as the activation vector
ofvector andbias:
and
an individual bias:
node.
An
h an example justexample ww == [0.2,
to get an intuition. Let’s suppose [0.2,
we 0.3,
0.3,
have 0.9]
a 0.9]
g weight vector and bias: bb = 0.5
= 0.5
Suppose a unit has:
w = would
What [0.2, 0.3, 0.9]
this unitdo
dowith
withthe
thefollowing
followinginput
inputvector:
vector:
What wouldw = this unit
[0.2,0.3,0.9]
b = 0.5
b = 0.5 xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
do with the following input vector:
What happens with input x:
Theresulting
The resultingoutput
outputyywould
wouldbe:
be:
x = [0.5,
x =0.6,[0.5,0.6,0.1]
0.1]
11 11
yy=
y would be: =ss(w(w··xx++b)b)= = == =
=
11+
+ee (w·x+b) 11+
+ee (.5⇤.2+.6⇤.3+.1⇤.9+.5) 11+
(w·x+b) (.5⇤.2+.6⇤.3+.1⇤.9+.5)
1 1 1
+e In=practice,(.5⇤.2+.6⇤.3+.1⇤.9+.5)
In practice,
(w·x+b) the sigmoid is not
= commonly
1 + e the sigmoid is not commonly1+e 0.87
= used
.70 as an activation fun
used as an activation func
tanh
tanh thatisisvery
that verysimilar
similarbut
butalmost
almostalways
alwaysbetter
thetanh functionsh
sh
id is not commonly
tanh is a used as of
variant an the
activation
sigmoidfunction.
that A function
ranges from-1 -1to
to+1:
+1:
tanh is a variant of the sigmoid that ranges from
t almost always better is the tanh function shown in Fig. 7.3a;
e sigmoid that ranges from -1 to +1: z z
e ee
e
z z
he unit
e unit
mped yisisand
at y+1) theproducing
the sameasasa,ana,but
same butinindeeper
output y.deeper networks
networks
We include we’ll
somewe’ll reserve
reserve
convenient y to
y to
theentire
he
the entirenetwork,
output ofnetwork, leavingz,
leaving
the summation, a aasasthe
and theactivation
the activation
output ofof
of the anan individual
individual
sigmoid, node.
a. In node.
he unit y is theLet’s
same walk
Let’s walk
as through
a, butthrough
in deeper an example
annetworks
example just
just
we’ll to
toget
reserve get an
y toanintuition.
intuition. Let’s
Let’ssu
su
unit
the entireunit with
with
network, thefollowing
the
leaving following weight
weight
a as the activation vector
ofvector andbias:
and
an individual bias:
node.
han
anexample
examplejust
justtotoget
getananintuition.
intuition.Let’s
Let’ssuppose
supposewewe have
have a a
ghgweight
weight
an vector
vector
An
example and
and
to getbias:
bias:
justexample ww = = we
an intuition. Let’s suppose [0.2,
[0.2, 0.3,
0.3,
have 0.9]
a 0.9]
g weight vector and bias: bb = 0.5
= 0.5
wwSuppose
== [0.2,a0.3,
[0.2, unit has:
0.3,0.9]
0.9]
wbb=
What = [0.2,
0.5
w 0.5
would =0.3, 0.9]
[0.2,0.3,0.9]
this unitdo
dowith
withthe
thefollowing
followinginput
inputvector:
vector:
What would this unit
=
b = 0.5
b = 0.5
do dowith
withthethefollowing
followinginput inputvector:
vector: xx = 0.6,0.1]
[0.5,0.6,
= [0.5, 0.1]
What happens
do with the following with input x:
input vector:
The
The =resulting
xx resulting
= [0.5,
x [0.5,
= output
0.6,
0.6,
output0.1]
0.1]
yywould
[0.5,0.6,0.1] wouldbe: be:
x = [0.5, 0.6, 0.1]
yywould
would be:
be: 11 1 1
would be:yy==ss(w (w··xx+ +b) b)= = == ==
11++ee (w·x+b) 11+ +ee (.5⇤.2+.6⇤.3+.1⇤.9+.5) 11+
(w·x+b) (.5⇤.2+.6⇤.3+.1⇤.9+.5)
111 1 11 1 11
In =
=practice,
= the sigmoid is not == 0.87 =0.87
= commonly .70= =
used .70
as an
.70 activation fun
+++eee In
(w·x+b)
(w·x+b)
(w·x+b) practice,
11+1++ee e the sigmoid is
(.5⇤.2+.6⇤.3+.1⇤.9+.5)
(.5⇤.2+.6⇤.3+.1⇤.9+.5) not
(.5⇤.2+.6⇤.3+.1⇤.9+.5) commonly
1 + e1 +1+ ee used
0.87 as an activation func
tanh
tanh thatisisvery
that verysimilar
similarbut butalmost
almostalways
alwaysbetter
thetanh functionshsh
idisis
did not
isnot commonly
commonly
tanh
nottanh is
commonly a used
used
variant
usedasasan
of
as activation
an
the
an sigmoidfunction.
activation
activation that A function
function.
ranges
function. Afrom
function
from
A -1to
to+1:
function +1:
talmost
almost always is a
always bettervariant of the
tanh
isisthethe sigmoid
function that
shown ranges
in Fig. 7.3a; -1
te sigmoid
almost always better
better is the tanh
tanh function
function shown
shown in
z
Fig.
in 7.3a;
Fig.
z 7.3a;
that ranges from -1 to +1: ez ee z
sigmoid that ranges from
e sigmoid that ranges from -1 to +1: -1 to +1: e
would be:
ez e z
1 1 1 y= z z
= = e +e
= .70
+e (w·x+b) 1+e (.5⇤.2+.6⇤.3+.1⇤.9+.5) 1+e 0.87
The simplest activation function, and perhaps the most commonl
ReLU
Non-Linear
d is not commonly
tified linear Activation
used as an
unit, also calledFunctions
activation function.
the besides
A function
ReLU, shown sigmoid
in Fig. 7.3b. It’s
almost always better is the tanh function shown in Fig. 7.3a;
when
sigmoid that z isfrom
ranges positive, and 0 otherwise:
-1 to +1: Most Common:
ez e z
y = max(z,
y= z z
(7.5) 0)
e +e
n function, and perhaps the most commonly used, is the rec-
called the ReLU, shown in Fig. 7.3b. It’s just the same as x
d 0 otherwise:
y = max(x, 0) (7.6)
tanh ReLU
Rectified Linear Unit
12
Simple Neural
Networks and Units in Neural Networks
Neural
Language
Models
Sal -> i
xesigmoidy
- 7
-> *B ↓
exp
linearly separable.
A
expr 0?
8
0 O
%
x
y
=
00
t ++
+
it ++t
+
Salary
exclusive Or.
Simple Neural
Networks and The XOR problem
Neural
Language
Models
Early in the history of neural networks it was realized that the power of neural net-
works, as with the real neurons that inspired them, comes from combining these
units into larger networks.
XORy= GNANDyAMDGORG
The XOR problem
One of the most clever demonstrations of the need for multi-layer networks was
the proof by Minsky and Papert (1969) that aMinsky
single neural unit cannot compute
-
some very simple functions of its input. Consider
and Papert (1969)
the task of computing elementary
Can neural
logical functions units compute
of two inputs, simple
like AND, OR, and XOR.functions of input?
As a reminder, here are
the truth tables for those functions:
-
AND NAND OR XOR
x1 x2 y x1 x2 y x1 x2 y ⑳ ⑧
0 0
0 0
0 1
0
0
①-
0
0
0
1
0
1
-
0 0
0 1
0
1
-
-
1 0 0 1 0 1
-
1 0 1
-
1 1 1 1 1 1 1 1 0
12
-
8 ⑧ >
1
This-8.
example was ⑧first shown for the⑧ perceptron, ⑧which is a very simple
Xor neural
unit that has a binary output and does not have a non-linear activation function. The
.. ⑧ S
Anna
1
x1 ⑧
Or
.
>,
Perceptrons 7.2 • T HE XOR
A very simple neural unit
eptron is• 0Binary
or 1,output
and is computed
(0 or 1) as follows (using th
as b as in Eq.non-linear
• No 7.2): activation function
⇢
0, if w · x + b  0
y=
1, if w · x + b > 0
to build a perceptron that can compute the logical
nary inputs; Fig. 7.4 shows the necessary weights.
⇢⇢
0,0,ififww· x· x++b b0 0
.2 The XOR
7.2y7.2yproblem
==The XOR problem
1,HE
• T ififww
1,XOR · x· x++b b>5>0 0
PROBLEM
(7
ytron
ryeasy
is 0 orto
easy 1,tobuild
andbuild a athe
perceptron
is computed
Early in perceptron that
as follows (using
history of Earlythat can
can
the same
neural compute
compute
weight
networks it was the
thelogical
realized logical
that the ANDAND
power and
of and
neur
in the history of neural networks it was realized
b as in Eq. 7.2):
fofitsitsbinary
binary⇢
Easy
inputs;
works, to
inputs; build
Fig.
as withFig. 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes from combining
that inspired them, c
units
0, if winto
· x +larger
b  0 networks.
y= (7.7)
1, ifOne
w·x+ b 0
of the most clever demonstrations of the need for multi-layer network
>
One of the most clever demonstrations of the need
x the that
proof
can by Minsky and Papert (1969)
x x thatand
a single neural unit cannot co
x
o build a perceptron
ary inputs; Fig.1 7.4
1some very
shows
compute
the simple
the logical ANDby
the proof and OR
Minsky
weights. of its input. 1Consider
necessaryfunctions 1
Papert
the
(1969)
task
that a
of computing
single
elemta
some very simple
1 1functions of two inputs, like AND, OR, functions
xx
1 11
of its input. Consider the
logical and XOR.
logical functions of two inputs, As a reminder,
like AND, OR, and Xhe
x1 x2x2the truth
1 1 tables x1 for those thefunctions:
truth tables for xx2 1
x2xthose 1
1 1 2 functions:
x2 1 -1-1 x2 1 AND OR AND00 XOR OR
5
-1 +1+1 0 x1 x2 y +1
x1 +1x1 x2 y x1 x2x1y x2 y
x2 y x
1 +1 0 0 0 0 0 0 00 0 0 0 0 00 0 0
(a) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 0 11 1 0
The
andThe
x weights
andweights
the bias as
AND
ights w and bias b for perceptrons
w
a wandand
special
for computing
bias
bias
node b
with bforfor
value +1
0 0functions. 1The 0 1 1OR
1 logical
perceptrons
perceptrons
which is for
for
multiplied
0 0
computing
1 0 1 10 1
computinglogical functions.
logical 1
functions
1 2 1 1 1 1 1 1 11 1 1 1 1 01 1 1
own
shown asasxAND,
. (a) logical
x
1 and
1 and x x
showing
2 and
2 and the
weights w1bias
the bias
= 1 andasw2a=special
as a special node
node
1 and bias weight with
with value
value +1
+1 which
which isismultip
multi
OR, showing weights w1 = 1 and w2 = 1 and bias weight b = 0. These
as perceptron
t weight
from anb.
weight b.(a)(a) This
logicalexample
logical AND,
AND, wasofshowing
first
setsshowing
shown for thewas
weights
biases thatw
weights perceptron,
w 1 and
1 andwhich
w w is 1
1 aperceptron,
very
and
and simple
bias
biasweww
one infinite number ofperceptron
possible This
weights example
and 1=1first
= shown 2 =
for
2 the
=
unit that has a binary output andhas
does not have a non-linear activation functio
b)functions.
logical
logicalOR, OR,showing
showingweights
weightsw unit
w11= =1
that 1and a
and w
binary
w
22= =1 and
output
1 and bias
and does
bias not
weight
weight b
haveb
= a=0.0.ThT
non-li
es are
ases arejust
justone
onefrom fromananinfinite
infinitenumber
numberofofpossible
possiblesets setsofofweights
weightsand andbiases
biase
⇢⇢
0,0,ififww· x· x++b b0 0
.2 The XOR
7.2y7.2yproblem
==The XOR problem
1,HE
• T ififww
1,XOR · x· x++b b>5>0 0
PROBLEM
(7
ytron
ryeasy
is 0 orto
easy 1,tobuild
andbuild a athe
perceptron
is computed
as follows (using
can
the same
neural compute
compute
weight
networks it was the
thelogical
realized logical
that the ANDAND
power and
of and
neur
b as in Eq. 7.2):
fofitsitsbinary
binary⇢
Easy
inputs;
works, to
inputs; build
Fig.
as withFig. 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neurons
·
comes from combining
units
0, if winto
· x +larger
b  0 networks. NAND
y= units into larger networks.
(7.7)
1, ifOne
w·x+ b a
> 0
of the most clever demonstrations
(1) of the need for multi-layer network
1.(1)One1
+
of the most clever demonstrations of the need

-
x the that
proof
can by Minsky and Papert (1969)x xthatand
x
1some
shows very
compute
the simple
the logical
the
1.1
+weights.
necessaryfunctions
ANDby
proof
=
-
of its
and
1
OR
Minsky
input. 1Consider
1
Papert
the
(1969)
task
that a
of computing
single
elemta
some very simple
1 1functions of two inputs, like AND, OR,1 and functions of its input. Consider
1 XOR. As a reminder, he the
logical
-
logical functions of two inputs, like AND, OR, and X

x1 x2x2the truth
-
1 1 tables x1 for those thefunctions:

truth tables for x2xthose 1functions:
1
1 1 2
x2 1 -1-1
I x2 1 AND NANDOR 0
AND0 XOR OR
-1 +1+1 0 x1 x2 y +1
x1 +1 x2x1y x2 y x1 x2x1y x2 y x
1 +1 0 0 0 10 0 0
0 0 0 0 0 0 00 0 0
(a)(a) 0 1 0 1 1(b) (b)
- -
I
(a) -
(b)0 1 0 0 0 1 0 11 1 0
The
andThe
x and
Not
ights w and bias
weights
weights
the bias as
AND
b for perceptrons
w
a w and
and
special bias
bias
node -
for computing
b
with bforfor
value
1 logical
perceptrons
+1perceptrons
which is for
for
multiplied
0
computing 0
computing
1 0 1 10 1
logical
logicalfunctions.
functions 1
1 2 1 1 1 1 1 1 11 1 1 1 1 01 1 1
own
shown asasxAND,
. (a) logical
x
1 and
1 and x x
showing
2 and
2 and the
weights w1bias
the bias
as a special node
node
with value
value +1
+1 which
which isismultip
multi
as perceptron
t weight
from anb.
logicalexample
logical AND,
AND, wasofshowing
first
setsshowing
shown for thewas
weights
biases thatw
weights perceptron,
w 1 and
1 andwhich
w w is 1
1 aperceptron,
very
and
and simple
bias
biasweww
possible This
weights example
and 1=1first
= shown 2 =
for
2 the
=
b)functions.
logical
showingweights
weightsw unit
w11= =1
that 1and a
and w
binary
w
22= =1 and
output
1 and bias
and does
bias not
weight
weight b
haveb
= a=0.0.ThT
non-li
es are
ases arejust
justone
infinitenumber
numberofofpossible
possiblesets
-
setsofofweights
biase
⇢⇢
0,0,ififww· x· x++b b0 0
.2 The XOR
7.2y7.2yproblem
==The XOR problem
1,HE
• T ififww
1,XOR · x· x++b b>5>0 0
PROBLEM
(7
ytron
ryeasy
is 0 orto
easy 1,tobuild
andbuild a athe
perceptron
is computed
as follows (using
can
the same
neural compute
compute
weight
networks it was the
thelogical
realized logical
that the ANDAND
power and
of and
neur
b as in Eq. 7.2):
fofitsitsbinary
binary⇢
Easy
inputs;
works, to
inputs; build
Fig.
as withFig. 7.4
the AND
7.4shows
real showsor
neurons
works, as
OR
the
the
that
with
with
necessary
necessary
inspired
the real
perceptrons
weights.
them,weights.
neuronscomes from combining
units
0, if winto
· x +larger
b  0 networks.
y= (7.7)
1, ifOne
w·x+ b 0
of the most clever demonstrations of the need for multi-layer network
>
One of the most clever demonstrations of the need
I
x the that
proof
can by Minsky and Papert (1969)
x x thatand
x
1some very
shows
compute
the simple
the logical ANDby
the proof and OR
Minsky
weights. of its input. 1Consider
necessaryfunctions 1
Papert
the
(1969)
task
that a
of computing
single
elemta
some very simple functions
1 1functions of two inputs, like AND, OR,1 and of its input. Consider
1 XOR. As a reminder, he the
8
logical logical functions of two inputs, like AND, OR, and X
x1 x2x2the truth
1 1 tables x1 for those thefunctions: x2xthose
truth tables for 1functions:
1
1 1 2
x2 1 -1-1 x2 1 AND OR AND00 XOR OR
-1 +1+1 0 x1 x2 y +1
x1 +1x2x1y x2 y x1 x2x1y x2 y x
1 +1 0 0 0 0 0 0 00 0 0 0 0 00 0 0
(a) (a)(a) (b)0 1 0 0 1 0 1 1(b) (b)
0 0 1 0 11 1 0
The
andThe
x weights
andweights
the bias as
AND
ights w and bias b for perceptrons
w
a wandand
special
for computing
bias
bias
node b
with bforfor
value +1
1 logical
perceptrons
perceptrons
which is for
for
multiplied
0 0
computing
1 0 1 10 1
computinglogical
logicalfunctions.1
functions
1 2 1 1 1 1 1 1 11 1 1 1 1 01 1 1
own
shown asasxAND,
. (a) logical
x
1 and
1 and x x
showing
2 and
2 and the
weights w1bias
the bias
as a special node
node
with value
value +1
+1 which
which isismultip
multi
as perceptron
t weight
from anb.
logicalexample
logical AND,
AND, wasofshowing
first
setsshowing
shown for thewas
weights
biases thatw
weights perceptron,
w 1 and
1 andwhich
w w is 1
1 aperceptron,
very
and
and simple
bias
biasweww
possible This
weights example
and 1=1first
= shown 2 =
for
2 the
=
b)functions.
logical
showingweights
weightsw unit
w11= =1
that 1and a
and w
binary
w
22= =1 and
output
1 and bias
and does
bias not
weight
weight b
haveb
= a=0.0.ThT
non-li
es are
ases arejust
justone
infinitenumber
numberofofpossible
possiblesets setsofofweights
biase
Not possible to capture XOR with perceptrons
Pause the lecture and try for yourself!

Why? Perceptrons are linear classifiers
Perceptron equation given x1 and x2, is the equation of a line
w1x1 + w2x2 + b = 0
(in standard linear format: x2 = (−w1/w2)x1 + (−b/w2) )
This line acts as a decision boundary

• 0 if input is on one side of the line
d.
• 1 if on the other side of the line
a xORb (a =
NAYDb) AND (a OR b).
a XOR b,
I
e
-
Decision boundaries 2 d I
-
Yo
a
6 A.
x2 x2 x2
1 1 1
?
0 x1 0 x1 0 x1
0 1 0 1 0 1
a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2
XOR is not a linearly separable function!

was realized that the power of neural net-
red them, comes from combining these
Solution to the XOR problem
of the need for multi-layer networks was
hat a single neural unit cannot compute
nsider the task
XORof computing
can't be elementary
calculated by a single perceptron
OR, and XOR. As a reminder, here are
XOR can be calculated by a layered network of units.
a XOR b,
*
NAND ReLU
OR. y1
XOR
d. 1 -2
M
0
·
2 y x1 x2 y C
-
0 0 0 0 I
ReLU
O
h1 h2 +1 /
1 0 1 1 1 I d I
1 1 0 1 1 1 1
1 1 1 0 ! 1
x1
1 1
x2
0 -1
1
d +1
erceptron, which
⑧ is a very⑧simple neural
ave a non-linear activation function. The
-
-
was realized that the power of neural net-
red them, comes from combining these
Solution to the XOR problem
of the need for multi-layer networks was
hat a single neural unit cannot compute
nsider the task
XORof computing
can't be elementary
calculated by a single perceptron
OR, and XOR. As a reminder, here are
XOR can be calculated by a layered network of units.
XOR
y1 output
2 y x1 x2 y In hz 1 -2 0
0 0 0 0
-
0
1
h1 h2 and +1 hidden layer.
1 0 1 1
I
1 1 0 1 1 1 1 1 0 -1
imputlayer
1
1 1
ha
1 0 x1 x2 +1
1
↑
erceptron, which is a very simple neural
ave a non-linear activation function. The dammy neuron.
⑧
Chr
The hidden representation h
y1
1 -2 0
->
1986 - h1
1 1 1 1
h2
0 -1
+1
x1 x2 +1
x2 h2
Ruenemerhart.
1 1 Himtom.
-
0 x1 0
h1
0 1 0 1 2
-
a) The original x space b) The new (linearly separable) h space
(With learning: hidden layers will learn to form useful representations)
Pytorch.
Simple Neural
Networks and The XOR problem
Neural
Language
Models
Simple Neural
Networks and Feedforward Neural Networks
Neural
Language -MLP S
Models CNN fully
commected
NN).
Feedforward Neural Networks
Can also be called multi-layer perceptrons (or
8
MLPs) for historical reasons
C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS
y1 y2 … yn
2
h1 h2 h3 … hn
1
W b
x1 x2 … xn
0 +1
Figure 7.8 A simple 2-layer feedforward network, with one hidden layer, one output layer,
and one input layer (the input layer is usually not counted when enumerating layers).
Recall that a single hidden unit has parameters w (the weight vector) and b (the
bias scalar). We represent the parameters for the entire hidden layer by combining
the weight vector wi and bias bi for each unit i into a single weight matrix W and
Binary Logistic Regression as a 1-layer Network
(we don't count the input layer in counting layers!)
Output layer σ ! = #(% & ' + ))

(σ node) sigmoid. (y is a scalar)
w w1 wn b (scalar)
(vector)
Input layer x1 xn +1
vector x
29
Multinomial Logistic Regression as a 1-layer Network
Fully connected single layer network
y1 yn A
Output layer s s s ! = softmax(2' + ))

-
(softmax nodes) y is a vector

W b
W is a b is a vector
matrix
Input layer x1 xn +1
scalars -
-
30
m
dimensional vector *
ax(E
-sofa
-
mx 1
R -I
y
=
yi Eea;
distribution,
gistic withcases
In such eachwe
value
use in the range (0,1),
multinomial andregression,
logistic all the values
alsosumming to 1.
called softmax r
phrase
ssion (choosing from tagsxx
Like gression
the sigmoid, like1
person, location, organization).
888d;
(or, it is an xtexponential function.
i
X
historically, 0cy:
the maxent classifier). In multinomial logistic regressi
-
ch casesForwea use multinomial logistic regression, alsoiscalled softmax re-

X
0
08
1
vectory zisofa dimensionality k, the softmax defined as:
↑x x
the target variable that ranges over more than two classes; we want to kno
=
~
xx
xx
0 Or
+
(or, historically, the maxent classifier).
S
xx In multinomial logistic regression
the probability of y being X t
4! in each
It potential class
c cMR2
2 C,
* p(y
n2 = c|x).
y is a variable that ranges over more than two classes; we want to know
=
+ +
The multinomial logistic1 t +

classifier
+
+ +
exp uses
(z ) a generalization of the sigmoid, call
m= 3
bility ofthe
y being
softmax in each potential
softmax(z
function, class
to compute c 2the C, probability
p(y = c|x).1 p(y
i
 i= k
 c|x). The softmax(5.30)
functi
tmax i) = P
multinomial Reminder:
takeslogistic
a vector classifier softmax:
z = [z1 , uses
a generalization
k
z2 , ...,azgeneralization
k ] of k j=1
exp (zofvalues
arbitrary j )the sigmoid, of
and maps
sigmoid
called
them to a probabili
ax function, to compute
distribution, withthe each probability
value in p(y the = c|x).(0,1),
range The softmax
and all the function
values summing to
ector z =The[z1softmax
, zthe
2 , ..., zkof ankinput
] of vector
arbitrary z = [zand
values 1 , z2maps
, ..., zkthem
] is thus
to aaprobability
vector itself:
on, with For avalue
vector z ofrange
dimensionality k, the softmax is:to 1.
Like sigmoid, it is an exponential function.
each in the (0,1), and all
For a vector z of dimensionality k, the softmax is defined as: the values summing
sigmoid,
C HAPTERit5is an L OGISTIC "R
• exponential function.
EGRESSION #
vector z of dimensionality k, the exp (z1 ) is defined
softmax exp (z 2)
as: exp (zk )
softmax(z) = PP k
, Pk , ..., Pk (5.31)
ER 5 • L OGISTIC R EGRESSION
The denominator i=1
k exp (zi )
exp (z ) is exp
used
i=1 exp (z(z
to )i )
inormalize all
i=1 exp
the (z i)
values into proba
softmax(z
i=1 i ) = i P
k
1  i  k (5.3
ities. Thus for exampleexp given
(z ) a vector: j=1 exp (z j )
Pk i
softmax(z
The denominator = i=1Pexp
i ) 1ik
k (zi ) is used to normalize all the values into probabil-
(5.30)
The exp= (z
z vector j ) z1.1,
= [z1 1.5, 1.2,zk3.2, 1.1]a vector itself:
ities. Thus forsoftmax
exampleofgiven an input
j=1
a vector: [0.6, , z2 , ..., ] is thus
Example:
oftmax of the
an input vector
resulting z=
(rounded)
z= z2 , ...,
1.1,zk ] 1.5,
,"softmax(z)
[z1[0.6, isisthus1.2,a3.2,
vector1.1]
itself: #
exp (z1 ) exp (z2 ) exp (zk )
"softmax(z) = [0.055, P 0.090,I+t 0.006,
, Pk0.099, 0.74, 0.010]
,#..., P (5.3
the resulting (rounded) softmax(z) is exp k + (z )
i exp (zi )
1
+
I
k
exp (zi )
exp (z1 ) i=1
exp (z ) xoo i=1
exp (z ) 0 i=1
Pksoftmax willImputi
+2 k x +
softmax(z) =Again Pklike the sigmoid,

, Pk the input to the
+ + 7 , ..., X
x
be (5.31)
the dot product betw
0
0.090, 0.006, 0.099, 0.74, 0.010]

+
x x
x 0
[0.055,
i=1 expw(zand
i ) ani=1 t exp (zi ) 000 i=1 exp (zi ) But now IR100
+ 0
xxx
a weight vector inputt vector x (plus a bias). itwe’ll need sepa
>
+
x
7 x =
++t 0000
weight 000
like vectors (and bias) for+each theofsoftmax
the K classes. 0
Again the sigmoid, the input to 000o will be the dot product between
5x3
Hidden;2000
I 3 -
Output: 1000.
1
33 >i
*
7
parameters.
M -
↑
+
255. -
1
-
Imple t
Layer 1. 2000
+
1000 x 2000
Two-Layer Network with scalar output Layer 22000 No
- + 1000.
Output layer % = &(() y is a scalar

(σ node) z = #ℎ
U
*
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
Two-Layer Network with scalar output
(σ node) z = #ℎ
U
hidden units j
Wji
(σ node)
W b vector
Input layer
x1 i xn +1
(vector)
Two-Layer Network with scalar output IR -
IR.
->
>
(σ node) z = #ℎ
U
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
vector
Two-Layer Network with softmax output- IR*- IR.
-
Output layer % = softmax(()
(σ node) z = #ℎ
U y is a vector
h hun has
hidden units -2 -3
(σ node) te
Could be ReLU
Or tanh
W b n 0(t)=
h olt_)
Input layer
=
x1 xn +1 his 0(ts)
=
(vector) h o(t).
=> =
3-class classification.
Multi-layer Notation
* = $[%]
sigmoid or softmax
$[%] = ' % (! % )
! [%] = # [%] $["] + &[%]
W[2 b[2]
] act func
j activation
a
$ = ' (!
-
["] " "
) ReLU
W[1
- ! ["] = # ["] $[$] + &["]
b[1]
↳assiee
] Pre activation.
x1 i xn +1 $[$] activation.
y
Multi Layer Notation σ

a
z
∑
w1 w2 w3 b
x1 x2 x3 +1
-> layers.
I1
=
preact -
act->
=
=
37
Replacing the bias unit
Let's switch to a notation without the bias unit
Just a notational change
1. Add a dummy node I a0=1 to each layer
2. Its weight w0 will be the bias
=>
>
3. So input layer a[0]0=1,

◦ And a[1]0=1 , a[2]0=1,…
h = s (W x)
might be plicit bias nodefunction
the activation b. Instead,
g(·) we addinternal
at the a dummy node a0 to each
layers.
always h=
be 1. sThus
(W x + b) 0, the input layer, will have a(7.12)
layer dummy
But now instead
nit In describing networks, of our
we vector
will x
often having
use a n values:
slightly sim-x = x1 , .
[1]
1
willvalues,
have a with= a
1, new
and 0th
so dummy
on. This value
dummy x
epresents exactly the same function without referring to an ex-
0 node
0 = 1: x
still x
has
= 0an
, . .
that weight represents
computing each hbias
j as the bias value b. For example instead o
Replacing
tead, we add a dummy the
node a follows:
tounit
each layer whose value will
0
h = s (W x)
yer 0, the input layer, will have a dummy node a
[0]
= 1, layer !1(7.13)
h = sX n 00 x + b)
(W
Instead
d so on. This of: node still has an
dummy hj = We'll do
s
associated 1 W
this:
ji xi + band
weight, j ,
of our we’ll
vectoruse:x having n values: x = x1 , . . . , 11xn , it will have n +
s the bias value
x= xb.
, For
x , …, example
x instead of anx= equation
i=1
x , x , x like
new 0th dummy value x0 = 1: x = x0 , . . . , xn00 . 1 And
1 2 n0 2, …, xn0
instead of
h j as follows:
we’ll
h = instead
s (W x +use: b) h = s (W x) (7.12)
! n
!
n 0 X 0
X of our vector x having n values: x = x1 , .
But now instead
h j = s W ji x i + b j , s W ji xi , (7.14)
1 values, with a new 0th dummy value x0 = 1: x = x0 , . .
i=1 i=0
h = s (W
computing each x) h j as follows: (7.13)
: n
!
our vector x having n values: x = x1 , . . . , xn , X it0 will have n +
!
w 0th dummy value X x00 = 1: x = x0h, .j .=
n . ,s
xn0 . And W jiinstead
xi + b j of,
follows: s W ji xi , i=1 (7.15)
Replacing the bias unit
Instead of: We'll do this:
y1 y2 … yn
2
y1 y2 … yn
2
U U
h1 h2 h3 … hn
1
h1 h2 h3 … hn
1
W b W
x1 x2 … xn
0 +1 x0=1
-
x1 x2 … xn
0
NN:1: RNo-IR*
+
x
y.
Simple Neural
Networks and Feedforward Neural Networks
Neural
Language
Models
Simple Neural
Networks and
Applying feedforward networks
Neural to NLP tasks
Language
Models
Use cases for feedforward networks
Let's consider 2 (simplified) sample tasks:
1. Text classification
2. Language modeling
State of the art systems use more powerful neural

architectures, but simple models are useful to
consider!
43
Classification: Sentiment Analysis
We could do exactly what we did with logistic

regression
Input layer are binary features as before
32 10
Output layer is 0 or 1 Edu
128
σ
512 U
⑧
-Enterpolip 102 1
- W
0 0 0000 000
x1 xn
⑧ ooooo ?
O
5.1.1 Example: sentiment classification
Let’s have an example. Suppose we are doing binary sentiment cl
movie review text, and we would like to know whether to assign the s
+ or to a review document doc. We’ll represent each input observ
INDB
features x1 ...x6 of the input shown in the following table; Fig. 5.2 show
Sentiment Features
in a sample mini test document.
Var Definition Value in Fi

x1 count(positive lexicon) 2 doc) 3
x2 count(negative
⇢ lexicon) 2 doc) 2
1 if “no” 2 doc
x3 1
0 otherwise
x4 count(1st
⇢ and 2nd pronouns 2 doc) 3
1 if “!” 2 doc
x5 0
0 otherwise
x6 log(word count of doc) ln(66) = 4
45
Let’s assume for the moment that we’ve already learned a real-valu
each of these features, and that the 6 weights corresponding to the
[2.5, 5.0, 1.2, 0.5, 2.0, 0.7], while b = 0.1. (We’ll discuss in the ne
Feedforward nets for simple classification
σ ales
↳
σ U
"
2-layer
Logistic
Regression W feedforward
network
W
x1 xn
f1 f2 fn x1 xn
f1 f2 fn
Just adding a hidden layer to logistic regression
46
46
• allows the network to use non-linear interactions between features

• which may (or may not) improve performance.
Even better: representation learning deep
I
learning.
σ
The real power of deep learning comes U
from the ability to learn features from alls, is 0.
the data W
Instead of using hand-built human-
engineered features for classification x1 xn
e1 e2 en
Use learned representations like
embeddings!
47
Neural Net Classification with embeddings as input features!
p(positive sentiment|The dessert is…)
Output layer ^y
sigmoid
()
U a |V|⨉dh
Hidden layer h1 h2 h3 … hdh dh⨉1
wt-1 dh⨉3d
W
Projection layer 3d⨉1
embeddings
E embedding for
word 534
embedding for
word 23864
embedding for
word 7
... The dessert is …

w1 w2 w3 48
p(positive sentiment|The dessert is…)
Output layer ^y
sigmoid
U |V|⨉dh
Hidden layer h1 h2 h3 … hdh dh⨉1
Issue: texts come in different

W sizes
Projection layer
wt-1 dh⨉3d
3d⨉1
embeddings
This assumes a fixed size length (3)! E embedding for

word 534
embedding for
word 23864
embedding for
word 7
Kind of unrealistic. ...The dessert is …

w1 w2 w3
Some simple solutions (more sophisticated solutions later)

1. Make the input the length of the longest review
• If shorter then pad with zero embeddings Doc 2 Vec.
• Truncate if you get longer reviews at test time
2.
Paragraph
Create a single "sentence embedding" (the same
2 Vec
dimensionality as a word) to represent all the words

• Take the mean of all the word embeddings
• Take the element-wise max of all the word embeddings
• For each dimension, pick the max value from all words 49
Tomas Mikolox
Quoc Le
Reminder: Multiclass Outputs
What if you have more than two output classes?
◦ Add more output units (one for each class)
◦ And use a “softmax layer”
W
x1 xn
50
Neural Language Models (LMs)
Language Modeling: Calculating the probability of the
next word in a sequence given some history.
• We've seen N-gram based LMs
• But neural network LMs far outperform n-gram
language models
State-of-the-art neural LMs are based on more
powerful neural network technology like Transformers
But simple feedforward LMs can do almost as well!
51
Simple feedforward Neural Language Models
Task: predict next word wt

given prior words wt-1, wt-2, wt-3, …
Problem: Now we’re dealing with sequences of
arbitrary length.
Solution: Sliding windows (of fixed length)
p(xx)x,xyz).
p(xq(x,xx)).
N = 3,
52
1 1 1 1 I
↳(*)
Neural Language Model cross
entropy
loss.
↑ 100000
->
the
embedding alb
for "for"
dogs
0 53
Why Neural LMs work better than N-gram LMs
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed
Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog"
embeddings to generalize and predict “fed” after dog
Simple Neural
Networks and Applying feedforward networks
Neural to NLP tasks
Language
Models
Simple Neural
Networks and Training Neural Nets: Overview
Neural
Language
Models
Intuition: training a 2-layer Network
andot
I
->
I (N
Actual answer "
Loss function L(",
! ")
System output "!
->
al
x z2)
U
I
"
Backward pass
a
Forward pass z()

W x
(1)
Training instance x1 xn (0)

C
Auto Diff 57
X
STE, Pyare
Backward.
Forward
(JAX). JAX, -f
Intuition: Training a 2-layer network
For every training tuple (", $)
◦ Run forward computation to find our estimate !3
◦ Run backward computation to update weights:
◦ For every output node
◦ Compute loss ! between true " and the estimated "#
◦ For every weight $ from hidden layer to the output layer
◦ Update the weight
◦ For every hidden node
◦ Assess how much blame it deserves for the current answer
◦ For every weight $ from input layer to the hidden layer
◦ Update the weight
58
Now we take the log of both sides.=
p(y|x) Thisy will ŷ)
- ŷ (1
turn
1 yout to be handy mathematically,
(5.9)
I
1R* IRm
and doesn’t hurt us; whatevery: values maximize a probabilityJij will also maximize the
=
e
x
Nowlogwe takeprobability:
of the the log of both sides. This will turn out to be handy mathematically,
m e
5
=
i
and doesn’t hurt us; whatever
- values maximize
⇥ y a probability ⇤ will also maximize the
log of the probability: log p(y|x) = log ŷ (1 ŷ)1 y
Jacobian del
Reminder: Loss = y⇥logy ŷ +for
Function y)
y log(1 logistic
(1 1binary
⇤ ŷ) regression (5.10)
log p(y|x) = log ŷ (1 ŷ)
Eq. 5.10 describes a log likelihood = y log
thatŷ + (1 y)
should belog(1 ŷ)
maximized. In order to turn
(5.10)this
A measure
into loss function for how
(something farneed
that we off the currentwe’ll
to minimize), answer is to
just flip the sign on
Eq.Eq.
5.10 describes a log likelihood that should be maximized. In order to turn this
=Hints in
5.10. The result
the right isanswer
the cross-entropy loss L CE :
into loss function (something that we need to minimize), we’ll just flip the sign on
Eq. 5.10. TheCross
LCE (entropy
resultŷ,isy)the loss
=cross-entropy
log for =
p(y|x) logistic
loss L[y : regression:
CElog ŷ + (1 y) log(1 ŷ)] (5.11)
Finally, weLCE
can(ŷ,plug
y) =in the
logdefinition
p(y|x) =of ŷ [y
=log ŷ +
s (w · x(1 y) log(1
+ b): ŷ)] (5.11)
Finally, we L
can (plug
CE [y log s (wof· xŷ+=b)s+
ŷ, y) in=the definition (w(1· x +y)b):
log (1 s (w · x + b))] (5.12)
Chain
LCE (ŷ, y) = [yRule:
log s (w · x + b) + (1 y) log (1 s (w · x9: IR
+ b))] 1R*
g(h(e(h(x)))).
(5.12)
Let’s see if this loss function does the right thing for our example from Fig. 5.2. We
y does
=
want
Let’s seethe loss loss
if this to befunction
smaller if thethemodel’s estimate
right thing is close
for our to correct,
example from Fig.and5.2.
bigger
We if
=In i
thethe
want model
lossistoconfused.
be smallerSoiffirst
the let’s
model’s suppose the correct
estimate is closegold label for
to correct, thebigger
and sentiment
59
if
theexample
model isinconfused.
Fig. 5.2 isSo positive, i.e.,suppose
first let’s y = 1. In
thethis case gold
correct our model is doing
label for well, since
the sentiment
from Eq.
example 5.7 it5.2
in Fig. indeed gave the
is positive, i.e.,example
y = 1. Ina this
higher
caseprobability
our modelof is being
doing positive
well, since(.69)
thanEq.
from negative (.31). gave
5.7 it indeed If wetheplug s (w · xa+higher
example b) = .69 and y = 1ofinto
probability Eq.positive
being 5.12, the right
(.69)
than negative (.31). If we plug s (w · x + b) = .69 and y = 1 into Eq. 5.12, the right
2 1y =
ande
Jacobiams
Reminder: gradient descent for weight updates
5.4 • G RADIENT D ESCENT 11
GISTIC RUse
EGRESSION
the derivative of the loss function with respect to
5.4.1 Theweights % for Logistic Regression
Gradient &(' "; ) , $)
%&
n order to update q , we need a definition for the gradient —L( f (x; q ), y). Recall that
To tell us how to adjust weights for each training item
or logistic regression, the cross-entropy loss function is:
◦ Move them in the opposite direction of the gradient
LCE (ŷ, y) = [y log s (w · x + b)d
+ (1 y) log (1 s (w · x + b))] (5.17)
t+1 t
w = w h function
It turns out that the derivative of this L( f (x;
for w),
one y)
observation vector x is
dw
◦ For logistic
Eq. 5.18 (the interested readerregression
can see Section 5.8 for the derivation of this equation):
et’s extend the intuition ∂ LCE (ŷ,from
y) a function of one scalar variable w to
= [s (w · x + b) y]x j (5.18)
because we don’t just want ∂ w j to move left or right, we want to know
imensional space
Note in Eq. 5.18 that(of the N parameters
the gradient with respect to that make
a single w jqrepresents
weightup ) we shoulda
very
entintuitive
is justvalue:
suchthea difference
vector; itbetween the truethe
expresses y and ŷ = s (w ·
our estimatedcomponents
directional
x + b) for that observation, multiplied by the corresponding input value x j .
ope along each of those N dimensions. If we’re just imagining two
Figure 7.10 Computation graph for the function L(a, b, c) = c(a+
nodes a = 3, b = 1, c = 2, showing the forward pass computation
of f (x) is the derivative of u(x) with respect to v(x) times the

Where didtothat
respect x: derivative come from?
Using the chain rule! f (x) = u(v(x)) d f = du · dv

dx dv dx
Intuition (see the text for details)
y The chain rule extends toDerivative
more thanof the two functions.
weighted sum If compu
a composite function Derivative
f (x) = u(v(w(x))),
of the Activation the derivative of f (x
σ Derivative of the Loss
z df du dv dw
= · ·
∑ !" !" !$
dx !% dv dw dx
w1 w2 w3
=
!#! the!$ !% !#! we need. Since in
x1 x2
Let’s
b
x3 +1
now compute 3 derivatives
L = ce, we can directly compute the derivative ∂∂ Lc :
61
∂L
=e
∂c
How can I find that gradient for every weight in
the network?
These derivatives on the prior slide only give the
updates for one weight layer: the last one!
What about deeper networks?
• Lots of layers, different activation functions?
Solution in the next lecture:
• Even more use of the chain rule!!
• Computation graphs and backward differentiation!
62
Simple Neural
Networks and Training Neural Nets: Overview
Neural
Language
Models
Simple Neural
Networks and Computation Graphs and
Neural Backward Differentiation
Language
Models AutoDif1 F Approximate
Diff
symbolic.
I
Diff.
2
=21
y
y
=As. Symbolic
Diff
Why Computation Graphs
For training, we need the derivative of the loss with
respect to each weight in every layer of the network
• But the loss is computed only at the very end of the
network!
Solution: error backpropagation (Rumelhart,
- Hinton, Williams, 1986)
• Backprop is a special case of backward differentiation

• Which relies on computation graphs.
65
Computation Graphs
A computation graph represents the process of

computing a mathematical expression
66
loss garage
Example: =>
e
-
=C
Computations:
de de
I
↳
Eg:2
C
a
-8 e=a+d ↓
·
b d = 2b L=ce
c = c
67
e 1
=
Example:
Computations:
68
Backwards differentiation in computation graphs
The importance of the computation graph

comes from the backward pass
This is used to compute the derivatives that we’ll
need for the weight update.
) = 10. In the forward pass of a computation graph, we app
to right, passing the outputs of each computation as the inpu
Example
kward differentiation on computation graphs
ce of the computation graph comes from the backward pass,
mpute the derivatives that we’ll need for the weight update.
oal is to compute the derivative of the output function L with
input variables,
We want: i.e., ∂∂ La , ∂∂ Lb , and ∂∂ Lc . The derivative ∂∂ La , tells
change in a affects L.
s differentiation
The derivative !#use
makes !"
, tellsof chain
the much
us how rulechange
a small in calculus.
in a Supp
the derivative
affectsofL. a composite function f (x) = u(v(x)). The de
70
c
-2
Figure 7.10 Computation graph for the function L(a, b, c) = c(a+2b), with value
c
nodes a = 3, b = 1, c = 2, showing the forward pass computation of L.
Figure 7.10 Computation graph for the function L(a, b, c) = c(a+2b), with values fo
nodes
of fa(x)
= 3,isbthe
= 1, c = 2, showing
derivative of u(x) the forward
with pass
respect to computation of L.
v(x) times the derivative of
The chain rule
respect to x:
of f (x) is the derivative of u(x) with respect to v(x) times the derivative of v(x
respect to x:Computing the derivatived of f a composite
du dv function:
= ·
dx dv dx
f (x) = u(v(x)) df du dv
The chain rule extends to more than=two functions.
· If computing the deriv
dx dv dx
composite function f (x) = u(v(w(x))), the derivative of f (x) is:
The chain rule extends to more than two functions. If computing the derivati
composite ffunction f d f thedu
u(v(w(x))), dv dw
derivative
(x) =
(x) = u(v(w(x))) = · · of f (x) is:
dx dv dw dx
df du dv dw
Let’s now compute the 3 derivatives
= · we need.
· Since in the computat
dx dv dw dx
L = ce, we can directly compute the derivative ∂∂ Lc :
Let’s now compute the 3 derivatives we need. Since in the computation
∂L
L = ce, we can directly compute the derivative ∂L
= e ∂c :
∂c
posite function
mposite function f f(x)
(x)==u(v(w(x))),
u(v(w(x))),the thederivative
derivativeof of f f(x)
(x)is:is:
ain ruleextends
n rule extendstotomore
morethanthantwo twofunctions.
functions.IfIfcomputing
computingthe thederivati
deriva
itefunction
functionf f(x) ddf f the
u(v(w(x))), thedu
du dvdv dw
derivative dw
te (x)==u(v(w(x))), == · · · · ofoff f(x)
derivative (x)is: is:
dx
dx dv
dv dw dw dx dx
ddf f du dv
du dv dw dw
Let’s nowExample
Let’snow compute
computethe the33derivatives
==
derivatives · ·we
weneed.
·
·need. Since
Sinceininthe thecomputatio
computati
dx
dx the dv dv dw
dw dx dx
∂∂LL
= ce, we can directly compute the derivative ∂∂cc: :
ce, we can directly compute derivative
snow
nowcompute
computethe the33derivatives
derivativeswe weneed.
need. SinceSinceininthe thecomputation
computatio
wewecancandirectly
directlycompute
computethe ∂∂LL
thederivative
derivative ∂∂LL :
==ee∂∂c c:
∂∂cc
∂∂LL
For
Forthe
theother
othertwo,
two,we’ll
we’llneed
needtotouse ==the
use e chain
ethe chainrule:
rule:
∂∂cc
∂∂LL ∂∂LL∂∂ee
theother
he othertwo,
two,we’ll
we’llneed
needtotouseusethethe
= chainrule:
= chain rule:
∂∂aa ∂∂ee∂∂aa
∂∂LL∂∂LL ∂∂LL∂∂L∂L ee∂∂ee∂∂dd
== ==
∂∂aa∂∂bb ∂∂ee∂∂e∂aea∂∂dd∂∂bb 72
∂∂LL ∂∂LL∂∂ee∂∂dd ∂∂LL ∂∂LL ∂∂ee ∂∂ee

Eq. 7.26 thus requires
Eq. 7.26 thus requires fivefive intermediate
= =
intermediate derivatives:
derivatives: ,, ∂ c ,, ∂ a ,, ∂ d ,,
∂∂bb ∂∂ee∂∂dd∂∂bb ∂ e
∂e ∂c ∂a ∂d
ch
ichare
areas
asfollows
follows(making
(makinguse useof ofthe
thefact
factthat
thatthethederivative
derivativeof ofaasumsumisis
∂L
=e
∂c
For the other two, we’ll need to use the chain rule:
Example ∂L
=
∂L ∂e
∂a ∂e ∂a
∂L ∂L ∂e ∂d
=
∂b ∂e ∂d ∂b
Eq. 7.26 thus requires five intermediate derivatives: ∂∂ Le , ∂∂ Lc , ∂∂ ae ,

which are as follows (making use of the fact that the derivative of a sum
of the derivatives):
∂L ∂L
L = ce : = c, =e
∂e ∂c
∂e ∂e
e = a+d : = 1, =1
∂a ∂d
∂d 73
d = 2b : =2
∂b
In the backward pass, we compute each of these partials along each edge
from right to left, multiplying the necessary partials to result in the fina
Let’s now compute the 3 derivatives we need. Since in the computation graph
L = ce, we can directly compute the derivative ∂∂ Lc : ∂L ∂L ∂e ∂d
=
∂b ∂e ∂d ∂b
∂L
e 7.26 thus requires five intermediate
=Eq. derivatives: ∂∂ Le , ∂∂ Lc , ∂∂ ae , ∂∂
(7.25)
∂ cwhich are as follows (making use of the fact that the derivative of a sum
For the other two, we’ll need to useofthe
thechain
derivatives):
rule:
∂L ∂L ∂e ∂L ∂L
Example = L = ce : = c, =e
∂a ∂e ∂a ∂e ∂c
∂L ∂L ∂e ∂d ∂e ∂e
e = a+d : = 1, =1
= ∂(7.26)
a ∂d
a=3 ∂b ∂e ∂d ∂b
∂d
d = 2b : =2
a thus requires five intermediate derivatives: ∂∂ Le , ∂∂ Lc , ∂∂ ae , ∂∂ de , and
Eq. 7.26 ∂ b∂d ,
∂b
which are as follows (making use of the
In fact that the derivative
the backward of a sum
pass, we compute is of
each thethese
sumpartials along each edge o
of the derivatives): from right to left, multiplying the necessary partials to result in the fina
e=5
we need. Thus we begin by annotating the final node with ∂∂ LL = 1. Mo
∂L ∂L
b=1 L = ce : left, e=d+ac,
we=then = e ∂∂ Lc and ∂∂ Le L=-10
compute , and so on, until we have annotated th
d=2 ∂ e ∂ c
the way to the input variables. The forward pass conveniently already
b ∂e ∂e L=ce
de =
= 2b
a + d : computed = 1,the values
= 1 of the forward intermediate variables we need (lik
∂a ∂d
∂d
d = 2b : =2
∂b
In the backward
c=-2 pass, we compute each of these partials along each edge of the graph
from right to left, multiplying the necessary partials to result in the final derivative
c backward pass
we need. Thus we begin by annotating the final node with ∂∂ LL = 1. Moving to the
left, we then compute ∂∂ Lc and ∂∂ Le , and so on, until we have annotated the graph all
the way to the input variables. The forward pass conveniently already will have
74
computed the values of the forward intermediate variables we need (like d and e)
Example
75
Backward differentiation for a neural network
Of course computation graphs for real neural networks
Fig. 7.12 shows a sample computation graph for a 2-layer
2, n1 = 2,
Backward and n2 = 1, assuming
differentiation on a twobinary classification a
layer network
output unit for simplicity. The function that the computat
y
Sigmoid activation σ z[1] = W [1] x + b[1]

a[1] = ReLU(z[1] )
W[2] b[2]
1
z[2] = W [2] a[1] + b[2]
ReLU activation
a[2] = s (z[2] )
W[1] b[1] ŷ = a[2]
1
The weights that need updating (those for which we
x1 x2
derivative of the loss function) are shown in orange. In
76
pass, we’ll need to know the derivatives of all the function

saw in Section ?? the derivative of the sigmoid s :
iation for a neural network ds (z)
= s (z)(1 s (z))
on graphs for real neural networks dz
are much more complex.
mple computation graphWe’ll for
also aneed
2-layer neural of
the derivatives network with
each of the n0activation
other = func
derivative of tanh is:
1, assumingBackward
binary differentiation
classification andon a two
hence layer
using network
a sigmoid
icity. The function that the computation d tanh(z)
graph=is1 computing is:
tanh2 (z)
dz
z[1] = WThe
[1] derivative
x + b[1] of the ReLU is
⇢
[1] [1] d ReLU(z)
a = ReLU(z ) =
0 f or z < 0
dz 1 f or z 0
z[2] = W [2] a[1] + b[2]
a[2] = s (z[2] )
ŷ = a[2] (7.27)
need updating (those for which we need to know the partial 77
s function) are shown in orange. In order to do the backward

now the derivatives of all the functions in the graph. We already
e derivative of the sigmoid s :
Of coursepass,
computation
we’ll needgraphs forthe
to know real neural networks
derivatives of all theare much more
functions in the complex.
graph. We already
Fig. 7.12 saw
showsin aSection
sample??
computation graph
the derivative for sigmoid
of the a 2-layersneural
: network with n0 =
2, n1 = 2, and n2 = 1, assuming binary classification and hence using a sigmoid
output unit for simplicity. The function dsthat
(z) the computation graph is computing is:
= s (z)(1 s (z)) (7.28)
dz
z[1] = W [1] x + b[1]
We’ll also need the[1] derivatives of[1]each of the other activation functions. The
derivative of tanh is: a = ReLU(z )
z[2] = W [2] a[1] + b[2]
Backward differentiation on a 2-layer network
d tanh(z)
a[2] = s (z [2] = 1 tanh2 (z)
dz )
(7.29)
ŷ = a[2] (7.27)
The derivative of the ReLU is
The weights that need updating ⇢
(those for which
d ReLU(z) 0 fweor need
z < 0to know the partial
derivative of the loss function) are shown = (7.30)
dz in orange. 1 fInor order
z 0to do the backward
pass, we’ll need to know the derivatives of all the functions in the graph. We already
ds (z)
= s (z)(1 s (z)) (7.28)
dz
We’ll also need the derivatives of each of the other activation functions. The
derivative of tanh is:
d tanh(z)
=1 tanh2 (z) (7.29)
dz
The derivative of the ReLU is
⇢
d ReLU(z) 0 f or z < 0
= (7.30)
dz 1 f or z 0
Backward differentiation for a neural network
Of course computation graphs for real neural networks are
Fig. 7.12 shows a sample computation graph for a 2-layer ne
2, n1 = 2, and n2 = 1, assuming binary classification and
output unit for simplicity. The function that the computation
!" z[1] = W [1] x + b[1]

Starting off the backward pass: a[1] = ReLU(z[1] )
!#
(I'll write ! for ![%] and $ for $[%] ) z[2] = W [2] a[1] + b[2]
a[2] = s (z[2] )
! #,
! # = − y log(#)
! + 1 − # log(1 − #)
! ŷ = a[2]
The weights that need updating (those for which we ne

! /, # = − y log / + 1 − # log(1 −derivative
/) of the loss function) are shown in orange. In ord
!" !" !# pass, we’ll need to know the derivatives of all the functions in
!$
= !# !$ saw in Section ?? the derivative of the sigmoid s :
ds (z)
= s (z)(1
s (z))
%& % log ! % log 1 − ! dz
=− ) + (1 − y)
%! %! %! We’ll also need the derivatives of each of the other act
1 1 ) of tanh
derivative ) −is:1
=− ) + 1−y −1 = − + d tanh(z)
! 1−! ! 1−! = 1 tanh2 (z)
dz
0/ 0! # #−
The1
derivative of the ReLU is
= /(1 − /) =− + / 1 − /d ReLU(z)= a −⇢y0
01 01 / 1−/ dz
=
1
f or x < 0
f or x 0
Summary
For training, we need the derivative of the loss with respect to
weights in early layers of the network
• But loss is computed only at the very end of the network!
Solution: backward differentiation
Given a computation graph and the derivatives of all the
functions in it we can automatically compute the derivative of
the loss with respect to these early weights.
80
Simple Neural
Networks and Computation Graphs and
Neural Backward Differentiation
Language
Models

Lecture 4. Neural Networks and Neural Language Models

Uploaded by

Copyright:

Available Formats

Lecture 4. Neural Networks and Neural Language Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4. Neural Networks and Neural Language Models

Uploaded by

Copyright:

Available Formats

Simple Neural

Networks and Units in Neural Networks

By BruceBlaus - Own work, CC BY 3.0,

Non-linear activation function σ

An example w = [0.2, 0.3, 0.9]

of the most clever demonstrations of the need

logical functions of two inputs, like AND, OR, and X

1 1 tables x1 for those thefunctions:

Pause the lecture and try for yourself!

(in standard linear format: x2 = (−w1/w2)x1 + (−b/w2) )

This line acts as a decision boundary

XOR is not a linearly separable function!

a) The original x space b) The new (linearly separable) h space

(With learning: hidden layers will learn to form useful representations)

Output layer σ ! = #(% & ' + ))

Output layer s s s ! = softmax(2' + ))

(softmax nodes) y is a vector

ch casesForwea use multinomial logistic regression, alsoiscalled softmax re-

The multinomial logistic1 t +

softmax(z) =Again Pklike the sigmoid,

0.090, 0.006, 0.099, 0.74, 0.010]

Two-Layer Network with scalar output Layer 22000 No

Output layer % = &(() y is a scalar

Multi Layer Notation σ

3. So input layer a[0]0=1,

State of the art systems use more powerful neural

We could do exactly what we did with logistic

Var Definition Value in Fi

• allows the network to use non-linear interactions between features

Hidden layer h1 h2 h3 … hdh dh⨉1

... The dessert is …

Hidden layer h1 h2 h3 … hdh dh⨉1

Issue: texts come in different

This assumes a fixed size length (3)! E embedding for

Kind of unrealistic. ...The dessert is …

Some simple solutions (more sophisticated solutions later)

dimensionality as a word) to represent all the words

Task: predict next word wt

Forward pass z()

Training instance x1 xn (0)

of f (x) is the derivative of u(x) with respect to v(x) times the

Using the chain rule! f (x) = u(v(x)) d f = du · dv

• Backprop is a special case of backward differentiation

A computation graph represents the process of

The importance of the computation graph

∂∂LL ∂∂LL∂∂ee∂∂dd ∂∂LL ∂∂LL ∂∂ee ∂∂ee

Eq. 7.26 thus requires five intermediate derivatives: ∂∂ Le , ∂∂ Lc , ∂∂ ae ,

Sigmoid activation σ z[1] = W [1] x + b[1]

pass, we’ll need to know the derivatives of all the function

need updating (those for which we need to know the partial 77

s function) are shown in orange. In order to do the backward

!" z[1] = W [1] x + b[1]

The weights that need updating (those for which we ne

You might also like