10 RNN
10 RNN
Pawan Goyal
CS60010
Predictive Hig
-
Contex
P(w4 |w1 , w2 , w3 ) M
--
-
A model that computes either of these is called a language model
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 2 / 30
Language Modeling
--
-
I Predictive typing
I Speech recognition translation
>
Chatbot
-
I Handwriting recognition
I Spelling/grammar correction
compress
data
a Parameter
Let of
-
to set
↓
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 6 / 30
n-gram language models
-
~ O
books their
L Pr) books)
u
pened
Pr (books) their
·
!
-
--
-
Eos 10.
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 7 / 30
n-gram language models
↓ - H 4-gram
- - -
-
o
100 %
- ->
we
1000
= 0 . 02
v4
E -
var
- -
-
O
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 9 / 30
Storage Problems with n-gram Language Model
su
-
# I I 3d ilp size
their
Hi Students ·
--
pened -
d-dim
--
1-hot - ↳
EBVE
One-hot representation
Words are represented as one-hot vectors: one 1, the rest 0s
d
V
G
w 2R i
d
Bey
divasion ?
-
-
-
-
-
↳
self-superen
Such a vector represents the ‘meaning’ of a word in some abstract way
Tran data
Traing dat
Just date with lates
tio laizel Evo human-generated
lessful labelsJ
E to
Generates labels
from the date itself
Learning Word Vectors: Overview
Jones
Word
·rf
Center
d
Basic Idea: Use self-supervision
We have a large corpus of text -
Every word in a fixed vocabulary is represented by a vector
Go through each position t in the text, which has a center word c and
context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate the
probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability
nectors
* word are
your parames
L ---
I 2v vectors
G
-
exp(uTo vc )
P(o|c) =
Âw2V exp(uTw vc )
P(o|c) =
exp(uTo vc )
Âw2V exp(uTw vc )
I
-
-
not
e -
Nu
X
3
4.
no" -
hidden--
dev
Out-
-
rectors
-
v
L ②
-
-
Crises
banking
Vw
f
+
O
Un
·
&
w -
-
d dis Edin
-
vxe
Gidden)
- resp(0()
-
+
U1
airir
#D
softmax
>
-
exp /NotPa)
-
E explur)
W= 1
Try this problem
Skip-gram
Suppose you are computing the word vectors using Skip-gram architecture.
You have 5 words in your vocabulary,
{passed, through, relu, activation, function} in that order and suppose you
have the window, ‘through relu activation’ in your corpora. You use this window
with ‘relu’ as the center word and one word before and after the center word as
your context.
-xl(
!
drapaci
Ed + O
-"
1
-
-
f
fixed-window
-
CS60010 23 / 30
Recurrent Neural Networks
0000
......
Core Idea
Apply the same weights repeatedly!
H
#g E n
hi
f
-
We
-
-
-1
↳
--
-
-
- >
-
f
↑ 4t I
# I I -
-
We Kn
de 22
RNN as a feed-forward network
↓
dont
· dowx du
duxan du
v:
: daxdin
ha
din
=
f(Uh + - 1 + Wx+ + b)
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 26 / 30
Forward Propagation
ht = g(Uht 1 + Wxt )
yt = softmax(Vht )
Let the dimensions of the input, hidden and output be din , dh and dout ,
respectively
The three parameter matrices: W : dh ⇥ din , U : dh ⇥ dh , V : dout ⇥ dh
-
&
0000 ↑
- M
H El Ye
-
↳ I
-
↳
_
↑
e
w
-
Training an RNN language model
4
L f
4
f X
↓
F
- -
If V EI
-
↓
~, W g W r
W -
- & >
-
↑
-
⑭ T
-
-
samplethepro
a
as ter
· ho
A
- Es
I+
Co
I
< ST
ofa
so
· 0
So
.
of 0 .
18 0 12
.
M ↑
Generating text with an RNN Language Model
w
RNN-based language models can be used for language generation (and
hence, for machine translation, dialog, etc.)
A language model can incrementally generate words by repeatedly
sampling the words conditioned on the previous choices – also known as
autoregressive generation.
Cl
zu --
O I to
La
Y --
↳
- I
Sequence labeling: Named Entity Recognition, Parts-of-Speech Tagging
-
O
-
47
- -
-
e
z It
Task
Assign a label chosen from a small fixed set of labels to each element of the
sequence
Inputs: Word embeddings
Outputs: Tag probabilities generated by the softmax layer
Task
Classify the entire sequence rather than the token within them
Pass the text to be classified a word at a time, generating new hidden
states at each time step
The hidden state of the last token can be thought of as a compressed
representation of the entire sequence
This last hidden state is passed through a feed-forward network that
chooses a class via softmax
There are other options of combining information from all the hidden
states
E C
3 =
[hi
T
kr o ·
300 - d
RNN makes use of information from left (prior) context to predict at time t
In many applications, the entire sequence is available; so it makes sense
to also make use of the right context to predict at time t
Bidirectional RNNs combine two independent RNNs, one where the input
is processed from left to right (forward RNN), and another from end to the
start (backward RNN).
hft = RNNforward (x1 , n, xt )
hbt = RNNbackward (xn , n, xt )
·--
ht = [hft ; hbt ]
af .
&
>
- - > -
hft + E
Kn
ze
C
- Eka
b
wondered dim
Pawan Goyal
Usin
bi-RNE
- taske
vord o
CSE, IIT Kharagpur
Pos -
M
for
Parkinders
CS60010 tag
hiddor- ↳
Dos
of
#
U
50
: :
>
- - -
> he Va he
°
-
ho
~
>
-
#
↑ 300
E
E
:
:
50x50
300450
3
s da
-
-
4.
[in]
=
Ex50
:
,
of classes
5-6 weeks
/Train
-
-Mi-
zi ne
.
-Endsch
&
I
fe35mini
K
terliest
George moni
*
quig
↳
quigals edo
After a
%
20 -
L
>
-
-
Using Bidirectional RNNs for Sequence Classification
Ro
- -
Y2 43 You
-v u -V
Synthe
②-
W
-
>
- n
The main problem is that it is too difficult for the RNN to learn to preserve
information over many timesteps.
In a vanilla RNN, the hidden state is constantly being rewritten
-Hantes
~
c1
(t)
-G
h = tanh(Uh + Wx )
-
(t 1)
- -
(t)
L
T
-000
- · .
4t
↳ A-1 50-dim
-
>
- A-
Gate
Neural Sigmoid
#
=
KA -
1
T
Pet
-
o -
it
-
I
u (U ↑
,
he- t Wich)
↑
-
--
Te more leanable
params
Using Gates for better RNN units
context
E
9t
I
At kA Ct
-
Context
victor
KA :
Oz O
-
(CA) Lad-state
Ant
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 8 / 25
He
CA-1 HA -
gA =
Fant (Night
-
in o UghAr)
+
-t
+
&A -
02 B - new state Content
(like RNN
is
At (WfR Uf hat
:
+
er
(Wile + Vikto
it =
2
2) Woc +
Vo4A-1)
Of =
LSTM: More Details
-
-
ft = s(Uf ht
-
1 + Wf xt )
Input Gate
Controls what parts of new cell content are written to the context
it = s(Ui ht 1 + Wi xt )
-
Output Gate
Controls what part of context are output to hidden state
ot = s(Uo ht 1 + Wo x t )
Ift Wie
-
The LSTM architecture makes it easier for the RNN to preserve information
over many timesteps
e.g., if the forget gate is set to remember everything on every timestep,
then the info in the cell is preserved indefinitely
By contrast, it is harder for vanilla RNN to learn a recurrent weight matrix
U that preserves info in hidden state Vi Vo
Vo -
vi
e
U
12 I
-
Ot
( >
ht +
its and
-
>
-
it
to of he -
HA
W LTM cal
Me
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 11 / 25
Common RNN NLP Architectures
600 -
RiritIsiris -
E
encoles
words
-°
-
&
<
- &
/
- -
l
O
-
>
-
Traslation
b-
+4
und
-
-
-decoder
>
>
-
②
A context vector, c, which is a function of h1:n and conveys the essence
G
>
-
°
⑩
the - example
Un
for
C =
: Text
Text olp Seco
Ip: Seco
ho
>
-
he
Endor
t
der
SE Si
now
U Toron
a
ho
,
↳ I Vocab
>
e Den 1 -Megp) dire
End-to-end training
For MT, the training data typically consists of set of sentences and their
translations
The network is given a source sentence and then a separator token, it is
trained auto-regressively to predict the next word
Teacher forcing is used during training, i.e., the system is forced to use
C
the gold target token from training as the next input xt+1 , rather than
-
t ! +
at time-
Never
inference
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 16 / 25
Training the Encoder-decoder model
Randomsitialized
The context vector, hn is the hidden state of the last time step of the
source text
It acts as a bottleneck, as it has to represent absolutely everything about
the meaning of the source text, as this is the only thing decoder knows
about the source text
a
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 18 / 25
Encoder-decoder with attention
not
,
dist
-
attention
dist
-
5 -
- 6
3
O -
-
wit
wat
weighing tencoderdates
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 20 / 25
Attention: In Equations
Computing ci
Compute how much to focus on each encoder state, by seeing how
relevant it is to the decoder state captured in hdi 1 – give it a score
wateeile
Simplest scoring mechanism is dot-product attention
⑳
 bi
ci = aij hej
Zahi + j
3
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 24 / 25
Summary
3
4 ee 4 . > hi ?
-
+
- Es
To
-
-
3
Boffmt [2
-
ei e
e 1
Cz Lij he
- I
-
j)
4
E 2i
j
=
Softmax[h ,
. i
h
Englich
v
-
V
-
Hindi -
zir
night
-
-
↳ digeu