0% found this document useful (0 votes)
34 views77 pages

10 RNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views77 pages

10 RNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

On

Neural Language Model, RNNs

Pawan Goyal

CSE, IIT Kharagpur

CS60010

Predictive Hig
-

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 1 / 30


Language Modeling
Language Modeling is the task of predicting what word comes next.

Contex

Goal: Compute the probability of a sentence or sequence of words:

P(W) = P(w1 , w2 , w3 , . . . , wn ) mainzule


y
--

Related Task: probability of an upcoming word:

P(w4 |w1 , w2 , w3 ) M
--
-
A model that computes either of these is called a language model
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 2 / 30
Language Modeling

You can also think of a language model as a system that assigns


probability to a piece of text.
For example, if we have some text x(1) , . . . , x(T) , then the probability of
this text (according to the Language Model) is:

--

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 3 / 30


You use language models every day!

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 4 / 30


You use language models every day!

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 5 / 30


Why should we care about language modeling?

E Language Modeling is a benchmark task that helps us measure our


progress on understanding language G
Language Modeling is fundamental to many NLP tasks, especially those
involving generating text or estimating the probability of text:

-
I Predictive typing
I Speech recognition translation
>
Chatbot
-
I Handwriting recognition
I Spelling/grammar correction
compress
data
a Parameter
Let of
-
to set


Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 6 / 30
n-gram language models
-

~ O
books their
L Pr) books)
u
pened
Pr (books) their
·

!
-

--

-
Eos 10.
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 7 / 30
n-gram language models

↓ - H 4-gram
- - -
-

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 8 / 30


n-gram language models: Example

o
100 %
- ->

we
1000
= 0 . 02

v4
E -

var

- -

-
O
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 9 / 30
Storage Problems with n-gram Language Model

su
-

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 10 / 30


A fixed-window neural language model
V
of wi I
p Uh
3dh +
-
[In

# I I 3d ilp size
their
Hi Students ·

--
pened -

d-dim
--

1-hot - ↳

EBVE

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 11 / 30


A fixed-window neural language model

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 12 / 30


How do we obtain word representations?

In traditional NLP / IR, words are treated as discrete symbols.

One-hot representation
Words are represented as one-hot vectors: one 1, the rest 0s
d
V

What is the problem?


Vector dimension = number of words in vocabulary (e.g., 500,000)
The vectors are orthogonal, and there is no natural notion of similarity
-

between one-hot vectors!


-

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 13 / 30


Word2Vec – A distributed representation
Distributional representation – word embedding?
Any word wi in the corpus is given a distributional representation by an
embedding

G
w 2R i
d

i.e., a d dimensional vector, which is mostly learnt!

Bey
divasion ?

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 14 / 30


Distributional Representation: Illustration
If we label the dimensions in a hypothetical word vector (there are no such
pre-assigned labels in the algorithm of course), it might look a bit like this:

-
-
-
-
-


self-superen
Such a vector represents the ‘meaning’ of a word in some abstract way

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 15 / 30


Unsupervised Self-supervised Superused

Tran data
Traing dat
Just date with lates
tio laizel Evo human-generated
lessful labelsJ
E to
Generates labels
from the date itself
Learning Word Vectors: Overview

Jones
Word

·rf
Center

d
Basic Idea: Use self-supervision
We have a large corpus of text -
Every word in a fixed vocabulary is represented by a vector
Go through each position t in the text, which has a center word c and
context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate the
probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability

nectors
* word are
your parames

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 16 / 30


Word2Vec (Skip-gram) Overview

Example windows and process for computing P(wt+j |wt )

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 17 / 30


Word2Vec Overview

Example windows and process for computing P(wt+j |wt )

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 18 / 30


Word2Vec: objective function

We want to minimize the loss function:

L ---
I 2v vectors

How to calculate P(wt+j |wt ; q)?


w w

We will use two vectors per word w:


No
°

vw when w is a center word


-

uw when w is a context word

G
-

Then, for a center word c and a context word o


- -

exp(uTo vc )
P(o|c) =
Âw2V exp(uTw vc )

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 19 / 30


Understanding P(o|c) further

P(o|c) =
exp(uTo vc )
Âw2V exp(uTw vc )
I
-
-

not

e -
Nu

X
3
4.

no" -

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 20 / 30


~-we #
&
d
I

hidden--
dev

Out-
-

rectors
-
v

L ②
-
-

Crises
banking

Vw
f
+
O

Un
·
&
w -
-

d dis Edin
-
vxe
Gidden)
- resp(0()

-
+
U1

airir
#D
softmax
>
-

exp /NotPa)
-

E explur)
W= 1
Try this problem

Skip-gram
Suppose you are computing the word vectors using Skip-gram architecture.
You have 5 words in your vocabulary,
{passed, through, relu, activation, function} in that order and suppose you
have the window, ‘through relu activation’ in your corpora. You use this window
with ‘relu’ as the center word and one word before and after the center word as
your context.

Compute the loss


Also, suppose that for each word, you have 2-dim in and out vectors, which
have the same value at this point given by [1,-1],[1,1],[-2,1],[0,1],[1,0] for the 5
words, respectively. As per the Skip-gram architecture, the loss corresponding
to the target word “activation” would be log(x). What is the value of x?

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 21 / 30


Homework

Compute partial derivative of the loss with respect to vc

-xl(

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 22 / 30


A fixed-window neural language model: Pros and Cons

!
drapaci

Ed + O

-"
1
-
-

f
fixed-window
-

Pawan Goyal (IIT Kharagpur)


de Eckly offered
Neural Language Model, RNNs
d
meis

CS60010 23 / 30
Recurrent Neural Networks

0000

Ween Wan Was Wer

......

Core Idea
Apply the same weights repeatedly!

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 24 / 30


Recurrent Neural Networks

We can process a sequence of vectors x by applying a recurrence formula at


each step:

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 25 / 30


Ve

H
#g E n

hi
f
-
We
-
-
-1

--
-
-

- >
-

f
↑ 4t I
# I I -
-

We Kn
de 22
RNN as a feed-forward network


dont

· dowx du
duxan du
v:

: daxdin

ha
din
=
f(Uh + - 1 + Wx+ + b)
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 26 / 30
Forward Propagation

ht = g(Uht 1 + Wxt )

yt = softmax(Vht )

Let the dimensions of the input, hidden and output be din , dh and dout ,
respectively
The three parameter matrices: W : dh ⇥ din , U : dh ⇥ dh , V : dout ⇥ dh
-

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 27 / 30


RNN Unrolled in Time
2

&

0000 ↑

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 28 / 30


Train &NN LN Compas
-
-
selt-supervision
u(GP)
Meg &

-

- M
H El Ye
-

↳ I
-


_

e
w

-
Training an RNN language model

To train RNN LM, we use self-supervision (or self-training)


We take a corpus of text as training material
At each time step t, we ask the model to predict the next word

Why is it called self-supervision?


We do not add any gold data, the natural sequence of words is its own
supervision!
We simply train the model to minimize the error in predicting the true next
word in the training sequence

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 29 / 30


Training an RNN language model

4
L f
4
f X

F
- -

If V EI
-


~, W g W r
W -
- & >
-


-

⑭ T

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 30 / 30


< /s]
CST
Bus
awood
-

-
-
samplethepro
a
as ter
· ho

A
- Es
I+
Co
I
< ST
ofa
so

· 0

So
.
of 0 .

18 0 12
.

M ↑
Generating text with an RNN Language Model

w
RNN-based language models can be used for language generation (and
hence, for machine translation, dialog, etc.)
A language model can incrementally generate words by repeatedly
sampling the words conditioned on the previous choices – also known as
autoregressive generation.

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 31 / 41


Autoregressive Generation with RNNs

All your parameters have already been trained.


Start with a special begin of sentence token <s> as input
Through forward propagation, obtain the probability distribution at the
output, and sample a word
Feed the word as input at the next time-step (its word vector)
Continue generating until the end of sentence token is sampled, or a fixed
length of the sentence has been reached.

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 32 / 41


Autoregressive Generation with RNNs

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 33 / 41


RNNs can be used for various other applications

Cl
zu --

O I to
La
Y --


- I
Sequence labeling: Named Entity Recognition, Parts-of-Speech Tagging
-

#- Text Classification: Sentiment Analysis, Spam Detection

O
-

47
- -

-
e
z It

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 34 / 41


RNNs for Sequence Labeling

Task
Assign a label chosen from a small fixed set of labels to each element of the
sequence
Inputs: Word embeddings
Outputs: Tag probabilities generated by the softmax layer

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 35 / 41


RNNs for Sequence Labeling

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 36 / 41


RNNs for Sequence Classification

Task
Classify the entire sequence rather than the token within them
Pass the text to be classified a word at a time, generating new hidden
states at each time step
The hidden state of the last token can be thought of as a compressed
representation of the entire sequence
This last hidden state is passed through a feed-forward network that
chooses a class via softmax
There are other options of combining information from all the hidden
states

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 37 / 41


RNNs for Sequence Classification

E C
3 =
[hi
T

kr o ·

300 - d

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 38 / 41


Other Variations: Stacked RNNs

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 39 / 41


Other Variations: Bidirectional RNNs

RNN makes use of information from left (prior) context to predict at time t
In many applications, the entire sequence is available; so it makes sense
to also make use of the right context to predict at time t
Bidirectional RNNs combine two independent RNNs, one where the input
is processed from left to right (forward RNN), and another from end to the
start (backward RNN).
hft = RNNforward (x1 , n, xt )
hbt = RNNbackward (xn , n, xt )
·--

ht = [hft ; hbt ]
af .
&
>
- - > -
hft + E
Kn
ze
C
- Eka
b

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 40 / 41


Other Variations: Bidirectional RNNs

Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs CS60010 41 / 41


RNNs: Other Applications, LSTMs
des

wondered dim
Pawan Goyal
Usin
bi-RNE
- taske
vord o
CSE, IIT Kharagpur
Pos -
M
for

Parkinders
CS60010 tag

hiddor- ↳
Dos

of
#

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 1 / 25


" Froxo 50x
Y >
-

U
50
: :

>
- - -
> he Va he
°
-

ho
~

>
-

#
↑ 300
E
E
:

:
50x50

300450
3
s da
-
-

4.
[in]
=

Ex50
:
,
of classes
5-6 weeks

/Train
-

-Mi-
zi ne
.

-Endsch

& Assignment Project hum

&
I

fe35mini
K
terliest
George moni
*
quig

quigals edo
After a
%

20 -
L
>
-

-
Using Bidirectional RNNs for Sequence Classification

Ro

- -

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 2 / 25


Need for better units: Vanishing Gradient

Y2 43 You

-v u -V

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 3 / 25


Effect of vanishing gradient on RNN LM

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 4 / 25


Effect of vanishing gradient on RNN LM
-v
- --

Synthe
②-

W
-

>
- n

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 5 / 25


How to fix vanishing gradient problem?

The main problem is that it is too difficult for the RNN to learn to preserve
information over many timesteps.
In a vanilla RNN, the hidden state is constantly being rewritten

-Hantes
~
c1
(t)
-G
h = tanh(Uh + Wx )
-
(t 1)
- -
(t)

How about better RNN units? "pen Few


-v
U -
F. -k I - -
-
....

L
T
-000
- · .

4t
↳ A-1 50-dim
-

>
- A-

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 6 / 25


gate
50

Gate
Neural Sigmoid

#
=

KA -
1
T
Pet
-

o -

it
-
I
u (U ↑
,
he- t Wich)

-

--
Te more leanable
params
Using Gates for better RNN units

The gates are also vectors


On each timestep, each element of the gates can be open (1), close (0)
or somewhere in-between.
The gates are dynamic: their value is computed based on the current
context.

Two famous architectures


GRUs, LSTMs

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 7 / 25


Long Short Term Memory (LSTM)
It
pyt
TV ht >
- ht +

context
E
9t
I

At kA Ct
-
Context
victor
KA :
Oz O
-

(CA) Lad-state
Ant
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 8 / 25
He
CA-1 HA -
gA =
Fant (Night
-

in o UghAr)
+
-t
+

&A -
02 B - new state Content

(like RNN
is
At (WfR Uf hat
:
+
er

(Wile + Vikto
it =
2

2) Woc +
Vo4A-1)
Of =
LSTM: More Details

For context management, an explicit context layer is added to the


architecture
It makes use of specialized neural units (gates) to control the flow of
information
The gates share a common design feature, and choice of sigmoid pushes
its output to 0 or 1, thus it works as a binary mask.

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 9 / 25


LSTM: In Equations
Forget Gate
Controls what is kept vs forgotten from the context

-
-
ft = s(Uf ht
-
1 + Wf xt )

Input Gate
Controls what parts of new cell content are written to the context

it = s(Ui ht 1 + Wi xt )
-

Output Gate
Controls what part of context are output to hidden state

ot = s(Uo ht 1 + Wo x t )

New Cell content: gt = tanh(Ug ht 1 + Wg xt )


New Context Vector: ct = it gt + ft ct 1
New Hidden State: ht = ot tanh(ct )
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 10 / 25
How does LSTM solve vanishing gradients?

Ift Wie
-

The LSTM architecture makes it easier for the RNN to preserve information
over many timesteps
e.g., if the forget gate is set to remember everything on every timestep,
then the info in the cell is preserved indefinitely
By contrast, it is harder for vanilla RNN to learn a recurrent weight matrix
U that preserves info in hidden state Vi Vo
Vo -
vi

e
U

12 I
-
Ot
( >
ht +

its and
-
>
-
it
to of he -
HA
W LTM cal
Me
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 11 / 25
Common RNN NLP Architectures

600 -
RiritIsiris -

E
encoles

words

-
&

<
- &
/
- -

l
O
-
>
-
Traslation

b-
+4
und
-

-
-decoder
>
>
-

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 12 / 25


Encoder-decoder networks

Also known as sequence-to-sequence networks, and are capable of


generating contextually appropriate, arbitrary length output sequences given
the input sequence.

Three conceptual components


·
An encoder that accepts an input sequence x1:n and generates a
o
- -

corresponding sequence of contextualized representations h1:n


A context vector, c, which is a function of h1:n and conveys the essence

G
>
-

of the input to the decoder


A decoder which accepts c as input and generates an arbitrary length
sequence of hidden states h1:m from which the corresponding output
-

states y1:m can be obtained.


-

°

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 13 / 25


Encoder-decoder networks

the - example
Un
for
C =

: Text
Text olp Seco
Ip: Seco

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 14 / 25


Encoder-decoder networks for translation
e

ho
>
-

he
Endor
t

der

SE Si

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 15 / 25


Training the Encoder-decoder model Lebra↓
S

now
U Toron
a
ho
,
↳ I Vocab
>
e Den 1 -Megp) dire

End-to-end training
For MT, the training data typically consists of set of sentences and their
translations
The network is given a source sentence and then a separator token, it is
trained auto-regressively to predict the next word
Teacher forcing is used during training, i.e., the system is forced to use

C
the gold target token from training as the next input xt+1 , rather than
-

relying on the last decoder output ŷt


-

t ! +

at time-
Never
inference
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 16 / 25
Training the Encoder-decoder model

Randomsitialized

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 17 / 25


Encoder-decoder: Bottleneck

The context vector, hn is the hidden state of the last time step of the
source text
It acts as a bottleneck, as it has to represent absolutely everything about
the meaning of the source text, as this is the only thing decoder knows
about the source text

a
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 18 / 25
Encoder-decoder with attention

not
,
dist
-
attention
dist
-
5 -

- 6
3
O -
-

wit

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 19 / 25


Attention: In Equations

The context vector ci is generated anew with each decoding step i


hid
- -
+ i

↓ hdi = g(ŷi 1 , hdi 1 , ci )


°
-
-
- --

wat
weighing tencoderdates
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 20 / 25
Attention: In Equations

Computing ci
Compute how much to focus on each encoder state, by seeing how
relevant it is to the decoder state captured in hdi 1 – give it a score

wateeile
Simplest scoring mechanism is dot-product attention

score(hdi 1 , hej ) = hdi 1 · hej >


-
hit
- -

Normalize these scores using softmax to create a vector of weights


aij = softmax(score(hdi 1 , hej )) E
A fixed-length context vector is created for the current decoder state


 bi
ci = aij hej

Zahi + j

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 21 / 25


Attention is quite helpful

Attention improves NMT performance


It is useful to allow decoder to focus on certain parts of the source

Attention helps with the long-term dependency problem


Provides shortcut to faraway states

Attention provides some interpretability


By inspecting attention distribution, we can see what the decoder was
focusing on
We get alignment for free even if we never explicitly trained an alignment
system

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 22 / 25


Example: Machine Translation

Neural Machine Translation by jointly learning to align and Translate, ICLR


2015

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 23 / 25


Example: Text Summarization

A Neural Attention Model for Sentence Summarization, EMNLP 2015


-

3
Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 24 / 25
Summary

3
4 ee 4 . > hi ?
-

+
- Es
To
-
-

3
Boffmt [2
-

Attention has proved to be a very impactful idea in NLP


Lot of new models are based on self-attention, e.g., Transformer, BERT

ei e
e 1
Cz Lij he
- I
-

j)
4

E 2i
j
=
Softmax[h ,
. i
h

Pawan Goyal (IIT Kharagpur) RNNs: Other Applications, LSTMs CS60010 25 / 25


Nu

Englich
v
-
V
-
Hindi -

zir
night
-
-
↳ digeu

You might also like