0% found this document useful (0 votes)
17 views67 pages

Neural Models For NLP

Uploaded by

anand.ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views67 pages

Neural Models For NLP

Uploaded by

anand.ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Deep Learning for NLP:

Neural Models

Ashish Anand
Professor, Dept. of CSE, IIT Guwahati
Associated Faculty, Mehta Family School of Data Sc and AI, IIT Guwahati
Outline: Introduction to NLP

• Neural Language Models


• Vector Semantics
• CNN Models for Classification
• RNN Models for NLP Tasks
NEURAL LANGUAGE
MODEL
Pre-Transformer Era
Feed-Forward Neural Language
Model

Bengio et al, JMLR 03


Advantages over statistical n-gram
models
• Better flexibility in considering larger context
Advantages over statistical n-gram
models
• Better generalizability
• Can generalize to context not seen during the training
• Example: Need to estimate P(reading | ram is)
• Assume if “ram is reading” is not in training, however, “john is reading” is
present. And there are other sentences like “Ram is writing”, “john is writing”
• It is likely that word representations learned by models for “ram” and “john”
is similar, then the model will also give similar probability to P(reading | ram
is) as to P(reading | john is)
Major drawbacks

• Inefficient
• Unable to exploit sequential nature of text
• Limited Context
• Unidirectional
HANDLING THE
DRAWBACKS
Inefficiency: Hierarchical Softmax

Neural Network Lectures by Hugo Leorchelle


Vector Semantics

• Study of vector representation of words


• model to represent meaning of words

• Question is what are the different aspects of meaning ?


• Answer: Lexical semantics – linguistic study of word meaning
DESIDERATA OF
WORD MEANING
Polysemy: Multiple sense of words

• Examples
• Class : Teaching group / Economic Group / Rank
• Right: Correct / A direction
• Mouse: Animal / a specific computer peripheral
• Multiple meaning: word sense
Synonymy: Similar Sense

• Relation: Synonyms/Antonyms
• Synonym: one word has a sense whose meaning is identical to a
sense of another word, or nearly identical
• Examples: couch/sofa; vomit/throw up
• Antonym: meanings are opposite
• Examples: long/short; big/little; fast/slow rise/fall
Word Similarity: Relations beyond
synonyms/antonyms
word1 word2 similarity
• Word Similarity
• Words with similar vanish disappear 9.8
meanings but not synonyms behave obey 7.3
• Cat / Dog belief impression 5.95
muscle bone 3.65
SimLex-999 dataset (Hill et al., 2015) modest flexible 0.98
hole agreement 0.3
Adapted from Jurafsky and Martin’s slide
Word Relatedness / Word Association

• Words are related by semantic frame or field


• Cat, Dog : similar
• Student, Teacher: related but not similar
• Relatedness: co-participation in a shared event

Adapted from Jurafsky and Martin’s slide


Semantic Field

• Set of words covering a particular semantic domain, and


• Have structured relations among them
• Example
• University
• Teacher, student, study, class, lecture, assignment, project
• House
• Room, door, furniture, bedroom
• Topic Modeling: example of semantic field
Semantic Frame

• Semantic Frames and Roles


• Set of words denoting perspectives or participants in a particular
event type
• Different agents playing distinct role in a single event
• Teaching/Learning
• Doctor/Patient
• Buyer/Seller
Taxonomic Relations
One sense is a subordinate/hyponym of another if the first sense is
more specific, denoting a subclass of the other
• car is a subordinate of vehicle
• mango is a subordinate of fruit
Conversely superordinate / hypernym
• vehicle is a superordinate of car
• fruit is a subodinate of mango

Superordinate vehicle fruit furniture


Subordinate car mang chair
o
Adapted from Jurafsky and Martin’s slide
Connotation: Affective Meaning

• Aspects of word’s meaning that are related to a writer’s or


reader’s emotions, opinions, or evaluations
• Happy: Positive connotations vs Sad: Negative connotations
• Great: Positive evaluation vs Terrible: negative evaluation
Connotation: Three dimensional
vector representation
• Three important dimension of
affective meaning (Osgood et
al. (1957))
• Valence: pleasantness of the
stimulus (happy vs annoyed)
• Arousal: intensity of emotion
provoked by the stimulus
(excited vs calm)
• Dominance: degree of control
exerted by the stimulus
(controlling vs awed)
In Summary

• Words
• Have multiple senses, leading to complex relations between words
• Synonymy / Antonymy
• Similarity
• Relatedness
• Taxonomic Relations: Hypernym/Hyponym
• Connotation

• Challenge is how we obtain an appropriate representation


Distributional hypothesis: radically
different approach
• Ludwing Wittgenstein
• Linguist / Philosopher of language
• Meaning of a word is its use in language
• Joos, Harris and Firth
• Define a word by the distribution it occurs in language use
Context determines meaning of
words
• Harris (1954)
• "Oculist and eye-doctor … occur in almost the same environments"
• Generalize it: "If A and B have almost identical environments … we
say that they are synonyms"

• Firth (1957)
• "You shall know a word by the company it keeps!"
Context determines meaning of word

A bottle of tesguino is on the table.


Everybody likes tesguino.
Tesguino makes you drunk.
We make tesguino out of corn.
Broad categories of vector space
models
• Long and Sparse vector representation
• Co-occurrence matrix based methods (term-doc, term-term matrices based on
MI, tf-idf etc.)

• Short and Dense vector representation


• Dimensionality reduction techniques such as Singular value decomposition
(Latent Semantic Analysis) on co-occurrence matrix
• Neural language inspired models (skip-grams, CBOW)
• GloVe

• Other Methods
• Clustering methods: Brown Clusters [Collins lecture]
GLOBAL VECTOR
GloVe

• Main Idea: Use Global co-occurrence statistics and linear


relationship

• Co-occurrence matrix from Term-Context matrix


GloVe: Notation
GloVe: Intuition

• Distribution relationship of two words with the help of a


probe/context word

Source: Pennigton et al. (GloVe: Global Vectors for Word Representation)


GloVe
Outline

• Convolution Neural Network (CNN/ConvNet)


• Motivation
• 1d and 2d Convolution
• CNN for Text (Sequence Data)
• More Terminologies
• An Example

• RNN
• Vanilla RNN
Convolution: Motivation

• Two Questions –

• What filters or feature extractors were used to extract image features,


specifically for image classification task?

• What were the common set of handcrafted features in NLP domain?


Convolution: Motivation in Image
Classification Task

 LeCun and Bengio, 1995


 Object Detectors

Source: 1.
https://www.cs.columbia.edu/education/courses/course/COMSW4995-7/26050/
2.
https://towardsdatascience.com/a-beginners-guide-to-convolutional-neural-networks-cnns
-14649dbddce8
Convolution: Motivation

• NLP: Text Classification


• n-grams as features
• Example: “The movie was based on a true story” -> “the movie was”, “movie
was based” and so on
Convolution: Motivation

• Capture local predictor/features

• Avoid dense or fully connected networks

• Translation-invariant (important specifically in image)


Convolution: 2d convolution in image

• Grid data: 2d convolution

1 0 1 1

0 1 1 0 1 0 1
2 2
0 0 0
1 0 1 0 2 1
0 1 0
0 1 0 1

Input Image Filter (Kernel) Output Image


1d Convolution: Convolution for Text
the (Sequence)
0.3 0.2 0.1 -0.4
movie 0.5 0.4 -0.7 -0.1
is 0.4 0.2 -0.3 -0.2
based -0.2 -0.1 -0.1 0.4
on -0.6 0.5 0.1 0.1
a 0.7 0.3 -0.1 0.1
true 0.8 0.4 0.5 0.6
story 0.01 0.02 0.01 -0.4

Input sentence and embedded representation

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


1d Convolution: Convolution for Text
the(Sequence)
0.3 0.2 0.1 -0.4
movie 0.5 0.4 -0.7 -0.1 themovieis -1.3
2 1 -1 3 movieisbased 2.6
is 0.4 0.2 -0.3 -0.2 isbasedon 0.7
1 1 2 -1
based -0.2 -0.1 -0.1 0.4 basedona 2.2
1 2 2 3 onatrue 4.6
on -0.6 0.5 0.1 0.1
atruestory 2.57
a 0.7 0.3 -0.1 0.1
true 0.8 0.4 0.5 0.6 Filter of width/size 3 Convoluted Feature
story 0.01 0.02 0.01 -0.4

Input sentence and embedded representation

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


1d Convolution: Convolution for Text
the (Sequence)
0.3 0.2 0.1 -0.4
movie 0.5 0.4 -0.7 -0.1 themovieis -1.3
2 1 -1 3 movieisbased 2.6
is 0.4 0.2 -0.3 -0.2 isbasedon 0.7
1 1 2 -1
based -0.2 -0.1 -0.1 0.4 basedona 2.2
1 2 2 3 onatrue 4.6
on -0.6 0.5 0.1 0.1
atruestory 2.57
a 0.7 0.3 -0.1 0.1
true 0.8 0.4 0.5 0.6 Filter of width/size 3 Convoluted Feature
story 0.01 0.02 0.01 -0.4 Width: k = 3
Vector of size: n-k+1
Input sentence and embedded representation

Sentence length: n = 8; embedding dimension: d = 4

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


1d Convolution for Text with padding
◊ 0.0 0.0 0.0 0.0
the 0.3 0.2 0.1 -0.4 ◊themovie 0.7
movie 0.5 0.4 -0.7 -0.1 2 1 -1 3 themovieis -1.3
movieisbased 2.6
is 0.4 0.2 -0.3 -0.2 1 1 2 -1 isbasedon 0.7
based -0.2 -0.1 -0.1 0.4 1 2 2 3 basedona 2.2
onatrue 4.6
on -0.6 0.5 0.1 0.1 Filter of width/size 3 atruestory 2.57
a 0.7 0.3 -0.1 0.1 Width: k = 3
truestory◊ 3.75
true 0.8 0.4 0.5 0.6
Convoluted Feature
story 0.01 0.02 0.01 -0.4
◊ 0.0 0.0 0.0 0.0

Input sentence and embedded representation

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


1d Convolution for Text with padding,
stride
◊ 0.0
= 2
0.0 0.0 0.0
the 0.3 0.2 0.1 -0.4
movie 0.5 0.4 -0.7 -0.1 ◊themovie 0.7
2 1 -1 3
movieisbased 2.6
is 0.4 0.2 -0.3 -0.2 1 1 2 -1 Basedona 2.2
based -0.2 -0.1 -0.1 0.4 1 2 2 3 atruestory 2.57
on -0.6 0.5 0.1 0.1 story◊◊ -1.17
Filter of width/size 3
a 0.7 0.3 -0.1 0.1 Width: k = 3
true 0.8 0.4 0.5 0.6
story 0.01 0.02 0.01 -0.4 Convoluted Feature

◊ 0.0 0.0 0.0 0.0

Input sentence and embedded representation

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


Multi-channel 1d Convolution with
padding, stride=1 2 1 -1 3
◊ 0.0 0.0 0.0 0.0 1 1 2 -1
◊themovie 0.7 0.4 1.3
the 0.3 0.2 0.1 -0.4 1 2 2 3
movie 0.5 0.4 -0.7 -0.1 themovieis -1.3 1.9 -0.9
movieisbase 2.6 3.5 0.7
is 0.4 0.2 -0.3 -0.2 3 1 -1 3 d
based -0.2 -0.1 -0.1 0.4 2 -1 -2 -1 isbasedon 0.7 1.3 -0.5
basedona 2.2 -0.2 2.6
on -0.6 0.5 0.1 0.1 1 2 2 1
onatrue 4.6 3.3 3.5
a 0.7 0.3 -0.1 0.1
atruestory 2.57 2.07 1.36
true 0.8 0.4 0.5 0.6 2 -1 1 4
truestory◊ 3.75 4.48 4.54
story 0.01 0.02 0.01 -0.4 1 1 1 -1
◊ 0.0 0.0 0.0 0.0 1 2 1 3 Convoluted Feature

Input sentence and embedded representation 3 Filters of width/size 3

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


Pooling over time: Obtaining fixed size
vector
Global Pooling

◊themovie 0.7 0.4 1.3


themovieis -1.3 1.9 -0.9
movieisbase 2.6 3.5 0.7 Max Pooling
d
s 4.6 4.48 4.54
isbasedon 0.7 1.3 -0.5
basedona 2.2 -0.2 2.6
onatrue 4.6 3.3 3.5
atruestory 2.57 2.07 1.36
truestory◊ 3.75 4.48 4.54

Convoluted Feature

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


Pooling over time: Obtaining fixed size
vector
Global Pooling

◊themovie 0.7 0.4 1.3


Max Pooling
themovieis -1.3 1.9 -0.9 s 4.6 4.48 4.54
movieisbase 2.6 3.5 0.7
d
isbasedon 0.7 1.3 -0.5
basedona 2.2 -0.2 2.6 Avg Pooling
onatrue 4.6 3.3 3.5 s 1.98 2.09 1.58
atruestory 2.57 2.07 1.36
truestory◊ 3.75 4.48 4.54

Convoluted Feature

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


Pooling over time: Obtaining fixed
size vector
Global Pooling

◊themovie 0.7 0.4 1.3


Max Pooling
themovieis -1.3 1.9 -0.9 s 4.6 4.48 4.54
movieisbase 2.6 3.5 0.7 Avg Pooling
d
s 1.98 2.09 1.58
isbasedon 0.7 1.3 -0.5
basedona 2.2 -0.2 2.6
4.6 3.3 3.5 k-max Pooling
onatrue 4.6 3.5 3.5
2.57 2.07 1.36 s
atruestory k=2 3.75 4.48 4.54
truestory◊ 3.75 4.48 4.54

Convoluted Feature

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


Local Pooling, stride = 2

◊themovie 0.7 0.4 1.3


themovieis -1.3 1.9 -0.9
◊themovieis 0.7 1.9 1.3
movieisbase 2.6 3.5 0.7
d Max Pooling movieisbasedon 2.6 3.5 0.7
isbasedon 0.7 1.3 -0.5
basedonatrue 4.6 3.3 3.5
basedona 2.2 -0.2 2.6
onatrue 4.6 3.3 3.5 atruestory◊ 3.75 4.48 4.54
atruestory 2.57 2.07 1.36
truestory◊ 3.75 4.48 4.54

Convoluted Feature

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


1d convolution with dilation:
efficient
0.7 0.4 1.3
way to get wider context
◊themovie
2 1 -1
themovieis -1.3 1.9 -0.9
1 1 2
movieisbase 2.6 3.5 0.7 ◊themovieisbased 15
d 1 2 2
isbasedon 0.7 1.3 -0.5 themovieisbasedon
basedona 2.2 -0.2 2.6 atrue
2 -1 1
onatrue 4.6 3.3 3.5 movieisbasedonatr
1 1 1 uestory
atruestory 2.57 2.07 1.36
3.75 4.48 4.54 1 2 1 isbasedonatruestor
truestory◊ y◊
Convoluted Feature

Adapted from Stanford CS224n-2019 Lecture slides (Lecture 11)


An example of 1-layer CNN for
sentence classification

Source: Zhang and Wallace:


https://arxiv.org/pdf/1510.03820.pdf
CONVOLUTIONAL
TO
RECURRENT
NEURAL NETWORK
Natural Language has sequence and
orders
• Natural Language
• Sequence of characters : Word
• Sequence of words: sentence
• Sequence of sentence: documents
Natural Language has sequence and
orders
• Representation
• Feed Forward Network: concatenation of vectors or vector addition
• Concatenation: size vary as input size vary
• Vector addition: fixed size at the expense of ignoring order in sequence
• CNN
• Respect orders but mostly local patterns
RNN naturally handles sequence and
orders

h(t) • Loop allows information to be passed from one step of the network to
the next.
Θ
• Recursive function being applied to input at time step t and previous
wt hidden state h(t-1)
Unrolling a RNN
( 5)
Output Layer ^
𝒚
( 1)
^
𝒚
( 2)
𝒚 (3)
^ ^
𝒚
(4) ^
𝒚

h(t) h(0) h(1) h(2) h(3) h(4) h(5)


Hidden Layer
Wh Wh Wh Wh Wh
W

wt

Ww Ww Ww Ww Ww

Embedding Layer

Input Sequence w1 w2 w3 w4 w5
RNN is about sequence and orders
(𝑡 ) (𝑡)
^
𝒚 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑼 h +𝒃 2)
Output Layer ^
𝒚
( 1)
^
𝒚
( 2)
^
𝒚
(3)
^
𝒚
(4)
𝒚 ( 5)
^

h(0) h(1) h(2) h(3) h(4) h(5)

Hidden Layer (𝑡 ) (𝑡 −1 ) 𝑡
Wh Wh Wh Wh Wh 𝒉 =𝜎 (𝑾 h 𝒉 +𝑾 𝑤 𝒘 +𝒃 1)

Ww Ww Ww Ww Ww

Embedding Layer

Input Sequence w1 w2 w3 w4 w5
RNN in different form: Acceptor
Predict and
calculate loss
( 5)
Output Layer ^
𝒚

Hidden Layer
Wh Wh Wh Wh Wh

Ww Ww Ww Ww Ww

Embedding Layer

Input Sequence w1 w2 w3 w4 w5
RNN in different form: Encoder
Encoded the
representation
( 5)
Output Layer ^
𝒚

Hidden Layer
Wh Wh Wh Wh Wh

Loss is dependent
on other features
or other network Ww Ww Ww Ww Ww

Embedding Layer

Input Sequence w1 w2 w3 w4 w5
RNN in different form: Transducer
Global loss Predict and Predict and
. . . . . . . .
calculate loss calculate loss
( 5)
Output Layer 𝒚 ( 1)
^ 𝒚 ( 2)
^ ^
𝒚
(3)
𝒚 (4)
^ ^
𝒚

Hidden Layer W W W W W
h h h h h
Loss is dependent
on other features
or other network Ww Ww Ww Ww Ww

Embedding
Layer
Input w1 w2 w3 w4 w5
Sequence
Advantages

• Respect orders
• Can process any input length
• Model complexity remains the same
• In theory, information from previous time steps remains
Disadvantages

• Slow computation
• Not parallelizable
• In practice, forgets information from many steps back
• Primarily happens due to vanishing gradient problem
• Also may lead to exploding gradient problem
NEURAL LANGUAGE
MODEL
Pre-Transformer Era
Major drawbacks

• Inefficient
• Unable to exploit sequential nature of text
• Limited Context
• Unidirectional
Limited Context and Sequential
Nature: RNN-LM

Jurafsky and Martin, Speech and Language Processing, 3 rd Draft Jan 2022 Ed
Unidirectional: ELMo

Source: Devlin et al. NAACL 2019


Unidirectional: ELMo

Source: Devlin et al. NAACL 2019


Issues with RNN-based LMs

• Limited bi-directionality
• Difficult to parallelize
References

• Jurafsky and Martin, Speech and Language Processing, 3rd Ed. Draft
[Available at https://web.stanford.edu/~jurafsky/slp3/ ]
Thanks!
Question and Comments!

[email protected] https://www.iitg.ac.in/
anand.ashish

You might also like