0% found this document useful (0 votes)
4 views

Lecture15 - Neural Models For NLP

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture15 - Neural Models For NLP

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Neural Models for NLP

Lecture 15
COMS 4705, Spring 2020
Yassine Benajiba
Columbia University
One more thing about embeddings
No learning
No learning
If you don’t want to use any additional learning, you could just stop here and take the average of the
embeddings as the representation of the sentence. What are the pros and cons of such an approach?

What if you wanted to weight the words without any additional learning, what could you use?
doc2vec
Doc2vec/paragraph2vec

https://medium.com/district-data-labs/forward-propagation-building-a-skip-gram-net-from-the-ground-up-9578814b221

At the word level, Skip gram takes a word as an input and predicts its context.
We can do few changes and get a document representation.
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions


One row per word
This is what we save at the end as the word embedding matrix

Input word one hot encoding

One hot encoded


context words
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions


One row per word doc
This is what we save at the end as the word doc embedding matrix

vd
d(t)

Input word doc one hot encoding


We should call it d(t) instead

One hot encoded


context words
occurring in the
document
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions


One row per word doc
This is what we save at the end as the word doc embedding matrix

vd
d(t)

Input word doc one hot encoding


We should call it d(t) instead

What happens for documents that were not in the One hot encoded
training? context words
occurring in the
Add new column to d(t) and new parameter row to ‘p’, document
freeze all old parameters and train for few iterations.
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions


One row per word doc
This is what we save at the end as the word doc embedding matrix

vd
d(t)

Input word doc one hot encoding


We should call it d(t) instead

Approach doesn’t allow for composition, i.e. does not build


document representation from words.
One hot encoded
context words
Nice for document similarity but not for applications like
occurring in the
machine translation, question answering, sentiment, etc
document
Let’s take a look at the ones that attempt to do so.
dense nets
dense nets
dense nets
Deep Averaging Networks (DANs)

https://cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
RAND -> randomly initialized embeddings

NBOW -> avg + softmax

ROOT -> sentiment at the sentence level


dense nets
Deep Averaging Networks (DANs)
RAND -> randomly initialized embeddings

NBOW -> avg + softmax

ROOT -> sentiment at the sentence level


dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
Simple

Fast

But what are we missing here?


Primarily order of words. It could beat
RecNN but not CNN!
convolutional nets
CNNs
intuition
CNNs
feature maps
• A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.
Zhang Y. and Wallace B.

https://arxiv.org/pdf/1510.03820v2.pdf
CNNs
the convolution operation

Wikipedia
CNNs
the convolution operation
1D

Wikipedia

Input
kernel Output
2D
* =
CNNs
the convolution operation
A (input)
B (kernel) C (output)

* =

C 𝑚, 𝑛 = 𝐴 ∗ 𝐵 𝑚, 𝑛 = σ𝑗 σ𝑘 𝐵 𝑗, 𝑘 . 𝐴[𝑚 + 𝑗, 𝑛 + 𝑘]
CNNs
- Shared parameters
- Sparse neurons
CNNs Retain only the max values.

Why does this make sense?

https://arxiv.org/pdf/1510.03820v2.pdf
Convolutional neural networks

https://github.com/bwallace/CNN-for-text-classification/blob/master/CNN_text.py
CNNs
Now we have something that can take
word order into consideration but we still
have two problems:

- Length of sentence
- Long distance dependencies

.. Let’s talk about recurrent neural


networks

https://arxiv.org/pdf/1510.03820v2.pdf
recurrent nets
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

How can we do this with words


instead of chars?

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

sequential

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

Shared params

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

generative

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

Last hidden state is sequence


representation

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

Can be used to classify either each element


of the sequence or whole sequence

LMs too

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks

Training:
Algorithm: BTTBP
Issues: vanishing gradients

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Gated RNNs: LSTMs
Gated RNNs: LSTMs

Ct-1 Ct

ht-1 ht

xt
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Gated RNNs: LSTMs

Ct-1 Ct

ht-1 ht

xt
Gated RNNs: LSTMs

Ct-1 Ct

ht-1 ht

xt
Gated RNNs: LSTMs

Ct-1 Ct

ht-1 ht

xt Given an NLP task the LSTM will pick what information to remember and what information is
irrelevant to build a sentence level rep: h(n) where ‘n’ is the length of the sentence
Gated RNNs: LSTMs

Ct-1 Ct
Works much better ..

What problems does it solve?:


- Vanishing gradient
ht-1 ht - Longer memory

xt
Gated RNNs: LSTMs
Gated RNNs: LSTMs

BiLSTM means bidirectional


LSTM, basically two LSTMs one
starting from the left and the
other from the right. At each
word we concatenate the
content of the hidden layers
Gated RNNs: LSTMs
Let’s discuss some
limitations of LSTMs:
- Speed
- Accuracy of
representation

Where do you think


is the state of art
today?
Gated RNNs: LSTMs

http://nlpprogress.com/
Gated RNNs: LSTMs

http://nlpprogress.com/
Other architectures
ADANs
one-shot learning
Back to the intro
Back to the intro
Back to the intro
Back to the intro

You might also like