0% found this document useful (0 votes)

4 views

Lecture15 - Neural Models For NLP

Uploaded by

1162407364

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lecture15 - Neural Models For NLP

Uploaded by

1162407364

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Neural Models for NLP

Lecture 15
COMS 4705, Spring 2020
Yassine Benajiba
Columbia University
One more thing about embeddings
No learning
No learning
If you don’t want to use any additional learning, you could just stop here and take the average of the
embeddings as the representation of the sentence. What are the pros and cons of such an approach?

What if you wanted to weight the words without any additional learning, what could you use?
doc2vec
Doc2vec/paragraph2vec

https://medium.com/district-data-labs/forward-propagation-building-a-skip-gram-net-from-the-ground-up-9578814b221

At the word level, Skip gram takes a word as an input and predicts its context.
We can do few changes and get a document representation.
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions

One row per word
This is what we save at the end as the word embedding matrix

Input word one hot encoding

One hot encoded

context words
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions

One row per word doc
This is what we save at the end as the word doc embedding matrix

vd
d(t)

Input word doc one hot encoding

We should call it d(t) instead

One hot encoded

context words
occurring in the
document
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions

One row per word doc
This is what we save at the end as the word doc embedding matrix

vd
d(t)

Input word doc one hot encoding

We should call it d(t) instead

What happens for documents that were not in the One hot encoded
training? context words
occurring in the
Add new column to d(t) and new parameter row to ‘p’, document
freeze all old parameters and train for few iterations.
p’: Matrix to increase # dimensions

Doc2vec/paragraph2vec This is what we throw away at the end

p: Matrix to reduce # dimensions

One row per word doc
This is what we save at the end as the word doc embedding matrix

vd
d(t)

Input word doc one hot encoding

We should call it d(t) instead

Approach doesn’t allow for composition, i.e. does not build

document representation from words.
One hot encoded
context words
Nice for document similarity but not for applications like
occurring in the
machine translation, question answering, sentiment, etc
document
Let’s take a look at the ones that attempt to do so.
dense nets
dense nets
dense nets
Deep Averaging Networks (DANs)

https://cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
RAND -> randomly initialized embeddings

NBOW -> avg + softmax

ROOT -> sentiment at the sentence level

dense nets
Deep Averaging Networks (DANs)
RAND -> randomly initialized embeddings

NBOW -> avg + softmax

ROOT -> sentiment at the sentence level

dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
Simple

Fast

But what are we missing here?

Primarily order of words. It could beat
RecNN but not CNN!
convolutional nets
CNNs
intuition
CNNs
feature maps
• A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.
Zhang Y. and Wallace B.

https://arxiv.org/pdf/1510.03820v2.pdf
CNNs
the convolution operation

Wikipedia
CNNs
the convolution operation
1D

Wikipedia

Input
kernel Output
2D
* =
CNNs
the convolution operation
A (input)
B (kernel) C (output)

* =

C 𝑚, 𝑛 = 𝐴 ∗ 𝐵 𝑚, 𝑛 = σ𝑗 σ𝑘 𝐵 𝑗, 𝑘 . 𝐴[𝑚 + 𝑗, 𝑛 + 𝑘]
CNNs
- Shared parameters
- Sparse neurons
CNNs Retain only the max values.

Why does this make sense?

https://arxiv.org/pdf/1510.03820v2.pdf
Convolutional neural networks

https://github.com/bwallace/CNN-for-text-classification/blob/master/CNN_text.py
CNNs
Now we have something that can take
word order into consideration but we still
have two problems:

- Length of sentence
- Long distance dependencies

.. Let’s talk about recurrent neural

networks

https://arxiv.org/pdf/1510.03820v2.pdf
recurrent nets
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks