Lecture15 - Neural Models For NLP
Lecture15 - Neural Models For NLP
Lecture 15
COMS 4705, Spring 2020
Yassine Benajiba
Columbia University
One more thing about embeddings
No learning
No learning
If you don’t want to use any additional learning, you could just stop here and take the average of the
embeddings as the representation of the sentence. What are the pros and cons of such an approach?
What if you wanted to weight the words without any additional learning, what could you use?
doc2vec
Doc2vec/paragraph2vec
https://medium.com/district-data-labs/forward-propagation-building-a-skip-gram-net-from-the-ground-up-9578814b221
At the word level, Skip gram takes a word as an input and predicts its context.
We can do few changes and get a document representation.
p’: Matrix to increase # dimensions
vd
d(t)
vd
d(t)
What happens for documents that were not in the One hot encoded
training? context words
occurring in the
Add new column to d(t) and new parameter row to ‘p’, document
freeze all old parameters and train for few iterations.
p’: Matrix to increase # dimensions
vd
d(t)
https://cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
dense nets
Deep Averaging Networks (DANs)
RAND -> randomly initialized embeddings
Fast
https://arxiv.org/pdf/1510.03820v2.pdf
CNNs
the convolution operation
Wikipedia
CNNs
the convolution operation
1D
Wikipedia
Input
kernel Output
2D
* =
CNNs
the convolution operation
A (input)
B (kernel) C (output)
* =
C 𝑚, 𝑛 = 𝐴 ∗ 𝐵 𝑚, 𝑛 = σ𝑗 σ𝑘 𝐵 𝑗, 𝑘 . 𝐴[𝑚 + 𝑗, 𝑛 + 𝑘]
CNNs
- Shared parameters
- Sparse neurons
CNNs Retain only the max values.
https://arxiv.org/pdf/1510.03820v2.pdf
Convolutional neural networks
https://github.com/bwallace/CNN-for-text-classification/blob/master/CNN_text.py
CNNs
Now we have something that can take
word order into consideration but we still
have two problems:
- Length of sentence
- Long distance dependencies
https://arxiv.org/pdf/1510.03820v2.pdf
recurrent nets
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
sequential
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
Shared params
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
generative
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
LMs too
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs
Vanilla
The Unreasonable Effectiveness of Recurrent Neural Networks
Training:
Algorithm: BTTBP
Issues: vanishing gradients
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Gated RNNs: LSTMs
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt
Gated RNNs: LSTMs
Ct-1 Ct
ht-1 ht
xt Given an NLP task the LSTM will pick what information to remember and what information is
irrelevant to build a sentence level rep: h(n) where ‘n’ is the length of the sentence
Gated RNNs: LSTMs
Ct-1 Ct
Works much better ..
xt
Gated RNNs: LSTMs
Gated RNNs: LSTMs
http://nlpprogress.com/
Gated RNNs: LSTMs
http://nlpprogress.com/
Other architectures
ADANs
one-shot learning
Back to the intro
Back to the intro
Back to the intro
Back to the intro