0% found this document useful (0 votes)
1 views

07-dlintro deep learning nlp

The document provides an overview of deep learning applications in natural language processing (NLP), contrasting traditional methods with neural network approaches. It discusses the evolution of feature representation from symbolic to neural, emphasizing the benefits of learned features and universal representations. Additionally, it covers various architectures, loss functions, optimization techniques, and practical advice for implementing deep learning in NLP tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

07-dlintro deep learning nlp

The document provides an overview of deep learning applications in natural language processing (NLP), contrasting traditional methods with neural network approaches. It discusses the evolution of feature representation from symbolic to neural, emphasizing the benefits of learned features and universal representations. Additionally, it covers various architectures, loss functions, optimization techniques, and practical advice for implementing deep learning in NLP tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

An Intro to Deep Learning for NLP

Mausam
Disclaimer: this is an outsider’s understanding. Some details may be inaccurate

(several slides by Yoav Goldberg & Graham Neubig)


NLP before DL #1
Assumptions
- doc: bag/sequence/tree of words
- model: bag of features (linear)
- feature: symbolic (diff wt for each)

Features
Model
(NB, SVM, CRF)

Supervised Optimize function


Training (LL, sqd error, margin…)
Data
Learn feature weights
NLP before DL #2
Assumptions
z1 z2 … - doc/query/word is a vector of numbers
- dot product can compute similarity
- via distributional hypothesis

Model
(MF, LSA, IR)

Unsupervised Optimize function


Co-occurrence (LL, sqd error, margin…)
Data
Learn vectors
NLP with DL

Features
Model
(NB, SVM, CRF)

Supervised Optimize function


Training (LL, sqd error, margin…)
Data
Learn feature weights
NLP with DL

Neural Model
Features (NB, SVM, CRF)

Supervised Optimize function


Training (LL, sqd error, margin…)
Data
Learn feature weights
NLP with DL

z1 z2 …

Neural Model
Features (NB, SVM, CRF)

Supervised Optimize function


Training (LL, sqd error, margin…)
Data
Learn feature weights+vectors
NLP with DL

z1 z2 …

Neural Model
Features NN= (NB, SVM, CRF, +++
+ feature discovery)
Supervised Optimize function
Training (LL, sqd error, margin…)
Data
Learn feature weights+vectors
NLP with DL
Assumptions
- doc/query/word is a vector of numbers
z1 z2 … - doc: bag/sequence/tree of words
- feature: neural (weights are shared)
- model: bag/seq of features (non-linear)

Neural Model
Features NN= (NB, SVM, CRF, +++
+ feature discovery)
Supervised Optimize function
Training (LL, sqd error, margin…)
Data
Learn feature weights+vectors
Meta-thoughts
Features
• Learned
• in a task specific end2end way
• not limited by human creativity
Everything is a “Point”
• Word embedding
• Phrase embedding
• Sentence embedding
• Word embedding in context of sentence
• Etc

• Also known as dense/distributed representations

Points are good  reduce sparsity by wt sharing


a single (complex) model can handle all pts
Universal Representations
• Non-linearities
– Allow complex functions

• Put anything computable in the loss function


– Any additional insight about data/external knowledge
Make symbolic operations continuous
• Symbolic  continuous
– Yes/No 
• (number between 0 and 1)
– Good/bad 
• (number between -1 and 1)

– Either remember or forget 


• partially remember
– Select from n things 
• weighted avg over n things
Encoder-Decoder

Symbolic Symbolic
Input z1 Model Neural Model Output
(word) Features (class, sentence..)

Encoder Decoder

Different assumptions on data create different architectures


Building Blocks

+ ; .

Matrix-mult gate non-linearity


x;y
x+y

Can also try


Dimension-wise
Max
(later weighted sum)
Concat vs. Sum
• Concatenating feature vectors: the
"roles" of each vector is retained.

prev current next


word word word

• Different features can have vectors of different dim.

• Fixed number of features in each example


(need to feed into a fixed dim layer).
Concat vs. Sum
• Summing feature vectors: "bag of features"

word word word

• Different feature vectors should have same dim.

• Can encode a bag of arbitrary number of features.


x.y
• degree of closeness
• alignment

• Uses
– question aligns with answer //QA
– sentence aligns with sentence //paraphrase
– word aligns with (~important for) sentence //attention
g(Ax+b)
• 1-layer MLP
• Take x
– project it into a different space //relevant to task
– add some scalar bias (only increases/decreases it)
– convert into a required output

• 2-layer MLP
– Common way to convert input to output
Loss Functions

Cross Entropy
Binary Cross Entropy
Max Margin
Encoder-Decoder
LOSS
P(y) y*

Symbolic Symbolic
Input z1 Model Neural Model Output
(word) Features (class, sentence..)

Encoder Decoder
Common Loss Functions
Common Loss Functions
• Max Margin
Loss = max(0, 1-(score(y*)-score(ybest)))

• Ranking loss (max margin: x ranked over x’)


Loss = max(0, 1-(score(x)-score(x’)))
Regularization
• L1
• L2
• Elastic Net
• DropOut
• Batch Normalization
• Layer Normalization
• Problem-specific regularizations
• Early Stopping
• https://towardsdatascience.com/different-
normalization-layers-in-deep-learning-1a7214ff71d6
Some Practical Advice
Optimization
• Stochastic Gradient Descent
• Mini-Batch Gradient Descent
• AdaGrad
• AdaDelta
• RMSProp
• Adam

Learning rate schedules

https://ruder.io/optimizing-gradient-descent/
Glorot/Xavier Initialization (tanh)
• Initializing W matrix of dimensionality dinxdout

He’s Initialization (tanh)


Batching
• Padding
Vanishing and Exploding Gradients
• Clipping

You might also like