07 Dlintro
07 Dlintro
Mausam
Disclaimer: this is an outsider’s understanding. Some details may be inaccurate
[Figure:
Francesconi, 2022]
The Localist vs. Distributed Debate
Distributed vs. Localist
Representations
Localist: “..one computing element for each entity”
Distributed:
• “Each entity is represented by a pattern of activity
distributed over many computing elements”
• “each computing element is involved in representing
many different entitites”
Local Representation
[Thorpe 1989]
Distributed Representation
[Thorpe 1989]
Semi-Distributed Representation
[Thorpe 1989]
Distributed Representations: Pros
and Cons
Distributed
representations:
😀 😀
• Efficient 😀 Localistrepresentations
😀
• Continuous 😀 • Easier to work with(?)
😢
• Degrade gracefully • More interpretable
• Less interpretable
[Pate 2002]
So, who won?
Features
Model
(NB, SVM, CRF)
Model
(MF, LSA, IR)
Features
Model
(NB, SVM, CRF)
Neural Model
Features (NB, SVM, CRF)
z1 z2 …
Neural Model
Features (NB, SVM, CRF)
z1 z2 …
Neural Model
Features NN= (NB, SVM, CRF, +++
+ feature discovery)
Supervised Optimize function
Training (LL, sqd error, margin…)
Data
Learn feature weights+vectors
NLP with DL
Assumptions
- doc/query/word is a vector of numbers
z1 z2 … - doc: bag/sequence/tree of words
- feature: neural (weights are shared)
- model: bag/seq of features (non-linear)
Neural Model
Features NN= (NB, SVM, CRF, +++
+ feature discovery)
Supervised Optimize function
Training (LL, sqd error, margin…)
Data
Learn feature weights+vectors
Meta-thoughts
Features
• Learned
• in a task specific end2end way
• not limited by human creativity
Everything is a “Point”
• Word embedding
• Phrase embedding
• Sentence embedding
• Word embedding in context of sentence
• Etc
Symbolic Symbolic
Input z1 Model Neural Model Output
(word) Features (class, sentence..)
Encoder Decoder
+ ; .
• Uses
– question aligns with answer //QA
– sentence aligns with sentence //paraphrase
– word aligns with (~important for) sentence //attention
g(Ax+b)
• 1-layer MLP
• Take x
– project it into a different space //relevant to task
– add some scalar bias (only increases/decreases it)
– convert into a required output
• 2-layer MLP
– Common way to convert input to output
Loss Functions
Cross Entropy
Binary Cross Entropy
Max Margin
Encoder-Decoder
LOSS
P(y) y*
Symbolic Symbolic
Input z1 Model Neural Model Output
(word) Features (class, sentence..)
Encoder Decoder
Common Loss Functions
Common Loss Functions
• Max Margin
Loss = max(0, 1-(score(y*)-score(ybest)))
https://ruder.io/optimizing-gradient-descent/
Glorot/Xavier Initialization (tanh)
• Initializing W matrix of dimensionality dinxdout