Basics of NLP
Basics of NLP
CATEGORIZATION OF LEARNING
Disclaimer:
This categoriazation is rather coarse
The list of paradigms is extendable
Not everything is unambiguous, there might be overlap
Connection to tasks/data:
Given the task, some paradigms are more suitable
Given the amount of data, a specific paradigm might be preferrable
Presence/Absence of labels makes certain paradigms (in)feasible
Distinction between:
Embedding texts
Pre-training & fine-tuning a model
Prompting
Interaction & Generation
Agents
Pre-training:
Using unlabeled corpora with self-supervised objectives is referred to as Pre-Training
Pre-training examples require no annotation, the inherent structure of the text is exploited
Construction of different self-supervised objectives, which are assumed to:
- cover different phenomena better than the others
- work more efficiently for learning
Example 1: predict the next work in a sentence
Example 2: predict a masked word
Fine-tuning:
The second phase of transfer learning, i.e. adapting the pre-trained model to a labeled data set for a specific
downstream task is referred to as Fine-Tuning
Far less labeled data required compared to a scenario w/o pre-training
PROMPTING
Accessing pre-trained models:
Fine-tune them
Also possible: No fine-tuning, but ..
Zero-Shot Transfer w/o ANY labeled data (only describe the task)
Few-Shot Transfer w/ FEW labeled data points (describe the task, and show examples as context)
this is called in-context learning
In both of the latter cases, good pre-training becomes even more important
Definition(s):
GPT-3 paper: "Task Description" (accompanied by samples + labels)
Prompt: Describes the task the model should perform
Prompt Engineering: Finding the best prompt(s) for one (or across multiple) task(s)
Prompt Tuning: Add trainable weights ("soft prompt") to inputs and fine-tune
CHATTING / GENERATION
Interacting with the model
Larger model sizes, reduced latency and improved training regimes enable conversations with the models
Enables the user to:
- have multi-turn conversations, with the model "remembering" previous inputs
- refine the prompt in case of unsatisfactory output
- used increased context sizes for the prompts
Still: Static, pre-trained model with "knowledge"
Interacting with the model: Persona-Chat Benchmark
OUTLOOK: Agents
NLP TASKS
Learning goals
Understand the different types of tasks (low- vs. high-level)
Purely Linguistic tasks vs. more general classification tasks
LANGUAGE MODELING
LOW-LEVEL: SEMANTICS
SEQUENCE-LEVEL CLASSIFICATION
Output can also be non-binary, i.e. multi-class/-label
SEQ2SEQ MODELING
The model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. The model stops making
predictions after outputting the end-of-sentence token. Note that the LSTM reads the input sentence in reverse,
because doing so introduces many short-term dependencies in the data that make the optimization problem much
easier.
Notes:
In the meantime: Transformers replaced LSTMs
Overall architecture (Encoder-Decoder) still used
Used for:
(Neural) Machine Translation
Summarization
Questions answering
TRADITIONAL BENCHMARKING: NLU
Nine sentence- or sentence-pair language understanding tasks
Public leaderboard, (still) very popular benchmark collection
WinoGrande: Test whether the model can identify the correct reference
HellaSwag
Pick the best ending to the context.
PIQA (Physical Intercation: Question Answering): Test whether the model can identify the most plausible
continuation (see also LAMBADA, HellaSwag)
Neural Probabilistic Language Model
Learning goals
grasp importance of the “look-up table“ a.k.a. embedding layer
understand computational implications of language modeling
Look-up table
Non-linearity, e.g. tanh, ReLU
Word Embeddings
Learning goals
Understand what word embeddings are
Learn the main methods for creating them
MOTIVATION
How to represent words/tokens in a neural network?
Possible solution: one-hot encoded indicator vectors of length |V|.
- Typical dimensionality
- Embedding matrix:
Question: Advantages of using word vectors?
- We can express similarities between words, e.g., with cosine similarity
- Since the embedding operation is a lookup operation, we only need to update the vectors that occur in a
given training batch
Supervised training?
Training embeddings from scratch:
- Initialize randomly and learn it during training phase
- Words that play similar roles w.r.t. task get similar embeddings.
- Example: Sentiment Classification: We might expect
- We typically have more unlabeled than labeled data. Can we learn embeddings from the unlabeled data?
Question: What could be a problem at test time? If training set is small, many words are unseen during
training and therefore have random vectors
Distributional hypothesis: “A word is characterized by the company it keeps“ (J.R. Firth, 1957)
Idea:
Learn similar vectors for words that occur in similar contexts
Three different (milestone) methods:
o Word2Vec
o GloVe (not covered)
o FastText
Answer: Softmax
Question: Problem with training softmax? (IT IS SLOW)
Answer: Needs to compute dot products with the whole vocabulary in the denominator for every single prediction
SPEEDING UP TRAINING: NEGATIVE SAMPLING
SKIP-GRAM (WORD2VEC)
Create a fake task:
Training objective: Given a word, predict the neighbouring words
Generation of samples: Sliding fixed-size window over the text
is the word vector of a random word, M is the number of negatives per positive sample
FASTTEXT
Accomplishments:
Words can be represented as dense, low-dimensional vectors
Easy to capture similarity between words
Additive Compositionality of word vectors
Open issues:
Even if we train Word2Vec on a very large corpus, we will still encounter unknown words at test time
What about rare words?
Orthography can often help us:
W (remuneration) should be similar to:
W(remunerate) same stem, w(iteration), w(consideration) … same suffix ~ same Part of Speech
Assume, we want to represent the word example:
Character n-grams (n = 3):
In practice, we don’t set n = a but rather
Character n-grams :
Note, that the 4-gram exam is different from the word <exam>.
Representation of a known word: Average of the word’s embedding and char-n-gram embeddings
FASTTEXT TRAINING
ngrams typically contains character 3- to 6-grams
Replace in Skipgram objective with its new definition. During backpropagation, loss gradient vector is
SUMMARY
Word2Vec as a bigram Language Model
Negative Sampling
Skipgram: Predict words in window given word in the middle
fastText: N-gram embeddings generalize to unseen words