0% found this document useful (0 votes)
8 views

Basics of NLP

Uploaded by

agnes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Basics of NLP

Uploaded by

agnes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Learning Paradigm

3 settings we explore for in-context learning


1) Zero-shot: the model predicts the answer given only a natural language description of
the task. No gradient updates are performed.
2) One-shot: in addition to task description, the model sees a single example of the task.
No gradient updates are performed.
3) Few-shot: In addition to the task description, the model sees a few examples of the
task. No gradient updates.
Traditional Fine-tuning (not used for GPT-3):
- Fine tuning:
1) The model is trained via repeated gradient updates using a large corpus of example
tasks.
Example: sea otter -> loutre de mer (GRADIENT UPDATE) peppermint -> menthe poivre (GRADIENT UPDATE) plush
giraffe -> giraffe peluche (GRADIENT UPDATE) cheese -> … PROMPT

CATEGORIZATION OF LEARNING
Disclaimer:
This categoriazation is rather coarse
The list of paradigms is extendable
Not everything is unambiguous, there might be overlap
Connection to tasks/data:
Given the task, some paradigms are more suitable
Given the amount of data, a specific paradigm might be preferrable
Presence/Absence of labels makes certain paradigms (in)feasible
Distinction between:
Embedding texts
Pre-training & fine-tuning a model
Prompting
Interaction & Generation
Agents

WORD VECTORS: ONE-HOT ENCODING


Problem statement
Words are discrete units
We can represent them as (high-dimensional) one-hot vectors
This makes it difficult/impossible to e.g. capture similarity between synonyms
Documents can be represented as a vector of word occurrences (bag-of-words)
Example of one-hot: w(football) = [0,0,0,0,1,0,0,0,0,0,0,0], w(basketball) = [0,1,0,0,0,0,0,0,0,0,0,0],

Problems of one-hot embeddings


high dimensionality
not possible to measure similarity
Alternative: Dense embeddings

WORD VECTORS: EMBEDDING


Measuring similarity now possible:

Not only possible for words, but for whole documents:


Use Case: Document retrieval
PRE-TRAIN/FINE-TUNE
Problem statement
The larger the models, the more data is needed to train them
(Labeled) Data is scarce and expensive!
Many languages are underrepresented in terms of resources: Number of speakers (of a language) 6= Amount of
available text
Unlabeled (English) text data is ubiquitous
Machine learning setup Transfer learning setup

Pre-training:
Using unlabeled corpora with self-supervised objectives is referred to as Pre-Training
Pre-training examples require no annotation, the inherent structure of the text is exploited
Construction of different self-supervised objectives, which are assumed to:
- cover different phenomena better than the others
- work more efficiently for learning
Example 1: predict the next work in a sentence
Example 2: predict a masked word

Fine-tuning:
The second phase of transfer learning, i.e. adapting the pre-trained model to a labeled data set for a specific
downstream task is referred to as Fine-Tuning
Far less labeled data required compared to a scenario w/o pre-training

PROMPTING
Accessing pre-trained models:
Fine-tune them
Also possible: No fine-tuning, but ..
 Zero-Shot Transfer w/o ANY labeled data (only describe the task)
 Few-Shot Transfer w/ FEW labeled data points (describe the task, and show examples as context)
this is called in-context learning
In both of the latter cases, good pre-training becomes even more important
Definition(s):
GPT-3 paper: "Task Description" (accompanied by samples + labels)
Prompt: Describes the task the model should perform
Prompt Engineering: Finding the best prompt(s) for one (or across multiple) task(s)
Prompt Tuning: Add trainable weights ("soft prompt") to inputs and fine-tune

CHATTING / GENERATION
Interacting with the model
Larger model sizes, reduced latency and improved training regimes enable conversations with the models
Enables the user to:
- have multi-turn conversations, with the model "remembering" previous inputs
- refine the prompt in case of unsatisfactory output
- used increased context sizes for the prompts
Still: Static, pre-trained model with "knowledge"
Interacting with the model: Persona-Chat Benchmark
OUTLOOK: Agents

NLP TASKS
Learning goals
Understand the different types of tasks (low- vs. high-level)
Purely Linguistic tasks vs. more general classification tasks

CATEGORIZATION OF NLP TASKS


Distinction between:
Language modeling
Token-level classification
Sequence-level classification
Similarity / Retrieval
Text generation
Connection to learning paradigms:
Given the task, some learning paradigms are more suitable
Tasks can be formulated differently to fit a given learning paradigm
Amount of available (labeled) data might depend on task
Presence/Absence of labels important to consider

LANGUAGE MODELING

Predict the next token:


S= Where are we … (word being predicted, should be “going”)
P(S) = P(Where) x P(are|Where) x P(we|Where are) x P(going|Where are we)

CATEGORIZATION OF NLP TASKS


"Low-Level" tasks:
Token-level Classification: Problems on a word/token level
Modeling relationships between words/tokens
"High-Level" tasks:
Sequence-level Classification: Problems on a sequence level
Retrieval: Assess (semantic) similarity on document-level
Producing sequences of text based on an input sequences, known as seq2seq tasks
Note: The latter one is also an instance of a generation task
LOW-LEVEL: SEQUENCE TAGGING
POS-tagging (part of speech):
Examples: Time flies like an arrow. // Fruit flies like a banana. IN = Preposition or
subordinating conjunction (conjunction here); VBZ = Verb, 3 rd person singular present; DT = determiner; NN =
singular noun
LOW-LEVEL: STRUCTURE PREDICTION- Chunking/Parsing

LOW-LEVEL: SEMANTICS

Word sense disambiguation

NAMED ENTITY RECOGNITION (NER)


BIO-Tagging
B – begin of entity (B-PER for person, B-LOC for location), I – inside entity, e.g I-PER, I-LOC, O – other (no entity)
NER AS TOKEN-LEVEL CLASSIFICATION
Pre-train/fine-tune:

HIGH-LEVEL NLP TASKS


 Information Extraction: search, event detection, textual entailment
 Writing Assistance: spell checking, grammar checking, auto-completion
 Text Classification: spam, sentiment, author, plagiarism
 Natural language understanding: metaphor analysis, argumentation mining, question-answering
 Natural language generation: summarization, tutoring systems, chat bots
 Multilinguality: machine translation, cross-lingual information retrieval

SEQUENCE-LEVEL CLASSIFICATION
Output can also be non-binary, i.e. multi-class/-label

Reformulation as generative task:

RETRIEVAL: Document retrieval


GENERATION: MACHINE TRANSLATION
A brief History of Machine Translation
Rule-Based Machine Translation (50s – 80s): Dictionaries + Grammatical Rules
Example-Based Machine Translation (80s – 90s): First suggested by Makoto Nagao (1984), Based on bilingual text
corpora
Statistical Machine Translation (90s – 10s): Mostly driven by IBM research
Neural Machine Translation (10s – now): Based on neural networks (LSTMs, Transformers)

SEQ2SEQ MODELING

The model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. The model stops making
predictions after outputting the end-of-sentence token. Note that the LSTM reads the input sentence in reverse,
because doing so introduces many short-term dependencies in the data that make the optimization problem much
easier.
Notes:
In the meantime: Transformers replaced LSTMs
Overall architecture (Encoder-Decoder) still used
Used for:
(Neural) Machine Translation
Summarization
Questions answering
TRADITIONAL BENCHMARKING: NLU
Nine sentence- or sentence-pair language understanding tasks
Public leaderboard, (still) very popular benchmark collection

WinoGrande: Test whether the model can identify the correct reference

HellaSwag
Pick the best ending to the context.

PIQA (Physical Intercation: Question Answering): Test whether the model can identify the most plausible
continuation (see also LAMBADA, HellaSwag)
Neural Probabilistic Language Model
Learning goals
grasp importance of the “look-up table“ a.k.a. embedding layer
understand computational implications of language modeling

WHAT IS A LANGUAGE MODEL?


Wikipedia says:
"A statistical language model is a probability distribution over sequences of words"
This means (a) assigning a probability to a sequence of words, e.g.
P("we are all interested in NLP")
and (b) assigning a probability to the likelihood of a word given a sequence of words, e.g.
P("NLP"|"we are all interested in")

MAKING USE OF THE MARKOV-ASSUMPTION


The Markov-Assumption
"The future is independent of the past given the present"
In NLP context:
- Next word only depends on the k previous words
- kth order markov assumption with k to be chosen manually
"Traditional" count-based models
Good baselines, but severe shortcomings
Lacking the ability to generalize
WHAT ARE POTENTIAL PROBLEMS?
Curse of dimensionality
Linear increase in context size leads to an exponential increase in the number of parameters
Considering a vocabulary of size |V| = 100, 000. Already for bi-grams: |V|^2 = 10^10 possible combinations
Sparsity
Again, considering |V| = 1.000.000 & bi-grams as context
Unlikely to observe all of the bi-gram combinations
(a) ever
(b) often

A NEURAL PROBABILISTIC LANGUAGE MODEL


Idea
Using a neural network induces non-linearity and overcomes the shortcomings of traditional models
(a) Linear increase in #parameters with increasing context size
(b) Better generalization

Input: Context of (n - 1) words


In between:

 Look-up table
 Non-linearity, e.g. tanh, ReLU

Output: Probability distribution over the next word

WHAT COULD BE PROBLEMATIC?


Computational cost
Vanilla softmax is expensive
Proposed solution(s):
1) Hierarchical softmax
2) Sampling Approaches
Still relying on the Markov assumption: Context window has to be specified
manually

Word Embeddings
Learning goals
Understand what word embeddings are
Learn the main methods for creating them

MOTIVATION
How to represent words/tokens in a neural network?
Possible solution: one-hot encoded indicator vectors of length |V|.

Question: Why is this a bad idea?


- Parameter explosion (|V| might be > 1M)
- All word vectors are orthogonal to each other - no notion of word similarity
- Learn one word vector (“word embedding”) per word i

- Typical dimensionality
- Embedding matrix:
Question: Advantages of using word vectors?
- We can express similarities between words, e.g., with cosine similarity

- Since the embedding operation is a lookup operation, we only need to update the vectors that occur in a
given training batch
Supervised training?
Training embeddings from scratch:
- Initialize randomly and learn it during training phase
- Words that play similar roles w.r.t. task get similar embeddings.
- Example: Sentiment Classification: We might expect
- We typically have more unlabeled than labeled data. Can we learn embeddings from the unlabeled data?
Question: What could be a problem at test time? If training set is small, many words are unseen during
training and therefore have random vectors

Distributional hypothesis: “A word is characterized by the company it keeps“ (J.R. Firth, 1957)
Idea:
 Learn similar vectors for words that occur in similar contexts
 Three different (milestone) methods:
o Word2Vec
o GloVe (not covered)
o FastText

WORD2VEC AS A BIGRAM LANGUAGE MODEL


Model architecture:
Words in our vocabulary are represented as two sets of vectors:
- if they are to be predicted
- if they are conditioned on as context
Predict word “i” given previous word “j”:
Question: What is a possible function f (*) ?

Answer: Softmax 
Question: Problem with training softmax? (IT IS SLOW)
Answer: Needs to compute dot products with the whole vocabulary in the denominator for every single prediction
SPEEDING UP TRAINING: NEGATIVE SAMPLING

One option: Hierarchical Softmax (not covered) reduces complexity from


Another trick: Negative Sampling (a variant of noise contrastive estimation): Changes the objective function; the
resulting model is not a language model anymore!
IDEA: Instead of predicting the probability distribution over the whole vocabulary, make binary decisions for a small
number of words.
- “Positive“ samples: Bigrams seen in the corpus.
- “Negative“ samples: Random bigrams (not seen in corpus)
NEGATIVE SAMPLING: LIKELIHOOD
Given: positive training set pos(O), negative training set neg(O)

Question: Why not just maximize the likelihood?


WORD2VEC WITH NEGATIVE SAMPLING

Maximize likelihood of training data:

↔ minimize negative log likelihood:


Question: What do these components stand for in Word2Vec with negative sampling?
x(i) Word pair, from corpus OR randomly created
y(i) Label:
 1 = word pair is from positive training set,
 0 = word pair is from negative training set

Parameters θ of the model:

SPEEDING UP TRAINING: NEGATIVE SAMPLING


Constructing a good negative training set can be difficult
Often it is some random perturbation of the training data (e.g. replacing the second word of each bigram by a
random word).
The number of negative samples is often a multiple (1x to 20x) of the number of positive samples
Negative sets are often constructed per batch
Question: How many dot products do we need to calculate for a given word pair? How does this compare to the
naive and hierarchical softmax?

SKIP-GRAM (WORD2VEC)
Create a fake task:
Training objective: Given a word, predict the neighbouring words
Generation of samples: Sliding fixed-size window over the text

Idea: Learn many bigram language models at the same time.


Given word w[t], predict words inside a window around w[t]:
One position before the target = word: p(w[t-1] | w[t])
One position after the target word: p(w[t+1] | w[t])
Two positions before the target word: p(w[t-2] | w[t])
... up to a specified window size c.
Models share all parameters!
SKIP-GRAM: OBJECTIVE
Optimize the joint likelihood of the 2c language models:

Negative Log-likelihood for whole corpus (of size N):

Using negative sampling as approximation:

is the word vector of a random word, M is the number of negatives per positive sample

FASTTEXT
Accomplishments:
Words can be represented as dense, low-dimensional vectors
Easy to capture similarity between words
Additive Compositionality of word vectors
Open issues:
Even if we train Word2Vec on a very large corpus, we will still encounter unknown words at test time
What about rare words?
Orthography can often help us:
W (remuneration) should be similar to:
W(remunerate) same stem, w(iteration), w(consideration) … same suffix ~ same Part of Speech
Assume, we want to represent the word example:
Character n-grams (n = 3):
In practice, we don’t set n = a but rather
Character n-grams :
Note, that the 4-gram exam is different from the word <exam>.

Representation of a known word: Average of the word’s embedding and char-n-gram embeddings

Representation of an unknown word: Average of char-n-gram embeddings

FASTTEXT TRAINING
ngrams typically contains character 3- to 6-grams

Replace in Skipgram objective with its new definition. During backpropagation, loss gradient vector is

distributed to word vector and associated n-gram vectors

SUMMARY
Word2Vec as a bigram Language Model
Negative Sampling
Skipgram: Predict words in window given word in the middle
fastText: N-gram embeddings generalize to unseen words

USING PRETRAINED EMBEDDINGS


Knowledge transfer from unlabelled corpus
Design choice: Fine-tune embeddings on task or freeze them?
- Pro: Can learn/strengthen features that are important for task
- Contra: Training vocabulary is small subset of entire vocabulary  we might overfit and mess up topology
w.r.t. unseen words
Resources:
https://fasttext.cc/docs/en/crawl-vectors.html
https://nlp.stanford.edu/projects/glove/
ANALOGY MINING

W(a) – w(b) + w(c) = w(? d)


N(d) = argmax (w(d’) in W) cos (w(?) , w(d’))
SUMMARY
Applications of Word Embeddings
- Word vector initialization in neural networks for NLP tasks E.g., sentiment classification of reviews, topical
classification of news
- Analogy mining
- Information retrieval: semantic search, query expansion
- Simple and fast aggregations of sentence representations

You might also like