2023 07 28 Evolution of Language Models
2023 07 28 Evolution of Language Models
ITS APPLICATIONS ON
HPC
Ashish Anand
Professor, Dept of CSE, IIT Guwahati
Associate Faculty, Mehta Family School of DS and AI, IIT
Guwahati
What this talk is about?
• Spelling
correction: Study was conducted by
students vs study was conducted be students
Application III: Words, Meaning and
Representation
• Similar Words
• Synonyms
• Target Sentence:
Many more applications …..
DEFINING
LANGUAGE MODEL
Let’s look at some examples
•Speech Recognition
•I saw a van vs eyes awe an
Example continued
•Spelling correction
• Study was conducted by students vs study was
conducted be students
• Their are two exams for this course vs There are
two exams for this course
•Machine Translation
•I have asked him to do homework
• मैंने उससे पूछा कि होमवर्क करने के लिए
• मैंने उसे होमवर्क करने के लिए कहा
In each of the example, objective
is either
•To find next probable word
•To find which sentence is more likely to be
true
•Chain Rule
Estimating Probability of a
sequence
Estimating P(w1, w2, .., wn)
•
Estimating P(w1, w2, .., wn)
• Too many possible sentences
• Data sparseness
• Poor generalizability
Markov Assumption
•
Markov Assumption
•
MLE of N-gram models
• Unigram (Simplest Model)
•Sparsity Issue
• OOV : Can be solved by having <UNK> category
• Words are present in corpus but relevant counts
are zero
• Underestimation of such probabilities
N-gram Model: Issue
• Long-distance dependencies
“The computer which I had just put into the lab on the
fifth floor crashed”
SMOOTHING
TECHNIQUES
Simplest Approach: Additive
Smoothing
• Add-1 Smoothing
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + ¿ 𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 , 𝑤𝑖 ) + 1
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=
• Generalized version
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + 𝛿∨𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 , 𝑤𝑖−1 , 𝑤𝑖 ) + 𝛿
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=
Take the help of lower order
models
• Bigram Example
• c(w1, w2) = 0 = c(w1, w2’)
• Discounting Models
NEURAL LANGUAGE
MODEL
Pre-Transformer Era
Feed-Forward Neural Language Model
• Inefficient
• Unable to exploit sequential nature of text
• Limited Context
• Unidirectional
HANDLING THE
DRAWBACKS
Inefficiency: Hierarchical Softmax
Jurafsky and Martin, Speech and Language Processing, 3 rd Draft Jan 2022 Ed
Unidirectional: ELMo
• Limited bi-directionality
• Difficult to parallelize
TRANSFORMER-
BASED LM
Era of Large-scale pre-trained neural language models
Vanilla Transformer Language Model
Jurafsky and Martin, Speech and Language Processing, 3 rd Draft Jan 2022 Ed
Three Paradigms
• Decoder-only Models
• Example: GPT
• Encoder-only Models
• Example: BERT
Unidirectional
Reconstructed Original
Very convenient to fine-tune for
Sequence: Output seq2seq tasks (e.g. MT,
Corrupted Sequence:
Input
Summarization etc.)
Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)
Important aspects about pre-training
corpora
• Size
• Quality (source data characteristics) and diversity
• Domain of intended downstream tasks
Pre-training corpora of a few LMs
• For
in-context or in-domain learning, may not
require to fine-tune PLMs, thus reduces
computational requirements
• Allowsbetter alignment of the new task with the
pre-train objective
Main Prompt-based approaches
• Contextual
Move LMs Embedding
from • Increased • More
Statistic Probabilis
rule/gramm Transform
tic context generalize
al ar-based er
Neural • Word d
models to PLMs
Models data-driven Models Embedding • Multi-
models modality
• Multi-Task