0% found this document useful (0 votes)
20 views73 pages

2023 07 28 Evolution of Language Models

Uploaded by

anand.ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views73 pages

2023 07 28 Evolution of Language Models

Uploaded by

anand.ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

LANGUAGE MODELS AND

ITS APPLICATIONS ON
HPC

Faculty Development Programme


CDAC Centre in North East

Ashish Anand
Professor, Dept of CSE, IIT Guwahati
Associate Faculty, Mehta Family School of DS and AI, IIT
Guwahati
What this talk is about?

•Defining Language Model

•Major Language model Paradigms

•Implicit: Role of HPC


QUICK WARM-UP
TO NLP
Application I: Automatic text
completion
Application II: Spelling Correction

• Spelling
correction: Study was conducted by
students vs study was conducted be students
Application III: Words, Meaning and
Representation
• Similar Words

• Synonyms

• Word Sense Disambiguation


• I went to a bank to deposit money
• I went to a bank to see calm water currents
Application IV: Sentiment
Classification

I like this laptop

My new laptop is not good for


computational intensive task

Watching a lecture on my new


laptop
Application V: Named Entity
Recognition (NER)
Application VI: Machine
Translation
• Source Sentence: I have asked him to do homework

• Target Sentence:
Many more applications …..
DEFINING
LANGUAGE MODEL
Let’s look at some examples

•Predicting next word


• I am planning ……..

•Speech Recognition
•I saw a van vs eyes awe an
Example continued

•Spelling correction
• Study was conducted by students vs study was
conducted be students
• Their are two exams for this course vs There are
two exams for this course

•Machine Translation
•I have asked him to do homework
• मैंने उससे पूछा कि होमवर्क करने के लिए
• मैंने उसे होमवर्क करने के लिए कहा
In each of the example, objective
is either
•To find next probable word
•To find which sentence is more likely to be
true

Translating in terms of problem formulation


•Finding probability of a word given a context
•Finding probability of a sentence given a
context
Language Models (LM)

•Models assigning probabilities to a sequence


of words

•P(I saw a van) > P(eyes awe an)

•P(मैंने उससे पूछा कि होमवर्क करने के लिए)


< P(मैंने उसे होमवर्क करने के लिए कहा)
Statistical LM
Estimating Probability of a
sequence
•Our task is to compute
P(I, am, fascinated, with, recent, advances,
in, AI)

•Chain Rule
Estimating Probability of a
sequence
Estimating P(w1, w2, .., wn)

Estimating P(w1, w2, .., wn)
• Too many possible sentences
• Data sparseness
• Poor generalizability
Markov Assumption


Markov Assumption


MLE of N-gram models
• Unigram (Simplest Model)

• Bigram (1st order Markov Model)

• Trigram (2nd order Markov Model)


Trigram Model in Summary
Problem with MLE

•Works well if test corpus is very similar to


training, which is not generally the case

•Sparsity Issue
• OOV : Can be solved by having <UNK> category
• Words are present in corpus but relevant counts
are zero
• Underestimation of such probabilities
N-gram Model: Issue

• Long-distance dependencies

“The computer which I had just put into the lab on the
fifth floor crashed”
SMOOTHING
TECHNIQUES
Simplest Approach: Additive
Smoothing
• Add-1 Smoothing

𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + ¿ 𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 , 𝑤𝑖 ) + 1
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=

• Generalized version

𝑐 ( 𝑤 𝑖 −2 ,𝑤 𝑖 −1 ) + 𝛿∨𝒱 ∨¿ ¿
𝑐 ( 𝑤 𝑖 −2 , 𝑤𝑖−1 , 𝑤𝑖 ) + 𝛿
𝑝 𝑚𝑙𝑒 ( 𝑤𝑖|𝑤 𝑖 −2 , 𝑤𝑖 −1 ¿=
Take the help of lower order
models
• Bigram Example
• c(w1, w2) = 0 = c(w1, w2’)

• padd (w2 | w1 ) = padd (w2’| w1 )

• Lets assume p(w2‘) < p(w2)

• We should expect padd (w2 | w1 ) > padd (w2’| w1 )


Take the help of lower order
models
• Linear Interpolation Models

• Discounting Models
NEURAL LANGUAGE
MODEL
Pre-Transformer Era
Feed-Forward Neural Language Model

Bengio et al, JMLR 03


Advantages over statistical n-gram
models
• Better flexibility in considering larger context
Advantages over statistical n-gram
models
• Better generalizability
• Can generalize to context not seen during the training
• Example: Need to estimate P(reading | ram is)
• Assume if “ram is reading” is not in training, however, “john
is reading” is present. And there are other sentences like
“Ram is writing”, “john is writing”
• It is likely that word representations learned by models for
“ram” and “john” is similar, then the model will also give
similar probability to P(reading | ram is) as to P(reading |
john is)
Major drawbacks

• Inefficient
• Unable to exploit sequential nature of text
• Limited Context
• Unidirectional
HANDLING THE
DRAWBACKS
Inefficiency: Hierarchical Softmax

Neural Network Lectures by Hugo Leorchelle


Limited Context and Sequential
Nature: RNN-LM

Jurafsky and Martin, Speech and Language Processing, 3 rd Draft Jan 2022 Ed
Unidirectional: ELMo

Source: Devlin et al. NAACL 2019


Unidirectional: ELMo

Source: Devlin et al. NAACL 2019


Issues with RNN-based LMs

• Limited bi-directionality
• Difficult to parallelize
TRANSFORMER-
BASED LM
Era of Large-scale pre-trained neural language models
Vanilla Transformer Language Model

Jurafsky and Martin, Speech and Language Processing, 3 rd Draft Jan 2022 Ed
Three Paradigms

• Pre-train then fine-tune


• Prompt-based learning
• NLP as text generation
PARADIGM
Pre-train then fine-tune
Three classes of Pre-trained LM (PLMs)

• Decoder-only Models
• Example: GPT

• Encoder-only Models
• Example: BERT

• Encoder-Decoder Language Models


• Example: BART, T5
Three Types of Training Objectives

• Autoregressive (AR): Predicting next word given a


set of previous words
• Masking:Predict masked/hidden word, given words
on both sides of it
• Denoising Tasks: Correct the perturbation/corruption
given in the input word sequence
• Perturbation Examples: sentence permutation, span deletion etc.
Autoregressive Language Models

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Autoregressive Language Models

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Autoregressive Language Models

 Unidirectional

 Taking sequential nature into


account

 Useful for natural language


generation downstream tasks

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Masked Language Model

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Masked Language Model

Source: Min et al. arxiv:2111.01243v1


Masked Language Model
 Inherently bi-directional

 Non-autoregressive nature allows


parallelized computation during
inference

 Making independence assumption

 Issue with pretrain-finetune


discrepancy due to corrupted input

 Useful for classification tasks


Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)
Encoder-Decoder Models

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Encoder-Decoder Models

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Encoder-Decoder Model

 bi-directional and sequence to


sequence learning

 Combining the advantage of bi-


directional and autoregressive
models

Reconstructed Original
 Very convenient to fine-tune for
Sequence: Output seq2seq tasks (e.g. MT,
Corrupted Sequence:
Input
Summarization etc.)
Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)
Important aspects about pre-training
corpora
• Size
• Quality (source data characteristics) and diversity
• Domain of intended downstream tasks
Pre-training corpora of a few LMs

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Pre-trained LMs on specific domain

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Pre-trained LMs on different
languages

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Fine-Tuning

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


PARADIGM
Prompt Based Learning
What is prompting?

• “adding natural language text or continuous vectors


to the input or output to encourage PLMs to perform
specific tasks”
Advantages

• For
in-context or in-domain learning, may not
require to fine-tune PLMs, thus reduces
computational requirements
• Allowsbetter alignment of the new task with the
pre-train objective
Main Prompt-based approaches

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Instruction based prompting

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


Template based prompting

Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)


PARADIGM
NLP as text generation
Source: Min et al. arxiv:2111.01243v1 (ACM Computing Surveys)
https://levelup.gitconnected.com/the-brief-history-of-large-language-models-a-journey-from-eliza-to-gpt-4-and-google-
bard-167c614af5af
Summary of Paradigm Shifts

• Contextual
Move LMs Embedding
from • Increased • More
Statistic Probabilis
rule/gramm Transform
tic context generalize
al ar-based er
Neural • Word d
models to PLMs
Models data-driven Models Embedding • Multi-
models modality
• Multi-Task

Source: Min et al. arxiv:2111.01243v1


References

• Jurafskyand Martin, Speech and Language


Processing, 3rd Ed. Draft [Available at
https://web.stanford.edu/~jurafsky/slp3/ ]
• Minet al., Recent advances in natural language
processing via large pre-trained language models: A
survey. ACM Computing Surveys
Thanks!
Question and Comments!

You might also like