0% found this document useful (0 votes)

8 views47 pages

Chapter 01

Uploaded by

Sid Ali Khelladi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views47 pages

Chapter 01

Uploaded by

Sid Ali Khelladi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

ADVA N C ED

NLP

CHAPTER 01: LANGUAGE

MODELLING
1 2 3 4
Probabilistic LM? Markov hypothesis N-gram Models Evaluation and complexity

COURSE OUTLINE

ADVA N CED N LP Mrs. MEZZI 2

PROBABILISTIC LANGUAGE MODELS

◾ O bjective: assign a probability to a sentence.

o Automatic translation:
 P(high winds tonight) > P(large winds tonite)
Why?

o Spellchecking
 P(fifteen minutes from) > P(fiften minuetes from)

o Speech recognition
 P(I saw a van) > P(eyes awe of an)
o + Automatic summarization, optical character recognition, document classification, question-answering,
etc., etc.!!

ADVA N CED N LP Mrs. MEZZI 3

PROBABILISTIC LANGUAGE MODELS

◾Objective: determine the of a phrase or word combination by

computing the following probability: P(W) = P(w1,w2,w3,w4,w5…wn)

◾Related tasks: probability of the next P(w5|w1,w2,w3,w4)

◾A model that makes it possible to compute both scenarios :
P(W) ou P(wn|w1,w2…wn-1) is called a Language Model

ADVA N CED N LP Mrs. MEZZI 4

PROBABILISTIC LANGUAGE MODELS

◾ Examples:
o Let’s meet instead in…
o The company's stock lost two...
o If you are able to visit …

ADVA N CED N LP Mrs. MEZZI 5

PROBABILISTIC LANGUAGE MODELS
information provided by "Natural Language Corpus Data: Beautiful Data" and derived from the "google web
trillion word corpus"

ADVA N CED N LP Mrs. MEZZI 6

PROBABILISTIC LANGUAGE MODELS
information provided by "Natural Language Corpus Data: Beautiful Data" and derived from the "google web
trillion word corpus"

ADVA N CED N LP Mrs. MEZZI 7

C ALCUATION OF THE PROBABILITY P(W)

◾ How to calculate joint probabilities?

◾P(this, water, is, so, clear, that)

◾ Intuition: Taking conditional probabilities into account

ADVA N CED N LP Mrs. MEZZI 8

REC ALL: C O N D ITIONAL PROBABILITIES

P(B|A) = P(A,B)/P(A)or P(A,B) = P(A)P(B|A)

P(w1 w2 … wn )   P(wi | w1 w2 … wi 1)

P(“This water is so clear”) =

P(this) × P(water|this) × P(is|this water)
× P(so|this water is) × P(clear|this water is so)

ADVA N CED N LP Mrs. MEZZI 10

HO W TO ESTIMATE THESE PROBABILITIES

◾ Can we just divide ?

�(the|this water is so clear that) =

��(this water is so cl�� ℎ�� ℎ�)
��(this water is so cl�� ℎ��)

◾Wrong! Too many possible phrases! (infinite sentences!!)

◾A large corpus is required to estimate this probability.

ADVA N CED N LP Mrs. MEZZI 11

MARKOV HYPOTHESIS

◾Simplified hypothesis: Andrei Markov

P(the | its water is so transparent that)  P(the | that)

◾Or
P(the | its water is so transparent that)  P(the | transparent that)

ADVA N CED N LP Mrs. MEZZI 12

MARKOV HYPOTHESIS

P(w1 w2 … wn )  P(w
i i
| wi k … wi 1)
◾ In other words, we approximate each component of the product

P(wi | w1 w2 … wi 1)  P(wi | wi  k … wi 1)

ADVA N CED N LP Mrs. MEZZI 13

MARKOV HYPOTHESIS

o Calculating the probability of a word is limited to knowing the probability

of the words that precede it.
o Markov's hypothesis allows us to focus only on a limited history.
N-gram models:
o Unigram: P(the) no context
o Bigram: P(the |that)
o Trigram: P(the | transparent that)

ADVA N CED N LP Mrs. MEZZI 14

N-GRAM MODELS

◾ We can thus extend to trigram models, 4-grams, 5-grams models

Small N -grams  information loss
At times, it can be an insufficient language model, because there are long-
distance dependencies:

“The computer I used yesterday for the Advanced NLP class session,
crashed.”
ADVA N CED N LP Mrs. MEZZI 15
N-GRAM MODELS
o If we use large N-grams  we obtain a model with a high complexity
 We then need a larger corpus
 Representation of N-grams :VN where V is the vocabulary size and N the number of grams.
o Issue with absent N-grams in the training corpus
 P(<s>he likes swimming </s>) = P(he|<s>)P(likes |he)P(swimming |likes)P(</s> |swimming)
= 32 0 1= 0
43 11
 The idea is to create a probability for absent N-grams by taking a small percentage of the
probabilities of the current N -grams.

But often, the N-gram models constitute an interesting solution

ADVA N CED N LP Mrs. MEZZI 16

LANGUAGE MODELLING

N-GRAM PROBABILITY ESTIMATION

DATA PREPROCESSING

o How to convert text to sequences �1� ?

 This depends on the application
o The following criteria need to be determined :
 How to delimit sequences (sentence, paragraphs, or documents)?
 How to delimit words ?
 How to normalize words?
 What words are in the model's vocabulary?

ADVA N CED N LP Mrs. MEZZI 18

DATA PREPROCESSING

o Sentences are typically used to delimit data for sequences because this is the form of
sequence that is handled the most frequently (e.g. automatic translation)

o Reserved words, such as <s> for the beginning and </s> for end of a sentence in
English, are frequently used to indicate sentence boundaries.

<s> The weather is nice today</s>

ADVA N CED N LP Mrs. MEZZI 19

DATA PREPROCESSING

o For words:A preliminary processing step involves converting a document into

a list of words that appear in it (tokens).
This step is called « Tokenization »
 Each lexical unit (token) corresponds then to a word.
o In English, French and some similar languages, the steps can be as follows:

 Dividing the sentences using spaces and punctuation.

 Considering whether or not the punctuation is a part of the sequence.

ADVA N CED N LP Mrs. MEZZI 20

DATA PREPROCESSING

o It is best to disregard the punctuation most of the time.

 The punctuation is omitted when it cannot be captured, as in speech recognition,
for instance.
o In other cases, not even the spaces are sufficient to separate the words.

ADVA N CED N LP Mrs. MEZZI 21

DATA PREPROCESSING

o The problem can be partially handled by treating each punctuation

mark as a separate character.
o However, there will always be exceptions:
o Ph.D.
o google.com
o 555,500.50
o 555 500,50

ADVA N CED N LP Mrs. MEZZI 22

DATA PREPROCESSING

o We occasionally work with clitics and must return them to their native form:
o j’aime = je and aime
o he’s = he and is
o At times, we would like a certain token to be regarded as a collocation, which
is a grouping of several words:
o New York
o rock ‘n’ roll

ADVA N CED N LP Mrs. MEZZI 23

DATA PREPROCESSING

o The selection of tokenization rules varies depending on the

application.
o Additionally, certain languages lack separators: the case of
Japanese and Chinese.

ADVA N CED N LP Mrs. MEZZI 24

DATA PREPROCESSING

o How to normalize words?

o Do we take capitalization into account?
 no in speech recognition yes in automatic translation
o Should we limit to a lemma ?
 yes in document classification  no in many other NLP tasks.
o Do we need to use any other conversions??
 Numbers: <number>, Date: <date>…etc.
o Each of these decisions is based on the current case study.
o It is recommended to conduct multiple experiments and make improvements in reaction to errors.

ADVA N CED N LP Mrs. MEZZI 25

BI-GRAMS PROBABILITY ESTIMATION

◾ Maximum Likelihood Estimation MLE allows to compute this probability based on the
sequences ferequencies:

P(wi | w i1 )  count(wi1,wi )

)  c(wi1,wi )
P(w i | w i1
count(w i1 ) c(w i1)

o count(wi-1 , wi) : the bigram’s (wi-1 , wi) frequence

o count (wi-1): the unigram’s frequnce wi-1

ADVA N CED N LP Mrs. MEZZI 26

EXAMPLE

<s> I am Sam </s>

c(w i1,w i )
P(wi | w i1 )  <s> Sam I am </s>
c(wi1)
<s> I do not like green eggs and ham </s>

ADVA N CED N LP Mrs. MEZZI 27

EXAMPLE 2: BERKELEY RESTAURANT PROJECT

 https://github.com/wooters/berp-trans
 A database of questions about a Berkeley restaurant (9332 sentences)
o can you tell me about any good cantonese restaurants close by
o mid priced thai food is what i’m looking for
o tell me about chez panisse
o can you give me a listing of the kinds of food that are available
o i’m looking for a good place to eat breakfast
o when is caffe venezia open during the day
ADVA N CED N LP Mrs. MEZZI 28
BIGRAMS’ FREQUENCIES
◾ Counting bigrams

◾ Tokenization information for Unigrams :

ADVA N CED N LP Mrs. MEZZI 29

BIGRAMS’ PROBABILITIES

◾ Bigram probability results:

ADVA N CED N LP Mrs. MEZZI 30

BIGRAMS’ ESTIMATION

P(<s> I want english food </s>) =

ADVA N CED N LP Mrs. MEZZI 31

WHAT KIND OF KNOWLEDGE?

ADVA N CED N LP Mrs. MEZZI 32

ISSUE

o We take all necessary steps in logarithmic space to prevent

underflow.
o It is also faster to add (sum) than to multiply (multiplication).

log(p1  p2  p3  p4 )  log p1  log p2  log p3  log p4

ADVA N CED N LP Mrs. MEZZI 33
GOOGLE N-GRAMS

o https://books.google.com/ngrams
o Models pre-processed from books
o free download:
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html

ADVA N CED N LP Mrs. MEZZI 34

ESTIMATION OF BIGRAMS

P(<s> I want english food </s>) =

ADVA N CED N LP Mrs. MEZZI 35

LANGUAGE MODELLING

EVALUATION A N D COMPLEXITY
PERFORMANCE EVALUATION
◾ Does our linguistic model prefer good sentences to bad ones?
◾ Assign a higher probability to "real" or "frequently observed" sentences than to
"ungrammatical" or "rarely observed" sentences? than "ungrammatical" or "rarely
observed" sentences?
◾  Our the model's parameters are trained using a learning base.
◾  We evaluate the model's performance on data that we did not observe.
◾ A test database is an unnoticed, entirely unused set of data that is distinct from
our training set.
◾ An evaluation measure shows us how well our model performs on the test set.

ADVA N CED N LP Mrs. MEZZI 37

TEST-BASED LEARNING

In terms of ethics:
o Test sentences are not allowed in learning.
o We will assign them an artificially high probability when we define them in the test
set .
o Test-based learning is a practice that violates the code of honour and distorts
results.

ADVA N CED N LP Mrs. MEZZI 38

IN-VIVO EVALUATION

◾ The best rating for models A and B

◾ Put each template into a specific task
◾ Spelling corrector, speech recognition, Machine translation systems

◾ Run the task and get the relevance of A and B

◾ How many errors correctly fixed?
◾ How many words are translated correctly?

◾ Compare the relevance of Model A to Model B.

ADVA N CED N LP Mrs. MEZZI 39

ISSUES
◾ In-vivo evaluation is:
◾ Long (time-consuming); can take days or weeks...
Donc
◾ To perform an internal (in-vitro) evaluation
◾ Measure: perplexity
◾ Approximations are ineffective.
◾ Unless the test collection is similar to the learning collection.
◾Typically useful for benchmarking experiments.
ADVA N CED N LP Mrs. MEZZI 40
INTUITION CONCERNING PERPLEXITY

mushrooms 0.1
◾ The Shannon Game:
pepperoni 0.1
◾ In which manner, will we be able to predict the next word?

I always order pizza with cheese and anchovies 0.01

….
The 33rd President of the US was
fried rice 0.0001
I saw a
….
◾ Unigrams are not a good solution for this game (why?).
and 1e-100
◾ A more suitable model for the text
◾ Is a model that assigns a high probability to words that actually appear.

ADVA N CED N LP Mrs. MEZZI 41

PERPLEXITY

The best linguistic model is one that best predicts a set of invisible tests 
Give the highest P (sentence). 1
PP(W )  P(w1w2...wN )
N

1
 N
The perplexity is the inverse probability of the total P(w1w2...wN )

number of tests, normalized by the number of words:

Conditional perplexities "chain rule":

For bigrams:

ADVA N CED N LP Mrs. MEZZI 42

PERPLEXITY

o The perplexity allows us to calculate the factor of medium

ponderation in order to predict the next word.
 Average number of hesitations (equivocal predictions) during a
prediction.
o Perplexity occurs typically between 40 and 400.

Reducing perplexity (confusion) equates to increasing probability.

ADVA N CED N LP Mrs. MEZZI 43

LOW PERPLEXITY = SUPERIOR MODEL

◾The following are the outcomes of an N-gram model:

◾Learning on 38 million words, testing on 1.5 million
words (Wall Street Journal):
N-gram
Unigram Bigram Trigram
Order
Perplexity 962 170 109
ADVA N CED N LP Mrs. MEZZI 44
LOW PERPLEXITY = SUPERIOR MODEL

o The trigram model is the best (it has the lowest perplexity).
o To interpret these results, we conclude that the uni-gram model must make
numerous assumptions.
o Perplexity has the advantage of being a measure that is not dependent on a specific
application.
o However, it is difficult to predict whether a gain in perplexity would translate into a
real-world gain (for example, the number of translation errors).
o The goal remains to assess the system's performance using the extrinsic (in-vivo)
language evaluation model.

ADVA N CED N LP Mrs. MEZZI 45

Additional links

o Probabilistic models in NLP

ADVA N CED N LP Mrs. MEZZI 46

THANK YOU!
Keep in touch: [email protected]

ADVA N CED N LP Mrs. MEZZI 47

Ukrainian Phrasebook, Dictionary, Menu Guide & Interactive Factbook
From Everand
Ukrainian Phrasebook, Dictionary, Menu Guide & Interactive Factbook
Masha Drach
No ratings yet
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
IS 7118 Unit-4 N-Grams
100% (2)
IS 7118 Unit-4 N-Grams
93 pages
21ML1601 NLP QB
No ratings yet
21ML1601 NLP QB
34 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
Artificial Intelligence: Rohan Raj Poudel
No ratings yet
Artificial Intelligence: Rohan Raj Poudel
34 pages
Chapter-02 2
No ratings yet
Chapter-02 2
42 pages
Lecture1 5 IntroToNLP
No ratings yet
Lecture1 5 IntroToNLP
73 pages
Lec 3 slp04 LM and Ngrans
No ratings yet
Lec 3 slp04 LM and Ngrans
73 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Mod 1
No ratings yet
Mod 1
71 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Unit-1-Introduction-And-Word-Level-Analysis NLP Unit-1-Introduction-And-Word-Level-Analysis
No ratings yet
NLP Unit-1-Introduction-And-Word-Level-Analysis NLP Unit-1-Introduction-And-Word-Level-Analysis
26 pages
N-Gram Language Models
No ratings yet
N-Gram Language Models
26 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
SAVCH PLC User's Manual of E and S Series MPU
No ratings yet
SAVCH PLC User's Manual of E and S Series MPU
10 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
NLP m2
No ratings yet
NLP m2
74 pages
NLP-Ch-2 Introduction To Language Models
No ratings yet
NLP-Ch-2 Introduction To Language Models
82 pages
Ima 2000
No ratings yet
Ima 2000
56 pages
Bcse306l Ai Module-7 Smsatapathy
No ratings yet
Bcse306l Ai Module-7 Smsatapathy
51 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Test Bank For Psychology in Your Life 3rd by Grisonpdf Download
100% (6)
Test Bank For Psychology in Your Life 3rd by Grisonpdf Download
41 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Week 3
No ratings yet
Week 3
24 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
Photointerrupter: Product Data Sheet
No ratings yet
Photointerrupter: Product Data Sheet
6 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
Unit 1
No ratings yet
Unit 1
17 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Simple Calculation of The Inbreeding Coefficient
100% (1)
Simple Calculation of The Inbreeding Coefficient
4 pages
NLP 5th Unit
No ratings yet
NLP 5th Unit
19 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Unit 5 Notes Final
No ratings yet
Unit 5 Notes Final
14 pages
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
No ratings yet
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
10 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Ngrams
100% (1)
Ngrams
22 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
Recent Review On CV QKD
No ratings yet
Recent Review On CV QKD
55 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Alma Mater Studiorum Università Di Bologna Archivio Istituzionale Della Ricerca
No ratings yet
Alma Mater Studiorum Università Di Bologna Archivio Istituzionale Della Ricerca
40 pages
Gail Marlow Taylor, Ph.D. - The Alchemy of Al-Razi - A Translation of The - Book of Secrets - CreateSpace Independent Publishing Platform (2015)
100% (1)
Gail Marlow Taylor, Ph.D. - The Alchemy of Al-Razi - A Translation of The - Book of Secrets - CreateSpace Independent Publishing Platform (2015)
274 pages
s1 Homework
100% (2)
s1 Homework
4 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
PHD Thesis Comments
100% (2)
PHD Thesis Comments
4 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
No ratings yet
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
10 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Patient Complaint Form Template
No ratings yet
Patient Complaint Form Template
3 pages
056.M2H.5.MS540 1074F
No ratings yet
056.M2H.5.MS540 1074F
9 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
NLP Viva
No ratings yet
NLP Viva
14 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
Understanding The IUCN Red List Categories and Examples PDF
No ratings yet
Understanding The IUCN Red List Categories and Examples PDF
11 pages
LM 1684 Trumatic 3050 5 KW 2003
No ratings yet
LM 1684 Trumatic 3050 5 KW 2003
3 pages
Bioinformatics Companies in India
No ratings yet
Bioinformatics Companies in India
3 pages
Lucke Beecham 2009 Cavitation Aeration and Negative Pressures in Siphonic Roof Drainage Systems
No ratings yet
Lucke Beecham 2009 Cavitation Aeration and Negative Pressures in Siphonic Roof Drainage Systems
17 pages
Article PDF
No ratings yet
Article PDF
6 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Question Test Mem360 Mac 2022
No ratings yet
Question Test Mem360 Mac 2022
3 pages
Manual - AirControl 1 - GB
No ratings yet
Manual - AirControl 1 - GB
16 pages
G8 Sci SLM Q4 Wk4 CorrectedBeta Tested
No ratings yet
G8 Sci SLM Q4 Wk4 CorrectedBeta Tested
25 pages
Final - UBD - Science Stage 1-3
No ratings yet
Final - UBD - Science Stage 1-3
16 pages
Controversial Topics 2021
No ratings yet
Controversial Topics 2021
2 pages
Social Studies Ms MD
No ratings yet
Social Studies Ms MD
8 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
ĐỀ 9 M
No ratings yet
ĐỀ 9 M
5 pages
LT LG400 13 Inspection Maintenance Procedures
No ratings yet
LT LG400 13 Inspection Maintenance Procedures
6 pages
Module 2 LESSON 2
No ratings yet
Module 2 LESSON 2
3 pages
GEED 20073 BSID 1-1 Philippine Popular Culture - FINAL PAPER
No ratings yet
GEED 20073 BSID 1-1 Philippine Popular Culture - FINAL PAPER
2 pages
Older Persons Programme Western Cape Government
No ratings yet
Older Persons Programme Western Cape Government
4 pages
KK275P-3CD3CG: IEC61215 Ed2 IEC61730
No ratings yet
KK275P-3CD3CG: IEC61215 Ed2 IEC61730
2 pages
Class 7 S09
No ratings yet
Class 7 S09
3 pages