0% found this document useful (0 votes)
8 views47 pages

Chapter 01

Uploaded by

Sid Ali Khelladi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views47 pages

Chapter 01

Uploaded by

Sid Ali Khelladi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

ADVA N C ED

NLP

CHAPTER 01: LANGUAGE


MODELLING
1 2 3 4
Probabilistic LM? Markov hypothesis N-gram Models Evaluation and complexity

COURSE OUTLINE

ADVA N CED N LP Mrs. MEZZI 2


PROBABILISTIC LANGUAGE MODELS

◾ O bjective: assign a probability to a sentence.


o Automatic translation:
 P(high winds tonight) > P(large winds tonite)
Why?

o Spellchecking
 P(fifteen minutes from) > P(fiften minuetes from)

o Speech recognition
 P(I saw a van) > P(eyes awe of an)
o + Automatic summarization, optical character recognition, document classification, question-answering,
etc., etc.!!

ADVA N CED N LP Mrs. MEZZI 3


PROBABILISTIC LANGUAGE MODELS

◾Objective: determine the of a phrase or word combination by


computing the following probability: P(W) = P(w1,w2,w3,w4,w5…wn)

◾Related tasks: probability of the next P(w5|w1,w2,w3,w4)


◾A model that makes it possible to compute both scenarios :
P(W) ou P(wn|w1,w2…wn-1) is called a Language Model

ADVA N CED N LP Mrs. MEZZI 4


PROBABILISTIC LANGUAGE MODELS

◾ Examples:
o Let’s meet instead in…
o The company's stock lost two...
o If you are able to visit …

ADVA N CED N LP Mrs. MEZZI 5


PROBABILISTIC LANGUAGE MODELS
information provided by "Natural Language Corpus Data: Beautiful Data" and derived from the "google web
trillion word corpus"

ADVA N CED N LP Mrs. MEZZI 6


PROBABILISTIC LANGUAGE MODELS
information provided by "Natural Language Corpus Data: Beautiful Data" and derived from the "google web
trillion word corpus"

ADVA N CED N LP Mrs. MEZZI 7


C ALCUATION OF THE PROBABILITY P(W)

◾ How to calculate joint probabilities?

◾P(this, water, is, so, clear, that)

◾ Intuition: Taking conditional probabilities into account

ADVA N CED N LP Mrs. MEZZI 8


REC ALL: C O N D ITIONAL PROBABILITIES

P(B|A) = P(A,B)/P(A)or P(A,B) = P(A)P(B|A)

◾Many variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
◾General Rule:
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
ADVA N CED N LP Mrs. MEZZI 9
LANGUAGE-BASED CONDITIONAL PROBABILITIES

P(w1 w2 … wn )   P(wi | w1 w2 … wi 1)


i

P(“This water is so clear”) =


P(this) × P(water|this) × P(is|this water)
× P(so|this water is) × P(clear|this water is so)

ADVA N CED N LP Mrs. MEZZI 10


HO W TO ESTIMATE THESE PROBABILITIES

◾ Can we just divide ?

�(the|this water is so clear that) =


�����(this water is so cl��� �ℎ�� �ℎ�)
�����(this water is so cl��� �ℎ��)

◾Wrong! Too many possible phrases! (infinite sentences!!)


◾A large corpus is required to estimate this probability.

ADVA N CED N LP Mrs. MEZZI 11


MARKOV HYPOTHESIS

◾Simplified hypothesis: Andrei Markov

P(the | its water is so transparent that)  P(the | that)

◾Or
P(the | its water is so transparent that)  P(the | transparent that)

ADVA N CED N LP Mrs. MEZZI 12


MARKOV HYPOTHESIS

P(w1 w2 … wn )  P(w
i i
| wi k … wi 1)
◾ In other words, we approximate each component of the product

P(wi | w1 w2 … wi 1)  P(wi | wi  k … wi 1)

ADVA N CED N LP Mrs. MEZZI 13


MARKOV HYPOTHESIS

o Calculating the probability of a word is limited to knowing the probability


of the words that precede it.
o Markov's hypothesis allows us to focus only on a limited history.
N-gram models:
o Unigram: P(the) no context
o Bigram: P(the |that)
o Trigram: P(the | transparent that)

ADVA N CED N LP Mrs. MEZZI 14


N-GRAM MODELS

◾ We can thus extend to trigram models, 4-grams, 5-grams models


Small N -grams  information loss
At times, it can be an insufficient language model, because there are long-
distance dependencies:

“The computer I used yesterday for the Advanced NLP class session,
crashed.”
ADVA N CED N LP Mrs. MEZZI 15
N-GRAM MODELS
o If we use large N-grams  we obtain a model with a high complexity
 We then need a larger corpus
 Representation of N-grams :VN where V is the vocabulary size and N the number of grams.
o Issue with absent N-grams in the training corpus
 P(<s>he likes swimming </s>) = P(he|<s>)P(likes |he)P(swimming |likes)P(</s> |swimming)
= 32 0 1= 0
43 11
 The idea is to create a probability for absent N-grams by taking a small percentage of the
probabilities of the current N -grams.

But often, the N-gram models constitute an interesting solution

ADVA N CED N LP Mrs. MEZZI 16


LANGUAGE MODELLING

N-GRAM PROBABILITY ESTIMATION


DATA PREPROCESSING

o How to convert text to sequences �1� ?


 This depends on the application
o The following criteria need to be determined :
 How to delimit sequences (sentence, paragraphs, or documents)?
 How to delimit words ?
 How to normalize words?
 What words are in the model's vocabulary?

ADVA N CED N LP Mrs. MEZZI 18


DATA PREPROCESSING

o Sentences are typically used to delimit data for sequences because this is the form of
sequence that is handled the most frequently (e.g. automatic translation)

o Reserved words, such as <s> for the beginning and </s> for end of a sentence in
English, are frequently used to indicate sentence boundaries.

<s> The weather is nice today</s>

ADVA N CED N LP Mrs. MEZZI 19


DATA PREPROCESSING

o For words:A preliminary processing step involves converting a document into


a list of words that appear in it (tokens).
This step is called « Tokenization »
 Each lexical unit (token) corresponds then to a word.
o In English, French and some similar languages, the steps can be as follows:

 Dividing the sentences using spaces and punctuation.


 Considering whether or not the punctuation is a part of the sequence.

ADVA N CED N LP Mrs. MEZZI 20


DATA PREPROCESSING

o It is best to disregard the punctuation most of the time.


 The punctuation is omitted when it cannot be captured, as in speech recognition,
for instance.
o In other cases, not even the spaces are sufficient to separate the words.

ADVA N CED N LP Mrs. MEZZI 21


DATA PREPROCESSING

o The problem can be partially handled by treating each punctuation


mark as a separate character.
o However, there will always be exceptions:
o Ph.D.
o google.com
o 555,500.50
o 555 500,50

ADVA N CED N LP Mrs. MEZZI 22


DATA PREPROCESSING

o We occasionally work with clitics and must return them to their native form:
o j’aime = je and aime
o he’s = he and is
o At times, we would like a certain token to be regarded as a collocation, which
is a grouping of several words:
o New York
o rock ‘n’ roll

ADVA N CED N LP Mrs. MEZZI 23


DATA PREPROCESSING

o The selection of tokenization rules varies depending on the


application.
o Additionally, certain languages lack separators: the case of
Japanese and Chinese.

ADVA N CED N LP Mrs. MEZZI 24


DATA PREPROCESSING

o How to normalize words?


o Do we take capitalization into account?
 no in speech recognition yes in automatic translation
o Should we limit to a lemma ?
 yes in document classification  no in many other NLP tasks.
o Do we need to use any other conversions??
 Numbers: <number>, Date: <date>…etc.
o Each of these decisions is based on the current case study.
o It is recommended to conduct multiple experiments and make improvements in reaction to errors.

ADVA N CED N LP Mrs. MEZZI 25


BI-GRAMS PROBABILITY ESTIMATION

◾ Maximum Likelihood Estimation MLE allows to compute this probability based on the
sequences ferequencies:

P(wi | w i1 )  count(wi1,wi )


)  c(wi1,wi )
P(w i | w i1
count(w i1 ) c(w i1)

o count(wi-1 , wi) : the bigram’s (wi-1 , wi) frequence


o count (wi-1): the unigram’s frequnce wi-1

ADVA N CED N LP Mrs. MEZZI 26


EXAMPLE

<s> I am Sam </s>


c(w i1,w i )
P(wi | w i1 )  <s> Sam I am </s>
c(wi1)
<s> I do not like green eggs and ham </s>

ADVA N CED N LP Mrs. MEZZI 27


EXAMPLE 2: BERKELEY RESTAURANT PROJECT

 https://github.com/wooters/berp-trans
 A database of questions about a Berkeley restaurant (9332 sentences)
o can you tell me about any good cantonese restaurants close by
o mid priced thai food is what i’m looking for
o tell me about chez panisse
o can you give me a listing of the kinds of food that are available
o i’m looking for a good place to eat breakfast
o when is caffe venezia open during the day
ADVA N CED N LP Mrs. MEZZI 28
BIGRAMS’ FREQUENCIES
◾ Counting bigrams

◾ Tokenization information for Unigrams :

ADVA N CED N LP Mrs. MEZZI 29


BIGRAMS’ PROBABILITIES

◾ Bigram probability results:

ADVA N CED N LP Mrs. MEZZI 30


BIGRAMS’ ESTIMATION

P(<s> I want english food </s>) =

ADVA N CED N LP Mrs. MEZZI 31


WHAT KIND OF KNOWLEDGE?

ADVA N CED N LP Mrs. MEZZI 32


ISSUE

o We take all necessary steps in logarithmic space to prevent


underflow.
o It is also faster to add (sum) than to multiply (multiplication).

log(p1  p2  p3  p4 )  log p1  log p2  log p3  log p4


ADVA N CED N LP Mrs. MEZZI 33
GOOGLE N-GRAMS

o https://books.google.com/ngrams
o Models pre-processed from books
o free download:
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html

ADVA N CED N LP Mrs. MEZZI 34


ESTIMATION OF BIGRAMS

P(<s> I want english food </s>) =

ADVA N CED N LP Mrs. MEZZI 35


LANGUAGE MODELLING

EVALUATION A N D COMPLEXITY
PERFORMANCE EVALUATION
◾ Does our linguistic model prefer good sentences to bad ones?
◾ Assign a higher probability to "real" or "frequently observed" sentences than to
"ungrammatical" or "rarely observed" sentences? than "ungrammatical" or "rarely
observed" sentences?
◾  Our the model's parameters are trained using a learning base.
◾  We evaluate the model's performance on data that we did not observe.
◾ A test database is an unnoticed, entirely unused set of data that is distinct from
our training set.
◾ An evaluation measure shows us how well our model performs on the test set.

ADVA N CED N LP Mrs. MEZZI 37


TEST-BASED LEARNING

In terms of ethics:
o Test sentences are not allowed in learning.
o We will assign them an artificially high probability when we define them in the test
set .
o Test-based learning is a practice that violates the code of honour and distorts
results.

ADVA N CED N LP Mrs. MEZZI 38


IN-VIVO EVALUATION

◾ The best rating for models A and B


◾ Put each template into a specific task
◾ Spelling corrector, speech recognition, Machine translation systems

◾ Run the task and get the relevance of A and B


◾ How many errors correctly fixed?
◾ How many words are translated correctly?

◾ Compare the relevance of Model A to Model B.

ADVA N CED N LP Mrs. MEZZI 39


ISSUES
◾ In-vivo evaluation is:
◾ Long (time-consuming); can take days or weeks...
Donc
◾ To perform an internal (in-vitro) evaluation
◾ Measure: perplexity
◾ Approximations are ineffective.
◾ Unless the test collection is similar to the learning collection.
◾Typically useful for benchmarking experiments.
ADVA N CED N LP Mrs. MEZZI 40
INTUITION CONCERNING PERPLEXITY

mushrooms 0.1
◾ The Shannon Game:
pepperoni 0.1
◾ In which manner, will we be able to predict the next word?

I always order pizza with cheese and anchovies 0.01


….
The 33rd President of the US was
fried rice 0.0001
I saw a
….
◾ Unigrams are not a good solution for this game (why?).
and 1e-100
◾ A more suitable model for the text
◾ Is a model that assigns a high probability to words that actually appear.

ADVA N CED N LP Mrs. MEZZI 41


PERPLEXITY

The best linguistic model is one that best predicts a set of invisible tests 
Give the highest P (sentence). 1
PP(W )  P(w1w2...wN )
N

1
 N
The perplexity is the inverse probability of the total P(w1w2...wN )

number of tests, normalized by the number of words:

Conditional perplexities "chain rule":

For bigrams:

ADVA N CED N LP Mrs. MEZZI 42


PERPLEXITY

o The perplexity allows us to calculate the factor of medium


ponderation in order to predict the next word.
 Average number of hesitations (equivocal predictions) during a
prediction.
o Perplexity occurs typically between 40 and 400.

Reducing perplexity (confusion) equates to increasing probability.

ADVA N CED N LP Mrs. MEZZI 43


LOW PERPLEXITY = SUPERIOR MODEL

◾The following are the outcomes of an N-gram model:


◾Learning on 38 million words, testing on 1.5 million
words (Wall Street Journal):
N-gram
Unigram Bigram Trigram
Order
Perplexity 962 170 109
ADVA N CED N LP Mrs. MEZZI 44
LOW PERPLEXITY = SUPERIOR MODEL

o The trigram model is the best (it has the lowest perplexity).
o To interpret these results, we conclude that the uni-gram model must make
numerous assumptions.
o Perplexity has the advantage of being a measure that is not dependent on a specific
application.
o However, it is difficult to predict whether a gain in perplexity would translate into a
real-world gain (for example, the number of translation errors).
o The goal remains to assess the system's performance using the extrinsic (in-vivo)
language evaluation model.

ADVA N CED N LP Mrs. MEZZI 45


Additional links

o Probabilistic models in NLP

ADVA N CED N LP Mrs. MEZZI 46


THANK YOU!
Keep in touch: [email protected]

ADVA N CED N LP Mrs. MEZZI 47

You might also like