Chapter 01
Chapter 01
NLP
COURSE OUTLINE
o Spellchecking
P(fifteen minutes from) > P(fiften minuetes from)
o Speech recognition
P(I saw a van) > P(eyes awe of an)
o + Automatic summarization, optical character recognition, document classification, question-answering,
etc., etc.!!
◾ Examples:
o Let’s meet instead in…
o The company's stock lost two...
o If you are able to visit …
◾Many variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
◾General Rule:
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
ADVA N CED N LP Mrs. MEZZI 9
LANGUAGE-BASED CONDITIONAL PROBABILITIES
◾Or
P(the | its water is so transparent that) P(the | transparent that)
P(w1 w2 … wn ) P(w
i i
| wi k … wi 1)
◾ In other words, we approximate each component of the product
“The computer I used yesterday for the Advanced NLP class session,
crashed.”
ADVA N CED N LP Mrs. MEZZI 15
N-GRAM MODELS
o If we use large N-grams we obtain a model with a high complexity
We then need a larger corpus
Representation of N-grams :VN where V is the vocabulary size and N the number of grams.
o Issue with absent N-grams in the training corpus
P(<s>he likes swimming </s>) = P(he|<s>)P(likes |he)P(swimming |likes)P(</s> |swimming)
= 32 0 1= 0
43 11
The idea is to create a probability for absent N-grams by taking a small percentage of the
probabilities of the current N -grams.
o Sentences are typically used to delimit data for sequences because this is the form of
sequence that is handled the most frequently (e.g. automatic translation)
o Reserved words, such as <s> for the beginning and </s> for end of a sentence in
English, are frequently used to indicate sentence boundaries.
o We occasionally work with clitics and must return them to their native form:
o j’aime = je and aime
o he’s = he and is
o At times, we would like a certain token to be regarded as a collocation, which
is a grouping of several words:
o New York
o rock ‘n’ roll
◾ Maximum Likelihood Estimation MLE allows to compute this probability based on the
sequences ferequencies:
https://github.com/wooters/berp-trans
A database of questions about a Berkeley restaurant (9332 sentences)
o can you tell me about any good cantonese restaurants close by
o mid priced thai food is what i’m looking for
o tell me about chez panisse
o can you give me a listing of the kinds of food that are available
o i’m looking for a good place to eat breakfast
o when is caffe venezia open during the day
ADVA N CED N LP Mrs. MEZZI 28
BIGRAMS’ FREQUENCIES
◾ Counting bigrams
o https://books.google.com/ngrams
o Models pre-processed from books
o free download:
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
EVALUATION A N D COMPLEXITY
PERFORMANCE EVALUATION
◾ Does our linguistic model prefer good sentences to bad ones?
◾ Assign a higher probability to "real" or "frequently observed" sentences than to
"ungrammatical" or "rarely observed" sentences? than "ungrammatical" or "rarely
observed" sentences?
◾ Our the model's parameters are trained using a learning base.
◾ We evaluate the model's performance on data that we did not observe.
◾ A test database is an unnoticed, entirely unused set of data that is distinct from
our training set.
◾ An evaluation measure shows us how well our model performs on the test set.
In terms of ethics:
o Test sentences are not allowed in learning.
o We will assign them an artificially high probability when we define them in the test
set .
o Test-based learning is a practice that violates the code of honour and distorts
results.
mushrooms 0.1
◾ The Shannon Game:
pepperoni 0.1
◾ In which manner, will we be able to predict the next word?
The best linguistic model is one that best predicts a set of invisible tests
Give the highest P (sentence). 1
PP(W ) P(w1w2...wN )
N
1
N
The perplexity is the inverse probability of the total P(w1w2...wN )
For bigrams:
o The trigram model is the best (it has the lowest perplexity).
o To interpret these results, we conclude that the uni-gram model must make
numerous assumptions.
o Perplexity has the advantage of being a measure that is not dependent on a specific
application.
o However, it is difficult to predict whether a gain in perplexity would translate into a
real-world gain (for example, the number of translation errors).
o The goal remains to assess the system's performance using the extrinsic (in-vivo)
language evaluation model.