0% found this document useful (0 votes)

40 views17 pages

KEN2570 4 LanguageModel

The document discusses language models and probabilistic language models. It covers n-gram language models, the chain rule, Markov assumption, unigram models, bigram models, estimating n-gram probabilities from text corpora, and using n-gram probabilities to estimate the probability of sentences.

Uploaded by

Gibi Gibi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views17 pages

KEN2570 4 LanguageModel

Uploaded by

Gibi Gibi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

KEN2570 Agenda

Natural Language Processing

Language Models

Jerry Spanakis Probabilistic Models N-gram language models

http://dke.maastrichtuniversity.nl/jerry.spanakis
@gerasimoss

What is a language model? Probabilistic Language Models

• Goal: assign a probability to a sentence
- Model fluent/correct English
• Application
- Machine Translation:
- P(high winds tonight) > P(large winds tonight)
The student is watching ___ - Spell Correction
- The office is about fifteen minuets from my house
- P(about fifteen minutes from) > P(about fifteen
minuets from)
- Speech Recognition
- P(I saw a van) >> P(eyes awe of an)
- Summarization

3 4
Language models in daily life Language models in daily life

5 6

Language Modeling Statistical Language Models

• Advantage:
• Formally: - Trainable on Large Text Databases
- A function that for an English (or other language) sentence - Prediction ‘Soft’ (Probabilities)
- Can be combined with translation model (or other task)
gives us the probability that that this sentence is written in
“good” English (or the other language) • Problem:
• Approaches: - Need Large Text Database for each Domain
- Deterministic: e.g. finite state grammar
- Statistical:
Compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)

7
Statistical Language Models (cont.) The Chain Rule
• Simple approach: look at a large text database • Break up into prediction of one word
- Count(“Good morning ”) = 7
- P(“Good morning”) = 7/196884= 3.55*10-5
• Recall the definition of conditional probabilities
• What might be the problem here? p(B|A) = P(A,B)/P(A)
• Sparse data Rewriting: P(A,B) = P(A)P(B|A)
- many perfectly good sentences will be assigned a probability • More variables:
of zero, because they have never been seen before! P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• The Chain Rule in General

P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

Sentence Probability Markov Assumption

• Simplifying assumption:

𝑃 𝑤! 𝑤" … 𝑤# = % 𝑃 𝑤$ |𝑤! … 𝑤$%! P(the | its water is so transparent that) ≈ P(the | that)
$ • Or maybe
P(“its water is so transparent”) = P(the | its water is so transparent that) ≈ P(the | transparent that)
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water €
is 𝐶(𝑤! … 𝑤"#! ) = 𝑤"#$ 𝑤"#!
so)
€
• What might be the problem here?
Andrei Markov
11 12
Markov Assumption Simplest case: Unigram model
• 𝑃 𝑤! 𝑤" … 𝑤# = ∏$ 𝑃 𝑤$ |𝑤$%('%!) … 𝑤$%! • 𝑃 𝑤! 𝑤" … 𝑤# = ∏$ 𝑃 𝑤$

• In other words, we approximate each component in • Some automatically generated sentences from a
the product unigram model
- 𝑃 𝑤! |𝑤" … 𝑤!#" = 𝑃 𝑤! |𝑤!#(%#") … 𝑤!#"

13 14

Bigram model N-gram models

• 𝑃 𝑤! 𝑤" … 𝑤# = ∏$ 𝑃 𝑤$ |𝑤$%!
• We can extend to trigrams, 4-grams, 5-grams
• Some automatically generated sentences from a • In general, this is an insufficient model of language
bigram model - because language has long-distance dependencies:
“The computer which I had just put into the machine room on
the fifth floor crashed.”
• But we can often get away with N-gram models

15 16
Training Language Models An example
• The Maximum Likelihood Estimate (MLE)
- Train parameters so that they maximize the probability of the
training data <s> I am Sam </s>
- Parameter: c(w i−1,w i ) <s> Sam I am </s>
- N-Gram probabilities
P(w i | w i−1 ) = <s> I do not like green eggs and ham </s>
c(w i−1 )

count(w i−1,w i )
P(w i | w i−1 ) = €
count(w i−1 )

17 18

€
Estimate N-gram Probabilities More examples:
Berkeley Restaurant Project sentences
• Example (Trigram model):
- Counts from EPPS (Europarl) corpus
• can you tell me about any good cantonese restaurants close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day

20
Raw bigram counts Raw bigram probabilities
• Normalize by unigrams (aka calculate the probabilities):

• Result:
• Out of 9222 sentences

21 22

Bigram estimates of sentence probabilities What kinds of knowledge?

P(<s> I want dutch food </s>) = • P(dutch|want) = .0011

23 24
Practical Issues Evaluation: How good is our model?
• We do everything in log space • Does our language model prefer good sentences to
- Avoid underflow bad ones?
- (also adding is faster than multiplying) - Assign higher probability to “real” or “frequently observed”
sentences
- Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t
log( p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4 seen.
- A test set is an unseen dataset that is different from our
training set, totally unused.
- An evaluation metric tells us how well our model does on
the test set.
25 26

Extrinsic evaluation of N-gram models Intuition of Perplexity

• The Shannon Game:
• Also called “down-stream” evaluation - How well can we predict the next word?
• Best evaluation for comparing models A and B mushrooms 0.1
- Put each model in a task pepperoni 0.1
I always order pizza with cheese and ____
- spelling corrector, speech recognizer, MT system anchovies 0.01
- Run the task, get an accuracy for A and for B The 33rd President of the US was ____ ….
- How many misspelled words corrected properly I saw a ____ fried rice 0.0001
- How many words translated correctly ….
- Compare accuracy for A and B and 1e-100

• Challenges: - Unigrams are terrible at this game. (Why?)

- Time-consuming: can take days or weeks
- Is there a way to use intrinsic evaluation? Yes! By directly
• A better model of a text
measuring the quality of a LM!
- is one which assigns a higher probability to the word
that actually occurs
27 28
Intuition of Perplexity (cont.) Perplexity
• Intuitively, perplexity can be understood as a The best language model is one that best predicts an unseen test set
measure of uncertainty • Gives the highest P(sentence)
• What’s the level of uncertainty to predict the next
1
−
PP(W ) = P(w1w2 ...wN ) N

Perplexity is the inverse probability of

word? the test set, normalized by the number 1
=
o The current president of China is _______ ?
N
of words: P(w1w2 ...wN )
o ChatGPT is built on OpenAI's GPT-3 family of large language _____ ?
Chain rule:
• Uncertainty level
o Unigram: highest For bigrams:
o Bigram: high
o 5-gram model: lower
Minimizing perplexity is the same as maximizing probability

29 30

Another way to compute perplexity Lower perplexity = better model

• The probability of test sentence is:

<latexit sha1_base64="1npKAauBNH+Bke4vuh0VSjOwfU8=">AAA1p3icnVv9cty4kZd9uVxOueSyl/+Sf3gnqWLFkk6jjXNbSbkqiu2zU+tVdP7Y7K4oq0ASw4GHJGgA1MyYy7zBPeC9xz1AugHOkARASY6qbBHArxuNRnej8aGozJhUx8f/d+/+P/zgH3/4Tz/65+0f/8tPfvqvP/vs376WvBIxfRvzjItvIiJpxgr6VjGV0W9KQUkeZfQv0fwJtv/lmgrJePFGrUp6mZO0YFMWEwVVV5/d+9+9IisnX/wuy0+2w1Lw5IoF5w/CaxqretG8qx+w/WY/2HsMlYuryUGwuDo5CI6OjvCr2N/ew+rfBGHOkqBrX1x93jbN7aaWtJ4fTpoWMznAFmxAlkHb1T7+f7Km3teVnw+Y7YdZwpXULcVIN8UN3XTDXVyxEXrW0Y9DTpoOGwQhKYHxcsAW2ubNGMl2WPCiyiMqtq9+tnN8dKx/Avdj0n7sbLU/51effX4vTHhc5bRQcUakvJgcl+qyJkKxOKPNdlhJWpJ4TlJ6AZ8Fyam8rLXhNMEe1CTBlAv4V6hA1/YpapJLucojQOZEzaTdhpW+totKTb+4rFlRVooWseloWmWB4gFaYZAwAQaWreCDxIKBrEE8I4LECmx1e9BNSTKw1YIPhlIrulQxz8thbSpIOWPxclibV5ligi+GtTG/ZkWqeDGsnubAtyrUsJbC/FBBFLU4w8CFnEqrNlmg58KoA6Zozj7SXx8EGw74reKjYAGCzoIZuaZBwQMJxCBOEFG1oLTQhEN151xQ8OSosSdBzfJhnayiKUvt0eKED6oEV6jYIfACA4k01kgTsKMs5TA/s/zEGnpkdbogcgWmgoOOmQDTg6kt0ogo6cqhouwuM1SJIeyCVGrGxYoScSBgjkC8gqiI2SopS1okDI3gNVWn2RNS/neh6hCsNUOGEtQP6pVqldG64PBFy8f174MG2gq6oMU1iFKgQ9Xt9MkPFZOzpg4jmrKi1pNbhxmJIOwiAgrANKNgTrM61FNHy6Y+pnnTb9AEC5aAz9THR4/sVjpVORHQQTOAhiRJFB9HYVcNFGDMRrJmW48DvCMnUBfOyxS/YTzNxeSyDo3rwDTUUVZRHNU0MMoJLoKdSXB4+OX588tWGxsuBZGfyOXs9PWlIwwRFCQ/fbUbCpbOFBEw67vPzqzOPs4Q9d2Lm1EJRdTTZzejKt3j21t6pEWeNvWzswHqq+eO+FmKEbQx45dx/YrkJSykzjgh9AkOHHfD1x8qGPSuDaAfwEfRDrU6n304+mso6LTemTisqrIkTOjpLlLwq1Cl6h1WayFhHQkTcd2rCEKhgba2+KLwMsIGi1VX1TEbcstZEXOpBgw1v5BIdaDZ4JdfFKRMWO4SPzSUD8fJSJW6ZIeG7NBPlpNlzhOaDelmBEICEDcH5hMYNCNjhVUlQQPnWYKLIc/QG6KM2C6CuDKrZGM+39UPfQjQXA9yaEMSWsCyprl85+Ggm1sW33nI40rgIPXClNe6BMPZCxI6ZQXDnE8GU8Hz4Oz09MnLoCQlFb9z1QymBCuLNegM0srEGXXBRR4DW7WEOPjdFXMmrUjAFaKmLq/qsOX7YBV8Hyz3R6DIyAG3megIyYo5RIZgnFQJAhopUgjBMM6NU7TogzV9axNhA8na40nzrnAm5PpTOBQPgUddPMxtNfbEIVEES303i28AuzcAT5craJ4+WB6s9t02tmKmdSiIF1kKpsNZHx3qyjFVyzz3TM9BsNIOb1M9edmNA76d1kHzS22rAwTuQ8yCAza4NPaI+GB3Z7I7YtrPvjp7eb427U9kZ89JIVsBmapN0Z23Qi46UEaXXkjZQcqZ8EKQtJt0b2dyVXR8sGABiFIknnUQKNuQawK5REw7DFTYGDnsSAlKm3cnFuhD2iHg22rN8p5OHGuXIl5wkUAOjmkm5k9YlGsCqB+hqIoxGmixaQQFQCcGFp14anK5FpGY/Mmy++sO4BFsyoRU07inLl3j6INYKKxwLO71NY8B8pqlObEYvDFNvZgwbD+rILvd5CUFlmwIZBybVPYadxy0PpwcTWjuIGG7VVv9r6jEdCae0XgOGeh8156gTPPF/qe4kUN/5BlLKuksGIpcEwbCnp26Dgx70QwSfAyF1rqk4w1GNjWDGOksNLr5/ab5va8ZEuc2e9UYk225MHvh02g/UmvRgr9p4cPQA+mzoKybQV1uHqBhswOdi/0+aJnaQXQ27GDmmCAlSke2AWzqgzVQvW3LZQvlIByxUfuOfw67d8IgDtSGmDXHBa7Vip3GJKu/8WH0outht1mr7SQfiSKGYQu3USPEOv8fZQH7m7brHq0jHYXcdgNULEuojXcWuj++fPYW/atVsy7aLmYw+n+3QZ/FtOuahkAX1uqRWhayctaXVDEb4p0iAM7QmEzuPGTphTI/9gbm70cp3vsoiHQGpxMSZ5+nmKCAty3sW8dhvSDtr04q6UNeGZ8Y48pctmyML/MwZh4dGLCccaGGFHasHGLnEZU2wbt6vjbFQ93s6852wNVtDohEYw64urMDtl33aL/1O2ALdB3w28ZZ8XD/d3OUBYRjZLMbjAzwtlW8cMzBC/IamRc5YmRrLHPZOkbmx44Y2QZsG9kL18iGWNvIXowYmesunoG/GfVDL3jcFX2DfzPujX74uEOOquuNZwXVe4Cc4EbhOckx/7PyphRyg6H5fXSDvNZzxga6/tIWbQNiA9TIQDq0PYwv3VnfgN97wO/tMeFZQ++4gvvS707KNcTpNONSdhhdshC89J9pmNTO68QfKgJxY4obDdhUzQXNNj1gdVM/1trfC3AbErCAFYG5lgskBADcY1nZx0Lnnss27dwL8FyABny6ZqC4+Xp/B1aiuTi5tBIYzfhdvXOiuae0oILFa0bI2WaS4snoshmOQUGSTNVIx8qMYXW3MdzCyh3DamQMLSN3DMqMYaXHAPtqIuntyisxTypZ0yO5CxEzWwf42oxe0hRPxPVdKirBYiGBLSjEdCIthkhrGSNKNlDqRrwNl7sYB54vaM3qPoYa7XR08+QYHc0cJd1G1SpJ3aAki8VNSlIeJam7Kuk2UVslKVtJ/lTJWM0gYytneKRm7J1kLNV3RuYIqud0Ul96dgYsDwCc8QXUL7gDjKhhRRMs8YL2SY/cy4ihcpxNKoFdNhkTsi8U9taXZSCkDXSF7JF6hLTTNeeghmQIiW3dl3jgn0p3zfOgGmjAIADSHCp+iEJ1w91YifQYRmc38lbfKl2dR05e0YIi58Ag05XW7URu0K8sNCsSFrdHlWB0UVT/qdFZ88XOxKTIl/ZKmVWDxVQXneOu3u0AFuxlUvTWWizYFxik7NqxYO+wBwDpQagBQhkEzNtTire2oPFVgMeNQUsgrUOgBdGnXLto0wSjh+IiKIGZZVCFdRoRFpVjdEXcRqui2gSrxBJD34LZJ/7mok0nN3Elrqkp4pz4zgYL3ckF9mBPWalIlPVz1XMn/jznpHesp0u+S0lvcsNzmhIPPLv7KVfOwS0+4Taoh/cdjb30E1FSZBzG7+0oIqLrzT6DLpPUS1RKx3Y7rE+yc5fgWge8Ie7aPdfzoKS71/Cg3KP5drk7n7GrTffO0TwexgNDUm22NcOeMjCVPgaPfWzMEKC7de61yg5kk6OljJmPdQQ/LTc78NI9kBboM5toYFzI9qFs2oNgBHQ76bjIqZ9LxwQQLZMhD85VFx2RDVR4boSqDuERVdKu2Tl/fdoPvk/d4HvGRd6LBWeuuqR5e7AOr1iyRZCLQQReEFfrPS4gpo9LjwkifEwgcOsbgTUIC+5iUfAOAd9mv+Re2I1dRc9WpX4JEtJCVoIio/aE89AJBcdDezy22yfDdufkPzWBtr2dSbl7K1bi4xXVptbrLeHOZLBB7F5O6GTS1ivNmWBF+4Zkd94OcBf3zA8FlSyBPWdgZwIKEpl4jBSPumb1gEFjc+jTFryNfiO939q5h4FPBouP1vxgCJu71x6jRq9z0fSW8djMNiJ9Ak/3kcj8mmGMSJgsM7LSz8Ow+qpes1rbxtfMPQ5Fem09y/DfW3gwt6eB5GVGAWY+dje4XXuAiuVUtvLo77sJgXdAXOCDDJPX6EPW79f5Yvi9LQ/e/HYZ5tQstZgI2UchSxp3rqFLThpLlpn2VJNzWKsfXcKIdVftMzyTz0zbzu4EzvyS+bAgpMH6gk1VAjU+tdQBB/YL8GEeywZTlimKRhX8Z8DykguFKWagp0vfR1rvJ9z7tiFgecuN2OqW65iP7s7HkcC+zlvqE509u5+VXffRbJjsdV0yqdVph4G0wGs/0+Ia3S0vNE7Ni0zI1q34jjOjo4Ez2/FmBvtvJBNeQb4sqozqB5YhRFqywqJ5KtkHh/oxPEr026NHcd4E8LMXPFuW0DHOMIw1WEOMavoPPzevdf1PP/WTTP1MmAqNhV3TpAFxSMTituKo8b8KPbz56Sfexv8dTz731tNrZMLe2m/d9hrcMtDOAqrD7fBpkQj6KxnomSp4ezrzICcr3N13r7VJQkq1P/6KxgQOs3my1hw12L36IT0MpjweEDSJ2/pCefQzdY3reYwHG7F0ygXYGZRwL6lX8wq2fALEAZmqSC/mNdKGoVnEzfYLdi+nr54/q3dDwwCMe5y7EWXDV2cJn8CELmFaJZTMx98lpSG9sYPbpbyRyTUkRLDRyrh+1qK9P8Ozld2wa9nVhE9pDC30K5jCP5foWBz2dDHZvM5RqtalUawgxbzD6tIo9o/Z/CkjaQdfVxg3eUXSlCavcEmEEgR/hht+9XiCfgnBTa8Y1Lxt5vqUsalFGoFrHoDrHhw3Q0iepEI/RNaYY8D81sEk8xQC1Rpy9PlB+89EWQhFEFzzuv3d1G/ajwtJ9TuaywGsFLzkUi9pzUVbeVmf92oH8JgLDtYuVj3wk03dAJpRvHbqYC91eQDBJFzMe5hXpmIA6tbcHvBpV7ntm7dfYzhPIY9o6vb3TTAMhu1v/KOZif0nMu7H1ydHk+Ojyf/8ZucPX7R/PvOjrV9u/cfWg63J1n9t/WHrxdb51tut+N7/3//F/Z37u4/2H/350dePvjHQ+/damp9vDX4ekb8B0z/ibw==</latexit>

• Training 38 million words, test 1.5 million words, WSJ

• Usually, we work with log-probabilities:

• Let’s divide by number of words M in test sentences:

<latexit sha1_base64="KE92W4dFaCEu/VptH1BWz72P9LY=">AAA2InicnVtbc904ctZ4c5kot53NY16YSKrYa0nR0aw3U7vlqtXajp0aj6L4MjszoqwCSRwe+JAEDYDnYg7zL/KavOTX5C2Vp1Tlx6Qb4DkkAVCSV1W2CODrRqPR3WhcFJUZk+rk5H8/u/eTP/jDP/rjz/9k90//7M//4i9/+sXPvpW8EjF9G/OMi+8iImnGCvpWMZXR70pBSR5l9HfR/Am2/25BhWS8eKPWJb3KSVqwKYuJgqrrL+59flBk5eSrX2X56e5BWAqeXLPg4n64oLGql827+j570DwIoG0qSFxPmvqbJtgNM55enwbj+MdBKKscmlqkgxgyvKnjx1C7vJ4cBsvr08Pg+PgYvwpggNW/CMKcJUHXvrz+sm2a200taT0/mjQtZnKILdiALIO2qwf4/+mG+oGu/HLA7EGYJVxJ3VKMdFPc0E033uU1G6FnHf045LTpsEEQkhIYrwZsoW3ejJHshgUvqjyiYvf6p3snxyf6J3A/Ju3H3k77c3H9xZefhQmPq5wWKs6IlJeTk1Jd1UQoFme02Q0rSUsSz0lKL+GzIDmVV7W22SY4gJokmHIB/woV6No+RU1yKdd5BMicqJm027DS13ZZqelXVzUrykrRIjYdTassUDxABwgSJsDCsjV8kFgwkDWIZwRMUYGb7A66KUkGblLwwVBqRVcq5nk5rE0FKWcsXg1r8ypTTPDlsDbmC1akihfD6mkOfKtCDWspzA8VRFGLMwxcyKm0apMlBg0YdcAUzdlH+vPDYMsBv1V8HCxB0FkwIwsaFDyQQAziBBFVS0oLTThUd84FhSASNfYkqFk+rJNVNGWpPVqc8EGV4AoVOwReYgyTxhppAnaUpRzmZ5afWkOPrE6XRK7BVHDQMRNgejC1RRoRJV05VJTdZYYqMYRdkkrNuFhTIg4FzBGIVxAVMVslZUmLhKERvKbqLHtCyn8sVB2CtWbIUIL6Qb1SrTNaFxy+aPm4/nXQQFtBl7RYgCgFOlTdTp/8UDE5a+owoikraj25dZiRCCI+IqAATDMK5jSrQz11tGzqE5o3/QZNsGQJ+Ex9cvzIbqVTlRMBHTQDaEiSRPFxFHbVQAHGbCRrdvU4wDtyAnXhvEzxG8bTXE6u6tC4DkxDHWUVxVFNA6Oc4DLYmwRHR19fPL9qtbHlUhD5iVzOz15fOcIQQUHys1f7oWDpTBEBs77/7Nzq7OMMUT+8uBmVUEQ9fXYzqtI9vr2lR1rkaVM/Ox+gvnnuiJ+lGEEbM34Z169IXsIa7owTQp/gwHE/fP2hgkHv2wD6AXwU7VCr89mH438NBZ3WexOHVVWWhAk93UUKfhWqVL3Dai0krCNhIha9iiAUGmhriy8LLyNssFh1VR2zIbecFTGXasBQ8wuJVIeaDX75RUHKhOUu8UND+XCcjFSpS3ZkyI78ZDlZ5Tyh2ZBuRiAkAHFzaD6BQTMyVlhVEjRwniW4GPIMvSHKiO0iiCuzSjbm81390IcAzfUgRzYkoQUsa5rLDx4Ourll8YOHPK4EDlIvTHmtSzCcgyChU1YwTDdlMBU8D87Pzp68DEpSUvErV81gSrCyWIPOIKNNnFEXXOQxsFUriIM/XDNn0ooEXCFq6vK6Dlu+99fBj8HqwQgUGTngNhUdIVkzh8gQjJMqQUAjRQohGMa5dYoWfbihb20ibCBZezxp3hXOhCw+hUPxEHjUxcPcVmNPHBJFsNR3s/gGsAcD8HS1hubp/dXh+oHbxtbMtA4F8SJLwXQ466NDXTmmapnnnuk5DNba4W2qJy+7ccC30zpofqltdYDAjYhZcMAGV8YeER/s7032R0z72TfnLy82pv2J7Ow5KWQrIFO1KbrzVshlB8roygspO0g5E14IknaT7u1MrouODxYsAFGKxLMOAmUbsiCQS8S0w0CFjZHDjpSgtHl3aoE+pB0Cvq3WLO/pxLF2KeIlFwnk4JhmYv6ERbkhgPoRiqoYo4EWm0ZQAHRiYNGJpyaXaxGJyZ8su190AI9gUyakmsY9dekaRx/EQmGFY3GvFzwGyGuW5sRi8MY09WLCsP28gux2m5cUWLIhkHFsU9kF7jhofTQ5ntDcQcJ2q7b6X1OJ6Uw8o/EcMtD5vj1BmeaL/eOZgvZHnrGkks6CociCMBD2/Mx1YNiLZpDgYyi01iUdbzCyqRnESGeh0c3vt83vfc2QOLfZq8aYbMuF2QufRvuRWosW/E0LH4YeSJ8FZd0M6nJzHw2bHepc7NdBy9QOorNhBzPHBClROrINYFMfrIHqXVsuWygH4YiN2nf8c9i9EwZxoDbErDkucKNW7DQmWf2dD6MXXQ+77VptJ/lIFDEMW7iNGiHW+f8oC9jftF33aB3pKOS2W6BiWUJtvLPQ/fbls7foX62addF2MYPR/7sN+iymXdc0BLqwVo/UspC1s76kitkQ7xQBcIbGZHLnIUsvlPmxNzB/P0rx3kdBpDM4nZA4+zzFBAW8bWHfOw7rBWl/dVJJH/La+MQYV+ayZWN8mYcx8+jAgOWMCzWksGPlEDuPqLQJ3tXzjSke6WZfd7YDrm9zQCQac8D1nR2w7bpH+73fAVug64DfN86Kh/u/m6MsIBwjm91gZIC3reKFYw5ekNfIvMgRI9tgmcvWMTI/dsTItmDbyF64RjbE2kb2YsTIXHfxDPzNqB96weOu6Bv8m3Fv9MPHHXJUXW88K6jeA+QENwrPSY75n5U3pZAbDM3voxvktZ4zNtD117ZoWxAboEYG0qHtYXztzvoW/N4Dfm+PCc8aescV3Jd+d1JuIE6nGZeyw+iSheCl/0zDpHZeJ/5QEYgbU9xowKZqLmi27QGrm/qx1v5BgNuQgAWsCMyNYCAhAOAey8o+ljr3XLVp50GA5wI04NMNA8XN1/s7sBLN5emVlcBoxu/qvVPNPaUFFSzeMELONpMUT0ZXzXAMCpJkqkY6VmYM67uN4RZW7hjWI2NoGbljUGYMaz0G2FcTSW9XXol5UsmaHsldiJjZOsDXdvSSpngirq9xUQkWCwlsQSGmE2kxRFrLGFGygVK34m253MU48HxBa1b3MdRop6ObJ8foaOYo6TaqVknqBiVZLG5SkvIoSd1VSbeJ2ipJ2Uryp0rGagYZWznDIzVj7yRjqb4zMkdQPaeT+tKzM2B5COCML6F+yR1gRA0rmmCJF7RPeuxeRgyV42xSCeyyyZiQfaGwt74sAyFtoCtkj9QjpJ2uOQc1JENIbOu+xAP/VLprngfVQAMGAZDmSPEjFKob7tZKpMcwOruRt/pW6eo8cvKKFhQ5BwaZrrRuJ3KDfmWhWZGwuD2qBKOLovqfGp01X+5NTIp8Za+UWTVYTHXROe7q3Q5gwV4mRW+txYJ9gUHKrh0L9g57AJAehBoglEHAvD2leGsLGl8HeNwYtATSOgRaEn3KtY82TTB6KC6CEphZBlVYpxFhUTlGV8RttCqqbbBKLDH0LZh94m8u2nRyE1diQU0R58R3NljoTi6xB3vKSkWirJ+rXjjx5zknvWM9XfJdSnqTG57TlHjg2d1PuXIObvEJt0E9vO9o7KWfiJIi4zB+b0cREV1v9hl0maReolI6ttthfZJduAQLHfCGuIV7rudBSXev4UG5R/PtcncxY9fb7p2jeTyMB4ak2m5rhj1lYCp9DB772JghQHfr3GuVHcgmR0sZMx/rCH5abnfgpXsgLdBnttHAuJDtQ9m0B8EI6HbScZFTP5eOCSBaJkMenKsuOiIbqPDcCFUdwiOqpF2zc/76tB98n7rB95yLvBcLzl11SfP2YBNesWSLIJeDCLwkrtZ7XEBMH5ceE0T4mEDg1jcCGxAW3MWi4B0Cvs1+yb2wG7uKnq1L/RIkpIWsBEVG7QnnkRMKTob2eGK3T4btzsl/agJtezuTcvdWrMTHK6pNrTdbwr3JYIPYvZzQyaStV5ozwYr2Dcn+vB3gPu6ZHwoqWQJ7zsDOBBQkMvEYKR51zeoBg8bm0KcteBv9Rnq/tXMPA58MFh+t+cEQtnevPUaNXuei6S3jsZltRfoEnu4jkfmCYYxImCwzstbPw7D6ut6w2tjGt8w9DkV6bT2r8G9aeDC3p4HkZUYBZj72t7h9e4CK5VS28ujvuwmBd0Bc4IMMk9foQ9YfN/li+KMtD978dhnm1Cy1mAjZRyErGneuoUtOGktWmfZUk3NYqx9dwYh1V+0zPJPPTNvO7gTO/JL5sCCkwfqCTVUCNT611AEH9gvwYR7LBlOWKYpGFfx9wPKSC4UpZqCnS99HWu8n3Pu2IWB1y43Y+pbrmI/uzseRwL7OW+kTnQO7n7Vd99FsmOx1XTKp1WmHgbTAaz/T4hrdLS80zsyLTMjWrfiOM6OjgTPb8XYG+28kE15BviyqjOoHliFEWrLGonkq2QeH+h0+SvTL40dx3gTwcxA8W5XQMc4wjDXYQIxq+g8/t691/U8/9ZNM/UyYCo2FXdOkAXFIxOK24rjxvwo9uvnpJ97G/x5PPg8202tkwt7ab932Gtwy0M4CqsPt8FmRCPp3MtAzVfD2dOZ+Tta4u+9ea5OElOrB+CsaEzjM5slac9Rg9+qH9DCY8nhA0CRu6wvl0c/UNa7nMR5sxNIpF2BnUMK9pF7NK9jyCRAHZKoivZjXSBuGZhE32y/YvZy9ev6s3g8NAzDuce5GlC1fnSV8AhO6gmmVUDIfv5eUhvTGDm6X8kYmC0iIYKOVcf2sRXt/hmcr+2HXsq8Jn9IYWug3MIX/XKJjcdjTxWT7OkepWpdGsYIU8w6rS6PY32bzp4ykHXxTYdzkFUlTmrzCJRH/2IZAwIcNv3o8Qb+E4KZXDGreNnN9ytjUIo3ANQ/BdQ9PmiEkT1KhHyJrzAlgfulgknkKgWoDOf7ysP1noiyEIgiued3+buo37celpPodzdUAVgpecqmXtOayrbyqL3q1A3jMBQdrF+se+Mm2bgDNKF47dbCXujyAYBIu5j3MK1MxAHVrbg/4tKvc9c3bzzGcp5BHNHX7+yYYBsP2N/7RzMT+Exn349vT48nJ8eRffrH3m6/aP5/5fOevd/525/7OZOcfdn6z82LnYuftTnyP3/u3e/9+7z8e/eej/3r034/+x0DvfdbS/NXO4OfR//0/io0PQQ==</latexit>

N-gram Unigram Bigram Trigram

• So, the average log-probability of test words is:
<latexit sha1_base64="9L/5hIget1iG7V5i9dgJ/sJZolo=">AAA2Q3icnVtfcxy5cefJ+WMzTuKzH/0yCcmKaJE0l2c5V3apyrQkS6nTMbT+nO+OQ7EwM9hZaGcGIwDD3dXc5DPk0+Q1ecmHyGfwmyuvqUo3MLszA2BIyqySOAB+3Wg0uhuNP4zKjEl1fPw/n9z73l/85V/99fd/sP03P/zbv/v7H336468kr0RM38Q84+LriEiasYK+UUxl9OtSUJJHGf1DNH+M7X+4pkIyXrxWq5Je5iQt2JTFREHV1af39veKrJx8/qssP9neC0vBkysWnN8Pr2ms6kXztr7P9pv9ANqmgsT1pKm/bLCU8fTqJBgneBSEssqhqUU6iO0+w9uwN0r2CGoXV5ODYHF1chAcHR3hVwFEWP2LIMxZEnTti6vP2qa53dSS1vPDSdNiJgfYgg3IMmi72sf/T9bU+7ryswGz/TBLuJK6pRjpprihm268iys2Qs86+nHISdNhgyAkJTBeDthC27wZI9kOC15UeUTF9tWPdo6PjvVP4H5M2o+drfbn/OrTzz4JEx5XOS1UnBEpLybHpbqsiVAszmizHVaSliSek5RewGdBciova23UTbAHNUkw5QL+FSrQtX2KmuRSrvIIkDlRM2m3YaWv7aJS088va1aUlaJFbDqaVlmgeIAeEiRMgIVlK/ggsWAgaxDPCNiqAj/aHnRTkgz8qOCDodSKLlXM83JYmwpSzli8HNbmVaaY4IthbcyvWZEqXgyrpznwrQo1rKUwP1QQRS3OMHAhp9KqTRYYVWDUAVM0Zx/ozw6CDQf8VvFRsABBZ8GMXNOg4IEEYhAniKhaUFpowqG6cy4oRJmosSdBzfJhnayiKUvt0eKED6oEV6jYIfACg5w01kgTsKMs5TA/s/zEGnpkdbogcgWmgoOOmQDTg6kt0ogo6cqhouwuM1SJIeyCVGrGxYoScSBgjkC8gqiI2SopS1okDI3gFVWn2WNS/q5QdQjWmiFDCeoH9Uq1ymhdcPii5aP610EDbQVd0OIaRCnQoep2+uT7islZU4cRTVlR68mtw4xEsCQgAgrANKNgTrM61FNHy6Y+pnnTb9AEC5aAz9THRw/tVjpVORHQQTOAhiRJFB9HYVcNFGDMRrJmW48DvCMnUBfOyxS/YTzNxeSyDo3rwDTUUVZRHNU0MMoJLoKdSXB4+MX5s8tWGxsuBZEfyeXs9NWlIwwRFCQ/fbkbCpbOFBEw67tPz6zOPswQ9e3zm1EJRdSTpzejKt3jm1t6pEWeNvXTswHqy2eO+FmKEbQx45dx/ZLkJSzyzjgh9AkOHHfDV+8rGPSuDaDvwUfRDrU6n74/+rdQ0Gm9M3FYVWVJmNDTXaTgV6FK1Vus1kLCOhIm4rpXEYRCA21t8UXhZYQNFquuqmM25JazIuZSDRhqfiGR6kCzwS+/KEiZsNwlfmAoH4yTkSp1yQ4N2aGfLCfLnCc0G9LNCIQEIG4OzCcwaEbGCqtKggbOswQXQ56hN0QZsV0EcWVWycZ8vq0f+BCguR7k0IYktIBlTXP51sNBN7csvvWQx5XAQeqFKa91CYazFyR0ygqG+agMpoLnwdnp6eMXQUlKKn7lqhlMCVYWa9AZpLyJM+qCizwGtmoJcfDbK+ZMWpGAK0RNXV7VYcv3/ir4Lljuj0CRkQNuU9ERkhVziAzBOKkSBDRSpBCCYZwbp2jRB2v61ibCBpK1R5PmbeFMyPXHcCgeAI+6eJDbauyJQ6IIlvpuFl8Ddm8Ani5X0Dy9vzxY7bttbMVM61AQL7IUTIezPjrUlWOqlnnumZ6DYKUd3qZ6/KIbB3w7rYPmF9pWBwjciJgFB2xwaewR8cHuzmR3xLSffnn24nxt2h/Jzp6TQrYCMlWbojtvhVx0oIwuvZCyg5Qz4YUgaTfp3s7kquj4YMECEKVIPOsgULYh1wRyiZh2GKiwMXLYkRKUNm9PLND7tEPAt9Wa5T2dONYuRbzgIoEcHNNMzJ+wKNcEUD9CURVjNNBi0wgKgE4MLDrx1ORyLSIx+ZNl99cdwCPYlAmppnFPXbrG0QexUFjhWNyrax4D5BVLc2IxeG2aejFh2H5WQXa7yUsKLNkQyDg2qew17jhofTg5mtDcQcJ2q7b6X1GJ6Uw8o/EcMtD5rj1BmeaL/eOhg/ZHnrGkks6Cocg1YSDs2anrwLAXzSDBx1BorUs63mBkUzOIkc5Co5vfbZrf+ZohcW6zV40x2ZYLsxc+jfYjtRYt+OsWPgw9kD4LyroZ1OXmPho2O9C52K+DlqkdRGfDDmaOCVKidGQbwKY+WAPV27ZctlAOwhEbte/457B7JwziQG2IWXNc4Fqt2GlMsvprH0Yvuh52m7XaTvKRKGIYtnAbNUKs8/9RFrC/abvu0TrSUchtN0DFsoTaeGeh++2Lp2/Qv1o166LtYgaj/3cb9FlMu65pCHRhrR6pZSErZ31JFbMh3ikC4AyNyeTOQ5ZeKPNjb2D+bpTinY+CSGdwOiFx9nmKCQp428K+cRzWC9L+6qSSPuSV8Ykxrsxly8b4Mg9j5tGBAcsZF2pIYcfKIXYeUWkTvK3na1M81M2+7mwHXN3mgEg05oCrOztg23WP9hu/A7ZA1wG/aZwVD/d/N0dZQDhGNrvByABvW8Vzxxy8IK+ReZEjRrbGMpetY2R+7IiRbcC2kT13jWyItY3s+YiRue7iGfjrUT/0gsdd0Tf41+Pe6IePO+Soul57VlC9B8gJbhSekRzzPytvSiE3GJrfBzfIaz1nbKDrL2zRNiA2QI0MpEPbw/jCnfUN+J0H/M4eE5419I4ruC/97qRcQ5xOMy5lh9ElC8FL/5mGSe28Tvy+IhA3prjRgE3VXNBs0wNWN/Ujrf29ALchAQtYEZgrw0BCAMA9lpV9LHTuuWzTzr0AzwVowKdrBoqbr3d3YCWai5NLK4HRjN/WOyeae0oLKli8ZoScbSYpnowum+EYFCTJVI10rMwYVncbwy2s3DGsRsbQMnLHoMwYVnoMsK8mkt6uvBLzpJI1PZK7EDGzdYCvzeglTfFEXN/zohIsFhLYgkJMJ9JiiLSWMaJkA6VuxNtwuYtx4PmC1qzuY6jRTkc3T47R0cxR0m1UrZLUDUqyWNykJOVRkrqrkm4TtVWSspXkT5WM1QwytnKGR2rG3knGUn1nZI6gek4n9aVnZ8DyAMAZX0D9gjvAiBpWNMESL2if9Mi9jBgqx9mkEthlkzEh+0Jhb31ZBkLaQFfIHqlHSDtdcw5qSIaQ2NZ9iQf+qXTXPA+qgQYMAiDNoeKHKFQ33I2VSI9hdHYjb/Wt0tV55OQVLShyDgwyXWndTuQG/dJCsyJhcXtUCUYXRfW/NDprvtiZmBT50l4ps2qwmOqic9zVux3Agr1Mit5aiwX7AoOUXTsW7B32ACA9CDVAKIOAeXtC8dYWNL4K8LgxaAmkdQi0IPqUaxdtmmD0UFwEJTCzDKqwTiPConKMrojbaFVUm2CVWGLoWzD7xN9ctOnkJq7ENTVFnBPf2WChO7nAHuwpKxWJsn6ueu7En2ec9I71dMl3KelNbnhOU+KBZ3c/5co5uMVH3Ab18L6jsRd+IkqKjMP4vR1FRHS92WfQZZJ6iUrp2G6H9Ul27hJc64A3xF2753oelHT3Gh6UezTfLnfnM3a16d45msfDeGBIqs22ZthTBqbSx+Cxj40ZAnS3zr1W2YFscrSUMfOxjuCn5WYHXroH0gJ9ZhMNjAvZPpRNexCMgG4nHRc59XPpmACiZTLkwbnqoiOygQrPjVDVITyiSto1O+evT/rB94kbfM+4yHux4MxVlzRvD9bhFUu2CHIxiMAL4mq9xwXE9HHpMUGEjwkEbn0jsAZhwV0sCt4h4Nvsl9wLu7Gr6Nmq1C9BQlrISlBk1J5wHjqh4Hhoj8d2+2TY7pz8pybQtrczKXdvxUp8vKLa1Hq9JdyZDDaI3csJnUzaeqU5E6xo35DsztsB7uKe+YGgkiWw5wzsTEBBIhOPkeJR16weMGhsDn3agrfRb6T3Wzv3MPDJYPHRmh8MYXP32mPU6HUumt4yHpvZRqSP4Ok+EplfM4wRCZNlRlb6eRhWX9VrVmvb+Iq5x6FIr61nGf5DCw/m9jSQvMwowMzH7ga3aw9QsZzKVh79fTch8A6IC3yQYfIafcj63TpfDL+z5cGb3y7DnJqlFhMh+yhkSePONXTJSWPJMtOeanIOa/WjSxix7qp9hmfymWnb2Z3AmV8yHxaENFhfsKlKoManljrgwH4BPsxj2WDKMkXRqIKfBywvuVCYYgZ6uvR9pPV+wr1vGwKWt9yIrW65jvng7nwcCezrvKU+0dmz+1nZdR/Mhsle1yWTWp12GEgLvPYzLa7R3fJC49S8yIRs3YrvODM6GjizHW9msP9GMuEV5Muiyqh+YBlCpCUrLJqnkn1wqB/qo0S/PHoY500AP3vB02UJHeMMw1iDNcSopv/wc/Na1//0Uz/J1M+EqdBY2DVNGhCHRCxuK44a/6vQw5uffuJt/J/x5HNvPb1GJuyt/dZtr8AtA+0soDrcDp8WiaD/JAM9UwVvT2fu52SFu/vutTZJSKn2x1/RmMBhNk/WmqMGu1c/pIfBlMcDgiZxW18oj36mrnE9j/FgI5ZOuQA7gxLuJfVqXsGWT4A4IFMV6cW8RtowNIu42X7B7uX05bOn9W5oGIBxj3M3omz46izhI5jQJUyrhJL5+LOkNKQ3dnC7lDcyuYaECDZaGdfPWrT3Z3i2sht2Lbua8AmNoYV+CVP4ryU6Foc9XUw2r3OUqnVpFCtIMe+wujSK/W02f8JI2sHXFcZNXpI0pclLXBLxb14IBHzY8KtHE/RLCG56xaDmbTPXp4xNLdIIXPMAXPfguBlC8iQV+iGyxhwD5pcOJpmnEKjWkKPPDtp/JspCKILgmtft76Z+3X5cSKrf0VwOYKXgJZd6SWsu2srL+rxXO4DHXHCwdrHqgR9v6gbQjOK1Uwd7ocsDCCbhYt7DvDQVA1C35vaAT7rKbd+8/QzDeQp5RFO3v2+CYTBsf+MfzUzsP5FxP746OZocH01+/4ud33ze/vnM97d+uvWPW/e3Jlv/vPWbredb51tvtuJ7/37vP+79573/evjfD//48E8P/9dA733S0vxka/Dz8P/+H2pfG9A=</latexit>

Order
Perplexity 962 170 109
• And then the perplexity is the inverse of the log-probability:
<latexit sha1_base64="iyOVfGam9hHU6OLTNgtwH/JnnSI=">AAA2SnicnVtbc904ctY4m2Sj3HaSx7wwkVSx15Kio1lPpnbLVau1HTs1HkXry+zMiLIKJHF44EMSNACeiznM38ivyWvykj+Qv5G3VF7SDfAckgAoyasqWwTwdaPR6G40LorKjEl1cvLfn937g5/84R/98U//ZPdP/+zP/+Ivf/b5X30reSVi+jbmGRffRUTSjBX0rWIqo9+VgpI8yujvovkTbP/dggrJePFGrUt6lZO0YFMWEwVV15/fOzkosnLy1S+z/HT3ICwFT65ZcHE/XNBY1cvmXX2fPWgeBNA2FSSuJ039TYOljKfXp8E4weMglFUOTS3SQeyGNMsQ1uN7G8mNAj6G2uX15DBYXp8eBsfHx/hVABFW/yIIc5YEXfvy+ou2aW43taT1/GjStJjJIbZgA7IM2q4e4P+nG+oHuvKLAbMHYZZwJXVLMdJNcUM33XiX12yEnnX045DTpsMGQUhKYLwasIW2eTNGshsWvKjyiIrd65/tnRyf6J/A/Zi0H3s77c/F9edffBYmPK5yWqg4I1JeTk5KdVUToVic0WY3rCQtSTwnKb2Ez4LkVF7V2rab4ABqkmDKBfwrVKBr+xQ1yaVc5xEgc6Jm0m7DSl/bZaWmX13VrCgrRYvYdDStskDxAB0lSJgAC8vW8EFiwUDWIJ4RsFUF7rQ76KYkGbhTwQdDqRVdqZjn5bA2FaScsXg1rM2rTDHBl8PamC9YkSpeDKunOfCtCjWspTA/VBBFLc4wcCGn0qpNlhhcYNQBUzRnH+nPD4MtB/xW8XGwBEFnwYwsaFDwQAIxiBNEVC0pLTThUN05FxSCTdTYk6Bm+bBOVtGUpfZoccIHVYIrVOwQeImxThprpAnYUZZymJ9ZfmoNPbI6XRK5BlPBQcdMgOnB1BZpRJR05VBRdpcZqsQQdkkqNeNiTYk4FDBHIF5BVMRslZQlLRKGRvCaqrPsCSn/qVB1CNaaIUMJ6gf1SrXOaF1w+KLl4/pXQQNtBV3SYgGiFOhQdTt98kPF5Kypw4imrKj15NZhRiJYGRABBWCaUTCnWR3qqaNlU5/QvOk3aIIlS8Bn6pPjR3YrnaqcCOigGUBDkiSKj6OwqwYKMGYjWbOrxwHekROoC+dlit8wnuZyclWHxnVgGuooqyiOahoY5QSXwd4kODr6+uL5VauNLZeCyE/kcn72+soRhggKkp+92g8FS2eKCJj1/WfnVmcfZ4j64cXNqIQi6umzm1GV7vHtLT3SIk+b+tn5APXNc0f8LMUI2pjxy7h+RfIS1npnnBD6BAeO++HrDxUMet8G0A/go2iHWp3PPhz/ayjotN6bOKyqsiRM6OkuUvCrUKXqHVZrIWEdCROx6FUEodBAW1t8WXgZYYPFqqvqmA255ayIuVQDhppfSKQ61Gzwyy8KUiYsd4kfGsqH42SkSl2yI0N25CfLySrnCc2GdDMCIQGIm0PzCQyakbHCqpKggfMswcWQZ+gNUUZsF0FcmVWyMZ/v6oc+BGiuBzmyIQktYFnTXH7wcNDNLYsfPORxJXCQemHKa12C4RwECZ2ygmFaKoOp4Hlwfnb25GVQkpKKX7pqBlOClcUadAaZb+KMuuAij4GtWkEc/OGaOZNWJOAKUVOX13XY8r2/Dn4MVg9GoMjIAbep6AjJmjlEhmCcVAkCGilSCMEwzq1TtOjDDX1rE2EDydrjSfOucCZk8SkciofAoy4e5rYae+KQKIKlvpvFN4A9GICnqzU0T++vDtcP3Da2ZqZ1KIgXWQqmw1kfHerKMVXLPPdMz2Gw1g5vUz152Y0Dvp3WQfNLbasDBG5EzIIDNrgy9oj4YH9vsj9i2s++OX95sTHtT2Rnz0khWwGZqk3RnbdCLjtQRldeSNlBypnwQpC0m3RvZ3JddHywYAGIUiSedRAo25AFgVwiph0GKmyMHHakBKXNu1ML9CHtEPBttWZ5TyeOtUsRL7lIIAfHNBPzJyzKDQHUj1BUxRgNtNg0ggKgEwOLTjw1uVyLSEz+ZNn9ogN4BJsyIdU07qlL1zj6IBYKKxyLe73gMUBeszQnFoM3pqkXE4bt5xVkt9u8pMCSDYGMY5vKLnDHQeujyfGE5g4Stlu11f+aSkxn4hmN55CBzvftCco0X+wfDx20P/KMJZV0FgxFFoSBsOdnrgPDXjSDBB9DobUu6XiDkU3NIEY6C41ufr9tfu9rhsS5zV41xmRbLsxe+DTaj9RatOBvWvgw9ED6LCjrZlCXm/to2OxQ52K/ClqmdhCdDTuYOSZIidKRbQCb+mANVO/actlCOQhHbNS+45/D7p0wiAO1IWbNcYEbtWKnMcnq73wYveh62G3XajvJR6KIYdjCbdQIsc7/R1nA/qbtukfrSEcht90CFcsSauOdhe43L5+9Rf9q1ayLtosZjP7fbdBnMe26piHQhbV6pJaFrJ31JVXMhninCIAzNCaTOw9ZeqHMj72B+ftRivc+CiKdwemExNnnKSYo4G0L+95xWC9I+6uTSvqQ18Ynxrgyly0b48s8jJlHBwYsZ1yoIYUdK4fYeUSlTfCunm9M8Ug3+7qzHXB9mwMi0ZgDru/sgG3XPdrv/Q7YAl0H/L5xVjzc/90cZQHhGNnsBiMDvG0VLxxz8IK8RuZFjhjZBstcto6R+bEjRrYF20b2wjWyIdY2shcjRua6i2fgb0b90Ased0Xf4N+Me6MfPu6Qo+p641lB9R4gJ7hReE5yzP+svCmF3GBofh/dIK/1nLGBrr+2RduC2AA1MpAObQ/ja3fWt+D3HvB7e0x41tA7ruC+9LuTcgNxOs24lB1GlywEL/1nGia18zrxh4pA3JjiRgM2VXNBs20PWN3Uj7X2DwLchgQsYEVgbg4DCQEA91hW9rHUueeqTTsPAjwXoAGfbhgobr7e34GVaC5Pr6wERjN+V++dau4pLahg8YYRcraZpHgyumqGY1CQJFM10rEyY1jfbQy3sHLHsB4ZQ8vIHYMyY1jrMcC+mkh6u/JKzJNK1vRI7kLEzNYBvrajlzTFE3F93YtKsFhIYAsKMZ1IiyHSWsaIkg2UuhVvy+UuxoHnC1qzuo+hRjsd3Tw5RkczR0m3UbVKUjcoyWJxk5KUR0nqrkq6TdRWScpWkj9VMlYzyNjKGR6pGXsnGUv1nZE5guo5ndSXnp0By0MAZ3wJ9UvuACNqWNEES7ygfdJj9zJiqBxnk0pgl03GhOwLhb31ZRkIaQNdIXukHiHtdM05qCEZQmJb9yUe+KfSXfM8qAYaMAiANEeKH6FQ3XC3ViI9htHZjbzVt0pX55GTV7SgyDkwyHSldTuRG/QrC82KhMXtUSUYXRTV/9zorPlyb2JS5Ct7pcyqwWKqi85xV+92AAv2Mil6ay0W7AsMUnbtWLB32AOA9CDUAKEMAubtKcVbW9D4OsDjxqAlkNYh0JLoU659tGmC0UNxEZTAzDKowjqNCIvKMboibqNVUW2DVWKJoW/B7BN/c9Gmk5u4EgtqijgnvrPBQndyiT3YU1YqEmX9XPXCiT/POekd6+mS71LSm9zwnKbEA8/ufsqVc3CLT7gN6uF9R2Mv/USUFBmH8Xs7iojoerPPoMsk9RKV0rHdDuuT7MIlWOiAN8Qt3HM9D0q6ew0Pyj2ab5e7ixm73nbvHM3jYTwwJNV2WzPsKQNT6WPw2MfGDAG6W+deq+xANjlaypj5WEfw03K7Ay/dA2mBPrONBsaFbB/Kpj0IRkC3k46LnPq5dEwA0TIZ8uBcddER2UCF50ao6hAeUSXtmp3z16f94PvUDb7nXOS9WHDuqkuatweb8IolWwS5HETgJXG13uMCYvq49JggwscEAre+EdiAsOAuFgXvEPBt9kvuhd3YVfRsXeqXICEtZCUoMmpPOI+cUHAytMcTu30ybHdO/lMTaNvbmZS7t2IlPl5RbWq92RLuTQYbxO7lhE4mbb3SnAlWtG9I9uftAPdxz/xQUMkS2HMGdiagIJGJx0jxqGtWDxg0Noc+bcHb6DfS+62dexj4ZLD4aM0PhrC9e+0xavQ6F01vGY/NbCvSJ/B0H4nMFwxjRMJkmZG1fh6G1df1htXGNr5l7nEo0mvrWYV/28KDuT0NJC8zCjDzsb/F7dsDVCynspVHf99NCLwD4gIfZJi8Rh+y/rjJF8MfbXnw5rfLMKdmqcVEyD4KWdG4cw1dctJYssq0p5qcw1r96ApGrLtqn+GZfGbadnYncOaXzIcFIQ3WF2yqEqjxqaUOOLBfgA/zWDaYskxRNKrgHwKWl1woTDEDPV36PtJ6P+Hetw0Bq1tuxNa3XMd8dHc+jgT2dd5Kn+gc2P2s7bqPZsNkr+uSSa1OOwykBV77mRbX6G55oXFmXmRCtm7Fd5wZHQ2c2Y63M9h/I5nwCvJlUWVUP7AMIdKSNRbNU8k+ONTv9VGiL48fxXkTwM9B8GxVQsc4wzDWYAMxquk//Ny+1vU//dRPMvUzYSo0FnZNkwbEIRGL24rjxv8q9Ojmp594G/97PPk82EyvkQl7a79122twy0A7C6gOt8NnRSLo38tAz1TB29OZ+zlZ4+6+e61NElKqB+OvaEzgMJsna81Rg92rH9LDYMrjAUGTuK0vlEc/U9e4nsd4sBFLp1yAnUEJ95J6Na9gyydAHJCpivRiXiNtGJpF3Gy/YPdy9ur5s3o/NAzAuMe5G1G2fHWW8AlM6AqmVULJfPxeUhrSGzu4XcobmSwgIYKNVsb1sxbt/RmereyHXcu+JnxKY2ih38AU/kuJjsVhTxeT7escpWpdGsUKUsw7rC6NYn+TzZ8yknbwTYVxk1ckTWnyCpdE/JsXAgEfNvzq8QT9EoKbXjGoedvM9SljU4s0Atc8BNc9PGmGkDxJhX6IrDEngPnSwSTzFALVBnL8xWH7z0RZCEUQXPO6/d3Ub9qPS0n1O5qrAawUvORSL2nNZVt5VV/0agfwmAsO1i7WPfCTbd0AmlG8dupgL3V5AMEkXMx7mFemYgDq1twe8GlXueubt59jOE8hj2jq9vdNMAyG7W/8o5mJ/Scy7se3p8eTk+PJb3+x9+uv2j+f+enO3+z83c79ncnOP+78eufFzsXO25343r/d+/d7/3HvPx/916P/efS/j/7PQO991tL89c7g58uf/D9k4h0t</latexit>

31 32
value between 0 and 1 and print the word whose interval includes this chosen value.
We continue choosing random numbers and generating words until we randomly
generate the sentence-final token </s>. We can use the same technique to generate

WSJ examples Approximating Shakespeare

bigrams by first generating a random bigram that starts with <s> (according to its
bigram probability), then choosing a random bigram to follow (again, according to
its bigram probability), and so on.
To give an intuition for the increasing power of higher-order N-grams, Fig. 4.3
shows random sentences generated from unigram, bigram, trigram, and 4-gram
models trained on Shakespeare’s works.
4.3 • G ENERALIZATION AND Z EROS 11
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1gram
1
rote life have
Months the my and issue of year foreign new exchange’s september
–Hill he late speaks; or! a more to leg less first you enter
were recession exchange new endorsed a acquire to six executives
gram –Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
Last December through the way to preserve the Hudson corporation N. 2gram king. Follow.

2 B. E. C. Taylor would seem to complete the major central planners one

–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
gram point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her 3gram ’tis done.
–This shall forbid it should be branded, if renown made it empty.

They also point to ninety nine point six billion dollars from two hundred –King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A

3 four oh six three percent of the rates of interest stores as Mexico and 4gram great banquet serv’d in;
–It cannot be but so.
gram Brazil on market conditions Figure 4.3 Eight sentences randomly generated from four N-grams computed from Shakespeare’s works. All
Figure 4.4 Three sentences randomly generated from three N-gram models computed from characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
tion as words. Output was then hand-corrected for capitalization to improve readability. The longer the context on which we train the model, the more coherent the sen-
33 tences. In the unigram sentences, there is no coherent relation between words or any 34
sentence-final punctuation. The bigram sentences have some local word-to-word
lap whatsoever in possible sentences, and little if any overlap even in small phrases. coherence (especially if we consider that punctuation counts as a word). The tri-
This stark difference tells us that statistical models are likely to be pretty useless as gram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a
careful investigation of the 4-gram sentences shows that they look a little too much
predictors if the training sets and the test sets are as different as Shakespeare and like Shakespeare. The words It cannot be but so are directly from King John. This
WSJ.
Shakespeare as corpus
How should we deal with this problem when we build N-gram models? One way Lesson 1: The perils of overfitting
is because, not to put the knock on Shakespeare, his oeuvre is not very large as
corpora go (N = 884, 647,V = 29, 066), and our N-gram probability matrices are
is to be sure to use a training corpus that has a similar genre to whatever task we are ridiculously sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the
number of possible 4-grams is V 4 = 7 ⇥ 1017 . Thus, once the generator has chosen
trying to accomplish. To build a language model for translating legal documents,
we need a training corpus of legal documents. To build a language model for a
• N-grams only work well for word prediction if the test
the first 4-gram (It cannot be but), there are only five possible continuations (that, I,
he, thou, and so); indeed, for many 4-grams, there is only one continuation.
question-answering system, we need a training corpus of questions. corpus looks like the training corpus
To get an idea of the dependence of a grammar on its training set, let’s look at an
• N=884,647 tokens, V=29,066
Matching genres is still not sufficient. Our models may still be subject to the
N-gram grammar trained on a completely different corpus: the Wall Street Journal
- In real life, it often doesn’t
(WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so
• Shakespeare produced 300,000 bigram types out
problem of sparsity. For any N-gram that occurred a sufficient number of times,
we might have a good estimate of its probability. But because any corpus is limited,
we might expect some overlap between our N-grams for the two genres. Fig. 4.4
shows sentences generated by unigram, bigram, and trigram grammars trained on

of V = 844 million possible bigrams.

2
some perfectly acceptable English word sequences are bound to be missing from it. 40 million words from WSJ.
• We need to train robust models that generalize!
Compare these examples to the pseudo-Shakespeare in Fig. 4.3. While superfi-
That is, we’ll have a many cases of putative “zero probability N-grams” that should
- Sohave
really 99.96% of theprobability.
some non-zero possible bigrams
Consider were
the words thatnever seen
follow the bigram
cially they both seem to model “English-like sentences”, there is obviously no over-
- How can I account for things that I don’t see in the
(have zero entries in the table)
denied the in the WSJ Treebank3 corpus, together with their counts: training set?
• For 4-grams: What's
deniedcoming out5 looks like
the allegations:
denied the speculation: 2
Shakespeare because itrumors:
denied the is Shakespeare
1
denied the report: 1

But suppose our test set has phrases like:

denied the offer 35 36

denied the loan

Lesson 2: Zeros Count Smoothing
• Training set: • Test set • Problem: N-Gram has not been seen in training
… denied the allegations - Maximum likelihood estimation -> Probability is 0
… denied the reports … denied the offer - Quite harsh
… denied the claims … denied the loan - Not very useful
… denied the request - OOV in sentence -> All Hypothesis have probability of 0
P(“offer” | denied the) = 0
• Solution: Assign positive probabilities to unseen n-grams
• Bigrams with zero probability - Higher order n-grams -> even more important
- means that we will assign 0 probability to the test set!
• And hence we cannot compute perplexity (can’t divide • Empirical counts:
- Counts seen in the training data
by 0)!
• Expected counts:
- Counts in previously unseen text

The intuition of smoothing (from Dan Klein) Add-one estimation

• When we have sparse statistics:
• Also called Laplace smoothing
P(w | denied the) • Pretend we saw each word one more time than we did
3 allegations
• Just add one to all the counts!
allegations

2 reports
outcome

1 claims
reports

attack

1 request
…
request
claims

man

7 total • MLE estimate: c(wi−1, wi )

PMLE (wi | wi−1 ) =
• Steal probability mass to generalize better c(wi−1 )
• Add-1 estimate:
P(w | denied the)
2.5 allegations
1.5 reports
c(wi−1, wi ) +1
PAdd−1 (wi | wi−1 ) =
allegations

0.5 claims
allegations

c(wi−1 ) +V
outcome

0.5 request
attack
reports

2 other
…
man
request
claims

7 total

39 40
Add-one estimation on Berkeley restaurants Reconstituted (Adjusted) counts

MLE estimate: Original:

Add-1 estimate: Reconsituted:

Add-one estimation (issues) One solution: Add-α Smoothing

• Problem: • Add α < 1 instead:
- Too much weight to unseen examples c c +α
p= →p =
n n + αv
• But add-1 is used to smooth other NLP models
- For text classification
- In domains where the number of zeros isn’t so huge. • How to find α:
-€Optimize on Perplexity
- Match between adjusted counts and test counts
tasks (including text classification), it turns out that it still doesn’t work well for
4.4.2 Add-k smoothing language modeling, generating counts with poor variances and often inappropriate

Add-α Smoothing (Example)

One alternative to add-one smoothing is to (Gale
discounts move aand
bit less of the 1994).
Church, probability mass
from the seen to the unseen events. Instead of adding 1 to each count, we add a frac- Interpolation and Backoff
tional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing.
4.4.3 Backoff and Interpolation
⇤
PAdd-k (wn |wn 1 ) =
C(wn 1 wn ) + k
(4.23)
• General:
The discounting C(wn 1we have been discussing
) + kV so far can help solve the
- problem of zero -> Better language models
Longer context
frequency
we have N-grams.
a method forBut there k;is this
an can
additional source of knowledge we can draw
Add-k smoothing requires that
done, for example, by optimizing on.on aIfdevset.
we are tryingadd-k
Although
choosing
to compute
is is usefulP(w
be
|wn 2 wn 1 ) but we have
for nsome
• Problem:
no examples of a
tasks (including text classification), it turns out that it still doesn’t work well for
particular trigram wn 2 wn 1 wn , we can instead estimate its probability - Limited bydata
using-> More n-grams not observed
language modeling, generating the
discounts (Gale and Church, 1994).
counts with poor variances and often inappropriate
bigram probability P(wn |wn 1 ). Similarly, if we don’t have • With
counts“absolute
to compute discounting”:
P(wn |wn 1 ), we can look to the unigram P(wn ). - We treat all (non-observed) n-grams equally
In other words, sometimes using less context is a good thing, helping to general-
4.4.3 Backoff and Interpolation
ize more for contexts that the model hasn’t learned much about. There are two ways
• Example:
The discounting we backoff
have been discussing
to use this so far can help
N-gram solve the problem
“hierarchy”. of zero
In backoff, we use the trigram-if Scottish beerisdrinkers
the evidence
frequency N-grams. But there is an additional source of knowledge we can draw
sufficient, otherwise we use the bigram, otherwise the unigram. -
on. If we are trying to compute P(wn |wn 2 wn 1 ) but we have no examples of a
Scottish
In other beer
words, weeaters
particular trigram wn 2 wn 1 wnonly
, we can “back off”estimate
instead to a lower-order N-gram
its probability by using • Big
if we have zero evidence for idea: Use trigram if you have good evidence,
a higher-order
P(wn |wnN-gram.
interpolation
the bigram probability By
Similarly, contrast,
if we don’t in
have interpolation,
counts to compute we always mix the probability estimates
1
P(wn |wn 1 ), we can look to thefrom
).
unigram allP(w
then ).N-gram estimators, weighing and combining the trigram, bigram, and
otherwise bigram, otherwise unigram
In other words, sometimes using
unigramless context
counts. is a good thing, helping to general-
ize more for contexts that the model hasn’t learned much about. There are two ways
In simple linear interpolation, we combine different order N-grams by linearly
to use this N-gram “hierarchy”. In backoff, we use the trigram if the evidence is
interpolating
sufficient, otherwise we use the bigram, otherwise all the
theunigram.
models.InThus, we estimate
other words, we the trigram probability P(wn |wn 2 wn 1 )
by mixing together the unigram,
only “back off” to a lower-order N-gram if we have zero evidence for a higher-order bigram, and trigram probabilities, each weighted
by a l :
N-gram. By contrast, in interpolation, we always mix the probability estimates
from all the N-gram estimators, weighing and combining the trigram, bigram, and
Linear Interpolation
unigram counts.
In simple linear interpolation, we combine different order N-grams by linearly
Backoff
P̂(wn |wn 2 wn 1 ) = l1 P(wn |wn 2 wn 1 )
interpolating all the models. Thus, we estimate the trigram probability P(wn |wn 2 wn 1 )
by mixing together the unigram, bigram, and trigram probabilities, each weighted +l2 P(wn |wn 1 )
• Simple interpolation
by a l :
+l3 P(wn )
• Trust the highest
(4.24)
n-gram with counts, otherwise “fall
back” (or “back off”)
P̂(wn |wn 2such
wn 1 )that
= lthe l s sum to 1:
1 P(wn |wn 2 wn 1 ) X
+l2 P(wn |wn 1 ) li = 1 (4.25)
+l3 P(wn ) i(4.24)

In a slightly more sophisticated version of linear interpolation, each l weight is

such that the l s sum to 1: X
computed li =in1 a more sophisticated way, by conditioning on the context. This way,
• Lambdas conditional on context: (4.25)
if we ihave particularly accurate counts for a particular bigram, we assume that the
- More on this on Kneser-Ney
counts
In a slightly more sophisticated of of
version thelinear
trigrams basedeach
interpolation, on lthis bigram
weight is will be more trustworthy, so we can
make
computed in a more sophisticated way,the l s for those
by conditioning on trigrams
the context.higher and thus give that trigram more weight in
This way,
if we have particularly accurate counts for a particular bigram, we assume that the
counts of the trigrams based on this bigram will be more trustworthy, so we can
make the l s for those trigrams higher and thus give that trigram more weight in • Adjust weights and discounting

47
How to set (“learn”) the lambdas? Unknown words:
Open vs. closed vocabulary tasks
• Use a held-out (or validation) corpus • If we know all the words in advance
- Vocabulary V is fixed
Held-Out Test - Closed vocabulary task
Training Data Data Data • Often, we don’t know this
- Out Of Vocabulary = OOV words
- Open vocabulary task
• Choose λs to maximize the probability of • Instead: create an unknown word token <UNK>
- Training of <UNK> probabilities
held-out data: - Create a fixed lexicon L of size V
- Fix the N-gram probabilities (on the training data) - At text normalization phase, any training word not in L changed to
- Then search for λs that give largest probability to held-out set: <UNK>
- Now we train its probabilities like a normal word
log P(w1...wn | M (λ1...λk )) = ∑ log PM ( λ1... λk ) (wi | wi−1 ) - At decoding time
- If text input: Use UNK probabilities for any word not in training
i
49 50

Huge web-scale n-grams Smoothing for Web-scale N-grams

• How to deal with, e.g., Google N-gram corpus • “Stupid backoff” (Brants et al. 2007)
- 1 trillion tokens, 24GB of files • No discounting, just use relative frequencies
• Pruning • No probabilities anymore
- Only store N-grams with count > threshold. "
- Remove singletons of higher-order n-grams
i
$$ count(wi−k+1 ) i
if count(wi−k+1 )>0
- Entropy-based pruning i−1
S(wi | wi−k+1 ) = # count(w i−1
i−k+1 )
• Efficiency $ i−1
$% 0.4S(wi | wi−k+2 ) otherwise
- Efficient data structures like tries
- Store words as indexes, not strings
- Use Huffman coding to fit large numbers of words into two count(wi )
bytes S(wi ) =
N
- Quantize probabilities (4-8 bits instead of 8-byte float)
51 52
Modified Kneser-Ney Smoothing Absolute discounting:
Just subtract a little from each count
• Most commonly used smoothing technique
• Different strategies for highest order and other n- • Suppose we wanted to subtract a little from Bigram count Bigram count in
grams: a count of e.g. 4 to save probability mass for in training heldout set
the zeros 0 .0000270
• Highest order n-grams: • How much to subtract ? 1 0.448
- Absolute discounting 2 1.25
• Church and Gale (1991)’s clever idea
• Lower order n-grams: • Divide up 22 million words of AP Newswire
3 2.24
4 3.23
- Discounting of the histories - Training and held-out set
5 4.21
- For each bigram in the training set
6 5.23
- See the actual count in the held-out set!
• Backoff and Interpolated version • It sure looks like c* = (c - .75) 7 6.21
8 7.21
9 8.26

Absolute Discounting Interpolation Kneser-Ney Smoothing I

• Better estimate for probabilities of lower-order unigrams!
glasses
Kong
- Shannon game: I can’t see without my reading___________?
• Save ourselves some time and just subtract 0.75 (or some d)! - “Kong” turns out to be more common than “glasses”
discounted bigram Interpolation weight - … but “Kong” (almost) always follows “Hong”
c(wi−1, wi ) − d • The unigram is useful exactly when we haven’t seen this bigram!
PAbsoluteDiscounting (wi | wi−1 ) = + λ (wi−1 )P(w)
c(wi−1 ) • Instead of P(w): “How likely is w”
unigram
(Maybe keeping a couple extra values of d for counts 1 and 2) • Pcontinuation(w): “How likely is w to appear as a novel continuation?
- For each word, count the number of bigram types it completes
• But should we really just use the regular unigram P(w)? - Every bigram type is a novel continuation the first time we see it

PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0}

55
Kneser-Ney Smoothing II Kneser-Ney Smoothing III
• How many times does w appear as a novel continuation:

PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0} max(c(wi−1, wi ) − d, 0)

PKN (wi | wi−1 ) = + λ (wi−1 )PCONTINUATION (wi )
c(wi−1 )
• Normalized by the total number of word bigram types
λ is a normalizing constant; the probability mass we’ve discounted
{(w j−1, w j ) : c(w j−1, w j ) > 0}

{wi−1 : c(wi−1, w) > 0} d

PCONTINUATION (w) = λ (wi−1 ) = {w : c(wi−1, w) > 0}
{(w j−1, w j ) : c(w j−1, w j ) > 0} c(wi−1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
A frequent word (Kong) occurring in only one context (Hong) will have a low = # of times we applied normalized discount
continuation probability
58

Kneser-Ney Smoothing: Recursive formulation N-gram Smoothing Summary

• Add-1 smoothing:
i
- OK for text categorization, not for language modeling
i−1 max(cKN (wi−n+1 ) − d, 0) i−1 i−1
PKN (wi | wi−n+1 ) = i−1
+ λ (wi−n+1 )PKN (wi | wi−n+2 )
cKN (wi−n+1 ) • The most commonly used method:
- Modified Interpolated Kneser-Ney
!# count(•) for the highest order
cKN (•) = " • For very large N-grams like the Web:
$# continuationcount(•) for lower order - Stupid backoff

Continuation count = Number of unique single word contexts for

59 60
Advanced Language Modeling More Advanced Language Modeling

• Discriminative models: Masami Nakamura, Katsuteru Maruyama,

Takeshi Kawabata, and Kiyohiro Shikano. Journal of Machine Learning Research 3 (2003) 1137–1155 Submitted 4/02; Published 2/03

- choose n-gram weights to improve a task, not 1990. Neural network approach to word
category prediction for English texts. In
to fit the training set Proceedings of the 13th conference on
A Neural Probabilistic Language Model

• Parsing-based models Computational linguistics - Volume 3

(COLING ’90). Association for
Yoshua Bengio
Réjean Ducharme
Pascal Vincent
BENGIOY @ IRO . UMONTREAL . CA
DUCHARME @ IRO . UMONTREAL . CA
VINCENTP @ IRO . UMONTREAL . CA

• Caching Models
Christian Jauvin JAUVINC @ IRO . UMONTREAL . CA
Computational Linguistics, USA, 213–218. Département d’Informatique et Recherche Opérationnelle

DOI:https://doi.org/10.3115/991146.9911 Centre de Recherche Mathématiques

- Recently used words are more likely to appear

Université de Montréal, Montréal, Québec, Canada
84
Editors: Jaz Kandola, Thomas Hofmann, Tomaso Poggio and John Shawe-Taylor

c(w ∈ history)
PCACHE (w | history) = λ P(wi | wi−2 wi−1 ) + (1− λ )
| history | • (very old) idea: Use a neural network for Abstract
A goal of statistical language modeling is to learn the joint probability function of sequences of
words in a language. This is intrinsically difficult because of the curse of dimensionality: a word

n-gram language modeling: sequence on which the model will be tested is likely to be different from all the word sequences seen
during training. Traditional but very successful approaches based on n-grams obtain generalization
by concatenating very short overlapping sequences seen in the training set. We propose to fight the
curse of dimensionality by learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of semantically neighboring
sentences. The model learns simultaneously (1) a distributed representation for each word along
with (2) the probability function for word sequences, expressed in terms of these representations.
Generalization is obtained because a sequence of words that has never been seen before gets high
probability if it is made of words that are similar (in the sense of having a nearby representation) to
61 words forming an already seen sentence. Training such large models (with millions of parameters) 62
within a reasonable time is itself a significant challenge. We report on experiments using neural
networks for the probability function, showing on two text corpora that the proposed approach
significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to
take advantage of longer contexts.
Keywords: Statistical language modeling, artificial neural networks, distributed representation,

Why Neural LMs work better than N-gram LMs

curse of dimensionality

Language model integration 1. Introduction

A fundamental problem that makes language modeling and other learning problems difficult is the
curse of dimensionality. It is particularly obvious in the case when one wants to model the joint

• Training data: distribution between many discrete random variables (such as words in a sentence, or discrete at-

• How to use language model in a task ?

tributes in a data-mining task). For example, if one wants to model the joint distribution of 10
consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially

• We've seen: - Noisy channel model

10000010 1 = 1050 1 free parameters. When modeling continuous variables, we obtain gen-
eralization more easily (e.g. with smooth classes of functions like multi-layer neural networks or

I have to make sure that the cat gets fed.

Gaussian mixture models) because the function to be learned can be expected to have some lo-
cal smoothness properties. For discrete spaces, the generalization structure is not as obvious: any
change of these discrete variables may have a drastic impact on the value of the function to be esti-

• Never seen: dog gets fed • Applications: c 2003 Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin.

• Test data: - Speech recognition

• I forgot to make sure that the dog gets ___ - Spelling correction
- Machine translation
• N-gram LM can't predict "fed"!
• Neural LM can use similarity of "cat" and "dog"
embeddings (remember word vectors?) to
generalize and predict “fed” after dog

64
Noisy Channel Intuition Noisy Channel for spelling problem
• We see an observation x of a misspelled
word
• Find the correct word w
ŵ = argmax P(w | x)
w∈V
P(x | w)P(w)
= argmax
IBM w∈V P(x)
Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context based spelling correction.
Information Processing and Management, 23(5), 517–522
AT&T Bell Labs
= argmax P(x | w)P(w)
Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990. A spelling correction program
w∈V
based on a noisy channel model. Proceedings of COLING 1990, 205-210

65 66

Summary References
• SLP chapter 3 (3.8 is optional)
• LM toolkits
- SRILM http://www.speech.sri.com/projects/srilm/
- KenLM https://kheafield.com/code/kenlm/
- Google N-grams https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
Smoothing Maximum Likelihood Training - Google Books N-grams http://ngrams.googlelabs.com/

N-Gram Language Model

67 68

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1175)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

KEN2570 4 LanguageModel

Uploaded by

KEN2570 4 LanguageModel

Uploaded by

KEN2570 Agenda

Natural Language Processing

Jerry Spanakis Probabilistic Models N-gram language models

What is a language model? Probabilistic Language Models

Language Modeling Statistical Language Models

• The Chain Rule in General

Sentence Probability Markov Assumption

Bigram model N-gram models

Bigram estimates of sentence probabilities What kinds of knowledge?

P(<s> I want dutch food </s>) = • P(dutch|want) = .0011

Extrinsic evaluation of N-gram models Intuition of Perplexity

• Challenges: - Unigrams are terrible at this game. (Why?)

Perplexity is the inverse probability of

Another way to compute perplexity Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

• Let’s divide by number of words M in test sentences:

N-gram Unigram Bigram Trigram

WSJ examples Approximating Shakespeare

2 B. E. C. Taylor would seem to complete the major central planners one

of V = 844 million possible bigrams.

But suppose our test set has phrases like:

denied the offer 35 36

denied the loan

The intuition of smoothing (from Dan Klein) Add-one estimation

7 total • MLE estimate: c(wi−1, wi )

MLE estimate: Original:

Add-1 estimate: Reconsituted:

Add-one estimation (issues) One solution: Add-α Smoothing

Add-α Smoothing (Example)

In a slightly more sophisticated version of linear interpolation, each l weight is

Huge web-scale n-grams Smoothing for Web-scale N-grams

Absolute Discounting Interpolation Kneser-Ney Smoothing I

PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0}

PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0} max(c(wi−1, wi ) − d, 0)

{wi−1 : c(wi−1, w) > 0} d

Kneser-Ney Smoothing: Recursive formulation N-gram Smoothing Summary

Continuation count = Number of unique single word contexts for 

• Discriminative models: Masami Nakamura, Katsuteru Maruyama,

• Parsing-based models Computational linguistics - Volume 3

DOI:https://doi.org/10.3115/991146.9911 Centre de Recherche Mathématiques

- Recently used words are more likely to appear

Why Neural LMs work better than N-gram LMs

Language model integration 1. Introduction

• How to use language model in a task ?

• We've seen: - Noisy channel model

I have to make sure that the cat gets fed.

• Test data: - Speech recognition

N-Gram Language Model

You might also like

Continuation count = Number of unique single word contexts for