KEN2570 4 LanguageModel
KEN2570 4 LanguageModel
3 4
Language models in daily life Language models in daily life
5 6
7
Statistical Language Models (cont.) The Chain Rule
• Simple approach: look at a large text database • Break up into prediction of one word
- Count(“Good morning ”) = 7
- P(“Good morning”) = 7/196884= 3.55*10-5
• Recall the definition of conditional probabilities
• What might be the problem here? p(B|A) = P(A,B)/P(A)
• Sparse data Rewriting: P(A,B) = P(A)P(B|A)
- many perfectly good sentences will be assigned a probability • More variables:
of zero, because they have never been seen before! P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
10
𝑃 𝑤! 𝑤" … 𝑤# = % 𝑃 𝑤$ |𝑤! … 𝑤$%! P(the | its water is so transparent that) ≈ P(the | that)
$ • Or maybe
P(“its water is so transparent”) = P(the | its water is so transparent that) ≈ P(the | transparent that)
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water €
is 𝐶(𝑤! … 𝑤"#! ) = 𝑤"#$ 𝑤"#!
so)
€
• What might be the problem here?
Andrei Markov
11 12
Markov Assumption Simplest case: Unigram model
• 𝑃 𝑤! 𝑤" … 𝑤# = ∏$ 𝑃 𝑤$ |𝑤$%('%!) … 𝑤$%! • 𝑃 𝑤! 𝑤" … 𝑤# = ∏$ 𝑃 𝑤$
• In other words, we approximate each component in • Some automatically generated sentences from a
the product unigram model
- 𝑃 𝑤! |𝑤" … 𝑤!#" = 𝑃 𝑤! |𝑤!#(%#") … 𝑤!#"
13 14
15 16
Training Language Models An example
• The Maximum Likelihood Estimate (MLE)
- Train parameters so that they maximize the probability of the
training data <s> I am Sam </s>
- Parameter: c(w i−1,w i ) <s> Sam I am </s>
- N-Gram probabilities
P(w i | w i−1 ) = <s> I do not like green eggs and ham </s>
c(w i−1 )
count(w i−1,w i )
P(w i | w i−1 ) = €
count(w i−1 )
17 18
€
Estimate N-gram Probabilities More examples:
Berkeley Restaurant Project sentences
• Example (Trigram model):
- Counts from EPPS (Europarl) corpus
• can you tell me about any good cantonese restaurants close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day
20
Raw bigram counts Raw bigram probabilities
• Normalize by unigrams (aka calculate the probabilities):
• Result:
• Out of 9222 sentences
21 22
23 24
Practical Issues Evaluation: How good is our model?
• We do everything in log space • Does our language model prefer good sentences to
- Avoid underflow bad ones?
- (also adding is faster than multiplying) - Assign higher probability to “real” or “frequently observed”
sentences
- Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t
log( p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4 seen.
- A test set is an unseen dataset that is different from our
training set, totally unused.
- An evaluation metric tells us how well our model does on
the test set.
25 26
29 30
<latexit sha1_base64="1npKAauBNH+Bke4vuh0VSjOwfU8=">AAA1p3icnVv9cty4kZd9uVxOueSyl/+Sf3gnqWLFkk6jjXNbSbkqiu2zU+tVdP7Y7K4oq0ASw4GHJGgA1MyYy7zBPeC9xz1AugHOkARASY6qbBHArxuNRnej8aGozJhUx8f/d+/+P/zgH3/4Tz/65+0f/8tPfvqvP/vs376WvBIxfRvzjItvIiJpxgr6VjGV0W9KQUkeZfQv0fwJtv/lmgrJePFGrUp6mZO0YFMWEwVVV5/d+9+9IisnX/wuy0+2w1Lw5IoF5w/CaxqretG8qx+w/WY/2HsMlYuryUGwuDo5CI6OjvCr2N/ew+rfBGHOkqBrX1x93jbN7aaWtJ4fTpoWMznAFmxAlkHb1T7+f7Km3teVnw+Y7YdZwpXULcVIN8UN3XTDXVyxEXrW0Y9DTpoOGwQhKYHxcsAW2ubNGMl2WPCiyiMqtq9+tnN8dKx/Avdj0n7sbLU/51effX4vTHhc5bRQcUakvJgcl+qyJkKxOKPNdlhJWpJ4TlJ6AZ8Fyam8rLXhNMEe1CTBlAv4V6hA1/YpapJLucojQOZEzaTdhpW+totKTb+4rFlRVooWseloWmWB4gFaYZAwAQaWreCDxIKBrEE8I4LECmx1e9BNSTKw1YIPhlIrulQxz8thbSpIOWPxclibV5ligi+GtTG/ZkWqeDGsnubAtyrUsJbC/FBBFLU4w8CFnEqrNlmg58KoA6Zozj7SXx8EGw74reKjYAGCzoIZuaZBwQMJxCBOEFG1oLTQhEN151xQ8OSosSdBzfJhnayiKUvt0eKED6oEV6jYIfACA4k01kgTsKMs5TA/s/zEGnpkdbogcgWmgoOOmQDTg6kt0ogo6cqhouwuM1SJIeyCVGrGxYoScSBgjkC8gqiI2SopS1okDI3gNVWn2RNS/neh6hCsNUOGEtQP6pVqldG64PBFy8f174MG2gq6oMU1iFKgQ9Xt9MkPFZOzpg4jmrKi1pNbhxmJIOwiAgrANKNgTrM61FNHy6Y+pnnTb9AEC5aAz9THR4/sVjpVORHQQTOAhiRJFB9HYVcNFGDMRrJmW48DvCMnUBfOyxS/YTzNxeSyDo3rwDTUUVZRHNU0MMoJLoKdSXB4+OX588tWGxsuBZGfyOXs9PWlIwwRFCQ/fbUbCpbOFBEw67vPzqzOPs4Q9d2Lm1EJRdTTZzejKt3j21t6pEWeNvWzswHqq+eO+FmKEbQx45dx/YrkJSykzjgh9AkOHHfD1x8qGPSuDaAfwEfRDrU6n304+mso6LTemTisqrIkTOjpLlLwq1Cl6h1WayFhHQkTcd2rCEKhgba2+KLwMsIGi1VX1TEbcstZEXOpBgw1v5BIdaDZ4JdfFKRMWO4SPzSUD8fJSJW6ZIeG7NBPlpNlzhOaDelmBEICEDcH5hMYNCNjhVUlQQPnWYKLIc/QG6KM2C6CuDKrZGM+39UPfQjQXA9yaEMSWsCyprl85+Ggm1sW33nI40rgIPXClNe6BMPZCxI6ZQXDnE8GU8Hz4Oz09MnLoCQlFb9z1QymBCuLNegM0srEGXXBRR4DW7WEOPjdFXMmrUjAFaKmLq/qsOX7YBV8Hyz3R6DIyAG3megIyYo5RIZgnFQJAhopUgjBMM6NU7TogzV9axNhA8na40nzrnAm5PpTOBQPgUddPMxtNfbEIVEES303i28AuzcAT5craJ4+WB6s9t02tmKmdSiIF1kKpsNZHx3qyjFVyzz3TM9BsNIOb1M9edmNA76d1kHzS22rAwTuQ8yCAza4NPaI+GB3Z7I7YtrPvjp7eb427U9kZ89JIVsBmapN0Z23Qi46UEaXXkjZQcqZ8EKQtJt0b2dyVXR8sGABiFIknnUQKNuQawK5REw7DFTYGDnsSAlKm3cnFuhD2iHg22rN8p5OHGuXIl5wkUAOjmkm5k9YlGsCqB+hqIoxGmixaQQFQCcGFp14anK5FpGY/Mmy++sO4BFsyoRU07inLl3j6INYKKxwLO71NY8B8pqlObEYvDFNvZgwbD+rILvd5CUFlmwIZBybVPYadxy0PpwcTWjuIGG7VVv9r6jEdCae0XgOGeh8156gTPPF/qe4kUN/5BlLKuksGIpcEwbCnp26Dgx70QwSfAyF1rqk4w1GNjWDGOksNLr5/ab5va8ZEuc2e9UYk225MHvh02g/UmvRgr9p4cPQA+mzoKybQV1uHqBhswOdi/0+aJnaQXQ27GDmmCAlSke2AWzqgzVQvW3LZQvlIByxUfuOfw67d8IgDtSGmDXHBa7Vip3GJKu/8WH0outht1mr7SQfiSKGYQu3USPEOv8fZQH7m7brHq0jHYXcdgNULEuojXcWuj++fPYW/atVsy7aLmYw+n+3QZ/FtOuahkAX1uqRWhayctaXVDEb4p0iAM7QmEzuPGTphTI/9gbm70cp3vsoiHQGpxMSZ5+nmKCAty3sW8dhvSDtr04q6UNeGZ8Y48pctmyML/MwZh4dGLCccaGGFHasHGLnEZU2wbt6vjbFQ93s6852wNVtDohEYw64urMDtl33aL/1O2ALdB3w28ZZ8XD/d3OUBYRjZLMbjAzwtlW8cMzBC/IamRc5YmRrLHPZOkbmx44Y2QZsG9kL18iGWNvIXowYmesunoG/GfVDL3jcFX2DfzPujX74uEOOquuNZwXVe4Cc4EbhOckx/7PyphRyg6H5fXSDvNZzxga6/tIWbQNiA9TIQDq0PYwv3VnfgN97wO/tMeFZQ++4gvvS707KNcTpNONSdhhdshC89J9pmNTO68QfKgJxY4obDdhUzQXNNj1gdVM/1trfC3AbErCAFYG5lgskBADcY1nZx0Lnnss27dwL8FyABny6ZqC4+Xp/B1aiuTi5tBIYzfhdvXOiuae0oILFa0bI2WaS4snoshmOQUGSTNVIx8qMYXW3MdzCyh3DamQMLSN3DMqMYaXHAPtqIuntyisxTypZ0yO5CxEzWwf42oxe0hRPxPVdKirBYiGBLSjEdCIthkhrGSNKNlDqRrwNl7sYB54vaM3qPoYa7XR08+QYHc0cJd1G1SpJ3aAki8VNSlIeJam7Kuk2UVslKVtJ/lTJWM0gYytneKRm7J1kLNV3RuYIqud0Ul96dgYsDwCc8QXUL7gDjKhhRRMs8YL2SY/cy4ihcpxNKoFdNhkTsi8U9taXZSCkDXSF7JF6hLTTNeeghmQIiW3dl3jgn0p3zfOgGmjAIADSHCp+iEJ1w91YifQYRmc38lbfKl2dR05e0YIi58Ag05XW7URu0K8sNCsSFrdHlWB0UVT/qdFZ88XOxKTIl/ZKmVWDxVQXneOu3u0AFuxlUvTWWizYFxik7NqxYO+wBwDpQagBQhkEzNtTire2oPFVgMeNQUsgrUOgBdGnXLto0wSjh+IiKIGZZVCFdRoRFpVjdEXcRqui2gSrxBJD34LZJ/7mok0nN3Elrqkp4pz4zgYL3ckF9mBPWalIlPVz1XMn/jznpHesp0u+S0lvcsNzmhIPPLv7KVfOwS0+4Taoh/cdjb30E1FSZBzG7+0oIqLrzT6DLpPUS1RKx3Y7rE+yc5fgWge8Ie7aPdfzoKS71/Cg3KP5drk7n7GrTffO0TwexgNDUm22NcOeMjCVPgaPfWzMEKC7de61yg5kk6OljJmPdQQ/LTc78NI9kBboM5toYFzI9qFs2oNgBHQ76bjIqZ9LxwQQLZMhD85VFx2RDVR4boSqDuERVdKu2Tl/fdoPvk/d4HvGRd6LBWeuuqR5e7AOr1iyRZCLQQReEFfrPS4gpo9LjwkifEwgcOsbgTUIC+5iUfAOAd9mv+Re2I1dRc9WpX4JEtJCVoIio/aE89AJBcdDezy22yfDdufkPzWBtr2dSbl7K1bi4xXVptbrLeHOZLBB7F5O6GTS1ivNmWBF+4Zkd94OcBf3zA8FlSyBPWdgZwIKEpl4jBSPumb1gEFjc+jTFryNfiO939q5h4FPBouP1vxgCJu71x6jRq9z0fSW8djMNiJ9Ak/3kcj8mmGMSJgsM7LSz8Ow+qpes1rbxtfMPQ5Fem09y/DfW3gwt6eB5GVGAWY+dje4XXuAiuVUtvLo77sJgXdAXOCDDJPX6EPW79f5Yvi9LQ/e/HYZ5tQstZgI2UchSxp3rqFLThpLlpn2VJNzWKsfXcKIdVftMzyTz0zbzu4EzvyS+bAgpMH6gk1VAjU+tdQBB/YL8GEeywZTlimKRhX8Z8DykguFKWagp0vfR1rvJ9z7tiFgecuN2OqW65iP7s7HkcC+zlvqE509u5+VXffRbJjsdV0yqdVph4G0wGs/0+Ia3S0vNE7Ni0zI1q34jjOjo4Ez2/FmBvtvJBNeQb4sqozqB5YhRFqywqJ5KtkHh/oxPEr026NHcd4E8LMXPFuW0DHOMIw1WEOMavoPPzevdf1PP/WTTP1MmAqNhV3TpAFxSMTituKo8b8KPbz56Sfexv8dTz731tNrZMLe2m/d9hrcMtDOAqrD7fBpkQj6KxnomSp4ezrzICcr3N13r7VJQkq1P/6KxgQOs3my1hw12L36IT0MpjweEDSJ2/pCefQzdY3reYwHG7F0ygXYGZRwL6lX8wq2fALEAZmqSC/mNdKGoVnEzfYLdi+nr54/q3dDwwCMe5y7EWXDV2cJn8CELmFaJZTMx98lpSG9sYPbpbyRyTUkRLDRyrh+1qK9P8Ozld2wa9nVhE9pDC30K5jCP5foWBz2dDHZvM5RqtalUawgxbzD6tIo9o/Z/CkjaQdfVxg3eUXSlCavcEmEEgR/hht+9XiCfgnBTa8Y1Lxt5vqUsalFGoFrHoDrHhw3Q0iepEI/RNaYY8D81sEk8xQC1Rpy9PlB+89EWQhFEFzzuv3d1G/ajwtJ9TuaywGsFLzkUi9pzUVbeVmf92oH8JgLDtYuVj3wk03dAJpRvHbqYC91eQDBJFzMe5hXpmIA6tbcHvBpV7ntm7dfYzhPIY9o6vb3TTAMhu1v/KOZif0nMu7H1ydHk+Ojyf/8ZucPX7R/PvOjrV9u/cfWg63J1n9t/WHrxdb51tut+N7/3//F/Z37u4/2H/350dePvjHQ+/damp9vDX4ekb8B0z/ibw==</latexit>
Order
Perplexity 962 170 109
• And then the perplexity is the inverse of the log-probability:
<latexit sha1_base64="iyOVfGam9hHU6OLTNgtwH/JnnSI=">AAA2SnicnVtbc904ctY4m2Sj3HaSx7wwkVSx15Kio1lPpnbLVau1HTs1HkXry+zMiLIKJHF44EMSNACeiznM38ivyWvykj+Qv5G3VF7SDfAckgAoyasqWwTwdaPR6G40LorKjEl1cvLfn937g5/84R/98U//ZPdP/+zP/+Ivf/b5X30reSVi+jbmGRffRUTSjBX0rWIqo9+VgpI8yujvovkTbP/dggrJePFGrUt6lZO0YFMWEwVV15/fOzkosnLy1S+z/HT3ICwFT65ZcHE/XNBY1cvmXX2fPWgeBNA2FSSuJ039TYOljKfXp8E4weMglFUOTS3SQeyGNMsQ1uN7G8mNAj6G2uX15DBYXp8eBsfHx/hVABFW/yIIc5YEXfvy+ou2aW43taT1/GjStJjJIbZgA7IM2q4e4P+nG+oHuvKLAbMHYZZwJXVLMdJNcUM33XiX12yEnnX045DTpsMGQUhKYLwasIW2eTNGshsWvKjyiIrd65/tnRyf6J/A/Zi0H3s77c/F9edffBYmPK5yWqg4I1JeTk5KdVUToVic0WY3rCQtSTwnKb2Ez4LkVF7V2rab4ABqkmDKBfwrVKBr+xQ1yaVc5xEgc6Jm0m7DSl/bZaWmX13VrCgrRYvYdDStskDxAB0lSJgAC8vW8EFiwUDWIJ4RsFUF7rQ76KYkGbhTwQdDqRVdqZjn5bA2FaScsXg1rM2rTDHBl8PamC9YkSpeDKunOfCtCjWspTA/VBBFLc4wcCGn0qpNlhhcYNQBUzRnH+nPD4MtB/xW8XGwBEFnwYwsaFDwQAIxiBNEVC0pLTThUN05FxSCTdTYk6Bm+bBOVtGUpfZoccIHVYIrVOwQeImxThprpAnYUZZymJ9ZfmoNPbI6XRK5BlPBQcdMgOnB1BZpRJR05VBRdpcZqsQQdkkqNeNiTYk4FDBHIF5BVMRslZQlLRKGRvCaqrPsCSn/qVB1CNaaIUMJ6gf1SrXOaF1w+KLl4/pXQQNtBV3SYgGiFOhQdTt98kPF5Kypw4imrKj15NZhRiJYGRABBWCaUTCnWR3qqaNlU5/QvOk3aIIlS8Bn6pPjR3YrnaqcCOigGUBDkiSKj6OwqwYKMGYjWbOrxwHekROoC+dlit8wnuZyclWHxnVgGuooqyiOahoY5QSXwd4kODr6+uL5VauNLZeCyE/kcn72+soRhggKkp+92g8FS2eKCJj1/WfnVmcfZ4j64cXNqIQi6umzm1GV7vHtLT3SIk+b+tn5APXNc0f8LMUI2pjxy7h+RfIS1npnnBD6BAeO++HrDxUMet8G0A/go2iHWp3PPhz/ayjotN6bOKyqsiRM6OkuUvCrUKXqHVZrIWEdCROx6FUEodBAW1t8WXgZYYPFqqvqmA255ayIuVQDhppfSKQ61Gzwyy8KUiYsd4kfGsqH42SkSl2yI0N25CfLySrnCc2GdDMCIQGIm0PzCQyakbHCqpKggfMswcWQZ+gNUUZsF0FcmVWyMZ/v6oc+BGiuBzmyIQktYFnTXH7wcNDNLYsfPORxJXCQemHKa12C4RwECZ2ygmFaKoOp4Hlwfnb25GVQkpKKX7pqBlOClcUadAaZb+KMuuAij4GtWkEc/OGaOZNWJOAKUVOX13XY8r2/Dn4MVg9GoMjIAbep6AjJmjlEhmCcVAkCGilSCMEwzq1TtOjDDX1rE2EDydrjSfOucCZk8SkciofAoy4e5rYae+KQKIKlvpvFN4A9GICnqzU0T++vDtcP3Da2ZqZ1KIgXWQqmw1kfHerKMVXLPPdMz2Gw1g5vUz152Y0Dvp3WQfNLbasDBG5EzIIDNrgy9oj4YH9vsj9i2s++OX95sTHtT2Rnz0khWwGZqk3RnbdCLjtQRldeSNlBypnwQpC0m3RvZ3JddHywYAGIUiSedRAo25AFgVwiph0GKmyMHHakBKXNu1ML9CHtEPBttWZ5TyeOtUsRL7lIIAfHNBPzJyzKDQHUj1BUxRgNtNg0ggKgEwOLTjw1uVyLSEz+ZNn9ogN4BJsyIdU07qlL1zj6IBYKKxyLe73gMUBeszQnFoM3pqkXE4bt5xVkt9u8pMCSDYGMY5vKLnDHQeujyfGE5g4Stlu11f+aSkxn4hmN55CBzvftCco0X+wfDx20P/KMJZV0FgxFFoSBsOdnrgPDXjSDBB9DobUu6XiDkU3NIEY6C41ufr9tfu9rhsS5zV41xmRbLsxe+DTaj9RatOBvWvgw9ED6LCjrZlCXm/to2OxQ52K/ClqmdhCdDTuYOSZIidKRbQCb+mANVO/actlCOQhHbNS+45/D7p0wiAO1IWbNcYEbtWKnMcnq73wYveh62G3XajvJR6KIYdjCbdQIsc7/R1nA/qbtukfrSEcht90CFcsSauOdhe43L5+9Rf9q1ayLtosZjP7fbdBnMe26piHQhbV6pJaFrJ31JVXMhninCIAzNCaTOw9ZeqHMj72B+ftRivc+CiKdwemExNnnKSYo4G0L+95xWC9I+6uTSvqQ18Ynxrgyly0b48s8jJlHBwYsZ1yoIYUdK4fYeUSlTfCunm9M8Ug3+7qzHXB9mwMi0ZgDru/sgG3XPdrv/Q7YAl0H/L5xVjzc/90cZQHhGNnsBiMDvG0VLxxz8IK8RuZFjhjZBstcto6R+bEjRrYF20b2wjWyIdY2shcjRua6i2fgb0b90Ased0Xf4N+Me6MfPu6Qo+p641lB9R4gJ7hReE5yzP+svCmF3GBofh/dIK/1nLGBrr+2RduC2AA1MpAObQ/ja3fWt+D3HvB7e0x41tA7ruC+9LuTcgNxOs24lB1GlywEL/1nGia18zrxh4pA3JjiRgM2VXNBs20PWN3Uj7X2DwLchgQsYEVgbg4DCQEA91hW9rHUueeqTTsPAjwXoAGfbhgobr7e34GVaC5Pr6wERjN+V++dau4pLahg8YYRcraZpHgyumqGY1CQJFM10rEyY1jfbQy3sHLHsB4ZQ8vIHYMyY1jrMcC+mkh6u/JKzJNK1vRI7kLEzNYBvrajlzTFE3F93YtKsFhIYAsKMZ1IiyHSWsaIkg2UuhVvy+UuxoHnC1qzuo+hRjsd3Tw5RkczR0m3UbVKUjcoyWJxk5KUR0nqrkq6TdRWScpWkj9VMlYzyNjKGR6pGXsnGUv1nZE5guo5ndSXnp0By0MAZ3wJ9UvuACNqWNEES7ygfdJj9zJiqBxnk0pgl03GhOwLhb31ZRkIaQNdIXukHiHtdM05qCEZQmJb9yUe+KfSXfM8qAYaMAiANEeKH6FQ3XC3ViI9htHZjbzVt0pX55GTV7SgyDkwyHSldTuRG/QrC82KhMXtUSUYXRTV/9zorPlyb2JS5Ct7pcyqwWKqi85xV+92AAv2Mil6ay0W7AsMUnbtWLB32AOA9CDUAKEMAubtKcVbW9D4OsDjxqAlkNYh0JLoU659tGmC0UNxEZTAzDKowjqNCIvKMboibqNVUW2DVWKJoW/B7BN/c9Gmk5u4EgtqijgnvrPBQndyiT3YU1YqEmX9XPXCiT/POekd6+mS71LSm9zwnKbEA8/ufsqVc3CLT7gN6uF9R2Mv/USUFBmH8Xs7iojoerPPoMsk9RKV0rHdDuuT7MIlWOiAN8Qt3HM9D0q6ew0Pyj2ab5e7ixm73nbvHM3jYTwwJNV2WzPsKQNT6WPw2MfGDAG6W+deq+xANjlaypj5WEfw03K7Ay/dA2mBPrONBsaFbB/Kpj0IRkC3k46LnPq5dEwA0TIZ8uBcddER2UCF50ao6hAeUSXtmp3z16f94PvUDb7nXOS9WHDuqkuatweb8IolWwS5HETgJXG13uMCYvq49JggwscEAre+EdiAsOAuFgXvEPBt9kvuhd3YVfRsXeqXICEtZCUoMmpPOI+cUHAytMcTu30ybHdO/lMTaNvbmZS7t2IlPl5RbWq92RLuTQYbxO7lhE4mbb3SnAlWtG9I9uftAPdxz/xQUMkS2HMGdiagIJGJx0jxqGtWDxg0Noc+bcHb6DfS+62dexj4ZLD4aM0PhrC9e+0xavQ6F01vGY/NbCvSJ/B0H4nMFwxjRMJkmZG1fh6G1df1htXGNr5l7nEo0mvrWYV/28KDuT0NJC8zCjDzsb/F7dsDVCynspVHf99NCLwD4gIfZJi8Rh+y/rjJF8MfbXnw5rfLMKdmqcVEyD4KWdG4cw1dctJYssq0p5qcw1r96ApGrLtqn+GZfGbadnYncOaXzIcFIQ3WF2yqEqjxqaUOOLBfgA/zWDaYskxRNKrgHwKWl1woTDEDPV36PtJ6P+Hetw0Bq1tuxNa3XMd8dHc+jgT2dd5Kn+gc2P2s7bqPZsNkr+uSSa1OOwykBV77mRbX6G55oXFmXmRCtm7Fd5wZHQ2c2Y63M9h/I5nwCvJlUWVUP7AMIdKSNRbNU8k+ONTv9VGiL48fxXkTwM9B8GxVQsc4wzDWYAMxquk//Ny+1vU//dRPMvUzYSo0FnZNkwbEIRGL24rjxv8q9Ojmp594G/97PPk82EyvkQl7a79122twy0A7C6gOt8NnRSLo38tAz1TB29OZ+zlZ4+6+e61NElKqB+OvaEzgMJsna81Rg92rH9LDYMrjAUGTuK0vlEc/U9e4nsd4sBFLp1yAnUEJ95J6Na9gyydAHJCpivRiXiNtGJpF3Gy/YPdy9ur5s3o/NAzAuMe5G1G2fHWW8AlM6AqmVULJfPxeUhrSGzu4XcobmSwgIYKNVsb1sxbt/RmereyHXcu+JnxKY2ih38AU/kuJjsVhTxeT7escpWpdGsUKUsw7rC6NYn+TzZ8yknbwTYVxk1ckTWnyCpdE/JsXAgEfNvzq8QT9EoKbXjGoedvM9SljU4s0Atc8BNc9PGmGkDxJhX6IrDEngPnSwSTzFALVBnL8xWH7z0RZCEUQXPO6/d3Ub9qPS0n1O5qrAawUvORSL2nNZVt5VV/0agfwmAsO1i7WPfCTbd0AmlG8dupgL3V5AMEkXMx7mFemYgDq1twe8GlXueubt59jOE8hj2jq9vdNMAyG7W/8o5mJ/Scy7se3p8eTk+PJb3+x9+uv2j+f+enO3+z83c79ncnOP+78eufFzsXO25343r/d+/d7/3HvPx/916P/efS/j/7PQO991tL89c7g58uf/D9k4h0t</latexit>
31 32
value between 0 and 1 and print the word whose interval includes this chosen value.
We continue choosing random numbers and generating words until we randomly
generate the sentence-final token </s>. We can use the same technique to generate
They also point to ninety nine point six billion dollars from two hundred –King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
3 four oh six three percent of the rates of interest stores as Mexico and 4gram great banquet serv’d in;
–It cannot be but so.
gram Brazil on market conditions Figure 4.3 Eight sentences randomly generated from four N-grams computed from Shakespeare’s works. All
Figure 4.4 Three sentences randomly generated from three N-gram models computed from characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
tion as words. Output was then hand-corrected for capitalization to improve readability. The longer the context on which we train the model, the more coherent the sen-
33 tences. In the unigram sentences, there is no coherent relation between words or any 34
sentence-final punctuation. The bigram sentences have some local word-to-word
lap whatsoever in possible sentences, and little if any overlap even in small phrases. coherence (especially if we consider that punctuation counts as a word). The tri-
This stark difference tells us that statistical models are likely to be pretty useless as gram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a
careful investigation of the 4-gram sentences shows that they look a little too much
predictors if the training sets and the test sets are as different as Shakespeare and like Shakespeare. The words It cannot be but so are directly from King John. This
WSJ.
Shakespeare as corpus
How should we deal with this problem when we build N-gram models? One way Lesson 1: The perils of overfitting
is because, not to put the knock on Shakespeare, his oeuvre is not very large as
corpora go (N = 884, 647,V = 29, 066), and our N-gram probability matrices are
is to be sure to use a training corpus that has a similar genre to whatever task we are ridiculously sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the
number of possible 4-grams is V 4 = 7 ⇥ 1017 . Thus, once the generator has chosen
trying to accomplish. To build a language model for translating legal documents,
we need a training corpus of legal documents. To build a language model for a
• N-grams only work well for word prediction if the test
the first 4-gram (It cannot be but), there are only five possible continuations (that, I,
he, thou, and so); indeed, for many 4-grams, there is only one continuation.
question-answering system, we need a training corpus of questions. corpus looks like the training corpus
To get an idea of the dependence of a grammar on its training set, let’s look at an
• N=884,647 tokens, V=29,066
Matching genres is still not sufficient. Our models may still be subject to the
N-gram grammar trained on a completely different corpus: the Wall Street Journal
- In real life, it often doesn’t
(WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so
• Shakespeare produced 300,000 bigram types out
problem of sparsity. For any N-gram that occurred a sufficient number of times,
we might have a good estimate of its probability. But because any corpus is limited,
we might expect some overlap between our N-grams for the two genres. Fig. 4.4
shows sentences generated by unigram, bigram, and trigram grammars trained on
2 reports
outcome
1 claims
reports
attack
1 request
…
request
claims
man
0.5 claims
allegations
c(wi−1 ) +V
outcome
0.5 request
attack
reports
2 other
…
man
request
claims
7 total
39 40
Add-one estimation on Berkeley restaurants Reconstituted (Adjusted) counts
41
47
How to set (“learn”) the lambdas? Unknown words:
Open vs. closed vocabulary tasks
• Use a held-out (or validation) corpus • If we know all the words in advance
- Vocabulary V is fixed
Held-Out Test - Closed vocabulary task
Training Data Data Data • Often, we don’t know this
- Out Of Vocabulary = OOV words
- Open vocabulary task
• Choose λs to maximize the probability of • Instead: create an unknown word token <UNK>
- Training of <UNK> probabilities
held-out data: - Create a fixed lexicon L of size V
- Fix the N-gram probabilities (on the training data) - At text normalization phase, any training word not in L changed to
- Then search for λs that give largest probability to held-out set: <UNK>
- Now we train its probabilities like a normal word
log P(w1...wn | M (λ1...λk )) = ∑ log PM ( λ1... λk ) (wi | wi−1 ) - At decoding time
- If text input: Use UNK probabilities for any word not in training
i
49 50
55
Kneser-Ney Smoothing II Kneser-Ney Smoothing III
• How many times does w appear as a novel continuation:
59 60
Advanced Language Modeling More Advanced Language Modeling
- choose n-gram weights to improve a task, not 1990. Neural network approach to word
category prediction for English texts. In
to fit the training set Proceedings of the 13th conference on
A Neural Probabilistic Language Model
• Caching Models
Christian Jauvin JAUVINC @ IRO . UMONTREAL . CA
Computational Linguistics, USA, 213–218. Département d’Informatique et Recherche Opérationnelle
c(w ∈ history)
PCACHE (w | history) = λ P(wi | wi−2 wi−1 ) + (1− λ )
| history | • (very old) idea: Use a neural network for Abstract
A goal of statistical language modeling is to learn the joint probability function of sequences of
words in a language. This is intrinsically difficult because of the curse of dimensionality: a word
n-gram language modeling: sequence on which the model will be tested is likely to be different from all the word sequences seen
during training. Traditional but very successful approaches based on n-grams obtain generalization
by concatenating very short overlapping sequences seen in the training set. We propose to fight the
curse of dimensionality by learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of semantically neighboring
sentences. The model learns simultaneously (1) a distributed representation for each word along
with (2) the probability function for word sequences, expressed in terms of these representations.
Generalization is obtained because a sequence of words that has never been seen before gets high
probability if it is made of words that are similar (in the sense of having a nearby representation) to
61 words forming an already seen sentence. Training such large models (with millions of parameters) 62
within a reasonable time is itself a significant challenge. We report on experiments using neural
networks for the probability function, showing on two text corpora that the proposed approach
significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to
take advantage of longer contexts.
Keywords: Statistical language modeling, artificial neural networks, distributed representation,
• Training data: distribution between many discrete random variables (such as words in a sentence, or discrete at-
• Never seen: dog gets fed • Applications: c 2003 Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin.
64
Noisy Channel Intuition Noisy Channel for spelling problem
• We see an observation x of a misspelled
word
• Find the correct word w
ŵ = argmax P(w | x)
w∈V
P(x | w)P(w)
= argmax
IBM w∈V P(x)
Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991. Context based spelling correction.
Information Processing and Management, 23(5), 517–522
AT&T Bell Labs
= argmax P(x | w)P(w)
Kernighan, Mark D., Kenneth W. Church, and William A. Gale. 1990. A spelling correction program
w∈V
based on a noisy channel model. Proceedings of COLING 1990, 205-210
65 66
Summary References
• SLP chapter 3 (3.8 is optional)
• LM toolkits
- SRILM http://www.speech.sri.com/projects/srilm/
- KenLM https://kheafield.com/code/kenlm/
- Google N-grams https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
Smoothing Maximum Likelihood Training - Google Books N-grams http://ngrams.googlelabs.com/
67 68