0% found this document useful (0 votes)
151 views23 pages

N-Grams - Text Representation

Zipf's law states that the frequency of words in a text collection follows a power law distribution, where a few words appear very frequently and many words appear very infrequently. This distribution has been observed across many types of data including words in texts, incomes, web page requests, and more. N-gram language models apply Zipf's law by estimating the probability of a word based on the previous 1, 2, or more words. These models address data sparsity issues through techniques like smoothing and combining multiple probability estimates. Zipf's law and n-gram models are important for applications like speech recognition, machine translation, and predicting future words.

Uploaded by

scribd_comverse
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views23 pages

N-Grams - Text Representation

Zipf's law states that the frequency of words in a text collection follows a power law distribution, where a few words appear very frequently and many words appear very infrequently. This distribution has been observed across many types of data including words in texts, incomes, web page requests, and more. N-gram language models apply Zipf's law by estimating the probability of a word based on the previous 1, 2, or more words. These models address data sparsity issues through techniques like smoothing and combining multiple probability estimates. Zipf's law and n-gram models are important for applications like speech recognition, machine translation, and predicting future words.

Uploaded by

scribd_comverse
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Text Representation

Corpora, Types, and Tokens


We now have available large corpora of machine readable texts in many languages. We can analyze a corpus into a set of: word tokens (instances of words), and word types or terms (distinct words) So, The boys went to the park contains 6 tokens and 5 types.

Zipfs Law

George Kingsley Zipf 1902-1950

Frequency of occurrence of words is inversely proportional to the rank in this frequency of occurrence. When both are plotted on a log scale, the graph is a straight line.

Zipf Distribution
The Important Points:
a few elements occur very frequently a medium number of elements have medium frequency many elements occur very infrequently

Zipf Distribution
The product of the frequency of words (f) and their rank (r) is approximately constant
Rank = order of words frequency of occurrence

f C 1/ r C # N /10

Zipf Distribution (Same curve on linear and log scale)

Illustration by Jacob Nielsen

What Kinds of Data Exhibit a Zipf Distribution?


Words in a text collection
Virtually any language usage

income distribution amongst individuals Library book checkout patterns Web Page Requests Page links on Web

Zipf and Web Requests

Data from AOL users web requests for one day in December, 1997

Zipf and Web Requests

Applying Zipfs Law to Language


Applying Zipfs law to word frequencies, in a large enough corpus: t r(t) | c * f(t)-1 for some constant c. In English texts, c is about N/10, where N is the number of words in the collection. English: http://web.archive.org/web/20000818062828/http://hobart.cs.um ass.edu/~allan/cs646-f97/char_of_text.html

Statistics from the TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences). Top 50 terms are:

Statistics from the WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences). Top 50 terms are:

Visualizing Zipfs Law

Word frequencies in the Brown corpus


From -XGLWK $ 0ROND'DQLHOVHQ

Letter Frequencies in English


Letter Frequencies
0.14 0.12 0.1
Frequency

0.08 0.06 0.04 0.02 0 0 5 10 Rank 15 20 25 30

Links

Etaoin Shrdlu and frequencies in the dictionary: http://rinkworks.com/words/letterfreq.shtml Simon Singhs applet for computing letter frequencies: http://www.simonsingh.net/The_Black_Chamber/frequencyanalysis.html

Redundancy in Text - Letters


Her visit-r, she saw as -he opened t-e door, was s-ated in the rmchair be-ore the fir-, dozing it w-uld seem, wi-h his banda-ed head dro-ping on one -ide. The onl- light in th- room was th- red glow fr-m the firew-ich lit his -yes like ad-erse railw-y signals, b-t left his d-wncast fac- in darknes---and the sca-ty vestige- of the day t-at came in t-rough the o-en door. Eveything was -uddy, shado-y, and indis-inct to her, -he more so snce she had -ust been li-hting the b-r lamp, and h-r eyes were azzled.

Redundancy in Text - Letters


Aft-r Mr-. Hall -ad l-ft t-e ro-m, he ema-ned tan-ing -n frnt o- the -ire, -lar-ng, s- Mr. H-nfr-y pu-s it, -t th- clo-k-medin-. Mr. H-nfr-y no- onl- too- off -he h-nds -f th- clo-k, anthe -ace, -ut e-tra-ted -he w-rks; -nd h- tri-d to -ork -n as -low -nd q-iet -nd u-ass-min- a ma-ner -s po-sibl-. He w-rke- with he l-mp c-ose -o hi-, and -he g-een had- thr-w a b-ill-ant ight -pon -is h-nds, -nd u-on t-e fr-me a-d wh-els, -nd l-ft t-e re-t of -he r-om s-ado-y. Wh-n he ook-d up, -olo-red atc-es s-am -n hi- eye-.

Order Doesnt Seem to Matter


Aoccdrnig to rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, olny taht the frist and lsat ltteres are at the rghit pcleas. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by ilstef, but the wrod as a wlohe.

http://joi.ito.com/archives/2003/09/14/ordering_of_letters_dont_matter.html

Chatbots Exploit Redundancy


Lets look at some data on the inputs to ALICE: http://www.alicebot.org/

Chatbots Exploit Redundancy


Human language is not random Considering the vast size of the set of things people could possibly say, that are grammatically correct or semantically meaningful, the number of things people actually do say is surprisingly small. 1800 words covers 95% of all the first words input to ALICE. The number of choices for the second word is only about two. The average branching factor decreases with each successive word.
8024 YES 5184 NO 2268 OK 2006 WHY 1145 BYE 1101 HOW OLD ARE YOU 946 HI 934 HOW ARE YOU 846 WHAT 840 HELLO 663 GOOD 645 WHY NOT 584 OH 553 REALLY 544 YOU 531 WHAT IS YOUR NAME 525 COOL 516 I DO NOT KNOW 488 FUCK YOU 486 THANK YOU 416 SO 414 ME TOO 403 LOL 403 THANKS 381 NICE TO MEET YOU TOO 375 SORRY 374 ALICE 368 HI ALICE 366 OKAY 353 WELL 352 WHAT IS MY NAME 349 WHERE DO YOU LIVE 340 NOTHING

Why Do We Want to Predict Words?


Speech recognition Handwriting recognition/OCR Spelling correction Statistical machine translation

Predicting a Word Sequence


The probability of The cat is on the mat is
P(the cat is on the mat) = P(the | <s>) P(cat | <s> the) P(is | <s> the cat) P(on | <s> the cat is) P(the | <s> the cat is on) P(mat | <s> the cat is on the) P(</s> | <s> the cat is on the mat) where the tags <s> and </s> indicate beginning and end of the sentence.

But that is not a practical solution. Instead taking only two previous tokens,
P(the cat is on the mat) = P(the | <s>) P(cat | <s> the) P(is | the cat) P(on | cat is) P(the | is on) P(mat | on the) P(</s> | the mat)

N-grams
Approximating reality: let V be the number of words in the lexicon and T be the number of tokens in a training corpus
P(wk = W) = 1/V P(wk = W) = count(W) / T P(wk = W1 | wk-1 = W0) = c(W0W1)/c(W0) word frequencies bigrams

Abbreviating P(wk = W1 | wk-1 = W0) to P(W1|W0). For example P(rabbit | the).


P(Wn|Wn-2Wn-1) = c(Wn-2Wn-1Wn)/c(Wn-2Wn-1) trigrams

Building n-gram Models


1. Data preparation:
Decide training corpus remove punctuations sentence breaks - keep them or throw them?

2. Create equivalence classes and get counts on training data falling into each class. 3. Find statistical estimators.

Problems with n-grams


Sue swallowed the large green ______ . Pill, frog, car, mountain, tree?
Knowing that Sue swallowed helps narrow down possibilities. How far back do we look?

For a Vocabulary of 20,000 words, number of bigrams = 400 million, number of trigrams = 8 trillion, number of four-grams = 1.6 x 1017! Data sparseness

Bigram Example

Statistical Estimation Methods


Maximum Likelihood Estimation (MLE) Smoothing:
Laplaces Good-Turing Estimation

Combining estimators:
Linear Interpolation Backing off

Maximum Likelihood Estimation (MLE)


It is the choice of parameter values which gives the highest probability to the training corpus.

PMLE(w1,..,wn)=C(w1,..,wn)/N where C(w1,..,wn) is the frequency of n-gram w1,..,wn PMLE(wn|w1,.,wn-1)=C(w1,..,wn)/C(w1,..,wn-1)

MLE: Problems
Problem of Sparseness of data. Vast majority of the words are very uncommon (Zipfs Law). Some bins may remain empty or contain too little data. MLE assigns 0 probability to unseen events.
We need to allow for possibility of seeing events not seen in training.

Smoothing
Examples: In some specific corpus, to want doesnt occur. But it could: Im going to want to eat lunch at 1. The words knit, purl, quilt, and bobcat are missing from our list of the top 10,000 words in a newswire corpus. In Alices Adventures in Wonderland, the words half and sister both occur, but the bigram half sister does not. But this does not mean that the probability of encountering half sister in some new text is 0.

Laplaces Law (1814-1995)


Adding One Process: PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B),
where

C(w1,..,wn) = frequency of n-gram w1,..,wn


B is the number of values the target feature can take on B = Vn (vocabulary size) for n = 1 2 B = V for n > 1, So, for bigrams, there are V possibilities

Gives a little bit of the probability space to unseen events. This is the Bayesian estimator assuming a uniform prior on events.

Laplaces Law: Problems


For sparse sets of data over large vocabularies, it assigns too much of the probability space to unseen events.

Good-Turing Estimation
Use counts of more frequent n-grams to estimate less frequent n-grams PGT= r*/N where, r* can be thought of as an adjusted frequency given by r*=(r+1)E(Nr+1)/E(Nr).

Combining Estimators
Combining multiple probability estimates from various different models:
Simple Linear Interpolation Katzs backing-off

Simple Linear Interpolation


Solve the sparseness in a trigram model by mixing with bigram and unigram models. Combine linearly: Termed linear interpolation, finite mixture models or deleted interpolation. Pli(wn|wn-2,wn-1) = O1P1(wn)+ O2P2(wn|wn-1)+ O3P3(wn|wn-1,wn-2) where 0dOi d1 and 6i Oi =1
Weights can be picked automatically using Expectation Maximization.

Katzs Backing-Off
Different models are consulted depending on their specificity:
1. Use n-gram probability when the n-gram has appeared more than k times (k usually = 0 or 1) 2. If not, back-off to the (n-1)-gram probability Repeat while necessary.

Fun Links
N-gram Search Engine http://nlp.cs.nyu.edu/nsearch/ http://xkcd.com/798/

Take home message


Language is not random Context (letter/word/sentence history) is used by humans, and also for automatic processing.

An Overview of Microsoft Web N-gram Corpus and Applications


Kuansan Wang, Christopher Thrasher, Evelyne Viegas Xiaolong Li, Bo-june (Paul) Hsu

NAACL-HLT 2010

The Microsoft corpus


The effectiveness of statistical natural language processing - highly susceptible to the data size used to develop
English Giga-word corpus (Graff and Cieri, 2003) 1 Tera-word Google N-gram, created from WWW

But, modeling document content is not enough: Neglects anchor text , short messages from social network applications that summarize the document, etc. Different text streams have significantly different properties Microsoft Web N-gram corpus: materials from the body, title and anchor text processed separately.

General Model Information


Based on Web documents, indexed by a commercial search engine (Bing) Spam and low quality web pages excluded using dedicated algorithms. The various sections parsed, tokenized, lower-cased, punctuation marks removed. No stemming, spelling corrections or inflections Provide smoothed back-off N-gram models Live updates from the Web
# of tokens: Body 1.4 trillion
*As of June 30, 2009

Title 12.5 billion

Anchor 357 billion

Demo: Search Query Segmentation


Given: a search query Goal: word breaking A query of T chars has 2T-1 segmentation hypotheses

Papers for next week


Memory-Based Context-Sensitive Spelling Correction at Web Scale, 2007 Improving Email Speech Acts Analysis via N-gram Selection, 2006

What makes a good presentation


Motivation is clear. What is the problem/task? Engaging: examples, illustrative figures (if possible) Model is clear at high-level; (dont drawn in the formulas) Experiments:
Settings Evaluation measures Results and their meaning

Related work: relatively brief; what is novel about this work? Conclusions and future work What did you think about the work?

You might also like