N-Grams - Text Representation
N-Grams - Text Representation
Zipfs Law
Frequency of occurrence of words is inversely proportional to the rank in this frequency of occurrence. When both are plotted on a log scale, the graph is a straight line.
Zipf Distribution
The Important Points:
a few elements occur very frequently a medium number of elements have medium frequency many elements occur very infrequently
Zipf Distribution
The product of the frequency of words (f) and their rank (r) is approximately constant
Rank = order of words frequency of occurrence
f C 1/ r C # N /10
income distribution amongst individuals Library book checkout patterns Web Page Requests Page links on Web
Data from AOL users web requests for one day in December, 1997
Statistics from the TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences). Top 50 terms are:
Statistics from the WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences). Top 50 terms are:
Links
Etaoin Shrdlu and frequencies in the dictionary: http://rinkworks.com/words/letterfreq.shtml Simon Singhs applet for computing letter frequencies: http://www.simonsingh.net/The_Black_Chamber/frequencyanalysis.html
http://joi.ito.com/archives/2003/09/14/ordering_of_letters_dont_matter.html
But that is not a practical solution. Instead taking only two previous tokens,
P(the cat is on the mat) = P(the | <s>) P(cat | <s> the) P(is | the cat) P(on | cat is) P(the | is on) P(mat | on the) P(</s> | the mat)
N-grams
Approximating reality: let V be the number of words in the lexicon and T be the number of tokens in a training corpus
P(wk = W) = 1/V P(wk = W) = count(W) / T P(wk = W1 | wk-1 = W0) = c(W0W1)/c(W0) word frequencies bigrams
2. Create equivalence classes and get counts on training data falling into each class. 3. Find statistical estimators.
For a Vocabulary of 20,000 words, number of bigrams = 400 million, number of trigrams = 8 trillion, number of four-grams = 1.6 x 1017! Data sparseness
Bigram Example
Combining estimators:
Linear Interpolation Backing off
MLE: Problems
Problem of Sparseness of data. Vast majority of the words are very uncommon (Zipfs Law). Some bins may remain empty or contain too little data. MLE assigns 0 probability to unseen events.
We need to allow for possibility of seeing events not seen in training.
Smoothing
Examples: In some specific corpus, to want doesnt occur. But it could: Im going to want to eat lunch at 1. The words knit, purl, quilt, and bobcat are missing from our list of the top 10,000 words in a newswire corpus. In Alices Adventures in Wonderland, the words half and sister both occur, but the bigram half sister does not. But this does not mean that the probability of encountering half sister in some new text is 0.
Gives a little bit of the probability space to unseen events. This is the Bayesian estimator assuming a uniform prior on events.
Good-Turing Estimation
Use counts of more frequent n-grams to estimate less frequent n-grams PGT= r*/N where, r* can be thought of as an adjusted frequency given by r*=(r+1)E(Nr+1)/E(Nr).
Combining Estimators
Combining multiple probability estimates from various different models:
Simple Linear Interpolation Katzs backing-off
Katzs Backing-Off
Different models are consulted depending on their specificity:
1. Use n-gram probability when the n-gram has appeared more than k times (k usually = 0 or 1) 2. If not, back-off to the (n-1)-gram probability Repeat while necessary.
Fun Links
N-gram Search Engine http://nlp.cs.nyu.edu/nsearch/ http://xkcd.com/798/
NAACL-HLT 2010
But, modeling document content is not enough: Neglects anchor text , short messages from social network applications that summarize the document, etc. Different text streams have significantly different properties Microsoft Web N-gram corpus: materials from the body, title and anchor text processed separately.
Related work: relatively brief; what is novel about this work? Conclusions and future work What did you think about the work?