0% found this document useful (0 votes)

151 views23 pages

N-Grams - Text Representation

Zipf's law states that the frequency of words in a text collection follows a power law distribution, where a few words appear very frequently and many words appear very infrequently. This distribution has been observed across many types of data including words in texts, incomes, web page requests, and more. N-gram language models apply Zipf's law by estimating the probability of a word based on the previous 1, 2, or more words. These models address data sparsity issues through techniques like smoothing and combining multiple probability estimates. Zipf's law and n-gram models are important for applications like speech recognition, machine translation, and predicting future words.

Uploaded by

scribd_comverse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

151 views23 pages

N-Grams - Text Representation

Uploaded by

scribd_comverse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Text Representation

Corpora, Types, and Tokens

We now have available large corpora of machine readable texts in many languages. We can analyze a corpus into a set of: word tokens (instances of words), and word types or terms (distinct words) So, The boys went to the park contains 6 tokens and 5 types.

Zipfs Law

George Kingsley Zipf 1902-1950

Frequency of occurrence of words is inversely proportional to the rank in this frequency of occurrence. When both are plotted on a log scale, the graph is a straight line.

Zipf Distribution
The Important Points:
a few elements occur very frequently a medium number of elements have medium frequency many elements occur very infrequently

Zipf Distribution
The product of the frequency of words (f) and their rank (r) is approximately constant
Rank = order of words frequency of occurrence

f C 1/ r C # N /10

Zipf Distribution (Same curve on linear and log scale)

Illustration by Jacob Nielsen

What Kinds of Data Exhibit a Zipf Distribution?

Words in a text collection
Virtually any language usage

income distribution amongst individuals Library book checkout patterns Web Page Requests Page links on Web

Zipf and Web Requests

Data from AOL users web requests for one day in December, 1997

Zipf and Web Requests

Applying Zipfs Law to Language

Applying Zipfs law to word frequencies, in a large enough corpus: t r(t) | c * f(t)-1 for some constant c. In English texts, c is about N/10, where N is the number of words in the collection. English: http://web.archive.org/web/20000818062828/http://hobart.cs.um ass.edu/~allan/cs646-f97/char_of_text.html

Statistics from the TIME collection, a 1.6 MB collection of 423 short TIME magazine articles (245,412 term occurrences). Top 50 terms are:

Statistics from the WSJ87 collection, a 131.6 MB collection of 46,449 newspaper articles (19 million term occurrences). Top 50 terms are:

Visualizing Zipfs Law

Word frequencies in the Brown corpus

From -XGLWK $ 0ROND'DQLHOVHQ

Letter Frequencies in English

Letter Frequencies
0.14 0.12 0.1
Frequency

0.08 0.06 0.04 0.02 0 0 5 10 Rank 15 20 25 30

Links

Etaoin Shrdlu and frequencies in the dictionary: http://rinkworks.com/words/letterfreq.shtml Simon Singhs applet for computing letter frequencies: http://www.simonsingh.net/The_Black_Chamber/frequencyanalysis.html

Redundancy in Text - Letters

Her visit-r, she saw as -he opened t-e door, was s-ated in the rmchair be-ore the fir-, dozing it w-uld seem, wi-h his banda-ed head dro-ping on one -ide. The onl- light in th- room was th- red glow fr-m the firew-ich lit his -yes like ad-erse railw-y signals, b-t left his d-wncast fac- in darknes---and the sca-ty vestige- of the day t-at came in t-rough the o-en door. Eveything was -uddy, shado-y, and indis-inct to her, -he more so snce she had -ust been li-hting the b-r lamp, and h-r eyes were azzled.

Redundancy in Text - Letters

Aft-r Mr-. Hall -ad l-ft t-e ro-m, he ema-ned tan-ing -n frnt o- the -ire, -lar-ng, s- Mr. H-nfr-y pu-s it, -t th- clo-k-medin-. Mr. H-nfr-y no- onl- too- off -he h-nds -f th- clo-k, anthe -ace, -ut e-tra-ted -he w-rks; -nd h- tri-d to -ork -n as -low -nd q-iet -nd u-ass-min- a ma-ner -s po-sibl-. He w-rke- with he l-mp c-ose -o hi-, and -he g-een had- thr-w a b-ill-ant ight -pon -is h-nds, -nd u-on t-e fr-me a-d wh-els, -nd l-ft t-e re-t of -he r-om s-ado-y. Wh-n he ook-d up, -olo-red atc-es s-am -n hi- eye-.

Order Doesnt Seem to Matter

Aoccdrnig to rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, olny taht the frist and lsat ltteres are at the rghit pcleas. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by ilstef, but the wrod as a wlohe.

http://joi.ito.com/archives/2003/09/14/ordering_of_letters_dont_matter.html

Chatbots Exploit Redundancy

Lets look at some data on the inputs to ALICE: http://www.alicebot.org/

Chatbots Exploit Redundancy

Human language is not random Considering the vast size of the set of things people could possibly say, that are grammatically correct or semantically meaningful, the number of things people actually do say is surprisingly small. 1800 words covers 95% of all the first words input to ALICE. The number of choices for the second word is only about two. The average branching factor decreases with each successive word.
8024 YES 5184 NO 2268 OK 2006 WHY 1145 BYE 1101 HOW OLD ARE YOU 946 HI 934 HOW ARE YOU 846 WHAT 840 HELLO 663 GOOD 645 WHY NOT 584 OH 553 REALLY 544 YOU 531 WHAT IS YOUR NAME 525 COOL 516 I DO NOT KNOW 488 FUCK YOU 486 THANK YOU 416 SO 414 ME TOO 403 LOL 403 THANKS 381 NICE TO MEET YOU TOO 375 SORRY 374 ALICE 368 HI ALICE 366 OKAY 353 WELL 352 WHAT IS MY NAME 349 WHERE DO YOU LIVE 340 NOTHING

Why Do We Want to Predict Words?

Speech recognition Handwriting recognition/OCR Spelling correction Statistical machine translation

Predicting a Word Sequence

N-grams
Approximating reality: let V be the number of words in the lexicon and T be the number of tokens in a training corpus
P(wk = W) = 1/V P(wk = W) = count(W) / T P(wk = W1 | wk-1 = W0) = c(W0W1)/c(W0) word frequencies bigrams

Abbreviating P(wk = W1 | wk-1 = W0) to P(W1|W0). For example P(rabbit | the).

P(Wn|Wn-2Wn-1) = c(Wn-2Wn-1Wn)/c(Wn-2Wn-1) trigrams

Building n-gram Models

1. Data preparation:
Decide training corpus remove punctuations sentence breaks - keep them or throw them?

2. Create equivalence classes and get counts on training data falling into each class. 3. Find statistical estimators.

Problems with n-grams

Sue swallowed the large green ______ . Pill, frog, car, mountain, tree?
Knowing that Sue swallowed helps narrow down possibilities. How far back do we look?

For a Vocabulary of 20,000 words, number of bigrams = 400 million, number of trigrams = 8 trillion, number of four-grams = 1.6 x 1017! Data sparseness

Bigram Example

Statistical Estimation Methods

Maximum Likelihood Estimation (MLE) Smoothing:
Laplaces Good-Turing Estimation

Combining estimators:
Linear Interpolation Backing off

Maximum Likelihood Estimation (MLE)

It is the choice of parameter values which gives the highest probability to the training corpus.

PMLE(w1,..,wn)=C(w1,..,wn)/N where C(w1,..,wn) is the frequency of n-gram w1,..,wn PMLE(wn|w1,.,wn-1)=C(w1,..,wn)/C(w1,..,wn-1)

MLE: Problems
Problem of Sparseness of data. Vast majority of the words are very uncommon (Zipfs Law). Some bins may remain empty or contain too little data. MLE assigns 0 probability to unseen events.
We need to allow for possibility of seeing events not seen in training.

Smoothing
Examples: In some specific corpus, to want doesnt occur. But it could: Im going to want to eat lunch at 1. The words knit, purl, quilt, and bobcat are missing from our list of the top 10,000 words in a newswire corpus. In Alices Adventures in Wonderland, the words half and sister both occur, but the bigram half sister does not. But this does not mean that the probability of encountering half sister in some new text is 0.

Laplaces Law (1814-1995)

Adding One Process: PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B),
where

C(w1,..,wn) = frequency of n-gram w1,..,wn

B is the number of values the target feature can take on B = Vn (vocabulary size) for n = 1 2 B = V for n > 1, So, for bigrams, there are V possibilities

Gives a little bit of the probability space to unseen events. This is the Bayesian estimator assuming a uniform prior on events.

Laplaces Law: Problems

For sparse sets of data over large vocabularies, it assigns too much of the probability space to unseen events.

Good-Turing Estimation
Use counts of more frequent n-grams to estimate less frequent n-grams PGT= r*/N where, r* can be thought of as an adjusted frequency given by r*=(r+1)E(Nr+1)/E(Nr).

Combining Estimators
Combining multiple probability estimates from various different models:
Simple Linear Interpolation Katzs backing-off

Simple Linear Interpolation

Solve the sparseness in a trigram model by mixing with bigram and unigram models. Combine linearly: Termed linear interpolation, finite mixture models or deleted interpolation. Pli(wn|wn-2,wn-1) = O1P1(wn)+ O2P2(wn|wn-1)+ O3P3(wn|wn-1,wn-2) where 0dOi d1 and 6i Oi =1
Weights can be picked automatically using Expectation Maximization.

Katzs Backing-Off
Different models are consulted depending on their specificity:
1. Use n-gram probability when the n-gram has appeared more than k times (k usually = 0 or 1) 2. If not, back-off to the (n-1)-gram probability Repeat while necessary.

Fun Links
N-gram Search Engine http://nlp.cs.nyu.edu/nsearch/ http://xkcd.com/798/

Take home message

Language is not random Context (letter/word/sentence history) is used by humans, and also for automatic processing.

An Overview of Microsoft Web N-gram Corpus and Applications

Kuansan Wang, Christopher Thrasher, Evelyne Viegas Xiaolong Li, Bo-june (Paul) Hsu

NAACL-HLT 2010

The Microsoft corpus

The effectiveness of statistical natural language processing - highly susceptible to the data size used to develop
English Giga-word corpus (Graff and Cieri, 2003) 1 Tera-word Google N-gram, created from WWW

But, modeling document content is not enough: Neglects anchor text , short messages from social network applications that summarize the document, etc. Different text streams have significantly different properties Microsoft Web N-gram corpus: materials from the body, title and anchor text processed separately.

General Model Information

Based on Web documents, indexed by a commercial search engine (Bing) Spam and low quality web pages excluded using dedicated algorithms. The various sections parsed, tokenized, lower-cased, punctuation marks removed. No stemming, spelling corrections or inflections Provide smoothed back-off N-gram models Live updates from the Web
# of tokens: Body 1.4 trillion
*As of June 30, 2009

Title 12.5 billion

Anchor 357 billion

Demo: Search Query Segmentation

Given: a search query Goal: word breaking A query of T chars has 2T-1 segmentation hypotheses

Papers for next week

Memory-Based Context-Sensitive Spelling Correction at Web Scale, 2007 Improving Email Speech Acts Analysis via N-gram Selection, 2006

What makes a good presentation

Motivation is clear. What is the problem/task? Engaging: examples, illustrative figures (if possible) Model is clear at high-level; (dont drawn in the formulas) Experiments:
Settings Evaluation measures Results and their meaning

Related work: relatively brief; what is novel about this work? Conclusions and future work What did you think about the work?

N Grams - Nptel Notes
No ratings yet
N Grams - Nptel Notes
75 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
28 pages
Zipf's Law and Heaps Law
No ratings yet
Zipf's Law and Heaps Law
10 pages
NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
File Handling Question Bank - Assertion Reasoning
No ratings yet
File Handling Question Bank - Assertion Reasoning
26 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
Chap 4
No ratings yet
Chap 4
76 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Language Models: Instructor: Rada Mihalcea Taught by Bonnie Dorr at Univ. of Maryland
No ratings yet
Language Models: Instructor: Rada Mihalcea Taught by Bonnie Dorr at Univ. of Maryland
74 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
Lec 3 slp04 LM and Ngrans
No ratings yet
Lec 3 slp04 LM and Ngrans
73 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
NLP Week 02
No ratings yet
NLP Week 02
55 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
2 - Text Operation
No ratings yet
2 - Text Operation
35 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
No ratings yet
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
46 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
15 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
NLP Week 03
No ratings yet
NLP Week 03
33 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
MLRD 3
No ratings yet
MLRD 3
26 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
NLP UNIT III (Part 1)
No ratings yet
NLP UNIT III (Part 1)
15 pages
N Grams
No ratings yet
N Grams
51 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Ai TXT Unit3
No ratings yet
Ai TXT Unit3
22 pages
7.? Using Corpora For Language Research. Statistics For Corpora Based Studies
No ratings yet
7.? Using Corpora For Language Research. Statistics For Corpora Based Studies
63 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Language Models: CS6370: Natural Language Processing
No ratings yet
Language Models: CS6370: Natural Language Processing
35 pages
5b. Word Vectors
No ratings yet
5b. Word Vectors
24 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Basic Text Process
No ratings yet
Basic Text Process
3 pages
The Log-Log Term Frequency Distribution
No ratings yet
The Log-Log Term Frequency Distribution
13 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
Natural Language Processing
No ratings yet
Natural Language Processing
44 pages
RealPythonPart3 PDF
No ratings yet
RealPythonPart3 PDF
593 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Redline Drawing1
No ratings yet
Redline Drawing1
47 pages
Introduction To The ICT Program
No ratings yet
Introduction To The ICT Program
17 pages
Lect-10-Construction Supervision Consultency Services
100% (1)
Lect-10-Construction Supervision Consultency Services
25 pages
Nynas AB Oil Brochure
100% (1)
Nynas AB Oil Brochure
5 pages
Bca 1 Sem Syllabus
100% (1)
Bca 1 Sem Syllabus
9 pages
Technical Seminar Report On: "High Performance Computing"
No ratings yet
Technical Seminar Report On: "High Performance Computing"
14 pages
AI ML and Data Science PDF
No ratings yet
AI ML and Data Science PDF
11 pages
CA210 - EDI Interface PDF
100% (1)
CA210 - EDI Interface PDF
408 pages
Ims DB Batch 11102003
No ratings yet
Ims DB Batch 11102003
261 pages
Object Oriented Programming C++
No ratings yet
Object Oriented Programming C++
656 pages
Lexmark ms310 User Guide
No ratings yet
Lexmark ms310 User Guide
110 pages
Chinmoy Mukherjee-Cracking The Coding Interview - 60 Java Programming Questions and Answers (Volume 1) - CreateSpace Independent Publishing Platform (2015)
No ratings yet
Chinmoy Mukherjee-Cracking The Coding Interview - 60 Java Programming Questions and Answers (Volume 1) - CreateSpace Independent Publishing Platform (2015)
27 pages
PS6
No ratings yet
PS6
3 pages
Amdahl's Law, Also Known As Amdahl's Argument,: Parallel Computing Speedup Computer Architect Gene Amdahl Afips
No ratings yet
Amdahl's Law, Also Known As Amdahl's Argument,: Parallel Computing Speedup Computer Architect Gene Amdahl Afips
3 pages
Vi3!35!25 U2 Admin Guide
No ratings yet
Vi3!35!25 U2 Admin Guide
386 pages
Tate Datacentre Containment Hard Partition Data Sheet
No ratings yet
Tate Datacentre Containment Hard Partition Data Sheet
2 pages
CLASS 8 FA 1 Computer - 50 - Marks - Question - Paperfdsf
No ratings yet
CLASS 8 FA 1 Computer - 50 - Marks - Question - Paperfdsf
1 page
Question Paper Winter 2019
No ratings yet
Question Paper Winter 2019
2 pages
Class 8 Comp - Annual Exam Revision Worksheet-2 - Edited
No ratings yet
Class 8 Comp - Annual Exam Revision Worksheet-2 - Edited
1 page
Santhosh GK Profiles 2019
No ratings yet
Santhosh GK Profiles 2019
1 page
CS101 Solved Subjective Questions From Old Papers
No ratings yet
CS101 Solved Subjective Questions From Old Papers
4 pages
Attachment 1574460066
No ratings yet
Attachment 1574460066
2 pages
Array Programs
No ratings yet
Array Programs
7 pages
Department of Education Emiliano Tria Tirona Memorial National High School
No ratings yet
Department of Education Emiliano Tria Tirona Memorial National High School
3 pages
Mohammad Rizwan Resume
No ratings yet
Mohammad Rizwan Resume
3 pages
Practical List MCA-406ANU
No ratings yet
Practical List MCA-406ANU
2 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)