0% found this document useful (0 votes)

286 views48 pages

Natural Language Processing:: N-Gram Language Models

This document discusses natural language processing and n-gram language models. It covers the following key points in 3 sentences: N-gram language models estimate the probability of a word given the previous n-1 words to model language probabilistically rather than with formal grammars. They are useful for applications like speech recognition, machine translation, and predictive text by predicting the most likely next words based on n-gram counts from large text corpora. N-gram models make the Markov assumption that a word is independent of all words preceding the previous n-1 words to allow estimating probabilities from limited data.

Uploaded by

Eco Frnd Nikhil Ch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

286 views48 pages

Natural Language Processing:: N-Gram Language Models

Uploaded by

Eco Frnd Nikhil Ch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 48

Natural Language

Processing:

N-Gram Language Models

1
Language Models
 Formal grammars (e.g. regular, context
free) give a hard “binary” model of the
legal sentences in a language.
 For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
 To specify a correct probability
distribution, the probability of all
sentences in a language must sum to 1.
Uses of Language Models
 Speech recognition
 “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
 OCR & Handwriting recognition
 More probable sentences are more likely correct readings.
 Machine translation
 More likely sentences are probably better translations.
 Generation
 More likely sentences are probably better NL generations.
 Context sensitive spelling correction
 “Their are problems wit this sentence.” 6383010676 -
sankar
Completion Prediction

 A language model also supports

predicting the completion of a sentence.
 Please turn off your cell _____
 Your program does not ______
 Predictive text input systems can guess
what you are typing and give choices on
how to complete it.
N-Gram Models
 Estimate probability of each word given prior context.
 P(phone | Please turn off your cell)
 Number of parameters required grows exponentially
with the number of words of prior context.
 An N-gram model uses only N1 words of prior context.
 Unigram: P(phone)
 Bigram: P(phone | cell)
 Trigram: P(phone | your cell)
 The Markov assumption is the presumption that the
future behavior of a dynamical system only depends on
its recent history. In particular, in a kth-order
Markov model, the next state only depends on the k
most recent states, therefore an N-gram model is a
(N1)-order Markov model.
N-Gram Model Formulas
 Word sequences
w1n  w1...wn
 Chain rule of probability
n
P( w )  P( w1 ) P( w2 | w1 ) P( w3 | w )...P( wn | w )   P( wk | w1k 1 )
n
1
2
1
n 1
1
k 1
 Bigram approximation
n
P( w )   P( wk | wk 1 )
n
1
k 1
 N-gram approximation
n
P( w )   P( wk | wkk1N 1 )
n
1
k 1
Estimating Probabilities
 N-gram conditional probabilities can be
estimated from raw text based on the relative
frequency of word sequences.
C ( wn 1wn )
Bigram: P ( wn | wn 1 ) 
C ( wn 1 )
n 1
C ( wn  N 1 wn )
N-gram: P ( wn | wnn1N 1 ) 
C ( wnn1N 1 )
 To have a consistent probabilistic model, append
a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional
words.
Word Prediction
 Guess the next word...
 ... I notice three guys standing on the ???
 There are many sources of knowledge
that could be helpful, including world
knowledge.
 But we can do pretty well by simply
looking at the preceding words and
keeping track of simple counts.
Word Prediction
 Formalize this task using N-gram models.
 N-grams are token sequences of length N.
 Given N-grams counts, we can guess likely
next words in a sequence.

08/14/20 9
N-Gram Models
 More formally, we will assess the
conditional probability of candidate words.
 Also, we will assess the probability of an
entire sequence of words.
 Being able to predict the next word (or
any linguistic unit) in a sequence is very
common in many applications.

08/14/20 10
Applications
 It lies at the core of the following applications
 Automatic speech recognition
 Handwriting and character recognition
 Spelling correction
 Machine translation
 Augmentative communication
 Word similarity, generation, POS tagging, etc.

08/14/20 11
Counting
 He stepped out into the hall, was delighted to
encounter a water brother.
 13 tokens, 15 if we include “,” and “.” as separate
tokens.
 Assuming we include the comma and period, how
many bigrams are there?

08/14/20 Speech and Language Processing - Jurafsky and Martin 12

Counting
 Not always that simple
 I do uh main- mainly business data processing
 Spoken language poses various challenges.
 Should we count “uh” and other fillers as tokens?
 What about the repetition of “mainly”? Should such do-
overs count twice or just once?
 The answers depend on the application.
 If we’re focusing on something like Automatic Speech
Recognition to support indexing for search, then “uh” isn’t
helpful (it’s not likely to occur as a query).
 But filled pauses are very useful in dialog management, so we
might want them there.

08/14/20 Speech and Language Processing - Jurafsky and Martin 13

Counting: Types and Tokens
 They picnicked by the pool, then lay back
on the grass and looked at the stars.
 18 tokens (again counting punctuation)
 only 16 types
 In going forward, we’ll sometimes count
types and sometimes tokens.

08/14/20 Speech and Language Processing - Jurafsky and Martin 14

Counting: Wordforms
 Should “cats” and “cat” count as the
same?
 How about “geese” and “goose”?
 Lemma: cats & cat and geese & goose have
the same lemmas
 Wordform: fully inflected surface form
 Again, we’ll have occasion to count both
lemmas and wordforms

08/14/20 Speech and Language Processing - Jurafsky and Martin 15

Counting: Corpora
 Corpus: body of text
 Google
 Crawl of 1,024,908,267,229 English tokens
 13,588,391 wordform types
 That seems like a lot of types... Even large dictionaries of English have only
around 500k types. Why so many here?

•Numbers
•Misspellings
•Names
•Acronyms
•etc

08/14/20 Speech and Language Processing - Jurafsky and Martin 16

Language Modeling
 To Perform word prediction:
 Assess the conditional probability of a
word given the previous words in the
sequence
 P(wn|w1,w2…wn-1).
 We’ll call a statistical model that can
assess this a Language Model.

08/14/20 Speech and Language Processing - Jurafsky and Martin 17

Language Modeling
 One way to calculate them is to use the
definition of conditional probabilities, and
get the counts from a large corpus.
 Counts are used to estimate the probabilities.

08/14/20 Speech and Language Processing - Jurafsky and Martin 18

Language Modeling
 Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.

 We’ll use the chain rule of probability

and an independence assumption.

08/14/20 Speech and Language Processing - Jurafsky and Martin 19

The Chain Rule

P(its water was so transparent)=

P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)

08/14/20 Speech and Language Processing - Jurafsky and Martin 20

 So, that’s the chain rule
 We still have the problem of estimating
probabilities of word sequences from counts;
we’ll never have enough data.
 The other thing we mentioned (two slides
back) is an independence assumption
 What type of assumption would make sense?

08/14/20 Speech and Language Processing - Jurafsky and Martin 21

Independence Assumption
 Make the simplifying assumption
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|a)
 Or maybe
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|saw,a)

08/14/20 Speech and Language Processing - Jurafsky and Martin 22

Independence Assumption
 This kind of independence assumption is called a
Markov assumption after the Russian
mathematician Andrei Markov.

08/14/20 Speech and Language Processing - Jurafsky and Martin 23

Markov Assumption

For each component in the product replace with the

approximation (assuming a prefix of N)

P(wn | w n1
1 )  P(wn | w n1
nN 1 )
Bigram version

P(w n | w n1
1 )  P(w n | w n1 )

08/14/20 Speech and Language Processing - Jurafsky and Martin 24

Estimating Bigram
Probabilities
 The Maximum Likelihood Estimate (MLE)

count(w i1,w i )
P(w i | w i1 ) 
count(w i1 )

08/14/20 Speech and Language Processing - Jurafsky and Martin 25

An Example
 <s> I am Sam </s>
 <s> Sam I am </s>
 <s> I do not like green eggs and ham </s>

08/14/20 Speech and Language Processing - Jurafsky and Martin 26

08/14/20 Speech and Language Processing - Jurafsky and Martin 27
Maximum Likelihood Estimates
 The maximum likelihood estimate of some parameter of
a model M from a training set T
 Is the estimate that maximizes the likelihood of the training set
T given the model M
 Suppose the word Chinese occurs 400 times in a corpus
of a million words (Brown corpus)
 What is the probability that a random word from some
other text from the same distribution will be “Chinese”
 MLE estimate is 400/1000000 = .004
 This may be a bad estimate for some other corpus
 (We’ll return to MLEs later – we’ll do smoothing to get
estimates for low-frequency or 0 counts. Even with our
independence assumptions, we still have this problem.)

08/14/20 Speech and Language Processing - Jurafsky and Martin 28

Berkeley Restaurant Project
Sentences
 can you tell me about any good cantonese restaurants
close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are
available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day

08/14/20 Speech and Language Processing - Jurafsky and Martin 29

Bigram Counts
 Out of 9222 sentences
 Eg. “I want” occurred 827 times

08/14/20 Speech and Language Processing - Jurafsky and Martin 30

Bigram Probabilities
 Divide bigram counts by prefix unigram
counts to get probabilities.

08/14/20 Speech and Language Processing - Jurafsky and Martin 31

08/14/20 Speech and Language Processing - Jurafsky and Martin 32

Kinds of Knowledge
 As crude as they are, N-gram probabilities capture a
range of interesting facts about language.

08/14/20 Speech and Language Processing - Jurafsky and Martin 33

Shannon’s Method
 Let’s turn the model around and use it to
generate random sentences that are like
the sentences from which the model was
derived.
 Illustrates:
 Dependence of N-gram models on the specific
training corpus
 Greater N  better model
 Generally attributed to
Claude Shannon.
08/14/20 Speech and Language Processing - Jurafsky and Martin 34
Shannon’s Method
 Sample a random bigram (<s>, w) according to its probability
 Now sample a random bigram (w, x) according to its probability
 And so on until we randomly choose a (y, </s>)
 Then string the words together
 <s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>

08/14/20 Speech and Language Processing - Jurafsky and Martin 35

Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 36

Shakespeare as a Corpus
 N=884,647 tokens, V=29,066
 Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams...
 So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
 This is the biggest problem in language modeling;
we’ll come back to it.
 Quadrigrams are worse: What's coming out
looks like Shakespeare because it is
Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 37

The Wall Street Journal is Not
Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 38

A bad language model
(thanks to Joshua Goodman)

39
A bad language model

40
A bad language model

41
A bad language model

42
Evaluation
 How do we know if one model is better
than another?
 Shannon’s game gives us an intuition.
 The generated texts from the higher order
models sound more like the text the model
was obtained from.

08/14/20 Speech and Language Processing - Jurafsky and Martin 43

Evaluation
 Standard method
 Train parameters of our model on a training set.
 Look at the model’s performance on some new
data: a test set. A dataset which is different than
our training set, but is drawn from the same source
 Then we need an evaluation metric to tell us how
well our model is doing on the test set.
 One such metric is perplexity (to be introduced below)

08/14/20 Speech and Language Processing - Jurafsky and Martin 44

Perplexity
 Perplexity is the probability of
the test set (assigned by the
language model), normalized by
the number of words:
 Chain rule:

 For bigrams:

 Minimizing perplexity is the same as maximizing

probability
 The best language model is one that best
predicts an unseen test set
08/14/20 Speech and Language Processing - Jurafsky and Martin 45
Lower perplexity means a
better model
 Training 38 million words, test 1.5 million
words, WSJ

08/14/20 Speech and Language Processing - Jurafsky and Martin 46

Evaluating N-Gram Models
 Best evaluation for a language model
 Put model A into an application
 For example, a speech recognizer
 Evaluate the performance of the
application with model A
 Put model B into the application and
evaluate
 Compare performance of the application
with the two models
 Extrinsic evaluation
08/14/20 Speech and Language Processing - Jurafsky and Martin 47
Difficulty of extrinsic (in-vivo)
evaluation of N-gram models
 Extrinsic evaluation
 This is really time-consuming
 Can take days to run an experiment
 So
 As a temporary solution, in order to run experiments
 To evaluate N-grams we often use an intrinsic
evaluation, an approximation called perplexity
 But perplexity is a poor approximation unless the test
data looks just like the training data
 So is generally only useful in pilot experiments
(generally is not sufficient to publish)
 But is helpful to think about.

08/14/20 Speech and Language Processing - Jurafsky and Martin 48

Grade 7 Maths Notes Part 1
No ratings yet
Grade 7 Maths Notes Part 1
6 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit Ix Cost Effectiveness and Cost Accounting
No ratings yet
Unit Ix Cost Effectiveness and Cost Accounting
38 pages
Trevithick Second Steam Locomotive PDF
50% (2)
Trevithick Second Steam Locomotive PDF
6 pages
NLP UNIT 5 Part B
100% (2)
NLP UNIT 5 Part B
31 pages
Al3452 Os Notes
No ratings yet
Al3452 Os Notes
280 pages
Portfolio Management in Kotak Securites
0% (1)
Portfolio Management in Kotak Securites
92 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Revised PN Staff Writing Manual - 1
No ratings yet
Revised PN Staff Writing Manual - 1
334 pages
Energy Drinks
0% (1)
Energy Drinks
19 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
DVT - Unit 1 Notes
No ratings yet
DVT - Unit 1 Notes
10 pages
DBMS Unit 1 Notes
100% (1)
DBMS Unit 1 Notes
22 pages
Class 11 CHAPTER 04 by Arslan Saleem
No ratings yet
Class 11 CHAPTER 04 by Arslan Saleem
12 pages
Web Quiz
No ratings yet
Web Quiz
36 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Immunization
No ratings yet
Immunization
40 pages
Ontological Engineering
No ratings yet
Ontological Engineering
17 pages
SC Cat1 Merged PDF
No ratings yet
SC Cat1 Merged PDF
244 pages
Sets in Python
No ratings yet
Sets in Python
7 pages
Soft Computing
No ratings yet
Soft Computing
92 pages
Evolution of Big Data
No ratings yet
Evolution of Big Data
21 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
JNTUA Advanced Data Structures and Algorithms Lab Manual R20
No ratings yet
JNTUA Advanced Data Structures and Algorithms Lab Manual R20
71 pages
User Datagram Protocol
No ratings yet
User Datagram Protocol
33 pages
Marking Guideline: Building and Structural Construction N5
No ratings yet
Marking Guideline: Building and Structural Construction N5
8 pages
POStagging
No ratings yet
POStagging
72 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Module 4 Nosql
No ratings yet
Module 4 Nosql
8 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Layered Technology
No ratings yet
Layered Technology
57 pages
Se 221FJ01071
No ratings yet
Se 221FJ01071
3 pages
Part-Of-Speech (POS) Tagging
No ratings yet
Part-Of-Speech (POS) Tagging
53 pages
DAA Unit1
No ratings yet
DAA Unit1
26 pages
R Language
No ratings yet
R Language
59 pages
Soft Computing
No ratings yet
Soft Computing
52 pages
Expression Tree
No ratings yet
Expression Tree
18 pages
API 653 Notes
No ratings yet
API 653 Notes
3 pages
Unit V Graph Structures
No ratings yet
Unit V Graph Structures
39 pages
Distance & Direction-2: Floor, Behind Bus Stand, Karnal - Contact: 7015275075, 7206600658
No ratings yet
Distance & Direction-2: Floor, Behind Bus Stand, Karnal - Contact: 7015275075, 7206600658
8 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
CD Unit - 1
No ratings yet
CD Unit - 1
38 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Data Leakage and Prevention
No ratings yet
Data Leakage and Prevention
27 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
JNTUGV B.tech R23 Course Structure
No ratings yet
JNTUGV B.tech R23 Course Structure
6 pages
DBMS Basic Concepts
No ratings yet
DBMS Basic Concepts
56 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Chapter Three
No ratings yet
Chapter Three
37 pages
Collection Framework in JAVA
No ratings yet
Collection Framework in JAVA
10 pages
Mfcs PPT (All Units)
No ratings yet
Mfcs PPT (All Units)
103 pages
Teacher Leader Qualities Self Assessment
No ratings yet
Teacher Leader Qualities Self Assessment
7 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Animal Toxins: - Composition & Chemical Properties
No ratings yet
Animal Toxins: - Composition & Chemical Properties
6 pages
Regular Expressions and Its Applications
No ratings yet
Regular Expressions and Its Applications
6 pages
A Framework For Deepfake V2
No ratings yet
A Framework For Deepfake V2
24 pages
Soft Computing
No ratings yet
Soft Computing
40 pages
IS 7118 Unit-5 POS Tagging
No ratings yet
IS 7118 Unit-5 POS Tagging
89 pages
OS - Module 5 - Memory Management
No ratings yet
OS - Module 5 - Memory Management
81 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Prolog Notes-Complete
No ratings yet
Prolog Notes-Complete
31 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Digital Notes: (Department of Computer Applications)
No ratings yet
Digital Notes: (Department of Computer Applications)
14 pages
R22 Unit 5
No ratings yet
R22 Unit 5
23 pages
2 Marks
No ratings yet
2 Marks
11 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Id Questio N A Graph Is A Set of - and Set of - A Vertices, Edges B Variables, Values C Vertices, Distances D Variable, Equation Answer A Marks 1 Unit 1
No ratings yet
Id Questio N A Graph Is A Set of - and Set of - A Vertices, Edges B Variables, Values C Vertices, Distances D Variable, Equation Answer A Marks 1 Unit 1
94 pages
Model Question Paper
0% (1)
Model Question Paper
2 pages
QR729 (QTR729) Qatar Airways Flight Tracking and History - FlightAware
No ratings yet
QR729 (QTR729) Qatar Airways Flight Tracking and History - FlightAware
1 page
Java - Crossword - 01: Bhuvaneswaran B / Ap (SS) / Cse / Rec
No ratings yet
Java - Crossword - 01: Bhuvaneswaran B / Ap (SS) / Cse / Rec
2 pages
Dept. of Chemistry, Rajabazar Science College, 92-Acharya Prafulla Chandra Road, University of Calcutta, Kolkata - 700009, West Bengal, India
No ratings yet
Dept. of Chemistry, Rajabazar Science College, 92-Acharya Prafulla Chandra Road, University of Calcutta, Kolkata - 700009, West Bengal, India
6 pages
Energy Relationships in Chemical Reactions
No ratings yet
Energy Relationships in Chemical Reactions
11 pages
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
No ratings yet
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
44 pages
Chapter 1 - Data Representation 1.1 - Data Types
No ratings yet
Chapter 1 - Data Representation 1.1 - Data Types
12 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
Visual Analytics: Networks and Trees - Heat Map - Map Color and Other Channels - Manipulate View - Visual Attributes
No ratings yet
Visual Analytics: Networks and Trees - Heat Map - Map Color and Other Channels - Manipulate View - Visual Attributes
20 pages
Ai-Unit-Iii Notes
No ratings yet
Ai-Unit-Iii Notes
46 pages
Conceptual Dependency and Natural Language Processing
No ratings yet
Conceptual Dependency and Natural Language Processing
59 pages
Dhupguri Report
No ratings yet
Dhupguri Report
11 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
46 pages
Data Structures: Stack and Its Applications
No ratings yet
Data Structures: Stack and Its Applications
55 pages
East West Institute of Technology: Sadp Notes
No ratings yet
East West Institute of Technology: Sadp Notes
30 pages
Artificial Intelligence Class 6: Skill Education for Class 6th, Code (417)
From Everand
Artificial Intelligence Class 6: Skill Education for Class 6th, Code (417)
Geeta Zunjani
No ratings yet
Course Plan Natural Language Processing
No ratings yet
Course Plan Natural Language Processing
5 pages
AEIF 2024 Proposal Forms
No ratings yet
AEIF 2024 Proposal Forms
10 pages
Structures Brochure
No ratings yet
Structures Brochure
44 pages
Enterprise Resource Planning (Swe1014) : Prof. Kumaresan P
No ratings yet
Enterprise Resource Planning (Swe1014) : Prof. Kumaresan P
25 pages
Data Leakage and Prevention
No ratings yet
Data Leakage and Prevention
36 pages
Basic Terminologies
No ratings yet
Basic Terminologies
8 pages
Number Series
No ratings yet
Number Series
16 pages
Excise, Taxation and Narcotics - Government of Sindh
No ratings yet
Excise, Taxation and Narcotics - Government of Sindh
1 page
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Parallel Database Systems
No ratings yet
Parallel Database Systems
17 pages
Tan ChineseLiteratureEssays 2016
No ratings yet
Tan ChineseLiteratureEssays 2016
5 pages
Reserch Proposal Raneesha
No ratings yet
Reserch Proposal Raneesha
22 pages
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
No ratings yet
Abbotsford VFR Terminal Procedures Chart Rwy 01 & 19
3 pages
BT Mid 1ans
No ratings yet
BT Mid 1ans
9 pages
To From
No ratings yet
To From
4 pages
Mission 1 Stage 1 Copywriting
No ratings yet
Mission 1 Stage 1 Copywriting
3 pages
Monthly Reimbursement Bill Enclosure
No ratings yet
Monthly Reimbursement Bill Enclosure
3 pages
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet