0% found this document useful (0 votes)

223 views92 pages

CHAPTER 6-Ambiguity Resolutions Statistical Methods

- This document discusses statistical methods for solving problems based on probability theory, including conditional probability, Bayes' rule, and estimating probabilities from data including dealing with sparse data. - It also covers part-of-speech tagging, which aims to assign the most likely syntactic category to each word in a sentence. This is done by maximizing the conditional probability of a category sequence given a word sequence, using techniques like bigram modeling that make independence assumptions to estimate long sequence probabilities. Evaluation involves dividing a corpus into training and test sets.

Uploaded by

Rosmarinus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

223 views92 pages

CHAPTER 6-Ambiguity Resolutions Statistical Methods

Uploaded by

Rosmarinus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 92

Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory

This section explores some techniques for solving the
problems based on probability theory.
A probability function, PROB , assigns a probability to every
value of a random variable
A probability function must have following properties, where
e1 , …, en are the possible distinct values of a random variable
E.
1. PROB ( ei )  0 for all i (i)
2. PROB ( ei )  1 for all i (i)
3. ∑ i = 1, n PROB ( ei ) = 1
1
Chapter 6: Ambiguity Resolutions: Statistical
methods
6.1 basic Probability theory
- Conditional probability is defined by the formula
PROB ( e │ e’ ) = PROB ( e & e’ ) / PROB ( e’ )

where PROB ( e & e’ ) is probability of two events e and e’

occurring simultaneously
- An important theorem relating conditional probabilities is
Bayes’ rule. This rule relates conditional probability of an event
A given B to the conditional probability of an event B given A:
PROB ( B  A ) * PROB ( A )
PROB ( A  B ) =
PROB ( B )
2
Chapter 6: Ambiguity Resolutions: Statistical methods
6.1 basic Probability theory
- Two events A and B are independent of each other if only if
PROB ( A / B ) = PROB ( A )
which using the definition of conditional probability, is
equivalent to saying
PROB ( A & B ) = PROB ( A ) * PROB ( B)
Example: PROB(Win/Rain) = PROB (Win & Rain)/ PRO(Rain)
= .15/.3. If Win and Rain are independent of each other then
PROB(Win/Rain) = PROB (Win)/ PRO(Rain) = 0.2*0.3= 0.06
While PROB(Win&Rain) = 0.15. Note ! Probability of winning
and raining occur together at a rate much greater than random
chance. 3
Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory

Consider an application of probability theory related to

language, namely part- of- speech indentification: given a
sentence with ambiguous words. Now, determine the most likely
lexical category for each word.

Example: word flies can be either V or N.

C that ranges over the part-of-speech (V, N), W that ranges over
all possible words. The problem can be state as determining
either PROB ( C = N / W = flies ) or PROB ( C = V / W = flies )
is greater. 4
Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory

The conditional probability for the word flies with lexical categories
N and V:
PROB ( N / flies ) = PROB ( flies  N ) / PROB ( flies )
PROB ( V / flies ) = PROB ( flies  V ) / PROB ( flies )
Hence, we reduce to finding which of PROB ( flies  N ) and
PROB ( flies  V ) is greater, because the denominator - PROB
( flies ) is the same in each formula.

Let’s say we have a corpus of simple sentence obtaining 1273000

words. Say, there are 1000 uses of word flies , 400 of them in the
N sense and 600 in the V sense.
Then: PROB ( flies ) = 1000 / 127 3000 = 0.0008
5
PROB ( flies & N ) = 400 / 1273000 = 0.0003
Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory

PROB ( flies & V ) = 600 / 1273000 = 0.0005

Finally,
PROB ( V / flies ) = PROB ( V & flies ) / PROB ( flies )
= 0.0005 / 0.0008 = 0.625

6
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
- For instance, we use information on Harry’s past performance (
Harry’s winning probability of 100 races) to predict how likely
he is to win his 101st race.
- We interested in parsing sentences that have never been seen
before. Thus we need to use data on previously occurring
sentences to predict the next sentences.
- We will always be working with estimates of probability rather
than actual probability.

Maximum likelihood estimate (MLE)

If we have seen the word flies 1000 times before, and 600 of
them were as a verb, we assume that PROB(V/flies) = 0.6, and 7
st
Chapter 6: Ambiguity Resolutions: Statistical methods

6.2 Estimating Probabilities

Maximum likelihood estimate (MLE)
This simple ratio estimate is called Maximum likelihood estimate –
MLE.
The accuracy of an estimate increases as the amount of data
expands. The estimate is accuracy enough if it falls between 0.25
and 0.75. This range will be called margin error.

result Estimate of Prob H) Acceptable estimate

HH 1.0 NO
HT 0.5 YES
TH 0.5 YES
TT 0.0 NO

Figure 6.1: Probabilities with two trails 8

Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Maximum likelihood estimate (MLE)

Results Estimate of Prob (H) Acceptable Estimate

HHH 1.0 NO
HHT 0.66 YES
HTH 0.66 YES
HTT 0.33 YES
THH 0.66 YES
THT 0.33 YES
TTH 0.33 YES
TTT 0.0 NO

Figure 6.2: Probabilities with three trails

9
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Sparse data
There are a vast number of estimates needed for natural language
applications, and large proportion of these events are quite rare. This
is the problem of sparse data.
For instance, the Brown corpus contains about a million words, but
due to duplication there are only 49.000 different words and 40.000 of
the words occur five times or less.
The worst case occurs, if low-frequency word does not occur at all in
one of its possible category.
Its probability in this category would then be estimated as 0, then the
probability of the overall sentence containing the word would be 0.
There are other techniques attempt to address the problem of
10
estimating probabilities of low-frequency events.
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Sparse data
Random variable X, technique start with a set of values Vi,
computed from the count of the number of times X = xi. MLE
uses Vi = xi where Vi is exactly the count of number of times
X= xi : PROB ( X = xi )  V1 / I xi ,
One technique to solve the zero probability is to sure that no Vi
has value 0 by Vi = xi  + 0.5, that 0.5 is added to every count
This estimation technique is called expected likelihood estimate
(ELE)
Different between MLE and ELE:
For instance: consider a word w that does not occur in the corpus.
Consider estimating the probability that w occurs in one of 40
words classes L1…L40 categories. Consider comparing between
11
MLE and ELE.
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Evaluation
How well our new technique performs compared with other
algorithms or variants of our algorithms.
The general method for doing this is to divide the corpus into
two parts, the training set and the test set. The test set consists of
10 – 20% of total data.
- The training set is then used to estimate the probabilities and
- the algorithm is run on the test set to see how well it does on
new data.
- A more thorough method of testing is called cross-
validation:
Removing repeatedly different parts of the corpus as the test
set, training on the remainder of the corpus, and then evaluating12
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
- Part-of-speech tagging involves selecting the most likely
sequence of the syntactic categories for the words a sentence.
- A typical set of tags is used in the Penn Treebank project, is
shown in figure 6.3.
- The general method to improve reliability is to use some of the
local context of the sentence in which the word appears.
- For instance, if the word preceded by the word the , it is much
more likely to be N. In the section, we use this technique to
exploit such information.
Let W1 , …, WT be a sequence of words. We want to find a
sequence of lexical categories C1 , …, CT, that maximizes.
13
Figure 6.3 The Penn Treebank tagset 14
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
1)PROB ( C1 , …, CT / W1 , …, WT )
We solve this problem by Bayes’ rule, which says that this conditional
probability equals
2) PROB ( C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
PROB ( W1 , …, WT )
Finding C1,…, Cn, that gives a maximum value, the common denominator
in all these cases will not affect the answer. Thus the problem reduces to
finding C1,…, Cn, that maximizes the formula:
3)PROB ( C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
There are still no effect methods for calculating the probability of these
long sequences accurately, as it would require far too much data. 15
Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging

But the probabilities can be approximated by marking some
independence assumptions. Each of the two expressions in
formula 3 will be approximated.
The most common assumptions use either one or two previous
categories.
The bigram model looks at pairs of categories (or words) and
use the conditional probability that Ci will follow Ci-1, written as
PROB(Ci / Ci-1).
The trigram model use conditional probability of one category
(or word) given the two preceding categories (or words), that is
PROB( Ci / Ci-2 Ci-1)
16
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
n-gram model, in which n represents the number of word
used in the pattern.
While the trigram model will produce better result in practice.
We use bigram here for simplicity
PROB ( C1 , …, CT )  i = 1, T PROB ( Ci / Ci – 1 )
To account for beginning of the sentence, we posit a
pseudocategory  at the position 0 as value of C0
If ART at the beginning of a sentence that the first bigram
will be PROB ( ART /  ).
Example: approximation of the probability of the sequence
ART N V N using bigram would be 17
Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging

- PROB ( ART N V N )  PROB ( ART / ) * PROB ( N / ART )
* PROB ( V / N ) * PROB ( N / V)
- The second probability in formula 3:

- PROB (W1 , …, WT / Ci , …, CT )  i = 1, T PROB (Wi / Ci )

- Can be approximated by assuming that a word appears in a category

independent of the words in the preceding or succeeding categories.
-With these two approximations, the problem has changed into finding
the sequence C1…CT that maximizes the value of
3’) PROB (C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
 i = 1, T PROB (Ci / Ci-1 ) * PROB ( Wi / Ci )
18
Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging

The advantage of formula 3’) is that the probabilities involved can

be readily estimated from a corpus of text labeled with parts of
speech.
In particular, given a database of text, the bigram probabilities can
be estimated simply by counting the number of times each pair of
categories occurs compared to the individual category counts.
Example: the probability that a V follows an N would be estimated
as follows: count (N at position i-1 and V at i)
PROB(Ci=V/Ci-1=N) 
count (N at position i-1)

To deal with the problem of the sparse data, any bigram is not listed
in figure 6.4, will be assumed to have a token probability 0.0001. 19
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
category Count at i pair Count at i,i+1 bigram estimate

O 300 O, ART 213 PROB(ART∕O) 0.71

O 300 O, N 87 PORB(N∕O) 0.29
ART 558 ART, N 558 PROB(N∕ART) 1.0
N 833 N,V 258 PROB(V∕N) 0.43
N 833 N, N 108 PROB(N∕N) 0.13
N 833 N, P 366 PROB(P∕N) 0.44
V 300 V, N 75 PROB(N∕V) 0.35
V 300 V, ART 194 PROB(ART∕V) 0.65
P 307 P, ART 226 PROB(ART∕P) 0.74
P 307 P, N 81 PROB(N∕P) 0.26

Figure 6.4 Bigram probabilities from a corpus [1] 20

Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
TOTAL

Figure 6.5 A summary of some of the words counts in the corpus 21

Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging

- Lexical probabilities PROB ( Wi / Ci ) can be estimated simply

by counting the number of occurrence of each word by category.
Example: some lexical probabilities are estimated based on data
of figure 6.5, and are shown on figure 6.6 .

Figure 6.6 The lexical generation probabilities 22

Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging

- We can find the sequence of categories that has the highest

probability of generating a specific sentence (POST) by the
independent assumption that were made about the data.
- Since we only deal with bigram probabilities, the probability is
that the i’th word in category Ci depends only on category of the
(i-1)th word, Ci-1.
-Thus the process can be modeled by a special form of
probabilistic finite state, as shown in Figure 6.7.Each node
represents a possible lexical category and the transition
probabilities. Networks like that in Figure 6.7 are called Markov
Chains. 23
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
Example: Given a sequence of categories ART N V N, that
has the probability as follows:
0.71 * 1 * 0.43 * 0.35 = 0.107 (data from Figure 7.7)

Figure 6.7 A Markov chain capturing the bigram probabilities 24

Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
Example: The probability that the sequence N V ART N generates the
output flies like a flower is computed as follows:

- The probability of the path N V ART N, given the Markov Model in

Figure 6.7: PROB ( N V ART N ) = 0.29 * 0.43 * 0.65 * 1 = 0.081
- The probability of the output flies like a flowers for this sequence is
computed from probabilities in Figure 6.6:
PROB (flies / N) * PROB (like / V) * PROB (a / ART)
* PROB (flower / N) = 0.025 * 1 * 0.36 * 0.063 = 5.4 * 10 -5
- Multiplying these together give us the likelihood, that the HMM
would generates the sentence:
PROB (N V ART N / flies like a flower) = 5.4 * 10 -5 * 0.081  4.37 * 10-625
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
• The formula for computing probability of sentence w1…wT given
sequence C1…CN is:
i = 1, T PROB (Ci / Ci-1 ) * PROB ( Wi / Ci )
If we keep track of the most likely sequence found so far for each
possible ending category, so we can ignore all the other less likely
sequences.
Example: To find the most likely categories for sentence flies like
a flower are shown in Figure 6.8. There are 256 different sequences
of length four (words).
To find the most likely sequence we sweep forward through the
words one at the time finding for each ending category. This
algorithm usually is called Viterbi algorithm
26
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging

Figure 6.8 Encoding the 256 possible sequences exploiting the Markov
assumption
27
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
The Viterbi algorithm
Given word sequence: w1…wT, lexical categories: L1…
LN,lexical probabilities PROB(wi/Li) and bigram probabilities
PROB(Li/Li-1), find the most likely sequence of lexical
categories C1….CT for the word sequence.
Initialization step:
For i := 1 to N do
SEQSCORE (i, 1) = PROB( w1 | li )* PROB( li |  )
BACKPTR( i |1) = 0

28
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
The Viterbi Algorithm (cons)
Iteration step:
For t := 2 to T do
for i := 1 to N do
SEQSCORE ( I,t) = MAX j=1,N (SEQSCORE ( j,t-1)
*PROB(li | lj))* PROB(wt | li )
BACKPTR(i,t) = index of j that gave the max above
Sequence identification step
C(T) = I that maximizes SEQSCORE ( i, t)
For i := T-1 to 1 do
C(i) = BACKPTR ( C(i+1),i+1) 29
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
Example: Using the Viterbi algorithm compute a probability of the sequence W=
flies like a flower. L= V,N,ART, P

i=1 to N SEQ(1,1) = PROB(flies/V)PROB(V/O) 7.610-2 * 10-4 = 7.6 *10-6

SEQ(2,1) = PROB(flies/V)*PROB(N/O) 0.025*0.29 = 7.25 * 10-3
SEQ(3,1) = PROB(flies/V)*PROB(ART/O) 0
SEQ(4,1) = PROB(flies/V)*PROB(P/O) 0

t= 2 to 4 SEQ(1,2) = max j=1,4 (SEQ(1,1)* PROB(V/V), Max(7.6 10-8, 7.25 10-3

I = 1 to 4 SEQ(2,1)*PROB(V/N))* PROB(like/V) *0.43) *0.1 = 3.1 *10-4
SEQ(2,2) = max j=1,4 (SEQ(1,1)* PROB(N/V), Max (7.6*10-6*0.35, 7.25 *10-
SEQ(2,1)*PROB(N/N))* PROB(like/N) 3 *0.13)*0.012= 1.13 *10-5

SEQ(3,2) = max j=1,4 (SEQ(1,1)* PROB(ART/V), 0

SEQ(2,1)PROB(ART/N))PROB(like/ART) Max(7.610-610-4, 7.25*10-

SEQ(4,2) max j=1,4 (SEQ(1,1)* PROB(P/V), 3*0.44)*0.068 = 2.2 *10-4
SEQ(2,1)*PROB(P/N))*PROB(like/P)

30
t=3 to 4 SEQ(1,3) = max j=1,4 (SEQ(1,2)* PROB(V/V), =max( 3.1 *10-4*10-4,
i=1 to 4 SEQ(2,2)*PROB(V/N),SEQ(4,2)*PROB(V/P)) 1.13 *10-5*0.43,
*PROB(a/V) 2.2 *10-4*10-4)*0 =0
SEQ(2,3) = max j=1,4 (SEQ(1,2)* PROB(N/V), =max( 3.1 *10-4*0.35,
SEQ(2,2)*PROB(N/N),SEQ(4,2)*PROB(N/P)) 1.13 *10-5*0.13,
*PROB(a/N) 2.2 *10-4*0.26)*0.01
=5.7*10-7
SEQ(3,3) = max j=1,4 (SEQ(1,2)* PROB(ART/V), max( 3.1 *10-4* 0.65,
SEQ(2,2)*PROB(ART/N),SEQ(4,2)* 1.13 *10-5*10-4,
PROB(ART/P))*PROB(a/ART) 2.2 *10-4*0.74)*0.36= 2.01*10-5
SEQ(4,3) = max j=1,4 (SEQ(1,2)* PROB(P/V), max( 3.1 *10-4* 10-4,
SEQ(2,2)*PROB(P/N),SEQ(4,2)* 1.13 *10-5*0.44,
PROB(P/P))*PROB(a/P) 2.2 *10-4*10-4)*0 = 0

t=4 SEQ(1,4) = max j=1,4 (SEQ(2,3)* PROB(V/N), - Max(5.710-70.43, 2.01*10-

i=1 to 4 SEQ(3,3)*PROB(V/ART)) *PROB(flower/V) 5*10-4)*0.05= 1.2*10-8
SEQ(2,4) = max j=1,4 (SEQ(2,3)* PROB(N/N), - Max(5.7*10-7*0.13, 2.01*10-
SEQ(3,3)*PROB(N/ART)) *PROB(flower/N) 5)*0.63= 1.26*10-5
SEQ(3,4) = max j=1,4 (SEQ(2,3)* PROB(ART/N), 0
SEQ(3,3)*PROB(ART/ART)) *PROB(flower/ART)
SEQ(4,4) = max j=1,4 (SEQ(2,3)* PROB(P/N), 0
SEQ(3,3)*PROB(P/ART)) *PROB(flower/P)

31
Example: flies like a flower. L= V,N,ART, P

• Sequence Identification Step

• C(T) = i that maximizes SEQ (I,T)  SEQ (2,4) = 4.5*10-6
• C(4) = 2
• For i = 4– 1 to 1 do
• C(3) = BACKPTR(C(3+1), 4) = BACKPTR(2,4) = 3
• C(2) =BACKPTR(C(3), 3) = BACKPTR(3, 3) = 1
• C(1) = BACKPTR(C(2), 2) = BACKPTR(4,2) = 2
• 2-1-3-2 -> N V ART N

32
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities

-The simplest technique for estimating lexical probability is
computed by a number of times each word appears in the corpus
in each category.
-The probability that the word w appears in a lexical category L j
out of possible categories L1…LN could be estimated by a
formula:
PROB ( Lj / w )  count ( Lj  w ) / i= 1, N count ( Li / w )

- A better estimate would be obtained by computing how likely it

is that category Lj occurs at a position t over all sequences given
33
Chapter 6: Ambiguity Resolutions: Statistical methods
6.4 obtaining lexical probabilities

the input w1 … wt . In the other word, to find the one sequence

that yields the maximum probability for the input from all
sequences.

Figure 6.9 Context independent estimates for the lexical categories

34
Chapter 6: Ambiguity Resolutions: Statistical methods
6.4 Obtaining lexical probabilities
Example: The probability that flies is as N in the sentence The flies
like flowers would be calculated by summing the probability of all
sequences that end with flies as N.
Given transition (Figure 7.4) and lexical generation probabilities
(Figure 7.6), the sequences would be calculated as follows:

The/ART flies/N 9.58*10-3

The/ N flies/N 1.13* 10-6
The/P flies/N 4.55*10-9

Three nonzero sequences above have sum that is up 9.58*10-3

Likewise, three nonzero sequences end with flies as V, yielding a
total sum of 1.13*10-5. The sum of all sequences will be the
probability of the sequence The flies, namely 9.591*10-3 35
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities

The probability that flies is a noun as follows:

PROB(flies/N│The flies) =PROB(flies/N&The flies)/PROB(The flies)
= 9.58*10-3/9.591*10-3 = 0.9988
Likewise, the probability that flies is a verb would be 0.0012.
Rather than selecting the maximum score for each node at each stage of the algorithm ( the
Viterbi algorithm), we compute the sum of all scores.
We define the forward probability (Figure 7.14), written as αj(t), which is the probability of
producing the words: w1…wt and ending in state wt/Li : αj(t)= PROB(wt/Li, w1,..,wt)

36
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities

Figure 6.10 The forward algorithm for computing the lexical probabilities
37
Chapter 6: Ambiguity Resolutions: Statistical methods
64 Obtaining lexical probabilities

Example: with sentence The flies like flowers, α2(3)= would be

the sum of values computed for all sequences ending V( V is the
second category) in position 3 given the input The flies like.

- Using the conditional probability we derive the probability that

word wt is an instance of the lexical category Li:
PROB ( wt / Li│ w1 … wt ) 
PROB ( wt / Li , w1 … wt ) / PROB ( w1 … wt) (1)
PROB ( w1 , … wt )  j = 1, N j ( t ) (2)
From (1) and (2):
PROB ( wt / Li / w1 … wt )  I ( t ) / j = 1, N j ( t ) 38
39
40
41
Figure 6.11 Computing the sums of the probabilities of the sequences
Figure 6.12 Context dependent estimates for lexical categories in the sentence: The
flies like flowers
42
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities

- We could also consider the backward probability βi(t), the

probability of producing the sequence wt,…,WT beginning from
the state wt/Lj.
A better method of estimating the lexical probability for word
wt:
PROB ( wt / Li ) = i ( t ) * i ( t ) / j : = 1 , N j ( t ) * j ( t )

43
Chapter 6: Ambiguity Resolutions: Statistical methods
6.5 Language Modeling : Introduction

Application Language Modeling

Today’s goal: assign a probability to a sentence
• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Spell Correction
• The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen
minuets from)
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• + Summarization, question-answering, etc., etc.!! 44
6.5 Language
Language Modeling
Modeling : Introduction

• Goal: compute the probability of a sentence or

sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called
a language model.
• Better: the grammar But language model or LM
is standard
45
6.5 Language
Language Modeling
Modeling : Introduction

• The Chain Rule in General

46
6.5 Language
Language Modeling
Modeling : Introduction

Zeros

• Training set: • Test set

… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

47
6.6 Applications for spelling correction
Word processing Phones

Web search

48
Spelling Tasks

• Spelling Error Detection

• Spelling Error Correction:
– Autocorrect
• htethe
– Suggest a correction
– Suggestion lists

49
Non-word spelling errors
• Non-word spelling error detection:
• Any word not in a dictionary is an error
• The larger the dictionary the better
• Non-word spelling error correction:
• Generate candidates: real words that are similar to error
• Choose the one which is best:
• Shortest weighted edit distance
• Highest noisy channel probability
Non-word spelling error example
acress
50
Non-word spelling errors
• Using a bigram language model

• “a stellar and versatile acress whose combination of sass and

glamour…”
• Counts from the Corpus of Contemporary American English
with add-1 smoothing
• P(actress|versatile)=.000021 P(whose|actress) = .0010
• P(across|versatile) =.000021 P(whose|across) = .000006

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10

• P(“versatile across whose”) = .000021*.000006 = 1 x10-10

51
Chapter 6: Ambiguity Resolutions: Statistical methods

6.7 Probabilistic Context – Free grammars

- Context free grammar can be generated to the probabilistic case.

- Counting the number of times each rule is used in a corpus
containing parsed sentences and use this to estimate the probability
of each rule being used.
- For instance, m rules R1,.., Rm with left-hand side C.
- Estimating the probability using the rule Rj to derive C:

PROB ( Rj / C )  count ( # times Rj used) / i = 1, m ( # times Ri used)

52
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars

Figure 6.13 A simple probabilistic grammar

53
The grammar in Figure 7.13 shows probabilities for CFG
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
- We must assume that the probability of a constituent being
derived by a rule Rj is independent of how the constituent
used as sub-constituent.
Inside probability PROB(wij/C):
The probability that a constituent C generates a sequence of
words wi,…wj, written as wij.
This type of probability is called because it assigns a probability
to the word sequence inside the constituent.
Consider how to derive inside probabilities
- For lexical categories that these are exactly lexical generation
probabilities in Figure 6.3, instance PROB ( flower / N ) that is
54
inside probability
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
- For non-lexical categories: using such lexical generation
probabilities, we can derive the probability of a constituent.
Example:
Deriving a probability that the constituent NP generates a sequence a
flower as in figure 7.18.
The grammar in Figure 7.17, there are two rules NP which can
generate two words, so the probability of NP generating a flower can
be derived as follows:
PROB ( a flower / NP ) = PROB ( rule8 / NP ) * PROB ( a / ART ) *
PROB ( flower / N ) + PROB ( rule6 / NP ) * PROB ( a / ART ) *
PROB ( flower / N ) = 0.55 * 0.36 * 0.6 + 0.9 * 0.0001 * 0.06 = 0.012
55
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
This probability can be then used to compute a probability of
large constituents.
For instance, the probability of generating the words A flower
wilted from S could be computed by summing the probabilities
from each of the possible trees shown in the Figure 7.19.
Note that in Figure 7.19 there are three trees, and the first two
differ only in the derivation of a flower as NP.
Thus the probability of a flower wilted is:
PROB (a flower wilted/ S) = PROB (Rule1 / S) * PROB (a
flower / NP) * PROB ( wilted / VP ) + PROB ( Rule1 / S ) *
PROB ( a / NP ) * PROB ( flower wilted / VP ).
56
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars

Figure 6.14 The three possible ways to generate a flower wilted as an S

57
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
Using this method, the probability that a given sentence will be
generated by the grammar can be computed efficiently
The goal of probability parsing method is to find the most likely
parse rather than overall probability of the given sentence.
The probability of each constituent is computed from the
probability of its sub-constituents and the probability of the rule
used.
For instance, entry E of category C using a rule i with n sub-
constituents corresponding to entries E1 , …, En, then:

PROB ( E ) = PROB ( Rule i / C ) * PROB ( E1 ) * … * PROB ( En )

58
NP 425
1 N 422 0.14
NP 424
1 N 417
2 N 422 0.00011
NP 423
1 ART 416
2 N 422 0.54
S 421
1 NP 418
2 VP 420 3.2 x 10-8
NP 418
1 N 417 0.0018 VP 420 0.0018
N 417 0.001 N 422 0.00011
ART 416 0.99 V 410 0.00047

59
Figure. 6.15 The full chart for a flower that is as NP
6.7 Probabilistic Context – Free grammars

Machine translation
The idea behind statistical machine translation comes from
information theory.
-A document is translated according to the probability
distribution p(e|f), that a string e in the target language
(for example, English) is the translation of a string f in
the source language (for example, French).
- The problem of modeling the probability distribution p(e|
f) has been approached in a number of ways. One
approach which lends itself well to computer
implementation is to apply Bayes Theorem, that is p(e|f)
α p(f|e)p(e), where the translation model p(f|e) is the
probability that the source string is the translation 60
6.7 Probabilistic Context – Free grammars
Machine translation
of the target string, and the language model p(e) is the
probability of seeing that target language string. This
decomposition is attractive as it splits the problem into two
subproblems. Finding the best translation ẽ is done by
picking up the one that gives the highest probability:

ẽ =argmax p(e|f)=argmax p(f|e)p(e)

e e* e e*

61
Chapter 6: Ambiguity Resolutions: Statistical methods

6.8 Best –First parsing (study oneself)

62
6.9 Word Similarity

• Synonymy: a binary relation

• Two words are either synonymous or not
• Similarity (or distance): a looser metric
• Two words are more similar if they share more features of
meaning
• Similarity is properly a relation between senses
• The word “bank” is not similar to the word “slope”
• Bank1 is similar to fund3
• Bank2 is similar to slope5
• But we’ll compute similarity over both words and senses

63
Why word similarity

• Information retrieval
• Question answering
• Machine translation
• Natural language generation
• Language modeling
• Automatic essay grading
• Plagiarism detection
• Document clustering

64
Word similarity and word relatedness

• We often distinguish word similarity from word

relatedness
• Similar words: near-synonyms
• Related words: can be related any way
• car, bicycle: similar
• car, gasoline: related, not similar

65
Two classes of similarity algorithms

• Thesaurus-based algorithms
• Are words “nearby” in hypernym hierarchy?
• Do words have similar glosses (definitions)?
• Distributional algorithms
• Do words have similar distributional contexts?

66
Path based similarity

• Two concepts (senses/synsets) are similar if they

are near each other in the thesaurus hierarchy
• =have a short path between them
• concepts have path 1 to themselves
67
Refinements to path-based similarity

• pathlen(c1,c2) = 1 + number of edges in the

shortest path in the hypernym graph between
sense nodes c1 and c2
• ranges from 0 to 1 (identity)

• simpath(c1,c2) =

• wordsim(w1,w2) = c max sim(c1,c2)

senses(w ),c senses(w )
1 1 2 2

68
Summary: thesaurus-based similarity

69
Example: path-based similarity
simpath(c1,c2) = 1/pathlen(c1,c2)

simpath(nickel,coin) = 1/2 = .5
simpath(fund,budget) = 1/2 = .5
simpath(nickel,currency) = 1/4 = .25
simpath(nickel,money) = 1/6 = .17
70
simpath(coinage,Richter scale) = 1/6 = .17
Problems with thesaurus-based meaning
• We don’t have a thesaurus for every language
• Even if we do, they have problems with recall
• Many words are missing
• Most (if not all) phrases are missing
• Some connections between senses are missing
• Thesauri work less well for verbs, adjectives
• Adjectives and verbs have less structured
hyponymy relations

71
Distributional models of meaning

- Also called vector-space models of meaning.

- Offer much higher recall than hand-built thesauri.
- Although they tend to have lower precision

• Zellig Harris (1954): “oculist and eye-doctor …

occur in almost the same environments….
If A and B have almost identical environments we
say that they are synonyms.
• Firth (1957): “You shall know a word by the
company it keeps!”

72
Reminder: Term-document matrix

• Each cell: count of term t in a document d: tft,d:

• Each document is a count vector in ℕv: a column below

1
2
37
6

73
Reminder: Term-document matrix

• Two documents are similar if their vectors are similar

8 15
12 36
1 5
0 0

74
The words in a term-document matrix

• Each word is a count vector in ℕD: a row below

count vector

37 58 1 5

75
The words in a term-document matrix

• Two words are similar if their vectors are similar

6 117 0 0

76
The Term-Context matrix

• Instead of using entire documents, use smaller contexts

• Paragraph
• Window of 10 words
• A word is now defined by a vector over counts of
context words

77
Sample contexts: 20 words (Brown corpus)
• equal amount of sugar, a sliced lemon, a tablespoonful of apricot
preserve or jam, a pinch each of clove and nutmeg,
• on board for their enjoyment. Cautiously she sampled her first
pineapple and another fruit whose taste she likened to that of
• of a recursive type well suited to programming on
the digital computer. In finding the optimal R-stage
policy from that of
• substantially affect commerce, for the purpose of
gathering data and information necessary for the
study authorized in the first section of this

78
Term-context matrix for word similarity
• Two words are similar in meaning if their context
vectors are similar

0 0 0 11 0 11
0 0 0 11 0 11
1
2 1
4
1 6

79
Should we use raw counts?
• For the term-document matrix
• We used tf-idf instead of raw term counts
• For the term-context matrix
• Positive Pointwise Mutual Information
(PPMI) is common

80
Pointwise Mutual Information
• Pointwise mutual information:
• Do events x and y co-occur more than if they were
independent?

• PMI between two words: (Church & Hanks 1989)

• Do words x and y co-occur more than if they were
independent?

• Positive PMI between two words (Niwa & Nitta 1994)

• Replace all PMI values less than 0 with zero
81
Computing PPMI on a term-context matrix
• Matrix F with W rows (words) and C columns (contexts)
• fij is # of times wi occurs in context cj

82
p(w=information,c=data) = 6/19 = .32
p(w=information) = 11/19 = .58
p(c=data) = 7/19 = .37

83
• pmi(information,data) = log2 ( .32 / (.37*.58) ) = .58
(.57 using full precision)

84
Reminder: cosine for computing similarity
Dot product Unit vectors

vi is the PPMI value for word v in context i

wi is the PPMI value for word w in context i.

Cos(v,w) is the cosine similarity of v and w

85
Cosine as a similarity metric

• -1: vectors point in opposite directions

• +1: vectors point in same directions
• 0: vectors are orthogonal

• Raw frequency or PPMI are non-

negative, so cosine range 0-1
86
large data computer
apricot 1 0 0
digital 0 1 2
information 1 6 1

Which pair of words is more similar?

cosine(apricot,information) =

cosine(digital,information) =

cosine(apricot,digital) =

87
Other possible similarity measures

88
Evaluating similarity
(the same as for thesaurus-based)
• Intrinsic Evaluation:
• Correlation between algorithm and human word
similarity ratings
• Extrinsic (task-based, end-to-end) Evaluation:
• Spelling error detection, WSD, essay grading
• Taking TOEFL multiple-choice vocabulary tests

Levied is closest in meaning to which of these:

imposed, believed, requested, correlated
89
EXERCISE OF CHAPTER 6
1. Hand simulate the Viterbi algorithm using the data and
probability estimates in Figures7.4- 7.6 on the sentence
Flower flowers like flowers. Draw transition network as in
Figure 7.10-7.12 for the problem, and identify what part of
speech the algorithm identifies for each word.
2. Using the bigram and lexical generation probabilities given in
this chapter, calculate the word probabilities using the
forward algorithm for the sentence The a flies like flower
(involving a very rare use of the word a as a noun, as in the a
flies, the b flies, and so on). Remember to use 0.0001 as a
probability for any bigram not in the table. Are the results you
get reasonable ?. If not, what is the problem and how might it
be fixed ?. 90
EXERCISE OF CHAPTER 6
3. Consider an extended version of Grammar 7.17 with the
additional rule: 10. VP → V PP
The revised rule probabilities are shown here (Any not
mentioned are the same as in Grammar 7.17):
VP → V 0.32 VP → V NP PP 0.20
VP → V NP 0.33 VP → V PP 0.15
In addition, the following bigram probabilities differ from those
in Figure 7.4:
PROB(N/V) = 0.53 PROB(ART/V) = 0.32 PROB(P/V) = 0.15
a) Hand simulate (or implement) the forward algorithm on Fruit
flies like birds to produce the lexical probabilities.
b) Draw out the full chart for Fruit flies like birds, showing the
probabilities of each constituent.
91
EXERCISE OF CHAPTER 6

• Specify PMI between two words, Positive PMI

between two words in the below table

aardvark computer data pinch result sugar

0 0 0 0 1 0
Apricot
0 0 0 0 1 0
Pineapple
0 2 1 0 1 0
Digital
Information 0 1 6 0 4 0

Wohascum Problems
100% (1)
Wohascum Problems
246 pages
Complex Analysis For Mathematics and Engineering Compress
100% (2)
Complex Analysis For Mathematics and Engineering Compress
792 pages
Natural Language Processing and Information
No ratings yet
Natural Language Processing and Information
105 pages
Toc Full Notes
100% (1)
Toc Full Notes
82 pages
CD - Sem 7 - GTU - Study Material - 15112016 - 100740AM PDF
50% (2)
CD - Sem 7 - GTU - Study Material - 15112016 - 100740AM PDF
100 pages
Icar Syllabus-Physics, Chemistry, Maths, Bio & Agriculture
75% (4)
Icar Syllabus-Physics, Chemistry, Maths, Bio & Agriculture
26 pages
NLP Unit 5
No ratings yet
NLP Unit 5
10 pages
Differentiation of Exponential and Logarithmic Functions
No ratings yet
Differentiation of Exponential and Logarithmic Functions
28 pages
Heinzman DG 08 001 e Pandaros Positioner DC6 06
No ratings yet
Heinzman DG 08 001 e Pandaros Positioner DC6 06
96 pages
JNTUA R20 B.tech - CSE AIML III IV Year Course Structure Syllabus
No ratings yet
JNTUA R20 B.tech - CSE AIML III IV Year Course Structure Syllabus
117 pages
P632 EN M R-32-D 311 652 Volume 2
No ratings yet
P632 EN M R-32-D 311 652 Volume 2
444 pages
15-Chomsky Hierarchy
No ratings yet
15-Chomsky Hierarchy
12 pages
Exponential Worksheet1
No ratings yet
Exponential Worksheet1
3 pages
Math 12 1 Syllabus PDF Free
No ratings yet
Math 12 1 Syllabus PDF Free
8 pages
Machine Learning Notes (All Units Merged)
No ratings yet
Machine Learning Notes (All Units Merged)
144 pages
Software Testing Methodologies Unit I
No ratings yet
Software Testing Methodologies Unit I
195 pages
6 Sem Solution Bank
No ratings yet
6 Sem Solution Bank
251 pages
Elements of Computer Science and Engineering: CS106ES
No ratings yet
Elements of Computer Science and Engineering: CS106ES
98 pages
NLP Lab File
100% (2)
NLP Lab File
66 pages
AI Unit II All Topics
No ratings yet
AI Unit II All Topics
114 pages
Ad3311 - Ai Lab Manual
No ratings yet
Ad3311 - Ai Lab Manual
37 pages
DAA UNIT-3 (Updated)
No ratings yet
DAA UNIT-3 (Updated)
33 pages
Material For CAT 1
100% (1)
Material For CAT 1
22 pages
Log-Linear Models: Michael Collins
No ratings yet
Log-Linear Models: Michael Collins
20 pages
Categories and Objects
No ratings yet
Categories and Objects
5 pages
Ai Notes
No ratings yet
Ai Notes
68 pages
Learning Set of Rules
No ratings yet
Learning Set of Rules
11 pages
Lab Manual of ISL
No ratings yet
Lab Manual of ISL
55 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Deap
No ratings yet
Deap
117 pages
Module-2 Lexical Analyzer
No ratings yet
Module-2 Lexical Analyzer
36 pages
UNIT - (II) PDF
No ratings yet
UNIT - (II) PDF
19 pages
Si - Chronicles Ay 22-23
No ratings yet
Si - Chronicles Ay 22-23
220 pages
ML UNIT-3 Notes PDF
No ratings yet
ML UNIT-3 Notes PDF
23 pages
Unit 3 Computer Networks
No ratings yet
Unit 3 Computer Networks
18 pages
Dbms Vtu Question Paper
No ratings yet
Dbms Vtu Question Paper
10 pages
Math Quiz 2 Relation and Functions
No ratings yet
Math Quiz 2 Relation and Functions
5 pages
SPPM Unit 1
No ratings yet
SPPM Unit 1
37 pages
Ai Lab Manual (Kcs 751a)
No ratings yet
Ai Lab Manual (Kcs 751a)
45 pages
ML Unit-5
No ratings yet
ML Unit-5
83 pages
AI Unit 1.
No ratings yet
AI Unit 1.
15 pages
Machine Learning (6CS4-02) Unit-3 Notes
No ratings yet
Machine Learning (6CS4-02) Unit-3 Notes
21 pages
Groups & Coding Theory
100% (1)
Groups & Coding Theory
201 pages
R18 B.Tech - CSE (AIML) 3-2 Tentative Syllabus
No ratings yet
R18 B.Tech - CSE (AIML) 3-2 Tentative Syllabus
24 pages
Osi Model
No ratings yet
Osi Model
20 pages
TOC Question Bank
No ratings yet
TOC Question Bank
38 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Using Predicate Logic: Unit-IV
No ratings yet
Using Predicate Logic: Unit-IV
60 pages
16CS517-Formal Languages and Automata Theory
No ratings yet
16CS517-Formal Languages and Automata Theory
8 pages
Chapter 7
No ratings yet
Chapter 7
49 pages
Daa Notes Unit 4
No ratings yet
Daa Notes Unit 4
14 pages
Unit B2: Limit, Continuity and Differentiability: Detailed Content Time Ratio Notes On Teaching
No ratings yet
Unit B2: Limit, Continuity and Differentiability: Detailed Content Time Ratio Notes On Teaching
3 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
MCQs Bank Math 4,5,6 (HSSC-II)
No ratings yet
MCQs Bank Math 4,5,6 (HSSC-II)
4 pages
Security in Computing - Chapter 2 Notes
100% (1)
Security in Computing - Chapter 2 Notes
14 pages
Unit 1: Database Management System (DBMS) Historical Perspective
100% (1)
Unit 1: Database Management System (DBMS) Historical Perspective
30 pages
PPS Expected+pyq
No ratings yet
PPS Expected+pyq
7 pages
Additional Mathematics Syllabus O Level
100% (1)
Additional Mathematics Syllabus O Level
32 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
0.4 Combinations of Functions
No ratings yet
0.4 Combinations of Functions
15 pages
Ai - Question Bank
No ratings yet
Ai - Question Bank
3 pages
Unit - Viii Machine Dependent Code Optimization Peephole Optimization
No ratings yet
Unit - Viii Machine Dependent Code Optimization Peephole Optimization
9 pages
WFF and Properties
No ratings yet
WFF and Properties
10 pages
Pops Jaminey M
No ratings yet
Pops Jaminey M
7 pages
Standard Math Class 11 PT2
No ratings yet
Standard Math Class 11 PT2
5 pages
Lesson Plan Level MA Grade 11
No ratings yet
Lesson Plan Level MA Grade 11
6 pages
Reactive Systems: Appendix 2
No ratings yet
Reactive Systems: Appendix 2
5 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
6 pages
Undecidable Problems For Recursively Enumerable Languages: Continued
No ratings yet
Undecidable Problems For Recursively Enumerable Languages: Continued
54 pages
Homework Trigonometric Limits Answers
100% (1)
Homework Trigonometric Limits Answers
6 pages
NLP Unit III
No ratings yet
NLP Unit III
17 pages
Unit 5 1
No ratings yet
Unit 5 1
18 pages
CSI3120 Assignment 1
No ratings yet
CSI3120 Assignment 1
6 pages
Engineering Data Analysis Chapter 3 - Discrete Probability Distribution
No ratings yet
Engineering Data Analysis Chapter 3 - Discrete Probability Distribution
18 pages
B.tech Civil Details Curriculum As Per NEP 2023 W.E.F 2023-24-1694930910
No ratings yet
B.tech Civil Details Curriculum As Per NEP 2023 W.E.F 2023-24-1694930910
44 pages
Advance Math Exam 1
No ratings yet
Advance Math Exam 1
5 pages
Unit-3 Aim 502
No ratings yet
Unit-3 Aim 502
14 pages
Reasoning Systems For Categories
No ratings yet
Reasoning Systems For Categories
13 pages
Maths Report
No ratings yet
Maths Report
59 pages
Ontological Engineering
No ratings yet
Ontological Engineering
17 pages
F# Bog
No ratings yet
F# Bog
332 pages
Ai - Unit - 3-1
No ratings yet
Ai - Unit - 3-1
31 pages
Unit - 3 NLP - R20
No ratings yet
Unit - 3 NLP - R20
21 pages
NLP Module 4 Notes
No ratings yet
NLP Module 4 Notes
8 pages
Pre-Ap Algebra 2 Instructional Planning Guide With Springboard Algebra 2, National Edition
No ratings yet
Pre-Ap Algebra 2 Instructional Planning Guide With Springboard Algebra 2, National Edition
14 pages
Natural Language Understanding Allen 1995 Chapter 7
No ratings yet
Natural Language Understanding Allen 1995 Chapter 7
98 pages
Mathematics First Round Pre University Remedial Model Examination
No ratings yet
Mathematics First Round Pre University Remedial Model Examination
12 pages
AI (Horn Clauses and Definite Clauses)
No ratings yet
AI (Horn Clauses and Definite Clauses)
13 pages
Calculus Worksheet Solutions
No ratings yet
Calculus Worksheet Solutions
23 pages
Xii Hy SQP 2025 26
No ratings yet
Xii Hy SQP 2025 26
6 pages

CHAPTER 6-Ambiguity Resolutions Statistical Methods

Uploaded by

CHAPTER 6-Ambiguity Resolutions Statistical Methods

Uploaded by

Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory

where PROB ( e & e’ ) is probability of two events e and e’

6.1 basic Probability theory

Consider an application of probability theory related to

Example: word flies can be either V or N.

6.1 basic Probability theory

Let’s say we have a corpus of simple sentence obtaining 1273000

6.1 basic Probability theory

PROB ( flies & V ) = 600 / 1273000 = 0.0005

Maximum likelihood estimate (MLE)

6.2 Estimating Probabilities

result Estimate of Prob H) Acceptable estimate

Figure 6.1: Probabilities with two trails 8

Results Estimate of Prob (H) Acceptable Estimate

Figure 6.2: Probabilities with three trails

6.3 Part – of- speech Tagging

6.3 Part – of- speech Tagging

- PROB (W1 , …, WT / Ci , …, CT )  i = 1, T PROB (Wi / Ci )

- Can be approximated by assuming that a word appears in a category

6.3 Part – of- speech Tagging

The advantage of formula 3’) is that the probabilities involved can

O 300 O, ART 213 PROB(ART∕O) 0.71

Figure 6.4 Bigram probabilities from a corpus [1] 20

Figure 6.5 A summary of some of the words counts in the corpus 21

6.3 Part – of- speech Tagging

- Lexical probabilities PROB ( Wi / Ci ) can be estimated simply

Figure 6.6 The lexical generation probabilities 22

- We can find the sequence of categories that has the highest

Figure 6.7 A Markov chain capturing the bigram probabilities 24

- The probability of the path N V ART N, given the Markov Model in

i=1 to N SEQ(1,1) = PROB(flies/V)*PROB(V/O) 7.6*10-2 * 10-4 = 7.6 *10-6

t= 2 to 4 SEQ(1,2) = max j=1,4 (SEQ(1,1)* PROB(V/V), Max(7.6 *10-8, 7.25 *10-3

SEQ(3,2) = max j=1,4 (SEQ(1,1)* PROB(ART/V), 0

SEQ(2,1)*PROB(ART/N))*PROB(like/ART) Max(7.6*10-6*10-4, 7.25*10-

t=4 SEQ(1,4) = max j=1,4 (SEQ(2,3)* PROB(V/N), - Max(5.7*10-7*0.43, 2.01*10-

• Sequence Identification Step

6.4 Obtaining lexical probabilities

- A better estimate would be obtained by computing how likely it

the input w1 … wt . In the other word, to find the one sequence

Figure 6.9 Context independent estimates for the lexical categories

The/ART flies/N 9.58*10-3

Three nonzero sequences above have sum that is up 9.58*10-3

6.4 Obtaining lexical probabilities

The probability that flies is a noun as follows:

6.4 Obtaining lexical probabilities

Example: with sentence The flies like flowers, α2(3)= would be

- Using the conditional probability we derive the probability that

6.4 Obtaining lexical probabilities

- We could also consider the backward probability βi(t), the

Application Language Modeling

• Goal: compute the probability of a sentence or

• The Chain Rule in General

• Training set: • Test set

P(“offer” | denied the) = 0

• Spelling Error Detection

• “a stellar and versatile acress whose combination of sass and

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10

6.7 Probabilistic Context – Free grammars

- Context free grammar can be generated to the probabilistic case.

PROB ( Rj / C )  count ( # times Rj used) / i = 1, m ( # times Ri used)

Figure 6.13 A simple probabilistic grammar

Figure 6.14 The three possible ways to generate a flower wilted as an S

PROB ( E ) = PROB ( Rule i / C ) * PROB ( E1 ) * … * PROB ( En )

ẽ =argmax p(e|f)=argmax p(f|e)p(e)

6.8 Best –First parsing (study oneself)

• Synonymy: a binary relation

• We often distinguish word similarity from word

• Two concepts (senses/synsets) are similar if they

• pathlen(c1,c2) = 1 + number of edges in the

• wordsim(w1,w2) = c max sim(c1,c2)

- Also called vector-space models of meaning.

• Zellig Harris (1954): “oculist and eye-doctor …

• Each cell: count of term t in a document d: tft,d:

• Two documents are similar if their vectors are similar

• Each word is a count vector in ℕD: a row below

i=1 to N SEQ(1,1) = PROB(flies/V)PROB(V/O) 7.610-2 * 10-4 = 7.6 *10-6

t= 2 to 4 SEQ(1,2) = max j=1,4 (SEQ(1,1)* PROB(V/V), Max(7.6 10-8, 7.25 10-3

SEQ(2,1)PROB(ART/N))PROB(like/ART) Max(7.610-610-4, 7.25*10-

t=4 SEQ(1,4) = max j=1,4 (SEQ(2,3)* PROB(V/N), - Max(5.710-70.43, 2.01*10-