0% found this document useful (0 votes)
223 views92 pages

CHAPTER 6-Ambiguity Resolutions Statistical Methods

- This document discusses statistical methods for solving problems based on probability theory, including conditional probability, Bayes' rule, and estimating probabilities from data including dealing with sparse data. - It also covers part-of-speech tagging, which aims to assign the most likely syntactic category to each word in a sentence. This is done by maximizing the conditional probability of a category sequence given a word sequence, using techniques like bigram modeling that make independence assumptions to estimate long sequence probabilities. Evaluation involves dividing a corpus into training and test sets.

Uploaded by

Rosmarinus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views92 pages

CHAPTER 6-Ambiguity Resolutions Statistical Methods

- This document discusses statistical methods for solving problems based on probability theory, including conditional probability, Bayes' rule, and estimating probabilities from data including dealing with sparse data. - It also covers part-of-speech tagging, which aims to assign the most likely syntactic category to each word in a sentence. This is done by maximizing the conditional probability of a category sequence given a word sequence, using techniques like bigram modeling that make independence assumptions to estimate long sequence probabilities. Evaluation involves dividing a corpus into training and test sets.

Uploaded by

Rosmarinus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory


This section explores some techniques for solving the
problems based on probability theory.
A probability function, PROB , assigns a probability to every
value of a random variable
A probability function must have following properties, where
e1 , …, en are the possible distinct values of a random variable
E.
1. PROB ( ei )  0 for all i (i)
2. PROB ( ei )  1 for all i (i)
3. ∑ i = 1, n PROB ( ei ) = 1
1
Chapter 6: Ambiguity Resolutions: Statistical
methods
6.1 basic Probability theory
- Conditional probability is defined by the formula
PROB ( e │ e’ ) = PROB ( e & e’ ) / PROB ( e’ )

where PROB ( e & e’ ) is probability of two events e and e’


occurring simultaneously
- An important theorem relating conditional probabilities is
Bayes’ rule. This rule relates conditional probability of an event
A given B to the conditional probability of an event B given A:
PROB ( B  A ) * PROB ( A )
PROB ( A  B ) =
PROB ( B )
2
Chapter 6: Ambiguity Resolutions: Statistical methods
6.1 basic Probability theory
- Two events A and B are independent of each other if only if
PROB ( A / B ) = PROB ( A )
which using the definition of conditional probability, is
equivalent to saying
PROB ( A & B ) = PROB ( A ) * PROB ( B)
Example: PROB(Win/Rain) = PROB (Win & Rain)/ PRO(Rain)
= .15/.3. If Win and Rain are independent of each other then
PROB(Win/Rain) = PROB (Win)/ PRO(Rain) = 0.2*0.3= 0.06
While PROB(Win&Rain) = 0.15. Note ! Probability of winning
and raining occur together at a rate much greater than random
chance. 3
Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory

Consider an application of probability theory related to


language, namely part- of- speech indentification: given a
sentence with ambiguous words. Now, determine the most likely
lexical category for each word.

Example: word flies can be either V or N.


C that ranges over the part-of-speech (V, N), W that ranges over
all possible words. The problem can be state as determining
either PROB ( C = N / W = flies ) or PROB ( C = V / W = flies )
is greater. 4
Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory


The conditional probability for the word flies with lexical categories
N and V:
PROB ( N / flies ) = PROB ( flies  N ) / PROB ( flies )
PROB ( V / flies ) = PROB ( flies  V ) / PROB ( flies )
Hence, we reduce to finding which of PROB ( flies  N ) and
PROB ( flies  V ) is greater, because the denominator - PROB
( flies ) is the same in each formula.

Let’s say we have a corpus of simple sentence obtaining 1273000


words. Say, there are 1000 uses of word flies , 400 of them in the
N sense and 600 in the V sense.
Then: PROB ( flies ) = 1000 / 127 3000 = 0.0008
5
PROB ( flies & N ) = 400 / 1273000 = 0.0003
Chapter 6: Ambiguity Resolutions: Statistical methods

6.1 basic Probability theory

PROB ( flies & V ) = 600 / 1273000 = 0.0005


Finally,
PROB ( V / flies ) = PROB ( V & flies ) / PROB ( flies )
= 0.0005 / 0.0008 = 0.625

6
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
- For instance, we use information on Harry’s past performance (
Harry’s winning probability of 100 races) to predict how likely
he is to win his 101st race.
- We interested in parsing sentences that have never been seen
before. Thus we need to use data on previously occurring
sentences to predict the next sentences.
- We will always be working with estimates of probability rather
than actual probability.

Maximum likelihood estimate (MLE)

If we have seen the word flies 1000 times before, and 600 of
them were as a verb, we assume that PROB(V/flies) = 0.6, and 7
st
Chapter 6: Ambiguity Resolutions: Statistical methods

6.2 Estimating Probabilities


Maximum likelihood estimate (MLE)
This simple ratio estimate is called Maximum likelihood estimate –
MLE.
The accuracy of an estimate increases as the amount of data
expands. The estimate is accuracy enough if it falls between 0.25
and 0.75. This range will be called margin error.

result Estimate of Prob H) Acceptable estimate


HH 1.0 NO
HT 0.5 YES
TH 0.5 YES
TT 0.0 NO

Figure 6.1: Probabilities with two trails 8


Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Maximum likelihood estimate (MLE)

Results Estimate of Prob (H) Acceptable Estimate


HHH 1.0 NO
HHT 0.66 YES
HTH 0.66 YES
HTT 0.33 YES
THH 0.66 YES
THT 0.33 YES
TTH 0.33 YES
TTT 0.0 NO

Figure 6.2: Probabilities with three trails

9
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Sparse data
There are a vast number of estimates needed for natural language
applications, and large proportion of these events are quite rare. This
is the problem of sparse data.
For instance, the Brown corpus contains about a million words, but
due to duplication there are only 49.000 different words and 40.000 of
the words occur five times or less.
The worst case occurs, if low-frequency word does not occur at all in
one of its possible category.
Its probability in this category would then be estimated as 0, then the
probability of the overall sentence containing the word would be 0.
There are other techniques attempt to address the problem of
10
estimating probabilities of low-frequency events.
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Sparse data
Random variable X, technique start with a set of values Vi,
computed from the count of the number of times X = xi. MLE
uses Vi = xi where Vi is exactly the count of number of times
X= xi : PROB ( X = xi )  V1 / I xi ,
One technique to solve the zero probability is to sure that no Vi
has value 0 by Vi = xi  + 0.5, that 0.5 is added to every count
This estimation technique is called expected likelihood estimate
(ELE)
Different between MLE and ELE:
For instance: consider a word w that does not occur in the corpus.
Consider estimating the probability that w occurs in one of 40
words classes L1…L40 categories. Consider comparing between
11
MLE and ELE.
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Evaluation
How well our new technique performs compared with other
algorithms or variants of our algorithms.
The general method for doing this is to divide the corpus into
two parts, the training set and the test set. The test set consists of
10 – 20% of total data.
- The training set is then used to estimate the probabilities and
- the algorithm is run on the test set to see how well it does on
new data.
- A more thorough method of testing is called cross-
validation:
Removing repeatedly different parts of the corpus as the test
set, training on the remainder of the corpus, and then evaluating12
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
- Part-of-speech tagging involves selecting the most likely
sequence of the syntactic categories for the words a sentence.
- A typical set of tags is used in the Penn Treebank project, is
shown in figure 6.3.
- The general method to improve reliability is to use some of the
local context of the sentence in which the word appears.
- For instance, if the word preceded by the word the , it is much
more likely to be N. In the section, we use this technique to
exploit such information.
Let W1 , …, WT be a sequence of words. We want to find a
sequence of lexical categories C1 , …, CT, that maximizes.
13
Figure 6.3 The Penn Treebank tagset 14
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
1)PROB ( C1 , …, CT / W1 , …, WT )
We solve this problem by Bayes’ rule, which says that this conditional
probability equals
2) PROB ( C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
PROB ( W1 , …, WT )
Finding C1,…, Cn, that gives a maximum value, the common denominator
in all these cases will not affect the answer. Thus the problem reduces to
finding C1,…, Cn, that maximizes the formula:
3)PROB ( C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
There are still no effect methods for calculating the probability of these
long sequences accurately, as it would require far too much data. 15
Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging


But the probabilities can be approximated by marking some
independence assumptions. Each of the two expressions in
formula 3 will be approximated.
The most common assumptions use either one or two previous
categories.
The bigram model looks at pairs of categories (or words) and
use the conditional probability that Ci will follow Ci-1, written as
PROB(Ci / Ci-1).
The trigram model use conditional probability of one category
(or word) given the two preceding categories (or words), that is
PROB( Ci / Ci-2 Ci-1)
16
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
n-gram model, in which n represents the number of word
used in the pattern.
While the trigram model will produce better result in practice.
We use bigram here for simplicity
PROB ( C1 , …, CT )  i = 1, T PROB ( Ci / Ci – 1 )
To account for beginning of the sentence, we posit a
pseudocategory  at the position 0 as value of C0
If ART at the beginning of a sentence that the first bigram
will be PROB ( ART /  ).
Example: approximation of the probability of the sequence
ART N V N using bigram would be 17
Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging


- PROB ( ART N V N )  PROB ( ART / ) * PROB ( N / ART )
* PROB ( V / N ) * PROB ( N / V)
- The second probability in formula 3:

- PROB (W1 , …, WT / Ci , …, CT )  i = 1, T PROB (Wi / Ci )

- Can be approximated by assuming that a word appears in a category


independent of the words in the preceding or succeeding categories.
-With these two approximations, the problem has changed into finding
the sequence C1…CT that maximizes the value of
3’) PROB (C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
 i = 1, T PROB (Ci / Ci-1 ) * PROB ( Wi / Ci )
18
Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging

The advantage of formula 3’) is that the probabilities involved can


be readily estimated from a corpus of text labeled with parts of
speech.
In particular, given a database of text, the bigram probabilities can
be estimated simply by counting the number of times each pair of
categories occurs compared to the individual category counts.
Example: the probability that a V follows an N would be estimated
as follows: count (N at position i-1 and V at i)
PROB(Ci=V/Ci-1=N) 
count (N at position i-1)

To deal with the problem of the sparse data, any bigram is not listed
in figure 6.4, will be assumed to have a token probability 0.0001. 19
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
category Count at i pair Count at i,i+1 bigram estimate

O 300 O, ART 213 PROB(ART∕O) 0.71


O 300 O, N 87 PORB(N∕O) 0.29
ART 558 ART, N 558 PROB(N∕ART) 1.0
N 833 N,V 258 PROB(V∕N) 0.43
N 833 N, N 108 PROB(N∕N) 0.13
N 833 N, P 366 PROB(P∕N) 0.44
V 300 V, N 75 PROB(N∕V) 0.35
V 300 V, ART 194 PROB(ART∕V) 0.65
P 307 P, ART 226 PROB(ART∕P) 0.74
P 307 P, N 81 PROB(N∕P) 0.26

Figure 6.4 Bigram probabilities from a corpus [1] 20


Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
TOTAL

Figure 6.5 A summary of some of the words counts in the corpus 21


Chapter 6: Ambiguity Resolutions: Statistical methods

6.3 Part – of- speech Tagging

- Lexical probabilities PROB ( Wi / Ci ) can be estimated simply


by counting the number of occurrence of each word by category.
Example: some lexical probabilities are estimated based on data
of figure 6.5, and are shown on figure 6.6 .

Figure 6.6 The lexical generation probabilities 22


Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging

- We can find the sequence of categories that has the highest


probability of generating a specific sentence (POST) by the
independent assumption that were made about the data.
- Since we only deal with bigram probabilities, the probability is
that the i’th word in category Ci depends only on category of the
(i-1)th word, Ci-1.
-Thus the process can be modeled by a special form of
probabilistic finite state, as shown in Figure 6.7.Each node
represents a possible lexical category and the transition
probabilities. Networks like that in Figure 6.7 are called Markov
Chains. 23
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
Example: Given a sequence of categories ART N V N, that
has the probability as follows:
0.71 * 1 * 0.43 * 0.35 = 0.107 (data from Figure 7.7)

Figure 6.7 A Markov chain capturing the bigram probabilities 24


Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
Example: The probability that the sequence N V ART N generates the
output flies like a flower is computed as follows:

- The probability of the path N V ART N, given the Markov Model in


Figure 6.7: PROB ( N V ART N ) = 0.29 * 0.43 * 0.65 * 1 = 0.081
- The probability of the output flies like a flowers for this sequence is
computed from probabilities in Figure 6.6:
PROB (flies / N) * PROB (like / V) * PROB (a / ART)
* PROB (flower / N) = 0.025 * 1 * 0.36 * 0.063 = 5.4 * 10 -5
- Multiplying these together give us the likelihood, that the HMM
would generates the sentence:
PROB (N V ART N / flies like a flower) = 5.4 * 10 -5 * 0.081  4.37 * 10-625
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
• The formula for computing probability of sentence w1…wT given
sequence C1…CN is:
i = 1, T PROB (Ci / Ci-1 ) * PROB ( Wi / Ci )
If we keep track of the most likely sequence found so far for each
possible ending category, so we can ignore all the other less likely
sequences.
Example: To find the most likely categories for sentence flies like
a flower are shown in Figure 6.8. There are 256 different sequences
of length four (words).
To find the most likely sequence we sweep forward through the
words one at the time finding for each ending category. This
algorithm usually is called Viterbi algorithm
26
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging

Figure 6.8 Encoding the 256 possible sequences exploiting the Markov
assumption
27
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
The Viterbi algorithm
Given word sequence: w1…wT, lexical categories: L1…
LN,lexical probabilities PROB(wi/Li) and bigram probabilities
PROB(Li/Li-1), find the most likely sequence of lexical
categories C1….CT for the word sequence.
Initialization step:
For i := 1 to N do
SEQSCORE (i, 1) = PROB( w1 | li )* PROB( li |  )
BACKPTR( i |1) = 0

28
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
The Viterbi Algorithm (cons)
Iteration step:
For t := 2 to T do
for i := 1 to N do
SEQSCORE ( I,t) = MAX j=1,N (SEQSCORE ( j,t-1)
*PROB(li | lj))* PROB(wt | li )
BACKPTR(i,t) = index of j that gave the max above
Sequence identification step
C(T) = I that maximizes SEQSCORE ( i, t)
For i := T-1 to 1 do
C(i) = BACKPTR ( C(i+1),i+1) 29
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
Example: Using the Viterbi algorithm compute a probability of the sequence W=
flies like a flower. L= V,N,ART, P

i=1 to N SEQ(1,1) = PROB(flies/V)*PROB(V/O) 7.6*10-2 * 10-4 = 7.6 *10-6


SEQ(2,1) = PROB(flies/V)*PROB(N/O) 0.025*0.29 = 7.25 * 10-3
SEQ(3,1) = PROB(flies/V)*PROB(ART/O) 0
SEQ(4,1) = PROB(flies/V)*PROB(P/O) 0

t= 2 to 4 SEQ(1,2) = max j=1,4 (SEQ(1,1)* PROB(V/V), Max(7.6 *10-8, 7.25 *10-3


I = 1 to 4 SEQ(2,1)*PROB(V/N))* PROB(like/V) *0.43) *0.1 = 3.1 *10-4
SEQ(2,2) = max j=1,4 (SEQ(1,1)* PROB(N/V), Max (7.6*10-6*0.35, 7.25 *10-
SEQ(2,1)*PROB(N/N))* PROB(like/N) 3 *0.13)*0.012= 1.13 *10-5

SEQ(3,2) = max j=1,4 (SEQ(1,1)* PROB(ART/V), 0

SEQ(2,1)*PROB(ART/N))*PROB(like/ART) Max(7.6*10-6*10-4, 7.25*10-


SEQ(4,2) max j=1,4 (SEQ(1,1)* PROB(P/V), 3*0.44)*0.068 = 2.2 *10-4
SEQ(2,1)*PROB(P/N))*PROB(like/P)

30
t=3 to 4 SEQ(1,3) = max j=1,4 (SEQ(1,2)* PROB(V/V), =max( 3.1 *10-4*10-4,
i=1 to 4 SEQ(2,2)*PROB(V/N),SEQ(4,2)*PROB(V/P)) 1.13 *10-5*0.43,
*PROB(a/V) 2.2 *10-4*10-4)*0 =0
SEQ(2,3) = max j=1,4 (SEQ(1,2)* PROB(N/V), =max( 3.1 *10-4*0.35,
SEQ(2,2)*PROB(N/N),SEQ(4,2)*PROB(N/P)) 1.13 *10-5*0.13,
*PROB(a/N) 2.2 *10-4*0.26)*0.01
=5.7*10-7
SEQ(3,3) = max j=1,4 (SEQ(1,2)* PROB(ART/V), max( 3.1 *10-4* 0.65,
SEQ(2,2)*PROB(ART/N),SEQ(4,2)* 1.13 *10-5*10-4,
PROB(ART/P))*PROB(a/ART) 2.2 *10-4*0.74)*0.36= 2.01*10-5
SEQ(4,3) = max j=1,4 (SEQ(1,2)* PROB(P/V), max( 3.1 *10-4* 10-4,
SEQ(2,2)*PROB(P/N),SEQ(4,2)* 1.13 *10-5*0.44,
PROB(P/P))*PROB(a/P) 2.2 *10-4*10-4)*0 = 0

t=4 SEQ(1,4) = max j=1,4 (SEQ(2,3)* PROB(V/N), - Max(5.7*10-7*0.43, 2.01*10-


i=1 to 4 SEQ(3,3)*PROB(V/ART)) *PROB(flower/V) 5*10-4)*0.05= 1.2*10-8
SEQ(2,4) = max j=1,4 (SEQ(2,3)* PROB(N/N), - Max(5.7*10-7*0.13, 2.01*10-
SEQ(3,3)*PROB(N/ART)) *PROB(flower/N) 5)*0.63= 1.26*10-5
SEQ(3,4) = max j=1,4 (SEQ(2,3)* PROB(ART/N), 0
SEQ(3,3)*PROB(ART/ART)) *PROB(flower/ART)
SEQ(4,4) = max j=1,4 (SEQ(2,3)* PROB(P/N), 0
SEQ(3,3)*PROB(P/ART)) *PROB(flower/P)

31
Example: flies like a flower. L= V,N,ART, P

• Sequence Identification Step


• C(T) = i that maximizes SEQ (I,T)  SEQ (2,4) = 4.5*10-6
• C(4) = 2
• For i = 4– 1 to 1 do
• C(3) = BACKPTR(C(3+1), 4) = BACKPTR(2,4) = 3
• C(2) =BACKPTR(C(3), 3) = BACKPTR(3, 3) = 1
• C(1) = BACKPTR(C(2), 2) = BACKPTR(4,2) = 2
• 2-1-3-2 -> N V ART N

32
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities


-The simplest technique for estimating lexical probability is
computed by a number of times each word appears in the corpus
in each category.
-The probability that the word w appears in a lexical category L j
out of possible categories L1…LN could be estimated by a
formula:
PROB ( Lj / w )  count ( Lj  w ) / i= 1, N count ( Li / w )

- A better estimate would be obtained by computing how likely it


is that category Lj occurs at a position t over all sequences given
33
Chapter 6: Ambiguity Resolutions: Statistical methods
6.4 obtaining lexical probabilities

the input w1 … wt . In the other word, to find the one sequence


that yields the maximum probability for the input from all
sequences.

Figure 6.9 Context independent estimates for the lexical categories


34
Chapter 6: Ambiguity Resolutions: Statistical methods
6.4 Obtaining lexical probabilities
Example: The probability that flies is as N in the sentence The flies
like flowers would be calculated by summing the probability of all
sequences that end with flies as N.
Given transition (Figure 7.4) and lexical generation probabilities
(Figure 7.6), the sequences would be calculated as follows:

The/ART flies/N 9.58*10-3


The/ N flies/N 1.13* 10-6
The/P flies/N 4.55*10-9

Three nonzero sequences above have sum that is up 9.58*10-3


Likewise, three nonzero sequences end with flies as V, yielding a
total sum of 1.13*10-5. The sum of all sequences will be the
probability of the sequence The flies, namely 9.591*10-3 35
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities

The probability that flies is a noun as follows:


PROB(flies/N│The flies) =PROB(flies/N&The flies)/PROB(The flies)
= 9.58*10-3/9.591*10-3 = 0.9988
Likewise, the probability that flies is a verb would be 0.0012.
Rather than selecting the maximum score for each node at each stage of the algorithm ( the
Viterbi algorithm), we compute the sum of all scores.
We define the forward probability (Figure 7.14), written as αj(t), which is the probability of
producing the words: w1…wt and ending in state wt/Li : αj(t)= PROB(wt/Li, w1,..,wt)

36
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities

Figure 6.10 The forward algorithm for computing the lexical probabilities
37
Chapter 6: Ambiguity Resolutions: Statistical methods
64 Obtaining lexical probabilities

Example: with sentence The flies like flowers, α2(3)= would be


the sum of values computed for all sequences ending V( V is the
second category) in position 3 given the input The flies like.

- Using the conditional probability we derive the probability that


word wt is an instance of the lexical category Li:
PROB ( wt / Li│ w1 … wt ) 
PROB ( wt / Li , w1 … wt ) / PROB ( w1 … wt) (1)
PROB ( w1 , … wt )  j = 1, N j ( t ) (2)
From (1) and (2):
PROB ( wt / Li / w1 … wt )  I ( t ) / j = 1, N j ( t ) 38
39
40
41
Figure 6.11 Computing the sums of the probabilities of the sequences
Figure 6.12 Context dependent estimates for lexical categories in the sentence: The
flies like flowers
42
Chapter 6: Ambiguity Resolutions: Statistical methods

6.4 Obtaining lexical probabilities

- We could also consider the backward probability βi(t), the


probability of producing the sequence wt,…,WT beginning from
the state wt/Lj.
A better method of estimating the lexical probability for word
wt:
PROB ( wt / Li ) = i ( t ) * i ( t ) / j : = 1 , N j ( t ) * j ( t )

43
Chapter 6: Ambiguity Resolutions: Statistical methods
6.5 Language Modeling : Introduction

Application Language Modeling


Today’s goal: assign a probability to a sentence
• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Spell Correction
• The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen
minuets from)
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• + Summarization, question-answering, etc., etc.!! 44
6.5 Language
Language Modeling
Modeling : Introduction

• Goal: compute the probability of a sentence or


sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called
a language model.
• Better: the grammar But language model or LM
is standard
45
6.5 Language
Language Modeling
Modeling : Introduction

• The Chain Rule in General


P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Example:
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
• Markov Assumption
Bigram model

46
6.5 Language
Language Modeling
Modeling : Introduction

Zeros

• Training set: • Test set


… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

47
6.6 Applications for spelling correction
Word processing Phones

Web search

48
Spelling Tasks

• Spelling Error Detection


• Spelling Error Correction:
– Autocorrect
• htethe
– Suggest a correction
– Suggestion lists

49
Non-word spelling errors
• Non-word spelling error detection:
• Any word not in a dictionary is an error
• The larger the dictionary the better
• Non-word spelling error correction:
• Generate candidates: real words that are similar to error
• Choose the one which is best:
• Shortest weighted edit distance
• Highest noisy channel probability
Non-word spelling error example
acress
50
Non-word spelling errors
• Using a bigram language model

• “a stellar and versatile acress whose combination of sass and


glamour…”
• Counts from the Corpus of Contemporary American English
with add-1 smoothing
• P(actress|versatile)=.000021 P(whose|actress) = .0010
• P(across|versatile) =.000021 P(whose|across) = .000006

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10


• P(“versatile across whose”) = .000021*.000006 = 1 x10-10

51
Chapter 6: Ambiguity Resolutions: Statistical methods

6.7 Probabilistic Context – Free grammars

- Context free grammar can be generated to the probabilistic case.


- Counting the number of times each rule is used in a corpus
containing parsed sentences and use this to estimate the probability
of each rule being used.
- For instance, m rules R1,.., Rm with left-hand side C.
- Estimating the probability using the rule Rj to derive C:

PROB ( Rj / C )  count ( # times Rj used) / i = 1, m ( # times Ri used)

52
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars

Figure 6.13 A simple probabilistic grammar


53
The grammar in Figure 7.13 shows probabilities for CFG
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
- We must assume that the probability of a constituent being
derived by a rule Rj is independent of how the constituent
used as sub-constituent.
Inside probability PROB(wij/C):
The probability that a constituent C generates a sequence of
words wi,…wj, written as wij.
This type of probability is called because it assigns a probability
to the word sequence inside the constituent.
Consider how to derive inside probabilities
- For lexical categories that these are exactly lexical generation
probabilities in Figure 6.3, instance PROB ( flower / N ) that is
54
inside probability
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
- For non-lexical categories: using such lexical generation
probabilities, we can derive the probability of a constituent.
Example:
Deriving a probability that the constituent NP generates a sequence a
flower as in figure 7.18.
The grammar in Figure 7.17, there are two rules NP which can
generate two words, so the probability of NP generating a flower can
be derived as follows:
PROB ( a flower / NP ) = PROB ( rule8 / NP ) * PROB ( a / ART ) *
PROB ( flower / N ) + PROB ( rule6 / NP ) * PROB ( a / ART ) *
PROB ( flower / N ) = 0.55 * 0.36 * 0.6 + 0.9 * 0.0001 * 0.06 = 0.012
55
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
This probability can be then used to compute a probability of
large constituents.
For instance, the probability of generating the words A flower
wilted from S could be computed by summing the probabilities
from each of the possible trees shown in the Figure 7.19.
Note that in Figure 7.19 there are three trees, and the first two
differ only in the derivation of a flower as NP.
Thus the probability of a flower wilted is:
PROB (a flower wilted/ S) = PROB (Rule1 / S) * PROB (a
flower / NP) * PROB ( wilted / VP ) + PROB ( Rule1 / S ) *
PROB ( a / NP ) * PROB ( flower wilted / VP ).
56
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars

Figure 6.14 The three possible ways to generate a flower wilted as an S


57
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
Using this method, the probability that a given sentence will be
generated by the grammar can be computed efficiently
The goal of probability parsing method is to find the most likely
parse rather than overall probability of the given sentence.
The probability of each constituent is computed from the
probability of its sub-constituents and the probability of the rule
used.
For instance, entry E of category C using a rule i with n sub-
constituents corresponding to entries E1 , …, En, then:

PROB ( E ) = PROB ( Rule i / C ) * PROB ( E1 ) * … * PROB ( En )


58
NP 425
1 N 422 0.14
NP 424
1 N 417
2 N 422 0.00011
NP 423
1 ART 416
2 N 422 0.54
S 421
1 NP 418
2 VP 420 3.2 x 10-8
NP 418
1 N 417 0.0018 VP 420 0.0018
N 417 0.001 N 422 0.00011
ART 416 0.99 V 410 0.00047

59
Figure. 6.15 The full chart for a flower that is as NP
6.7 Probabilistic Context – Free grammars

Machine translation
The idea behind statistical machine translation comes from
information theory.
-A document is translated according to the probability
distribution p(e|f), that a string e in the target language
(for example, English) is the translation of a string f in
the source language (for example, French).
- The problem of modeling the probability distribution p(e|
f) has been approached in a number of ways. One
approach which lends itself well to computer
implementation is to apply Bayes Theorem, that is p(e|f)
α p(f|e)p(e), where the translation model p(f|e) is the
probability that the source string is the translation 60
6.7 Probabilistic Context – Free grammars
Machine translation
of the target string, and the language model p(e) is the
probability of seeing that target language string. This
decomposition is attractive as it splits the problem into two
subproblems. Finding the best translation ẽ is done by
picking up the one that gives the highest probability:

ẽ =argmax p(e|f)=argmax p(f|e)p(e)


e e* e e*

61
Chapter 6: Ambiguity Resolutions: Statistical methods

6.8 Best –First parsing (study oneself)

62
6.9 Word Similarity

• Synonymy: a binary relation


• Two words are either synonymous or not
• Similarity (or distance): a looser metric
• Two words are more similar if they share more features of
meaning
• Similarity is properly a relation between senses
• The word “bank” is not similar to the word “slope”
• Bank1 is similar to fund3
• Bank2 is similar to slope5
• But we’ll compute similarity over both words and senses

63
Why word similarity

• Information retrieval
• Question answering
• Machine translation
• Natural language generation
• Language modeling
• Automatic essay grading
• Plagiarism detection
• Document clustering

64
Word similarity and word relatedness

• We often distinguish word similarity from word


relatedness
• Similar words: near-synonyms
• Related words: can be related any way
• car, bicycle: similar
• car, gasoline: related, not similar

65
Two classes of similarity algorithms

• Thesaurus-based algorithms
• Are words “nearby” in hypernym hierarchy?
• Do words have similar glosses (definitions)?
• Distributional algorithms
• Do words have similar distributional contexts?

66
Path based similarity

• Two concepts (senses/synsets) are similar if they


are near each other in the thesaurus hierarchy
• =have a short path between them
• concepts have path 1 to themselves
67
Refinements to path-based similarity

• pathlen(c1,c2) = 1 + number of edges in the


shortest path in the hypernym graph between
sense nodes c1 and c2
• ranges from 0 to 1 (identity)

• simpath(c1,c2) =

• wordsim(w1,w2) = c max sim(c1,c2)


senses(w ),c senses(w )
1 1 2 2

68
Summary: thesaurus-based similarity

69
Example: path-based similarity
simpath(c1,c2) = 1/pathlen(c1,c2)

simpath(nickel,coin) = 1/2 = .5
simpath(fund,budget) = 1/2 = .5
simpath(nickel,currency) = 1/4 = .25
simpath(nickel,money) = 1/6 = .17
70
simpath(coinage,Richter scale) = 1/6 = .17
Problems with thesaurus-based meaning
• We don’t have a thesaurus for every language
• Even if we do, they have problems with recall
• Many words are missing
• Most (if not all) phrases are missing
• Some connections between senses are missing
• Thesauri work less well for verbs, adjectives
• Adjectives and verbs have less structured
hyponymy relations

71
Distributional models of meaning

- Also called vector-space models of meaning.


- Offer much higher recall than hand-built thesauri.
- Although they tend to have lower precision

• Zellig Harris (1954): “oculist and eye-doctor …


occur in almost the same environments….
If A and B have almost identical environments we
say that they are synonyms.
• Firth (1957): “You shall know a word by the
company it keeps!”

72
Reminder: Term-document matrix

• Each cell: count of term t in a document d: tft,d:


• Each document is a count vector in ℕv: a column below

1
2
37
6

73
Reminder: Term-document matrix

• Two documents are similar if their vectors are similar

8 15
12 36
1 5
0 0

74
The words in a term-document matrix

• Each word is a count vector in ℕD: a row below


count vector

37 58 1 5

75
The words in a term-document matrix

• Two words are similar if their vectors are similar

6 117 0 0

76
The Term-Context matrix

• Instead of using entire documents, use smaller contexts


• Paragraph
• Window of 10 words
• A word is now defined by a vector over counts of
context words

77
Sample contexts: 20 words (Brown corpus)
• equal amount of sugar, a sliced lemon, a tablespoonful of apricot
preserve or jam, a pinch each of clove and nutmeg,
• on board for their enjoyment. Cautiously she sampled her first
pineapple and another fruit whose taste she likened to that of
• of a recursive type well suited to programming on
the digital computer. In finding the optimal R-stage
policy from that of
• substantially affect commerce, for the purpose of
gathering data and information necessary for the
study authorized in the first section of this

78
Term-context matrix for word similarity
• Two words are similar in meaning if their context
vectors are similar

0 0 0 11 0 11
0 0 0 11 0 11
1
2 1
4
1 6

79
Should we use raw counts?
• For the term-document matrix
• We used tf-idf instead of raw term counts
• For the term-context matrix
• Positive Pointwise Mutual Information
(PPMI) is common

80
Pointwise Mutual Information
• Pointwise mutual information:
• Do events x and y co-occur more than if they were
independent?

• PMI between two words: (Church & Hanks 1989)


• Do words x and y co-occur more than if they were
independent?

• Positive PMI between two words (Niwa & Nitta 1994)


• Replace all PMI values less than 0 with zero
81
Computing PPMI on a term-context matrix
• Matrix F with W rows (words) and C columns (contexts)
• fij is # of times wi occurs in context cj

82
p(w=information,c=data) = 6/19 = .32
p(w=information) = 11/19 = .58
p(c=data) = 7/19 = .37

83
• pmi(information,data) = log2 ( .32 / (.37*.58) ) = .58
(.57 using full precision)

84
Reminder: cosine for computing similarity
Dot product Unit vectors

vi is the PPMI value for word v in context i


wi is the PPMI value for word w in context i.

Cos(v,w) is the cosine similarity of v and w

85
Cosine as a similarity metric

• -1: vectors point in opposite directions


• +1: vectors point in same directions
• 0: vectors are orthogonal

• Raw frequency or PPMI are non-


negative, so cosine range 0-1
86
large data computer
apricot 1 0 0
digital 0 1 2
information 1 6 1

Which pair of words is more similar?


cosine(apricot,information) =

cosine(digital,information) =

cosine(apricot,digital) =

87
Other possible similarity measures

88
Evaluating similarity
(the same as for thesaurus-based)
• Intrinsic Evaluation:
• Correlation between algorithm and human word
similarity ratings
• Extrinsic (task-based, end-to-end) Evaluation:
• Spelling error detection, WSD, essay grading
• Taking TOEFL multiple-choice vocabulary tests

Levied is closest in meaning to which of these:


imposed, believed, requested, correlated
89
EXERCISE OF CHAPTER 6
1. Hand simulate the Viterbi algorithm using the data and
probability estimates in Figures7.4- 7.6 on the sentence
Flower flowers like flowers. Draw transition network as in
Figure 7.10-7.12 for the problem, and identify what part of
speech the algorithm identifies for each word.
2. Using the bigram and lexical generation probabilities given in
this chapter, calculate the word probabilities using the
forward algorithm for the sentence The a flies like flower
(involving a very rare use of the word a as a noun, as in the a
flies, the b flies, and so on). Remember to use 0.0001 as a
probability for any bigram not in the table. Are the results you
get reasonable ?. If not, what is the problem and how might it
be fixed ?. 90
EXERCISE OF CHAPTER 6
3. Consider an extended version of Grammar 7.17 with the
additional rule: 10. VP → V PP
The revised rule probabilities are shown here (Any not
mentioned are the same as in Grammar 7.17):
VP → V 0.32 VP → V NP PP 0.20
VP → V NP 0.33 VP → V PP 0.15
In addition, the following bigram probabilities differ from those
in Figure 7.4:
PROB(N/V) = 0.53 PROB(ART/V) = 0.32 PROB(P/V) = 0.15
a) Hand simulate (or implement) the forward algorithm on Fruit
flies like birds to produce the lexical probabilities.
b) Draw out the full chart for Fruit flies like birds, showing the
probabilities of each constituent.
91
EXERCISE OF CHAPTER 6

• Specify PMI between two words, Positive PMI


between two words in the below table

aardvark computer data pinch result sugar


0 0 0 0 1 0
Apricot
0 0 0 0 1 0
Pineapple
0 2 1 0 1 0
Digital
Information 0 1 6 0 4 0

92

You might also like