CHAPTER 6-Ambiguity Resolutions Statistical Methods
CHAPTER 6-Ambiguity Resolutions Statistical Methods
6
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
- For instance, we use information on Harry’s past performance (
Harry’s winning probability of 100 races) to predict how likely
he is to win his 101st race.
- We interested in parsing sentences that have never been seen
before. Thus we need to use data on previously occurring
sentences to predict the next sentences.
- We will always be working with estimates of probability rather
than actual probability.
If we have seen the word flies 1000 times before, and 600 of
them were as a verb, we assume that PROB(V/flies) = 0.6, and 7
st
Chapter 6: Ambiguity Resolutions: Statistical methods
9
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Sparse data
There are a vast number of estimates needed for natural language
applications, and large proportion of these events are quite rare. This
is the problem of sparse data.
For instance, the Brown corpus contains about a million words, but
due to duplication there are only 49.000 different words and 40.000 of
the words occur five times or less.
The worst case occurs, if low-frequency word does not occur at all in
one of its possible category.
Its probability in this category would then be estimated as 0, then the
probability of the overall sentence containing the word would be 0.
There are other techniques attempt to address the problem of
10
estimating probabilities of low-frequency events.
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Sparse data
Random variable X, technique start with a set of values Vi,
computed from the count of the number of times X = xi. MLE
uses Vi = xi where Vi is exactly the count of number of times
X= xi : PROB ( X = xi ) V1 / I xi ,
One technique to solve the zero probability is to sure that no Vi
has value 0 by Vi = xi + 0.5, that 0.5 is added to every count
This estimation technique is called expected likelihood estimate
(ELE)
Different between MLE and ELE:
For instance: consider a word w that does not occur in the corpus.
Consider estimating the probability that w occurs in one of 40
words classes L1…L40 categories. Consider comparing between
11
MLE and ELE.
Chapter 6: Ambiguity Resolutions: Statistical methods
6.2 Estimating Probabilities
Evaluation
How well our new technique performs compared with other
algorithms or variants of our algorithms.
The general method for doing this is to divide the corpus into
two parts, the training set and the test set. The test set consists of
10 – 20% of total data.
- The training set is then used to estimate the probabilities and
- the algorithm is run on the test set to see how well it does on
new data.
- A more thorough method of testing is called cross-
validation:
Removing repeatedly different parts of the corpus as the test
set, training on the remainder of the corpus, and then evaluating12
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
- Part-of-speech tagging involves selecting the most likely
sequence of the syntactic categories for the words a sentence.
- A typical set of tags is used in the Penn Treebank project, is
shown in figure 6.3.
- The general method to improve reliability is to use some of the
local context of the sentence in which the word appears.
- For instance, if the word preceded by the word the , it is much
more likely to be N. In the section, we use this technique to
exploit such information.
Let W1 , …, WT be a sequence of words. We want to find a
sequence of lexical categories C1 , …, CT, that maximizes.
13
Figure 6.3 The Penn Treebank tagset 14
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
1)PROB ( C1 , …, CT / W1 , …, WT )
We solve this problem by Bayes’ rule, which says that this conditional
probability equals
2) PROB ( C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
PROB ( W1 , …, WT )
Finding C1,…, Cn, that gives a maximum value, the common denominator
in all these cases will not affect the answer. Thus the problem reduces to
finding C1,…, Cn, that maximizes the formula:
3)PROB ( C1 , …, CT ) * PROB ( W1 , …, WT / C1 , …, CT )
There are still no effect methods for calculating the probability of these
long sequences accurately, as it would require far too much data. 15
Chapter 6: Ambiguity Resolutions: Statistical methods
To deal with the problem of the sparse data, any bigram is not listed
in figure 6.4, will be assumed to have a token probability 0.0001. 19
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
category Count at i pair Count at i,i+1 bigram estimate
Figure 6.8 Encoding the 256 possible sequences exploiting the Markov
assumption
27
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
The Viterbi algorithm
Given word sequence: w1…wT, lexical categories: L1…
LN,lexical probabilities PROB(wi/Li) and bigram probabilities
PROB(Li/Li-1), find the most likely sequence of lexical
categories C1….CT for the word sequence.
Initialization step:
For i := 1 to N do
SEQSCORE (i, 1) = PROB( w1 | li )* PROB( li | )
BACKPTR( i |1) = 0
28
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
The Viterbi Algorithm (cons)
Iteration step:
For t := 2 to T do
for i := 1 to N do
SEQSCORE ( I,t) = MAX j=1,N (SEQSCORE ( j,t-1)
*PROB(li | lj))* PROB(wt | li )
BACKPTR(i,t) = index of j that gave the max above
Sequence identification step
C(T) = I that maximizes SEQSCORE ( i, t)
For i := T-1 to 1 do
C(i) = BACKPTR ( C(i+1),i+1) 29
Chapter 6: Ambiguity Resolutions: Statistical methods
6.3 Part – of- speech Tagging
Example: Using the Viterbi algorithm compute a probability of the sequence W=
flies like a flower. L= V,N,ART, P
30
t=3 to 4 SEQ(1,3) = max j=1,4 (SEQ(1,2)* PROB(V/V), =max( 3.1 *10-4*10-4,
i=1 to 4 SEQ(2,2)*PROB(V/N),SEQ(4,2)*PROB(V/P)) 1.13 *10-5*0.43,
*PROB(a/V) 2.2 *10-4*10-4)*0 =0
SEQ(2,3) = max j=1,4 (SEQ(1,2)* PROB(N/V), =max( 3.1 *10-4*0.35,
SEQ(2,2)*PROB(N/N),SEQ(4,2)*PROB(N/P)) 1.13 *10-5*0.13,
*PROB(a/N) 2.2 *10-4*0.26)*0.01
=5.7*10-7
SEQ(3,3) = max j=1,4 (SEQ(1,2)* PROB(ART/V), max( 3.1 *10-4* 0.65,
SEQ(2,2)*PROB(ART/N),SEQ(4,2)* 1.13 *10-5*10-4,
PROB(ART/P))*PROB(a/ART) 2.2 *10-4*0.74)*0.36= 2.01*10-5
SEQ(4,3) = max j=1,4 (SEQ(1,2)* PROB(P/V), max( 3.1 *10-4* 10-4,
SEQ(2,2)*PROB(P/N),SEQ(4,2)* 1.13 *10-5*0.44,
PROB(P/P))*PROB(a/P) 2.2 *10-4*10-4)*0 = 0
31
Example: flies like a flower. L= V,N,ART, P
32
Chapter 6: Ambiguity Resolutions: Statistical methods
36
Chapter 6: Ambiguity Resolutions: Statistical methods
Figure 6.10 The forward algorithm for computing the lexical probabilities
37
Chapter 6: Ambiguity Resolutions: Statistical methods
64 Obtaining lexical probabilities
43
Chapter 6: Ambiguity Resolutions: Statistical methods
6.5 Language Modeling : Introduction
46
6.5 Language
Language Modeling
Modeling : Introduction
Zeros
47
6.6 Applications for spelling correction
Word processing Phones
Web search
48
Spelling Tasks
49
Non-word spelling errors
• Non-word spelling error detection:
• Any word not in a dictionary is an error
• The larger the dictionary the better
• Non-word spelling error correction:
• Generate candidates: real words that are similar to error
• Choose the one which is best:
• Shortest weighted edit distance
• Highest noisy channel probability
Non-word spelling error example
acress
50
Non-word spelling errors
• Using a bigram language model
51
Chapter 6: Ambiguity Resolutions: Statistical methods
52
Chapter 6: Ambiguity Resolutions: Statistical methods
6.7 Probabilistic Context – Free grammars
59
Figure. 6.15 The full chart for a flower that is as NP
6.7 Probabilistic Context – Free grammars
Machine translation
The idea behind statistical machine translation comes from
information theory.
-A document is translated according to the probability
distribution p(e|f), that a string e in the target language
(for example, English) is the translation of a string f in
the source language (for example, French).
- The problem of modeling the probability distribution p(e|
f) has been approached in a number of ways. One
approach which lends itself well to computer
implementation is to apply Bayes Theorem, that is p(e|f)
α p(f|e)p(e), where the translation model p(f|e) is the
probability that the source string is the translation 60
6.7 Probabilistic Context – Free grammars
Machine translation
of the target string, and the language model p(e) is the
probability of seeing that target language string. This
decomposition is attractive as it splits the problem into two
subproblems. Finding the best translation ẽ is done by
picking up the one that gives the highest probability:
61
Chapter 6: Ambiguity Resolutions: Statistical methods
62
6.9 Word Similarity
63
Why word similarity
• Information retrieval
• Question answering
• Machine translation
• Natural language generation
• Language modeling
• Automatic essay grading
• Plagiarism detection
• Document clustering
64
Word similarity and word relatedness
65
Two classes of similarity algorithms
• Thesaurus-based algorithms
• Are words “nearby” in hypernym hierarchy?
• Do words have similar glosses (definitions)?
• Distributional algorithms
• Do words have similar distributional contexts?
66
Path based similarity
• simpath(c1,c2) =
68
Summary: thesaurus-based similarity
69
Example: path-based similarity
simpath(c1,c2) = 1/pathlen(c1,c2)
simpath(nickel,coin) = 1/2 = .5
simpath(fund,budget) = 1/2 = .5
simpath(nickel,currency) = 1/4 = .25
simpath(nickel,money) = 1/6 = .17
70
simpath(coinage,Richter scale) = 1/6 = .17
Problems with thesaurus-based meaning
• We don’t have a thesaurus for every language
• Even if we do, they have problems with recall
• Many words are missing
• Most (if not all) phrases are missing
• Some connections between senses are missing
• Thesauri work less well for verbs, adjectives
• Adjectives and verbs have less structured
hyponymy relations
71
Distributional models of meaning
72
Reminder: Term-document matrix
1
2
37
6
73
Reminder: Term-document matrix
8 15
12 36
1 5
0 0
74
The words in a term-document matrix
37 58 1 5
75
The words in a term-document matrix
6 117 0 0
76
The Term-Context matrix
77
Sample contexts: 20 words (Brown corpus)
• equal amount of sugar, a sliced lemon, a tablespoonful of apricot
preserve or jam, a pinch each of clove and nutmeg,
• on board for their enjoyment. Cautiously she sampled her first
pineapple and another fruit whose taste she likened to that of
• of a recursive type well suited to programming on
the digital computer. In finding the optimal R-stage
policy from that of
• substantially affect commerce, for the purpose of
gathering data and information necessary for the
study authorized in the first section of this
78
Term-context matrix for word similarity
• Two words are similar in meaning if their context
vectors are similar
0 0 0 11 0 11
0 0 0 11 0 11
1
2 1
4
1 6
79
Should we use raw counts?
• For the term-document matrix
• We used tf-idf instead of raw term counts
• For the term-context matrix
• Positive Pointwise Mutual Information
(PPMI) is common
80
Pointwise Mutual Information
• Pointwise mutual information:
• Do events x and y co-occur more than if they were
independent?
82
p(w=information,c=data) = 6/19 = .32
p(w=information) = 11/19 = .58
p(c=data) = 7/19 = .37
83
• pmi(information,data) = log2 ( .32 / (.37*.58) ) = .58
(.57 using full precision)
84
Reminder: cosine for computing similarity
Dot product Unit vectors
85
Cosine as a similarity metric
cosine(digital,information) =
cosine(apricot,digital) =
87
Other possible similarity measures
88
Evaluating similarity
(the same as for thesaurus-based)
• Intrinsic Evaluation:
• Correlation between algorithm and human word
similarity ratings
• Extrinsic (task-based, end-to-end) Evaluation:
• Spelling error detection, WSD, essay grading
• Taking TOEFL multiple-choice vocabulary tests
92