Chapter 9
Chapter 9
Given the data in figure 9.11, we will find a sense for the word bridge,
by using the window size 11 words in the corpus with 10.000.000 words
Figure 9.11: The counts for the senses for bridge in a hypothetical corpus
Ambiguity Resolution
9.4 Statistical Word Sense Disambiguity
Given the data in figure10.11, we get the following estimates:
PROBn (teeth/bridge/ STRUCTURED ) = 1/ 5651 = 1.77 * 10 -4
= 0.97*0.113 = 0.109
PROBn (the/bridge/ DENTAL – DEV37)* PROBn (bridge/ DENTAL –
DEV37) = 0.93* 3.87 * 10 -4 = 3.6* 10-4
Ambiguity Resolution
9.4 Statistical Word Sense Disambiguity
It is content words, like teeth in this example that has the most dramatic
effect. For instance:
PROBn(dentist/bridge/STRUCTURE1)*PROB(bridge/STRUCTURE1)
= 3.54.10 -4 * 0.113 = 4 * 10 -5
PROBn(dentist/bridge/DENTAL-DEV37)*PROB(bridge/DENTAL-
DEV37) = 0.18 * 3.87 * 10 -4 = 6.97 * 10 -5
Of course, with a larger window, there are many more chances for
content words that strong effect the decision.
Example: The dentist put a bridge on my teeth
The words teeth and dentist together in the same window combine to
strongly prefer the rare sense of the word bridge.
Ambiguity Resolution
9.4 Statistical Word Sense Disambiguity
In the fact, the estimate for the sense DENTAL-DEV37 would be
3.6*10-6, considerably greater than the estimate of 7.08*10-7 for
STRUCTURE1.
Collocations and Mutual Information
In the area uses collocations, which measure how likely two words
are to co-occur in a window of text. One way to compute such a
measure is to consider a correlation statistic (where n is the window
size).
PROB (w/ S & w’ are in the same window)
Cn (w/ S, w’) =
PROB(w/S in the window)*PROB(w’ in the window)
Ambiguity Resolution
9.4 Statistical Word Sense Disambiguity
If K is the number of windows in the corpus, then each of the
probabilities above could be:
Count (#times event occurs in window)/K
After substituting such estimates in for each probability uses in
Cn (w/ S, w’), simplifying we get the formula:
K*Count (#times w/ S & w’ co-occur in window)
Cn(w/S, w’)=
Count(#times w/S in window*Count(#times w’ in window)