Lec04 SpellingCorrection
Lec04 SpellingCorrection
• Real-word Errors: Real word spelling correction is the task of detecting and
correcting spelling errors even if they accidentally result in an actual word.
– Typographical errors
• three there
– Cognitive Errors (homophones)
• piecepeace,
• too two
noisy guess at
SOURCE word DECODER original word
word
noisy channel
• In this problem, the real word is the source and the misspelled word is the signal.
• Assume that
V is the space of all the words we know
s denotes the misspelling (signal)
denotes the correct word (estimate)
• Computing P(s|w)
– It is fruitless to collect statistics about the misspellings of individual words
for a given dictionary. We will likely never get enough data.
– We need a way to compute P(s|w) without using direct information.
– We can use spelling error pattern statistics to compute P(s|w).
Natural Language Processing 10
Spelling Error Patterns
• There are four patterns:
Insertion -- ther for the
Deletion -- ther for there
Substitution -- noq for now
Transposition -- hte for the
• For each pattern we need a confusion matrix.
– del[x,y] contains the number of times in the training set that characters xy
in the correct word were typed as x.
– ins[x,y] contains the number of times in the training set that character x
in the correct word were typed as xy.
– sub[x,y] contains the number of times that x was typed as y.
– trans[x,y] contains the number of times that xy was typed as yx.
• Counts from the Corpus of Contemporary American English with add-1 smoothing
to threw
too on the
two of thaw
to threw
too on the
two of thaw
• Channel model
– Same as for non-word spelling correction
– Plus need probability for no error, P(w|w)
• Probability of no error
– What is the channel probability for a correctly typed word?
• P(“the”|“the”)
– Obviously this depends on the application
• .90 (1 error in 10 words)
• .95 (1 error in 20 words)
• .99 (1 error in 100 words)
• .995 (1 error in 200 words)