CS 3308 Learning Journal Unit 2
CS 3308 Learning Journal Unit 2
Gram Correspondence
rectifying errors within a given text corpus. This paper will delve into the application of a local
dictionary to correct misspellings, using Levenshtein distance and k-gram overlap as the primary
methods. The context for this discussion is a document, Doc1, containing two misspelled words:
"inforomation" and "jopardy." The purpose of this report is to elucidate the role of the dictionary
in spelling correction, detail the procedures involved in employing these two methods, and assess
In the realm of spelling correction, a dictionary acts as an essential tool, serving as the
definitive source of correctly spelled words. When addressing the errors present in Doc1, the
dictionary functions to provide a set of candidates that are phonetically or structurally akin to the
misspelled terms. This process is contingent upon similarity metrics, such as the Levenshtein
distance and k-gram overlap, which quantify the proximity between the incorrect and correct
spellings. While the dictionary used in this instance is relatively small, thus simplifying the task
Levenshtein Distance
The Levenshtein distance is a metric that evaluates the similarity between two strings by
"inforomation," the system computes the edit distance relative to each term in the
dictionary. The word with the smallest distance is selected as the appropriate correction,
Example 1:
I N F O R M A T I O N
I 0 1 2 3 4 5 6 7 8 9 10
N 1 0 1 2 3 4 5 6 7 8 9
F 2 1 0 1 2 3 4 5 6 7 8
O 3 2 1 0 1 2 3 4 5 6 7
R 4 3 2 1 0 1 2 3 4 5 6
O 5 4 3 2 1 0 1 2 3 4 5
M 6 5 4 3 2 1 0 1 2 3 4
A 7 6 5 4 3 2 1 0 1 2 3
T 8 7 6 5 4 3 2 1 0 1 2
I 9 8 7 6 5 4 3 2 1 0 1
O 10 9 8 7 6 5 4 3 2 1 0
N 11 10 9 8 7 6 5 4 3 2 1
For "jopardy," the Levenshtein distance calculation reveals a distance of 1 between the
K-Gram Overlap
The k-gram overlap technique involves partitioning words into subsequences of length k
and then ascertaining their overlap. With k set to 3 for this example, the overlap is
computed as follows:
Example 1:
"inforomation": {"inf," "nfo," "for," "rom," "oma," "mat," "ati," "tio," "ion"}
"information": {"inf," "nfo," "for," "orm," "rma," "mat," "ati," "tio," "ion"}
Example 2:
suitable match.
The arrangement of the dictionary significantly impacts the performance of the spelling
correction process:
Search Complexity: In a sorted dictionary, binary search can be employed to swiftly
dealing with large, unsorted dictionaries, particularly in real-time systems where speed is
of the essence.
Despite these implications for search efficiency, the accuracy of the Levenshtein distance and k-
Conclusion
distance and k-gram overlap algorithms, effectively addresses the spelling errors present in
Doc1. While an unsorted dictionary may impede the efficiency of the correction process, the
effectiveness of these similarity metrics remains steadfast. This analysis underscores the
importance of structured data and appropriate metrics in the realm of textual error correction.
References
Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. In
https://nlp.stanford.edu/IR-book/pdf/03dict.pdf