0% found this document useful (0 votes)
42 views5 pages

CS 3308 Learning Journal Unit 2

This paper discusses the use of a local dictionary for spelling correction through Levenshtein distance and k-gram overlap methods, specifically addressing errors in a document containing 'inforomation' and 'jopardy.' It highlights the dictionary's role in providing correct spelling candidates and the implications of using an unordered dictionary on search efficiency. The study concludes that while an unsorted dictionary may hinder correction speed, the effectiveness of the similarity metrics remains unchanged.

Uploaded by

Reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views5 pages

CS 3308 Learning Journal Unit 2

This paper discusses the use of a local dictionary for spelling correction through Levenshtein distance and k-gram overlap methods, specifically addressing errors in a document containing 'inforomation' and 'jopardy.' It highlights the dictionary's role in providing correct spelling candidates and the implications of using an unordered dictionary on search efficiency. The study concludes that while an unsorted dictionary may hinder correction speed, the effectiveness of the similarity metrics remains unchanged.

Uploaded by

Reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Spelling Correction Utilizing a Local Lexicon: Implementing Levenshtein Distance and K-

Gram Correspondence

Spelling correction is an essential aspect of text processing, aimed at identifying and

rectifying errors within a given text corpus. This paper will delve into the application of a local

dictionary to correct misspellings, using Levenshtein distance and k-gram overlap as the primary

methods. The context for this discussion is a document, Doc1, containing two misspelled words:

"inforomation" and "jopardy." The purpose of this report is to elucidate the role of the dictionary

in spelling correction, detail the procedures involved in employing these two methods, and assess

the implications of an unordered dictionary on the correction process.

The Dictionary as a Catalyst for Spelling Correction

In the realm of spelling correction, a dictionary acts as an essential tool, serving as the

definitive source of correctly spelled words. When addressing the errors present in Doc1, the

dictionary functions to provide a set of candidates that are phonetically or structurally akin to the

misspelled terms. This process is contingent upon similarity metrics, such as the Levenshtein

distance and k-gram overlap, which quantify the proximity between the incorrect and correct

spellings. While the dictionary used in this instance is relatively small, thus simplifying the task

at hand, it is important to note that in practical, real-world applications, advanced search

algorithms are indispensable for managing extensive lexicons.

Methodology: Levenshtein Distance and K-Gram Overlap

Levenshtein Distance

The Levenshtein distance is a metric that evaluates the similarity between two strings by

calculating the least number of operations (insertions, deletions, or substitutions) required


to convert one into the other (Manning, Raghavan, & Schütze, 2009). To rectify

"inforomation," the system computes the edit distance relative to each term in the

dictionary. The word with the smallest distance is selected as the appropriate correction,

which in this case is "information."

Example 1:

To correct "inforomation," the following Levenshtein distance table is constructed:

I N F O R M A T I O N

I 0 1 2 3 4 5 6 7 8 9 10

N 1 0 1 2 3 4 5 6 7 8 9

F 2 1 0 1 2 3 4 5 6 7 8

O 3 2 1 0 1 2 3 4 5 6 7

R 4 3 2 1 0 1 2 3 4 5 6

O 5 4 3 2 1 0 1 2 3 4 5

M 6 5 4 3 2 1 0 1 2 3 4

A 7 6 5 4 3 2 1 0 1 2 3

T 8 7 6 5 4 3 2 1 0 1 2

I 9 8 7 6 5 4 3 2 1 0 1

O 10 9 8 7 6 5 4 3 2 1 0

N 11 10 9 8 7 6 5 4 3 2 1

The computed Levenshtein distance is 2, affirming "information" as the closest match.


Example 2:

For "jopardy," the Levenshtein distance calculation reveals a distance of 1 between the

misspelled term and "jeopardy," confirming "jeopardy" as the correct term.

K-Gram Overlap

The k-gram overlap technique involves partitioning words into subsequences of length k

and then ascertaining their overlap. With k set to 3 for this example, the overlap is

computed as follows:

Example 1:

"inforomation": {"inf," "nfo," "for," "rom," "oma," "mat," "ati," "tio," "ion"}

"information": {"inf," "nfo," "for," "orm," "rma," "mat," "ati," "tio," "ion"}

The overlap is 7 out of 10 grams, signifying a high degree of similarity.

Example 2:

"jopardy": {"jop," "opa," "par," "ard," "rdy"}

"jeopardy": {"jeo," "eop," "opa," "par," "ard," "rdy"}

Here, the overlap is 4 out of 5 grams, substantiating "jeopardy" as the most

suitable match.

Consequences of an Unsorted Dictionary

The arrangement of the dictionary significantly impacts the performance of the spelling

correction process:
 Search Complexity: In a sorted dictionary, binary search can be employed to swiftly

locate potential matches. Conversely, an unsorted dictionary necessitates a complete

linear scan, which is computationally more demanding.

 Scalability Challenges: The efficiency of correction diminishes substantially when

dealing with large, unsorted dictionaries, particularly in real-time systems where speed is

of the essence.

Despite these implications for search efficiency, the accuracy of the Levenshtein distance and k-

gram overlap metrics remains unaffected by the dictionary's order.

Conclusion

The utilization of a dictionary-based approach, complemented by the Levenshtein

distance and k-gram overlap algorithms, effectively addresses the spelling errors present in

Doc1. While an unsorted dictionary may impede the efficiency of the correction process, the

effectiveness of these similarity metrics remains steadfast. This analysis underscores the

importance of structured data and appropriate metrics in the realm of textual error correction.
References

Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. In

Introduction to Information Retrieval (pp. 105-108). Springer.

https://nlp.stanford.edu/IR-book/pdf/03dict.pdf

You might also like