0% found this document useful (0 votes)

42 views5 pages

CS 3308 Learning Journal Unit 2

This paper discusses the use of a local dictionary for spelling correction through Levenshtein distance and k-gram overlap methods, specifically addressing errors in a document containing 'inforomation' and 'jopardy.' It highlights the dictionary's role in providing correct spelling candidates and the implications of using an unordered dictionary on search efficiency. The study concludes that while an unsorted dictionary may hinder correction speed, the effectiveness of the similarity metrics remains unchanged.

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views5 pages

CS 3308 Learning Journal Unit 2

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Spelling Correction Utilizing a Local Lexicon: Implementing Levenshtein Distance and K-

Gram Correspondence

Spelling correction is an essential aspect of text processing, aimed at identifying and

rectifying errors within a given text corpus. This paper will delve into the application of a local

dictionary to correct misspellings, using Levenshtein distance and k-gram overlap as the primary

methods. The context for this discussion is a document, Doc1, containing two misspelled words:

"inforomation" and "jopardy." The purpose of this report is to elucidate the role of the dictionary

in spelling correction, detail the procedures involved in employing these two methods, and assess

the implications of an unordered dictionary on the correction process.

The Dictionary as a Catalyst for Spelling Correction

In the realm of spelling correction, a dictionary acts as an essential tool, serving as the

definitive source of correctly spelled words. When addressing the errors present in Doc1, the

dictionary functions to provide a set of candidates that are phonetically or structurally akin to the

misspelled terms. This process is contingent upon similarity metrics, such as the Levenshtein

distance and k-gram overlap, which quantify the proximity between the incorrect and correct

spellings. While the dictionary used in this instance is relatively small, thus simplifying the task

at hand, it is important to note that in practical, real-world applications, advanced search

algorithms are indispensable for managing extensive lexicons.

Methodology: Levenshtein Distance and K-Gram Overlap

Levenshtein Distance

The Levenshtein distance is a metric that evaluates the similarity between two strings by

calculating the least number of operations (insertions, deletions, or substitutions) required

to convert one into the other (Manning, Raghavan, & Schütze, 2009). To rectify

"inforomation," the system computes the edit distance relative to each term in the

dictionary. The word with the smallest distance is selected as the appropriate correction,

which in this case is "information."

Example 1:

To correct "inforomation," the following Levenshtein distance table is constructed:

I N F O R M A T I O N

I 0 1 2 3 4 5 6 7 8 9 10

N 1 0 1 2 3 4 5 6 7 8 9

F 2 1 0 1 2 3 4 5 6 7 8

O 3 2 1 0 1 2 3 4 5 6 7

R 4 3 2 1 0 1 2 3 4 5 6

O 5 4 3 2 1 0 1 2 3 4 5

M 6 5 4 3 2 1 0 1 2 3 4

A 7 6 5 4 3 2 1 0 1 2 3

T 8 7 6 5 4 3 2 1 0 1 2

I 9 8 7 6 5 4 3 2 1 0 1

O 10 9 8 7 6 5 4 3 2 1 0

N 11 10 9 8 7 6 5 4 3 2 1

The computed Levenshtein distance is 2, affirming "information" as the closest match.

Example 2:

For "jopardy," the Levenshtein distance calculation reveals a distance of 1 between the

misspelled term and "jeopardy," confirming "jeopardy" as the correct term.

K-Gram Overlap

The k-gram overlap technique involves partitioning words into subsequences of length k

and then ascertaining their overlap. With k set to 3 for this example, the overlap is

computed as follows:

Example 1:

"inforomation": {"inf," "nfo," "for," "rom," "oma," "mat," "ati," "tio," "ion"}

"information": {"inf," "nfo," "for," "orm," "rma," "mat," "ati," "tio," "ion"}

The overlap is 7 out of 10 grams, signifying a high degree of similarity.

Example 2:

"jopardy": {"jop," "opa," "par," "ard," "rdy"}

"jeopardy": {"jeo," "eop," "opa," "par," "ard," "rdy"}

Here, the overlap is 4 out of 5 grams, substantiating "jeopardy" as the most

suitable match.

Consequences of an Unsorted Dictionary

The arrangement of the dictionary significantly impacts the performance of the spelling

correction process:
 Search Complexity: In a sorted dictionary, binary search can be employed to swiftly

locate potential matches. Conversely, an unsorted dictionary necessitates a complete

linear scan, which is computationally more demanding.

 Scalability Challenges: The efficiency of correction diminishes substantially when

dealing with large, unsorted dictionaries, particularly in real-time systems where speed is

of the essence.

Despite these implications for search efficiency, the accuracy of the Levenshtein distance and k-

gram overlap metrics remains unaffected by the dictionary's order.

Conclusion

The utilization of a dictionary-based approach, complemented by the Levenshtein

distance and k-gram overlap algorithms, effectively addresses the spelling errors present in

Doc1. While an unsorted dictionary may impede the efficiency of the correction process, the

effectiveness of these similarity metrics remains steadfast. This analysis underscores the

importance of structured data and appropriate metrics in the realm of textual error correction.
References

Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. In

Introduction to Information Retrieval (pp. 105-108). Springer.

https://nlp.stanford.edu/IR-book/pdf/03dict.pdf

Spell Check and Soundex
No ratings yet
Spell Check and Soundex
19 pages
4-Tolerant Retrieval
No ratings yet
4-Tolerant Retrieval
82 pages
Duo Lingo
0% (1)
Duo Lingo
24 pages
2019 Final Research Paper
100% (1)
2019 Final Research Paper
36 pages
4.spelling Correction (Autosaved)
No ratings yet
4.spelling Correction (Autosaved)
17 pages
Lec 8
No ratings yet
Lec 8
17 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
DH24 Week13
No ratings yet
DH24 Week13
29 pages
6-Spelling Correction Soundex
No ratings yet
6-Spelling Correction Soundex
52 pages
Module2 Ch3 B
No ratings yet
Module2 Ch3 B
96 pages
Lecture3 Tolerent
No ratings yet
Lecture3 Tolerent
81 pages
Module 1 Jacard Distance and Editdistance
No ratings yet
Module 1 Jacard Distance and Editdistance
16 pages
Lec 8
No ratings yet
Lec 8
20 pages
Learning Journal 02
No ratings yet
Learning Journal 02
3 pages
10 Dictionaries and Tolerant Retrieval
No ratings yet
10 Dictionaries and Tolerant Retrieval
13 pages
Cross Linguistic Name Matching in Englis
No ratings yet
Cross Linguistic Name Matching in Englis
8 pages
DL UNIT V NLP Application
No ratings yet
DL UNIT V NLP Application
83 pages
Interlanguage Error Analysis
No ratings yet
Interlanguage Error Analysis
13 pages
Learning Journal Unit 2
No ratings yet
Learning Journal Unit 2
3 pages
Autocomplete and Spell Checking Levenshtein Distan
No ratings yet
Autocomplete and Spell Checking Levenshtein Distan
9 pages
4394-Article Text-15117-1-10-20220522
No ratings yet
4394-Article Text-15117-1-10-20220522
5 pages
18 nlp2
No ratings yet
18 nlp2
13 pages
Automatic Spelling Correction in Scientific and Scholarly Text
No ratings yet
Automatic Spelling Correction in Scientific and Scholarly Text
11 pages
A Simple Real-Word Error Detection and Correction Using Local Word
No ratings yet
A Simple Real-Word Error Detection and Correction Using Local Word
10 pages
Minimum Edit Distance.
No ratings yet
Minimum Edit Distance.
12 pages
Task 1
No ratings yet
Task 1
5 pages
2nd Option Title Research
No ratings yet
2nd Option Title Research
5 pages
Unit 2 NLP
No ratings yet
Unit 2 NLP
5 pages
Scoring Short Answer Essays
No ratings yet
Scoring Short Answer Essays
13 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Multilingual Spelling Checker For Select
No ratings yet
Multilingual Spelling Checker For Select
8 pages
A Comprehensive Bangla Spelling Checker
No ratings yet
A Comprehensive Bangla Spelling Checker
8 pages
Levenshtein Algorithm 1 PDF
No ratings yet
Levenshtein Algorithm 1 PDF
10 pages
Portable Spelling Corrector
No ratings yet
Portable Spelling Corrector
6 pages
Finite-State Spell-Checking With Weighted Language
No ratings yet
Finite-State Spell-Checking With Weighted Language
7 pages
Word Suggestions For Non-Word Text Errors Using Similarity Measure
No ratings yet
Word Suggestions For Non-Word Text Errors Using Similarity Measure
6 pages
NLP Mrinmoyee Mam
No ratings yet
NLP Mrinmoyee Mam
4 pages
An Improved Error Model For Noisy Channel Spelling Correction
No ratings yet
An Improved Error Model For Noisy Channel Spelling Correction
8 pages
Medict: Health Dictionary Application Using Damerau-Levenshtein Distance Algorithm
No ratings yet
Medict: Health Dictionary Application Using Damerau-Levenshtein Distance Algorithm
4 pages
Language and Computer
No ratings yet
Language and Computer
19 pages
C90 2036 PDF
No ratings yet
C90 2036 PDF
6 pages
Lexical Normalisation of Twitter Data
No ratings yet
Lexical Normalisation of Twitter Data
4 pages
Similarity Metric
No ratings yet
Similarity Metric
13 pages
Simple Efficient Algorithm
No ratings yet
Simple Efficient Algorithm
9 pages
Spell Correction For Azerbaijani Language Using Deep Neural Networks
No ratings yet
Spell Correction For Azerbaijani Language Using Deep Neural Networks
5 pages
Spelling Noisy Channel
No ratings yet
Spelling Noisy Channel
5 pages
Irjet V6i674
No ratings yet
Irjet V6i674
6 pages
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
From Everand
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
Björn Olsson
No ratings yet
Damerau1989 An Examination of Undetected Typing Errors
No ratings yet
Damerau1989 An Examination of Undetected Typing Errors
6 pages
040 - Jerzy Tomaszczyk (Lodz) - The Bilingual Dictionary Under Review
No ratings yet
040 - Jerzy Tomaszczyk (Lodz) - The Bilingual Dictionary Under Review
9 pages
Coli A 00216 GimenesRomanCarvalho Spelling Error Patterns in Brazilian Portuguese
No ratings yet
Coli A 00216 GimenesRomanCarvalho Spelling Error Patterns in Brazilian Portuguese
10 pages
Singh 2016
No ratings yet
Singh 2016
5 pages
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
No ratings yet
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
6 pages
8.chinese Phonemic
No ratings yet
8.chinese Phonemic
2 pages
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
The 10 Hook Lead System
100% (1)
The 10 Hook Lead System
5 pages
Efficient Algorithm For Auto Correction Using N-Gram Indexing
No ratings yet
Efficient Algorithm For Auto Correction Using N-Gram Indexing
5 pages
Deterioration of Concrete
No ratings yet
Deterioration of Concrete
34 pages
Pronunciation Modeling in Spelling Correction
No ratings yet
Pronunciation Modeling in Spelling Correction
6 pages
2018 Book CyberSecurityForCyberPhysicalS PDF
100% (1)
2018 Book CyberSecurityForCyberPhysicalS PDF
189 pages
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
10 pages
Diani The Concept of Social Movement
No ratings yet
Diani The Concept of Social Movement
26 pages
Lesson Plan Subject/Grade Unit/Skill/Topic of Lesson Standards Addressed Va:Re9.1. 2 Va:Cr2.1.2 Vacr3.1.2
100% (1)
Lesson Plan Subject/Grade Unit/Skill/Topic of Lesson Standards Addressed Va:Re9.1. 2 Va:Cr2.1.2 Vacr3.1.2
4 pages
2GMD 21 TK
100% (1)
2GMD 21 TK
3 pages
Fiat Hitachi Excavator Ex135w Workshop Manual
100% (1)
Fiat Hitachi Excavator Ex135w Workshop Manual
22 pages
English 6 DLP 4 Decoding Meaning of Unfamiliar Words Using Structur 150603124800 Lva1 App6892
No ratings yet
English 6 DLP 4 Decoding Meaning of Unfamiliar Words Using Structur 150603124800 Lva1 App6892
10 pages
Lubrizol 1038 - Auto Gear Oil - Tds
No ratings yet
Lubrizol 1038 - Auto Gear Oil - Tds
3 pages
Psychology and Other Disciplines
No ratings yet
Psychology and Other Disciplines
5 pages
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
No ratings yet
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
30 pages
Lecture 2
No ratings yet
Lecture 2
29 pages
MATH 1281 - Unit 8 Assignment
100% (1)
MATH 1281 - Unit 8 Assignment
2 pages
MATH 1281 - Unit 4 Discussion Assignment
No ratings yet
MATH 1281 - Unit 4 Discussion Assignment
5 pages
1.safety Inspection Check List
No ratings yet
1.safety Inspection Check List
2 pages
MATH 1281 - Unit 5 Assignment
No ratings yet
MATH 1281 - Unit 5 Assignment
4 pages
MATH 1281 - Unit 3 Assignment
No ratings yet
MATH 1281 - Unit 3 Assignment
5 pages
Learning Guide Unit 6 - Home
No ratings yet
Learning Guide Unit 6 - Home
10 pages
Heading Hints A Guide To Cold Forming Specialty Alloys
No ratings yet
Heading Hints A Guide To Cold Forming Specialty Alloys
63 pages
Patient Clinical Audit Case Study Example
No ratings yet
Patient Clinical Audit Case Study Example
3 pages
Vmware - Kopia
No ratings yet
Vmware - Kopia
45 pages
Bomba Stanadyne John Deere
100% (22)
Bomba Stanadyne John Deere
60 pages
CS 3308 Learning Journal Unit 5
No ratings yet
CS 3308 Learning Journal Unit 5
6 pages
CS 3308 Learning Journal Unit 7
No ratings yet
CS 3308 Learning Journal Unit 7
5 pages
Learning Guide Unit 1 - Home
No ratings yet
Learning Guide Unit 1 - Home
10 pages
Three Cheers For The Nanny State Review: Student's Name: - Grade: 8
No ratings yet
Three Cheers For The Nanny State Review: Student's Name: - Grade: 8
5 pages
Def Slide
No ratings yet
Def Slide
9 pages
28 October 2023 Current Affairs English
No ratings yet
28 October 2023 Current Affairs English
11 pages
Edexcel Igcse Physics
No ratings yet
Edexcel Igcse Physics
12 pages
Write Up of Mech Dept For NAAC
No ratings yet
Write Up of Mech Dept For NAAC
3 pages
X4751 enUS 4751 CementIndustryBrochure 010920
No ratings yet
X4751 enUS 4751 CementIndustryBrochure 010920
12 pages
Skymionic Beams PDF
No ratings yet
Skymionic Beams PDF
6 pages
Daftar Referensi Jurnal Enzim1
No ratings yet
Daftar Referensi Jurnal Enzim1
7 pages
DTX Adjust Captured Images
No ratings yet
DTX Adjust Captured Images
3 pages
MATH 1302 - Unit 2 Discussion Assignment
No ratings yet
MATH 1302 - Unit 2 Discussion Assignment
4 pages
Umakant B
No ratings yet
Umakant B
3 pages
MATH 1280-Unit 2 Discussion Assignment
No ratings yet
MATH 1280-Unit 2 Discussion Assignment
2 pages
Wheat
No ratings yet
Wheat
1 page
MATH 1280-Unit 1 Discussion Assignment
No ratings yet
MATH 1280-Unit 1 Discussion Assignment
3 pages
ENGL 1102-Unit 2 Discussion Assignment
No ratings yet
ENGL 1102-Unit 2 Discussion Assignment
3 pages
Teacher Notes and Answers 8 Fluid Mechanics
No ratings yet
Teacher Notes and Answers 8 Fluid Mechanics
3 pages
Lab 3
No ratings yet
Lab 3
2 pages
Week 03 - Quiz
No ratings yet
Week 03 - Quiz
1 page
Blood of The Fold Terry Goodkind Instant Download
100% (1)
Blood of The Fold Terry Goodkind Instant Download
35 pages

CS 3308 Learning Journal Unit 2

Uploaded by

CS 3308 Learning Journal Unit 2

Uploaded by

Spelling Correction Utilizing a Local Lexicon: Implementing Levenshtein Distance and K-

Spelling correction is an essential aspect of text processing, aimed at identifying and

the implications of an unordered dictionary on the correction process.

The Dictionary as a Catalyst for Spelling Correction

at hand, it is important to note that in practical, real-world applications, advanced search

algorithms are indispensable for managing extensive lexicons.

Methodology: Levenshtein Distance and K-Gram Overlap

calculating the least number of operations (insertions, deletions, or substitutions) required

which in this case is "information."

To correct "inforomation," the following Levenshtein distance table is constructed:

The computed Levenshtein distance is 2, affirming "information" as the closest match.

misspelled term and "jeopardy," confirming "jeopardy" as the correct term.

The overlap is 7 out of 10 grams, signifying a high degree of similarity.

"jopardy": {"jop," "opa," "par," "ard," "rdy"}

"jeopardy": {"jeo," "eop," "opa," "par," "ard," "rdy"}

Here, the overlap is 4 out of 5 grams, substantiating "jeopardy" as the most

Consequences of an Unsorted Dictionary

locate potential matches. Conversely, an unsorted dictionary necessitates a complete

linear scan, which is computationally more demanding.

 Scalability Challenges: The efficiency of correction diminishes substantially when

gram overlap metrics remains unaffected by the dictionary's order.

The utilization of a dictionary-based approach, complemented by the Levenshtein

Introduction to Information Retrieval (pp. 105-108). Springer.

You might also like