NLP_Lecture_6_Week_3
NLP_Lecture_6_Week_3
One of the most widely used stemming algorithms is the Porter Stemmer, developed by Martin
Porter in 1980. This algorithm is highly efficient and is based on cascaded rewrite rules, which
makes it particularly suitable for implementation as a Finite State Transducer (FST).
The Porter Stemmer follows a sequence of transformation rules to reduce words to their root
forms. Some example rules include:
ATIONAL → ATE
(e.g., relational → relate)
ING → ǫ (epsilon) if the stem contains a vowel
(e.g., motoring → motor)
Krovetz (1993) analyzed the impact of stemming in IR and found that it improves performance,
particularly for smaller documents. In longer documents, the likelihood of encountering the exact
keyword form increases, reducing the need for stemming.
Errors of Commission occur when the stemmer incorrectly reduces a word to an unintended base
form, while Errors of Omission occur when two related words are not recognized as having the
same root.
Introduction
Tokenization is the process of segmenting running text into meaningful units, such as words and
sentences. This process plays a crucial role in Natural Language Processing (NLP) and other
text-processing applications. Tokenization can be broadly classified into two types:
Word Tokenization
Word tokenization may seem straightforward, especially in languages like English, which use
spaces to separate words. However, spaces alone are not always a reliable delimiter. Several
challenges make word tokenization complex:
1. Punctuation Handling:
o Consider the sentence: Mr. Sherwood said reaction to Sea Containers’ proposal
has been "very positive."
o If tokenization is based only on whitespace, incorrect tokens like cents. and
positive." might be produced.
o Punctuation marks often appear within words (e.g., Ph.D., AT&T, cap’n,
google.com), which makes segmentation more challenging.
2. Numbers and Symbols:
o Numbers like 62.5 should be a single token, but a naive algorithm might split it
into 62 and 5.
o English uses commas as thousand separators (e.g., 555,500.50), whereas other
languages like German use a comma for decimals (e.g., 555 500,50), requiring
language-specific tokenization rules.
3. Clitic Contractions:
o Contractions like what’re (what are) and we’re (we are) should be expanded
properly.
o Apostrophes are also used for possessives (e.g., book’s cover) and quotes, adding
to the ambiguity.
o French contractions like j’ai (je ai) also require special handling.
4. Multiword Expressions:
o Some multiword expressions (e.g., New York, rock ’n’ roll) should be treated
as single tokens.
o Recognizing named entities (persons, locations, organizations) is part of named
entity detection, discussed in later chapters.
Sentence Tokenization
Segmenting text into sentences is typically done using punctuation marks such as periods (.),
question marks (?), and exclamation marks (!). However, there are challenges:
1. Ambiguity of Periods:
o The period (.) serves multiple functions, such as marking abbreviations (Mr.,
Inc.) and sentence boundaries.
o Example: In "The company is registered as Inc. It has operations worldwide.", the
period in Inc. serves as both an abbreviation and a sentence boundary.
o Advanced methods use machine learning classifiers and abbreviation dictionaries
to resolve this ambiguity.
2. Machine Learning Approaches:
o State-of-the-art sentence tokenization methods use machine learning to classify
whether a period is a sentence boundary.
o A basic method involves using regular expressions to detect sentence boundaries.
o More advanced models incorporate features like part-of-speech tags and
frequency-based abbreviation detection.
Figure 3.9 illustrates different challenges in word and sentence tokenization, showing:
Segmentation in Chinese
Some languages, including Chinese, Japanese, and Thai, do not use spaces to mark word
boundaries. Instead, alternative segmentation techniques are required.
1. Hanzi Characters:
o Chinese words are made up of hanzi characters, each representing a morpheme
and syllable.
o The average Chinese word length is 2.4 characters.
2. Maximum Matching Algorithm (MaxMatch):
o A greedy algorithm that requires a dictionary.
o Steps:
1. Start at the beginning of the string.
2. Find the longest dictionary word matching the input at the current
position.
3. Move the pointer past the matched word.
4. If no match is found, treat the next character as a single-character word.
5. Repeat until the entire string is processed.
o Example:
Introduction Spelling errors are a common occurrence in written text, affecting both human
communication and digital text processing. The study of spelling errors, their detection, and
correction is crucial in various applications such as word processors, search engines, and Optical
Character Recognition (OCR) systems.
Historical Perspective The concern for spelling accuracy dates back to the 19th century, as
depicted in Oscar Wilde’s The Importance of Being Earnest, where Cecily criticizes Algernon’s
poorly spelled letters. Similarly, Gilbert and Sullivan’s works highlight the importance of
spelling skills. Thorstein Veblen’s 1899 theory suggested that English spelling’s complexity
served as a test of social class.
Despite its historical significance, spelling errors remain prevalent. Studies estimate error rates
ranging from 0.05% in professionally edited text to 38% in specialized applications such as
telephone directory lookups (Kukich, 1992).
Types of Spelling Errors Kukich (1992) categorized spelling errors into three broad types:
1. Non-word error detection: Errors that result in words that do not exist in a language,
e.g., graffe instead of giraffe.
2. Isolated-word error correction: Correcting non-word errors without considering the
context, e.g., correcting graffe to giraffe.
3. Context-dependent error detection and correction: Using the surrounding text to
identify and correct errors, particularly real-word errors, such as there instead of three or
homophones like desert instead of dessert.
1. Dictionary-Based Detection
o The simplest method involves checking words against a predefined dictionary.
o Early research suggested small dictionaries to avoid false positives with rare
words (Peterson, 1986), but later studies (Damerau & Mays, 1989) indicated that
large dictionaries were more beneficial.
o Probabilistic spell-checking algorithms leverage word frequency to improve
accuracy.
2. Finite-State Transducers (FSTs) and Morphological Parsing
o Finite-State Morphological Parsers help recognize words and their variations,
making them useful for spell-checking.
o FST dictionaries efficiently handle inflected forms, which is particularly
beneficial for morphologically rich languages.
3. Error Correction Methods
o Edit Distance Algorithm: Measures how many operations (insertion, deletion,
substitution, or transposition) are required to convert an incorrect word into a
valid one.
o Probabilistic Models: Use statistical methods to determine the most likely
correction based on real-world usage and context.
o Minimum Edit Distance: A non-probabilistic approach that finds the closest
correct word by minimizing the number of changes.
Introduction
The concept of minimum edit distance is essential in determining the similarity between two
strings by measuring how one string can be transformed into another using a set of operations.
This method is crucial in various applications, such as spell-checking, speech recognition, and
machine translation.
Definition
The minimum edit distance between two strings is the smallest number of operations required
to convert one string into another. These operations include:
For example, the transformation from "intention" to "execution" requires five operations.
Levenshtein Distance
Using the basic version, the Levenshtein distance between "intention" and "execution" is 5.
In an alternate version where substitutions cost 2, the distance increases to 8.
The minimum edit distance is computed using dynamic programming, a technique that solves
problems by breaking them down into smaller overlapping subproblems and storing their
solutions in a table.
distance[i,j]=min{distance[i−1,j]
+ins_cost(targeti−1)distance[i−1,j−1]+subst_cost(sourcej−1,targeti−1)distance[i,j−1]+del_cost(s
ourcej−1)distance[i, j] = \min \begin{cases} distance[i-1, j] + ins\_cost(target_{i-1}) \\
distance[i-1, j-1] + subst\_cost(source_{j-1}, target_{i-1}) \\ distance[i, j-1] + del\
_cost(source_{j-1}) \end{cases}
This approach ensures that the optimal solution is found by examining all possible
transformations step by step.
Alignment of Strings
To understand how words align during transformations, consider an alignment matrix. This
alignment helps in:
Applications
Viterbi Algorithm: Uses probabilities instead of fixed costs to determine the most
probable alignment.
Weighted Edit Distance: Assigns different costs to each operation based on linguistic
importance.
Dynamic Time Warping (DTW): Used in speech processing for time-series alignment.
Introduction
Human morphological processing examines how multi-morphemic words are represented and
processed in the human mind. This study is crucial in understanding how speakers of a language
store, retrieve, and manipulate words, particularly in morphologically complex languages like
English and Turkish.
There are two primary hypotheses regarding the storage of words in the human mental lexicon:
1. Full Listing Hypothesis
o Proposes that all words of a language are stored individually in the lexicon
without any internal morphological structure.
o Example: walk, walks, walked, happy, happily are all stored separately.
o This theory is inefficient for languages with complex morphology.
2. Minimum Redundancy Hypothesis
o Suggests that only the base morphemes are stored, and affixes are processed
separately when forming words.
o Example: Instead of storing walks and walked separately, the lexicon contains
walk, and affixes (-s, -ed) are added as needed.
o This approach is more economical, especially for highly inflected languages.
Some of the earliest evidence for morphological structure in the mental lexicon comes from
speech errors, where affixes appear separately from their stems.
These errors suggest that morphemes are stored independently and combined during speech
production.
Experimental Evidence
More recent experiments indicate that neither the Full Listing nor Minimum Redundancy
Hypothesis fully explains human morphological processing. Some morphological relationships
are represented mentally, while others are not.