0% found this document useful (0 votes)
6 views

NLP_Lecture_6_Week_3

The document discusses lexicon-free finite state transducers (FSTs) and the Porter Stemmer, which is a widely used stemming algorithm that reduces words to their base forms for improved keyword matching in information retrieval. It also covers the complexities of tokenization in natural language processing, highlighting challenges in word and sentence segmentation, especially in languages without clear word boundaries. Additionally, it addresses spelling error detection and correction techniques, minimum edit distance, and human morphological processing, emphasizing the representation and retrieval of multi-morphemic words in the mental lexicon.

Uploaded by

Irfan Ul Haq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

NLP_Lecture_6_Week_3

The document discusses lexicon-free finite state transducers (FSTs) and the Porter Stemmer, which is a widely used stemming algorithm that reduces words to their base forms for improved keyword matching in information retrieval. It also covers the complexities of tokenization in natural language processing, highlighting challenges in word and sentence segmentation, especially in languages without clear word boundaries. Additionally, it addresses spelling error detection and correction techniques, minimum edit distance, and human morphological processing, emphasizing the representation and retrieval of multi-morphemic words in the mental lexicon.

Uploaded by

Irfan Ul Haq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

3.

8 LEXICON-FREE FSTS: THE PORTER STEMMER

Introduction to Lexicon-Free Finite State Transducers (FSTs)

A common approach to morphological parsing involves constructing a transducer from a lexicon


combined with linguistic rules. However, this method requires a large online lexicon, making it
resource-intensive. In contrast, simpler algorithms that do not depend on an extensive lexicon are
useful in tasks such as Information Retrieval (IR), particularly for web search and keyword-
based querying.

Role of Stemming in Information Retrieval

In Information Retrieval, a query often consists of a Boolean combination of keywords (e.g.,


marsupial OR kangaroo OR koala). However, different morphological variations of a word (e.g.,
marsupial vs. marsupials) may not match exactly. To address this issue, stemming is applied to
both query words and document words. Stemming removes suffixes and reduces words to their
base forms, ensuring better keyword matching.

Porter Stemmer: A Lexicon-Free Approach

One of the most widely used stemming algorithms is the Porter Stemmer, developed by Martin
Porter in 1980. This algorithm is highly efficient and is based on cascaded rewrite rules, which
makes it particularly suitable for implementation as a Finite State Transducer (FST).

How the Porter Algorithm Works

The Porter Stemmer follows a sequence of transformation rules to reduce words to their root
forms. Some example rules include:

 ATIONAL → ATE
(e.g., relational → relate)
 ING → ǫ (epsilon) if the stem contains a vowel
(e.g., motoring → motor)

These rules systematically strip suffixes while maintaining word structure.

Effectiveness of Stemming in Information Retrieval

Krovetz (1993) analyzed the impact of stemming in IR and found that it improves performance,
particularly for smaller documents. In longer documents, the likelihood of encountering the exact
keyword form increases, reducing the need for stemming.

Challenges and Limitations of Stemming


Despite its benefits, stemming can lead to errors categorized as Errors of Commission and
Errors of Omission:

Errors of Commission Errors of Omission


organization → organ European → Europe
doing → doe analysis → analyzes
generalization → generic matrices → matrix
numerical → numerous noise → noisy
policy → police sparse → sparsity

Errors of Commission occur when the stemmer incorrectly reduces a word to an unintended base
form, while Errors of Omission occur when two related words are not recognized as having the
same root.

Word and Sentence Tokenization

3.9 Word and Sentence Tokenization

Introduction

Tokenization is the process of segmenting running text into meaningful units, such as words and
sentences. This process plays a crucial role in Natural Language Processing (NLP) and other
text-processing applications. Tokenization can be broadly classified into two types:

1. Word Tokenization - Splitting text into words.


2. Sentence Tokenization - Splitting text into sentences.

Word Tokenization

Word tokenization may seem straightforward, especially in languages like English, which use
spaces to separate words. However, spaces alone are not always a reliable delimiter. Several
challenges make word tokenization complex:

1. Punctuation Handling:
o Consider the sentence: Mr. Sherwood said reaction to Sea Containers’ proposal
has been "very positive."
o If tokenization is based only on whitespace, incorrect tokens like cents. and
positive." might be produced.
o Punctuation marks often appear within words (e.g., Ph.D., AT&T, cap’n,
google.com), which makes segmentation more challenging.
2. Numbers and Symbols:
o Numbers like 62.5 should be a single token, but a naive algorithm might split it
into 62 and 5.
o English uses commas as thousand separators (e.g., 555,500.50), whereas other
languages like German use a comma for decimals (e.g., 555 500,50), requiring
language-specific tokenization rules.
3. Clitic Contractions:
o Contractions like what’re (what are) and we’re (we are) should be expanded
properly.
o Apostrophes are also used for possessives (e.g., book’s cover) and quotes, adding
to the ambiguity.
o French contractions like j’ai (je ai) also require special handling.
4. Multiword Expressions:
o Some multiword expressions (e.g., New York, rock ’n’ roll) should be treated
as single tokens.
o Recognizing named entities (persons, locations, organizations) is part of named
entity detection, discussed in later chapters.

Sentence Tokenization

Segmenting text into sentences is typically done using punctuation marks such as periods (.),
question marks (?), and exclamation marks (!). However, there are challenges:

1. Ambiguity of Periods:
o The period (.) serves multiple functions, such as marking abbreviations (Mr.,
Inc.) and sentence boundaries.
o Example: In "The company is registered as Inc. It has operations worldwide.", the
period in Inc. serves as both an abbreviation and a sentence boundary.
o Advanced methods use machine learning classifiers and abbreviation dictionaries
to resolve this ambiguity.
2. Machine Learning Approaches:
o State-of-the-art sentence tokenization methods use machine learning to classify
whether a period is a sentence boundary.
o A basic method involves using regular expressions to detect sentence boundaries.
o More advanced models incorporate features like part-of-speech tags and
frequency-based abbreviation detection.

Figure 3.9: Word and Sentence Tokenization

Figure 3.9 illustrates different challenges in word and sentence tokenization, showing:

 Incorrect tokenization due to whitespace-based segmentation.


 Handling of punctuation and numbers.
 Multiword expressions and clitics.
 Sentence segmentation difficulties caused by ambiguous punctuation.

Segmentation in Chinese
Some languages, including Chinese, Japanese, and Thai, do not use spaces to mark word
boundaries. Instead, alternative segmentation techniques are required.

1. Hanzi Characters:
o Chinese words are made up of hanzi characters, each representing a morpheme
and syllable.
o The average Chinese word length is 2.4 characters.
2. Maximum Matching Algorithm (MaxMatch):
o A greedy algorithm that requires a dictionary.
o Steps:
1. Start at the beginning of the string.
2. Find the longest dictionary word matching the input at the current
position.
3. Move the pointer past the matched word.
4. If no match is found, treat the next character as a single-character word.
5. Repeat until the entire string is processed.
o Example:

 Removing spaces from "the table down there" results in


thetabledownthere.
 MaxMatch incorrectly segments it as theta bled own there instead of
the table down there.
o Works better in Chinese due to shorter word lengths but struggles with unknown
words and genres.

Modern Approaches to Tokenization

Modern tokenization techniques use:

 Finite State Transducers (FSTs): Efficiently implement tokenization rules.


 Machine Learning: Uses hand-segmented training data to improve accuracy.
 Regular Expressions: Provides a rule-based approach for initial tokenization.
 Hybrid Methods: Combine dictionary-based and learning-based approaches.

Detecting and Correcting Spelling Errors

Introduction Spelling errors are a common occurrence in written text, affecting both human
communication and digital text processing. The study of spelling errors, their detection, and
correction is crucial in various applications such as word processors, search engines, and Optical
Character Recognition (OCR) systems.

Historical Perspective The concern for spelling accuracy dates back to the 19th century, as
depicted in Oscar Wilde’s The Importance of Being Earnest, where Cecily criticizes Algernon’s
poorly spelled letters. Similarly, Gilbert and Sullivan’s works highlight the importance of
spelling skills. Thorstein Veblen’s 1899 theory suggested that English spelling’s complexity
served as a test of social class.
Despite its historical significance, spelling errors remain prevalent. Studies estimate error rates
ranging from 0.05% in professionally edited text to 38% in specialized applications such as
telephone directory lookups (Kukich, 1992).

Types of Spelling Errors Kukich (1992) categorized spelling errors into three broad types:

1. Non-word error detection: Errors that result in words that do not exist in a language,
e.g., graffe instead of giraffe.
2. Isolated-word error correction: Correcting non-word errors without considering the
context, e.g., correcting graffe to giraffe.
3. Context-dependent error detection and correction: Using the surrounding text to
identify and correct errors, particularly real-word errors, such as there instead of three or
homophones like desert instead of dessert.

Techniques for Spelling Error Detection and Correction

1. Dictionary-Based Detection
o The simplest method involves checking words against a predefined dictionary.
o Early research suggested small dictionaries to avoid false positives with rare
words (Peterson, 1986), but later studies (Damerau & Mays, 1989) indicated that
large dictionaries were more beneficial.
o Probabilistic spell-checking algorithms leverage word frequency to improve
accuracy.
2. Finite-State Transducers (FSTs) and Morphological Parsing
o Finite-State Morphological Parsers help recognize words and their variations,
making them useful for spell-checking.
o FST dictionaries efficiently handle inflected forms, which is particularly
beneficial for morphologically rich languages.
3. Error Correction Methods
o Edit Distance Algorithm: Measures how many operations (insertion, deletion,
substitution, or transposition) are required to convert an incorrect word into a
valid one.
o Probabilistic Models: Use statistical methods to determine the most likely
correction based on real-world usage and context.
o Minimum Edit Distance: A non-probabilistic approach that finds the closest
correct word by minimizing the number of changes.

Minimum Edit Distance

Introduction

The concept of minimum edit distance is essential in determining the similarity between two
strings by measuring how one string can be transformed into another using a set of operations.
This method is crucial in various applications, such as spell-checking, speech recognition, and
machine translation.
Definition

The minimum edit distance between two strings is the smallest number of operations required
to convert one string into another. These operations include:

 Insertion (i): Adding a character to the string.


 Deletion (d): Removing a character from the string.
 Substitution (s): Replacing one character with another.

For example, the transformation from "intention" to "execution" requires five operations.

Levenshtein Distance

The Levenshtein distance is the simplest form of edit distance where:

 Each insertion, deletion, or substitution has a cost of 1.


 If substitutions are not allowed, they can be represented as an insertion followed by a
deletion (cost = 2).

Using the basic version, the Levenshtein distance between "intention" and "execution" is 5.
In an alternate version where substitutions cost 2, the distance increases to 8.

Dynamic Programming Approach

The minimum edit distance is computed using dynamic programming, a technique that solves
problems by breaking them down into smaller overlapping subproblems and storing their
solutions in a table.

Distance Matrix Computation

A distance matrix is used to compute edit distance, where:

 Rows represent characters of the source string.


 Columns represent characters of the target string.
 Each cell distance[i, j] contains the edit distance between the first i characters of the
target and the first j characters of the source.
 The value in each cell is determined by the formula:

distance[i,j]=min⁡{distance[i−1,j]
+ins_cost(targeti−1)distance[i−1,j−1]+subst_cost(sourcej−1,targeti−1)distance[i,j−1]+del_cost(s
ourcej−1)distance[i, j] = \min \begin{cases} distance[i-1, j] + ins\_cost(target_{i-1}) \\
distance[i-1, j-1] + subst\_cost(source_{j-1}, target_{i-1}) \\ distance[i, j-1] + del\
_cost(source_{j-1}) \end{cases}

This approach ensures that the optimal solution is found by examining all possible
transformations step by step.
Alignment of Strings

To understand how words align during transformations, consider an alignment matrix. This
alignment helps in:

 Identifying insertions, deletions, and substitutions.


 Visualizing the transformation path.
 Computing the word error rate in speech recognition.

Backtracking and Backtrace Algorithm

To reconstruct the optimal alignment:

1. Backpointers are stored in each cell to track where a transformation originates.


2. Backtrace Algorithm follows these pointers from the final cell to the initial cell, forming
the shortest transformation sequence.

Applications

1. Spell Correction: Identifies the closest correct word to a misspelled input.


2. Speech Recognition: Computes word error rate.
3. Machine Translation: Aligns sentences in bilingual corpora.
4. DNA Sequencing: Helps in gene alignment by finding similarities between sequences.
5. Version Control: Used in tools like UNIX diff and NIST sclite.

Extensions of Minimum Edit Distance

 Viterbi Algorithm: Uses probabilities instead of fixed costs to determine the most
probable alignment.
 Weighted Edit Distance: Assigns different costs to each operation based on linguistic
importance.
 Dynamic Time Warping (DTW): Used in speech processing for time-series alignment.

Human Morphological Processing

Introduction

Human morphological processing examines how multi-morphemic words are represented and
processed in the human mind. This study is crucial in understanding how speakers of a language
store, retrieve, and manipulate words, particularly in morphologically complex languages like
English and Turkish.

Representation of Morphological Structures

There are two primary hypotheses regarding the storage of words in the human mental lexicon:
1. Full Listing Hypothesis
o Proposes that all words of a language are stored individually in the lexicon
without any internal morphological structure.
o Example: walk, walks, walked, happy, happily are all stored separately.
o This theory is inefficient for languages with complex morphology.
2. Minimum Redundancy Hypothesis
o Suggests that only the base morphemes are stored, and affixes are processed
separately when forming words.
o Example: Instead of storing walks and walked separately, the lexicon contains
walk, and affixes (-s, -ed) are added as needed.
o This approach is more economical, especially for highly inflected languages.

Evidence from Speech Errors (Slips of the Tongue)

Some of the earliest evidence for morphological structure in the mental lexicon comes from
speech errors, where affixes appear separately from their stems.

Examples of Speech Errors:

 Screw looses instead of screws loose


 Words of rule formation instead of rules of word formation
 Easy enoughly instead of easily enough

These errors suggest that morphemes are stored independently and combined during speech
production.

Experimental Evidence

More recent experiments indicate that neither the Full Listing nor Minimum Redundancy
Hypothesis fully explains human morphological processing. Some morphological relationships
are represented mentally, while others are not.

1. Repetition Priming Experiment (Stanners et al., 1979)


o Words are recognized faster if they have been seen before (primed).
o Findings:
 Lifting primed lift, and burned primed burn.
 Selective did not prime select, suggesting that some derived forms are
stored separately.
2. Marslen-Wilson et al. (1994) Study
o Found that spoken derived words can prime their stems if they are closely related
in meaning.
o Examples:
 Government primes govern (closely related meaning).
 Department does not prime depart (less related meaning).

Morphological Family Size and Word Recognition


1. Definition:
o The morphological family size of a word refers to the number of related
multimorphemic words and compounds in which it appears.
o Example (Fear family): fearful, fearfully, fearfulness, fearless, fearlessly,
fearlessness, fearsome, godfearing (Total: 9 words)
2. Research Findings:
o Words with larger morphological families are recognized faster (Baayen et al.,
1997; De Jong et al., 2002).
o The total amount of morphological information (entropy) in a word affects
recognition speed (Moscoso del Prado Martín et al., 2004).

You might also like