0% found this document useful (0 votes)

6 views

NLP_Lecture_6_Week_3

The document discusses lexicon-free finite state transducers (FSTs) and the Porter Stemmer, which is a widely used stemming algorithm that reduces words to their base forms for improved keyword matching in information retrieval. It also covers the complexities of tokenization in natural language processing, highlighting challenges in word and sentence segmentation, especially in languages without clear word boundaries. Additionally, it addresses spelling error detection and correction techniques, minimum edit distance, and human morphological processing, emphasizing the representation and retrieval of multi-morphemic words in the mental lexicon.

Uploaded by

Irfan Ul Haq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

NLP_Lecture_6_Week_3

Uploaded by

Irfan Ul Haq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

3.

8 LEXICON-FREE FSTS: THE PORTER STEMMER

Introduction to Lexicon-Free Finite State Transducers (FSTs)

A common approach to morphological parsing involves constructing a transducer from a lexicon

combined with linguistic rules. However, this method requires a large online lexicon, making it
resource-intensive. In contrast, simpler algorithms that do not depend on an extensive lexicon are
useful in tasks such as Information Retrieval (IR), particularly for web search and keyword-
based querying.

Role of Stemming in Information Retrieval

In Information Retrieval, a query often consists of a Boolean combination of keywords (e.g.,

marsupial OR kangaroo OR koala). However, different morphological variations of a word (e.g.,
marsupial vs. marsupials) may not match exactly. To address this issue, stemming is applied to
both query words and document words. Stemming removes suffixes and reduces words to their
base forms, ensuring better keyword matching.

Porter Stemmer: A Lexicon-Free Approach

One of the most widely used stemming algorithms is the Porter Stemmer, developed by Martin
Porter in 1980. This algorithm is highly efficient and is based on cascaded rewrite rules, which
makes it particularly suitable for implementation as a Finite State Transducer (FST).

How the Porter Algorithm Works

The Porter Stemmer follows a sequence of transformation rules to reduce words to their root
forms. Some example rules include:

 ATIONAL → ATE
(e.g., relational → relate)
 ING → ǫ (epsilon) if the stem contains a vowel
(e.g., motoring → motor)

These rules systematically strip suffixes while maintaining word structure.

Effectiveness of Stemming in Information Retrieval

Krovetz (1993) analyzed the impact of stemming in IR and found that it improves performance,
particularly for smaller documents. In longer documents, the likelihood of encountering the exact
keyword form increases, reducing the need for stemming.

Challenges and Limitations of Stemming

Despite its benefits, stemming can lead to errors categorized as Errors of Commission and
Errors of Omission:

Errors of Commission Errors of Omission

organization → organ European → Europe
doing → doe analysis → analyzes
generalization → generic matrices → matrix
numerical → numerous noise → noisy
policy → police sparse → sparsity

Errors of Commission occur when the stemmer incorrectly reduces a word to an unintended base
form, while Errors of Omission occur when two related words are not recognized as having the
same root.

Word and Sentence Tokenization

3.9 Word and Sentence Tokenization

Introduction

Tokenization is the process of segmenting running text into meaningful units, such as words and
sentences. This process plays a crucial role in Natural Language Processing (NLP) and other
text-processing applications. Tokenization can be broadly classified into two types:

1. Word Tokenization - Splitting text into words.

2. Sentence Tokenization - Splitting text into sentences.

Word Tokenization

Word tokenization may seem straightforward, especially in languages like English, which use
spaces to separate words. However, spaces alone are not always a reliable delimiter. Several
challenges make word tokenization complex:

1. Punctuation Handling:
o Consider the sentence: Mr. Sherwood said reaction to Sea Containers’ proposal
has been "very positive."
o If tokenization is based only on whitespace, incorrect tokens like cents. and
positive." might be produced.
o Punctuation marks often appear within words (e.g., Ph.D., AT&T, cap’n,
google.com), which makes segmentation more challenging.
2. Numbers and Symbols:
o Numbers like 62.5 should be a single token, but a naive algorithm might split it
into 62 and 5.
o English uses commas as thousand separators (e.g., 555,500.50), whereas other
languages like German use a comma for decimals (e.g., 555 500,50), requiring
language-specific tokenization rules.
3. Clitic Contractions:
o Contractions like what’re (what are) and we’re (we are) should be expanded
properly.
o Apostrophes are also used for possessives (e.g., book’s cover) and quotes, adding
to the ambiguity.
o French contractions like j’ai (je ai) also require special handling.
4. Multiword Expressions:
o Some multiword expressions (e.g., New York, rock ’n’ roll) should be treated
as single tokens.
o Recognizing named entities (persons, locations, organizations) is part of named
entity detection, discussed in later chapters.

Sentence Tokenization

Segmenting text into sentences is typically done using punctuation marks such as periods (.),
question marks (?), and exclamation marks (!). However, there are challenges:

1. Ambiguity of Periods:
o The period (.) serves multiple functions, such as marking abbreviations (Mr.,
Inc.) and sentence boundaries.
o Example: In "The company is registered as Inc. It has operations worldwide.", the
period in Inc. serves as both an abbreviation and a sentence boundary.
o Advanced methods use machine learning classifiers and abbreviation dictionaries
to resolve this ambiguity.
2. Machine Learning Approaches:
o State-of-the-art sentence tokenization methods use machine learning to classify
whether a period is a sentence boundary.
o A basic method involves using regular expressions to detect sentence boundaries.
o More advanced models incorporate features like part-of-speech tags and
frequency-based abbreviation detection.

Figure 3.9: Word and Sentence Tokenization

Figure 3.9 illustrates different challenges in word and sentence tokenization, showing:

 Incorrect tokenization due to whitespace-based segmentation.

 Handling of punctuation and numbers.
 Multiword expressions and clitics.
 Sentence segmentation difficulties caused by ambiguous punctuation.

Segmentation in Chinese
Some languages, including Chinese, Japanese, and Thai, do not use spaces to mark word
boundaries. Instead, alternative segmentation techniques are required.

1. Hanzi Characters:
o Chinese words are made up of hanzi characters, each representing a morpheme
and syllable.
o The average Chinese word length is 2.4 characters.
2. Maximum Matching Algorithm (MaxMatch):
o A greedy algorithm that requires a dictionary.
o Steps:
1. Start at the beginning of the string.
2. Find the longest dictionary word matching the input at the current
position.
3. Move the pointer past the matched word.
4. If no match is found, treat the next character as a single-character word.
5. Repeat until the entire string is processed.
o Example:

 Removing spaces from "the table down there" results in

thetabledownthere.
 MaxMatch incorrectly segments it as theta bled own there instead of
the table down there.
o Works better in Chinese due to shorter word lengths but struggles with unknown
words and genres.

Modern Approaches to Tokenization

Modern tokenization techniques use:

 Finite State Transducers (FSTs): Efficiently implement tokenization rules.

 Machine Learning: Uses hand-segmented training data to improve accuracy.
 Regular Expressions: Provides a rule-based approach for initial tokenization.
 Hybrid Methods: Combine dictionary-based and learning-based approaches.

Detecting and Correcting Spelling Errors

Introduction Spelling errors are a common occurrence in written text, affecting both human
communication and digital text processing. The study of spelling errors, their detection, and
correction is crucial in various applications such as word processors, search engines, and Optical
Character Recognition (OCR) systems.

Historical Perspective The concern for spelling accuracy dates back to the 19th century, as
depicted in Oscar Wilde’s The Importance of Being Earnest, where Cecily criticizes Algernon’s
poorly spelled letters. Similarly, Gilbert and Sullivan’s works highlight the importance of
spelling skills. Thorstein Veblen’s 1899 theory suggested that English spelling’s complexity
served as a test of social class.
Despite its historical significance, spelling errors remain prevalent. Studies estimate error rates
ranging from 0.05% in professionally edited text to 38% in specialized applications such as
telephone directory lookups (Kukich, 1992).

Types of Spelling Errors Kukich (1992) categorized spelling errors into three broad types:

1. Non-word error detection: Errors that result in words that do not exist in a language,
e.g., graffe instead of giraffe.
2. Isolated-word error correction: Correcting non-word errors without considering the
context, e.g., correcting graffe to giraffe.
3. Context-dependent error detection and correction: Using the surrounding text to
identify and correct errors, particularly real-word errors, such as there instead of three or
homophones like desert instead of dessert.

Techniques for Spelling Error Detection and Correction

1. Dictionary-Based Detection
o The simplest method involves checking words against a predefined dictionary.
o Early research suggested small dictionaries to avoid false positives with rare
words (Peterson, 1986), but later studies (Damerau & Mays, 1989) indicated that
large dictionaries were more beneficial.
o Probabilistic spell-checking algorithms leverage word frequency to improve
accuracy.
2. Finite-State Transducers (FSTs) and Morphological Parsing
o Finite-State Morphological Parsers help recognize words and their variations,
making them useful for spell-checking.
o FST dictionaries efficiently handle inflected forms, which is particularly
beneficial for morphologically rich languages.
3. Error Correction Methods
o Edit Distance Algorithm: Measures how many operations (insertion, deletion,
substitution, or transposition) are required to convert an incorrect word into a
valid one.
o Probabilistic Models: Use statistical methods to determine the most likely
correction based on real-world usage and context.
o Minimum Edit Distance: A non-probabilistic approach that finds the closest
correct word by minimizing the number of changes.

Minimum Edit Distance

Introduction

The concept of minimum edit distance is essential in determining the similarity between two
strings by measuring how one string can be transformed into another using a set of operations.
This method is crucial in various applications, such as spell-checking, speech recognition, and
machine translation.
Definition

The minimum edit distance between two strings is the smallest number of operations required
to convert one string into another. These operations include:

 Insertion (i): Adding a character to the string.

 Deletion (d): Removing a character from the string.
 Substitution (s): Replacing one character with another.

For example, the transformation from "intention" to "execution" requires five operations.

Levenshtein Distance

The Levenshtein distance is the simplest form of edit distance where:

 Each insertion, deletion, or substitution has a cost of 1.

 If substitutions are not allowed, they can be represented as an insertion followed by a
deletion (cost = 2).

Using the basic version, the Levenshtein distance between "intention" and "execution" is 5.
In an alternate version where substitutions cost 2, the distance increases to 8.

Dynamic Programming Approach

The minimum edit distance is computed using dynamic programming, a technique that solves
problems by breaking them down into smaller overlapping subproblems and storing their
solutions in a table.

Distance Matrix Computation

A distance matrix is used to compute edit distance, where:

 Rows represent characters of the source string.

 Columns represent characters of the target string.
 Each cell distance[i, j] contains the edit distance between the first i characters of the
target and the first j characters of the source.
 The value in each cell is determined by the formula:

distance[i,j]=min⁡{distance[i−1,j]
+ins_cost(targeti−1)distance[i−1,j−1]+subst_cost(sourcej−1,targeti−1)distance[i,j−1]+del_cost(s
ourcej−1)distance[i, j] = \min \begin{cases} distance[i-1, j] + ins\_cost(target_{i-1}) \\
distance[i-1, j-1] + subst\_cost(source_{j-1}, target_{i-1}) \\ distance[i, j-1] + del\
_cost(source_{j-1}) \end{cases}

This approach ensures that the optimal solution is found by examining all possible
transformations step by step.
Alignment of Strings

To understand how words align during transformations, consider an alignment matrix. This
alignment helps in:

 Identifying insertions, deletions, and substitutions.

 Visualizing the transformation path.
 Computing the word error rate in speech recognition.

Backtracking and Backtrace Algorithm

To reconstruct the optimal alignment:

1. Backpointers are stored in each cell to track where a transformation originates.

2. Backtrace Algorithm follows these pointers from the final cell to the initial cell, forming
the shortest transformation sequence.

Applications

1. Spell Correction: Identifies the closest correct word to a misspelled input.

2. Speech Recognition: Computes word error rate.
3. Machine Translation: Aligns sentences in bilingual corpora.
4. DNA Sequencing: Helps in gene alignment by finding similarities between sequences.
5. Version Control: Used in tools like UNIX diff and NIST sclite.

Extensions of Minimum Edit Distance

 Viterbi Algorithm: Uses probabilities instead of fixed costs to determine the most
probable alignment.
 Weighted Edit Distance: Assigns different costs to each operation based on linguistic
importance.
 Dynamic Time Warping (DTW): Used in speech processing for time-series alignment.

Human Morphological Processing

Introduction

Human morphological processing examines how multi-morphemic words are represented and
processed in the human mind. This study is crucial in understanding how speakers of a language
store, retrieve, and manipulate words, particularly in morphologically complex languages like
English and Turkish.

Representation of Morphological Structures

There are two primary hypotheses regarding the storage of words in the human mental lexicon:
1. Full Listing Hypothesis
o Proposes that all words of a language are stored individually in the lexicon
without any internal morphological structure.
o Example: walk, walks, walked, happy, happily are all stored separately.
o This theory is inefficient for languages with complex morphology.
2. Minimum Redundancy Hypothesis
o Suggests that only the base morphemes are stored, and affixes are processed
separately when forming words.
o Example: Instead of storing walks and walked separately, the lexicon contains
walk, and affixes (-s, -ed) are added as needed.
o This approach is more economical, especially for highly inflected languages.

Evidence from Speech Errors (Slips of the Tongue)

Some of the earliest evidence for morphological structure in the mental lexicon comes from
speech errors, where affixes appear separately from their stems.

Examples of Speech Errors:

 Screw looses instead of screws loose

 Words of rule formation instead of rules of word formation
 Easy enoughly instead of easily enough

These errors suggest that morphemes are stored independently and combined during speech
production.

Experimental Evidence

More recent experiments indicate that neither the Full Listing nor Minimum Redundancy
Hypothesis fully explains human morphological processing. Some morphological relationships
are represented mentally, while others are not.

1. Repetition Priming Experiment (Stanners et al., 1979)

o Words are recognized faster if they have been seen before (primed).
o Findings:
 Lifting primed lift, and burned primed burn.
 Selective did not prime select, suggesting that some derived forms are
stored separately.
2. Marslen-Wilson et al. (1994) Study
o Found that spoken derived words can prime their stems if they are closely related
in meaning.
o Examples:
 Government primes govern (closely related meaning).
 Department does not prime depart (less related meaning).

Morphological Family Size and Word Recognition

1. Definition:
o The morphological family size of a word refers to the number of related
multimorphemic words and compounds in which it appears.
o Example (Fear family): fearful, fearfully, fearfulness, fearless, fearlessly,
fearlessness, fearsome, godfearing (Total: 9 words)
2. Research Findings:
o Words with larger morphological families are recognized faster (Baayen et al.,
1997; De Jong et al., 2002).
o The total amount of morphological information (entropy) in a word affects
recognition speed (Moscoso del Prado Martín et al., 2004).

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
The Bipolar Child Parent Questionnaire - Version 2 - Tomas Cañas
No ratings yet
The Bipolar Child Parent Questionnaire - Version 2 - Tomas Cañas
2 pages
Building Your New Testament Greek Vocabulary (Resources For Biblical Study) (PDFDrive)
100% (9)
Building Your New Testament Greek Vocabulary (Resources For Biblical Study) (PDFDrive)
132 pages
Text preprocessing
No ratings yet
Text preprocessing
39 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Week3
No ratings yet
Week3
15 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
3.Chapter4_Lexical Representations
No ratings yet
3.Chapter4_Lexical Representations
36 pages
NATURAL LANGUAGE PROCESSING UNIT 1
No ratings yet
NATURAL LANGUAGE PROCESSING UNIT 1
16 pages
3.Word level analysis-tokenization stemming
No ratings yet
3.Word level analysis-tokenization stemming
8 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Explain in Detail Rule Based POS Tagging
No ratings yet
Explain in Detail Rule Based POS Tagging
12 pages
Aplng 260 Notes
No ratings yet
Aplng 260 Notes
1 page
Text Mining
No ratings yet
Text Mining
62 pages
Schmid Tokenising - CorpusLinguistics IntHbk
No ratings yet
Schmid Tokenising - CorpusLinguistics IntHbk
17 pages
unit2
No ratings yet
unit2
20 pages
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
No ratings yet
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
80 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP Notes
No ratings yet
NLP Notes
26 pages
Week 2
No ratings yet
Week 2
90 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
VO_MCA_SEM 4 _ Text Mining _U2
No ratings yet
VO_MCA_SEM 4 _ Text Mining _U2
15 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
lec2
No ratings yet
lec2
21 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Corpora
No ratings yet
Corpora
48 pages
Chapter 2
No ratings yet
Chapter 2
8 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Session 1
No ratings yet
Session 1
33 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Word Segmentation Sentence Segmentation: Recommended Reading
No ratings yet
Word Segmentation Sentence Segmentation: Recommended Reading
31 pages
ir manual
No ratings yet
ir manual
53 pages
9-Word and Sentence Segmentation-17!01!2024
No ratings yet
9-Word and Sentence Segmentation-17!01!2024
32 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
Natual Languagr Processing
No ratings yet
Natual Languagr Processing
12 pages
Lec 5
No ratings yet
Lec 5
25 pages
Session1 2024_2025_ Natural Language Processing
No ratings yet
Session1 2024_2025_ Natural Language Processing
40 pages
Compound Words Typographic Technical Series for Apprentices #36
From Everand
Compound Words Typographic Technical Series for Apprentices #36
Frederick W. (Frederick William) Hamilton
4/5 (1)
NLP Asgn1
No ratings yet
NLP Asgn1
7 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Basic Concepts Trees
No ratings yet
Basic Concepts Trees
48 pages
Basic Elements of Assembly Language
No ratings yet
Basic Elements of Assembly Language
2 pages
Introduction to NLP_first_week_lecture_2st
No ratings yet
Introduction to NLP_first_week_lecture_2st
4 pages
Banker's algorithm
No ratings yet
Banker's algorithm
5 pages
DSA Lab Manual(Merge Sort )
No ratings yet
DSA Lab Manual(Merge Sort )
3 pages
DSA Lab Manual(Heap Sort)
No ratings yet
DSA Lab Manual(Heap Sort)
3 pages
DSA Lab Manual(Shell Sort)
No ratings yet
DSA Lab Manual(Shell Sort)
3 pages
DSA Lab Manual(Counting sort)
No ratings yet
DSA Lab Manual(Counting sort)
4 pages
Data Structure and Algorithm_Assignment
No ratings yet
Data Structure and Algorithm_Assignment
2 pages
Binary Search Trees
No ratings yet
Binary Search Trees
9 pages
Exercise Questions
No ratings yet
Exercise Questions
1 page
AVL Trees
No ratings yet
AVL Trees
6 pages
DSA Lab Manual(Insertion Sort )
No ratings yet
DSA Lab Manual(Insertion Sort )
3 pages
binary
No ratings yet
binary
8 pages
Lab Manual_DSA
No ratings yet
Lab Manual_DSA
3 pages
Blackmont Consulting Careers
No ratings yet
Blackmont Consulting Careers
12 pages
Ousia, Substratum and Matter
No ratings yet
Ousia, Substratum and Matter
11 pages
Communicable Diseases Reviewer
No ratings yet
Communicable Diseases Reviewer
13 pages
Relations and Functions Class 11 Notes Maths Chapter 2 - Learn CBSE
No ratings yet
Relations and Functions Class 11 Notes Maths Chapter 2 - Learn CBSE
5 pages
(Checked) 12 Anh 1-8
No ratings yet
(Checked) 12 Anh 1-8
9 pages
Christ College of Engineering and Technology: Presented By, S.Nishanthi Mba
No ratings yet
Christ College of Engineering and Technology: Presented By, S.Nishanthi Mba
11 pages
SSM-I Course Outline MBA 2023-25
No ratings yet
SSM-I Course Outline MBA 2023-25
4 pages
Hazards & Uses of Radioactivity 3 QP.7
No ratings yet
Hazards & Uses of Radioactivity 3 QP.7
20 pages
Car India May
No ratings yet
Car India May
133 pages
4th Syllabus Civil Engg PDF
No ratings yet
4th Syllabus Civil Engg PDF
17 pages
Allyah Denise M. Tumarong - EAPP G12 - Assignment
No ratings yet
Allyah Denise M. Tumarong - EAPP G12 - Assignment
3 pages
Assignment
No ratings yet
Assignment
12 pages
Enzyme 111
No ratings yet
Enzyme 111
7 pages
Indian School Sohar FINAL EXAMINATION (2017-2018) Informatics Practices
No ratings yet
Indian School Sohar FINAL EXAMINATION (2017-2018) Informatics Practices
4 pages
Packet® Icon Library: Current As of June 16, 2004
No ratings yet
Packet® Icon Library: Current As of June 16, 2004
10 pages
Spotify Hits Awards v5
No ratings yet
Spotify Hits Awards v5
9 pages
1.2 MHZ Networking For BCCH TRXs (GBSS21.1 - Draft A)
No ratings yet
1.2 MHZ Networking For BCCH TRXs (GBSS21.1 - Draft A)
33 pages
Download Veterinary Clinical Epidemiology From Patient to Population Fourth Edition Smith ebook All Chapters PDF
No ratings yet
Download Veterinary Clinical Epidemiology From Patient to Population Fourth Edition Smith ebook All Chapters PDF
81 pages
Wcdma Radio Interface Physical Layer
No ratings yet
Wcdma Radio Interface Physical Layer
53 pages
Classic Chic_ Music, Fashion, and Modernism -- Mary E_ Davis -- California Studies in 20th-Century Music, 2006 -- University of California Press -- 9780520245426 -- 3f2d1d74792722447600cfea0acd2133 -- Anna’s Archive
No ratings yet
Classic Chic_ Music, Fashion, and Modernism -- Mary E_ Davis -- California Studies in 20th-Century Music, 2006 -- University of California Press -- 9780520245426 -- 3f2d1d74792722447600cfea0acd2133 -- Anna’s Archive
354 pages
Heat Cycles, Heat Engines, & Real Devices: John Jechura - Jjechura@mines - Edu Updated: January 4, 2015
100% (1)
Heat Cycles, Heat Engines, & Real Devices: John Jechura - Jjechura@mines - Edu Updated: January 4, 2015
21 pages
DSC, Solubility, NMR BCDX
No ratings yet
DSC, Solubility, NMR BCDX
14 pages
Rodrigo Lopes IABMAS 2016 VF
No ratings yet
Rodrigo Lopes IABMAS 2016 VF
16 pages
Bio 100 A Virtual Labs Unit One and Two
92% (12)
Bio 100 A Virtual Labs Unit One and Two
14 pages
Acpi C
No ratings yet
Acpi C
5 pages
Core Knowledge Powerpoint DR 12142012
No ratings yet
Core Knowledge Powerpoint DR 12142012
11 pages
HDL71 도면
No ratings yet
HDL71 도면
228 pages
MKT309 (ENT309) EVENT MANAGEMENT SUMMARY 08024665051
No ratings yet
MKT309 (ENT309) EVENT MANAGEMENT SUMMARY 08024665051
46 pages

NLP_Lecture_6_Week_3

Uploaded by

NLP_Lecture_6_Week_3

Uploaded by

3.

8 LEXICON-FREE FSTS: THE PORTER STEMMER

Introduction to Lexicon-Free Finite State Transducers (FSTs)

A common approach to morphological parsing involves constructing a transducer from a lexicon

Role of Stemming in Information Retrieval

In Information Retrieval, a query often consists of a Boolean combination of keywords (e.g.,

Porter Stemmer: A Lexicon-Free Approach

How the Porter Algorithm Works

These rules systematically strip suffixes while maintaining word structure.

Effectiveness of Stemming in Information Retrieval

Challenges and Limitations of Stemming

Errors of Commission Errors of Omission

Word and Sentence Tokenization

3.9 Word and Sentence Tokenization

1. Word Tokenization - Splitting text into words.

Figure 3.9: Word and Sentence Tokenization

 Incorrect tokenization due to whitespace-based segmentation.

 Removing spaces from "the table down there" results in

Modern Approaches to Tokenization

Modern tokenization techniques use:

 Finite State Transducers (FSTs): Efficiently implement tokenization rules.

Detecting and Correcting Spelling Errors

Techniques for Spelling Error Detection and Correction

Minimum Edit Distance

 Insertion (i): Adding a character to the string.

The Levenshtein distance is the simplest form of edit distance where:

 Each insertion, deletion, or substitution has a cost of 1.

Dynamic Programming Approach

Distance Matrix Computation

A distance matrix is used to compute edit distance, where:

 Rows represent characters of the source string.

 Identifying insertions, deletions, and substitutions.

Backtracking and Backtrace Algorithm

To reconstruct the optimal alignment:

1. Backpointers are stored in each cell to track where a transformation originates.

1. Spell Correction: Identifies the closest correct word to a misspelled input.

Extensions of Minimum Edit Distance

Human Morphological Processing

Representation of Morphological Structures

Evidence from Speech Errors (Slips of the Tongue)

Examples of Speech Errors:

 Screw looses instead of screws loose

1. Repetition Priming Experiment (Stanners et al., 1979)

Morphological Family Size and Word Recognition

You might also like