0% found this document useful (0 votes)
91 views31 pages

NLP Unit II Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views31 pages

NLP Unit II Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

lOMoARcPSD|44355733

NLP-Unit-II - Notes

Master of information technology (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Bhavesh Fanade ([email protected])
lOMoARcPSD|44355733

NLP – UNIT -II


Text Processing Challenges, Overview of Language Scripts and their
representation on Machines using Character Sets, Language, Corpus and
Application Dependence issues, Segmentation: word level(Tokenization),
Sentence level. Regular Expression and Automata Morphology, Types,
Survey of English and Indian Languages Morphology, Morphological
parsing FSA and FST, Porter stemmer, Rule based and Paradigm based
Morphology, Human Morphological Processing, Machine Learning
approaches.

❖ Text Processing Challenges:


There are many issues that arise in text preprocessing that need to be addressed
when designing NLP systems, and many can be addressed as part of document triage
in preparing a corpus for analysis.
The type of writing system used for a language is the most important factor for
determining the best approach to text preprocessing. Writing systems can be
logographic, where a large number (often thousands) of individual symbols
represent words. In contrast, writing systems can be syllabic, in which individual
symbols represent syllables, or alphabetic, in which individual symbols (more or
less) represent sounds; unlike logographic systems, syllabic and alphabetic
systems typically have fewer than 100 symbols.
In this section, we discuss the essential document triage steps, and we emphasize
the main types of dependencies that must be addressed in developing algorithms
for text segmentation: character-set dependence, language dependence, corpus
dependence, and application dependence.
The main types of dependencies that must be addressed when designing algorithms
for text segmentation include:

• Character-Set Dependence: This aspect pertains to the specific characters


and symbols used in the writing system of a language. Different languages
may have unique character sets, necessitating specific handling during text
preprocessing. For instance, languages with logographic writing systems
(e.g., Chinese, Japanese) require segmentation at the character level, while
alphabetic or syllabic systems may involve word-level segmentation.
• Language Dependence: Languages vary in terms of grammar, morphology, and
syntax, which impact text preprocessing techniques such as tokenization,
stemming, and lemmatization. Language-specific tools and resources (e.g.,
dictionaries, grammars) are often employed to ensure accurate preprocessing
for different languages.
• Corpus Dependence: The characteristics of the corpus being analyzed, such
as domain-specific terminology, jargon, or informal language, influence
preprocessing decisions. Customized preprocessing pipelines may be
necessary to handle specific characteristics of the corpus effectively.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

• Application Dependence: The intended NLP application (e.g., sentiment


analysis, named entity recognition) influences text preprocessing
requirements. For example, sentiment analysis may prioritize emoticon
handling and sentiment lexicon integration, while named entity recognition
may focus on entity extraction and normalization.
Addressing these dependencies requires a deep understanding of linguistic
principles, domain expertise, and familiarity with NLP techniques. Designing
robust text preprocessing algorithms involves considering these dependencies to
ensure accurate and effective processing of textual data for downstream NLP tasks.

❖ Overview of Language Scripts:


Historically, handling digital text was straightforward with the 7-bit ASCII
character set, covering basic English. Non-ASCII characters needed
"asciification" for compatibility, like replacing umlauts with quotes or letters.
8-bit sets expanded options but had overlap issues for different languages.
Languages like Chinese or Japanese with many characters required multi-byte
encodings, adding complexity. Unicode solved this with a Universal Character Set
supporting over 100,000 characters, using UTF-8 encoding for efficiency. It
replaced older systems, avoiding ambiguity in multilingual text processing.

❖ Character Set Dependence:


➢ About Character Sets
The adoption of Unicode has simplified text handling, but it introduced challenges
for tokenization. For instance, in Latin-1 encoding, bytes 161–191 represent
various symbols in English or Spanish. However, in UTF-8, this byte range
signifies a part of multi-byte sequences and lacks meaning alone. Tokenizers must
thus account for encoding specifics and language nuances.
Identifying character encoding is crucial for accurate tokenization but can be
challenging. Automatic encoding detection algorithms analyze byte patterns to
match known encodings. For example, Russian encodings like ISO-8859-5 and KOI8-
R have distinct byte ranges for Cyrillic characters, unlike Unicode where Cyrillic
characters span two bytes.
An encoding identification algorithm examines byte distributions to deduce the
most likely encoding. However, due to overlap between encodings, determining the
exact encoding may not always be feasible, especially when documents contain only
ASCII characters, which are common across multiple encodings.
➢ Character Encoding Identification and Its Impact on Tokenization
Character set dependencies refer to the specific characters and symbols used in
a writing system, which can vary significantly across languages and scripts.
Handling character set dependencies is crucial in text preprocessing for NLP
tasks, as it impacts tokenization, normalization, and other processing steps.
Some key aspects to consider regarding character set dependencies:

1. Character Encoding: Different character encodings, such as ASCII, Unicode


(UTF-8, UTF-16, UTF-32), and various legacy encodings, represent characters

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

differently. Choosing the appropriate character encoding is essential for


correctly interpreting and processing text data, especially when dealing with
multilingual or non-Latin scripts.
2. Tokenization: Tokenization involves splitting text into tokens, which can be
words, phrases, or symbols. The tokenization process depends on the rules specific
to the character set and language. For instance, languages with logographic
scripts may require character-level tokenization, while alphabetic languages
typically tokenize at the word level.
3. Handling Special Characters: Text often contains special characters,
punctuation marks, emojis, and symbols that may carry meaning or context. Handling
these special characters appropriately during preprocessing is essential to avoid
misinterpretation or loss of information.
4. Multi-byte Characters: Some writing systems, such as Chinese, Japanese, and
Korean (CJK), use multi-byte characters. Proper handling of multi-byte characters
is necessary to ensure correct tokenization, indexing, and processing in NLP
systems.
5. Normalization: Text normalization involves transforming text to a standard
format, which may include converting characters to lowercase, expanding
contractions, and removing diacritics. The normalization process should consider
character set variations and language-specific norms to preserve semantic meaning
accurately.
6. Transliteration and Transcription: When working with languages that use non-
Latin scripts, transliteration (converting characters from one script to another)
or transcription (representing spoken language using a Latin-based script) may
be necessary. These conversions require careful consideration of phonetic and
orthographic differences between scripts.
7. Character-level Operations: Some NLP tasks, such as spell checking, named
entity recognition, and part-of-speech tagging, involve character-level
operations. Understanding the character set dependencies is crucial for designing
accurate algorithms and models for these tasks.
8. Font and Rendering Issues: In addition to character encoding, font and
rendering variations can impact text preprocessing. Differences in font styles,
ligatures, and rendering systems may affect how characters are displayed and
processed by NLP systems.
Addressing character set dependencies effectively requires knowledge of
linguistics, Unicode standards, text processing libraries, and language-specific
rules. NLP practitioners often rely on specialized tools and techniques tailored
to handle diverse character sets and writing systems to ensure robust and accurate
text preprocessing.

❖ Language Dependence:
➢The impact of writing systems on text segmentation: It is significant and varies
depending on the specific characteristics of each writing system.
1. Symbol Types: The types of symbols used in a writing system (logographic,
syllabic, alphabetic) can affect text segmentation. For instance, logographic

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

systems like Chinese may not have explicit word boundaries marked, while
alphabetic systems like English often use whitespace to separate words.
2. Orthographic Conventions: Orthographic conventions play a crucial role in
denoting boundaries between linguistic units such as syllables, words, or
sentences. For example, in written Amharic, both word and sentence boundaries
are explicitly marked. In contrast, written Thai lacks explicit boundary markers,
similar to the lack of clear boundaries in spoken Thai.
3. Segmentation Challenges: Languages with fewer explicit boundary markers, such
as Thai or spoken languages, present challenges for text segmentation algorithms.
Without clear cues or markers, it can be difficult to segment the text accurately
at various linguistic levels.
4. Language-Specific Issues: Many segmentation challenges are language-specific.
For example, English relies on whitespace and punctuation marks for segmentation,
but these cues may not always be sufficient for unambiguous segmentation,
especially in cases of compound words or complex sentence structures.
5. Tokenization Efforts: Tokenization, which involves segmenting text into
meaningful units such as words or sentences, faces challenges in languages where
boundaries are not explicitly marked. Robust tokenization efforts need to account
for these language-specific segmentation nuances.
6. Resource Recommendation: For detailed information on various writing systems
and their segmentation features, Daniels and Bright (1996) provide comprehensive
coverage and examples that can aid in understanding the complexities of text
segmentation across different languages.
Writing systems significantly influence text segmentation by determining how
boundaries between linguistic units are marked or implied. Understanding these
differences is crucial for developing effective text processing and natural
language understanding systems across diverse languages.
➢Language Identification
Language identification and text segmentation are indeed complex tasks that
require consideration of language-specific and orthography-specific features.
Key points related to language identification and text segmentation in
multilingual documents:
1. Language Identification: In multilingual documents or sections, identifying
the language of each part is crucial for accurate text segmentation. For languages
with unique alphabets like Greek or Hebrew, character set identification plays a
role in determining the language. Similarly, character set identification can be
used to distinguish between languages that share many characters, such as Arabic
and Persian, or European languages with identical character sets but different
frequencies.
2. Character Set Identification: Languages with distinct character sets can be
identified based on the presence of specific characters unique to that language.
For example, Persian includes supplemental characters not found in Arabic, aiding
in language differentiation.
3. Byte Range Distribution: Analyzing byte range distributions can help identify
predominant characters in a given language, especially when languages share a

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

similar character set but differ in character frequencies. This method is


particularly useful for distinguishing between closely related languages like
Arabic and Persian.
4. Training Models for Language Identification: For more challenging cases where
languages share the same character set and have similar frequencies, training
models based on byte/character distributions can be effective. Sorting bytes by
frequency count and using the sorted list as a signature vector for comparison
through n-gram or vector distance models is a basic yet powerful approach for
language identification.
5. Algorithmic Approaches: Algorithms for language identification often involve
statistical analysis, machine learning models, or a combination of both. These
approaches leverage linguistic features, character frequencies, and other
linguistic properties to accurately determine the language of a given text segment.
By integrating these techniques and algorithms, language identification and text
segmentation systems can effectively handle multilingual documents and diverse
writing systems, enhancing overall document triage and processing capabilities.
Successful text segmentation requires considering language-specific and
orthography-specific features due to the wide range of writing systems used
globally. Language identification is crucial during document triage, especially
for multilingual documents.
For languages with unique alphabets like Greek or Hebrew, character set
identification determines the language. Similarly, character set identification
helps narrow down languages with shared characters, like Arabic vs. Persian or
Russian vs. Ukrainian. Byte range distribution aids in identifying predominant
characters unique to a language, such as Persian's supplemental characters in
the Arabic alphabet.
In cases where languages share the same character set but differ in frequency,
training models using byte/character distributions can help. An effective
algorithm is sorting bytes by frequency for signature vectors and comparing them
using n-gram or vector distance models for final identification.

❖ Corpus Dependence:
Early natural language processing (NLP) systems were limited in their ability to
handle unstructured or poorly formatted data. They were designed to process well-
formed input following specific grammatical rules. However, with the availability
of large corpora containing diverse data types and irregularities like
misspellings, erratic punctuation, and non-standard spacing, the need for robust
NLP approaches has grown.
Algorithms relying on well-formed input struggle with the varied nature of real-
world text, such as data collected from sources like newswires, emails, internet
pages, and weblogs. Rules governing written language are challenging to define
and enforce due to the evolving nature of language usage and conventions.
Conventional punctuation and spacing, which aid in segmentation in well-formed
texts, may not be reliable indicators in diverse corpora where these conventions
are often disregarded.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

For instance, capitalization and punctuation rules may vary widely depending on
the text's origin and purpose. Literary works may adhere closely to conventions,
while internet posts or emails may exhibit erratic punctuation and spelling. Such
irregularities pose challenges for segmentation algorithms designed for
structured and consistent data.
Text harvested from the internet often contains non-textual elements like headers,
images, advertisements, and scripts, which are not part of the actual content.
Robust text segmentation algorithms need to identify and filter out such
extraneous text to focus on meaningful content.
Efforts like Cleaneval** aim to address these challenges by providing shared tasks
and evaluations focused on cleaning and segmenting web data for linguistic and
language technology research. These initiatives highlight the importance of
developing NLP techniques that can handle the complexities and irregularities
present in real-world textual data.
Note: Cleaneval is a project that tests and improves tools for cleaning up messy web
data. It focuses on removing non-text elements like images and ads from web pages,
leaving only the useful text for language research and technology. Cleaneval helps
researchers develop better algorithms for handling real-world web content.

❖ Application Dependence:
Word and sentence segmentation, although necessary for computational linguistics,
lack absolute definitions and vary widely across languages. For computational
purposes, we define these distinctions based on language and task requirements.
For instance, English contractions like "I'm" are often expanded during
tokenization to ensure essential grammatical features are preserved. Failure to
expand contractions can lead to tokens being treated as unknown words in later
processing stages.
The treatment of possessives, such as 's in English, also varies in different
tagged corpora, impacting tokenization results. These issues are typically
addressed by normalizing text to suit application needs. For example, in automatic
speech recognition, tokens are represented in a spoken form, like converting
"$300" to "three hundred dollars."
In languages like Chinese, where words don't have spaces between them, various
word segmentation conventions exist, affecting tasks like information retrieval
and text-to-speech synthesis. Task-specific Chinese segmentation has been a focus
in the machine translation community.
The tasks of word and sentence segmentation intersect with techniques discussed
in other chapters, such as Lexical Analysis, Corpus Creation, and Multiword
Expressions, as well as practical applications across different chapters in
computational linguistics literature.

❖ Segmentation: word level(Tokenization):


Tokenization, a critical step in natural language processing, faces various
challenges, especially in freely occurring text. Unlike artificial languages like
programming languages, natural languages lack strict definitions, leading to

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

lexical and structural ambiguities. This section delves into the technical
complexities of tokenization, highlighting differences between space-delimited
and unsegmented languages.
Space-delimited languages, like most European languages, use whitespace to
indicate word boundaries. However, the delimited character sequences may not
always align with tokens needed for further processing due to writing system
ambiguities and varying tokenization conventions. On the other hand, unsegmented
languages such as Chinese and Thai lack explicit word boundaries, necessitating
additional lexical and morphological information for tokenization.
The challenges of tokenization are influenced by the writing system (logographic,
syllabic, or alphabetic) and the typographical structure of words. Word
structures fall into three main categories: isolating (no smaller units),
agglutinating (clearly divided morphemes), and inflectional (unclear boundaries
and multiple grammatical meanings). While languages tend towards one type (e.g.,
Mandarin Chinese is isolating, Japanese is agglutinative, and Latin is
inflectional), most languages exhibit aspects of all three types. Polysynthetic
languages like Chukchi and Inuktitut form complex words resembling whole
sentences, requiring specialized tokenization approaches.
Given the distinct techniques for space-delimited and unsegmented languages,
separate discussions in the following Sections, explore these challenges and
methodologies.
➢Tokenization in Space-Delimited Languages
In alphabetic writing systems like those using the Latin alphabet, words are
typically separated by whitespace. However, even in well-formed sentences,
tokenization faces challenges, especially with punctuation marks such as periods,
commas, quotation marks, apostrophes, and hyphens. These punctuation marks can
serve multiple functions within a sentence or text.
For instance, consider the sentence from the Wall Street Journal:
"Clairson International Corp. said it expects to report a net loss for its second
quarter ended March 26 and doesn’t expect to meet analysts’ profit estimates of
$3.9 to $4 million, or 76 cents a share to 79 cents a share, for its year ending
Sept. 24."
In this sentence, periods are used for decimal points, abbreviations, and sentence
endings. Apostrophes mark possession and contractions. Tokenization must
differentiate when a punctuation mark is part of a token or a separate token.
Additionally, decisions must be made about phrases like "76 cents a share," which
can be treated as four tokens or a single hyphenated token ("76-cents-a-share").
Similar decisions arise with phrases involving numbers like "$3.9 to $4 million"
or "3.9 to 4 million dollars."
It's important to note that the semantics of numbers can vary based on genre and
application. For instance, in scientific literature, numbers like 3.9, 3.90, and
3.900 have distinct significances.
An initial tokenization approach for space-delimited languages involves
considering any character sequence surrounded by whitespace as a token. However,
this method doesn't account for punctuation marks. Certain punctuation marks

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

should be treated as separate tokens, even if they aren't preceded by whitespace.


Additionally, before tokenization, texts may need filtering to remove markup,
headers (including HTML), extra whitespace, and control characters.
→ Tokenizing Punctuation
Tokenization, particularly in languages using the Latin alphabet like English,
involves various complexities, especially with punctuation marks and ambiguous
cases.
1. Abbreviations: Abbreviations are often marked by periods and can be challenging
to tokenize. They can represent multiple words or have different meanings based
on context. For example, "St." can stand for Saint, Street, or State.
2. Quotation Marks and Apostrophes: Quotation marks and apostrophes are sources
of ambiguity. They can indicate quoted passages, contractions, or possessive
forms. Apostrophes, in particular, have multiple uses in English, such as marking
contractions (e.g., "I'm" for "I am") or possessive forms (e.g., "Peter's head").
Tokenization decisions regarding apostrophes depend on context and syntactic
analysis.
3. Contractions: Contractions like "I'm" or "we've" may require expansion during
tokenization. However, the decision to expand contractions is language-dependent
and requires knowledge of specific contractions and their expanded forms.
4. Other Languages: Different languages have their contraction rules and
conventions. For example, French has contracted articles (e.g., "l'homme" for
"the man") and pronouns (e.g., "j'ai" for "I have"). Tokenizers must be aware of
these language-specific rules for accurate tokenization.
Overall, tokenization involves understanding the diverse uses of punctuation
marks, abbreviations, contractions, and possessive forms across languages to
ensure accurate and meaningful tokenization in natural language processing tasks.
→ Multi-Part Words
Many written languages, to varying extents, feature space-delimited words that
consist of multiple units, each conveying a specific grammatical meaning. This
phenomenon is especially common in strongly agglutinative languages like Swahili,
Quechua, and many Altaic languages. German also employs compounding extensively,
combining nouns, adverbs, and prepositions to create complex words.
Agglutinative constructions, where multiple grammatical parts are joined to form
a single word, can also be marked by hyphenation. English often uses hyphens to
create single-token words like "end-of-line" or multi-token words like "Boston-
based." However, hyphen usage varies between languages and dialects, such as
between British and American English.
In French, hyphenated compounds like "va-t-il" (will it?), "c’est-à-dire" (that
is to say), and "celui-ci" (this one) require expansion during tokenization to
capture essential grammatical features. Tokenizers need specific rules to handle
these structures correctly.
Tokenization can be challenging due to end-of-line hyphens used in traditional
typesetting. Distinguishing between incidental hyphenation and naturally
hyphenated words at line breaks is difficult but crucial for accurate tokenization.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

When tokenizing multi-part words, including hyphenated or agglutinative ones,


whitespace alone isn't sufficient for further processing. This challenge in
tokenization is closely related to morphological analysis, which is discussed in
detail in Chapter 3 of this handbook.
→ Multiword Expressions
Spacing conventions in written languages can create challenges for tokenization
in natural language processing (NLP) applications, particularly with multiword
expressions.
1. Equivalence of Multiword Expressions: Some multiword expressions, like "in
spite of" and "despite," are functionally equivalent and can be treated as a
single token. Similarly, foreign loanword expressions like "au pair" or "de
facto" can also be considered as single tokens.
2. Multiword Numerical Expressions: Tokenization involves identifying multiword
numerical expressions, such as dates ("March 26"), money expressions ("$3.9 to
$4 million"), and percents. These expressions, along with sequences of digits,
are often treated as single tokens for ease of processing.
3. Text Normalization: Tokenizers must normalize numeric expressions to ensure
consistency across different formats, languages, and applications. For example,
dates can be represented in various ways (e.g., "Nov. 18, 1989" or "18/11/89"),
highlighting the need for text normalization during tokenization.
4. Language and Application Dependence: The treatment of multiword expressions
depends on the language and the specific NLP application. For instance, the
phrase "no one" can be a single token or two separate words based on context,
and this decision may be deferred to later processing stages like parsing.
In summary, tokenization must account for multiword expressions, consider text
normalization for numerical expressions, and be adaptable to different languages
and applications for accurate and effective NLP processing.
➢Tokenization in Unsegmented Languages
Tokenization in unsegmented languages like Chinese, Japanese, and Thai presents
unique challenges due to the absence of spaces between words. This requires a
more sophisticated approach compared to space-delimited languages such as English.
The specific writing systems and orthographic conventions of each language
further complicate the task, making it difficult to devise a single universal
solution.
Various algorithms have been developed to address word segmentation in
unsegmented languages, providing initial approximations for a range of languages.
Successful strategies tailored for Chinese and Japanese segmentation have been
established, highlighting the importance of language-specific approaches.
Additionally, approaches for languages with unsegmented alphabetic or syllabic
writing systems have been explored, addressing the complexities of tokenization
in these linguistic contexts.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Overall, the challenges and strategies related to tokenization in unsegmented


languages underscore the need for nuanced and context-aware techniques in natural
language processing tasks.
→ Common Approaches
Word segmentation in unsegmented languages like Chinese, Japanese, and Thai poses
significant challenges due to the absence of spaces between words and the lack
of widely accepted guidelines on what constitutes a word. This section delves
into the complexities of accurate word segmentation, highlighting key obstacles
and common approaches used in natural language processing (NLP).
One major challenge in word segmentation is dealing with unknown or out-of-
vocabulary words not present in the lexicon of the segmenter. The accuracy of
segmentation is heavily influenced by the lexicon's source and its correspondence
with the text being segmented. Research has shown that using a lexicon constructed
from a similar corpus as the test corpus can significantly improve segmentation
accuracy, emphasizing the importance of lexical resources in this task.
Additionally, the absence of clear guidelines on word boundaries leads to
disagreements among native speakers about the "correct" segmentation. This
ambiguity is evident even in languages like English, where compound words and
hyphenated phrases can be segmented in multiple ways. In languages like Chinese,
disagreement among native speakers regarding word segmentation is more common,
making it challenging to establish a definitive standard for evaluating
segmentation algorithms.
One simplistic approach to word segmentation treats each character as a distinct
word, which is practical for languages with short average word lengths like
Chinese. However, this approach lacks sophistication and is not suitable for
tasks like parsing or part-of-speech tagging. Instead, more advanced algorithms,
such as the maximum matching algorithm (greedy algorithm), are commonly used.
This algorithm attempts to find the longest word in a word list starting from
each character in the text and marks boundaries accordingly.
A variant of the maximum matching algorithm is the reverse maximum matching
algorithm, which matches characters from the end of the string. This approach,
along with forward-backward matching and language-specific heuristics, can
improve segmentation accuracy by optimizing the segmentation based on multiple
matching results.
In summary, accurate word segmentation in unsegmented languages requires a
combination of lexical resources, informed algorithms, and language-specific
considerations to overcome the challenges posed by ambiguous word boundaries and
unknown words. These efforts contribute significantly to the effectiveness of
NLP applications in handling text in these languages.
→ Chinese Segmentation
The Chinese writing system, composed of thousands of characters known as Hanzi,
poses challenges for word segmentation due to the absence of spaces between words.
Previous approaches to Chinese word segmentation fall into three main categories:
statistical, lexical rule-based, and hybrid approaches combining both statistical
and lexical information.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Statistical approaches leverage data such as character mutual information from


training corpora to determine likely word formations. Lexical approaches encode
language features like syntactic and semantic information, phrasal structures,
and morphological rules to refine segmentation. Hybrid approaches integrate
information from both statistical and lexical sources.
Examples of hybrid approaches include the use of weighted finite-state
transducers to identify dictionary entries and unknown words, trainable sequence
transformation rules for incremental segmentation improvement, adaptive language
models similar to text compression techniques, and rapid retraining algorithms
for new genres or segmentation standards.
One significant challenge in evaluating segmentation algorithms is the lack of a
common evaluation corpus and segmentation standards. Organized evaluations like
the "First International Chinese Word Segmentation Bakeoff" have aimed to
establish consistent standards and facilitate direct comparisons between
segmentation methods, contributing to improvements in segmentation accuracy and
corpus quality.
→ Japanese Segmentation
The Japanese writing system is complex, incorporating alphabetic, syllabic, and
logographic symbols. Modern Japanese texts often include Kanji (Chinese Hanzi
symbols), hiragana (a syllabary for grammatical markers and Japanese-origin
words), katakana (a syllabary for foreign-origin words), romanji (words in the
Roman alphabet), Arabic numerals, and punctuation symbols. While transitions
between character sets can provide clues about word boundaries, they are not
sufficient due to words containing characters from multiple sets, like inflected
verbs and company names.
Previous Japanese segmentation approaches, like JUMAN and Chasen programs, rely
on manually derived morphological rules. Some statistical techniques developed
for Chinese can also be applied to Japanese. For instance, Nagata described a
segmentation algorithm similar to Chinese segmentation methods. More recent work
by Ando and Lee introduced an unsupervised statistical segmentation method based
on n-gram counts in Kanji sequences, achieving high performance on long Kanji
sequences.
→ Unsegmented Alphabetic and Syllabic Languages
Unsegmented alphabetic and syllabic languages like Thai, Balinese, Javanese, and
Khmer pose challenges due to longer words despite having fewer characters compared
to Chinese and Japanese. Localized optimization is less practical, necessitating
segmentation based on lists of words, names, and affixes using variations of the
maximum matching algorithm. High-accuracy segmentation requires a deep
understanding of the language's lexical and morphological features.
Early works on Thai segmentation include Kawtrakul et al. (1996), presenting a
robust rule-based Thai segmenter and morphological analyzer. Meknavin et al.
(1997) employ machine learning to automatically derive lexical and collocational
features for optimal segmentation from an n-best maximum matching set.
Aroonmanakun (2002) adopts a statistical Thai segmentation approach, segmenting

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

the text into syllables first and then merging syllables into words based on a
trained model of syllable collocation.

❖ Segmentation: Sentence level:

Sentences in most written languages are delimited by punctuation marks, yet the
specific usage rules for punctuation are not always coherently(clearly) defined
The scope of this problem varies greatly by language, as does the number of
different punctuation marks that need to be considered.
Thai, for example, uses a space sometimes at sentence breaks, but very often the
space is indistinguishable from the carriage return, or there is no separation
between sentences.
Even languages with relatively rich punctuation systems like English present
surprising problems. Recognizing boundaries in such a written language involves
determining the roles of all punctuation marks, which can denote sentence
boundaries: periods, question marks, exclamation points, and sometimes
semicolons, colons, dashes, and commas.

➢ Sentence Boundary Punctuation:


In most NLP applications, the only sentence boundary punctuation marks considered
are the period, question mark, and exclamation point, and the definition of
sentence is limited to the text sentence, which begins with a capital letter and
ends in a full stop.
However, grammatical sentences can be delimited by many other punctuation marks
and restricting sentence boundary punctuation to these three can cause an
application to overlook many meaningful sentences or can unnecessarily complicate
processing by allowing only longer, complex sentences.
For example,
Here is a sentence. Here is another.
Here is a sentence; here is another.
Another example,
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much
out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I shall be
late!’ (when she thought it over afterwards, it occurred to her that she ought
to have wondered at this, but at the time it all seemed quite natural); but when

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOATPOCKET, and looked at it,
and then hurried on, Alice started to her feet, for it flashed across her mind
that she had never before seen a rabbit with either a waistcoat-pocket, or a
watch to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large rabbit-
hole under the hedge.
This example contains a single period at the end and three exclamation points
within a quoted passage. long sentences are more likely to produce (and compound)
errors of analysis.
Treating embedded sentences and their punctuation differently could assist in
the processing of the entire text-sentence.
In following example, the main sentence contains an embedded sentence (delimited
by dashes), and this embedded sentence also contains an embedded quoted sentence.
The holes certainly were rough –
The holes certainly were rough - “Just right for a lot of vagabonds like us,”
said Bigwig - but the exhausted and those who wander in strange country are not
particular about their quarters.

➢ The Importance of Context:


In any attempt to disambiguate the various uses of punctuation marks, whether in
text-sentences or embedded sentences, some amount of the context in which the
punctuation occurs is essential.
When analyzing well-formed English documents, for example, it is tempting to
believe that sentence boundary detection is simply a matter of finding a period
followed by one or more spaces followed by a word beginning with a capital letter,
perhaps also with quotation marks before or after the space. Indeed, in some
corpora (e.g., literary texts) this single period-space-capital (or period-quote-
space-capital) pattern accounts for almost all sentence boundaries.
However, the results are different in journalistic texts such as the Wall Street
Journal (WSJ). In a small corpus of the WSJ from 1989 that has 16,466 periods as
sentence boundaries, this simple rule would detect only 14,562 (88.4%) while
producing 2900 false positives, placing a boundary where one does not exist.
Many contextual factors have been shown to assist sentence segmentation in
difficult cases. These contextual factors include
Case distinctions—In languages and corpora where both uppercase and lowercase
letters are consistently used, whether a word is capitalized provides information
about sentence boundaries. To compare whether after '.', will there be a space
and a letter starts with a small case or a capital. Based on that whether there
is a sentence boundary or not can be decided.
Part of speech—Palmer and Hearst (1997) showed that the parts of speech of the
words within three tokens of the punctuation mark can assist in sentence
segmentation. Their results indicate that even an estimate of the possible parts
of speech can produce good results. generally, a sentence is starting with Noun
or pronoun followed by a verb. So as per the grammatical rule of that language a
sentence boundary can be identified.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Word length—Riley (1989) used the length of the words before and after a period
as one contextual feature. Generally, abbreviation doesn't have more than 4-
character word.
Lexical endings—Müller et al. (1980) used morphological analysis to recognize
suffixes and thereby filter out words which were not likely to be abbreviations.
The analysis made it possible to identify words that were not otherwise present
in the extensive word lists used to identify abbreviations.
Prefixes and suffixes—Reynar and Ratnaparkhi (1997) used both prefixes and
suffixes of the words surrounding the punctuation mark as one contextual feature.
Abbreviation classes—Riley (1989) and Reynar and Ratnaparkhi (1997) further
divided abbreviations into categories such as titles (which are not likely to
occur at a sentence boundary) and corporate designators (which are more likely
to occur at a boundary).
Internal punctuation—Kiss and Strunk (2006) used the presence of periods within
a token as a feature.
Proper nouns—Mikheev (2002) used the presence of a proper noun to the right of a
period as a feature.

➢ Traditional Rule-Based Approaches:


In well-behaved corpora, simple rules relying on regular punctuation, spacing,
and capitalization can be quickly written, and are usually quite successful.
Traditionally, the method widely used for determining sentence boundaries is a
regular grammar, usually with limited lookahead.
More elaborate implementations include extensive word lists and exception lists
to attempt to recognize abbreviations and proper nouns.
Such systems are usually developed specifically for a text corpus in a single
language and rely on special language-specific word lists; as a result, they are
not portable to other natural languages without repeating the effort of compiling
extensive lists and rewriting rules
Although the regular grammar approach can be successful, it requires a large
manual effort to compile the individual rules used to recognize the sentence
boundaries.
Nevertheless, since rule-based sentence segmentation algorithms can be very
successful when an application deals with well-behaved corpora.
Various researches done on this issue.

➢ Robustness and Trainability:


Throughout this chapter we have emphasized the need for robustness in NLP systems,
and sentence segmentation is no exception.
The traditional rule-based systems, which rely on features such as spacing and
capitalization, will not be as successful when processing texts where these
features are not present, such as the below Example:

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

ive just loaded pcl onto my akcl. when i do an ‘in- package’ to load pcl, ill
get the prompt but im not able to use functions like defclass, etc... is there
womething basic im missing or am i just left hanging, twisting in the breeze?
Similarly, some important kinds of text consist solely of uppercase letters is
an example of such a corpus. Which has unpredictable spelling and punctuation,
as can be seen from the following example of uppercase letters data from CNN:
THIS IS A DESPERATE ATTEMPT BY THE REPUBLICANS TO SPIN THEIR STORY THAT NOTHING
SEAR WHYOUS – SERIOUS HAS BEEN DONE AND TRY TO SAVE THE SPEAKER’S SPEAKERSHIP
AND THIS HAS BEEN A SERIOUS PROBLEM FOR THE SPEAKER, HE DID NOT TELL THE TRUTH
TO THE COMMITTEE, NUMBER ONE.
The limitations of manually crafted rule-based approaches suggest the need for
trainable approaches to sentence segmentation, in order to allow for variations
between languages, applications, and genres.
Trainable methods provide a means for addressing the problem of embedded sentence
boundaries, as well as the capability of processing a range of corpora and the
problems they present, such as erratic(irregular) spacing, spelling errors,
single-case, and Optical Caracter Reader(OCR) errors.
For each punctuation mark to be disambiguated, a typical trainable sentence
segmentation algorithm will automatically encode the context using some or all
of the features described above.
A set of training data, in which the sentence boundaries have been manually
labelled, is then used to
train a machine learning algorithm to recognize the salient features in the
context.

➢ Trainable Algorithms:
One of the first published works describing a trainable sentence segmentation
algorithm was Riley (1989).
The method described used regression trees to classify periods according to
contextual features.
These contextual features included word length, punctuation after the period,
abbreviation class, case of the word, and the probability of the word occurring
at beginning or end of a sentence.
Riley’s method was trained using million words from the AP newswire, and he
reported an accuracy of 99.8% when tested on the Brown corpus.
Palmer and Hearst (1997) developed a sentence segmentation system called Satz,
which used a machine learning algorithm to disambiguate all occurrences of periods,
exclamation points, and question marks.
The system defined a contextual feature array for three words preceding and three
words following the punctuation mark; the feature array encoded the context as
the parts of speech, which can be attributed to each word in the context.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Using the lexical feature arrays, both a neural network and a decision tree were
trained to disambiguate the punctuation marks and achieved a high accuracy rate
(98%–99%) on a large corpus from the Wall Street Journal.
They also demonstrated the algorithm, which was trainable in as little as one
minute and required less than 1000 sentences of training data, to be rapidly
ported to new languages.
They adapted the system to French and German, in each case achieving a very high
accuracy.
Additionally, they demonstrated the trainable method to be extremely robust, as
it was able to successfully disambiguate single-case texts and Optical Caracter
Reader (OCR) data.
Reynar and Ratnaparkhi (1997) described a trainable approach to identify English
sentence boundaries using a statistical maximum entropy model.
The system used a system of contextual templates, which encoded one word of
context preceding and following the punctuation mark, using such features as
prefixes, suffixes, and abbreviation class.
They also reported success in inducing an abbreviation list from the training
data for use in the disambiguation. The algorithm, trained in less than 30 min
on 40,000 manually annotated sentences, achieved a high accuracy rate (98%+) on
the same test corpus used by Palmer and Hearst (1997), without requiring specific
lexical information, word lists, or any domain-specific information.
Though they only reported results on English, they indicated that the ease of
trainability should allow the algorithm to be used with other Roman-alphabet
languages, given adequate training data.
Mikheev (2002) developed a high-performing sentence segmentation algorithm that
jointly identifies abbreviations, proper names, and sentence boundaries.
The algorithm casts the sentence segmentation problem as one of disambiguating
abbreviations to the left of a period and proper names to the right. While using
unsupervised training methods, the algorithm encodes a great deal of manual
information regarding abbreviation structure and length.
The algorithm also relies heavily on consistent capitalization in order to
identify proper names.
Kiss and Strunk (2006) developed a largely unsupervised approach to sentence
boundary detection that focuses primarily on identifying abbreviations. The
algorithm encodes manual heuristics for abbreviation detection into a statistical
model that first identifies abbreviations and then disambiguates sentence
boundaries. The approach is essentially language independent, and they report
results for a large number of European languages
Trainable sentence segmentation algorithms such as these are clearly necessary
for enabling robust processing of a variety of texts and languages.
Algorithms that offer rapid training while requiring small amounts of training
data allow systems to be retargeted in hours or minutes to new text genres and
languages.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

This adaptation can take into account the reality that good segmentation is task
dependent.
For example, in parallel corpus construction and processing, the segmentation
needs to be consistent in both the source and target language corpus, even if
that consistency comes at the expense of theoretical accuracy in either language.

❖ Regular Expression and Automata Morphology:

Regular expressions or RegEx is defined as a sequence of characters that are


mainly used to find or replace patterns present in the text. In simple words,
we can say that a regular expression is a set of characters or a pattern that
is used to find substrings in a given string.

A regular expression (RE) is a language for specifying text search strings.


It helps us to match or extract other strings or sets of strings, with the
help of a specialized syntax present in a pattern.

For Example, extracting all hashtags from a tweet, getting email iD or phone
numbers, etc from large unstructured text content.

Sometimes, we want to identify the different components of an email address.

Simply put, a regular expression is defined as an ”instruction” that is given


to a function on what and how to match, search or replace a set of strings.

Regular Expressions are used in various tasks such as,

• Data pre-processing,
• Rule-based information Mining systems,
• Pattern Matching,
• Text feature Engineering,
• Web scraping,
• Data Extraction, etc.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Let’s understand the working of Regular expressions with the help of an


example:

Consider the following list of some students of a School,

Names: Sunil, Shyam, Ankit, Surjeet, Sumit, Subhi, Surbhi, Siddharth, Sujan

And our goal is to select only those names from the above list which match a
certain pattern such as something like this S u _ _ _

The names having the first two letters as S and u, followed by only three
positions that can be taken up by any of the letters present in the dictionary.
What do you think, which names from the above list fit this criterion?

Let’s go one by one, the name Sunil, Sumit, and Sujan fit this criterion as
they have S and u in the beginning and three more letters after that. While
the rest of the three names are not following the given criteria. So, the new
list extracted is given by,

Extracted Names: Sunil, Sumit, Sujan

What exactly we have done here is that we have a pattern and a list of student
names and we have to find the name that matches the given pattern. That’s
exactly how regular expressions work.

In RegEx, we have different types of patterns to recognize different strings


of characters.

Properties of Regular Expressions

Some of the important properties of Regular Expressions are as follows:

1. The Regular Expression language is formalized by an American Mathematician


named Stephen Cole Kleene.

2. Regular Expression(RE) is a formula in a special language, which can be


used for specifying simple classes of strings, a sequence of symbols. In
simple words, we can say that Regular Expression is an algebraic notation for
characterizing a set of strings.

3. Regular expression requires two things, one is the pattern that we want to
search and the other is a corpus of text or a string from which we need to
search the pattern.

Mathematically, we can define the concept of Regular Expression in the


following manner:

1. ε is a Regular Expression, which indicates that the language is having an


empty string.

2. φ is a Regular Expression which denotes that it is an empty language.

3. If X and Y are Regular Expressions, then the following expressions are also
regular.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

• X, Y
• X.Y(Concatenation of XY)
• X+Y (Union of X and Y)
• X*, Y* (Kleen Closure of X and Y)

4. If a string is derived from the above rules then that would also be a
regular expression.

Use of Regular Expressions in NLP

In NLP, we can use Regular expressions at many places such as,

1. To Validate data fields.

For Example, dates, email address, URLs, abbreviations, etc.

2. To Filter a particular text from the whole corpus.

For Example, spam, disallowed websites, etc.

3. To Identify particular strings in a text.

For Example, token boundaries

4. To convert the output of one processing component into the format required
for a second component.

Automata

An automaton is represented by the 5-tuple hQ, Σ, δ, I, Fi where:


• Q is a set of states,
• Σ is a finite set of symbols, also called the alphabet of the language recognized
by the automaton,
• δ is a transition function:

δ : Q × Σ → Q,

• I is a set of states of Q called initial states,


• F is a set of states of Q called accepting states.
A less formal definition of δ would be: ∆ is a set of transitions between a pair
of states of Q
with a label that is a letter of Σ.
An automaton is accepting a word whenever we have reached an accepting state and
the input is empty.
Automata morphology refers to the study and analysis of the structure and
properties of automata, particularly finite automata. Automata theory deals with
the study of abstract machines that perform computations on strings of symbols.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Finite automata are one of the fundamental models of computation in automata


theory. Automata morphology involves analyzing the states, transitions, and
behavior of finite automata. It includes understanding different types of
automata (e.g., deterministic finite automata, nondeterministic finite automata)
and their relationships.

While regular expressions and automata morphology are related in the sense that
regular expressions can be converted to equivalent finite automata (and vice
versa), they are distinct concepts with their own areas of study and application.
Deterministic Finite Automata (DFA)
A DFA is a kind of automata where each state has an outgoing transition for each
letter of the alphabet. For example, Figure below is an deterministic finite
automaton. Notice that in a DFA, there can only be one initial state.

A NFA does not need to have for each state an outgoing transition for each letter
of the alphabet. Furthermore, it can even have for a state more than one outgoing
transition labeled by the same letter. For example, Figure below is a
nondeterministic finite automaton.

❖ Regular Expression and Automata Morphology Types:


In the context of automata theory, "automata morphology types" typically refers
to the different classes or categories of automata based on their characteristics,
capabilities, or structure. Here are some common types of automata morphology:

1. Finite Automata (FA):


Finite automata are the simplest type of automata with a finite number of states.
They can be deterministic (DFA) or nondeterministic (NFA). DFA has a unique
transition from each state for every input symbol, while NFA can have multiple
transitions. Finite automata recognize regular languages.
2. Pushdown Automata (PDA):
Pushdown automata extend finite automata with an additional stack memory. They
have finite control, an input tape, and a stack that provides memory. Pushdown
automata recognize context-free languages.
3. Turing Machines (TM):

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Turing machines are the most powerful type of automata. They consist of an
infinite tape, a read/write head, and a finite set of states with transitions.
Turing machines can compute anything computable, making them capable of
recognizing recursively enumerable languages.
4. Deterministic Finite Automata (DFA):
DFA is a type of finite automaton where for each state, there is exactly one
transition for each possible input symbol. DFA does not have epsilon (ε)
transitions.
5. Nondeterministic Finite Automata (NFA):
NFA is a type of finite automaton where for each state and input symbol, there
can be multiple possible transitions or no transition at all. NFA can have epsilon
(ε) transitions, which allow transitions without consuming input symbols.
6. Nondeterministic Pushdown Automata (NPDA):
NPDA is a type of pushdown automaton where the transitions are nondeterministic.
It can have multiple possible transitions for a given input symbol and stack
symbol.
7. Linear Bounded Automata (LBA):
Linear bounded automata are Turing machines whose tape space is limited to a
constant multiple of the input length. They can recognize context-sensitive
languages.
These are some of the fundamental types of automata morphology studied in automata
theory and formal languages. Each type has its own characteristics, computational
power, and language recognition capabilities.

❖ Survey of English and Indian Languages Morphology:


A survey of English and Indian languages morphology would delve into the
structural elements and processes that govern word formation and inflection in
these languages. Here's an overview of the morphology of English and Indian
languages:
English Morphology:
1. Word Formation:
English words are typically formed through various processes such as affixation,
compounding, conversion, and derivation. Affixation involves adding prefixes
(e.g., "un-" in "unhappy") or suffixes (e.g., "-ness" in "happiness") to base
words. Compounding combines two or more words to create a new word (e.g.,
"blackboard"). Conversion involves changing the grammatical category of a word
without adding affixes (e.g., "to butter" (verb) -> "a butter" (noun)).
Derivation creates new words by adding affixes to existing words (e.g., "teach"
-> "teacher").
2. Inflection:
English inflection primarily involves the use of suffixes to indicate grammatical
features such as tense, number, case, and comparison. Examples include adding "-
s" for plural nouns (e.g., "cat" -> "cats"), "-ed" for past tense verbs (e.g.,

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

"walk" -> "walked"), and "-er" for comparative adjectives (e.g., "tall" ->
"taller").
Indian Languages Morphology:
1. Word Formation:
Indian languages exhibit rich morphological systems with extensive use of
affixation, compounding, reduplication, and sandhi (phonological changes at
morpheme boundaries). Affixation involves prefixes and suffixes to modify the
meaning or grammatical function of words. Compounding is prevalent, combining
roots or words to create new lexical items. Reduplication duplicates part or all
of a word to express various semantic nuances. Sandhi alters the phonetic form
of words when they occur together, affecting pronunciation and morphological
boundaries.
2. Inflection:
Inflection in Indian languages typically includes markers for tense, aspect,
mood, person, number, gender, and case. Many Indian languages are agglutinative,
where morphemes are added to a root in a systematic manner to indicate different
grammatical categories. Verbs in Indian languages often exhibit complex
inflectional paradigms, including conjugation based on tense, aspect, and mood,
as well as agreement with subjects and objects. Nouns may inflect for number,
gender, and case, with different forms used for singular, plural, masculine,
feminine, neuter, and various case distinctions.
Overall, while English morphology tends to be more analytic and reliant on word
order and auxiliary verbs for grammatical information, Indian languages often
feature synthetic and agglutinative morphological systems with extensive
inflectional and derivational processes.

❖ Morphological parsing FSA and FST:


An FST is like an FSA but defines regular relations. Its not regular languages.
It has two alphabet sets. FST has a transition function relating input to states.
It has an output function relating state and input to output. It can be used to
recognize, generate, translate or relate sets.
An FSTs can be thought of as having an upper tape and a lower tape (output).

Regular Relations
• Regular language: a set of strings
• Regular relation: a set of pairs of strings
• E.g., Regular relation = {a:1, b:2, c:2}
Input Σ = {a,b,c}
Output ={1, 2}

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

FST:

FST Conventions:

Regular languages are closed under difference, complementation and


intersection; regular relations are (generally) not. Regular languages and
regular relations are both closed under union. But regular relations are closed
under composition and inversion; not defined for regular languages.
Inversion
FSTs are closed under inversion, i.e., the inverse of an FST is an FST.
Inversion just switches the input and output labels. e.g., if T1 maps ‘a’ to
‘1’, then T1 -1 maps ‘1’ to ‘a’. Consequently, an FST designed as a parser can
easily be changed into a generator.
It is possible to run input through multiple FSTs by using the output of one
FST as the input of the next. This is called Cascading. Composing is equivalent
in effect to Cascading but combines two FSTs and creates a new, more complex
FST.
T1 ∘ T2 = T2 (T1(s)) where s is the input string
Very simple example:
T1 = {a:1}
T2 = {1:one}
T1 ∘ T2 = {a:one}
T2 (T1 (a) ) =one
• Note that order matters: T1(T2(a)) ≠ one

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

• Composing will be useful for adding orthographic rules.


A model of computation that takes in some input string, processes them one symbol
at a time, and either accepts or rejects the string.
• e.g., we write a FSA to accept only valid English words.
A particular FSA defines a language (a set of strings that it would accept).
• e.g., the language in the FSA we are writing is the set of strings that are
valid English words.
The set of languages for which there is an FSA that could describe that language
is called regular languages.
A FSA consists of:
• 𝑄 finite set of states
• ∑ set of input symbols
• 𝑞0 ∈ 𝑄 starting state
• 𝛿 ∶ 𝑄 × Σ → 𝑄 transition function from current state and input symbol to next
state
• 𝐹 ⊆ 𝑄 set of accepting, final states
Identify the above components:

Use lexicon to expand the morphotactic FSA into a character-level FSA.


A list of all words, affixes and their behaviours. Entries are often called
lexical items (a.k.a., lexical entries, lexical units).
• e.g., Noun declensions

We begin by starting from the smallest level of grammatical unit in language,


the morpheme.
anti- dis- establish -ment -arian -ism
Six morphemes in one word
cat -s

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Two morphemes in one word


Of
One morpheme in one word
Free morphemes
May occur on their own as words (happy, the, robot)
Bound morphemes
Must occur with other morphemes as parts of words
Most bound morphemes are affixes, which attach to other morphemes to form new
words.
Prefixes come before the stem: un-, as in unhappy
Suffixes come after the stem: -s, as in robots
Infixes go inside: -f**king-, as in abso-f**king-lutely (Not really an infix,
but as close as we get in English)
Circumfixes go around: em- -en, as in embolden
Inflectional morphology is used to express some kind of grammatical function
required by the language
go -> goes think -> thought
Derivational morphology is used to derive a new word, possibly of a different
part of speech
happy -> happily establish -> establishment
Exercise: come up with three prefixes and suffixes in English. Make sure to
include at least one derivational and one inflectional affix.
Languages across the world vary by their morphemeto-word ratio.
Isolating languages: low morpheme-to-word ratio
Synthetic languages: high morpheme-to-word ratio

Morphological parsing involves analyzing the structure of words to identify


their constituent morphemes and understand their grammatical properties. Finite
State Automata (FSA) and Finite State Transducers (FST) are computational
models commonly used for morphological parsing.
Finite State Automata (FSA):
FSA is a mathematical model used to recognize or generate strings of symbols
according to a set of rules. In morphological parsing, an FSA can be employed
to recognize morphemes or patterns within words. Each state in the automaton
represents a particular stage in the parsing process, and transitions between

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

states are triggered by input symbols. FSA can be used to model morphological
processes such as affixation, compounding, and reduplication. However, FSAs are
limited in their ability to handle more complex morphological phenomena, such
as non-concatenative morphology (morphological changes that do not involve
concatenating affixes).
Finite State Transducers (FST):
FST extends the capabilities of FSA by associating output symbols with
transitions between states, enabling the modeling of transformations or
mappings between input and output strings. In morphological parsing, FSTs are
particularly useful for analyzing morphological processes that involve
morpheme-to-morpheme mappings, such as morphological inflection and derivation.
Each transition in the transducer not only consumes an input symbol but also
produces an output symbol or symbols, allowing for the generation of
morphologically parsed output. FSTs are capable of handling more complex
morphological phenomena compared to FSAs, making them suitable for a wider
range of morphological parsing tasks.
In summary, while Finite State Automata (FSA) are useful for recognizing
morphological patterns within words, Finite State Transducers (FST) provide a
more comprehensive framework for morphological parsing by allowing for both
recognition and generation of morphologically parsed output. FSTs are
particularly well-suited for modeling complex morphological processes involving
morpheme-to-morpheme mappings, making them a valuable tool in natural language
processing tasks such as morphological analysis and generation.

❖ Porter stemmer:
The Porter Stemmer algorithm, developed by Martin Porter in 1980, is a rule-
based algorithm for English language stemming. Stemming is the process of reducing
words to their root or base form, often by removing suffixes or other affixes.
The Porter Stemmer algorithm applies a series of heuristic rules to strip suffixes
from words in order to obtain their stems. It is designed to be efficient and
relatively simple, making it a popular choice for stemming tasks in applications
such as information retrieval, search engines, and text mining. The algorithm
consists of several phases, each targeting specific types of suffixes and applying
transformation rules to reduce words to their stems.
While the Porter Stemmer is effective for many common cases in English stemming,
it may not always produce linguistically correct stems, as it relies on simple
heuristics rather than linguistic analysis. Despite its limitations, the Porter
Stemmer remains a widely used tool for stemming in English text processing tasks.
An example of the application of the Porter Stemmer algorithm:
Input: "running"
Stem: "run"

Input: "flies"

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Stem: "fli" (Note: "fli" is the stem obtained by the Porter Stemmer for "flies",
as it removes the suffix "es" according to its rules.)

The Porter Stemmer algorithm is a valuable tool for reducing words to their stems
in English text processing tasks, offering simplicity and efficiency for stemming
needs.

❖ Rule based and Paradigm based Morphology:


Rule-based and paradigm-based morphology are two approaches to describing and
analyzing the morphology of languages.
Rule-Based Morphology:
In rule-based morphology, linguistic rules are used to describe the formation
and inflection of words. These rules specify how morphemes (the smallest units
of meaning) are combined or altered to form words or to indicate grammatical
features. Rule-based systems typically consist of a set of morphological rules
that apply in a linear fashion to generate or analyze words.
These rules may involve processes such as affixation (adding prefixes or suffixes),
reduplication (repeating parts of a word), or internal modification. Rule-based
morphology is often used in computational linguistics for tasks such as
morphological analysis, generation, and stemming.
Examples of rule-based morphological systems include the Porter Stemmer algorithm
for English stemming and the Xerox Finite-State Transducer for morphological
analysis in various languages.
Paradigm-Based Morphology:
Paradigm-based morphology focuses on the organization of words into paradigms,
which are sets of related forms sharing a common root or stem. Instead of
specifying rules for individual word forms, paradigm-based approaches describe
the systematic relationships between related forms within a paradigm.
These relationships are often represented in the form of morphological paradigms
or inflectional paradigms, which show how different forms of a word (such as
different tenses of a verb or different cases of a noun) are related to each
other. Paradigm-based approaches emphasize the importance of patterns and
regularities in morphological systems, rather than treating each word form as a
separate entity.
This approach is particularly useful for analyzing languages with rich
inflectional systems, where many words can be systematically derived from a
relatively small set of stems or roots. Paradigm-based morphology is often used
in linguistic typology and theoretical linguistics to study the morphological
structure of languages and the principles governing word formation and inflection.
While rule-based morphology focuses on specifying rules for generating or
analyzing individual word forms, paradigm-based morphology emphasizes the
systematic relationships between related forms within a language's morphological
system. Both approaches offer valuable insights into the structure and

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

organization of words in languages, and they are often used in combination to


provide a comprehensive understanding of morphological phenomena.

❖ Human Morphological Processing:


Human morphological processing refers to the cognitive mechanisms and processes
involved in the recognition, production, and understanding of morphologically
complex words in human language processing. Morphological processing plays a
crucial role in various language-related tasks, including reading, writing,
speaking, and comprehension. Some key aspects of human morphological processing:
➢ Morpheme Identification:
Morphemes are the smallest units of meaning in language, such as roots, prefixes,
and suffixes.
Human morphological processing involves the ability to identify and segment
morphemes within words. This includes recognizing affixes and understanding their
meanings and grammatical functions.
➢ Word Recognition:
Morphological processing influences the recognition of words in reading and
listening comprehension.
Humans utilize morphological knowledge to decode and understand complex words.
This involves accessing mental representations of morphologically related words
and applying morphological rules to analyze and interpret novel or unfamiliar
words.
➢ Word Production:
Morphological processing is also involved in word production during speech and
writing.
Humans generate words by combining morphemes according to the rules of morphology.
This includes selecting appropriate affixes and inflections to convey specific
meanings and grammatical functions.
➢ Morphological Awareness:
Morphological awareness refers to the explicit knowledge and understanding of
morphological structure and processes.
Individuals with high morphological awareness have a greater sensitivity to
morphological patterns, can manipulate morphemes to create new words, and can
analyze the meanings of complex words based on their morphological components.
➢ Morphological Priming:
Morphological priming occurs when exposure to a morphologically related word
facilitates the processing of a subsequent word with a similar morphological
structure.
This phenomenon suggests that morphologically related words are stored and
processed together in the mental lexicon, influencing the speed and efficiency
of lexical access and retrieval.
➢ Developmental Aspects:

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Morphological processing skills develop over time, with children gradually


acquiring morphological knowledge and awareness through exposure to language and
literacy experiences.
Morphological processing abilities continue to develop into adulthood, with
individuals becoming more proficient in recognizing, producing, and understanding
morphologically complex words with increasing linguistic experience.
Human morphological processing encompasses a range of cognitive processes
involved in the analysis, recognition, production, and comprehension of
morphologically complex words in language. It is a fundamental aspect of language
processing that contributes to fluent communication and literacy skills.

❖ Machine Learning approaches:


Machine learning approaches to morphological processing involve the application
of computational techniques to analyze and understand the morphology of languages.
These approaches leverage large amounts of linguistic data to automatically learn
patterns, rules, and relationships within morphological systems. Common machine
learning approaches used in morphological processing:
➢ Supervised Learning:
In supervised learning, machine learning algorithms are trained on labeled data,
where each input (e.g., word) is associated with a corresponding output (e.g.,
morphological analysis or tag).
Supervised learning algorithms, such as support vector machines (SVM), decision
trees, and neural networks, can be used for tasks such as morphological tagging
(assigning morphological tags to words), morphological segmentation (segmenting
words into morphemes), and morphological analysis (identifying morphological
features of words). Training data for supervised learning in morphological
processing typically consists of annotated corpora, which contain words annotated
with morphological information such as part-of-speech tags, morpheme boundaries,
or inflectional features.
➢ Unsupervised Learning:
Unsupervised learning involves learning patterns and structures from unlabeled
data without explicit supervision. Unsupervised learning algorithms, such as
clustering algorithms (e.g., k-means clustering) and generative models (e.g.,
Hidden Markov Models, topic models), can be applied to discover morphological
patterns and regularities from raw text data.
Unsupervised learning approaches in morphological processing can be used for
tasks such as morphological segmentation, morphological clustering (grouping
words with similar morphological properties), and morphological generation
(creating new words based on learned patterns).
➢ Semi-Supervised Learning:
Semi-supervised learning combines supervised and unsupervised learning techniques,
leveraging both labeled and unlabeled data for training. Semi-supervised learning
algorithms can be beneficial in morphological processing tasks when labeled data
is limited or expensive to obtain.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

These approaches can improve the performance of supervised models by leveraging


the additional information provided by unlabeled data, such as unlabeled text
corpora or partially annotated data.
➢ Deep Learning:
Deep learning techniques, particularly neural networks, have gained popularity
in various natural language processing tasks, including morphological processing.
Deep learning architectures, such as recurrent neural networks (RNNs), long
short-term memory (LSTM) networks, and transformer models (e.g., BERT, GPT), can
be applied to morphological tasks such as morphological tagging, morphological
generation, and morphological parsing.
Deep learning models have the ability to capture complex patterns and dependencies
in morphological data, leading to state-of-the-art performance in many
morphological processing tasks when trained on large datasets.
➢ Hybrid Approaches:
Hybrid approaches combine multiple machine learning techniques, such as combining
rule-based and statistical methods, or integrating symbolic and connectionist
models. Hybrid approaches in morphological processing leverage the strengths of
different methods to address the limitations of individual approaches and improve
overall performance.
For example, a hybrid morphological analyzer may combine rule-based morphological
rules with statistical models learned from data to achieve more accurate
morphological analysis.
Machine learning approaches play a significant role in morphological processing
by enabling automated analysis and understanding of the morphology of languages.
These approaches leverage large amounts of linguistic data to learn patterns and
structures within morphological systems, leading to improved performance in tasks
such as morphological tagging, segmentation, analysis, and generation.

Downloaded by Bhavesh Fanade ([email protected])

You might also like