NLP Unit II Notes
NLP Unit II Notes
NLP-Unit-II - Notes
❖ Language Dependence:
➢The impact of writing systems on text segmentation: It is significant and varies
depending on the specific characteristics of each writing system.
1. Symbol Types: The types of symbols used in a writing system (logographic,
syllabic, alphabetic) can affect text segmentation. For instance, logographic
systems like Chinese may not have explicit word boundaries marked, while
alphabetic systems like English often use whitespace to separate words.
2. Orthographic Conventions: Orthographic conventions play a crucial role in
denoting boundaries between linguistic units such as syllables, words, or
sentences. For example, in written Amharic, both word and sentence boundaries
are explicitly marked. In contrast, written Thai lacks explicit boundary markers,
similar to the lack of clear boundaries in spoken Thai.
3. Segmentation Challenges: Languages with fewer explicit boundary markers, such
as Thai or spoken languages, present challenges for text segmentation algorithms.
Without clear cues or markers, it can be difficult to segment the text accurately
at various linguistic levels.
4. Language-Specific Issues: Many segmentation challenges are language-specific.
For example, English relies on whitespace and punctuation marks for segmentation,
but these cues may not always be sufficient for unambiguous segmentation,
especially in cases of compound words or complex sentence structures.
5. Tokenization Efforts: Tokenization, which involves segmenting text into
meaningful units such as words or sentences, faces challenges in languages where
boundaries are not explicitly marked. Robust tokenization efforts need to account
for these language-specific segmentation nuances.
6. Resource Recommendation: For detailed information on various writing systems
and their segmentation features, Daniels and Bright (1996) provide comprehensive
coverage and examples that can aid in understanding the complexities of text
segmentation across different languages.
Writing systems significantly influence text segmentation by determining how
boundaries between linguistic units are marked or implied. Understanding these
differences is crucial for developing effective text processing and natural
language understanding systems across diverse languages.
➢Language Identification
Language identification and text segmentation are indeed complex tasks that
require consideration of language-specific and orthography-specific features.
Key points related to language identification and text segmentation in
multilingual documents:
1. Language Identification: In multilingual documents or sections, identifying
the language of each part is crucial for accurate text segmentation. For languages
with unique alphabets like Greek or Hebrew, character set identification plays a
role in determining the language. Similarly, character set identification can be
used to distinguish between languages that share many characters, such as Arabic
and Persian, or European languages with identical character sets but different
frequencies.
2. Character Set Identification: Languages with distinct character sets can be
identified based on the presence of specific characters unique to that language.
For example, Persian includes supplemental characters not found in Arabic, aiding
in language differentiation.
3. Byte Range Distribution: Analyzing byte range distributions can help identify
predominant characters in a given language, especially when languages share a
❖ Corpus Dependence:
Early natural language processing (NLP) systems were limited in their ability to
handle unstructured or poorly formatted data. They were designed to process well-
formed input following specific grammatical rules. However, with the availability
of large corpora containing diverse data types and irregularities like
misspellings, erratic punctuation, and non-standard spacing, the need for robust
NLP approaches has grown.
Algorithms relying on well-formed input struggle with the varied nature of real-
world text, such as data collected from sources like newswires, emails, internet
pages, and weblogs. Rules governing written language are challenging to define
and enforce due to the evolving nature of language usage and conventions.
Conventional punctuation and spacing, which aid in segmentation in well-formed
texts, may not be reliable indicators in diverse corpora where these conventions
are often disregarded.
For instance, capitalization and punctuation rules may vary widely depending on
the text's origin and purpose. Literary works may adhere closely to conventions,
while internet posts or emails may exhibit erratic punctuation and spelling. Such
irregularities pose challenges for segmentation algorithms designed for
structured and consistent data.
Text harvested from the internet often contains non-textual elements like headers,
images, advertisements, and scripts, which are not part of the actual content.
Robust text segmentation algorithms need to identify and filter out such
extraneous text to focus on meaningful content.
Efforts like Cleaneval** aim to address these challenges by providing shared tasks
and evaluations focused on cleaning and segmenting web data for linguistic and
language technology research. These initiatives highlight the importance of
developing NLP techniques that can handle the complexities and irregularities
present in real-world textual data.
Note: Cleaneval is a project that tests and improves tools for cleaning up messy web
data. It focuses on removing non-text elements like images and ads from web pages,
leaving only the useful text for language research and technology. Cleaneval helps
researchers develop better algorithms for handling real-world web content.
❖ Application Dependence:
Word and sentence segmentation, although necessary for computational linguistics,
lack absolute definitions and vary widely across languages. For computational
purposes, we define these distinctions based on language and task requirements.
For instance, English contractions like "I'm" are often expanded during
tokenization to ensure essential grammatical features are preserved. Failure to
expand contractions can lead to tokens being treated as unknown words in later
processing stages.
The treatment of possessives, such as 's in English, also varies in different
tagged corpora, impacting tokenization results. These issues are typically
addressed by normalizing text to suit application needs. For example, in automatic
speech recognition, tokens are represented in a spoken form, like converting
"$300" to "three hundred dollars."
In languages like Chinese, where words don't have spaces between them, various
word segmentation conventions exist, affecting tasks like information retrieval
and text-to-speech synthesis. Task-specific Chinese segmentation has been a focus
in the machine translation community.
The tasks of word and sentence segmentation intersect with techniques discussed
in other chapters, such as Lexical Analysis, Corpus Creation, and Multiword
Expressions, as well as practical applications across different chapters in
computational linguistics literature.
lexical and structural ambiguities. This section delves into the technical
complexities of tokenization, highlighting differences between space-delimited
and unsegmented languages.
Space-delimited languages, like most European languages, use whitespace to
indicate word boundaries. However, the delimited character sequences may not
always align with tokens needed for further processing due to writing system
ambiguities and varying tokenization conventions. On the other hand, unsegmented
languages such as Chinese and Thai lack explicit word boundaries, necessitating
additional lexical and morphological information for tokenization.
The challenges of tokenization are influenced by the writing system (logographic,
syllabic, or alphabetic) and the typographical structure of words. Word
structures fall into three main categories: isolating (no smaller units),
agglutinating (clearly divided morphemes), and inflectional (unclear boundaries
and multiple grammatical meanings). While languages tend towards one type (e.g.,
Mandarin Chinese is isolating, Japanese is agglutinative, and Latin is
inflectional), most languages exhibit aspects of all three types. Polysynthetic
languages like Chukchi and Inuktitut form complex words resembling whole
sentences, requiring specialized tokenization approaches.
Given the distinct techniques for space-delimited and unsegmented languages,
separate discussions in the following Sections, explore these challenges and
methodologies.
➢Tokenization in Space-Delimited Languages
In alphabetic writing systems like those using the Latin alphabet, words are
typically separated by whitespace. However, even in well-formed sentences,
tokenization faces challenges, especially with punctuation marks such as periods,
commas, quotation marks, apostrophes, and hyphens. These punctuation marks can
serve multiple functions within a sentence or text.
For instance, consider the sentence from the Wall Street Journal:
"Clairson International Corp. said it expects to report a net loss for its second
quarter ended March 26 and doesn’t expect to meet analysts’ profit estimates of
$3.9 to $4 million, or 76 cents a share to 79 cents a share, for its year ending
Sept. 24."
In this sentence, periods are used for decimal points, abbreviations, and sentence
endings. Apostrophes mark possession and contractions. Tokenization must
differentiate when a punctuation mark is part of a token or a separate token.
Additionally, decisions must be made about phrases like "76 cents a share," which
can be treated as four tokens or a single hyphenated token ("76-cents-a-share").
Similar decisions arise with phrases involving numbers like "$3.9 to $4 million"
or "3.9 to 4 million dollars."
It's important to note that the semantics of numbers can vary based on genre and
application. For instance, in scientific literature, numbers like 3.9, 3.90, and
3.900 have distinct significances.
An initial tokenization approach for space-delimited languages involves
considering any character sequence surrounded by whitespace as a token. However,
this method doesn't account for punctuation marks. Certain punctuation marks
the text into syllables first and then merging syllables into words based on a
trained model of syllable collocation.
Sentences in most written languages are delimited by punctuation marks, yet the
specific usage rules for punctuation are not always coherently(clearly) defined
The scope of this problem varies greatly by language, as does the number of
different punctuation marks that need to be considered.
Thai, for example, uses a space sometimes at sentence breaks, but very often the
space is indistinguishable from the carriage return, or there is no separation
between sentences.
Even languages with relatively rich punctuation systems like English present
surprising problems. Recognizing boundaries in such a written language involves
determining the roles of all punctuation marks, which can denote sentence
boundaries: periods, question marks, exclamation points, and sometimes
semicolons, colons, dashes, and commas.
the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOATPOCKET, and looked at it,
and then hurried on, Alice started to her feet, for it flashed across her mind
that she had never before seen a rabbit with either a waistcoat-pocket, or a
watch to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large rabbit-
hole under the hedge.
This example contains a single period at the end and three exclamation points
within a quoted passage. long sentences are more likely to produce (and compound)
errors of analysis.
Treating embedded sentences and their punctuation differently could assist in
the processing of the entire text-sentence.
In following example, the main sentence contains an embedded sentence (delimited
by dashes), and this embedded sentence also contains an embedded quoted sentence.
The holes certainly were rough –
The holes certainly were rough - “Just right for a lot of vagabonds like us,”
said Bigwig - but the exhausted and those who wander in strange country are not
particular about their quarters.
Word length—Riley (1989) used the length of the words before and after a period
as one contextual feature. Generally, abbreviation doesn't have more than 4-
character word.
Lexical endings—Müller et al. (1980) used morphological analysis to recognize
suffixes and thereby filter out words which were not likely to be abbreviations.
The analysis made it possible to identify words that were not otherwise present
in the extensive word lists used to identify abbreviations.
Prefixes and suffixes—Reynar and Ratnaparkhi (1997) used both prefixes and
suffixes of the words surrounding the punctuation mark as one contextual feature.
Abbreviation classes—Riley (1989) and Reynar and Ratnaparkhi (1997) further
divided abbreviations into categories such as titles (which are not likely to
occur at a sentence boundary) and corporate designators (which are more likely
to occur at a boundary).
Internal punctuation—Kiss and Strunk (2006) used the presence of periods within
a token as a feature.
Proper nouns—Mikheev (2002) used the presence of a proper noun to the right of a
period as a feature.
ive just loaded pcl onto my akcl. when i do an ‘in- package’ to load pcl, ill
get the prompt but im not able to use functions like defclass, etc... is there
womething basic im missing or am i just left hanging, twisting in the breeze?
Similarly, some important kinds of text consist solely of uppercase letters is
an example of such a corpus. Which has unpredictable spelling and punctuation,
as can be seen from the following example of uppercase letters data from CNN:
THIS IS A DESPERATE ATTEMPT BY THE REPUBLICANS TO SPIN THEIR STORY THAT NOTHING
SEAR WHYOUS – SERIOUS HAS BEEN DONE AND TRY TO SAVE THE SPEAKER’S SPEAKERSHIP
AND THIS HAS BEEN A SERIOUS PROBLEM FOR THE SPEAKER, HE DID NOT TELL THE TRUTH
TO THE COMMITTEE, NUMBER ONE.
The limitations of manually crafted rule-based approaches suggest the need for
trainable approaches to sentence segmentation, in order to allow for variations
between languages, applications, and genres.
Trainable methods provide a means for addressing the problem of embedded sentence
boundaries, as well as the capability of processing a range of corpora and the
problems they present, such as erratic(irregular) spacing, spelling errors,
single-case, and Optical Caracter Reader(OCR) errors.
For each punctuation mark to be disambiguated, a typical trainable sentence
segmentation algorithm will automatically encode the context using some or all
of the features described above.
A set of training data, in which the sentence boundaries have been manually
labelled, is then used to
train a machine learning algorithm to recognize the salient features in the
context.
➢ Trainable Algorithms:
One of the first published works describing a trainable sentence segmentation
algorithm was Riley (1989).
The method described used regression trees to classify periods according to
contextual features.
These contextual features included word length, punctuation after the period,
abbreviation class, case of the word, and the probability of the word occurring
at beginning or end of a sentence.
Riley’s method was trained using million words from the AP newswire, and he
reported an accuracy of 99.8% when tested on the Brown corpus.
Palmer and Hearst (1997) developed a sentence segmentation system called Satz,
which used a machine learning algorithm to disambiguate all occurrences of periods,
exclamation points, and question marks.
The system defined a contextual feature array for three words preceding and three
words following the punctuation mark; the feature array encoded the context as
the parts of speech, which can be attributed to each word in the context.
Using the lexical feature arrays, both a neural network and a decision tree were
trained to disambiguate the punctuation marks and achieved a high accuracy rate
(98%–99%) on a large corpus from the Wall Street Journal.
They also demonstrated the algorithm, which was trainable in as little as one
minute and required less than 1000 sentences of training data, to be rapidly
ported to new languages.
They adapted the system to French and German, in each case achieving a very high
accuracy.
Additionally, they demonstrated the trainable method to be extremely robust, as
it was able to successfully disambiguate single-case texts and Optical Caracter
Reader (OCR) data.
Reynar and Ratnaparkhi (1997) described a trainable approach to identify English
sentence boundaries using a statistical maximum entropy model.
The system used a system of contextual templates, which encoded one word of
context preceding and following the punctuation mark, using such features as
prefixes, suffixes, and abbreviation class.
They also reported success in inducing an abbreviation list from the training
data for use in the disambiguation. The algorithm, trained in less than 30 min
on 40,000 manually annotated sentences, achieved a high accuracy rate (98%+) on
the same test corpus used by Palmer and Hearst (1997), without requiring specific
lexical information, word lists, or any domain-specific information.
Though they only reported results on English, they indicated that the ease of
trainability should allow the algorithm to be used with other Roman-alphabet
languages, given adequate training data.
Mikheev (2002) developed a high-performing sentence segmentation algorithm that
jointly identifies abbreviations, proper names, and sentence boundaries.
The algorithm casts the sentence segmentation problem as one of disambiguating
abbreviations to the left of a period and proper names to the right. While using
unsupervised training methods, the algorithm encodes a great deal of manual
information regarding abbreviation structure and length.
The algorithm also relies heavily on consistent capitalization in order to
identify proper names.
Kiss and Strunk (2006) developed a largely unsupervised approach to sentence
boundary detection that focuses primarily on identifying abbreviations. The
algorithm encodes manual heuristics for abbreviation detection into a statistical
model that first identifies abbreviations and then disambiguates sentence
boundaries. The approach is essentially language independent, and they report
results for a large number of European languages
Trainable sentence segmentation algorithms such as these are clearly necessary
for enabling robust processing of a variety of texts and languages.
Algorithms that offer rapid training while requiring small amounts of training
data allow systems to be retargeted in hours or minutes to new text genres and
languages.
This adaptation can take into account the reality that good segmentation is task
dependent.
For example, in parallel corpus construction and processing, the segmentation
needs to be consistent in both the source and target language corpus, even if
that consistency comes at the expense of theoretical accuracy in either language.
For Example, extracting all hashtags from a tweet, getting email iD or phone
numbers, etc from large unstructured text content.
• Data pre-processing,
• Rule-based information Mining systems,
• Pattern Matching,
• Text feature Engineering,
• Web scraping,
• Data Extraction, etc.
Names: Sunil, Shyam, Ankit, Surjeet, Sumit, Subhi, Surbhi, Siddharth, Sujan
And our goal is to select only those names from the above list which match a
certain pattern such as something like this S u _ _ _
The names having the first two letters as S and u, followed by only three
positions that can be taken up by any of the letters present in the dictionary.
What do you think, which names from the above list fit this criterion?
Let’s go one by one, the name Sunil, Sumit, and Sujan fit this criterion as
they have S and u in the beginning and three more letters after that. While
the rest of the three names are not following the given criteria. So, the new
list extracted is given by,
What exactly we have done here is that we have a pattern and a list of student
names and we have to find the name that matches the given pattern. That’s
exactly how regular expressions work.
3. Regular expression requires two things, one is the pattern that we want to
search and the other is a corpus of text or a string from which we need to
search the pattern.
3. If X and Y are Regular Expressions, then the following expressions are also
regular.
• X, Y
• X.Y(Concatenation of XY)
• X+Y (Union of X and Y)
• X*, Y* (Kleen Closure of X and Y)
4. If a string is derived from the above rules then that would also be a
regular expression.
4. To convert the output of one processing component into the format required
for a second component.
Automata
δ : Q × Σ → Q,
While regular expressions and automata morphology are related in the sense that
regular expressions can be converted to equivalent finite automata (and vice
versa), they are distinct concepts with their own areas of study and application.
Deterministic Finite Automata (DFA)
A DFA is a kind of automata where each state has an outgoing transition for each
letter of the alphabet. For example, Figure below is an deterministic finite
automaton. Notice that in a DFA, there can only be one initial state.
A NFA does not need to have for each state an outgoing transition for each letter
of the alphabet. Furthermore, it can even have for a state more than one outgoing
transition labeled by the same letter. For example, Figure below is a
nondeterministic finite automaton.
Turing machines are the most powerful type of automata. They consist of an
infinite tape, a read/write head, and a finite set of states with transitions.
Turing machines can compute anything computable, making them capable of
recognizing recursively enumerable languages.
4. Deterministic Finite Automata (DFA):
DFA is a type of finite automaton where for each state, there is exactly one
transition for each possible input symbol. DFA does not have epsilon (ε)
transitions.
5. Nondeterministic Finite Automata (NFA):
NFA is a type of finite automaton where for each state and input symbol, there
can be multiple possible transitions or no transition at all. NFA can have epsilon
(ε) transitions, which allow transitions without consuming input symbols.
6. Nondeterministic Pushdown Automata (NPDA):
NPDA is a type of pushdown automaton where the transitions are nondeterministic.
It can have multiple possible transitions for a given input symbol and stack
symbol.
7. Linear Bounded Automata (LBA):
Linear bounded automata are Turing machines whose tape space is limited to a
constant multiple of the input length. They can recognize context-sensitive
languages.
These are some of the fundamental types of automata morphology studied in automata
theory and formal languages. Each type has its own characteristics, computational
power, and language recognition capabilities.
"walk" -> "walked"), and "-er" for comparative adjectives (e.g., "tall" ->
"taller").
Indian Languages Morphology:
1. Word Formation:
Indian languages exhibit rich morphological systems with extensive use of
affixation, compounding, reduplication, and sandhi (phonological changes at
morpheme boundaries). Affixation involves prefixes and suffixes to modify the
meaning or grammatical function of words. Compounding is prevalent, combining
roots or words to create new lexical items. Reduplication duplicates part or all
of a word to express various semantic nuances. Sandhi alters the phonetic form
of words when they occur together, affecting pronunciation and morphological
boundaries.
2. Inflection:
Inflection in Indian languages typically includes markers for tense, aspect,
mood, person, number, gender, and case. Many Indian languages are agglutinative,
where morphemes are added to a root in a systematic manner to indicate different
grammatical categories. Verbs in Indian languages often exhibit complex
inflectional paradigms, including conjugation based on tense, aspect, and mood,
as well as agreement with subjects and objects. Nouns may inflect for number,
gender, and case, with different forms used for singular, plural, masculine,
feminine, neuter, and various case distinctions.
Overall, while English morphology tends to be more analytic and reliant on word
order and auxiliary verbs for grammatical information, Indian languages often
feature synthetic and agglutinative morphological systems with extensive
inflectional and derivational processes.
Regular Relations
• Regular language: a set of strings
• Regular relation: a set of pairs of strings
• E.g., Regular relation = {a:1, b:2, c:2}
Input Σ = {a,b,c}
Output ={1, 2}
FST:
FST Conventions:
states are triggered by input symbols. FSA can be used to model morphological
processes such as affixation, compounding, and reduplication. However, FSAs are
limited in their ability to handle more complex morphological phenomena, such
as non-concatenative morphology (morphological changes that do not involve
concatenating affixes).
Finite State Transducers (FST):
FST extends the capabilities of FSA by associating output symbols with
transitions between states, enabling the modeling of transformations or
mappings between input and output strings. In morphological parsing, FSTs are
particularly useful for analyzing morphological processes that involve
morpheme-to-morpheme mappings, such as morphological inflection and derivation.
Each transition in the transducer not only consumes an input symbol but also
produces an output symbol or symbols, allowing for the generation of
morphologically parsed output. FSTs are capable of handling more complex
morphological phenomena compared to FSAs, making them suitable for a wider
range of morphological parsing tasks.
In summary, while Finite State Automata (FSA) are useful for recognizing
morphological patterns within words, Finite State Transducers (FST) provide a
more comprehensive framework for morphological parsing by allowing for both
recognition and generation of morphologically parsed output. FSTs are
particularly well-suited for modeling complex morphological processes involving
morpheme-to-morpheme mappings, making them a valuable tool in natural language
processing tasks such as morphological analysis and generation.
❖ Porter stemmer:
The Porter Stemmer algorithm, developed by Martin Porter in 1980, is a rule-
based algorithm for English language stemming. Stemming is the process of reducing
words to their root or base form, often by removing suffixes or other affixes.
The Porter Stemmer algorithm applies a series of heuristic rules to strip suffixes
from words in order to obtain their stems. It is designed to be efficient and
relatively simple, making it a popular choice for stemming tasks in applications
such as information retrieval, search engines, and text mining. The algorithm
consists of several phases, each targeting specific types of suffixes and applying
transformation rules to reduce words to their stems.
While the Porter Stemmer is effective for many common cases in English stemming,
it may not always produce linguistically correct stems, as it relies on simple
heuristics rather than linguistic analysis. Despite its limitations, the Porter
Stemmer remains a widely used tool for stemming in English text processing tasks.
An example of the application of the Porter Stemmer algorithm:
Input: "running"
Stem: "run"
Input: "flies"
Stem: "fli" (Note: "fli" is the stem obtained by the Porter Stemmer for "flies",
as it removes the suffix "es" according to its rules.)
The Porter Stemmer algorithm is a valuable tool for reducing words to their stems
in English text processing tasks, offering simplicity and efficiency for stemming
needs.