NLP Notes
NLP Notes
UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1.Words and Their Components
2.Issues and Challenges
3.Morphological Models
PART 2: Finding the Structure of Documents:(chapter 2 txtbk 1)
1.Introduction
2.Methods
3.Complexity of the Approaches
4.Performances of the Approaches
UNIT - V:
PART 1: Discourse Processing: (chapter 6 txtbk 2)
1.Cohesion
2. Reference Resolution
3.Discourse Cohesion and Structure
PART 2:Language Modelling:(chapter 5 txtbk 1)
1. Introduction
2. N-Gram Models
3.Language Model Evaluation
4.Parameter Estimation
5.Language Model Adaptation
6.Types of Language Models
7.Language-Specific Modeling Problems
8.Multilingual and Cross Lingual Language Modeling
TEXTBOOK1:
Multilingual Natural Language Processing Applications: From Theory to Practice –
Daniel M. Bikel and Imed Zitouni, Pearson Publication
lOM oARcP SD| 47 12 672 7
UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1.Words and Their Components
2.Issues and Challenges
3.Morphological Models
There are several ways to find the structure of words in NLP, including:
For example, the word "unhappily" has three morphemes: "un-" (a prefix meaning
"not"), "happy" (a free morpheme meaning "feeling or showing pleasure or
contentment"), and "-ly" (a suffix that changes the word into an adverb). By analyzing
the morphemes in a word, NLP systems can be er understand its meaning and how
it relates to other words in a sentence.
In addi on to morphemes, words can also be analyzed by their part of speech, such
as noun, verb, adjec ve, or adverb. By iden fying the part of speech of each word in a
sentence, NLP systems can be er understand the rela onships between words and
the structure of the sentence.
1.1Tokens:
In natural language processing (NLP), a token refers to a sequence of characters that
represents a meaningful unit of text. This could be a word, punctua on mark,
number, or other en ty that serves as a basic unit of analysis in NLP.
For example, in the sentence "The quick brown fox jumps over the lazy dog," the
tokens are "The," "quick," "brown," "fox," "jumps," "over," "the," "lazy," and "dog." Each of
these tokens represents a separate unit of meaning that can be analyzed and
processed by an NLP system.
In order to analyze and process text effec vely, NLP systems must be able to iden fy
and dis nguish between different types of tokens, and understand their rela onships
to one another. This can involve tasks such as tokeniza on, where the text is divided
into individual tokens, and part-of-speech tagging, where each token is assigned a
gramma cal category (such as noun, verb, or adjec ve). By accurately iden fying
and processing tokens, NLP systems can be er understand the meaning and
structure of a text.
1.2 Lexemes:
In natural language processing (NLP), a lexeme is a unit of vocabulary that
represents a single concept, regardless of its inflected forms or gramma cal
varia ons. It can be thought of as the abstract representa on of a word, with all its
possible inflec ons and varia ons.
For example, the word "run" has many inflected forms, such as "ran," "running," and
lOM oARcP SD| 47 12 672 7
"runs." These inflec ons are not considered separate lexemes because they all
represent the same concept of running or moving quickly on foot.
In contrast, words that have different meanings, even if they are spelled the same
way, are considered separate lexemes. For example, the word "bank" can refer to a
financial ins tu on or the edge of a river. These different meanings are considered
separate lexemes because they represent different concepts.
Lexical analysis is also used to iden fy and analyze the morphological and
syntac cal features of a word, such as its part of speech, inflec on, and deriva on.
This informa on is important for tasks such as stemming, lemma za on, and
part-of-speech tagging, which involve reducing words to their base or root forms and
iden fying their gramma cal func ons.
1.3 Morphemes:
In natural language processing (NLP), morphemes are the smallest units of meaning
in a language. A morpheme is a sequence of phonemes (the smallest units of sound
in a language) that carries meaning. Morphemes can be divided into two types: free
morphemes and bound morphemes.
Free morphemes are words that can stand alone and convey meaning. Examples of
free morphemes include "book," "cat," "happy," and "run."
Bound morphemes are units of meaning that cannot stand alone but must be
a ached to a free morpheme to convey meaning. Bound morphemes can be further
divided into two types: prefixes and suffixes.
● A prefix is a bound morpheme that is added to the beginning of a word to
change its meaning. For example, the prefix "un-" added to the word "happy"
creates the word "unhappy," which means not happy.
● A suffix is a bound morpheme that is added to the end of a word to change its
meaning. For example, the suffix "-ed" added to the word "walk" creates the
word "walked," which represents the past tense of "walk."
Here are some examples of words broken down into their morphemes:
● "unhappily" = "un-" (prefix meaning "not") + "happy" + "-ly" (suffix meaning "in a
lOM oARcP SD| 47 12 672 7
manner of")
● "rearrangement" = "re-" (prefix meaning "again") + "arrange" + "-ment" (suffix
indica ng the act of doing something)
● "cats" = "cat" (free morpheme) + "-s" (suffix indica ng plural form)
By analysing the morphemes in a word, NLP systems can be er understand its
meaning and how it relates to other words in a sentence. This can be helpful for
tasks such as part-of-speech tagging, sen ment analysis, and language transla on.
1.4 Typology:
In natural language processing (NLP), typology refers to the classifica on of
languages based on their structural and func onal features. This can include
features such as word order, morphology, tense and aspect systems, and syntac c
structures.
There are many different approaches to typology in NLP, but a common one is the
dis nc on between analy c and synthe c languages. Analy c languages have a
rela vely simple gramma cal structure and tend to rely on word order and
preposi ons to convey meaning. In contrast, synthe c languages have a more
complex gramma cal structure and use inflec ons and conjuga ons to indicate
tense, number, and other gramma cal features.
of a word.
3. Word order: The order of words in a sentence can have a significant impact on
the meaning of the sentence, making it important to correctly iden fy the
rela onship between words.
4. Informal language: Informal language, such as slang or colloquialisms, can be
challenging for NLP systems to process since they o en deviate from the
standard rules of grammar.
5. Out-of-vocabulary words: NLP systems may not have encountered a word
before, making it difficult to determine its structure and meaning.
6. Named en es: Proper nouns, such as names of people or organiza ons, can
be challenging to recognize and structure correctly.
7. Language-specific challenges: Different languages have different structures
and rules, making it necessary to develop language-specific approaches for
NLP.
8. Domain-specific challenges: NLP systems trained on one domain may not be
effec ve in another domain, such as medical or legal language.
Overcoming these issues and challenges requires a combina on of linguis c
knowledge, machine learning techniques, and careful model design and evalua on.
2.1 Irregularity:
Irregularity is a challenge in natural language processing (NLP) because it refers to
words that do not follow regular pa erns of forma on or inflec on. Many languages
have irregular words that are excep ons to the standard rules, making it difficult for
NLP systems to accurately iden fy and categorize these words.
For example, in English, irregular verbs such as "go," "do," and "have" do not follow the
regular pa ern of adding "-ed" to the base form to form the past tense. Instead, they
have their unique past tense forms ("went," "did," "had") that must be memorized.
Similarly, in English, there are many irregular plural nouns, such as "child" and "foot,"
that do not follow the standard rule of adding "-s" to form the plural. Instead, these
words have their unique plural forms ("children," "feet") that must be memorized.
Irregularity can also occur in inflec onal morphology, where different forms of a
word are created by adding inflec onal affixes. For example, in Spanish, the irregular
verb "tener" (to have) has a unique conjuga on pa ern that does not follow the
standard pa ern of other regular verbs in the language.
However, dealing with irregularity remains an ongoing challenge in NLP, par cularly
in languages with a high degree of lexical varia on and complex morphological
systems. Therefore, NLP researchers are con nually working to improve the
lOM oARcP SD| 47 12 672 7
Homonyms are words that have the same spelling and pronuncia on but different
meanings. For example, the word "bank" can refer to a financial ins tu on or the side
of a river. This can create ambiguity in NLP tasks, such as named en ty recogni on,
where the system needs to iden fy the correct en ty based on the context.
Polysemous words are words that have mul ple related meanings. For example, the
word "book" can refer to a physical object or the act of reserving something. In this
case, the intended meaning of the word can be difficult to iden fy without
considering the context in which the word is used.
Syntac c ambiguity occurs when a sentence can be parsed in mul ple ways. For
example, the sentence "I saw her duck" can be interpreted as "I saw the bird she
owns" or "I saw her lower her head to avoid something." In this case, the meaning of
the sentence can only be determined by considering the context in which it is used.
Ambiguity can also occur due to cultural or linguis c differences. For example, the
phrase "kick the bucket" means "to die" in English, but its meaning may not be
apparent to non-na ve speakers or speakers of other languages.
For example, in English, new words can be created by combining exis ng words,
such as "smartphone," "cyberbully," or "workaholic." These words are formed by
combining two or more words to create a new word with a specific meaning.
lOM oARcP SD| 47 12 672 7
Another example is the use of prefixes and suffixes to create new words. For
instance, in English, the prefix "un-" can be added to words to create their opposite
meaning, such as "happy" and "unhappy." The suffix "-er" can be added to a verb to
create a noun indica ng the person who performs the ac on, such as "run" and
"runner."
Produc vity can also occur in inflec onal morphology, where different forms of a
word are created by adding inflec onal affixes. For example, in English, the verb
"walk" can be inflected to "walked" to indicate the past tense. Similarly, the adjec ve
"big" can be inflected to "bigger" to indicate a compara ve degree.
These examples demonstrate how produc vity can create a vast number of possible
word forms, making it challenging for NLP systems to accurately iden fy and
categorize words. To address this challenge, NLP researchers have developed
various techniques, including morphological analysis algorithms that use sta s cal
models to predict the likely structure of a word based on its context. Addi onally,
machine learning algorithms can be trained on large datasets to learn to recognize
and categorize new word forms.
3.Morphological Models:
In natural language processing (NLP), morphological models refer to computa onal
models that are designed to analyze the morphological structure of words in a
language. Morphology is the study of the internal structure and the forms of words,
including their inflec onal and deriva onal pa erns. Morphological models are used
in a wide range of NLP applica ons, including part-of-speech tagging, named en ty
recogni on, machine transla on, and text-to-speech synthesis.
There are several types of morphological models used in NLP, including rule-based
models, sta s cal models, and neural models.
Sta s cal models use machine learning algorithms to learn the morphological
structure of words from large datasets of annotated text. These models use
probabilis c models, such as Hidden Markov Models (HMMs) or Condi onal
Random Fields (CRFs), to predict the morphological features of words. Sta s cal
models are more accurate than rule-based models and are used in many NLP
applica ons.
Neural models, such as recurrent neural networks (RNNs) and transformers, use
deep learning techniques to learn the morphological structure of words. These
models have achieved state-of-the-art results in many NLP tasks and are par cularly
effec ve in languages with complex morphological systems, such as Arabic and
lOM oARcP SD| 47 12 672 7
Turkish.
In addi on to these models, there are also morphological analyzers, which are tools
that can automa cally segment words into their cons tuent morphemes and provide
addi onal informa on about the inflec onal and deriva onal proper es of each
morpheme. Morphological analyzers are widely used in machine transla on and
informa on retrieval applica ons, where they can improve the accuracy of these
systems by providing more precise linguis c informa on about the words in a text.
3.1 Dictionary Lookup:
Dic onary lookup is one of the simplest forms of morphological modeling used in
NLP. In this approach, a dic onary or lexicon is used to store informa on about the
words in a language, including their inflec onal and deriva onal forms, parts of
speech, and other relevant features. When a word is encountered in a text, the
dic onary is consulted to retrieve its proper es.
Dic onary lookup is effec ve for languages with simple morphological systems,
such as English, where most words follow regular pa erns of inflec on and
deriva on. However, it is less effec ve for languages with complex morphological
systems, such as Arabic, Turkish, or Finnish, where many words have irregular forms
and the inflec onal and deriva onal pa erns are highly produc ve.
To improve the accuracy of dic onary lookup, various techniques have been
developed, such as:
● Lemma za on: This involves reducing inflected words to their base or
dic onary form, also known as the lemma. For example, the verb "running"
would be lemma zed to "run". This helps to reduce the size of the dic onary
and make it more manageable.
● Stemming: This involves reducing words to their stem or root form, which is
similar to the lemma but not always iden cal. For example, the word "jumping"
would be stemmed to "jump". This can help to group related words together
and reduce the size of the dic onary.
● Morphological analysis: This involves analyzing the internal structure of words
and iden fying their cons tuent morphemes, such as prefixes, suffixes, and
roots. This can help to iden fy the inflec onal and deriva onal pa erns of
words and make it easier to store them in the dic onary.
Dic onary lookup is a simple and effec ve way to handle morphological analysis in
NLP for languages with simple morphological systems. However, for more complex
languages, it may be necessary to use more advanced morphological models, such
as rule-based, sta s cal, or neural models.
3.2 Finite-State Morphology:
Finite-state morphology is a type of morphological modeling used in natural
language processing (NLP) that is based on the principles of finite-state automata. It
is a rule-based approach that uses a set of finite-state transducers to generate and
recognize words in a language.
set of strings or sequences of symbols, which represent the morphemes that make
up the word. Each morpheme is associated with a set of features that describe its
proper es, such as its part of speech, gender, tense, or case.
Finite-state morphology is par cularly effec ve for languages with regular and
produc ve morphological systems, such as Turkish or Finnish, where many words
are generated through inflec onal or deriva onal pa erns. It can handle large
morphological paradigms with high produc vity, such as the conjuga on of verbs or
the declension of nouns, by using a set of cascading transducers that apply different
rules and transforma ons to the input.
One of the main advantages of finite-state morphology is that it is efficient and fast,
since it can handle large vocabularies and morphological paradigms using compact
and op mized finite-state transducers. It is also transparent and interpretable, since
the rules and transforma ons used by the transducers can be easily inspected and
understood by linguists and language experts.
Finite-state morphology has been used in various NLP applica ons, such as machine
transla on, speech recogni on, and informa on retrieval, and it has been shown to
be effec ve for many languages and domains. However, it may be less effec ve for
languages with irregular or non-produc ve morphological systems, or for languages
with complex syntac c or seman c structures that require more sophis cated
linguis c analysis.
3.3 Unification-Based Morphology:
Unifica on-based morphology is a type of morphological modeling used in natural
language processing (NLP) that is based on the principles of unifica on and
feature-based grammar. It is a rule-based approach that uses a set of rules and
constraints to generate and recognize words in a language.
The rules and constraints used in unifica on-based morphology are designed to
perform two main opera ons: analysis and genera on. In analysis, the rules and
constraints are applied to the input word and its feature structure, in order to iden fy
its morphemes, their proper es, and their rela onships. In genera on, the rules and
lOM oARcP SD| 47 12 672 7
constraints are used to construct a feature structure that corresponds to a given set
of morphemes, inflec ng the word for the appropriate features and proper es.
Unifica on-based morphology is par cularly effec ve for languages with complex
and irregular morphological systems, such as Arabic or German, where many words
are generated through complex and idiosyncra c pa erns. It can handle rich and
detailed morphological and syntac c structures, by using a set of constraints and
agreements that ensure the consistency and coherence of the generated words.
One of the main advantages of unifica on-based morphology is that it is flexible and
expressive, since it can handle a wide range of linguis c phenomena and constraints,
by using a set of powerful and adaptable rules and constraints. It is also modular
and extensible, since the feature structures and the rules and constraints can be
easily combined and reused for different tasks and domains.
Unifica on-based morphology has been used in various NLP applica ons, such as
text-to-speech synthesis, grammar checking, and machine transla on, and it has
been shown to be effec ve for many languages and domains. However, it may be
less efficient and scalable than other morphological models, since the unifica on
and constraint-solving algorithms can be computa onally expensive and complex.
3.4 Functional Morphology:
Func onal morphology is a type of morphological modeling used in natural language processing (NLP)
that is based on the principles of func onal and cogni ve linguis cs. It is a usage-based approach that
emphasizes the func onal and communica ve aspects of language, and seeks to model the ways in
In func onal morphology, words are modeled as units of meaning, or lexemes, which are associated with
a set of func ons and communica ve contexts. Each lexeme is composed of a set of abstract features
that describe its seman c, pragma c, and discursive proper es, such as its thema c roles, discourse
The func onal morphology model seeks to capture the rela onship between the form and meaning of
words, by analyzing the ways in which the morphological and syntac c structures of words reflect their
communica ve and discourse func ons. It emphasizes the role of context and discourse in the
interpreta on of words, and seeks to explain the ways in which words are used and modified in response
Func onal morphology is par cularly effec ve for modeling the ways in which words are inflected,
derived, or modified in response to the communica ve and discourse context, such as in the case of
argument structure alterna ons or pragma c marking. It can handle the complexity and variability of
lOM oARcP SD| 47 12 672 7
natural language, by focusing on the func onal and communica ve proper es of words, and by using a
One of the main advantages of func onal morphology is that it is usage-based and corpus-driven, since it
is based on the analysis of natural language data and usage pa erns. It is also compa ble with other
models of language and cogni on, such as construc on grammar and cogni ve linguis cs, and can be
integrated with other NLP techniques, such as discourse analysis and sen ment analysis.
Func onal morphology has been used in various NLP applica ons, such as text classifica on, sen ment
analysis, and language genera on, and it has been shown to be effec ve for many languages and
domains. However, it may require large amounts of annotated data and computa onal resources, in
order to model the complex and variable pa erns of natural language use and interpreta on.
Morphology induc on has been used in various NLP applica ons, such as machine
transla on, informa on retrieval, and language modeling, and it has been shown to
be effec ve for many languages and domains. However, it may produce less
accurate and interpretable results than other morphological models, since it relies on
sta s cal pa erns and does not capture the full range of morphological and
syntac c structures in the language.
PART 2: Finding the Structure of Documents:
1.Introduction
2.Methods
3.Complexity of the Approaches
4.Performances of the Approaches
There are several approaches to finding the structure of documents in NLP, including:
1. Rule-based methods: These methods rely on a set of predefined rules and
heuris cs to iden fy the different structural elements of a document, such as
headings, paragraphs, and sec ons. For example, a rule-based method might
iden fy a sec on heading based on its font size, posi on, or forma ng.
2. Machine learning methods: These methods use sta s cal and machine
learning algorithms to automa cally learn the structural pa erns and features
of a document, based on a training set of annotated data. For example, a
machine learning method might use a support vector machine (SVM)
classifier to iden fy the different sec ons of a document based on their
linguis c and structural features.
3. Hybrid methods: These methods combine rule-based and machine learning
approaches, in order to leverage the strengths of both. For example, a hybrid
method might use a rule-based algorithm to iden fy the headings and
sec ons of a document, and then use a machine learning algorithm to classify
the content of each sec on.
Some of the specific techniques and tools used in finding the structure of
documents in NLP include:
1. Named en ty recogni on: This technique iden fies and extracts specific
en es, such as people, places, and organiza ons, from the document, which
can help in iden fying the different sec ons and topics.
2. Part-of-speech tagging: This technique assigns a part-of-speech tag to each
word in the document, which can help in iden fying the syntac c and
lOM oARcP SD| 47 12 672 7
3. Coh-Metrix: This is a text analysis tool that uses a range of linguis c and
discourse-based features to iden fy different aspects of text complexity,
including topic boundaries. By analyzing the pa erns of words, syntax, and
discourse in a text, Coh-Metrix can iden fy poten al topic boundaries, as well
as provide insights into the overall structure and organiza on of the text.
Topic boundary detec on is an important task in NLP, as it enables more effec ve
organiza on and analysis of large amounts of text. By accurately iden fying topic
boundaries, NLP systems can more effec vely extract and summarize informa on,
iden fy key themes and ideas, and provide more insigh ul and relevant responses to
user queries.
2.Methods:
There are several methods and techniques used in NLP to find the structure of
documents, which include:
1. Sentence boundary detec on: This involves iden fying the boundaries
between sentences in a document, which is important for tasks like parsing,
machine transla on, and text-to-speech synthesis.
2. Part-of-speech tagging: This involves assigning a part of speech (noun, verb,
adjec ve, etc.) to each word in a sentence, which is useful for tasks like
parsing, informa on extrac on, and sen ment analysis.
3. Named en ty recogni on: This involves iden fying and classifying named
en es (such as people, organiza ons, and loca ons) in a document, which is
important for tasks like informa on extrac on and text categoriza on.
4. Coreference resolu on: This involves iden fying all the expressions in a text
that refer to the same en ty, which is important for tasks like informa on
extrac on and machine transla on.
5. Topic boundary detec on: This involves iden fying the points in a document
where the topic or theme of the text shi s, which is useful for organizing and
summarizing large amounts of text.
6. Parsing: This involves analyzing the gramma cal structure of sentences in a
document, which is important for tasks like machine transla on,
text-to-speech synthesis, and informa on extrac on.
7. Sen ment analysis: This involves iden fying the sen ment (posi ve, nega ve,
or neutral) expressed in a document, which is useful for tasks like brand
monitoring, customer feedback analysis, and market research.
There are several tools and techniques used in NLP to perform these tasks, including
machine learning algorithms, rule-based systems, and sta s cal models. These
tools can be used in combina on to build more complex NLP systems that can
accurately analyze and understand the structure and content of large amounts of
text.
2.1 Generative Sequence Classification Methods:
Genera ve sequence classifica on methods are a type of NLP method used to find
the structure of documents. These methods involve using probabilis c models to
classify sequences of words into predefined categories or labels.
words by modeling the probability distribu on of the observed words given a set of
hidden states. The hidden states in an HMM can represent different linguis c
features, such as part-of-speech tags or named en es, and the model can be
trained using labeled data to learn the most likely sequence of hidden states for a
given sequence of words.
Both HMMs and CRFs can be used for tasks like part-of-speech tagging, named
en ty recogni on, and chunking, which involve classifying sequences of words into
predefined categories or labels. These methods have been shown to be effec ve in a
variety of NLP applica ons and are widely used in industry and academia.
2.2 Discriminative Local Classification Methods:
Discrimina ve local classifica on methods are another type of NLP method used to
find the structure of documents. These methods involve training a model to classify
each individual word or token in a document based on its features and the context in
which it appears.
Overall, discrimina ve local classifica on methods are useful for tasks where it is
necessary to classify each individual word or token in a document based on its
features and context. These methods are o en used in conjunc on with other NLP
techniques, such as sentence boundary detec on and parsing, to build more
complex NLP systems for document analysis and understanding.
2.3 Discriminative Sequence Classification Methods:
lOM oARcP SD| 47 12 672 7
Overall, discrimina ve sequence classifica on methods are useful for tasks where it
is necessary to predict the label or category for a sequence of words in a document,
based on the features of the sequence and the context in which it appears. These
methods have been shown to be effec ve in a variety of NLP applica ons and are
widely used in industry and academia.
2.4 Hybrid Approaches:
Hybrid approaches to finding the structure of documents in NLP combine mul ple
methods to achieve be er results than any one method alone. For example, a hybrid
approach might combine genera ve and discrimina ve models, or combine different
types of models with different types of features.
One example of a hybrid approach is the use of Condi onal Random Fields (CRFs)
and Support Vector Machines (SVMs) for named en ty recogni on. CRFs are used to
model the dependencies between neighboring labels in the sequence, while SVMs
are used to model the rela onship between the input features and the labels.
Overall, hybrid approaches are useful for tasks where a single method may not be
sufficient to achieve high accuracy. By combining mul ple methods, hybrid
approaches can take advantage of the strengths of each method and achieve be er
performance than any one method alone.
2.5 Extensions for Global Modeling for Sentence Segmentation:
Extensions for global modeling for sentence segmenta on in NLP involve using
algorithms that analyze an en re document or corpus of documents to iden fy
sentence boundaries, rather than analyzing sentences in isola on. These methods
can be more effec ve in situa ons where sentence boundaries are not clearly
indicated by punctua on, or where there are other sources of ambiguity.
One example of an extension for global modeling for sentence segmenta on is the
use of Hidden Markov Models (HMMs). HMMs are sta s cal models that can be
used to iden fy pa erns in a sequence of observa ons. In the case of sentence
segmenta on, the observa ons are the words in the document, and the model tries
to iden fy pa erns that correspond to the beginning and end of sentences. HMMs
can take into account context beyond just the current sentence, which can improve
accuracy in cases where sentence boundaries are not clearly marked.
Addi onally, there are also neural network-based approaches, such as the use of
convolu onal neural networks (CNNs) or recurrent neural networks (RNNs) for
sentence boundary detec on. These models can learn to recognize pa erns in the
text by analyzing larger contexts, and can be trained on large corpora of text to
improve their accuracy.
Overall, extensions for global modeling for sentence segmenta on can be more
effec ve than local models when dealing with more complex or ambiguous text, and
can lead to more accurate results in certain situa ons.
3.Complexity of the Approaches:
Finding the structure of documents in natural language processing (NLP) can be a
complex task, and there are several approaches with varying degrees of complexity.
Here are a few examples:
1. Rule-based approaches: These approaches use a set of predefined rules to
lOM oARcP SD| 47 12 672 7
There are two main approaches to syntax analysis in NLP: rule-based parsing and
sta s cal parsing. Rule-based parsing involves the use of a set of pre-defined rules
that dictate how the different parts of speech and phrases in a sentence should be
structured and related to each other. Sta s cal parsing, on the other hand, uses
machine learning algorithms to learn pa erns and rela onships in large corpora of
text in order to generate parse trees for new sentences.
Step 1: Tokeniza on
The first step is to break the sentence down into its individual words, or tokens:
Next, each token is assigned a part of speech tag, which indicates its gramma cal
func on in the sentence:
"The" (determiner), "cat" (noun), "sat" (verb), "on" (preposi on), "the" (determiner),
"mat" (noun), "." (punctua on)
Finally, the rela onships between the words in the sentence are analyzed using a
dependency parser to create a parse tree. In this example, the parse tree might look
something like this:
sat
/ \
cat on
/ \ |
This parse tree shows that "cat" is the subject of the verb "sat," and "mat" is the
transla on, text-to-speech conversion, and sen ment analysis. By understanding the
lOM oARcP SD| 47 12 672 7
gramma cal structure of a sentence, NLP models can more accurately interpret its meaning
and generate appropriate responses or transla ons.
Here's an example of a parse tree for the sentence "The cat sat on the mat":
sat(V)
/ \
cat(N) on(PREP)
/ \ / \
The(D) mat(N) the(D)
This parse tree shows that the sentence is composed of a verb phrase ("sat") and a
preposi onal phrase ("on the mat"), with the verb phrase consis ng of a verb ("sat")
and a noun phrase ("the cat"). The noun phrase, in turn, consists of a determiner
("the") and a noun ("cat"), and the preposi onal phrase consists of a preposi on
("on") and a noun phrase ("the mat").
Treebanks can be used to train sta s cal parsers, which can then automa cally
analyze new sentences and generate their own parse trees. These parsers work by
iden fying pa erns in the treebank data and using these pa erns to make
predic ons about the structure of new sentences. For example, a sta s cal parser
might learn that a noun phrase is usually followed by a verb phrase and use this
pa ern to generate a parse tree for a new sentence.
This representa on shows that the verb "sat" depends on the subject "cat," and the
preposi on "on" depends on the object "mat."
Here's an example of a dependency graph for the sentence "The cat sat on the mat":
┌─► sat
┌───┐ │ │
│ The │ │ ├─► on
└───┘ │ │ │
└────► cat ──► mat
In this graph, the word "cat" depends on the word "sat" with a subject rela onship,
and the word "mat" depends on the word "on" with a preposi onal rela onship.
Dependency graphs are useful for a variety of NLP tasks, including named en ty
recogni on, rela on extrac on, and sen ment analysis. They can also be used for
parsing and syntac c analysis, as they provide a compact and expressive way to
represent the structure of a sentence.
One advantage of dependency graphs is that they are simpler and more efficient than
phrase structure trees, which can be computa onally expensive to build and
manipulate. Dependency graphs also provide a more flexible representa on of
syntac c structure, as they can easily capture non-projec ve dependencies and other
complex rela onships between words.
Here's another example of a dependency graph for the sentence "I saw the man with
the telescope":
┌─── saw
│
┌──────┐ │ │ │
│ I the man with the telescope
│ │ │ │ │ │ │
│ nsubj │ │ prep │ │
│ │ │ │ │ │
│ det │ pobj │ │
│ │ │ │ │ │
│ │ det pobj │ │
│ │ │ │ │ │
│ │ │ prep │ │
│ │ │ │ │ │
│ │ │ det │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ │ det │
│ │ │ │ │ │
│ │ │ │ │ pobj
lOM oARcP SD| 47 12 672 7
This graph shows that the verb "saw" depends on the subject "I," and that the noun
phrase "the man" depends on the verb "saw" with an object rela onship. The
preposi onal phrase "with the telescope" modifies the noun phrase "the man," with
the word "telescope" being the object of the preposi on "with."
In summary, dependency graphs provide a flexible and efficient way to represent the
syntac c structure of a sentence in NLP. They can be used for a variety of tasks and
are a key component of many state-of-the-art NLP models.
3.2 Syntax Analysis Using Phrase Structure Trees:
Syntax analysis, also known as parsing, is the process of analyzing the gramma cal structure
of a sentence to iden fy its cons tuent parts and the rela onships between them. In
natural language processing (NLP), phrase structure trees are o en used to represent the
syntac c structure of a sentence.
A phrase structure tree, also known as a parse tree or a syntax tree, is a graphical
representa on of the syntac c structure of a sentence. It consists of a hierarchical
structure of nodes, where each node represents a phrase or a cons tuent of the
sentence.
Here's an example of a phrase structure tree for the sentence "The cat sat on the
mat":
_____|_____
| |
NP VP
| |
___|___ |______
| | | |
Det N V NP
| | | |
In this tree, the top-level node represents the en re sentence (S), which is divided
lOM oARcP SD| 47 12 672 7
into two subparts: the noun phrase (NP) "The cat" and the verb phrase (VP) "sat on
the mat". The NP is further divided into a determiner (Det) "The" and a noun (N) "cat".
The VP is composed of a verb (V) "sat" and a preposi onal phrase (PP) "on the mat",
which itself consists of a preposi on (P) "on" and another noun phrase (NP) "the
mat".
Here's another example of a phrase structure tree for the sentence "John saw the
man with the telescope":
___|___
| |
NP VP
| |
_______|_____ |___
| | |
N V PP
| | |
| |
P NP
| |
with Det N
| |
the telescope
In this tree, the top-level node represents the en re sentence (S), which is divided
into a noun phrase (NP) "John" and a verb phrase (VP) "saw the man with the
lOM oARcP SD| 47 12 672 7
telescope". The NP is simply a single noun (N) "John". The VP is composed of a verb
(V) "saw" and a preposi onal phrase (PP) "with the telescope", which itself consists
of a preposi on (P) "with" and another noun phrase (NP) "the man with the
telescope". The la er is composed of a determiner (Det) "the" and a noun (N) "man",
which is modified by another preposi onal phrase "with the telescope", consis ng of
a preposi on (P) "with" and a noun phrase (NP) "the telescope".
Phrase structure trees can be used in NLP for a variety of tasks, such as machine
transla on, text-to-speech synthesis, and natural language understanding. By
iden fying the syntac c structure of a sentence, computers can more accurately
understand its meaning and generate appropriate responses.
4.Parsing Algorithms:
There are several algorithms used in natural language processing (NLP) for syntax
analysis or parsing, each with its own strengths and weaknesses. Here are three
common parsing algorithms and their examples:
1. Recursive descent parsing: This is a top-down parsing algorithm that starts
with the top-level symbol (usually the sentence) and recursively applies
produc on rules to derive the structure of the sentence. Each produc on rule
corresponds to a non-terminal symbol in the grammar, which can be
expanded into a sequence of other symbols. The algorithm selects the first
produc on rule that matches the current input, and recursively applies it to its
right-hand side symbols. This process con nues un l a match is found for
every terminal symbol in the input.
Example: Consider the following context-free grammar for arithme c expressions:
E -> E + T | E - T | T
T -> T * F | T / F | F
F -> ( E ) | num
Suppose we want to parse the expression "3 + 4 * (5 - 2)" using recursive descent parsing.
The algorithm would start with the top-level symbol E and apply the first produc on rule E ->
E + T. It would then recursively apply the produc on rules for E, T, and F un l it reaches the
terminals "3", "+", "4", "*", "(", "5", "-", "2", and ")". The resul ng parse tree would look like
this:
E
/\
E T
/ /|\
T F*F
| | |
num num E
/|\
T F
| |
num num
lOM oARcP SD| 47 12 672 7
2. Shi -reduce parsing: This is a bo om-up parsing algorithm that starts with
the input tokens and constructs a parse tree by repeatedly shi ing a token
onto a stack and reducing a group of symbols on the stack to a single symbol
based on the produc on rules. The algorithm maintains a parse table that
specifies which ac ons to take based on the current state and the next input
symbol.
S -> NP VP
NP -> Det N | NP PP
VP -> V NP | VP PP
PP -> P NP
Det -> the | a
N -> man | ball | woman
V -> saw | liked
P -> with | in
Suppose we want to parse the sentence "the man saw a woman with a ball" using shi -
reduce parsing. The algorithm would start with an empty stack and shi the tokens "the",
"man", "saw", "a", "woman", "with", "a", and "ball" onto the stack. It would then reduce the
symbols "Det N" to NP, "NP PP" to NP, "V NP" to VP, and "NP PP" to PP. The resul ng parse
tree would look like this:
S
|
____|____
| |
NP VP
| |
| _V_
| | |
Det NP PP
| | |
the __|__ |
| NP
| |
| Det N
| | |
| a man
|
__V__
| |
saw NP
|
Det N
| |
lOM oARcP SD| 47 12 672 7
a woman
|
PP
|
| |
P NP
| |
with Det N
| |
a ball
3. Earley parsing: This is a chart parsing algorithm that uses dynamic programming to store par al
parses in a chart, which can be combined to form complete parses.
Here is an example of how shi -reduce parsing can be used to parse the sentence
"the cat chased the mouse" using a simple grammar:
S -> NP VP
NP -> Det N
VP -> V NP
V -> chased
1. Ini aliza on: We start by ini alizing an empty stack and an input buffer with
the sentence tokens "the", "cat", "chased", "the", and "mouse". We also ini alize
a parse chart, which is a table used to keep track of all the possible par al
parses of the sentence.
lOM oARcP SD| 47 12 672 7
2. Shi ing: We shi the first token "the" onto the stack and the first token "cat"
into the lookahead buffer. The stack now contains only the symbol "the".
3. Shi ing again: We shi the next token "cat" onto the stack and the next token
"chased" into the lookahead buffer. The stack now contains the symbols "the"
and "cat".
4. Reduc on: We check if the top of the stack and the next token in the
lookahead buffer can be reduced to a non-terminal symbol using a produc on
rule from the grammar. In this case, "the" and "cat" can be reduced to the
non-terminal symbol NP using the produc on rule NP -> Det N. We pop "the"
and "cat" from the stack and push the non-terminal symbol NP onto the stack.
5. Shi ing again: We shi the next token "chased" onto the stack and the next
token "the" into the lookahead buffer. The stack now contains the symbols NP
and "chased".
6. Reduc on again: We check if the top of the stack and the next token in the
lookahead buffer can be reduced to a non-terminal symbol using a produc on
rule from the grammar. In this case, "chased" and NP can be reduced to the
non-terminal symbol VP using the produc on rule VP -> V NP. We pop
"chased" and NP from the stack and push the non-terminal symbol VP onto
the stack.
7. Shi ing again: We shi the next token "the" onto the stack and the next token
"mouse" into the lookahead buffer. The stack now contains the symbols VP
and "the".
8. Reduc on again: We check if the top of the stack and the next token in the
lookahead buffer can be reduced to a non-terminal symbol using a produc on
rule from the grammar. In this case, "the" and VP can be reduced to the
non-terminal symbol S using the produc on rule S -> NP VP. We pop "the" and
VP from the stack and push the non-terminal symbol S onto the stack.
9. Comple on: The stack now contains only the symbol S, which is the final
parse of the input sentence. We can also look at the parse chart to see all the
possible par al parses that were considered during the parsing process. The
final parse tree for the sentence is:
S
/\
NP VP
/ \ |
/ chased
/ |
lOM oARcP SD| 47 12 672 7
Det NP
| /\
the Det N
| |
the mouse
Note that this example uses a simple grammar and a straigh orward parsing process, but
more complex grammars and sentences may require addi onal steps or different strategies
to achieve a successful parse.
Hypergraphs represent a generaliza on of tradi onal parse trees, allowing for more
Here is an example of how chart parsing can be used to parse the sentence "the cat
chased the mouse" using a simple grammar:
S -> NP VP
NP -> Det N
VP -> V NP
V -> chased
1. Ini aliza on: We start by ini alizing an empty chart with the length of the
input sentence (5 words) and a set of empty cells represen ng all possible
par al parses.
2. Scanning: We scan each word in the input sentence and add a corresponding
parse to the chart. For example, for the first word "the", we add a parse for the
non-terminal symbol Det (Det -> the). We do this for each word in the
sentence.
3. Predic ng: We use the grammar rules to predict possible par al parses for
each span of words in the sentence. For example, we can predict a par al
parse for the span (1, 2) (i.e., the first two words "the cat") by applying the rule
NP -> Det N to the parses for "the" and "cat". We add this par al parse to the
create a parse for the span (1, 3) and the non-terminal symbol NP.
5. Combining: We con nue to combine par al parses in the chart using grammar
rules un l we have a complete parse for the en re sentence.
6. Output: The final parse tree for the sentence is represented by the complete
parse in the chart cell for the span (1, 5) and the non-terminal symbol S.
Chart parsing can be more efficient than other parsing algorithms, such as recursive
descent or shi -reduce parsing, because it stores all possible par al parses in the
chart and avoids redundant parsing of the same span mul ple mes. Hypergraphs
can also be used in chart parsing to represent more complex structures and enable
more efficient parsing algorithms.
Dependency parsing is a type of syntac c parsing that represents the gramma cal
structure of a sentence as a directed acyclic graph (DAG). The nodes of the graph
represent the words of the sentence, and the edges represent the syntac c
rela onships between the words.
lOM oARcP SD| 47 12 672 7
Minimum spanning tree (MST) algorithms are o en used for dependency parsing, as
they provide an efficient way to find the most likely parse for a sentence given a set
of syntac c dependencies.
Here's an example of how a MST algorithm can be used for dependency parsing:
Consider the sentence "The cat chased the mouse". We can represent this sentence
as a graph with nodes for each word and edges represen ng the syntac c
dependencies between them:
We can use a MST algorithm to find the most likely parse for this graph. One popular
algorithm for this is the Chu-Liu/Edmonds algorithm:
1. We first remove all self-loops and mul ple edges in the graph. This is because
a valid dependency tree must be acyclic and have only one edge between any
two nodes.
2. We then choose a node to be the root of the tree. In this example, we can
choose "chased" to be the root since it is the main verb of the sentence.
3. We then compute the scores for each edge in the graph based on a scoring
func on that takes into account the probability of each edge being a valid
4. We use the MST algorithm to find the tree that maximizes the total score of its
edges. The MST algorithm starts with a set of edges that connect the root
node to each of its immediate dependents, and itera vely adds edges that
connect other nodes to the tree. At each itera on, we select the edge with the
highest score that does not create a cycle in the tree.
5. Once the MST algorithm has constructed the tree, we can assign a label to
each edge in the tree based on the type of dependency it represents (e.g.,
subject, object, etc.).
The resul ng dependency tree for the example sentence is shown below:
In this tree, each node represents a word in the sentence, and each edge represents a syntac c
dependency between two words.
Dependency parsing can be useful for many NLP tasks, such as informa on extrac on, machine
One advantage of dependency parsing is that it captures more fine-grained syntac c informa on than
phrase-structure parsing, as it represents the rela onships between individual words rather than just the
hierarchical structure of phrases. However, dependency parsing can be more difficult to perform
accurately than phrase-structure parsing, as it requires more sophis cated algorithms and models to
rule-based model, a sta s cal model, and a neural network model to improve
the overall accuracy of the parsing system.
Overall, there are many models for ambiguity resolu on in parsing, each with its own
strengths and weaknesses. The choice of model depends on the specific applica on
and the available resources, such as training data and computa onal power.
PCFGs can be used to compute the probability of a parse tree for a given sentence,
which can then be used to select the most likely parse. The probability of a parse
tree is computed by mul plying the probabili es of its cons tuent produc on rules,
from the root symbol down to the leaves. The probability of a sentence is computed
by summing the probabili es of all parse trees that generate the sentence.
Here is an example of a PCFG for the sentence "the cat saw the dog":
S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]
In this PCFG, each produc on rule is annotated with a probability. For example, the
rule NP -> Det N [0.6] has a probability of 0.6, indica ng that a noun phrase can be
generated by first genera ng a determiner, followed by a noun, with a probability of
0.6.
To parse the sentence "the cat saw the dog" using this PCFG, we can use the CKY
algorithm to generate all possible parse trees and compute their probabili es. The
algorithm starts by filling in the table of all possible subtrees for each span of the
sentence, and then combines these subtrees using the produc on rules of the PCFG.
The final cell in the table represents the probability of the best parse tree for the
en re sentence.
Using the probabili es from the PCFG, the CKY algorithm generates the following
parse tree for the sentence "the cat saw the dog":
S
/ \
lOM oARcP SD| 47 12 672 7
NP VP
/ \ / \
Det N V NP
| | | / \
the cat saw the dog
Thus, the probability of the best parse tree for the sentence "the cat saw the dog" is
0.11664. This probability can be used to select the most likely parse among all possible parse
trees for the sentence.
5.2 Generative Models for Parsing:
Genera ve models for parsing are a family of models that generate a sentence's
parse tree by genera ng each node in the tree according to a set of probabilis c
rules. One such model is the probabilis c earley parser.
The earley parser uses a chart data structure to store all possible parse trees for a
sentence. The parser starts with an empty chart, and then adds new parse trees to
the chart as it progresses through the sentence. The parser consists of three main
stages: predic on, scanning, and comple on.
In the predic on stage, the parser generates new items in the chart by applying
grammar rules that can generate non-terminal symbols. For example, if the grammar
has a rule S -> NP VP, the parser would predict the presence of an S symbol in the
current span of the sentence by adding a new item to the chart that indicates that an
S symbol can be generated by an NP symbol followed by a VP symbol.
In the scanning stage, the parser checks whether a word in the sentence can be
assigned to a non-terminal symbol in the chart. For example, if the parser has
predicted an NP symbol in the current span of the sentence, and the word "dog"
appears in that span, the parser would add a new item to the chart that indicates that
the NP symbol can be generated by the word "dog".
In the comple on stage, the parser combines items in the chart that have the same
end posi on and can be combined according to the grammar rules. For example, if
the parser has added an item to the chart that indicates that an NP symbol can be
generated by the word "dog", and another item that indicates that a VP symbol can
be generated by the word "saw" and an NP symbol, the parser would add a new item
to the chart that indicates that an S symbol can be generated by an NP symbol
followed by a VP symbol.
Here is an example of a probabilis c earley parser applied to the sentence "the cat
saw the dog":
Grammar:
lOM oARcP SD| 47 12 672 7
S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]
Initial chart:
0: [S -> * NP VP [1.0], 0, 0]
0: [NP -> * Det N [0.6], 0, 0]
0: [NP -> * N [0.4], 0, 0]
0: [VP -> * V NP [0.8], 0, 0]
0: [VP -> * V [0.2], 0, 0]
0: [Det -> * "the" [0.9], 0, 0]
0: [Det -> * "a" [0.1], 0, 0]
0: [N -> * "cat" [0.5], 0, 0]
0: [N -> * "dog" [0.5], 0, 0]
0: [V -> * "saw" [1.0], 0, 0]
Predicting S:
0: [S -> * NP VP [1.0], 0, 0]
1: [NP -> * Det N [0.6], 0, 0]
1: [NP -> * N [0.4], 0, 0]
1: [VP -> * V NP [0.8], 0 5.3
Discriminative Models for Parsing:
Discrimina ve models for parsing are a family of models that predict a sentence's
parse tree by learning to discriminate between different possible trees. One such
model is the maximum entropy markov model.
The maximum entropy markov model (MEMM) is a discrimina ve model that models
the condi onal probability of a parse tree given a sentence. The model is trained on a
corpus of labeled sentences and their corresponding parse trees. During training, the
model learns a set of feature func ons that map the current state of the parser (i.e.,
the current span of the sentence and the par al parse tree constructed so far) to a
set of binary features that are indica ve of a par cular parse tree. The model then
learns the weight of each feature func on using maximum likelihood es ma on.
During tes ng, the MEMM uses the learned feature func ons and weights to score
each possible parse tree for the input sentence. The model then selects the parse
tree with the highest score as the final parse tree for the sentence.
Here is an example of a MEMM applied to the sentence "the cat saw the dog":
Features:
F1: current word is "the"
F2: current word is "cat"
F3: current word is "saw"
lOM oARcP SD| 47 12 672 7
Weights:
F1: 1.2
F2: 0.5
F3: 0.9
F4: 1.1
F5: 0.8
F6: 0.6
F7: 0.7
F8: 0.9
F9: 1.5
S -> NP VP
- NP -> N
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"Score: 4.9
S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V
- - V -> "saw"
- NP -> Det N
- - Det -> "the"
- - N -> "dog"Score: 3.5
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"
Score: 5.7
In this example, the MEMM generates a score for each possible parse tree and selects the
parse tree with the highest score as the final parse tree for the sentence.
The selected parse tree corresponds to the correct parse for the sentence.
6.Multilingual Issues:
In natural language processing (NLP), a token is a sequence of characters that
represents a single unit of meaning. In other words, it is a word or a piece of a word
that has a specific meaning within a language. The process of spli ng a text into
individual tokens is called tokeniza on.
However, the defini on of what cons tutes a token can vary depending on the
language being analyzed. This is because different languages have different rules for
how words are constructed, how they are wri en, and how they are used in context.
For example, in English, words are typically separated by spaces, making it rela vely
easy to tokenize a sentence into individual words. However, in some languages, such
as Chinese or Japanese, there are no spaces between words, and the text must be
segmented into individual units of meaning based on other cues, such as syntax or
context.
Furthermore, even within a single language, there can be varia on in how words are
spelled or wri en. For example, in English, words can be spelled with or without
hyphens or apostrophes, and there can be differences in spelling between American
English and Bri sh English.
Mul lingual issues in tokeniza on arise because different languages can have
different character sets, which means that the same sequence of characters can
represent different words in different languages. Addi onally, some languages have
complex morphology, which means that a single word can have many different forms
that represent different gramma cal features or meanings.
To address these issues, NLP researchers have developed mul lingual tokeniza on
techniques that take into account the specific linguis c features of different
languages. These techniques can include using language-specific dic onaries,
models, or rules to iden fy the boundaries between words or units of meaning in
different languages.
6.1 Tokenization, Case, and Encoding:
Tokeniza on, case, and encoding are all important aspects of natural language
processing (NLP) that are used to preprocess text data before it can be analyzed by
lOM oARcP SD| 47 12 672 7
1. ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."].
Case:
2. Case refers to the use of upper and lower case le ers in text. In NLP, it is o en
important to standardize the case of words to avoid trea ng the same word
as different simply because it appears in different case. For example, the
words "apple" and "Apple" should be treated as the same word.
Encoding:
3. Encoding refers to the process of represen ng text data in a way that can be
processed by machine learning algorithms. One common encoding method
used in NLP is Unicode, which is a character encoding standard that can
represent a wide range of characters from different languages.
Here is an example of how tokeniza on, case, and encoding might be applied to a
sentence of text:
Text: "The quick brown fox jumps over the lazy dog."
Tokeniza on: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
Case: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
Encoding: [0x74, 0x68, 0x65, 0x20, 0x71, 0x75, 0x69, 0x63, 0x6b, 0x20, 0x62, 0x72,
0x6f, 0x77, 0x6e, 0x20, 0x66, 0x6f, 0x78, 0x20, 0x6a, 0x75, 0x6d, 0x70, 0x73, 0x20,
0x6f, 0x76, 0x65, 0x72, 0x20, 0x74, 0x68, 0x65, 0x20, 0x6c, 0x61, 0x7a, 0x79, 0x20,
0x64, 0x6f, 0x67, 0x2e]
Note that the encoding is represented in hexadecimal to show the underlying bytes
that represent the text.
6.2 Word Segmentation:
Word segmenta on is one of the most basic tasks in Natural Language Processing
(NLP), and it involves iden fying the boundaries between words in a sentence.
However, in some languages, such as Chinese and Japanese, there is no clear
spacing or punctua on between words, which makes word segmenta on more
challenging.
In Chinese, for example, a sentence like "我喜欢中文" (which means "I like Chinese")
lOM oARcP SD| 47 12 672 7
● Turkish: Turkish has a rich morphology, with a complex system of affixes that
can be added to words to convey different meanings. For example, the word
"kitap" (book) can be modified with different suffixes to indicate things like
possession, plurality, or tense.
● Arabic: Arabic also has a rich morphology, with a complex system of prefixes,
suffixes, and infixes that can be added to words to convey different meanings.
For example, the root "k-t-b" (meaning "write") can be modified with different
affixes to form words like "kitab" (book) and "kataba" (he wrote).
● Finnish: Finnish has a complex morphology, with a large number of cases,
suffixes, and vowel harmony rules that can affect the form of a word. For
lOM oARcP SD| 47 12 672 7
example, the word "käsi" (hand) can be modified with different suffixes to
indicate things like possession, loca on, or movement.
● Swahili: Swahili has a complex morphology, with a large number of prefixes
and suffixes that can be added to words to convey different meanings. For
example, the word "kutaka" (to want) can be modified with different prefixes
and suffixes to indicate things like tense, nega on, or subject agreement.
To address these challenges, NLP researchers have developed various techniques
for morphological analysis, including rule-based approaches, sta s cal models, and
neural networks. However, morphological analysis is s ll an ac ve area of research,
especially for low-resource languages where large amounts of annotated data are
not available.