0% found this document useful (0 votes)
77 views43 pages

NLP Notes

Uploaded by

Shaik Reshma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views43 pages

NLP Notes

Uploaded by

Shaik Reshma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

lOM oARcP SD| 47 12 672 7

NATURAL LANGUAGE PROCESSING

UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1.Words and Their Components
2.Issues and Challenges
3.Morphological Models
PART 2: Finding the Structure of Documents:(chapter 2 txtbk 1)
1.Introduction
2.Methods
3.Complexity of the Approaches
4.Performances of the Approaches

UNIT - II: Syntax Analysis: (chapter 3 txtbk 1)


1.Parsing Natural Language
2.Treebanks: A Data-Driven Approach to Syntax
3.Representation of Syntactic Structure
4.Parsing Algorithms
5.Models for Ambiguity Resolution in Parsing
6.Multilingual Issues

UNIT - III : Semantic Parsing: (chapter 4 txtbk 1)


1.Introduction
2.Semantic Interpretation
3.System Paradigms
4.Word Sense Systems Software

UNIT - IV :(chapter 4 txtbk 1)


1. Predicate-Argument Structure
2.Meaning Representation
3. Systems Software

UNIT - V:
PART 1: Discourse Processing: (chapter 6 txtbk 2)
1.Cohesion
2. Reference Resolution
3.Discourse Cohesion and Structure
PART 2:Language Modelling:(chapter 5 txtbk 1)
1. Introduction
2. N-Gram Models
3.Language Model Evaluation
4.Parameter Estimation
5.Language Model Adaptation
6.Types of Language Models
7.Language-Specific Modeling Problems
8.Multilingual and Cross Lingual Language Modeling

TEXTBOOK1:
Multilingual Natural Language Processing Applications: From Theory to Practice –
Daniel M. Bikel and Imed Zitouni, Pearson Publication
lOM oARcP SD| 47 12 672 7

UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1.Words and Their Components
2.Issues and Challenges
3.Morphological Models

Finding the Structure of Words:


In natural language processing (NLP), finding the structure of words involves breaking down
words into their cons tuent parts and iden fying the rela onships between those parts.
This process is known as morphological analysis, and it helps NLP systems understand the
structure of language.

There are several ways to find the structure of words in NLP, including:

1. Tokeniza on: This involves breaking a sentence or document into individual


words or tokens, which can then be analysed further.
2. Stemming and Lemma za on: These techniques involve reducing words to
their base or root form, which can help iden fy pa erns and rela onships
between words.
3. Part-of-Speech Tagging: This involves labelling each word in a sentence with
its part of speech, such as noun, verb, adjec ve, or adverb.
4. Parsing: This involves analysing the gramma cal structure of a sentence by
iden fying its cons tuent parts, such as subject, object, and predicate.
5. Named En ty Recogni on: This involves iden fying and classifying named
en es in text, such as people, organisa ons, and loca ons.
6. Dependency Parsing: This involves analysing the rela onships between words
in a sentence and iden fying which words depend on or modify other words.
By finding the structure of words in text, NLP systems can perform a wide range of
tasks, such as machine transla on, text classifica on, sen ment analysis, and
informa on extrac on.
1.Words and Their Components:
In natural language processing (NLP), words are analysed by breaking them down
into smaller units called components or morphemes. The analysis of words and their
components is important for various NLP tasks such as stemming, lemma za on,
part-of-speech tagging, and sen ment analysis.

There are two main types of morphemes:


1. Free Morphemes: These are standalone words that can convey meaning on
their own, such as "book," "dog," or "happy."
2. Bound Morphemes: These are units of meaning that cannot stand alone but
must be a ached to a free morpheme to convey meaning. There are two
types of bound morphemes:
● Prefixes: These are morphemes that are a ached to the beginning of a free
morpheme, such as "un-" in "unhappy" or "pre-" in "preview."
● Suffixes: These are morphemes that are a ached to the end of a free
morpheme, such as "-ness" in "happiness" or "-ed" in "jumped."
lOM oARcP SD| 47 12 672 7

For example, the word "unhappily" has three morphemes: "un-" (a prefix meaning
"not"), "happy" (a free morpheme meaning "feeling or showing pleasure or
contentment"), and "-ly" (a suffix that changes the word into an adverb). By analyzing
the morphemes in a word, NLP systems can be er understand its meaning and how
it relates to other words in a sentence.

In addi on to morphemes, words can also be analyzed by their part of speech, such
as noun, verb, adjec ve, or adverb. By iden fying the part of speech of each word in a
sentence, NLP systems can be er understand the rela onships between words and
the structure of the sentence.
1.1Tokens:
In natural language processing (NLP), a token refers to a sequence of characters that
represents a meaningful unit of text. This could be a word, punctua on mark,
number, or other en ty that serves as a basic unit of analysis in NLP.

For example, in the sentence "The quick brown fox jumps over the lazy dog," the
tokens are "The," "quick," "brown," "fox," "jumps," "over," "the," "lazy," and "dog." Each of
these tokens represents a separate unit of meaning that can be analyzed and
processed by an NLP system.

Here are some addi onal examples of tokens:


● Punctua on marks, such as periods, commas, and semicolons, are tokens
that represent the boundaries between sentences and clauses.
● Numbers, such as "123" or "3.14," are tokens that represent numeric quan es
or measurements.
● Special characters, such as "@" or "#," can be tokens that represent symbols
used in social media or other online contexts.
Tokens are o en used as the input for various NLP tasks, such as text classifica on,
sen ment analysis, and named en ty recogni on. In these tasks, the NLP system
analyzes the tokens to iden fy pa erns and rela onships between them, and uses
this informa on to make predic ons or draw insights about the text.

In order to analyze and process text effec vely, NLP systems must be able to iden fy
and dis nguish between different types of tokens, and understand their rela onships
to one another. This can involve tasks such as tokeniza on, where the text is divided
into individual tokens, and part-of-speech tagging, where each token is assigned a
gramma cal category (such as noun, verb, or adjec ve). By accurately iden fying
and processing tokens, NLP systems can be er understand the meaning and
structure of a text.
1.2 Lexemes:
In natural language processing (NLP), a lexeme is a unit of vocabulary that
represents a single concept, regardless of its inflected forms or gramma cal
varia ons. It can be thought of as the abstract representa on of a word, with all its
possible inflec ons and varia ons.

For example, the word "run" has many inflected forms, such as "ran," "running," and
lOM oARcP SD| 47 12 672 7

"runs." These inflec ons are not considered separate lexemes because they all
represent the same concept of running or moving quickly on foot.

In contrast, words that have different meanings, even if they are spelled the same
way, are considered separate lexemes. For example, the word "bank" can refer to a
financial ins tu on or the edge of a river. These different meanings are considered
separate lexemes because they represent different concepts.

Here are some addi onal examples of lexemes:


● "Walk" and "walked" are inflected forms of the same lexeme, represen ng the
concept of walking.
● "Cat" and "cats" are inflected forms of the same lexeme, represen ng the
concept of a feline animal.
● "Bank" and "banking" are derived forms of the same lexeme, represen ng the
concept of finance and financial ins tu ons.
Lexical analysis involves iden fying and categorizing lexemes in a text, which is an
important step in many NLP tasks, such as text classifica on, sen ment analysis,
and informa on retrieval. By iden fying and categorizing lexemes, NLP systems can
be er understand the meaning and context of a text.

Lexical analysis is also used to iden fy and analyze the morphological and
syntac cal features of a word, such as its part of speech, inflec on, and deriva on.
This informa on is important for tasks such as stemming, lemma za on, and
part-of-speech tagging, which involve reducing words to their base or root forms and
iden fying their gramma cal func ons.
1.3 Morphemes:
In natural language processing (NLP), morphemes are the smallest units of meaning
in a language. A morpheme is a sequence of phonemes (the smallest units of sound
in a language) that carries meaning. Morphemes can be divided into two types: free
morphemes and bound morphemes.

Free morphemes are words that can stand alone and convey meaning. Examples of
free morphemes include "book," "cat," "happy," and "run."

Bound morphemes are units of meaning that cannot stand alone but must be
a ached to a free morpheme to convey meaning. Bound morphemes can be further
divided into two types: prefixes and suffixes.
● A prefix is a bound morpheme that is added to the beginning of a word to
change its meaning. For example, the prefix "un-" added to the word "happy"
creates the word "unhappy," which means not happy.
● A suffix is a bound morpheme that is added to the end of a word to change its
meaning. For example, the suffix "-ed" added to the word "walk" creates the
word "walked," which represents the past tense of "walk."
Here are some examples of words broken down into their morphemes:

● "unhappily" = "un-" (prefix meaning "not") + "happy" + "-ly" (suffix meaning "in a
lOM oARcP SD| 47 12 672 7

manner of")
● "rearrangement" = "re-" (prefix meaning "again") + "arrange" + "-ment" (suffix
indica ng the act of doing something)
● "cats" = "cat" (free morpheme) + "-s" (suffix indica ng plural form)
By analysing the morphemes in a word, NLP systems can be er understand its
meaning and how it relates to other words in a sentence. This can be helpful for
tasks such as part-of-speech tagging, sen ment analysis, and language transla on.
1.4 Typology:
In natural language processing (NLP), typology refers to the classifica on of
languages based on their structural and func onal features. This can include
features such as word order, morphology, tense and aspect systems, and syntac c
structures.

There are many different approaches to typology in NLP, but a common one is the
dis nc on between analy c and synthe c languages. Analy c languages have a
rela vely simple gramma cal structure and tend to rely on word order and
preposi ons to convey meaning. In contrast, synthe c languages have a more
complex gramma cal structure and use inflec ons and conjuga ons to indicate
tense, number, and other gramma cal features.

For example, English is considered to be an analy c language, as it relies heavily on


word order and preposi ons to convey meaning. In contrast, Russian is a synthe c
language, with a complex system of noun declensions, verb conjuga ons, and case
markings to convey gramma cal informa on.

Another example of typology in NLP is the dis nc on between head-ini al and


head-final languages. In head-ini al languages, the head of a phrase (usually a noun)
comes before its modifiers (adjec ves or other nouns). In head-final languages, the
head comes a er its modifiers. For example, English is a head-ini al language, as in
the phrase "red apple," where "apple" is the head and "red" is the modifier. In contrast,
Japanese is a head-final language, as in the phrase "aka-i ringo" (red apple), where
"ringo" (apple) is the head and "aka-i" (red) is the modifier.

By understanding the typology of a language, NLP systems can be er model its


gramma cal and structural features, and improve their performance in tasks such as
language modelling, parsing, and machine transla on.
2.Issues and Challenges:
Finding the structure of words in natural language processing (NLP) can be a
challenging task due to various issues and challenges. Some of these issues and
challenges are:
1. Ambiguity: Many words in natural language have mul ple meanings, and it
can be difficult to determine the correct meaning of a word in a par cular
context.
2. Morphology: Many languages have complex morphology, meaning that words
can change their form based on various gramma cal features like tense,
gender, and number. This makes it difficult to iden fy the underlying structure
lOM oARcP SD| 47 12 672 7

of a word.
3. Word order: The order of words in a sentence can have a significant impact on
the meaning of the sentence, making it important to correctly iden fy the
rela onship between words.
4. Informal language: Informal language, such as slang or colloquialisms, can be
challenging for NLP systems to process since they o en deviate from the
standard rules of grammar.
5. Out-of-vocabulary words: NLP systems may not have encountered a word
before, making it difficult to determine its structure and meaning.
6. Named en es: Proper nouns, such as names of people or organiza ons, can
be challenging to recognize and structure correctly.
7. Language-specific challenges: Different languages have different structures
and rules, making it necessary to develop language-specific approaches for
NLP.
8. Domain-specific challenges: NLP systems trained on one domain may not be
effec ve in another domain, such as medical or legal language.
Overcoming these issues and challenges requires a combina on of linguis c
knowledge, machine learning techniques, and careful model design and evalua on.
2.1 Irregularity:
Irregularity is a challenge in natural language processing (NLP) because it refers to
words that do not follow regular pa erns of forma on or inflec on. Many languages
have irregular words that are excep ons to the standard rules, making it difficult for
NLP systems to accurately iden fy and categorize these words.

For example, in English, irregular verbs such as "go," "do," and "have" do not follow the
regular pa ern of adding "-ed" to the base form to form the past tense. Instead, they
have their unique past tense forms ("went," "did," "had") that must be memorized.

Similarly, in English, there are many irregular plural nouns, such as "child" and "foot,"
that do not follow the standard rule of adding "-s" to form the plural. Instead, these
words have their unique plural forms ("children," "feet") that must be memorized.

Irregularity can also occur in inflec onal morphology, where different forms of a
word are created by adding inflec onal affixes. For example, in Spanish, the irregular
verb "tener" (to have) has a unique conjuga on pa ern that does not follow the
standard pa ern of other regular verbs in the language.

To address the challenge of irregularity in NLP, researchers have developed various


techniques, including crea ng rule-based systems that incorporate irregular forms
into the standard pa erns of word forma on or using machine learning algorithms
that can learn to recognize and categorize irregular forms based on the pa erns
present in large datasets.

However, dealing with irregularity remains an ongoing challenge in NLP, par cularly
in languages with a high degree of lexical varia on and complex morphological
systems. Therefore, NLP researchers are con nually working to improve the
lOM oARcP SD| 47 12 672 7

accuracy of NLP systems in dealing with irregularity.


2.2 Ambiguity:
Ambiguity is a challenge in natural language processing (NLP) because it refers to
situa ons where a word or phrase can have mul ple possible meanings, making it
difficult for NLP systems to accurately iden fy the intended meaning. Ambiguity can
arise in various forms, such as homonyms, polysemous words, and syntac c
ambiguity.

Homonyms are words that have the same spelling and pronuncia on but different
meanings. For example, the word "bank" can refer to a financial ins tu on or the side
of a river. This can create ambiguity in NLP tasks, such as named en ty recogni on,
where the system needs to iden fy the correct en ty based on the context.

Polysemous words are words that have mul ple related meanings. For example, the
word "book" can refer to a physical object or the act of reserving something. In this
case, the intended meaning of the word can be difficult to iden fy without
considering the context in which the word is used.

Syntac c ambiguity occurs when a sentence can be parsed in mul ple ways. For
example, the sentence "I saw her duck" can be interpreted as "I saw the bird she
owns" or "I saw her lower her head to avoid something." In this case, the meaning of
the sentence can only be determined by considering the context in which it is used.

Ambiguity can also occur due to cultural or linguis c differences. For example, the
phrase "kick the bucket" means "to die" in English, but its meaning may not be
apparent to non-na ve speakers or speakers of other languages.

To address ambiguity in NLP, researchers have developed various techniques,


including using contextual informa on, part-of-speech tagging, and syntac c parsing
to disambiguate words and phrases. These techniques involve analyzing the
surrounding context of a word to determine its intended meaning based on the
context. Addi onally, machine learning algorithms can be trained on large datasets
to learn to disambiguate words and phrases automa cally. However, dealing with
ambiguity remains an ongoing challenge in NLP, par cularly in languages with
complex gramma cal structures and a high degree of lexical varia on.
2.3 Productivity:
Produc vity is a challenge in natural language processing (NLP) because it refers to
the ability of a language to generate new words or forms based on exis ng pa erns
or rules. This can create a vast number of possible word forms that may not be
present in dic onaries or training data, which makes it difficult for NLP systems to
accurately iden fy and categorize words.

For example, in English, new words can be created by combining exis ng words,
such as "smartphone," "cyberbully," or "workaholic." These words are formed by
combining two or more words to create a new word with a specific meaning.
lOM oARcP SD| 47 12 672 7

Another example is the use of prefixes and suffixes to create new words. For
instance, in English, the prefix "un-" can be added to words to create their opposite
meaning, such as "happy" and "unhappy." The suffix "-er" can be added to a verb to
create a noun indica ng the person who performs the ac on, such as "run" and
"runner."

Produc vity can also occur in inflec onal morphology, where different forms of a
word are created by adding inflec onal affixes. For example, in English, the verb
"walk" can be inflected to "walked" to indicate the past tense. Similarly, the adjec ve
"big" can be inflected to "bigger" to indicate a compara ve degree.

These examples demonstrate how produc vity can create a vast number of possible
word forms, making it challenging for NLP systems to accurately iden fy and
categorize words. To address this challenge, NLP researchers have developed
various techniques, including morphological analysis algorithms that use sta s cal
models to predict the likely structure of a word based on its context. Addi onally,
machine learning algorithms can be trained on large datasets to learn to recognize
and categorize new word forms.
3.Morphological Models:
In natural language processing (NLP), morphological models refer to computa onal
models that are designed to analyze the morphological structure of words in a
language. Morphology is the study of the internal structure and the forms of words,
including their inflec onal and deriva onal pa erns. Morphological models are used
in a wide range of NLP applica ons, including part-of-speech tagging, named en ty
recogni on, machine transla on, and text-to-speech synthesis.

There are several types of morphological models used in NLP, including rule-based
models, sta s cal models, and neural models.

Rule-based models rely on a set of handcra ed rules that describe the


morphological structure of words. These rules are based on linguis c knowledge
and are manually created by experts in the language. Rule-based models are o en
used in languages with rela vely simple morphological systems, such as English.

Sta s cal models use machine learning algorithms to learn the morphological
structure of words from large datasets of annotated text. These models use
probabilis c models, such as Hidden Markov Models (HMMs) or Condi onal
Random Fields (CRFs), to predict the morphological features of words. Sta s cal
models are more accurate than rule-based models and are used in many NLP
applica ons.

Neural models, such as recurrent neural networks (RNNs) and transformers, use
deep learning techniques to learn the morphological structure of words. These
models have achieved state-of-the-art results in many NLP tasks and are par cularly
effec ve in languages with complex morphological systems, such as Arabic and
lOM oARcP SD| 47 12 672 7

Turkish.

In addi on to these models, there are also morphological analyzers, which are tools
that can automa cally segment words into their cons tuent morphemes and provide
addi onal informa on about the inflec onal and deriva onal proper es of each
morpheme. Morphological analyzers are widely used in machine transla on and
informa on retrieval applica ons, where they can improve the accuracy of these
systems by providing more precise linguis c informa on about the words in a text.
3.1 Dictionary Lookup:
Dic onary lookup is one of the simplest forms of morphological modeling used in
NLP. In this approach, a dic onary or lexicon is used to store informa on about the
words in a language, including their inflec onal and deriva onal forms, parts of
speech, and other relevant features. When a word is encountered in a text, the
dic onary is consulted to retrieve its proper es.

Dic onary lookup is effec ve for languages with simple morphological systems,
such as English, where most words follow regular pa erns of inflec on and
deriva on. However, it is less effec ve for languages with complex morphological
systems, such as Arabic, Turkish, or Finnish, where many words have irregular forms
and the inflec onal and deriva onal pa erns are highly produc ve.

To improve the accuracy of dic onary lookup, various techniques have been
developed, such as:
● Lemma za on: This involves reducing inflected words to their base or
dic onary form, also known as the lemma. For example, the verb "running"
would be lemma zed to "run". This helps to reduce the size of the dic onary
and make it more manageable.
● Stemming: This involves reducing words to their stem or root form, which is
similar to the lemma but not always iden cal. For example, the word "jumping"
would be stemmed to "jump". This can help to group related words together
and reduce the size of the dic onary.
● Morphological analysis: This involves analyzing the internal structure of words
and iden fying their cons tuent morphemes, such as prefixes, suffixes, and
roots. This can help to iden fy the inflec onal and deriva onal pa erns of
words and make it easier to store them in the dic onary.
Dic onary lookup is a simple and effec ve way to handle morphological analysis in
NLP for languages with simple morphological systems. However, for more complex
languages, it may be necessary to use more advanced morphological models, such
as rule-based, sta s cal, or neural models.
3.2 Finite-State Morphology:
Finite-state morphology is a type of morphological modeling used in natural
language processing (NLP) that is based on the principles of finite-state automata. It
is a rule-based approach that uses a set of finite-state transducers to generate and
recognize words in a language.

In finite-state morphology, words are modeled as finite-state automata that accept a


lOM oARcP SD| 47 12 672 7

set of strings or sequences of symbols, which represent the morphemes that make
up the word. Each morpheme is associated with a set of features that describe its
proper es, such as its part of speech, gender, tense, or case.

The finite-state transducers used in finite-state morphology are designed to perform


two main opera ons: analysis and genera on. In analysis, the transducer takes a
word as input and breaks it down into its cons tuent morphemes, iden fying their
features and proper es. In genera on, the transducer takes a sequence of
morphemes and generates a word that corresponds to that sequence, inflec ng it for
the appropriate features and proper es.

Finite-state morphology is par cularly effec ve for languages with regular and
produc ve morphological systems, such as Turkish or Finnish, where many words
are generated through inflec onal or deriva onal pa erns. It can handle large
morphological paradigms with high produc vity, such as the conjuga on of verbs or
the declension of nouns, by using a set of cascading transducers that apply different
rules and transforma ons to the input.

One of the main advantages of finite-state morphology is that it is efficient and fast,
since it can handle large vocabularies and morphological paradigms using compact
and op mized finite-state transducers. It is also transparent and interpretable, since
the rules and transforma ons used by the transducers can be easily inspected and
understood by linguists and language experts.

Finite-state morphology has been used in various NLP applica ons, such as machine
transla on, speech recogni on, and informa on retrieval, and it has been shown to
be effec ve for many languages and domains. However, it may be less effec ve for
languages with irregular or non-produc ve morphological systems, or for languages
with complex syntac c or seman c structures that require more sophis cated
linguis c analysis.
3.3 Unification-Based Morphology:
Unifica on-based morphology is a type of morphological modeling used in natural
language processing (NLP) that is based on the principles of unifica on and
feature-based grammar. It is a rule-based approach that uses a set of rules and
constraints to generate and recognize words in a language.

In unifica on-based morphology, words are modeled as a set of feature structures,


which are hierarchically organized representa ons of the proper es and a ributes of
a word. Each feature structure is associated with a set of features and values that
describe the word's morphological and syntac c proper es, such as its part of
speech, gender, number, tense, or case.

The rules and constraints used in unifica on-based morphology are designed to
perform two main opera ons: analysis and genera on. In analysis, the rules and
constraints are applied to the input word and its feature structure, in order to iden fy
its morphemes, their proper es, and their rela onships. In genera on, the rules and
lOM oARcP SD| 47 12 672 7

constraints are used to construct a feature structure that corresponds to a given set
of morphemes, inflec ng the word for the appropriate features and proper es.

Unifica on-based morphology is par cularly effec ve for languages with complex
and irregular morphological systems, such as Arabic or German, where many words
are generated through complex and idiosyncra c pa erns. It can handle rich and
detailed morphological and syntac c structures, by using a set of constraints and
agreements that ensure the consistency and coherence of the generated words.

One of the main advantages of unifica on-based morphology is that it is flexible and
expressive, since it can handle a wide range of linguis c phenomena and constraints,
by using a set of powerful and adaptable rules and constraints. It is also modular
and extensible, since the feature structures and the rules and constraints can be
easily combined and reused for different tasks and domains.

Unifica on-based morphology has been used in various NLP applica ons, such as
text-to-speech synthesis, grammar checking, and machine transla on, and it has
been shown to be effec ve for many languages and domains. However, it may be
less efficient and scalable than other morphological models, since the unifica on
and constraint-solving algorithms can be computa onally expensive and complex.
3.4 Functional Morphology:
Func onal morphology is a type of morphological modeling used in natural language processing (NLP)

that is based on the principles of func onal and cogni ve linguis cs. It is a usage-based approach that

emphasizes the func onal and communica ve aspects of language, and seeks to model the ways in

which words are used and interpreted in context.

In func onal morphology, words are modeled as units of meaning, or lexemes, which are associated with

a set of func ons and communica ve contexts. Each lexeme is composed of a set of abstract features

that describe its seman c, pragma c, and discursive proper es, such as its thema c roles, discourse

status, or informa on structure.

The func onal morphology model seeks to capture the rela onship between the form and meaning of

words, by analyzing the ways in which the morphological and syntac c structures of words reflect their

communica ve and discourse func ons. It emphasizes the role of context and discourse in the

interpreta on of words, and seeks to explain the ways in which words are used and modified in response

to the communica ve needs of the speaker and the listener.

Func onal morphology is par cularly effec ve for modeling the ways in which words are inflected,

derived, or modified in response to the communica ve and discourse context, such as in the case of

argument structure alterna ons or pragma c marking. It can handle the complexity and variability of
lOM oARcP SD| 47 12 672 7

natural language, by focusing on the func onal and communica ve proper es of words, and by using a

set of flexible and adap ve rules and constraints.

One of the main advantages of func onal morphology is that it is usage-based and corpus-driven, since it

is based on the analysis of natural language data and usage pa erns. It is also compa ble with other

models of language and cogni on, such as construc on grammar and cogni ve linguis cs, and can be

integrated with other NLP techniques, such as discourse analysis and sen ment analysis.

Func onal morphology has been used in various NLP applica ons, such as text classifica on, sen ment

analysis, and language genera on, and it has been shown to be effec ve for many languages and

domains. However, it may require large amounts of annotated data and computa onal resources, in

order to model the complex and variable pa erns of natural language use and interpreta on.

3.5 Morphology Induction :


Morphology induc on is a type of morphological modeling used in natural language
processing (NLP) that is based on the principles of unsupervised learning and
sta s cal inference. It is a data-driven approach that seeks to discover the
underlying morphological structure of a language, by analyzing large amounts of raw
text data.

In morphology induc on, words are analyzed as sequences of characters or


sub-word units, which are assumed to represent the basic building blocks of the
language's morphology. The task of morphology induc on is to group these units
into meaningful morphemes, based on their distribu onal proper es and sta s cal
pa erns in the data.

Morphology induc on can be approached through various unsupervised learning


algorithms, such as clustering, probabilis c modeling, or neural networks. These
algorithms use a set of heuris cs and metrics to iden fy the most probable
morpheme boundaries and groupings, based on the frequency, entropy, or coherence
of the sub-word units in the data.

Morphology induc on is par cularly effec ve for modeling the morphological


structure of languages with agglu na ve or isola ng morphologies, where words are
composed of mul ple morphemes with clear boundaries and meanings. It can also
handle the richness and complexity of the morphology of low-resource and
under-studied languages, where annotated data and linguis c resources are scarce.

One of the main advantages of morphology induc on is that it is unsupervised and


data-driven, since it does not require explicit linguis c knowledge or annotated data.
It can also be easily adapted to different languages and domains, by using different
lOM oARcP SD| 47 12 672 7

data sources and feature representa ons.

Morphology induc on has been used in various NLP applica ons, such as machine
transla on, informa on retrieval, and language modeling, and it has been shown to
be effec ve for many languages and domains. However, it may produce less
accurate and interpretable results than other morphological models, since it relies on
sta s cal pa erns and does not capture the full range of morphological and
syntac c structures in the language.
PART 2: Finding the Structure of Documents:
1.Introduction
2.Methods
3.Complexity of the Approaches
4.Performances of the Approaches

Finding the Structure of Documents:


1.introduction:
Finding the structure of documents in natural language processing (NLP) refers to
the process of iden fying the different components and sec ons of a document, and
organizing them in a hierarchical or linear structure. This is a crucial step in many
NLP tasks, such as informa on retrieval, text classifica on, and summariza on, as it
allows for a more accurate and effec ve analysis of the document's content and
meaning.

There are several approaches to finding the structure of documents in NLP, including:
1. Rule-based methods: These methods rely on a set of predefined rules and
heuris cs to iden fy the different structural elements of a document, such as
headings, paragraphs, and sec ons. For example, a rule-based method might
iden fy a sec on heading based on its font size, posi on, or forma ng.
2. Machine learning methods: These methods use sta s cal and machine
learning algorithms to automa cally learn the structural pa erns and features
of a document, based on a training set of annotated data. For example, a
machine learning method might use a support vector machine (SVM)
classifier to iden fy the different sec ons of a document based on their
linguis c and structural features.
3. Hybrid methods: These methods combine rule-based and machine learning
approaches, in order to leverage the strengths of both. For example, a hybrid
method might use a rule-based algorithm to iden fy the headings and
sec ons of a document, and then use a machine learning algorithm to classify
the content of each sec on.
Some of the specific techniques and tools used in finding the structure of
documents in NLP include:
1. Named en ty recogni on: This technique iden fies and extracts specific
en es, such as people, places, and organiza ons, from the document, which
can help in iden fying the different sec ons and topics.
2. Part-of-speech tagging: This technique assigns a part-of-speech tag to each
word in the document, which can help in iden fying the syntac c and
lOM oARcP SD| 47 12 672 7

seman c structure of the text.


3. Dependency parsing: This technique analyzes the rela onships between the
words in a sentence, and can be used to iden fy the different clauses and
phrases in the text.
4. Topic modeling: This technique uses unsupervised learning algorithms to
iden fy the different topics and themes in the document, which can be used
to organize the content into different sec ons.
Finding the structure of documents in NLP is a complex and challenging task, as it
requires the analysis of mul ple linguis c and non-linguis c cues, as well as the use
of domain-specific knowledge and exper se. However, it is a cri cal step in many
NLP applica ons, and can greatly improve the accuracy and effec veness of the
analysis and interpreta on of the document's content.
1.1 Sentence Boundary Detection:
Sentence boundary detec on is a subtask of finding the structure of documents in
NLP that involves iden fying the boundaries between sentences in a document. This
is an important task, as it is a fundamental step in many NLP applica ons, such as
machine transla on, text summariza on, and informa on retrieval.

Sentence boundary detec on is a challenging task due to the presence of


ambigui es and irregulari es in natural language, such as abbrevia ons, acronyms,
and names that end with a period. To address these challenges, several methods
and techniques have been developed for sentence boundary detec on, including:
1. Rule-based methods: These methods use a set of pre-defined rules and
heuris cs to iden fy the end of a sentence. For example, a rule-based method
may consider a period followed by a whitespace character as an
end-of-sentence marker, unless the period is part of an abbrevia on.
2. Machine learning methods: These methods use sta s cal and machine
learning algorithms to learn the pa erns and features of sentence boundaries
based on a training set of annotated data. For example, a machine learning
method may use a support vector machine (SVM) classifier to iden fy the
boundaries between sentences based on linguis c and contextual features,
such as the length of the sentence, the presence of quota on marks, and the
part-of-speech of the last word.
3. Hybrid methods: These methods combine the strengths of rule-based and
machine learning approaches, in order to leverage the advantages of both. For
example, a hybrid method may use a rule-based algorithm to iden fy most
sentence boundaries, and then use a machine learning algorithm to correct
any errors or excep ons.
Some of the specific techniques and tools used in sentence boundary detec on
include:
1. Regular expressions: These are pa erns that can be used to match specific
character sequences in a text, such as periods followed by whitespace
characters, and can be used to iden fy the end of a sentence.
2. Hidden Markov Models: These are sta s cal models that can be used to
iden fy the most likely sequence of sentence boundaries in a text, based on
lOM oARcP SD| 47 12 672 7

the probabili es of different sentence boundary markers.


3. Deep learning models: These are neural network models that can learn
complex pa erns and features of sentence boundaries from a large corpus of
text, and can be used to achieve state-of-the-art performance in sentence
boundary detec on.
Sentence boundary detec on is an essen al step in many NLP tasks, as it provides
the founda on for analyzing and interpre ng the structure and meaning of a
document. By accurately iden fying the boundaries between sentences, NLP
systems can more effec vely extract informa on, generate summaries, and perform
other language-related tasks.
1.2 Topic Boundary Detection:
Topic boundary detec on is another important subtask of finding the structure of
documents in NLP. It involves iden fying the points in a document where the topic or
theme of the text shi s. This task is par cularly useful for organizing and
summarizing large amounts of text, as it allows for the iden fica on of different
topics or subtopics within a document.

Topic boundary detec on is a challenging task, as it involves understanding the


underlying seman c structure and meaning of the text, rather than simply iden fying
specific markers or pa erns. As such, there are several methods and techniques that
have been developed to address this challenge, including:
1. Lexical cohesion: This method looks at the pa erns of words and phrases
that appear in a text, and iden fies changes in the frequency or distribu on of
these pa erns as poten al topic boundaries. For example, if the frequency of
a par cular keyword or phrase drops off sharply a er a certain point in the
text, this could indicate a shi in topic.
2. Discourse markers: This method looks at the use of discourse markers, such
as "however", "in contrast", and "furthermore", which are o en used to signal a
change in topic or subtopic. By iden fying these markers in a text, it is
possible to locate poten al topic boundaries.
3. Machine learning: This method involves training a machine learning model to
iden fy pa erns and features in a text that are associated with topic
boundaries. This can involve using a variety of linguis c and contextual
features, such as sentence length, word frequency, and part-of-speech tags, to
iden fy poten al topic boundaries.
Some of the specific techniques and tools used in topic boundary detec on include:

1. Latent Dirichlet Alloca on (LDA): This is a probabilis c topic modeling


technique that can be used to iden fy topics within a corpus of text. By
analyzing the distribu on of words within a text, LDA can iden fy the most
likely topics and subtopics within the text, and can be used to locate topic
boundaries.
2. TextTiling: This is a technique that involves breaking a text into smaller
segments, or " les", based on the frequency and distribu on of key words and
phrases. By comparing the les to each other, it is possible to iden fy shi s in
topic or subtopic, and locate poten al topic boundaries.
lOM oARcP SD| 47 12 672 7

3. Coh-Metrix: This is a text analysis tool that uses a range of linguis c and
discourse-based features to iden fy different aspects of text complexity,
including topic boundaries. By analyzing the pa erns of words, syntax, and
discourse in a text, Coh-Metrix can iden fy poten al topic boundaries, as well
as provide insights into the overall structure and organiza on of the text.
Topic boundary detec on is an important task in NLP, as it enables more effec ve
organiza on and analysis of large amounts of text. By accurately iden fying topic
boundaries, NLP systems can more effec vely extract and summarize informa on,
iden fy key themes and ideas, and provide more insigh ul and relevant responses to
user queries.
2.Methods:
There are several methods and techniques used in NLP to find the structure of
documents, which include:
1. Sentence boundary detec on: This involves iden fying the boundaries
between sentences in a document, which is important for tasks like parsing,
machine transla on, and text-to-speech synthesis.
2. Part-of-speech tagging: This involves assigning a part of speech (noun, verb,
adjec ve, etc.) to each word in a sentence, which is useful for tasks like
parsing, informa on extrac on, and sen ment analysis.
3. Named en ty recogni on: This involves iden fying and classifying named
en es (such as people, organiza ons, and loca ons) in a document, which is
important for tasks like informa on extrac on and text categoriza on.
4. Coreference resolu on: This involves iden fying all the expressions in a text
that refer to the same en ty, which is important for tasks like informa on
extrac on and machine transla on.
5. Topic boundary detec on: This involves iden fying the points in a document
where the topic or theme of the text shi s, which is useful for organizing and
summarizing large amounts of text.
6. Parsing: This involves analyzing the gramma cal structure of sentences in a
document, which is important for tasks like machine transla on,
text-to-speech synthesis, and informa on extrac on.
7. Sen ment analysis: This involves iden fying the sen ment (posi ve, nega ve,
or neutral) expressed in a document, which is useful for tasks like brand
monitoring, customer feedback analysis, and market research.
There are several tools and techniques used in NLP to perform these tasks, including
machine learning algorithms, rule-based systems, and sta s cal models. These
tools can be used in combina on to build more complex NLP systems that can
accurately analyze and understand the structure and content of large amounts of
text.
2.1 Generative Sequence Classification Methods:
Genera ve sequence classifica on methods are a type of NLP method used to find
the structure of documents. These methods involve using probabilis c models to
classify sequences of words into predefined categories or labels.

One popular genera ve sequence classifica on method is Hidden Markov Models


(HMMs). HMMs are sta s cal models that can be used to classify sequences of
lOM oARcP SD| 47 12 672 7

words by modeling the probability distribu on of the observed words given a set of
hidden states. The hidden states in an HMM can represent different linguis c
features, such as part-of-speech tags or named en es, and the model can be
trained using labeled data to learn the most likely sequence of hidden states for a
given sequence of words.

Another type of genera ve sequence classifica on method is Condi onal Random


Fields (CRFs). CRFs are similar to HMMs in that they model the condi onal
probability of a sequence of labels given a sequence of words, but they are more
flexible in that they can take into account more complex features and dependencies
between labels.

Both HMMs and CRFs can be used for tasks like part-of-speech tagging, named
en ty recogni on, and chunking, which involve classifying sequences of words into
predefined categories or labels. These methods have been shown to be effec ve in a
variety of NLP applica ons and are widely used in industry and academia.
2.2 Discriminative Local Classification Methods:
Discrimina ve local classifica on methods are another type of NLP method used to
find the structure of documents. These methods involve training a model to classify
each individual word or token in a document based on its features and the context in
which it appears.

One popular example of a discrimina ve local classifica on method is Condi onal


Random Fields (CRFs). CRFs are a type of genera ve model that can also be used as
a discrimina ve model, as they can model the condi onal probability of a sequence
of labels given a sequence of features, without making assump ons about the
underlying distribu on of the data. CRFs have been used for tasks such as named
en ty recogni on, part-of-speech tagging, and chunking.

Another example of a discrimina ve local classifica on method is Maximum Entropy


Markov Models (MEMMs), which are similar to CRFs but use maximum entropy
modeling to make predic ons about the next label in a sequence given the current
label and features. MEMMs have been used for tasks such as speech recogni on,
named en ty recogni on, and machine transla on.

Other discrimina ve local classifica on methods include support vector machines


(SVMs), decision trees, and neural networks. These methods have also been used
for tasks such as sen ment analysis, topic classifica on, and document
categoriza on.

Overall, discrimina ve local classifica on methods are useful for tasks where it is
necessary to classify each individual word or token in a document based on its
features and context. These methods are o en used in conjunc on with other NLP
techniques, such as sentence boundary detec on and parsing, to build more
complex NLP systems for document analysis and understanding.
2.3 Discriminative Sequence Classification Methods:
lOM oARcP SD| 47 12 672 7

Discrimina ve sequence classifica on methods are another type of NLP method


used to find the structure of documents. These methods involve training a model to
predict the label or category for a sequence of words in a document, based on the
features of the sequence and the context in which it appears.

One popular example of a discrimina ve sequence classifica on method is the


Maximum Entropy Markov Model (MEMM). MEMMs are a type of discrimina ve
model that can predict the label or category for a sequence of words in a document,
based on the features of the sequence and the context in which it appears. MEMMs
have been used for tasks such as named en ty recogni on, part-of-speech tagging,
and text classifica on.

Another example of a discrimina ve sequence classifica on method is Condi onal


Random Fields (CRFs), which were men oned earlier as a type of genera ve model.
CRFs can also be used as discrimina ve models, as they can model the condi onal
probability of a sequence of labels given a sequence of features, without making
assump ons about the underlying distribu on of the data. CRFs have been used for
tasks such as named en ty recogni on, part-of-speech tagging, and chunking.

Other discrimina ve sequence classifica on methods include Hidden Markov


Models (HMMs), which were men oned earlier as a type of genera ve model. HMMs
can also be used as discrimina ve models, by directly es ma ng the probability of a
sequence of labels given a sequence of features. HMMs have been used for tasks
such as speech recogni on, named en ty recogni on, and part-of-speech tagging.

Overall, discrimina ve sequence classifica on methods are useful for tasks where it
is necessary to predict the label or category for a sequence of words in a document,
based on the features of the sequence and the context in which it appears. These
methods have been shown to be effec ve in a variety of NLP applica ons and are
widely used in industry and academia.
2.4 Hybrid Approaches:
Hybrid approaches to finding the structure of documents in NLP combine mul ple
methods to achieve be er results than any one method alone. For example, a hybrid
approach might combine genera ve and discrimina ve models, or combine different
types of models with different types of features.

One example of a hybrid approach is the use of Condi onal Random Fields (CRFs)
and Support Vector Machines (SVMs) for named en ty recogni on. CRFs are used to
model the dependencies between neighboring labels in the sequence, while SVMs
are used to model the rela onship between the input features and the labels.

Another example of a hybrid approach is the use of a rule-based system in


combina on with machine learning models for sentence boundary detec on. The
rule-based system might use heuris cs to iden fy common sentence-ending
punctua on, while a machine learning model might be trained on a large corpus of
lOM oARcP SD| 47 12 672 7

text to iden fy less common pa erns.

Hybrid approaches can also be used to combine different types of features in a


model. For example, a model might use both lexical features (such as the words in
the sequence) and syntac c features (such as the part-of-speech tags of the words)
to predict the labels for a sequence.

Overall, hybrid approaches are useful for tasks where a single method may not be
sufficient to achieve high accuracy. By combining mul ple methods, hybrid
approaches can take advantage of the strengths of each method and achieve be er
performance than any one method alone.
2.5 Extensions for Global Modeling for Sentence Segmentation:
Extensions for global modeling for sentence segmenta on in NLP involve using
algorithms that analyze an en re document or corpus of documents to iden fy
sentence boundaries, rather than analyzing sentences in isola on. These methods
can be more effec ve in situa ons where sentence boundaries are not clearly
indicated by punctua on, or where there are other sources of ambiguity.

One example of an extension for global modeling for sentence segmenta on is the
use of Hidden Markov Models (HMMs). HMMs are sta s cal models that can be
used to iden fy pa erns in a sequence of observa ons. In the case of sentence
segmenta on, the observa ons are the words in the document, and the model tries
to iden fy pa erns that correspond to the beginning and end of sentences. HMMs
can take into account context beyond just the current sentence, which can improve
accuracy in cases where sentence boundaries are not clearly marked.

Another example of an extension for global modeling is the use of clustering


algorithms. Clustering algorithms group similar sentences together based on
features such as the frequency of certain words or the number of common n-grams.
Once sentences are clustered together, the boundaries between the clusters can be
used to iden fy sentence boundaries.

Addi onally, there are also neural network-based approaches, such as the use of
convolu onal neural networks (CNNs) or recurrent neural networks (RNNs) for
sentence boundary detec on. These models can learn to recognize pa erns in the
text by analyzing larger contexts, and can be trained on large corpora of text to
improve their accuracy.

Overall, extensions for global modeling for sentence segmenta on can be more
effec ve than local models when dealing with more complex or ambiguous text, and
can lead to more accurate results in certain situa ons.
3.Complexity of the Approaches:
Finding the structure of documents in natural language processing (NLP) can be a
complex task, and there are several approaches with varying degrees of complexity.
Here are a few examples:
1. Rule-based approaches: These approaches use a set of predefined rules to
lOM oARcP SD| 47 12 672 7

iden fy the structure of a document. For instance, they might iden fy


headings based on font size and style or look for bullet points or numbered
lists. While these approaches can be effec ve in some cases, they are o en
limited in their ability to handle complex or ambiguous structures.
2. Sta s cal approaches: These approaches use machine learning algorithms to
iden fy the structure of a document based on pa erns in the data. For
instance, they might use a classifier to predict whether a given sentence is a
heading or a body paragraph. These approaches can be quite effec ve, but
they require large amounts of labeled data to train the model.
3. Deep learning approaches: These approaches use deep neural networks to
learn the structure of a document. For instance, they might use a hierarchical
a en on network to iden fy headings and subheadings, or a
sequence-to-sequence model to summarize the document. These approaches
can be very powerful, but they require even larger amounts of labeled data and
significant computa onal resources to train.
Overall, the complexity of these approaches depends on the level of accuracy and
precision desired, the size and complexity of the documents being analyzed, and the
amount of labeled data available for training. In general, more complex approaches
tend to be more accurate but also require more resources and exper se to
implement.
4.Performances of the Approaches:
The performance of different approaches for finding the structure of documents in
natural language processing (NLP) can vary depending on the specific task and the
complexity of the document. Here are some general trends:
1. Rule-based approaches: These approaches can be effec ve when the
document structure is rela vely simple and the rules are well-defined.
However, they can struggle with more complex or ambiguous structures, and
require a lot of manual effort to define the rules.
2. Sta s cal approaches: These approaches can be quite effec ve when there is
a large amount of labeled data available for training, and the document
structure is rela vely consistent across examples. However, they may struggle
with iden fying new or unusual structures that are not well-represented in the
training data.
3. Deep learning approaches: These approaches can be very effec ve in
iden fying complex and ambiguous document structures, and can even
discover new structures that were not present in the training data. However,
they require large amounts of labeled data and significant computa onal
resources to train, and can be difficult to interpret.
In general, the performance of these approaches will depend on factors such as the
quality and quan ty of the training data, the complexity and variability of the
document structure, and the specific metrics used to evaluate performance (e.g.
accuracy, precision, recall, F1-score). It's also worth no ng that different approaches
may be be er suited for different sub-tasks within document structure analysis, such
as iden fying headings, lists, tables, or sec on breaks.
UNIT - II: Syntax Analysis: (chapter 3 txtbk 1)
1.Parsing Natural Language
lOM oARcP SD| 47 12 672 7

2.Treebanks: A Data-Driven Approach to Syntax


3.Representation of Syntactic Structure
4.Parsing Algorithms
5.Models for Ambiguity Resolution in Parsing
6.Multilingual Issues
Syntax Analysis:
Syntax analysis in natural language processing (NLP) refers to the process of
iden fying the structure of a sentence and its component parts, such as phrases and
clauses, based on the rules of the language's syntax.

There are several approaches to syntax analysis in NLP, including:


1. Part-of-speech (POS) tagging: This involves iden fying the syntac c category
of each word in a sentence, such as noun, verb, adjec ve, etc. This can be
done using machine learning algorithms trained on annotated corpora of text.
2. Dependency parsing: This involves iden fying the rela onships between
words in a sentence, such as subject-verb or object-verb rela onships. This
can be done using a dependency parser, which generates a parse tree that
represents the rela onships between words.
3. Cons tuency parsing: This involves iden fying the cons tuent parts of a
sentence, such as phrases and clauses. This can be done using a
phrase-structure parser, which generates a parse tree that represents the
structure of the sentence.
Syntax analysis is important for many NLP tasks, such as named en ty recogni on,
sen ment analysis, and machine transla on. By understanding the syntac c
structure of a sentence, NLP systems can be er iden fy the rela onships between
words and the overall structure of the text, which can be used to extract meaning and
perform various downstream tasks.
1.Parsing Natural Language:
In natural language processing (NLP), syntax analysis, also known as parsing, refers
to the process of analyzing the gramma cal structure of a sentence in order to
determine its cons tuent parts, their rela onships to each other, and their func ons
within the sentence. This involves breaking down the sentence into its individual
components, such as nouns, verbs, adjec ves, and phrases, and then analyzing how
these components are related to each other.

There are two main approaches to syntax analysis in NLP: rule-based parsing and
sta s cal parsing. Rule-based parsing involves the use of a set of pre-defined rules
that dictate how the different parts of speech and phrases in a sentence should be
structured and related to each other. Sta s cal parsing, on the other hand, uses
machine learning algorithms to learn pa erns and rela onships in large corpora of
text in order to generate parse trees for new sentences.

Here's an example of how syntax analysis works using a simple sentence:


lOM oARcP SD| 47 12 672 7

Sentence: "The cat sat on the mat."

Step 1: Tokeniza on

The first step is to break the sentence down into its individual words, or tokens:

"The", "cat", "sat", "on", "the", "mat", "."

Step 2: Part of Speech Tagging

Next, each token is assigned a part of speech tag, which indicates its gramma cal
func on in the sentence:

"The" (determiner), "cat" (noun), "sat" (verb), "on" (preposi on), "the" (determiner),
"mat" (noun), "." (punctua on)

Step 3: Dependency Parsing

Finally, the rela onships between the words in the sentence are analyzed using a
dependency parser to create a parse tree. In this example, the parse tree might look
something like this:

sat

/ \

cat on

/ \ |

The mat the

This parse tree shows that "cat" is the subject of the verb "sat," and "mat" is the

object of the preposi on "on."

Syntax analysis is a crucial component of many NLP tasks, including machine

transla on, text-to-speech conversion, and sen ment analysis. By understanding the
lOM oARcP SD| 47 12 672 7

gramma cal structure of a sentence, NLP models can more accurately interpret its meaning
and generate appropriate responses or transla ons.

2.Treebanks: A Data-Driven Approach to Syntax:


Treebanks are a data-driven approach to syntax analysis in natural language
processing (NLP). They consist of a large collec on of sentences, each of which has
been manually annotated with a parse tree that shows the syntac c structure of the
sentence. Treebanks are used to train sta s cal parsers, which can then
automa cally analyze new sentences and generate their own parse trees.

A parse tree is a hierarchical structure that represents the syntac c structure of a


sentence. Each node in the tree represents a cons tuent of the sentence, such as a
noun phrase or a verb phrase. The edges of the tree represent the rela onships
between these cons tuents, such as subject-verb or verb-object rela onships.

Here's an example of a parse tree for the sentence "The cat sat on the mat":
sat(V)
/ \
cat(N) on(PREP)
/ \ / \
The(D) mat(N) the(D)
This parse tree shows that the sentence is composed of a verb phrase ("sat") and a
preposi onal phrase ("on the mat"), with the verb phrase consis ng of a verb ("sat")
and a noun phrase ("the cat"). The noun phrase, in turn, consists of a determiner
("the") and a noun ("cat"), and the preposi onal phrase consists of a preposi on
("on") and a noun phrase ("the mat").

Treebanks can be used to train sta s cal parsers, which can then automa cally
analyze new sentences and generate their own parse trees. These parsers work by
iden fying pa erns in the treebank data and using these pa erns to make
predic ons about the structure of new sentences. For example, a sta s cal parser
might learn that a noun phrase is usually followed by a verb phrase and use this
pa ern to generate a parse tree for a new sentence.

Treebanks are an important resource in NLP, as they allow researchers and


developers to train and test sta s cal parsers and other models that rely on
syntac c analysis. Some well-known treebanks include the Penn Treebank and the
Universal Dependencies treebanks. These resources are publicly available and have
been used in a wide range of NLP research and applica ons.
3.Representation of Syntactic Structure:
In natural language processing (NLP), the representa on of syntac c structure refers
to how the structure of a sentence is represented in a machine-readable form. There
are several different ways to represent syntac c structure, including
cons tuency-based representa ons and dependency-based representa ons.
lOM oARcP SD| 47 12 672 7

Cons tuency-Based Representa ons:

1. Cons tuency-based representa ons, also known as phrase structure trees,


represent the structure of a sentence as a hierarchical tree structure, with
each node in the tree represen ng a cons tuent of the sentence. The nodes
are labeled with a gramma cal category such as noun phrase (NP) or verb
phrase (VP), and the branches represent the syntac c rela onships between
the nodes. Cons tuency-based representa ons are o en used in rule-based
approaches to parsing.
Here's an example of a cons tuency-based representa on of the sentence "The cat
sat on the mat":
(S
(NP (DT The) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat)))))
This representa on shows that the sentence is composed of a noun phrase ("The
cat") and a verb phrase ("sat on the mat"), with the verb phrase consis ng of a verb
("sat") and a preposi onal phrase ("on the mat"), and the preposi onal phrase
consis ng of a preposi on ("on") and a noun phrase ("the mat").
Dependency-Based Representa ons:

2. Dependency-based representa ons represent the structure of a sentence as a


directed graph, with each word in the sentence represented as a node in the
graph, and the rela onships between the words represented as directed
edges. The edges are labeled with a gramma cal func on such as subject
(SUBJ) or object (OBJ), and the nodes are labeled with a part-of-speech tag
such as noun (N) or verb (V). Dependency-based representa ons are o en
used in sta s cal approaches to parsing.
Here's an example of a dependency-based representa on of the sentence "The cat
sat on the mat":
sat-V
|
cat-N
|
on-PREP
|
mat-N

This representa on shows that the verb "sat" depends on the subject "cat," and the
preposi on "on" depends on the object "mat."

Both cons tuency-based and dependency-based representa ons are used in a


variety of NLP tasks, including machine transla on, sen ment analysis, and
informa on extrac on. The choice of representa on depends on the specific task
and the algorithms used to process the data.
3.1 Syntax Analysis Using Dependency Graphs:
lOM oARcP SD| 47 12 672 7

Syntax analysis using dependency graphs is a popular approach in natural language


processing (NLP). Dependency graphs represent the syntac c structure of a
sentence as a directed graph, where each word is a node in the graph and the
rela onships between words are represented as directed edges. The nodes in the
graph are labeled with the part of speech of the corresponding word, and the edges
are labeled with the gramma cal rela onship between the two words.

Here's an example of a dependency graph for the sentence "The cat sat on the mat":
┌─► sat
┌───┐ │ │
│ The │ │ ├─► on
└───┘ │ │ │
└────► cat ──► mat

In this graph, the word "cat" depends on the word "sat" with a subject rela onship,
and the word "mat" depends on the word "on" with a preposi onal rela onship.

Dependency graphs are useful for a variety of NLP tasks, including named en ty
recogni on, rela on extrac on, and sen ment analysis. They can also be used for
parsing and syntac c analysis, as they provide a compact and expressive way to
represent the structure of a sentence.

One advantage of dependency graphs is that they are simpler and more efficient than
phrase structure trees, which can be computa onally expensive to build and
manipulate. Dependency graphs also provide a more flexible representa on of
syntac c structure, as they can easily capture non-projec ve dependencies and other
complex rela onships between words.

Here's another example of a dependency graph for the sentence "I saw the man with
the telescope":
┌─── saw

┌──────┐ │ │ │
│ I the man with the telescope
│ │ │ │ │ │ │
│ nsubj │ │ prep │ │
│ │ │ │ │ │
│ det │ pobj │ │
│ │ │ │ │ │
│ │ det pobj │ │
│ │ │ │ │ │
│ │ │ prep │ │
│ │ │ │ │ │
│ │ │ det │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ │ det │
│ │ │ │ │ │
│ │ │ │ │ pobj
lOM oARcP SD| 47 12 672 7

This graph shows that the verb "saw" depends on the subject "I," and that the noun
phrase "the man" depends on the verb "saw" with an object rela onship. The
preposi onal phrase "with the telescope" modifies the noun phrase "the man," with
the word "telescope" being the object of the preposi on "with."

In summary, dependency graphs provide a flexible and efficient way to represent the
syntac c structure of a sentence in NLP. They can be used for a variety of tasks and
are a key component of many state-of-the-art NLP models.
3.2 Syntax Analysis Using Phrase Structure Trees:

Syntax analysis, also known as parsing, is the process of analyzing the gramma cal structure
of a sentence to iden fy its cons tuent parts and the rela onships between them. In
natural language processing (NLP), phrase structure trees are o en used to represent the
syntac c structure of a sentence.
A phrase structure tree, also known as a parse tree or a syntax tree, is a graphical
representa on of the syntac c structure of a sentence. It consists of a hierarchical
structure of nodes, where each node represents a phrase or a cons tuent of the
sentence.

Here's an example of a phrase structure tree for the sentence "The cat sat on the
mat":

_____|_____

| |

NP VP

| |

___|___ |______

| | | |

Det N V NP

| | | |

The cat sat on the mat

In this tree, the top-level node represents the en re sentence (S), which is divided
lOM oARcP SD| 47 12 672 7

into two subparts: the noun phrase (NP) "The cat" and the verb phrase (VP) "sat on
the mat". The NP is further divided into a determiner (Det) "The" and a noun (N) "cat".

The VP is composed of a verb (V) "sat" and a preposi onal phrase (PP) "on the mat",
which itself consists of a preposi on (P) "on" and another noun phrase (NP) "the
mat".

Here's another example of a phrase structure tree for the sentence "John saw the
man with the telescope":

___|___

| |

NP VP

| |

_______|_____ |___

| | |

N V PP

| | |

John saw the man with the telescope

| |

P NP

| |

with Det N

| |

the telescope

In this tree, the top-level node represents the en re sentence (S), which is divided
into a noun phrase (NP) "John" and a verb phrase (VP) "saw the man with the
lOM oARcP SD| 47 12 672 7

telescope". The NP is simply a single noun (N) "John". The VP is composed of a verb
(V) "saw" and a preposi onal phrase (PP) "with the telescope", which itself consists
of a preposi on (P) "with" and another noun phrase (NP) "the man with the
telescope". The la er is composed of a determiner (Det) "the" and a noun (N) "man",
which is modified by another preposi onal phrase "with the telescope", consis ng of
a preposi on (P) "with" and a noun phrase (NP) "the telescope".

Phrase structure trees can be used in NLP for a variety of tasks, such as machine
transla on, text-to-speech synthesis, and natural language understanding. By
iden fying the syntac c structure of a sentence, computers can more accurately
understand its meaning and generate appropriate responses.
4.Parsing Algorithms:
There are several algorithms used in natural language processing (NLP) for syntax
analysis or parsing, each with its own strengths and weaknesses. Here are three
common parsing algorithms and their examples:
1. Recursive descent parsing: This is a top-down parsing algorithm that starts
with the top-level symbol (usually the sentence) and recursively applies
produc on rules to derive the structure of the sentence. Each produc on rule
corresponds to a non-terminal symbol in the grammar, which can be
expanded into a sequence of other symbols. The algorithm selects the first
produc on rule that matches the current input, and recursively applies it to its
right-hand side symbols. This process con nues un l a match is found for
every terminal symbol in the input.
Example: Consider the following context-free grammar for arithme c expressions:

E -> E + T | E - T | T
T -> T * F | T / F | F
F -> ( E ) | num

Suppose we want to parse the expression "3 + 4 * (5 - 2)" using recursive descent parsing.
The algorithm would start with the top-level symbol E and apply the first produc on rule E ->
E + T. It would then recursively apply the produc on rules for E, T, and F un l it reaches the
terminals "3", "+", "4", "*", "(", "5", "-", "2", and ")". The resul ng parse tree would look like
this:
E
/\
E T
/ /|\
T F*F
| | |
num num E
/|\
T F
| |
num num
lOM oARcP SD| 47 12 672 7

2. Shi -reduce parsing: This is a bo om-up parsing algorithm that starts with
the input tokens and constructs a parse tree by repeatedly shi ing a token

onto a stack and reducing a group of symbols on the stack to a single symbol
based on the produc on rules. The algorithm maintains a parse table that
specifies which ac ons to take based on the current state and the next input
symbol.

Example: Consider the following grammar for simple English sentences:

S -> NP VP
NP -> Det N | NP PP
VP -> V NP | VP PP
PP -> P NP
Det -> the | a
N -> man | ball | woman
V -> saw | liked
P -> with | in

Suppose we want to parse the sentence "the man saw a woman with a ball" using shi -
reduce parsing. The algorithm would start with an empty stack and shi the tokens "the",
"man", "saw", "a", "woman", "with", "a", and "ball" onto the stack. It would then reduce the
symbols "Det N" to NP, "NP PP" to NP, "V NP" to VP, and "NP PP" to PP. The resul ng parse
tree would look like this:
S
|
____|____
| |
NP VP
| |
| _V_
| | |
Det NP PP
| | |
the __|__ |
| NP
| |
| Det N
| | |
| a man
|
__V__
| |
saw NP
|
Det N
| |
lOM oARcP SD| 47 12 672 7

a woman
|
PP
|
| |
P NP
| |
with Det N
| |
a ball

3. Earley parsing: This is a chart parsing algorithm that uses dynamic programming to store par al
parses in a chart, which can be combined to form complete parses.

4.1 Shi -Reduce Parsing:


Shi -reduce parsing is a bo om-up parsing algorithm commonly used in natural
language processing (NLP) to generate parse trees from input sentences. It works by
incrementally reducing a stack of symbols to a single non-terminal symbol that
matches a produc on rule.

Here is an example of how shi -reduce parsing can be used to parse the sentence
"the cat chased the mouse" using a simple grammar:

S -> NP VP

NP -> Det N

VP -> V NP

Det -> the

N -> cat | mouse

V -> chased

1. Ini aliza on: We start by ini alizing an empty stack and an input buffer with
the sentence tokens "the", "cat", "chased", "the", and "mouse". We also ini alize
a parse chart, which is a table used to keep track of all the possible par al
parses of the sentence.
lOM oARcP SD| 47 12 672 7

2. Shi ing: We shi the first token "the" onto the stack and the first token "cat"
into the lookahead buffer. The stack now contains only the symbol "the".
3. Shi ing again: We shi the next token "cat" onto the stack and the next token
"chased" into the lookahead buffer. The stack now contains the symbols "the"
and "cat".
4. Reduc on: We check if the top of the stack and the next token in the
lookahead buffer can be reduced to a non-terminal symbol using a produc on
rule from the grammar. In this case, "the" and "cat" can be reduced to the
non-terminal symbol NP using the produc on rule NP -> Det N. We pop "the"
and "cat" from the stack and push the non-terminal symbol NP onto the stack.

5. Shi ing again: We shi the next token "chased" onto the stack and the next
token "the" into the lookahead buffer. The stack now contains the symbols NP
and "chased".
6. Reduc on again: We check if the top of the stack and the next token in the
lookahead buffer can be reduced to a non-terminal symbol using a produc on
rule from the grammar. In this case, "chased" and NP can be reduced to the
non-terminal symbol VP using the produc on rule VP -> V NP. We pop
"chased" and NP from the stack and push the non-terminal symbol VP onto
the stack.
7. Shi ing again: We shi the next token "the" onto the stack and the next token
"mouse" into the lookahead buffer. The stack now contains the symbols VP
and "the".
8. Reduc on again: We check if the top of the stack and the next token in the
lookahead buffer can be reduced to a non-terminal symbol using a produc on
rule from the grammar. In this case, "the" and VP can be reduced to the
non-terminal symbol S using the produc on rule S -> NP VP. We pop "the" and

VP from the stack and push the non-terminal symbol S onto the stack.
9. Comple on: The stack now contains only the symbol S, which is the final
parse of the input sentence. We can also look at the parse chart to see all the
possible par al parses that were considered during the parsing process. The
final parse tree for the sentence is:
S
/\

NP VP

/ \ |

/ chased

/ |
lOM oARcP SD| 47 12 672 7

Det NP

| /\

the Det N

| |

the mouse

Note that this example uses a simple grammar and a straigh orward parsing process, but
more complex grammars and sentences may require addi onal steps or different strategies
to achieve a successful parse.

4.2 Hypergraphs and Chart Parsing:


Hypergraphs and chart parsing are two related concepts used in natural language
processing (NLP) for syntac c parsing.

Hypergraphs represent a generaliza on of tradi onal parse trees, allowing for more

complex structures and more efficient parsing algorithms. A hypergraph consists of


a set of nodes (represen ng words or phrases in the input sentence) and a set of
hyperedges, which connect nodes and represent higher-level structures. A chart, on
the other hand, is a data structure used in chart parsing to efficiently store and
manipulate all possible par al parses of a sentence.

Here is an example of how chart parsing can be used to parse the sentence "the cat
chased the mouse" using a simple grammar:

S -> NP VP

NP -> Det N

VP -> V NP

Det -> the


lOM oARcP SD| 47 12 672 7

N -> cat | mouse

V -> chased

1. Ini aliza on: We start by ini alizing an empty chart with the length of the
input sentence (5 words) and a set of empty cells represen ng all possible

par al parses.
2. Scanning: We scan each word in the input sentence and add a corresponding
parse to the chart. For example, for the first word "the", we add a parse for the
non-terminal symbol Det (Det -> the). We do this for each word in the
sentence.
3. Predic ng: We use the grammar rules to predict possible par al parses for
each span of words in the sentence. For example, we can predict a par al
parse for the span (1, 2) (i.e., the first two words "the cat") by applying the rule
NP -> Det N to the parses for "the" and "cat". We add this par al parse to the

chart cell for the span (1, 2).


4. Scanning again: We scan the input sentence again, this me looking for
matches to predicted par al parses in the chart. For example, if we predicted
a par al parse for the span (1, 2), we look for a parse for the exact same span
in the chart. If we find a match, we can apply a grammar rule to combine the
two par al parses into a larger parse. For example, if we find a parse for (1, 2)
that matches the predicted parse for NP -> Det N, we can combine them to

create a parse for the span (1, 3) and the non-terminal symbol NP.
5. Combining: We con nue to combine par al parses in the chart using grammar
rules un l we have a complete parse for the en re sentence.
6. Output: The final parse tree for the sentence is represented by the complete
parse in the chart cell for the span (1, 5) and the non-terminal symbol S.
Chart parsing can be more efficient than other parsing algorithms, such as recursive

descent or shi -reduce parsing, because it stores all possible par al parses in the
chart and avoids redundant parsing of the same span mul ple mes. Hypergraphs
can also be used in chart parsing to represent more complex structures and enable
more efficient parsing algorithms.

4.3 Minimum Spanning Trees and Dependency Parsing:

Dependency parsing is a type of syntac c parsing that represents the gramma cal
structure of a sentence as a directed acyclic graph (DAG). The nodes of the graph
represent the words of the sentence, and the edges represent the syntac c
rela onships between the words.
lOM oARcP SD| 47 12 672 7

Minimum spanning tree (MST) algorithms are o en used for dependency parsing, as

they provide an efficient way to find the most likely parse for a sentence given a set
of syntac c dependencies.

Here's an example of how a MST algorithm can be used for dependency parsing:

Consider the sentence "The cat chased the mouse". We can represent this sentence
as a graph with nodes for each word and edges represen ng the syntac c
dependencies between them:

We can use a MST algorithm to find the most likely parse for this graph. One popular
algorithm for this is the Chu-Liu/Edmonds algorithm:

1. We first remove all self-loops and mul ple edges in the graph. This is because
a valid dependency tree must be acyclic and have only one edge between any
two nodes.
2. We then choose a node to be the root of the tree. In this example, we can
choose "chased" to be the root since it is the main verb of the sentence.
3. We then compute the scores for each edge in the graph based on a scoring
func on that takes into account the probability of each edge being a valid

dependency. The score func on can be based on various linguis c features,


such as part-of-speech tags or word embeddings.

4. We use the MST algorithm to find the tree that maximizes the total score of its
edges. The MST algorithm starts with a set of edges that connect the root
node to each of its immediate dependents, and itera vely adds edges that
connect other nodes to the tree. At each itera on, we select the edge with the
highest score that does not create a cycle in the tree.

5. Once the MST algorithm has constructed the tree, we can assign a label to
each edge in the tree based on the type of dependency it represents (e.g.,
subject, object, etc.).
The resul ng dependency tree for the example sentence is shown below:

In this tree, each node represents a word in the sentence, and each edge represents a syntac c
dependency between two words.

Dependency parsing can be useful for many NLP tasks, such as informa on extrac on, machine

transla on, and sen ment analysis.


lOM oARcP SD| 47 12 672 7

One advantage of dependency parsing is that it captures more fine-grained syntac c informa on than

phrase-structure parsing, as it represents the rela onships between individual words rather than just the

hierarchical structure of phrases. However, dependency parsing can be more difficult to perform

accurately than phrase-structure parsing, as it requires more sophis cated algorithms and models to

capture the nuances of syntac c dependencies.

5.Models for Ambiguity Resolution in Parsing:


Ambiguity resolu on is an important problem in natural language processing (NLP)
as many sentences can have mul ple valid syntac c parses. This means that the
same sentence can be represented by mul ple phrase structure trees or dependency
graphs. Resolving ambiguity is crucial for many NLP applica ons, such as machine
transla on, text-to-speech synthesis, and informa on retrieval.

Here are some common models for ambiguity resolu on in parsing:


1. Rule-based models: Rule-based models use hand-cra ed grammars and rules
to disambiguate sentences. These rules can be based on linguis c knowledge
or heuris cs, and can help resolve ambiguity by preferring certain syntac c
structures over others. For example, a rule-based model might prefer a noun
phrase followed by a verb phrase as the primary syntac c structure for a given
sentence.
2. Sta s cal models: Sta s cal models use machine learning algorithms to
learn from large corpora of text and make predic ons about the most likely
syntac c structure for a given sentence. These models can be based on
various features, such as part-of-speech tags, word embeddings, or
contextual informa on. For example, a sta s cal model might learn to
associate certain word sequences with specific syntac c structures.
3. Hybrid models: Hybrid models combine both rule-based and sta s cal
approaches to resolve ambiguity. These models can use rules to guide the
parsing process and sta s cal models to make more fine-grained predic ons.
For example, a hybrid model might use a rule-based approach to iden fy the
main syntac c structures in a sentence, and then use a sta s cal model to
disambiguate specific substructures.
4. Neural network models: Neural network models use deep learning techniques
to learn from large amounts of text and make predic ons about the most
likely syntac c structure for a given sentence. These models can be based on
various neural architectures, such as recurrent neural networks (RNNs) or
transformer models. For example, a neural network model might use an
a en on mechanism to learn which words in a sentence are most relevant for
predic ng the syntac c structure.
5. Ensemble models: Ensemble models combine the predic ons of mul ple
parsing models to achieve higher accuracy and robustness. These models
can be based on various techniques, such as vo ng, weigh ng, or stacking.
For example, an ensemble model might combine the predic ons of a
lOM oARcP SD| 47 12 672 7

rule-based model, a sta s cal model, and a neural network model to improve
the overall accuracy of the parsing system.
Overall, there are many models for ambiguity resolu on in parsing, each with its own
strengths and weaknesses. The choice of model depends on the specific applica on
and the available resources, such as training data and computa onal power.

5.1 Probabilistic Context-Free Grammars :


Probabilis c context-free grammars (PCFGs) are a popular model for ambiguity
resolu on in parsing. PCFGs extend context-free grammars (CFGs) by assigning
probabili es to each produc on rule, represen ng the likelihood of genera ng a
certain symbol given its parent symbol.

PCFGs can be used to compute the probability of a parse tree for a given sentence,
which can then be used to select the most likely parse. The probability of a parse
tree is computed by mul plying the probabili es of its cons tuent produc on rules,
from the root symbol down to the leaves. The probability of a sentence is computed
by summing the probabili es of all parse trees that generate the sentence.

Here is an example of a PCFG for the sentence "the cat saw the dog":

S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]

In this PCFG, each produc on rule is annotated with a probability. For example, the
rule NP -> Det N [0.6] has a probability of 0.6, indica ng that a noun phrase can be
generated by first genera ng a determiner, followed by a noun, with a probability of
0.6.

To parse the sentence "the cat saw the dog" using this PCFG, we can use the CKY
algorithm to generate all possible parse trees and compute their probabili es. The
algorithm starts by filling in the table of all possible subtrees for each span of the
sentence, and then combines these subtrees using the produc on rules of the PCFG.
The final cell in the table represents the probability of the best parse tree for the
en re sentence.

Using the probabili es from the PCFG, the CKY algorithm generates the following
parse tree for the sentence "the cat saw the dog":
S
/ \
lOM oARcP SD| 47 12 672 7

NP VP
/ \ / \
Det N V NP
| | | / \
the cat saw the dog

The probability of this parse tree is computed as follows:


P(S -> NP VP) * P(NP -> Det N) * P(Det -> "the") * P(N -> "cat") * P(VP -> V NP) * P(V ->
"saw") * P(NP -> Det N) * P(Det -> "the") * P(N -> "dog") = 1.0 * 0.6 * 0.9 * 0.5 * 0.8 *
1.0 * 0.6 * 0.9 * 0.5 = 0.11664

Thus, the probability of the best parse tree for the sentence "the cat saw the dog" is
0.11664. This probability can be used to select the most likely parse among all possible parse
trees for the sentence.
5.2 Generative Models for Parsing:
Genera ve models for parsing are a family of models that generate a sentence's
parse tree by genera ng each node in the tree according to a set of probabilis c
rules. One such model is the probabilis c earley parser.

The earley parser uses a chart data structure to store all possible parse trees for a
sentence. The parser starts with an empty chart, and then adds new parse trees to
the chart as it progresses through the sentence. The parser consists of three main
stages: predic on, scanning, and comple on.

In the predic on stage, the parser generates new items in the chart by applying
grammar rules that can generate non-terminal symbols. For example, if the grammar
has a rule S -> NP VP, the parser would predict the presence of an S symbol in the
current span of the sentence by adding a new item to the chart that indicates that an
S symbol can be generated by an NP symbol followed by a VP symbol.

In the scanning stage, the parser checks whether a word in the sentence can be
assigned to a non-terminal symbol in the chart. For example, if the parser has
predicted an NP symbol in the current span of the sentence, and the word "dog"
appears in that span, the parser would add a new item to the chart that indicates that
the NP symbol can be generated by the word "dog".

In the comple on stage, the parser combines items in the chart that have the same
end posi on and can be combined according to the grammar rules. For example, if
the parser has added an item to the chart that indicates that an NP symbol can be
generated by the word "dog", and another item that indicates that a VP symbol can
be generated by the word "saw" and an NP symbol, the parser would add a new item
to the chart that indicates that an S symbol can be generated by an NP symbol
followed by a VP symbol.

Here is an example of a probabilis c earley parser applied to the sentence "the cat
saw the dog":
Grammar:
lOM oARcP SD| 47 12 672 7

S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]

Initial chart:
0: [S -> * NP VP [1.0], 0, 0]
0: [NP -> * Det N [0.6], 0, 0]
0: [NP -> * N [0.4], 0, 0]
0: [VP -> * V NP [0.8], 0, 0]
0: [VP -> * V [0.2], 0, 0]
0: [Det -> * "the" [0.9], 0, 0]
0: [Det -> * "a" [0.1], 0, 0]
0: [N -> * "cat" [0.5], 0, 0]
0: [N -> * "dog" [0.5], 0, 0]
0: [V -> * "saw" [1.0], 0, 0]

Predicting S:
0: [S -> * NP VP [1.0], 0, 0]
1: [NP -> * Det N [0.6], 0, 0]
1: [NP -> * N [0.4], 0, 0]
1: [VP -> * V NP [0.8], 0 5.3
Discriminative Models for Parsing:
Discrimina ve models for parsing are a family of models that predict a sentence's
parse tree by learning to discriminate between different possible trees. One such
model is the maximum entropy markov model.

The maximum entropy markov model (MEMM) is a discrimina ve model that models
the condi onal probability of a parse tree given a sentence. The model is trained on a
corpus of labeled sentences and their corresponding parse trees. During training, the
model learns a set of feature func ons that map the current state of the parser (i.e.,
the current span of the sentence and the par al parse tree constructed so far) to a
set of binary features that are indica ve of a par cular parse tree. The model then
learns the weight of each feature func on using maximum likelihood es ma on.

During tes ng, the MEMM uses the learned feature func ons and weights to score
each possible parse tree for the input sentence. The model then selects the parse
tree with the highest score as the final parse tree for the sentence.

Here is an example of a MEMM applied to the sentence "the cat saw the dog":
Features:
F1: current word is "the"
F2: current word is "cat"
F3: current word is "saw"
lOM oARcP SD| 47 12 672 7

F4: current word is "dog"


F5: current span is "the cat"
F6: current span is "cat saw"
F7: current span is "saw the"
F8: current span is "the dog" F9:
partial parse tree is "S -> NP VP"

Weights:
F1: 1.2
F2: 0.5
F3: 0.9
F4: 1.1
F5: 0.8
F6: 0.6
F7: 0.7
F8: 0.9
F9: 1.5

Possible parse trees and their scores:


S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"Score: 5.7

S -> NP VP
- NP -> N
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"Score: 4.9

S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V
- - V -> "saw"
- NP -> Det N
- - Det -> "the"
- - N -> "dog"Score: 3.5

Selected parse tree:


S -> NP VP
- NP -> Det N
- - Det -> "the"
lOM oARcP SD| 47 12 672 7

- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"
Score: 5.7

In this example, the MEMM generates a score for each possible parse tree and selects the
parse tree with the highest score as the final parse tree for the sentence.
The selected parse tree corresponds to the correct parse for the sentence.

6.Multilingual Issues:
In natural language processing (NLP), a token is a sequence of characters that
represents a single unit of meaning. In other words, it is a word or a piece of a word
that has a specific meaning within a language. The process of spli ng a text into
individual tokens is called tokeniza on.

However, the defini on of what cons tutes a token can vary depending on the
language being analyzed. This is because different languages have different rules for
how words are constructed, how they are wri en, and how they are used in context.

For example, in English, words are typically separated by spaces, making it rela vely
easy to tokenize a sentence into individual words. However, in some languages, such
as Chinese or Japanese, there are no spaces between words, and the text must be
segmented into individual units of meaning based on other cues, such as syntax or
context.

Furthermore, even within a single language, there can be varia on in how words are
spelled or wri en. For example, in English, words can be spelled with or without
hyphens or apostrophes, and there can be differences in spelling between American
English and Bri sh English.

Mul lingual issues in tokeniza on arise because different languages can have
different character sets, which means that the same sequence of characters can
represent different words in different languages. Addi onally, some languages have
complex morphology, which means that a single word can have many different forms
that represent different gramma cal features or meanings.

To address these issues, NLP researchers have developed mul lingual tokeniza on
techniques that take into account the specific linguis c features of different
languages. These techniques can include using language-specific dic onaries,
models, or rules to iden fy the boundaries between words or units of meaning in
different languages.
6.1 Tokenization, Case, and Encoding:
Tokeniza on, case, and encoding are all important aspects of natural language
processing (NLP) that are used to preprocess text data before it can be analyzed by
lOM oARcP SD| 47 12 672 7

machine learning algorithms. Here are some examples of each:


Tokeniza on:
Tokeniza on is the process of spli ng a text into individual tokens or words. In English, this
is typically done by spli ng the text on whitespace and punctua on marks. For example,
the sentence "The quick brown fox jumps over the lazy dog." would be tokenized into the
following list of words:

1. ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."].

Case:

2. Case refers to the use of upper and lower case le ers in text. In NLP, it is o en
important to standardize the case of words to avoid trea ng the same word
as different simply because it appears in different case. For example, the
words "apple" and "Apple" should be treated as the same word.
Encoding:

3. Encoding refers to the process of represen ng text data in a way that can be
processed by machine learning algorithms. One common encoding method
used in NLP is Unicode, which is a character encoding standard that can
represent a wide range of characters from different languages.
Here is an example of how tokeniza on, case, and encoding might be applied to a
sentence of text:

Text: "The quick brown fox jumps over the lazy dog."

Tokeniza on: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]

Case: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]

Encoding: [0x74, 0x68, 0x65, 0x20, 0x71, 0x75, 0x69, 0x63, 0x6b, 0x20, 0x62, 0x72,
0x6f, 0x77, 0x6e, 0x20, 0x66, 0x6f, 0x78, 0x20, 0x6a, 0x75, 0x6d, 0x70, 0x73, 0x20,
0x6f, 0x76, 0x65, 0x72, 0x20, 0x74, 0x68, 0x65, 0x20, 0x6c, 0x61, 0x7a, 0x79, 0x20,
0x64, 0x6f, 0x67, 0x2e]

Note that the encoding is represented in hexadecimal to show the underlying bytes
that represent the text.
6.2 Word Segmentation:
Word segmenta on is one of the most basic tasks in Natural Language Processing
(NLP), and it involves iden fying the boundaries between words in a sentence.
However, in some languages, such as Chinese and Japanese, there is no clear
spacing or punctua on between words, which makes word segmenta on more
challenging.

In Chinese, for example, a sentence like "我喜欢中文" (which means "I like Chinese")
lOM oARcP SD| 47 12 672 7

could be segmented in different ways, such as "我 / 喜欢 / 中文" or "我喜欢 / 中文".


Similarly, in Japanese, a sentence like "私 日本語が好きです" (which also means "I
like Japanese") could be segmented in different ways, such as "私 / 日本語が / 好
きです" or "私 日本語 / が好きです".

Here are some examples of the challenges of word segmenta on in different


languages:
● Chinese: In addi on to the lack of spacing between words, Chinese also has a
large number of homophones, which are words that sound the same but have
different meanings. For example, the words "你" (you) and "年" (year) sound
the same in Mandarin, but they are wri en with different characters.
● Japanese: Japanese also has a large number of homophones, but it also has
different wri ng systems, including kanji (Chinese characters), hiragana, and
katakana. Kanji can o en have mul ple readings, which makes word
segmenta on more complex.
● Thai: Thai has no spaces between words, and it also has no capitaliza on or
punctua on. In addi on, Thai has a unique script with many consonants that
can be combined with different vowel signs to form words.
● Vietnamese: Vietnamese uses the La n alphabet, but it also has many
diacri cs (accent marks) that can change the meaning of a word. In addi on,
Vietnamese words can be formed by combining smaller words, which makes
word segmenta on more complex.
To address these challenges, NLP researchers have developed various techniques
for word segmenta on, including rule-based approaches, sta s cal models, and
neural networks. However, word segmenta on is s ll an ac ve area of research,
especially for low-resource languages where large amounts of annotated data are
not available.
6.3 Morphology:
Morphology is the study of the structure of words and how they are formed from
smaller units called morphemes. Morphological analysis is important in many
natural language processing tasks, such as machine transla on and speech
recogni on, because it helps to iden fy the underlying structure of words and to
disambiguate their meanings.

Here are some examples of the challenges of morphology in different languages:

● Turkish: Turkish has a rich morphology, with a complex system of affixes that
can be added to words to convey different meanings. For example, the word
"kitap" (book) can be modified with different suffixes to indicate things like
possession, plurality, or tense.
● Arabic: Arabic also has a rich morphology, with a complex system of prefixes,
suffixes, and infixes that can be added to words to convey different meanings.
For example, the root "k-t-b" (meaning "write") can be modified with different
affixes to form words like "kitab" (book) and "kataba" (he wrote).
● Finnish: Finnish has a complex morphology, with a large number of cases,
suffixes, and vowel harmony rules that can affect the form of a word. For
lOM oARcP SD| 47 12 672 7

example, the word "käsi" (hand) can be modified with different suffixes to
indicate things like possession, loca on, or movement.
● Swahili: Swahili has a complex morphology, with a large number of prefixes
and suffixes that can be added to words to convey different meanings. For
example, the word "kutaka" (to want) can be modified with different prefixes
and suffixes to indicate things like tense, nega on, or subject agreement.
To address these challenges, NLP researchers have developed various techniques
for morphological analysis, including rule-based approaches, sta s cal models, and
neural networks. However, morphological analysis is s ll an ac ve area of research,
especially for low-resource languages where large amounts of annotated data are
not available.

You might also like