MNLP Unit-2
MNLP Unit-2
Syntax Analysis
-Presented By,
Dr D. Teja Santosh
Associate Professor, CSE
CVRCE
Why to frame sentences by using words … Why one should actually care for sentences
• Framing sentences with words and caring about sentence structure are essential for effective
communication, conveying meaning, and ensuring that your message is understood as intended.
• It is a fundamental aspect of language that contributes to clarity, precision, and the overall impact of your
communication.
Forms of expressing syntactic structure
• Dependency Structure
Phrase Structure Model
13 Jan 2006
Phrase Structure Grammar (PSG)
13 Jan 2006
Noun Phrases
John the student the intelligent student
NP NP NP
13 Jan 2006
Noun Phrase
his first five PhD students
NP
13 Jan 2006
Noun Phrase
The five best students of my class
NP
Det Quant AP N PP
13 Jan 2006
Verb Phrases
can sing can hit the ball
VP VP
Aux V Aux V NP
13 Jan 2006
Verb Phrase
Can give a flower to Mary
VP
Aux V NP PP
13 Jan 2006
Example PSG Grammar
S NP VP
NP Det N
VP V NP
Det the
N boy, ball
V hit
Verb Phrase
may make John the chairman
VP
Aux V NP NP
13 Jan 2006
Verb Phrase
may find the book very interesting
VP
Aux V NP AP
13 Jan 2006
Prepositional Phrases
in the classroom near the river
PP PP
P NP P NP
13 Jan 2006
Adjective Phrases
intelligent very honest fond of sweets
AP AP AP
A Degree A A PP
13 Jan 2006
Adjective Phrase
• very worried that she might have done badly in the
assignment
AP
Degree A S’
very worried
13 Jan 2006
Derivation
The boy hit the ball.
• Sentence
NP + VP (1) S NP VP
Det + N + VP (2) NP Det N
Det + N + V + NP (3) VP V NP
The + N + V + NP (4) Det the
The + boy + V + NP (5) N boy
The + boy + hit + NP (6) V hit
The + boy + hit + Det + N (2) NP Det N
The + boy + hit + the + N (4) Det the
The + boy + hit + the + ball (5) N ball
13 Jan 2006
PSG Parse Tree
• The boy hit the ball.
S
NP VP
Det N V NP
the ball
13 Jan 2006
PSG Parse Tree
• John wrote those words in the Book of Proverbs.
S
NP VP
PropN V NP PP
P NP
NP PP
John wrote those in
words
the of
book proverbs
13 Jan 2006
Then why structural ambiguities exist for sentences
• Structural ambiguities in sentences arise when the arrangement of words
and phrases allows for multiple interpretations or meanings.
• These ambiguities can occur due to various linguistic factors, including
syntax , semantics, and pragmatics.
• Syntactic Ambiguity (the arrangement of words or symbols of a given sentence is such that there are
multiple valid ways to parse it, resulting in different syntactic structures.)
• Semantic Ambiguity
• Pragmatic Ambiguity
Two different constituent parses
Sentence level constructions often involve
• Coordination Coordination involves combining words, phrases, or clauses of equal syntactic
importance using coordinating conjunctions (such as "and," "but," "or") or other coordinating structures.
• Agreement Agreement involves ensuring that different elements within a sentence match
grammatically. This includes agreement in number, person, gender, and sometimes case.
Types of Syntactic Ambiguity
• There are several types of syntactic ambiguity, including:
1.Structural Ambiguity: Ambiguity arising from different possible ways to group
words into phrases or constituents.
• Example: "I saw the man with the telescope." (Does "with the telescope" modify "saw" or
"man"?)
2.Attachment Ambiguity: Ambiguity related to the attachment of phrases to a higher
syntactic structure.
• Example: "I told her I love." (Is "I love" part of what was told, or is it a new statement?)
3.Coordination Ambiguity: Ambiguity in how coordinated elements are grouped
together.
• Example: "I like cooking, reading, and my dog." (Is "my dog" a separate activity or related to
the previous ones?)
4.Prepositional Phrase Attachment Ambiguity: Ambiguity involving the attachment
of prepositional phrases.
• Example: "The old man and woman watched the sunset with a telescope." (Did both the man
and the woman use the telescope?)
Parsing Natural Language
• In natural language processing (NLP), the syntactic analysis of natural language input can
vary from being very low-level, such as simply tagging each word in the sentence with a
part of speech (POS), or very high level, such as recovering a structural analysis that
identifies the dependency between each predicate in the sentence and its explicit and
implicit arguments.
1.Tokenization:
1. Definition: Breaking a text into individual words or tokens.
2. Example: "The cat in the hat" -> ['The', 'cat', 'in', 'the', 'hat']
2.Part-of-Speech Tagging (POS Tagging):
1. Definition: Assigning parts of speech (e.g., noun, verb, adjective) to each word in a sentence.
2. Example: "The cat in the hat" -> [('The', 'DT'), ('cat', 'NN'), ('in', 'IN'), ('the', 'DT'), ('hat', 'NN')]
3.Syntax Parsing (Syntactic Parsing):
1. Definition: Determining the syntactic structure of a sentence by analyzing the relationships between
words.
2. Example: "The cat in the hat" -> Tree structure representing how words are connected, e.g., (S (NP
(Det The) (N cat)) (PP (P in) (NP (Det the) (N hat))))
4.Semantic Parsing:
1. Definition: Extracting the meaning or intent from a sentence.
2. Example: "What is the capital of France?" -> {'intent': 'query', 'target': 'capital', 'entity': 'France'}
5.Dependency Parsing:
1. Definition: Analyzing the grammatical structure by identifying relationships between words in a
sentence, represented as a dependency tree.
2. Example: "The cat in the hat" -> Dependency tree with edges indicating grammatical relationships.
• The major bottleneck in parsing natural language is the fact that ambiguity is so pervasive.
• In syntactic parsing, ambiguity is a particularly difficult problem because the most plausible analysis
has to be chosen from an exponentially large number of alternative analyses.
Example Sentences:
• He wanted to go for a drive in the country.
• The cat who lives dangerously had nine lives.
• Beyond the basic level, the operations of the three products vary widely. The operations of the products vary.
• open borders imply increasing racial fragmentation in EUROPEAN COUNTRIES.
• Parsing the sentence with the CFG rules gives us two possible derivations for this sentence.
• In one parse, pockets are a kind of currency that can be used to buy a shirt, and the other
parse, which is the more plausible one, John is purchasing a kind of shirt that has pockets.
https://parts-of-speech.info/
sentence = "natural language processing"
N -> N N
N -> 'natural' | 'language' | 'processing' | 'book'
Note that the ambiguity in the syntactic analysis reflects a real ambiguity: is it a processing
• Any system of writing down syntactic rules should represent this ambiguity.
• However, by using the recursive rule three times, we get five parses for natural language processing
book and for longer and longer input noun phrases,
• using the recursive rule four times, we get 14 parses;
• using it five times, we get 42 parses;
• using it six times, we get 132 parses.
• In fact, for CFGs it can be proved that the number of parses obtained by using the recursive rule n
times is the Catalan number of n.
Syntactically ambiguous sentences in Turkish, Korean and Chinese languages
Turkish:
1."Kediyi gördüm ağaç altında." (I saw the cat under the tree.)
• Ambiguity: It's unclear whether the speaker saw the cat under the tree or saw the cat
and it was under the tree.
2."Çocuğa bisikleti verdi kadın." (The woman gave the bike to the child.)
• Ambiguity: It's unclear whether the woman gave the bike to the child or the child
gave the bike to the woman.
3."Kitabı okumadan eve geldim." (I came home without reading the book.)
• Ambiguity: It's unclear whether the speaker read the book before coming home or
hasn't read it at all.
Contd…
Korean:
• A treebank is simply a collection of sentences (also called a corpus of text), where each sentence is
• The syntactic analysis for each sentence has been judged by a human expert as the most plausible
• A lot of care is taken during the human annotation process to ensure that a consistent treatment is
• Each sentence in a treebank has been given its most plausible syntactic analysis,
supervised machine learning methods can be used to learn a scoring function
over all possible syntax analyses.
• Two main approaches to syntax analysis are used to construct treebanks: dependency graphs and
phrase structure trees.
• These two representations are very closely related to each other, and under some assumptions, one
representation can be converted to another.
• Dependency analysis is typically favored for languages such as Czech and Turkish, that have
free(er) word order, where the arguments of a predicate are often seen in different ordering in the
sentence.
• While phrase structure analysis is often used to provide additional information about long-distance
dependencies and mostly in languages like English and French, where the word order is less
flexible.
Representation of Syntactic Structure
• The main philosophy behind dependency graphs is to connect a word—the head of a phrase—with the
dependents in that phrase.
• The notation connects a head with its dependent using a directed (hence asymmetric) connection.
• Dependency graphs, just like phrase structure trees, is a representation that is consistent with many
different linguistic frameworks.
• The main difference between dependency graphs and phrase structure trees is that dependency analyses
typically make minimal assumptions about syntactic structure and to avoid any annotation of hidden
structure such as, for example, using empty elements as placeholders to represent missing or displaced
arguments of predicates, or any unnecessary hierarchical structure.
• The words in the input sentence are treated as the only vertices in the graph, which are linked together by
directed arcs representing syntactic dependencies.
Example
• example sentence: "The cat is sleeping."
• In this sentence, we can identify the following dependencies:
1.The -> cat: The word "cat" depends on "The" as its head. The relationship here is typically
labeled with the grammatical role, which, in this case, might be something like
"determiner" (indicating that "The" is a determiner for "cat").
2.cat -> is: The word "is" depends on "cat" as its head. This relationship could be labeled
with a grammatical role like "verb" or "predicate."
3.cat -> sleeping: Similarly, "sleeping" depends on "cat" as its head. The relationship might
be labeled as "verb" or "predicate."
So, in this example, the syntactic heads are "The" for "cat," "cat" for "is," and
"cat" for "sleeping."
displaCy Dependency Visualizer · Explosion
Projectivity
• A projective dependency tree is one where if we put the words in a linear order based on the
sentence with the root symbol in the first position, the dependency arcs can be drawn above the
• Another way to state projectivity is to say that for each word in the sentence, its descendants form
https://demos.explosion.ai/displacy
Converting such a dependency tree to a CFG with the
asterisk notation gives us two options. Either we can
capture that X3 depends on X2 but fail to capture that
X1 depends on X3:
or we can capture the fact that X1 depends on X3 but
fail to capture that X3 depends on X2:
• A phrase structure syntax analysis of a sentence derives from the traditional sentence
diagrams that partition a sentence into constituents, and larger constituents are formed by
merging smaller ones.
• Phrase structure analysis also typically incorporate ideas from generative grammar (from
linguistics) to deal with displaced constituents or apparent long-distance relationships
between heads and constituents.
the predicate (what the subject is doing or the action of the sentence).
In this tree:
•The subject marker (↑) still points to the subject of the sentence, which is the
noun phrase "(Det) (N)" representing "The cat."
•The predicate marker (↓) points to the predicate of the sentence, which is the
verb phrase "(V) (V)" representing "is sleeping."
•Below the predicate node, there is a branch representing the predicate-argument
structure.
•The first argument is a noun "(N)" representing the subject "cat."
•The second argument is an adverb "(Adv)" representing the adverbial modifier
"sleeping.“
Representing a phrase structure tree in predicate-argument structure involves annotating the tree nodes with semantic
roles and relationships between the verb (predicate) and its arguments. Here are the steps to represent a phrase
structure tree in predicate-argument structure:
4. Identify Arguments:
Identify the arguments of the verb in the VP. Arguments are typically noun phrases (NP) or other phrases that fulfill
specific roles in relation to the verb (e.g., subject, object).
5. Annotate Arguments with Semantic Roles:
Annotate each argument with its semantic role. Common roles include "agent" for the entity performing the action,
"theme" for the entity affected by the action, etc.
6. Connect Verb to Arguments:
Establish connections between the main verb and its arguments in the VP. This step involves indicating which argument
fills each specific role associated with the verb.
7. Update Tree Labels:
Update the labels of the tree nodes to reflect the semantic roles and relationships. This might involve adding labels like
"ARG0" for the first argument, "ARG1" for the second argument, and so on.
8. Optional: Add Adverbial Modifiers:
If there are adverbial modifiers in the sentence, such as adverbs or prepositional phrases, annotate them with their
semantic roles and connect them to the appropriate nodes in the tree.
9. Visualize the Predicate-Argument Structure:
Optionally, visualize the modified tree with semantic roles and relationships to represent the predicate-argument
structure.
10. Validate and Refine:
Review the representation to ensure that the semantic roles are accurately assigned, and the relationships between the
verb and its arguments are appropriately captured. Refine the representation as needed.
Introduction to Parsing
• Parsing is the process of examining the grammatical structure and relationships inside a
given sentence or text in natural language processing (NLP). It involves analyzing the text to
determine the roles of specific words, such as nouns, verbs, and adjectives, as well as their
interrelationships.
• This analysis produces a structured representation of the text, allowing NLP computers to
understand how words in a phrase connect to one another. Parsers expose the structure of a
sentence by constructing parse trees or dependency trees that illustrate the hierarchical and
syntactic relationships between words.
• This essential NLP stage is crucial for a variety of language understanding tasks, which allow
machines to extract meaning, provide coherent answers, and execute tasks such as machine
translation, sentiment analysis, and information extraction.
Strengths and Limitations of Rule-based parsers and Data-driven parsers
• Top-down parsing
It attempts to derive the sentence from the start symbol, and the production
tree is created from top down.
• Bottom-up parsing
Bottom-up parsing begins with the words of input and attempts to create
trees from the words up, again by applying grammar rules one at a time.
Parsing Algorithms
Shift Reduce Parsing (Bottom-Up Parsing)
• Shift-Reduce parsing is a common technique used in Natural Language Processing (NLP)
for syntactic parsing.
• Shift-Reduce parsing operates by successively shifting input words onto a stack and then
reducing them according to a set of predefined rules until a parse tree or dependency
structure is formed.
• It's particularly well-suited for dependency parsing, which is concerned with identifying
the syntactic relationships (dependencies) between words in a sentence.
Q) Does shift reduce parser generates
multiple parse trees for an ambiguous
sentence?
Hypergraphs:
https://www.youtube.com/watch?v=LDX9qGVa2l0
Q) Does chart parser generates multiple parse trees for an ambiguous sentence?
A) Yes, chart parsers have the capability to generate multiple parse trees for an ambiguous
sentence. Chart parsing algorithms typically use dynamic programming and construct a
parse chart that records partial parse results. In the presence of ambiguity, the parser
explores different possibilities and may produce multiple valid parse trees corresponding to
different syntactic interpretations of the input sentence.
Earley Parsing Example
https://www.youtube.com/watch?v=9GIgYd1OWfQ
CKY or CYK Algorithm
https://www.youtube.com/watch?v=cpeYw-hWtSc
Q) Does CYK parser generates multiple parse trees for an ambiguous sentence?
A)
• CYK parsers can potentially generate multiple parse trees for an ambiguous sentence.
• When a sentence is ambiguous, meaning it has more than one valid syntactic interpretation, the CYK
parser explores various possibilities and can yield multiple parse trees.
• The decision on whether to generate and output all parse trees or just one often depends on the specific
implementation and requirements of the parser.
CKY PARSING DEMO
http://lxmls.it.pt/2015/cky.html
MULTIPLE PARSE TREES GENERATED BY CKY PARSER
Minimum Spanning Trees and Dependency Parsing
• Dependency parsing with minimum spanning trees typically aims to produce a single,
projective parse tree for a given sentence.
• However, dependency parsing based on minimum spanning trees might still face
challenges when dealing with ambiguous sentences.
• In practice, when ambiguity exists, dependency parsers often make a choice based on
heuristics, statistical models, or other criteria. The chosen parse may represent the most
likely or most frequently observed structure based on the training data or other linguistic
knowledge.
Models for Ambiguity Resolution in Parsing
https://m.youtube.com/watch?v=DjJYKmAuAJ0&pp=ygUMcGNmZyBleGFtcGxl
NOTE: Stanford NLP Parser (shown in slide 63) operates based on statistical and
probabilistic models (PCFG). This parser provides a single "best" parse tree when
PCFG language model is used.
Why Stanford Parser
parses ambiguous
sentences
• A global linear model for parsing refers to a parsing approach that uses a linear model to globally score
entire structures (such as parse trees) for a given sentence.
• The goal is to find the structure that maximizes or minimizes a global scoring function. This is in contrast to
local models, which score individual decisions in a parsing process.
• In the context of dependency parsing, a global linear model assigns a score to a complete dependency
structure for a sentence.
• The structure can be represented as a set of labeled dependencies between words in the sentence. The
model considers the entire structure at once and assigns scores based on various features and weights.
Components of a Global Linear Model for Parsing
1.Feature Representation:
1. Define a set of features that capture relevant information about the input sentence and its
potential dependency structure. Features can include word-level information, part-of-speech
tags, syntactic context, and more.
2.Parameterized Scoring Function:
1. Define a scoring function that combines the feature values with associated weights. The
scoring function is often a linear combination of the features and weights, giving a global
score to the entire dependency structure.
3.Optimization Objective:
1. Formulate an optimization problem to find the best-scoring dependency structure. The
objective can involve maximizing the global score for correct structures or minimizing it for
incorrect ones.
4.Learning Weights:
1. Train the model by learning the weights associated with the features. This is typically done
using labeled training data, where correct and incorrect structures are provided for each
sentence.
POS Tagging of Chinese Sentence and Chinese sentence parse
tree
Lead to next unit…