Syntax_JB_slides
Syntax_JB_slides
αAβ → αγβ
where:
A is a non-terminal.
α, β, γ are strings of terminals and/or non-terminals.
|γ| ≥ |A|, meaning the output must be at least as long as the input.
A designated start symbol.
S → aSBC
S → aBC
CB → DB
DB → DC
DC → BC
aB → ab
bC → bc
S ⇒ aSBC
⇒ aaBCBC
⇒ aaBBCC
⇒ aabbcc
A→γ
S → aSb
S → ab
S ⇒ aSb
⇒ aaSbb
⇒ aaaSbbb
⇒ aaabbb
The simplest form of grammar, used in Finite State Machines (FSM) and Regular
Expressions.
Production rules:
A → aB (Right Linear)
A → Ba (Left Linear)
A → a (Terminal production)
Example: S → 0S | 1S | 0 | 1
7. Preposition (IN)
IN (Preposition/Subordinating Conjunction): in, on, at, because, although.
8. Conjunction (CC, IN)
CC (Coordinating Conjunction): and, but, or, yet.
IN (Subordinating Conjunction): because, although, since.
9. Modal Verbs (MD)
MD (Modal Verbs): can, could, shall, should, will, would, must, might.
10. Particles (RP)
RP (Particle): up, off, out, over (as in look up, take off).
Syntactic Analysis (also called parsing) is the process of analyzing the structure of a sentence
based on grammatical rules. It determines how words are arranged and related to each other
to form meaningful sentences.
Syntactic analysis is the process of checking whether a sentence follows the grammatical
rules of a language.
It involves breaking down a sentence into its components (phrases, clauses, words) and
identifying their roles.
It builds a tree-like structure (Parse Tree) to represent sentence structure.
Parsing in natural language processing (NLP) is the process of analyzing the grammatical
structure of a sentence to determine its meaning and organization.
It helps computers understand and process human language more accurately by
identifying relationships between words and phrases.
Text-to-Speech (TTS) Systems
One of the important applications of parsing is in text-to-speech (TTS) systems. When
converting written text into spoken words, a TTS system must ensure that the output
sounds natural, just as a native speaker would pronounce it. Consider the sentences:
He wanted to go for a drive-in movie.
He wanted to go for a drive in the country.
In spoken language, there is a natural pause between drive and in in the second sentence,
whereas in the first sentence, the words are spoken together as one unit.
Parsing helps identify such structural differences, ensuring correct intonation in TTS systems.
Another challenge in NLP is part-of-speech (POS) tagging, which assigns the correct
grammatical category (noun, verb, adjective, etc.) to each word in a sentence. For example, in
the sentence:
The cat who lives dangerously had nine lives.
Here, lives appears twice but has different meanings: in who lives dangerously, lives is a verb,
while in had nine lives, lives is a noun. A TTS system must correctly identify these roles to
produce the right pronunciation and rhythm.
Parsing is also essential in text summarization, where long documents need to be condensed
into a shorter, meaningful summary. For instance, given the sentence:
Beyond the basic level, the operations of the three products vary widely.
A summarization system may reduce this to:
The operations of the products vary.
The parse tree systematically breaks down the sentence into its grammatical components,
which include noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and other
syntactic units.
Error Correction in Text – Identifying and correcting grammatical, spelling, and syntactic
errors in written text using rule-based, statistical, or deep learning techniques to enhance
clarity and correctness.
Dialogue Systems – Improving chatbot and virtual assistant responses by analyzing
syntactic structures to enhance user interactions and contextual understanding.
Knowledge Acquisition – Extracting semantic relationships between concepts (e.g.,
identifying ”dog is-a animal”) to support automated reasoning and ontology building.
Text-to-Speech Systems – Assisting speech synthesis models in generating more natural
and grammatically correct spoken language by analyzing sentence structure and
intonation patterns.
Figure:
Figure:
Writing a CFG for the syntactic analysis of natural language is problematic because unlike
a programming language, a natural language is far too complex to list all the syntactic
rules interms of a CFG.
A simple list of rules is not sufficient to comprehend interactions between different
components in the grammar.
Listing all possible syntactic constructoins in a language is a difficult task.
In addition, it is diffficult to list all the grammar rules in which a particular word can be a
participant.
This is known as knowledge acquisition problem.
Apart from this knowledge acquisition problem, there is another problem that the rules
interact with each other in many combinatorial ways. Consider a simple CFG that
provides a syntactic analysis of noun phrases as a binary branching tree.
N-> N N
N -> ’natural’ | ’language’ | ’processing’ | ’book’
Recursive rules produce ambiguity: Is it:
A ”processing of natural language”? (correct interpretation)
A ”natural way to do language processing”? (incorrect interpretation)
∑
n−1
Cn = Ci · Cn−1−i
i=0
Problem-1 : Finding the underlying grammar for syntactic analysis and recursive rules
produce ambiguity.
Problem-2 : not only do we need to know the syntactic rules for a particular language,
but we also need to know which analysis is the most plausible for a given input sentence.
Solution- Treebanks : The construction of a treebank is a data-driven approach to syntax
analysis that allows us to address both of these knowledge acquisition bottlenecks in one
stroke.
Penn Treebank (PTB) – English constituency treebank used in many NLP benchmarks. it follows Bracketed
Notation
(e.g., (S
(NP (NNP John))
(VP (VBZ loves)
(NP (NNP Books))))
Universal Dependencies (UD) – A multilingual dependency treebank covering over 100 languages.
TIGER Treebank – A German treebank based on dependency and constituency annotations.
NEGRA Corpus – A treebank for German with both constituency and dependency structures.
S
NP VP
NNP VBZ NP
Books
Phrase structure trees, also known as constituency trees, are hierarchical representations of sentence structure based on
constituents or phrases.
This approach is grounded in Chomskyan generative grammar, where sentences are built from recursively nested phrases.
Sentences are divided into nested phrases, which are labeled by syntactic categories such as noun phrases (NP),
verb phrases (VP), and prepositional phrases (PP).
The structure is hierarchical, meaning that each phrase can contain sub-phrases within it.
Phrase structure trees capture constituent relationships, helping in identifying which words form a meaningful unit
together.
Phrase structure trees are widely used in languages like English and French, where word order is relatively fixed. In
these languages, word placement plays a crucial role in conveying meaning (e.g., The boy eats an apple vs. An
apple eats the boy changes the meaning completely).
Since phrases are explicitly marked, phrase structure trees help in identifying subjects, objects, and verb phrases,
aiding in semantic role labeling.
Penn Treebank (PTB) uses phrase-structure annotations, making it one of the most influential resources for
training statistical and neural parsers.
Dependency graphs (or dependency trees) provide an alternative way to represent syntax, focusing on word-to-word
relationships instead of hierarchical phrase structures. This approach is based on dependency grammar, which directly
links words with syntactic dependencies.
Each word in a sentence is connected to another word based on grammatical relationships.
The structure is not hierarchical like phrase structure trees; instead, it forms a directed graph with words as nodes
and dependencies as edges.
The main verb is usually the root of the tree, with other words connected as dependents.
Better suited for free word order languages, such as Czech, Turkish, Russian, where word placement is flexible, but
syntactic relations remain clear.
since dependency trees do not require extra phrase labels, they are more compact and faster to process.
Many modern dependency parsers (e.g., Stanford Parser, SpaCy, UDPipe) work efficiently using dependency
annotations. Widely Used in Multilingual NLP
The Universal Dependencies (UD) Project standardizes dependency treebanks across 100+ languages.
Dependency trees work well for low-resource languages, where phrase structure rules are harder to define.
Dependency graphs connect a word (the head of a phrase) with its dependents using a
directed (asymmetric) connection.
The head-dependent relationship could be either
semantic (head-modifier)
The tall boy runs fast. (Adjective ’fast’ modifying the head ’runs’)
She sings beautifully. (Adverb ’beautifully’ modifying verb)
Syntactic(head-specifier )
The boy runs. (Determiner specifying noun)
She has eaten dinner. (Auxiliary verb supporting main verb)
Dependency graphs are a fundamental way to represent the syntactic structure of sentences.
They make minimal assumptions about syntactic structure and avoid any annotation of hidden
structures such empty elements as place holders ro represent missing or displaced arguments of
predicates or unnecessary hierarchical structure.
sentence contains an extraposition to the right of a noun phrase modifier phrase, which as
a result requires a crossing dependency. making it non-projective.
English has very few cases in a treebank that will need such a nonprojective analysis. In
other languages, such Czech and Turkish, theSyntax
Dr. John Babu
number of nonprojective dependencies can
Projectivity and CFG
If a dependency tree can be converted into an equivalent CFG, then it must be projective.
Non-Projective Structures Cannot Be Fully Captured in CFGs
Non-projective dependency trees lead to inconsistencies when converted to CFGs. In fact,
there is no CFG that can capture the nonprojective depedency.
Any conversion attempt either misses dependencies or creates crossing arcs.
Non-projectivity means that a nonterminal’s descendants do not form a contiguous
substring of the sentence.
CFGs require contiguity, making them incompatible with non-projective dependencies.
In a CFG converted from a dependency tree, we have onlythe following three types of rules
with one type of ule to introduce the terminal symbols and two rules where Y is dependent on
X or viceversa . The head word is indicated by asterisk(*).
Z-> X* Y
Z-> X Y*
A-> a*
Dr. John Babu Syntax
Projectivity
Projectivity: for each word in the sentence, its descendentats form a contiguous substring
of the sentences
Nonprojectivity: means that there is word in the sentence such that its descendants do
not form a contiguous substring of sentences
In natural language processing, syntax analysis helps us understand the structure of sentences.
One common method is phrase structure analysis, which breaks a sentence into smaller units
called constituents. These constituents group words together based on their grammatical
relationships, forming a hierarchical tree known as a phrase structure tree.
A phrase structure tree is a graphical representation of how different parts of a sentence fit
together. It follows generative grammar principles, which help handle complex sentence
structures like long-distance relationships between words.
Figure:
Figure:
A phrase structure syntax analysis of a sentence derives from the traditional sentence
diagrams that partition a sentence into constituents, and larger constituents are formed
by merging smaller ones.
Phrase structure analysis also typically incorporate ideas from generative grammar (from
linguistics) to deal with displaced constituents or apparent long-distance relationships
between heads and constituents.
A phrase structure tree can be viewed as implicitly having a predicate- argument structure
associated with it.
Both approaches are useful:
Phrase structure trees are better for identifying hierarchical structures and phrase
boundaries.
Dependency trees provide a more direct representation of word relationships, making
them useful for machine translation and information extraction.
Parsing is the process of analyzing an input sentence according to the rules of a grammar to
determine its structure. Given an input sentence, a parser produces an output analysis, which
we assume matches the structure defined by a treebank used for training. Treebank parsers
often do not require an explicit grammar, but to simplify the explanation, we will first consider
parsing algorithms that assume the existence of a Context-Free Grammar (CFG).
A Context-Free Grammar (CFG) consists of a set of production rules that define how
sentences can be derived from a starting symbol. Consider the following simple CFG G that
generates strings such as a and b or c from the start symbol N:
N -> N ’and’ N
N -> N ’or’ N
N -> ’a’ | ’b’ | ’c’
Each rule describes how a nonterminal symbol (N) can be rewritten into terminal symbols (a,
b, c, and, or) or combinations of N. This allows us to build complex expressions using a
hierarchical structure.
Derivation in Parsing
A derivation is a step-by-step process that shows how an input sentence can be generated
using the CFG. Consider the input sentence:
a and b or c
N
=> N ’or’ N # Applying rule N -> N or N
=> N ’or c’ # Applying rule N -> c
=> N ’and’ N ’or c’ # Applying rule N -> N and N
=> N ’and b or c’ # Applying rule N -> b
=> ’a and b or c’ # Applying rule N -> a
Here, each step applies a rule from the CFG, progressively transforming the start symbol (N)
into the full input sentence. Each line of this process is called a sentential form.
A rightmost derivation always expands the rightmost nonterminal first. If we reverse the
derivation order, we see how the sentence structure builds up:
This derivation sequence corresponds to the following parse tree, constructed from left to
right:
Dr. John Babu Syntax
Parsing Algorithms V
Both trees represent different ways of interpreting the sentence structure. This ambiguity is
common in natural language and must be resolved when designing efficient parsers.
Parsing algorithms in Natural Language Processing (NLP) can be broadly classified into
different categories based on their approach to syntactic analysis. Below are the primary types:
Transition-Based Parsing Transition-based parsing builds a parse tree by taking incremental
parsing decisions in a state-based manner. It is commonly used for dependency parsing.
Uses a stack, buffer, and set of actions to determine a parse tree.
Relies on machine learning models (e.g., neural networks) to predict parsing actions.
Example : Shift-Reduce Parsing Algorithm
Graph-Based Parsing Graph-based parsers construct the best possible dependency tree by
treating parsing as a graph optimization problem.
Constructs a weighted graph where words are nodes and dependencies are edges.
Uses algorithms like Minimum Spanning Tree (MST) to find the best parse.
Example Maximum Spanning Tree Parsing (Eisner’s Algorithm)
Neural Network-Based Parsing Neural network-based approaches use deep learning models
(e.g., transformers, LSTMs) to learn parsing patterns from large datasets.
Key Idea:
Uses word embeddings (e.g., BERT) to encode words.
Relies on sequence-to-sequence models or self-attention mechanisms.
Example : BiLSTM-Based Dependency Parsing Algorithm
To build a parser, we need to create an algorithm that can perform the steps in rightmost
derivation for any grammar and for any input string.
Every CFG turns out to have an automaton that is equivalent to it, called a push- down
automaton. A pushdown automaton is simply a finite-state automaton with some additional
memory in the form of a stack (or pushdown).
This is a limited amount of memory because only the top of the stack is used by the machine.
This provides an algorithm for parsing that is general for any given CFG and input string.
This algorithm calledshift-reduce parsing, which uses two data structures, a buffer for input
symbols and a stack for storing CFG symbols.
The shift-reduce parsing algorithm consists of the following actionss :
1 Shift: Move the next word from the buffer to the stack.
2 Reduce: Apply a production rule to reduce the top elements of the stack into a
non-terminal.
3 Left-Arc: Establish a dependency where the second-top element of the stack is the head
of the top element, removing the top element from the stack.
4 Right-Arc: Establish a dependency where the top element of the stack is the head of the
second-top element, removing the second-top element from the stack.
2 Exit with success if the top of the stack contains the start symbol of the grammar and if
an oracle):
Shift a symbol from the buffer onto the stack.
If the top k symbols of the stack are α1 ...αk , which corresponds to the righthand side of a
CFG rule Aα1 ...αk , then replace the top k symbols with the left-hand side nonterminal
A.(Reduce)
4 Exit with failure if no action can be taken in previous step.
5 Else, go to step 2.
This parsing technique is categorized as bottom-up parsing, meaning it builds the parse tree
from the leaves (input symbols) and works its way up to the root (start symbol of the
grammar).
Dr. John Babu Syntax
Shift Reduce Parsing II
head-dependent structure
An oracle in NLP parsing is a decision-making component that helps the parser choose the
correct sequence of shift and reduce operations. In shift-reduce parsing:
Multiple parsing paths may be possible at each step.
The oracle predicts the correct action based on training data or predefined heuristics.
This ensures that the parser does not make mistakes that require backtracking.
If trained on a large corpus of correctly parsed sentences, the oracle learns to select the best
action at each step.
It may use machine learning models (such as neural networks or probabilistic models) to
decide whether to shift or reduce.
If the oracle is perfect, shift-reduce parsing can be done in linear time.
Necessity of oracle
Without an oracle, the parser might choose the wrong action, requiring backtracking
(reparsing from an earlier state).
In worst-case scenarios, backtracking can lead to exponential parsing time due to
repeatedly trying different possibilities.