0% found this document useful (0 votes)
2 views

Syntax_JB_slides

The document provides an overview of syntax, grammar, and parsing in both natural and programming languages. It explains the components of grammar, types of grammars according to the Chomsky hierarchy, and the importance of parsing in understanding sentence structure and meaning. Additionally, it discusses challenges such as ambiguity in parsing and applications in natural language processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Syntax_JB_slides

The document provides an overview of syntax, grammar, and parsing in both natural and programming languages. It explains the components of grammar, types of grammars according to the Chomsky hierarchy, and the importance of parsing in understanding sentence structure and meaning. Additionally, it discusses challenges such as ambiguity in parsing and applications in natural language processing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Syntax

Dr. John Babu

February 20, 2025

Dr. John Babu Syntax


Grammar

Defines the rules for forming valid sentences in a language.


It consists of a set of rules (often written in formal notation) that describe how words and
phrases should be structured.
Example: In English, a sentence follows the structure: Subject → Verb → Object (e.g.,
”John eats an apple.”).
In programming, grammars are used to define the syntax of programming languages using
formal grammars such as Context-Free Grammar (CFG).

Dr. John Babu Syntax


Syntax

Refers to the structure and arrangement of words, phrases, or symbols in a sentence or


code according to grammatical rules.
Ensures that the given sentence or code follows the correct structure.
Example (English): ”John eats an apple.” is syntactically correct, but ”Eats apple John.”
is incorrect.
Example (Programming):
print(”Hello”) -Correct syntax
print ”Hello” -Incorrect syntax in Python 3

Dr. John Babu Syntax


Parsing

The process of analyzing a sentence (natural language or programming code) to check


whether it follows the defined grammar and syntax.
Parsing breaks down a string into meaningful components using a parser.
In programming, parsing converts source code into a structured format (such as a parse
tree) for further processing.
Example: Given the expression (3 + 5) × 2, parsing will analyze its structure to determine
the correct order of operations (precedence).

Dr. John Babu Syntax


Components of a Grammar

Grammar is expressed with the the following


Non-Terminals (N): Abstract symbols representing different syntactic categories.
Example: S (Sentence), NP (Noun Phrase), VP (Verb Phrase).
Terminals (T): The actual symbols/words in the language.
Example: ”cat”, ”chases”, ”the”.
Production Rules (P): Rules that define how non-terminals are transformed.
Example: S → NP VP (A sentence consists of a noun phrase followed by a verb phrase).
Start Symbol (S): The starting point of derivations.
Example: S represents a complete sentence.

Dr. John Babu Syntax


Types of Grammars

The Chomsky Hierarchy defines four types of grammars:


1 Unrestricted Grammar (Type 0)
2 Context-Sensitive Grammar (CSG) (Type 1)
3 Context-Free Grammar (CFG) (Type 2)
4 Regular Grammar (Type 3)

Dr. John Babu Syntax


Unrestricted Grammar (Type 0)

The most powerful but computationally expensive.


Every production rule follows:
� → �, where � and � are arbitrary strings of terminals and non-terminals.
Example: S → aSa | bSb | �
Most Powerful: It can generate any recursively enumerable language, which includes all
languages that can be recognized by a Turing machine.
No Restrictions on Production Rules: Unlike context-free or context-sensitive grammars,
Type 0 grammars do not restrict the structure of the left or right-hand sides of rules.
Computational Equivalence to Turing Machines: Any language generated by a Type 0
grammar can also be recognized by a Turing machine, making it equivalent to the class of
recursively enumerable languages.

Dr. John Babu Syntax


Type-1 : Context-Sensitive Grammar (CSG)

Context-Sensitive Grammar (CSG) was introduced as part of Noam Chomsky’s hierarchy of


formal grammars in the 1950s. It was developed to handle languages that require context to
determine transformations. Unlike context-free grammars, CSG can express rules where
substitutions depend on surrounding symbols. It plays a crucial role in computational
linguistics and natural language processing.
Called ”context-sensitive” because:
The left-hand side (LHS) of a production rule can contain multiple symbols.
The rule applies only if a specific context is met.
The RHS must be at least as long as the LHS.
Example CSG: A B → A C B
AC B → A B is not a CSG

Dr. John Babu Syntax


A context-sensitive grammar (CSG) is a Type 1 grammar in the Chomsky hierarchy, which
consists of:
A finite set of non-terminal symbols.
A finite set of terminal symbols.
A finite set of production rules of the form:

αAβ → αγβ

where:
A is a non-terminal.
α, β, γ are strings of terminals and/or non-terminals.
|γ| ≥ |A|, meaning the output must be at least as long as the input.
A designated start symbol.

Dr. John Babu Syntax


Characteristics of CSG

Generates context-sensitive languages (CSLs).


Expands only within a given context.
More expressive than context-free grammars.
Describes languages requiring context-dependent transformations.

Dr. John Babu Syntax


Example Grammar

A context-sensitive grammar for the language L = {an bn cn | n ≥ 1}:

S → aSBC
S → aBC
CB → DB
DB → DC
DC → BC
aB → ab
bC → bc

Dr. John Babu Syntax


Derivation Example (n = 2)

Given input string aabbcc:

S ⇒ aSBC
⇒ aaBCBC
⇒ aaBBCC
⇒ aabbcc

The final output is a valid derivation.

Dr. John Babu Syntax


Type-2 :Context-Free Grammar (CFG)

A context-free grammar (CFG) is a Type 2 grammar in the Chomsky hierarchy. It was


introduced to describe the syntax of programming languages and natural languages.
CFG consists of production rules where a single non-terminal is replaced by a string of
terminals and/or non-terminals.
CFGs are widely used in compiler design and formal language theory.
Recognized by pushdown automata.
Called ”context-free” because:
The left-hand side (LHS) of a production rule contains only one non-terminal.
The application of a rule does not depend on surrounding symbols (context).
Example CFG:
E→E+T|E-T|T
T→T*F|T/F|F
F → (E) | id

Dr. John Babu Syntax


CFG

A context-free grammar (CFG) consists of:


A finite set of non-terminal symbols.
A finite set of terminal symbols.
A finite set of production rules of the form:

A→γ

where A is a single non-terminal and γ is a string of terminals and/or non-terminals.


A designated start symbol.

Dr. John Babu Syntax


Example Grammar for CFG

A CFG for the language L = {an bn | n ≥ 1}:

S → aSb
S → ab

Given input string aaabbb:

S ⇒ aSb
⇒ aaSbb
⇒ aaaSbbb
⇒ aaabbb

The final output is a valid derivation.

Dr. John Babu Syntax


Regular Grammar (Type 3)

The simplest form of grammar, used in Finite State Machines (FSM) and Regular
Expressions.
Production rules:
A → aB (Right Linear)
A → Ba (Left Linear)
A → a (Terminal production)
Example: S → 0S | 1S | 0 | 1

Dr. John Babu Syntax


Parsing Techniques and grammars

Regular Grammar: Finite State Automata, Lexical Analysis.


CFG: LL Parsing (Top-down), LR Parsing (Bottom-up).CFGs can be parsed efficiently
using pushdown automata or chart parsing algorithms like CYK and Earley’s algorithm,
which work in polynomial time.(NLP Applications)
CSG: Augmented CFGs and they handle semantic constraints
Unrestricted Grammar: Turing Machine-based Analysis and these are of theoretical
importance

Dr. John Babu Syntax


Example of CFG-Based Parsing

Consider the CFG:


S → NP VP
NP → Det N | N
VP → V NP
Det → ”the”
N → ”cat” | ”dog”
V → ”chases”
Parse tree ensures correct syntactic structure.

Dr. John Babu Syntax


Important POS Tags Used in Parsing I

1. Noun (NN, NNS, NNP, NNPS)


NN (Singular Noun): dog, table, book.
NNS (Plural Noun): dogs, tables, books.
NNP (Proper Singular Noun): John, India, Microsoft.
NNPS (Proper Plural Noun): Americans, Europeans.
2. Pronoun (PRP, PRP$)
PRP (Personal Pronoun): he, she, it, they.
PRP$ (Possessive Pronoun): his, her, its, their.
3. Verb (VB, VBD, VBG, VBN, VBP, VBZ)
VB (Base Form): run, eat, go.
VBD (Past Tense): ran, ate, went.
VBG (Gerund/Present Participle): running, eating, going.
Dr. John Babu Syntax
Important POS Tags Used in Parsing II

VBN (Past Participle): run, eaten, gone.


VBP (Present, Non-3rd Person Singular): run, eat, go.
VBZ (Present, 3rd Person Singular): runs, eats, goes.

Dr. John Babu Syntax


Important POS Tags Used in Parsing III
4. Adjective (JJ, JJR, JJS)
JJ (Adjective): big, fast, beautiful.
JJR (Comparative Adjective): bigger, faster, more beautiful.
JJS (Superlative Adjective): biggest, fastest, most beautiful.
5. Adverb (RB, RBR, RBS)
RB (Adverb): quickly, silently, well.
RBR (Comparative Adverb): faster, better.
RBS (Superlative Adverb): fastest, best.
6. Determiner (DT, PDT, WDT)
DT (Determiner): the, a, an, this.
PDT (Predeterminer): all, half, both.
WDT (Wh-determiner): which, whatever.
Dr. John Babu Syntax
Important POS Tags Used in Parsing IV

7. Preposition (IN)
IN (Preposition/Subordinating Conjunction): in, on, at, because, although.
8. Conjunction (CC, IN)
CC (Coordinating Conjunction): and, but, or, yet.
IN (Subordinating Conjunction): because, although, since.
9. Modal Verbs (MD)
MD (Modal Verbs): can, could, shall, should, will, would, must, might.
10. Particles (RP)
RP (Particle): up, off, out, over (as in look up, take off).

Dr. John Babu Syntax


Important POS Tags Used in Parsing V

11. Interjection (UH)


UH (Interjection): wow, oh, huh.
12. Wh-words (WP, WP$, WRB)
WP (Wh-pronoun): who, what, whom.
WP$ (Possessive Wh-pronoun): whose.
WRB (Wh-adverb): when, where, why.

Dr. John Babu Syntax


Parsing in Natural Language Processing

Parsing uncovers the hidden structure of a sentence.


Essential for applications like:
Machine Translation
Question Answering
Information Extraction
Predicate-Argument Structure (PAS) helps understand relationships in a sentence.

Dr. John Babu Syntax


Predicate-Argument Structure

PAS provides semantic analysis beyond syntactic parsing.


PAS helps in understanding relationships between words, particularly in terms of who is
doing what to whom in a sentence.
Example: ”SaiManesh gave Hemanth a book.”
the predicate gave connects three arguments.
SaiManesh (Who gave?)
Hemanth (Who received?)
a book (What was given?)
By analyzing this structure, we can extract meaningful relationships, which is useful in
tasks like semantic role labeling and information retrieval.

Dr. John Babu Syntax


Syntactic Analysis

Syntactic Analysis (also called parsing) is the process of analyzing the structure of a sentence
based on grammatical rules. It determines how words are arranged and related to each other
to form meaningful sentences.
Syntactic analysis is the process of checking whether a sentence follows the grammatical
rules of a language.
It involves breaking down a sentence into its components (phrases, clauses, words) and
identifying their roles.
It builds a tree-like structure (Parse Tree) to represent sentence structure.

Dr. John Babu Syntax


Levels of Syntactic Analysis

Syntactic Analysis can be done at different levels.


Basic Level – POS Tagging At this level each word is tagged with its part of speech, such
as noun, verb, adjective, etc.
Example: ”The quick brown fox jumps over the lazy dog.”
Output: The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN
Advanced Level – Full Syntactic Parsing This involves building a parse tree, which
represents the full grammatical structure of a sentence. There are two main types of
parsing techniques:
Constituency Parsing (Phrase Structure Trees): Breaks a sentence into phrases (NP, VP, PP,
etc.) and uses Context-Free Grammar (CFG) to analyze structure.
Dependency Parsing: Focuses on word-to-word relationships (dependencies) and identifies
the main verb and connects words based on their grammatical roles.

Dr. John Babu Syntax


Ambiguity in Parsing
The major challenge ind parsing natural language is ambiguity.
Consider the sentence
The boy saw the man with a telescope.
The Ambiguity arising here is Did the boy use the telescope, or did the man have the
telescope?
This type of ambiguity makes parsing difficult because multiple possible structures exist.
Algorithms must choose the most plausible one.
Ambiguity is a major challenge in parsing because multiple interpretations can exist for
the same sentence. This issue arises in:
Lexical Ambiguity (words with multiple meanings, e.g., ”bank” as a financial institution or
riverbank)
Structural Ambiguity (different grammatical structures, e.g., ”I saw the man with the
telescope”)
Attachment Ambiguity (where to attach modifiers, e.g., ”old men and women” could mean
both are old or only men are old)
Because of this, parsing algorithms must be designed carefully to resolve ambiguity efficiently.
Dr. John Babu Syntax
Parsing in NLP Applications

Parsing in natural language processing (NLP) is the process of analyzing the grammatical
structure of a sentence to determine its meaning and organization.
It helps computers understand and process human language more accurately by
identifying relationships between words and phrases.
Text-to-Speech (TTS) Systems
One of the important applications of parsing is in text-to-speech (TTS) systems. When
converting written text into spoken words, a TTS system must ensure that the output
sounds natural, just as a native speaker would pronounce it. Consider the sentences:
He wanted to go for a drive-in movie.
He wanted to go for a drive in the country.
In spoken language, there is a natural pause between drive and in in the second sentence,
whereas in the first sentence, the words are spoken together as one unit.
Parsing helps identify such structural differences, ensuring correct intonation in TTS systems.

Dr. John Babu Syntax


Part-of-Speech (POS) Tagging

Another challenge in NLP is part-of-speech (POS) tagging, which assigns the correct
grammatical category (noun, verb, adjective, etc.) to each word in a sentence. For example, in
the sentence:
The cat who lives dangerously had nine lives.
Here, lives appears twice but has different meanings: in who lives dangerously, lives is a verb,
while in had nine lives, lives is a noun. A TTS system must correctly identify these roles to
produce the right pronunciation and rhythm.

Dr. John Babu Syntax


Text Summarization

Parsing is also essential in text summarization, where long documents need to be condensed
into a shorter, meaningful summary. For instance, given the sentence:
Beyond the basic level, the operations of the three products vary widely.
A summarization system may reduce this to:
The operations of the products vary.
The parse tree systematically breaks down the sentence into its grammatical components,
which include noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and other
syntactic units.

Dr. John Babu Syntax


Dr. John Babu Syntax
Breakdown of the Parse Tree
1 Sentence (S): The root of the tree represents the entire sentence.
2 Prepositional Phrase (PP): The phrase Beyond the basic level is a prepositional phrase
that acts as an adverbial modifier.
3 Noun Phrase (NP): The subject of the sentence, The operations of the three products, is
a noun phrase consisting of a determiner (the) and a plural noun (operations) followed by
a prepositional phrase (of the three products).
4 Verb Phrase (VP): The main action in the sentence is captured in the verb phrase, where
vary is the verb, and widely is an adverb modifying it.
5 Prepositional Phrase within NP (PP): The phrase of the three products further specifies
the noun operations. It consists of a preposition (of) followed by another noun phrase
(the three products), which includes a determiner (the), a cardinal number (three), and a
plural noun (products).
6 Adverbial Phrase (ADVP): The adverb widely functions as an adverbial phrase, modifying
the verb vary.
Dr. John Babu Syntax
Sentence Compression

To generate the more concise sentence:


The operations of the products vary.
Certain elements are removed from the parse tree:
Prepositional Phrase (PP) - ”Beyond the basic level”: This phrase does not change the
essential meaning of the sentence and can be omitted without loss of fluency.
Cardinal Number (CD) - ”three”: The specific number of products is unnecessary for the
summary and is removed.
Adverbial Phrase (ADVP) - ”widely”: Though it provides additional detail, it is not
crucial to the core meaning of the sentence.

Dr. John Babu Syntax


Summarization

Parsing helps in understanding the sentence structure and identifying removable


constituents while maintaining grammatical correctness and coherence.
By removing non-essential elements from the parse tree, a compression model ensures
that the resulting sentence remains fluent and meaningful.
This approach is widely used in text summarization, where long sentences or paragraphs
need to be condensed while preserving key information.

Dr. John Babu Syntax


Paraphrasing with Parsing

Paraphrasing is another important application of parsing, where sentences are rewritten in


different ways while preserving their original meaning. Consider the sentence:
Open borders imply increasing racial fragmentation in EUROPEAN COUNTRIES.
This sentence can be rewritten in multiple ways:
open borders imply increasing racial fragmentation in the countries of europe.
open borders imply increasing racial fragmentation in european states.
open borders imply increasing racial fragmentation in europe.
open borders imply increasing racial fragmentation in european nations.
open borders imply increasing racial fragmentation in the european countries.
Parsing ensures that such substitutions are meaningful and grammatically correct rather than
random word replacements that could lead to awkward or incorrect sentences.

Dr. John Babu Syntax


Applications of Parsing I
Parsing plays a fundamental role in many modern NLP tasks, enabling more accurate and
fluent language processing. Beyond these applications, syntactic parsers are widely used in:
Machine Translation – Facilitating accurate translation of text between languages while
preserving grammatical structure and syntactic coherence.
Information Extraction – Automatically identifying and extracting key entities, events, and
relationships from large text collections to support data-driven applications.
Speech Recognition – Enhancing the accuracy of speech-to-text systems by resolving
syntactic ambiguities and improving transcription quality, particularly in cases of unclear
or noisy speech.
Language Summarization – Generating a concise and coherent summary of a longer text
while retaining essential information, ensuring readability and informativeness.
Producing Entity Grids for Language Generation – Constructing structured
representations of entity occurrences and their interactions across sentences to improve
text coherence and fluency in natural language generation tasks.
Dr. John Babu Syntax
Applications of Parsing II

Error Correction in Text – Identifying and correcting grammatical, spelling, and syntactic
errors in written text using rule-based, statistical, or deep learning techniques to enhance
clarity and correctness.
Dialogue Systems – Improving chatbot and virtual assistant responses by analyzing
syntactic structures to enhance user interactions and contextual understanding.
Knowledge Acquisition – Extracting semantic relationships between concepts (e.g.,
identifying ”dog is-a animal”) to support automated reasoning and ontology building.
Text-to-Speech Systems – Assisting speech synthesis models in generating more natural
and grammatically correct spoken language by analyzing sentence structure and
intonation patterns.

Dr. John Babu Syntax


Reuirements for Parsing

Parsing recovers information that is not explicit in the input sentence.


This implies that a parser requires some knowledge in addition to the input sentence
about the kind of syntactic analysis that should be produced as output.
One method to provide such knowledge to the parser is to write down a grammar of the
language—a set of rules of syntactic analysis.
We can write down the rules of syntax as a context-free grammar (CFG).

Dr. John Babu Syntax


Context-Free Grammar (CFG)

One approach to defining syntax is using Context-Free Grammar (CFG). A Context-Free


Grammar (CFG) is a formal grammar used to define the syntactic structure of a language.
A Context-Free Grammar (CFG) is called ”context-free” because the application of its
production rules does not depend on surrounding symbols (context). Instead, each rule applies
to a single non-terminal on the left-hand side (LHS) independently.
It consists of a set of rules that specify how symbols can be combined to form valid sentences.
A CFG consists of:
Non-Terminals: Symbols that can be expanded further (e.g., S, NP, VP).
Terminals: Words in the language (e.g., John, shirt, bought).
Production Rules: Rules that define how non-terminals expand.

Dr. John Babu Syntax


Ambiguity in CFG-Example-1
Example CFG:

Figure:

This grammar allows us to generate and parse sentences such as:


Dr. John Babu Syntax
There are two possible interpretations:

Figure:

John bought a shirt that happens to have pockets.


John used pockets (as currency) to buy a shirt.
This ambiguity arises because the prepositional phrase (with pockets) can attach to either
shirt or bought.

Dr. John Babu Syntax


knowledge Acquisition Probelm

Writing a CFG for the syntactic analysis of natural language is problematic because unlike
a programming language, a natural language is far too complex to list all the syntactic
rules interms of a CFG.
A simple list of rules is not sufficient to comprehend interactions between different
components in the grammar.
Listing all possible syntactic constructoins in a language is a difficult task.
In addition, it is diffficult to list all the grammar rules in which a particular word can be a
participant.
This is known as knowledge acquisition problem.

Dr. John Babu Syntax


Knowledge Acquisition Problem in NLP Parsing

Knowledge acquisition in Natural Language Processing (NLP) parsing refers to the


challenge of obtaining, representing, and utilizing linguistic and world knowledge required
for accurate syntactic and semantic analysis of sentences.
Since human languages are complex, ambiguous, and context-dependent, parsing requires
extensive knowledge about grammar, syntax, semantics, pragmatics, and world knowledge.
Traditional parsing techniques rely on predefined grammar rules, but these rules may not
always generalize well to new or unseen data.
The knowledge acquisition problem arises due to the difficulty of defining and automating
the learning of these rules, especially in cases where knowledge is vast, implicit, and
constantly evolving.

Dr. John Babu Syntax


Challenges in Knowledge Acquisition for Parsing

The knowledge acquisition problem arises due to the following challenges:


1. Ambiguity in Natural Language
Lexical Ambiguity: A word can have multiple meanings depending on context.
Example: ”The bank approved the loan.” vs. ”He sat by the bank of the river.” The parser
must determine whether bank refers to a financial institution or a riverbank.
Syntactic Ambiguity: A sentence can have multiple syntactic structures. Example: ”The man
saw the boy with a telescope.” Did the man use a telescope to see the boy, or does the boy
have the telescope?
Semantic Ambiguity: Different meanings may arise due to word sense variations.
Example: ”Flying planes can be dangerous.” Does it mean that the act of flying planes is
dangerous, or that planes that are flying are dangerous?

Dr. John Babu Syntax


2. Lack of Sufficient Annotated Data Rule-based parsers require handcrafted grammar
rules, which are difficult to generalize to all linguistic variations.
Statistical and neural parsers need large amounts of annotated training data, which is
expensive and time-consuming to create.
Training a deep learning-based parser requires millions of labeled sentences, which may
not be available for low-resource languages.
3. Handling Idioms, Colloquial Expressions, and Domain-Specific Jargon NLP models
struggle to acquire knowledge about idiomatic expressions and domain-specific terms.
”Kicked the bucket”
Medical domain: ”BP is high”
4. Incompleteness of Linguistic Rules No grammar rule set is exhaustive enough to
capture all variations in human language.
recognizing new slang words such as ”ghosting”
commonly used phrases like ”Me and my friend went to the park.”

Dr. John Babu Syntax


Recursive Rules

Apart from this knowledge acquisition problem, there is another problem that the rules
interact with each other in many combinatorial ways. Consider a simple CFG that
provides a syntactic analysis of noun phrases as a binary branching tree.
N-> N N
N -> ’natural’ | ’language’ | ’processing’ | ’book’
Recursive rules produce ambiguity: Is it:
A ”processing of natural language”? (correct interpretation)
A ”natural way to do language processing”? (incorrect interpretation)

Dr. John Babu Syntax


With recursive rules, ambiguity increases exponentially. For instance:
The Catalan number is a mathematical sequence used in Context-Free Grammar (CFG) to
count the number of valid parse trees (binary trees) for a given sentence length.
Formula for Catalan Numbers The n-th Catalan number Cn is given by:
( )
1 2n (2n)!
Cn = =
n+1 n (n + 1)!n!
Catalan numbers can also be computed recursively as:


n−1
Cn = Ci · Cn−1−i
i=0

where C0 = 1, C1 = 1, C2 = 2, and so on.


With 3 words, there are 5 possible parse trees.
With 5 words, there are 42 possible parse trees.
With 6 words, there are 132 possible parse trees.
Due to this exponential growth, parsing becomes computationally expensive.
Dr. John Babu Syntax
Knowledge Acquisition Problems

Problem-1 : Finding the underlying grammar for syntactic analysis and recursive rules
produce ambiguity.
Problem-2 : not only do we need to know the syntactic rules for a particular language,
but we also need to know which analysis is the most plausible for a given input sentence.
Solution- Treebanks : The construction of a treebank is a data-driven approach to syntax
analysis that allows us to address both of these knowledge acquisition bottlenecks in one
stroke.

Dr. John Babu Syntax


Treebanks

To address these problems, NLP researchers use Treebanks—collections of sentences alslo


called corpus of text, annotated with their syntactic structure. Each sentence in a treebank
has a single correct parse that has been manually verified by a human expert.
A treebank is essentially a corpus of text where each sentence is annotated with a
syntactic structure.
Unlike traditional grammar-based parsing approaches, treebanks offer a data-driven
method for learning syntax directly from annotated examples.
Treebanks solve the two primary knowledge acquisition challenges:
Finding the underlying grammar for syntactic analysis – Instead of manually writing grammar
rules, treebanks provide syntactic annotations that serve as implicit grammar knowledge.
A parser trained on a treebank does not necessarily need explicit grammar rules but instead
learns to generate syntax analyses based on statistical patterns in the annotated data

Dr. John Babu Syntax


Selecting the most plausible syntactic analysis for a sentence – Each sentence in a treebank is
assigned its most plausible syntactic structure through human annotation.
Supervised learning models use this data to develop scoring functions that rank various
possible parses for new sentences, selecting the most probable one.
These statistical parsers trained on the treebank attempt to replicate human annotation
decisions by using indicators from the input and previous decisions made in the parser
itself to learn such scoring funciton.
For a given sentence not seen in the training data, a statistical parser can use the scoring
function to return the syntax analysis that has the highest score, which is taken to be the
most plausible analysis for the sentence.
The scoring function can also be used to produce the k-best syntax analyses for a
sentence.

Dr. John Babu Syntax


Treebank
In Natural Language Processing (NLP), a treebank is a text corpus that has been
annotated by experts with syntactic or semantic sentence structures, typically represented
in a tree-like format.
These annotations provide detailed information about the grammatical relationships
between words in a sentence, facilitating the development and evaluation of
computational models for language understanding.
Construction of Treebanks:
Treebanks are often built upon corpora that have already been annotated with
part-of-speech tags.
The process of creating a treebank can be entirely manual, where linguists meticulously
annotate each sentence’s structure, or semi-automatic, where an initial parsing is
performed by software and subsequently reviewed and corrected by human annotators.
The complexity and scale of this task mean that developing comprehensive treebanks can
be labor-intensive, often requiring teams of linguists several years to complete.
Dr. John Babu Syntax
Popular Treebanks

Penn Treebank (PTB) – English constituency treebank used in many NLP benchmarks. it follows Bracketed
Notation
(e.g., (S
(NP (NNP John))
(VP (VBZ loves)
(NP (NNP Books))))
Universal Dependencies (UD) – A multilingual dependency treebank covering over 100 languages.
TIGER Treebank – A German treebank based on dependency and constituency annotations.
NEGRA Corpus – A treebank for German with both constituency and dependency structures.
S

NP VP

NNP VBZ NP

John loves NNP

Books

Dr. John Babu Syntax


Comparison: Treebanks vs. Traditional Grammars
Treebanks Traditional Grammars
A corpus of parsed sentences annotated with A set of formal rules describing the structure
syntactic structure of a language
Data-driven, based on real-world linguistic Rule-based, defining idealized linguistic struc-
data tures
Usually represented as syntactic trees derived Typically represented using phrase structure
from actual sentences rules or dependency rules
Can accommodate variations and ambiguities Often rigid and prescriptive, focusing on ide-
found in actual usage alized grammar
Used for training machine learning models Used for teaching linguistic rules and formal
and NLP applications syntax
Covers a wide range of syntactic construc- Limited to predefined rules that may not cap-
tions found in real texts ture all variations
Essential for statistical parsing, POS tagging, More theoretical, less used in NLP systems
and syntactic analysis directly
Eg: Penn Treebank, Universal Dependencies Eg: Chomskyan Grammar, Context-Free
Grammar (CFG)
Directly derived from real text corpora Independent of any specific corpus
Used in computational linguistics, corpus lin- Used in theoretical linguistics, education, and
guistics, and NLP tasks like parsing and trans- formal syntax analysis
lation Dr. John Babu Syntax
Dependency graphs vs Phrase structure trees

Treebanks primarily use two types of syntactic representations:


Phrase Structure Trees (Constituency Trees)
Dependency Graphs (Dependency Trees)
These two methods differ in how they represent the syntactic structure of sentences and are
suitable for different types of languages.

Dr. John Babu Syntax


Phrase Structure Trees

Phrase structure trees, also known as constituency trees, are hierarchical representations of sentence structure based on
constituents or phrases.
This approach is grounded in Chomskyan generative grammar, where sentences are built from recursively nested phrases.
Sentences are divided into nested phrases, which are labeled by syntactic categories such as noun phrases (NP),
verb phrases (VP), and prepositional phrases (PP).
The structure is hierarchical, meaning that each phrase can contain sub-phrases within it.
Phrase structure trees capture constituent relationships, helping in identifying which words form a meaningful unit
together.
Phrase structure trees are widely used in languages like English and French, where word order is relatively fixed. In
these languages, word placement plays a crucial role in conveying meaning (e.g., The boy eats an apple vs. An
apple eats the boy changes the meaning completely).
Since phrases are explicitly marked, phrase structure trees help in identifying subjects, objects, and verb phrases,
aiding in semantic role labeling.
Penn Treebank (PTB) uses phrase-structure annotations, making it one of the most influential resources for
training statistical and neural parsers.

Dr. John Babu Syntax


Dependency Trees

Dependency graphs (or dependency trees) provide an alternative way to represent syntax, focusing on word-to-word
relationships instead of hierarchical phrase structures. This approach is based on dependency grammar, which directly
links words with syntactic dependencies.
Each word in a sentence is connected to another word based on grammatical relationships.
The structure is not hierarchical like phrase structure trees; instead, it forms a directed graph with words as nodes
and dependencies as edges.
The main verb is usually the root of the tree, with other words connected as dependents.
Better suited for free word order languages, such as Czech, Turkish, Russian, where word placement is flexible, but
syntactic relations remain clear.
since dependency trees do not require extra phrase labels, they are more compact and faster to process.
Many modern dependency parsers (e.g., Stanford Parser, SpaCy, UDPipe) work efficiently using dependency
annotations. Widely Used in Multilingual NLP
The Universal Dependencies (UD) Project standardizes dependency treebanks across 100+ languages.
Dependency trees work well for low-resource languages, where phrase structure rules are harder to define.

Dr. John Babu Syntax


Representation of Syntactic Structure - Dependency Graphs

Dependency graphs connect a word (the head of a phrase) with its dependents using a
directed (asymmetric) connection.
The head-dependent relationship could be either
semantic (head-modifier)
The tall boy runs fast. (Adjective ’fast’ modifying the head ’runs’)
She sings beautifully. (Adverb ’beautifully’ modifying verb)
Syntactic(head-specifier )
The boy runs. (Determiner specifying noun)
She has eaten dinner. (Auxiliary verb supporting main verb)
Dependency graphs are a fundamental way to represent the syntactic structure of sentences.
They make minimal assumptions about syntactic structure and avoid any annotation of hidden
structures such empty elements as place holders ro represent missing or displaced arguments of
predicates or unnecessary hierarchical structure.

Dr. John Babu Syntax


Basic Concept of Dependency Graphs

Vertices: Represent words in the sentence.


Directed Arcs: Binary relations from head to dependent representing syntactic
dependencies.
Key Properties:
All words except the root have a syntactic head.
The graph forms a tree with a single independent root node.
Labeled dependency parsing: assigns labels to dependency relations.

Dr. John Babu Syntax


The CoNLL 2007 shared task on dependency parsing provides the following definition of a
dependency graph:
”In dependency-based syntactic parsing, the task is to derive a syntactic structure for an input
sentence by identifying the syntactic head of each word in the sentence. This defines a
dependency graph, where the nodes are the words of the input sentence and the arcs are the
binary relations from head to dependent. Often, but not always, it is assumed that all words
except one have a syntactic head, which means that the graph will be a tree with the single
independent node as the root. In labeled dependency parsing, we additionally require the
parser to assign a specific type (or label) to each dependency relation holding between head
word and dependent word.”

Dr. John Babu Syntax


Dependency Graph Syntax Analysis
The students are interested in languages, but the faculty is missing teachers of English.

Dr. John Babu Syntax


Dependency Representation Example
There are many variants of dependency syntactic analysis, but the basic textual format for a
dependency tree can be written in the following form, where each dependent word specifies the
head word in the sentence, and exactly one word is dependent to the root of the sentence.
The following shows a typical textual representation of a labeled dependency tree: Example
Sentence: They persuaded Mr. Trotter to take it back.
Index Word POS Head Label
1 They PRP 2 SBJ
2 persuaded VBD 0 ROOT
3 Mr. NNP 4 NMOD
4 Trotter NNP 2 IOBJ
5 to TO 6 VMOD
6 take VB 2 OBJ
7 it PRP 6 OBJ
8 back RB 6 PRT
9 . . 2 P
Dr. John Babu Syntax
Projectivity in Dependency Parsing
In dependency parsing, a sentence is said to have a projective dependency structure if, when
drawn as a tree with words in linear order, none of its dependency arcs cross. Conversely, if
any arcs cross, the dependency structure is non-projective.
A dependency tree is projective if:
Words are arranged in a linear order with the root at the first position.
Dependency arcs can be drawn above words without crossing dependencies.
Chris saw a dog yesterday which was blind. Example:

sentence contains an extraposition to the right of a noun phrase modifier phrase, which as
a result requires a crossing dependency. making it non-projective.
English has very few cases in a treebank that will need such a nonprojective analysis. In
other languages, such Czech and Turkish, theSyntax
Dr. John Babu
number of nonprojective dependencies can
Projectivity and CFG
If a dependency tree can be converted into an equivalent CFG, then it must be projective.
Non-Projective Structures Cannot Be Fully Captured in CFGs
Non-projective dependency trees lead to inconsistencies when converted to CFGs. In fact,
there is no CFG that can capture the nonprojective depedency.
Any conversion attempt either misses dependencies or creates crossing arcs.
Non-projectivity means that a nonterminal’s descendants do not form a contiguous
substring of the sentence.
CFGs require contiguity, making them incompatible with non-projective dependencies.
In a CFG converted from a dependency tree, we have onlythe following three types of rules
with one type of ule to introduce the terminal symbols and two rules where Y is dependent on
X or viceversa . The head word is indicated by asterisk(*).
Z-> X* Y
Z-> X Y*
A-> a*
Dr. John Babu Syntax
Projectivity

Projectivity: for each word in the sentence, its descendentats form a contiguous substring
of the sentences
Nonprojectivity: means that there is word in the sentence such that its descendants do
not form a contiguous substring of sentences

Dr. John Babu Syntax


3.3.2 :Syntax Analysis Using Phrase Structure Trees

In natural language processing, syntax analysis helps us understand the structure of sentences.
One common method is phrase structure analysis, which breaks a sentence into smaller units
called constituents. These constituents group words together based on their grammatical
relationships, forming a hierarchical tree known as a phrase structure tree.
A phrase structure tree is a graphical representation of how different parts of a sentence fit
together. It follows generative grammar principles, which help handle complex sentence
structures like long-distance relationships between words.

Dr. John Babu Syntax


Example: A Simple Phrase Structure Tree Consider the sentence: Mr. Baker seems especially sensitive.
Its phrase structure tree looks like this:

Figure:

S (Sentence) is the root of the tree.


NP-SBJ (Noun Phrase - Subject) represents Mr. Baker as the subject of the sentence.
VP (Verb Phrase) contains seems especially sensitive as the predicate.
ADJP-PRD (Adjective Phrase - Predicate) describes the adjective phrase especially sensitive modifying the verb
seems.
This structure shows that seems is the main verb, with Mr. Baker as the subject and especially sensitive as the
complement.
Dr. John Babu Syntax
The same sentence gets the following dependency tree analysis. The information from the
bracketing labels from the phrase structure analysis gets mapped onto the labeled arcs of the
dependency analysis. Typically, dependency analysis would not link the subject with the
predicate directly because it would create an inconvenient crossing dependency with the
dependency between seems and the root symbol.

Figure:

Dr. John Babu Syntax


Phrase Structure Trees vs. Dependency Trees

A phrase structure syntax analysis of a sentence derives from the traditional sentence
diagrams that partition a sentence into constituents, and larger constituents are formed
by merging smaller ones.
Phrase structure analysis also typically incorporate ideas from generative grammar (from
linguistics) to deal with displaced constituents or apparent long-distance relationships
between heads and constituents.
A phrase structure tree can be viewed as implicitly having a predicate- argument structure
associated with it.
Both approaches are useful:
Phrase structure trees are better for identifying hierarchical structures and phrase
boundaries.
Dependency trees provide a more direct representation of word relationships, making
them useful for machine translation and information extraction.

Dr. John Babu Syntax


Parsing Algorithms I

Parsing is the process of analyzing an input sentence according to the rules of a grammar to
determine its structure. Given an input sentence, a parser produces an output analysis, which
we assume matches the structure defined by a treebank used for training. Treebank parsers
often do not require an explicit grammar, but to simplify the explanation, we will first consider
parsing algorithms that assume the existence of a Context-Free Grammar (CFG).
A Context-Free Grammar (CFG) consists of a set of production rules that define how
sentences can be derived from a starting symbol. Consider the following simple CFG G that
generates strings such as a and b or c from the start symbol N:

N -> N ’and’ N
N -> N ’or’ N
N -> ’a’ | ’b’ | ’c’

Dr. John Babu Syntax


Parsing Algorithms II

Each rule describes how a nonterminal symbol (N) can be rewritten into terminal symbols (a,
b, c, and, or) or combinations of N. This allows us to build complex expressions using a
hierarchical structure.
Derivation in Parsing
A derivation is a step-by-step process that shows how an input sentence can be generated
using the CFG. Consider the input sentence:

a and b or c

A rightmost derivation follows these steps:

Dr. John Babu Syntax


Parsing Algorithms III

N
=> N ’or’ N # Applying rule N -> N or N
=> N ’or c’ # Applying rule N -> c
=> N ’and’ N ’or c’ # Applying rule N -> N and N
=> N ’and b or c’ # Applying rule N -> b
=> ’a and b or c’ # Applying rule N -> a

Here, each step applies a rule from the CFG, progressively transforming the start symbol (N)
into the full input sentence. Each line of this process is called a sentential form.
A rightmost derivation always expands the rightmost nonterminal first. If we reverse the
derivation order, we see how the sentence structure builds up:

Dr. John Babu Syntax


Parsing Algorithms IV
’a and b or c’
=> N ’and b or c’ # use rule N -> a
=> N ’and’ N ’or c’ # use rule N -> b
=> N ’or c’ # use rule N -> N and N
=> N ’or’ N # use rule N -> c
=> N # use rule N -> N or N

This derivation sequence corresponds to the following parse tree, constructed from left to

right:
Dr. John Babu Syntax
Parsing Algorithms V

Alternative Parse Trees and Ambiguity in Parsing


One key observation is that a unique derivation sequence is not guaranteed. There can be
multiple valid derivations, leading to different parse trees. For example, another rightmost
derivation results in a different tree structure:
This alternative derivation sequence follows these steps:

Dr. John Babu Syntax


Parsing Algorithms VI

Both trees represent different ways of interpreting the sentence structure. This ambiguity is
common in natural language and must be resolved when designing efficient parsers.

Dr. John Babu Syntax


Classification of Parsing Algorithms in NLP I

Parsing algorithms in Natural Language Processing (NLP) can be broadly classified into
different categories based on their approach to syntactic analysis. Below are the primary types:
Transition-Based Parsing Transition-based parsing builds a parse tree by taking incremental
parsing decisions in a state-based manner. It is commonly used for dependency parsing.
Uses a stack, buffer, and set of actions to determine a parse tree.
Relies on machine learning models (e.g., neural networks) to predict parsing actions.
Example : Shift-Reduce Parsing Algorithm

Dr. John Babu Syntax


Classification of Parsing Algorithms in NLP II

Dynamic Programming-Based Parsing These parsers use bottom-up or top-down approaches


and store intermediate results to avoid redundant computations. They are widely used for
constituency parsing.
Uses a chart to store computed subtrees.
Avoids recomputation of overlapping subproblems.
Example : CKY (Cocke-Kasami-Younger) Algorithm

Dr. John Babu Syntax


Classification of Parsing Algorithms in NLP III

Graph-Based Parsing Graph-based parsers construct the best possible dependency tree by
treating parsing as a graph optimization problem.
Constructs a weighted graph where words are nodes and dependencies are edges.
Uses algorithms like Minimum Spanning Tree (MST) to find the best parse.
Example Maximum Spanning Tree Parsing (Eisner’s Algorithm)

Dr. John Babu Syntax


Classification of Parsing Algorithms in NLP IV

Neural Network-Based Parsing Neural network-based approaches use deep learning models
(e.g., transformers, LSTMs) to learn parsing patterns from large datasets.
Key Idea:
Uses word embeddings (e.g., BERT) to encode words.
Relies on sequence-to-sequence models or self-attention mechanisms.
Example : BiLSTM-Based Dependency Parsing Algorithm

Dr. John Babu Syntax


Shift-Reduce Parsing I

To build a parser, we need to create an algorithm that can perform the steps in rightmost
derivation for any grammar and for any input string.
Every CFG turns out to have an automaton that is equivalent to it, called a push- down
automaton. A pushdown automaton is simply a finite-state automaton with some additional
memory in the form of a stack (or pushdown).
This is a limited amount of memory because only the top of the stack is used by the machine.
This provides an algorithm for parsing that is general for any given CFG and input string.
This algorithm calledshift-reduce parsing, which uses two data structures, a buffer for input
symbols and a stack for storing CFG symbols.
The shift-reduce parsing algorithm consists of the following actionss :
1 Shift: Move the next word from the buffer to the stack.
2 Reduce: Apply a production rule to reduce the top elements of the stack into a
non-terminal.

Dr. John Babu Syntax


Shift-Reduce Parsing II

3 Left-Arc: Establish a dependency where the second-top element of the stack is the head
of the top element, removing the top element from the stack.
4 Right-Arc: Establish a dependency where the top element of the stack is the head of the
second-top element, removing the second-top element from the stack.

Dr. John Babu Syntax


Shift Reduce Parsing I
The shift-reduce parsing algorithm is defined as follows:
1 Start with an empty stack and the buffer containing the input string.

2 Exit with success if the top of the stack contains the start symbol of the grammar and if

the buffer is empty


3 Choose between the following two steps (if the choice is ambiguous, choose one based on

an oracle):
Shift a symbol from the buffer onto the stack.
If the top k symbols of the stack are α1 ...αk , which corresponds to the righthand side of a
CFG rule Aα1 ...αk , then replace the top k symbols with the left-hand side nonterminal
A.(Reduce)
4 Exit with failure if no action can be taken in previous step.
5 Else, go to step 2.
This parsing technique is categorized as bottom-up parsing, meaning it builds the parse tree
from the leaves (input symbols) and works its way up to the root (start symbol of the
grammar).
Dr. John Babu Syntax
Shift Reduce Parsing II

Example Consider the following CFG (Context-Free Grammar):

We will parse the sentence:


a and b or c

Dr. John Babu Syntax


Shift Reduce Parsing III

Using Shift-Reduce Parsing, the operations are as follows:

Dr. John Babu Syntax


Shift Reduce Parsing IV

Dependency Parsing using Shift-Reduce Algorithm Shift-Reduce parsing is also applied to


dependency parsing, where relationships between words are determined in terms of a

Dr. John Babu Syntax


Shift Reduce Parsing V

head-dependent structure

Dr. John Babu Syntax


Shift Reduce Parsing VI

Advantages of Shift-Reduce Parsing


Efficient: Works in linear time O(n) for many cases.
Simple Implementation: Requires only a stack and a buffer.
Incremental Parsing: Suitable for real-time applications like speech recognition.
Flexibility: Easily integrates with machine learning models for statistical parsing.
Challenges
Ambiguity: Sometimes the parser must choose between multiple valid actions.
Backtracking Required: If an incorrect decision is made, backtracking or heuristics (like
probabilistic models) are needed.
Grammar Constraints: It works best with LR(1) grammars and may require modifications
for complex cases.

Dr. John Babu Syntax


Shift Reduce Parsing VII

Shift-Reduce parsing is an essential technique in syntactic analysis and dependency parsing. It


efficiently handles parsing in compilers, NLP applications, and machine translation. By
integrating probabilistic models, modern parsers enhance decision-making in ambiguous
situations, making this technique a powerful tool for natural language understanding.

Dr. John Babu Syntax


Oracle in Shift-Reduce Parsing I

An oracle in NLP parsing is a decision-making component that helps the parser choose the
correct sequence of shift and reduce operations. In shift-reduce parsing:
Multiple parsing paths may be possible at each step.
The oracle predicts the correct action based on training data or predefined heuristics.
This ensures that the parser does not make mistakes that require backtracking.
If trained on a large corpus of correctly parsed sentences, the oracle learns to select the best
action at each step.
It may use machine learning models (such as neural networks or probabilistic models) to
decide whether to shift or reduce.
If the oracle is perfect, shift-reduce parsing can be done in linear time.

Dr. John Babu Syntax


Oracle in Shift-Reduce Parsing II

Necessity of oracle
Without an oracle, the parser might choose the wrong action, requiring backtracking
(reparsing from an earlier state).
In worst-case scenarios, backtracking can lead to exponential parsing time due to
repeatedly trying different possibilities.

Dr. John Babu Syntax

You might also like