3. Syntax Parsing
3. Syntax Parsing
Tushar B. Kute,
http://tusharkute.com
Syntax Parsing
• Intuitive Approach:
– Constituency grammar offers a clear and intuitive way
to represent the structure of sentences, making it
easier to understand how words are grouped together.
• Formalization:
– The use of rewrite rules allows for a formal and
systematic approach to sentence analysis.
• Applications:
– Constituency grammar has various applications,
including natural language processing (NLP), machine
translation, and syntactic parsing.
Limitations
• Focus on Structure:
– CFGs focus on the hierarchical structure of
sentences, defining how words can be
grouped into phrases and clauses to form a
complete sentence.
– They don't consider the surrounding words
when applying a production rule.
Context free grammer
• Rewrite Rules:
– CFGs use rewrite rules, also called productions,
to specify how to generate sentences.
– These rules define how a non-terminal symbol (a
placeholder for a phrase or clause) can be
replaced by a sequence of terminal symbols
(actual words) or other non-terminal symbols.
CFG: Formal Definition
• Parts of Speech:
– English sentences are built from eight main parts
of speech: nouns, verbs, adjectives, adverbs,
pronouns, prepositions, conjunctions, and
interjections.
– Each part of speech has a specific function within a
sentence.
• Nouns: represent people, places, things, or ideas
(e.g., cat, book, happiness).
• Verbs: describe actions, states of being, or
occurrences (e.g., run, sleep, happen).
Grammer Rules for English
• Sentence Structure:
– Basic English sentences follow a Subject-Verb-
Object (SVO) word order.
• Subject: who or what the sentence is about (e.g.,
The girl).
• Verb: the action or state of being (e.g., runs).
• Object: receives the action of the verb (e.g.,
across the street).
– Sentences can also be more complex with
prepositional phrases, adjective clauses, adverb
clauses, etc.
Grammer Rules for English
• Subject-Verb Agreement:
– The subject and verb in a sentence must agree in
number (singular or plural).
• Singular subject - singular verb (e.g., The cat
jumps).
• Plural subject - plural verb (e.g., The cats
jump).
Grammer Rules for English
• Verb Tenses:
– Verbs are conjugated to indicate time (past,
present, future) and aspect (simple, continuous,
perfect).
• English has 12 basic verb tenses that express
different nuances of time.
Grammer Rules for English
• Articles:
– English uses two articles, "a/an" (indefinite) and
"the" (definite), to indicate whether a noun is
being referred to for the first time or is already
known.
Grammer Rules for English
• Punctuation:
– Punctuation marks like periods, commas,
question marks, etc., are used to separate
clauses, indicate pauses, and convey meaning
clearly.
Grammer Rules for English
• Sentence Types:
– English sentences can be declarative
(statements), interrogative (questions),
imperative (commands), or exclamatory
(exclamations).
Treebank
• Semantic Treebanks:
– These delve deeper into the meaning of
sentences, annotating the semantic roles of
words and their relationships within the
sentence.
– They might utilize various formalisms to
represent the meaning structure.
Treebank : Types
• Penn Treebank:
– A widely used treebank for English, focusing on
syntactic structure.
• FrameNet:
– A semantic treebank that annotates sentences based
on semantic frames, which represent stereotypical
situations involving roles and participants.
• PropBank:
– Another semantic treebank that focuses on verb
argument structure, labeling the semantic roles of
noun phrases relative to a verb.
Grammar Equivalence
• Normal forms
– are specific types of grammars with particular
properties that make them easier to analyze and
manipulate.
– There are different types of normal forms for
CFGs, with some of the most common being:
• Chomsky Normal Form (CNF)
• Greibach Normal Form (GNF)
Normal Forms
• Key Features:
– Dual Representation: LFG employs two separate
levels of representation:
• Phrase Structure (PS) rule system: Similar to CFGs,
it defines the surface word order and
constituency of sentences.
• Functional Uncertainty Principles (F-Structure):
Represents grammatical functions like subject,
object, and agent, independent of word order.
– Lexical Entries: Words in LFG have rich lexical entries
that specify their syntactic and semantic properties.
Lexical Grammar (Lexicogrammar)
• Core Idea:
– Imagine a sentence as a tree structure. The root of
the tree represents the entire sentence, and it
branches out into smaller and smaller constituents.
– These constituents can be individual words, phrases
(like noun phrases or verb phrases), or even clauses.
– Constituency parsing aims to identify these
constituents and their hierarchical relationships
within the sentence tree.
Constituency Parsing
• Process:
– Input: The parser takes a sentence as input.
– Constituent Identification: The parser identifies the
different constituents within the sentence, such as noun
phrases (NPs), verb phrases (VPs), adjective phrases (ADJPs),
etc.
– Hierarchical Relationships: The parser determines the
hierarchical relationships between the constituents. For
example, an NP might be a child of a VP, and a VP might be a
child of the main sentence (S).
– Output: The parser outputs a parse tree that visually
represents the identified constituents and their
relationships.
Constituency Parsing
• Core Functionality:
– Bottom-up Approach:
• Unlike top-down parsers that start from the entire
sentence and break it down, CKY parsing starts from
individual words and builds up progressively to
identify larger constituents and ultimately the entire
sentence structure.
– Dynamic Programming:
• It leverages dynamic programming, a technique that
stores intermediate results to avoid redundant
calculations. This makes CKY parsing efficient for
handling even complex sentences.
CKY Parsing
• Process:
– Input:
• The algorithm takes a sentence (string of words)
and a grammar (set of production rules) as input.
– Initialization:
• A 2D table is created, where rows and columns
represent positions in the sentence.
• Initially, each cell holds the non-terminal symbols
(grammar variables) that can generate the single
word at that position based on the grammar rules.
CKY Parsing
• Bottom-up Filling:
– The algorithm iterates through the table diagonally, filling
each cell. For a given cell, it checks all possible ways to
combine constituents from the cells below and to the left
(based on grammar rules) to see if they can generate a
larger constituent that can span the current cell's range in
the sentence.
• Output:
– After processing the entire table, the cell representing the
entire sentence (top-right corner) should contain the start
symbol of the grammar if the sentence is grammatically
correct according to the provided grammar.
Span based neural constituency parsing
• Core Idea:
– Shift from Rules to Spans:
• Unlike traditional parsers that rely on predefined
grammatical rules, span-based parsing focuses on identifying
spans of words (contiguous sequences) that represent
constituents. The neural network model learns to score and
predict these spans directly from the training data.
– Neural Network Power:
• The model leverages the power of neural networks to
capture complex patterns and relationships within
sentences. This allows it to potentially handle ambiguities
and variations in language structure better than rule-based
approaches.
Span based neural constituency parsing
• Process:
– Input:
• The model takes a sentence as input.
– Word Representation:
• Each word in the sentence is converted into a
vector representation using techniques like
word embedding.
• This vector captures the semantic and
syntactic properties of the word.
span based neural constituency parsing
• Process:
– Span Scoring:
• The model then employs a neural network architecture to score
each possible span in the sentence (considering all start and
end positions for contiguous sequences). This scoring takes
into account the word representations of the words within the
span and their potential to form a grammatical constituent.
– Prediction and Tree Building:
• Based on the span scores, the model predicts the most likely
set of spans that represents the grammatical structure of the
sentence. This prediction can then be used to build a parse tree
depicting the hierarchical relationships between the
constituents.
Types of Span-Based Parsers
• Precision:
– This metric measures the proportion of the parser's
predicted constituents that are actually correct
according to the gold standard.
• Recall:
– This metric measures the proportion of the gold
standard constituents that are correctly identified by
the parser.
• F1-Score:
– This is a harmonic mean of precision and recall,
providing a balanced view of the parser's performance.
Common Evaluation Metrics
• Efficiency:
– For some tasks, a complete parse tree might not be necessary.
Partial parsing can be more efficient, especially for large
datasets or real-time applications.
• Focus on Specific Elements:
– Sometimes, the focus might be on identifying specific
grammatical elements like named entities, verb phrases, or
noun phrases. Partial parsing can be tailored to extract these
elements directly.
• Handling Complexity:
– Complex sentences or ungrammatical structures can be
challenging for full parsers. Partial parsing might be able to
extract useful information even in such cases.
Partial Parsing: Types
• Chunking:
– This involves identifying and labeling non-
overlapping chunks of words representing
phrases like noun phrases, verb phrases, or
prepositional phrases.
• Tagging:
– This assigns part-of-speech (POS) tags to
individual words, indicating their grammatical
function (noun, verb, adjective, etc.).
Partial Parsing: Types
• Entity Recognition:
– This focuses on identifying and classifying named
entities like people, organizations, or locations
within the sentence.
• Shallow Parsing:
– This involves extracting basic syntactic
information like subject-verb relationships or
dependency links between words.
CCG parsing
• Core Idea:
– Categories and Combinators:
• CCG assigns lexical categories to words and phrases
that represent their grammatical function and how
they can combine with other elements. These
categories can be complex, reflecting the richness of
natural language.
– Combinatory Logic:
• CCG employs a set of predefined operations
(combinators) that specify how categories can be
combined to form new categories. These operations
determine the valid syntactic structures for sentences.
CCG parsing
• Input:
– The parser takes a sentence as input.
• Lexical Categorization:
– Each word in the sentence is assigned a lexical
category based on its part-of-speech and its role in
combining with other words.
– For example, a noun might have a category like "NP"
(Noun Phrase), and a verb might have a category like
"S<NP>/NP" (takes an NP as an argument and
returns an S (sentence)).
CCG parsing
• CCG Combinators:
– The parser applies the CCG combinators to combine the
categories of adjacent words, following the valid rules
defined by the combinators. These combinations build
up a structure that reflects the grammatical
relationships between words.
• Derivation and Output:
– If a valid combination of categories leads to the category
"S" (sentence) at the end, the parser has successfully
derived a grammatical structure for the sentence. This
derivation process essentially shows how the individual
words combine to form the complete sentence.
Dependancy parsing
• Core Concept:
– Words and Dependencies:
• Each word in a sentence is considered a node, and
the parser identifies the grammatical dependency
between words.
• A dependency link connects a "head" word to its
"dependent" word, indicating how the dependent
word modifies or complements the meaning of
the head word.
Dependancy parsing
• Process:
– Input: The parser takes a sentence as input.
– Dependency Identification: The parser analyzes the
sentence to identify the head word for each word and
the grammatical relationship between them. This
relationship can be labeled with specific dependency
tags like "subject," "object," "modifier," etc.
– Output: The parser outputs a dependency graph (or
tree) that visually represents the identified
relationships. Each word is a node in the graph, and
directed arrows connect the heads to their dependents,
along with the dependency labels.
Dependancy parsing: Types
• Rule-based Parsing:
– Relies on predefined grammatical rules to
identify dependencies.
• Statistical Parsing:
– Uses statistical models trained on large datasets
of pre-annotated sentences to predict the most
likely dependencies.
Dependancy parsing: Benefits
• Straightforward Relationships:
– Dependency parsing directly captures the grammatical
relationships between words, making it intuitive and
interpretable.
• Handling Complexities:
– It can handle complex sentence structures and word
orders more effectively than constituency parsing in
some cases.
• Cross-Lingual Applicability:
– Dependency parsing can be more easily adapted to
different languages compared to constituency parsing.
Dependency Relations
• Subject (subj): This identifies the subject of a verb. (e.g., "The dog
(head) chased the cat (dependent, subj)").
• Object (obj): This identifies the direct or indirect object of a verb.
(e.g., "I (head) gave a gift (dependent, obj) to her (dependent, obj)").
• Modifier (mod): This is a general category for various modifiers,
including adjectives (e.g., "a red (dependent, mod) car (head)"),
adverbs (e.g., "She ran quickly (dependent, mod)"), and prepositional
phrases (e.g., "The house on the hill (dependent, mod) is beautiful
(head)").
• Possessive (poss): This identifies the possessor of a noun. (e.g., "The
man's (head) hat (dependent, poss) is red").
• Aux (aux): This identifies auxiliary verbs that help form verb tenses.
(e.g., "She has (head, aux) been waiting (dependent) for a long time").
Universal Dependency (UD) Tags
• Sentences:
– The treebank consists of a large number of sentences
from a specific language.
• Dependency Annotations: Each sentence is annotated with
its dependency relationships. This annotation involves:
– Identifying the head word for each word in the sentence.
– Labeling the dependency relation between the head
word and its dependent words. These labels specify the
grammatical role of the dependent word (e.g., subject,
object, modifier).
Dependency Treebank: Components
• Process:
– Input: The parser takes a sentence as input.
– Initial State: Parsing starts with an initial state, often representing an
empty dependency structure.
– Transition Sequence: The model predicts a sequence of transitions based
on the current state. These transitions can involve actions like:
• Shift: Move the next word from the sentence to the processing stack.
• Arc-Left/Right: Create a dependency link between a word on the stack
and another word in the structure, depending on whether the head
word is to the left or right.
• Reduce: Finalize a dependency structure for a portion of the sentence.
– Final State: The parsing process ends when a designated final state is
reached, representing a complete and valid dependency structure for the
sentence.
Graph-Based Dependency Parsing
• Constituency Parsing:
– Goal: Identifies phrases and clauses within a
sentence based on CFG rules or statistical models.
– Output: A parse tree representing the hierarchical
relationships between phrases (e.g., noun phrases,
verb phrases).
– Benefits: Suitable for tasks that require
understanding phrase structure.
– Limitations: Might struggle with complex sentences
or word order variations.
Summary
• Dependency Parsing:
– Goal: Analyzes sentence structure by focusing on
the grammatical relationships between words.
– Output: A dependency graph (or tree) showing how
words depend on each other (head-dependent
relationships).
– Benefits: More interpretable due to focus on word-
to-word relationships, handles complex structures
well.
– Limitations: Doesn't explicitly represent hierarchical
phrase structure, can struggle with ambiguity.
Thank you
This presentation is created using LibreOffice Impress 7.4.1.2, can be used freely as per GNU General Public License
Web Resources
https://mitu.co.in
@mituskillologies http://tusharkute.com @mituskillologies