04 - Parsing in NLP
04 - Parsing in NLP
Syntactic Analysis
Concept of Grammar
Phases of Natural Language Processing
Stem ,
Morphem ,
Morphology POS
Analysis
Natural Grammar
Pragmatic Syntactic
language Rules
Analysis Analysis
Processing
Contextual
Information
Semantic
Analysis
Semantic
Rules
What is Syntactic Analysis
Syntactic analysis, or parsing, is the process of analyzing natural language with the
rules of a formal grammar. Grammatical rules are applied to categories and groups
of words, NOT individual words. Syntactic analysis basically assigns a semantic
structure to text.
• Use of Noun-Verb pair: A sentence includes a subject and a predicate. We combine
every noun phrase with a verb phrase in the sentence.
Example: The dog (noun phrase) went away (verb phrase)
• Adjective before Noun: Adjectives are usually placed before the noun they describe.
Example: The beautiful garden was blooming with flowers.
• Use of Articles:'A' or 'an' is used before singular, countable nouns that are not
specific; 'the' is used before specific nouns.
Example: A cat sat on the mat. (any cat)
Example: The cat sat on the mat. (a specific cat)
• Proper Placement of Modifiers: Modifiers should be placed next to the word they
modify.
Example: She almost drove six hours to get home.
• Pronoun Antecedent Agreement: Pronouns must agree with their antecedents in
number and gender.
Example: Every student must bring his or her own pencil.
• Subject-Verb Agreement: A singular subject takes a singular verb, while a plural
subject takes a plural verb.
Example: The dog barks. (singular)
Example: The dogs bark. (plural)
Chomsky Hierarchy of Grammar
• The field of formal language theory (FLT) initiated by Noam Chomsky sets a
minimal limit on description adequacy.
• Chomsky approach ignores meaning , usage of expressions ,frequency, context
dependence and processing complexity entirely from the natural language.
• Chomsky theory only assume that patterns that are productive for short strings
apply to strings of arbitrary length in an unrestricted way.
• An expression in the sense of FLT is simply a finite string of symbols, and a
(formal) language is a set of such strings. Chomsky theory explores the
mathematical and computational properties of such sets.
• The immense success of his framework influence not only linguistics but also
theoretical computer science and molecular biology.
• Particularly , FLT deals with formal languages (= sets of strings) that is defined by
a finite set of rules – Grammar (𝒢).
• Grammar in FLT is composed of four elements :
(1) a finite vocabulary of symbols (Σ), referred to as terminals , that appear in the
strings of the language
(2) finite vocabulary of extra symbols called non-terminals (NT)
(3) a special designated non-terminal called the start symbol (S)
(4) and a finite set of rules. (R)
Use of articles :
Rule: The definite article 'the' is used before a noun that is specific or known
to the listener, while 'a' or 'an' is used for non-specific nouns in the singular
form.
She wants an apple from the basket.
Subjunctive Mood:
Rule: The subjunctive mood is used for wishes, hypotheticals, or actions that
are contrary to fact.
If I were you, I would not do that.
• These rules illustrate how the context surrounding words or phrases can
dictate the appropriate grammatical forms to use, which is a hallmark of
context-sensitive (Type-1) grammars.
• Starting from a string in question β, there are finitely many ways in which
rules can be applied backward to it.
Chomsky Hierarchy of Grammar (contd)
Type 2 - Context-free Grammar
Chomsky Type-2 Grammar, also known as context-free grammar (CFG), is a formal
grammar in which every production rule is of the form α → β where α is a single non-
terminal symbol, and β is a string of terminals and/or non-terminals (β can be
empty). The productions need NOT follow condition that len(α ) <= len(β) instead
- Every string has an equal number of 'α's and 'β's, but in any order which yields a
context-free grammar. ab → ba
aabb → bbaa
CB → DB
- Further, it follows a hierarchical structure i.e it consists a set of production rules
that can be applied recursively and can generate a tree structure.
The hierarchical structure refers to the way sentences can be broken down into
smaller parts, and those parts can be broken down further, following the CFG rules.
This leads to the creation of a parse tree, which visually represents the breakdown of
a sentence into its grammatical parts.
In a parse tree for a context-free grammar:
The root node is typically the start symbol (often S for sentence).
The leaf nodes are terminal symbols, which correspond to the words of the sentence.
The interior nodes are non-terminal symbols, representing the syntactic categories (like
noun phrases, verb phrases, etc.).
Chomsky Hierarchy of Grammar (contd)
For the sentence "The cat chases the mouse.", we define a context-free rule as
follows:
S→NPsingular VPsingular S
NPsingular →Det N singular / \
VPsingular → Vsingular NP NP VP
/ \ / \
1. Start with the Sentence (S): Det N V NP
The initial rule identifies the sentence structure: | | | / \
S→NP VP The cat chases Det N
| |
2. Expand the Noun Phrase (NP) for the Subject: The mouse
Here, we expand the noun phrase to include a
determiner (Det) and a singular noun (N_singular):
NP→Det Nsingular
"The cat": NP→[The][cat]
The tree shows the hierarchical structure of the sentence. The sentence is divided into a
noun phrase and a verb phrase. The noun phrase NP consists of a determiner Det
("The") and a noun N ("cat"), which together refer to the subject of the sentence. The
verb phrase VP consists of a verb V ("chases") and a noun phrase NP, which is the
object of the sentence. This object NP is again made up of a determiner "The" and a
noun "mouse".
Chomsky Hierarchy of Grammar (contd)
Type 3 - Regular Grammar
• Chomsky's Type-3 Grammar, also known as Regular Grammar, is the simplest
type of grammar in the Chomsky hierarchy.
• The Type-3 Grammar are suitable for describing the simplest syntactic structures
that involve direct adjacency and do not require nesting or recursion.
• It does not allow hierarchical structure or much nesting or recursion, unlike
context-free grammars.
Syntax
Lexical Analysis
Analysis
Code
Generation
Code
Optimisation
Concept of Parsing
Parsing in NLP
• Parsing in basic terms can be described as breaking down the sentence into its
constituent words in order to find out the grammatical type of each word or
alternatively to decompose an input into more easily processed components.
• Every natural language consist of its own grammar rules according to which the
sentences are formed. Parsing is used to find out the sequence of rules applied for
sentence generation in that particular language.
• The basic connection between a sentence and the grammar is derived from the
parse tree. Natural Language processing provides us with two basic parsing
techniques viz, Top-Down and Bottom-Up. Their name describes the direction in
which parsing process advances.
Top-Down parsing
• The process involves predicting the structure of a sentence from the start symbol
of the grammar down to the terminals, which correspond to the words in the
sentence.
• The start symbol S represents the most general concept, typically a sentence in
natural language grammars.
• The algorithm starts from the tops of the tree i.e S, by looking on the grammar
rules with S on left hand side so that all the possible trees are generated.
Top-Down parsing
• The algorithm proceeds by substituting the start symbol with one of its possible
expansions (productions). This prediction is guided by the grammar rules, which
define how symbols can be replaced or expanded.
• The process is recursive; for each non-terminal symbol encountered, the parser
selects a production rule to expand it further, moving towards the terminal
symbols.
• This expansion continues until the parser reaches the terminal symbols, which are
the actual words or tokens of the input sentence.
• If the parser selects a production that doesn't lead to a successful match with the
input sentence, it may need to backtrack. Backtracking involves going back up the
parse tree to a previous decision point and trying a different production rule.
• This can be computationally expensive in cases where many backtracks are
necessary.
• The goal of top-down parsing is to construct a parse tree that represents the
syntactic structure of the input sentence according to the grammar. If the entire
input sentence is successfully matched against the productions of the grammar,
the sentence is considered syntactically valid.
• Top-down parsing can be implemented in various forms,
o The simplest being a Recursive Descent Parser
o Predictive Parser
Top-Down parsing (contd)
Recursive Descent Parser
• Recursive descent parsing is one of the most straightforward forms of parsing.
• This parser checks the syntax of the input stream of text by reading it from left to
right (hence, it is also known as the Left-Right Parser).
• The parser first reads a character from the input stream and then verifies it or
matches it with the grammar's terminals. If the character is verified, it is accepted
else it gets rejected.
• Recursive descent parsers are straightforward to implement and can handle a
wide range of grammars, including those that are not context-free.
• Since the grammar in parser is manually coded, it can include sophisticated error
reporting and recovery mechanisms. Consider
expression ::= term (('+' | '-') term)*
term ::= factor (('*' | '/') factor)*
<h1> , <b>, <head>
factor ::= NUMBER , <html> , <img>
3 + (4 * 5)
Predictive Parser
• The Predictive Parser is a type of top-down parser that is specifically designed to
work with a class of grammars known as LL grammars, where the first "L" stands
for scanning the input from left to right, and the second "L" for producing a leftmost
derivation.
Top-Down parsing (contd)
Grammar Rule :
The basic sentence is understood in terms of noun phrase NP and verb phrase VP.
Other rules lets say are stated as below :
• Initially, the parser shifts each word of the sentence onto a stack, one word at a
time, starting from "Obama".
• When the items on the stack match the right side of a grammar rule, the parser
reduces those items into a single item based on the rule. For example, after
shifting "Obama", it matches the rule N -> 'Obama', so "Obama" is reduced to N.
Bottom-Up parsing (contd)
Shift "Obama" onto the stack. (Stack: [Obama])
Reduce "Obama" to N using the rule N -> 'Obama'. (Stack: [N])
Reduce N to NP using the rule NP -> N. (Stack: [NP])
Shift "eats" onto the stack. (Stack: [V, eats])
Reduce "eats" to VP using the rule VP -> V -> 'eats'.
Shift "an" onto the stack.
Reduce "an" to Det and further using the rule Det -> NP ->VP
Reduce "apple" to N using the rule N -> NP ->VP
Reduce N to NP using the rule NP -> Det N.
Reduce NP VP to S using the rule S -> VP NP and NP -> Det N
Bottom-Up parsing (contd)
• This process continues with shifting and reducing according to the rules defined
in the grammar until the entire sentence is reduced to the start symbol (S),
indicating successful parsing.
• The ShiftReduceParser might not always find a parse for a sentence, especially if
the grammar is ambiguous or doesn't cover the sentence structure. In such cases,
we need to adjust the grammar.
RegexpParser
• The RegexpParser in NLTK, or Regular Expression Parser, is a tool for chunking
text into groups based on patterns defined using regular expressions. In this we
first define patterns using regular expressions that describe the syntactic
structures we want to identify. These patterns are matched against the parts of
speech of words in a sentence.
• It's particularly useful for identifying specific structures within sentences, such as
noun phrases (NPs), verb phrases (VPs), and other syntactic groups, without
requiring a full parsing of the sentence's structure.
Parsed … Why ? Purpose ?
Parsing long sentences and generating parse trees offers several practical advantages
in understanding and processing natural language.
1. Syntactic Structure Understanding
Parse trees provide a visual and structural representation of the syntactic relationships
within a sentence. This helps students understand how sentences are constructed in
natural language, including how words relate to each other through dependencies or
hierarchies.
2. Ambiguity Resolution
Long sentences often contain ambiguities in meaning. This can be resolved through
parsing by choosing specific interpretations based on syntactic rules or learned
patterns (context).
I saw the man with the telescope.
In short summary :
Parsers produced structured representation enables deeper analysis, such as
identifying subjects, objects, and the actions connecting them, which is crucial
for understanding the meaning of texts, sentiment analysis, information
extraction, and more.
Dependency Parsing
• As the name suggests , Dependency grammar is a fundamental concept in natural
language processing (NLP) that allows us to understand how words connect within
sentences. It provides a framework for representing sentence structure based on
word-to-word relationships.
• Every sentence has a central idea represented by a main verb, which connects all
other words in the sentence to it. This central idea is known as the root in
dependency grammar.
• In every word relationship, there are two key roles: the governor and the
dependent. The “governor” represents the main action while “dependent” serves
as object , target on which action relies.
• In a dependency parse tree , Each dependency relation line is labeled to illustrate
the relationship between the words on each end. Labels like subject (subj) and
object (obj) provide the grammatical role for every word in the sentence structure.
• The Dependency parsing using dependency grammar principles, analyzes
sentences and produces a tree that illustrates the grammatical relationships
between words. This is essential for understanding the structure of sentences.
• Dependency graph satisfy the following constraints:
o They have a single designated root node that has no incoming arcs.
o Each node has one incoming edge except the root node.
o There is a unique path to each node from the root node.
Dependency Parsing(contd)
An essential part of dependency parsing is the dependency tag. A dependency tag
indicates the relationship between two phrases. It is the word that modifies the
meaning of the other word. For eg take the following sentence
https://www.analyticsvidhya.com/blog/2021/12/dependency-parsing-in-natural-
language-processing-with-examples/
Dependency Parsing(contd)
Example sentence :
YOU EDUCATE A MAN, YOU EDUCATE A MAN. YOU EDUCATE A
WOMAN, YOU EDUCATE A GENERATION.
Some open source library NLP libraries …
1. Flair: Flair is an NLP library developed by Zalando Research and focuses for tasks
like named entity recognition, part-of-speech tagging, syntactic & sentiment analysis.
https://engineering.zalando.com/posts/2018/11/zalando-research-releases-flair.html
3. AllenNLP: AllenNLP is developed by the Allen Institute for AI. It provides a wide
range of pre-built models and components for tasks like text classification, semantic
role labeling, and more. AllenNLP also offers flexibility for custom model development
with its modular design. https://allenai.org/allennlp/software/allennlp-library
4. Gensim: Gensim is a popular Python library for topic modeling, document similarity
analysis, and word vector representations. It provides implementations of algorithms
such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and
Word2Vec. Gensim is known for its efficiency, scalability, and ease of use
Some Natural language processing (NLP) libraries
5. UIMA: (Unstructured Information Management Architecture): UIMA is an open-
standard framework for building NLP pipelines developed by the Apache Software
Foundation. It provides a scalable and extensible infrastructure for processing
unstructured information, including text, audio, and video in C++ & Java framework.
https://uima.apache.org/
…
…
…
This is not full list. There are many more . Pl. explore …
Case study of parsers of
NLP systems like ELIZA, LUNAR
Next Class
Case study - ELIZA, LUNAR NLP systems
1. ELIZA
• ELIZA is a computer program that simulates the behavior of a therapist. This is first
programs of its very kind developed way back in 1966 in MIT. This program interacts
wit the user in simple English and simulate the presence or talking with a therapist
imaginary Doctor. Though then many concepts of Artificial Intelligence were still not
developed but ELIZA surprised number of individuals as it attributed human-like
feelings to its user.
• Eliza listened to what the user said and it could parse the sentence in very basic
way, and then present question in way that is somehow related to question. In its
early 1960s people were fooled by Eliza as they thought were told that a real live
therapist was talking from the second computer.
• A program like ELIZA requires knowledge of three domains.
1. Artificial Intelligence
2. Expert System
3. Natural Language Processing
• Weizenbaum who was developer of this program was shocked to know that the
MIT staff of the lab thought that the machine was a real therapist, and spent hours
revealing their problems to the program. When Weizenbaum informed them that he
had access to logs of all conversations, the community was outraged at this
invasion of their privacy. He himself was shocked that these kind of simple
programs could so easily deceive a native user into revealing personal information.
Case study - ELIZA, LUNAR NLP systems
1. ELIZA
• ELIZA is a computer program that simulates the behavior of a therapist. This is first
programs of its very kind developed way back in 1966 in MIT. This program interacts
wit the user in simple English and simulate the presence or talking with a therapist
imaginary Doctor. Though then many concepts of Artificial Intelligence were still not
developed but ELIZA surprised number of individuals as it attributed human-like
feelings to its user.
• Eliza listened to what the user said and it could parse the sentence in very basic
way, and then present question in way that is somehow related to question. In its
early 1960s people were fooled by Eliza as they thought were told that a real live
therapist was talking from the second computer.
• A program like ELIZA requires knowledge of three domains.
1. Artificial Intelligence
2. Expert System
3. Natural Language Processing
• Weizenbaum who was developer of this program was shocked to know that the
MIT staff of the lab thought that the machine was a real therapist, and spent hours
revealing their problems to the program. When Weizenbaum informed them that he
had access to logs of all conversations, the community was outraged at this
invasion of their privacy. He himself was shocked that these kind of simple
programs could so easily deceive a native user into revealing personal information.
Case study - ELIZA, LUNAR NLP systems
• Although ELIZA doesn't understand context or meaning and is limited to do
syntactic analysis of current generation chat box, it set the stage for the
development of more sophisticated AI and chatbots that use complex NLP and
machine learning techniques to interact with users.
• The Technical Blocks of ELIZA can be broken down in to following components
Input Processing: ELIZA starts with the input processing where the user's
input is scanned for keywords or phrases that the system can recognize. This
is typically a simple string matching process without any understanding of the
language.
Pattern Matching: ELIZA uses a pattern matching technique to identify the
user's statements' key elements. This is done using a script, which is
essentially a collection of pattern-response pairs.
Decomposition Rules: Once a pattern is identified in the user's input, ELIZA
applies decomposition rules to break down the input into smaller parts. These
rules are used to transform the input into a form that can be more easily
manipulated to generate a response.
Case study - ELIZA, LUNAR NLP systems
Reassembly Rules: After decomposition, reassembly rules are used to construct the
response. These rules take the decomposed input and reassemble it into a statement
that reflects what the user has said. The reassembly process often involves
rephrasing the user's input and asking for further information.
Script Database: ELIZA operates using a script, a database of pre-defined patterns
and responses. The most famous script, known as DOCTOR, simulates a Rogerian
psychotherapist, which means it primarily uses the user's own statements to form
questions.
Response Generation: The response is generated based on the matching patterns
and associated rules. It then selects an appropriate response from the script that
corresponds to the identified pattern.
Output: Finally, the generated response is output to the user, continuing the
conversation.
Case study - ELIZA, LUNAR NLP systems
Thanks