0% found this document useful (0 votes)
28 views

Syntax Analysis (Part-I)

The document discusses parsing and grammars for programming languages. It covers: - Grammars provide a precise syntactic specification for a programming language and allow parsers to detect syntax errors. - Parsers verify that a string of tokens can be generated by the grammar and construct a parse tree if it is well-formed. - Context-free grammars use recursive rules to generate patterns of strings and can describe regular languages and more.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Syntax Analysis (Part-I)

The document discusses parsing and grammars for programming languages. It covers: - Grammars provide a precise syntactic specification for a programming language and allow parsers to detect syntax errors. - Parsers verify that a string of tokens can be generated by the grammar and construct a parse tree if it is well-formed. - Context-free grammars use recursive rules to generate patterns of strings and can describe regular languages and more.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Majority of texts, diagrams and tables in the slide is based

on the text book Compilers: Principles, Techniques, and


Tools by Aho, Sethi, Ullman and Lam.
• The syntax of programming language constructs can be specified by context-
free grammars or BNF (Backus-Naur Form) notation.
• Grammars offer significant benefits for both language designers and compiler
writers.
• A grammar gives a precise, yet easy-to-understand, syntactic specification of
a programming language.
• From certain classes of grammars, we can construct automatically an efficient
parser that determines the syntactic structure of a source program.
• The structure imparted to a language by a properly designed grammar is
useful for translating source programs into correct object code and for
detecting errors.
• A grammar allows a language to be evolved or developed iteratively, by
adding new constructs to perform new tasks.
• In compiler model, the parser obtains a string of tokens from the lexical
analyzer and verifies that the string of token names can be generated by
the grammar for the source language.
• Parser then reports any syntax errors and recovers from commonly
occurring errors to continue processing the remainder of the program.
• For well-formed programs, the parser constructs a parse tree and passes
it to the rest of the compiler for further processing.
• The parse tree need not be constructed explicitly, since checking and
translation actions can be interspersed with parsing.
• The parser and the rest of the front end could well be implemented by a
single module.
Syntax Definition: A Notion of Grammar

• A grammar naturally describes the hierarchical structure of most


programming language constructs. For example, an if-else statement
in Java can have the form
Context-Free Grammar & Language

• A context-free grammar is a set of recursive rules used to generate


patterns of strings.
• A context-free grammar can describe all regular languages and more,
but they cannot describe all possible languages.
• The language generated by context-free grammar is known as
context-free language.
Notion of Grammar (Informal)
• Sentence: Sincere students go to college regularly.
• Grammatical sentence/acceptable English language sentence. Because we can parse it
according to English grammar.
<Sentence>

<Noun Phrase> <Verb Phrase>

<Adjective> <Noun> <Verb> <Preposition> <Noun> <Adverb>

Sincere students go to college regularly


Notion of Grammar (Informal)
Rules:
<sentence>  <noun phrase><verb phrase>
<noun phrase>  <adjective><noun>
<adjective>  good
<noun>  students
<verb phrase>  <verb><preposition><noun><adverb>
<verb>  go
<preposition>  to
<noun>  school
<adverb>  regularly
Notion of Grammar

S
0 S 1
0 0 S 1 1
0 0 0 S 1 1 1
0 0 0 1 1 1
L(G) = ∗
Generalize/Formalize Notion of Grammar
• A context-free grammar has four components:
• ∑: A set of terminal symbols, sometimes referred to as “tokens." The
terminals are the elementary symbols of the language defined by the
grammar.
• V: A set of non-terminals, sometimes called “syntactic variables." Each non-
terminal represents a set of strings of terminals, in a manner we shall
describe.
• P: A set of productions or rewriting rules, where each production consists of
a nonterminal, called the head or left side of the production, an arrow, and a
sequence of terminals and/or non-terminals, called the body or right side of
the production.
• S: A designation of one of the non-terminals as the start symbol.
• G = (V, ∑, P, S)  Grammar
Notion of Grammar: Example
V = {S} S
∑ = {a, b} a S a
S  aSa a a S a a
a a b S b a a
S  bSb
a a b a S a b a a
Sɛ a a b a a b a a
α α
Xϒ
α α ∗

Then α α α α ϒ)

If α α α α , then α α (Reflexive transitive closure)
Types of Parser
• There are three general types of parsers for grammars: universal, top-
down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-Kasami algorithm
and Earley's algorithm can parse any grammar. These general methods
are, however, too inefficient to use in production compilers.
• Top-down methods build parse trees from the top (root) to the bottom
(leaves).
• Bottom-up methods start from the leaves and work their way up to the
root.
• In either case, the input to the parser is scanned from left to right, one
symbol at a time.
Notational Conventions
Notational Conventions
Notational Conventions
Parse Tree
• A parse tree pictorially shows how the start symbol of a grammar derives a string in
the language.
• If nonterminal A has a production A  XYZ, then a parse tree may have an interior
node labeled A with three children labeled X, Y , and Z, from left to right:

• The root is labeled by the start symbol.


• Each leaf is labeled by a terminal or by ϵ.
• Each interior node is labeled by a nonterminal.
• If A is the nonterminal labeling some interior node and are the
labels of the children of that node from left to right, then there must be a
production A  . Here, each stand for a symbol that is
either a terminal or a nonterminal.
Parse Tree: Example
Ambiguity in Grammar
• A grammar may have more than one parse tree generating a given string of
terminals. Such a grammar is said to be ambiguous.
• To show that a grammar is ambiguous, all we need to do is find a terminal
string that is the yield of more than one parse tree.
Abstract and Concrete Syntax
• In an abstract syntax tree for an expression, each interior node
represents an operator; the children of the node represent the operands
of the operator.
• In the syntax tree, interior nodes represent programming constructs
while in the parse tree, the interior nodes represent nonterminals.
• Many nonterminals of a grammar represent programming constructs,
but others are “helpers" such as those representing terms, factors, or
other variations of expressions.
• In the syntax tree, these helpers typically are not needed and are hence
dropped.
• To emphasize the contrast, a parse tree is sometimes called a concrete
syntax tree, and the underlying grammar is called a concrete syntax for
the language.
Abstract and Concrete Syntax
Derivations
• The construction of a parse tree can be made precise by taking a
derivational view, in which productions are treated as rewriting rules.

• Beginning with the start symbol, each rewriting step replaces a


nonterminal by the body of one of its productions.

• This derivational view corresponds to the top-down construction of a


parse tree, but the precision of derivations will be helpful when bottom-
up parsing used.

• Bottom-up parsing is related to a class of derivations known as


“rightmost" derivations, in which the rightmost nonterminal is rewritten
at each step.
Derivations
Derivations
We can take a single E and repeatedly apply productions in any order to get a
sequence of replacements.

We call such a sequence of replacements a derivation of -(id) from E. This derivation


provides a proof that the string -(id) is one particular instance of an expression.
For a general definition of derivation, consider a nonterminal A in the middle of a
sequence of grammar symbols, as in A , where and are arbitrary strings of
grammar symbols.
Suppose A is a production. Then we can write A =>
When a sequence of derivation steps
we say derives
Derivations
Derivations
Derivations
Derivations
Parse Tree and Derivations
• A parse tree is a graphical representation of a
derivation that filters out the order in which
productions are applied to replace non-terminals.

• Each interior node of a parse tree represents the


application of a production.

• The interior node is labeled with the nonterminal A


in the head of the production; the children of the
node are labeled, from left to right, by the symbols
in the body of the production by which this A was
replaced during the derivation.
Parse Tree and Derivations
Parse Tree and Derivations
Parse Tree and Derivations
Classification of Parsers
Parsing
• Parsing is the process of determining how a string of terminals can be
generated by a grammar.
• Most parsing methods fall into one of two classes, called the top-down
and bottom-up methods.
• In top-down parsers, construction starts at the root and proceeds towards
the leaves.
• In bottom-up parsers, construction starts at the leaves and proceeds
towards the root.
• The popularity of top-down parsers is due to the fact that efficient parsers
can be constructed more easily by hand using top-down methods.
• Bottom-up parsing, however, can deal with a larger class of grammars and
translation schemes, so software tools for generating parsers directly from
grammars often use bottom-up methods.
• Recursive-descent parsing is a top-down method in which a set of
recursive procedures is used to process the input.
• One procedure is associated with each nonterminal of a grammar.
• Here, we consider a simple form of recursive-descent parsing, called
predictive parsing, in which the lookahead symbol unambiguously
determines the flow of control through the procedure body for each
nonterminal.
• Predictive parsing relies on information about the first symbols that can
be generated by a production body.
• Let ‘α’ be a string of grammar symbols (terminals and/or nonterminals).
• Then, FIRST(α) to be the set of terminals that appear as the first symbols
of one or more strings of terminals generated from α.
• If α is ε or can generate ε, then ε is also in FIRST(α).
Predictive parser uses an -production as a default when no other
production can be used.
Elimination of Left Recursion
Elimination of Left Recursion
Algorithm: Elimination of Left Recursion
Left Factoring
• Left factoring is a grammar transformation that is useful for producing
a grammar suitable for predictive or top-down parsing.
• When the choice between two alternative A-productions is not clear,
we may be able to rewrite the productions to defer the decision until
enough of the input has been seen that we can make the right choice.
Left Factoring Example
Dangling Else Problem
Resolving Dangling Else Problem
A predictive parser is a program consisting of a procedure for every
nonterminal.
The procedure for nonterminal A does two things.
• It decides which A-production to use by examining the lookahead
symbol.
• The procedure then mimics the body of the chosen production.
Top-Down Parsing
• Top-down parsing can be viewed as the problem of constructing a parse
tree for the input string, starting from the root and creating the nodes of
the parse tree in preorder.
• Top-down parsing can be viewed as finding a leftmost derivation for an
input string.
• At each step of a top-down parse, the key problem is that of determining
the production to be applied for a nonterminal, say A.
• Once an A-production is chosen, the rest of the parsing process consists
of “matching" the terminal symbols in the production body with the
input string.
Top-Down Parsing
Recursive-Descent Parsing
• A recursive-descent parsing program consists of a set of procedures, one
for each nonterminal.
• Execution begins with the procedure for the start symbol, which halts and
announces success if its procedure body scans the entire input string.
• Pseudocode for a typical nonterminal appears in Figure. This pseudocode
is nondeterministic, since it begins by choosing the A-production to apply
in a manner that is not specified.
• General recursive-descent may require backtracking; that is, it may
require repeated scans over the input.
• However, backtracking is rarely needed to parse programming language
constructs, so backtracking parsers are not seen frequently.
Recursive-Descent Parsing
Components of Parsing Technique
FIRST and FOLLOW
FIRST and FOLLOW
To compute FIRST(X) for all grammar symbols X, apply the following rules
until no more terminals or ɛ can be added to any FIRST set.
FIRST and FOLLOW
To compute FOLLOW(A) for all nonterminals A, apply the following
rules until nothing can be added to any FOLLOW set.
FIRST and FOLLOW: Example
FIRST and FOLLOW: Example

Symbol FIRST FOLLOW


E {(, id} {),
E’ {+, } {),
T {(, id} {+, ),
T’ {*, {+, ),
F {(, id} {+, *, ),
LL(1) Grammars

• Predictive parsers can be constructed for a class of grammars called


LL(1).
• The first “L" in LL(1) stands for scanning the input from left to right,
the second “L" for producing a leftmost derivation, and the “1" for
using one input symbol of lookahead at each step to make parsing
action decisions.
• No left-recursive or ambiguous grammar can be LL(1).
LL(1) Grammars
LL(1) Grammars

• Predictive parsers can be constructed for LL(1) grammars since the proper
production to apply for a nonterminal can be selected by looking only at
the current input symbol.
• Flow-of-control constructs, with their distinguishing key- words, generally
satisfy the LL(1) constraints. For instance, if we have the productions
LL(1) Grammars
Symbol FIRST FOLLOW
E {(, id} {),
E’ {+, } {),
T {(, id} {+, ),
T’ {*, {+, ),
F {(, id} {+, *, ),
Nonrecursive Predictive Parsing
• A nonrecursive predictive parser can be constructed by maintaining a
stack explicitly, rather than implicitly done recursive calls.
• The parser mimics a leftmost derivation.
• If w is the input that has been matched so far, then the stack holds a
sequence of grammar symbols α such that

• The table-driven parser consists of an input buffer, a stack containing a


sequence of grammar symbols, a parsing table constructed by an algorithm,
and an output stream.
• The input buffer contains the string to be parsed, followed by the endmarker $.
• We reuse the symbol $ to mark the bottom of the stack, which initially contains
the start symbol of the grammar on top of $.
Nonrecursive Predictive Parsing
• The parser is controlled by a program that considers X, the symbol on top of the stack,
and a, the current input symbol.
• If X is a nonterminal, the parser chooses an X-production by consulting entry M[X, a]
of the parsing table M.
• Otherwise, it checks for a match between the terminal X and current input symbol a.
Nonrecursive Predictive Parsing
Nonrecursive Predictive Parsing
Nonrecursive Predictive Parsing: Example
Error Recovery in Predictive Parsing
• In table-driven predictive parsing, an error is occurred when a terminal
symbol on top of the stack does not match with next symbol of the
input string.
• An error is occurred when for a pair of non-terminal on top of the stack
and next terminal symbol of the input string, the corresponding
production rule is found empty in the parsing table entry at M[A,x]
where M denotes a parsing table, A denotes the non-terminal on top of
the stack and x denotes next symbol of the input string.
• To recover from an error, four error recovery strategies, such as, panic
mode error recovery, phase-level error recovery, error productions and
global corrections, can be used.
Error Recovery Techniques
• Panic Mode: When an error is occurred over the symbols of the input string,
parser skips the terminal symbols one at a time until a synchronizing token is
found from the set of such designated tokens. Synchronizing tokens may be
semicolon or end in source program.
• Phase-Level: When an error is occurred over the symbols of the input string,
parser introduces local corrections over the terminal symbols of the input
string, in which a prefix of the input string may be replaced by some string in
order to continue the parsing.
• Error Productions: In this strategy, if a common error is occurred for
erroneous production rules, then we can augment the grammar with such
productions and further, we can use this grammar augmented erroneous
productions to create a parser.
• Global Corrections: In this strategy, for an incorrect input string, we can
introduce a few changes, such as changes of tokens, insertion and deletion.
Panic Mode Recovery
• Restrict our discussion on panic mode recovery strategy and will show
how it can be used to recover the errors occurred over the symbols of
the input string.
• To add synchronizing tokens in the predictive parsing table, the symbols
in FOLLOW set of non-terminal are chosen.
• The symbols that are added to the FOLLOW sets of the non-terminals
cannot be generated from those non-terminals, and the symbols in the
string can be skipped until a set of one or more synchronizing tokens
appear.
• Moreover, the symbols which are found in FOLLOW set of some non-
terminal constitute the set of synchronizing tokens.
Panic Mode Recovery
Some heuristics are as follows:
• To begin with, we keep all symbols in FOLLOW(A) into the set of synchronizing
tokens for nonterminal A.
• While matching the top of the stack with the next symbol of the input string
and if does not match, we skip the tokens until a synchronizing token of
FOLLOW(A) is found and we pop the non-terminal A from top of the stack to
continue the parsing.
• If the elements in FOLLOW(A) are not enough as the synchronizing tokens for
non-terminal A, then we can add the terminal symbols in FIRST(A) to the set
of synchronizing tokens for non-terminal A.
• When a non-terminal generates the empty string, the production that derives
is used as a default.
• During parsing, if a terminal symbol on top of the stack does not match, then
we pop the terminal, output a message and continue the parsing.
Advantages and Disadvantages

• Advantage: It is easy to implement and guarantees not to go to


infinite loop.
• Disadvantage: A considerable amount of input is skipped without
checking it for additional errors.
Panic Mode Recovery

• Skip symbols on the input until a set of synchronizing tokens appear


• Use FOLLOW symbols as synchronizing tokens
• Write synch in in predictive parsing table to indicate synchronizing
tokens obtain from the follow set of the non-terminal.
Symbol FIRST FOLLOW

E {(, id} {), $}


E’ {+, ℇ} {), $}
T {(, id} {+, ), $}
T’ {*, ℇ} {+, ), $}
F {(, id} {+, *, ), $}
Panic Mode Example
Panic Mode Recovery: Rules

• If parser looks up entry M[A, a] and finds it blank, then the input
symbol a is skipped.
• If the entry is synch, then the non-terminal on the top of stack is
popped in an attempt to resume parsing.
• If a token on the top of stack does not match the input symbol, then
we pop the token from the stack.

You might also like