UNIT 3 Syntax Analysis-Part1: Harshita Sharma
UNIT 3 Syntax Analysis-Part1: Harshita Sharma
Analysis–Part1
HARSHITA SHARMA
Syllabus
• Syntax analysis:
• Specification of syntax using grammar.
• Top-down parsing
• recursive-descent
• predictive.
• Bottom-up parsing
• shift-reduce
• SLR
• CLR
• LALR
• Parser generator.
INTRODUCTION
Role of the Parser
• Parser for any grammar is program that takes as input string w (obtain set of strings tokens
from the lexical analyzer) and produces as output either a parse tree for w , if w is a valid
sentences of grammar or error message indicating that w is not a valid sentences of given
grammar.
• The goal of the parser is to determine the syntactic validity of a source string is valid, a tree
is built for use by the subsequent phases of the computer.
• The tree reflects the sequence of derivations or reduction used during the parser. Hence, it
is called parse tree.
• If string is invalid, the parse has to issue diagnostic message identifying the nature and cause
of the errors in string. Every elementary subtree in the parse tree corresponds to a
production of the grammar.
There are two ways of identifying an elementary subtree:
1. By deriving a string from a non-terminal or
2. By reducing a string of symbol to a non-terminal.
Types of Parsers
The two types of parsers employed are:
a. Top down parser: which build parse trees from top(root) to bottom(leaves)
b. Bottom up parser: which build parse trees from leaves and work up the
root.
USE OF GRAMMAR
• By design, every programming language has precise rules that prescribe the
syntactic structure of well-formed programs.
• In C, for example, a program is made up of functions, a function out of
declarations and statements, a statement out of expressions, and so on.
• The syntax of programming language constructs can be specified by context-
free grammars or BNF (Backus-Naur Form) notation.
• Grammars offer significant benefits for both language designers and
compiler writers.
Advantages of using a Grammar
• A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming
language.
• From certain classes of grammars, we can construct automatically an efficient parser that
determines the syntactic structure of a source program. As a side benefit, the parser-
construction process can reveal syntactic ambiguities and trouble spots that might have
slipped through the initial design phase of a language.
• The structure imparted to a language by a properly designed grammar is useful for
translating source programs into correct object code and for detecting errors.
• A grammar allows a language to be evolved or developed iteratively, by adding new
constructs to perform new tasks. These new constructs can be integrated more easily into an
implementation that follows the grammatical structure of the language.
SYNTAX ERROR HANDLING
Errors at various levels
• Lexical errors
• Syntax Errors
• Semantic Errors
• Logical Errors
• The precision of parsing methods allows syntactic errors to be detected very effciently.
Several parsing methods, such as the LL and LR methods, detect an error as soon as
possible; that is, when the stream of tokens from the lexical analyzer cannot be parsed
further according to the grammar for the language. More precisely, they have the viable-
prefix property, meaning that they detect that an error has occurred as soon as they see a
prefix of the input that cannot be completed to form a string in the language.
• Another reason for emphasizing error recovery during parsing is that many errors
appear syntactic, whatever their cause, and are exposed when parsing cannot
continue. A few semantic errors, such as type mismatches, can also be detected
efficiently; however, accurate detection of semantic and logical errors at compile
time is in general a difficult task.
• Goals of error handler:
• Report errors
• Recover errrors
• Minimal overhead
ERROR RECOVERY STRATEGIES
Panic Mode Recovery
• With this method, on discovering an error, the parser discards input symbols
one at a time until one of a designated set of synchronizing tokens is found.
The synchronizing tokens are usually delimiters, such as semicolon or },
whose role in the source program is clear and unambiguous.
• The compiler designer must select the synchronizing tokens appropriate for
the source language. While panic-mode correction often skips a considerable
amount of input without checking it for additional errors, it has the
advantage of simplicity, and, unlike some methods to be considered later, is
guaranteed not to go into an infinite loop.
Phrase Level Recovery
• On discovering an error, a parser may perform local correction on the remaining input; that
is, it may replace a prefix of the remaining input by some string that allows the parser to
continue.
• A typical local correction is to replace a comma by a semicolon, delete an extraneous
semicolon, or insert a missing semicolon. The choice of the local correction is left to the
compiler designer. Of course, we must be careful to choose replacements that do not lead to
infinite loops, as would be the case, for example, if we always inserted something on the
input ahead of the current input symbol.
• Phrase-level replacement has been used in several error-repairing compilers, as it can correct
any input string. Its major drawback is the difficulty it has in coping with situations in which
the actual error has occurred before the point of detection
Error Productions
• By anticipating common errors that might be encountered, we can augment
the grammar for the language at hand with productions that generate the
erroneous constructs.
• A parser constructed from a grammar augmented by these error productions
detects the anticipated errors when an error production is used during
parsing.
• The parser can then generate appropriate error diagnostics about the
erroneous construct that has been recognized in the input.
Global Correction
• Ideally, we would like a compiler to make as few changes as possible in processing an
incorrect input string. There are algorithms for choosing a minimal sequence of changes to
obtain a globally least-cost correction. Given an incorrect input string x and grammar G,
these algorithms will find a parse tree for a related string y, such that the number of
insertions, deletions, and changes of tokens required to transform x into y is as small as
possible.
• Unfortunately, these methods are in general too costly to implement in terms of time and
space, so these techniques are currently only of theoretical interest. Do note that a closest
correct program may not be what the programmer had in mind. Nevertheless, the notion of
least-cost correction provides a yard stick for evaluating error-recovery techniques, and has
been used for finding optimal replacement strings for phrase-level recovery.
GRAMMAR PRERQUISITES
Context Free Grammar
• Inherently recursive structures of a programming language are defined by a context-
free Grammar.
• In a context-free grammar, we have four triples G( V,T,P,S).
• Here , V is finite set of terminals (in our case, this will be the set of tokens)
• T is a finite set of non-terminals (syntactic-variables)
• P is a finite set of productions rules in the following form
A → α where A is a non-terminal and α is a string of terminals and non-terminals (including
the empty string)
• S is a start symbol (one of the non-terminal symbol)
Example
Using notational conventions
Language of a Grammar
• L(G) is the language of G (the language generated by G) which is a set of
sentences.
• A sentence of L(G) is a string of terminal symbols of G. If S is the start symbol of
G then ω is a sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G. If
G is a context free grammar, L(G) is a context-free language.
• Two grammar G1 and G2 are equivalent, if they produce same language.
• Consider the production of the form S ⇒ α, If α contains non-terminals, it is called
as a sentential form of G. If α does not contain non-terminals, it is called as a
sentence of G.
Derivations
• The construction of a parse tree can be made precise by taking a derivational view,
in which productions are treated as rewriting rules. Beginning with the start symbol,
each rewriting step replaces a nonterminal by the body of one of its productions.
• This derivational view corresponds to the top-down construction of a parse tree,
but the precision afforded by derivations will be especially helpful when bottom-up
parsing is discussed.
• As we shall see, bottom-up parsing is related to a class of derivations known as
“rightmost" derivations, in which the rightmost nonterminal is rewritten at each step
Derivations
• In general a derivation step is
αAβ ⇒ αγβ
is sentential form and if there is a production rule A→γ in our grammar where α and
β are arbitrary strings of terminal and non-terminal symbols α1 ⇒ α2 ⇒ ... ⇒ αn (αn
derives from α1 or α1 derives αn )
• Derives in zero or more steps:
• At each derivation step, we can choose any of the non-terminal in the sentential
form of G for the replacement.
Leftmost Derivation
• If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation.
• Example:
• E→E+E|E–E|E*E|E/E|-E
• E→(E)
• E → id
• Leftmost derivation :
• E → E + E → E * E+E →id* E+E→id*id+E→id*id+id
• The string is derived from the grammar w= id*id+id, which consists of all terminal
symbols.
Rightmost Derivation
• If we always choose the right-most non-terminal in each derivation step, this
derivation is called as left-most derivation.
• Example 1 (same grammar as previous slide):
• E → E + E→ E+E * E→E+ E*id→E+id*id→id+id*id
• String that appear in leftmost derivation are called left sentinel forms.
• String that appear in rightmost derivation are called right sentinel forms.
• Sentinels: Given a grammar G with start symbol S, if S → α , where α may contain
nonterminals or terminals, then α is called the sentinel form of G.
Question
• Given grammar G : E → E+E | E*E | ( E ) | - E | id
Sentence to be derived : – (id+id).
Derive using both leftmost and rightmost derivation.
Solution
Yield / Frontier of a Tree
• Each interior node of a parse tree is a non-terminal. The children of node
can be a terminal or non-terminal of the sentinel forms that are read from
left to right. The sentinel form in the parse tree is called yield or frontier of
the tree.
PARSE TREE
• A parse tree is a graphical representation of a derivation that filters out the order in
which productions are applied to replace nonterminals.
• Each interior node of a parse tree represents the application of a production.
• The interior node is labeled with the nonterminal A in the head of the production;
the children of the node are labeled, from left to right, by the symbols in the body
of the production by which this A was replaced during the derivation.
• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.
Question
• Draw parse tree for the input –(id+id).
Solution
Sequence of Parse Trees
AMBIGUITY
What is it?
• a grammar that produces more than one parse tree for some sentence is said
to be ambiguous. Put another way, an ambiguous grammar is one that
produces more than one leftmost derivation or more than one rightmost
derivation for the same sentence.
• For most parsers, it is desirable that the grammar be made unambiguous, for
if it is not, we cannot uniquely determine which parse tree to select r a
sentence. In other cases, it is convenient to use carefully chosen ambiguous
grammars, together with disambiguating rules that “throw away" undesirable
parse trees, leaving only one tree for each sentence.
Example Question
• Consider the grammar: E -> E + E | E * E | ( E ) | id
• Derive the two distinct derivations and parse trees for: id+id*id:
Solution
Example
• To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use precedence of operators as
follows:
• ^ (right to left)
• /,* (left to right)
• -,+ (left to right)
• We get the following unambiguous grammar:
• E → E+T | T
• T → T*F | F
• F → G^F | G
• G → id | (E)
Verifying languages generated by a Grammar
• A proof that a grammar G generates a language L has two parts: show that
every string generated by G is in L, and conversely that every string in L can
indeed be generated by G.
Every balanced string is derivable from S
CFGs vs Regular Expressions
• grammars are a more powerful notation than regular expressions. Every
construct that can be described by a regular expression can be described by a
grammar, but not vice-versa. Alternatively, every regular language is a
context-free language, but not vice-versa.
Constructing Grammar from an NFA
Example
Language describable by CFG but not be RE
• L = {a^nb^n|n>=1)
• FA cannot keep count, hence RE not possible.
• Grammar: S->aSb|
WRITING SUITABLE GRAMMARS
• Grammars are capable of describing most, but not all, of the syntax of
programming languages.
• For instance, the requirement that identifiers be declared before they are
used, cannot be described by a context-free grammar.
• Therefore, the sequences of tokens accepted by a parser form a superset of
the programming language; subsequent phases of the compiler must analyze
the output of the parser to ensure compliance with rules that are not
checked by the parser.
Why use RE to describe LA when CFG is
better?
1. Separating the syntactic structure of a language into lexical and non lexical parts
provides a convenient way of modularizing the front end of a compiler into two
manageable-sized components.
2. The lexical rules of a language are frequently quite simple, and to describe them we
do not need a notation as powerful as grammars.
3. Regular expressions generally provide a more concise and easier-to-understand
notation for tokens than grammars.
4. More efficient lexical analyzers can be constructed automatically from regular
expressions than from arbitrary grammars.
Removing Ambiguity
Question
• Draw parse tree(s) for the input:
• if E1 then if E2 then S1 else S2
Removing ambiguity
Eliminating Left Recursion
• A grammar is said to be left recursive if it has a non-terminal A such that there is a
derivation A=>Aα for some string α. Top-down parsing methods cannot handle
left-recursive grammars.
• Hence, left recursion can be eliminated as follows:
• If there is a production A → Aα | β it can be replaced with a sequence of two
productions
• A → βA’
• A’ → αA’ | ε
• Without changing the set of strings derivable from A.
Example Question
• Consider the following grammar for arithmetic expressions:
• E → E+T | T
• T → T*F | F
• F → (E) | id
Solution
• First eliminate the left recursion for E as
• E → TE’
• E’ → +TE’ | ε
Solution
• First eliminate the left recursion for E as
• E → TE’
• E’ → +TE’ | ε
• Then eliminate for T as
• T → FT’
• T’→ *FT’ | ε
Solution
• Thus the obtained grammar after eliminating left recursion is
• E → TE’
• E’ → +TE’ | ε
• T → FT’
• T’ → *FT’ | ε
• F → (E) | id
Algorithm to eliminate left recursion
1. Arrange the non-terminals in some order A1, A2 . . . An.
2. for i := 1 to n do begin
for j := 1 to i-1 do begin
replace each production of the form Ai → Aj γ
by the productions Ai → δ1 γ | δ2γ | . . . | δk γ
where Aj → δ1 | δ2 | . . . | δk are all the current Aj- productions;
end
eliminate the immediate left recursion among the Ai- productions
end
Indirect left recursion
• A grammar is said to have indirect left recursion if, starting from any symbol
of the grammar, it is possible to derive a string whose head is that symbol.
• For example: A->Br; B->Cd; C->At
• where A, B, C are non-terminals and r, d, t are terminals. Here, starting with
A, we can derive A again by substituting C to B and B to A.
Example
• A1 ⇒ A2 A3
• A2 ⇒ A3 A1 | b
• A3 ⇒ A1 A1 | a
• Where A1, A2, A3 are non terminals and a, b are terminals.
Solution
• Identify the productions which can cause indirect left recursion. In our case,
A3 ⇒ A1 A1 | a
• Substitute its production at the place the terminal is present in any other
production: substitute A1–> A2 A3 in production of A3.
Eliminating Left Factoring
• Left factoring is a grammar transformation that is useful for producing a
grammar suitable for predictive parsing. When it is not clear which of two
alternative productions to use to expand a non-terminal A, we can rewrite
the A-productions to defer the decision until we have seen enough of the
input to make the right choice.
• If there is any production A → αβ1 | αβ2 , it can be rewritten as
• A → αA’
• A’ → β1 | β2
Example
• Consider the grammar , G : S → iEtS | iEtSeS | a
• E→b
• Left factored, this grammar becomes
• S → iEtSS’ | a
• S’ → eS | ε
• E→b