0% found this document useful (0 votes)
61 views106 pages

Lesson 3: Syntax Analysis: Risul Islam Rasel

The document discusses context-free grammars and parsing. It defines the key components of a context-free grammar including terminals, non-terminals, productions, and the start symbol. It also explains derivation, which is the process of rewriting non-terminals using productions until only terminals remain. Derivation trees, also called parse trees, represent derivations graphically and show the syntactic structure of strings generated by the grammar. The role of a parser is to take a string of tokens and determine if it can be derived using the grammar's productions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views106 pages

Lesson 3: Syntax Analysis: Risul Islam Rasel

The document discusses context-free grammars and parsing. It defines the key components of a context-free grammar including terminals, non-terminals, productions, and the start symbol. It also explains derivation, which is the process of rewriting non-terminals using productions until only terminals remain. Derivation trees, also called parse trees, represent derivations graphically and show the syntactic structure of strings generated by the grammar. The role of a parser is to take a string of tokens and determine if it can be derived using the grammar's productions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 106

Lesson 3:

Syntax Analysis

Risul Islam Rasel


Assistant Professor
Department of Computer Science and Engineering
Email: [email protected]

Reference book: Compilers Principles, Techniques & Tools (2nd Edition), by Aho, Lam, Sethi, Ullman
Benefits of CFG
• A grammar gives a precise, yet easy-to-understand, syntactic specification of a
programming language.

• From certain classes of grammars, we can construct automatically an efficient


parser that determines the syntactic structure of a source program.
 As a side benefit, the parser-construction process can reveal syntactic
ambiguities & trouble spots that might have slipped through the initial
design phase of a language.

• The structure imparted to a language by a properly designed grammar is useful


for translating source programs into correct object code & for detecting errors.

• A grammar allows a language to be evolved or developed iteratively, by adding


new constructs to perform new tasks.
 These new constructs can be integrated more easily into an
implementation that follows the grammatical structure of the language.
The Role of the Parser
• Parser obtains a string of tokens from the lexical
analyzer & verifies that the string of token names
can be generated by the grammar for the source
language.
• Parser report any syntax errors in an intelligible
fashion & to recover from commonly occurring
errors to continue processing the remainder of
the program.
• Conceptually, for well-formed programs, the
parser constructs a parse tree & passes it to the
rest of the compiler for further processing.
Position of parser in compiler model

Parsing Approach
• Top down
• Bottom UP
Representative Grammars
• Expression grammar (LR): •Not suitable for top-down
suitable for bottom-up parsing (left recursive)
parsing.

Remove left recursion

Producing ambiguity: a + b * c
Syntax Error Handling
Common programming errors
• Lexical errors include misspellings of identifiers, keywords, or
operators
 the use of an identifier elipseSize instead of ellipseSize – and missing quotes
around text intended as a string.
• Syntactic errors include misplaced semicolons or extra or missing
braces; that is, "{" or "}."
 As another example, in C or Java, the appearance of a case statement without an
enclosing switch is a syntactic error
• Semantic errors include type mismatches between operators &
operands.
 An example is a return statement in a Java method with result type void.
• Logical errors can be anything from incorrect reasoning on the part of
the programmer to the use in a C program of the assignment
operator = instead of the comparison operator ==.
 The program containing = may be well formed; however, it may not reflect the
programmer's intent.
Error-Recovery Strategies
• Panic-Mode Recovery
• Phrase-Level Recovery
• Error Productions
• Global Correction

• See details in the Book (Self study)


Context-Free Grammars
• Grammars systematically describe the syntax
of programming language constructs like
expressions & statements.
• Using a syntactic variable stmt to denote
statements & variable expr to denote
expressions, the production
stmt  if ( expr ) stmt else stmt
Formal Definition of CFG
1. Terminals (T) are the basic symbols from which strings
are formed.
• The term "token name" is a synonym for "terminal"
and frequently we will use the word "token" for
terminal.
• Terminals are the first components of the tokens
output by the lexical analyzer.
• Keywords: if, else, “(”, “)”
stmt  if ( expr ) stmt else stmt
stmt  if ( expr ) stmt else stmt

2. Non-terminals (V) are syntactic variables that


denote sets of strings.
• Non-terminals: stmt, expr
• The sets of strings denoted by non-terminals
help define the language generated by the
grammar.
• Nonterminals impose a hierarchical structure
on the language that is key to syntax analysis
and translation.
stmt  if ( expr ) stmt else stmt

3. Start symbol (S): one non-terminal is


distinguished as the start symbol,
• the set of strings it denotes is the language
generated by the grammar.
• Conventionally, the productions for the start
symbol are listed first.
stmt  if ( expr ) stmt else stmt
4. Productions (P) of a grammar specify the
manner in which the terminals & non-terminals
can be combined to form strings. Each
production consists of:
• (a) A non-terminal called the head or left side
of the production; this production defines
some of the strings denoted by the head.
• (b) The symbol . Sometimes : : = has been
used in place of the arrow.
• (c) A body or right side consisting of zero or
more terminals and non-terminals.
Example: Grammar for simple
arithmetic expressions
expression  expression + term
Terminals:
expression  expression - term
expression  term id, +, -, *, /, (, )
term  term * factor Non-terminals:
term  term / factor
expression, term, factor
term  factor
Start Symbol
factor  ( expression )
factor  id expression
Notational Conventions
1. These symbols are terminals:
-Lowercase letters early in the alphabet: a, b, c.
-Operator symbols: +, *
-Punctuation symbols: parentheses, comma,
-Digits: 0, 1, . . . , 9.
-Boldface strings: id, if, each of which represents a single terminal symbol.
2. These symbols are Non-terminals:
-Uppercase letters early in the alphabet: A, B, C.
-Letter S: start symbol.
-Lowercase, italic: expr, stmt.
-Uppercase letters: E, T, F
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent
grammar symbols (T or V).
4. Lowercase letters late in the alphabet, such as u, v, ... ,z, represent
(possibly empty) strings of terminals.
5. Lowercase Greek letters, , , : represent (possibly empty) strings
of grammar symbols. Thus , a generic production can be written as A
 , where A-head & -body.
6. A set of productions A  1, A  2, ... , Ak  k with a common
head A (call them A-productions) , written A  1,| A  2,| ... ,| Ak
 k Call 1, 2, …, k are the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start
symbol.
Example

• V: E, T, F
• S: E
• T: remaining symbols
• P: 08
Derivation
Consider the following example grammar
with 5 productions:
1. S  AB
2. A  aaA
3. A  
4. B  Bb
5. B  
17
1. S  AB 2. A  aaA 4. B  Bb
3. A   5. B  

Leftmost derivation order of string: aab


1 2 3 4 5
S  AB  aaAB  aaB  aaBb  aab

At each step, we substitute the


leftmost variable
18
1. S  AB 2. A  aaA 4. B  Bb
3. A   5. B  

Rightmost derivation order of string: aab


1 4 5 2 3
S  AB  ABb  Ab  aaAb  aab
At each step, we substitute the
rightmost variable
19
1. S  AB 2. A  aaA 4. B  Bb
3. A   5. B  

Leftmost derivation of: aab


1 2 3 4 5
S  AB  aaAB  aaB  aaBb  aab

Rightmost derivation of: aab


1 4 5 2 3
S  AB  ABb  Ab  aaAb  aab
20
Derivation Trees
Consider the same example grammar:
S  AB A  aaA |  B  Bb | 

And a derivation of: aab

S  AB  aaAB  aaABb  aaBb  aab

21
S  AB A  aaA |  B  Bb | 

S  AB
S

A B

yield AB

22
S  AB A  aaA |  B  Bb | 

S  AB  aaAB
S

A B

a a A yield aaAB

23
S  AB A  aaA |  B  Bb | 

S  AB  aaAB  aaABb
S

A B

a a A B b

yield aaABb
24
S  AB A  aaA |  B  Bb | 
S  AB  aaAB  aaABb  aaBb
S

A B

a a A B b

yield
 aaBb  aaBb
25
S  AB A  aaA |  B  Bb | 
S  AB  aaAB  aaABb  aaBb  aab
Derivation Tree S
(parse tree)

A B

a a A B b
yield
  aab  aab
26
Sometimes, derivation order doesn’t matter
Leftmost derivation:
S  AB  aaAB  aaB  aaBb  aab
Rightmost derivation:
S  AB  ABb  Ab  aaAb  aab
S

Give same A B
derivation tree
a a A B b

  27
Parse Tree & Derivation
-(id+id)

LMD
Ambiguity

• A grammar that produces more than one parse tree


for some sentence is said to be ambiguous.
• An ambiguous grammar is one that produces more
than one leftmost derivation or more than one
rightmost derivation
Grammar for mathematical expressions

E  E  E | E  E | (E) | a

Example strings:
(a  a )  a  (a  a  (a  a ))

Denotes any number

30
E  E  E | E  E | (E) | a

E  E  E  a E  a EE
E
 a  a E  a  a*a
E  E
A leftmost derivation
for a  a  a
a E  E

a a
31
E  E  E | E  E | (E) | a

E  EE  E  EE  a EE


E
 a  aE  a  aa
E  E
Another
leftmost derivation
for a  a  a E  E a

a a
32
E  E  E | E  E | (E) | a

Two derivation trees


E for a  a  a E

E  E E  E

a E  E E  E a

a a a a
33
take a  2

a  a a  2  22
E E

E  E E  E

2 E  E E  E 2

2 2 2 2
34
Good Tree Bad Tree
2  22  6 2  22  8
6 8
Compute expression result
E using the tree E
2 4 4 2
E  E E  E
2 2 2 2
2 E  E E  E 2

2 2 2 2
35
Two different derivation trees may cause
problems in applications which use the
derivation trees:

• Evaluating expressions

• In general, in compilers for


programming languages

36
Ambiguous Grammar:

A context-free grammar G is ambiguous


if there is a string w L(G ) which has:

two different derivation trees


or
two leftmost derivations

(Two different derivation trees give two


different leftmost derivations and vice-versa)
37
Example: E  E  E | E  E | ( E ) | a

this grammar is ambiguous since


string a  a  a has two derivation trees

E E

E  E E  E

a E  E E  E a

a a a a
38
E  E  E | E  E | (E) | a

this grammar is ambiguous also because


string a  a  a has two leftmost derivations

E  E  E  a E  a EE
 a  a E  a  a*a

E  EE  E  EE  a EE


 a  aE  a  aa
39
Another ambiguous grammar:

IF_STMT  if EXPR then STMT


| if EXPR then STMT else STMT

Variables Terminals

Very common piece of grammar


in programming languages
If expr1 then if expr2 then stmt1 else stmt2
IF_STMT

if expr1 then STMT

if expr2 then stmt1 else stmt2

Two derivation trees


IF_STMT

if expr1 then STMT else stmt2

if expr2 then stmt1


41
In general, ambiguity is bad
and we want to remove it

Sometimes it is possible to find


a non-ambiguous grammar for a language

But, in general it is difficult to achieve this


id+id*id

Two LMD
Lexical Versus Syntactic Analysis
• Everything that can be described by a regular
expression can also be described by a
grammar.

• Why use regular expressions to define the


lexical syntax of a language?
Several reasons
1. Separating the syntactic structure of a language into lexical and
Non-lexical parts provides a convenient way of modularizing the
front end of a compiler into two manageable-sized components.
2. The lexical rules of a language are frequently quite simple, and to
describe them we do not need a notation as powerful as
grammars.
3. Regular expressions generally provide a more concise and easier-
to-understand notation for tokens than grammars.
4. More efficient lexical analyzers can be constructed automatically
from regular expressions than from arbitrary grammars.

RE: most useful for describing the structure of constructs such as


identifiers, constants, keywords, and white space.
CFG: most useful for describing nested structures such as balanced
parentheses, matching begin-end’s, corresponding if-then-else’s
Nested structures cannot be described by regular expressions.
Eliminating Ambiguity
• Sometimes an ambiguous grammar can be
rewritten to eliminate the ambiguity.
• As an example, we shall eliminate the ambiguity
from the following “dangling-else” grammar:

 "other" stands for any other statement.


Compound conditional statement:

Parse Tree

Ambiguous or non-ambiguous??
Lets try:

Which one is the best??

"Match each else with the


closest unmatched then”
Rewrite the dangling-else grammar:

Ideas: a statement appearing between a then and an else must be "matched"


; -the interior statement must not end with an unmatched or open then.
A matched statement is either an if-then-else statement containing no open
statements or it is any other kind of unconditional statement.

Only one Parse Tree


A successful example:

Equivalent
Ambiguous
Grammar Non-Ambiguous
Grammar

E E E
E  E T |T
E  E E
T T  F | F
E  (E )
E a F  (E ) | a
generates the same
language

50
E  E T T T  F T  a T  a T F
 a  F F  a  aF  a  aa
E
E  E T |T
E  T
T T F | F
F  (E) | a T T  F

Unique F F a
derivation tree
for

a  aa a a
51
Elimination of Left Recursion
• A grammar is left recursive if it has a non-terminal A such
that there is a derivation A  A  for some string  .
• Top-down parsing methods cannot handle left-recursive
grammars, so a transformation is needed to eliminate
left recursion.
• AA|
replaced by the non-left-recursive productions:
• A   A'
• A'   A' | 
• without changing the strings derivable from A.
Example : The non-left-recursive
expression grammar,

• eliminating immediate left recursion from the


expression grammar.
• The left-recursive pair of productions
E  E + T | T are replaced by
E T E' and E'  +T E‘ | 
• Immediate left recursion can be eliminated by
the following technique, which works for any
number of A-productions.
• First, group the productions as

• where no i begins with an A. Then, replace


the A-productions by
The non-terminal A generates the same strings as
before but is no longer left recursive.
This procedure eliminates all left recursion from the A
and A' productions (provided no i is ) , but it does not
eliminate left recursion involving derivations of two or
more steps.

The non-terminal S is left recursive because


S  Aa  Sda
but it is not immediately left recursive.
Example
• Order the non-terminals: S, A.
No immediate left recursion among the S-
productions, so nothing happens during the
outer loop for i =i.
For i = 2, Substitute for S in AS d to obtain the
following A-productions: A  Ac I Aad I bd I 
• Eliminating the immediate left recursion:
Left Factoring
• Left factoring is a grammar transformation that
is useful for producing a grammar suitable for
predictive, or top-down, parsing.
• When the choice between two alternative A-
productions is not clear, we may be able to
rewrite the productions to defer the decision
until enough of the input has been seen that we
can make the right choice.
if we have the two productions

• on seeing the input if, we cannot immediately


tell which production to choose to expand
stmt.
if A  1 | 2 are two A-productions, & input
begins with a nonempty string derived from ,
Whether to expand A to 1 or 2!!!
We may defer the decision by expanding A to A`
Then, after seeing the input derived from ,
Expand A` to 1 or 2.
Left-factored:
Algorithm: Left factoring a grammar.
INPUT: Grammar G
OUTPUT: An equivalent left-factored grammar.
 Example
Left-factored grammars:

• Here, i, t, and e stand for


if, then, and else;
• E: conditional expression
Longest common: iEtS
S: Conditional statement
Top-Down Parsing
• Constructing a parse tree
• Starting from the root and creating the nodes
of the parse tree in preorder (depth-first)

• Equivalent to a leftmost derivation


Input: id+id*id
Recursive-Descent Parsing
• A recursive-descent parsing program consists of a
set of procedures, one for each nonterminal.
• Execution begins with the procedure for the start
symbol, which halts and announces success if its
procedure body scans the entire input string.
• General recursive-descent may require
backtracking; that is, it may require repeated
scans over the input.
• However, backtracking is rarely needed to parse
programming language constructs, so
backtracking parsers are not seen frequently.
• Even for situations like natural language parsing,
backtracking is not very efficient
Modify Recursive-Descent Parsing
• To allow backtracking, recursive decent parsing algorithm
needs to be modified.
• First, we cannot choose a unique A-production at line ( 1 ) ,
so we must try each of several productions in some order.
• Then, failure at line (7) is not ultimate failure, but suggests
only that we need to return to line (1) and try another A-
production.
• Only if there are no more A-productions to try do we
declare that an input error has been found.
• In order to try another A-production, we need to be able to
reset the input pointer to where it was when we first
reached line (1) .
• Thus, a local variable is needed to store this input pointer
for future use.
• w=cad

Match for the second input symbol, a, so advance the input


pointer to d, the third input symbol,
-compare d against the next leaf, labeled b.
-Since b does not match d,
-report failure and go back to A to see whether there is another
alternative for A that has not been tried

In going back to A, Reset the input pointer to position 2 ,


the position it had when we first came to A,
-which means that the procedure for A must store the input
pointer in a local variable.
FIRST and FOLLOW
• The construction of both top-down and
bottom-up parsers is aided by two functions:
• FIRST
• FOLLOW,
• During top-down parsing, FIRST and FOLLOW
allow us to choose which production to apply,
based on the next input symbol.
FIRST

• Define FIRST(), where  is any string of


grammar symbols, to be the set of terminals
that begin strings derived from .
• If  * , then  is also in FIRST().

• A * c,
• FIRST(A) ={c}
FOLLOW

• FOLLOW(A) for non-terminal A to be the set of terminals a


that can appear immediately to the right of A in some
sentential form ; that is; the set of terminals a such that
there exists a derivation of the form S * Aa, for some
 and .
• Note that there may have been symbols between A & a, at
some time during the derivation, but if so, they derived  &
disappeared.
• If A can be the rightmost symbol in some sentential form,
then $ is in FOLLOW(A) ;
• $ is a special "endmarker" symbol that is assumed not to be
a symbol of any grammar.
• To compute FIRST(X) for all grammar symbols X,
apply the following rules until no more
terminals or  can be added to any FIRST set .
• To compute FOLLOW(A) for all non-terminals A,
apply the following rules until nothing can be
added to any FOLLOW set.
LL ( 1 ) Grammars
• Predictive parsers (recursive-descent) needing
no backtracking,
• can be constructed for a class of grammars
called LL(1):
• "L" : for scanning the input from left to right,
• "L" : for producing a leftmost derivation,
• “1“: for using one input symbol of look-ahead at
each step
• A grammar G is LL(1) if and only if whenever
A| are two distinct productions of G, the
following conditions hold:
FIRST(F)=FIRST(T)=FIRST(E)={(, id}
FIRST (E`) = {+, }
FIRST(T`) = {*, }
FOLLOW(E)=FOLLOW(E`) ={ ),$}
FOLLOW(T)=FOLLOW(T`) ={+, ),$}
FOLLOW(F)= {+, *, ),$}
Ambiguous (multiple Entry)
Non-recursive Predictive Parsing
Error Recovery in Predictive
Parsing
Bottom-Up Parsing

Shift
Reduce
Algorithm
Reductions
• process of "reducing" a string w to the start
symbol of the grammar.
• At each reduction step, a specific substring
matching the body of a production is replaced
by the non-terminal at the head of that
production.
• The key decisions:
 when to reduce
 what production to apply
Reductions
By definition, a reduction is the reverse of a
step in a derivation
The goal of bottom-up parsing is therefore to
construct a derivation in reverse.

This derivation is in fact a rightmost derivation.


Handle Pruning
Bottom-up parsing during a left-to-right scan of the
input constructs a rightmost derivation in reverse.
 Informally, a "handle" is a substring that matches the
body of a production,
 and whose reduction represents one step along the
reverse of a rightmost derivation.
A Handle
Shift Reduce Parsing
• Shift-reduce parsing is a form of bottom-up parsing in
which a stack holds grammar symbols and an input
buffer holds the rest of the string to be parsed.
• As we shall see, the handle always appears at the top
of the stack just before it is identified as the handle.
• We use $ to mark the bottom of the stack and also the
right end of the input.
• Conventionally, when discussing bottom-up parsing,
we show the top of the stack on the right, rather than
on the left as we did for top-down parsing.
• Initially, the stack is empty, and the string w is on the
input, as follows
STACK INPUT
$ w$
• During a left-to-right scan of the input string, the parser shifts zero
or more input symbols onto the stack, until it is ready to reduce a
string  of grammar symbols on top of the stack.
• It then reduces  to the head of the appropriate production.
• The parser repeats this cycle until it has detected an
error or until the stack contains the start symbol and
the input is empty:

STACK INPUT
$S $
Shift-Reduce Parsing
Why LR Parsers?
• LR parsers are table-driven, much like the nonrecursive LL
parsers of Section
• 4.4.4. A grammar for which we can construct a parsing
table using one of
• the methods in this section and the next is said to be an LR
grammar. Intuitively,
• for a grammar to be LR it is sufficient that a left-to-right
shift-reduce
• parser be able to recognize handles of right-sentential
forms when they appear
• on top of the stack.
• LR parsing is attractive for a variety of reasons:
• The principal drawback of the LR method is that it is too much work to
• construct an LR parser by hand for a typical programming-language
grammar.
• A specialized tool, an LR parser generator, is needed. Fortunately, many
such
• generators are available, and we shall discuss one of the most commonly
used
• ones, Yacc , in Section 4.9. Such a generator takes a context-free grammar
and
• automatically produces a parser for that grammar. If the grammar
contains
• ambiguities or other constructs that are difficult to parse in a left-to-right
scan
• of the input, then the parser generator locates these constructs and
provides
• detailed diagnostic messages.
The LR-Parsing Algorithm

You might also like