Ch3 Compiler Ebook
Ch3 Compiler Ebook
Syntax Analysis
The second phase of a compiler is called syntax analysis. The input to this phase
consists of a stream of tokens put out by the lexical analysis phase. They are then
checked for proper syntax, i.e. the compiler checks to make sure the statements and
expressions are correctly formed. Some examples of syntax errors in Java are:
When the compiler encounters such an error, it should put out an informative message for
the user. At this point, it is not necessary for the compiler to generate an object program.
A compiler is not expected to guess the intended purpose of a program with syntax errors.
A good compiler, however, will continue scanning the input for additional syntax errors.
The output of the syntax analysis phase (if there are no syntax errors) could be a
stream of atoms or syntax trees. An atom is a primitive operation which is found in most
computer architectures, or which can be implemented using only a few machine language
instructions. Each atom also includes operands, which are ultimately converted to
memory addresses on the target machine. A syntax tree is a data structure in which the
interior nodes represent operations, and the leaves represent operands, as discussed in
Section 1.2.2. We will see that the parser can be used not only to check for proper syntax,
but to produce output as well. This process is called syntax directed translation.
Just as we used formal methods to specify and construct the lexical scanner, we
will do the same with syntax analysis. In this case however, the formal methods are far
more sophisticated. Most of the early work in the theory of compiler design focused on
Section 3.0 Grammars, Languages, and Pushdown Machines 71
syntax analysis. We will introduce the concept of a formal grammar not only as a means
of specifying the programming language, but also as a means of implementing the syntax
analysis phase of the compiler.
Before we discuss the syntax analysis phase of a compiler, there are some concepts of
formal language theory which the student must understand. These concepts play a vital
role in the design of the compiler. They are also important for the understanding of
programming language design and programming in general.
3.0.1 Grammars
Recall our definition of language from Chapter 2 as a set of strings. We have already
seen two ways of formally specifying a language – regular expressions and finite state
machines. We will now define a third way of specifying languages, i.e. by using a
grammar. A grammar is a list of rules which can be used to produce or generate all the
strings of a language, and which does not generate any strings which are not in the
language. More formally a grammar consists of:
1. A finite set of characters, called the input alphabet, the input symbols, or terminal
symbols.
2. A finite set of symbols, distinct from the terminal symbols, called nonterminal
symbols, exactly one of which is designated the starting nonterminal.
3. A finite list of rewriting rules, also called productions, which define how strings in the
language may be generated. Each of these rewriting rules is of the form α z β, where
α and β are arbitrary strings of terminals and nonterminals, and α is not null.
The grammar specifies a language in the following way: beginning with the starting
nonterminal, any of the rewriting rules are applied repeatedly to produce a sentential
form, which may contain a mix of terminals and nonterminals. If at any point, the
sentential form contains no nonterminal symbols, then it is in the language of this gram-
mar. If G is a grammar, then we designate the language specified by this grammar as
L(G).
A derivation is a sequence of rewriting rules, applied to the starting nonterminal,
ending with a string of terminals. A derivation thus serves to demonstrate that a particular
string is a member of the language. Assuming that the starting nonterminal is S, we will
write derivations in the following form:
S ⇒ α ⇒ β ⇒ γ ⇒ ... ⇒ x
72 Chapter 3 Syntax Analysis
G1:
1. S z 0S0
2. S z 1S1
3. S z 0
4. S z 1
Thus, 0010100 is in L(G1), i.e. it is one of the strings in the language of grammar G1.
The student should find other derivations using G1 and verify that G1 specifies the
language of palindromes of odd length over the alphabet {0,1}. A palindrome is a
string which reads the same from left to right as it does from right to left.
In our next example, the terminal symbols are {a,b} (ε represents the null string and is
not a terminal symbol).
G2:
1. S z ASB
2. S zε
3. A z a
4. B z b
Thus, aabb is in L(G2). G2 specifies the set of all strings of a’s and b’s which contain
the same number of a’s as b’s and in which all the a’s precede all the b’s. Note that the
null string is permitted in a rewriting rule.
This language is the set of all strings of a’s and b’s which consist of zero or more a’s
followed by exactly the same number of b’s.
Two grammars, g1 and g2, are said to be equivalent if L(g1) = L(g2) – i.e., they
specify the same language. In this example (grammar G2) there can be several different
derivations for a particular string – i.e., the rewriting rules could have been applied in a
different sequence to arrive at the same result.
1. S z a S A
2. S z B A
3. A z a b
4. B z b A
Solution
S ⇒a S A ⇒ a B A A ⇒ a B a b A ⇒ a B a b a b
⇒ a b A a b a b ⇒ a b a b a b a b
S ⇒ a S A ⇒ a S a b ⇒ a B A a b ⇒ a b A A a b
⇒ a b a b A a b ⇒ a b a b a b a b
S ⇒ B A ⇒ b A A ⇒ b a b A ⇒ b a b a b
Note that in the solution to this problem we have shown that it is possible to have more
than one derivation for the same string: abababab.
SaB z cS
αAγ z αβγ
where α,β and γ are any string of terminals and nonterminals (including ε), and A
represents a single nonterminal. In this type of grammar, it is the nonterminal on the left
side of the rule (A) which is being rewritten, but only if it appears in a particular context,
α on its left and γ on its right. An example of a context-sensitive rule is shown below:
SaB z caB
which is another way of saying that an S may be rewritten as a c, but only if the S is
followed by aB (i.e. when S appears in that context). In the above example, the left
context is null.
2. Context-Free – A context-free grammar is one in which each rule must be of the form:
Azα
where A represents a single nonterminal and α is any string of terminals and nonterminals.
Most programming languages are defined by grammars of this type; consequently, we will
focus on context-free grammars. Note that both grammars G1 and G2, above, are
context-free. An example of a context-free rule is shown below:
A z aABb
3. Right Linear – A right linear grammar is one in which each rule is of the form:
A z aB
or
Aza
where A and B represent nonterminals, and a represents a terminal. Right linear gram-
mars can be used to define lexical items such as identifiers, constants, and keywords.
Section 3.0 Grammars, Languages, and Pushdown Machines 75
Unrestricted
Context-Sensitive
Context-Free
Right Linear
Note that every context-sensitive grammar is also in the unrestricted class. Every
context-free grammar is also in the context-sensitive and unrestricted classes. Every right
linear grammar is also in the context-free, context-sensitive, and unrestricted classes.
This is represented by the diagram of Figure 3.1, above, which depicts the classes of
grammars as circles. All points in a circle belong to the class of that circle.
A context-sensitive language is one for which there exists a context-sensitive
grammar. A context-free language is one for which there exists a context-free grammar.
A right linear language is one for which there exists a right linear grammar. These
classes of languages form the same hierarchy as the corresponding classes of grammars.
We conclude this section with an example of a context-sensitive grammar which
is not context-free.
G3:
1. S z aSBC
2. S z ε
3. aB z ab
4. bB z bb
5. C z c
6. CB z CX
7. CX z BX
8. BX z BC
i.e., the language of grammar G3 is the set of all strings consisting of a’s followed by
exactly the same number of b’s followed by exactly the same number of c’s. This is an
example of a context-sensitive language which is not also context-free; i.e., there is no
context-free grammar for this language. An intuitive understanding of why this is true is
beyond the scope of this text.
Solution:
Since programming languages are typically specified with context-free grammars, we are
particularly interested in this class of grammars. Although there are some aspects of
programming languages that cannot be specified with a context-free grammar, it is
generally felt that using more complex grammars would only serve to confuse rather than
clarify. In addition, context-sensitive grammars could not be used in a practical way to
construct the compiler.
Context-free grammars can be represented in a form called Backus-Naur Form
(BNF) in which nonterminals are enclosed in angle brackets <>, and the arrow is replaced
by a ::=, as shown in the following example:
S z a S b
This form also permits multiple definitions of one nonterminal on one line, using the
alternation vertical bar (|).
Section 3.0 Grammars, Languages, and Pushdown Machines 77
A S B
a b
A S B
a
A S B
a ε b
S z a S b
S z ε
BNF and context-free grammars are equivalent forms, and we choose to use context-free
grammars only for the sake of appearance.
We now present some definitions which apply only to context-free grammars. A
derivation tree is a tree in which each interior node corresponds to a nonterminal in a
sentential form and each leaf node corresponds to a terminal symbol in the derived string.
An example of a derivation tree for the string aaabbb, using grammar G2, is shown in
Figure 3.2.
A context-free grammar is said to be ambiguous if there is more than one
derivation tree for a particular string. In natural languages, ambiguous phrases are those
which may have more than one interpretation. Thus, the derivation tree does more than
show that a particular string is in the language of the grammar – it shows the structure of
the string, which may affect the meaning or semantics of the string. For example,
consider the following grammar for simple arithmetic expressions:
G4:
Expr Expr
Figure 3.3 Two Different Derivation Trees for the String var + var ∗ var
4. Expr z var
5. Expr z const
Figure 3.3 shows two different derivation trees for the string var+var∗var,
consequently this grammar is ambiguous. It should be clear that the second derivation
tree in Figure 3.3 represents a preferable interpretation because it correctly shows the
structure of the expression as defined in most programming languages (since multiplica-
tion takes precedence over addition). In other words, all subtrees in the derivation tree
correspond to subexpressions in the derived expression. A nonambiguous grammar for
expressions will be given in Section 3.1.
A left-most derivation is one in which the left-most nonterminal is always the
one to which a rule is applied. An example of a left-most derivation for grammar G2
above is:
Like the finite state machine, the pushdown machine is another example of an abstract or
theoretic machine. Pushdown machines can be used for syntax analysis, just as finite state
machines are used for lexical analysis. A pushdown machine consists of:
Determine whether the following grammar is ambiguous. If so, show two different
derivation trees for the same string of terminals, and show a left-most derivation corre-
sponding to each tree.
1. S z a S b S
2. S z a S
3. S z c
Solution: S S
a S b S a S
a S c a S b S
c c c
S ⇒ a S b S ⇒ a a S b S ⇒ a a c b S ⇒ a a c b c
S ⇒ a S ⇒ a a S b S ⇒ a a c b S ⇒ a a c b c
We note that the two derviation trees correspond to two different left-most derivations,
and the grammar is ambiguous.
3. An infinite stack and a finite set of stack symbols which may be pushed on top or
removed from the top of the stack in a last-in first-out manner. The stack symbols
need not be distinct from the input symbols. The stack must be initialized to contain
at least one stack symbol before the first input symbol is read.
4. A state transition function which takes as arguments the current state, the current input
symbol, and the symbol currently on top of the stack; its result is the new state of the
machine.
5. On each state transition the machine may advance to the next input symbol or retain
the input pointer (i.e., not advance to the next input symbol).
6. On each state transition the machine may perform one of the stack operations, push(X)
or pop, where X is one of the stack symbols.
80 Chapter 3 Syntax Analysis
7. A state transition may include an exit from the machine labeled either Accept or
Reject. This determines whether or not the input string is in the specified language.
Note that without the infinite stack, the pushdown machine is nothing more than a finite
state machine as defined in Chapter 2. Also, the pushdown machine halts by taking an
exit from the machine, whereas the finite state machine halts when all input symbols have
been read.
An example of a pushdown machine is shown below, in Figure 3.4, in which the
rows are labeled by stack symbols and the columns are labeled by input symbols. The N
character is used as an endmarker, indicating the end of the input string, and the ,
symbol is a stack symbol which we are using to mark the bottom of the stack so that we
can test for the empty stack condition. The states of the machine are S1 (in our examples
S1 will always be the starting state) and S2, and there is a separate transition table for
each state. Each cell of those tables shows a stack operation (push() or pop), an input
pointer function (advance or retain), and the next state. “Accept” and “Reject” are exits
from the machine. The language of strings accepted by this machine is anbn where n≥
0 – i.e., the same language specified by grammar G2, above. To see this, the student
should trace the operation of the machine for a particular input string. A trace showing
the sequence of stack configurations and states of the machine for the input string aabb
is shown in Figure 3.5. Note that while in state S1 the machine is pushing X's on the stack
as each a is read, and while in state S2 the machine is popping an X off the stack as each
b is read.
An example of a pushdown machine which accepts any string of correctly
balanced parentheses is shown in Figure 3.6. In this machine, the input symbols are left
and right parentheses, and the stack symbols are X and ,. Note that this language could
not be accepted by a finite state machine because there could be an unlimited number of
left parentheses before the first right parenthesis. The student should compare the
language accepted by this machine with the language of grammar G2.
The pushdown machines, as we have described them, are purely deterministic
machines. A deterministic machine is one in which all operations are uniquely and
completely specified regardless of the input (computers are deterministic), whereas a
nondeterministic machine may be able to choose from zero or more operations in an
unpredictable way. With nondeterministic pushdown machines it is possible to specify a
larger class of languages. In this text we will not be concerned with nondeterministic
machines.
We define a pushdown translator to be a machine which has an output function
in addition to all the features of the pushdown machine described above. We may include
this output function in any of the cells of the state transition table to indicate that the
machine produces a particular output (e.g. Out(x)) before changing to the new state.
We now introduce an extension to pushdown machines which will make them
easier to work with, but will not make them any more powerful. This extension is the
Replace operation designated Rep(X,Y,Z,...), where X, Y, and Z are any stack
symbols. The replace function replaces the top stack symbol with all the symbols in its
argument list. The Replace function is equivalent to a pop operation followed by a push
Section 3.0 Grammars, Languages, and Pushdown Machines 81
S1 a b N
Push (X) Pop
X Advance Advance Reject
S1 S2
Push (X)
, Advance Reject Accept ,
Initial
Stack
S2 a b N
Pop
X Reject Advance Reject
S1 S2
a a X b b
z X z X z X z X
, , , , ,
S1 S1 S1 S2 S2
Figure 3.5 Sequence of Stacks as Pushdown Machine of Figure 3.4 Accepts the Input
String aabb
S1 ( ) N
Push (X) Pop
X Advance Advance Reject
S1 S1
Push (X)
, Advance Reject Accept ,
S1
Initial
Stack
Infix Postfix
2 + 3 2 3 +
2 + 3 ∗ 5 2 3 5 ∗ +
2 ∗ 3 + 5 2 3 ∗ 5 +
(2 + 3) ∗ 5 2 3 + 5 ∗
Note that parentheses are never used in postfix notation. In Figure 3.8 the default state
transition is to stay in the same state, and the default input pointer operation is advance.
States S2 and S3 show only a few input symbols and stack symbols in their transition
tables, because those are the only configurations which are possible in those states. The
stack symbol E represents an expression, and the stack symbol L represents a left
parenthesis. Similarly, the stack symbols Ep and Lp represent an expression and a left
parenthesis on top of a plus symbol, respectively.
We now examine the class of languages which can be specified by a particular machine.
A language can be accepted by a finite state machine if, and only if, it can be specified
with a right linear grammar (and if, and only if, it can be specified with a regular expres-
sion). This means that if we are given a right linear grammar, we can construct a finite
state machine which accepts exactly the language of that grammar. It also means that if
we are given a finite state machine, we can write a right linear grammar which specifies
the same language accepted by the finite state machine.
Section 3.0 Grammars, Languages, and Pushdown Machines 83
S1 a + * ( ) N
pop pop
E Reject push(+) push(*) Reject retain retain
S3
pop pop pop
Ep Reject out(+) push(*) Reject retain retain
S2 S2
push(E)
L out(a) Reject Reject push(L) Reject Reject
push(E)
Lp out(a) Reject Reject push(L) Reject Reject
push(E)
Ls out(a) Reject Reject push(L) Reject Reject
push(Ep)
+ out(a) Reject Reject push(Lp) Reject Reject
pop
* out(a*) Reject Reject push(Ls) Reject Reject
push(E)
, out(a) Reject Reject push(L) Reject Accept
S2 ) N S3 )
pop pop Rep(E)
+ out(+) out(+) L S1
retain,S3 retain,S1
pop Rep(E)
* out(*) Reject Lp S1
S1 ,
pop
E retain Initial
Stack
pop
Figure 3.8 Pushdown Translator for Ls retain
Infix to Postfix Expressions S2
, Reject
There are algorithms which can be used to produce any of these three forms
(finite state machines, right linear grammars, and regular expressions), given one of the
other two (see, for example, Hopcroft and Ullman [1979]). However, here we rely on the
student's ingenuity to solve these problems.
84 Chapter 3 Syntax Analysis
Show the sequence of stacks and states which the pushdown machine of Figure 3.8
would go through if the input were: a+(a∗a)
Solution:
*
E E E
Lp Lp Lp Lp Lp Ep
a + + ( + a + * + a + ) + + N
z E z E z E z E z E z E z E z E z
, Out(a) , , , Out(a) , , Out(a*) , , ,
S1 S1 S1 S1 S1 S1 S1 S3 S1
+
E z E z Output: aaa*+
, Out(+) , ,
S2 S1 S1 Accept
Give a right linear grammar for each of the languages specified in Sample Problem
2.0 (a).
Solution:
(1) Strings over {0,1} containing (2) Strings over {0,1} which
an odd number of 0's. contain three consecutive 1's.
1. Sz 0 1. Sz 1S
2. Sz 1S 2. Sz 0S
3. Sz 0A 3. Sz 1A
4. Az 1 4. Az 1B
5. Az 1A 5. Bz 1C
6. Az 0S 6. Bz 1
7. Cz 1C
8. Cz 0C
9. Cz 1
10. Cz 0
Section 3.0 Grammars, Languages, and Pushdown Machines 85
(3) Strings over {0,1} which (4) Strings over {0,1} which
contain exactly three 0's. contain an odd number of zeros
and an even number of 1's.
1. Sz 1S
2. Sz 0A 1. Sz 0A
3. Az 1A 2. Sz 1B
4. Az 0B 3. Sz 0
5. Bz 1B 4. Az 0S
6. Bz 0C 5. Az 1C
7. Bz 0 6. Bz 0C
8. Cz 1C 7. Bz 1S
9. Cz 1 8. Cz 0B
9. Cz 1A
10. Cz 1
S z 0S0
S z 1S1
S z c
S z 0S0
S z 1S1
S z 0
S z1
S zε
86 Chapter 3 Syntax Analysis
Exercises 3.0
1. Show three different derivations using each of the following grammars, with starting
nonterminal S.
(a) S z a S (b) S z a B c
S z b A B z A B
A z b S A z B A
A z c A z a
B z ε
(c) S z a S B c
a S A z a S b b (d) S z a b
B c z A c a z a A b B
S b z b A b B z ε
A z a
2. Classify the grammars of Problem 1 according to Chomsky’s definitions (give the most
restricted classification applicable).
4. For each of the given input strings show a derivation tree using the following grammar.
1. S z S a A
2. S z A
3. A z A b B
Section 3.0 Grammars, Languages, and Pushdown Machines 87
4. A z B
5. B z c S d
6. B z e
7. B z f
5. Show a left-most derivation for each of the following strings, using grammar G4 of
Section 3.0.3.
7. Some of the following grammars may be ambiguous; for each ambiguous grammar,
show two different derivation trees for the same input string:
(a) 1. S z a S b (b) 1. S z A a A
2. S z A A 2. S z A b A
3. A z c 3. A z c
4. A z S 4. A z S
(c) 1. S z a S b S (d) 1. S z a S b c
2. S z a S 2. S z A B
3. S z c 3. A z a
4. B z b
8. Show a pushdown machine that will accept each of the following languages:
Hint: Use the first state to push Ni onto the stack until the c is read. Then use
another state to pop the stack as long as the input is the complement of the
stack symbol, until the top stack symbol and the input symbol are equal. Then
use a third state to ensure that the remaining input symbols match the symbols
on the stack.
9. Show the output and the sequence of stacks for the machine of Figure 3.8 for each of
the following input strings:
10. Show a grammar and an extended pushdown machine for the language of prefix
expressions involving addition and multiplication. Use the terminal symbol a to
represent a variable or constant. Example: ∗+aa∗aa
11. Show a pushdown machine to accept palindromes over {0,1} with centermarker
c. This is the language, Pc, referred to in Section 3.0.5.
12. Show a grammar for the language of valid regular expressions over the alphabet
{0,1}. Hint: Think about grammars for arithmetic expressions.
Section 3.1 Ambiguities in Programming Languages 89
G5:
A derivation tree for the input string var + var ∗ var is shown, below, in Figure
3.9. The student should verify that there is no other derivation tree for this input string,
and that the grammar is not ambiguous. Also note that in any derivation tree using this
grammar, subtrees correspond to subexpressions, according to the usual precedence rules.
The derivation tree in Figure 3.9 indicates that the multiplication takes precedence over
the addition. The left associativity rule would also be observed in a derivation tree for
var + var + var.
Another example of ambiguity in programming languages is the conditional
statement as defined by grammar G6:
Expr
Expr + Term
var var
Figure 3.9 A Derivation Tree for var + var ∗ var Using Grammar G5
90 Chapter 3 Syntax Analysis
Stmt
IfStmt
IfStmt
if ( Expr ) Stmt
Stmt
IfStmt
if ( Expr ) Stmt
IfStmt
G6:
1. Stmt z IfStmt
2. IfStmt z if ( Expr ) Stmt
3. IfStmt z if ( Expr ) Stmt else Stmt
Stmt
IfStmt
Unmatched
if ( Expr ) Stmt
IfStmt
Matched
OtherStmt OtherStmt
G7:
1. Stmt z IfStmt
2. IfStmt z Matched
3. IfStmt z Unmatched
4. Matched z if ( Expr ) Matched else Matched
5. Matched z OtherStmt
6. Unmatched z if ( Expr ) Stmt
7. Unmatched z if ( Expr ) Matched else Unmatched
This grammar differentiates between the two different kinds of if statements, those
with a matching else (Matched) and those without a matching else (Un-
matched). The nonterminal OtherStmt would be defined with rules for statements
92 Chapter 3 Syntax Analysis
other than if statements (while, expression, for, ...). A derivation tree for the
string if ( Expr ) if ( Expr ) OtherStmt else OtherStmt is shown
in Figure 3.11.
Exercises 3.1
1. Show derivation trees for each of the following input strings using grammar G5.
S z a A b S z x B y
A z b A a B z y B x
A z a B z x
5. How many different derivation trees are there for each of the following if statements
using grammar G6?
The student may recall, from high school days, the problem of diagramming English
sentences. You would put words together into groups and assign syntactic types to them,
such as noun phrase, predicate, and prepositional phrase. An example of a diagrammed
English sentence is shown, below, in Figure 3.12. The process of diagramming an
English sentence corresponds to the problem a compiler must solve in the syntax analysis
phase of compilation.
The syntax analysis phase of a compiler must be able to solve the parsing
problem for the programming language being compiled: Given a grammar, G, and a
string of input symbols, decide whether the string is in L(G); also, determine the structure
of the input string. The solution to the parsing problem will be “yes” or “no”, and, if
“yes”, some description of the input string’s structure, such as a derivation tree.
A parsing algorithm is one which solves the parsing problem for a particular
class of grammars. A good parsing algorithm will be applicable to a large class of
grammars and will accommodate the kinds of rewriting rules normally found in grammars
for programming languages. For context-free grammars, there are two kinds of parsing
algorithms – bottom up and top down. These terms refer to the sequence in which the
derivation tree of a correct input string is built. A parsing algorithm is needed in the
syntax analysis phase of a compiler.
There are parsing algorithms which can be applied to any context-free grammar,
employing a complete search strategy to find a parse of the input string. These algorithms
are generally considered unacceptable since they are too slow; they cannot run in “poly-
nomial time” (see Aho and Ullman [1972], for example).
Verb DirectObject
Predicate