CSC 415
CSC 415
1 Compilers - Introduction
A compiler is a program that reads a program written in one language (the source language) and
translates it into an equivalent program in another language (the target language).
As part of this translation process, the compiler reports to its user the presence of errors in the source
program.
Software tools that manipulate the source program first perform some kind of analysis
1. Structure Editors: takes input a sequence of commands to build a source program.
Performs text creation and modification, analyzes the program text – hierarchical structure
(check the i/p is correctly formed).
1
2. Pretty Printers: analyzes the program and prints the structure of the program becomes clearly
visible.
3. Static Checkers: reads a program, analyzes it, and attempts to discover potential bugs without
running the program.
4. Interpreters: Instead of producing a target program as a translation, it performs the operations
implied by the source program.
2
• The identifier; position
• The assignment symbol; :=
• The identifier; initial
• The plus symbol; +
• The identifier; rate
• The multiplication symbol; *
• The number; 60
• The blanks separating the characters of these tokens would normally be eliminated during lexical
analysis.
Syntax Analysis: (Hierarchical Analysis or Parsing)
It involves grouping the tokens of the source program into grammatical phrases that are used by the
compiler to synthesize output. The grammatical phrase is represented by a parse tree.
3
Semantic Analysis:
This phase checks the source program for semantic errors and gathers type information for the
subsequent code-generation phase. It uses the hierarchical structure determined by the syntax
analysis phase to identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking.
4
Compiler operates in phases. Each phase transforms the source program from one representation to
another.
Six phases:
– Lexical Analyser
– Syntax Analyser
– Semantic Analyser
– Intermediate code generation
– Code optimization
– Code Generation
Symbol table and error handling interact with the six phases.
Some of the phases may be grouped together.
These phases are illustrated by considering the following statement:
position := initial + rate * 60
5
6
• Symbol Table Management:
– Symbol table is a data structure containing a record for each identifier, with fields for the
attributes of the identifier.
– Record the identifier used in the source program and collect information about the identifier such
as,
• its type, (by semantic and intermediate code)
• its scope, (by semantic and intermediate code)
• storage allocation, (by code generation)
• number of arguments and its type for procedure, the type returned
• Error Detecting and Reporting:
– Each phase encounters errors.
– Lexical phase determine the input that do not form token.
– Syntax phase determine the token that violates the syntax rule.
– Semantic phase detects the constructs that have no meaning to operand.
The Analysis Phase:
– Involves: Lexical, Syntax and Semantic phase
– As translation progresses, the compiler’s internal representation of the source program changes.
Lexical Analysis Phase:
– The lexical phase reads the characters in the source program and groups them into a stream of
tokens in which each token represents a logically cohesive sequence of characters, such as, An
identifier, A keyword, A punctuation character.
– The character sequence forming a token is called the lexeme for the token.
Syntax Analysis Phase:
– Syntax analysis imposes a hierarchical structure on the token stream. This hierarchical structure is
called syntax tree.
– A syntax tree has an interior node is a record with a field for the operator and two fields
containing pointers to the records for the left and right children.
7
– A leaf is a record with two or more fields, one to identify the token at the leaf, and the other to
record information about the token.
Semantic Analysis Phase:
– This phase checks the source program for semantic errors and gathers type information for the
subsequent code-generation phase.
– It uses the hierarchical structure determined by the syntax-analysis phase to identify the operators
and operands of expressions and statements.
– An important component of semantic analysis is type checking.
Intermediate Code Generation:
– The syntax and semantic analysis generate a explicit intermediate representation of the source
program.
– The intermediate representation should have two important properties:
• It should be easy to produce,
• And easy to translate into target program.
– Intermediate representation can have a variety of forms.
– One of the forms is: three address code; which is like the assembly language for a machine in
which every location can act like a register.
– Three address code consists of a sequence of instructions, each of which has at most three
operands.
Code Optimization:
– Code optimization phase attempts to improve the intermediate code, so that faster-running
machine code will result.
Code Generation:
– The final phase of the compiler is the generation of target code, consisting normally of relocatable
machine code or assembly code.
– Memory locations are selected for each of the variables used by the program.
– Then, the each intermediate instruction is translated into a sequence of machine instructions that
perform the same task.
8
compiler typically operates are:
– Preprocessors
– Assemblers
– Loader and Link-Editors
• Preprocessors:
– Preprocessors produce input to compiler.
– They perform following functions:
• 1. Macro processing: a preprocessor allow a user to define macros. (that are short forms for
longer construct)
• 2. File inclusion: a preprocessor include header files into the program text.
• 3. Rational preprocessor: a preprocessor provide the user with built-in macros for constructs
like while-stmt or if-stmt if none exist in the program itself.
• 4. Language extension: they add capabilities to the language by what amounts to built-in
macros. e.g. the language, Equel is a database query language embedded in C
• Assemblers:
– Some compilers produce assembly code that is passed to an assembler for further processing.
– Assembly code is a mnemonic version of machine code, in which names are used instead of
binary codes for operations and names are also given to memory addresses.
• Loaders and Link-Editors:
– Loader is a program, performs two functions; loading and link-editing.
– The process of loading consists of taking relocatable machine code, altering the relocatable
addresses and placing the altered instructions and data in memory at the proper locations.
– The link-editor allows making a single program from several files of relocatable machine code.
9
– The phases are collected into a front and a back end.
– The front end consists of phases or part of phases that depends primarily on source language and
is largely independent of the target machine.
• These normally include lexical and syntactic analysis, the creation of symbol table, semantic
analysis and intermediate code generation.
– The back end includes portions of the compiler that depend on the target machine.
• Passes:
– Several phases of compilation are implemented in a single pass consisting of reading an input
file and writing an output file.
– Grouping of several phases into one pass, may force the entire program in memory, because one
phase may need information in a different order than previous phase produces it.
– Intermediate code and code generation are often merged into one pass using a technique called
backpatching.
The compiler writers use software tools such as, debuggers, version managers, profilers, and so on.
1. Parser generators
2. Scanner generators
10
5. Data-flow engines
1. Parser generators: These produce syntax analyser from context free grammar as input.
2. Scanner generators: These automatically produce lexical analyser from a specification based on
regular expressions.
3. Syntax-directed translation engines: These produce collection of routines from parse tree,
generating the intermediate code.
4. Automatic code generators: Takes collection of rules that define the translation of each operation
of the intermediate language into the machine language for the target machine.
5. Data-flow engines: To perform good code optimization involves data-flow analysis – gathering of
information about how values are transmitted from one part of a program to other part.
A simple way to build a lexical analyser is to construct a diagram that illustrates the structure of the
tokens of the source language, and then to translate the diagram into a program for finding tokens.
The software tool that automates the construction of lexical analysers is lexical analyser generator.
The major advantage of a lexical analyser generator is that it can utilize the best-known pattern
matching algorithms and there by create efficient lexical analysers for people who are not experts in
pattern-matching techniques.
Its main task is to read the input characters and produce as output a sequence of tokens.
11
Upon receiving a “get next token” command from the parser, the lexical analyser reads input
character until it can identify next token. Lexical analysers read source text and also perform certain
secondary task.
Stripping out from the source program comments and white space in the form of blank, tab,
and newline characters.
Correlating error messages from the compiler with the source program.
For example, the lexical analyser may keep track of the number of newline characters seen, so
that a line number is associated with an error message. If the source language supports some
macro preprocessor functions, then these preprocessor functions may also be implemented as
lexical analysis takes place.
Reasons for separating the analysis phase into lexical analysis and parsing:
1. simpler design
2. compiler efficiency is improved. Specialized buffering techniques for reading input characters and
12
Tokens, lexeme and pattern:
– Set of strings in the input for which the same token is produced as output.
– Set of strings is described by a rule called a pattern associated with the token.
– A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token.
if if if
Lexical errors:
13
– Undeclared function identifier;
Error-recovery actions:
Buffering
Input Buffering:
• Some efficiency issues concerned with the buffering of input.
• A two-buffer input scheme that is useful when lookahead on the input is necessary to identify
tokens.
• Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the buffer
end.
• There are three general approaches to the implementation of a lexical analyser:
• 1. Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a
regular expression based specification. In this, the generator provides routines for reading and
buffering the input.
• 2. Write the lexical analyser in a conventional systems-programming language, using I/O
facilities of that language to read the input.
• 3. Write the lexical analyser in assembly language and explicitly manage the reading of input.
14
• Buffer pairs:
• Because of a large amount of time can be consumed moving characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an input
character.
• The scheme to be discussed:
• Consists a buffer divided into two N-character halves.
15
– This limited lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the buffer.
– For example: DECLARE ( ARG1, ARG2, … , ARGn ) in PL/1 program;
– Cannot determine whether the DECLARE is a keyword or an array name until the character that
follows the right parenthesis.
– Sentinels:
– In the previous scheme, must check each time the move forward pointer that have not moved off
one half of the buffer. If it is done, then must reload the other half.
– Therefore the ends of the buffer halves require two tests for each advance of the forward pointer.
– This can reduce the two tests to one if it is extend each buffer half to hold a sentinel character at
the end.
– The sentinel is a special character that cannot be part of the source program. (eof character is used
as sentinel).
• In this, most of the time it performs only one test to see whether forward points to an eof.
• Only when it reach the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eof’s, the average number of tests per input
character is very close to 1.
Specification of Tokens:
• Regular expressions are notation for specifying patterns.
• Each pattern matches a set of strings.
• Regular expressions will serve as names for sets of strings.
• Strings and Languages:
• The term alphabet or character class denotes any finite set of symbols.
16
• e.g., set {0,1} is the binary alphabet.
• The term sentence and word are often used as synonyms for the term string.
• The length of a string s is written as |s| - is the number of occurrences of symbols in s.
• e.g., string “banana” is of length six.
• The empty string denoted by ε – length of empty string is zero.
• The term language denotes any set of strings over some fixed alphabet.
• e.g., {ε} – set containing only empty string is language under φ.
• If x and y are strings, then the concatenation of x and y (written as xy) is the string formed by
appending y to x. x = dog and y = house; then xy is doghouse.
• sε = εs = s.
• s0 = ε, s1 = s, s2 = ss, s3 = sss, … so on.
TERM DEFINITION
Prefix of s A string obtained by removing zero or more trailing symbols of string
s; e.g., ban is a prefix of banana.
Suffix of s A string formed by deleting zero or more of the leading symbols of s;
e.g., nana is a suffix of banana.
Substring of s A string obtained by deleting a prefix and a suffix from s; e.g., nan is a
substring of banana.
Proper prefix, suffix, Any nonempty string x that is a prefix, suffix or substring of s that s
or substring of s <> x.
Subsequence of s Any string formed by deleting zero or more not necessarily contiguous
symbols from s; e.g., baaa is a subsequence of banana.
• Operations on Languages:
– There are several operations that can be applied to languages:
17
Table 1.3 Definitions of operations on languages L and M
OPERATION DEFINITION
Union of L and M. written LυM L υ M = { s | s is in L or s is in M }
Concatenation of L and M. LM = { st | s is in L and t is in M }
written LM
Kleene closure of L.
written L* L* denotes “zero or more concatenation of”
L.
Positive closure of L. L+ denotes “one or more Concatenation of” L.
written L+
• Regular Expressions:
– It allows defining the sets to form tokens precisely.
– e.g., letter ( letter | digit) *
– Defines a Pascal identifier – which says that the identifier is formed by a letter followed by zero
or more letters or digits.
– A regular expression is built up out of simpler regular expressions using a set of defining rules.
– Each regular expression r denotes a language L(r).
• The rules that define the regular expressions over alphabet ∑.
– (Associated with each rule is a specification of the language denoted by the regular expression
being defined)
• 1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty string.
• 2. If a is a symbol in ∑, then a is a regular expression that denotes {a}, i.e. the set containing the
string a.
• 3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then
– a) (r) | (s) is a regular expression denoting the languages L(r) U L(s).
– b) (r)(s) is a regular expression denoting the languages L(r)L(s).
– c) (r)* is a regular expression denoting the languages (L(r))*.
– d) (r) is a regular expression denoting the languages L(r).
18
• A language denoted by a regular expression is said to be a regular set.
• The specification of a regular expression is an example of a recursive definition.
• Rule (1) and (2) form the basis of the definition.
• Rule (3) provides the inductive step.
AXIOM DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = rs|rt Concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity element for concatenation
rε = r
r* = (r|ε)* Relation between * and ε
r** = r* * Is idempotent
• Regular Definition:
– If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form
• d1 → r1
• d2 → r2
• …
• dn → rn
– Where each di is a distinct name, and each r i is a regular expression over the symbols in ∑ U {d 1,
d2, … , di-1}, i.e., the basic symbols and the previously defined names.
– e.g. (regular definition in bold):
• letter → A|B|…|Z|a|b|…|z
• digit → 0|1|…|9
• id → letter ( letter | digit ) *
19
• Notational Shorthand:
– This shorthand is used in certain constructs that occur frequently in regular expressions.
• 1. one or more instance: unary postfix operator + means “one or more instances of”. If ris a
regular expression that denotes the language L(r), the r+ is a regular expression that denotes the
language (L(r))+. Similarly unary postfix operator * means “zero or more instances of”. The two
algebraic identities r* and r+ relate the kleene and positive closureoperators.
• 2. zero or one instance: unary postfix operator ? means “zero or one instance of”. The
notation r? is a shorthand for r|ε.
• 3. character class: the notation [abc] is a shorthand for a|b|c.
In the compiler model, the parser obtains a string of tokens from the lexical analyser, and verifies
that the string can be generated by the grammar for the source language.
The parser returns any syntax error for the source language.
20
parsing.
Top-down parsers build parse trees from the top (root) to the bottom (leaves).
Bottom-up parsers build parse trees from the leaves and work up to the root.
In both case input to the parser is scanned from left to right, one symbol at a time.
The output of the parser is some representation of the parse tree for the stream of tokens.
There are number of tasks that might be conducted during parsing. Such as;
Collecting information about various tokens into the symbol table.
Performing type checking and other kinds of semantic analysis.
Generating intermediate code.
Syntax Error Handling:
Planning the error handling right from the start can both simplify the structure of a compiler and
improve its response to errors.
The program can contain errors at many different levels. e.g.,
Lexical – such as misspelling an identifier, keyword, or operator.
Syntax – such as an arithmetic expression with unbalanced parenthesis.
Semantic – such as an operator applied to an incompatible operand.
Logical – such as an infinitely recursive call.
Much of the error detection and recovery in a compiler is centered on the syntax analysis phase.
One reason for this is that many errors are syntactic in nature or are exposed when the stream of
tokens coming from the lexical analyser disobeys the grammatical rules defining the programming
language.
Another is the precision of modern parsing methods; they can detect the presence of syntactic errors
in programs very efficiently.
The error handler in a parser has simple goals:
It should the presence of errors clearly and accurately.
It should recover from each error quickly enough to be able to detect subsequent errors.
It should not significantly slow down the processing of correct programs.
21
Error-Recovery Strategies:
There are many different general strategies that a parser can employ to recover from a syntactic
error.
Panic mode
Phrase level
Error production
Global correction
Panic mode:
This is used by most parsing methods.
On discovering an error, the parser discards input symbols one at a time until one of a designated set
of synchronizing tokens (delimiters; such as; semicolon or end) is found.
Panic mode correction often skips a considerable amount of input without checking it for additional
errors.
It is simple.
Phrase-level recovery:
On discovering an error; the parser may perform local correction on the remaining input; i.e., it may
replace a prefix of the remaining input by some string that allows the parser to continue. e.g., local
correction would be to replace a comma by a semicolon, deleting an extraneous semicolon, or insert
a missing semicolon.
Its major drawback is the difficulty it has in coping with situations in which the actual error has
occurred before the point of detection.
Error productions:
If an error production is used by the parser, can generate appropriate error diagnostics to indicate the
erroneous construct that has been recognized in the input.
Global correction:
Given an incorrect input string x and grammar G, the algorithm will find a parse tree for a related
string y, such that the number of insertions, deletions and changes of tokens required to transform x
22
Writing Grammars
Grammars are capable of describing the syntax of the programming languages.
Eliminating Ambiguity:
23
Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity.
As an example, the ambiguity from the following “dangling-else” grammar is eliminated:
stmt → if expr then stmt
| if expr then stmt else stmt
| other (Grammar 2.2.1)
Here other stands for other statement.
According to this grammar, the compound conditional statement
if E1 then S1 else if E2 then S2 else S3 has the parse tree
24
Grammar (2.2.1) is ambiguous since the string: if E1 then if E2 then S1 else S2 has two parse trees.
In all programming languages with conditional statements of this form, the first parse tree is
preferred.
The general rule is, “Match each else with the closest previous unmatched then” this disambiguating
rule can be incorporated directly into the grammar.
i.e., rewriting the grammar (2.2.1) as following:
stmt → matched_stmt
| unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
unmatched_stmt → if expr then stmt
| if expr then matched_stmt else unmatched_stmt (Grammar 2.2.2)
25
eliminates left recursion is needed.
A left-recursion production of the form A → Aα | β could be replaced by the non-left-recursive
productions as follows:
A → βA`
A` → αA` | ε
Without changing the set of strings derivable from A.
Exercise 2.2.1
Consider the following grammar for arithmetic expression:
E→E+T|T
T→T*F|F
F → ( E ) | id (Grammar 2.2.3)
Eliminate the immediate left recursion if any in the above grammar.
Exercise 2.2.2
26
Eliminate the left recursion for the following grammar using the Algorithm 2.2.1
S → Aa | b
A → Ac | Sd | ε (Grammar 2.2.4)
Left Factoring:
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parsing.
The basic idea is that when it is not clear which of two alternative productions to use to expand a
nonterminal A, then rewrite the A-productions to defer the decision until the input to make the right
choice.
In general, a production are of the form A → αβ1 | αβ2 then it is left factored as:
A → αA`
A → β 1 | β2
Exercise 2.2.3
For the following grammar do the left factoring:
S → iEtS | iEtSeS | a
E→b
Exercise 2.3.1:
In the following grammar find terminals, nonterminals, start symbols, and productions:
expr → expr op expr
expr → ( expr )
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ^ (Grammar 2.3.2)
28
Notational Conventions:
To avoid having to state that “these are terminals” and “these are nonterminals”, the following
notational conventions with regard to grammars are used:
1. These symbols are terminals:
i) lower-case letters early in the alphabet such as a,b,c
ii) operator symbols such as +, -, etc.,
iii) punctuation symbols such as parenthesis, comma, etc.,
iv) boldface strings such as id or if
2. These symbols are nonterminals:
i) upper-case letters early in the alphabet such as A, B, C.
ii) The letter S, when it appears, is usually the start symbol.
iii) Lower-case italic names such as expr or stmt.
3. Upper-case letters late in the alphabet, such as X, Y, Z represent grammar symbols, i.e., either
nonterminals or terminals.
4. Lower-case Greek letters, α, β, γ, for example, represent strings of grammar symbols.
5. If A → α1, A → α2, …, A → αk are all productions with A on the left (A-productions), then we
can write as; A → α1| α2 | … | αk. (α1, α2, … , αk, the alternatives for A).
e.g., A sample grammar:
E → E A E | ( E ) | - E | id
A→+|-|*|/|^ (Grammar 2.3.3)
Derivations:
Derivation gives a precise description of the top-down constructions of a parse tree.
In derivation, a production is treated as a rewriting rule in which the nonterminal on the left is
replaced by the string on the right side of the production.
e.g. E → E + E | E * E | - E | ( E ) | id (Grammar 2.3.4)
tell us we can replace one instance of an E in any string of grammar symbols by E + E or E * E
or – E or ( E ) or id
E => - E read as “E derives – E”
Take single E and repeatedly apply productions in any order to obtain a sequence of replacements.
e.g., E => - E => - ( E ) => - ( id )
29
Such a sequence of replacements is called as derivation of – ( id ) from E.
Symbol => means “derive in one step”.
Symbol means “derives in zero or more step”
Symbol means “derives in one or more steps”
Given a grammar G with start symbol S, then relation to define L(G), the language generated by G.
Strings in L(G) may contain only terminal symbols of G.
When string of terminals w is in L(G) if and only if S w. The string w is called a sentence of G.
A language that can be generated by a grammar is said to be context-free language.
If two grammar generated by the same language, the grammars are said to be equivalent.
If S α, where α may contain nonterminals, then α is a sentential form of G.
A sentence is a sentential form with no nonterminals.
Example: using the grammar (Grammar 2.3.4), the string – ( id + id ) is a sentence of
grammar(Grammar 2.3.4).
E => - E => - ( E ) => - ( E + E ) => - ( id + E ) => - ( id + id ),
The strings E, - E, - ( E + E ), - ( id + E ), - ( id + id ) appearing in the derivation are all sentential
form of the grammar (Grammar 2.3.4)
It can be written as E - ( id + id ) to indicate – ( id + id ) can be derived from E.
There are two types of derivations: leftmost derivation and rightmost derivation;
The derivation in which only the leftmost nonterminal in any sentential form is replaced at each
step. Such derivation is called leftmost derivation.
Example: E - E - ( E ) - ( E + E ) - ( id + E ) - ( id + id )
If S α, then α is a left-sentential form of the grammar at hand.
The derivation in which only the rightmost nonterminal in any sentential form is replaced at each
step. Such derivation is called rightmost derivation or canonical derivation.
Example: E - E - ( E + E ) - ( E + id ) - ( id + id )
If S α, then α is a right-sentential form of the grammar at hand.
30
regarding replacement order.
Each interior node of a parse tree is labeled by some nonterminal A, and that the children of the
node are labeled, from left to right, by the symbols in the right side of the production by which this
A was replaced in the derivation. The leaves of the parse tree are labeled by nonterminals or
terminals and, read for left to right; they constitute a sentential form, called the yield or frontier of
the tree.
Example: parse tree for – ( id + id ),
Ambiguity:
A grammar that produces more than one parse tree for some sentence is said to be ambiguous.
An ambiguous grammar is one that produces more than one leftmost or more than one rightmost
derivation for the same sentence.
Example: from the grammar (Grammar 2.3.4), the sentence id + id * id has two distinct leftmost
derivations.
Top-Down Parsing
Parsing is the process of determining if a string of tokens can be generated by a grammar.
31
For any context-free grammar there is a parser that takes at most Ο(n 3) time to parse a string of
n tokens.
Top-down parsers build parse trees from the top (root) to the bottom (leaves).
Two top-down parsing are to be discussed:
o Recursive Descent Parsing
o An efficient non-backtracking parsing called Predictive Parsing for LL(1) grammars.
Top-Down Parsing
Top-Down Parsing
Parsing is the process of determining if a string of tokens can be generated by a grammar.
For any context-free grammar there is a parser that takes at most Ο(n 3) time to parse a string of
n tokens.
Top-down parsers build parse trees from the top (root) to the bottom (leaves).
Two top-down parsing are to be discussed:
o Recursive Descent Parsing
o An efficient non-backtracking parsing called Predictive Parsing for LL(1) grammars.
32
and the input string w = cad
To construct a parse tree for this string using top-down approach, initially create a tree
consisting of a single node labeled S.
33
Procedure S
procedure S()
begin
if input symbol = ‘c’ then
begin
ADVANCE( );
if A( ) then
if input symbol = ‘d’ then
begin ADSVANCE( ); return true end
end;
return false
end
Procedure A
procedure A( )
begin
isave := input-pointer;
if input symbol = ‘a’ then
begin
ADVANCE( );
if input symbol = ‘b’ then
begin ADVANCE( ); return true end
end
input-pointer := isave;
/* failure to find ab */
if input symbol = ‘a’ then
begin ADVANCE( ); return true end
else return false
end
A parser that uses a set of recursive procedures to recognize its input with no backtracking is
called a recursive descent parser.
Exercise 2.5.1
34
o * Write recursive descent parsing for the (Grammar 2.2.3) after eliminating the left
recursion. (refer reference book 1, for solution, page 181.)
The predictive parser has an input, a stack, a parsing table, and an output.
The input contains the string to be parsed, followed by $, the right endmarker.
The stack contains a sequence of grammar symbols, preceded by $, the bottom-of-stack marker.
Initially the stack contains the start symbol of the grammar preceded by $.
The parsing table is a two dimensional array M[A,a], where A is a nonterminal, and a is a
terminal or the symbol $.
The parser is controlled by a program that behaves as follows:
The program determines X, the symbol on top of the stack, and a, the current input symbol.
These two symbols determine the action of the parser.
There are three possibilities:
o 1. If X = a = $, the parser halts and announces successful completion of parsing.
o 2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input
symbol.
35
o 3. If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This entry will
be either an X-production of the grammar or an error entry.
If M[X,a] = {X → UVW}, the parser replaces X on top of the stack by WVU (with U on top).
If M[X,a] = error, the parser calls an error recovery routine.
Construction of Parsing Table:
o Before constructing the parsing table, two functions are to be performed to fill the entries in the
table.
FIRST( ) and FOLLOW( ) functions.
o These functions will indicate proper entries in the table for a grammar G.
o To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or ε can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is {X}.
2. If X is nonterminal and X → aα is a production, then add a to FIRST(X). If X → ε is a
production, then add ε to FIRST(X).
3. If X → Y1, Y2, … , Yk is a production, then for all I such that all of Y1, … , Yi-1 are nonterminals
and FIRST(Yj) contains ε for j = 1,2, … , i-1 (i.e., Y1, Y2, … . Yi-1 ε), add every non-ε symbol in
FIRST(Yi) to FIRST(X). If ε is in FIRST(Yj) for all j = 1, 2, … , k, than add ε to FIRST(X).
o To compute FOLLOW(A) for all nonterminals A, apply the following rules until nothing can
be added to any FOLLOW set.
1. $ is in FOLLOW(S), where S is the start symbol.
2. If there is a production A → αBβ, β ≠ ε, the everything in FIRST(β) but ε is in FOLLOW(B).
3. If there is a production A → αB, or a production A → αBβ where FIRST(β) contains ε (i.e.,
β ε), then everything in FOLLOW(A) is in FOLLOW(B).
o Example 2.6.1
Consider the following grammar
E→E+T|T
T→T*F|F
F → ( E ) | id (Grammar 2.6.1)
Compute the FIRST and FOLLOW function for the above grammar.
Solution:
36
Here the (Grammar 2.6.1) is in left-recursion, so eliminate the left recursion for the(Grammar
2.6.1) we get;
E → TE`
E` → +TE` | ε
T → FT`
T` → *FT` | ε
F → ( E ) | id (Grammar 2.6.2)
Then:
FIRST(E) = FIRST(T) = FIRST(F) = {(, id}.
FIRST(E`) = {+,ε}
FIRST(T`) = {*,ε}
FOLLOW(E) = FOLLOW(E`) = {),$}
FOLLOW(T) = FOLLOW(T`) = {+,),$}
FOLLOW(F) = {+,*,),$}
o Exercise 2.6.1
Consider the grammar
S → iCtSS` | a
S` → eS | ε
C→b (Grammar 2.6.3)
Compute the FIRST and FOLLOW for the (Grammar 2.6.3)
o Construction of Predictive Parsing Table:
The following algorithm can be used to construct a predictive parsing table for a grammar G
Algorithm 2.6.1 Constructing a predictive parsing table
Input: Grammar G
Output: Parsing table M
Method:
1. For each production A → α of the grammar, do step 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A,a].
3. If ε is in FIRST(α), add A → α to M[A,b] for each terminal b in FOLLOW(A). If ε is in
FIRST(α) and $ is in FOLLOW(A), add A → α to M[A,$].
4. Make each undefined entry of M error.
37
Example 2.6.2
Construct the predictive parsing table for the (Grammar 2.6.2) using the Algorithm 2.6.1
Exercise 2.6.2
Construct the predictive parsing table for the (Grammar 2.6.3)
Moves by Predictive parser using the input string
o Predictive parsing program
repeat
begin
let X be the top stack symbol and a the next input symbol;
if X is a terminal or $ then
if X = a then
pop X from the stack and remove a from the input
else
ERROR( )
else /* X is a nonterminal */
if M[X,a] = X → Y1, Y2, … , Yk then
begin
pop X from the stack;
push Yk, Yk-1, … ,Y1 onto the stack, Y1 on top
end
else
ERROR( )
end
until
X = $ /* stack has emptied */
Example 2.6.3
38
o For the input string id + id * id, using the (Grammar 2.6.2) do the predictive parsing. Use the
Table 2.6.1
LL(1) Grammars:
o When there is a situation that the parsing table consists of atleast one multiply defined entries,
then the easiest recourse is to transform the grammar by eliminating all left-recursion and then
left-factoring whenever possible, to produce a grammar for which the parsing table has no
multiply-defined entries.
o A grammar whose parsing table has no multiply-defined entries is said to be LL(1) grammar.
o A grammar is LL(1) if and only if whenever A → α | β are two distinct productions of G the
following conditions hold:
1. For no terminal a do α and β derive strings beginning with a.
2. At most one of α and β can derive the empty string.
3. If β ε, then α does not derive any strings beginning with a terminal in FOLLOW(A).
o Exercise 2.6.3 check whether the (Grammar 2.6.2) and (Grammar 2.6.3) are LL(1).
Error recovery in predictive parsing
o An error is detected during predictive parsing when the terminal on top of the stack does not
match input symbol or when nonterminal A is on top of the stack, a is the next input symbol, and
the parsing table entry M[A,a] is empty.
o Panic-mode error recovery:
39
It is based on the idea of skipping symbols on the input until a token in a selected set of
synchronizing tokens appears.
Its effectiveness depends on the choice of synchronizing set.
The sets should chosen so that the parser recovers quickly from errors that are likely to occur in
practice.
o Phrase=level recovery:
It is implemented by filling in the blank entries in the predictive parsing table with pointers to
error routines.
These routines may change, insert, or delete symbols on the input and issue appropriate error
messages.
They may also pop from the stack.
In any event, sure that there is no possibility of an infinite loop.
Checking that any recovery action eventually results in an input symbol being consumed is a good
way to protect against such loops.
Bottom-up parsers build parse trees from the leaves and work up to the root.
Bottom-up syntax analysis known as shift-reduce parsing.
An easy-to-implement shift-reduce parser is called operator precedence parsing.
General method of shift-reduce parsing is called LR parsing.
Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves
(the bottom) and working up towards the root (the top).
At each reduction step a particular substring matching the right side of a production is replaced
by the symbol on the left of that production, and if the substring is chosen correctly at each step,
a rightmost derivation is traced out in reverse.
Example 2.7.1
o Consider the grammar
S → aABe
40
A → Abc | b
B→d (Grammar 2.7.1)
The sentence abbcde can be reduced to S by the following steps.
abbcde
aAbcde
aAde
aABe
S
These reductions trace out the following rightmost derivation in reverse.
S aABe aAde aAbcde abbcde
Handles:
o A handle of a string is a substring that matches the right side of a production, and whose reduction
to the nonterminal on the left side of the production represents one step along the reverse of a
rightmost derivation.
Precise definition of a handle:
o A handle of a right-sentential form γ is a production A → β and a position of γ where the string β
may be found and replaced by A to produce the previous right-sentential form in a rightmost
derivation of γ.
o i.e., if S αAw αβw, then A → β in the position following α is a handle of
αβw.
o The string w to the right of the handle contains only terminal symbols.
o In the example above, abbcde is a right sentential form whose handle is A → b at position 2.
Likewise, aAbcde is a right sentential form whose handle is A → Abc at position 2.
Handle Pruning:
A rightmost derivation in reverse can be obtained by handle pruning.
i.e., start with a string of terminals w that is to parse. If w is a sentence of the grammar at hand,
then w = γn, where γn is the nth right sentential form of some as yet unknown rightmost
derivation.
S = γ0 γ1 γ2 … γn-1 γn = w.
Example for right sentential form and handle for grammar
41
E→E+E
E→E*E
E→(E)
E → id
42
Stack Input
$S $
After entering this configuration, the parser halts and announces successful completion of parsing.
There are four possible actions that a shift-reduce parser can make: 1) shift 2) reduce 3) accept 4)
error.
1. In a shift action, the next symbol is shifted onto the top of the stack.
2. In a reduce action, the parser knows the right end of the handle is at the top of the stack. It
must then locate the left end of the handle within the stack and decide with what nonterminal to
replace the handle.
3. In an accept action, the parser announces successful completion of parsing.
4. In an error action, the parser discovers that a syntax error has occurred and calls an error
recovery routine.
Note: an important fact that justifies the use of a stack in shift-reduce parsing: the handle will
always appear on top of the stack, and never inside.
Example 2.8.1
Consider the grammar
E→E+E
E→E*E
E→(E)
E → id (Grammar 2.8.1)
and the input string id1 + id2 * id3. Use the shift-reduce parser to check whether the input string is
accepted by the (Grammar 2.8.1)
o Viable Prefixes:
43
The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser
are called viable prefixes.
o Conflicts during shift-reduce parsing:
There are CFGs for which shift-reduce parsing cannot be used.
For every shift-reduce parser for such grammar can reach a configuration in which the parser
cannot decide whether to shift or to reduce (a shift-reduce conflict), or cannot decide which of
several reductions to make (a reduce/reduce conflict), by knowing the entire stack contents and
the next input symbol.
Example of such grammars:
These grammars are not LR(k) class grammars, refer them as no-LR grammars.
The k in the LR(k) grammars refer to the number of symbols of lookahead on the input.
Grammars used in compiling usually fall in the LR(1) class, with one symbol lookahead.
An ambiguous grammar can never be LR.
Stmt → if expr then stmt
| if expr then stmt else stmt
| other (Grammar 2.8.2)
In this grammar there is a shift/reduce conflict occur for some input string.
So this (Grammar 2.8.2) is not LR(1) grammar.
The grammars that have the property that no production right side is ε or has two
adjacent nonterminals.
A grammar with the later property is called an operator grammar.
The following grammar for expressions is not an operator grammar:
E → E A E | ( E ) | - E | id
A→+|-|*|/|^ (Grammar 2.9.1)
o Because the right side EAE has two consecutive nonterminals.
To obtain the (Grammar 2.9.1) as operator grammar, substitute for A each of its
alternatives for production E.
44
E → E + E | E – E | E * E | E / E | E ^ E | ( E ) | - E | id (Grammar 2.9.2)
Operator precedence parsing has number of disadvantage.
o It is hard to handle tokens like minus sign, which has two different precedence. (depending
on whether it is unary or binary).
o Since the relationship between a grammar for the language being parsed and the operator-
precedence parser itself is weak, it cannot always be sure that the parser accepts exactly the
desired language.
o Only small class of grammars can be parsed using operator-precedence technique.
45
Exactly one of the relations < ., , .> holds between two terminals. Further use $ to mark each
end of the string, and define $ <. b and b .> $ for all terminals b.
Example 2.9.1
Consider the grammar
E → E + E | E * E | ( E ) | id (Grammar 2.9.3)
The corresponding operator precedence relation table is as follows:
46
Input: An input string w and a table of precedence relations.
Output: If well formed with a placeholder nonterminal E labeling all interior nodes;
otherwise, an error indication.
Method: Initially, the stack contains $ and the input buffer the string w$.
(1) set ip to point to the first symbol of w$;
(2) repeat forever
(3) if $ is on top of the stack and ip points to $ then
(4) return
(5) else begin
(6) let a be the topmost terminal symbol on the stack
and let b be the symbol pointed to by ip;
(6) if a <. b or a b then begin
(7) push b onto the stack;
(8) advance ip to the next input symbol;
end;
(9) else if a .> b then /* reduce */
(10) repeat
(11) pop the stack
(12) until the top stack terminal is replaced by <.
to the terminal most recently popped
(13) else error( )
end;
Operator-Precedence Relations from Associativity and Precedence:
o The rules are designed to select the “proper” handles to reflect a given set of associativity
and precedence rules for binary operators.
1. If operator θ1 has higher precedence than operator θ 2, make θ1 .> θ2 and θ2 <. θ1. These
relations ensure that, in an expression of the form E + E * E + E, the central E * E is the
handle that will be reduced first.
2. If θ1 and θ2 are operators of equal precedence ( they may in fact be the same operator),
then make θ1 .> θ2 and θ2 .> θ1, if the operators are left associative, or make θ1 <. θ2 and
47
θ2 <. θ1, if the operators are right associative. These relations ensure that E – E + E will
have handle E – E selected and E ^ E ^ E will have right E ^ E selected.
3. Make θ <. id and id .> θ,
θ <. ( and ( <. θ,
) .> θ and θ .> ),
θ .> $ and $ .> θ,
for all operators θ. Also
( ), $ <. (, $ <. id,
(<. (, id .> $, ) .> $,
(<. id, id .> ), ) .> )
o These rules ensure that both id and (E) will be reduced to E. Also, $ serves as both the left
and right endmarker, causing handles to be found between $’s wherever possible.
o Example 2.9.2
Consider the grammar
E → E + E | E – E | E * E | E / E | E ^ E | ( E ) | - E | id
(Grammar 2.9.2)
produce the operator precedence relations for the above grammar.
and then parse the input id * ( id ^ id) – id / id from the operator-
precedence relations
from the above operator-precedence relation table, parse the input string.
Precedence Functions:
48
o Compilers using operator-precedence parsers need not store the table of precedence
relations.
o In most cases, the table is encoded by two precedence functions f and g that map terminal
symbols to integers.
o To select f and g so that for symbols a and b,
f(a) < g(b) whenever a <. b,
f(a) = g(b) whenever a b,
f(a) > g(b) whenever a .> b
49
o Exercise 2.9.1
Consider the grammar
E → E + E | E * E | ( E ) | id (Grammar 2.9.3)
The corresponding operator precedence relation table is as follows:
Construct the graph using the Algorithm 2.9.2 and derive the precedence function.
50
The resulting precedence functions are:
The path is represented as, for example, there is a path from gid to f* to g* to f+ to g+ to f$ =
5, the precedence function g representing id is 5.
Error Recovery in operator-precedence parsing:
o There are two points in the parsing process at which an operator-precedence parser can
discover syntactic error:
If no precedence relation holds between the terminal on top of the stack and the current
input.
If a handle has been found, but there is no production with this handle as a right side.
o The error checker does the following errors:
Missimg operand
Missing operator
No expression between parenthesis
Illegal b on line (line containing b)
Missing d on line (line containing c)
Missing E on line (line containing b)
o These eroor diagonistic issued at handling of errors during reduction.
o Dudring handling of shift/reduce errors; the diagonistics issues are:
Missing operand
Unbalanced right parenthesis
Missing right paraenthesis
51
Missing operator
52