0% found this document useful (0 votes)
32 views52 pages

CSC 415

A compiler translates programs from a source language to a target language, consisting of analysis and synthesis phases. The analysis phase includes lexical, syntax, and semantic analysis, which breaks down the source program into tokens, checks for grammatical structure, and ensures semantic correctness. The document also discusses various tools and techniques used in compiler construction, including preprocessors, lexical analyzers, and optimization methods.

Uploaded by

nelainfubara64
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views52 pages

CSC 415

A compiler translates programs from a source language to a target language, consisting of analysis and synthesis phases. The analysis phase includes lexical, syntax, and semantic analysis, which breaks down the source program into tokens, checks for grammatical structure, and ensures semantic correctness. The document also discusses various tools and techniques used in compiler construction, including preprocessors, lexical analyzers, and optimization methods.

Uploaded by

nelainfubara64
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

1.

1 Compilers - Introduction

A compiler is a program that reads a program written in one language (the source language) and
translates it into an equivalent program in another language (the target language).
As part of this translation process, the compiler reports to its user the presence of errors in the source
program.

• The Analysis-Synthesis Model of Compilation:


There are two parts to compilation: analysis and synthesis.
– The analysis part breaks up the source program into constituent pieces and creates an
intermediate representation of the source program.
– The synthesis part constructs the desired target program from the intermediate representation.
During analysis, the operations implied by the source program are determined and recorded in a
hierarchical structure called tree (syntax tree).
– In which each node represents an operation and the children of a node represent the arguments of
the operation.

Software tools that manipulate the source program first perform some kind of analysis
1. Structure Editors: takes input a sequence of commands to build a source program.
Performs text creation and modification, analyzes the program text – hierarchical structure
(check the i/p is correctly formed).

1
2. Pretty Printers: analyzes the program and prints the structure of the program becomes clearly
visible.
3. Static Checkers: reads a program, analyzes it, and attempts to discover potential bugs without
running the program.
4. Interpreters: Instead of producing a target program as a translation, it performs the operations
implied by the source program.

Some examples where the analysis portion is similar to conventional compiler:


1. Text Formatters: takes input that is a stream of characters and indicates commands to indicate
paragraphs, figures, or mathematical structures like subscripts and superscripts.
2. Silicon Compilers: has a source language similar to a conventional programming language.
• However, the variables of the language represent, not locations in memory, but logical signals (0
or 1) or groups of signals in a switching circuit.
3. Query Interpreters: translates a predicate containing relational and boolean operators into
commands to search database for records satisfying that predicate.
Preprocessor: The task of collecting the source program is sometimes entrusted to a distinct
program.

1.2 Analysis of the Source Program:


In compiling, analysis consists of three phases:
1. Linear Analysis: in which the stream of characters (source program) is read from left to right
and grouped into tokens; that are sequence of characters having a collective meaning.
2. Hierarchical Analysis: in which characters or tokens are grouped hierarchically into nested
collections with collective meaning.
3. Semantic Analysis: in which certain checks are performed to ensure that the components of a
program fit together meaningfully.
Lexical Analysis: (Linear Analysis or Scanning)
For example, in lexical analysis the characters in the assignment statement
Position := initial + rate * 60
Would be grouped into the following tokens:

2
• The identifier; position
• The assignment symbol; :=
• The identifier; initial
• The plus symbol; +
• The identifier; rate
• The multiplication symbol; *
• The number; 60
• The blanks separating the characters of these tokens would normally be eliminated during lexical
analysis.
Syntax Analysis: (Hierarchical Analysis or Parsing)
It involves grouping the tokens of the source program into grammatical phrases that are used by the
compiler to synthesize output. The grammatical phrase is represented by a parse tree.

3
Semantic Analysis:
This phase checks the source program for semantic errors and gathers type information for the
subsequent code-generation phase. It uses the hierarchical structure determined by the syntax
analysis phase to identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking.

The Phases of a Compiler:

4
Compiler operates in phases. Each phase transforms the source program from one representation to
another.
Six phases:
– Lexical Analyser
– Syntax Analyser
– Semantic Analyser
– Intermediate code generation
– Code optimization
– Code Generation
Symbol table and error handling interact with the six phases.
Some of the phases may be grouped together.
These phases are illustrated by considering the following statement:
position := initial + rate * 60

5
6
• Symbol Table Management:
– Symbol table is a data structure containing a record for each identifier, with fields for the
attributes of the identifier.
– Record the identifier used in the source program and collect information about the identifier such
as,
• its type, (by semantic and intermediate code)
• its scope, (by semantic and intermediate code)
• storage allocation, (by code generation)
• number of arguments and its type for procedure, the type returned
• Error Detecting and Reporting:
– Each phase encounters errors.
– Lexical phase determine the input that do not form token.
– Syntax phase determine the token that violates the syntax rule.
– Semantic phase detects the constructs that have no meaning to operand.
The Analysis Phase:
– Involves: Lexical, Syntax and Semantic phase
– As translation progresses, the compiler’s internal representation of the source program changes.
Lexical Analysis Phase:
– The lexical phase reads the characters in the source program and groups them into a stream of
tokens in which each token represents a logically cohesive sequence of characters, such as, An
identifier, A keyword, A punctuation character.
– The character sequence forming a token is called the lexeme for the token.
Syntax Analysis Phase:
– Syntax analysis imposes a hierarchical structure on the token stream. This hierarchical structure is
called syntax tree.
– A syntax tree has an interior node is a record with a field for the operator and two fields
containing pointers to the records for the left and right children.

7
– A leaf is a record with two or more fields, one to identify the token at the leaf, and the other to
record information about the token.
Semantic Analysis Phase:
– This phase checks the source program for semantic errors and gathers type information for the
subsequent code-generation phase.
– It uses the hierarchical structure determined by the syntax-analysis phase to identify the operators
and operands of expressions and statements.
– An important component of semantic analysis is type checking.
Intermediate Code Generation:
– The syntax and semantic analysis generate a explicit intermediate representation of the source
program.
– The intermediate representation should have two important properties:
• It should be easy to produce,
• And easy to translate into target program.
– Intermediate representation can have a variety of forms.
– One of the forms is: three address code; which is like the assembly language for a machine in
which every location can act like a register.
– Three address code consists of a sequence of instructions, each of which has at most three
operands.
Code Optimization:
– Code optimization phase attempts to improve the intermediate code, so that faster-running
machine code will result.
Code Generation:
– The final phase of the compiler is the generation of target code, consisting normally of relocatable
machine code or assembly code.
– Memory locations are selected for each of the variables used by the program.
– Then, the each intermediate instruction is translated into a sequence of machine instructions that
perform the same task.

1.4 Cousins of the Compiler:


The input to a compiler may be produced by one or more preprocessors. The context in which the

8
compiler typically operates are:

– Preprocessors
– Assemblers
– Loader and Link-Editors
• Preprocessors:
– Preprocessors produce input to compiler.
– They perform following functions:
• 1. Macro processing: a preprocessor allow a user to define macros. (that are short forms for
longer construct)
• 2. File inclusion: a preprocessor include header files into the program text.

• 3. Rational preprocessor: a preprocessor provide the user with built-in macros for constructs
like while-stmt or if-stmt if none exist in the program itself.
• 4. Language extension: they add capabilities to the language by what amounts to built-in
macros. e.g. the language, Equel is a database query language embedded in C
• Assemblers:
– Some compilers produce assembly code that is passed to an assembler for further processing.
– Assembly code is a mnemonic version of machine code, in which names are used instead of
binary codes for operations and names are also given to memory addresses.
• Loaders and Link-Editors:
– Loader is a program, performs two functions; loading and link-editing.
– The process of loading consists of taking relocatable machine code, altering the relocatable
addresses and placing the altered instructions and data in memory at the proper locations.
– The link-editor allows making a single program from several files of relocatable machine code.

1.5 The Grouping of Phases:


Phases deals with logical organisation of compiler. In an implementation, activities from more than

one phase are often grouped together.

• Front and Back Ends:

9
– The phases are collected into a front and a back end.

– The front end consists of phases or part of phases that depends primarily on source language and
is largely independent of the target machine.

• These normally include lexical and syntactic analysis, the creation of symbol table, semantic
analysis and intermediate code generation.

– The back end includes portions of the compiler that depend on the target machine.

• This includes part of code optimization and code generation.

• Passes:

– Several phases of compilation are implemented in a single pass consisting of reading an input
file and writing an output file.

• Reducing the number of passes:

– Takes time to read and write intermediate files.

– Grouping of several phases into one pass, may force the entire program in memory, because one
phase may need information in a different order than previous phase produces it.

– Intermediate code and code generation are often merged into one pass using a technique called
backpatching.

1.6 Compiler-Construction Tools:

The compiler writers use software tools such as, debuggers, version managers, profilers, and so on.

• The following is a list of some useful compiler-construction tools:

1. Parser generators

2. Scanner generators

3. Syntax-directed translation engines

4. Automatic code generators

10
5. Data-flow engines

1. Parser generators: These produce syntax analyser from context free grammar as input.

2. Scanner generators: These automatically produce lexical analyser from a specification based on
regular expressions.

3. Syntax-directed translation engines: These produce collection of routines from parse tree,
generating the intermediate code.

4. Automatic code generators: Takes collection of rules that define the translation of each operation
of the intermediate language into the machine language for the target machine.

5. Data-flow engines: To perform good code optimization involves data-flow analysis – gathering of
information about how values are transmitted from one part of a program to other part.

1.7 Lexical Analysis:

A simple way to build a lexical analyser is to construct a diagram that illustrates the structure of the

tokens of the source language, and then to translate the diagram into a program for finding tokens.

The software tool that automates the construction of lexical analysers is lexical analyser generator.

The major advantage of a lexical analyser generator is that it can utilize the best-known pattern

matching algorithms and there by create efficient lexical analysers for people who are not experts in

pattern-matching techniques.

1.8 Role of the Lexical Analyser:

The lexical analyser is the first phase of a compiler.

Its main task is to read the input characters and produce as output a sequence of tokens.

11
Upon receiving a “get next token” command from the parser, the lexical analyser reads input

character until it can identify next token. Lexical analysers read source text and also perform certain

secondary task.

 Stripping out from the source program comments and white space in the form of blank, tab,
and newline characters.
 Correlating error messages from the compiler with the source program.

For example, the lexical analyser may keep track of the number of newline characters seen, so
that a line number is associated with an error message. If the source language supports some
macro preprocessor functions, then these preprocessor functions may also be implemented as
lexical analysis takes place.

Issues in lexical analysis:

Reasons for separating the analysis phase into lexical analysis and parsing:

1. simpler design

2. compiler efficiency is improved. Specialized buffering techniques for reading input characters and

processing tokens can significantly speed up the performance of the computer.

3. compiler portability is enhanced

12
Tokens, lexeme and pattern:

– Set of strings in the input for which the same token is produced as output.

– Set of strings is described by a rule called a pattern associated with the token.

– A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token.

For example: const pi = 3.1416;

Table 1.1 Examples of tokens

Token Sample Lexemes Informal description of pattern

const const const

if if if

relation <, <=, =, >, >=, <> <or<=or=or>or>=or<>

id pi, count, D2 letter followed by letters and digits

num 3.1416, 0, 6.02E23 any numeric constant

literal “core dumped” any characters between “ and “

Lexical errors:

– Misspelling of the keyword; e.g. fi(a==f)

13
– Undeclared function identifier;

Error-recovery actions:

– Deleting an extraneous character

– Inserting a missing character

– Replacing an incorrect character by a correct character

– Transporting two adjacent characters

Buffering

Input Buffering:
• Some efficiency issues concerned with the buffering of input.
• A two-buffer input scheme that is useful when lookahead on the input is necessary to identify
tokens.
• Techniques for speeding up the lexical analyser, such as the use of sentinels to mark the buffer
end.
• There are three general approaches to the implementation of a lexical analyser:
• 1. Use a lexical-analyser generator, such as Lex compiler to produce the lexical analyser from a
regular expression based specification. In this, the generator provides routines for reading and
buffering the input.
• 2. Write the lexical analyser in a conventional systems-programming language, using I/O
facilities of that language to read the input.
• 3. Write the lexical analyser in assembly language and explicitly manage the reading of input.

14
• Buffer pairs:
• Because of a large amount of time can be consumed moving characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an input
character.
• The scheme to be discussed:
• Consists a buffer divided into two N-character halves.

N – Number of characters on one disk block, e.g., 1024 or 4096.


– Read N characters into each half of the buffer with one system read command.
– If fewer than N characters remain in the input, then eof is read into the buffer after the input
characters.
– Two pointers to the input buffer are maintained.
– The string of characters between two pointers is the current lexeme.
– Initially both pointers point to the first character of the next lexeme to be found.
– Forward pointer, scans ahead until a match for a pattern is found.
– Once the next lexeme is determined, the forward pointer is set to the character at its right end.
– If the forward pointer is about to move past the halfway mark, the right half is filled with N new
input characters.
– If the forward pointer is about to move past the right end of the buffer, the left half is filled with
N new characters and the forward pointer wraps around to the beginning of the buffer.
– Disadvantage of this scheme:
– This scheme works well most of the time, but the amount of lookahead is limited.

15
– This limited lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the buffer.
– For example: DECLARE ( ARG1, ARG2, … , ARGn ) in PL/1 program;
– Cannot determine whether the DECLARE is a keyword or an array name until the character that
follows the right parenthesis.
– Sentinels:
– In the previous scheme, must check each time the move forward pointer that have not moved off
one half of the buffer. If it is done, then must reload the other half.
– Therefore the ends of the buffer halves require two tests for each advance of the forward pointer.
– This can reduce the two tests to one if it is extend each buffer half to hold a sentinel character at
the end.
– The sentinel is a special character that cannot be part of the source program. (eof character is used
as sentinel).

• In this, most of the time it performs only one test to see whether forward points to an eof.
• Only when it reach the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eof’s, the average number of tests per input
character is very close to 1.

Specification of Tokens:
• Regular expressions are notation for specifying patterns.
• Each pattern matches a set of strings.
• Regular expressions will serve as names for sets of strings.
• Strings and Languages:
• The term alphabet or character class denotes any finite set of symbols.

16
• e.g., set {0,1} is the binary alphabet.
• The term sentence and word are often used as synonyms for the term string.
• The length of a string s is written as |s| - is the number of occurrences of symbols in s.
• e.g., string “banana” is of length six.
• The empty string denoted by ε – length of empty string is zero.
• The term language denotes any set of strings over some fixed alphabet.
• e.g., {ε} – set containing only empty string is language under φ.
• If x and y are strings, then the concatenation of x and y (written as xy) is the string formed by
appending y to x. x = dog and y = house; then xy is doghouse.
• sε = εs = s.
• s0 = ε, s1 = s, s2 = ss, s3 = sss, … so on.

Table1.2 Terms for parts of a string

TERM DEFINITION
Prefix of s A string obtained by removing zero or more trailing symbols of string
s; e.g., ban is a prefix of banana.
Suffix of s A string formed by deleting zero or more of the leading symbols of s;
e.g., nana is a suffix of banana.
Substring of s A string obtained by deleting a prefix and a suffix from s; e.g., nan is a
substring of banana.
Proper prefix, suffix, Any nonempty string x that is a prefix, suffix or substring of s that s
or substring of s <> x.
Subsequence of s Any string formed by deleting zero or more not necessarily contiguous
symbols from s; e.g., baaa is a subsequence of banana.
• Operations on Languages:
– There are several operations that can be applied to languages:

17
Table 1.3 Definitions of operations on languages L and M

OPERATION DEFINITION
Union of L and M. written LυM L υ M = { s | s is in L or s is in M }
Concatenation of L and M. LM = { st | s is in L and t is in M }
written LM
Kleene closure of L.
written L* L* denotes “zero or more concatenation of”
L.
Positive closure of L. L+ denotes “one or more Concatenation of” L.
written L+

• Regular Expressions:
– It allows defining the sets to form tokens precisely.
– e.g., letter ( letter | digit) *
– Defines a Pascal identifier – which says that the identifier is formed by a letter followed by zero
or more letters or digits.
– A regular expression is built up out of simpler regular expressions using a set of defining rules.
– Each regular expression r denotes a language L(r).
• The rules that define the regular expressions over alphabet ∑.
– (Associated with each rule is a specification of the language denoted by the regular expression
being defined)
• 1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty string.
• 2. If a is a symbol in ∑, then a is a regular expression that denotes {a}, i.e. the set containing the
string a.
• 3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then
– a) (r) | (s) is a regular expression denoting the languages L(r) U L(s).
– b) (r)(s) is a regular expression denoting the languages L(r)L(s).
– c) (r)* is a regular expression denoting the languages (L(r))*.
– d) (r) is a regular expression denoting the languages L(r).

18
• A language denoted by a regular expression is said to be a regular set.
• The specification of a regular expression is an example of a recursive definition.
• Rule (1) and (2) form the basis of the definition.
• Rule (3) provides the inductive step.

Table 1.4 Algebraic Properties of regular expressions

AXIOM DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = rs|rt Concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity element for concatenation
rε = r
r* = (r|ε)* Relation between * and ε
r** = r* * Is idempotent

• Regular Definition:
– If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form
• d1 → r1
• d2 → r2
• …
• dn → rn
– Where each di is a distinct name, and each r i is a regular expression over the symbols in ∑ U {d 1,
d2, … , di-1}, i.e., the basic symbols and the previously defined names.
– e.g. (regular definition in bold):
• letter → A|B|…|Z|a|b|…|z
• digit → 0|1|…|9
• id → letter ( letter | digit ) *

19
• Notational Shorthand:
– This shorthand is used in certain constructs that occur frequently in regular expressions.
• 1. one or more instance: unary postfix operator + means “one or more instances of”. If ris a
regular expression that denotes the language L(r), the r+ is a regular expression that denotes the
language (L(r))+. Similarly unary postfix operator * means “zero or more instances of”. The two
algebraic identities r* and r+ relate the kleene and positive closureoperators.
• 2. zero or one instance: unary postfix operator ? means “zero or one instance of”. The
notation r? is a shorthand for r|ε.
• 3. character class: the notation [abc] is a shorthand for a|b|c.

Role of the Parser

In the compiler model, the parser obtains a string of tokens from the lexical analyser, and verifies
that the string can be generated by the grammar for the source language.
The parser returns any syntax error for the source language.

There are three general types’ parsers for grammars.


Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley’s
algorithm can parse any grammar. These methods are too inefficient to use in production compilers.
The methods commonly used in compilers are classified as either top-down parsing or bottom-up

20
parsing.
Top-down parsers build parse trees from the top (root) to the bottom (leaves).
Bottom-up parsers build parse trees from the leaves and work up to the root.
In both case input to the parser is scanned from left to right, one symbol at a time.
The output of the parser is some representation of the parse tree for the stream of tokens.
There are number of tasks that might be conducted during parsing. Such as;
 Collecting information about various tokens into the symbol table.
 Performing type checking and other kinds of semantic analysis.
 Generating intermediate code.
Syntax Error Handling:
Planning the error handling right from the start can both simplify the structure of a compiler and
improve its response to errors.
The program can contain errors at many different levels. e.g.,
 Lexical – such as misspelling an identifier, keyword, or operator.
 Syntax – such as an arithmetic expression with unbalanced parenthesis.
 Semantic – such as an operator applied to an incompatible operand.
 Logical – such as an infinitely recursive call.

Much of the error detection and recovery in a compiler is centered on the syntax analysis phase.
One reason for this is that many errors are syntactic in nature or are exposed when the stream of
tokens coming from the lexical analyser disobeys the grammatical rules defining the programming
language.
Another is the precision of modern parsing methods; they can detect the presence of syntactic errors
in programs very efficiently.
The error handler in a parser has simple goals:
 It should the presence of errors clearly and accurately.
 It should recover from each error quickly enough to be able to detect subsequent errors.
 It should not significantly slow down the processing of correct programs.

21
Error-Recovery Strategies:
There are many different general strategies that a parser can employ to recover from a syntactic
error.
 Panic mode
 Phrase level
 Error production
 Global correction
Panic mode:
This is used by most parsing methods.
On discovering an error, the parser discards input symbols one at a time until one of a designated set
of synchronizing tokens (delimiters; such as; semicolon or end) is found.
Panic mode correction often skips a considerable amount of input without checking it for additional
errors.
It is simple.
Phrase-level recovery:
On discovering an error; the parser may perform local correction on the remaining input; i.e., it may
replace a prefix of the remaining input by some string that allows the parser to continue. e.g., local
correction would be to replace a comma by a semicolon, deleting an extraneous semicolon, or insert
a missing semicolon.
Its major drawback is the difficulty it has in coping with situations in which the actual error has
occurred before the point of detection.
Error productions:
If an error production is used by the parser, can generate appropriate error diagnostics to indicate the
erroneous construct that has been recognized in the input.
Global correction:
Given an incorrect input string x and grammar G, the algorithm will find a parse tree for a related

string y, such that the number of insertions, deletions and changes of tokens required to transform x

into y is as small as possible.

22
Writing Grammars
Grammars are capable of describing the syntax of the programming languages.

Regular Expressions Vs. Context-Free Grammars:


Every constructs that can be described by a regular expression can also be described by a grammar.
e.g., for the regular expression (a|b)*abb, the grammar is:
A0 → aA0 | bA0 | aA1
A1 → bA2
A2 → bA3
A3 → ε
which describe the same language, the set of strings of a’s and b’s ending in abb.
Mathematically, the NFA is converted into a grammar that generates the same language as
recognized by the NFA.
The above grammar was constructed using the following constructions:
 For each state i of the NFA, create a nonterminal symbol Ai.
 If state i has a transition to state j on symbol a, introduce the production Ai → aAj.
 If state i goes to state j on input ε, introduce the production Ai → Aj.
 If i is an accepting state, introduce Ai → ε.
 If i is the start state, make Ai be the start symbol of the grammar.
There are several reasons the regular expressions differ from CFG.
1. The lexical rules of a language are frequently quite simple. No need of any notation as
powerful as grammars.
2. Regular expressions generally provide a more concise and easier to understand notation for
tokens than grammars.
3. More efficient lexical analysers can be constructed automatically from regular expressions
than from arbitrary grammars.
4. Separating the syntactic structure of a language into lexical and nonlexical parts provides a
convenient way of modularizing the front end of a compiler into two manageable-sized
components.

Eliminating Ambiguity:
23
Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity.
As an example, the ambiguity from the following “dangling-else” grammar is eliminated:
stmt → if expr then stmt
| if expr then stmt else stmt
| other (Grammar 2.2.1)
Here other stands for other statement.
According to this grammar, the compound conditional statement
if E1 then S1 else if E2 then S2 else S3 has the parse tree

24
Grammar (2.2.1) is ambiguous since the string: if E1 then if E2 then S1 else S2 has two parse trees.

In all programming languages with conditional statements of this form, the first parse tree is
preferred.
The general rule is, “Match each else with the closest previous unmatched then” this disambiguating
rule can be incorporated directly into the grammar.
i.e., rewriting the grammar (2.2.1) as following:
stmt → matched_stmt
| unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
unmatched_stmt → if expr then stmt
| if expr then matched_stmt else unmatched_stmt (Grammar 2.2.2)

Elimination of Left Recursion:


A grammar is left recursive if it has a nonterminal A, such that there is a derivation A Aα for some
string α.
Top-down parsing methods cannot handle left-recursive grammars, so a transformation that

25
eliminates left recursion is needed.
A left-recursion production of the form A → Aα | β could be replaced by the non-left-recursive
productions as follows:
A → βA`
A` → αA` | ε
Without changing the set of strings derivable from A.

Algorithm 2.2.1 Algorithm to eliminating left recursion from a grammar


Input: Grammar G with no cycles or ε-productions.
Output: An equivalent grammar with no left recursion.
Method: Note that the resulting non-left-recursive grammar may have ε-productions.
1. Arrange the nonterminals in some order A1, A2, …. An
2. for i := 1 to n do begin
for j := 1 to i-1 do begin
replace each production of the form Ai → Ajγ
by the productions Ai → δ1γ | δ2γ | … | δkγ
where Aj → δ1 | δ2 | … | δk are all the current Aj-productions;
end
eliminate the immediate left recursion among the Ai-productions
end

Exercise 2.2.1
Consider the following grammar for arithmetic expression:
E→E+T|T
T→T*F|F
F → ( E ) | id (Grammar 2.2.3)
Eliminate the immediate left recursion if any in the above grammar.

Exercise 2.2.2

26
Eliminate the left recursion for the following grammar using the Algorithm 2.2.1
S → Aa | b
A → Ac | Sd | ε (Grammar 2.2.4)

Left Factoring:
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parsing.
The basic idea is that when it is not clear which of two alternative productions to use to expand a
nonterminal A, then rewrite the A-productions to defer the decision until the input to make the right
choice.
In general, a production are of the form A → αβ1 | αβ2 then it is left factored as:
A → αA`
A → β 1 | β2

Algorithm 2.2.2 Left Factoring a grammar


Input: Grammar G.
Output: An equivalent left-factored grammar.
Method: For each nonterminal A, find the longest prefix α common to two or more of its
alternatives. If α ≠ ε, i.e., there is a nontrivial common prefix, replace all the A productions A →
αβ1 | αβ2 | … | αβn | γ where γ represents all alternatives that do not begin with α by,
A → αA`
A → β 1 | β2 | … | βn
Repeatedly apply this transformation until no two alternatives for a nonterminal have common
prefix.

Exercise 2.2.3
For the following grammar do the left factoring:
S → iEtS | iEtSeS | a
E→b

Context Free Grammars (CFG or simply grammar)


27
Many programming language constructs have an inherently recursive structure that can be
defined by context-free grammars. e.g., conditional statement defined by a rule such as;

if S1 and S2 are statements and E is an expression, then


“if E then S1 else S2” is a statement (Grammar 2.3.1)
This form of conditional statement cannot be specified using the notation of regular expressions.
CFG consists of terminals, nonterminals, a start symbol, and productions.
1. Terminals are basic symbols from which strings are formed. e.g., in programming
language; if, then, and else is a terminal.
2. Non-terminals are syntactic variables that denote set of strings.
e.g., in production: stmt → if expr then stmt else stmt, expr and stmt are nonterminals.
The non terminals define sets of strings that define the language generated by the grammar.
They also impose hierarchical structure on the language that is useful for both syntax analysis and
translation.
3. In a grammar, one non terminal is distinguished as the start symbol, and the set of strings it
denotes is the language defined by the grammar.
4. The production of a grammar specifies the manner in which the terminals and nonterminals are
combined to form strings.
Each production consists of a nonterminal, followed by an arrow, followed by a string of
nonterminals and terminals.

Exercise 2.3.1:
In the following grammar find terminals, nonterminals, start symbols, and productions:
expr → expr op expr
expr → ( expr )
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ^ (Grammar 2.3.2)

28
Notational Conventions:
To avoid having to state that “these are terminals” and “these are nonterminals”, the following
notational conventions with regard to grammars are used:
1. These symbols are terminals:
i) lower-case letters early in the alphabet such as a,b,c
ii) operator symbols such as +, -, etc.,
iii) punctuation symbols such as parenthesis, comma, etc.,
iv) boldface strings such as id or if
2. These symbols are nonterminals:
i) upper-case letters early in the alphabet such as A, B, C.
ii) The letter S, when it appears, is usually the start symbol.
iii) Lower-case italic names such as expr or stmt.
3. Upper-case letters late in the alphabet, such as X, Y, Z represent grammar symbols, i.e., either
nonterminals or terminals.
4. Lower-case Greek letters, α, β, γ, for example, represent strings of grammar symbols.
5. If A → α1, A → α2, …, A → αk are all productions with A on the left (A-productions), then we
can write as; A → α1| α2 | … | αk. (α1, α2, … , αk, the alternatives for A).
e.g., A sample grammar:
E → E A E | ( E ) | - E | id
A→+|-|*|/|^ (Grammar 2.3.3)

Derivations:
Derivation gives a precise description of the top-down constructions of a parse tree.
In derivation, a production is treated as a rewriting rule in which the nonterminal on the left is
replaced by the string on the right side of the production.
e.g. E → E + E | E * E | - E | ( E ) | id (Grammar 2.3.4)
tell us we can replace one instance of an E in any string of grammar symbols by E + E or E * E
or – E or ( E ) or id
E => - E read as “E derives – E”
Take single E and repeatedly apply productions in any order to obtain a sequence of replacements.
e.g., E => - E => - ( E ) => - ( id )

29
Such a sequence of replacements is called as derivation of – ( id ) from E.
Symbol => means “derive in one step”.
Symbol means “derives in zero or more step”
Symbol means “derives in one or more steps”
Given a grammar G with start symbol S, then relation to define L(G), the language generated by G.
Strings in L(G) may contain only terminal symbols of G.
When string of terminals w is in L(G) if and only if S w. The string w is called a sentence of G.
A language that can be generated by a grammar is said to be context-free language.
If two grammar generated by the same language, the grammars are said to be equivalent.
If S α, where α may contain nonterminals, then α is a sentential form of G.
A sentence is a sentential form with no nonterminals.
Example: using the grammar (Grammar 2.3.4), the string – ( id + id ) is a sentence of
grammar(Grammar 2.3.4).
E => - E => - ( E ) => - ( E + E ) => - ( id + E ) => - ( id + id ),
The strings E, - E, - ( E + E ), - ( id + E ), - ( id + id ) appearing in the derivation are all sentential
form of the grammar (Grammar 2.3.4)
It can be written as E - ( id + id ) to indicate – ( id + id ) can be derived from E.
There are two types of derivations: leftmost derivation and rightmost derivation;
The derivation in which only the leftmost nonterminal in any sentential form is replaced at each
step. Such derivation is called leftmost derivation.
Example: E - E - ( E ) - ( E + E ) - ( id + E ) - ( id + id )
If S α, then α is a left-sentential form of the grammar at hand.

The derivation in which only the rightmost nonterminal in any sentential form is replaced at each
step. Such derivation is called rightmost derivation or canonical derivation.
Example: E - E - ( E + E ) - ( E + id ) - ( id + id )
If S α, then α is a right-sentential form of the grammar at hand.

Parse Trees and Derivations:


A parse tree may be viewed as a graphical representation for a derivation that filters out the choice

30
regarding replacement order.
Each interior node of a parse tree is labeled by some nonterminal A, and that the children of the
node are labeled, from left to right, by the symbols in the right side of the production by which this
A was replaced in the derivation. The leaves of the parse tree are labeled by nonterminals or
terminals and, read for left to right; they constitute a sentential form, called the yield or frontier of
the tree.
Example: parse tree for – ( id + id ),

Ambiguity:
A grammar that produces more than one parse tree for some sentence is said to be ambiguous.
An ambiguous grammar is one that produces more than one leftmost or more than one rightmost
derivation for the same sentence.
Example: from the grammar (Grammar 2.3.4), the sentence id + id * id has two distinct leftmost
derivations.

Top-Down Parsing
 Parsing is the process of determining if a string of tokens can be generated by a grammar.

31
 For any context-free grammar there is a parser that takes at most Ο(n 3) time to parse a string of
n tokens.
 Top-down parsers build parse trees from the top (root) to the bottom (leaves).
 Two top-down parsing are to be discussed:
o Recursive Descent Parsing
o An efficient non-backtracking parsing called Predictive Parsing for LL(1) grammars.

Top-Down Parsing
Top-Down Parsing
 Parsing is the process of determining if a string of tokens can be generated by a grammar.
 For any context-free grammar there is a parser that takes at most Ο(n 3) time to parse a string of
n tokens.
 Top-down parsers build parse trees from the top (root) to the bottom (leaves).
 Two top-down parsing are to be discussed:
o Recursive Descent Parsing
o An efficient non-backtracking parsing called Predictive Parsing for LL(1) grammars.

2.5 Recursive Descent Parsing


 Recursive descent parsing is a top-down method of syntax analysis in which a set recursive
procedures to process the input is executed.
 A procedure is associated with each nonterminal of a grammar.
 Top-down parsing can be viewed as an attempt to find a leftmost derivation for an input string.
 Equivalently, it attempts to construct a parse tree for the input starting from the root and
creating the nodes of the parse tree in preorder.
 Recursive descent parsing involves backtracking.

 Example 2.5.1 (backtracking)


 Consider the grammar
S → cAd
A → ab | a (Grammar 2.5.1)

32
and the input string w = cad
 To construct a parse tree for this string using top-down approach, initially create a tree
consisting of a single node labeled S.

 An input pointer points to c, the first symbol of w.


 Then use the first production for S to expand the tree and obtain the tree (as in Fig 2.5(a))
 The leftmost leaf, labeled c, matches the first symbol of w.
 Next input pointer to a, the second symbol of w.
 Consider the next leaf, labeled A.
 Expand A using the first alternative for A to obtain the tree (as in Fig 2.5(b)).
 Now have a match for the second input symbol. Then advance to the next input pointer d, the
third input symbol and compare d against the next leaf, labeled b. Since b does not match d,
report failure and go back to A to see whether there is another alternative. (Backtracking takes
place).
 If there is another alternative for A, substitute and compare the input symbol with leaf.
 Repeat the step for all the alternatives of A to find a match using backtracking. If match found,
then the string is accepted by the grammar. Else report failure.
 A left-recursive grammar can cause a recursive-descent parser, even one with backtracking, to
go into an infinite loop.
 As discussed above, an easy way to implement a recursive descent parsing with backtracking is
to create a procedure for each nonterminals.
 For the grammar (Grammar 2.5.1) above write the recursive procedure for each nonterminal S
and A.

33
 Procedure S
procedure S()
begin
if input symbol = ‘c’ then
begin
ADVANCE( );
if A( ) then
if input symbol = ‘d’ then
begin ADSVANCE( ); return true end
end;
return false
end
 Procedure A
procedure A( )
begin
isave := input-pointer;
if input symbol = ‘a’ then
begin
ADVANCE( );
if input symbol = ‘b’ then
begin ADVANCE( ); return true end
end
input-pointer := isave;
/* failure to find ab */
if input symbol = ‘a’ then
begin ADVANCE( ); return true end
else return false
end
 A parser that uses a set of recursive procedures to recognize its input with no backtracking is
called a recursive descent parser.
 Exercise 2.5.1

34
o * Write recursive descent parsing for the (Grammar 2.2.3) after eliminating the left
recursion. (refer reference book 1, for solution, page 181.)

2.6 Predictive Parsing


 A predictive parser is an efficient way of implementing recursive-descent parsing by handling
the stack of activation records explicitly.

 The predictive parser has an input, a stack, a parsing table, and an output.
 The input contains the string to be parsed, followed by $, the right endmarker.
 The stack contains a sequence of grammar symbols, preceded by $, the bottom-of-stack marker.
 Initially the stack contains the start symbol of the grammar preceded by $.
 The parsing table is a two dimensional array M[A,a], where A is a nonterminal, and a is a
terminal or the symbol $.
 The parser is controlled by a program that behaves as follows:
 The program determines X, the symbol on top of the stack, and a, the current input symbol.
 These two symbols determine the action of the parser.
 There are three possibilities:
o 1. If X = a = $, the parser halts and announces successful completion of parsing.
o 2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input
symbol.

35
o 3. If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This entry will
be either an X-production of the grammar or an error entry.
 If M[X,a] = {X → UVW}, the parser replaces X on top of the stack by WVU (with U on top).
 If M[X,a] = error, the parser calls an error recovery routine.
 Construction of Parsing Table:
o Before constructing the parsing table, two functions are to be performed to fill the entries in the
table.
 FIRST( ) and FOLLOW( ) functions.
o These functions will indicate proper entries in the table for a grammar G.
o To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or ε can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is {X}.
2. If X is nonterminal and X → aα is a production, then add a to FIRST(X). If X → ε is a
production, then add ε to FIRST(X).
3. If X → Y1, Y2, … , Yk is a production, then for all I such that all of Y1, … , Yi-1 are nonterminals
and FIRST(Yj) contains ε for j = 1,2, … , i-1 (i.e., Y1, Y2, … . Yi-1 ε), add every non-ε symbol in
FIRST(Yi) to FIRST(X). If ε is in FIRST(Yj) for all j = 1, 2, … , k, than add ε to FIRST(X).
o To compute FOLLOW(A) for all nonterminals A, apply the following rules until nothing can
be added to any FOLLOW set.
1. $ is in FOLLOW(S), where S is the start symbol.
2. If there is a production A → αBβ, β ≠ ε, the everything in FIRST(β) but ε is in FOLLOW(B).
3. If there is a production A → αB, or a production A → αBβ where FIRST(β) contains ε (i.e.,
β ε), then everything in FOLLOW(A) is in FOLLOW(B).
o Example 2.6.1
 Consider the following grammar
E→E+T|T
T→T*F|F
F → ( E ) | id (Grammar 2.6.1)
 Compute the FIRST and FOLLOW function for the above grammar.
 Solution:

36
 Here the (Grammar 2.6.1) is in left-recursion, so eliminate the left recursion for the(Grammar
2.6.1) we get;
E → TE`
E` → +TE` | ε
T → FT`
T` → *FT` | ε
F → ( E ) | id (Grammar 2.6.2)
 Then:
FIRST(E) = FIRST(T) = FIRST(F) = {(, id}.
FIRST(E`) = {+,ε}
FIRST(T`) = {*,ε}
FOLLOW(E) = FOLLOW(E`) = {),$}
FOLLOW(T) = FOLLOW(T`) = {+,),$}
FOLLOW(F) = {+,*,),$}
o Exercise 2.6.1
 Consider the grammar
S → iCtSS` | a
S` → eS | ε
C→b (Grammar 2.6.3)
 Compute the FIRST and FOLLOW for the (Grammar 2.6.3)
o Construction of Predictive Parsing Table:
 The following algorithm can be used to construct a predictive parsing table for a grammar G
 Algorithm 2.6.1 Constructing a predictive parsing table
Input: Grammar G
Output: Parsing table M
Method:
1. For each production A → α of the grammar, do step 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A,a].
3. If ε is in FIRST(α), add A → α to M[A,b] for each terminal b in FOLLOW(A). If ε is in
FIRST(α) and $ is in FOLLOW(A), add A → α to M[A,$].
4. Make each undefined entry of M error.

37
 Example 2.6.2
 Construct the predictive parsing table for the (Grammar 2.6.2) using the Algorithm 2.6.1

Table 2.6.1 Predictive parsing table for the (Grammar 2.6.2)

 Exercise 2.6.2
 Construct the predictive parsing table for the (Grammar 2.6.3)
 Moves by Predictive parser using the input string
o Predictive parsing program
repeat
begin
let X be the top stack symbol and a the next input symbol;
if X is a terminal or $ then
if X = a then
pop X from the stack and remove a from the input
else
ERROR( )
else /* X is a nonterminal */
if M[X,a] = X → Y1, Y2, … , Yk then
begin
pop X from the stack;
push Yk, Yk-1, … ,Y1 onto the stack, Y1 on top
end
else
ERROR( )
end
until
X = $ /* stack has emptied */
 Example 2.6.3

38
o For the input string id + id * id, using the (Grammar 2.6.2) do the predictive parsing. Use the
Table 2.6.1

 LL(1) Grammars:
o When there is a situation that the parsing table consists of atleast one multiply defined entries,
then the easiest recourse is to transform the grammar by eliminating all left-recursion and then
left-factoring whenever possible, to produce a grammar for which the parsing table has no
multiply-defined entries.
o A grammar whose parsing table has no multiply-defined entries is said to be LL(1) grammar.
o A grammar is LL(1) if and only if whenever A → α | β are two distinct productions of G the
following conditions hold:
 1. For no terminal a do α and β derive strings beginning with a.
 2. At most one of α and β can derive the empty string.
 3. If β ε, then α does not derive any strings beginning with a terminal in FOLLOW(A).
o Exercise 2.6.3 check whether the (Grammar 2.6.2) and (Grammar 2.6.3) are LL(1).
 Error recovery in predictive parsing
o An error is detected during predictive parsing when the terminal on top of the stack does not
match input symbol or when nonterminal A is on top of the stack, a is the next input symbol, and
the parsing table entry M[A,a] is empty.
o Panic-mode error recovery:

39
 It is based on the idea of skipping symbols on the input until a token in a selected set of
synchronizing tokens appears.
 Its effectiveness depends on the choice of synchronizing set.
 The sets should chosen so that the parser recovers quickly from errors that are likely to occur in
practice.
o Phrase=level recovery:
 It is implemented by filling in the blank entries in the predictive parsing table with pointers to
error routines.
 These routines may change, insert, or delete symbols on the input and issue appropriate error
messages.
 They may also pop from the stack.
 In any event, sure that there is no possibility of an infinite loop.
 Checking that any recovery action eventually results in an input symbol being consumed is a good
way to protect against such loops.

Bottom-up Parsing & shift reducing Parsing

 Bottom-up parsers build parse trees from the leaves and work up to the root.
 Bottom-up syntax analysis known as shift-reduce parsing.
 An easy-to-implement shift-reduce parser is called operator precedence parsing.
 General method of shift-reduce parsing is called LR parsing.
 Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves
(the bottom) and working up towards the root (the top).
 At each reduction step a particular substring matching the right side of a production is replaced
by the symbol on the left of that production, and if the substring is chosen correctly at each step,
a rightmost derivation is traced out in reverse.
 Example 2.7.1
o Consider the grammar
S → aABe

40
A → Abc | b
B→d (Grammar 2.7.1)
The sentence abbcde can be reduced to S by the following steps.
abbcde
aAbcde
aAde
aABe
S
These reductions trace out the following rightmost derivation in reverse.
S aABe aAde aAbcde abbcde
 Handles:
o A handle of a string is a substring that matches the right side of a production, and whose reduction
to the nonterminal on the left side of the production represents one step along the reverse of a
rightmost derivation.
 Precise definition of a handle:
o A handle of a right-sentential form γ is a production A → β and a position of γ where the string β
may be found and replaced by A to produce the previous right-sentential form in a rightmost
derivation of γ.
o i.e., if S αAw αβw, then A → β in the position following α is a handle of

αβw.
o The string w to the right of the handle contains only terminal symbols.
o In the example above, abbcde is a right sentential form whose handle is A → b at position 2.
Likewise, aAbcde is a right sentential form whose handle is A → Abc at position 2.
 Handle Pruning:
 A rightmost derivation in reverse can be obtained by handle pruning.
 i.e., start with a string of terminals w that is to parse. If w is a sentence of the grammar at hand,
then w = γn, where γn is the nth right sentential form of some as yet unknown rightmost
derivation.
 S = γ0 γ1 γ2 … γn-1 γn = w.
Example for right sentential form and handle for grammar

41
E→E+E
E→E*E
E→(E)
E → id

2.8 Shift Reduce Parsing


 Stack implementation of Shift-reduce parsing:
o There are two problems that must be solved to parse by handle pruning.
 The first is to locate the substring to be reduced in a right sentential form.
 The second is to determine what production to chose in case there is more than one production
with that substring on the right side.
o The type of data structure to use in a shift reduces parser.
o Implementation of Shift-Reduce Parser:
 To implement shift-reduce parser, use a stack to hold grammar symbols and an input buffer to
hold the string w to be parsed.
 Use $ to mark the bottom of the stack and also the right end of the input.
 Initially the stack is empty, and the string w is on the input, as follows:
Stack Input
$ w$
 The parser operates by shifting zero or more input symbols onto the stack until a handle β is on
top of the stack.
 The parser then reduces β to the left side of the appropriate production.
 The parser repeats this cycle until it has detected an error or until the stack contains the start
symbol and the input is empty:

42
Stack Input
$S $
 After entering this configuration, the parser halts and announces successful completion of parsing.
 There are four possible actions that a shift-reduce parser can make: 1) shift 2) reduce 3) accept 4)
error.
1. In a shift action, the next symbol is shifted onto the top of the stack.
2. In a reduce action, the parser knows the right end of the handle is at the top of the stack. It
must then locate the left end of the handle within the stack and decide with what nonterminal to
replace the handle.
3. In an accept action, the parser announces successful completion of parsing.
4. In an error action, the parser discovers that a syntax error has occurred and calls an error
recovery routine.
 Note: an important fact that justifies the use of a stack in shift-reduce parsing: the handle will
always appear on top of the stack, and never inside.
 Example 2.8.1
 Consider the grammar
E→E+E
E→E*E
E→(E)
E → id (Grammar 2.8.1)
and the input string id1 + id2 * id3. Use the shift-reduce parser to check whether the input string is
accepted by the (Grammar 2.8.1)

o Viable Prefixes:

43
 The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser
are called viable prefixes.
o Conflicts during shift-reduce parsing:
 There are CFGs for which shift-reduce parsing cannot be used.
 For every shift-reduce parser for such grammar can reach a configuration in which the parser
cannot decide whether to shift or to reduce (a shift-reduce conflict), or cannot decide which of
several reductions to make (a reduce/reduce conflict), by knowing the entire stack contents and
the next input symbol.
 Example of such grammars:
 These grammars are not LR(k) class grammars, refer them as no-LR grammars.
 The k in the LR(k) grammars refer to the number of symbols of lookahead on the input.
 Grammars used in compiling usually fall in the LR(1) class, with one symbol lookahead.
 An ambiguous grammar can never be LR.
Stmt → if expr then stmt
| if expr then stmt else stmt
| other (Grammar 2.8.2)
 In this grammar there is a shift/reduce conflict occur for some input string.
 So this (Grammar 2.8.2) is not LR(1) grammar.

Operator Precedence Parsing

 The grammars that have the property that no production right side is ε or has two
adjacent nonterminals.
 A grammar with the later property is called an operator grammar.
 The following grammar for expressions is not an operator grammar:
E → E A E | ( E ) | - E | id
A→+|-|*|/|^ (Grammar 2.9.1)
o Because the right side EAE has two consecutive nonterminals.
 To obtain the (Grammar 2.9.1) as operator grammar, substitute for A each of its
alternatives for production E.

44
E → E + E | E – E | E * E | E / E | E ^ E | ( E ) | - E | id (Grammar 2.9.2)
 Operator precedence parsing has number of disadvantage.
o It is hard to handle tokens like minus sign, which has two different precedence. (depending
on whether it is unary or binary).
o Since the relationship between a grammar for the language being parsed and the operator-
precedence parser itself is weak, it cannot always be sure that the parser accepts exactly the
desired language.
o Only small class of grammars can be parsed using operator-precedence technique.

 Way to proceed the operator-precedence parsing:


o In operator-precedence parsing three disjoint precedence relations between certain pairs of
terminals has been defined.
They are <., , .>
o These precedence relations guide the selection of handles and have the following meaning:

o Using operator-precedence relations:


 The intention of the precedence relations is to delimit the handle of a right sentential form,
with <. marking the left end, appearing in the interior of the handle, and .> marking the
right end.
 To be more precise, a right sentential form of an operator grammar that no adjacent
nonterminals appear on the right sides of production implies that no right sentential form
will have two adjacent nonterminals either.
 Then the right sentential form is written as β 0a1β1…anβn, where each βi is either ε or a single
nonterminal and each ai is a single terminal.

45
 Exactly one of the relations < ., , .> holds between two terminals. Further use $ to mark each
end of the string, and define $ <. b and b .> $ for all terminals b.
 Example 2.9.1
 Consider the grammar
E → E + E | E * E | ( E ) | id (Grammar 2.9.3)
The corresponding operator precedence relation table is as follows:

Consider the initial right sentential form id + id * id.


 Then the string with the precedence relations inserted is:
$ <. id .> + <. id .> * <. id .> $ (2.9.1)
 The handle can be found by following process:
 1. Scan the string from the left end until the first .> is encountered. In above (2.9.1), this
occurs between the first id and +.
 2. The scan backwards (to the left over any ’s until a <. Is encountered. In (2.9.1), scan
backwards to $ (left).
 3. The handle contains everything to the left of the first .> and to the right of the
<.Encountered in step (2), including any intervening or surrounding nonterminals.
In(2.9.1), the handle is the first id (leftmost).
 Then from the (Grammar 2.9.3), reduce id to E. i.e., the input string becomes E + id * id.
 Similarly continue the above three step. Now (2.9.1) becomes E + E * E, after reducing the
two id’s from (2.9.1).
 Then when the nonterminals are deleted, there will be $ + * $, insert the operator relation
and reduce.
o Algorithm 2.9.1 Operator-Precedence parsing algorithm.

46
 Input: An input string w and a table of precedence relations.
 Output: If well formed with a placeholder nonterminal E labeling all interior nodes;
otherwise, an error indication.
 Method: Initially, the stack contains $ and the input buffer the string w$.
(1) set ip to point to the first symbol of w$;
(2) repeat forever
(3) if $ is on top of the stack and ip points to $ then
(4) return
(5) else begin
(6) let a be the topmost terminal symbol on the stack
and let b be the symbol pointed to by ip;
(6) if a <. b or a b then begin
(7) push b onto the stack;
(8) advance ip to the next input symbol;
end;
(9) else if a .> b then /* reduce */
(10) repeat
(11) pop the stack
(12) until the top stack terminal is replaced by <.
to the terminal most recently popped
(13) else error( )
end;
 Operator-Precedence Relations from Associativity and Precedence:
o The rules are designed to select the “proper” handles to reflect a given set of associativity
and precedence rules for binary operators.
 1. If operator θ1 has higher precedence than operator θ 2, make θ1 .> θ2 and θ2 <. θ1. These
relations ensure that, in an expression of the form E + E * E + E, the central E * E is the
handle that will be reduced first.
 2. If θ1 and θ2 are operators of equal precedence ( they may in fact be the same operator),
then make θ1 .> θ2 and θ2 .> θ1, if the operators are left associative, or make θ1 <. θ2 and

47
θ2 <. θ1, if the operators are right associative. These relations ensure that E – E + E will
have handle E – E selected and E ^ E ^ E will have right E ^ E selected.
 3. Make θ <. id and id .> θ,
θ <. ( and ( <. θ,
) .> θ and θ .> ),
θ .> $ and $ .> θ,
for all operators θ. Also
( ), $ <. (, $ <. id,
(<. (, id .> $, ) .> $,
(<. id, id .> ), ) .> )
o These rules ensure that both id and (E) will be reduced to E. Also, $ serves as both the left
and right endmarker, causing handles to be found between $’s wherever possible.
o Example 2.9.2
 Consider the grammar
E → E + E | E – E | E * E | E / E | E ^ E | ( E ) | - E | id
(Grammar 2.9.2)
produce the operator precedence relations for the above grammar.
and then parse the input id * ( id ^ id) – id / id from the operator-
precedence relations

from the above operator-precedence relation table, parse the input string.
 Precedence Functions:

48
o Compilers using operator-precedence parsers need not store the table of precedence
relations.
o In most cases, the table is encoded by two precedence functions f and g that map terminal
symbols to integers.
o To select f and g so that for symbols a and b,
 f(a) < g(b) whenever a <. b,
 f(a) = g(b) whenever a b,
 f(a) > g(b) whenever a .> b

o Thus the precedence relation between a and b determined by a numerical comparison


between f(a) and g(b).
o Algorithm 2.9.2 Constructing precedence functions:
 Input: An operator precedence matrix.
 Output: Precedence functions representing the input matrix, or an indication that none exist.
 Method:
1. Create symbols fa and ga for each a that is a terminal or $.
2. Partition the created symbols into as many groups as possible, in such a way that if a b,
then fa and gb are in the same group.
3. Create a directed graph whose nodes are the groups found in (2). For any a and b, if a
<. b, place an edge from the group of gb to the group of fa. If a .> b, place an edge from the
group of fa to that of gb.
Note: that an edge or path from fa to gb means that f(a) must exceed g(b) ; a path
from gb to fa means that g(b) must exceed f(a).
4. If the graph constructed in (3) has a cycle, then no precedence functions exist. If there
are no cycles, let f(a) be the length of the longest path beginning at the group of fa; let g(b)
be the length of the longest path from the group of gb.
o Example 2.9.3
 For the operator-precedence relation for the (Grammar 2.9.2) the precedence function is
given as:

49
o Exercise 2.9.1
 Consider the grammar
E → E + E | E * E | ( E ) | id (Grammar 2.9.3)
 The corresponding operator precedence relation table is as follows:
Construct the graph using the Algorithm 2.9.2 and derive the precedence function.

 The graph representing the precedence function is as follows:

50
 The resulting precedence functions are:

 The path is represented as, for example, there is a path from gid to f* to g* to f+ to g+ to f$ =
5, the precedence function g representing id is 5.
 Error Recovery in operator-precedence parsing:
o There are two points in the parsing process at which an operator-precedence parser can
discover syntactic error:
 If no precedence relation holds between the terminal on top of the stack and the current
input.
 If a handle has been found, but there is no production with this handle as a right side.
o The error checker does the following errors:
 Missimg operand
 Missing operator
 No expression between parenthesis
 Illegal b on line (line containing b)
 Missing d on line (line containing c)
 Missing E on line (line containing b)
o These eroor diagonistic issued at handling of errors during reduction.
o Dudring handling of shift/reduce errors; the diagonistics issues are:
 Missing operand
 Unbalanced right parenthesis
 Missing right paraenthesis

51
 Missing operator

52

You might also like