Introduction, Lexical Analysis 1.1 Language Processors:: Compiled By: Dept. of CSE SJEC, M'luru
Introduction, Lexical Analysis 1.1 Language Processors:: Compiled By: Dept. of CSE SJEC, M'luru
Compilers:
A compiler is a program that can read a program in one language — the source language — and
translate it into an equivalent program in another language — the target language
An important role of the compiler is to report any errors in the source program that it detects
during the translation process.
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs
Interpreters:
An interpreter is another kind of language processor which directly execute the operations
specified in the source program on inputs supplied by the user
Note:
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs.
An interpreter, can usually give better error diagnostics than a compiler, because it executes the
source program statement by statement.
Hybrid Compilers:
Ex: Java Language processors: It combines compilation and interpretation,
A Java source program first compiled into an intermediate form called bytecodes.
The bytecodes are then interpreted by a virtual machine. A benefit of this arrangement is
that bytecodes compiled on one machine can be interpreted on another machine, perhaps
across a network.
Source Program
Translator
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 2
Along with compiler, other language processors are required to create an executable target program.
These are: Preprcessors, assembler, linker and loader. The language processing system is shown
below.
Source Program
Preprocessor
Compiler
Assembler
Library files
Linker/Loader
Relocatable object files
A source program may be divided into modules stored in separate files. The task of collecting the
source program is sometimes entrusted to a separate program, called a preprocessor.
The preprocessor may also expand shorthands, called macros, into source language statements.
The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly language is easier to produce as output and is
easier to debug.
The assembly language is then processed by a program called an assembler that produces
relocatable machine code as its output.
The relocatable machine code may have to be linked together with other relocatable object files
and library files into the code that actually runs on the machine.
The linker resolves external memory addresses, where the code in one file may refer to a location
in another file.
The loader then puts together all of the executable object files into memory for execution.
There are two major parts of a compiler: Analysis(or front end) and Synthesis(or back end)
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate representation
of the source program.
If the analysis part detects that the source program is either syntactically ill formed or
semantically unsound, then it must provide informative messages
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 3
The analysis part also collects information about the source program and stores it in a data
structure called a symbol table, which is passed along with the intermediate representation to the
synthesis part.
The synthesis part constructs the desired target program from the intermediate representation and
the information in the symbol table.
Compilation process:
It operates as a sequence of phases, each of which transforms one representation of the source
program to another Character Stream
Lexical Analyzer
Token Stream
Syntax Analyzer
Syntax tree
Semantic Analyzer
Syntax Tree
Intermediate representation
Symbol
Machine Independent
Table
Code Optimizer
Intermediate representation
Code Generator
Target-machine code
Machine Dependent
Code Optimizer
Target-machine code
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 4
Ex: Consider an assignment statement, position = initial + rate * 60 , where all the variables are real,
Then the translation of this statement is shown in the following diagram:
Position=initial + rate * 60
Lexical Analyzer
< id, 1 > < = > < id, 2 > < + > < id, 3 > < * > < 60 >
Syntax Analyzer
=
<id,1>
+
position …. <id,2>
*
initial ---- <id,3> 60
=
<id,1>
+
<id,2>
*
SYMBOL TABLE <id,3> inttofloat
60
Intermediate Code Generator
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Code Optimizer
t1 = id3 * 60.0
id1 – id2 + t1
Code Generator
STF id1, R1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 5
Semantic Analysis:
A compiler may produce an explicit intermediate code representing the source program.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 6
There exist different forms representing intermediate code. One of the form is three-address
code.
A three address code consists of a sequence of assembly-like instructions with three operands per
instruction.
It has at the max, only one operator on the right side.
Temporary names have to be generated by the compiler in order to hold the value computed by a
three-address instruction.
For the assignment statement example, intermediate code generated in three-address code form is
shown in the diagram.
This phase is an intermediate phase between front end and back end.
The purpose of this phase is to perform transformations on the intermediate representation so that
the back end can produce a better target program than it would have otherwise produced from an
un-optimized intermediate representation.
i.e, this phase attempts to improve the intermediate code so that better target code will result.
For the assignment statement example, the code generated after this phase is shown in the
diagram., where intofloat operation is eliminated by replacing the integer 60 by the floating point
number 60.0.
Code Generation:
The code generator takes as input an intermediate representation of the source program and maps
it into the target language.
If the target language is machine code, registers or memory locations are selected for each of the
variables used by the program.
Then the intermediate instructions are translated into sequences of machine instructions that
perform the same task.
A crucial aspect of code generation is the judicious assignment of registers to hold variables.
For the assignment statement example, the target code generated is shown in the diagram.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 7
front-end phases of lexical analysis, syntax analysis, semantic analysis and from back
end intermediate code generation might be grouped together into one pass.
Code optimization might be an optional pass.
There could be a back-end pass consisting of code generation for a particular target
machine.
3. Syntax-directed translation engines: These produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators: These produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
5. Data-flow analysis engines: These facilitate the gathering of information about how values are
transmitted from one part of a program to each other.
Classification of Languages:
1. Based on generation:
a) First generation language-machine languages.
b) Second generation languages-assembly languages.
c) Third generation languages-higher level languages, ex: Fortran, Cobol, Lisp, C, C++, C#,
Java
d) Fourth generation languages-designed for specific applications like SQL for database
applications, Postscript for text formatting
e) Fifth generation languages-applied to logic and constraint based languages like prolog
and OPS5.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 8
eg: C, C++,C#,JAVA
Declarative languages: These are the languages in which a program specifies what computation is
to be done.
Eg: ML, Haskel,Prolog
3. Von Neumann language: These are the languages whose computational model is based on the
Von Neumann architecture
Eg: Fortran, C etc
5. Scripting Languages: Languages that makes use of high level operators to perform computations.
Eg: JavaScript, Perl, PHP, Python, Ruby, Tcl
Impacts on Compilers:
As new architectures and programming languages were evolving, the compiler writers had to
track new language features and had to devise translation algorithms that would take maximal
advantage of the new hardware capabilities.
A compiler by itself is a large program that must translate correctly the potentially infinite set of
programs that could be written in the source language.
A compiler writer must evaluate tradeoffs about what problems to tackle and what heuristics to
use to approach the problem of generating efficient code.
A compiler must accept all source programs that conform to the specification of the language.
Any transformation performed by the compiler while translating a source program must preserve
the meaning of the program being compiled.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 9
It refers to the attempts that a compiler makes to produce code that is more efficient than the
obvious code.
The optimization of code that a compiler performs has become both more important and more
complex.
More complex: Because processor architectures have become more complex.
More important: Because most of the parallel computers require optimization, o.w.,
performance degrades.
Compiler optimizations must meet the following design objectives:
1. The optimization must be correct, i.e., preserve the meaning of the compiled program.
2. It must improve the performance of many programs.-refers to shorter code, faster execution
of the program, minimum power consumption.
3. The compilation time must be kept reasonable- refers to short compilation time in order to
support a rapid development and debugging cycle.
4. The engineering effort required must be manageable.- keep the system simple to assure that
the engineering and maintenance cost of the compiler are manageable.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 10
Parallelism:
All modern microprocessors exploit instruction-level parallelism. this can be hidden from
the programmer.
The hardware scheduler dynamically checks for dependencies in the sequential
instruction stream and issues them in parallel when possible.
Whether the hardware reorders the instruction or not, compilers can rearrange the
instruction to make instruction-level parallelism more effective.
Memory Hierarchies:
Memory hierarchy consists of several levels of storage with different speeds and sizes.
A processor usually has a small number of registers consisting of hundred of bytes,
several levels of caches containing kilobytes to megabytes, and finally secondary storage
that contains gigabytes and beyond.
Correspondingly, the speed of accesses between adjacent levels of the hierarchy can
differ by two or three orders of magnitude.
The performance of a system is often limited not by the speed of the processor but by the
performance of the memory subsystem.
While compliers traditionally focus on optimizing the processor execution, more
emphasis is now placed on making the memory hierarchy more effective.
Program Translations:
Normally we think of compiling as a translation of a high level Lang to machine level Language,
the same technology can be applied to translate between different kinds of languages.
The following are some of the important applications of program translation techniques:
Binary Translation:
Compiler technology can be used to translate the binary code for one machine to another.
Eg:Binary translators have been developed to convert X86 code into both Alpha and Spare
Code.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 11
Several ways in which program analysis, building techniques originally developed to optimize
code in compilers, have improved software productivity.
Type Checking:
It is an effective technique to catch inconsistencies in programs.
It can be used to catch errors
Ex: Type mismatch of the object, parameter type mismatches with the procedure signature
etc.
Through Program analysis i.e., by analyzing the flow of data through program, errors can be
detected.
Ex: usage of null pointer
This technique can also be used to catch a variety of security holes.
Ex: If the attacker supplies “dangerous” string, and if this string is not checked properly and
there are chances that this string would influence the control flow of the code at some point in
the program.
Bounds Checking:
This technique mainly used when buffer overflows occur in the program.
Ex: C programs doesn’t support array bound checks. It is upto the user to ensure that arrays
are not accessed out of bounds. Failing to check this, the program may store the data outside
this buffer.
There are several other techniques that would perform the similar job.
Ex: Data flow analysis
Static/Dynamic Distinction
If a language uses a policy that allows the compiler to decide an issue, then we say that the
language uses a static policy or that the issue can be decided at compile time.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 15
Aliasing:
It is possible that two formal parameters can refer to the same location; such variables are
said to be aliases of one another.
Lexical Analysis:
Lexical Analyzer reads the source program character by character, group them into lexemes and
produce sequence of tokens for each lexeme in the source program.
These tokens are sent to syntax analyzer.
The interaction between lexical analyzer and the parser is shown below:
token
Source Lexical Syntax To semantic
Program Analyzer Analyzer analysis
getNextToken
Symbol
Table
Here parser makes a call using getNextToken command, which causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the next token, which
it returns to the parser.
The lexical analyzer also interacts with the symbol table, wherein newly identified identifier lexeme
is written also information from the table can be read .
Some Other tasks performed by the Lexical Analyzer:
Stripping out comments and whitespace
Normally Lexical analyzer doesn’t return a comment as a token.
It skips a comment, and return the next token (which is not a comment) to the parser.
Correlating error messages
It can associate a line number with each error message..
In some compilers it makes a copy of the source program with the error messages inserted at the
appropriate positions.
If the source program uses macro-processor, the expansion of macros may also be performed by
the Lexical analyzer
Sometimes, lexical analyzers are divided into a cascade of two processes:
Scanning: This is the simple process that do not require tokenization of the input, such as
deletion of comments and stripping out whitespace
Lexical analysis: This is the complex process, wherein the scanner produces a sequence of
tokens.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 16
The reasons for separating analysis phase of the compilation into lexical analysis and syntax analysis:
Tokens,Patterns,Lexemes:
Token:
It is a pair consisting of a token name and an optional attribute value.
The token name is an abstract symbol representing a kind of lexical units.
Ex: Keywords, operators, identifiers, constants, literal strings, punctuation symbols(such
as commas, semicolons)
Pattern:
Description of the form that the lexemes of a token may take.
Lexeme:
It is a sequence of characters in the source program that matches the pattern for a token
Ex 1:
Ex 2:
Consider the following statement:
if(a>=b)
printf(“total=%d\”,a);
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 17
The lexical analyzer returns to the parser not only a token name but an attribute value that
describes the lexeme presented by the token.
These attribute information is kept in the symbol table for the future reference.
Ex: for a token id, the attribute information includes its lexeme, its type, the location at which it
is first found etc. Thus, the appropriate attribute value here would be pointer to the symbol table
entry.
Ex: the token names and associated attribute values for the Fortran statement E=M * C ** 2 are:
Lexical Errors:
Instead of if(a==b) statement if we mistype it as fi(a==b) then lexical analyzer will not rectify
this mistake as it treats fi as a valid lexeme.
In certain situations lexical analyzer cannot proceed further as there are no matching patterns for
the remaining input. In this case it can adopt panic mode error recovery strategy.
The possible actions here would be:
Delete successive characters from the remaining input until the lexical analyzer can find a
well formed token at the beginning of what input is left.
Delete one character from the remaining input.
Insert a missing character into the remaining input.
Replace a character by another character.
Transpose two adjacent characters.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 18
To recognize tokens, atleast one extra character has to be read which incurs certain overhead. To
minimize this overhead and thereby to speed up the reading process special buffer technique have
been developed .
One such technique is two-buffer scheme in which two buffers are alternately reloaded.:
E = M * C * * 2 eof
forward
lexemeBegin
Buffer Pairs:
Size of each buffer is N(size of disk block) Ex:4096 bytes.
System read command is used to read N characters into a buffer.
If fewer than N characters remain in the input file , then a special character, represented by eof,
marks the end of source file.
Two pointers to the input are maintained.
1) Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.
2) Pointer forward scans ahead until a pattern match if found.
Once the next lexeme is determined, forward is set to the character at its right end.
Then, after the lexeme is recorded as an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the lexeme just found.
Advancing forward requires that we first test whether we have reached the end of one of the
buffers and if so we must reload the other buffer from the input and move forward to the
beginning of the newly loaded buffer.
Sentinel:
For each character read, we make two tests:
1) For the end of the buffer
2) To determine what character is read.
Here, buffer end test can be done by using sentinel character at the end of the buffer.
Sentinel is a special character that cannot be a part of source program. eof is used as sentinel.
The arrangement is shown below:
forward
lexemeBegin
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 19
switch(*forward++)
{
case eof:
if (forward is at end of first buffer)
{
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer)
{
reload first buffer;
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input terminate lexical analysis*/
break;
cases for the other characters
}
Regular expressions are an important notation for specifying lexeme patterns. While they cannot
express all possible patterns, they are very effective in specifying those types of patterns that we
actually need for tokens.
Strings and Languages:
An alphabet is any finite set of symbols such as letters, digits, and punctuation.
o The set {0,1) is the binary alphabet
o ACII character set is an alphabet
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
o |s| represents the length of a string s, Ex: college is a string of length 7
o The empty string ԑ, is the string of length zero.
o The empty string is the identity under concatenation; that is, for any string s, ԑS = Sԑ = s.
o If x and y are strings, then the concatenation of x and y is also string, denoted xy,
For example, if x = hello and y = world, then xy = helloworld.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 20
o Abstract languages like , the empty set, or {},the set containing only the empty string,
are languages under this definition.
Operations on Languages:
operations performed on languages is shown below:
Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z } and let D be the set of digits {0,1,.. .9}.
Other languages can be constructed from L and D, using the operators illustrated above :
1. L U D is the set of letters and digits - strictly speaking the language with 62 (52+10) strings
of length one, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one
digit.(10×52).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including , the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Regular Expressions
Basis:
1) If ԑ is a R.E., then L(ԑ)={ ԑ}
2) If L(a)={a}, then a is a R.E.
Induction:
Larger regular expressions are built from smaller ones. Let r and s are regular expressions
denoting languages L(r) and L(s), respectively.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 21
Regular Definition:
To write regular expression for some languages can be difficult, because their regular expressions
can be quite complex. In those cases, we may use regular definitions.
We can give names to regular expressions, and we can use these names as symbols to define other
regular expressions.
A regular definition is a sequence of the definitions of the form:
Ex. 1: C identifiers are strings of letters, digits, and underscores. The regular definition for the
language of C identifiers:
Ex. 2: Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4,
or 1.89E-4. The regular definition is
digit 0|1|2 |… | 9
digits digit digit*
optionalFraction .digits |
optionalExponent ( E( + |- | ) digits ) |
number digits optionalFraction optionalExponent
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 22
term → id
| number
o The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of
tokens as used by the lexical analyzer.
o The pattern description using regular definition for the terminals is
digit → [0-9]
digits → digits+
number → digits (. digits)? (E[+-]? digits)?
letter → [A-Za-z]
id → letter ( letter | digit )*
if → if
then → then
else → else
relop → < | > | <= | >= | = | <>
o The lexical analyzer also has the job of stripping out whitespace, by recognizing the "token" ws
defined by:
ws → ( blank | tab | newline ) +
o Tokens, and attribute values for the above grammar is:
o Note:
The lexer will be called by the parser when the latter needs a new token. If the lexer then
recognizes the token ws, it does not return it to the parser but instead goes on to recognize the
next token, which is then returned.
We can't have two consecutive ws tokens in the input because, for a given token, the lexer
will match the longest lexeme starting at the current position that yields this token.
For the parser, all the relational operators are to be treated the same so they are all the same
token, relop.
Other parts of the compiler, for example the code generator, will need to distinguish between
the various relational operators so that appropriate code is generated. Hence, they have
distinct attribute values.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 23
Transition diagram:
o A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each
possible token. It shows the decisions that must be made based on the input seen.
o The two main components are circles representing states (think of them as decision points of the
lexer) and arrows representing edges (think of them as the decisions made).
o The transition diagram always begins in the start state before any input symbols have been read.
o The accepting states indicate that a lexeme has been found.
o Sometimes it is necessary to retract the forward pointer one position(i.e., the lexeme does not
include the symbol that got us the to the accepting state), then we shall additionally place a * near
that accepting state.
o It is fairly clear how to write code corresponding to this diagram. look at the first character, if it is
<, look at the next character. If that character is =, return (relop,LE) to the parser.
o If instead that character is >, return (relop,NE).
o If it is another character, return (relop,LT) and adjust the input buffer so that we will read this
character again since this is not used it for the current lexeme.
o If the first character is =, return (relop,EQ).
o There are two ways that we can handle reserved words that look like identifiers.
1. We will continue to assume that the keywords are reserved, i.e., may not be used as
identifiers.
These reserved words should be installed into the symbol table prior to any invocation of
the lexer. The separate entry in the table will indicate that the entry is a keyword.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 24
When we find an identifier, installID() checks if the lexeme is already in the table. If it is
not present, the lexeme is installed as an id token. In either case a pointer to the entry is
returned.
gettoken() examines the lexeme and returns the token name, either id or a name
corresponding to a reserved keyword.
Note: If we adopt this approach, then we must prioritize the tokens so that the reserved-word
tokens are recognized in preference to id, when lexeme matches both patterns.
o Recognizing Numbers:
o When an accepting state is reached, the identified number is stored in the table in which numbers
are stored. Also a pointer to the corresponding lexeme is returned.
o These numbers are needed when code is generated.
o Depending on the source language, we may wish to indicate in the table whether this is a real or
integer. A similar, but more complicated, transition diagram could be produced if the language
permitted complex numbers as well.
o Recognizing Whitespace:
o The delim in the diagram represents any of the whitespace characters, say space, tab, and
newline.
o The final star is there because we needed to find a non-whitespace character in order to know
when the whitespace ends and this character begins the next token.
o There is no action performed at the accepting state. Indeed the lexer does not return to the parser,
but starts again from its beginning as it still must find the next token.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 25
o The idea is that we write a piece of code for each transition diagram collectively to build a lexical
analyzer.
o Here, we may imagine a variable state holding the number of the current state for a transition
diagram.
o A switch based on the value of state takes us to code for each of the possible states, where we
find the action of the state.
o Ex: Implementation of relop transition diagram using C++ function is described below:
This piece of code contains a case for each state, which typically reads a character and then
goes to the next case depending on the character read.
The numbers in the circles are the names of the cases.
Accepting states often need to take some action and return to the parser. Many of these
accepting states (the ones with stars) need to restore one character of input. This is called
retract() in the code.
What should the code for a particular diagram do if at one state the character read is not one
of those for which a next state has been defined? That is, what if the character read is not the
label of any of the outgoing arcs? This means that we have failed to find the token
corresponding to this diagram. The code in this case ,calls fail().
This is not an error case. It simply means that the current input does not match
this particular token. So we need to go to the code section for another diagram
after restoring the input pointer so that we start the next diagram at the point
where this failing diagram started.
If we have tried all the diagram, then we have a real failure and need to print an
error message and perhaps try to repair the input.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 26
One of the ordering could be sequential, If the input matches more than one token, the first
one tried will be chosen.
For Example, code must be written first for the keywords followed by the identifiers.
2. Run the various transition diagrams in parallel. That is, each character read is passed to each
diagram (that hasn't already failed). Care is needed when one diagram has accepted the input,
but others still haven't failed . One strategy would be to accept a longer prefix of the input.
Ex: prefer the identifier thenext to keyword then.
3. The preferred approach is to combine all the diagrams into one. Because all the diagrams
begin with different characters being matched. Hence we just have one large start with
multiple outgoing edges. However, in general the problem of combining transition diagrams
for several tokens is more complex.
Transition Tables
o We can also represent an NFA by a transition table, whose rows correspond to states, and whose
columns correspond to the input symbols and ɛ.
o The entry for a given state and input is the value of the transition function applied to those
arguments. If the transition function has no information about that state-input pair, we put ɸ in the
table for the pair.
o Example: The transition table for the above NFA is shown below:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
15CS63: Module 4: Syntax Analysis 1
4.1 Introduction:
The syntax analyzer obtains a string of tokens from the lexical analyzer, and verifies that
the string of token names can be generated by the grammar for the source language.
i.e., Syntax Analyzer creates the syntactic structure of the given source program. This
syntactic structure is mostly a parse tree.
Thus Syntax Analyzer is also known as parser.
The syntax of a programming is described by a context-free grammar (CFG).
The syntax analyzer checks whether a given source program satisfies the rules implied by
a context-free grammar or not.
If it satisfies, the parser creates the parse tree of that program.
Otherwise the parser gives the error messages.
It then passes parse tree to the rest of the compiler for further processing
A context-free grammar
Gives a precise, easy-to-understand, syntactic specification of a programming
language.
Can be used effectively to construct an efficient parser that determines the syntactic
structure of a source program. The parser-construction process can reveal syntactic
ambiguities and trouble spots that might have noticed in the initial design phase of a
language.
Useful for translating source programs into correct object code and for detecting errors.
Allows a language to be evolved or developed iteratively, by adding new constructs to
perform new tasks.
token
Source Lexical Syntax Rest of
Program Analyzer Analyzer Parse Tree
front end Intermediate
representation
getNextToken
Symbol
Table
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 2
3) bottom-up: Build parse tree from leaves and work their way up to the root. We
categorize the parsers into two groups:
Both top-down and bottom-up parsers scan the input from left to right (one symbol at a
time).
Efficient top-down and bottom-up parsers can be implemented only for subclasses of
CFG’s:
LL grammars for top-down parsing
LR grammars for bottom-up parsing
Representative grammars:
Error-Recovery Strategies:
1. Panic-Mode Recovery:
o In this method, on discovering an error, the parser discards input symbols one at a time
until one of a designated set of Synchronizing tokens is found.
o Synchronizing tokens are usually delimiters.
Ex: } or ; whose role in the source program is clear and unambiguous.
o Advantage:
Simple method
Is guaranteed not to go into an infinite loop
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 3
o Disadvantage:
It often skips a considerable amount of input without checking it for additional errors.
Careful selection of synchronizing tokens
2. Phrase-Level Recovery:
o In this method, A parser may perform local correction on the remaining input. i.e it may
replace a prefix of the remaining input by some string that allows the parser to continue.
o Ex: replace a comma by a semicolon, insert a missing semicolon
o Advantage:
It is used in several error-repairing compliers, as it can correct any input string.
o Disadvantage:
Difficulty in coping with the situations in which the actual error has occurred before
the point of detection.
This method is not guaranteed to not to go into an infinite loop.
3. Error Productions:
o Augment the grammar for the language with productions that would generate the erroneous
constructs.
o Then use this grammar augmented by the error productions to construct a parser.
o If an error production is used by the parser, we can generate appropriate error diagnostics
to indicate the erroneous construct that has been recognized in the input.
4. Global Correction:
o We use algorithms that perform minimal sequence of changes to obtain a globally least
cost correction.
o Given an incorrect input string x and grammar G, these algorithms will find a parse tree
for a related string y such that the number of insertions, deletions and changes of tokens
required to transform x into y is as small as possible.
o It is too costly to implement in terms of time space, so these techniques only of theoretical
interest.
Notational conventions:
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 4
3. Lower case greek letters such as α, β, γ represent (possibly empty) strings of grammar symbols.
4. X, Y, Z represent grammar symbols(Terminal or Nonterminal)
5. u,v,…,z represent strings of terminals.
6. A α1 , A α2 , A α3 can be written as A α1 | α 2 | α3
Using the conventions listed above, the above grammar can be written as,
E -> E+T | E-T | T
T -> T*F | T/F | F
F -> ( E ) | id
Derivations:
Consider the following grammar,
E -> E+E | E*E | -E | (E) | id
Since in the above example, multiple derivation steps do exist, it can also be written as E=>
id+id
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 5
If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation(LMD)
Ex: E => E+E
=> id+E
=> id+id
If we always choose the right-most non-terminal in each derivation step, this derivation is
called as right-most derivation.(RMD)
Ex: E => E+E
=> E+id
=> id+id
If S =>α, where S is the start symbol of a grammar G, we say that α is a sentential form
of G. A sentential form may contain both terminals and nonterminals, and may be empty.
Eg: In the above example, the sentential forms are E+E and E+id.
The sentential forms obtained in a LMD are said to be as left sentential form whereas the
sentential forms obtained in a RMD are called as right sentential form.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 6
Problems:
Ambiguity:
A grammar that produces more than one parse tree for a sentence is called as an ambiguous
grammar.
Ex:
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 7
Grammars are capable of describing most of the syntax of programming languages. The sequences
of tokens accepted by a parser form a superset of the programming language; subsequent phases
of the compiler must analyze the output of the parser to ensure compliance with rules that are not
checked by the parser.
Regular expressions are most useful for describing the structure of constructs such as
identifiers, constants, keywords, and white space.
Grammars, on the other hand, are most useful for describing nested structures such as balanced
parentheses, matching begin-end's, corresponding if-then-else's, and so on. These nested
structures cannot be described by regular expressions.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 8
Here "other" stands for any other statement. According to this grammar, the compound
conditional statement “ if E1 then S1 else if E2 then S2 else S3 “ has the following parse tree:
In all programming languages with conditional statements of this form, the second parse tree is
preferred. The general rule is, "Match each else with the closest unmatched then."
We can rewrite the above dangling-else grammar as the following unambiguous grammar.
The idea is that a statement appearing between a then and an else must be "matched"; that
is, the interior statement must not end with an unmatched or open then.
A matched statement is either an if-then-else statement containing no open statements or
it is any other kind of unconditional statement.
Thus, we may use the following grammar , that allows only one parsing for string; namely,
the one that associates each else with the closest previous unmatched then.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 9
E -> E+T | T
T -> T*F | F
F -> G^F | G
G -> id | (E)
Immediate left recursion can be eliminated by the following technique, which works for
any number of A-productions.
This procedure eliminates all left recursion from the A and A' productions (provided no 𝛼𝑖
is ɛ).
But the above procedure does not eliminate left recursion involving derivations of two or
more steps.
For example, consider the grammar
S -> Aa | b
A -> Ac | Sd | ɛ
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 10
The nonterminal S is left recursive because S═> Aa ═>Sda, but it is not immediately left
recursive.
The following Algorithm below, systematically eliminates left recursion from a grammar.
It is guaranteed to work if
+
The grammar has no cycles (derivations of the form A⇒A)
The grammar has no ɛ -productions (productions of the form A -> ɛ).
Example:
Consider the grammar
S -> Aa | b
A -> Ac | Sd | ɛ
Left Factoring
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 11
Left factoring is a grammar transformation that is useful for producing a grammar suitable
for predictive, or top-down, parsing.
When the choice between two alternative A-productions is not clear, we may be able to rewrite
the productions to defer the decision until enough of the input has been seen that we can make
the right choice.
on seeing the input if, we cannot immediately tell which production to choose to expand stmt.
In general, if A -> 𝜶𝜷𝟏 | 𝜶𝜷𝟐 are two A-productions, and the input begins with a nonempty
string derived from α, we do not know whether to expand A to 𝛽1 𝑜𝑟 𝛼𝛽2 .
However, we may defer the decision by left factoring, so that the original productions become
A -> αA'
A' -> β1 | β2
Example: Left factor the following grammar which abstracts the "dangling-else" problem:
S -> iEtS | iEtSeS | a
E -> b
Here, i, t, and e stand for if, then, and else; E and S stand for "conditional
expression" and "statement."
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 12
Example 1:
o The language in this example abstracts the problem of checking that identifiers are
declared before they are used in a program.
o The language consists of strings of the form wcw, where the first w represents the
declaration of an identifier w, c represents an intervening program fragment, and the
second w represents the use of the identifier.
o The abstract language is L= {wcw \ w is in (a|b)*}.
o L consists of all words composed of a repeated string of a's and b's separated by c, such
as aabcaab.
o The noncontext-freedom of L directly implies the non-context-freedom of
programming languages like C and Java, which require declaration of identifiers before
their use and which allow identifiers of arbitrary length. For this reason, a grammar for
C or Java does not distinguish among identifiers that are different character strings.
Instead, all identifiers are represented by a token such as id in the grammar. In a
compiler for such a language, the semantic-analysis phase checks that identifiers are
declared before they are used.
Example 2 :
o The problem of checking that the number of formal parameters in the declaration of a
function agrees with the number of actual parameters in a use of the function.
o The language consists of strings of the form anbmcndm. Here an and bm could represent the
formal-parameter lists of two functions declared to have n and m arguments, respectively,
while cn and dm represent the actual-parameter lists in calls to these two functions.
o The abstract language is L2 = { anbmcndm | n > 1 and m > 1}. That is, L2 consists of strings
in the language generated by the regular expression a*b*c*d* such that the number of a's
and c's are equal and the number of b's and d's are equal.
o This language is not context free.
o The typical syntax of function declarations and uses does not concern itself with counting
the number of parameters. For example, a function call in C-like language might be
specified by
Stmt -> id ( expr_list)
expr_list -> expr_list , expr | expr
with suitable productions for expr. Checking that the number of parameters in a call is
correct is usually done during the semantic-analysis phase.
Excercises:
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 13
2) S S S + | S S * | a
3) S 0 S 1 | 0 1
4) S ( L ) | a
L L , S | S
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 14
Recursive-Descent Parsing:
void A( ) {
1) Choose an A-production, A ->X1 X2 • • Xk;
2) for ( i = 1 to k ) {
3) if ( Xi is a nonterminal )
4) call procedure Xi( );
5) else if ( Xi equals the current input symbol a )
6) advance the input to the next symbol;
7) else /* an error has occurred */;
}
}
Figure: A typical procedure for a nonterminal in a top-down parser
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 15
S -> cAd
A -> ab | a
To construct a parse tree top-down for the input string w = cad, begin with a tree
consisting of a single node labeled S, and the input pointer pointing to c, the first
symbol of w.
S has only one production, so we use it to expand S and obtain the tree of Fig.
4.14(a).
The leftmost leaf, labeled c, matches the first symbol of input w, so we advance the
input pointer to a, the second symbol of w, and consider the next leaf, labeled A.
Now, we expand A using the first alternative A -> ab to obtain the tree of Fig.
4.14(b). We have a match for the second input symbol, a, so we advance the input
pointer to d, the third input symbol, and compare d against the next leaf, labeled b.
Since b does not match d, we report failure and go back to A to see whether there
is another alternative for A that has not been tried, but that might produce a match.
In going back to A, we must reset the input pointer to position 2, the position it had
when we first came to A, which means that the procedure for A must store the input
pointer in a local variable. The second alternative for A produces the tree of Fig.
4.14(c).
The leaf a matches the second symbol of w and the leaf d matches the third symbol.
Since we have produced a parse tree for w, we halt and announce successful
completion of parsing.
FIRST(𝜶) is defined as the set of terminals that begin strings derived from α. , where
∗
α is any string of grammar symbols. If α ⇒ɛ, then ɛ is also in FIRST(α).
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 16
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or ɛ can be added to any FIRST set.
1. If X is a terminal, then FIRST(X) = { X }.
2. If X is a nonterminal and X -> Y1Y2 • • • Yk is a production for some k ≥ 1, then
place a in FIRST(X) if for some i,
a is in FIRST(Yi), and ɛ is in all of FlRST(Y1), ... , FIRST(Yi-1);
∗
that is, Y1• • • Yi-1 ⇒ ɛ.
If ɛ is in FIRST(Yj) for all j = 1,2,... , k, then add ɛ to FIRST(X).
3. If X -> ɛ is a production, then add ɛ to FIRST(X).
To compute FOLLOW(A) for all nonterminals A, apply the following rules until nothing
can be added to any FOLLOW set.
1. Place $ in FOLLOW(5), where S is the start symbol, and $ is the input right endmarker.
2. If there is a production A->αBβ, then everything in FIRST(β) except ɛ is in
FOLLOW(B).
3. If there is a production A->αB, or a production A -> αBβ, where FIRST ( β ) contains ɛ,
then everything in FOLLOW( A ) is in FOLLOW( B ) .
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 17
2. For each production A -> X1X2 … Xk, create a path from the initial to the final state, with
edges labeled X1,X2, .. Xk. , If A -> 𝜖, the path is an edge labeled 𝜖.
Transition diagrams for predictive parsers have one diagram for each nonterminal. The
labels of edges can be tokens or nonterminals.
A transition on a token (terminal) means that we take that transition if that token is the next
input symbol.
A transition on a nonterminal A is a call of the procedure for A.
Example: Transition diagrams for non terminals E and E'
LL(1) Grammars
Predictive parsers, that is, recursive-descent parsers needing no backtracking, can be
constructed for a class of grammars called LL(1).
The first "L" in LL(1) stands for scanning the input from left to right, the second "L" for
producing a leftmost derivation, and the " 1 " for using one input symbol of lookahead at
each step to make parsing action decisions.
No left-recursive or ambiguous grammar can be LL(1).
A grammar G is LL(1) if and only if whenever A —> α | β are two distinct productions of
G, the following conditions hold:
1. For no terminal a do both α and β derive strings beginning with a.
2. At most one of α and 𝛽 can derive the empty string.
∗
3. If β⇒ɛ, then α does not derive any string beginning with a terminal in FOLLOW (A).
∗
Likewise, if α⇒ɛ, then β does not derive any string beginning with a terminal in
FOLLOW(A).
The first two conditions are equivalent to the statement that FIRST(α) and FIRST(β) are
disjoint sets.
The third condition is equivalent to stating that if ɛ is in FIRST(β), then FIRST (α) and
FOLLOW(A) are disjoint sets, and likewise if ɛ is in FIRST(α).Type equation here.
Predictive parsers can be constructed for LL(1) grammars since the proper production to apply
for a nonterminal can be selected by looking only at the current input symbol.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 18
If, after performing the above, there is no production at all in M[A, a], then set M[A, a] to
error (which we normally represent by an empty entry in the table).
a) FIRST sets:
FIRST(E) = FIRST(T) = FIRST(F) = { (, id }
FIRST(E') = { +, ɛ }
FIRST(T') = { *, ɛ }
FOLLOW sets:
FOLLOW (E) = {$, ) }
FOLLOW (E') = {$, ) }
FOLLOW (T) = {+, ), $}
FOLLOW (T') = {+, ), $}
FOLLOW (F) = {+, *, ), $}
b) Parsing Table :
The construction of predictive parsing table Algorithm can be applied to any grammar G
to produce a parsing table M.
For every LL(1) grammar, each parsing-table entry uniquely identifies a production or
signals an error.
For some grammars, however, M may have some entries that are multiply defined. For
example, if G is left-recursive or ambiguous, then M will have at least one multiply defined
entry.
Although left recursion elimination and left factoring are easy to do, there are some
grammars for which no amount of alteration will produce an LL(1) grammar.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 19
S -> iEtSS' | a
S' -> eS | 𝜖
E -> b
FIRST(S) = { i, a }
FIRST(S') = { e, 𝜖 }
FIRST(E) = { b }
FOLLOW(S) = {
FOLLOW(S') = {
FOLLOW(E) = {
The entry for M[S', e] contains both S' —> eS and S' —>𝜖.
Therefore, the grammar is not LL(1).
The grammar is ambiguous and the ambiguity is manifested by a choice in what
production to use when an e (else) is seen.
We can resolve this ambiguity by choosing S'->eS. This choice corresponds to
associating an else with the closest previous then.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 20
The parser has an input buffer, a stack containing a sequence of grammar symbols, a
parsing table constructed by Algorithm (Construction of a predictive parsing table) and an
output stream.
The input buffer contains the string to be parsed, followed by the endmarker $. We reuse
the symbol $ to mark the bottom of the stack, which initially contains the start symbol of
the grammar on top of $.
The parser is controlled by a program that considers X, the symbol on top of the stack, and
a, the current input symbol.
If X is a nonterminal, the parser chooses an X-production by consulting entry M[X, a]
of the parsing table M.
Otherwise, it checks for a match between the terminal X and current input symbol a.
The behavior of the parser can be described in terms of its configurations, which give the
stack contents and the remaining input.
METHOD: Initially, the parser is in a configuration with w$ in the input buffer and the start
symbol S of G on top of the stack, above $. The program in above figure uses the predictive parsing
table M to produce a predictive parse for the input.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 21
On input id + id * id, the nonrecursive predictive parser makes the sequence of moves as
follows. These moves correspond to a leftmost derivation
The sentential forms in this derivation correspond to the input that has already been
matched (in column MATCHED ) followed by the stack contents.
The matched input is shown only to highlight the correspondence. The input pointer points
to the leftmost symbol of the string in the INPUT column.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 22
An error is detected during predictive parsing when the terminal on top of the stack does
not match the next input symbol or when nonterminal A is on top of the stack, a is the next
input symbol, and M[A, a] is error (i.e., the parsing-table entry is empty).
Panic Mode
o Panic-mode error recovery is based on the idea of skipping symbols on the the input
until a token in a selected set of synchronizing tokens appears.
o Its effectiveness depends on the choice of synchronizing set. The sets should be chosen
so that the parser recovers quickly from errors that are likely to occur in practice.
o Some heuristics are as follows:
1. As a starting point, place all symbols in FOLLOW(A) into the synchronizing set
for nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen and
pop A from the stack, it is likely that parsing can continue.
2. It is not enough to use FOLLOW(A) as the synchronizing set for A. For example,
if semicolons terminate statements, as in C, then keywords that begin statements
may not appear in the FOLLOW set of the nonterminal representing expressions.
A missing semicolon after an assignment may therefore result in the keyword
beginning the next statement being skipped. Often, there is a hierarchical structure
on constructs in a language; for example, expressions appear within statements,
which appear within blocks, and so on. We can add to the synchronizing set of a
lower-level construct the symbols that begin higher-level constructs. For example,
we might add keywords that begin statements to the synchronizing sets for the
nonterminals generating expressions.
3. If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it
may be possible to resume parsing according to A if a symbol in FIRST (A) appears
in the input.
4. If a nonterminal can generate the empty string, then the production deriving 𝜖 can
be used as a default. Doing so may postpone some error detection, but cannot cause
an error to be missed. This approach reduces the number of nonterminals that have
to be considered during error recovery.
5. If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue
parsing. In effect, this approach takes the synchronizing set of a token to consist of
all other tokens.
Example: Using FIRST and FOLLOW symbols as synchronizing tokens works reasonably
well when expressions are parsed according to grammar
E TE'
E' + TE' | ɛ
T FT'
T' *FT' | ɛ
F ( E ) | id
The parsing table for this grammar in is repeated with "synch" indicating synchronizing
tokens obtained from the FOLLOW set of the nonterminal
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 23
On the erroneous input ) id * + id, the parser and error recovery mechanism of the above
table, behaves as follows:
Phrase-level Recovery
o Phrase-level error recovery is implemented by filling in the blank entries in the
predictive parsing table with pointers to error routines.
o These routines may change, insert, or delete symbols on the input and issue appropriate
error messages. They may also pop from the stack.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 24
o Alteration of stack symbols or the pushing of new symbols onto the stack is
questionable for several reasons.
First, the steps carried out by the parser might then not correspond to the
derivation of any word in the language at all.
Second, we must ensure that there is no possibility of an infinite loop. Checking
that any recovery action eventually results in an input symbol being consumed
(or the stack being shortened if the end of the input has been reached) is a good
way to protect against such loops.
Construct the predictive parser LL (1) for the following grammar and parse the given string
1. S -> S(S)S | 𝜖 String= ( ( ) ( ) )
5. S 0S1 | 01 String=00011
6. S -> aB | aC | Sd | Se
B -> bBc | f
C -> g
7. P -> Ra | Qba
R -> aba | caba | Rbc
Q -> bbc | bc String= cababca
8. S -> PQR
P -> a | Rb | 𝜖
Q -> c | dP | 𝜖
R -> e | f String= adeb
9. E -> E+T | T
T -> id | id[ ] | id[X]
X -> E,E | E String= id[id]
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 25
A bottom-up parse corresponds to the construction of a parse tree for an input string
beginning at the leaves (the bottom) and working up towards the root (the top).
Ex:
Consider the grammar,
E E+T | T
T T*F | F
F ( E ) | id
Let the string be id*id
The following illustrates construction of parse-tree using bottom-up parsing.
id * id F * id T * id T*F T E
| | | | | |
id F F id T*F T
| | | | |
id id F id T*F
| | |
id F id
|
id
Classification :
Bottom-up Parsing
Reductions:
We can think of bottom-up parsing as the process of "reducing" a string w to the start
symbol of the grammar.
At each reduction step, a specific substring matching the body of a production is
replaced by the non-terminal at the head of that production.
The key decisions during bottom-up parsing are about when to reduce and about what
production to apply, as the parse proceeds.
Ex: sequence of reductions in the above example: id * id, F* id , T*id , T*F, T, E
By definition, a reduction is the reverse of a step in a derivation. The goal of bottom-
up parsing is therefore to construct a derivation in reverse. The following derivation
corresponds to the parse in the above example.
E => T
=> T * F
=> T * id
=> F * id
=> id* id
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 26
Handle Pruning:
A "handle" is a substring that matches the body of a production, and whose reduction
represents one step along the reverse of a rightmost derivation.
For example, the handles during the parse of id1*id2 according to the above grammar
are as shown in the following:
Right Sentential Form Handle Reducing Production
id1*id2 id1 F - > id
F * id2 F T->F
T * id2 id2 F - > id
T*F T*F T->T*F
Formally, if S => αAw =>αβw , then the production A -> β in the position following α is
a handle of αβw. i.e., a handle of a right-sentential form γ is a production A ->β and a
position of γ where the string β may be found such that replacing β at that position by A
produces the previous right-sentential form in a rightmost derivation of γ.
S
α β w
If a grammar is unambiguous, then every right-sentential form of the grammar has exactly
one handle.
A rightmost derivation in reverse can be obtained by "handle pruning."
o That is, we start with a string of terminals w to be parsed. If w is a sentence of the
grammar at hand, then let w = γn , where γn is the nth right-sentential form of some as
yet unknown rightmost derivation,
S => γ0 => γ1 => γ2……..=> γn-1=> γn = w
o To reconstruct this derivation in reverse order, we locate the handle βn in γn and replace
βn by the head of the relevant production An -> βn to obtain the previous right-sentential
form γn-1.
o We then repeat this process.
o If by continuing this process we produce a right-sentential form consisting only of the
start symbol S, then we halt and announce successful completion of parsing.
o The reverse of the sequence of productions used in the reductions is a rightmost
derivation for the input string.
Shift-Reduce Parsing:
Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar
symbols and an input buffer holds the rest of the string to be parsed.
We use $ to mark the bottom of the stack and also the right end of the input. Conventionally,
when discussing bottom-up parsing, we show the top of the stack on the right.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 27
Initially, the stack is empty, and the string w is on the input, as follows:
Stack Input
$ w$
During a left-to-right scan of the input string, the parser shifts zero or more input symbols
onto the stack, until it is ready to reduce a string β of grammar symbols on top of the stack.
It then reduces β to the head of the appropriate production. The parser repeats this cycle
until it has detected an error or until the stack contains the start symbol and the input is
empty:
Stack Input
$S $
Upon entering this configuration, the parser halts and announces successful completion of
parsing.
There are actually four possible actions a shift-reduce parser can make: (1) shift, (2) reduce,
(3) accept, and (4) error.
1. Shift. Shift the next input symbol onto the top of the stack.
2. Reduce. The right end of the string to be reduced must be at the top of the stack. Locate
the left end of the string within the stack and decide with what nonterminal to replace the
string.
3. Accept. Announce successful completion of parsing.
4. Error. Discover a syntax error and call an error recovery routine.
The actions of a shift-reduce parser in parsing the input string id1*id2 according to the
expression grammar is shown here:
Stack Input Action
$ id1*id2$ shift
$id1 *id2$ reduce by F -> id
$F *id2$ reduce by T -> F
$T *id2$ shift
$T* id2$ shift
$T*id2 $ reduce by F -> id
$T*F $ reduce by T -> T * F
$T $ reduce by E -> T
$E $ accept
Note: The handle will always eventually appear on top of the stack, never inside.
Question: Consider the following grammar and parse the respective strings using shift-reduce
parser.
(1) S -> TL; (2) S -> (L) | a
T -> int | float L -> L,S | S
L -> L, id | id String : (a,(a,a))
String : int id, id;
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 28
1) shift/reduce conflict: This conflict arises when a parser can not decide whether to perform
shift action or reduce action.
Ex: Consider the grammar,
stmt -> if expr then stmt
| if expr then stmt else stmt
| other
We cannot tell whether if expr then stmt is the handle, no matter what appears below it on
the stack. Here there is a shift/reduce conflict. Depending on what follows the else on the
input, it might be correct to reduce if expr then stmt to stmt, or it might be correct to shift
else and then to look for another stmt to complete the alternative if expr then stmt else stmt.
2) reduce/reduce conflict: This conflict arises when a parser cannot decide which of several
reductions to make.
Ex 1: Consider the following grammar,
S -> AB
A -> aA | ab
B -> bB | ab
Suppose the string is abab
Then the actions of a shift-reduce parser will be
Stack Input Action
$ abab$ shift
$a bab$ shift
$ab ab$ reduce by A -> ab or B ->ab [ conflict ]
Here, parser will have a confusion as to which production to use for reduce action.
Ex 2: Suppose we have a lexical analyzer that returns the token name id for all names,
regardless of their type. Suppose also that our language invokes procedures by giving their
names, with parameters surrounded by parentheses, and that arrays are referenced by the
same syntax. Our grammar might therefore have (among others) productions such as shown
below:
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 29
A statement beginning with p( i , j ) would appear as the token stream id(id, id) to the
parser.
After shifting the first three tokens onto the stack, a shift-reduce parser would be in
configuration
Stack Input
• • • id ( id , id ) • • •
It is evident that the id on top of the stack must be reduced, but by which production? The
correct choice is production (5) if p is a procedure, but production (7) if p is an array. The
stack does not tell which; information in the symbol table obtained from the declaration of
p must be used.
Exercises :
For the following grammars, indicate the handle in each of the following right-sentential forms:
1)S -> 0 S 1 | 0 1 a) 000111
b) 00S11
Why LR Parsers?
For a grammar to be LR it is sufficient that a left-to-right shift-reduce parser be able to
recognize handles of right-sentential forms when they appear on top of the stack.
LR parsing is attractive for a variety of reasons:
1) LR parsers can be constructed to recognize virtually all programming language constructs
for which context-free grammars can be written.
----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru