0% found this document useful (0 votes)
76 views52 pages

Introduction, Lexical Analysis 1.1 Language Processors:: Compiled By: Dept. of CSE SJEC, M'luru

1) The document discusses language processors like compilers and interpreters. Compilers translate source code to machine code while interpreters directly execute operations on inputs. 2) It describes the typical structure of a compiler which includes a front-end for analysis and a back-end for synthesis. The analysis phase breaks down and analyzes the source code while the synthesis phase constructs the target program. 3) The compilation process operates in phases, transforming one representation to another through lexical analysis, syntax analysis, semantic analysis, code generation, and optimization. An example shows how an assignment statement is processed.

Uploaded by

Max Mario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views52 pages

Introduction, Lexical Analysis 1.1 Language Processors:: Compiled By: Dept. of CSE SJEC, M'luru

1) The document discusses language processors like compilers and interpreters. Compilers translate source code to machine code while interpreters directly execute operations on inputs. 2) It describes the typical structure of a compiler which includes a front-end for analysis and a back-end for synthesis. The analysis phase breaks down and analyzes the source code while the synthesis phase constructs the target program. 3) The compilation process operates in phases, transforming one representation to another through lexical analysis, syntax analysis, semantic analysis, code generation, and optimization. An example shows how an assignment statement is processed.

Uploaded by

Max Mario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Module3: Introduction, Lexical Analysis 1

INTRODUCTION, LEXICAL ANALYSIS

1.1 Language Processors:

 Compilers:
 A compiler is a program that can read a program in one language — the source language — and
translate it into an equivalent program in another language — the target language
 An important role of the compiler is to report any errors in the source program that it detects
during the translation process.

 If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs

 Interpreters:
 An interpreter is another kind of language processor which directly execute the operations
specified in the source program on inputs supplied by the user

Note:
 The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs.
 An interpreter, can usually give better error diagnostics than a compiler, because it executes the
source program statement by statement.

 Hybrid Compilers:
 Ex: Java Language processors: It combines compilation and interpretation,
 A Java source program first compiled into an intermediate form called bytecodes.
 The bytecodes are then interpreted by a virtual machine. A benefit of this arrangement is
that bytecodes compiled on one machine can be interpreted on another machine, perhaps
across a network.
Source Program

Translator

Intermediate Program Virtual


Machine output
input
 In order to achieve faster processing of inputs to outputs, some Java compilers, called just-in-time
compilers, translate the bytecodes into machine language immediately before they run the
intermediate program to process the input.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 2

 Along with compiler, other language processors are required to create an executable target program.
These are: Preprcessors, assembler, linker and loader. The language processing system is shown
below.
Source Program

Preprocessor

Modified Source Program

Compiler

Target Assembly Program

Assembler

Relocatable machine code

Library files
Linker/Loader
Relocatable object files

Target Machine code

A source program may be divided into modules stored in separate files. The task of collecting the
source program is sometimes entrusted to a separate program, called a preprocessor.
 The preprocessor may also expand shorthands, called macros, into source language statements.
 The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly language is easier to produce as output and is
easier to debug.
 The assembly language is then processed by a program called an assembler that produces
relocatable machine code as its output.
 The relocatable machine code may have to be linked together with other relocatable object files
and library files into the code that actually runs on the machine.
 The linker resolves external memory addresses, where the code in one file may refer to a location
in another file.
 The loader then puts together all of the executable object files into memory for execution.

1.2 The Structure of a Compiler:

 There are two major parts of a compiler: Analysis(or front end) and Synthesis(or back end)

 The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to create an intermediate representation
of the source program.
 If the analysis part detects that the source program is either syntactically ill formed or
semantically unsound, then it must provide informative messages

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 3

 The analysis part also collects information about the source program and stores it in a data
structure called a symbol table, which is passed along with the intermediate representation to the
synthesis part.
 The synthesis part constructs the desired target program from the intermediate representation and
the information in the symbol table.

 Compilation process:
 It operates as a sequence of phases, each of which transforms one representation of the source
program to another Character Stream

Lexical Analyzer

Token Stream

Syntax Analyzer

Syntax tree

Semantic Analyzer

Syntax Tree

Intermediate Code Generator

Intermediate representation

Symbol
Machine Independent
Table
Code Optimizer

Intermediate representation

Code Generator

Target-machine code

Machine Dependent
Code Optimizer

Target-machine code

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 4

Ex: Consider an assignment statement, position = initial + rate * 60 , where all the variables are real,
Then the translation of this statement is shown in the following diagram:
Position=initial + rate * 60

Lexical Analyzer

< id, 1 > < = > < id, 2 > < + > < id, 3 > < * > < 60 >

Syntax Analyzer

=
<id,1>
+
position …. <id,2>
*
initial ---- <id,3> 60

rate --- Semantic Analyzer

=
<id,1>
+
<id,2>
*
SYMBOL TABLE <id,3> inttofloat
60
Intermediate Code Generator

t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

Code Optimizer

t1 = id3 * 60.0
id1 – id2 + t1

Code Generator

LDF R2, id3


MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2

STF id1, R1

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 5

Lexical Analysis (Also called as Scanning):

 This is the first phase of compiler.


 The lexical analyzer reads the stream of characters of source program and groups the characters
into meaningful sequences called lexemes.
 It produces o/p as token of the form: < token-name, attribute- value >
 token-name is an abstract symbol that is used during syntax analysis.
 attribute-value points to an entry in the symbol table for this token.
 For the above example, the following table shows lexemes and their corresponding tokens.
Lexeme token-name attribute-value
position id 1(refers to symbol table entry)
= = ---
initial id 2(refers to symbol table entry)
+ + ---
rate id 3(refers to symbol table entry)
* * ---
60 60 ---

 Thus, after lexical analysis, the sequence of token generated is:


< id, 1 > < = > < id, 2 > < + > < id, 3 > < * > < 60 >

Syntax Analysis ( Also called as Parsing):

 This is the second phase of the compiler.


 It uses token stream generated by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream.
 A typical representation is a syntax tree, in which each interior node represents an operation and
the children of the node represent the arguments of the operation.
 The syntax tree generated after parsing for the example is shown in the diagram.

Semantic Analysis:

 This is the third phase of the compiler.


 The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition.
 It also gathers type information and saves it in either the syntax tree or the symbol table, for
subsequent use during intermediate-code generatuion.
 An important part of semantic analysis is type checking, where the compiler checks that each
operator has matching operands.
In some cases, it should also support coercions.(i.e, type casting int to float etc.).
 The output of semantic analysis for the example is shown in the diagram.

Intermediate Code Generation:

 A compiler may produce an explicit intermediate code representing the source program.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 6

 There exist different forms representing intermediate code. One of the form is three-address
code.
 A three address code consists of a sequence of assembly-like instructions with three operands per
instruction.
It has at the max, only one operator on the right side.
Temporary names have to be generated by the compiler in order to hold the value computed by a
three-address instruction.
 For the assignment statement example, intermediate code generated in three-address code form is
shown in the diagram.

Machine independent code optimization:

 This phase is an intermediate phase between front end and back end.
 The purpose of this phase is to perform transformations on the intermediate representation so that
the back end can produce a better target program than it would have otherwise produced from an
un-optimized intermediate representation.
 i.e, this phase attempts to improve the intermediate code so that better target code will result.
 For the assignment statement example, the code generated after this phase is shown in the
diagram., where intofloat operation is eliminated by replacing the integer 60 by the floating point
number 60.0.

Code Generation:

 The code generator takes as input an intermediate representation of the source program and maps
it into the target language.
 If the target language is machine code, registers or memory locations are selected for each of the
variables used by the program.
 Then the intermediate instructions are translated into sequences of machine instructions that
perform the same task.
 A crucial aspect of code generation is the judicious assignment of registers to hold variables.
 For the assignment statement example, the target code generated is shown in the diagram.

Symbol Table Management:


 The symbol table is a data structure containing a record for each variable name used in
the source program.
 This stores attributes of each name
o Eg: name, its type, its scope
o Method of passing each argument(by value or by reference)
o Return type

Grouping of phases into passes:


 In an implementation, activities from several phases may be grouped together into a pass
that reads an input file and writes an output file.
 For example,

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 7

 front-end phases of lexical analysis, syntax analysis, semantic analysis and from back
end intermediate code generation might be grouped together into one pass.
 Code optimization might be an optional pass.
 There could be a back-end pass consisting of code generation for a particular target
machine.

Compiler construction Tools:

Some commonly used compiler construction tools are:

1. Parser generators: These automatically produce syntax analyzers from a grammatical


description of a programming language.

2. Scanner generators: These produce lexical analyzers from a regular-expression description of


the tokens of a language.

3. Syntax-directed translation engines: These produce collections of routines for walking a parse
tree and generating intermediate code.

4. Code-generator generators: These produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.

5. Data-flow analysis engines: These facilitate the gathering of information about how values are
transmitted from one part of a program to each other.

6. Compiler-construction toolkits: these provide an integrated set of routines for constructing


various phases of a compiler.

1.3 The Evolution of Programming Languages:

The Move to Higher-Level Languages:

Classification of Languages:
1. Based on generation:
a) First generation language-machine languages.
b) Second generation languages-assembly languages.
c) Third generation languages-higher level languages, ex: Fortran, Cobol, Lisp, C, C++, C#,
Java
d) Fourth generation languages-designed for specific applications like SQL for database
applications, Postscript for text formatting
e) Fifth generation languages-applied to logic and constraint based languages like prolog
and OPS5.

2. Imperative Vs declarative languages:


Imperative languages : these are the languages in which a program specifies how a computation is
to be done.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 8

eg: C, C++,C#,JAVA
Declarative languages: These are the languages in which a program specifies what computation is
to be done.
Eg: ML, Haskel,Prolog

3. Von Neumann language: These are the languages whose computational model is based on the
Von Neumann architecture
Eg: Fortran, C etc

4. Object Oriented Language: Languages which support object-oriented programming style.


Eg: C++, JAVA, C#, Ruby

5. Scripting Languages: Languages that makes use of high level operators to perform computations.
Eg: JavaScript, Perl, PHP, Python, Ruby, Tcl

Impacts on Compilers:

 As new architectures and programming languages were evolving, the compiler writers had to
track new language features and had to devise translation algorithms that would take maximal
advantage of the new hardware capabilities.
 A compiler by itself is a large program that must translate correctly the potentially infinite set of
programs that could be written in the source language.
 A compiler writer must evaluate tradeoffs about what problems to tackle and what heuristics to
use to approach the problem of generating efficient code.

1.4 The science of Building a Compiler:

 The science behind the compiler:


1) Take a problem
2) Formulate a mathematical abstraction that captures the key characteristics- requires solid
understanding of the characteristics of computer programs.
3) Solve it using mathematical techniques.

 A compiler must accept all source programs that conform to the specification of the language.
 Any transformation performed by the compiler while translating a source program must preserve
the meaning of the program being compiled.

Modeling in compiler design and Implementation:


 Some of the models used are:
1) Finite state machines and regular expressions: useful
 For describing the lexical units of programs
 For describing the algorithms used by the compiler to recognize those units.
2) Context-Free-Grammars:
Used to describe the syntactic structure of programming languages
3) Trees:
Used for representing the structure of programs and their translation into object code.

The Science of code optimization:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 9

 It refers to the attempts that a compiler makes to produce code that is more efficient than the
obvious code.
 The optimization of code that a compiler performs has become both more important and more
complex.
 More complex: Because processor architectures have become more complex.
 More important: Because most of the parallel computers require optimization, o.w.,
performance degrades.
 Compiler optimizations must meet the following design objectives:
1. The optimization must be correct, i.e., preserve the meaning of the compiled program.
2. It must improve the performance of many programs.-refers to shorter code, faster execution
of the program, minimum power consumption.
3. The compilation time must be kept reasonable- refers to short compilation time in order to
support a rapid development and debugging cycle.
4. The engineering effort required must be manageable.- keep the system simple to assure that
the engineering and maintenance cost of the compiler are manageable.

1.5 Applications of Compiler Technology:

Implementation of High Level Programming languages:


 A high-level programming language defines a programming abstraction:
The programmer expresses an algorithm using the language, and the compiler must translate that
program to the target language.
 Generally High level Programming languages are easier to program in, but are less efficient, i.e.,
the target program runs more slowly.
 Programmers using Low level programming Language have more control over a computation and
can produce more efficient code.
 Unfortunately, Low level programs are harder to write and still worse less portable, more prone to
errors and harder to maintain.
 Optimizing compilers include techniques to improve the performance of general code, thus
offsetting the inefficiency introduced by HL abstractions.
 Ex: usage of register keyword in the earlier C Programming language:
As effective register-allocation techniques were developed, this keyword lost its efficiency.
 Programming languages: C,Fortran
These support user defined aggregate data types(eg:arrays, structures) and high level control
flow(loops, procedures).
A component ‘data-flow optimizations’ analyzes the flow of data through the program and
removes redundancies across these constructs.
 Object oriented programming languages: C++,C#,JAVA
These supports:
Data abstraction and Inheritance properties.
Optimizations to speed up virtual method dispatches have also been developed.

Optimizations for Computer Architectures


 The rapid evolution of computer architecture has also led to an insatiable demand for a new
compiler technology.
 Almost all high performance systems take advantage of the same basic 2 techniques:
parallelism and memory hierarchies.\

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 10

 Parallelism can be found at several levels :


at the instruction level- where multiple operations are executed simultaneously
at the processor level- where different threads of same application are run on different processors.
 Memory hierarchies are a response to the basic limitation that we can built very fast storage or
very large storage, but not storage that is both fast and large.

 Parallelism:

 All modern microprocessors exploit instruction-level parallelism. this can be hidden from
the programmer.
 The hardware scheduler dynamically checks for dependencies in the sequential
instruction stream and issues them in parallel when possible.
 Whether the hardware reorders the instruction or not, compilers can rearrange the
instruction to make instruction-level parallelism more effective.

 Memory Hierarchies:

 Memory hierarchy consists of several levels of storage with different speeds and sizes.
 A processor usually has a small number of registers consisting of hundred of bytes,
several levels of caches containing kilobytes to megabytes, and finally secondary storage
that contains gigabytes and beyond.
 Correspondingly, the speed of accesses between adjacent levels of the hierarchy can
differ by two or three orders of magnitude.
 The performance of a system is often limited not by the speed of the processor but by the
performance of the memory subsystem.
 While compliers traditionally focus on optimizing the processor execution, more
emphasis is now placed on making the memory hierarchy more effective.

Design of New Computer Architectures.


 In modern computer architecture development, compilers are developed in the processor design
stage, and compiled code running on simulators, is used to evaluate the proposed architectural
design.
 One of the best known example of how compilers influenced the design of computer architecture
was the invention of RISC (reduced instruction set computer) architecture.
 Over the last 3 decades, many architectural concepts have been proposed. They include
data flow machines, vector machines, VLIW(very long instruction word) machines,
multiprocessors with shared memory, and with distributed memory.
 The development of each of these architectural concepts was accompanied by the research and
development of corresponding compiler technology.
 Compiler technology is not only needed to support programming of these architectures but also to
evaluate the proposed architectural designs.

Program Translations:
 Normally we think of compiling as a translation of a high level Lang to machine level Language,
the same technology can be applied to translate between different kinds of languages.
 The following are some of the important applications of program translation techniques:
 Binary Translation:
 Compiler technology can be used to translate the binary code for one machine to another.
 Eg:Binary translators have been developed to convert X86 code into both Alpha and Spare
Code.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 11

 Binary translation can also be used to provide backward compatibility


Eg: PowerPC processors were allowed to run legacy MC 68040 code.
 Hardware Synthesis:
 Compiler technology is also used in high level hardware description languages like VHDL.
 Database Query Interpreters:
 Compiler technology is also used in query languages such as SQL which are used to search
databases.
 Compiled Simulation:
 It can run orders of magnitude faster than an interpreter based approach, which are used
mainly in scientific and engineering disciplines to understand the phenomenon or to validate
a design.

Software Productivity Tools:

 Several ways in which program analysis, building techniques originally developed to optimize
code in compilers, have improved software productivity.

 Type Checking:
 It is an effective technique to catch inconsistencies in programs.
 It can be used to catch errors
Ex: Type mismatch of the object, parameter type mismatches with the procedure signature
etc.
 Through Program analysis i.e., by analyzing the flow of data through program, errors can be
detected.
Ex: usage of null pointer
 This technique can also be used to catch a variety of security holes.
Ex: If the attacker supplies “dangerous” string, and if this string is not checked properly and
there are chances that this string would influence the control flow of the code at some point in
the program.

 Bounds Checking:
 This technique mainly used when buffer overflows occur in the program.
 Ex: C programs doesn’t support array bound checks. It is upto the user to ensure that arrays
are not accessed out of bounds. Failing to check this, the program may store the data outside
this buffer.
 There are several other techniques that would perform the similar job.
Ex: Data flow analysis

 Memory Management Tools:


 There are several tools that deal with checking memory management issues.
 “Garbage collection” is the main issue to be seen in memory management.
 Ex: “Purify” is a tool that dynamically catches memory management errors as they occur.

1.6 Programming Language Basics

 Static/Dynamic Distinction
 If a language uses a policy that allows the compiler to decide an issue, then we say that the
language uses a static policy or that the issue can be decided at compile time.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 15

 It became unfavorable as it did not support actual parameters containing expressions.

 Aliasing:
 It is possible that two formal parameters can refer to the same location; such variables are
said to be aliases of one another.

Lexical Analysis:

2.1 The role of the Lexical Analyzer:

 Lexical Analyzer reads the source program character by character, group them into lexemes and
produce sequence of tokens for each lexeme in the source program.
 These tokens are sent to syntax analyzer.
 The interaction between lexical analyzer and the parser is shown below:

token
Source Lexical Syntax To semantic
Program Analyzer Analyzer analysis
getNextToken

Symbol
Table

 Here parser makes a call using getNextToken command, which causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the next token, which
it returns to the parser.
 The lexical analyzer also interacts with the symbol table, wherein newly identified identifier lexeme
is written also information from the table can be read .
 Some Other tasks performed by the Lexical Analyzer:
 Stripping out comments and whitespace
Normally Lexical analyzer doesn’t return a comment as a token.
It skips a comment, and return the next token (which is not a comment) to the parser.
 Correlating error messages
It can associate a line number with each error message..
In some compilers it makes a copy of the source program with the error messages inserted at the
appropriate positions.
If the source program uses macro-processor, the expansion of macros may also be performed by
the Lexical analyzer
 Sometimes, lexical analyzers are divided into a cascade of two processes:
 Scanning: This is the simple process that do not require tokenization of the input, such as
deletion of comments and stripping out whitespace
 Lexical analysis: This is the complex process, wherein the scanner produces a sequence of
tokens.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 16

 Lexical Analysis Versus Parsing:

The reasons for separating analysis phase of the compilation into lexical analysis and syntax analysis:

1. Simplicity in the design :


 The tasks to be performed made easier
 It leads to cleaner overall language design.

2. Compiler efficiency improved:


 Specialized techniques such as input buffering can speed up the compiler significantly.

3. Compiler portability enhanced:


 Input device specific peculiarities can be restricted only to the lexical analyzer, hence the
other phases are free from these limitations.

 Tokens,Patterns,Lexemes:

Token:
 It is a pair consisting of a token name and an optional attribute value.
 The token name is an abstract symbol representing a kind of lexical units.
 Ex: Keywords, operators, identifiers, constants, literal strings, punctuation symbols(such
as commas, semicolons)

Pattern:
 Description of the form that the lexemes of a token may take.

Lexeme:
 It is a sequence of characters in the source program that matches the pattern for a token

Ex 1:

Ex 2:
Consider the following statement:
if(a>=b)
printf(“total=%d\”,a);

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 17

Lexemes: if ( a >= b ) printf ( “total=%d” a )


Tokens : IF LP id relop id RP id LP literal id RP

 Most of the token belongs to the following classes:


1) One token for each keyword.
2) Tokens for the operators either individually or in classes
3) One token representing all identifiers
4) One or more tokens representing constants such as numbers and literal strings
5) Tokens for each punctuation symbol such as left and right parentheses, comma and
semicolon.

 Attributes for tokens:

 The lexical analyzer returns to the parser not only a token name but an attribute value that
describes the lexeme presented by the token.
 These attribute information is kept in the symbol table for the future reference.
 Ex: for a token id, the attribute information includes its lexeme, its type, the location at which it
is first found etc. Thus, the appropriate attribute value here would be pointer to the symbol table
entry.
 Ex: the token names and associated attribute values for the Fortran statement E=M * C ** 2 are:

< id, pointer to symbol table entry for E >


<assign_op>
< id, pointer to symbol table entry for M >
<mult_op>
< id, pointer to symbol table entry for C >
<exp_op>
< number, integer value 2>

 Tricky problem when recognizing tokens:


Ex: In FORTRAN,
DO 5 I = 1.25 // DO5I is a lexeme
DO 5 I = 1, 25 // do- statement

 Lexical Errors:

 Instead of if(a==b) statement if we mistype it as fi(a==b) then lexical analyzer will not rectify
this mistake as it treats fi as a valid lexeme.
 In certain situations lexical analyzer cannot proceed further as there are no matching patterns for
the remaining input. In this case it can adopt panic mode error recovery strategy.
The possible actions here would be:
 Delete successive characters from the remaining input until the lexical analyzer can find a
well formed token at the beginning of what input is left.
 Delete one character from the remaining input.
 Insert a missing character into the remaining input.
 Replace a character by another character.
 Transpose two adjacent characters.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 18

2.2 Input Buffering:

 To recognize tokens, atleast one extra character has to be read which incurs certain overhead. To
minimize this overhead and thereby to speed up the reading process special buffer technique have
been developed .
 One such technique is two-buffer scheme in which two buffers are alternately reloaded.:

E = M * C * * 2 eof

forward
lexemeBegin

 Buffer Pairs:
 Size of each buffer is N(size of disk block) Ex:4096 bytes.
 System read command is used to read N characters into a buffer.
 If fewer than N characters remain in the input file , then a special character, represented by eof,
marks the end of source file.
 Two pointers to the input are maintained.
1) Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.
2) Pointer forward scans ahead until a pattern match if found.
 Once the next lexeme is determined, forward is set to the character at its right end.
 Then, after the lexeme is recorded as an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the lexeme just found.
 Advancing forward requires that we first test whether we have reached the end of one of the
buffers and if so we must reload the other buffer from the input and move forward to the
beginning of the newly loaded buffer.

 Sentinel:
 For each character read, we make two tests:
1) For the end of the buffer
2) To determine what character is read.
 Here, buffer end test can be done by using sentinel character at the end of the buffer.
 Sentinel is a special character that cannot be a part of source program. eof is used as sentinel.
 The arrangement is shown below:

E = M * eof C * * 2 eof eof

forward
lexemeBegin

 The following algorithm summarizes advancing forward pointer:


(lookahead code with sentinels)

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 19

switch(*forward++)
{
case eof:
if (forward is at end of first buffer)
{
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer)
{
reload first buffer;
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input terminate lexical analysis*/
break;
cases for the other characters
}

2.3 Specification of tokens:

 Regular expressions are an important notation for specifying lexeme patterns. While they cannot
express all possible patterns, they are very effective in specifying those types of patterns that we
actually need for tokens.
 Strings and Languages:

 An alphabet is any finite set of symbols such as letters, digits, and punctuation.
o The set {0,1) is the binary alphabet
o ACII character set is an alphabet

 A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
o |s| represents the length of a string s, Ex: college is a string of length 7
o The empty string ԑ, is the string of length zero.
o The empty string is the identity under concatenation; that is, for any string s, ԑS = Sԑ = s.
o If x and y are strings, then the concatenation of x and y is also string, denoted xy,
For example, if x = hello and y = world, then xy = helloworld.

 A language is any countable set of strings over some fixed alphabet.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 20

o Abstract languages like , the empty set, or {},the set containing only the empty string,
are languages under this definition.

 Operations on Languages:
 operations performed on languages is shown below:

 Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z } and let D be the set of digits {0,1,.. .9}.
Other languages can be constructed from L and D, using the operators illustrated above :

1. L U D is the set of letters and digits - strictly speaking the language with 62 (52+10) strings
of length one, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one
digit.(10×52).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including , the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

 Regular Expressions

 Regular Expressions can be defined as follows:

Basis:
1) If ԑ is a R.E., then L(ԑ)={ ԑ}
2) If L(a)={a}, then a is a R.E.

Induction:
Larger regular expressions are built from smaller ones. Let r and s are regular expressions
denoting languages L(r) and L(s), respectively.

1. (r) | (s) is a regular expression denoting the language L(r) U L(s).


2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r). This last rule says that we can add additional pairs of
parentheses around expressions without changing the language they denote.
For example, we may replace the regular expression (a) | ((b) * (c)) by a| b*c.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 21

 Regular Definition:

 To write regular expression for some languages can be difficult, because their regular expressions
can be quite complex. In those cases, we may use regular definitions.
 We can give names to regular expressions, and we can use these names as symbols to define other
regular expressions.
 A regular definition is a sequence of the definitions of the form:

 Ex. 1: C identifiers are strings of letters, digits, and underscores. The regular definition for the
language of C identifiers:

Letter A | B | C|…| Z | a | b | … |z| -


digit  0|1|2 |… | 9
id  letter( letter | digit )*

Ex. 2: Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4,
or 1.89E-4. The regular definition is
digit  0|1|2 |… | 9
digits  digit digit*
optionalFraction  .digits | 
optionalExponent  ( E( + |- | ) digits ) | 
number  digits optionalFraction optionalExponent

2.4 Recognition of Tokens:


 Our current goal is to perform the lexical analysis needed for the following grammar.
stmt → if expr then stmt
| if expr then stmt else stmt

expr → term relop term // relop is relational operator =, >, etc
| term

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 22

term → id
| number

o The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of
tokens as used by the lexical analyzer.
o The pattern description using regular definition for the terminals is
digit → [0-9]
digits → digits+
number → digits (. digits)? (E[+-]? digits)?
letter → [A-Za-z]
id → letter ( letter | digit )*
if → if
then → then
else → else
relop → < | > | <= | >= | = | <>

o The lexical analyzer also has the job of stripping out whitespace, by recognizing the "token" ws
defined by:
ws → ( blank | tab | newline ) +
o Tokens, and attribute values for the above grammar is:

o Note:
 The lexer will be called by the parser when the latter needs a new token. If the lexer then
recognizes the token ws, it does not return it to the parser but instead goes on to recognize the
next token, which is then returned.
 We can't have two consecutive ws tokens in the input because, for a given token, the lexer
will match the longest lexeme starting at the current position that yields this token.
 For the parser, all the relational operators are to be treated the same so they are all the same
token, relop.
 Other parts of the compiler, for example the code generator, will need to distinguish between
the various relational operators so that appropriate code is generated. Hence, they have
distinct attribute values.

 To recognize tokens there are 2 steps


1. Design of Transition Diagram
2. Implementation of Transition Diagram

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 23

 Transition diagram:
o A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each
possible token. It shows the decisions that must be made based on the input seen.
o The two main components are circles representing states (think of them as decision points of the
lexer) and arrows representing edges (think of them as the decisions made).
o The transition diagram always begins in the start state before any input symbols have been read.
o The accepting states indicate that a lexeme has been found.
o Sometimes it is necessary to retract the forward pointer one position(i.e., the lexeme does not
include the symbol that got us the to the accepting state), then we shall additionally place a * near
that accepting state.

 Transition diagram for the token relop:

o It is fairly clear how to write code corresponding to this diagram. look at the first character, if it is
<, look at the next character. If that character is =, return (relop,LE) to the parser.
o If instead that character is >, return (relop,NE).
o If it is another character, return (relop,LT) and adjust the input buffer so that we will read this
character again since this is not used it for the current lexeme.
o If the first character is =, return (relop,EQ).

o Recognition of reserved words and identifiers:

Note again the star affixed to the final state.


o Two questions remain.
1. How do we distinguish between identifiers and keywords such as then, which also match the
pattern in the transition diagram?
2. What is (gettoken(), installID())?

o There are two ways that we can handle reserved words that look like identifiers.
1. We will continue to assume that the keywords are reserved, i.e., may not be used as
identifiers.
 These reserved words should be installed into the symbol table prior to any invocation of
the lexer. The separate entry in the table will indicate that the entry is a keyword.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 24

 When we find an identifier, installID() checks if the lexeme is already in the table. If it is
not present, the lexeme is installed as an id token. In either case a pointer to the entry is
returned.
 gettoken() examines the lexeme and returns the token name, either id or a name
corresponding to a reserved keyword.

2. Create separate transition diagrams for each keyword


Ex: transition diagram for the keyword then is:

Note: If we adopt this approach, then we must prioritize the tokens so that the reserved-word
tokens are recognized in preference to id, when lexeme matches both patterns.

o Recognizing Numbers:

o When an accepting state is reached, the identified number is stored in the table in which numbers
are stored. Also a pointer to the corresponding lexeme is returned.
o These numbers are needed when code is generated.
o Depending on the source language, we may wish to indicate in the table whether this is a real or
integer. A similar, but more complicated, transition diagram could be produced if the language
permitted complex numbers as well.

o Recognizing Whitespace:

o The delim in the diagram represents any of the whitespace characters, say space, tab, and
newline.
o The final star is there because we needed to find a non-whitespace character in order to know
when the whitespace ends and this character begins the next token.
o There is no action performed at the accepting state. Indeed the lexer does not return to the parser,
but starts again from its beginning as it still must find the next token.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 25

o Architecture of a Transition-Diagram-Based Lexical Analyzer:

o The idea is that we write a piece of code for each transition diagram collectively to build a lexical
analyzer.
o Here, we may imagine a variable state holding the number of the current state for a transition
diagram.
o A switch based on the value of state takes us to code for each of the possible states, where we
find the action of the state.
o Ex: Implementation of relop transition diagram using C++ function is described below:

TOKEN getRelop() // TOKEN has two components


{
TOKEN retToken = new(RELOP); // First component set here
while (1)
{ // Repeat character processing until a return or failure occurs
switch(state)
{
case 0: c = nextChar();
if (c == '<') state = 1;
else if (c == '=') state = 5;
else if (c == '>') state = 6;
else fail(); // lexeme is not a relop
break;
case 1: ...
...
case 8: retract(); // an accepting state with a star
retToken.attribute = GT; // second component
return(retToken);
}
}
}

 This piece of code contains a case for each state, which typically reads a character and then
goes to the next case depending on the character read.
 The numbers in the circles are the names of the cases.
 Accepting states often need to take some action and return to the parser. Many of these
accepting states (the ones with stars) need to restore one character of input. This is called
retract() in the code.
 What should the code for a particular diagram do if at one state the character read is not one
of those for which a next state has been defined? That is, what if the character read is not the
label of any of the outgoing arcs? This means that we have failed to find the token
corresponding to this diagram. The code in this case ,calls fail().
 This is not an error case. It simply means that the current input does not match
this particular token. So we need to go to the code section for another diagram
after restoring the input pointer so that we start the next diagram at the point
where this failing diagram started.
 If we have tried all the diagram, then we have a real failure and need to print an
error message and perhaps try to repair the input.

o Alternate Methods to fit the code into the lexical analyzer:


1. The order in which the diagrams are tried is important.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
Module3: Introduction, Lexical Analysis 26

One of the ordering could be sequential, If the input matches more than one token, the first
one tried will be chosen.
For Example, code must be written first for the keywords followed by the identifiers.
2. Run the various transition diagrams in parallel. That is, each character read is passed to each
diagram (that hasn't already failed). Care is needed when one diagram has accepted the input,
but others still haven't failed . One strategy would be to accept a longer prefix of the input.
Ex: prefer the identifier thenext to keyword then.
3. The preferred approach is to combine all the diagrams into one. Because all the diagrams
begin with different characters being matched. Hence we just have one large start with
multiple outgoing edges. However, in general the problem of combining transition diagrams
for several tokens is more complex.

2.5 The Lexical-Analyzer Generator ( Given as assignment topic, Q.No.4 )

2.6 Finite Automata:


o Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.
o Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A
symbol can label several edges out of the same state, and ɛ, the empty string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input
alphabet exactly one edge with that symbol leaving that state.
o Both deterministic and nondeterministic finite automata are capable of recognizing the same
languages. The languages accepted by these automata are regular languages.

 Nondeterministic Finite Automata


o A nondeterministic finite automaton (NFA) consists of:
1. A finite set of states S.
2. A set of input symbols Ʃ, the input alphabet. We assume that ɛ, which stands for the empty
string, is never a member of Ʃ.
3. A transition function that gives, for each state, and for each symbol in Ʃ U{ɛ} a set of next
states.
4. A state s0 from S that is distinguished as the start state (or initial state).
5. A set of states F, a subset of S, that is distinguished as the accepting states (or final states).
o We can represent either an NFA or DFA by a transition graph, where the nodes are states and the
labeled edges represent the transition function. There is an edge labeled a from state s to state t if
and only if t is one of the next states for state s and input a.
 Example:The transition graph for an NFA recognizing the language of regular expression
(a|b)*abb is :

 Transition Tables
o We can also represent an NFA by a transition table, whose rows correspond to states, and whose
columns correspond to the input symbols and ɛ.
o The entry for a given state and input is the value of the transition function applied to those
arguments. If the transition function has no information about that state-input pair, we put ɸ in the
table for the pair.
o Example: The transition table for the above NFA is shown below:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Compiled By: Dept. of CSE SJEC, M’luru
15CS63: Module 4: Syntax Analysis 1

4.1 Introduction:

The Role of the Parser:

 The syntax analyzer obtains a string of tokens from the lexical analyzer, and verifies that
the string of token names can be generated by the grammar for the source language.
 i.e., Syntax Analyzer creates the syntactic structure of the given source program. This
syntactic structure is mostly a parse tree.
 Thus Syntax Analyzer is also known as parser.
 The syntax of a programming is described by a context-free grammar (CFG).
 The syntax analyzer checks whether a given source program satisfies the rules implied by
a context-free grammar or not.
 If it satisfies, the parser creates the parse tree of that program.
 Otherwise the parser gives the error messages.
 It then passes parse tree to the rest of the compiler for further processing
 A context-free grammar
 Gives a precise, easy-to-understand, syntactic specification of a programming
language.
 Can be used effectively to construct an efficient parser that determines the syntactic
structure of a source program. The parser-construction process can reveal syntactic
ambiguities and trouble spots that might have noticed in the initial design phase of a
language.
 Useful for translating source programs into correct object code and for detecting errors.
 Allows a language to be evolved or developed iteratively, by adding new constructs to
perform new tasks.
token
Source Lexical Syntax Rest of
Program Analyzer Analyzer Parse Tree
front end Intermediate
representation
getNextToken

Symbol
Table

Fig :Position of Parser in Compiler model


 There are three general types of parsers for grammars:
1) Universal: Universal parsing methods such as the Cocke-Younger-Kasami algorithm
and Earley's algorithm can parse any grammar . however, too inefficient to use in
production compilers.
2) top-down: build parse trees from the top (root) to the bottom (leaves)

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 2

3) bottom-up: Build parse tree from leaves and work their way up to the root. We
categorize the parsers into two groups:
 Both top-down and bottom-up parsers scan the input from left to right (one symbol at a
time).
 Efficient top-down and bottom-up parsers can be implemented only for subclasses of
CFG’s:
 LL grammars for top-down parsing
 LR grammars for bottom-up parsing

Representative grammars:

 The Expression Grammar used for top-down parsing:


E  TE'
E'  + TE' | e
T  FT'
T'  *FT' | e
F  ( E ) | id
 The expression grammar used for bottom-up parsing:
E  E+T | T
T  T*F | F
F  ( E ) | id

Syntax Error Handling:

 Common Programming errors can occur at many different levels:


1. Lexical errors: include misspelling of identifiers, keywords, or operators.
2. Syntactic errors: include misplaced semicolons or extra or missing braces.
3. Semantic errors: include type mismatches between operators and operands.
4. Logical errors:can be anything from incorrect reasoning on the part of the programmer.

 Goals of the Parser


 Report the presence of errors clearly and accurately
 Recover from each error quickly enough to detect subsequent errors.
 Add minimal overhead to the processing of correct programs.

Error-Recovery Strategies:
1. Panic-Mode Recovery:
o In this method, on discovering an error, the parser discards input symbols one at a time
until one of a designated set of Synchronizing tokens is found.
o Synchronizing tokens are usually delimiters.
Ex: } or ; whose role in the source program is clear and unambiguous.
o Advantage:
 Simple method
 Is guaranteed not to go into an infinite loop

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 3

o Disadvantage:
 It often skips a considerable amount of input without checking it for additional errors.
 Careful selection of synchronizing tokens
2. Phrase-Level Recovery:
o In this method, A parser may perform local correction on the remaining input. i.e it may
replace a prefix of the remaining input by some string that allows the parser to continue.
o Ex: replace a comma by a semicolon, insert a missing semicolon
o Advantage:
 It is used in several error-repairing compliers, as it can correct any input string.
o Disadvantage:
 Difficulty in coping with the situations in which the actual error has occurred before
the point of detection.
 This method is not guaranteed to not to go into an infinite loop.

3. Error Productions:
o Augment the grammar for the language with productions that would generate the erroneous
constructs.
o Then use this grammar augmented by the error productions to construct a parser.
o If an error production is used by the parser, we can generate appropriate error diagnostics
to indicate the erroneous construct that has been recognized in the input.

4. Global Correction:
o We use algorithms that perform minimal sequence of changes to obtain a globally least
cost correction.
o Given an incorrect input string x and grammar G, these algorithms will find a parse tree
for a related string y such that the number of insertions, deletions and changes of tokens
required to transform x into y is as small as possible.
o It is too costly to implement in terms of time space, so these techniques only of theoretical
interest.

4.2 Context-Free Grammars:


 Grammars describe the syntax of programming language constructs like expression,
statements.
 A CFG is defined as G= (V,T,P,S) where
V is the finite set of non-terminals (variables)
T is the finite set of terminals (tokens)
P is the finite set of productions rules in the following form
A α where
A is a non-terminal and
α is a string of terminals and non-terminals (including the empty string)
S is the start symbol (one of the non-terminal symbol)

Notational conventions:

1. Symbols used for terminals are :


a. Lower case letters early in the alphabet (such as a, b, c, . . .)

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 4

b. Operator symbols (such as +, *, . . . )


c. Punctuation symbols (such as parenthesis, comma and so on)
d. The digits(0…9)
e. Boldface strings and keywords (such as id or if) each of which represents a single terminal
symbol

2. Symbols used for non terminals are:


a. Uppercase letters early in the alphabet (such as A, B, C, …)
b. The letter S, which when it appears is usually the start symbol.
c. Lowercase, italic names (such as expr or stmt).

3. Lower case greek letters such as α, β, γ represent (possibly empty) strings of grammar symbols.
4. X, Y, Z represent grammar symbols(Terminal or Nonterminal)
5. u,v,…,z represent strings of terminals.
6. A α1 , A α2 , A α3 can be written as A α1 | α 2 | α3

Ex: The following grammar defines the arithmetic expression


expression -> expression + term
expression -> expression - term
expression -> term
term -> term * factor
term -> term / factor
term -> factor
factor-> ( expression )
factor -> id

Using the conventions listed above, the above grammar can be written as,
E -> E+T | E-T | T
T -> T*F | T/F | F
F -> ( E ) | id

Derivations:
 Consider the following grammar,
E -> E+E | E*E | -E | (E) | id

 A sequence of replacements of non-terminal symbols by its production body is called as


derivation
Ex: E => E+E => id+E => id+id

 In general, a derivation step is αAβ => αγβ where A γ is a production

 Since in the above example, multiple derivation steps do exist, it can also be written as E=>
id+id

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 5

 If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation(LMD)
Ex: E => E+E
=> id+E
=> id+id
 If we always choose the right-most non-terminal in each derivation step, this derivation is
called as right-most derivation.(RMD)
Ex: E => E+E
=> E+id
=> id+id

 If S =>α, where S is the start symbol of a grammar G, we say that α is a sentential form
of G. A sentential form may contain both terminals and nonterminals, and may be empty.
Eg: In the above example, the sentential forms are E+E and E+id.

 The sentential forms obtained in a LMD are said to be as left sentential form whereas the
sentential forms obtained in a RMD are called as right sentential form.

 A sentence of G is a sentential form with no nonterminals.


Eg: In the above example, sentence is id+id
 The language generated by a grammar, L(G) is its set of sentences.
Thus, a string of terminals w is in L(G), if and only if w is a sentence of G (or S => w).
 If G is a context-free grammar, L(G) is a context-free language.
 Two grammars are equivalent if they produce the same language.

Parse Trees and derivations:


 Root node has the Start variable
 Inner nodes of a parse tree are non-terminal symbols.
 The leaves of a parse tree are terminal symbols.
 The leaves of a parse tree when read from left to right constitute a sentential form, called
yield or frontier of the tree.
 A parse tree can be seen as a graphical representation of a derivation.
 Ex: Parse tree construction for the string –(id+id) is shown below along with the
derivation

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 6

Problems:

1. Consider the grammar S  ( L ) | a


L L , S | S
i. What are the terminals, non terminal and the start symbol?
ii. Construct parse tree for the following sentence
a. ( a , a)
b. (a, (a, a ) )
c. (a, ( ( a, a ) , ( a , a ) ) )
d. ( ( a, a ) , a , ( a ) )
iii. Obtain LMD and RMD for each.

2. Do the above steps for the following grammars:


a) S  aS | aSbS | ε for the string aaabaab
b) SSS+|SS*|a for the string aa+a*
c) S0S1|01 with string 000111.
d) S  + SS | * S S | a with string + * aaa.
e) S  S (S) S | ε with string (()()).
f) S S + S | S S | ( S ) | S *| a with string (a + a) * a.
g) SaSbS|bSaS|ε with string aabbab.

Ambiguity:
 A grammar that produces more than one parse tree for a sentence is called as an ambiguous
grammar.
 Ex:

 For the most parsers, the grammar must be unambiguous.


 i.e.,We should eliminate the ambiguity in the grammar during the design phase of the
compiler.

Verifying the Language Generated by a Grammar

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 7

Although compiler designers rarely do so for a complete programming-language grammar, it is


useful to be able to reason that a given set of productions generates a particular language.
Troublesome constructs can be studied by writing a concise, abstract grammar and studying the
language that it generates.
We shall construct such a grammar for conditional statements below.
A proof that a grammar G generates a language L has two parts: show that every string generated
by G is in L, and conversely that every string in L can indeed be generated by G.

Context-Free Grammars versus Regular Expressions


Grammars are a more powerful notation than regular expressions. Every construct that can be
described by a regular expression can be described by a grammar, but not vice-versa.
Alternatively, every regular language is a context-free language, but not vice-versa.

4.3 Writing a Grammar

Grammars are capable of describing most of the syntax of programming languages. The sequences
of tokens accepted by a parser form a superset of the programming language; subsequent phases
of the compiler must analyze the output of the parser to ensure compliance with rules that are not
checked by the parser.

Lexical Versus Syntactic Analysis


"Why do we use regular expressions to define the lexical syntax of a language?" There are several
reasons.
1. Separating the syntactic structure of a language into lexical and nonlexical parts provides
a convenient way of modularizing the front end of a compiler into two manageable-sized
components.
2. The lexical rules of a language are frequently quite simple, and to describe them we do not
need a notation as powerful as grammars.
3. Regular expressions generally provide a more concise and easier-to-understand notation
for tokens than grammars.
4. More efficient lexical analyzers can be constructed automatically from regular expressions
than from arbitrary grammars.

 Regular expressions are most useful for describing the structure of constructs such as
identifiers, constants, keywords, and white space.
 Grammars, on the other hand, are most useful for describing nested structures such as balanced
parentheses, matching begin-end's, corresponding if-then-else's, and so on. These nested
structures cannot be described by regular expressions.

4.3.2 Eliminating Ambiguity


An ambiguous grammar can be rewritten to eliminate the ambiguity.
As an example, we shall eliminate the ambiguity from the following "dangling else" grammar:
stmt —> if expr then stmt
| if expr then stmt else stmt
| other

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 8

Here "other" stands for any other statement. According to this grammar, the compound
conditional statement “ if E1 then S1 else if E2 then S2 else S3 “ has the following parse tree:

However, the Grammar is ambiguous since the string


if E1 then if E2 then S1 else S2 has the following two parse trees.

In all programming languages with conditional statements of this form, the second parse tree is
preferred. The general rule is, "Match each else with the closest unmatched then."

We can rewrite the above dangling-else grammar as the following unambiguous grammar.
 The idea is that a statement appearing between a then and an else must be "matched"; that
is, the interior statement must not end with an unmatched or open then.
 A matched statement is either an if-then-else statement containing no open statements or
it is any other kind of unconditional statement.
 Thus, we may use the following grammar , that allows only one parsing for string; namely,
the one that associates each else with the closest previous unmatched then.

stmt -> matchedstmt | openstmt


matchedstmt -> if expr then matchedstmt else matchedstmt | other
openstmt -> if expr then stmt | if expr then matchedstmt else openstmt

Ambiguity – Operator Precedence

Ambiguous grammars (because of ambiguous operators) can be disambiguated according to the


precedence and associativity rules.
E -> E+E | E*E | E^E | id | (E)

disambiguate the grammar


precedence: ^ (right to left)
* (left to right)
+ (left to right)

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 9

E -> E+T | T
T -> T*F | F
F -> G^F | G
G -> id | (E)

Elimination of Left Recursion


 A grammar is left recursive if it has a nonterminal A such that there is a derivation
+
A ⇒Aα for some string α.
 Top-down parsing methods cannot handle left-recursive grammars, so a transformation is
needed to eliminate left recursion.
 immediate left recursion: If there is a production of the form A -> A𝜶 | 𝜷 then it could be
replaced by the non-left-recursive productions:
A -> βA'
A'->αA' | ɛ

Example: Consider the expression grammar,


E  E+T | T
T  T*F | F
F  ( E ) | id

The non-left-recursive expression grammar is


E -> T E'
E' -> + T E' | ɛ
T -> FT'
T' -> * F T' | ɛ
F -> (E) | id

Immediate left recursion can be eliminated by the following technique, which works for
any number of A-productions.

First, group the productions as


A -> 𝐴𝛼1 | 𝐴𝛼2 | … | 𝐴𝛼𝑚 | 𝛽1| 𝛽2 | ... | 𝛽𝑛 where no 𝛽𝑖 begins with an A.

Then, replace the A-productions by


A -> 𝛽1 𝐴′ | 𝛽2 𝐴′ | … | 𝛽𝑛 𝐴′
A' -> 𝛼1 A' | 𝛼2 A' | … | 𝛼𝑚 A' | ɛ

This procedure eliminates all left recursion from the A and A' productions (provided no 𝛼𝑖
is ɛ).

 But the above procedure does not eliminate left recursion involving derivations of two or
more steps.
For example, consider the grammar
S -> Aa | b
A -> Ac | Sd | ɛ

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 10

The nonterminal S is left recursive because S═> Aa ═>Sda, but it is not immediately left
recursive.

The following Algorithm below, systematically eliminates left recursion from a grammar.
It is guaranteed to work if
+
 The grammar has no cycles (derivations of the form A⇒A)
 The grammar has no ɛ -productions (productions of the form A -> ɛ).

Algorithm: Eliminating left recursion.

INPUT : Grammar G with no cycles or ɛ-productions.


OUTPUT: An equivalent grammar with no left recursion.
METHOD: Apply the below algorithm to G. Note that the resulting non-left-recursive grammar
may have ɛ -productions.
1) Arrange the nonterminals in some order A1,A2,... ,An.
2) for ( each i from 1 to n ) {
3) for ( each j from 1 to i-1 ) {
4) replace each production of the form Ai -> Aj𝛾 by the
productions Ai -> 𝛿1 𝛾| 𝛿2 𝛾 | • • • | 𝛿𝑘 𝛾, where
Aj ->𝛿1 | 𝛿2 | . .. | 𝛿𝑘 are all current Aj -productions
5) }
6) eliminate the immediate left recursion among the Ai -productions
7) }

Example:
Consider the grammar
S -> Aa | b
A -> Ac | Sd | ɛ

 We order the nonterminals S, A.


 For i=1, There is no immediate left recursion among the 5-productions, so nothing
happens.
 For i = 2, we substitute for S in A -> Sd to obtain the following A-productions.
A-> Ac | Aad | bd | ɛ
Eliminating the immediate left recursion among these A-productions yields the following
grammar.
S -> Aa | b
A -> bdA' | A'
A' -> cA' | adA' | ɛ

Left Factoring

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 11

 Left factoring is a grammar transformation that is useful for producing a grammar suitable
for predictive, or top-down, parsing.
 When the choice between two alternative A-productions is not clear, we may be able to rewrite
the productions to defer the decision until enough of the input has been seen that we can make
the right choice.

For example, if we have the two productions


stmt -> if expr then stmt else stmt | if expr then stmt

on seeing the input if, we cannot immediately tell which production to choose to expand stmt.

 In general, if A -> 𝜶𝜷𝟏 | 𝜶𝜷𝟐 are two A-productions, and the input begins with a nonempty
string derived from α, we do not know whether to expand A to 𝛽1 𝑜𝑟 𝛼𝛽2 .
 However, we may defer the decision by left factoring, so that the original productions become
A -> αA'
A' -> β1 | β2

Algorithm: Left factoring a grammar.


INPUT: Grammar G.
OUTPUT: An equivalent left-factored grammar.
METHOD:
 For each nonterminal A, find the longest prefix α common to two or more of its alternatives.
 If α ≠ ɛ — i.e., there is a nontrivial common prefix — replace all of the A-productions
A -> 𝛼𝛽1 | 𝛼𝛽2 | • • • | 𝛼𝛽𝑛 | 𝛾, where 𝛾 represents all alternatives that do not begin with 𝛼,
by
A -> αA' | 𝛾
A' -> 𝛽1 | 𝛽2 | • • • | 𝛽𝑛
Here A' is a new nonterminal.
 Repeatedly apply this transformation until no two alternatives for a nonterminal have a
common prefix.

Example: Left factor the following grammar which abstracts the "dangling-else" problem:
S -> iEtS | iEtSeS | a
E -> b
Here, i, t, and e stand for if, then, and else; E and S stand for "conditional
expression" and "statement."

Left-factored, this grammar becomes:


S -> iEtSS' | a
S' -> eS | ɛ
E -> b

Non-Context-Free Language Constructs


 A few syntactic constructs found in typical programming languages cannot be specified using
grammars alone.

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 12

 Example 1:
o The language in this example abstracts the problem of checking that identifiers are
declared before they are used in a program.
o The language consists of strings of the form wcw, where the first w represents the
declaration of an identifier w, c represents an intervening program fragment, and the
second w represents the use of the identifier.
o The abstract language is L= {wcw \ w is in (a|b)*}.
o L consists of all words composed of a repeated string of a's and b's separated by c, such
as aabcaab.
o The noncontext-freedom of L directly implies the non-context-freedom of
programming languages like C and Java, which require declaration of identifiers before
their use and which allow identifiers of arbitrary length. For this reason, a grammar for
C or Java does not distinguish among identifiers that are different character strings.
Instead, all identifiers are represented by a token such as id in the grammar. In a
compiler for such a language, the semantic-analysis phase checks that identifiers are
declared before they are used.

 Example 2 :
o The problem of checking that the number of formal parameters in the declaration of a
function agrees with the number of actual parameters in a use of the function.
o The language consists of strings of the form anbmcndm. Here an and bm could represent the
formal-parameter lists of two functions declared to have n and m arguments, respectively,
while cn and dm represent the actual-parameter lists in calls to these two functions.
o The abstract language is L2 = { anbmcndm | n > 1 and m > 1}. That is, L2 consists of strings
in the language generated by the regular expression a*b*c*d* such that the number of a's
and c's are equal and the number of b's and d's are equal.
o This language is not context free.
o The typical syntax of function declarations and uses does not concern itself with counting
the number of parameters. For example, a function call in C-like language might be
specified by
Stmt -> id ( expr_list)
expr_list -> expr_list , expr | expr
with suitable productions for expr. Checking that the number of parameters in a call is
correct is usually done during the semantic-analysis phase.

Excercises:

For each of the following grammars,


a) Left factor the grammar.
b) In addition to left factoring, eliminate left recursion from the original grammar.

1) rexpr  rexpr + rterm | rterm


rterm  rterm rfactor \ rfactor
rfactor  rfactor * | rprimary
rprimary  a|b

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 13

2) S  S S + | S S * | a

3) S  0 S 1 | 0 1

4) S  ( L ) | a
L L , S | S

5) bexpr  bexpr or bterm | bterm


bterm  bterm and bfactor \ bfactor
bfactor  not bfactor | ( bexpr ) | true | false

4.4 Top-Down Parsing:


Top-down parsing can be viewed as the problem of constructing a parse tree for the input string,
starting from the root and creating the nodes of the parse tree in preorder (depth-first).
Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an input string.
Example: Consider the grammar,
E  TE'
E'  + TE' | ɛ
T  FT'
T'  *FT' | ɛ
F  ( E ) | id
The sequence of parse trees for the input id+id*id actually corresponds leftmost derivation of the
input(see below)

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 14

High level classification of Top-Down Parser:

Recursive-Descent Parsing:

void A( ) {
1) Choose an A-production, A ->X1 X2 • • Xk;
2) for ( i = 1 to k ) {
3) if ( Xi is a nonterminal )
4) call procedure Xi( );
5) else if ( Xi equals the current input symbol a )
6) advance the input to the next symbol;
7) else /* an error has occurred */;
}
}
Figure: A typical procedure for a nonterminal in a top-down parser

 A recursive-descent parsing program consists of a set of procedures, one for each


nonterminal.
 Execution begins with the procedure for the start symbol, which halts and announces
success if its procedure body scans the entire input string. Pseudocode for a typical
nonterminal is shown in the above figure.
 General recursive-descent may require backtracking; that is, it may require repeated scans
over the input. However, backtracking is rarely needed to parse programming language
constructs, so backtracking parsers are not seen frequently.
 To allow backtracking, the above code needs to be modified.
o First, we cannot choose a unique A-production at line (1), so we must try each of several
Productions in some order.
o Then, failure at line (7) is not ultimate failure, but suggests only that we need to return
to line (1) and try another A-production.
o Only if there are no more A-productions to try do we declare that an input error has
been found.
o In order to try another A-production, we need to be able to reset the input pointer to
where it was when we first reached line (1). Thus, a local variable is needed to store
this input pointer for future use.

o Example: Consider the grammar

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 15

S -> cAd
A -> ab | a

 To construct a parse tree top-down for the input string w = cad, begin with a tree
consisting of a single node labeled S, and the input pointer pointing to c, the first
symbol of w.
 S has only one production, so we use it to expand S and obtain the tree of Fig.
4.14(a).
 The leftmost leaf, labeled c, matches the first symbol of input w, so we advance the
input pointer to a, the second symbol of w, and consider the next leaf, labeled A.
 Now, we expand A using the first alternative A -> ab to obtain the tree of Fig.
4.14(b). We have a match for the second input symbol, a, so we advance the input
pointer to d, the third input symbol, and compare d against the next leaf, labeled b.
 Since b does not match d, we report failure and go back to A to see whether there
is another alternative for A that has not been tried, but that might produce a match.
 In going back to A, we must reset the input pointer to position 2, the position it had
when we first came to A, which means that the procedure for A must store the input
pointer in a local variable. The second alternative for A produces the tree of Fig.
4.14(c).
 The leaf a matches the second symbol of w and the leaf d matches the third symbol.
 Since we have produced a parse tree for w, we halt and announce successful
completion of parsing.

Figure 4.14 : Steps in a top-down parser

 A left-recursive grammar can cause a recursive-descent parser, even one with


backtracking, to go into an infinite loop. That is, when we try to expand a nonterminal A,
we may eventually find ourselves again trying to expand A without having consumed any
input.

FIRST and FOLLOW


 The construction of both top-down and bottom-up parsers is aided by two functions,
FIRST and FOLLOW, associated with a grammar G.
 During top-down parsing, FIRST and FOLLOW allow us to choose which production to
apply, based on the next input symbol.
 During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as
synchronizing tokens.

 FIRST(𝜶) is defined as the set of terminals that begin strings derived from α. , where

α is any string of grammar symbols. If α ⇒ɛ, then ɛ is also in FIRST(α).

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 16

 For nonterminal A, we define FOLLOW(A) , to be the set of the terminals a which


occur immediately after (follow) the non-terminal A in some sentential form;

That is, the set of terminals a such that there exists a derivation of the form S⇒αAaβ, for
some α and β.

 To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or ɛ can be added to any FIRST set.
1. If X is a terminal, then FIRST(X) = { X }.
2. If X is a nonterminal and X -> Y1Y2 • • • Yk is a production for some k ≥ 1, then
 place a in FIRST(X) if for some i,
a is in FIRST(Yi), and ɛ is in all of FlRST(Y1), ... , FIRST(Yi-1);

that is, Y1• • • Yi-1 ⇒ ɛ.
 If ɛ is in FIRST(Yj) for all j = 1,2,... , k, then add ɛ to FIRST(X).
3. If X -> ɛ is a production, then add ɛ to FIRST(X).

 Now, we can compute FIRST for any string X1X2 • • • Xn as follows.


 Add to FlRST(X1X2 • • • Xn) all non-ɛ symbols of FIRST (X1).
 Also add the non- ɛ symbols of FIRST(X2), if ɛ is in FlRST(X1);
 Add the non- ɛ symbols of FIRST(X3), if ɛ is in FIRST(X1) and FIRST(X2); and so
on.
 Finally, add ɛ to FIRST(X1X2 • • • Xn) if, for all i, ɛ is in FIRST (Xi).

 To compute FOLLOW(A) for all nonterminals A, apply the following rules until nothing
can be added to any FOLLOW set.
1. Place $ in FOLLOW(5), where S is the start symbol, and $ is the input right endmarker.
2. If there is a production A->αBβ, then everything in FIRST(β) except ɛ is in
FOLLOW(B).
3. If there is a production A->αB, or a production A -> αBβ, where FIRST ( β ) contains ɛ,
then everything in FOLLOW( A ) is in FOLLOW( B ) .

Transition Diagrams for Predictive Parsers


 Transition diagrams are useful for visualizing predictive parsers.
 To construct the transition diagram from a grammar, first eliminate left recursion and then
left factor the grammar.
 Then, for each nonterminal A,
1. Create an initial and final (return) state.

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 17

2. For each production A -> X1X2 … Xk, create a path from the initial to the final state, with
edges labeled X1,X2, .. Xk. , If A -> 𝜖, the path is an edge labeled 𝜖.
 Transition diagrams for predictive parsers have one diagram for each nonterminal. The
labels of edges can be tokens or nonterminals.
 A transition on a token (terminal) means that we take that transition if that token is the next
input symbol.
 A transition on a nonterminal A is a call of the procedure for A.
 Example: Transition diagrams for non terminals E and E'

LL(1) Grammars
 Predictive parsers, that is, recursive-descent parsers needing no backtracking, can be
constructed for a class of grammars called LL(1).
 The first "L" in LL(1) stands for scanning the input from left to right, the second "L" for
producing a leftmost derivation, and the " 1 " for using one input symbol of lookahead at
each step to make parsing action decisions.
 No left-recursive or ambiguous grammar can be LL(1).
 A grammar G is LL(1) if and only if whenever A —> α | β are two distinct productions of
G, the following conditions hold:
1. For no terminal a do both α and β derive strings beginning with a.
2. At most one of α and 𝛽 can derive the empty string.

3. If β⇒ɛ, then α does not derive any string beginning with a terminal in FOLLOW (A).

Likewise, if α⇒ɛ, then β does not derive any string beginning with a terminal in
FOLLOW(A).

The first two conditions are equivalent to the statement that FIRST(α) and FIRST(β) are
disjoint sets.
The third condition is equivalent to stating that if ɛ is in FIRST(β), then FIRST (α) and
FOLLOW(A) are disjoint sets, and likewise if ɛ is in FIRST(α).Type equation here.
Predictive parsers can be constructed for LL(1) grammars since the proper production to apply
for a nonterminal can be selected by looking only at the current input symbol.

Algorithm: Construction of a predictive parsing table,


INPUT: Grammar G.
OUTPUT: Parsing table M.A two-dimensional array, M[A,a], where A is a nonterminal, and a is
a terminal or the symbol $, the input endmarker
METHOD: For each production A -> α of the grammar, do the following:
1. For each terminal a in FIRST(A), add A -> α to M[A, a].
2. If 𝜖 is in FlRST(α), then for each terminal b in FOLLOW(A), add A -> α to M[A,b].
If 𝜖 is in FIRST(α) and $ is in FOLLOW(A), add A -> α to M[A, $] as well.

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 18

If, after performing the above, there is no production at all in M[A, a], then set M[A, a] to
error (which we normally represent by an empty entry in the table).

Example 1 : For the following expression grammar,


E  TE'
E'  + TE' | ɛ
T  FT'
T'  *FT' | ɛ
F  ( E ) | id
a) Compute FIRST & FOLLOW
b) Construct predictive parsing table

a) FIRST sets:
FIRST(E) = FIRST(T) = FIRST(F) = { (, id }
FIRST(E') = { +, ɛ }
FIRST(T') = { *, ɛ }
FOLLOW sets:
FOLLOW (E) = {$, ) }
FOLLOW (E') = {$, ) }
FOLLOW (T) = {+, ), $}
FOLLOW (T') = {+, ), $}
FOLLOW (F) = {+, *, ), $}

b) Parsing Table :

 The construction of predictive parsing table Algorithm can be applied to any grammar G
to produce a parsing table M.
 For every LL(1) grammar, each parsing-table entry uniquely identifies a production or
signals an error.
 For some grammars, however, M may have some entries that are multiply defined. For
example, if G is left-recursive or ambiguous, then M will have at least one multiply defined
entry.
 Although left recursion elimination and left factoring are easy to do, there are some
grammars for which no amount of alteration will produce an LL(1) grammar.

 Example 2: Show that the following grammar is not LL(1).

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 19

S -> iEtSS' | a
S' -> eS | 𝜖
E -> b

FIRST(S) = { i, a }
FIRST(S') = { e, 𝜖 }
FIRST(E) = { b }

FOLLOW(S) = {
FOLLOW(S') = {
FOLLOW(E) = {

The parsing table:

 The entry for M[S', e] contains both S' —> eS and S' —>𝜖.
 Therefore, the grammar is not LL(1).
 The grammar is ambiguous and the ambiguity is manifested by a choice in what
production to use when an e (else) is seen.
 We can resolve this ambiguity by choosing S'->eS. This choice corresponds to
associating an else with the closest previous then.

Nonrecursive Predictive Parsing:


 A nonrecursive predictive parser can be built by maintaining a stack explicitly, rather than
implicitly via recursive calls.
 The parser mimics a leftmost derivation. If w is the input that has been matched so far, then
the stack holds a sequence of grammar symbols α such that

S ⇒ wα
lm
 The following is a model of a table-driven predictive parser

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 20

 The parser has an input buffer, a stack containing a sequence of grammar symbols, a
parsing table constructed by Algorithm (Construction of a predictive parsing table) and an
output stream.
 The input buffer contains the string to be parsed, followed by the endmarker $. We reuse
the symbol $ to mark the bottom of the stack, which initially contains the start symbol of
the grammar on top of $.
 The parser is controlled by a program that considers X, the symbol on top of the stack, and
a, the current input symbol.
 If X is a nonterminal, the parser chooses an X-production by consulting entry M[X, a]
of the parsing table M.
 Otherwise, it checks for a match between the terminal X and current input symbol a.
 The behavior of the parser can be described in terms of its configurations, which give the
stack contents and the remaining input.

Algorithm: Table-driven predictive parsing (Describes how configurations are manipulated)

INPUT: A string w and a parsing table M for grammar G.

OUTPUT: If w is in L(G), a leftmost derivation of w; otherwise, an error indication.

METHOD: Initially, the parser is in a configuration with w$ in the input buffer and the start
symbol S of G on top of the stack, above $. The program in above figure uses the predictive parsing
table M to produce a predictive parse for the input.

set ip to point to the first symbol of w;


set X to the top stack symbol;
while ( X ≠ $ ) { /* stack is not empty */
if ( X is a )
pop the stack and advance ip;
else if ( X is a terminal )
error( );

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 21

else if ( M[X,a] is an error entry )


error( );
else if ( M[X,a] = X -> Y1Y2 … Yk ) {
output the production X -> Y1Y2 … Yk;
pop the stack;
push Yk ,Yk-1,... ,Y1 onto the stack, with Y1 on top;
}
set X to the top stack symbol;
}

Example: Consider the Parsing Table constructed for the grammar


E  TE'
E'  + TE' | ɛ
T  FT'
T'  *FT' | ɛ
F  ( E ) | id

On input id + id * id, the nonrecursive predictive parser makes the sequence of moves as
follows. These moves correspond to a leftmost derivation

 The sentential forms in this derivation correspond to the input that has already been
matched (in column MATCHED ) followed by the stack contents.
 The matched input is shown only to highlight the correspondence. The input pointer points
to the leftmost symbol of the string in the INPUT column.

Error Recovery in Predictive Parsing:

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 22

 An error is detected during predictive parsing when the terminal on top of the stack does
not match the next input symbol or when nonterminal A is on top of the stack, a is the next
input symbol, and M[A, a] is error (i.e., the parsing-table entry is empty).

 Panic Mode
o Panic-mode error recovery is based on the idea of skipping symbols on the the input
until a token in a selected set of synchronizing tokens appears.
o Its effectiveness depends on the choice of synchronizing set. The sets should be chosen
so that the parser recovers quickly from errors that are likely to occur in practice.
o Some heuristics are as follows:
1. As a starting point, place all symbols in FOLLOW(A) into the synchronizing set
for nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen and
pop A from the stack, it is likely that parsing can continue.
2. It is not enough to use FOLLOW(A) as the synchronizing set for A. For example,
if semicolons terminate statements, as in C, then keywords that begin statements
may not appear in the FOLLOW set of the nonterminal representing expressions.
A missing semicolon after an assignment may therefore result in the keyword
beginning the next statement being skipped. Often, there is a hierarchical structure
on constructs in a language; for example, expressions appear within statements,
which appear within blocks, and so on. We can add to the synchronizing set of a
lower-level construct the symbols that begin higher-level constructs. For example,
we might add keywords that begin statements to the synchronizing sets for the
nonterminals generating expressions.
3. If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it
may be possible to resume parsing according to A if a symbol in FIRST (A) appears
in the input.
4. If a nonterminal can generate the empty string, then the production deriving 𝜖 can
be used as a default. Doing so may postpone some error detection, but cannot cause
an error to be missed. This approach reduces the number of nonterminals that have
to be considered during error recovery.
5. If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue
parsing. In effect, this approach takes the synchronizing set of a token to consist of
all other tokens.

Example: Using FIRST and FOLLOW symbols as synchronizing tokens works reasonably
well when expressions are parsed according to grammar
E  TE'
E'  + TE' | ɛ
T  FT'
T'  *FT' | ɛ
F  ( E ) | id

The parsing table for this grammar in is repeated with "synch" indicating synchronizing
tokens obtained from the FOLLOW set of the nonterminal

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 23

The table is to be used as follows.


 If the parser looks up entry M[A,a] and finds that it is blank, then the input symbol
a is skipped.
 If the entry is "synch," then the nonterminal on top of the stack is popped in an
attempt to resume parsing.
 If a token on top of the stack does not match the input symbol, then we pop the
token from the stack, as mentioned above.

On the erroneous input ) id * + id, the parser and error recovery mechanism of the above
table, behaves as follows:

 Phrase-level Recovery
o Phrase-level error recovery is implemented by filling in the blank entries in the
predictive parsing table with pointers to error routines.
o These routines may change, insert, or delete symbols on the input and issue appropriate
error messages. They may also pop from the stack.

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 24

o Alteration of stack symbols or the pushing of new symbols onto the stack is
questionable for several reasons.
 First, the steps carried out by the parser might then not correspond to the
derivation of any word in the language at all.
 Second, we must ensure that there is no possibility of an infinite loop. Checking
that any recovery action eventually results in an input symbol being consumed
(or the stack being shortened if the end of the input has been reached) is a good
way to protect against such loops.

Construct the predictive parser LL (1) for the following grammar and parse the given string
1. S -> S(S)S | 𝜖 String= ( ( ) ( ) )

2. S -> +SS | * SS | a String= +*aaa

3. S -> aSbS | bSaS | 𝜖 String=aabbbab

4. bexpr bexpr or bterm | bterm


bterm bterm and bfactor \ bfactor
bfactor not bfactor | ( bexpr ) | true | false String= not ( true or false )

5. S  0S1 | 01 String=00011

6. S -> aB | aC | Sd | Se
B -> bBc | f
C -> g

7. P -> Ra | Qba
R -> aba | caba | Rbc
Q -> bbc | bc String= cababca

8. S -> PQR
P -> a | Rb | 𝜖
Q -> c | dP | 𝜖
R -> e | f String= adeb

9. E -> E+T | T
T -> id | id[ ] | id[X]
X -> E,E | E String= id[id]

10. S -> (A) | 0


A -> SB
B -> ,SB | 𝜖 String= (0,(0,0))

11. S -> a | ↑ | (T) String= (a,(a,a))


T -> T,S | S = ((a,a),↑,(a),a)

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 25

4.5 Bottom-Up Parsing:

 A bottom-up parse corresponds to the construction of a parse tree for an input string
beginning at the leaves (the bottom) and working up towards the root (the top).
 Ex:
Consider the grammar,
E  E+T | T
T  T*F | F
F  ( E ) | id
Let the string be id*id
The following illustrates construction of parse-tree using bottom-up parsing.

id * id F * id T * id T*F T E
| | | | | |
id F F id T*F T
| | | | |
id id F id T*F
| | |
id F id
|
id

 Classification :
Bottom-up Parsing

Shift-Reduce Parsing LR parsing

SLR LALR Canonical LR

 Reductions:
 We can think of bottom-up parsing as the process of "reducing" a string w to the start
symbol of the grammar.
 At each reduction step, a specific substring matching the body of a production is
replaced by the non-terminal at the head of that production.
 The key decisions during bottom-up parsing are about when to reduce and about what
production to apply, as the parse proceeds.
 Ex: sequence of reductions in the above example: id * id, F* id , T*id , T*F, T, E
 By definition, a reduction is the reverse of a step in a derivation. The goal of bottom-
up parsing is therefore to construct a derivation in reverse. The following derivation
corresponds to the parse in the above example.
E => T
=> T * F
=> T * id
=> F * id
=> id* id

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 26

This derivation is in fact a rightmost derivation.


 Thus, Bottom-up parsing during a left-to-right scan of the input constructs a
rightmost derivation in reverse.

 Handle Pruning:
 A "handle" is a substring that matches the body of a production, and whose reduction
represents one step along the reverse of a rightmost derivation.
 For example, the handles during the parse of id1*id2 according to the above grammar
are as shown in the following:
Right Sentential Form Handle Reducing Production
id1*id2 id1 F - > id
F * id2 F T->F
T * id2 id2 F - > id
T*F T*F T->T*F

 Formally, if S => αAw =>αβw , then the production A -> β in the position following α is
a handle of αβw. i.e., a handle of a right-sentential form γ is a production A ->β and a
position of γ where the string β may be found such that replacing β at that position by A
produces the previous right-sentential form in a rightmost derivation of γ.
S

α β w
 If a grammar is unambiguous, then every right-sentential form of the grammar has exactly
one handle.
 A rightmost derivation in reverse can be obtained by "handle pruning."
o That is, we start with a string of terminals w to be parsed. If w is a sentence of the
grammar at hand, then let w = γn , where γn is the nth right-sentential form of some as
yet unknown rightmost derivation,
S => γ0 => γ1 => γ2……..=> γn-1=> γn = w
o To reconstruct this derivation in reverse order, we locate the handle βn in γn and replace
βn by the head of the relevant production An -> βn to obtain the previous right-sentential
form γn-1.
o We then repeat this process.
o If by continuing this process we produce a right-sentential form consisting only of the
start symbol S, then we halt and announce successful completion of parsing.
o The reverse of the sequence of productions used in the reductions is a rightmost
derivation for the input string.

 Shift-Reduce Parsing:
 Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar
symbols and an input buffer holds the rest of the string to be parsed.
 We use $ to mark the bottom of the stack and also the right end of the input. Conventionally,
when discussing bottom-up parsing, we show the top of the stack on the right.

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 27

 Initially, the stack is empty, and the string w is on the input, as follows:
Stack Input
$ w$
 During a left-to-right scan of the input string, the parser shifts zero or more input symbols
onto the stack, until it is ready to reduce a string β of grammar symbols on top of the stack.
It then reduces β to the head of the appropriate production. The parser repeats this cycle
until it has detected an error or until the stack contains the start symbol and the input is
empty:
Stack Input
$S $
Upon entering this configuration, the parser halts and announces successful completion of
parsing.
 There are actually four possible actions a shift-reduce parser can make: (1) shift, (2) reduce,
(3) accept, and (4) error.
1. Shift. Shift the next input symbol onto the top of the stack.
2. Reduce. The right end of the string to be reduced must be at the top of the stack. Locate
the left end of the string within the stack and decide with what nonterminal to replace the
string.
3. Accept. Announce successful completion of parsing.
4. Error. Discover a syntax error and call an error recovery routine.
 The actions of a shift-reduce parser in parsing the input string id1*id2 according to the
expression grammar is shown here:
Stack Input Action
$ id1*id2$ shift
$id1 *id2$ reduce by F -> id
$F *id2$ reduce by T -> F
$T *id2$ shift
$T* id2$ shift
$T*id2 $ reduce by F -> id
$T*F $ reduce by T -> T * F
$T $ reduce by E -> T
$E $ accept
Note: The handle will always eventually appear on top of the stack, never inside.

Question: Consider the following grammar and parse the respective strings using shift-reduce
parser.
(1) S -> TL; (2) S -> (L) | a
T -> int | float L -> L,S | S
L -> L, id | id String : (a,(a,a))
String : int id, id;

 Conflicts During Shift-Reduce Parsing:


 There are context-free grammars for which shift-reduce parsing cannot be used. Every
shift-reduce parser for such a grammar can reach a configuration in which the parser,
knowing the entire stack contents and the next input symbol, cannot make certain decisions.
Accordingly there are two types of conflicts:

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 28

1) shift/reduce conflict: This conflict arises when a parser can not decide whether to perform
shift action or reduce action.
Ex: Consider the grammar,
stmt -> if expr then stmt
| if expr then stmt else stmt
| other

If we have a shift-reduce parser in configuration


Stack Input
• • • if expr then stmt else • • $

We cannot tell whether if expr then stmt is the handle, no matter what appears below it on
the stack. Here there is a shift/reduce conflict. Depending on what follows the else on the
input, it might be correct to reduce if expr then stmt to stmt, or it might be correct to shift
else and then to look for another stmt to complete the alternative if expr then stmt else stmt.

2) reduce/reduce conflict: This conflict arises when a parser cannot decide which of several
reductions to make.
Ex 1: Consider the following grammar,
S -> AB
A -> aA | ab
B -> bB | ab
Suppose the string is abab
Then the actions of a shift-reduce parser will be
Stack Input Action
$ abab$ shift
$a bab$ shift
$ab ab$ reduce by A -> ab or B ->ab [ conflict ]

Here, parser will have a confusion as to which production to use for reduce action.

Ex 2: Suppose we have a lexical analyzer that returns the token name id for all names,
regardless of their type. Suppose also that our language invokes procedures by giving their
names, with parameters surrounded by parentheses, and that arrays are referenced by the
same syntax. Our grammar might therefore have (among others) productions such as shown
below:

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru
15CS63: Module 4: Syntax Analysis 29

A statement beginning with p( i , j ) would appear as the token stream id(id, id) to the
parser.
After shifting the first three tokens onto the stack, a shift-reduce parser would be in
configuration
Stack Input
• • • id ( id , id ) • • •
It is evident that the id on top of the stack must be reduced, but by which production? The
correct choice is production (5) if p is a procedure, but production (7) if p is an array. The
stack does not tell which; information in the symbol table obtained from the declaration of
p must be used.

Exercises :
For the following grammars, indicate the handle in each of the following right-sentential forms:
1)S -> 0 S 1 | 0 1 a) 000111
b) 00S11

2)S -> SS + | SS * | a a) SSS+a*+


b) SS+a*a+
c) aaa*a++

Introduction to LR Parsing: Simple LR(SLR)


 The most of bottom-up parser today is based on a concept called LR(k) parsing;
"L" is for left-to-right scanning of the input,
"R" for constructing a rightmost derivation in reverse, and
k for the number of input symbols of lookahead that are used in making parsing decisions.
When (k) is omitted, k is assumed to be 1.

Why LR Parsers?
 For a grammar to be LR it is sufficient that a left-to-right shift-reduce parser be able to
recognize handles of right-sentential forms when they appear on top of the stack.
 LR parsing is attractive for a variety of reasons:
1) LR parsers can be constructed to recognize virtually all programming language constructs
for which context-free grammars can be written.

----------------------------------------------------------------------------------------------------------------------------- ---------------
Compiled By: Department of CSE SJEC Mangaluru

You might also like