P 15.compiler Design
P 15.compiler Design
Introduction
Programming languages are notations for describing computations to people and to machines. The
world as we know it depends on programming languages, because all the software running on all
the computers was written in some programming language. But, before a program can be run, it
first must be translated into a form in which it can be executed by a computer. The software systems
that do this translation are called compilers.
Generations of programming language
Programming languages are categorized into five generations: (1st, 2nd, 3rd, 4th and 5th
generation languages)
These programming languages can also be categorized into two broad categories: low level
and high level languages.
Low level languages are machine specific or dependent.
High level languages like COBOL, BASIC are machine independent and can run
on variety of computers.
From the five categories of programming languages, first and second generation languages
are low level languages and the rest are high level programming languages.
1. First Generation (Machine languages, 1940’s):
Difficult to write applications with.
Dependent on machine languages of the specific computer being used.
Machine languages allow the programmer to interact directly with the hardware,
and it can be executed by the computer without the need for a translator.
2. Second Generation (Assembly languages, early 1950’s):
Uses symbolic names for operations and storage locations.
A system program called an assembler translates a program written in assembly
language to machine language.
Programs written in assembly language are not portable. i.e., different computer
architectures have their own machine and assembly languages. They are highly
used in system software development.
3. Third Generation (High level languages, 1950’s to 1970’s):
Uses English like instructions and mathematicians were able to define variables
with statements such as Z = A + B.
Such languages are much easier to use than assembly language.
Programs written in high level languages need to be translated into machine
language in order to be executed.
All third generation programming languages are procedural languages.
The use of common words (reserved words) within instructions makes them easier
to learn.
Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-
of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by certain
amounts to build-in macro.
Compiler
Compiler is a translator program that translates a program written in (HLL) the source program
and translate it into an equivalent program in (MLL) the target program. As an important part of a
compiler is error showing to the programmer.
Executing a program written n HLL programming language is basically of two parts. The source
program must first be compiled translated into an object program. Then the results object program
is loaded into a memory executed.
Loading: the loader takes the executable file from disk and transfers it to memory. Additional
components from shared libraries that support the program are also loaded. Finally, the computer,
under the control of its CPU, executes the program.
Translator
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected
and reported to the programmers. Important role of translator are:
TYPE OF TRANSLATORS:-
INTERPRETER
COMPILER
ASSEMBLER
Compiler
Simply stated, a compiler is a program that can read a program in one language-the source
language-and translate it into an equivalent program in another language-the target language; as
listed in the figure. An important role of the compiler is to report any errors in the source program
that it detects during the translation process.
Advantages of Compiler
Disadvantage of Compiler
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs. An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter.
Example 1.1: Java language processors combine compilation and interpretation, as shown in
figure. A Java source program may first be compiled into an intermediate form called bytecodes.
The bytecodes are then interpreted by a virtual machine.
• Suppose you want to write compilers from m source languages to n computer platforms. A
naive solution requires n*m programs:
There are two major parts of a compiler: Analysis (Front end) and Synthesis (Back end)
In analysis phase, an intermediate representation is created from the given source program. Lexical
Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase.
In synthesis phase, the equivalent target program is created from this intermediate representation.
• Each phase transforms the source program from one representation into another
representation.
• They communicate with error handlers.
• They communicate with the symbol table.
Lexical Analysis:-
• Lexical Analyzer reads the source program character by character and returns the tokens
of the source program.
• A token describes a pattern of characters having same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimeters and so on)
Ex: newval := oldval + 12 => tokens:
newval identifier
:= assignment operator
oldval identifier
+ add operator
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis
is aided by using techniques based on formal grammar of the programming language.
• Ex: We use BNF (Backus Naur Form) to specify a CFG assgstmt -> identifier :=
expression
expression -> identifier
expression -> number
expression -> expression + expression
• Which constructs of a program should be recognized by the lexical analyzer, and which ones
by the syntax analyzer?
– Both of them do similar things;
– But the lexical analyzer deals with simple non-recursive constructs of the language.
– The syntax analyzer deals with recursive constructs of the language.
– The lexical analyzer simplifies the job of the syntax analyzer.
– The lexical analyzer recognizes the smallest meaningful units (tokens) in a source
program.
– The syntax analyzer works on the smallest meaningful units (tokens) in a source
program to recognize meaningful structures in our programming language.
Semantic Analyzer
• A semantic analyzer checks the source program for semantic errors and collects the type
information for the code generation.
• Type-checking is an important part of semantic analyzer.
• Normally semantic information cannot be represented by a context-free language used in
syntax analyzers.
• Context-free grammars used in the syntax analysis are integrated with attributes (semantic
rules)
– the result is a syntax-directed translation,
– Attribute grammars
• Ex:
newval := oldval + 12
• The type of the identifier newval must match with type of the expression (oldval+12)
An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.
• A compiler may produce an explicit intermediate codes representing the source program.
• These intermediate codes are generally machine or (architecture independent). But the level
of intermediate codes is close to the level of machine codes.
• Ex: newval := oldval * fact + 1
id1 := id2 * id3 + 1
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.
The last phase of translation is code generation. A number of optimizations to reduce the length
of machine language program are carried out during this phase. The output of the code generator
is the machine language program of the specified computer.
Syntax Analyzer
id1 +
id2 *
id3 id4
Semantic Analyzer
id1 +
id2 *
id3 60
int to real
Code Optimizer
MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
Compiler-Construction Tools
The compiler writer, like any software developer, can profitably use modern software development
environments containing tools such as language editors, debuggers, version managers, profilers,
test harnesses, and so on. In addition to these general software-development tools, other more
specialized tools have been created to help implement various phases of a compiler.
These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler. Some commonly used compiler-construction tools include:
1. Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of
the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a
target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part
of code optimization.
6. Compiler-construction toolk2ts that provide an integrated set of routines for constructing
various phases of a compiler. Passes
Single-pass Compiler:-
All the phases are combined into a single. Phases work in an interleaved way.
Multi-pass Compiler:-
More than one phase is combined into a number of groups called multi-pass.
Why multi-pass?
(Multi pass compiler can be made to use less space than single pass compiler.).
Natural Language
English text semantics (meaning)
Understanding
Regular Expressions JLex scanner generator a scanner in Java
Cross Compiler
• a compiler which generates target code for a different machine from one on which the
compiler runs.
• A host language is a language in which the compiler is written.
– T-diagram
Porting
• Porting: construct a compiler between a source and a target language using one host
language from another host language.
Cousins of Compilers
• Linkers
• Loaders
• Interpreters
• Assemblers
Translates high-level
languages into machine code
Temporarily executes
Translates low-level assembly code
highlevel languages, one
into machine code
statement at a time
A lexical analyzer, also called a scanner, typically has the following functionality and
characteristics.
Its primary function is to convert from a (often very long) sequence of characters into a
(much shorter, perhaps 10X shorter) sequence of tokens.
The scanner must identify and Categorize specific character sequences into tokens. It
must know whether every two adjacent characters in the file belong together in the
same token, or whether the second character must be in a different token.
Most lexical analyzers discard comments & whitespace. In most languages these
characters serve to separate tokens from each other, but once lexical analysis is
completed they serve no purpose.
Handle lexical errors (illegal characters, malformed tokens) by reporting them intelligibly
to the user.
Efficiency is crucial; a scanner may perform elaborate input buffering.
Token categories can be (precisely, formally) specified using regular expressions, e.g.
IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
Lexical Analyzers can be written by hand, or implemented automatically using finite
automata.
Input buffering
Sometimes
lexical analyzer needs to look ahead some symbols to decide about the token to
return
In C language: we need to look after -, = or < to decide what token to return
In Fortran: DO 5 I = 1.25
We need to introduce a two buffer scheme to handle large look-aheads safely
Specification of tokens
In theory of compilation regular expressions are used to formalize the specification of tokens
Regular expressions are means for specifying regular languages
Example:
Letter(letter | digit)*
Each regular expression is a pattern specifying the form of strings
Terminology of Languages
Alphabet : a finite set of symbols (ASCII characters)
String :
Finite sequence of symbols on an alphabet
Sentence and word are also used in terms of string
is the empty string
|s| is the length of string s.
Rules
A recognizer for a language is a program that takes
a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
We call the recognizer of the tokens as a finite automaton.
A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
This means
that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
Both deterministic and non-deterministic finite automatons recognize regular sets.
Which one?
deterministic – faster recognizer, but it may take more space
non-deterministic – slower, but it may take less space
Deterministic automatons are widely used lexical analyzers.
First, we define regular expressions
for tokens; then we convert them into a DFA to get a lexical
analyzer for our tokens.
Algorithm1:
Regular Expression NFA DFA (two steps: first to NFA, then to
DFA)
Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
For all input symbols a, states s1 and s2 have transitions to states in the same group.
Start state of the minimized DFA is the group containing the start state of the original DFA.
Accepting states
of the minimized DFA are the groups containing the accepting states of the
original DFA.
Lex (A LEXical Analyzer Generator)
Generates lexical analyzers (scanners or Lexers)
Yacc (Yet Another Compiler-Compiler) Generates
parser based on an analytic grammar
Flex is Free scanner alternative to Lex
Bison is Free parser generator program
1. Lex: a tool for automatically generating a lexer or scanner given a lex specification (.l
file)
2. A lexer or scanner is used to perform lexical analysis, or the breaking up of an input
stream into meaningful units, or tokens.
3. For example, consider breaking a text file up into individual words.
Lexical analyzer: scans the input stream and converts sequences of characters into tokens.
%{
This part will be
< C global variables, prototypes, comments embedded into *.c
>
%}
substitutions, code and
start states; will be
[DEFINITION SECTION] copied into *.c
Two Rules
1. lex will always match the longest (number of characters) token possible.
2. If two or more possible tokens are of the same length, then the token with the regular
expression that is defined first in the lex specification is favored.
a matches a
abc matches abc
[abc] matches a, b or c
[a-f] matches a, b, c, d, e, or f
[0-9] matches any digit
X+ matches one or more of X
X* matches zero or more of X
[0-9]+ matches any integer
(…) grouping an expression into a single unit
Examples:
Special Functions
• yytext
–where text matched most recently is stored
• yyleng
–number of characters in text most recently matched
• yylval
–associated value of current token
• yymore()
–append next string matched to current contents of yytext
• yyless(n)
–remove from yytext all but the first n characters
• unput(c)
–return character c to input stream
• yywrap()
–may be replaced by user
–The yywrap method is called by the lexical analyser whenever it inputs an
EOF as the first character when trying to match a regular expression
Yacc: a tool for automatically generating a parser given a grammar written in a yacc
specification (.y file)
%{
< C global variables, prototypes, comments > This part will be embedded
into *.c
%}
Definition section
Declarations of tokens
Rules section
User code
%%
|…
%%
}
2. Lex program to count the type of numbers
%{
int pi=0,ni=0,pf=0,nf=0;
%}
%%
\+?[0-9]+ pi++;
\+?[0-9]*\.[0-9]+ pf++;
\-[0-9]+ ni++;
\-[0-9]*\.[0-9]+ nf++;
%%
main()
{
printf("ENTER INPUT : ");
yylex();
printf("\nPOSITIVE INTEGER : %d",pi);
printf("\nNEGATIVE INTEGER : %d",ni);
printf("\nPOSITIVE FRACTION : %d",pf);
printf("\nNEGATIVE FRACTION : %d\n",nf);
}
3. Lex program to find simple and compound
statements %{ }%
%%
"and"|
"or"|
"but"|
"because"|
"nevertheless" {printf("COMPOUND STATEMENT"); exit(0); }
.;
\n return 0;
%%
main()
Syntax Analyzer
• Syntax Analyzer creates the syntactic structure of the given source program.
• This syntactic structure is mostly a parse tree.
• Syntax Analyzer is also known as parser.
• The syntax of a programming is described by a context-free grammar (CFG). We will use
BNF (Backus-Naur Form) notation in the description of CFGs.
• The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A context-free grammar
– Gives a precise syntactic specification for a programming language
– The design of the grammar is an initial phase of the design of a compiler.
– A grammar can be directly converted into a parser by some tools.
Parser
• Top-Down Parser
– The parse tree is created top to bottom, starting from the root.
• Bottom-Up Parser
– The parse tree is created bottom to top; starting from the leaves
• Efficient top-down and bottom-up parsers can be implemented only for sub-classes of
context-free grammars.
– LL for top-down parsing
– LR for bottom-up parsing
Context-Free Grammars
• Inherently recursive structures of a programming language are defined by a context-free
grammar.
• Example:
Derivations
E E+E
CFG - Terminology
• L(G) is the language of G (the language generated by G) which is a set of sentences.
Derivation Example
• At each derivation step, we can choose any of the non-terminals in the sentential form of
G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation.
• If we always choose the right-most non-terminal in each derivation step, this derivation is
called as right-most derivation.
• We will see that the top-down parsers try to find the left-most derivation of the given
source program.
• We will see that the bottom-up parsers try to find the right-most derivation of the given
source program in the reverse order.
Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.
Ambiguity
• A grammar produces more than one parse tree for a sentence is called as an ambiguous
grammar.
• Unambiguous grammar
• We should eliminate the ambiguity in the grammar during the design phase of the
compiler.
• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.
• The unambiguous grammar will be:
Top-Down Parsing
• The parse tree is created top to bottom.
• Top-down parser
– Recursive-Descent Parsing
• Backtracking is needed (If a choice of a production rule does not work, we
backtrack to try other alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
– Predictive Parsing
• No backtracking, efficient
• Needs a special form of grammars (LL(1) grammars).
• Recursive Predictive Parsing is a special form of Recursive Descent parsing
without backtracking.
• Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.
• Backtracking is needed.
S aBc
B bc | b
b c b
proc A {
- match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
}
A aBb | bAB
proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
} }
• If all other productions fail, we should apply an -production. For example, if the current
token is not a or b, we may apply the -production.
• Most correct choice: We should apply an -production for a non-terminal A when the
current token is in the follow set of A (which terminals can follow A in the sentential forms).
Predictive Parser
stmt if ...... |
while ...... |
begin ...... |
for .....
• When we are trying to write the non-terminal stmt, if the current token is if we have to
choose first production rule.
• When we are trying to write the non-terminal stmt, we can uniquely choose the
production rule by just looking the current token.
• We eliminate the left recursion in the grammar, and left factor it. But it may not be
suitable for predictive parsing (not LL(1) grammar).
Left Recursion
• So, we have to convert our left-recursive grammar into an equivalent grammar which is
not left-recursive.
• The left-recursion may appear in a single step of the derivation (immediate left-recursion),
or may appear in more than one step of the derivation.
Immediate Left-Recursion
A A’
A’ A’ | an equivalent grammar
In general,
E E+T | T
T T*F | F
F id | (E)
E T E’
E’ +T E’ |
T F T’
T’ *F T’ |
F id | (E)
Left-Recursion – Problem
• By just eliminating the immediate left-recursion, we may not get a grammar which is notleft-
recursive.
S Aa | b
S Aa Sca or
Left-Factoring
• A predictive parser (a top-down parser without backtracking) insists that the grammar
must be left-factored.
• When we see if, we cannot know which production rule to choose to re-write stmt in the
• In general,
Left-Factoring – Example1
A abB | aB | cdg | cdeB | cdfB
A aA’ | cdg | cdeB | cdfB
A’ bB | B
A aA’ | cdA’’
A’ bB | B
A’’ g | eB | fB
Left-Factoring – Example2
A ad | a | ab | abc | b
A aA’ | b
A’ d | | b | bc
A aA’ | b
A’ d | | bA’’
A’’ | c
• It is a top-down parser.
Input buffer
– our string to be parsed. We will assume that its end is marked with a special symbol
$.
Output
– a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer.
Stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S.
$S initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is completed.
Parsing table
– a two-dimensional array M[A,a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
– each entry holds a production rule.
3. If X is a non-terminal
Parser looks at the parsing table entry M[X,a]. If M[X,a] holds a production rule
XY1Y2...Yk, it pops X from the stack and pushes Yk,Yk-1,...,Y1 into the stack. The parser also
outputs the production rule XY1Y2...Yk to represent a step of the derivation.
S aBa
B bB |
LL(1) Parsing Table
A b $
S S aBa
B B B bB
Outputs: S aBa B bB B bB B
Parse tree
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
F (E) | id
Id + * ( ) $
E E TE’ E TE’
E’ E’ +TE’ E’ E’
T T FT’ T FT’
T’ T’ T’ *FT’ T’ T’
F F id F (E)
• FIRST() is a set of the terminal symbols which occur as first symbols in strings derived
from where is any string of grammar symbols.
• FOLLOW(A) is the set of the terminals which occur immediately after (follow) the
non-terminal A in the strings derived from the starting symbol.
– $ is in FOLLOW(A) if S A
• If X is FIRST(X)={}
• If X is Y1Y2..Yn
if a terminal a in FIRST(Yi) and is in all FIRST(Yj) for j=1,...,i-1 then a is in
FIRST(X).
if is in all FIRST(Yj) for j=1,...,n then is in FIRST(X).
FIRST Example
• If ( A B is a production rule ) or
( A B is a production rule and is in FIRST() )
everything in FOLLOW(A) is in FOLLOW(B).
We apply these rules until nothing more can be added to any follow set.
FOLLOW Example
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
F (E) | id
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
F0OLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
– If in FIRST()
for each terminal a in FOLLOW(A) add A to M[A,a]
• All other undefined entries of the parsing table are error entries.
E’ FIRST()={} none
but since in FIRST()
and FOLLOW(E’)={$,)} E’ into M[E’,$] and M[E’,)]
T FT’ FIRST(FT’)={(,id} T FT’ into M[T,(] and M[T,id]
T’ *FT’ FIRST(*FT’ )={*} T’ *FT’ into M[T’,*]
T’ FIRST()={} none
but since in FIRST()
and FOLLOW(T’)={$,),+} T’ into M[T’,$], M[T’,)] and
M[T’,+]
F (E) FIRST((E) )={(} F (E) into M[F,(]
F id FIRST(id)={id} F id into M[F,id]
LL(1) Grammars
• The parsing table of a grammar may contain more than one production rule. In this case,
we say that it is not a LL(1) grammar.
FIRST(iCtSE) = {i} FIRST(a) = {a} FIRST(eS) = {e} FIRST() = {} FIRST(b) = {b}
S Sa S iCtSE
E EeS E
E
C Cb
What do we have to do it if the resulting parsing table contains multiply defined entries?
– If we didn’t eliminate left recursion, eliminate the left recursion in the grammar.
– If the grammar is not left factored, we have to left factor the grammar.
– If its (new grammar’s) parsing table still contains multiply defined entries, that
grammar is ambiguous or it is inherently not a LL(1) grammar.
• A grammar G is LL (1) if and only if the following conditions hold for two distinctive
production rules A and A
1. Both and cannot derive strings starting with same terminals.
2. At most one of and can derive to .
3. If can derive to , then cannot derive to any string starting with a
terminal in FOLLOW(A).
– if the top of stack is a non-terminal A, the current input symbol is a, and the parsing
table entry M[A,a] is empty.
– The parser should be able to give an error message (as much as possible meaningful
error message).
– It should be recovered from that error case, and it should be able to continue the
parsing with the rest of the input.
• Error-Productions
– If we have a good idea of the common errors that might be encountered, we can
augment the grammar with productions that generate erroneous constructs.
– When an error production is used by the parser, we can generate appropriate error
diagnostics.
– Since it is almost impossible to know all the errors that can be made by the
programmers, this method is not practical.
• Global-Correction
– Ideally, we would like a compiler to make as few changes as possible in processing
incorrect inputs.
– We have to globally analyze the input to find the error.
– This is an expensive method, and it is not in practice.
Bottom-Up Parsing
• A bottom-up parser creates the parse tree of the given input starting from leaves towards
the root.
• A bottom-up parser tries to find the right-most derivation of the given input in the reverse
order.
S ... (the right-most derivation of )
(The bottom-up parser finds the right-most derivation in the reverse
order)
• Bottom-up parsing is also known as shift-reduce parsing because its two main actions are
shift and reduce.
– At each shift action, the current symbol in the input string is pushed to a stack.
– At each reduction step, the symbols at the top of the stack (this symbol sequence is
the right side of a production) will replaced by the non-terminal at the left side of
that production.
– There are also two more actions: accept and error.
Shift-Reduce Parsing
• A shift-reduce parser tries to reduce the given input string into the starting symbol.
Rightmost Derivation: S
Shift-Reduce Parser finds: ... S
Handle
• Informally, a handle of a string is a substring that matches the right side of a production
rule.
– But not every substring matches the right side of a production rule is handle
S A
• If the grammar is unambiguous, then every right-sentential form of the grammar has
exactly one handle.
Handle Pruning
• A right-most derivation in reverse can be obtained by handle-pruning.
S=0 1 2 ... n-1 n=
Input string
• Start from n, find a handle Ann in n, and replace n in by An to get n-1.
• Then find a handle An-1n-1 in n-1, and replace n-1 in by An-1 to get n-2.
• Repeat this, until we reach S.
id+id*id F id
F+id*id TF
T+id*id ET
E+id*id F id
E+F*id TF
E+T*id F id
E+T*F T T*F
E+T E E+T
E
• Shift: The next input symbol is shifted onto the top of the stack.
• Reduce: Replace the handle on the top of the stack by the non-terminal.
• Error: Parser discovers a syntax error, and calls an error recovery routine.
• There are context-free grammars for which shift-reduce parsers cannot be used.
• Stack contents and the next input symbol may not decide action:
• Operator-Precedence Parser
– Simple, but only a small class of grammars
• LR-Parsers
– Covers wide range of grammars.
• SLR – simple LR parser
• LR – most general LR parser
• LALR – intermediate LR parser (lookhead LR parser)
– SLR, LR and LALR work same, only their parsing tables are different.
Operator-Precedence Parser
• Operator grammar
• Ex:
EAB EEOE EE+E |
Aa Eid E*E |
Bb O+|*|/ E/E | id
not operator grammar not operator grammar operator grammar
Precedence Relations
•The determination of correct precedence relations between terminals are based on the
traditional notions of associativity and precedence of operators.
COMPILER DESIGN (CoSc3112) Page 5
Using Operator-Precedence Relations
• The intention of the precedence relations is to find the handle of a right-sentential form,
<. with marking the left end,
=· appearing in the interior of the handle, and
.
> marking the right hand.
• In our input string $a1a2...an$, we insert the precedence relation between the pairs of
terminals (the precedence relation holds between the terminals in that pair).
Using Operator -Precedence Relations
id + * $
• Then the input string id+id*id with the precedence relations inserted will be:
$ <. id .> + <. id .> * <. id .> $
• Scan the string from left end until the first .> is encountered.
• Then scan backwards (to the left) over any =· until a <. is encountered.
• The handle contains everything to left of the first .> and to the right of the <. is
encountered.
Algorithm:
set p to point to the first symbol of w$ ;
repeat forever
if ( $ is on top of the stack and p points to $ ) then return
else {
let a be the topmost terminal symbol on the stack and let b be the symbol pointed
to by p;
if ( a <. b or a =· b ) then { /* SHIFT */
push b onto the stack;
advance p to the next input symbol;
}
else if ( a .> b ) then /* REDUCE */
repeat pop stack
until ( the top of stack terminal is related by <. to the terminal most
recently popped );
else error(); }
Also, let
(=·) $ <. ( id .> ) ) .> $
.
(< ( $ <. id id .> $ ) .> )
( <. id
Operator-Precedence Relations
+ - * / ^ id ( ) $
Precedence Functions
• Compilers using operator precedence parsers do not need to store the table of precedence
relations.
• The table can be encoded by two precedence functions f and g that map terminal symbols
to integers.
• Disadvantages:
– It cannot handle the unary minus (the lexical analyzer should handle the unary
minus).
• Advantages:
– Simple
Error Cases:
• No relation holds between the terminal on the top of stack and the next input
symbol.
• A handle is found (reduction step), but there is no production with this handle as a
right side
Error Recovery:
• Decides the popped handle “looks like” which right hand side. And tries to recover
from that situation.
• LR-Parsers
– SLR, LR and LALR work same (they used the same algorithm), only their parsing
tables are different.
LR Parsing Algorithm
Actions of A LR-Parser
• shift s -- shifts the next input symbol and the state s onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) ( So X1 S1 ... Xm Sm ai s, ai+1 ... an $ )
• Error -- Parser detected an error (an empty entry in the action table)
Reduce Action
• pop 2|| (=r) items from the stack; let us assume that = Y1Y2...Yr
• An LR(0) item of a grammar G is a production of G a dot at the some position of the right
side.
• Sets of LR(0) items will be the states of action and goto table of the SLR parser.
• A collection of sets of LR(0) items (the canonical LR(0) collection) is the basis for
constructing SLR parsers.
• Augmented Grammar:
G’ is G with a new production rule S’S where S’ is the new starting symbol.
• If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items
constructed from I by the two rules:
We will apply this rule until no more new LR(0) items can be added to closure(I).
E’ E closure({E’ .E}) =
E E+T { E’ .E kernel items
ET E .E+T
Goto Operation
Example:
I ={ E’ .E, E .E+T, E .T,
T .T*F, T .F,
F .(E), F .id }
goto(I,E) = { E’ E., E E.+T }
goto(I,T) = { E T., T T.*F }
goto(I,F) = {T F. }
goto(I,() = { F (.E), E .E+T, E .T, T .T*F, T .F,
F .(E), F .id }
goto(I,id) = { F id. }
• To create the SLR parsing tables for a grammar G, we will create the canonical LR(0)
collection of the grammar G’.
• Algorithm:
C is { closure({S’.S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
I0 E
I1 + T
I6 I9 * to I7
F
T ( to I3
id
to I4
to I5
F I2 * I7 I10
F
I3 ( to I4
( id to I5
)
I4
E I8
id id T
to I2 + I11
F
I5 (
to I3 to I6
to I4
Construct the canonical collection of sets of LR(0) items for G’. C{I0,...,In}
Create the parsing action table as follows
If a is a terminal, A.a in Ii and goto(Ii,a)=Ij then action[i,a] is shift j.
If A. is in Ii , then action[i,a] is reduce A for all a in FOLLOW(A) where
AS’.
If S’S. is in Ii , then action[i,$] is accept.
If any conflicting actions generated by these rules, the grammar is not SLR(1).
Create the parsing goto table
for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
All entries not defined by (2) and (3) are errors.
Initial state of the parser contains S’.S
SLR(1) Grammar
An LR parser using SLR(1) parsing tables for a grammar G is called as the SLR(1) parser for G.
If a grammar G has an SLR(1) parsing table, it is called SLR(1) grammar (or SLR grammar in
short).
Every SLR grammar is unambiguous, but every unambiguous grammar is not a SLR grammar.
If a state does not know whether it will make a reduction operation using the production rule
i or j for a terminal, we say that there is a reduce/reduce conflict.
If the SLR parsing table of a grammar G has a conflict, we say that that grammar is not SLR
grammar.
Conflict Example
S L=R I0: S’ .S I1: S’ S. I6: S L=.R I9: S L=R.
S R S .L=R R .L
L *R S .R I2: S L.=R L .*R
L id L .*R R L. L .id
RL L .id
R .L I3: S R.
I7: L *R.
I4: L *.R
Problem R .L
FOLLOW(R)={=,$} L .*R I8: R L.
= shift 6 L .id
reduce by R L
shift/reduce conflict I5: L id.
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A b reduce by A
reduce by B reduce by B
reduce/reduce conflict reduce/reduce conflict
LR(1) Item
To avoid some of invalid reductions, the states need to carry more information.
Extra information is put into a state by including a terminal symbol as a second component
in an item.
A LR(1) item is:
A α.,a where a is the look-head of the LR(1) item
(a is a terminal or end-marker.)
The construction of the canonical collection of the sets of LR(1) items are similar to the
construction of the canonical collection of the sets of LR(0) items, except that closure and
goto operations work a little bit different.
goto operation
Algorithm:
C is { closure({S’.S,$}) }
repeat the followings until no more set of LR(1) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
goto function is a DFA on the sets in C.
1. Construct the canonical collection of sets of LR(1) items for G’. C{I0,...,In}
2. Create the parsing action table as follows
If a is a terminal, A.a,b in Ii and goto(Ii,a)=Ij then action[i,a] is shift j.
If A.,a is in Ii , then action[i,a] is reduce A where A S’.
If S’S.,$ is in Ii , then action[i,$] is accept.
If any conflicting actions generated by these rules, the grammar is not LR(1).
3. Create the parsing goto table
for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
4. All entries not defined by (2) and (3) are errors.
5. Initial state of the parser contains S’->.S,$
We will do this for all states of a canonical LR(1) parser to get the states of the LALR parser.
In fact, the number of the states of the LALR parser for a grammar will be equal to the number
of states of the SLR parser for that grammar.
Create the canonical LR(1) collection of the sets of LR(1) items for the given grammar.
Find each core; find all sets having that same core; replace those sets having same cores with
a single set which is their union.
C={I0,...,In} C’={J1,...,Jm}where m n
Create the parsing tables (action and goto tables) same as the construction of the parsing
tables of LR(1) parser.
– Note that: If J=I1 ... Ik since I1,...,Ik have same cores
cores of goto(I1,X),...,goto(I2,X) must be same.
– So, goto(J,X)=K where K is the union of all sets of items having same cores as
goto(I1,X).
If no conflict is introduced, the grammar is LALR(1) grammar. (We may only introduce
reduce/reduce conflicts; we cannot introduce a shift/reduce conflict)
Shift/Reduce Conflict
We say that we cannot introduce a shift/reduce conflict during the shrink process for the
creation of the states of a LALR parser.
Assume that we can introduce a shift/reduce conflict. In this case, a state of LALR parser must
have:
A .,a and B .a,b
This means that a state of the canonical LR(1) parser must have:
A .,a and B .a,c
But, this state has also a shift/reduce conflict. i.e. The original canonical LR(1) parser has a
conflict.
(Reason for this, the shift operation does not depend on lookaheads)
Reduce/Reduce Conflict
But, we may introduce a reduce/reduce conflict during the shrink process for the creation of
the states of a LALR parser.
I6:S L= R,$. R
to I9 I9:S L=R ,$ .
..
Same Cores
R L,$ L
to I810 I4 and I11
L *R,$ *
.
L id,$ to I411
I5 and I12
I :L *R . id
to I512
I7 and I13
713
,$/=
An LR parser will detect an error when it consults the parsing action table and finds an error
entry. All empty entries in the action table are error entries.
Errors are never detected by consulting the goto table.
An LR parser will announce error as soon as there is no valid continuation for the scanned
portion of the input.
A canonical LR parser (LR(1) parser) will never make even a single reduction before
Scan down the stack until a state s with a goto on a particular nonterminal A is
found. (Getrid of everything from the stack before this state s).
Discard zero or more input symbols until a symbol a is found that can legitimately follow
A.
– The symbol a is simply in FOLLOW(A), but this may not work for all situations.
The parser stacks the nonterminal A and the state goto[s,A], and it resumes
the normalparsing.
This nonterminal A is normally is a basic programming block (there can be more
than onechoice for A).
– stmt, expr, block, ...
Each empty entry in the action table is marked with a specific error routine.
An error routine reflects the error that the user most likely will make in that case.
An error routine inserts the symbols into the stack or the input (or it deletes the
symbols fromthe stack and the input, or it can do both insertion and deletion).
– Missing operand
– Unbalanced right parenthesis
Cnotents
Introduction
A Short Program
Semantic Analysis
Annotated Abstract Syntax tree (AST)
Syntax-Directed Translation
Syntax-Directed Definitions (SDD)
Evaluation of S-Attributed Definitions
L-Attributed Definitions
Translation Schemes
Semantic Analyzer
Semantic Analysis is the third phase of Compiler. Semantic Analysis makes sure that
declarations and statements of program are semantically correct. It is a collection of
procedures which is called by parser as and when required by grammar. Both syntax
tree of previous phase and symbol table are used to check the consistency of the given
code.
It uses syntax tree and symbol table to check whether the given program is semantically
consistent with language definition. It gathers type information and stores it in either
syntax tree or symbol table. This type information is subsequently used by compiler
during intermediate-code generation.
Semantic Errors
We have mentioned some of the semantics errors that the semantic analyzer is
expected to recognize:
Type mismatch
Undeclared variable
Therefore, semantic analysis Verify properties of the program that aren't caught during
the earlier phases:
Variables are declared before they're used.
Expressions have the right types.
Classes don't inherit from nonexistent base classes
Type consistence;
Inheritance relationship is correct;
A class is defined only once;
A method in a class is defined only once;
Reserved identifiers are not misused;
Once we finish semantic analysis, we know that the user's input program is legal.
Short program that shows semantic error
Semantic analysis also gathers useful information about program for later phases:
E.g. count how many variables are in scope at each point.
Why can't we just do this during parsing?
Context-free grammar can not represent all language constraints,
e.g. non local/context-dependent relations.
[COMPILER DESIGN (CoSc3112) Page 2
Limitations of CFGs
How would you prevent duplicate class definitions?
How would you differentiate variables of one type from variables of another
type?
How would you ensure classes implement all interface methods?
For most programming languages, these are provably impossible.
Semantic Analysis can be implemented using Annotated Abstract Syntax tree (AST)
The input for Semantic Analysis (syntax analyzer) is Abstract Syntax tree and the
output is Annotated Abstract Syntax tree.
Annotated Abstract Syntax tree is parse-tree that also shows the values of the
attributes at each node.
Attribute Grammar
Attribute grammar is a special form of context-free grammar where some additional
information (attributes) are appended to one or more of its non-terminals in order to
provide context-sensitive information. Each attribute has well-defined domain of values,
such as integer, float, character, string, and expressions.
Attribute grammar is a medium to provide semantics to the context-free grammar and it
can help specify the syntax and semantics of a programming language.
Example:
E → E + T { E.value = E.value + T.value }
The right part of the CFG contains the semantic rules that specify how the grammar
should be interpreted. Here, the values of non-terminals E and T are added together and
the result is copied to the non-terminal E.
Dependency Graph
L → id { addtype(id.entry,L.in) }
[COMPILER DESIGN (CoSc3112) Page 5
Symbol T is associated with a synthesized attribute type.
Symbol L is associated with an inherited attribute in.
We can use inherited attributes to track type information
We can use inherited attributes to track whether an identifier appear on the left or right side of
an assignment operator “:=” ( e.g. a := a +1 )
Stack for 2 + 3:
L-Attributed Definitions
A syntax-directed definition is L-attributed if each inherited attribute of Xj, where
1jn, on the right side of A → X1X2...Xn depends only on:
1. The attributes of the symbols X1,...,Xj-1 to the left of Xj in the production and
2. the inherited attribute of A
L-Attributed Definitions can always be evaluated by the depth first visit of the parse
tree-this means that they can also be evaluated during the parsing.
Translation Schemes
In a syntax-directed definition, we do not say
anything about the evaluation times of the semantic rules
when the semantic rules associated with a production should be evaluated?
A translation scheme is a context-free grammar in which:
attributes are associated with the grammar symbols and semantic actions
enclosed between braces {} are inserted within the right sides of
productions.
Semantic Actions
Translation schemes indicate the order in which semantic rules and attributes are to
be evaluated
A Translation Scheme Example
Benefits
1. Retargeting is facilitated
2. Machine independent Code Optimization can be applied.
Intermediate Code
Intermediate codes are machine independent codes, but they are close to machine instructions.
The given program in a source language is converted to an equivalent program in an
intermediate language by the intermediate code generator.
Intermediate language can be many different languages, and the designer of the compiler
decides this intermediate language.
Syntax trees can be used as an intermediate language.
Postfix notation can be used as an intermediate language.
three-address code (Quadruples) can be used as an intermediate language
we will use quadruples to discuss intermediate code generation
Quadruples are close to machine instructions, but they are not actual machine
instructions.
Postfix form
Example
a+b ab+
(a+b)*c ab+c*
a+b*c abc*+
a:=b*c+b*d abc*bd*+:=
(+) simple and concise
Observe that given the syntax-tree or the dag of the graphical representation we can easily
derive a three address code for assignments as above.
Quadruples
A quadruple is:
x := y op z
But we may also the following notation for quadruples (much better notation because it looks like a
machine code instruction)
op y,z,x
We use the term “three-address code” because each statement usually contains three addresses (two
for operands, one for the result).
Example:
t1:=- c
t2:=b * t1
t3:=- c
t4:=b * t3
t5:=t2 + t4
a:=t5
op arg1 arg2 result
(0) uminus c t1
(1) * b t1 t2
(2) uminus c
(3) * b t3 t4
(4) + t2 t4 t5
(5) := t5 a
Temporary names must be entered into the symbol table as they are created.
Three-Address Statements
Binary Operator:
op y,z,result or result := y op z
where op is a binary arithmetic or logical operator. This binary operator is applied to y and z, and the
result of the operation is stored in result.
op y,result or result := op y
where op is a unary arithmetic or logical operator. This unary operator is applied to y, and the result of
the operation is stored in result.
Triples
A triple has only three fields, which we call op, arg,, and arg2. Note that the result field in Fig. is used
primarily for temporary names. Using triples, we refer to the result of an operation x op y by its
position, rather than by an explicit temporary name. Thus, instead of the temporary t1 in Fig, a triple
op arg1 arg2
(0) uminus c
(1) * b (0)
(2) uminus c
(3) * b (2)
Indirect Triples
Indirect triples consist of a listing of pointers to triples, rather than a listing of triples themselves. For
example, let us use an array instruction to list pointers to triples in the desired order. Then, the
triples in Fig. might be represented as in Fig. With indirect triples, an optimizing compiler can move
an instruction by reordering the instruction list, without affecting the triples themselves.
Tradeoffs:
1) Performance vs. Size
2) Compilation speed and memory
> There is no perfect optimizer
Register Allocation
— Temporary variables
Instruction Selection
> For every expression, there are many ways to realize them for a processor
Peephole Optimization
— table pre-computed
— Structure Simplifications
Constant Folding
Constant Propagation
— If no change of c between!
Copy Propagation
> Replace later uses of x with y, if x and y have not been changed.
Algebraic Simplifications
Strength Reduction
Dead Code
Loop Optimizations
Advanced Optimizations
— Auto parallelization
— Profile-guided optimization
> Vectorization
Iterative Process
Code generation
The final phase of a compiler is code generation
It receives an intermediate representation (IR) with supplementary information in symbol table
Produces a semantically equivalent target program
Code generator main tasks:
o Instruction selection
o Register allocation and assignment
o Instruction ordering
⚫ IR + Symbol table
⚫ We assume front end produces low-level IR, i.e. values of names in it can be directly
manipulated by the machine instructions.
⚫ Common target architectures are: RISC, CISC and Stack based machines
Instruction Selection
The code generator must map the IR program into a code sequence that can be executed by
the target machine. The complexity of performing this mapping is determined by factors
such as
the level of the IR
the nature of the instruction-set architecture
the desired quality of the generated code.
For example, every three-address statement of the form x = y + z, where x, y, and z are statically
allocated, can be translated into the code sequence
This strategy often produces redundant loads and stores. For example, the sequence of three-
address statements would be translated into
a=b+c
d=a+e
LD Ro, b // Ro = b
ADD Ro, Ro, c // Ro = Ro + c
ST a, Ro // a = Ro
LD Ro,Y a // Ro = a
ADD Ro, Ro, e // Ro = Ro + e
ST d, Ro // d = Ro
Here, the fourth statement is redundant since it loads a value that has just been stored, and so
is the third if a is not subsequently used.
The quality of the generated code is usually determined by its speed and size. On most
machines, a given IR program can be implemented by many different code sequences, with
significant cost differences between the different implementations.
Register Allocation
A key problem in code generation is deciding what values to hold in what registers.
Registers are the fastest computational unit on the target machine, but we usually do not
have enough of them to hold all values. Values not held in registers need to reside in
memory. Instructions involving register operands are invariably shorter and faster than
those involving operands in memory, so efficient utilization of registers is particularly
important.
1. Register allocation, during which we select the set of variables that will reside in registers
at each point in the program.
2. Register assignment, during which we pick the specific register that a variable will reside
in.
3. Complications imposed by the hardware architecture
In this section, we shall consider an algorithm that generates code for a single basic block. It
considers each three-address instruction in turn, and keeps track of what values are in what
registers so it can avoid generating unnecessary loads and stores.
One of the primary issues during code generation is deciding how to use registers to best
advantage. There are four principal uses of registers: