0% found this document useful (0 votes)
8 views104 pages

P 15.compiler Design

This document provides an introduction to compiler design, detailing the various generations of programming languages and their classifications, including low-level and high-level languages. It explains the components of a language processing system, including preprocessors, compilers, assemblers, loaders, and interpreters, as well as the phases of compilation such as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization, and code generation. The document also discusses the importance of error handling and symbol table management in the compilation process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views104 pages

P 15.compiler Design

This document provides an introduction to compiler design, detailing the various generations of programming languages and their classifications, including low-level and high-level languages. It explains the components of a language processing system, including preprocessors, compilers, assemblers, loaders, and interpreters, as well as the phases of compilation such as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization, and code generation. The document also discusses the importance of error handling and symbol table management in the compilation process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Bonga University

College of Engineering and Technology


Department of Computer Science
CoSc 4103– COMPILER DESIGN
Chapter 1 Handout – Introduction to Compiler

Introduction
Programming languages are notations for describing computations to people and to machines. The
world as we know it depends on programming languages, because all the software running on all
the computers was written in some programming language. But, before a program can be run, it
first must be translated into a form in which it can be executed by a computer. The software systems
that do this translation are called compilers.
Generations of programming language
 Programming languages are categorized into five generations: (1st, 2nd, 3rd, 4th and 5th
generation languages)
 These programming languages can also be categorized into two broad categories: low level
and high level languages.
 Low level languages are machine specific or dependent.
 High level languages like COBOL, BASIC are machine independent and can run
on variety of computers.
 From the five categories of programming languages, first and second generation languages
are low level languages and the rest are high level programming languages.
1. First Generation (Machine languages, 1940’s):
 Difficult to write applications with.
 Dependent on machine languages of the specific computer being used.
 Machine languages allow the programmer to interact directly with the hardware,
and it can be executed by the computer without the need for a translator.
2. Second Generation (Assembly languages, early 1950’s):
 Uses symbolic names for operations and storage locations.
 A system program called an assembler translates a program written in assembly
language to machine language.
 Programs written in assembly language are not portable. i.e., different computer
architectures have their own machine and assembly languages.  They are highly
used in system software development.
3. Third Generation (High level languages, 1950’s to 1970’s):
 Uses English like instructions and mathematicians were able to define variables
with statements such as Z = A + B.
 Such languages are much easier to use than assembly language.
 Programs written in high level languages need to be translated into machine
language in order to be executed.
 All third generation programming languages are procedural languages.
 The use of common words (reserved words) within instructions makes them easier
to learn.

COMPLIER DESIGN (CoSc 4103) Page 1


4. Fourth Generation (since late 1970’s):
 Have a simple, English like syntax rules; commonly used to access databases.
 Fourth generation languages are non-procedural languages.
 The non-procedural method is easier to write, but you have less control over how
each task is actually performed.
 Fourth generation languages have a minimum number of syntax rules.
 This saves time and free professional programmers for more complex tasks.
 Some examples of 4GL are structured query languages (SQL), report generators,
application generators and graphics languages.
5. Fifth Generation (1990’s):
 These are used in artificial intelligence (AI) and expert systems; also used for
accessing databases.
 5GLs are “natural” languages whose instruction closely resembles human speech.
E.g. “get me Jone Brown’s sales figure for the 1997 financial year”.
 5GLs require very powerful hardware and software because of the complexity
involved in interpreting commands in human language.
6. Another classification of programming languages are:
 Imperative for languages in which a program specifies how a computation is to be
done.
 Languages such as C, C++, C#, and Java are imperative languages. In imperative
languages there is a notion of program state and statements that change the state.
 And declarative for languages in which a program specifies what computation is
to be done.
 Functional languages such as ML and Haskell and constraint logic languages such
as Prolog are often considered to be declarative languages.
7. An object-oriented language is one that supports object-oriented programming, a
programming style in which a program consists of a collection of objects that interact with
one another.
 Simula 67 and Smalltalk are the earliest major object-oriented languages.
Languages such as C++, C#, Java, and Ruby are more recent object-oriented
languages.
8. Scripting languages are interpreted languages with high-level operators designed for
"gluing together" computations.
 These computations were originally called "scripts." Awk, JavaScript, Perl, PHP,
Python, Ruby, and Tcl are popular examples of scripting languages.
 Programs written in scripting languages are often much shorter than equivalent
programs written in languages like C.

COMPLIER DESIGN (CoSc 4103) Page 2


Overview of language processing system

Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-
of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by certain
amounts to build-in macro.

Compiler
Compiler is a translator program that translates a program written in (HLL) the source program
and translate it into an equivalent program in (MLL) the target program. As an important part of a
compiler is error showing to the programmer.
Executing a program written n HLL programming language is basically of two parts. The source
program must first be compiled translated into an object program. Then the results object program
is loaded into a memory executed.

COMPLIER DESIGN (CoSc 4103) Page 3


Assembler: programmers found it difficult to write or read programs in machine language. They
begin to use a mnemonic (symbols) for each machine instruction, which they would subsequently
translate into machine language.
Such a mnemonic machine language is now called an assembly language. Programs known as
assembler were written to automate the translation of assembly language in to machine language.
The input to an assembler program is called source program, the output is a machine language
translation (object program).

Loader and Link-editor:


Linking: programs typically contain references to functions and data defined elsewhere, such as in
the standard libraries. The object code produced by the compiler typically contains “holes” due to
these missing parts. A linker links the object code with the code for the missing function to produce
an executable image (with no missing pieces). Generally, the linker completes the object code by
linking it with the object code of any library modules that the program may have referred to. The
final result is an executable file.

Loading: the loader takes the executable file from disk and transfers it to memory. Additional
components from shared libraries that support the program are also loaded. Finally, the computer,
under the control of its CPU, executes the program.

Translator

A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected
and reported to the programmers. Important role of translator are:

1. Translating the HLL program input into an equivalent MLL program.


2. Providing diagnostic messages wherever the programmer violates specification of the HLL.

TYPE OF TRANSLATORS:-

INTERPRETER
COMPILER
ASSEMBLER

Compiler
Simply stated, a compiler is a program that can read a program in one language-the source
language-and translate it into an equivalent program in another language-the target language; as
listed in the figure. An important role of the compiler is to report any errors in the source program
that it detects during the translation process.

COMPLIER DESIGN (CoSc 4103) Page 4


If the target program is an executable machine-language program, it can then be called by the user
to process inputs and produce outputs;

Advantages of Compiler

- Conciseness which improves programmer productivity, semantic restriction.


- Translate and analyze HLL(source pgm) only once and then run the equivalent m/c code produce
by compiler

Disadvantage of Compiler

- Size of compiler and compiled code


- Debugging involves all source code

Interpreter: An interpreter is another common kind of language processor. Instead of producing


a target program as a translation, an interpreter appears to directly execute the operations specified
in the source program on inputs supplied by the user, as shown in figure

The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs. An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter.

Example 1.1: Java language processors combine compilation and interpretation, as shown in
figure. A Java source program may first be compiled into an intermediate form called bytecodes.
The bytecodes are then interpreted by a virtual machine.

COMPLIER DESIGN (CoSc 4103) Page 5


Addressing Portability

• Suppose you want to write compilers from m source languages to n computer platforms. A
naive solution requires n*m programs:

• but we can do it with n+m programs:

Major Parts of Compilers

There are two major parts of a compiler: Analysis (Front end) and Synthesis (Back end)

In analysis phase, an intermediate representation is created from the given source program. Lexical
Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase.

In synthesis phase, the equivalent target program is created from this intermediate representation.

COMPLIER DESIGN (CoSc 4103) Page 6


Intermediate Code Generator, Code Generator, and Code Optimizer are the parts of this phase.

Structure of the compiler design:-

Compilation process is partitioned into no.-of-sub processes called ‘phases’.

• Each phase transforms the source program from one representation into another
representation.
• They communicate with error handlers.
• They communicate with the symbol table.

Lexical Analysis:-

• Lexical Analyzer reads the source program character by character and returns the tokens
of the source program.
• A token describes a pattern of characters having same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimeters and so on)
Ex: newval := oldval + 12 => tokens:
newval identifier
:= assignment operator
oldval identifier
+ add operator

COMPLIER DESIGN (CoSc 4103) Page 7


12 a number

• Puts information about identifiers into the symbol table.


• Regular expressions are used to describe tokens (lexical constructs).
• A (Deterministic) Finite State Automaton can be used in the implementation of a lexical
analyzer.

Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis
is aided by using techniques based on formal grammar of the programming language.

• The syntax of a language is specified by a context free grammar (CFG).


• The rules in a CFG are mostly recursive.
• A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or
not.
– If it satisfies, the syntax analyzer creates a parse tree for the given program.

• Ex: We use BNF (Backus Naur Form) to specify a CFG assgstmt -> identifier :=
expression
expression -> identifier
expression -> number
expression -> expression + expression

COMPLIER DESIGN (CoSc 4103) Page 8


Syntax Analyzer versus Lexical Analyzer

• Which constructs of a program should be recognized by the lexical analyzer, and which ones
by the syntax analyzer?
– Both of them do similar things;
– But the lexical analyzer deals with simple non-recursive constructs of the language.
– The syntax analyzer deals with recursive constructs of the language.
– The lexical analyzer simplifies the job of the syntax analyzer.
– The lexical analyzer recognizes the smallest meaningful units (tokens) in a source
program.
– The syntax analyzer works on the smallest meaningful units (tokens) in a source
program to recognize meaningful structures in our programming language.

Semantic Analyzer

• A semantic analyzer checks the source program for semantic errors and collects the type
information for the code generation.
• Type-checking is an important part of semantic analyzer.
• Normally semantic information cannot be represented by a context-free language used in
syntax analyzers.
• Context-free grammars used in the syntax analysis are integrated with attributes (semantic
rules)
– the result is a syntax-directed translation,
– Attribute grammars
• Ex:
newval := oldval + 12
• The type of the identifier newval must match with type of the expression (oldval+12)

Intermediate Code Generations:-

An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.

• A compiler may produce an explicit intermediate codes representing the source program.
• These intermediate codes are generally machine or (architecture independent). But the level
of intermediate codes is close to the level of machine codes.
• Ex: newval := oldval * fact + 1
id1 := id2 * id3 + 1

MULT id2,id3,temp1 Intermediates Codes (Quadraples)


ADD temp1,#1,temp2
MOV temp2,,id1
Code Optimization

This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.

COMPLIER DESIGN (CoSc 4103) Page 9


Code Generation

The last phase of translation is code generation. A number of optimizations to reduce the length
of machine language program are carried out during this phase. The output of the code generator
is the machine language program of the specified computer.

Symbol Table Management (or) Book-keeping

• Identifiers are names of variables, constants, functions, data types, etc.


• Store information associated with identifiers
– Information associated with different types of identifiers can be different
• Information associated with variables are name, type, address,size (for array), etc.

COMPLIER DESIGN (CoSc 4103) Page 10


• Information associated with functions are name,type of return value, parameters, address,
etc.
• Accessed in every phase of compilers
– The scanner, parser, and semantic analyzer put names of identifiers in symbol table.
– The semantic analyzer stores more information (e.g. data types) in the table.
– The intermediate code generator, code optimizer and code generator use information
in symbol table to generate appropriate code.
• Mostly use hash table for efficiency. Error Handlers •
Error can be found in every phase of compilation.
– Errors found during compilation are called static (or compile-time)
errors.
– Errors found during execution are called dynamic (or run-time) errors •
Compilers need to detect, report, and recover from error found in source
programs
• Error handlers are different in different phases of compiler.
Examples of compilers design

Position: = initial + rate *60

COMPLIER DESIGN (CoSc 4103) Page 11


Lexical Analyzer

Tokens id1 = id2 + id3 * id4

Syntax Analyzer

id1 +

id2 *

id3 id4

Semantic Analyzer

id1 +

id2 *

id3 60
int to real

Intermediate Code Generator

temp1:= int to real (60)


temp2:= id3 * temp1 temp3:=
id2 + temp2 id1:= temp3.

Code Optimizer

Temp1:= id3 * 60.0


Id1:= id2 +temp1

COMPLIER DESIGN (CoSc 4103) Page 12


Code Generator

MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1

Compiler-Construction Tools
The compiler writer, like any software developer, can profitably use modern software development
environments containing tools such as language editors, debuggers, version managers, profilers,
test harnesses, and so on. In addition to these general software-development tools, other more
specialized tools have been created to help implement various phases of a compiler.
These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler. Some commonly used compiler-construction tools include:
1. Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of
the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a
target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part
of code optimization.
6. Compiler-construction toolk2ts that provide an integrated set of routines for constructing
various phases of a compiler. Passes

One (or) more phases combined into a module called a pass.

Single-pass Compiler:-

All the phases are combined into a single. Phases work in an interleaved way.

COMPLIER DESIGN (CoSc 4103) Page 13


The target program is already generated while the source program is read.

Multi-pass Compiler:-

More than one phase is combined into a number of groups called multi-pass.

Why multi-pass?

• If memory is scarce (irrelevant today)


• If the language is complex
• If portability is important

COMPLIER DESIGN (CoSc 4103) Page 14


Today: Often Two-Pass Compilers

(Multi pass compiler can be made to use less space than single pass compiler.).

Compilers and interpreters are not the only examples of translators.

Here are few more:

Source Language Translator Target Language

LaTeX Text Formatter PostScript

SQL database query optimizer Query Evaluation Plan

Java javac compiler Java byte code

Java cross-compiler C++ code

Natural Language
English text semantics (meaning)
Understanding
Regular Expressions JLex scanner generator a scanner in Java

BNF of a language CUP parser generator a parser in Java

Cross Compiler

• a compiler which generates target code for a different machine from one on which the
compiler runs.
• A host language is a language in which the compiler is written.
– T-diagram

COMPLIER DESIGN (CoSc 4103) Page 15


S T

– Cross compilers are used very often in practice.


– If we want a compiler from language A to language B on a machine with language
E,
– write one with E
– write one with D if you have a compiler from D to E on some machine
– It is better than the former approach if D is a high-level language but E is a machine
language
– write one from G to B with E if we have a compiler from A to G written in E

Porting

• Porting: construct a compiler between a source and a target language using one host
language from another host language.

COMPLIER DESIGN (CoSc 4103) Page 16


Bootstrapping

• If we have to implement, from scratch, a compiler from a high-level language A to a


machine, which is also a host, language,
– direct method
– bootstrapping

Cousins of Compilers
• Linkers
• Loaders
• Interpreters
• Assemblers

Properties of a good compiler


 The compiler itself must be bug-free.
 It must generate correct machine code  The generated machine code must
run fast  The compiler itself must run fast.
 The compiler must be portable.
 It must give good diagnostics and error messages.  The generated code
must work well with existing debuggers  It must have Consistent
optimization.

Why we study about languages and compiler


 Improve understanding of program behaviors.
 Increase ability to learn new language.
 Learn to build a large and reliable system.

COMPLIER DESIGN (CoSc 4103) Page 17


 See many basics computer science concepts at work.
 Increase capacity of expression.
Summary of translators

Compiler Interpreter Assembler

Translates high-level
languages into machine code
Temporarily executes
Translates low-level assembly code
highlevel languages, one
into machine code
statement at a time

An executable file of No executable file of


An executable file of machine code is
machine code is produced machine code is produced
produced (object code)
(object code) (no object code)

Interpreted programs cannot


Compiled programs no Assembled programs no longer need
be used without the
longer need the compiler the assembler
interpreter

Error report produced once


Error message produced One low-level language statement is
entire program is
immediately (and program usually translated into one machine
compiled. These errors may
stops at that point) code instruction
cause your program to crash

Interpreted code is run


Compiling may be slow, but
through the interpreter
the resulting program code
(IDE), so it may be slow,
will run quick (directly on
e.g. to execute program
the processor)
loops

COMPLIER DESIGN (CoSc 4103) Page 18


One high-level language
statement may be several
lines of machine code when
compiled

COMPLIER DESIGN (CoSc 4103) Page 19


Bonga University
College of Engineering and Technology
Department of Computer Science
CoSc3112 – COMPILER DESIGN
Chapter 2 Handouts – Lexical Analysis

Overview of Lexical Analysis

A lexical analyzer, also called a scanner, typically has the following functionality and
characteristics.

 Its primary function is to convert from a (often very long) sequence of characters into a

(much shorter, perhaps 10X shorter) sequence of tokens.
 The scanner must identify and Categorize specific character sequences into tokens. It
must know whether every two adjacent characters in the file belong together in the

same token, or whether the second character must be in a different token. 
 Most lexical analyzers discard comments & whitespace. In most languages these
characters serve to separate tokens from each other, but once lexical analysis is

completed they serve no purpose.
 Handle lexical errors (illegal characters, malformed tokens) by reporting them intelligibly
to the user.
 Efficiency is crucial; a scanner may perform elaborate input buffering.
 Token categories can be (precisely, formally) specified using regular expressions, e.g.
 IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*

 Lexical Analyzers can be written by hand, or implemented automatically using finite
automata.

Role of Lexical Analysis

Issues (why separating lexical analysis from parsing)


  
Simpler design
  
Compiler efficiency
 
Compiler portability (e.g. Linux to Win)

COMPILER DESIGN (CoSc3112) Page 1


What’s a Token?
  A syntactic category
  
In English:
  
noun, verb, adjective, …
  
In a programming language:
 
Identifier, Integer, Keyword, Whitespace …

Tokens, Patterns and Lexemes


  A token is a pair a token name and an optional token value
  
A pattern is a description of the form that the lexemes of a token may take

A lexeme
 is a sequence of characters in the source program that matches the pattern for a
token

Input buffering


Sometimes
 lexical analyzer needs to look ahead some symbols to decide about the token to
 return
  
In C language: we need to look after -, = or < to decide what token to return
  
In Fortran: DO 5 I = 1.25
 
We need to introduce a two buffer scheme to handle large look-aheads safely

Specification of tokens

  In theory of compilation regular expressions are used to formalize the specification of tokens
  Regular expressions are means for specifying regular languages

  
Example:
  
Letter(letter | digit)*
 
Each regular expression is a pattern specifying the form of strings

Terminology of Languages
  Alphabet : a finite set of symbols (ASCII characters)
  
String :
  
Finite sequence of symbols on an alphabet
  
Sentence and word are also used in terms of string
 
  is the empty string
 
|s| is the length of string s.

COMPILER DESIGN (CoSc3112) Page 2


  Language: sets of strings over some fixed alphabet
 
  the empty set is a language.
  
{} the set containing empty string is a language
  
The set of well-formed C programs is a language
  
The set of all possible identifiers is a language.
  
Operators on Strings:
 
Concatenation: xy represents the concatenation of strings x and y. s = s s = s

sn = s s s .. s ( n times) s0 =
Regular Expressions
  We use regular expressions to describe tokens of a programming language.
  
A regular expression is built up of simpler regular expressions (using defining rules)
  
Each regular expression denotes a language.
 
A language denoted by a regular expression is called as a regular set.

Rules

Regular expressions over alphabet

Reg. Expr Language it denotes


 {}
a {a}
(r1) | (r2) L(r1) L(r2)
(r1) (r2) L(r1) L(r2)
(r)* (L(r))*
(r) L(r)
 
(r)+ = (r)(r)*
  
(r)? = (r) |
 
We may remove parentheses by using precedence rules.
* highest
concatenation next
| lowest

ab |c means (a(b)*)|(c)
*
 
Ex:
  
 = {0,1}
  
0|1 => {0,1}
 
 (0|1)(0|1) => {00,01,10,11}
  0* => { ,0,00,000,0000,....}

 
(0|1)* => all strings with 0 and 1, including the empty string

COMPILER DESIGN (CoSc3112) Page 3


Finite Automata


 A recognizer for a language is a program that takes
 a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
  
We call the recognizer of the tokens as a finite automaton.
  
A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)

 This means
that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
  
Both deterministic and non-deterministic finite automatons recognize regular sets.
  
Which one?
  
deterministic – faster recognizer, but it may take more space
  
non-deterministic – slower, but it may take less space
  
Deterministic automatons are widely used lexical analyzers.

First, we define regular expressions
 for tokens; then we convert them into a DFA to get a lexical
 analyzer for our tokens.

Algorithm1:
 Regular Expression NFA DFA (two steps: first to NFA, then to
DFA)

Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)

Non-Deterministic Finite Automaton (NFA)


  A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
  
S - a set of states
 
  - a set of input symbols (alphabet)
  
Move – a transition function move to map state-symbol pairs to sets of states.
  
s0 - a start (initial) state
  
F – a set of accepting states (final states)

- Transitions are allowed in NFAs.
 In other words, we can move from one state to another one
 without consuming any symbol.

A NFA accepts a string x, if and only if there is a path from
 the starting state to one of accepting
states such that edge labels along this path spell out x.

Deterministic Finite Automaton (DFA)

 A Deterministic Finite Automaton (DFA) is a special form of a


 NFA o no state has- transition
o for each symbol a and state s, there is at most one labeled edge a leaving s. o
i.e. transition function is from pair of state-symbol to state (not set of states)

Converting a Regular Expression into NFA (Thomson’s Construction)


  This is one way to convert a regular expression into a NFA.
  
There can be other ways (much efficient) for the conversion.

Thomson’s Construction is simple and systematic method. It guarantees that the resulting NFA
will have exactly one final state, and one start state.

COMPILER DESIGN (CoSc3112) Page 4



Construction starts from simplest parts (alphabet symbols). To create a NFA  for a complex
regular expression, NFAs of its sub-expressions are combined to create its NFA,

COMPILER DESIGN (CoSc3112) Page 5


COMPILER DESIGN (CoSc3112) Page 6
COMPILER DESIGN (CoSc3112) Page 7
Minimizing Number of States of a DFA
  partition the set of states into two groups:
  
G1 : set of accepting states
  
G2 : set of non-accepting states
  
For each new group G
 
partition G into subgroups such that states s1 and s2 are in the same group iff

For all input symbols a, states s1 and s2 have transitions to states in the same group.

 
 Start state of the minimized DFA is the group containing the start state of the original DFA.

Accepting states
 of the minimized DFA are the groups containing the accepting states of the
original DFA.

COMPILER DESIGN (CoSc3112) Page 8


Deterministic and Nondeterministic Automata
 
Deterministic Finite Automata (DFA)
  
One transition per input per state
  
No-moves
  
Nondeterministic Finite Automata (NFA)
  
Can have multiple transitions for one input in a given state
  
Can have-moves
  
Finite automata have finite memory
 
Need only to encode the current state

NFA vs. DFA


 NFAs and DFAs recognize the same set of languages (regular languages)

  
DFAs are easier to implement
 
There are no choices to consider

Regular Expressions to Finite Automata

COMPILER DESIGN (CoSc3112) Page 9


Overview of Lex and Yacc


Lex (A LEXical Analyzer Generator)

Generates lexical analyzers (scanners or Lexers)


Yacc (Yet Another Compiler-Compiler) Generates
 
parser based on an analytic grammar
  Flex is Free scanner alternative to Lex
 
Bison is Free parser generator program

Written for the GNU project alternative to Yacc

Lex: what is it?

1. Lex: a tool for automatically generating a lexer or scanner given a lex specification (.l
file)
2. A lexer or scanner is used to perform lexical analysis, or the breaking up of an input
stream into meaningful units, or tokens.
3. For example, consider breaking a text file up into individual words.

Lexical analyzer: scans the input stream and converts sequences of characters into tokens.

Token: a classification of groups of characters.


Examples: Lexeme Token
Sum ID
for FOR
= ASSIGN_OP
== EQUAL_OP
57 INTEGER_CONST
* MULT_OP
, COMMA
( LEFT_PAREN

Lex / Flex is a tool for writing lexical analyzers.

Lex / Flex: reads a specification file containing regular expressions


and generates a C routine that performs lexical analysis.

Matches sequences that identify tokens.

COMPILER DESIGN (CoSc3112) Page 10


Skeleton of a lex specification (.l file)

*.c is generated after


x.l
running

%{
This part will be
< C global variables, prototypes, comments embedded into *.c
>
%}
substitutions, code and
start states; will be
[DEFINITION SECTION] copied into *.c

define how to scan and


%% what action to take for
each token
[RULES SECTION]
any user code. For example, a
%% main function to call the scanning
function yylex().
< C auxiliary subroutines>

COMPILER DESIGN (CoSc3112) Page 11


The rules section
%%
[RULES SECTION]

<pattern> { <action to take when matched> } { <action


<pattern> to take when matched> }

%%

Patterns are specified by regular expressions.


For example:
%%
[A-Za-z]* { printf(“this is a word”); }
%%

Two Rules

1. lex will always match the longest (number of characters) token possible.

2. If two or more possible tokens are of the same length, then the token with the regular
expression that is defined first in the lex specification is favored.

Regular Expressions in lex / Flex:

a matches a
abc matches abc
[abc] matches a, b or c
[a-f] matches a, b, c, d, e, or f
[0-9] matches any digit
X+ matches one or more of X
X* matches zero or more of X
[0-9]+ matches any integer
(…) grouping an expression into a single unit

COMPILER DESIGN (CoSc3112) Page 12


| alternation (or)
(a|b|c)* is equivalent to [a-c]*
X? X is optional (0 or 1 occurrence)
if(def)? matches if or ifdef (equivalent to if|ifdef)
[A-Za-z] matches any alphabetical character
. matches any character except newline character
\. matches the . character
\n matches the newline character
\t matches the tab character
\\ matches the \ character
[ \t] matches either a space or tab character
[^a-d] matches any character other than a,b,c and d

Examples:

Real numbers, e.g., 0, 27, 2.10, .17


[0-9]+|[0-9]+\.[0-9]+|\.[0-9]+
[0-9]+(\.[0-9]+)?|\.[0-9]+
[0-9]*(\.)?[0-9]+

To include an optional preceding sign: [+-]?[0-9]*(\.)?[0-9]+

Special Functions
• yytext
–where text matched most recently is stored
• yyleng
–number of characters in text most recently matched
• yylval
–associated value of current token
• yymore()
–append next string matched to current contents of yytext
• yyless(n)
–remove from yytext all but the first n characters
• unput(c)
–return character c to input stream
• yywrap()
–may be replaced by user
–The yywrap method is called by the lexical analyser whenever it inputs an
EOF as the first character when trying to match a regular expression

Yacc / Bison: what is it?

Yacc: a tool for automatically generating a parser given a grammar written in a yacc
specification (.y file)

COMPILER DESIGN (CoSc3112) Page 13


A grammar specifies a set of production rules, which define a language. A production rule
specifies a sequence of symbols, sentences, which are legal in the language.

Skeleton of a yacc specification (.y file)

*.c is generated after running


x.y

%{
< C global variables, prototypes, comments > This part will be embedded
into *.c
%}

contains token declarations. Tokens are


[DEFINITION SECTION] recognized in lexer.

define how to “understand”


%% the input language, and what
[PRODUCTION RULES SECTION] %% actions to take for each “sentence”.

any user code. For example, a main


< C auxiliary subroutines> function to call the parser function
yyparse()

Structure of yacc File

Definition section

Declarations of tokens

Type of values used on parser stack

Rules section

List of grammar rules with semantic routines

User code

The Production Rules Section

%%

production: symbol1 symbol2 … { action }

| symbol3 symbol4 … { action }

|…

production: symbol1 symbol2 { action }

%%

COMPILER DESIGN (CoSc3112) Page 14


1. Lex program to count number of vowels and consonants
%{
int v=0,c=0;
%}
%%
[aeiouAEIOU]
v++; [a-zA-Z] c++;
%%
main()
{
printf("ENTER INTPUT : \n"); yylex();
printf("VOWELS=%d\nCONSONANTS=%d\n",v,c);

}
2. Lex program to count the type of numbers
%{
int pi=0,ni=0,pf=0,nf=0;
%}
%%
\+?[0-9]+ pi++;
\+?[0-9]*\.[0-9]+ pf++;
\-[0-9]+ ni++;
\-[0-9]*\.[0-9]+ nf++;
%%
main()
{
printf("ENTER INPUT : ");
yylex();
printf("\nPOSITIVE INTEGER : %d",pi);
printf("\nNEGATIVE INTEGER : %d",ni);
printf("\nPOSITIVE FRACTION : %d",pf);
printf("\nNEGATIVE FRACTION : %d\n",nf);
}
3. Lex program to find simple and compound
statements %{ }%

%%
"and"|
"or"|
"but"|
"because"|
"nevertheless" {printf("COMPOUND STATEMENT"); exit(0); }
.;
\n return 0;
%%
main()

COMPILER DESIGN (CoSc3112) Page 15


{
prntf("\nENTER THE STATEMENT : ");
yylex();
printf("SIMPLE STATEMENT");
}
4. Lex program to word count
/* just like Unix wc */
%{
int chars = 0;
int words = 0;
int lines = 0;
%}
%%
[a-zA-Z]+ { words++; chars += strlen(yytext);
} \n { chars++; lines++; }
. { chars++; }
%%
main(int argc, char **argv)
{
yylex();
printf("%8d%8d%8d\n", lines, words, chars);
}
5. Lex program for English to American
%%
"colour" { printf("color"); }
"flavour" { printf("flavor"); }
"clever" { printf("smart"); }
"smart" { printf("elegant"); }
"conservative" { printf("liberal"); }
… lots of other words …
. { printf("%s", yytext); }
%%

COMPILER DESIGN (CoSc3112) Page 16


Bonga University
College of Engineering and Technology
Department of Computer Science
CoSc3112 – COMPILER DESIGN
Chapter 3 Handouts – Syntax Analysis

Syntax Analyzer
• Syntax Analyzer creates the syntactic structure of the given source program.
• This syntactic structure is mostly a parse tree.
• Syntax Analyzer is also known as parser.
• The syntax of a programming is described by a context-free grammar (CFG). We will use
BNF (Backus-Naur Form) notation in the description of CFGs.
• The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A context-free grammar
– Gives a precise syntactic specification for a programming language
– The design of the grammar is an initial phase of the design of a compiler.
– A grammar can be directly converted into a parser by some tools.
Parser

• Parser works on a stream of tokens.

• The smallest item is a token.

• We categorize the parsers into two groups:

• Top-Down Parser
– The parse tree is created top to bottom, starting from the root.

• Bottom-Up Parser
– The parse tree is created bottom to top; starting from the leaves

COMPILER DESIGN (CoSc3112) Page 1


• Both top-down and bottom-up parsers scan the input from left to right (one symbol at a
time).

• Efficient top-down and bottom-up parsers can be implemented only for sub-classes of
context-free grammars.
– LL for top-down parsing
– LR for bottom-up parsing

Context-Free Grammars
• Inherently recursive structures of a programming language are defined by a context-free
grammar.

• In a context-free grammar, we have:


– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)
– A finite set of productions rules in the following form
• A where A is a non-terminal and
 is a string of terminals and non-terminals (including the empty
string)
– A start symbol (one of the non-terminal symbol)

• Example:

E E+E | E–E | E*E | E/E | -E


E (E)
E id

Derivations

E  E+E

• E+E derives from E


– We can replace E by E+E
– To able to do this, we have to have a production rule EE+E in our grammar.

E  E+E  id+E  id+id

• A sequence of replacements of non-terminal symbols is called a derivation of id+id from


E.
• In general a derivation step is

COMPILER DESIGN (CoSc3112) Page 2


A   if there is a production rule Ain our grammar
Where  and  are arbitrary strings of terminal and non-terminal symbols

1  2  ...  n (n derives from 1 or 1 derives n )

 : derives in one step


* : derives in zero or more steps
+ : derives in one or more steps

CFG - Terminology
• L(G) is the language of G (the language generated by G) which is a set of sentences.

• A sentence of L(G) is a string of terminal symbols of G.

• If S is the start symbol of G then


 is a sentence of L(G) iff S   where  is a string of terminals of G.

• If G is a context-free grammar, L(G) is a context-free language.

• Two grammars are equivalent if they produce the same language.

• S - If  contains non-terminals, it is called as a sentential form of G.


- If  does not contain non-terminals, it is called as a sentence of G.

Derivation Example

E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)


OR
E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

• At each derivation step, we can choose any of the non-terminals in the sentential form of
G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation.
• If we always choose the right-most non-terminal in each derivation step, this derivation is
called as right-most derivation.

Left-Most and Right-Most Derivations


Left-Most Derivation

E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)


Right-Most Derivation

COMPILER DESIGN (CoSc3112) Page 3


E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

• We will see that the top-down parsers try to find the left-most derivation of the given
source program.

• We will see that the bottom-up parsers try to find the right-most derivation of the given
source program in the reverse order.

Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.

• The leaves of a parse tree are terminal symbols.

• A parse tree can be seen as a graphical representation of a derivation.

Ambiguity
• A grammar produces more than one parse tree for a sentence is called as an ambiguous
grammar.

E  E+E  id+E  id+E*E  id+id*E  id+id*id

E  E*E  E+E*E  id+E*E  id+id*E  id+id*id

COMPILER DESIGN (CoSc3112) Page 4


• For the most parsers, the grammar must be unambiguous.

• Unambiguous grammar

 Unique selection of the parse tree for a sentence

• We should eliminate the ambiguity in the grammar during the design phase of the
compiler.

• An unambiguous grammar should be written to eliminate the ambiguity.

• We have to prefer one of the parse trees of a sentence (generated by an ambiguous


grammar) to disambiguate that grammar to restrict to this choice.

Example: stmt  if expr then stmt |


if expr then stmt else stmt | otherstmts

Sentence: if E1 then if E2 then S1 else S2

• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.
• The unambiguous grammar will be:

stmt  matchedstmt | unmatchedstmt

matchedstmt  if expr then matchedstmt else matchedstmt | otherstmts

unmatchedstmt  if expr then stmt | if expr then matchedstmt else unmatchedstmt

COMPILER DESIGN (CoSc3112) Page 5


Ambiguity – Operator Precedence

• Ambiguous grammars (because of ambiguous operators) can be disambiguated according


to the precedence and associativity rules.

E  E+E | E*E | E^E | id | (E)

disambiguate the grammar

 precedence: ^ (right to left)


* (left to right)
+ (left to right)
E  E+T | T
T  T*F | F
F  G^F | G
G  id | (E)

Top-Down Parsing
• The parse tree is created top to bottom.

• Top-down parser
– Recursive-Descent Parsing
• Backtracking is needed (If a choice of a production rule does not work, we
backtrack to try other alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
– Predictive Parsing
• No backtracking, efficient
• Needs a special form of grammars (LL(1) grammars).
• Recursive Predictive Parsing is a special form of Recursive Descent parsing
without backtracking.
• Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.

Recursive-Descent Parsing (uses Backtracking)

• Backtracking is needed.

• It tries to find the left-most derivation.

S  aBc
B  bc | b

COMPILER DESIGN (CoSc3112) Page 6


S S
input: abc
a B c a B c

b c b

Recursive Predictive Parsing

• Each non-terminal corresponds to a procedure.


Ex: A  aBb (This is only the production rule for A)

proc A {
- match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
}
A  aBb | bAB

proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
} }

• When to apply -productions.


A  aA | bB | 

• If all other productions fail, we should apply an -production. For example, if the current
token is not a or b, we may apply the -production.

• Most correct choice: We should apply an -production for a non-terminal A when the
current token is in the follow set of A (which terminals can follow A in the sentential forms).

Predictive Parser

a grammar   a grammar suitable for predictive


eliminate left parsing (a LL(1) grammar)
left recursion factor no %100 guarantee.

• When re-writing a non-terminal in a derivation step, a predictive parser can uniquely

COMPILER DESIGN (CoSc3112) Page 7


choose a production rule by just looking the current symbol in the input string.

A  1 | ... | n input: ... a ............. current token

Predictive Parser (example)

stmt  if ...... |
while ...... |
begin ...... |
for .....

• When we are trying to write the non-terminal stmt, if the current token is if we have to
choose first production rule.

• When we are trying to write the non-terminal stmt, we can uniquely choose the
production rule by just looking the current token.

• We eliminate the left recursion in the grammar, and left factor it. But it may not be
suitable for predictive parsing (not LL(1) grammar).

Left Recursion

• A grammar is left recursive if it has a non-terminal A such that there is a derivation.


A  Afor some string 

• Top-down parsing techniques cannot handle left-recursive grammars.

• So, we have to convert our left-recursive grammar into an equivalent grammar which is
not left-recursive.

• The left-recursion may appear in a single step of the derivation (immediate left-recursion),
or may appear in more than one step of the derivation.

Immediate Left-Recursion

AA|  where  does not start with A

 eliminate immediate left recursion

A   A’

A’   A’ |  an equivalent grammar

In general,

A  A 1 | ... | A m |  1 | ... |  n where 1 ...... n do not start with A

 eliminate immediate left recursion

COMPILER DESIGN (CoSc3112) Page 8


A  1 A’ | ... | n A’

A’  1 A’ | ... | m A’ | an equivalent grammar

Immediate Left-Recursion – Example

E  E+T | T

T  T*F | F

F  id | (E)

 eliminate immediate left recursion

E  T E’

E’  +T E’ | 

T  F T’

T’  *F T’ | 

F  id | (E)

Left-Recursion – Problem

• A grammar cannot be immediately left-recursive, but it still can be left-recursive.

• By just eliminating the immediate left-recursion, we may not get a grammar which is notleft-
recursive.

S  Aa | b

A  Sc | d this grammar is not immediately left-recursive, but it is still left-recursive.

S  Aa  Sca or

A  Sc  Aac causes to a left-recursion

• So, we have to eliminate all left-recursions from our grammar

Left-Factoring

• A predictive parser (a top-down parser without backtracking) insists that the grammar
must be left-factored.

Grammar  a new equivalent grammar suitable for predictive parsing

stmt  if expr then stmt else stmt | if expr then stmt

• When we see if, we cannot know which production rule to choose to re-write stmt in the

COMPILER DESIGN (CoSc3112) Page 9


derivation.

• In general,

A 1 | 2 where  is non-empty and the first symbols


of 1 and 2 (if they have one)are different.

• when processing  we cannot know whether expand


A to 1 or A to 2

• But, if we re-write the grammar as follows


A  A’
A’  1 | 2 so, we can immediately expand A to A’

Left-Factoring – Example1
A  abB | aB | cdg | cdeB | cdfB

A  aA’ | cdg | cdeB | cdfB
A’  bB | B

A  aA’ | cdA’’
A’  bB | B
A’’  g | eB | fB

Left-Factoring – Example2
A  ad | a | ab | abc | b

A  aA’ | b
A’  d |  | b | bc

A  aA’ | b
A’  d |  | bA’’
A’’   | c

Non-Recursive Predictive Parsing -- LL(1) Parser

• Non-Recursive predictive parsing is a table-driven parser.

• It is a top-down parser.

• It is also known as LL(1) Parser.

COMPILER DESIGN (CoSc3112) Page 10


LL(1) Parser

Input buffer
– our string to be parsed. We will assume that its end is marked with a special symbol
$.

Output
– a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer.
Stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S.
$S  initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is completed.

Parsing table
– a two-dimensional array M[A,a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
– each entry holds a production rule.

COMPILER DESIGN (CoSc3112) Page 11


LL(1) Parser – Parser Actions
• The symbol at the top of the stack (say X) and the current symbol in the input string (say
a) determine the parser action.
• There are four possible parser actions.

1. If X and a are $  parser halts (successful completion)

2. If X and a are the same terminal symbol (different from $)


 Parser pops X from the stack, and moves the next symbol in the input buffer.

3. If X is a non-terminal
 Parser looks at the parsing table entry M[X,a]. If M[X,a] holds a production rule
XY1Y2...Yk, it pops X from the stack and pushes Yk,Yk-1,...,Y1 into the stack. The parser also
outputs the production rule XY1Y2...Yk to represent a step of the derivation.

4. none of the above  error


– all empty entries in the parsing table are errors.
– If X is a terminal symbol different from a, this is also an error case.

LL(1) Parser – Example1

S  aBa
B bB | 

LL(1) Parsing Table

A b $

S S  aBa

B B   B  bB

stack input output


$S abba$ S  aBa
$aBa abba$
$aB bba$ B  bB
$aBb bba$
$aB ba$ B  bB
$aBb ba$
$aB a$ B 

COMPILER DESIGN (CoSc3112) Page 12


$a a$
$ $ accept, successful completion

Outputs: S  aBa B  bB B  bB B  

Derivation (left-most): SaBaabBaabbBaabba

Parse tree

LL(1) Parser – Example2

E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 
F  (E) | id

Id + * ( ) $

E E  TE’ E  TE’

E’ E’  +TE’ E’   E’  

T T  FT’ T  FT’

T’ T’   T’  *FT’ T’   T’  

F F  id F  (E)

stack input output


$E id+id$ E  TE’
$E’T id+id$ T  FT’
$E’ T’F id+id$ F  id
$ E’ T’id id+id$

COMPILER DESIGN (CoSc3112) Page 13


$ E’ T’ +id$ T’  
$ E’ +id$ E’  +TE’
$ E’ T+ +id$
$ E’ T id$ T  FT’
$ E’ T’ F id$ F  id
$ E’ T’id id$
$ E’ T’ $ T’  
$ E’ $ E’  
$ $ accept

Constructing LL(1) Parsing Tables

• Two functions are used in the construction of LL(1) parsing tables:


– FIRST and FOLLOW

• FIRST() is a set of the terminal symbols which occur as first symbols in strings derived
from  where  is any string of grammar symbols.

• if  derives to , then  is also in FIRST() .

• FOLLOW(A) is the set of the terminals which occur immediately after (follow) the
non-terminal A in the strings derived from the starting symbol.

– a terminal a is in FOLLOW(A) if S  Aa

– $ is in FOLLOW(A) if S  A

Compute FIRST for Any String X

• If X is a terminal symbol  FIRST(X)={X}

• If X is a non-terminal symbol and X   is a production rule


  is in FIRST(X).

• If X is a non-terminal symbol and X  Y1Y2..Yn is a production rule


 if a terminal a in FIRST(Yi) and  is in all FIRST(Yj) for j=1,...,i-1
then a is in FIRST(X).
 if  is in all FIRST(Yj) for j=1,...,n then  is in FIRST(X).

• If X is   FIRST(X)={}

• If X is Y1Y2..Yn
 if a terminal a in FIRST(Yi) and  is in all FIRST(Yj) for j=1,...,i-1 then a is in
FIRST(X).
 if  is in all FIRST(Yj) for j=1,...,n then  is in FIRST(X).

FIRST Example

COMPILER DESIGN (CoSc3112) Page 14


E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 
F  (E) | id

FIRST(F) = {(,id} FIRST(TE’) = {(,id}


FIRST(T ) = {*, }
’ FIRST(+TE’ ) = {+}
FIRST(T) = {(,id} FIRST() = {}
FIRST(E’) = {+, } FIRST(FT’) = {(,id}
FIRST(E) = {(,id} FIRST(*FT’) = {*}
FIRST() = {}
FIRST((E)) = {(}
FIRST(id) = {id}

Compute FOLLOW (for non-terminals)

• If S is the start symbol  $ is in FOLLOW(S)

• If A  B is a production rule


 everything in FIRST() is FOLLOW(B) except 

• If ( A  B is a production rule ) or
( A  B is a production rule and  is in FIRST() )
everything in FOLLOW(A) is in FOLLOW(B).

We apply these rules until nothing more can be added to any follow set.

FOLLOW Example

E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 
F  (E) | id

FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
F0OLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }

Constructing LL(1) Parsing Table – Algorithm

• for each production rule A  of a grammar G

– for each terminal a in FIRST()

COMPILER DESIGN (CoSc3112) Page 15


 add A   to M[A,a]

– If  in FIRST()
 for each terminal a in FOLLOW(A) add A   to M[A,a]

– If  in FIRST() and $ in FOLLOW(A)


 add A  to M[A,$]

• All other undefined entries of the parsing table are error entries.

Constructing LL(1) Parsing Table – Example

E  TE’ FIRST(TE’)={(,id}  E  TE’ into M[E,(] and M[E,id]


E’  +TE’ FIRST(+TE’ )={+}  E’  +TE’ into M[E’,+]

E’   FIRST()={}  none
but since  in FIRST()
and FOLLOW(E’)={$,)}  E’   into M[E’,$] and M[E’,)]
T  FT’ FIRST(FT’)={(,id}  T  FT’ into M[T,(] and M[T,id]
T’  *FT’ FIRST(*FT’ )={*}  T’  *FT’ into M[T’,*]

T’   FIRST()={}  none
but since  in FIRST()
and FOLLOW(T’)={$,),+}  T’   into M[T’,$], M[T’,)] and
M[T’,+]
F  (E) FIRST((E) )={(}  F  (E) into M[F,(]
F  id FIRST(id)={id}  F  id into M[F,id]

LL(1) Grammars

• A grammar whose parsing table has no multiply-defined entries is said to be LL(1)


grammar.
one input symbol used as a look-head symbol do determine parser action LL (1) left
most derivation, input scanned from left to right

• The parsing table of a grammar may contain more than one production rule. In this case,
we say that it is not a LL(1) grammar.

A Grammar which is not LL (1)

SiCtSE | a FOLLOW(S) = { $,e }


EeS |  FOLLOW(E) = { $,e }
Cb FOLLOW(C) = { t }

FIRST(iCtSE) = {i} FIRST(a) = {a} FIRST(eS) = {e} FIRST() = {} FIRST(b) = {b}

COMPILER DESIGN (CoSc3112) Page 16


a b e i t $

S Sa S  iCtSE

E EeS E  
E  
C Cb

Two production rules for M[E,e]


Problem  ambiguity

What do we have to do it if the resulting parsing table contains multiply defined entries?
– If we didn’t eliminate left recursion, eliminate the left recursion in the grammar.
– If the grammar is not left factored, we have to left factor the grammar.
– If its (new grammar’s) parsing table still contains multiply defined entries, that
grammar is ambiguous or it is inherently not a LL(1) grammar.

• A left recursive grammar cannot be a LL(1) grammar.


– A  A | 
 any terminal that appears in FIRST() also appears FIRST(A) because
A  .
 If  is , any terminal that appears in FIRST() also appears in FIRST(A)
and FOLLOW(A).

• A grammar is not left factored, it cannot be a LL (1) grammar


• A  1 | 2
 any terminal that appears in FIRST(1) also appears in FIRST(2).

• An ambiguous grammar cannot be a LL (1) grammar.

Properties of LL (1) Grammars

• A grammar G is LL (1) if and only if the following conditions hold for two distinctive
production rules A and A  

1. Both  and  cannot derive strings starting with same terminals.
2. At most one of  and  can derive to .
3. If  can derive to , then  cannot derive to any string starting with a
terminal in FOLLOW(A).

Error Recovery in Predictive Parsing

• An error may occur in the predictive parsing (LL(1) parsing)

COMPILER DESIGN (CoSc3112) Page 17


– If the terminal symbol on the top of stack does not match with the current input
symbol.

– if the top of stack is a non-terminal A, the current input symbol is a, and the parsing
table entry M[A,a] is empty.

• What should the parser do in an error case?

– The parser should be able to give an error message (as much as possible meaningful
error message).

– It should be recovered from that error case, and it should be able to continue the
parsing with the rest of the input.

Error Recovery Techniques

• Panic-Mode Error Recovery


– Skipping the input symbols until a synchronizing token is found.

• Phrase-Level Error Recovery


– Each empty entry in the parsing table is filled with a pointer to a specific error
routine to take care that error case.

• Error-Productions
– If we have a good idea of the common errors that might be encountered, we can
augment the grammar with productions that generate erroneous constructs.
– When an error production is used by the parser, we can generate appropriate error
diagnostics.
– Since it is almost impossible to know all the errors that can be made by the
programmers, this method is not practical.

• Global-Correction
– Ideally, we would like a compiler to make as few changes as possible in processing
incorrect inputs.
– We have to globally analyze the input to find the error.
– This is an expensive method, and it is not in practice.

COMPILER DESIGN (CoSc3112) Page 18


Bonga University
College of Engineering and Technology
Department of Computer Science
CoSc3112 –COMPILER DESIGN
Chapter 4 Handout – Bottom-up parsing

Bottom-Up Parsing
• A bottom-up parser creates the parse tree of the given input starting from leaves towards
the root.

• A bottom-up parser tries to find the right-most derivation of the given input in the reverse
order.
S  ...   (the right-most derivation of )
 (The bottom-up parser finds the right-most derivation in the reverse
order)

• Bottom-up parsing is also known as shift-reduce parsing because its two main actions are
shift and reduce.
– At each shift action, the current symbol in the input string is pushed to a stack.
– At each reduction step, the symbols at the top of the stack (this symbol sequence is
the right side of a production) will replaced by the non-terminal at the left side of
that production.
– There are also two more actions: accept and error.

Shift-Reduce Parsing
• A shift-reduce parser tries to reduce the given input string into the starting symbol.

a string  the starting symbol


Reduced to
• At each reduction step, a substring of the input matching to the right side of a production
rule is replaced by the non-terminal at the left side of that production rule.
• If the substring is chosen correctly, the right most derivation of that string is created in the
reverse order.

Rightmost Derivation: S  

Shift-Reduce Parser finds:   ...  S

COMPILER DESIGN (CoSc3112) Page 1


Shift-Reduce Parsing -- Example

S  aABb input string: aaabb


A  aA | a aaAbb
B bB | b aAbb  reduction
aABb
S

S  aABb  aAbb  aaAbb  aaabb

Right Sentential Forms

• How do we know which substring to be replaced at each reduction step?

Handle
• Informally, a handle of a string is a substring that matches the right side of a production
rule.
– But not every substring matches the right side of a production rule is handle

• A handle of a right sentential form  ( ) is


a production rule A   and a position of 
Where the string  may be found and replaced by A to produce the previous right-sentential
form in a rightmost derivation of .

S  A  

• If the grammar is unambiguous, then every right-sentential form of the grammar has
exactly one handle.

Handle Pruning
• A right-most derivation in reverse can be obtained by handle-pruning.
S=0  1  2  ...  n-1  n= 

Input string

• Start from n, find a handle Ann in n, and replace n in by An to get n-1.
• Then find a handle An-1n-1 in n-1, and replace n-1 in by An-1 to get n-2.
• Repeat this, until we reach S.

COMPILER DESIGN (CoSc3112) Page 2


A Shift-Reduce Parser

E  E+T | T Right-Most Derivation of id+id*id


T  T*F | F E  E+T  E+T*F  E+T*id  E+F*id
F  (E) | id  E+id*id  T+id*id  F+id*id  id+id*id
Right-Most Sentential Form Reducing Production

id+id*id F  id
F+id*id TF
T+id*id ET
E+id*id F  id
E+F*id TF
E+T*id F  id
E+T*F T  T*F
E+T E  E+T
E

Handles are red and underlined in the right-sentential forms.

A Stack Implementation of a Shift-Reduce Parser

• There are four possible actions of a shift-parser action:

• Shift: The next input symbol is shifted onto the top of the stack.

• Reduce: Replace the handle on the top of the stack by the non-terminal.

• Accept: Successful completion of parsing.

• Error: Parser discovers a syntax error, and calls an error recovery routine.

• Initial stack just contains only the end-marker $.

• The end of the input string is marked by the end-marker $.

COMPILER DESIGN (CoSc3112) Page 3


A Stack Implementation of A Shift-Reduce Parser
Stack Input Action
$ id+id*id$ shift
$id +id*id$ reduce by F  id Parse Tree
$F +id*id$ reduce by T  F
$T +id*id$ reduce by E  T E8
$E +id*id$ shift
$E+ id*id$ shift E 3 + T7
$E+id *id$ reduce by F  id
$E+F *id$ reduce by T  F T 2 T5 * F6
$E+T *id$ shift
$E+T* id$ shift F 1 F 4 id
$E+T*id $ reduce by F  id
$E+T*F $ reduce by T  T*F id id
$E+T $ reduce by E  E+T
$E $ accept

CS416 Compiler Design 8

Conflicts during Shift-Reduce Parsing

• There are context-free grammars for which shift-reduce parsers cannot be used.

• Stack contents and the next input symbol may not decide action:

– Shift/reduce conflict: Whether make a shift operation or a reduction.

– Reduce/reduce conflict: The parser cannot decide which of several reductions to


make.

• If a shift-reduce parser cannot be used for a grammar, that grammar is called as


non-LR (k) grammar.

left to right right-most k lookhead


scanning derivation

• An ambiguous grammar can never be a LR grammar.

COMPILER DESIGN (CoSc3112) Page 4


Shift-Reduce Parsers

• There are two main categories of shift-reduce parsers

• Operator-Precedence Parser
– Simple, but only a small class of grammars

• LR-Parsers
– Covers wide range of grammars.
• SLR – simple LR parser
• LR – most general LR parser
• LALR – intermediate LR parser (lookhead LR parser)
– SLR, LR and LALR work same, only their parsing tables are different.

Operator-Precedence Parser

• Operator grammar

– Small, but an important class of grammars

– We may have an efficient operator precedence parser (a shift-reduce parser) for an


operator grammar.

• In an operator grammar, no production rule can have:

–  at the right side

– two adjacent non-terminals at the right side.

• Ex:
EAB EEOE EE+E |
Aa Eid E*E |
Bb O+|*|/ E/E | id
not operator grammar not operator grammar operator grammar

Precedence Relations

• In operator-precedence parsing, we define three disjoint precedence relations between


certain pairs of terminals.

a <. b b has higher precedence than a


a =· b b has same precedence as a
a .> b b has lower precedence than a

•The determination of correct precedence relations between terminals are based on the
traditional notions of associativity and precedence of operators.
COMPILER DESIGN (CoSc3112) Page 5
Using Operator-Precedence Relations

• The intention of the precedence relations is to find the handle of a right-sentential form,
<. with marking the left end,
=· appearing in the interior of the handle, and
.
> marking the right hand.

• In our input string $a1a2...an$, we insert the precedence relation between the pairs of
terminals (the precedence relation holds between the terminals in that pair).
Using Operator -Precedence Relations

E  E+E | E-E | E*E | E/E | E^E | (E) | -E | id

The partial operator-precedence table for this grammar

id + * $

id .> .> .>

+ <. .> <. .>

* <. .> .> .>

$ <. <. <.

• Then the input string id+id*id with the precedence relations inserted will be:
$ <. id .> + <. id .> * <. id .> $

To Find The Handles

• Scan the string from left end until the first .> is encountered.

• Then scan backwards (to the left) over any =· until a <. is encountered.

• The handle contains everything to left of the first .> and to the right of the <. is
encountered.

$ <. id .> + <. id .> * <. id .> $ E  id $ id + id * id $


$ <. + <. id .> * <. id .> $ E  id $ E + id * id $
$ <. + <. * <. id .> $ E  id $ E + E * id $
$ < . + < . * .> $ E  E*E $ E + E * .E $
$ < . + .> $ E  E+E $E+E$
$$ $E$

COMPILER DESIGN (CoSc3112) Page 6


Operator-Precedence Parsing Algorithm
• The input string is w$, the initial stack is $ and a table holds precedence relations between
certain terminals

Algorithm:
set p to point to the first symbol of w$ ;
repeat forever
if ( $ is on top of the stack and p points to $ ) then return
else {
let a be the topmost terminal symbol on the stack and let b be the symbol pointed
to by p;
if ( a <. b or a =· b ) then { /* SHIFT */
push b onto the stack;
advance p to the next input symbol;
}
else if ( a .> b ) then /* REDUCE */
repeat pop stack
until ( the top of stack terminal is related by <. to the terminal most
recently popped );
else error(); }

Operator-Precedence Parsing Algorithm – Example

Stack input action

$ id+id*id$ $ <. id shift


$id +id*id$ id .> + reduce E  id
$ +id*id$ shift
$+ id*id$ shift
$+id *id$ id .> * reduce E  id
$+ *id$ shift
$+* id$ shift
$+*id $ id .> $ reduce E  id
$+* $ * .> $ reduce E  E*E
$+ $ + .> $ reduce EE+E
$ $ accept

How to Create Operator-Precedence Relations


• We use associativity and precedence relations among operators.
 If operator O1 has higher precedence than operator O2,
.
 O1 > O2 and
.
O2 < O1
 If operator O1 and operator O2 have equal precedence,
they are left-associative  .
O1 > O2 and O2 .> O1
. .
they are right-associative  O1 < O2 and O2 < O1

COMPILER DESIGN (CoSc3112) Page 7


 For all operators O,
O <. id, id .> O, O <. (, (<. O, O .> ), ) .> O, O .> $, and $ <. O

 Also, let
(=·) $ <. ( id .> ) ) .> $
.
(< ( $ <. id id .> $ ) .> )
( <. id
Operator-Precedence Relations

+ - * / ^ id ( ) $

+ .> .> <. <. <. <. <. .> .>

- .> .> <. <. <. <. <. .> .>

* .> .> .> .> <. <. <. .> .>

/ .> .> .> .> <. <. <. .> .>

^ .> .> .> .> <. <. <. .> .>

id .> .> .> .> .> .> .>

( <. <. <. <. <. <. <. =·

) .> .> .> .> .> .> .>

$ <. <. <. <. <. <. <.

Precedence Functions

• Compilers using operator precedence parsers do not need to store the table of precedence
relations.

• The table can be encoded by two precedence functions f and g that map terminal symbols
to integers.

• For symbols a and b


f(a) < g(b) whenever a <. b
f(a) = g(b) whenever a =· b
f(a) > g(b) whenever a .> b

COMPILER DESIGN (CoSc3112) Page 8


Disadvantages of Operator Precedence Parsing

• Disadvantages:

– It cannot handle the unary minus (the lexical analyzer should handle the unary
minus).

– Small class of grammars.

– Difficult to decide which language is recognized by the grammar.

• Advantages:

– Simple

– Powerful enough for expressions in programming languages

Error Recovery in Operator-Precedence Parsing

Error Cases:

• No relation holds between the terminal on the top of stack and the next input
symbol.

• A handle is found (reduction step), but there is no production with this handle as a
right side

Error Recovery:

• Each empty entry is filled with a pointer to an error routine.

• Decides the popped handle “looks like” which right hand side. And tries to recover
from that situation.

COMPILER DESIGN (CoSc3112) Page 9


LR Parsers
• The most powerful shift-reduce parsing (yet efficient) is:
LR(k) parsing.

left to right right-most k lookhead


scanning derivation (k is omitted  it is 1)

• LR parsing is attractive because:


– LR parsing is most general non-backtracking shift-reduce parsing, yet it is still
efficient.
– The class of grammars that can be parsed using LR methods is a proper superset of
the class of grammars that can be parsed with predictive parsers. LL(1)-
Grammars  LR(1)-Grammars
– An LR-parser can detect a syntactic error as soon as it is possible to do so a left-to-
right scan of the input.

COMPILER DESIGN (CoSc3112) Page 10


LR Parsers

• LR-Parsers

– Covers wide range of grammars.

– SLR – simple LR parser

– LR – most general LR parser

– LALR – intermediate LR parser (look-head LR parser)

– SLR, LR and LALR work same (they used the same algorithm), only their parsing
tables are different.

LR Parsing Algorithm

A Configuration of LR Parsing Algorithm

• A configuration of a LR parsing is:

( So X1 S1 ... Xm Sm, ai ai+1 ... an $ )

Stack Rest of Input

COMPILER DESIGN (CoSc3112) Page 11


• Sm and ai decides the parser action by consulting the parsing action table. (Initial Stack
contains just So )

• A configuration of a LR parsing represents the right sentential form:


X1 ... Xm ai ai+1 ... an $

Actions of A LR-Parser

• shift s -- shifts the next input symbol and the state s onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ )  ( So X1 S1 ... Xm Sm ai s, ai+1 ... an $ )

• reduce A (or rn where n is a production number)

– pop 2|| (=r) items from the stack;

– then push A and s where s=goto[sm-r,A]

( So X1 S1 ... Xm Sm, ai ai+1 ... an $ )  ( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )

– Output is the reducing production reduce A

• Accept – Parsing successfully completed

• Error -- Parser detected an error (an empty entry in the action table)

Reduce Action

• pop 2|| (=r) items from the stack; let us assume that  = Y1Y2...Yr

• then push A and s where s=goto[sm-r,A]

( So X1 S1 ... Xm-r Sm-r Y1 Sm-r ...Yr Sm, ai ai+1 ... an $ )


 ( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )

• In fact, Y1Y2...Yr is a handle.

X1 ... Xm-r A ai ... an $  X1 ... Xm Y1...Yr ai ai+1 ... an $

(SLR) Parsing Tables for Expression Grammar

Actions of A (S)LR-Parser -- Example


stack input action output
0 id*id+id$ shift 5
0id5 *id+id$ reduce by Fid Fid
0F3 *id+id$ reduce by TF TF
0T2 *id+id$ shift 7
0T2*7 id+id$ shift 5

COMPILER DESIGN (CoSc3112) Page 12


0T2*7id5 +id$ reduce by Fid Fid
0T2*7F10 +id$ reduce by TT*F TT*F
0T2 +id$ reduce by ET ET
0E1 +id$ shift 6
0E1+6 id$ shift 5
0E1+6id5 $ reduce by Fid Fid
0E1+6F3 $ reduce by TF TF
0E1+6T9 $ reduce by EE+T EE+T
0E1 $ accept

Constructing SLR Parsing Tables – LR(0) Item

• An LR(0) item of a grammar G is a production of G a dot at the some position of the right
side.

• Ex: A  aBb Possible LR(0) Items: A  .aBb


(four different possibility) A  a.Bb
A  aB.b
A  aBb.

• Sets of LR(0) items will be the states of action and goto table of the SLR parser.

• A collection of sets of LR(0) items (the canonical LR(0) collection) is the basis for
constructing SLR parsers.

• Augmented Grammar:
G’ is G with a new production rule S’S where S’ is the new starting symbol.

The Closure Operation

• If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items
constructed from I by the two rules:

• Initially, every LR(0) item in I is added to closure(I).

• If A .Bis in closure(I) and B is a production rule of G;


then B. will be in the closure(I).

We will apply this rule until no more new LR(0) items can be added to closure(I).

The Closure Operation -- Example

E’  E closure({E’  .E}) =
E  E+T { E’  .E kernel items
ET E  .E+T

COMPILER DESIGN (CoSc3112) Page 13


T  T*F E  .T
TF T  .T*F
F  (E) T  .F

COMPILER DESIGN (CoSc3112) Page 14


F  id F  .(E)
F  .id }

Goto Operation

• If I is a set of LR(0) items and X is a grammar symbol (terminal or non-terminal), then


goto(I,X) is defined as follows:

•If A  .X in I then every item in closure({A  X.}) will be in goto(I,X).

Example:
I ={ E’  .E, E  .E+T, E  .T,
T  .T*F, T  .F,
F  .(E), F  .id }
goto(I,E) = { E’  E., E  E.+T }
goto(I,T) = { E  T., T  T.*F }
goto(I,F) = {T  F. }
goto(I,() = { F  (.E), E  .E+T, E  .T, T  .T*F, T  .F,
F  .(E), F  .id }
goto(I,id) = { F  id. }

Construction of The Canonical LR(0) Collection

• To create the SLR parsing tables for a grammar G, we will create the canonical LR(0)
collection of the grammar G’.

• Algorithm:
C is { closure({S’.S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C

• goto function is a DFA on the sets in C.

COMPILER DESIGN (CoSc3112) Page 15


The Canonical LR(0) Collection – Example I5: F  id.
I0: E’  .E I1: E’  E. I6: E  E+.T I9: E  E+T.
E  .E+T E  E.+T T  .T*F T  T.*F
E  .T T  .F
T  .T*F I2: E  T. F  .(E) I10: T  T*F.
T  .F T  T.*F F  .id
F  .(E)
F  .id I3: T  F. I7: T  T*.F I11: F  (E).
F  .(E)
I4: F  (.E) F  .id
E  .E+T
E  .T I8: F  (E.)
T  .T*F E  E.+T
T  .F
F  .(E)
F  .id

Transition Diagram (DFA) of Goto Function

I0 E
I1 + T
I6 I9 * to I7
F
T ( to I3
id
to I4
to I5
F I2 * I7 I10
F
I3 ( to I4
( id to I5
)
I4
E I8
id id T
to I2 + I11
F
I5 (
to I3 to I6
to I4

CS416 Compiler Design 38

COMPILER DESIGN (CoSc3112) Page 16


Constructing SLR Parsing Table (of an augumented grammar G’)

 Construct the canonical collection of sets of LR(0) items for G’. C{I0,...,In}
 Create the parsing action table as follows
 If a is a terminal, A.a in Ii and goto(Ii,a)=Ij then action[i,a] is shift j.
 If A. is in Ii , then action[i,a] is reduce A for all a in FOLLOW(A) where
AS’.
 If S’S. is in Ii , then action[i,$] is accept.
 If any conflicting actions generated by these rules, the grammar is not SLR(1).
 Create the parsing goto table
 for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
 All entries not defined by (2) and (3) are errors.
 Initial state of the parser contains S’.S

Parsing Tables of Expression Grammar

SLR(1) Grammar
 An LR parser using SLR(1) parsing tables for a grammar G is called as the SLR(1) parser for G.
 If a grammar G has an SLR(1) parsing table, it is called SLR(1) grammar (or SLR grammar in
short).
 Every SLR grammar is unambiguous, but every unambiguous grammar is not a SLR grammar.

shift/reduce and reduce/reduce conflicts


COMPILER DESIGN (CoSc3112) Page 17
 If a state does not know whether it will make a shift operation or reduction for a terminal, we
say that there is a shift/reduce conflict.

 If a state does not know whether it will make a reduction operation using the production rule
i or j for a terminal, we say that there is a reduce/reduce conflict.

 If the SLR parsing table of a grammar G has a conflict, we say that that grammar is not SLR
grammar.

Conflict Example
S  L=R I0: S’  .S I1: S’  S. I6: S  L=.R I9: S  L=R.
S  R S  .L=R R  .L
L *R S  .R I2: S  L.=R L .*R
L  id L  .*R R  L. L  .id
RL L  .id
R  .L I3: S  R.

I7: L  *R.
I4: L  *.R
Problem R  .L
FOLLOW(R)={=,$} L .*R I8: R  L.
= shift 6 L  .id
reduce by R  L
shift/reduce conflict I5: L  id.

CS416 Compiler Design 43

COMPILER DESIGN (CoSc3112) Page 18


Conflict Example2
S  AaAbS I0: S’  .S
 BbBa A S  .AaAbS
   .BbBa A
B   .
 B .


Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A   b reduce by A  
reduce by B   reduce by B  
reduce/reduce conflict reduce/reduce conflict

CS416 Compiler Design 44

Constructing Canonical LR(1) Parsing Tables


 In SLR method, the state i makes a reduction by A when the current token is a:

– if the A. in the Ii and a is FOLLOW(A)

 In some situations, A cannot be followed by the terminal a in a right-sentential form when


 and the state i are on the top stack. This means that making reduction in this case is not
correct.

S  AaAb SAaAbAabab SBbBaBbaba


S  BbBa
A   Aab   ab Bba   ba
B   AaAb  Aa  b BbBa  Bb  a

LR(1) Item

 To avoid some of invalid reductions, the states need to carry more information.
 Extra information is put into a state by including a terminal symbol as a second component
in an item.
 A LR(1) item is:
A  α.,a where a is the look-head of the LR(1) item
(a is a terminal or end-marker.)

COMPILER DESIGN (CoSc3112) Page 19


 When  ( in the LR(1) item A  α.,a ) is not empty, the look-head does not have any
affect.
 When  is empty (A  α.,a ), we do the reduction by A only if the next input symbol
is a (not for any terminal in FOLLOW(A)).

 A state will contain A  α.,a1 where {a1,...,an}  FOLLOW(A)


...
A  α.,an
Canonical Collection of Sets of LR(1) Items

 The construction of the canonical collection of the sets of LR(1) items are similar to the
construction of the canonical collection of the sets of LR(0) items, except that closure and
goto operations work a little bit different.

closure(I) is: ( where I is a set of LR(1) items)

– every LR(1) item in I is in closure(I)

– if A.B,a in closure(I) and B is a production rule of G; then


B. ,b will be in the closure(I) for each terminal b in FIRST(a) .

goto operation

• If I is a set of LR(1) items and X is a grammar symbol (terminal or non-terminal), then


goto(I,X) is defined as follows:

– If A  .X,a in I then every item in closure({A  X.,a}) will be in goto(I,X).

Construction of The Canonical LR(1) Collection

 Algorithm:
C is { closure({S’.S,$}) }
repeat the followings until no more set of LR(1) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
 goto function is a DFA on the sets in C.

A Short Notation for The Sets of LR(1) Items


• A set of LR(1) items containing the following items
A  .,a1
...

COMPILER DESIGN (CoSc3112) Page 20


A  .,an

COMPILER DESIGN (CoSc3112) Page 21


can be written as
A  .,a1/a2/.../an

Canonical LR(1) Collection -- Example


S  AaAb I0: S’  .S ,$ I1: S’  S. ,$
S
S  BbBa S  .AaAb ,$
A a
A S  .BbBa ,$ I2: S  A.aAb ,$ to I4
B A  . ,a B
B  . ,b I3: S  B.bBa ,$ b
to I5

I4: S  Aa.Ab ,$ A I6: S  AaA.b ,$ a I8: S  AaAb. ,$


A  . ,b
I7: S  BbB.a ,$ b
I5: S  Bb.Ba ,$ B I9: S  BbBa. ,$
B  . ,a

CS416 Compiler Design 52

COMPILER DESIGN (CoSc3112) Page 22


Canonical LR(1) Collection – Example2
S’  S I0:S’  .S,$ I1:S’  S.,$ I4:L  *.R,$/= R to I7
1) S  L=R S  .L=R,$ S * R  .L,$/=
L to I
2) S  R S  .R,$ LI2:S  L.=R,$ to I6 L .*R,$/= 8
*
3) L *R L  .*R,$/= R  L.,$ L  .id,$/= to I4
R id
4) L  id L  .id,$/= id to I5
I3:S  R.,$ I5:L  id.,$/=
5) R  L R  .L,$
I9:S  L=R.,$
R I13:L  *R.,$
I6:S  L=.R,$ to I9 I10:R  L.,$
R  .L,$ L to I
10
L  .*R,$ *
R
I4 and I11
L  .id,$ to I11 I11:L  *.R,$ to I13
id R  .L,$ L I5 and I12
to I12 to I10
L .*R,$
I7:L  *R.,$/= *
L  .id,$ to I11 I7 and I13
id
I8: R  L.,$/= to I12 I8 and I10
I12:L  id.,$
CS416 Compiler Design 53

Construction of LR(1) Parsing Tables

1. Construct the canonical collection of sets of LR(1) items for G’. C{I0,...,In}
2. Create the parsing action table as follows
 If a is a terminal, A.a,b in Ii and goto(Ii,a)=Ij then action[i,a] is shift j.
 If A.,a is in Ii , then action[i,a] is reduce A where A  S’.
 If S’S.,$ is in Ii , then action[i,$] is accept.
 If any conflicting actions generated by these rules, the grammar is not LR(1).
3. Create the parsing goto table
 for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j
4. All entries not defined by (2) and (3) are errors.
5. Initial state of the parser contains S’->.S,$

COMPILER DESIGN (CoSc3112) Page 23


LR(1) Parsing Tables – (for Example2)
id * = $ S L R
0 s5 s4 1 2 3
1 acc
2 s6 r5
3 r2
4 s5 s4 8 7
5 r4 r4 no shift/reduce or
6 s12 s11 10 9 no reduce/reduce conflict
7
8
r3
r5
r3
r5

9 r1 so, it is a LR(1) grammar
10 r5
11 s12 s11 10 13
12 r4
13 r3

CS416 Compiler Design 55

LALR Parsing Tables

 LALR stands for LookAhead LR.


 LALR parsers are often used in practice because LALR parsing tables are smaller than LR(1)
parsing tables.
 The number of states in SLR and LALR parsing tables for a grammar G are equal.
 But LALR parsers recognize more grammars than SLR parsers.
 yacc creates a LALR parser for the given grammar.
 A state of LALR parser will be again a set of LR(1) items.

Creating LALR Parsing Tables

Canonical LR(1) Parser  LALR Parser


shrink # of states
 This shrink process may introduce a reduce/reduce conflict in the resulting LALR parser (so
the grammar is NOT LALR)
 But, this shrink process does not produce a shift/reduce conflict.

The Core of A Set of LR(1) Items


 The core of a set of LR(1) items is the set of its first component.

S  L.=R,$  S  L.=R Core


R  L.,$ R  L.
 We will find the states (sets of LR(1) items) in a canonical LR(1) parser with same cores. Then
we will merge them as a single state.
I1:L  id.,= A new state: I12: L  id.,=
COMPILER DESIGN (CoSc3112) Page 24
 L  id.,$
I2:L  id.,$ have same core, merge them

 We will do this for all states of a canonical LR(1) parser to get the states of the LALR parser.
 In fact, the number of the states of the LALR parser for a grammar will be equal to the number
of states of the SLR parser for that grammar.

Creation of LALR Parsing Tables

 Create the canonical LR(1) collection of the sets of LR(1) items for the given grammar.
 Find each core; find all sets having that same core; replace those sets having same cores with
a single set which is their union.
C={I0,...,In}  C’={J1,...,Jm}where m  n
 Create the parsing tables (action and goto tables) same as the construction of the parsing
tables of LR(1) parser.
– Note that: If J=I1  ...  Ik since I1,...,Ik have same cores
 cores of goto(I1,X),...,goto(I2,X) must be same.
– So, goto(J,X)=K where K is the union of all sets of items having same cores as
goto(I1,X).
 If no conflict is introduced, the grammar is LALR(1) grammar. (We may only introduce
reduce/reduce conflicts; we cannot introduce a shift/reduce conflict)

Shift/Reduce Conflict
 We say that we cannot introduce a shift/reduce conflict during the shrink process for the
creation of the states of a LALR parser.
 Assume that we can introduce a shift/reduce conflict. In this case, a state of LALR parser must
have:
A  .,a and B  .a,b
This means that a state of the canonical LR(1) parser must have:
A  .,a and B  .a,c
But, this state has also a shift/reduce conflict. i.e. The original canonical LR(1) parser has a
conflict.
(Reason for this, the shift operation does not depend on lookaheads)

Reduce/Reduce Conflict
 But, we may introduce a reduce/reduce conflict during the shrink process for the creation of
the states of a LALR parser.

I1 : A  .,a I2: A  .,b


B  .,b B  .,c

I12: A  .,a/b  reduce/reduce conflict
B  .,b/c

COMPILER DESIGN (CoSc3112) Page 25


Canonical LALR(1) Collection – Example2
S’  S I0:S’  . S,$ I1:S’  S ,$ . .
I411:L  * R,$/= R
1) S  L=R S .L=R,$ S *
.. .
R  L,$/= L
to I713

2) S  R S ..R,$ L I2:S  L =R,$ to I6 .


L *R,$/= *
to I810
3) L *R
4) L  id
L *R,$/= R  L ,$ .
L  id,$/= to I411
L
..
id,$/= R
I3:S  R ,$ . id I .
:L  id ,$/=
id to I
512
5) R  L R  L,$ 512

I6:S  L= R,$. R
to I9 I9:S  L=R ,$ .
..
Same Cores
R  L,$ L
to I810 I4 and I11
L  *R,$ *
.
L  id,$ to I411
I5 and I12
I :L  *R . id
to I512
I7 and I13
713
,$/=

I810: R  L ,$/= . I8 and I10

CS416 Compiler Design 62

LALR(1) Parsing Tables – (for Example2)


id * = $ S L R
0 s5 s4 1 2 3
1 acc
2 s6 r5
3 r2
4 s5 s4 8 7
5 r4 r4 no shift/reduce or
6 s12 s11 10 9 no reduce/reduce conflict
7
8
r3
r5
r3
r5 
9 r1 so, it is a LALR(1) grammar

CS416 Compiler Design 63

Error Recovery in LR Parsing

 An LR parser will detect an error when it consults the parsing action table and finds an error
entry. All empty entries in the action table are error entries.
 Errors are never detected by consulting the goto table.
 An LR parser will announce error as soon as there is no valid continuation for the scanned
portion of the input.
 A canonical LR parser (LR(1) parser) will never make even a single reduction before

COMPILER DESIGN (CoSc3112) Page 26


announcing an error.

COMPILER DESIGN (CoSc3112) Page 27


 The SLR and LALR parsers may make several reductions before announcing an error.
 But, all LR parsers (LR(1), LALR and SLR parsers) will never shift an erroneous
input symbolonto the stack.

Panic Mode Error Recovery in LR Parsing

 Scan down the stack until a state s with a goto on a particular nonterminal A is
found. (Getrid of everything from the stack before this state s).
 Discard zero or more input symbols until a symbol a is found that can legitimately follow
A.
– The symbol a is simply in FOLLOW(A), but this may not work for all situations.
 The parser stacks the nonterminal A and the state goto[s,A], and it resumes
the normalparsing.
 This nonterminal A is normally is a basic programming block (there can be more
than onechoice for A).
– stmt, expr, block, ...

 Each empty entry in the action table is marked with a specific error routine.
 An error routine reflects the error that the user most likely will make in that case.
 An error routine inserts the symbols into the stack or the input (or it deletes the
symbols fromthe stack and the input, or it can do both insertion and deletion).
– Missing operand
– Unbalanced right parenthesis

COMPILER DESIGN (CoSc3112) Page 28


Bonga University
College of Engineering and Technology
Department of Computer Science
CoSc3112 – COMPILER DESIGN
Chapter 5 Handouts – Semantic Analysis

Cnotents
 Introduction
 A Short Program
 Semantic Analysis
 Annotated Abstract Syntax tree (AST)
 Syntax-Directed Translation
 Syntax-Directed Definitions (SDD)
 Evaluation of S-Attributed Definitions
 L-Attributed Definitions
 Translation Schemes

Semantic Analyzer

Semantic Analysis is the third phase of Compiler. Semantic Analysis makes sure that
declarations and statements of program are semantically correct. It is a collection of
procedures which is called by parser as and when required by grammar. Both syntax
tree of previous phase and symbol table are used to check the consistency of the given
code.
It uses syntax tree and symbol table to check whether the given program is semantically
consistent with language definition. It gathers type information and stores it in either
syntax tree or symbol table. This type information is subsequently used by compiler
during intermediate-code generation.
Semantic Errors
We have mentioned some of the semantics errors that the semantic analyzer is
expected to recognize:

 Type mismatch
 Undeclared variable

[COMPILER DESIGN (CoSc3112) Page 1


 Reserved identifier misuse.
 Multiple declaration of variable in a scope.
 Accessing an out of scope variable.
 Actual and formal parameter mismatch.

Therefore, semantic analysis Verify properties of the program that aren't caught during
the earlier phases:
 Variables are declared before they're used.
 Expressions have the right types.
 Classes don't inherit from nonexistent base classes
 Type consistence;
 Inheritance relationship is correct;
 A class is defined only once;
 A method in a class is defined only once;
 Reserved identifiers are not misused;
Once we finish semantic analysis, we know that the user's input program is legal.
Short program that shows semantic error

The scope of an identifier is the portion of a program in which that


identifier is accessible

 Semantic analysis also gathers useful information about program for later phases:
 E.g. count how many variables are in scope at each point.
Why can't we just do this during parsing?
Context-free grammar can not represent all language constraints,
e.g. non local/context-dependent relations.
[COMPILER DESIGN (CoSc3112) Page 2
 Limitations of CFGs
 How would you prevent duplicate class definitions?
 How would you differentiate variables of one type from variables of another
type?
 How would you ensure classes implement all interface methods?
For most programming languages, these are provably impossible.

 Semantic Analysis can be implemented using Annotated Abstract Syntax tree (AST)
 The input for Semantic Analysis (syntax analyzer) is Abstract Syntax tree and the
output is Annotated Abstract Syntax tree.
 Annotated Abstract Syntax tree is parse-tree that also shows the values of the
attributes at each node.

Syntax-Directed Translation (SDT)


 SDT is used to drive semantic analysis tasks based on the language’s syntax
structures
 What semantic tasks?
 Generate AST (abstract syntax tree)
 Check type errors
 Generate intermediate representation (IR)
 What is synatx structures?
 Context free grammar (CFG)
 Parse tree generated from parser
 How?
 Attach attributes to grammar symbols/parse tree
 Attach either rules or program fragments to productions in a grammar
 Evaluate attribute values using semantic actions/ semantic rules associated
with the production rules.
Two Types of Attributes
 Attributes can represent anything we need: a string, a type, a number, a memory
location , … and they are two types
 Synthesized attributes: attribute values are computed from some attribute values
of its children nodes
S → ABC
If S is taking values from its child nodes (A,B,C), then it is said to be a
synthesized attribute, as the values of ABC are synthesized to S.
(E → E + T), the parent node E gets its value from its child node. Synthesized
attributes never take values from their parent nodes or any sibling nodes.

[COMPILER DESIGN (CoSc3112) Page 3


 Inherited attributes: attribute values are computed from attributes of the siblings
and parent of the node
S → ABC
A can get values from S, B and C. B can take values from S, A, and C. Likewise, C
can take values from S, A, and B.

Attribute Grammar
Attribute grammar is a special form of context-free grammar where some additional
information (attributes) are appended to one or more of its non-terminals in order to
provide context-sensitive information. Each attribute has well-defined domain of values,
such as integer, float, character, string, and expressions.
Attribute grammar is a medium to provide semantics to the context-free grammar and it
can help specify the syntax and semantics of a programming language.
Example:
E → E + T { E.value = E.value + T.value }
The right part of the CFG contains the semantic rules that specify how the grammar
should be interpreted. Here, the values of non-terminals E and T are added together and
the result is copied to the non-terminal E.

Two Types of Syntax Directed Translation


When we associate semantic rules with productions, we use two notations:-
 Syntax-Directed Definitions
 associates a production rule with a set of semantic actions, and we
do not say when they will be evaluated.
 don’t specify order of evaluation/translation - hides implementation
details
 Translation Schemes
 indicate the order of evaluation of semantic actions associated with
a production rule.
 shows more implementation details
 Syntax-Directed Definitions (SDD)
SDD is a context-free grammar together with, attributes and rules.
 Attributes are associated with grammar symbols and rules are associated
with productions.
If X is a symbol and a is one of its attributes, then we write X.a to denote the value of a
at a particular parse-tree node labeled X.  Symbols E, T, and F are
associated with a
Production Semantic Rules synthesized attribute val.
L → E return { print(E.val) }  The token digit has a
E → E1 + T { E.val = E1.val + T.val } synthesized attribute lexval
E→T { E.val = T.val } (it is assumed that it is
[COMPILER DESIGN (CoSc3112)
evaluated by the lexical Page 4
analyzer).
T → T1 * F { T.val = T1.val * F.val }
T→F { T.val = F.val }
F→(E) { F.val = E.val }
F → digit { F.val = digit.lexval }

Annotated Parse Tree – Example

Dependency Graph

A dependency graph suggests possible evaluation orders for an annotated parse-tree.

Syntax-Directed Definition – Inherited Attributes

Production Semantic Rules


D→TL {L.in = T.type }
T → int { T.type = integer }
T → real { T.type = real }
L → L id { L .in = L.in, addtype(id.entry,L.in) }
1 1

L → id { addtype(id.entry,L.in) }
[COMPILER DESIGN (CoSc3112) Page 5
 Symbol T is associated with a synthesized attribute type.
 Symbol L is associated with an inherited attribute in.
 We can use inherited attributes to track type information
 We can use inherited attributes to track whether an identifier appear on the left or right side of
an assignment operator “:=” ( e.g. a := a +1 )

Parse Tree and A Dependency Graph – Inherited Attributes

S-Attributed and L-Attributed Definitions


 There are two sub-classes of the syntax-directed definitions:
 S-Attributed Definitions: only synthesized attributes used in the syntax-
directed definitions.
• S-attributed SDTs are evaluated in bottom-up parsing, as the values
of the parent nodes depend upon the values of the child nodes.
• Semantic actions are placed in rightmost place of RHS.
Example S -> MN {S.val= M.val + N.val}
 L-Attributed Definitions: in addition to synthesized attributes, we may
also use inherited attributes in a restricted fashion.
• Semantic actions are placed anywhere in RHS
• Attributes in L-attributed SDTs are evaluated by depth-first and left-to-
right parsing manner.
Example M -> PQ {M.val = P.val * Q.val and P.val =Q.val

Note – If a definition is S-attributed, then it is also L-attributed but NOT vice-versa.


 Implementation S-Attributed and L-Attributed definitions are easy
 we can evaluate semantic rules in a single pass during the parsing
 However, implementations of S-attributed Definitions are a little bit easier than
implementations of L-Attributed Definitions
 An S-attributed SDD can be implemented naturally in conjunction with an LR
parser.
Bottom-Up Evaluation of S-Attributed Definitions

[COMPILER DESIGN (CoSc3112) Page 6


 We put the values of the synthesized attributes of the grammar symbols into a
parallel stack.
 When an entry of the parser stack holds a grammar symbol X (terminal or
non-terminal), the corresponding entry in the parallel stack will hold the
synthesized attribute(s) of the symbol X.
 We evaluate the values of the attributes during reductions.
A  XYZ A.a=f(X.x,Y.y,Z.z) where all attributes are synthesized.

 At each shift of digit, we also push digit.lexval into val-stack.


 At all other shifts, we do not put anything into val-stack because other terminals do
not have attributes (but we increment the stack pointer for val-stack).
Bottom-Up Evaluation – Example

[COMPILER DESIGN (CoSc3112) Page 7


Approach:
1. Semantic stack: Store attributes. (May be separate from main stack).
2. For every symbol shifted, store its corresponding attribute on stack.
3. For every reduction A  µq, compute attribute of A by popping attributes
for µ and q from semantic stack.
Example 2

Stack for 2 + 3:

Top-Down Evaluation of S-Attributed Definitions

[COMPILER DESIGN (CoSc3112) Page 8


 Remember that in a recursive predicate parser, each non-terminal corresponds to a
procedure.
procedure A() {
call B(); A→B
}
procedure B() {
if (currtoken=0) { consume 0; call B(); } B→0B
else if (currtoken=1) { consume 1; call B(); } B→1B
else if (currtoken=$) {} // $ is end-marker B → 
else error(“unexpected token”);
}

L-Attributed Definitions
 A syntax-directed definition is L-attributed if each inherited attribute of Xj, where
1jn, on the right side of A → X1X2...Xn depends only on:
1. The attributes of the symbols X1,...,Xj-1 to the left of Xj in the production and
2. the inherited attribute of A
 L-Attributed Definitions can always be evaluated by the depth first visit of the parse
tree-this means that they can also be evaluated during the parsing.

Algorithm: L-Eval(n: Node)


Input: Node of an annotated parse-tree.
Output: Attribute evaluation

Translation Schemes
 In a syntax-directed definition, we do not say
anything about the evaluation times of the semantic rules
 when the semantic rules associated with a production should be evaluated?
 A translation scheme is a context-free grammar in which:
 attributes are associated with the grammar symbols and semantic actions
enclosed between braces {} are inserted within the right sides of
productions.

 Ex: A → { ... } X { ... } Y { ... }

Semantic Actions
 Translation schemes indicate the order in which semantic rules and attributes are to
be evaluated
A Translation Scheme Example

[COMPILER DESIGN (CoSc3112) Page 9


The depth first traversal of the parse tree (executing the semantic actions in that order)
will produce the postfix representation of the infix expression.

[COMPILER DESIGN (CoSc3112) Page 10


Bonga University
Department of Computer Science
CoSc3112 –COMPILER DESIGN
Chapter 6 Handouts – Intermediate Code Generation, Code Optimization and Code
generation

Intermediate Code Generation

 Translating source program into an “intermediate language”


 Simple
 CPU Independent,
 …yet, close in spirit to machine language.
 Or, depending on the application other intermediate languages may be used, but in
general, we opt for simple, well structured intermediate forms.
 (And this completes the “Front-End” of Compilation).

Benefits
1. Retargeting is facilitated
2. Machine independent Code Optimization can be applied.

Intermediate Code

 Intermediate codes are machine independent codes, but they are close to machine instructions.
 The given program in a source language is converted to an equivalent program in an
intermediate language by the intermediate code generator.
 Intermediate language can be many different languages, and the designer of the compiler
decides this intermediate language.
 Syntax trees can be used as an intermediate language.
 Postfix notation can be used as an intermediate language.
 three-address code (Quadruples) can be used as an intermediate language
 we will use quadruples to discuss intermediate code generation
 Quadruples are close to machine instructions, but they are not actual machine
instructions.

COMPILER DESIGN (CoSc3112) Page 1


 Some programming languages have well defined intermediate languages.
 java – java virtual machine
 prolog – warren abstract machine
 In fact, there are byte-code emulators to execute instructions in these
intermediate languages.
Without IR With IR

Types of Intermediate Languages

Postfix form

Example

a+b ab+
(a+b)*c ab+c*
a+b*c abc*+
a:=b*c+b*d abc*bd*+:=
 (+) simple and concise

COMPILER DESIGN (CoSc3112) Page 2


 (+) good for driving an interpreter

(- ) Not good for optimization or code generation

Three Address Code

 Statements of general form x:=y op z

 No built-up arithmetic expressions are allowed.

 As a result, x:=y + z * w should be represented as


t1:=z * w
t2:=y + t1
x:=t2

 Observe that given the syntax-tree or the dag of the graphical representation we can easily
derive a three address code for assignments as above.

 In fact three-address code is a linearization of the tree.

 Three-address code is useful: related to machine-language/ simple/ optimizable.

Example of 3-address code

 Consider the assignment a: =b*-c + b*-c:


t1:=- c
t2:=b * t1
t3:=- c
t4:=b * t3
t5:=t2 + t4
a:=t5

Types of Three-Address Statements

Assignment Statement: x:=y op z


Assignment Statement: x:=op z
Copy Statement: x:=z
Unconditional Jump: goto L
Conditional Jump: if x relop y goto L
Stack Operations: Push/pop
Implementations of 3-address statements

 Quadruples

A quadruple is:

x := y op z

COMPILER DESIGN (CoSc3112) Page 3


where x, y and z are names, constants or compiler-generated temporaries; op is any operator.

But we may also the following notation for quadruples (much better notation because it looks like a
machine code instruction)

op y,z,x

apply operator op to y and z, and store the result in x.

We use the term “three-address code” because each statement usually contains three addresses (two
for operands, one for the result).

Example:

t1:=- c
t2:=b * t1
t3:=- c
t4:=b * t3
t5:=t2 + t4
a:=t5
op arg1 arg2 result

(0) uminus c t1

(1) * b t1 t2

(2) uminus c

(3) * b t3 t4

(4) + t2 t4 t5

(5) := t5 a

Temporary names must be entered into the symbol table as they are created.

Three-Address Statements

Binary Operator:

op y,z,result or result := y op z

where op is a binary arithmetic or logical operator. This binary operator is applied to y and z, and the
result of the operation is stored in result.

Ex: add a,b,c


gt a,b,c

COMPILER DESIGN (CoSc3112) Page 4


addr a,b,c
addi a,b,c
Unary Operator:

op y,result or result := op y

where op is a unary arithmetic or logical operator. This unary operator is applied to y, and the result of
the operation is stored in result.

Ex: uminus a,c


not a,c
inttoreal a,c
Indexed Assignments:

move y[i],,x or x := y[i]


move x,,y[i] or y[i] := x

Address and Pointer Assignments:

moveaddr y,,x or x := &y


movecont y,,x or x := *y
Other types of 3-address statements

e.g. ternary operations like


x[i]:=y x:=y[i] require two or more entries. e.g.

 Triples

A triple has only three fields, which we call op, arg,, and arg2. Note that the result field in Fig. is used
primarily for temporary names. Using triples, we refer to the result of an operation x op y by its
position, rather than by an explicit temporary name. Thus, instead of the temporary t1 in Fig, a triple

COMPILER DESIGN (CoSc3112) Page 5


representation would refer to position (0). Parenthesized numbers represent pointers into the triple
structure itself. In positions or pointers to positions were called value numbers.
t1:=- c
t2:=b * t1
t3:=- c
t4:=b * t3
t5:=t2 + t4
a:=t5

op arg1 arg2

(0) uminus c

(1) * b (0)

(2) uminus c

(3) * b (2)

(4) + (1) (3)

(5) assign a (4)

Temporary names are not entered into the symbol table.

 Indirect Triples

Indirect triples consist of a listing of pointers to triples, rather than a listing of triples themselves. For
example, let us use an array instruction to list pointers to triples in the desired order. Then, the
triples in Fig. might be represented as in Fig. With indirect triples, an optimizing compiler can move
an instruction by reordering the instruction list, without affecting the triples themselves.

COMPILER DESIGN (CoSc3112) Page 6


Code Optimization: The Idea

> Transform the program to improve efficiency


> Performance: faster execution
> Size: smaller executable, smaller memory footprint

Tradeoffs:
1) Performance vs. Size
2) Compilation speed and memory
> There is no perfect optimizer

> Example: optimize for simplicity

Optimization on many levels

> Optimizations both in the optimizer and back-end

Optimizations in the Backend

> Register Allocation

> Instruction Selection

> Peep-hole Optimization

Register Allocation

> Processor has only finite amount of registers

— Can be very small (x86)

— Temporary variables

— non-overlapping temporaries can share one register

> Passing arguments via registers

> Optimizing register allocation very important for good performance

COMPILER DESIGN (CoSc3112) Page 7


— Especially on x86

Instruction Selection

> For every expression, there are many ways to realize them for a processor

> Example: Multiplication*2 can be done by bit-shift

Instruction selection is a form of optimization

Peephole Optimization

> Simple local optimization

> Look at code “through a hole”

— replace sequences by known shorter ones

— table pre-computed

Important when using simple instruction selection!

Examples for Optimizations

> Constant Folding / Propagation

> Copy Propagation

> Algebraic Simplifications

> Strength Reduction

> Dead Code Elimination

— Structure Simplifications

> Loop Optimizations

> Partial Redundancy Elimination

> Code In- lining

Constant Folding

COMPILER DESIGN (CoSc3112) Page 8


> Evaluate constant expressions at compile time

> Only possible when side-effect freeness guaranteed

Constant Propagation

> Variables that have constant value, e.g. c := 3

— Later uses of c can be replaced by the constant

— If no change of c between!

Analysis needed, as b can be assigned more than once!

Copy Propagation

> for a statement x := y

> Replace later uses of x with y, if x and y have not been changed.

Analysis needed, as y and x can be assigned more than once!

Algebraic Simplifications

> Use algebraic properties to simplify expressions

COMPILER DESIGN (CoSc3112) Page 9


Important to simplify code for later optimizations

Strength Reduction

> Replace expensive operations with simpler ones

> Example: Multiplications replaced by additions

Peephole optimizations are often strength reductions

Dead Code

> Remove unnecessary code

— e.g. variables assigned but never read

> Remove code never reached

Loop Optimizations

> Optimizing code in loops is important

— often executed, large payoff

— All optimizations help when applied to loop-bodies

> Some optimizations are loop specific

Advanced Optimizations

> Optimizing for using multiple processors

— Auto parallelization

— Very active area of research (again)

> Inter-procedural optimizations

COMPILER DESIGN (CoSc3112) Page 10


— Global view, not just one procedure

— Profile-guided optimization

> Vectorization

> Dynamic optimization

— Used in virtual machines (both hardware and language VM)

Iterative Process

> There is no general “right” order of optimizations

> One optimization generates new opportunities for a preceding one.

> Optimization is an iterative process

Code generation
 The final phase of a compiler is code generation
 It receives an intermediate representation (IR) with supplementary information in symbol table
 Produces a semantically equivalent target program
 Code generator main tasks:
o Instruction selection
o Register allocation and assignment
o Instruction ordering

Issues in the Design of Code Generator

⚫ The most important criterion is that it produces correct code

⚫ Input to the code generator

⚫ IR + Symbol table

⚫ We assume front end produces low-level IR, i.e. values of names in it can be directly
manipulated by the machine instructions.

⚫ Syntactic and semantic errors have been already detected

⚫ The target program

⚫ Common target architectures are: RISC, CISC and Stack based machines

COMPILER DESIGN (CoSc3112) Page 11


⚫ In this chapter we use a very simple RISC-like computer with addition of some CISC-like
addressing modes

Instruction Selection

The code generator must map the IR program into a code sequence that can be executed by
the target machine. The complexity of performing this mapping is determined by factors
such as
 the level of the IR
 the nature of the instruction-set architecture
 the desired quality of the generated code.

For example, every three-address statement of the form x = y + z, where x, y, and z are statically
allocated, can be translated into the code sequence

LD Ro, y // Ro = y (load y into register Ro)


ADD Ro, Ro, z // Ro = R o + z (add z t o Ro )
ST x, Ro // x = Ro (store Ro into x)

This strategy often produces redundant loads and stores. For example, the sequence of three-
address statements would be translated into
a=b+c
d=a+e

LD Ro, b // Ro = b
ADD Ro, Ro, c // Ro = Ro + c
ST a, Ro // a = Ro
LD Ro,Y a // Ro = a
ADD Ro, Ro, e // Ro = Ro + e
ST d, Ro // d = Ro

Here, the fourth statement is redundant since it loads a value that has just been stored, and so
is the third if a is not subsequently used.

The quality of the generated code is usually determined by its speed and size. On most
machines, a given IR program can be implemented by many different code sequences, with
significant cost differences between the different implementations.

Register Allocation

A key problem in code generation is deciding what values to hold in what registers.
Registers are the fastest computational unit on the target machine, but we usually do not
have enough of them to hold all values. Values not held in registers need to reside in
memory. Instructions involving register operands are invariably shorter and faster than
those involving operands in memory, so efficient utilization of registers is particularly
important.

COMPILER DESIGN (CoSc3112) Page 12


The use of registers is often subdivided into two sub problems:

1. Register allocation, during which we select the set of variables that will reside in registers
at each point in the program.
2. Register assignment, during which we pick the specific register that a variable will reside
in.
3. Complications imposed by the hardware architecture

A simple target machine model

⚫ Load operations: LD r,x and LD r1, r2


⚫ Store operations: ST x,r
⚫ Computation operations: OP dst, src1, src2
⚫ Unconditional jumps: BR L
⚫ Conditional jumps: Bcond r, L like BLTZ r, L

A Simple Code Generator

In this section, we shall consider an algorithm that generates code for a single basic block. It
considers each three-address instruction in turn, and keeps track of what values are in what
registers so it can avoid generating unnecessary loads and stores.

One of the primary issues during code generation is deciding how to use registers to best
advantage. There are four principal uses of registers:

 In most machine architectures, some or all of the operands of an operation must be in


registers in order to perform the operation.
 Registers make good temporaries - places to hold the result of a sub expression while a
larger expression is being evaluated, or more generally, a place to hold a variable that is
used only within a single basic block.
 Registers are used to hold (global) values that are computed in one basic block and used
in other blocks, for example, a loop index that is incremented going around the loop and
is used several times within the loop.
 Registers are often used to help with run-time storage management, for example, to
manage the run-time stack, including the maintenance.

COMPILER DESIGN (CoSc3112) Page 13

You might also like