0% found this document useful (0 votes)
24 views

lecture-1 & 2 compiler

Lecture notes on introduction to compiler construction

Uploaded by

Felix Wanga
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

lecture-1 & 2 compiler

Lecture notes on introduction to compiler construction

Uploaded by

Felix Wanga
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 87

CSC 421:

Compiler Construction
Course Objectives
• By the end of the course unit, the student
should be able :
1. To appreciate basic concepts of compiler
construction.
2. To construct compilers.
Course Contents
• WEEK 1
– Overview of the Compilation Process
– The phases of compilation
– Review of necessary concepts from programming languages
• WEEK 2
– Lexical Analysis
– The role of lexical analyzers
– Regular expressions
– Conversion of regular expressions to finite automata and lexical
analyzers
– The use of Lex in developing lexical analyzers under Unix
• WEEK 3 to 7
– Parsing
– Basic bottom-up and top-down parsing techniques
– LR, SLR, and LALR parsing
– YACC under Unix
– Other parser generating schemes
• WEEK 8 to 9
– Syntax Directed Translation
– Use of attributes in translation
– Syntax directed translation schemes for the common constructs of
programming languages
– Intermediate code representations
• WEEK 10
– Supporting Considerations
– Symbol table management
– Run time support
– Error detection and recovery techniques
• WEEK 11
– Optimization and Code Generation
– Brief introduction to code generation issues
• WEEK 12
– Reviews and Examinations
Course Info
• Meeting time:
Mondat 4:00-7:00 p.m

• Meeting Room: N6
• Prerequisites:
• Automata and Languages
• Computer Programming
• Computer Architecture
• Assembly Language Programming
Textbooks
• Primary Textbooks:
– Compilers: Principles, Techniques and Tools by Aho, Sethi, and
Ullman; Addison-Wesley Pub Co, ISBN: 0201100886
– Compiler Design by Santanu chattopadhyay. Prentice Hall of India
Private limited: ISBN: 81-203-2725-X
• Recommended Textbooks:
– The Theory And Practice Of Compiler Writing by Jean-Paul
Tremblay, Paul G. Scoreman;
– Systems Software: An Introduction to Systems Programming
by Leland L Beck; Addison-Wesley Pub Co, ISBN: 0201423006
– Constructing Language Processors for Little Languages by
Randy M. Kaplan; John Wiley & Sons, ISBN: 0471597546
Lecturer Info
• Name: Dr. Shikali
• Office: Northern
• Phone: 0720-832863
• E-mail: [email protected]
• Office Hours:
• Mondays: 10:00 - 1:00,
• Tuesday: 11:00 – 1:30,
• Wednesday: 11:00 – 1:30, and
• By appointment.
Delivery & Grading

• Delivery Lectures

• Evaluation
– Continuous Assessment - 10%;
– Written assignments & Projects - 20%;
– Final Examination - 70%
-----
100%
Projects
• Basically, one big project in 5 parts.
• You must work in small groups ; 2-3 students
per group. Only hand in one written set of
answers.
Projects (cont’d)

Project 1: Lexical Analysis (Scanner)


Project 2: Syntax Analysis (Parser)
Project 3: Semantic Analysis (Compile-time error handling)
Project 4: Intermediate Code Generation
Project 5: Target Code Generation
Why take this course?
• Compilers draw together all of the theory and
techniques that you’ve learned about in most of your
previous computer sciences courses.
• We will focus on “little languages” - you will be
writing simple compilers to solve the kinds of
problems you may face in a career as a programmer.
• You will gain a deeper understanding of how
compilers work, and be able to write better code.
• You will learn to write other useful tools, such as
parsers, interpreters, and debuggers.
Programming Languages
 Human use natural languages to
communicate with each other
– Kiswahili, English, French, etc
 Human use programming languages to
communicate with computers
– Perl, Pascal, C++, …
The translation process
1. The sequence of characters of a source text is translated
into a corresponding sequence of symbols of the
vocabulary of the language. For instance, identifiers
consisting of letters and digits, numbers consisting of
digits, delimiters and operators consisting of special
characters are recognized in this phase, which is called
lexical analysis.
2. The sequence of symbols is transformed into a
representation that directly mirrors the syntactic structure
of the source text and lets this structure easily be
recognized. This phase is called syntax analysis
(parsing).
The translation process
3. High-level languages are characterized by the fact that
objects of programs, for example variables and functions,
are classified according to their type. Therefore, in
addition to syntactic rules, compatibility rules among
types of operators and operands define the language.
Hence, verification of whether these compatibility rules
are observed by a program is an additional duty of a
compiler. This verification is called type
checking/Semantic analysis.
4. On the basis of the representation resulting from step 2, a
sequence of instructions taken from the instruction set of
the target computer is generated. This phase is called code
generation. In general it is the most involved part, not
least because the instruction sets of many computers lack
the desirable regularity. Often, the code generation part is
therefore subdivided further.
Language and Syntax
• Every language displays a structure called
its grammar or syntax.
• For example, a correct sentence always
consists of a subject followed by a
predicate, correct here meaning well
formed.
• This fact can be described by the following
formula:
sentence = subject predicate.
Cont..
• If we add to this formula the two further formulas
subject = "John" | "Mary".
predicate = "eats" | "talks".
• then we define herewith exactly four possible
sentences, namely
John eats Mary eats
John talks Mary talks
• where the symbol | is to be pronounced as or. We
call these formulas syntax rules,productions, or
simply syntactic equations.
Cont..
• Subject and predicate are syntactic classes. A
shorter notation for the above omits meaningful
identifiers:
S = AB. L = {ac, ad, bc, bd}
A = "a" | "b".
B = "c" | "d".
• The set L of sentences which can be generated in
this way, that is, by repeated substitution of the
left hand sides by the right-hand sides of the
equations, is called the language.
A language is defined by the
following:
1. The set of terminal symbols. These are the symbols that
occur in its sentences. They are said to be terminal,
because they cannot be substituted by any other symbols.
The substitution process stops with terminal symbols. In
our first example this set consists of the elements a, b, c
and d. The set is also called vocabulary.
2. The set of nonterminal symbols. They denote syntactic
classes and can be substituted. In our first example this set
consists of the elements S, A and B.
3. The set of syntactic equations (also called productions).
These define the possible substitutions of nonterminal
symbols. An equation is specified for each nonterminal
symbol.
4. The start symbol. It is a nonterminal symbol, in the
examples above denoted by S.
Computer Organization

Applications

Translator

Operating System

Hardware Machine
What is a compiler?
• A computer program is a set of instructions that the
computer can understand and execute.
• In reality computers don’t understand the instructions, they
simply process data
• Computer languages need to be unambiguous and have an
exactly defined syntax and semantic (unlike humans
language)
• High level programming languages have been developed
for human convenience and readability
• A compiler is a program that reads the high level input
program and translates the high level language into
machine code.
Compiler

Program text Compiler Machine code

Errors
What is a language?
• Major elements of a language
– Syntax – determines what phrases there are in the language
– Semantics – determines what a phrase means
– Pragmatics – how the language is used

• Language has two parts


– “Words” of the language, or tokens (e.g. “if”, “{“)
– “Phrases” of the language (e.g. if (x<y) then x++;)
– Words and phrases define language syntax

• Tokens themselves may be specified using regular


expressions
• The language structure can be defined using context free
grammars (see later)
What is a compiler? Contd.
Compilers are large, complicated programs that can only convert
programs that conform to the syntax and semantic rules for a
particular language.
 Analysis-Front End
» Lexical Analysis
» Syntax Analysis
» Semantic Analysis
1. Recognises legal procedures
2. Reports errors
3. Produces Intermediate Language, preliminary storage map
4. Shapes code for back end
 Synthesis-Back End
» Intermediate Code Generation
» Code Optimization
» Code Generation
1. Translates the intermediate language into target machine code
2. Chooses the instructions required for each IL operation
3. Decide what information to keep on the processor registers
4. Ensures that the resulting program uses the target system interfaces correctly
How are languages
implemented?
• Various strategies depend on how much pre-
processing is done before a program can be
run, and how CPU-specific the program is.
• Interpreters run a program “as is” with little or
no pre-processing, but no changes need to be
made to run on a different platform.
• Compilers do extensive pre-processing, but
will run a program 2- to 20- times faster.
Source Compilation Process
Program

Source Program and


Compiler Data data Processed at
different times

Object
Program Executing
Result
Computer

Runtime
Interpretative Process
Data

Source
Program Interpreter
Result

• Processes an internal form of the source program


and data at the same time. Interpretation of the
internal source form occurs at Runtime and no
object program is generated
Language implementations
cont’d
• Some newer languages use a combination of
compiler and interpreter to get many of the benefits
of each.
• Examples are Java and Microsoft’s .NET, which
compile into a virtual assembly language (while
being optimized), which can then be interpreted on
any computer.
• Some languages (such as Basic or Lisp) have both
compilers and interpreters written for them.
• Recently, “Just-in-Time” compilers are becoming
more common - compile code only when its used!
History of compiler
development
1953 IBM develops the 701 EDPM (Electronic Data
Processing Machine), the first general purpose computer,
built as a “defense
calculator” in the Korean
War

No high-level
languages were
available, so all
programming was
done in assembly
History of compilers (cont’d)
As expensive as these early computers were, most of the
money companies spent was for software development,
due to the complexities of assembly.
In 1953, John Backus came up with the
idea of “speed coding”, and
developed the first
interpreter. Unfortunately, this
was 10-20 times slower than
programs written in
John Backus
assembly.

He was sure he could do better.


History of compilers (cont’d)
In 1954, Backus and his team released a research paper
titled “Preliminary Report, Specifications for the IBM
Mathematical FORmula TRANslating System,
FORTRAN.”
The initial release of FORTRAN I was in 1956, totaling
25,000 lines of assembly code. Compiled programs ran
almost as fast as handwritten assembly!
Projects that had taken two weeks to write now took
only 2 hours. By 1958 more than half of all software
was written in FORTRAN.
Modern Compilers
Compilers have not changed a great deal since the days
of Backus. They still consist of two main components:

The front-end reads in the program in the source


languages, makes sense of it, and stores it in an internal
representation…

…and the back-end takes the internal representation and


converts it into the target language, perhaps with some
optimizations. The target language is typically
assembly, but it is often easier to use an established,
higher-level language.
A Compiler Model
Analysis
Source
Program Scanner

Parser S
Y
Analysis M
and Error Semantic B
diagnostics Analyzer O
L

Intermediate Form
T
A
Error B
Messages L
Initial code E
generator

Object
Code Code
generator

Synthesis
Structure Source Language
of a
Compiler
Errors

?
Warnings

Target Language
Source Language
Structure
of a
Compiler Front End

Intermediate Code

Back End

Target Language
Source Language
Structure
Lexical Analyzer
of a
Syntax Analyzer Front
Compiler End
Semantic Analyzer

Int. Code Generator

Intermediate Code

Back End

Target Language
Source Language
Structure
Lexical Analyzer
of a
Syntax Analyzer Front
Compiler End
Semantic Analyzer

Int. Code Generator

Intermediate Code

Code Optimizer Back


Target Code Generator End

Target Language
Source Language Example Compilation
Lexical Analyzer
Source Code:
cur_time = start_time + cycles * 60
Syntax Analyzer

Semantic Analyzer

Int. Code Generator

Intermediate Code

Code Optimizer

Target Code Generator

Target Language
Source Language Example Compilation
Lexical Analyzer
Source Code:
cur_time = start_time + cycles * 60
Syntax Analyzer
Lexical Analysis:
ID(1) ASSIGN ID(2) ADD ID(3) MULT INT(60)
Semantic Analyzer

Int. Code Generator

Intermediate Code

Code Optimizer

Target Code Generator

Target Language
Source Language Example Compilation
Lexical Analyzer
Source Code:
cur_time = start_time + cycles * 60
Syntax Analyzer
Lexical Analysis:
ID(1) ASSIGN ID(2) ADD ID(3) MULT INT(60)
Semantic Analyzer
Syntax Analysis:
Int. Code Generator ASSIGN

ID(1) ADD
Intermediate Code
ID(2) MULT
Code Optimizer ID(3) INT(60)

Target Code Generator

Target Language
Source Language Example Compilation
Syntax Analysis:
Lexical Analyzer ASSIGN

Syntax Analyzer ID(1) ADD

ID(2) MULT
Semantic Analyzer
ID(3) INT(60)
Int. Code Generator Sematic Analysis:
ASSIGN

Intermediate Code ID(1) ADD

ID(2) MULT
Code Optimizer
ID(3) int2real
Target Code Generator
INT(60)

Target Language
Source Language Example Compilation
Lexical Analyzer Sematic Analysis:
ASSIGN
Syntax Analyzer
ID(1) ADD

Semantic Analyzer ID(2) MULT

Int. Code Generator ID(3) int2real

INT(60)
Intermediate Code
Intermediate Code:
temp1 = int2real(60)
Code Optimizer temp2 = id3 * temp1
temp3 = id2 + temp2
Target Code Generator id1 = temp3

Target Language
Source Language Example Compilation
Intermediate Code:
Lexical Analyzer temp1 = int2real(60)
temp2 = id3 * temp1
Syntax Analyzer temp3 = id2 + temp2
id1 = temp3

Semantic Analyzer Optimized Code (step 0):


temp1 = int2real(60)
Int. Code Generator temp2 = id3 * temp1
temp3 = id2 + temp2
id1 = temp3
Intermediate Code

Code Optimizer

Target Code Generator

Target Language
Source Language Example Compilation
Intermediate Code:
Lexical Analyzer temp1 = int2real(60)
temp2 = id3 * temp1
Syntax Analyzer temp3 = id2 + temp2
id1 = temp3

Semantic Analyzer Optimized Code (step 1):


temp1 = 60.0
Int. Code Generator temp2 = id3 * temp1
temp3 = id2 + temp2
id1 = temp3
Intermediate Code

Code Optimizer

Target Code Generator

Target Language
Source Language Example Compilation
Intermediate Code:
Lexical Analyzer temp1 = int2real(60)
temp2 = id3 * temp1
Syntax Analyzer temp3 = id2 + temp2
id1 = temp3

Semantic Analyzer Optimized Code (step 2):

Int. Code Generator temp2 = id3 * 60.0


temp3 = id2 + temp2
id1 = temp3
Intermediate Code

Code Optimizer

Target Code Generator

Target Language
Source Language Example Compilation
Intermediate Code:
Lexical Analyzer temp1 = int2real(60)
temp2 = id3 * temp1
Syntax Analyzer temp3 = id2 + temp2
id1 = temp3

Semantic Analyzer Optimized Code (step 3):

Int. Code Generator temp2 = id3 * 60.0


id1 = id2 + temp2

Intermediate Code

Code Optimizer

Target Code Generator

Target Language
Source Language Example Compilation
Intermediate Code:
Lexical Analyzer temp1 = int2real(60)
temp2 = id3 * temp1
Syntax Analyzer temp3 = id2 + temp2
id1 = temp3

Semantic Analyzer Optimized Code:

Int. Code Generator temp1 = id3 * 60.0


id1 = id2 + temp1

Intermediate Code

Code Optimizer

Target Code Generator

Target Language
Source Language Example Compilation
Intermediate Code:
Lexical Analyzer temp1 = int2real(60)
temp2 = id3 * temp1
Syntax Analyzer temp3 = id2 + temp2
id1 = temp3

Semantic Analyzer Optimized Code:

Int. Code Generator temp1 = id3 * 60.0


id1 = id2 + temp1

Intermediate Code Target Code:


MOVF id3, R2
MULF #60.0, R2
Code Optimizer MOVF id2, R1
ADDF R2, R1
Target Code Generator MOVF R1, id1

Target Language
Example II
Refer to section 1.5 of SANTANU
Lexical Analysis
Source Language
Structure
Lexical Analyzer
of a
Syntax Analyzer Front
Compiler End
Semantic Analyzer

Int. Code Generator

Intermediate Code

Code Optimizer Back


Target Code Generator End

Target Language
Source Language

Lexical Analyzer
Today!
Syntax Analyzer Front
End
Semantic Analyzer

Int. Code Generator

Intermediate Code

Code Optimizer Back


Target Code Generator End

Target Language
What exactly is lexing?
Consider the code:
if (i==j);
z=1;
else;
z=0;
endif;

This is really nothing more than a string of characters:


i f _ ( i = = j ) ; \n\tz = 1 ; \ne l s e ; \n\tz = 0 ; \ne n d i f ;
During our lexical analysis phase we must divide this
string into meaningful sub-strings.
Tokens
The output of our lexical analysis phase is a
streams of tokens.
A token is a syntactic category.
In English this would be types of words or
punctuation, such as a “noun”, “verb”,
“adjective” or “end-mark”.
In a program, this an identifier, a floating-point
number, a math symbol or a keyword.
Identifying Tokens
A sub-string that represents an instance of a token
is called a lexeme.
The class of all possible lexemes in a token is
described by the use of a pattern.
For example, the pattern to describe an identifier
(a variable) is a string of letters, numbers, or
underscores, beginning with a non-number.
Patterns are typically described using regular
expressions.
Implementation
A lexical analyzer must be able to do three things:
1. Remove all whitespace and comments.

2. Identify tokens within a string.

3. Return the lexeme of a found token, as


well as the line number it was found on.

How do we go about implementing this?


Example
i f _ ( i = = j ) ; \n\tz = 1 ; \ne l s e ; \n\tz = 0 ; \ne n d i f ;

Line-number - Token - Lexeme


1 COM_BLOCK if
1 OPEN (
1 ID i
1 OP_RELATION ==
1 CLOSE )
1 ENDLINE ;
2 ID z
2 OP_ASSIGN =
2 NUMBER 1
Etc…
Lookahead
Lookahead will typically be important to a
lexical analyzer.
Tokens are typically read in from left-to-right,
recognized one at a time from the input string.
It is not always possible to instantly decide if a
token is finished without looking ahead at the
next character. For example…
Is “i” a variable, or the first character of “if”?
Is “=” an assignment or the beginning of “==”?
Lookahead example
Some languages require more lookahead than
others. Example: Fortran removes all whitespace
before processing and cannot get clues from it.

DO 5 I = 1.25

“DO5I” is a variable!

DO 5 I = 1,25

Here, “DO” is a keyword!


Uglier Lookahead example
PL/I is another example of a difficult to lex
language because it allows identifiers to be the
same as keywords. Consider this legal statement:

IF THEN THEN THEN = ELSE;


ELSE ELSE = THEN;

ELSE and THEN were both previously defined as


variables.
Details…
How much lookahead will we need? And how
do we figure out ambiguities?
If we see the characters “intact”, how do we
know if we are declaring an integer called “ act”
or making use of a previously defined identifier
by the name “intact”?
We need specific rules that will ensure we never
have more than one possible answer, ideally with
only one character of lookahead.
To do…
Introduction to Lexical Analysis
Lexing examples
Regular expressions
Project 1 overview
Review of our language
Finite automata (Refer to what was taught…)
Languages
An alphabet is a well defined set of characters.
The character ∑ is typically used to represent an
alphabet.

A language over ∑ is a set of strings made up of


characters drawn from ∑.

Examples:
Alphabet: A-Z Language: English
Alphabet: ASCII Language: C++
Regular Expressions
Each regular expression is a notation for a
regular language (a well-defined set of possible
words.)
If A is a regular expression, then L(A) is the
language defined by that regular expression.
L(“c”) is the language with the single word “c”.
Concatenation:
L(AB) = { ab | a  L(A) and b  L(B) }
L(“i” “f”) is the language with just “if” in it.
Regular Expressions (cont’d)
Union:
L(A | B) = { s | s  L(A) or s  L(B) }

L(“if” | “then” | “else”) is the language with just


the words “if”, “then”, and “else”.

L((“0” | “1”)(“0” | “1”)) is the language


consisting of “00”, “01”, “10” and “11”.
Regular Expressions (cont’d)
There are several special symbols:

“+” indicates one or more repeats can be used.


L(A+) = { L(A) or L(AA) or L(AAA) or … }

 is the empty string.

“*” indicates zero or more repeats can be used.


L(A*) = {  or L(A+) }
Defining our language…
The first thing we can define in our language are
keywords. These are easy:

L(“if” | “else” | “while” | “find” | …)

When we scan a file, we can either have a single


token represent all keywords, or else break them
down into groups, such as “commands”, “types”,
etc.
Language def cont’d
Next we will define integers in our language:

digit = “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” | “9”
integer = digit+

Note that we can abbreviate ranges using the dash (“-”).


Thus, digit = 0-9

Float is not much more complicated:

float = digit+ “.” digit+


Language def cont’d
Identifiers are strings of letters, underscores, or digits
beginning with a non-digit. Identifies are used for
names the programmer comes up with, such as variable
or function names.

letter = a-z | A-Z


identifier = (letter | “_”)(letter | “_” | digit)*

Note that in most languages (including ours) keywords


are reserved and cannot be used as identifiers.
Real-world example
What is the regular expression that defines all phone
numbers?

∑ = { 0-9, (, ), - }

area = digit3
exchange = digit3
local = digit4

phone_number = “(” area “)” exchange “-” local

How many strings are defined by L(phone_number)?


Problems for you
What is the regular expression that defines all e-mail
addresses?

Describe what binary strings are defined by the


following languages:
0 (0 | 1)* 0
(0 | 1)* 0 (0 | 1)*
( | 0) 1*
( ( | 0) 1* )*
Lexical Analysis
Identifiers and integers are recognized
directly as single tokens:
e.g.
<ident> ::= <letter> | <ident> <letter> | <ident> <digit>
<letter> ::= A | B | C| D| … | Z
<digit> ::= 0|1|2|3|…|9

The parser would interpret a sequence of such characters as a


language construct <ident>.
Lexical Analysis
A large number of source program consists
of such multiple-character identifiers, there
is a saving in compilation time.
The scanner recognizes both single and
multiple character tokens directly.
The output of a scanner is a sequence of
tokens.
Example of a PASCAL program
PROGRAM STATS
VAR
SUM, SUMSQ, I, VALUE, MEAN, VARIANCE: INTEGER
BEGIN
SUM:=0;
SUMSQ:=0;
FOR I:=1 TO 100 DO
BEGIN
READ (VALUE);
SUM:=SUM+VALUE;
SUMSQ:=SUMSQ+VALUE*VALUE
END;
MEAN:=SUM DIV 100;
VARIANCE:=SUMSQ DIV 100 –MEAN*MEAN;
WRITE (MEAN, VARIANCE)
END.
BNF

BNF grammar contains a set of rules that


defines the syntax of some construct in the
programming language.
::= is defined as /(to be)
< > non terminal symbols
(Constructs defined in the
grammar)
No angle brackets terminal symbols
SIMPLIFIED PASCAL GRAMMAR
1. <program> ::= PROGRAM <prog-name> VAR <dec-list> BEGIN <stmt-
list> END.
2. <prog-name>::= id
3. <dec-list> ::= <dec> | <dec-list>; <dec>
4. <dec> ::= <id-lis> : <type>
5. <type> ::= INTEGER
6. <id-list> ::= id | <id-list> , id
7. <stmt-list> ::= <stmt> | <stmt-list> ; <stmt>
8. <stmt> ::= <assign> | <read> | <write> | <for>
9. <assign> ::= id := <exp>
10. <exp> ::= <term> | <exp> + <term> | <exp> - <term>
11. <term> ::= <factor> | <term> * <factor> | <term> DIV <factor>
12. <factor> ::= id | int | ( <exp> )
13. <read> ::= READ ( <id-list> )
14. <write> ::= WRITE ( <id-list> )
15. <for> ::= FOR <index-exp> DO <body>
16. <index-exp> ::= id := <exp> TO <exp>
17. <body> ::= <stmt> | BEGIN <stmt-list> END
Lexical Analysis

The scanner reads a character stream and converts it into
a token stream

White space is ignored (space, tab, return, formfeed)

Comments are ignored

Tokens are the basic entities of the language (identifiers,
numbers, operators, keywords, punctuation symbols etc.)

The character string associated with a token is called its
lexeme

The scanner may produce error messages (strings that
don't match any known token)

The scanner may store information in the symbol table
Token Coding scheme for the above grammar
Token Code

PROGRAM 1
VAR 2
BEGIN 3
END 4
END. 5
INTEGER 6
FOR 7
READ 8
WRITE 9
TO 10
DO 11
; 12
: 13
, 14
:= 15
+ 16
- 17
* 18
DIV 19
( 20
) 21
Id 22
int 23
Syntactic Analysis

Syntax refers to the structure (or grammar) of the language
(layout, statements, blocks etc.)

The parser groups tokens into grammatical phrases
corresponding to the structure of the language

Syntactic errors are things like "missing ;"
Example of a grammar for arithmetic expressions:
<exp> -> <exp> + <term> | <exp> - <term> | <term>
<term> -> <term> * <factor> | <term> / <factor> | <factor>
<factor> -> ( <exp> ) | id | num

tokens: id, num, (, ), +, -, *, /


structures: <exp>, <term>, <factor>
-> means is composed of | means or
Semantic Analysis

Semantics refers to meaning

Type checking and function argument checking

variables defined before use


operands are compatible (coercion may be applied)
reals can't be used to index arrays
the right number and type of function arguments


The Symbol Table has the required information for semantic
analysis
Code Generation and
Optimization
Possible intermediate code representations
syntax trees
directed acyclic graphs
postfix notation
3 address code

Possible optimizations
remove redundant or unreachable code
propagate constant values
optimize loops

Target code generation


optimal use of registers for frequently used data
taking advantage of specific architectural features
Compiler Structure
A pass is a reading of the program from file
Most compilers are 1 or 2 pass compilers.
Most C compilers are 3-pass
1st pass generates intermediate code
2nd pass generates assembler code
3rd pass generates optimized assembler
Some languages make 1-pass compiler structure
almost impossible because they allow undeclared
variables or forward jumps (gotos)
1-Pass Structure


Easy to implement

Requires large memory in order to store intermediate
representation

Produces relatively inefficient code

Example: the first Pascal compilers
2-Pass Structure

1st Pass: The front end or analysis phases


The parser is the main routine
it calls the scanner to get the next token
groups structures and calls generator subroutines
to write the intermediate representation
2nd Pass: The back end or synthesis phases
Reads intermediate representation
optimizes and writes target code
Example of a 2-pass
Assembler

Source program:
mov a, R1
add #2, R1
mov R1, b

1st pass: identifiers added to symbol table with relocatable addresses


id location
a 0
b 4

2nd pass: generate opcodes + addresses


00000001 00000000 11000001
00000010 10000010 11000001
00000001 11000001 00000100
Retargetable Compilers

Advantages of having 2-pass structure:



Increased opportunities for optimization

Platform-independent languages


Language-independent programs
Symbol Table
A data structure with a record for each identifier used in the program
(variables, user-defined type names, functions, formal arguments
etc)

Attributes may include:


Storage size
Type
Scope (visible within what language blocks)
Number and types of arguments

Possible structures:
Array
Linked List
Binary Search Tree
Hash Table
Error Handling

Each analysis phase may produce errors

Error messages should be meaningful

"Syntax Error" isn't very helpful


Error messages should indicate the location in the source file

Often not detected until already past it


Ideally, the compiler should recover and report as many errors as
possible rather than die the first time it encouters a problem

You might also like