COMPILER
Specification
1
Lecture Objectives
• To understand the significance in specifying
programming languages
• To describe the formal methods used in
specifying programming languages
• To recognise the processing requirements in
compiling programs based on programming
language specifications
2
Language Processors: Why do we need them?
Programmer Programmer
Compute surface area of
Concepts and Ideas
a triangle?
Java Program
How to bridge the JVM Byte code
“semantic gap” ?
JVM Interpreter
X86 Processor
0101001001...
Hardware Hardware
3
Programming Language Specification
• Why?
– A communication device between people who need to have
a common understanding of the PL:
• language designer, language implementer, user
• What to specify?
– Specify what is a ‘well formed’ program
• syntax
• contextual constraints (also called static semantics):
– scoping rules
– type rules
– Specify what is the meaning of (well formed) programs
• semantics (also called runtime semantics)
4
Programming Language Specification
• Why?
• What to specify?
• How to specify ?
– Formal specification: use some kind of
precisely defined formalism
– Informal specification: description in English.
– Usually a mix of both (e.g. Java specification)
• Syntax => formal specification using CFG/BNF
• Contextual constraints and semantics => informal
5
Syntax Definition
Syntax: the form of expressions, statements and
programming units.
Grammar: a formal set of rules that describes a
valid syntax of a language
Context Free Grammar (CFG): formal way of
describing syntax.
Backus-Naur Form (BNF): a particular way of
expressing CFGs.
6
How do we start?
• Input: Source program
• Need to convert the sequence of characters
(stream) into representations that can be
processed (tokens)
7
Lexemes
Lexemes are the lowest level syntactic units.
Example:
val = (int)(xdot + y*0.3) ;
In the above statement, the lexemes are
val, = , (, int, ), (, xdot, +, y, *, 0.3, ), ;,
8
Tokens
The category of lexemes are tokens.
• Identifiers: Names chosen by the programmer.
val, xdot, y
• Keywords: Names chosen by the language designer to
help syntax and structure.
int, return, void.
(Keywords that cannot be used as identifiers are
known as RESERVED WORDS)
9
Tokens (Contd.)
• Operators: Identify actions.
+, &&, !
• Literals: Denote values directly.
3.14, -10, ‘a’, true, null
• Punctuation Symbols: Supports syntactic
structure.
(, ), ;, {, }
10
Tokens (Contd.)
• Integers: 2 1000 -20
• Floating-point: 2.0 -0.010 .02
• Symbols: $ # @ { } << >> [ ]
• Strings: “x” “He said, I love Compilers”
• Comments: /* Hi and Bye */
11
Token Structure (Example)
12
What do we do with tokens?
• The sequence of tokens must conform to the
grammar of the language
• The tokens has to be checked with the
specifications given in the grammar
13
Grammars
Like natural languages (English), programming
languages are described by their grammar
It is essential to know the grammar of the source and
target languages when writing a compiler
Context Free Grammar (CFG): formal way of describing syntax
Backus-Naur Form (BNF): a particular way of expressing CFGs
14
Context Free Grammar
The Components of CFG
1. A set of tokens, known as terminal symbols
2. A set of nonterminals
3. A set of productions, LHS → RHS
4. A designation of one of the nonterminals as the start
symbol
Context Free Grammar can be used to help
guide the translation of programs
15
Grammar, Formally
Grammar G of a programming language is a four tuples (quadruple),
G = (T, N, S, P) where:
T is a finite set of terminal symbols <assign>→<ident> = <expr>
<ident> →A | B | C
N is a finite set of non-terminal symbols
<expr> → <ident> + <expr>
S is the start symbol | <ident> * <expr>
P is a finite set of production rules | ( <expr> )
| <ident>
T = { =, A, B, C, *, +, (, ) }
N = { <assign>, <ident>, <expr> }
S = { <assign> }
P = { <assign> → <ident> = <expr>, <ident> → A | B | C,
<expr> → <ident> + <expr> | <ident> * <expr> | ( <expr> ) | <ident> }
16
Production rules
* Consists of a nonterminal (LHS), an arrow (-> or
::=), and a sequence of tokens (terminals) and/or
nonterminals (RHS)
* Describes how the non-terminal LHS can be
expanded into the RHS
* Productions with the same LHS can have their RHS
combined, using a vertical bar (‘|’)
17
Backus Naur Form (BNF)
* Useful for describing the syntax of programming languages
if-else statement in Java
if (expression) statement else statement
Tokens
The structuring rule for if-else
Terminals
stat → if (exp) stat else stat
Nonterminals
Can have the form Production
18
list → list + digit
Logical OR in BNF list → list – digit
list → digit
Tokens digit → 0
digit → 1
+ – 0123456789
digit → 2
digit → 3
digit → 4
Nonterminals digit → 5
digit → 6
list digit digit → 7
OR digit → 8
digit → 9
list → list + digit | list – digit | digit
digit → 0|1|2|3|4|5|6|7|8|9
a string containing zero tokens, written as ε
empty string
19
Logical OR in BNF is denoted by |
digit→ 0|1|2|3|4|5|6|7|8|9 <digit> ::= 0|1|2|3|4|5|6|7|8|9
if_stmt → if expr then stmt
| if expr then stmt else stmt
<if_stmt> ::= if <expr> then <stmt>
| if <expr> then <stmt> else <stmt>
sign → + | − <sign> ::= + | −
20
Recursive Rules in BNF
A BNF rule is recursive if LHS appears on RHS.
<ident_list> ::= <identifier>
| <identifier> , <ident_list>
<integer> ::= <digit>
| <digit> <integer>
21
Extended BNF
• [ ] Optional element:
<if_stmt> ::= if (<logic_expr>) <stmt> [ else <stmt>]
<real_num> ::= [<int_num>] . <int_num>
• { } Unspecified number of repetitions
<ident_list> ::= <identifier> { , <identifier> }
• ( …| …) Multiple choice options. A single element must be
chosen from a group. “for” loop in Pascal:
<for_stmt> ::= for <var> := <expr> (to | downto) <expr> do <stmt>
EBNF enhances the readability and writability of BNF
22
Parse Tree
Parse tree shows how the start symbol of a grammar derives a
string in the language.
Parse trees describe the hierarchical structure of sentences.
Parser: carries out the parsing.
Parsing: is the process of determining if a string of tokens can
be generate by a grammar.
Parse Tree: is graphical (tree) proof showing the steps in
derivation of a string from the start symbol. It has the
following properties
A
1. The root is labeled by the start symbol
2. Each leaf is labeled by a token or by ε. X Y Z
3. Each interior node is labeled by a nonterminal A → XYZ
23
Parse Tree
Parse tree (concrete syntax tree) differs from the Abstract
Syntax Tree (AST)
The AST does not contain superficial distinctions of form, unimportant for
translation
Parse Tree for string 1 + 1 - 0
AST for string 1 + 1 - 0
_
+ 0
1 1
Parse Tree Syntax Tree
24
Example 1
list → list + digit | list – digit | digit
digit → 0|1|2|3|4|5|6|7|8|9 list
list
digit
9-5+2
list digit
_
digit 5
+ 2
Parse Tree for 9–5+2 9
25
Example 2
Parse Tree for A=B*C
<assign>
<assign> ::= <ident> = <expr> <ident> = <expr>
<ident> ::= A | B | C
<expr> ::= <ident> + <expr> <ident> * <expr>
| <ident> * <expr> A
| ( <expr> )
| <ident> <ident>
B
C
26
Derivation
Derivation is a mechanism by which the rules of a grammar
can be repeatedly applied to generate a sentence.
At each stage, a nonterminal is replaced by the RHS of a
rule, till finally the whole sentence is generated.
A = B * C
<assign>→<ident>=<expr> <assign> <ident> = <expr>
<ident> →A|B|C A = <expr>
<expr> → <ident>+<expr>
A = <ident> * <expr>
| <ident>*<expr>
| ( <expr> ) A = B * <expr>
| <ident> A = B * <ident>
A = B * C
27
Example
<exp> ::= <exp> <op> <exp> | (<exp> ) | <number>
<op> ::= + | - | *
<number> ::= {0..9}+
derivation for (34-3)*42:
<exp> => <exp> <op> < exp >
=> (<exp> ) <op> < exp >
=> (<exp> <op> <exp> ) <op> < exp >
=> (<number> <op> <exp> ) <op> < exp >
=> (34 <op> <exp> ) <op> < exp >
=> (34 – <exp> ) <op> < exp >
=> (34 – <number> ) <op> < exp >
=> (34 – 3 ) <op> < exp >
=> (34 – 3 ) * < exp >
=> (34-3)* <number>
=> (34 – 3 ) * 42
28
Invalid Sentence <assign> <ident> = <expr>
A = <expr>
A = <ident> * <expr>
A = B * C * A = B * <expr>
A = B * <ident>
A = B * C
<assign>→<ident>=<expr>
invalid
<ident> →A|B|C
<expr> → <ident>+<expr>
| <ident>*<expr> <assign> <ident> = <expr>
| ( <expr> ) A = <expr>
| <ident> A = <ident> * <expr>
A = B * <expr>
A = B * <ident> * <ident>
A = B * C * <ident>
invalid
29
Ambiguity
A grammar that generates a sentence which has two or
more distinct parse trees is said to be an ambiguous
grammar
If we rewrite the grammar as below
<string> ::= <string> + <string> | <string> – <string>
|0|1|2|3|4|5|6|7|8|9
then the sentence 9 – 5 + 2 would have two distinct parse trees,
and therefore the above grammar is ambiguous
9–5+2
(9 – 5) + 2 9 – (5 + 2)
30
Example 1 9–5+2
Two parse trees for 9 – 5 + 2
(9 – 5) + 2 9 – (5 + 2)
string → string + string | string – string | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
31
<assign> A = B*C+A <assign>
Example 2
<ident> = <expr> <ident> = <expr>
<expr> + <expr> <expr> <expr>
A A *
<expr> <expr> <ident> <ident> <expr> <expr>
* +
<ident> <ident> A B <ident> <ident>
<assign>→<ident>=<expr>
B C <ident> →A|B|C C A
<expr> → <expr>+<expr>
| <expr>*<expr>
| ( <expr> )
| <ident>
32
Further checking with the parse tree
• If the tokens can be matched against the
grammar, the parse tree can be produced
• This means the source programs is
syntactically correct
• However, most programming languages
have semantic specifications to be checked
in order to be able to generate the right
codes
33
Contextual Constraints
Syntax rules alone are not enough to specify the format of
well-formed programs.
Example 1:
let const m~2
Undefined! Scope Rules
in putint(m + x)
Example 2:
let const m~2 ;
var n:Boolean
in begin
n := m<4;
n := n+1 Type error!
Type Rules
end
34
Semantics
Specification of semantics is concerned with specifying the
“meaning” of well-formed programs.
Terminology:
Expressions are evaluated and yield values (and may or may not
perform side effects).
Commands are executed and perform side effects.
Declarations are elaborated to produce bindings.
Side effects:
• change the values of variables
• perform input/output
35
The End
36