0% found this document useful (0 votes)
4 views

Lec02 Programming Language Specification

.

Uploaded by

ASTROHALT C4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lec02 Programming Language Specification

.

Uploaded by

ASTROHALT C4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

COMPILER

Specification

1
Lecture Objectives

• To understand the significance in specifying


programming languages
• To describe the formal methods used in
specifying programming languages
• To recognise the processing requirements in
compiling programs based on programming
language specifications

2
Language Processors: Why do we need them?
Programmer Programmer
Compute surface area of
Concepts and Ideas
a triangle?
Java Program

How to bridge the JVM Byte code


“semantic gap” ?

JVM Interpreter

X86 Processor
0101001001...
Hardware Hardware
3
Programming Language Specification
• Why?
– A communication device between people who need to have
a common understanding of the PL:
• language designer, language implementer, user
• What to specify?
– Specify what is a ‘well formed’ program
• syntax
• contextual constraints (also called static semantics):
– scoping rules
– type rules
– Specify what is the meaning of (well formed) programs
• semantics (also called runtime semantics)

4
Programming Language Specification
• Why?
• What to specify?
• How to specify ?
– Formal specification: use some kind of
precisely defined formalism
– Informal specification: description in English.

– Usually a mix of both (e.g. Java specification)


• Syntax => formal specification using CFG/BNF
• Contextual constraints and semantics => informal
5
Syntax Definition

Syntax: the form of expressions, statements and


programming units.
Grammar: a formal set of rules that describes a
valid syntax of a language
Context Free Grammar (CFG): formal way of
describing syntax.
Backus-Naur Form (BNF): a particular way of
expressing CFGs.

6
How do we start?

• Input: Source program


• Need to convert the sequence of characters
(stream) into representations that can be
processed (tokens)

7
Lexemes

Lexemes are the lowest level syntactic units.


Example:
val = (int)(xdot + y*0.3) ;

In the above statement, the lexemes are

val, = , (, int, ), (, xdot, +, y, *, 0.3, ), ;,

8
Tokens
The category of lexemes are tokens.
• Identifiers: Names chosen by the programmer.
val, xdot, y

• Keywords: Names chosen by the language designer to


help syntax and structure.
int, return, void.

(Keywords that cannot be used as identifiers are


known as RESERVED WORDS)

9
Tokens (Contd.)

• Operators: Identify actions.


+, &&, !

• Literals: Denote values directly.


3.14, -10, ‘a’, true, null

• Punctuation Symbols: Supports syntactic


structure.
(, ), ;, {, }

10
Tokens (Contd.)
• Integers: 2 1000 -20
• Floating-point: 2.0 -0.010 .02
• Symbols: $ # @ { } << >> [ ]
• Strings: “x” “He said, I love Compilers”
• Comments: /* Hi and Bye */

11
Token Structure (Example)

12
What do we do with tokens?

• The sequence of tokens must conform to the


grammar of the language
• The tokens has to be checked with the
specifications given in the grammar

13
Grammars

Like natural languages (English), programming


languages are described by their grammar

It is essential to know the grammar of the source and


target languages when writing a compiler

Context Free Grammar (CFG): formal way of describing syntax


Backus-Naur Form (BNF): a particular way of expressing CFGs

14
Context Free Grammar

The Components of CFG

1. A set of tokens, known as terminal symbols


2. A set of nonterminals
3. A set of productions, LHS → RHS
4. A designation of one of the nonterminals as the start
symbol

Context Free Grammar can be used to help


guide the translation of programs

15
Grammar, Formally
Grammar G of a programming language is a four tuples (quadruple),
G = (T, N, S, P) where:
T is a finite set of terminal symbols <assign>→<ident> = <expr>
<ident> →A | B | C
N is a finite set of non-terminal symbols
<expr> → <ident> + <expr>
S is the start symbol | <ident> * <expr>
P is a finite set of production rules | ( <expr> )
| <ident>

T = { =, A, B, C, *, +, (, ) }
N = { <assign>, <ident>, <expr> }
S = { <assign> }
P = { <assign> → <ident> = <expr>, <ident> → A | B | C,
<expr> → <ident> + <expr> | <ident> * <expr> | ( <expr> ) | <ident> }
16
Production rules

* Consists of a nonterminal (LHS), an arrow (-> or


::=), and a sequence of tokens (terminals) and/or
nonterminals (RHS)
* Describes how the non-terminal LHS can be
expanded into the RHS
* Productions with the same LHS can have their RHS
combined, using a vertical bar (‘|’)

17
Backus Naur Form (BNF)
* Useful for describing the syntax of programming languages
if-else statement in Java

if (expression) statement else statement

Tokens
The structuring rule for if-else
Terminals

stat → if (exp) stat else stat

Nonterminals
Can have the form Production

18
list → list + digit
Logical OR in BNF list → list – digit
list → digit
Tokens digit → 0
digit → 1
+ – 0123456789
digit → 2
digit → 3
digit → 4
Nonterminals digit → 5
digit → 6
list digit digit → 7
OR digit → 8
digit → 9

list → list + digit | list – digit | digit


digit → 0|1|2|3|4|5|6|7|8|9

a string containing zero tokens, written as ε


empty string
19
Logical OR in BNF is denoted by |
digit→ 0|1|2|3|4|5|6|7|8|9 <digit> ::= 0|1|2|3|4|5|6|7|8|9

if_stmt → if expr then stmt


| if expr then stmt else stmt

<if_stmt> ::= if <expr> then <stmt>


| if <expr> then <stmt> else <stmt>

sign → + | − <sign> ::= + | −

20
Recursive Rules in BNF

A BNF rule is recursive if LHS appears on RHS.

<ident_list> ::= <identifier>


| <identifier> , <ident_list>

<integer> ::= <digit>


| <digit> <integer>

21
Extended BNF
• [ ] Optional element:
<if_stmt> ::= if (<logic_expr>) <stmt> [ else <stmt>]
<real_num> ::= [<int_num>] . <int_num>

• { } Unspecified number of repetitions


<ident_list> ::= <identifier> { , <identifier> }

• ( …| …) Multiple choice options. A single element must be


chosen from a group. “for” loop in Pascal:

<for_stmt> ::= for <var> := <expr> (to | downto) <expr> do <stmt>

EBNF enhances the readability and writability of BNF


22
Parse Tree
Parse tree shows how the start symbol of a grammar derives a
string in the language.
Parse trees describe the hierarchical structure of sentences.
Parser: carries out the parsing.
Parsing: is the process of determining if a string of tokens can
be generate by a grammar.
Parse Tree: is graphical (tree) proof showing the steps in
derivation of a string from the start symbol. It has the
following properties
A
1. The root is labeled by the start symbol
2. Each leaf is labeled by a token or by ε. X Y Z
3. Each interior node is labeled by a nonterminal A → XYZ
23
Parse Tree
Parse tree (concrete syntax tree) differs from the Abstract
Syntax Tree (AST)
The AST does not contain superficial distinctions of form, unimportant for
translation
Parse Tree for string 1 + 1 - 0

AST for string 1 + 1 - 0


_

+ 0

1 1

Parse Tree Syntax Tree


24
Example 1
list → list + digit | list – digit | digit
digit → 0|1|2|3|4|5|6|7|8|9 list

list
digit
9-5+2

list digit
_
digit 5
+ 2

Parse Tree for 9–5+2 9


25
Example 2
Parse Tree for A=B*C
<assign>

<assign> ::= <ident> = <expr> <ident> = <expr>


<ident> ::= A | B | C
<expr> ::= <ident> + <expr> <ident> * <expr>
| <ident> * <expr> A
| ( <expr> )
| <ident> <ident>
B

C
26
Derivation
Derivation is a mechanism by which the rules of a grammar
can be repeatedly applied to generate a sentence.
At each stage, a nonterminal is replaced by the RHS of a
rule, till finally the whole sentence is generated.

A = B * C

<assign>→<ident>=<expr> <assign> <ident> = <expr>


<ident> →A|B|C  A = <expr>
<expr> → <ident>+<expr>
 A = <ident> * <expr>
| <ident>*<expr>
| ( <expr> )  A = B * <expr>
| <ident>  A = B * <ident>
 A = B * C
27
Example
<exp> ::= <exp> <op> <exp> | (<exp> ) | <number>
<op> ::= + | - | *
<number> ::= {0..9}+

derivation for (34-3)*42:

<exp> => <exp> <op> < exp >


=> (<exp> ) <op> < exp >
=> (<exp> <op> <exp> ) <op> < exp >
=> (<number> <op> <exp> ) <op> < exp >
=> (34 <op> <exp> ) <op> < exp >
=> (34 – <exp> ) <op> < exp >
=> (34 – <number> ) <op> < exp >
=> (34 – 3 ) <op> < exp >
=> (34 – 3 ) * < exp >
=> (34-3)* <number>
=> (34 – 3 ) * 42
28
Invalid Sentence <assign> <ident> = <expr>
 A = <expr>
 A = <ident> * <expr>
A = B * C *  A = B * <expr>
 A = B * <ident>
 A = B * C
<assign>→<ident>=<expr>
 invalid
<ident> →A|B|C
<expr> → <ident>+<expr>
| <ident>*<expr> <assign> <ident> = <expr>
| ( <expr> )  A = <expr>
| <ident>  A = <ident> * <expr>
 A = B * <expr>
 A = B * <ident> * <ident>
 A = B * C * <ident>
 invalid
29
Ambiguity
A grammar that generates a sentence which has two or
more distinct parse trees is said to be an ambiguous
grammar
If we rewrite the grammar as below
<string> ::= <string> + <string> | <string> – <string>
|0|1|2|3|4|5|6|7|8|9
then the sentence 9 – 5 + 2 would have two distinct parse trees,
and therefore the above grammar is ambiguous
9–5+2

(9 – 5) + 2 9 – (5 + 2)
30
Example 1 9–5+2

Two parse trees for 9 – 5 + 2


(9 – 5) + 2 9 – (5 + 2)

string → string + string | string – string | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

31
<assign> A = B*C+A <assign>
Example 2
<ident> = <expr> <ident> = <expr>

<expr> + <expr> <expr> <expr>


A A *

<expr> <expr> <ident> <ident> <expr> <expr>


* +

<ident> <ident> A B <ident> <ident>


<assign>→<ident>=<expr>
B C <ident> →A|B|C C A
<expr> → <expr>+<expr>
| <expr>*<expr>
| ( <expr> )
| <ident>
32
Further checking with the parse tree

• If the tokens can be matched against the


grammar, the parse tree can be produced
• This means the source programs is
syntactically correct
• However, most programming languages
have semantic specifications to be checked
in order to be able to generate the right
codes

33
Contextual Constraints
Syntax rules alone are not enough to specify the format of
well-formed programs.

Example 1:
let const m~2
Undefined! Scope Rules
in putint(m + x)

Example 2:
let const m~2 ;
var n:Boolean
in begin
n := m<4;
n := n+1 Type error!
Type Rules
end

34
Semantics
Specification of semantics is concerned with specifying the
“meaning” of well-formed programs.
Terminology:
Expressions are evaluated and yield values (and may or may not
perform side effects).
Commands are executed and perform side effects.
Declarations are elaborated to produce bindings.

Side effects:
• change the values of variables
• perform input/output

35
The End

36

You might also like