lecture2
lecture2
9
Describing Syntax: Terminology
• Alphabet: Σ, All strings: Σ*
⊆ Σ*
• A language is a set of sentences, L
10
Describing Syntax: Terminology
• A language is a set of sentences
– Natural languages: English, Turkish, …
– Programming languages: C, Fortran,
Java,…
– Formal languages: a*b*, 0n1n
• String of the language:
– Sentences
– Program statements
– Words (aaaaabb, 000111)
11
Lexemes
• A lexeme is the lowest level syntactic
unit of a language (e.g., *, sum, begin)
• Lower level constructs are given not by the
syntax but by lexical
specifications.
• Examples: identifiers, constants,
operators, special words.
total, sum_of_products, 1254, ++, (
:
• So, a language is considered as a set of
strings of lexemes rather than strings of
chars. 12
Token
• A token of a language is a
category of lexemes
• For example, identifier is a token which
may have lexemes, or instances, sum
and total
13
Example in Java Language
x = (y+3.1) * z_5 ;
14
Describing Syntax
• Higher level constructs are given by syntax
rules.
• Syntax rules specify which strings
from Σ* are in the language
• Examples: organization of the
program, loop structures, assignment,
expressions, subprogram definitions,
and calls.
15
Elements of Syntax
• An alphabet of symbols
• Symbols are terminal and non-terminal
– Terminals cannot be broken down
– Non-terminals can be broken down further
• Grammar rules that express how
symbols are combined to make legal
sentences
• Rules are of the general form
non-terminal symbol ::= list of zero or more
terminals or non-terminals
• One uses rules to recognize (parse) or
generate legal sentences
16
Formal Definition of Languages
• Recognizers
– A recognition device reads input strings over the
alphabet of the language and decides whether
the input strings belong to the language
– Example: syntax analysis part of a compiler
• Generators
– A device that generates sentences of a language
– One can determine if the syntax of a particular
sentence is syntactically correct by comparing it
to the structure of the generator
17
Formal Methods of Describing Syntax
• Grammars: formal language-generation
mechanisms.
• In the mid-1950s, Chomsky, described four
classes of generative devices (or grammars)
that define four classes of languages.
– Two of these grammar classes, named context-
free and regular, turned out to be useful for
describing the syntax of programming
languages.
– Regular grammars: The forms of the
tokens of programming languages
– Context-free grammars: The syntax of
whole programming
18
19
Deterministic Finite ∑=
Automata {0,1}
1110000100000
00000000000001
01
0000
111111
There should be
at least one
“01” in
anywhere of the
input.
21
Regular Languages
• Tokens can be generated using three formal
rules
– Concatenation
– Alternation (|)
– Kleene closure (repetition an arbitrary number of
times)(*)
L = {0,1}
0|1 L = {00, 01, 10, 11}
(0 | 1) (0 | 1) L = {01, 00}
0 (1 | 0) L = {empty, 0, 00, 000, 0000….}
0* L = {0, 010, 00000, 1010, 1000,
(0 | 10) * 0010…}
24
0*1
*
0*01*1
0+1+
+ means 1 OR
more
25
Is the following language
regular?
L = {number of 0s followed by equal
number of 1s}
L = {0n1n, n>=0}
26
27
Context Free
Grammars
L = {anbn, n>=0 }
L = {anbn,
n>=1 }
S -> aSb | ab
28
29
Regular Language vs CFL
• Tokens can be generated using three formal
rules
– Concatenation
– Alternation (|)
– Kleene closure (repetition an arbitrary number of
times)(*)
• Context-Free Grammars
– Developed by Noam Chomsky in the mid-
1950s
– Language generators, meant to
describe the syntax of natural
languages
– Define the class of context-free
languages
– Programming languages are contained
in the class of CFL’s.
31
Backus-Naur Form (BNF)
• A notation to describe the syntax of
programming languages.
• Named after
– John Backus – Algol 58
– Peter Naur – Algol 60
• A metalanguage is a language used to
describe another language.
• BNF is a metalanguage used to describe PLs.
32
BNF Fundamentals
• BNF uses abstractions for
syntactic structures.
<LHS> → <RHS>
• LHS: abstraction being defined
• RHS: definition
• “→” means “can have the
form”
• Sometimes ::= is used for →
33
BNF Fundamentals
• Example, Java assignment statement can
be represented by the abstraction
<assign>
• <assign> → <var> = <expression>
• This is a rule or production
• Here, <var> and <expression> must also
be defined.
• total
example = instances of this abstraction
sub1 + sub2
can be = 4
myVar
34
BNF Fundamentals
• These abstractions are called
Variables or
Nonterminals of a Grammar.
• Lexemes and tokens are the Terminals
of a grammar.
• Nonterminals are often enclosed in
angle brackets
• Examples of BNF rules:
<ident_list> → identifier | identifier,
<ident_list>
<if_stmt> → if <logic_expr> then <stmt>
35
BNF Fundamentals
• A formal definition of rule:
A rule has a left-hand side (LHS), which
is a nonterminal, and a right-hand
side (RHS), which is a string of
terminals and/or nonterminals
<LHS> → <RHS>
• Grammar: a finite non-empty set of
rules
36
Additional Notes on Terminals and
NonTerminals
3
Source: 7
Additional Notes on Terminals and
NonTerminals
• Terminals are the smallest block we
consider in our grammars.
• A terminal could be either:
– a quoted literal
– a regular expression
– a name referring to a terminal definition.
38
Terminals
• Let’s see some typical terminals:
– identifiers: these are the names used for
variables, classes, functions, methods and
so on.
– keywords: almost every language uses
keywords. They are exact strings that are
used to indicate the start of a definition
(think about class in Java or def in Python),
a modifier (public, private, static, final,
etc.) or control flow structures (while, for,
until, etc.)
39
Terminals
– literals: these permit to define values in our
languages. We can have string literals,
numeric literal, char literals, boolean
literals (but we could consider them
keywords as well), array literals, map
literals, and more, depending on the
language
– separators and delimiters: like colons,
semicolons, commas, parenthesis,
brackets, braces
– whitespaces: spaces, tabs, newlines.
– comments 40
Terminals and Non-terminals
4
Source: 1
Non-terminals
• Examples of non-terminals are:
– program/document: represent the entire file
– module/classes: group several
declarations togethers
– functions/methods: group statements
together
– statements: these are the single
instructions. Some of them can
contain other statements. Example :
loops
– expressions: are typically used within
statements and can be composed in 42
Examples
43
An initial example
• Consider the sentence “Mary
greets J o hn ”
• A simple grammar
<sentence> ::= <subject><predicate>
<subject> ::= Mary
<predicate> ::= <verb><object>
<verb> ::= greets
<object> ::= J o h n
44
Alternations
• Multiple definitions can be separated by |
(OR).
<object> ::= J o h n | Alfred
• This adds “Mary greets Alfred” to legal
sentences
<subject> ::= Mary | J o h n | Alfred
<object> ::= Mary | J o h n | Alfred
• Alternation to the previous grammar
<sentence> ::= <subject><predicate>
<subject> ::= <noun>
<predicate> ::= <verb><object>
<verb> ::= greets
45
<object> ::= <noun>
Infinite Number of Sentences
<object> ::= J o h n |
J o h n again |
J o h n again and again |
….
Instead use recursive definition
<object> ::= J o h n |
J o h n <repeat
factor>
<repeat factor> ::= again |
again
and 46
Simple example for PLs
• How you can describe simple
arithmetic?
|
<identifier><digit>
48
PASCAL/Ada If Statement
<if_stmt> → if <logic_expr> then <stmt>
<if_stmt> → if <logic_expr> then <stmt> else <stmt>
Or
49
Grammars and Derivations
• A grammar is a generative device for
defining languages
• The sentences of the language are
generated through a sequence of
applications of the
rules, starting from the
special nonterminal called start symbol.
• Such a generation is called a derivation.
• Start symbol represents a complete
program. So it is usually named as
<program>.
50
An Example Grammar
<program> → begin <stmt_list> end
<stmt_list> → <stmt> |
<stmt> ; <stmt_list>
<stmt> → <var> := <expression>
<var> → A | B | C
<expression>→ <var> |
<var> <arith_op> <var>
<arith_op> → + | - | * | /
51
Derivation
• In order to check if a given string
represents a valid program in the
language, we try to derive it in
the grammar.
• Derivation starts from the start symbol
<program>.
• At each step we replace a
nonterminal with its definition (RHS
of the rule).
52
Derivations
• Every string of symbols in a derivation
is a
sentential form
• A sentence is a sentential form that has
only terminal symbols
• A leftmost derivation is one in which
the leftmost nonterminal in each
sentential form is the one that is
expanded
• A derivation may be neither
1-
5
5
<program> → begin <stmt_list> end
Derive string:
begin A := B; C := A * B end <stmt_list> → <stmt> | <stmt> ; <stmt_list>
<stmt> → <var> := <expression>
<var> → A | B | C
<expression>→ <var> | <var> <arith_op> <var>
<arith_op> →
+ | -
| * | /
<program> ⇒ begin
⇒ begin
<stmt_list> end
⇒ begin
<stmt> ; <stmt_list> end
<var> := <expression>; <stmt_list>
⇒ begin
end
⇒ begin
A := <expression>; <stmt_list> end
⇒ begin
A := B; <stmt_list> end
⇒ begin
A := B; <stmt> end
⇒ begin
A := B; <var> := <expression> end
⇒ begin
A := B; C := <expression> end
⇒ begin
A := B; C := <var><arith_op><var> end
⇒ begin
A := B; C := A <arith_op> <var> end
⇒ begin
A := B; C := A * <var> end
A := B; C := A * B end 5
If always the leftmost nonterminal is replaced, then it is called leftmost 6
<program> → begin <stmt_list> end
Derive string:
begin A := B; C := A * B end <stmt_list> → <stmt> | <stmt> ; <stmt_list>
<stmt> → <var> := <expression>
<var> → A | B | C
<expression>→ <var> | <var> <arith_op> <var>
<arith_op> → + | - | * | /
⇒
⇒
<program> begin <stmt_list> end
⇒
begin <stmt> ; <stmt_list> end
⇒
begin <stmt> ; <stmt> end
⇒
begin <stmt> ; <var> := <expression> end
⇒
begin <stmt> ; <var> := <var><arith_op><var> end
⇒
begin <stmt> ; <var> := <var><arith_op> B end
⇒
begin <stmt> ; <var> := <var> * B end
⇒
begin <stmt> ; <var> := A * B end
⇒
begin <stmt> ; C := A * B end
⇒
begin <var> := <expression>; C := A * B end
⇒
begin <var> := <var>; C := A * B end
⇒
begin <var> := B ; C := A * B end
begin A := B; C := A * B end
5
If always the rightmost nonterminal is replaced, then it is called rightmost7
<program> → begin <stmt_list> end
Derive string: <stmt_list> → <stmt> | <stmt> ; <stmt_list>
begin A := B; C := A * B end <stmt> → <var> := <expression>
<var> → A | B | C
<expression>→ <var> | <var> <arith_op> <var>
<arith_op> →
+
⇒ begin
⇒ begin
<program > <stmt_list> end
⇒ begin
<stmt> ; <stmt_list> end
⇒ begin
<stmt> ; <stmt> end
⇒ begin
<stmt> ; <var> := <expression> end
<stmt> ; <var> := <var><arith_op><var> Rightmost
⇒ begin
end
⇒ begin
<stmt> ; <var> := <var><arith_op> B end derivation
⇒ begin
<stmt> ; <var> := <var> * B end
⇒ begin
<stmt> ; <var> := A * B end
⇒ begin
<stmt> ; C := A * B end
⇒ begin
<var> := <expression>; C := A * B end
⇒ begin
<var> := <var>; C := A * B end 5
⇒ begin
<var> := B ; C := A * B end
8
A := B; C := A * B end
A Parse Tree
<program> → begin <stmt_list> end
hierarchical
representation <stmt_list>
<stmt>
→
→
<stmt> | <stmt> ; <stmt_list>
<var> := <expression>
<var> → A | B | C
begin A := B; C := A * B end
B 6
1
Another Example <program> <stmt_list>
<stmt_list> <stmt>
a = b + const | <stmt> ; <stmt_list>
<stmt> <var> = <expr>
<stmt_list> <var> a | b | c | d
<stmt> <expr> <term> + <term>
<var> = <expr> | <term> - <term>
<var> = <term> + <term> <var> | const
<term>
<var> = <term> + <program>
const
<var> = <var> +
const <stmt_list>
<var> = b + const
a = b + const <stmt>
= <expr>
<stmt_list>
<stmt> <vara <term> +
<var> = <expr> > <term>
a = <expr>
a = <term> +
a <var> const
<term>
6
a Copyright
= <var> © 2009+ <term>All rights
reserved.
Addison-Wesley.
b 21-62
Ambiguity (Belirsizlik) in Grammars
6
Copyright © 2009 Addison-Wesley. All rights 31-63
reserved.
Example
• Given the following
grammar
<assign> ::= <id> = <expr>
<id> ::= A | B | C
<expr> ::= <expr> + <expr>
| <expr>
*
<expr>
|
(<expr>) 64
<assign> ::= <id> = <expr>
for A = B + C * | (<expr>)
| <id>
65
<assign> ::= <id> = <expr>
<id> ::= A | B | C
Two Leftmost derivations <expr> ::= <expr> + <expr>
| <expr> * <expr>
for A = B + C * A |
|
(<expr>)
<id>
<assign> => <id> = <expr>
=> A = <expr>
=> A = <expr> + <expr>
=> A = <id> + <expr>
=> A = B + <expr>
=> A = B + <expr> * <expr>
=> A = B + <id> * <expr>
=> A = B + C * <expr>
=> A = B + C * <id>
=> A = B + C * A
3+[4* [3+4]
5] *5
68
A=B+C+A <assign> ::= <id> = <expr>
<id> ::= A | B | C
A = 3 , B = 4, C = <expr> ::= <expr> + <expr>
5 3+4+5 | <expr> * <expr>
| (<expr>)
| <id>
3+[4+ [3+4]
5] +5
69
<assign> ::= <id> = <expr>
Single leftmost
<id> ::= A | B | C
derivation for A = B + C <expr> ::= <expr> + <expr>
| <expr> * <expr>
| (<expr>)
| <id>
71
Operator Precedence
• In mathematics * operation has a
higher precedence than +
• This can be implemented with
extra nonterminals
<assign> ::= <id> = <expr> <assign> ::= <id> = <expr>
<id> ::= A | B | C <id> ::= A | B | C
<expr> ::= <expr> + <expr> <expr> ::= <expr> + <term>
| <expr> * <expr> | <term>
| (<expr>) <term> ::= <term> * <factor>
| <id> | <factor>
<factor> ::= (<expr>)
| <id>
72
Unique Parse Tree for A = B + C * A
73
Associativity of Operators
• What about equal precedence operators?
<expr
<expr
>>
<expr + cons
> t
<expr +
> const
const
75
Copyright © 2009 Addison-Wesley. All rights 1-
reserved. 75
Associativity (birleşirlik)
• In a BNF rule, if the LHS appears at the beginning
of the RHS, the rule is said to be left recursive
• Left recursion specifies left associativity
<expr> ::= <expr> + <term>
| <term>
Left associativity
Left addition is lower than the right
addition
77
Is this ambiguous?
<stmt> ::= <if_stmt> | <other_stmt>
<if_stmt> ::= if <logic_expr> then <stmt>
| if <logic_expr> then <stmt> else <stmt>
<other_stmt> ::= S1 | S2
<logic_expr> ::= L1 | L2
Derive for : If L1 then if L2 then S1
else S2
L1
L1 S2
7
L S1 L S1 S2
8
2 2
An Unambiguous grammar for “if then
else”
• Dangling else problem: there are more if then
else
• To design an unambiguous if-then-else
statement we have to decide which if a
dangling else belongs to
• Most PL adopt the following rule:
– (unmatched
“an else is matched with the closest
if = else- Has a unique parse
previous
less if) unmatched if statement” tree
<stmt> ::= <matched> | <unmatched>
<matched> ::= if <logic_expr> then <matched> else <matched>
| any non-if-statement
<unmatched> ::= if <logic_expr> then <stmt>
| if <logic_expr> then <matched> else <unmatched>
79
Draw the parse tree
<stmt> ::= <matched> | <unmatched>
<matched> ::= if <logic_expr> then <matched> else <matched>
| <other_stmt>
<unmatched> ::= if <logic_expr> then <stmt>
| if <logic_expr> then <matched> else <unmatched>
If L1 then if L2 then S1
else S2
80
Extended BNF
• EBNF: Same power but more convenient
• Optional parts are placed in
brackets [ ]
[X] : X is::=
<writeln> optional
WRITELN (0[(<item_list>)]
or 1
occurrence) if (<expr>) <stmt> [else<stmt>]
<selection>::=
• Repetitions (0 or more) are placed inside
braces { }
{X}: 0 or =
<identlist> more occurrences
<identifier> {,<identifier>}
• Alternative parts of RHSs are placed inside
parentheses and separated via vertical bars
( X 1 | X 2 | X 3 ) : choose X 1 or X 2 or X 3
<term> → <term> (+|-) const
81
BNF vs Extended BNF
• BNF < expr> ::= <expr> + <term>
| <expr> -
<term>
| <term>
<term> ::= <term> * <factor>
| <term> / <factor>
| <factor>
<factor> ::= <expr> **
<factor>
| <expr>
<expr> ::= (<expr>)
| <id>
82
BNF vs Extended BNF
• EBNF
<expr> ::= <term> {(+ | -) <term>}
<term> ::= <factor> {(*|/) <factor>}
<factor> ::= <expr> {**<expr>}
<expr> ::= (<expr>)
| id
83
Recent Variations in EBNF
• Alternative RHSs are put on separate
lines
• Use of a colon instead of =>
• Use of opt for optional parts
• Use of oneof for choices
84