0% found this document useful (0 votes)
58 views

15 Syntax Parsing

This document describes syntax description and parsing of programming languages. It begins with definitions related to syntax, such as sentences, lexemes, tokens, and syntax recognition vs generation. It then discusses Backus-Naur Form (BNF) and context-free grammars (CFGs), which are used to formally describe a language's syntax. Specific grammar rules and examples of parse trees are provided. The document notes that ambiguous grammars can lead to different parse trees for the same code, and discusses how precedence rules resolve such ambiguities.

Uploaded by

Egga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

15 Syntax Parsing

This document describes syntax description and parsing of programming languages. It begins with definitions related to syntax, such as sentences, lexemes, tokens, and syntax recognition vs generation. It then discusses Backus-Naur Form (BNF) and context-free grammars (CFGs), which are used to formally describe a language's syntax. Specific grammar rules and examples of parse trees are provided. The document notes that ambiguous grammars can lead to different parse trees for the same code, and discusses how precedence rules resolve such ambiguities.

Uploaded by

Egga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Programming Languages:Syntax Description and Parsing

Programming Languages:
Syntax Description and Parsing

Onur Tolga Şehitoğlu

Computer Engineering,METU

27 May 2009
Programming Languages:Syntax Description and Parsing

Outline
Programming Languages:Syntax Description and Parsing
Describing Syntax
Introduction

Introduction

Syntax: the form and structure of a program.


Semantics: meaning of a program
Language definitions are used by:
Programmers
Implementors of the language processors
Language designers
Programming Languages:Syntax Description and Parsing
Describing Syntax
Introduction

Definitions

A sentence is a string of characters over some alphabet


A language is a set of sentences
A lexeme is the lowest level syntactic unit of the language (i.e.
++, int, total)
A token is a category of lexemes (i.e. identifier )
Programming Languages:Syntax Description and Parsing
Describing Syntax
Introduction

Definitions

syntax recognition: read input strings of the language and


verify the input belonging to the language
syntax generation: generate sentences of the language (i.e.
from a given data structure)
Compilers and interpreters recognize syntax and convert it
into machine understandable form.
Programming Languages:Syntax Description and Parsing
Describing Syntax
Backus-Naur Form and CFGs

Backus-Naur Form and CFGs

CFG’s introduced by Noam Chomsky (mid 1950s)


Programming languages are usually in context free language
class
BNF introduced by John Bakus and modified by Peter Naur
for describing Algol language
BNF is equivalent to CFGs. It is a meta-language that
decribes other languages
Extended BNF improves readability of BNF
Programming Languages:Syntax Description and Parsing
Describing Syntax
Backus-Naur Form and CFGs

A Grammar Rule

hwhile stmti → while ( hlogic expri ) hstmti


LHS is a non-terminal denoting an intermediate phrase
LHS can be defined (rewritten) as the RHS sequence which
can contain terminals (lexems and tokens) of the language
and other non-terminals
Non-terminals are denoted as strings enclosed in angle
brackets.
::= may be used in BNF notation instead of the arrow
| is used to combine multiple rules with same LHS in a single
rule
hlgc consi ::= true ≡ hlgc consi ::= true | false
hlgc consi ::= false
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar

Context Free Grammar

A grammar G is defined as G = (V , Σ, R, S):


N, finite set of non terminals
Σ, finite set of terminals
R is a set of grammar rules. A relation from V to (V ∪ Σ)∗ .
S ∈ N the start symbol
Application of a rule maps one sentential form into the other
by replacing a non-terminal element in sentential form with its
right handside seuqence in the rule, u 7→ v .
n o

Language of a grammar L(G ) = w | w ∈ Σ∗ , S 7→ w
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar

Recursive or list like structures can be represented using


recursion
hexpr listi → hexpri , hexpr listi
hbtreei → hheadi ( hbtreei , hbtreei )
A derivation starts with a starting non-terminal and rules are
applied repeteadly to end with a sentence containing only
terminal symbols.
leftmost derivation: always leftmost non-terminal is chosen for
replacement
rightmost derivation: always rightmost non-terminal is chosen
for replacement
Same sentence can be derived using leftmost, rightmost, or
other derivaionts.
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar

Sample Grammar

hstmti → hidi = hexpri


hexpri → hexpri hopi hexpri | hidi
hopi → + | *
hidi → a | b | c
Leftmost derivation of a = a * b :
hstmti 7→ hidi = hexpri 7→ a = hexpri
7→ a = hidi hopi hexpri 7→ a = b hopi hexpri
7→ a = b * hexpri 7→ a = b * hidi 7→ a = b * c
Rightmost derivation of a = a * b :
hstmti 7→ hidi = hexpri 7→ hidi = hexpri hopi hexpri
7→ hidi = hexpri hopi hidi 7→ hidi = hexpri hopi b
7→ hidi = hexpri * b 7→ hidi = hidi * b
7→ hidi = a * b 7→ a = a * b
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar

Parse Tree

Steps of a derivation gives the structure of the sentence. This


structure can be represented as a tree.
All non-terminals used in derivation are intermediate nodes.
Each grammar rule replaces the non-terminal node with is
children. Root node is the start symbol.
Terminal nodes are the leaf nodes.
preorder traversal of leaf nodes gives the resulting sentence.
leftmost and rightmos derivations can be retrieved by traversal
of the tree.
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar

Parse Tree Example

a = a * b
hstmti

hidi = hexpri

a hexpri hopi hexpri

hidi * hidi

a b
Programming Languages:Syntax Description and Parsing
Describing Syntax
Context Free Grammar

Parse Tree Generation

A parse tree gives the structure of the program so semantics


of the program is related to this structure.
For example local scopes, evaluation order of expressions etc.
During compilation, parse trees might be required for code
generation, semantic analysis and optimization phases.
After a parse tree generated, it can be traversed to do various
tasks of compilation.
The processing of parse tree takes too long, so creation of
parse trees is usually avoided.
Approaches like syntax directed translation combines parsing
with code generation, semantic analysis etc..
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars

Ambigous Grammars

Consider a = a + b * c in our grammar:


hstmti hstmti

hidi = hexpri vs hidi = hexpri

a hexpri hopi hexpri a hexpri hopi hexpri

hidi + hexpri hopi hexpri


hexpri hopi hexpri * hidi
a hidi * hidi
hidi + hidi c
b c
a b
Both can be derived by the grammar!
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars

A grammar is called ambigous if same sentence can be derived


by following different set of rules, thus resulting in a different
parse tree
If structure changes semantic meaning of the program,
ambiguity is a serious problem.
Even if not, which one is the result?
i.e. Precedence of operators affects the value of the
expression.
Programming languages enforces precedence rules to resolv
ambiguity.
Solution:
1 design grammar not to be ambigous, or
2 during parsing, choose rules to generate the correct parse tree
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars

Precedence and Grammar

Operators with different precedence levels should be treated


differently
Higher precedence operations should be deep in the parse tree
→ their rules should be applied later.
Lower precedence operations should be closer to root →
applied earlier in derivation.
For each precedence level, define a non-terminal
One rewritten on the other based on the precedence lower to
higher
Programming Languages:Syntax Description and Parsing
Describing Syntax
Ambigous Grammars

Rewritten Grammar

hstmti
hstmti → hidi = hexpri hidi = hexpri
hexpri → hexpri + htermi | htermi
a hexpri + htermi
htermi → htermi * hfactori | hfactori
htermi htermi * hfactori
hfactori → hidi | ( hexpri )
hidi → a | b | c hfactori hfactori hidi

hidi hidi c

a b
htermi and hexpri has different precedence.
Once inside of htermi, there is no way to derive +
Only one parse possible
Programming Languages:Syntax Description and Parsing
Describing Syntax
Associativity

Associativity

Associativity of operators is another issue


a - b - c ≡ ( a - b ) - c or a - ( b - c)
Recursion of grammar defines how tree is constructed for
operators in the same level.
If left recursive, later operators in the sentence will be closer
to root, if right recursive earlier operators will be closer to root
left recursion implies left associativity, right recursion implies
right associativity.
Consider a + b + c in these grammars:
hexpri → hexpri + hidi | hidi hexpri → hidi + hexpri | hidi
hidi → a | b | c
vs hidi → a | b | c
Programming Languages:Syntax Description and Parsing
Describing Syntax
An Assignment Grammar

Sample Grammar

hasgni → hidi = hasgni | hidi = hexpri


hexpri → hexpri + htermi | htermi
htermi → htermi * hfactori | hfactori
hfactori → hpowi ^ hfactori | hpowi
hpowi → hidi | ( hexpri )
hidi → a | b | c

hasgni is right recursive like right associative C assignments.


hexpri and htermi are left recursive, * and + left associative
hfactori is right recursive for power operation ^ to be right associative.
precedence order is (...) ≺ ^ ≺ * ≺ + ≺ =
Programming Languages:Syntax Description and Parsing
Describing Syntax
An Assignment Grammar

a = a + b * c * a ^ b ^ c

hasgni

hidi = hexpri

a hexpri + htermi

htermi htermi * hfactori

hfactori htermi * hfactori hpowi ^ hfactori

hpowi hfactori hpowi hidi hpowi ^ hfactori

hidi hpowi hidi a hidi hpowi

a hidi c b hidi

b c
Programming Languages:Syntax Description and Parsing
Parsing
Compilation

Compilation
source code

Lexical Analysis

sequence of lexemes

Syntax analysis

parse tree

Intermediate while ( c o u n t e r < 12341) {


Symbol code generation f () ;
Optimization
Table and Semantic c o u n t e r += 12;
Analysis }

intermediate code
WHL LP ID LT I L I T RP LB
Code ID LP RP SC
generation ID PLEQ I L I T SC
RB
machine code

hwhlstmti
Programming Languages:Syntax Description and Parsing
Parsing
Parsing

Parsing

input: sequence of lexemes (output of lexical analysis) or


characters.
output: parse tree, intermediate code, translated code, or
sometimes only if document is valid or not.
Two main classes of parser:
Top down parsing
Tottom up parsing
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing

Top-down Parsing
Start from the starting non-terminal, apply grammar rules to
reach the input sentence
hassigni 7→ a = hexpr i 7→ a = hexpr i + htermi 7→
a = htermi + htermi 7→ a = hfacti + htermi 7→
a = a + htermi 7→ a = a + htermi ∗ hfacti 7→
a = a + hfacti ∗ hfacti 7→ a = a + b ∗ hfacti 7→
a=a+b∗a
Simplest form gives leftmost derivation of a grammar
processing input from left to right.
Left recursion in grammar is a problem. Elimination of left
recursion needed.
Deterministic parsing: Look at input symbols to choose next
rule to apply.
recursive descent parsers, LL family parsers are top-down
parsers
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing

Recursive Descent Parser


typedef enum { i d e n t , number , l p a r e n , r p a r e n , t i m e s ,
s l a s h , p l u s , minus } Symbol ;
int a c c e p t ( Symbol s ) { if (sym == s ) { n e x t (); return 1; }
return 0;
}
void f a c t o r ( void ) {
if ( a c c e p t ( i d e n t )) ;
else if ( a c c e p t ( number )) ;
else if ( a c c e p t ( l p a r e n )) { e x p r e s s i o n (); e x p e c t ( r p a r e n );}
else { e r r o r ( " factor : syntax error at " , c u r r s y m ); n e x t (); }
}
void term ( void ) {
f a c t o r ();
while ( a c c e p t ( t i m e s ) || a c c e p t ( s l a s h ))
f a c t o r ();
}
void e x p r e s s i o n ( void ) {
term ();
while ( a c c e p t ( p l u s ) || a c c e p t ( minus ))
term ();
}
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing

Each non-terminal realized as a parsing function


Parsing functions calls the right handside functions in
sequence
Rule choices are based on the current input symbol. accept
checks a terminal and consumes if matches.
Cannot handle direct or indirect left recursion. A function has
to call itself before anything else.
Hand coded, not flexible.
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing

LL Parsers

First L is ‘left to right input processing’, second is ‘leftmost


derivation’
Checks next N input symbols to decide on which rule to
apply: LL(N) parsing.
For example LL(1) checks the next input symbol only.
LL(N) parsing table: A table for V × ΣN 7→ R
for expanding a nonterminal NT ∈ V , looking at this table
and the next N input symbols, LL(N) parser chooses the
grammar rule r ∈ R to apply in the next step.
Programming Languages:Syntax Description and Parsing
Parsing
Top-down Parsing

Grammar and lookup table for a LL(1) parser:


1 S →E
2 S → −E a b - (
3 E → N+E S 1 1 2 1
4 E → (E ) E 3 3 4
5 N→a N 5 6
6 N→b
What if we add E → N to grammar?
You need an LL(2) grammar. What if N is recursive?
Programming Languages:Syntax Description and Parsing
Parsing
Bottom-up Parsing

Bottom-up Parsing

Start from input sentence and merge parts of sentential form


matching RHS of a rule into LHS at each step. Try to reach
the starting non-terminal. reach the input sentence

a = a + b ∗ a 7→ a = hfacti + b ∗ a 7→ a = htermi + b ∗ a 7→
a = hexpr i + b ∗ a 7→ a = hexpr i + hfacti ∗ a 7→
a = hexpr i + htermi ∗ a 7→ a = hexpr i + htermi ∗ hfacti 7→
a = hexpr i + htermi 7→ a = hexpr i 7→ hassigni

Simplest form gives rightmost derivation of a grammar (in


reverse) processing input from left to right.
Shift-reduce parsers are bottom-up:
shift: take a symbol from input and push to stack.
reduce: match and pop a RHS from stack and reduce into
LHS.
Programming Languages:Syntax Description and Parsing
Parsing
Bottom-up Parsing

Shift-Reduce Parser in Prolog

% Grammar is E - > E - T | E + T | T T -> a | b


r u l e ( e ,[ e , - , t ]).
r u l e ( e ,[ e ,+ , t ]).
r u l e ( e ,[ t ]).
r u l e ( t ,[ a ]).
r u l e ( t ,[ b ]).

p a r s e ([] ,[ S]) : - S = e . % s t a r t i n g symbol alone in the stack


% reduce : find RHS of a rule on stack , reduce it to LHS
p a r s e ( I n p u t , S t a c k ) : - match (LHS, S t a c k , R e m a i n d e r ) ,
p a r s e ( I n p u t ,[LHS| R e m a i n d e r ]).

% shift : n o n t e r m i n a l s are removed from input added on stack


p a r s e ([H| I n p u t ] , S t a c k ) : - member(X,[ a ,b , - ,+]) ,
p a r s e ( I n p u t ,[H| S t a c k ]).

% check if RSH of a rule is a prefix of Stack ( r e v e r s e d ).


match (LHS, L i s t ,L) : - r u l e (LHS,RHS) , r e v e r s e (RHS,NRHS) ,
p r e f i x (NRHS, L i s t ,L ).
Programming Languages:Syntax Description and Parsing
Parsing
Bottom-up Parsing

Shift reduce parser tries all non-deterministic shift


combinations to get all parses.
Deterministic bottom up parsers: LALR, SLR(1).

You might also like