Lecture 7-8 - Context-Free Grammars and Bottom-Up Parsing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Lecture 7-8 Context-free grammars and bottom-up parsing Grammars and languages Ambiguity and expression grammars Parser

rser generators
Bottom-up (shift/reduce) parsing ocamlyacc example Handling ambiguity in ocamlyacc

CS 421 Class 78, 2/7/12-2/9/12 1

Grammar for (almost) MiniJava


Program -> ClassDeclList ClassDecl -> class id { VarDeclList MethodDeclList } VarDecl -> Type id ; MethodDecl -> Type id ( FormalList ) { VarDeclList StmtList return Exp ; } Formal -> Type id Type -> int [ ] | boolean | int | id Stmt -> { StmtList } | if ( Exp ) Stmt else Stmt | while ( Exp ) Stmt | System.out.println ( Exp ) ; | id = Exp ; | id [ Exp ] = Exp ; Exp -> Exp Op Exp | Exp [ Exp ] | Exp . length | Exp . id ( ExpList ) | integer | true | false | id | this | new int [ Exp ] | new id ( ) | ! Exp | ( Exp ) Op -> && | < | + | - | * ExpList -> Exp ExpRest | ExpRest -> , Exp ExpRest | FormalList -> Type id FormalRest | FormalRest -> , Type id FormalRest | ClassDeclList = ClassDeclList VarDecl | MethodDeclList = MethodDeclList MethodDecl | VarDeclList = VarDeclList VarDecl | StmtList = StmtList Stmt |
CS 421 Class 78, 2/7/12-2/9/12 2

Grammars and languages E.g.


program class C {int f () { return 0; }} has this parse tree:

Parser converts source le to parse tree. AST is easily calculated from parse tree, or constructed while parsing (without explicitly constructing parse tree).
CS 421 Class 78, 2/7/12-2/9/12 3

Context-free grammar (cfg) notation CFG is a set of productions A X1X2 . . . Xn (n 0). If


n = 0, we may write either A or A .

A, B, C.... N , the set of non-terminals. One non-terminal


in the grammar is the start symbol.

T is the set of tokens, aka terminals X, Y, Z S = N T , the set of grammar symbols u, v, w T , , S Productions A , A , ..., abbreviated as A | | ... A parse tree, or (concrete) syntax tree, from A is a
CS 421 Class 78, 2/7/12-2/9/12 4

tree whose root is labelled A, and whose internal nodes are labelled with non-terminals such that if a node is labelled B , its children are labelled X1X2 . . . Xn, for some production B X1X2 . . . Xn. As a special case, if a node is labelled B , it may have a child labelled , if B has an -production.

An

A-form of a grammar is any frontier of a parse tree (i.e. labels of the leaf nodes) from A, with s deleted. An A-sentence is an A-form in T .

Sentential form, sentence, and parse tree refer to, respectively, an S -form, S -sentence, and parse tree from S , where S is the start symbol.

A N is nullable if is an A-sentence. An ambiguous grammar is one for which


sentence has more than one parse tree.
CS 421 Class 78, 2/7/12-2/9/12 5

at least one

A recognizer is a function of type string bool that says


whether a string is a sentence. A parser is a recognizer that also constructs a parse tree (or AST). (In practice, a recognizer is usually easy to convert to a parser, by adding tree-constructing operations at appropriate places.)

An extended cfg is one whose right-hand sides can contain


regular expression operations , +, and ?. These are used to abbreviate ordinary context-free rules:

should be replaced by a new non-terminal, say B , and


additional rules B | B (or, equivalently, B | B ) + should be replaced by a new non-terminal, say B , and additional rules B | B (or, equivalently, B | B ) ? should be replaced by a new non-terminal, say B , and additional rules B | .
CS 421 Class 78, 2/7/12-2/9/12 6

Exercises on cfg notation


E E+T | T T T P | P P id | int | ( P )

Parse tree for: x

* 10 + y:

Give examples of each:


E-sentence: T-sentence: E-sentential form that is not a sentence:

CS 421 Class 78, 2/7/12-2/9/12 7

More exercises on cfg notation


E A T P EA | A+T T (P ) id | int | ( P )

Transform this grammar to remove the Kleene star.

Show that both E+T and EA+T are sentential forms:

CS 421 Class 78, 2/7/12-2/9/12 8

Expression grammars Recall that the expression 3


+ 4*5 + -7 has AST

The shape of this AST represents the precedence of multiplication and the left-associativity of addition. It ensures, for example, that eval would return the correct value.

Parsing

produces a parse tree which is translated to an AST. It simplies this translation greatly if the shape of the concrete syntax tree correctly represents precedences and associativities of operators.

CS 421 Class 78, 2/7/12-2/9/12 9

Some expression grammars


GA: E id | E - E | E * E GB : E id | id - E | id * E GC : E id | E - id | E * id GD : GE : GF : ET-E|T T id | id * T EE-T|T T id | T * id ETE E |-E T id T T |*T

CS 421 Class 78, 2/7/12-2/9/12 10

GA: E id | E - E | E * E
x-y*z
x-y-z

Ambiguous?
CS 421 Class 78, 2/7/12-2/9/12 11

Precedence?

Associativity?

GB : E id | id - E | id * E
x-y*z
x-y*z-w

x*y-z

Ambiguous?
CS 421 Class 78, 2/7/12-2/9/12 12

Precedence?

Associativity?

GC : E id | E - id | E * id
x-y*z
x-y*z-w

x*y-z

Ambiguous?
CS 421 Class 78, 2/7/12-2/9/12 13

Precedence?

Associativity?

GD : E T - E | T
T id | id * T

x-y*z

x*y-z

x-y-z

Ambiguous?
CS 421 Class 78, 2/7/12-2/9/12 14

Precedence?

Associativity?

GE : E E - T | T
T id | T * id

x-y*z

x*y-z

x-y-z

Ambiguous?
CS 421 Class 78, 2/7/12-2/9/12 15

Precedence?

Associativity?

GF : E T E
E |-E T id T T |*T

x-y*z

x*y-z

x-y-z

Ambiguous?
CS 421 Class 78, 2/7/12-2/9/12 16

Precedence?

Associativity?

Parser generators Like lexer generators, these are programs that input a specication in the form of a context-free grammar, with an action associated with each production and output a parser.

The most famous of all parser generators is yacc which,


ironically, stands for yet another compiler compiler. Originally written to generate parsers in C, it has been copied in many other languages. We will use ocamlyacc.

We start with a small but complete example. Next week, we will discuss how to write a parser by hand,
using the method of recursive descent.
CS 421 Class 78, 2/7/12-2/9/12 17

Example - expression grammar In this example, we will use ocamlyacc to create a parser for
this grammar:
M E eof ET|E+T|E-T TP|T*P|T/P P id | ( E )

It will product ASTs of type exp:


(* File: exp.ml *) type exp = Plus of exp * exp | Minus of exp * exp | Mult of exp * exp | Div of exp * exp | Id of string
CS 421 Class 78, 2/7/12-2/9/12 18

Example - exprlex.mll
{ type token = PlusT | MinusT | TimesT | DivideT | OParenT | CParenT | IdT of string | EOF } let numeric = [0 - 9] let letter = [a - z A - Z] rule tokenize = parse | "+" {PlusT} | "-" {MinusT} | "*" {TimesT} | "/" {DivideT} | "(" {OParenT} | ")" {CParenT} | letter (letter | numeric | "_")* as id {IdT id} | [ \t \n] {tokenize lexbuf} | eof {EOF}

CS 421 Class 78, 2/7/12-2/9/12 19

Example - exprparse.mly
%token <string> IdT %token OParenT CParentT TimesT DivideT PlusT MinusT EOF %start main %type <exp> main %% expr: term {$1} | expr PlusT term {Plus($1,$3)} | expr MinusT term {Minus($1,$3)} term: factor | term TimesT factor | term DivideT factor

{$1} {Mult($1,$3)} {Div($1,$3)}

factor: IdT {Id $1} | OParenT expr CParenT {$2} main: | expr EOF {$1}
CS 421 Class 78, 2/7/12-2/9/12 20

Shift-reduce parsing ocamlyacc uses a method of parsing known as shift/reduce,


a.k.a. bottom-up parsing. Heres how it works:

Keep a stack of grammar symbols (initially empty). Based


on this stack, and the next input token (lookahead symbol), take one of these actions: Shift: Move lookahead symbol to stack Reduce A : Symbols on top of stack are ; replace them by A. (If you create a tree node here, you can construct the parse tree while parsing.) Accept: When stack consists of just the start symbol, and input is exhausted Reject
CS 421 Class 78, 2/7/12-2/9/12 21

Shift-reduce example 1
L L; E|E E id Input: x; y

CS 421 Class 78, 2/7/12-2/9/12 22

Shift-reduce example 2
E E + T |T T T P |P P id | int Input: x + 10 * y

CS 421 Class 78, 2/7/12-2/9/12 23

Parsing conicts Parsers for programming languages must be very ecient,


but no ecient method for parsing arbitrary cfgs is known. So all parser generators accept only certain grammars. The hardest part of using any parser generator is getting the grammar into a form the parser generator will accept.

yacc (and ocamlyacc) accept a class of grammars known as


LALR(1) grammars. We will not be able to describe exactly what distinguishes this class; that is done in CS 426.

If you present a non-LALR(1) grammar to ocamlyacc, it will


report an error message, called a conict.

To understand how to deal with conicts, we need to look


at shift-reduce parsing a little closer.
CS 421 Class 78, 2/7/12-2/9/12 24

S/R parsing parse tree Each shift-reduce parse i.e.


each sequence of s/r actions produces a unique parse tree.

Every parse tree is built by a unique s/r parse:


Traverse the tree in post-order. A leaf node corresponds to
a shift action, and an internal node to a reduce action.

An LALR(1) parser generator will accept a grammar only if


it can determine, based on the stack and lookahead symbol, the unique correct action.

It follows that ambiguous grammars can never be LALR(1).

CS 421 Class 78, 2/7/12-2/9/12 25

Shift-reduce example 3 Grammar:


E E + E | E E | id Input: x + y + z

Show a parse tree, and corresponding s/r parse, that represents left-associativity of addition.

CS 421 Class 78, 2/7/12-2/9/12 26

Shift-reduce example 3 (cont.) Grammar:


E E + E | E E | id Input: x + y + z

Show a parse tree, and corresponding s/r parse, that represents right-associativity of addition.

CS 421 Class 78, 2/7/12-2/9/12 27

Dealing with ambiguity yacc


has a special trick for dealing with ambiguity: annotations telling the parser explicitly what to do in some cases. x+y+z, x*y*z, x+y*z, x*y+z.

For the previous grammar, there are four interesting inputs: Consider x+y+z.
It has two parse trees. For both, the stack looks the same until the second + is the lookahead symbol.

What is the right decision?

CS 421 Class 78, 2/7/12-2/9/12 28

Dealing with ambiguity (cont.) For x*y*z, consider where the two stack congurations that
can occur for the two parse trees dier. What is the correct decision?

Do the same for x+y*z: and for x*y+z:


CS 421 Class 78, 2/7/12-2/9/12 29

Precedence declarations Looking at these four cases,


we can see that they will be parsed correctly if we follow these rules:

If the operator nearest the top of the stack and the lookahead symbol have the same precedence, then shift if the operator is right-associative, and reduce if it is left-associative. If the operator nearest the top of the stack has higher precedence than the lookahead symbol, then reduce; otherwise, shift

ocamlyacc will follow these rules if you tell it which operators have higher precedence, and which are left- or rightassociative. Do that using precedence declarations...

CS 421 Class 78, 2/7/12-2/9/12 30

Precedence declarations (cont.) Precedence declarations are added to the ocamlyacc specication after the %token declarations. Syntax:

%left symbol ... symbol %right symbol ... symbol %nonassoc symbol ... symbol

Two symbols appearing in the same precedence declaration


have the same precedence and the given associativity. (If they appear in a %nonassoc declaration, they cannot follow one another, e.g. x < y < z.)

Two symbols appearing in dierent precedence declarations


have dierence precedences: the one that comes earlier has lower precedence.
CS 421 Class 78, 2/7/12-2/9/12 31

Precedence declaration example Recall this ocamlyacc specication:


expr: term | term PlusT expr | term MinusT expr term: factor | factor TimesT term | factor DivideT term {$1} {Plus($1,$3)} {Minus($1,$3)}

{$1} {Mult($1,$3)} {Div($1,$3)}

factor: IdT {Id $1} | OParenT expr CParenT {$2} main: | expr EOF

{$1}

CS 421 Class 78, 2/7/12-2/9/12 32

Precedence declarations (cont.) This


can be simplied with precedence declarations (after the %token declarations):

%left PlusT MinusT %left TimesT DivideT %start main %type <expr> main %% expr: IdT | expr PlusT expr | expr MinusT expr | expr TimesT expr | expr DivideT expr | OParenT expr CParenT main: | expr EOF

{Id $1} {Plus($1,$3)} {Minus($1,$3)} {Plus($1,$3)} {Minus($1,$3)} {$2}

{$1}

CS 421 Class 78, 2/7/12-2/9/12 33

Debugging ocamlyacc specications In doing MP4, the main question will be: what operators are
causing conicts? Once youve identied them, you can add precedence declarations.

When E.g.

you run ocamlyacc, it will report the number of conicts. Running with the -v option produces a le with the extension .output, containing details. grammar Expr Expr + Expr | id has a conict. Search for conict in the .output le:
6: shift/reduce conflict (shift 5, reduce 1) on plus state 6 Expr : Expr . plus Expr (1) Expr : Expr plus Expr . (1)

This says that there is a problem with the plus token.


CS 421 Class 78, 2/7/12-2/9/12 34

Constructing ASTs Precedence rules make it easier to construct ASTs, because


concrete syntax is closer to abstract syntax.

Dierent non-terminals can produce dierent types of values. An important case is list-like syntax categories. E.g. consider this grammar: funcall id ( arglist ) arglist funcall arglistrest | arglistrest , funcall arglistrest |

Suppose our abstract syntax is


type funcall = Funcall of string * (funcall list)
CS 421 Class 78, 2/7/12-2/9/12 35

Here is how to do this in ocamlyacc:


funcall: IdT OParenT arglist CParenT arglist: funcall arglistrest | arglistrest: CommaT funcall arglistrest | { Funcall($1, $3) } { $1 :: $2 } { [] } { $2 :: $3 } { [] }

CS 421 Class 78, 2/7/12-2/9/12 36

Bonus topic: A little LR theory... The


shift-or-reduce decision seems very mysterious: We know what to do when we already have the parse tree, but how can we know, based only on the grammar, what will be the correct action in every case? parses. Dene SC (G) = { S | can be the stack in a shift-reduce parse for G}.

We can start by looking at the stack congurations of s/r Consider this example: A id | ( A )

CS 421 Class 78, 2/7/12-2/9/12 37

More examples of SC (G)


EE+T|T T id

ET+E|T T id

CS 421 Class 78, 2/7/12-2/9/12 38

A little LR theory (cont.) Theorem [Knuth] For any grammar G, SC (G) is a nitestate language over S .

yacc starts by constructing the characteristic DFA for the


specied grammar. To parse a sentence, repeat the following:

Take the stack and concatenate the lookahead symbol. If


the result is in SC (G), then shift; o/w reduce.

Actually, constructing the characteristic DFA is just the start.


The simple parsing method just given does not always work. (E.g. even when it says to reduce, it may not be clear which production to reduce by.) The full construction of the parser is quite involved; see CS426, or a compiler textbook.

CS 421 Class 78, 2/7/12-2/9/12 39

You might also like