0% found this document useful (0 votes)
59 views7 pages

CS107-08 CFGs

Context-free grammars are also used to describe (parts of) Programming Languages. They are essential for understanding the meaning of computer programs code: (2 + 3) 5 meaning: "add 2 and 3, and then multiply by 5"

Uploaded by

'Elijah Recto
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views7 pages

CS107-08 CFGs

Context-free grammars are also used to describe (parts of) Programming Languages. They are essential for understanding the meaning of computer programs code: (2 + 3) 5 meaning: "add 2 and 3, and then multiply by 5"

Uploaded by

'Elijah Recto
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Context-Free Grammar

CS 107
Theory of Automata
Chapter 08

This is a different model for describing


languages

The language is specified by productions


(substitution rules) that tell how strings can be obtained, e.g.
A 0A1 AB B# A, B are variables 0, 1, # are terminals A is the start variable

Using these rules, we can derive strings like


this:
A 0A1 00A11 000A111 000B111 000#111
CJD

Context-Free Grammars Context-

Natural Language

English and CFGs


We can describe (some fragments) of the English language by a context-free grammar:
SENTENCE NOUN-PHRASE VERB-PHRASE NOUN-PHRASE CMPLX-NOUN NOUN-PHRASE CMPLX-NOUN PREP-PHRASE VERB-PHRASE CMPLX-VERB VERB-PHRASE CMPLX-VERB PREP-PHRASE PREP-PHRASE PREP CMPLX-NOUN CMPLX-NOUN ARTICLE NOUN CMPLX-VERB VERB NOUN-PHRASE CMPLX-VERB VERB ARTICLE a ARTICLE the NOUN boy NOUN girl NOUN flower VERB likes VERB touches VERB sees PREP with

Context-free grammars were first used for


natural languages a girl with a flower likes the boy
ART NOUN CMPLX-NOUN PREP ART NOUN VERB ART NOUN

CMPLX-NOUN PREP-PHRASE

CMPLX-NOUN NOUN-PHRASE CMPLX-VERB

variables: SENTENCE, NOUN-PHRASE,


NOUN-PHRASE VERB-PHRASE

terminals: a, the, boy, girl, flower, likes, touches, sees, with start variable: SENTENCE

SENTENCE
CJD CJD

Programming Languages

CFGs for Compilers

Context-free grammars are also used to


describe (parts of) programming languages

Context-free grammars are essential for


understanding the meaning of computer programs code: (2 + 3) * 5

For instance, expressions like (2 + 3) * 5 or


3 + (8 + 2) * 7 can be described by the CFG
<expr> <expr> + <expr> <expr> <expr> * <expr> <expr> (<expr>) <expr> 0 <expr> 1 <expr> 9
CJD

Variables: <expr> Terminals: +, *, (, ), 0, 1, , 9

meaning: add 2 and 3, and then multiply by 5

They are used in compilers


CJD

BNF
John Backus and Peter Naur BNF: Backus-Naur Form
A way to describe grammars and define the syntax of programming languages (Algol), 1959-1963

Example
<exp> ::= <exp> - <exp> | <exp> * <exp> | <exp> = <exp> | <exp> < <exp> | (<exp>) | a | b | c

A BNF grammar is a CFG, with notational changes:


Nonterminals are written as words enclosed in angle brackets: <exp> instead of E Productions use ::= instead of The empty string is <empty> instead of

This BNF generates a little language of


expressions which includes : a < b ( a - ( b * c ) ) ( b * a ) = ( c < b ) - a

CFGs (due to Chomsky) came a few years earlier, but


BNF was developed independently
CJD

CJD

Example
<stmt> ::= <exp-stmt> | <while-stmt> | <compound-stmt> |... <exp-stmt> ::= <exp> ; <while-stmt> ::= while ( <exp> ) <stmt> <compound-stmt> ::= { <stmt-list> } <stmt-list> ::= <stmt> <stmt-list> | <empty> Element is Text or

HTML
a subset of HTML can be described as follows :
Doc is a sequence of elements

This BNF generates C-like statements, like


while (a<b) { c = c * a; a = a + a; }

A pair of matching tags and the document between them, or Unmatched tag followed by a document Text is any string of characters literally interpreted (i.e. there are no tags, user-text) Char is any single character legal in HTML tags List is a sequence of zero or more list items ListItem is the <LI> tag followed by a document followed by </LI>
CJD CJD

This is just a toy example; the BNF grammar for a full


language may include hundreds of productions

Limited HTML Grammar

Formal Definition
A Context-Free Grammar (CFG) is a 4-tuple (V, T, P, S) where V is a finite set of variables or non-terminals T is a finite set of terminals (V T = ) P is a set of productions or substitution rules of the form A where A is a symbol in V and is a string over VT S is a variable in V called the start variable

Doc | Element Doc Element Text | <EM> Doc </EM> | <P> Doc |
<OL> List </OL>

Text | Char Text Char a | A | List | ListItem List ListItem <LI> Doc </LI>

CJD

CJD

Convention
Variables : first few uppercase letters and S
ex. A, B, C, D, E, S S is start symbol unless otherwise specified

Shorthand for Productions


When we have multiple productions with the same variable on the left like
EE+E EE*E E (E) EN N 0N N 1N N0 N1 Variables: E, N Terminals: +, *, (, ), 0, 1 Start variable: E

Terminals : digits and first few lowercase letters


ex. a, b, c, d, e, 0, 1, 9

Symbols (Variables or Terminals) : last few uppercase letters


ex. X, Y, Z

we can write this in shorthand as


E E + E | E * E | (E) | N N 0N | 1N | 0 | 1

Strings of Terminals : last few lowercase letters


ex. u, v, w, x, y, z

Strings of variables and terminals : greek letters


ex. , ,
CJD CJD

Derivation
A derivation is a top-down sequential application of productions:
E E*E (E) * E (E) * N (E + E ) * N (E + E ) * 1 (E + N) * 1 (N + N) * 1 (N + 1N) * 1 (N + 10) * 1 (1 + 10) * 1 means can be obtained from with one production derivation * means can be obtained from after zero or more productions
i

Language of a CFG
* S

If contains variables and terminals, then is called a sentential form of G. If does not contain variables, it is called a sentence of G.

The language of a CFG G=(V, T, P, S) is the set of


all sentences of G.
* L = { w | w T * and S w }

means can be obtained from in exactly i productions


CJD

A language L is context-free if it is the language of


some CFG.
CJD

Example 1
productions :

Example 2
S SS | (S) |

A 0A1 | B B#

variables: A, B terminals: 0, 1, # start variable: A

Is the string 00#11 in L? How about 00#111, 00#0#1#11? What is the language of this CFG?
L = {0n#1n: n 0}
CJD

Give derivations of (), (()())


S (S) () (rule 2) (rule 3) S (S) (SS) ((S)S) ((S)(S)) (()(S)) (()()) (rule 2) (rule 1) (rule 2) (rule 2) (rule 3) (rule 3)

How about ())?

CJD

Examples 3 and 4

Example 5
Design a CFG for the following language:
L = { 0i1j | i j 2i, i=0,1, }, = {0, 1}
Consider two extreme cases: (a). if j = i, then L1 = { 0i1j: i=j }; (b). if j = 2i, then L2 = { 0i1j: 2i=j }. S S 0S1 red-rule S S 0S11 blue-rule

{ anb3n | n1 }
Each a on the left can be paired with three bs on the right That gives S aSbbb |

{ xy | x {a,b}*, y {c,d}*, and |x| = |y| }


Each symbol on the left (either a or b) can be paired with one on the right (either c or d) That gives S XSY | Xa|b Y c | d
CJD

If i j 2i , then randomly choose red-rule or blue-rule in the generation.

S S S

0S1 0S11
CJD

Example 5 Proof
L = {0i1j: i j 2i, i=0,1,},
G= S 0S1 S 0S11

Example 6
Design a CFG for the following language:
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
In other words, each a and c is matched by some b or d; and each b and d is matched by some a or c. To match a and d, use S aSd | To match a and b, use A aAb | To match c and b, use B bBc | To match c and d, use C cCd | a and d are far apart : they must be produced first by letting S be the start symbol. Afterwards, S must transition into the other productions that match adjacent terminals with S ABC
CJD CJD

= {0, 1}
S

Need to verify L = L(G)

1). L(G) is a subset of L: The red-rule and blue-rule guarantee that in each derivation, the number of 1s generated is one or two times larger than that of 0s. So, L(G) is a subset of L. 2). L is a subset of L(G): For any w = 0i1j, i j 2i, we use red-rule (2i - j) times and then blue-rule ( j - i ) times, i.e., * * S 02i-jS12i-j 02i-j0 j-iS12(j-i)12i-j 0i1j = w

Example 6
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
Suppose
n

Exercise: Designing CFGs

Design a CFG for the following languages


l n l n

To get a i and d l we need


n n n

S a Sd a ABCd
w

A a b

in i n

and C c d

Linear equations over integers, x, y, z, like: x + 5y z = 9 11x y = 2 Numbers without leading zeros, e.g., 109, 0 but not 019 L1 = {anbncmdm | n 0, m 0} L2 = {anbmcmdn | n 0, m 0} L3 = { 0n1n | n 1 } L4 = { aibjck | ij or jk }

To get b j and c k we need

an S
A B

dn
C

B b j (i n)ck (l n)
So w L j-(i-n)=k-(l-n). i+k = j+l

ai-n bi-n

bj-(i-n) ck-(l-n) cl-n

dl-n
CJD

CJD

CFLs vs Regular Languages

From Regular to Context-Free


regular expression CFG

Write a CFG for the language (0 + 1)*111


S A111 A | 0A | 1A

Can you do so for every regular language?


Every regular language is context-free


a (alphabet symbol) E1 + E2 E1E2 E1*

grammar with no rules S Sa S S1 | S2 S S1S2 S SS1 |

Proof:
regular expression NFA DFA
CJD

In all cases, S becomes the new start symbol


CJD

CFLs vs Regular Languages

Another CFL that is Not Regular


Language of palindromes
We can easily show using the pumping lemma that the language L = { w | w = wR } is not regular. However, we can describe this language by the following context-free grammar over the alphabet {0,1}: P P 0 P 1 Inductive definition P 0P0 P 1P1 More compactly: P
CJD

Is every context-free language regular? No! We already saw some examples:


A 0A1 | B B#

L = {0n#1n: n 0}

This language is context-free but not regular

| 0 | 1 | 0P0 | 1P1
CJD

Parse Trees

Definition of a Parse Tree

Derivations can also be represented using


parse trees
E E + E | E - E | (E) | V Vx|y|z EE+E V+E x+E x + (E) x + (E E) x + (V E) x + (y E) x + (y V) x + (y z) E + V x ( E V y E E E E V z )

A parse tree or derivation tree for a CFG G is an


ordered tree with labels on the nodes such that Every internal node is labeled by a variable The root is labeled S Nodes labeled by are leaves with no siblings If a node is labeled A and has children X1, , Xk from left to right, then the rule A X1 Xk is a production in G.

The yield of the parse tree is x+(yz)


CJD

A subtree is a node of a tree with all its


descendants and connecting edges.
CJD

A BNF Parse Tree


<exp> <ltexp> <subexp> <subexp> <mulexp> <rootexp> a <mulexp> <mulexp> <rootexp> b * <rootexp> c

CFGs and Parse Trees


Theorem : Let G = (V,T,P,S) be a context-free grammar. * Then S if and only if there is a parse tree in grammar G with yield . Proof : Prove a stronger version of theorem :
* For any AV, A there is an A-tree (ie. rooted at A) with yield .

<exp> ::= <ltexp> = <exp> | <ltexp> <ltexp> ::= <ltexp> < <subexp> | <subexp> <subexp> ::= <subexp> - <mulexp> | <mulexp> <mulexp> ::= <mulexp> * <rootexp> | <rootexp> <rootexp> ::= (<exp>) | a | b | c

So if it is true for any A, it is true for S.


CJD CJD

Parse Tree to Derivation


Prove: If G has a parse A-tree with yield , then A * Inductive proof on number of interior vertices (i.v.) of A-tree. Basis: If there is only one i.v., it is A. A has children X1, X2, Suppose result is true if |i.v.|<k, k>1.
* , Xn where yield =X1X2Xn. So A and A. Let be the yield of the A-tree with k -i.v.s. Let sons of A be X1, X2, , Xn so AX1X2Xn.

Derivation to Parse Tree


* Prove: If A then G has a parse tree with yield Inductive proof on number of steps in derivation of . * Basis: Suppose A in 1 step. Then A=X1X2Xn and there is an A-tree with children the Xis and yield . * Suppose BV, B in less than k steps, k>1, has a parse * tree with yield , and suppose A=1 2n in k steps, * with the first step AX1X2Xn so that Xii if Xi is a variable or Xi=i if Xi is a terminal. If Xi is a variable, it derives i in less than k steps, so has a parse Xi-tree with yield i. Construct the A-tree with children Xi, and each Xi that are terminals by i and each Xi that are variables by the Xi-tree. Clearly, this A-tree is a parse tree with yield .
CJD CJD

For each Xi which is not a leaf (they exist bec. k>1) Xi is a variable with yield i. Since Xitree has fewer than k -i.v.'s, * by inductive hyp, Xii. * For any Xj which are leaves, Xj=j , So Xjj * So A X1X2Xn 1 2n = . * So A .

Leftmost & Rightmost Derivations


Leftmost Derivation always derives from the leftmost
variable first : EE+E V+E x+E x + (E) x + (E E) x + (V E) x + (y E) x + (y V) x + (y z) E + V x ( E V y variable first.
CJD

Many LM and RM Derivations

E E E E V z )

Let L be a CFL and w L. A leftmost derivation of w corresponds to


exactly one parse tree and vice versa.

A parse tree of w corresponds to exactly one


rightmost derivation and vice versa.

w may have one or more derivations. w may have one or more leftmost derivation
(i.e. w may have one or more parse trees.)

w may have one or more rightmost derivation


(i.e. w may have one or more parse trees.)
CJD

Rightmost Derivation always derives from rightmost

Ambiguity
The parse tree represents the intended meaning. A grammar is ambiguous if some strings have more
than one parse tree (i.e. 1 lm- or 1 rm-derivation.)

Disambiguation Example
Some ambiguous grammars can be disambiguated by
enforcing precedence and associativity rules. E E+E | E-E | E*E | E/E | E^E | x | y | z | (E) precedence: ^ *,/ +,(right to left) (left to right) (left to right)
(start with most basic indivisible elements) (not F F^P because ^ is right to left) (in each step, refer to next higher precedence level)

Example:
E E +E

E E + E | E E | E E | (E) | V Vx|y|z E

V E E x V y V z

Both yield x+yz

E E E +E V V x V z y

P x | y | z | (E) F P^F | P T T*F | T/F | F E E+T | E-T | T

first multiply y and z, and then add this to x

first add x and y, and then multiply z to this


CJD

In x*y^z+x/(y-z) T stands for term: x*y^z, x/(y-z), y, z F stands for factor: x, y^z, x, (y-z) P stands for power: x, y, z, y^z, (y-z)
CJD

Inherently Ambiguous Languages


Can we always disambiguate a grammar? No, for two reasons : string.

Recursive Inference

a bottom-up process for the derivation of a Example


E ET|E+T|ET TF|TF F (E) | V Vx|y|z T T F V x
CJD

1.There exists inherently ambiguous context-free


languages L : Every CFG for such a language L is ambiguous.
Ex. L = { anbncmdm | n1,m1 } { anbmcmdn | n1,m1 } Text has shown: anbnc ndn, n1 has more than 1 derivation.

E T

2.There is no general procedure that can tell if a


grammar is ambiguous.

V y +

V z
CJD

However,

grammars used in programming languages can typically be disambiguated

Another Example
S aB | bA A a | aS | bAA B b | bS | aBB

End

Are ab, baba, abbbaa in L? How about a, bba? What is the language of this CFG? Is the CFG ambiguous?
CJD CJD

You might also like