CS107-08 CFGs
CS107-08 CFGs
CS 107
Theory of Automata
Chapter 08
Natural Language
CMPLX-NOUN PREP-PHRASE
terminals: a, the, boy, girl, flower, likes, touches, sees, with start variable: SENTENCE
SENTENCE
CJD CJD
Programming Languages
BNF
John Backus and Peter Naur BNF: Backus-Naur Form
A way to describe grammars and define the syntax of programming languages (Algol), 1959-1963
Example
<exp> ::= <exp> - <exp> | <exp> * <exp> | <exp> = <exp> | <exp> < <exp> | (<exp>) | a | b | c
CJD
Example
<stmt> ::= <exp-stmt> | <while-stmt> | <compound-stmt> |... <exp-stmt> ::= <exp> ; <while-stmt> ::= while ( <exp> ) <stmt> <compound-stmt> ::= { <stmt-list> } <stmt-list> ::= <stmt> <stmt-list> | <empty> Element is Text or
HTML
a subset of HTML can be described as follows :
Doc is a sequence of elements
A pair of matching tags and the document between them, or Unmatched tag followed by a document Text is any string of characters literally interpreted (i.e. there are no tags, user-text) Char is any single character legal in HTML tags List is a sequence of zero or more list items ListItem is the <LI> tag followed by a document followed by </LI>
CJD CJD
Formal Definition
A Context-Free Grammar (CFG) is a 4-tuple (V, T, P, S) where V is a finite set of variables or non-terminals T is a finite set of terminals (V T = ) P is a set of productions or substitution rules of the form A where A is a symbol in V and is a string over VT S is a variable in V called the start variable
Doc | Element Doc Element Text | <EM> Doc </EM> | <P> Doc |
<OL> List </OL>
Text | Char Text Char a | A | List | ListItem List ListItem <LI> Doc </LI>
CJD
CJD
Convention
Variables : first few uppercase letters and S
ex. A, B, C, D, E, S S is start symbol unless otherwise specified
Derivation
A derivation is a top-down sequential application of productions:
E E*E (E) * E (E) * N (E + E ) * N (E + E ) * 1 (E + N) * 1 (N + N) * 1 (N + 1N) * 1 (N + 10) * 1 (1 + 10) * 1 means can be obtained from with one production derivation * means can be obtained from after zero or more productions
i
Language of a CFG
* S
If contains variables and terminals, then is called a sentential form of G. If does not contain variables, it is called a sentence of G.
Example 1
productions :
Example 2
S SS | (S) |
A 0A1 | B B#
Is the string 00#11 in L? How about 00#111, 00#0#1#11? What is the language of this CFG?
L = {0n#1n: n 0}
CJD
CJD
Examples 3 and 4
Example 5
Design a CFG for the following language:
L = { 0i1j | i j 2i, i=0,1, }, = {0, 1}
Consider two extreme cases: (a). if j = i, then L1 = { 0i1j: i=j }; (b). if j = 2i, then L2 = { 0i1j: 2i=j }. S S 0S1 red-rule S S 0S11 blue-rule
{ anb3n | n1 }
Each a on the left can be paired with three bs on the right That gives S aSbbb |
S S S
0S1 0S11
CJD
Example 5 Proof
L = {0i1j: i j 2i, i=0,1,},
G= S 0S1 S 0S11
Example 6
Design a CFG for the following language:
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
In other words, each a and c is matched by some b or d; and each b and d is matched by some a or c. To match a and d, use S aSd | To match a and b, use A aAb | To match c and b, use B bBc | To match c and d, use C cCd | a and d are far apart : they must be produced first by letting S be the start symbol. Afterwards, S must transition into the other productions that match adjacent terminals with S ABC
CJD CJD
= {0, 1}
S
1). L(G) is a subset of L: The red-rule and blue-rule guarantee that in each derivation, the number of 1s generated is one or two times larger than that of 0s. So, L(G) is a subset of L. 2). L is a subset of L(G): For any w = 0i1j, i j 2i, we use red-rule (2i - j) times and then blue-rule ( j - i ) times, i.e., * * S 02i-jS12i-j 02i-j0 j-iS12(j-i)12i-j 0i1j = w
Example 6
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
Suppose
n
S a Sd a ABCd
w
A a b
in i n
and C c d
Linear equations over integers, x, y, z, like: x + 5y z = 9 11x y = 2 Numbers without leading zeros, e.g., 109, 0 but not 019 L1 = {anbncmdm | n 0, m 0} L2 = {anbmcmdn | n 0, m 0} L3 = { 0n1n | n 1 } L4 = { aibjck | ij or jk }
an S
A B
dn
C
B b j (i n)ck (l n)
So w L j-(i-n)=k-(l-n). i+k = j+l
ai-n bi-n
dl-n
CJD
CJD
a (alphabet symbol) E1 + E2 E1E2 E1*
Proof:
regular expression NFA DFA
CJD
L = {0n#1n: n 0}
| 0 | 1 | 0P0 | 1P1
CJD
Parse Trees
<exp> ::= <ltexp> = <exp> | <ltexp> <ltexp> ::= <ltexp> < <subexp> | <subexp> <subexp> ::= <subexp> - <mulexp> | <mulexp> <mulexp> ::= <mulexp> * <rootexp> | <rootexp> <rootexp> ::= (<exp>) | a | b | c
For each Xi which is not a leaf (they exist bec. k>1) Xi is a variable with yield i. Since Xitree has fewer than k -i.v.'s, * by inductive hyp, Xii. * For any Xj which are leaves, Xj=j , So Xjj * So A X1X2Xn 1 2n = . * So A .
E E E E V z )
w may have one or more derivations. w may have one or more leftmost derivation
(i.e. w may have one or more parse trees.)
Ambiguity
The parse tree represents the intended meaning. A grammar is ambiguous if some strings have more
than one parse tree (i.e. 1 lm- or 1 rm-derivation.)
Disambiguation Example
Some ambiguous grammars can be disambiguated by
enforcing precedence and associativity rules. E E+E | E-E | E*E | E/E | E^E | x | y | z | (E) precedence: ^ *,/ +,(right to left) (left to right) (left to right)
(start with most basic indivisible elements) (not F F^P because ^ is right to left) (in each step, refer to next higher precedence level)
Example:
E E +E
E E + E | E E | E E | (E) | V Vx|y|z E
V E E x V y V z
E E E +E V V x V z y
In x*y^z+x/(y-z) T stands for term: x*y^z, x/(y-z), y, z F stands for factor: x, y^z, x, (y-z) P stands for power: x, y, z, y^z, (y-z)
CJD
Recursive Inference
E T
V y +
V z
CJD
However,
Another Example
S aB | bA A a | aS | bAA B b | bS | aBB
End
Are ab, baba, abbbaa in L? How about a, bba? What is the language of this CFG? Is the CFG ambiguous?
CJD CJD