0% found this document useful (0 votes)
3 views

Chapter 3 Syntax Analysis

Chapter 3 discusses syntax analysis in programming languages, focusing on the parsing problem, parsing notations, and context-free grammars. It explains the goals of parsers, the structure of grammars, and the types of parsers such as top-down and bottom-up. Additionally, it covers recursive-descent parsing, left recursion issues, and the complexity of parsing algorithms.

Uploaded by

kimmaurice4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chapter 3 Syntax Analysis

Chapter 3 discusses syntax analysis in programming languages, focusing on the parsing problem, parsing notations, and context-free grammars. It explains the goals of parsers, the structure of grammars, and the types of parsers such as top-down and bottom-up. Additionally, it covers recursive-descent parsing, left recursion issues, and the complexity of parsing algorithms.

Uploaded by

kimmaurice4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Chapter 3

Syntax Analysis

1
The Parsing Problem
• Goals of the parser, given an input program:
• Find all syntax errors;
• for each, produce an appropriate diagnostic message and
recover quickly
• Produce the parse tree, or at least a trace of the parse
tree, for the program

2
Parsing Notations
• Lowercase letters at the beginning of the alphabet
(a, b, …) for terminal symbols.
• Uppercase letters at the beginning of the alphabet
(A, B, …) for non terminal symbols.
• Uppercase letters at the end of the alphabet (W, X,
Y, Z) for terminals or nonterminals.
• Lowercase letters at the end of the alphabet (w, x,
y, z) for strings of terminal.
• Lowercase Greek letters (, , , ) for mixed
strings (terminals and/or nonterminals)

1-3
Syntax of Programming Language
• Described by a context-free grammar (Backus-Naur Form -
BNF).
• Similar to the languages specified by regular expressions, but
more general.
• A grammar gives a precise syntactic specification of a language.
• From some classes of grammars, tools exist that can
automatically construct an efficient parser.
• These tools can also detect syntactic ambiguities and other problems
automatically.
• A compiler based on a grammatical description of a language
is more easily maintained and updated.
• Syntax decide whether program satisfies syntactic structure
• Error detection
• Error recovery
• Simplification: rules on tokens 4
Context-free Grammars
• A context-free grammar for a language specifies the
syntactic structure of programs in that language.
• Components of a grammar:
• a finite set of tokens (obtained from the scanner);
• a set of variables representing “related” sets of strings, e.g.,
declarations, statements, expressions.
• a set of rules that show the structure of these strings.
• an indication of the “top-level” set of strings we care about.

5
Context-free Grammars: Definition
• Formally, a context-free grammar G is a 4-tuple G = (V, T, P,
S), where:
• V is a finite set of variables (or nonterminals). These describe sets
of “related” strings.
• T is a finite set of terminals (i.e., tokens).
• P is a finite set of productions, each of the form
A  
where A  V is a variable, and   (V  T)* is a
sequence of terminals and nonterminals.
• S  V is the start symbol.
6
Context-free Grammars: An Example
A grammar for palindromic bit-strings:
G = (V, T, P, S), where:
• V = { S, B }
• T = {0, 1}
• P = { S  B,
S  ,
S  0 S 0,
S  1 S 1,
B  0,
B1
} 7
Context-free Grammars: Terminology

• Derivation: Suppose that


•  and  are strings of grammar symbols, and
• A   is a production.
Then, A   (“A derives ”).

•  : “derives in one step”


* : “derives in 0 or more steps”
 *  (0 steps)
 *  if    and  *  ( 1 steps)

8
Derivations: Example
• Grammar for palindromes: G = (V, T, P, S),
• V = {S},
• T = {0, 1},
• P = { S  0 S 0 | 1 S 1 | 0 | 1 |  }.
• A derivation of the string 10101:
S
1S1 (using S  1S1)
 1 0S0 1 (using S  0S0)
 10101 (using S  1)

9
Leftmost and Rightmost Derivations
• A leftmost derivation is one where, at each step, the leftmost
nonterminal is replaced.
(analogous for rightmost derivation)
• Example: a grammar for arithmetic expressions:
E  E + E | E * E | id
• Leftmost derivation:
E  E * E  E + E * E  id + E * E  id + id * E  id + id * id
• Rightmost derivation:
E  E + E  E + E * E  E + E * id  E + id * id  id + id * id

10
Summary on Syntax
Grammar rules: Symbols:
E  id terminals (tokens) + * ( ) id num
E  num non-terminals E
EE+E
EE*E
E(E)

Derivation: Parse tree:


E E
E+E
1+E E + E
1+E+E
1 *
1+2+E E E
1+2*3

11
Summary on Syntax - Ambiguity
Grammar rules:
E  id
E  num
EE+E
EE*E
E(E)

Leftmost derivation Rightmost derivation

Derivation: Parse tree: Derivation: Parse tree:


E E
E E
E+E E*E
1+E E + E E*3 E * E
1+E+E E+E*3
1 * 3
1+2+E E E E+2*3 E + E
1+2*3 1+2*3
2 3 1 2

12
Summary on Syntax - Grammar rewriting
Ambiguous grammar: Non-ambiguous grammar:
E  id EE+T
E  num ET
EE+E TT*F
EE*E TF
E(E) F  id
F(E)

Derivation: Parse tree:


E E
E+T
1+T E + T
1+T*F
T *
1+F*F T F
1+2*F F
F 3
1+2*3

13
Categories of Parsers
• Top down - produce the parse tree, beginning at the root
• Order is that of a leftmost derivation - i.e. branches from a particular node are followed in left-to-
right order
• Traces or builds the parse tree in preorder - i.e. each node is visited before its branches are
followed
• For every non-terminal and token predict the next production
• Top Down Parsers are classified into two:
• Recursive Descent Parsers
• LL Parsers

• Bottom up - produce the parse tree, beginning at the leaves towards the root
• Order is that of the reverse of a rightmost derivation - That is, the sentential forms of the
derivation are produced in order of last to first.
• For every potential right hand side and token decide when a production is found

• Useful parsers look only one token ahead in the input

14
Top-Down Parsers
• Determining the next sentential form is a matter of choosing the correct grammar rule
that has A as its LHS
• the leftmost derivation, using only the first token produced by A
• E.g. a sentential form, xA with the following A-rules
A → bB
A → cBb
A→a
the parser must choose the correct A-rule to get the next sentential form, which could be xbB, xcBb, or xa
• This is the parsing decision problem for top-down parsers
• The most common top-down parsing algorithms are called LL algorithms
• first L specifies a left-to-right scan of the input
• second L specifies that a leftmost derivation is generated.
• Two implementations of the algorithms are possible
• Using a recursive-descent parser coded version of a syntax analyzer based directly on the BNF description of the
syntax of language.
15
• Using a parsing table to implement the BNF rules.
Complexity of Parsing
• Parsers (algorithms) that work for any unambiguous
grammar are complex and inefficient
• Complexity of such algorithms is O(n3), where n is the length
of the input
• Thus, need to search for faster algorithms, though less general i.e.
generality is traded for efficiency.
• Compilers use parsers that only work for a subset of all
unambiguous grammars, but do it in linear time ( O(n),
where n is the length of the input )
16
Recursive-Descent Parsing
• Recursive descent parsing is a method where each non-terminal in the grammar is
associated with a procedure or function in the parsing code.
• There is a subprogram for each nonterminal in the grammar can parse sentences that
can be generated by that nonterminal
• These procedures recursively call each other to match the input string against the
production rules of the grammar.
• Recursive descent parsers are relatively straightforward to implement and understand,
but
• They can be inefficient for grammars with left recursion or ambiguity.
• EBNF is ideally suited for being the basis for a recursive-descent parser
• because EBNF minimizes the number of nonterminals
• A recursive-descent parser is an LL parser
• EBNF

17
Recursive-Descent Parsing (cont.)
• A grammar for simple expressions:

<expr>  <term> {(+ | -) <term>}


<term>  <factor> {(* | /) <factor>}
<factor>  id | int_constant | ( <expr> )

18
Recursive-Descent Parsing (cont.)
• Assume we have a lexical analyzer named lex, which puts
the next token code in nextToken
• The coding process when there is only one RHS:
• For each terminal symbol in the RHS, compare it with the next
input token;
• if they match, continue, else there is an error
• For each nonterminal symbol in the RHS, call its associated
parsing subprogram

19
Recursive-Descent Parsing (cont.)
/* Function expr
Parses strings in the language
generated by the rule:
<expr> → <term> {(+ | -) <term>}
*/

void expr() {

/* Parse the first term */

term();
/* As long as the next token is + or -, call
lex to get the next token and parse the • This particular routine does not detect
next term */
errors
while (nextToken == ADD_OP ||
• Convention: Every parsing routine
nextToken == SUB_OP){ leaves the next token in nextToken
lex();
term();
}
20
}
Recursive-Descent Parsing (cont.)
/* term
Parses strings in the language generated by the rule:
<term> -> <factor> {(* | /) <factor>)
*/
void term() {
printf("Enter <term>\n");
/* Parse the first factor */
factor();
/* As long as the next token is * or /,
next token and parse the next factor */
while (nextToken == MULT_OP || nextToken == DIV_OP) {
lex();
factor();
}
printf("Exit <term>\n");
} /* End of function term */
21
Recursive-Descent Parsing (cont.)
• A nonterminal that has more than one RHS requires an initial
process to determine which RHS it is to parse
• The correct RHS is chosen on the basis of the next token of input (the
lookahead)
• The next token is compared with the first token that can be generated by
each RHS until a match is found
• If no match is found, it is a syntax error

22
Recursive-Descent Parsing (cont.)
/* Function factor
Parses strings in the language
generated by the rule:
<factor> -> id | (<expr>) */

void factor() {

/* Determine which RHS */


if (nextToken) == ID_CODE || nextToken == INT_CODE)

/* For the RHS id, just call lex */


lex();

/* If the RHS is (<expr>) – call lex to pass over the left parenthesis,
call expr, and check for the right parenthesis */
else if (nextToken == LP_CODE) {
lex();
expr();
if (nextToken == RP_CODE)
lex();
else
error();
} /* End of else if (nextToken == ... */

else error(); /* Neither RHS matches */


} 23
Recursive-Descent Parsing (cont.)
- Trace of the lexical and syntax analyzers on (sum + 47) / total

Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is total
Enter <expr> Enter <factor>
Enter <term> Next token is: -1 Next lexeme is EOF
Enter <factor> Exit <factor>
Enter
Next token is: 11 Next lexeme is sum Exit <term>
Enter <expr> Exit <expr> Enter Exit
Enter <term>
Enter <factor>
Next token is: 21 Next lexeme is +
Exit <factor>
Exit <term>
Next token is: 10 Next lexeme is 47
Enter <term>
Enter <factor>
Next token is: 26 Next lexeme is )
Exit <factor>
Exit <term>
Exit <expr>
Next token is: 24 Next lexeme is /
Exit <factor> 24
Recursive-Descent Parsing - Left Recursion
Problem
• A problem in LL Grammar Class
• Left recursion: E  E + T
• Symbol on left also first symbol on right
• Predictive parsing fails when two rules can start with same token
EE+T
ET
• If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down
parser
• A grammar can be modified to remove left recursion
• For each nonterminal, A,
1. Group the A-rules as A → Aα1 | … | Aαm | β1 | β2 | … | βn
where none of the β‘s begins with A
2. Replace the original A-rules with
A → β1A’ | β2A’ | … | βnA’
25
A’ → α1A’ | α2A’ | … | αmA’ | ε
More left recursion
• Non-terminal with two rules starting with same prefix

Grammar: Left factored grammar:


S  if E then S else S S  if E then S X
S  if E then S X
X  else S

26
Recursive-Descent Parsing – lack of pairwise
disjointness
• Another problem that disallows top-down parsing is
• whether the parser can always choose the correct RHS on the basis of next
token input using only the first token generated by the leftmost nonterminal
in the current sentential form i.e. one token lookahead.
• This is referred to as lack of pairwise disjointness
• To solve this, pairwise disjointness test needs to be performed on
FIRST set.
FIRST() = {a |  =>* a }
(If  =>* ,  is in FIRST())

in which =>* means 0 or more derivation steps.


27
Recursive-Descent Parsing (cont.)
• Pairwise Disjointness Test:
• For each nonterminal, A, in the grammar that has more than one RHS, for each pair
of rules, A  i and A  j, it must be true that
FIRST(i) ∩ FIRST(j) = 
• Examples:
A  aB | bAb | Bb
B  cB | d - pass the test

A  aB | BAb
B  aB | b - fail the test

28
Recursive-Descent Parsing (cont.)
• Left factoring can resolve the problem

Replace
<variable>  identifier | identifier [<expression>]
with
<variable>  identifier <new>
<new>   | [<expression>]
or
<variable>  identifier [[<expression>]]
(the outer brackets are metasymbols of EBNF)

29
Top-down parsing - Example
• Builds parse tree in preorder
• LL(1) example

Grammar: if 5 then print 8 else…


S  if E then S else S Token : rule
S  begin S L if : S  if E then S else S
S  print E if E then S else S
L  end 5 : E  num
L;SL if 5 then S else S
E  num print : print E
if 5 then print E else S

30
LL Parsers
• LL parsers are a type of top-down parser used in computer science to analyze and process the structure of
strings according to a formal grammar.
• The term "LL" stands for "Left-to-right, Leftmost derivation," indicating the strategy used by these parsers to
process input.
• Here are some key characteristics of LL parsers:
• Left-to-right scanning: LL parsers scan the input string from left to right, processing symbols in the order
they appear.
• Leftmost derivation: LL parsers aim to derive the leftmost derivation of the input string. This means that
they always expand the leftmost non-terminal in the current sentential form.
• Predictive parsing: LL parsers use predictive parsing to determine which production rule to apply at each
step based on a finite lookahead. This lookahead involves examining a fixed number of input symbols to
predict the next production rule to apply.
• LL(k) grammars: LL parsers are often characterized by the maximum number of tokens they look ahead in
the input string. For example, LL(1) parsers look ahead one token to decide which production rule to apply,
while LL(k) parsers look ahead k tokens.
• Table-driven parsing: LL parsers are typically implemented using parsing tables, which store information
about which production rule to apply for each combination of non-terminal and lookahead symbol.

31
Bottom Up Parsing
• Unlike top-down parsing, which starts with the root of the parse tree and works down to the leaves, bottom-
up parsing begins with the input string and builds the parse tree from the leaves up to the root.
• Here are the key characteristics of bottom-up parsing:
1. Shift-Reduce Parsing: Bottom-up parsing is often implemented using a strategy called shift-reduce parsing.
In shift-reduce parsing, the parser shifts input symbols onto a stack until it can reduce a portion of the stack
to a non-terminal symbol according to the grammar rules.
2. Reduction: Reduction involves replacing a sequence of symbols on the top of the stack with a non-terminal
symbol according to a production rule in the grammar. The parser continues reducing portions of the stack
until it reaches the start symbol of the grammar.
3. Handle: During reduction, the portion of the stack that matches the right-hand side of a production rule is
called a handle. The parser identifies handles and replaces them with the corresponding non-terminal
symbol.
4. Bottom-up Parse Tree: The result of bottom-up parsing is a parse tree rooted at the start symbol of the
grammar, with the input string as its leaves. Each internal node in the parse tree represents a non-terminal
symbol, and its children represent the symbols derived from that non-terminal.
5. Shift-Reduce Conflict and Reduce-Reduce Conflict: Bottom-up parsing may encounter shift-reduce conflicts
or reduce-reduce conflicts when deciding whether to shift a symbol onto the stack or reduce a portion of
the stack. Conflicts can arise due to ambiguity or lack of sufficient lookahead in the grammar.

32
Bottom-Up Parsers
• Given a right sentential form, , determine what substring
of  is the right-hand side of the rule in the grammar that
must be reduced to produce the previous sentential form in
the right derivation
• Eg.
S  aAc
A  aA | b

S  aAc  aaAc  aabc


• The correct RHS is called the handle.
• The most common bottom-up parsing algorithms are in the LR family
• L specifies a left-to-right scan of the input and the
• R specifies that a rightmost derivation is generated 33
Bottom-up Parsing
• The parsing problem is finding the correct RHS in a right-sentential form
(handle) to reduce to get the previous right-sentential form in the
derivation
• No problem with left recursion
• Example grammar:

E→E+T|T
T→T*F|F
F → ( E ) | id (1)

E.g. derived sentence: id + id * id

34
Bottom-up Parsing (cont.)
• Intuition about handles (continued):
• Def:  is the handle of the right sentential form
 = w if and only if S =>*rm Aw =>rm w

• Def:  is a phrase of the right sentential form


 if and only if S =>*  = 1A2 =>+ 12

• Def:  is a simple phrase of the right sentential form  if and only if S =>*  = 1A2 =>
12

• The handle of a right sentential form is its leftmost simple phrase


• Given a parse tree, it is now easy to find the handle
• Parsing can be thought of as handle pruning

35
Bottom up Parsing - Shift-Reduce parsing
• Uses Pushdown Automata (PDA)
• Parser stack: symbols (terminal and non-terminals) + automaton states
• Parsing actions: sequence of shift and reduce operations
• Action determined by top of stack and k input tokens
• Shift: move next token to top of stack
• Reduce: replacing the handle on the top of the parse stack with its
corresponding LHS
• For example: rule X  A B C
pop C, B, A then push X
• Convention: $ stands for end of file
• The LR family of shift-reduce parsers is the most common bottom-up
parsing approach

36
Shift-reduce Parsing: Example
Grammar: S → aABe
A → Abc | b
B→d

Input: abbcde (using A → b)


 aAbcde (using A → Abc)
 aAde (using B → d)
 aABe (using S → aABe)
 S

37
Shift-Reduce Parsing: cont’d
• Need to choose reductions carefully:
abbcde  aAbcde  aAbcBe  …
doesn’t work.
• A handle of a string s is a substring  s.t.:
•  matches the RHS of a rule A → ; and
• replacing  by A (the LHS of the rule) represents a step in the
reverse of a rightmost derivation of s.
• For shift-reduce parsing, reduce only handles.

38
Shift-reduce Parsing: Implementation
• Data Structures:
• a stack, its bottom marked by ‘$’. Initially empty.
• the input string, its right end marked by ‘$’. Initially w.
• Actions:
repeat
1. Shift some ( 0) symbols from the input string onto the stack, until a
handle  appears on top of the stack.
2. Reduce  to the LHS of the appropriate production.
until ready to accept.
• Acceptance: when input is empty and stack contains only the
start symbol.
39
Example
Stack (→) Input Action
$ abbcde$ shift
$a bbcde$ shift
$ab bcde$ reduce: A → b Grammar :
$aA bcde$ shift S → aABe
$aAb cde$ shift A → Abc | b
$aAbc de$ reduce: A → Abc B→d
$aA de$ shift
$aAd e$ reduce: B → d
$aAB e$ shift
$aABe $ reduce: S → aABe
$S $ accept

40
Conflicts
• Can’t decide whether to shift or to reduce
• both seem OK (“shift-reduce conflict”).
Example: S → if E then S | if E then S else S | …

• Can’t decide which production to reduce with


• several may fit (“reduce-reduce conflict”).
Example: Stmt → id ( args ) | Expr
Expr → id ( args )

41
Advantages of LR parsers
• They will work for nearly all grammars that describe
programming languages.
• They work on a larger class of grammars than other bottom-
up algorithms, but are as efficient as any other bottom-up
parser.
• They can detect syntax errors as soon as it is possible.
• The LR class of grammars is a superset of the class parsable
by LL parsers.

42
Constructing LR Parsers
• LR parsers must be constructed with a tool
• Knuth’s insight: A bottom-up parser could use the entire
history of the parse, up to the current point, to make parsing
decisions
• There were only a finite and relatively small number of different
parse situations that could have occurred, so the history could be
stored in a parser state, on the parse stack

43
Constructing LR Parsers (cont.)
• An LR configuration stores the state of an LR parser

(S0X1S1X2S2…XmSm, aiai+1…an$)

44
Constructing LR Parsers (cont.)
• LR parsers are table driven, where the table has
two components
• The ACTION table specifies the action of the parser,
given the parser state and the next token
• Rows are state names; columns are terminals
• The GOTO table specifies which state to put on top
of the parse stack after a reduction action is done
• Rows are state names; columns are nonterminals

45
Structure of An LR Parser

46
Parser Actions
• Initial configuration: (S0, a1…an$)
• Parser actions:
• If ACTION[Sm, ai] = Shift S, the next configuration is:
(S0X1S1X2S2…XmSmaiS, ai+1…an$)
• If ACTION[Sm, ai] = Reduce A   and S = GOTO[Sm-r, A], where r =
the length of , the next configuration is
(S0X1S1X2S2…Xm-rSm-rAS, aiai+1…an$)
• If ACTION[Sm, ai] = Accept, the parse is complete and no errors
were found.
• If ACTION[Sm, ai] = Error, the parser calls an error-handling routine.
47
LR Parsing Table

• A parser table can be


generated from a given
grammar with a tool, e.g.,
yacc

48
Bottom-up Parsing (cont.)
• Grammar (1) rewritten and numbered for easy
referencing in a parsing table.

1. E→E+T
2. E→T
3. T→T*F
4. T→F
5. F→(E)
6. F → id

1-49
Bottom-up Parsing (cont.)
Stack Input Action

0 id + id * id $ Shift 5

0id5 + id * id $ Reduce 6 (use GOTO[0, F])

0F3 + id * id $ Reduce 4 (use GOTO[0, T])

0T2 + id * id $ Reduce 2 (use GOTO[0, E])

0E1 + id * id $ Shift 6

0E1+6 id * id $ Shift 5

0E1+6id5 * id $ Reduce 6 (use GOTO[6, F])

0E1+6F3 * id $ Reduce 4 (use GOTO[6, T])

0E1+6T9 * id $ Shift 7

0E1+6T9*7 id $ Shift 5

0E1+6T9*7id5 $ Reduce 6 (use GOTO[7, F])

0E1+6T9*7F10 $ Reduce 3 (use GOTO[6, T])

0E1+6T9 $ Reduce 1 (use GOTO[0, E])

0E1 $ Accept

1-50
YACC Syntax Analysis Tool

51
Introduction
• What is YACC ?
• Tool which will produce a parser for a given grammar.
• YACC (Yet Another Compiler Compiler)
• Program designed to compile a LALR(1) grammar and to
produce the source code of the syntactic analyzer of the
language produced by this grammar.

52
Common Tools
• ANTLR tool
• Generates LL(k) parsers
• Yacc (Yet Another Compiler Compiler)
• Generates LALR parsers
• Bison
• Improved version of Yacc

53
YACC File Format
%{
C declarations
%}
yacc declarations
%%
Grammar rules
%%
Additional C code

• Comments enclosed in /* ... */ may appear in any of


the sections.

54
YACC
• Input specification for YACC (similar to flex)
• Three parts: Definitions, Rules, User code
• Use “%%” as a delimiter for each part

• First part: Definitions (C and YACC declarations)


• Definition of tokens for the second part and for use by flex
• Definition of variables for use by the parser code

• Second part: Rules


• Grammar for the parser

• Third part: User code


• The code in this part is copied into the parser generated by YACC
55
Definitions Section
%{
#include <stdio.h>
#include <stdlib.h>
%}
It is a terminal
%token ID NUM
%start expr
expr parse

56
YACC Declaration Summary
`%start'
Specify the grammar's start symbol
`%union'
Declare the collection of data types that semantic values may have
`%token'
Declare a terminal symbol (token type name) with no precedence or associativity specified
`%type'
Declare the type of semantic values for a nonterminal symbol
`%right'
Declare a terminal symbol (token type name) that is
right-associative
`%left'
Declare a terminal symbol (token type name) that is left-associative
`%nonassoc'
Declare a terminal symbol (token type name) that is nonassociative (using it in a way that would be 57
associative is a syntax error, ex: x op. y op. z is syntax error)
Rules Section
• This section defines grammar
• Example
expr : expr '+' term | term;
term : term '*' factor | factor;
factor : '(' expr ')' | ID | NUM;

58
Rules Section
• Normally written like this
• Example:
expr : expr '+' term
| term
;
term : term '*' factor
| factor
;
factor : '(' expr ')'
| ID
| NUM
;
59
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;

60
The Position of Rules
$1
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;

61
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
; $2
62
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
; $3 Default: $$ = $1;

63
YACC File Example
%{
#include <stdio.h>
%}

%token NAME NUMBER


%%

statement: NAME '=' expression


| expression { printf("= %d\n", $1); }
;

expression: expression '+' NUMBER { $$ = $1 + $3; }


| expression '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
%%
int yyerror(char *s)
{
fprintf(stderr, "%s\n", s);
return 0;
}

int main(void)
{
yyparse();
return 0;
} 64
Example 1
%{ #include <ctype.h> %}
Also results in definition of
%token DIGIT #define DIGIT xxx
%%
line : expr ‘\n’ { printf(“= %d\n”, $1); }
;
expr : expr ‘+’ term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term ‘*’ factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : ‘(’ expr ‘)’ { $$ = $2; }
| DIGIT { $$ = $1; }
; Attribute of factor (child)
%% Attribute of
int yylex() term (parent) Attribute of token
{ int c = getchar();
(stored in yylval)
if (isdigit(c))
{ yylval = c-’0’; Example of a very crude lexical
return DIGIT; analyzer invoked by the parser
}
return c;
} 65
Example 2
%{
#include <ctype.h>
Double type for attributes
#include <stdio.h> and yylval
#define YYSTYPE double
%}
%token NUMBER
%left ‘+’ ‘-’
%left ‘*’ ‘/’
%right UMINUS
%%
lines : lines expr ‘\n’ { printf(“= %g\n”, $2); }
| lines ‘\n’
| /* empty */
;
expr : expr ‘+’ expr { $$ = $1 + $3; }
| expr ‘-’ expr { $$ = $1 - $3; }
| expr ‘*’ expr { $$ = $1 * $3; }
| expr ‘/’ expr { $$ = $1 / $3; }
| ‘(’ expr ‘)’ { $$ = $2; }
| ‘-’ expr %prec UMINUS { $$ = -$2; }
| NUMBER
;
%% 66
Example 2 (cont’d)
%%
int yylex()
{ int c;
while ((c = getchar()) == ‘ ‘)
;
if ((c == ‘.’) || isdigit(c)) Crude lexical analyzer for
{ ungetc(c, stdin); fp doubles and arithmetic
scanf(“%lf”, &yylval); operators
return NUMBER;
}
return c;
}
int main()
{ if (yyparse() != 0)
fprintf(stderr, “Abnormal exit\n”); Run the parser
return 0;
}
int yyerror(char *s)
{ fprintf(stderr, “Error: %s\n”, s);
Invoked by parser
} to report parse errors
67
How YACC Works
File containing desired grammar in yacc format
gram.y

yacc yacc program

y.tab.c C source program created by yacc

cc
or gcc
C compiler

Executable program that will parse grammar given


a.out
in gram.y
68
How YACC Works
y.tab.h
YACC source (*.y) yacc y.tab.c
y.output
(1) Parser generation time

y.tab.c C compiler/linker a.out

(2) Compile time


Abstract
Token stream a.out Syntax
Tree

69
Benefits of YACC
• Faster development
• Compared to manual implementation
• Easier to change the specification and generate new
parser
• Than to modify 1000s of lines of code to add, change, delete
an existing feature
• Less error-prone, as code is generated
• Cost: Learning curve
• Invest once, amortized over 40+ years career

70
Lex with Yacc
Lex source Yacc source
(Lexical Rules) (Grammar Rules)

Lex Yacc

lex.yy.c y.tab.c
call
Parsed
Input yylex() yyparse()
Input

return token
71
YACC works with Lex

How to work ?

72
YACC works with Lex

[0-9]+
call yylex()

next token is NUM

NUM ‘+’ NUM

73
Simple example
• Implement a calculator which can recognize adding or subtracting of
numbers

[linux33]% ./y_calc
1+101
= 102
[linux33] % ./y_calc
1000-300+200+100
= 1000
[linux33] %

74
Example – the Lex part
%{
#include <math.h>
#include "y.tab.h"
extern int yylval;
%} Definitions
pattern
action
%%
[0-9]+ { yylval = atoi(yytext);
return NUMBER; }
[\t ]+ ; /* Do nothing for white space */
\n return 0;/* End of the logic */
. return yytext[0];
%% Rules
75
Example – the Yacc part
%token NAME NUMBER
%%
statement: NAME '=' expression
| expression
Definitions
{ printf("= %d\n", $1); }
;
expression:expression '+' NUMBER
{ $$ = $1 + $3; }
|expression '-' NUMBER
{ $$ = $1 - $3; } Include Yacc library
| NUMBER (-ly)
{ $$ = $1; }
;

Rules
76
LEX and YACC – Another Example
%{ yacc -d xxx.y
scanner.l Produced
#include <stdio.h>
y.tab.h:
#include "y.tab.h"
%}
id [_a-zA-Z][_a-zA-Z0-9]* # define CHAR 258
%% # define FLOAT 259
int { return INT; } # define ID 260
char { return CHAR; } # define INT 261
float { return FLOAT; }
{id} { return ID;}
%{
parser.y
#include <stdio.h>
#include <stdlib.h>
%}
%token CHAR, FLOAT, ID, INT
%%
77
Lex vs. Yacc
• Lex
• Lex generates C code for a lexical analyzer, or scanner
• Lex uses patterns that match strings in the input and converts the
strings to tokens

• Yacc
• Yacc generates C code for syntax analyzer, or parser.
• Yacc uses grammar rules that allow it to analyze tokens from Lex
and create a syntax tree.

78

You might also like