Lex Yacc
Lex Yacc
by
H. Altay Güvenir
1) Lexical Analysis:
Lexical analyzer: scans the input stream and converts sequences of
characters into tokens.
Token: a classification of groups of characters.
Examples: Lexeme Token
Sum ID
for FOR
= ASSIGN_OP
== EQUAL_OP
57 INTEGER_CONST
“Abcd” STRING_CONST
* MULT_OP
, COMMA
: SEMICOLUMN
( LEFT_PAREN
Lex is a tool for writing lexical analyzers.
3) Actions:
Acting upon input is done by code supplied by the compiler writer.
Lex & Yacc 2
Basic model of parsing for interpreters and compilers:
lex yacc
*.c
. .
Custom C
lex.yy.c yylex() routines yyparse() y.tab.c
. .
scanner parser
Lex & Yacc 3
Lex
Regular Expressions in lex:
a matches a
abc matches abc
[abc] matches a, b or c
[a-f] matches a, b, c, d, e, or f
[0-9] matches any digit
X+ matches one or more of X
X* matches zero or more of X
[0-9]+ matches any natural number
(…) grouping an expression into a single unit
| alternation (or)
(a|b|c)* is equivalent to [a-c]*
X? X is optional (0 or 1 occurrence)
if(def)? matches if or ifdef (equivalent to if|ifdef)
[A-Za-z] matches any alphabetical character
. matches any character except newline character
\. matches the dot character
\n matches the newline character
\t matches the tab character
\\ matches the \ character
[ \t] matches either a space or tab character
[^a-d] matches any character other than a,b,c and d
Examples:
Real numbers, e.g., 0, 27, 2.10, .17
[0-9]+|[0-9]+\.[0-9]+|\.[0-9]+
[0-9]+(\.[0-9]+)?|\.[0-9]+
[0-9]*(\.)?[0-9]+
To include an optional preceding sign: [+-]?[0-9]*(\.)?[0-9]+
Lex & Yacc 4
Contents of a lex specification file:
definitions
%%
regular expressions and associated actions (rules)
%%
user routines
During pattern matching, lex searches the set of patterns for the single longest
possible match.
$cat ex2.l
%option main
%%
fun printf("FUN");
funny printf("FUNNY");
Lex & Yacc 5
$cat test | ex2
FUN
FUNNY
Ali is FUNNY
this course is FUN
Lex declares an external variable called yytext which contains the matched
string
$cat ex3.l
%option main
%%
tom|jerry printf(">%s<", yytext);
$cat test3
Did tom chase jerry?
$cat test3 | ex3
Did >tom< chase >jerry<?
Definitions:
/* float0.l */
%option main
%%
[+-]?[0-9]*(\.)?[0-9]+ printf("FLOAT");
input: ab7.3c--5.4.3+d++5-
output: abFLOATc-FLOATFLOAT+d+FLOAT-
Other examples
/* echo-upcase-wrods.l */
%option main
%%
[A-Z]+[ \t\n\.\,] printf("%s",yytext);
. ; /* no action specified */
The scanner with the specification above echoes all strings of capital letters,
followed by a space, tab (\t), newline (\n), dot (\.), or comma (\,) to stdout,
and all other characters will be ignored.
Input Output
Ali VELI A7, X. 12 VELI X.
HAMI BEY a HAMI BEY
Definitions can be used in definitions
/* def-in-def.l */
%option main
alphabetic [A-Za-z_$]
digit [0-9]
alphanumeric ({alphabetic}|{digit})
%%
{alphabetic}{alphanumeric}* printf("Java identifier");
\, printf("Comma");
\{ printf("Left brace");
\= printf("Assignment op");
\=\= printf("Equality op");
Among all of the rules that match the same number of characters, the rule given
first in the file will be chosen.
Example,
/* rule-order.l */
%option main
%%
for printf("FOR");
[a-z]+ printf("IDENTIFIER");
Lex & Yacc 7
for input
for count = 1 to 10
the output would be
FOR IDENTIFIER = 1 IDENTIFIER 10
Important note:
Do not leave extra spaces and/or empty lines at the end of a lex specification
file.
Lex & Yacc 8
Yacc
Yacc specification describes a CFG, that can be used to generate a parser.
Elements of a CFG:
1. Terminals: tokens and literal characters,
2. Variables (nonterminals): syntactical elements,
3. Production rules, and
4. Start rule.
/*anbn0.y */
%token A B
%%
anbn: s '\n' {return 0;}
s: A B
| A s B
;
%%
#include "lex.yy.c"
int main() {
return yyparse();
}
int yyerror( char *s ) { fprintf(stderr, "%s\n", s); }
If the input stream cannot be derived from the start variable, the default
message of "syntax error" is printed and program terminates.
However, customized error messages can be generated.
/*anbn1.y */
%token A B
%%
anbn: s '\n' { printf(" is in anbn\n");
return 0;}
s: A B
| A s B
;
%%
#include "lex.yy.c"
void yyerror(char *s) { printf("%s, it is not in anbn\n", s); }
int main() {
return yyparse();
}
Lex & Yacc 10
$./anbn
aabb
is in anbn
$./anbn
acadbefbg
Syntax error, it is not in anbn
$
A grammar to accept L = {anbn | n 0}.
/*anbn_0.y */
%token A B
%%
anbn: s '\n' { printf(" is in anbn_0\n");
return 0;}
s: empty
| A s B
;
empty: ;
%%
#include "lex.yy.c"
void yyerror(char *s){ printf("%s, it is not in anbn_0\n", s); }
int main() {
return yyparse();
}
$ ./add
003 + 05
= 8
1+2
syntax error
/* print-int.y */
%token INTEGER NEWLINE
%%
lines: /* empty */
| lines NEWLINE
| lines value NEWLINE {printf(" =%d\n", $2);}
| error NEWLINE {yyerror("! Reenter: "); yyerrok;}
;
value: INTEGER {$$ = $1;}
;
%%
#include "lex.yy.c"
void yyerror(char *s) { printf("%s", s); }
int main() {
return yyparse();
}
error is a token provided by yacc. The macro yyerrok says, ‘‘the old error is
finished.”
Lex & Yacc 12
Execution:
$./print-int
7
=7
007
=7
funny
syntax error! Reenter: 0007
=7
^D
/* print-int-wln.y */
/* prints integers with line numbers */
%token INTEGER NEWLINE
%%
lines: /* empty */
| lines NEWLINE
| lines line NEWLINE {printf("%d) %d\n", lineno, $2);}
| error NEWLINE { printf(" in line %d!\nReenter: ",lineno);
yyerrok;
}
;
line: INTEGER {$$ = $1;}
%%
#include "lex.yy.c"
int lineno=0;
void yyerror(char *s) { printf("%s", s); }
int main() {
return yyparse();
}
Lex & Yacc 13
Execution:
$./print-int-wln
007
1) 7
jhg
syntax error in line 2!
Reenter: 66
3) 66
_
Example: yacc specification of a calculator is given the web page of the course.
(http://www.cs.bilkent.edu.tr/~guvenir/courses/CS315/lex-yacc/calculator/)
Lex & Yacc 15
Actions between rule elements:
/* actions.l */
%%
a return A;
b return B;
\n return NL;
. ;
%%
int yywrap() { return 1; }
/* actions.y */
%{
#include <stdio.h>
%}
%token A B NL
%%
s: {printf("1");}
a
{printf("2");}
b
{printf("3");}
NL
{return 0;}
;
a: {printf("4");}
A
{printf("5");}
;
b: {printf("6");}
B
{printf("7");}
;
%%
#include "lex.yy.c"
int yyerror(char *s) {
printf ("%s\n", s);
}
int main(void){ yyparse(); }
actions: 14ab
52673
actions 14aa
526syntax error
actions 14ba
syntax error
actions 14xyzafghbnm
52673
Lex & Yacc 16
Conflicts
Pointer model: A pointer moves (right) on the RHS of a rule while input tokens
and variables are processed.
%token A B C
%%
start: A B C ; /* after reading A: start: A B C */
When all elements on the right-hand side are processed (the pointer reaches
the end of a rule), the rule is reduced.
If a rule reduces, the pointer then returns to the rule where it was called.
Conflict: There is a conflict if a rule is reduced when there is more than one
pointer. yacc looks one-token-ahead to see if the number of pointers
reduces to one before declaring a conflict.
Example:
%token A B C D E F
%%
start: x | y;
x: A B C D;
y: A B E F;
After tokens A and B, either one of the tokens, or both will disappear. For
example, if the next token is E, the first, if the next token is C the second token
will disappear. If the next token is anything other than C or E both pointers will
disappear. Therefore, there is no conflict.
Conflict example:
%token A B
%%
start: x B | y B ;
x: A ; reduce
y: A ; reduce reduce/reduce conflict on B.
After A, there are two pointers. Both rules (x and y) want to reduce at the
same time. If the next token is B, there will be still two pointers. Such
conflicts are called reduce/reduce conflict.
Debugging:
$yacc -v filename.y
produces a file named y.output for debugging purposes.
Example:
%token A P
%%
s: x | y P;
x: A P; /* shifts on P */
y: A; /* reduces on P */
Lex & Yacc 19
The y.output file for the grammar above is shown below:
0 $accept : s $end
y goto 4
Reduce rule 4
Shift and goto state 5
Shift/reduce conflict on P
1: shift/reduce conflict (shift 5, reduce 4) on P
state 1
One pointer is in rule 3 between tokens A and P
x : A . P (3)
y : A . (4)
The other pointer is in rule (4) after token A
P shift 5 If the next token is P, the system will choose to shift and goto
state 5.
state 2
State2: input matched the start variable s,
$accept : s . $end (0) if this is the end of string, accept it.
$end accept
{$end, A, P, .} {$accept, s, x, y}
4 terminals, 4 nonterminals
5 grammar rules, 7 states
Recursive Rules:
Consider the following grammar:
/* recursive.y */
%token A
%%
s: A // L ={A, AAA, AAAAA, …}, Not ambiguous !
| A s A
;
y.output file:
0 $accept : s $end
1 s : A
2 | A s A
^L
state 0
$accept : . s $end (0)
A shift 1
. error
s goto 3
...
Lex & Yacc 21
However, the same language can also be represented by the following
grammar, which does not have any conflict.
/* recursive.y */
%token A
%%
s: A // L ={A, AAA, AAAAA, …}, Not ambiguous !
| s A A
;
Actions on a Rule:
Actions can appear anywhere in the RHS of a rule.
However, for technical reasons, it is convenient for yacc to transform the
grammar so that actions always appear at the very end.
For this reason, yacc introduces new variables, called marker variables (non-
terminals), so that all actions are at the end of the rules.
Example,
Rule
a: {action1} b {action2} c {action3};
is replaced by
a: $$1 b $$2 c {action3};
$$1: {action1}; // Empty rules
$$2: {action2};
Example:
%token A B NL
%%
start: x | y;
x: A A NL ;
y: A B NL ;
Internally:
0 $accept : start $end
1 start : x
2 | y
3 x : A A NL
4 y : A B NL
No Conflict.
Lex & Yacc 22
However, the equivalent following grammar
%token A B NL
%%
start: x | y;
x: {printf("using x");} A A NL ;
y: {printf("using y");} A B NL ;
Converted into:
0 $accept : start $end
1 start : x
2 | y
3 $$1 :
4 x : $$1 A A NL
5 $$2 :
6 y : $$2 A B NL
Conflict:
reduce/reduce conflict (reduce 3, reduce 5) on A
Make utility
Using the make utility on linux systems:
Contents of the file named Makefile:
parser: lex.yy.c y.tab.c
gcc -o parser y.tab.c
y.tab.c: parser.y
yacc parser.y
lex.yy.c: scanner.l
lex scanner.l
Bibliography
Saumya Debray “A Quick Introduction to Handling Conflicts in Yacc Parsers”
https://www2.cs.arizona.edu/~debray/Teaching/CSc453/DOCS/conflicts.pdf
Tom Niemann, “LEX & YACC TUTORIAL”,
https://www.epaperpress.com/lexandyacc/