0% found this document useful (0 votes)
28 views64 pages

Lecture003 LEXandYACC

done

Uploaded by

appstech234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views64 pages

Lecture003 LEXandYACC

done

Uploaded by

appstech234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 64

Lex Yacc tutorial

Kun-Yuan Hsieh
[email protected]
Programming Language Lab., NTHU

PLLab, NTHU,Cs2403 Programming Languages 1


Compilation Sequence

PLLab, NTHU,Cs2403 Programming Languages 2


What is Lex?
• The main job of a lexical analyzer (scan
ner) is to break up an input stream into
more usable elements (tokens)
a = b + c * d;
ID ASSIGN ID PLUS ID MULT ID SEMI

• Lex is an utility to help you rapidly gener


ate your scanners
PLLab, NTHU,Cs2403 Programming Languages 3
Lex – Lexical Analyzer
• Lexical analyzers tokenize input streams
• Tokens are the terminals of a language
– English
• words, punctuation marks, …
– Programming language
• Identifiers, operators, keywords, …
• Regular expressions define terminals/toke
ns

PLLab, NTHU,Cs2403 Programming Languages 4


Lex Source Program
• Lex source is a table of
– regular expressions and
– corresponding program fragments
digit [0-9]
letter [a-zA-Z]
%%
{letter}({letter}|{digit})* printf(“id: %s\n”, yytext);
\n printf(“new line\n”);
%%
main() {
yylex();
}

PLLab, NTHU,Cs2403 Programming Languages 5


Lex Source to C Program
• The table is translated to a C program (l
ex.yy.c) which
– reads an input stream
– partitioning the input into strings which mat
ch the given expressions and
– copying it to an output stream if necessary

PLLab, NTHU,Cs2403 Programming Languages 6


An Overview of Lex

Lex source lex.yy.c


program
Lex

lex.yy.c C compiler a.out

input a.out tokens

PLLab, NTHU,Cs2403 Programming Languages 7


Lex Source
• Lex source is separated into three sections by %
% delimiters
• The general format of Lex source is

{definitions}
%% (required)
{transition rules}
%% (optional)
{user subroutines}
• The absolute minimum Lex program is thus
%% PLLab, NTHU,Cs2403 Programming Languages 8
Lex v.s. Yacc
• Lex
– Lex generates C code for a lexical analyzer, or sca
nner
– Lex uses patterns that match strings in the input a
nd converts the strings to tokens

• Yacc
– Yacc generates C code for syntax analyzer, or par
ser.
– Yacc uses grammar rules that allow it to analyze t
okens from Lex and create a syntax tree.

PLLab, NTHU,Cs2403 Programming Languages 9


Lex with Yacc
Lex source Yacc source
(Lexical Rules) (Grammar Rules)

Lex Yacc

lex.yy.c y.tab.c
call
Parsed
Input yylex() yyparse()
Input

return token
PLLab, NTHU,Cs2403 Programming Languages 10
Regular Expressions

We all know what they are!!!!!!!!

PLLab, NTHU,Cs2403 Programming Languages 11


Lex Regular Expressions (Ext
ended Regular Expressions)
• A regular expression matches a set of strings
• Regular expression
– Operators
– Character classes
– Arbitrary character
– Optional expressions
– Alternation and grouping
– Context sensitivity
– Repetitions and definitions
PLLab, NTHU,Cs2403 Programming Languages 12
Operators
“ \ [ ] ^ - ? . * + | ( ) $ / { } % < >

• If they are to be used as text characters, an escap


e should be used
\$= “$”
\\ = “\”
• Every character but blank, tab (\t), newline (\n) a
nd the list above is always a text character

PLLab, NTHU,Cs2403 Programming Languages 13


Character Classes []
• [abc] matches a single character, which may b
e a, b, or c
• Every operator meaning is ignored except \ - a
nd ^
• e.g.
[ab] => a or b
[a-z] => a or b or c or … or z
[-+0-9] => all the digits and the two signs
[^a-zA-Z] => any character which is not a
letter
PLLab, NTHU,Cs2403 Programming Languages 14
Arbitrary Character .
• To match almost character, the operator
character . is the class of all characters
except newline

• [\40-\176] matches all printable char


acters in the ASCII character set, from o
ctal 40 (blank) to octal 176 (tilde~)

PLLab, NTHU,Cs2403 Programming Languages 15


Optional & Repeated
Expressions
• a? => zero or one instance of a
• a* => zero or more instances of a
• a+ => one or more instances of a

• E.g.
ab?c => ac or abc
[a-z]+ => all strings of lower case letters
[a-zA-Z][a-zA-Z0-9]* => all alphanumer
ic strings with a leading alphabetic character

PLLab, NTHU,Cs2403 Programming Languages 16


Precedence of Operators
• Level of precedence
– Kleene closure (*), ?, +
– concatenation
– alternation (|)
• All operators are left associative.
• Ex: a*b|cd* = ((a*)b)|(c(d*))

PLLab, NTHU,Cs2403 Programming Languages 17


Pattern Matching
Primitives
Metacharacter Matches
. any character except newline
\n newline
* zero or more copies of the preceding expression
+ one or more copies of the preceding expression
? zero or one copy of the preceding expression
^ beginning of line / complement
$ end of line
a|b a or b
(ab)+ one or more copies of ab (grouping)
[ab] a or b
a{3} 3 instances of a
“a+b” literal “a+b” (C Programming
PLLab, NTHU,Cs2403 escapes Languages
still work) 18
Recall: Lex Source
• Lex source is a table of
– regular expressions and
– corresponding program fragments (actions)
a = b + c;

%%
<regexp> <action> a operator: ASSIGNMENT b + c;
<regexp> <action>

%%

%%
“=“ printf(“operator: ASSIGNMENT”);

PLLab, NTHU,Cs2403 Programming Languages 19


Transition Rules
• regexp <one or more blanks> action (C code);
• regexp <one or more blanks> { actions (C code) }

• A null statement ; will ignore the input (no action


s)
[ \t\n] ;
– Causes the three spacing characters to be ignored
a = b + c;
d = b * c;

↓↓
a=b+c;d=b*c;

PLLab, NTHU,Cs2403 Programming Languages 20


Transition Rules (cont’d)
• Four special options for actions:
|, ECHO;, BEGIN, and REJECT;
• | indicates that the action for this rule is from t
he action for the next rule
– [ \t\n] ;
– “ “ |
“\t” |
“\n” ;
• The unmatched token is using a default actio
n that ECHO from the input to the output

PLLab, NTHU,Cs2403 Programming Languages 21


Transition Rules (cont’d)
• REJECT
– Go do the next alternative


%%
pink {npink++; REJECT;}
ink {nink++; REJECT;}
pin {npin++; REJECT;}
.|
\n ;
%%

PLLab, NTHU,Cs2403 Programming Languages 22


Lex Predefined Variables
• yytext -- a string containing the lexeme
• yyleng -- the length of the lexeme
• yyin -- the input stream pointer
– the default input of default main() is stdin
• yyout -- the output stream pointer
– the default output of default main() is stdout.
• cs20: %./a.out < inputfile > outfile

• E.g.
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}

PLLab, NTHU,Cs2403 Programming Languages 23


Lex Library Routines
• yylex()
– The default main() contains a call of yylex()
• yymore()
– return the next token
• yyless(n)
– retain the first n characters in yytext
• yywarp()
– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1

PLLab, NTHU,Cs2403 Programming Languages 24


Review of Lex Predefined Var
iables
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition
PLLab, NTHU,Cs2403 Programming Languages 25
User Subroutines Section
• You can use your Lex routines in the same w
ays you use routines in other programming la
nguages.
%{
void foo();
%}
letter [a-zA-Z]
%%
{letter}+ foo();
%%

void foo() {

} PLLab, NTHU,Cs2403 Programming Languages 26
User Subroutines Section
(cont’d)
• The section where main() is placed
%{
int counter = 0;
%}
letter [a-zA-Z]

%%
{letter}+ {printf(“a word\n”); counter++;}

%%
main() {
yylex();
printf(“There are total %d words\n”, counter);
}
PLLab, NTHU,Cs2403 Programming Languages 27
Usage
• To run Lex on a source file, type
lex scanner.l
• It produces a file named lex.yy.c which is a
C program for the lexical analyzer.
• To compile lex.yy.c, type
gcc lex.yy.c –ll
• To run the lexical analyzer program, type
./a.out < inputfile
PLLab, NTHU,Cs2403 Programming Languages 28
EXAMPLE
%{ int noms,mots,lignes;
%}
mot [a-z]+
maj [A-Z]

%%
{maj}{mot} {noms++; printf("Bonjour %s.
n",yytext);}
{mot} {mots++;} \n {lignes+
+;printf(" Encore !\n");}
. ;
%%
main()
{
noms=mots=lignes=0;
yylex();
printf("nb de noms :%d, mots: %d, lignes: %d.\
n",noms,mots,lignes);
} PLLab, NTHU,Cs2403 Programming Languages 29
Versions of Lex
• AT&T -- lex
http://www.combo.org/lex_yacc_page/lex.html
• GNU -- flex
http://www.gnu.org/manual/flex-2.5.4/flex.html
• a Win32 version of flex :
http://www.monmouth.com/~wstreett/lex-yacc/lex-yacc.html
or Cygwin :
http://sources.redhat.com/cygwin/

• Lex on different machines is not created equal.

PLLab, NTHU,Cs2403 Programming Languages 30


Yacc - Yet Another Compil
er-Compiler

PLLab, NTHU,Cs2403 Programming Languages 31


Introduction

• What is YACC ?
– Tool which will produce a parser for a
given grammar.
– YACC (Yet Another Compiler Compiler)
is a program designed to compile a LAL
R(1) grammar and to produce the sourc
e code of the syntactic analyzer of the la
nguage produced by this grammar.

PLLab, NTHU,Cs2403 Programming Languages 32


How YACC Works
File containing desired
gram.y
grammar in yacc format

yacc yacc program

y.tab.c C source program created by yacc

cc
or gcc
C compiler

Executable program that will parse


a.out
grammar given in gram.y
PLLab, NTHU,Cs2403 Programming Languages 33
How YACC Works
y.tab.h
YACC source (*.y) yacc y.tab.c
y.output
(1) Parser generation time

y.tab.c C compiler/linker a.out

(2) Compile time


Abstract
Token stream a.out Syntax
Tree
(3) Run time
PLLab, NTHU,Cs2403 Programming Languages 34
An YACC File Example
%{
#include <stdio.h>
%}

%token NAME NUMBER


%%

statement: NAME '=' expression


| expression { printf("= %d\n", $1); }
;

expression: expression '+' NUMBER { $$ = $1 + $3; }


| expression '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
%%
int yyerror(char *s)
{
fprintf(stderr, "%s\n", s);
return 0;
}

int main(void)
{
yyparse();
return 0;
}

PLLab, NTHU,Cs2403 Programming Languages 35


Works with Lex

LEX
yylex()

I nput programs
YACC
yyparse() How to 12 + 26
work ?

PLLab, NTHU,Cs2403 Programming Languages 36


Works with Lex

LEX [0-9]+
call yylex() yylex()

I nput programs
YACC
yyparse() 12 + 26
next token is NUM

NUM ‘+’ NUM

PLLab, NTHU,Cs2403 Programming Languages 37


YACC File Format
%{
C declarations
%}
yacc declarations
%%
Grammar rules
%%
Additional C code
– Comments enclosed in /* ... */ may appear in
any of the sections.

PLLab, NTHU,Cs2403 Programming Languages 38


Definitions Section

%{
#include <stdio.h>
#include <stdlib.h>
%} It is a terminal

%token ID NUM
%start expr 由 expr 開始 parse

PLLab, NTHU,Cs2403 Programming Languages 39


Start Symbol

• The first non-terminal specified in the gr


ammar specification section.
• To overwrite it with %start declaraction.
%start non-terminal

PLLab, NTHU,Cs2403 Programming Languages 40


Rules Section
• This section defines grammar
• Example
expr : expr '+' term | term;
term : term '*' factor | factor;
factor : '(' expr ')' | ID | NUM;

PLLab, NTHU,Cs2403 Programming Languages 41


Rules Section
• Normally written like this
• Example:
expr : expr '+' term
| term
;
term : term '*' factor
| factor
;
factor : '(' expr ')'
| ID
| NUM
;
PLLab, NTHU,Cs2403 Programming Languages 42
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
PLLab, NTHU,Cs2403 Programming Languages 43
$1 The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
PLLab, NTHU,Cs2403 Programming Languages 44
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
; $2
PLLab, NTHU,Cs2403 Programming Languages 45
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
; $3 Default: $$ = $1;

PLLab, NTHU,Cs2403 Programming Languages 46


Communication between LEX and
YACC

LEX [0-9]+
call yylex() yylex()

I nput programs
YACC
yyparse() 12 + 26
next token is NUM

NUM ‘+’ NUM

LEX and YACC 需要一套方法確認 token 的身份


PLLab, NTHU,Cs2403 Programming Languages 47
Communication between LEX
and YACC
• Use enumeration ( 列舉 ) / d
yacc -d gram.y
efine
• 由一方產生,另一方 include Will produce:
• YACC 產生 y.tab.h y.tab.h
• LEX include y.tab.h

PLLab, NTHU,Cs2403 Programming Languages 48


Communication between LEX
and YACC
%{ yacc -d xxx.y
scanner.l Produced
#include <stdio.h>
#include "y.tab.h" y.tab.h:
%}
id [_a-zA-Z][_a-zA-Z0-9]* # define CHAR 258
%%
# define FLOAT 259
int { return INT; }
char { return CHAR; } # define ID 260
float { return FLOAT; } # define INT 261
{id} { return ID;}

%{ parser.y
#include <stdio.h>
#include <stdlib.h>
%}
%token CHAR, FLOAT, ID, INT
%% PLLab, NTHU,Cs2403 Programming Languages 49
YACC
• Rules may be recursive
• Rules may be ambiguous*
• Uses bottom up Shift/Reduce parsing
– Get a token
– Push onto stack Phrase -> cart_animal AND CART
| work_animal AND PLOW
– Can it reduced (How do we know?) …
• If yes: Reduce using a rule
• If no: Get another token
• Yacc cannot look ahead more than one token

PLLab, NTHU,Cs2403 Programming Languages 50


Yacc Example
• Taken from Lex & Yacc
• Simple calculator
a = 4 + 6
a
a=10
b = 7
c = a + b
c
c = 17
$

PLLab, NTHU,Cs2403 Programming Languages 51


Grammar
expression ::= expression '+' term |
expression '-' term |
term

term ::= term '*' factor |


term '/' factor |
factor

factor ::= '(' expression ')' |


'-' factor |
NUMBER |
NAME

PLLab, NTHU,Cs2403 Programming Languages 52


Parser (cont’d)
statement_list: statement '\n'
| statement_list statement '\n'
;

statement: NAME '=' expression { $1->value = $3; }


| expression { printf("= %g\n", $1); }
;

expression: expression '+' term { $$ = $1 + $3; }


| expression '-' term { $$ = $1 - $3; }
| term
;

PLLab, NTHU,Cs2403 Programming Languages parser.y


53
Parser (cont’d)
term: term '*' factor { $$ = $1 * $3; }
| term '/' factor { if ($3 == 0.0)
yyerror("divide by zero");
else
$$ = $1 / $3;
}
| factor
;

factor: '(' expression ')' { $$ = $2; }


| '-' factor { $$ = -$2; }
| NUMBER { $$ = $1; }
| NAME { $$ = $1->value; }
;
%%
PLLab, NTHU,Cs2403 Programming Languages parser.y
54
Scanner
%{
#include "y.tab.h"
#include "parser.h"
#include <math.h>
%}
%%
([0-9]+|([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?) {
yylval.dval = atof(yytext);
return NUMBER;
}

[ \t] ; /* ignore white space */

PLLab, NTHU,Cs2403 Programming Languages scanner.l


55
Scanner (cont’d)

[A-Za-z][A-Za-z0-9]* { /* return symbol pointer */


yylval.symp = symlook(yytext);
return NAME;
}

"$" { return 0; /* end of input */ }

\n |”=“|”+”|”-”|”*”|”/” return yytext[0];


%%

PLLab, NTHU,Cs2403 Programming Languages scanner.l


56
YACC Command

• Yacc (AT&T)
– yacc –d xxx.y 產生 y.tab.c, 與 yacc 相同
不然會產生 xxx.tab.c

• Bison (GNU)
– bison –d –y xxx.y

PLLab, NTHU,Cs2403 Programming Languages 57


Precedence / Association
expr: expr '-' expr
| expr '*' expr (1) 1 – 2 - 3
| expr '<' expr
| '(' expr ')'
... (2) 1 – 2 * 3
;

1. 1-2-3 = (1-2)-3? or 1-(2-3)?


Define ‘-’ operator is left-association.
2. 1-2*3 = 1-(2*3)
Define “*” operator is precedent to “-”
operator
PLLab, NTHU,Cs2403 Programming Languages 58
Precedence / Association

%right ‘=‘
%left '<' '>' NE LE GE
%left '+' '-‘
%left '*' '/'
highest precedence

PLLab, NTHU,Cs2403 Programming Languages 59


Precedence / Association
%left '+' '-'
%left '*' '/'
%noassoc UMINUS
expr : expr ‘+’ expr { $$ = $1 + $3; }
| expr ‘-’ expr { $$ = $1 - $3; }
| expr ‘*’ expr { $$ = $1 * $3; }
| expr ‘/’ expr
{
if($3==0)
yyerror(“divide 0”);
else
$$ = $1 / $3;
}
| ‘-’ expr %prec UMINUS {$$ = -$2; }

PLLab, NTHU,Cs2403 Programming Languages 60


Shift/Reduce Conflicts

• shift/reduce conflict
– occurs when a grammar is written in suc
h a way that a decision between shifting
and reducing can not be made.
– ex: IF-ELSE ambigious.
• To resolve this conflict, yacc will choo
se to shift.

PLLab, NTHU,Cs2403 Programming Languages 61


YACC Declaration
`%start'
Summary
Specify the grammar's start symbol

`%union'
Declare the collection of data types that semantic values may h
ave

`%token'
Declare a terminal symbol (token type name) with no precedenc
e or
associativity specified

`%type'
Declare the type of semantic values for a nonterminal symbol

PLLab, NTHU,Cs2403 Programming Languages 62


YACC Declaration
`%right' Summary
Declare a terminal symbol (token type name) that is
right-associative

`%left'
Declare a terminal symbol (token type name) that is left-associa
tive

`%nonassoc'
Declare a terminal symbol (token type name) that is nonassocia
tive
(using it in a way that would be associative is a syntax error,
ex: x op. y op. z is syntax error)

PLLab, NTHU,Cs2403 Programming Languages 63


Reference Books
• lex & yacc, 2nd Edition
– by John R.Levine, Tony Mason & Doug
Brown
– O’Reilly
– ISBN: 1-56592-000-7

• Mastering Regular Expressions


– by Jeffrey E.F. Friedl
– O’Reilly
– ISBN: 1-56592-257-3

PLLab, NTHU,Cs2403 Programming Languages 64

You might also like