0% found this document useful (0 votes)
77 views8 pages

Lecture 07 PDF

The document discusses Lex, a tool used to generate lexical analyzers or scanners. Lex works by taking a regular expression as input and converting it into a DFA, which is then used to generate a scanner. The scanner separates input into tokens that can be passed to a parser. The document provides an example Lex program structure and code for generating a lexical analyzer in C++. It also describes commonly used Lex functions like yylex() and yytext.

Uploaded by

Faisal Shehzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views8 pages

Lecture 07 PDF

The document discusses Lex, a tool used to generate lexical analyzers or scanners. Lex works by taking a regular expression as input and converting it into a DFA, which is then used to generate a scanner. The scanner separates input into tokens that can be passed to a parser. The document provides an example Lex program structure and code for generating a lexical analyzer in C++. It also describes commonly used Lex functions like yylex() and yytext.

Uploaded by

Faisal Shehzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Atif Ishaq - Lecturer GC University, Lahore

Compiler Construction
CS-4207
Lecture – 07

Lexical Analyzer Generator


The Lex tool was originally designed for Linux in Bell laboratories. The responsibilities of Lex
was to generate lexical analyzer or compiler. It works internally in the same way as we did in the
previous lectures. It accepts a regular expression and then converts it into DFA. After minimizing
the DFA it generates a scanner. The other tool that works with it is YACC tool that we will discuss
with parser. The scanners are not only required for compiler but also be used in different
application. For example in database, this tool can be used to separate data (token) in the form of
digits and strings and then put that data in database. The Flex tool is equally workable in windows
and linux environment as the tool is originally designed in C language.
How to use Lex
We need to first build a scanner. This tool needs some contribution form you to generate tokens.
Flex is provided with a specification file. The Flex reads and produces C or C++ output file
contains the scanner.

First of all we need to write an input file, lex.l , written in Lex language. The lex compiler
transforms lex.l file into a C file that is always named as lex.yy.c . this file is latter compiled
by the c compiler into a file called a.out, as always. This file is working lexical analyzer that
receives stream of input characters and produces a stream of token.
Atif Ishaq - Lecturer GC University, Lahore

Structure of Lex Program


The file consists of three sections.
1. C or C++ and Flex definition , Declaration
%%
2. Token definition and actions
%%
3. User Code
The symbols ‘%%’ marks each section. The declaration section includes declaration of variables,
manifest constants and regular definition as we discussed in previous lecture. The translation rules
each have the form Pattern { Action }. Each pattern is a regular expression, which may use the
regular definition of the declaration section. The third section holds the additional function used
in actions. Alternatively these functions may be compiled separately and loaded with lexical
analyzer.
The lexical analyzer works with parser as follows. When lexical analyzer receives a request from
parser, it starts reading it remaining input, one character at a time, until it finds the longest prefix
of the input that matches one of the pattern Pi. it then executes the associated action Ai. Typically
Ai returns token to the parser but if Pi describes a white space or comment, it does not returns a
token. In this case the lexical analyzer proceeds to find next lexeme until one of the corresponding
action causes a return to parser. The lexical analyzer returns a single value, token name, to the
parser, along with shared integer variable yylval to pass additional information about the lexeme
found, if needed.
Example Code
%{
#include “tokdefs.h”
%}
D [0-9]
L [a-zA-Z_]
id {L}({L}|{D})*
%%
"void" {return(TOK_VOID);}
"int" {return(TOK_INT);}
"if" {return(TOK_IF);}
"else"{return(TOK_ELSE);}
"while"{return(TOK_WHILE)};
Atif Ishaq - Lecturer GC University, Lahore

"<="{return(TOK_LE);}
">=" {return(TOK_GE);}
"==" {return(TOK_EQ);}
"!=" {return(TOK_NE);}
{D}+ {return(TOK_INT);}
{id} {return(TOK_ID);}
[\n]|[\t]|[ ] ;
%%
The file lex.l includes another file named “tokdefs.h”. The contents of the tokdefs.h file are
#define TOK_VOID 1
#define TOK_INT 2
#define TOK_IF 3
#define TOK_ELSE 4
#define TOK_WHILE 5
#define TOK_LE 6
#define TOK_GE 7
#define TOK_EQ 8
#define TOK_NE 9
#define TOK_INT 10
#define TOK_ID 111
Flex creates C++ classes that implements the lexical analyzer. The code for these classes is placed
in the Flex’s output file. Below is the code that needed to invoked the scanner, this is placed in the
main.cpp
void main()
{
FlexLexer lex;
int tc = lex.yylex(); while(tc
!= 0) {
cout << tc << “,” <<lex.YYText() << endl; tc =
lex.yylex();
}
}
Atif Ishaq - Lecturer GC University, Lahore

The following commands can be used to generate a scanner executable file in windows.
flex lex.l g++ –c
lex.cpp
g++ –c main.cpp
g++ –o lex.exe lex.o main.o
The output of the scanner when executed and given the file main.cpp as input, the scanner is being
asked to provide tokens found in the file main.cpp
259,void
258,main
283,(
284,)
285,{
258,FlexLexer
258,lex
290,;
260,int
258,tc
266,=
258,lex
291,.
258,yylex
283,(
284,)
290,;
263,while
283,(
258,tc
276,!=
257,0
Atif Ishaq - Lecturer GC University, Lahore

284,)
258,cout
279,<<
258,tc
279,<<
292,","
279,<<
258,lex
291,.
258,YYText
283,(
284,)
279,<<
258,endl
290,;
258,tc
266,=
258,lex
291,.
258,yylex
283,(
284,)
290,;
286,}

Here is another code for reference. The declaration section includes a pair of special brackets %{
and }% . Anything within these brackets is directly copied to the file lex.yy.c, and is not
treated as regular definition. It is common place there the definitions of the manifest constant,
using C++ #define can be incorporated. In our second example code some of the manifest constants
are LE, GT and so on are defined in comments without proper definition. We can also find the
sequence of regular definition in declaration section after the closing of manifest constant section.
Atif Ishaq - Lecturer GC University, Lahore

Regular definition (we have already discussed in previous lectures) that are used in later definition
or in the pattern of translation rues are surrounded by curly braces. The ‘delim’ is defined to be
shorthand for character class consisting of newline, tab and space.
In the definition of id and number the curly braces are used as grouping and do not stand for
themselves. If the symbol + , * , . or ? or parenthesis are to be used itself then they must be proceed
with a backslash. We can see \. In the definition of number.
As the white space is not returning any token. So if a white space is encountered in our code no
token will return to the parser and but look for another lexeme. For the keywords the regular
expression is the keyword itself. If a keywords also matches an identifier the lexical analyzer has
to decide to whichever is listed first.
%{
/*definition of manifest constants
LT , LE , EQ , NE , GT , GE , IF, THEN , ELSE , ID , NUMBER , RELOP */
%}

%%

“<=”
Atif Ishaq - Lecturer GC University, Lahore

%%

Commonly used function of lex are listed below

yyin :- the input stream pointer (i.e it points to an input file which is to be scanned or tokenised),
however the default input of default main() is stdin .

yylex() :- implies the main entry point for lex, reads the input stream generates tokens, returns zero
at the end of input stream . It is called to invoke the lexer (or scanner) and each time yylex() is
called, the scanner continues processing the input from where it last left off.
yytext :- a buffer that holds the input characters that actually match the pattern (i.e lexeme) or say
a pointer to the matched string .
yyleng :- the length of the lexeme .

yylval :- contains the token value .


yyval :- a local variable .
yyout :- the output stream pointer (i.e it points to a file where it has to keep the output), however
the default output of default main() is stdout .
yywrap() :- it is called by lex when input is exhausted (or at EOF). default yywrap always return
1.
yymore() :- returns the next token .
yyless(k) :- returns the first k characters in yytext .
yyparse() :- it parses (i.e builds the parse tree) of lexeme .

Conflict resolution in Lex

1. Always prefer a longer prefix to a shorter prefix


2. If the longest possible prefix matches two or more patterns, prefer the pattern listed first in
the Lex program
Atif Ishaq - Lecturer GC University, Lahore

Lex automatically reads one character ahead of the last character that forms the selected lexeme,
and then retracts the input so the lexeme itself is consumed from the input.

You might also like