UNIT 2 Part 3 Lexical Analyzer Generator
UNIT 2 Part 3 Lexical Analyzer Generator
Analyzer Generator
HARSHITA SHARMA
The Lex tool
► The Lex tool allows one to specify a lexical analyzer by specifying regular
expressions to describe patterns for tokens.
► The input notation for the Lex tool is referred to as the Lex language and the
tool itself is the Lex compiler.
► Behind the scenes, the Lex compiler transforms the input patterns into a
transition diagram and generates code, in a file called lex.yy.c, that
simulates this transition diagram.
Use of Lex
► An input file, which we call lex.l, is written in the Lex language and describes the
lexical analyzer to be generated.
► The Lex compiler transforms lex.l to a C program, in a file that is always named
lex.yy.c. The latter file is compiled by the C compiler into a file called a.out, as
always.
► The C-compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens. The normal use of the compiled C
program, referred to as a.out, is as a subroutine of the parser.
► It is a C function that returns an integer, which is a code for one of the possible
token names.
► The attribute value, whether it be another numeric code, a pointer to the symbol
table, or nothing, is placed in a global variable yylval,2 which is shared between the
lexical analyzer and parser, thereby making it simple to return both the name and
an attribute value of a token
Structure of a Lex Program
► The declarations section includes declarations of variables, manifest
constants (identifiers declared to stand for a constant, e.g., the name of a
token), and regular definitions. The translation rules each have the form
Pattern {Action}
► Each pattern is a regular expression, which may use the regular definitions of
the declaration section.
► The actions are fragments of code, typically written in C, although many
variants of Lex using other languages have been created.
► The third section holds whatever additional functions are used in the actions.
Alternatively, these functions can be compiled separately and loaded with the
lexical analyzer.
Working
► The lexical analyzer created by Lex behaves in concert with the parser as
follows. When called by the parser, the lexical analyzer begins reading its
remaining input, one character at a time, until it finds the longest prefix of
the input that matches one of the patterns Pi.
► It then executes the associated action Ai . Typically, Ai will return to the
parser, but if it does not (e.g., because Pi describes whitespace or
comments), then the lexical analyzer proceeds to find additional lexemes,
until one of the corresponding actions causes a return to the parser.
► The lexical analyzer returns a single value, the token name, to the parser, but
uses the shared, integer variable yylval to pass additional information about
the lexeme found, if needed.
Conflict Resolution in Lex
► The first rule tells us to continue reading letters and digits to find the longest
prefix of these characters to group as an identifier. It also tells us to treat <=
as a single lexeme, rather than selecting < as one lexeme and =as the next
lexeme.
► The second rule makes keywords reserved, if we list the keywords before id in
the program.
► For instance, if then is determined to be the longest prefix of the input that
matches any pattern, and the pattern then precedes {id}, then the token
THEN is returned, rather than ID.
The Lookahead Operator
► Lex automatically reads one character ahead of the last character that forms
the selected lexeme, and then retracts the input so only the lexeme itself is
consumed from the input. However, sometimes, we want a certain pattern to
be matched to the input only when it is followed by a certain other
characters.
► If so, we may use the slash in a pattern to indicate the end of the part of the
pattern that matches the lexeme. What follows / is additional pattern that
must be matched before we can decide that the token in question was seen,
but what matches this second pattern is not part of the lexeme.