Comp Chap2
Comp Chap2
Lexical Analysis
Lexical Analysis
• . Recognize tokens and ignore white spaces, comments
• Error reporting
• Model using regular expressions
• Recognize using Finite State Automata
The first phase of the compiler is lexical analysis. The lexical analyzer breaks a sentence into
a sequence of words or tokens and ignores white spaces and comments. It generates a
stream of tokens from the input. This is modeled through regular expressions and the
structure is recognized through finite state automata. If the token is not valid i.e., does not fall
into any of the identifiable groups, then the lexical analyzer reports an error. Lexical analysis
thus involves recognizing the tokens in the source program and reporting errors, if any. We
will study more about all these processes in the subsequent slides
Lexical Analysis
• Sentences consist of string of tokens (a syntactic category) for example, number, identifier,
keyword, string
• Sequences of characters in a token is a lexeme for example, 100.01, counter, const, "How are
you?"
• Rule of description is a pattern for example, letter(letter/digit)*
• Discard whatever does not contribute to parsing like white spaces ( blanks, tabs, newlines )
and comments
• construct constants: convert numbers to token num and pass number as its attribute, for
example, integer 31 becomes <num, 31>
• recognize keyword and identifiers for example counter = counter + increment becomes id = id
+ id /*check if id is a keyword*/
We often use the terms "token", "pattern" and "lexeme" while studying lexical analysis.
Lets see what each term stands for.
Cont…
• Token: A token is a syntactic category. Sentences consist of a string of tokens. For example
number, identifier, keyword, string etc are tokens.
• Lexeme: Sequence of characters in a token is a lexeme. For example 100.01, counter, const, "How
are you?" etc are lexemes.
• Pattern: Rule of description is a pattern. For example letter (letter | digit)* is a pattern to symbolize
a set of strings which consist of a letter followed by a letter or digit. In general, there is a set of
strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token. This pattern is said to match each
string in the set. A lexeme is a sequence of characters in the source program that is matched by
the pattern for a token. The patterns are specified using regular expressions. For example, in the
Pascal statement
Const pi = 3.1416;
• The substring pi is a lexeme for the token "identifier". We discard whatever does not contribute to
parsing like white spaces (blanks, tabs, new lines) and comments. When more than one pattern
matches a lexeme, the lexical analyzer must provide additional information about the particular
lexeme that matched to the subsequent phases of the compiler. For example, the
pattern num matches both 1 and 0 but it is essential for the code generator to know what string
was actually matched. The lexical analyzer collects information about tokens into their associated
attributes. For example integer 31 becomes <num, 31>. So, the constants are constructed by
converting numbers to token 'num' and passing the number as its attribute. Similarly, we recognize
keywords and identifiers. For example count = count + inc becomes id = id + id.
Interface to other phases
• Push back is required due to lookahead for example > = and >
• It is implemented through a buffer
• Keep input in a buffer
• Move pointers over the input
The lexical analyzer reads characters from the input and passes tokens to the syntax analyzer
whenever it asks for one. For many source languages, there are occasions when the lexical
analyzer needs to look ahead several characters beyond the current lexeme for a pattern before a
match can be announced. For example, > and >= cannot be distinguished merely on the basis of
the first character >. Hence there is a need to maintain a buffer of the input for look ahead and
push back. We keep the input in a buffer and move pointers over the input. Sometimes, we may
also need to push back extra characters due to this lookahead character.
Approaches to implementation
A crude implementation of lex() analyzer to eliminate white space and collect numbers is shown. Every time the body of the while
statement is executed, a character is read into t. If the character is a blank (written ' ') or a tab (written '\t'), then no token is returned
to the parser; we merely go around the while loop again. If a character is a new line (written '\n'), then a global variable "lineno" is
incremented, thereby keeping track of line numbers in the input, but again no token is returned. Supplying a line number with the
error messages helps pin point errors. The code for reading a sequence of digits is on lines 11-19. The predicate isdigit(t) from the
include file <ctype.h> is used on lines 11 and 14 to determine if an incoming character t is a digit. If it is, then its integer value is
given by the expression t-'0' in both ASCII and EBCDIC. With other character sets, the conversion may need to be done differently.
Problems
• Scans text character by character
• Look ahead character determines what kind of token to read and when the current token
ends
• First character cannot determine what kind of token we are going to read
• The problem with lexical analyzer is that the input is scanned character by
character.
• Now, its not possible to determine by only looking at the first character what kind of
token we are going to read since it might be common in multiple tokens. We saw
one such an example of > and >= previously.
• So one needs to use a lookahead character depending on which one can determine
what kind of token to read or when does a particular token end. It may not be a
punctuation or a blank but just another kind of token which acts as the word
boundary.
• The lexical analyzer that we just saw used a function ungetc() to push lookahead
characters back into the input stream. Because a large amount of time can be
consumed moving characters, there is actually a lot of overhead in processing an
input character. To reduce the amount of such overhead involved, many specialized
buffering schemes have been developed and used.
Symbol Table
• Consider token DIV and MOD with lexemes div and mod.
• Initialize symbol table with insert( "div" , DIV ) and insert( "mod" , MOD).
• Any subsequent lookup returns a nonzero value, therefore, cannot be used as an
identifier .
• To handle keywords, we consider the keywords themselves as lexemes. We
store all the entries corresponding to keywords in the symbol table while
initializing it and do lookup whenever we see a new lexeme.
• Now, whenever a lookup is done, if a nonzero value is returned, it means that
there already exists a corresponding entry in the Symbol Table.
• So, if someone tries to use a keyword as an identifier, it will not be allowed as
an identifier with this name already exists in the Symbol Table.
• For instance, consider the tokens DIV and MOD with lexemes "div" and
"mod".
• We initialize symbol table with insert("div", DIV) and insert("mod", MOD). Any
subsequent lookup now would return a nonzero value, and therefore, neither
"div" nor "mod" can be used as an identifier.
Difficulties in design of lexical analyzers
• Is it as simple as it sounds?
• Lexemes in a fixed position. Fix format vs. free format
languages
• Handling of blanks
• in Pascal, blanks separate identifiers
• in Fortran, blanks are important only in literal strings for example
variable counter is same as count er
Cont…
• Another example
DO 10 I = 1.25 DO10I=1.25
DO 10 I = 1,25 DO10I=1,25
• The design of a lexical analyzer is quite complicated and not as simple as it looks.
• There are several kinds of problems because of all the different types of languages we have. Let us have a look
at some of them. For example:
• 1. We have both fixed format and free format languages - A lexeme is a sequence of character in source
program that is matched by pattern for a token.
• FORTRAN has lexemes in a fixed position. These white space and fixed format rules came into force due to
punch cards and errors in punching. Fixed format languages make life difficult because in this case we have to
look at the position of the tokens also.
• 2. Handling of blanks - It's of our concern that how do we handle blanks as many languages (like Pascal,
FORTRAN etc) have significance for blanks and void spaces.
• When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about
the particular lexeme that matched to the subsequent phases of the lexical analyzer.
• In Pascal blanks separate identifiers.
• In FORTRAN blanks are important only in literal strings. For example, the variable " counter " is same as " count er ".
• Another example is DO 10 I = 1.25 DO 10 I = 1,25
• The first line is a variable assignment DO10I = 1.25.
• The second line is the beginning of a Do loop.
• In such a case we might need an arbitrary long lookahead. Reading from left to right, we cannot distinguish
between the two until the " , " or " . " is reached.
Cont….
• Cannot tell whether Declare is a keyword or array reference until after " ) "
• Requires arbitrary lookahead and very large buffers . Worse, the buffers may have to be reloaded.
• In many languages certain strings are reserved, i.e., there meaning is predefined and cannot be changed
by the user. If keywords are not reserved then the lexical analyzer must distinguish between a keyword
and a user defined identifier.
• PL/1 has several problems:
1. In PL/1 keywords are not reserved; thus, the rules for distinguishing keywords from identifiers are quite
complicated as the following PL/1 statement illustrates. For example - If then then then = else else else
= then
2. PL/1 declarations: Example - Declare (arg1, arg2, arg3,.., argn) In this statement, we can not tell whether
'Declare' is a keyword or array name until we see the character that follows the ")".
This requires arbitrary lookahead and very large buffers. This buffering scheme works quite well most of the
time but with it the amount of lookahead is limited and this limited lookahead may make it impossible to
recognize tokens in salutations where the distance the forward pointer must travel is more than the length of
the buffer, as the slide illustrates.
The situation even worsens if the buffers have to be reloaded.
Problem continues even today!!
• C++ template syntax:Foo<Bar>
• C++ stream syntax: cin >> var;
• Nested templates: Foo<Bar<Bazz>>
• Can these problems be resolved by lexical analyzers alone?
Even C++ has such problems like:
1. C++ template syntax: Foo<Bar>
2. C++ stream syntax: cin >> var;
3. Nested templates: Foo<Bar<Bazz>>
We have to see if these problems be resolved by lexical
analyzers alone.
How to specify tokens?
• If r and s are regular expressions denoting the languages L(r) and L(s) then
• (r)|(s) is a regular expression denoting L(r) U L(s)
• (r)(s) is a regular expression denoting L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denoting L(r )
Suppose r and s are regular expressions denoting the languages L(r) and
L(s). Then,
• (r)|(s) is a regular expression denoting L(r) U L(s).
• (r) (s) is a regular expression denoting L(r) L(s).
• (r)* is a regular expression denoting (L(r))*.
• (r) is a regular expression denoting L(r).
• Let us take an example to illustrate: Let S = {a, b}.
1. The regular expression a|b denotes the set {a,b}.
2. The regular expression (a | b) (a | b) denotes {aa, ab, ba, bb}, the set of all strings
of a's and b's of length two. Another regular expression for this same set is aa |
ab | ba | bb.
3. The regular expression a* denotes the set of all strings of zero or more a's i.e.,
{ ? , a, aa, aaa, .}.
4. The regular expression (a | b)* denotes the set of all strings containing zero or
more instances of an a or b, that is, the set of strings of a's and b's. Another
regular expression for this set is (a*b*)*.
5. The regular expression a | a*b denotes the set containing the string a and all
strings consisting of zero or more a's followed by a b.
• If two regular expressions contain the same language, we say r and s are
equivalent and write r = s. For example, (a | b) = (b | a).
Notation ....
where each di is a distinct name, and each ri is a regular expression over the symbols in i.e. the basic
symbols and the previously defined names. By restricting each ri to symbols of S and the previously
defined names, we can construct a regular expression over S for any ri by repeatedly replacing regular-
expression names by the expressions they denote. If ri used dkfor some k >= i, then ri might be
recursively defined, and this substitution process would not terminate. So, we treat tokens as terminal
symbols in the grammar for the source language. The lexeme matched by the pattern for the token
consists of a string of characters in the source program and can be treated as a lexical unit. The lexical
analyzer collects information about tokens into there associated attributes. As a practical matter a token
has usually only a single attribute, appointed to the symbol table entry in which the information about the
token is kept; the pointer becomes the attribute for the token.
Example
Example
Lexical analyzer generator
• Input to the generator
• List of regular expressions in priority order
• Associated actions for each of regular expression (generates kind of token and other book keeping
information)
• Output of the generator
• Program that reads input character stream and breaks that into tokens
• Reports lexical errors (unexpected characters), if any
We assume that we have a specification of lexical analyzers in the form of regular expression
and the corresponding action parameters.
Action parameter is the program segments that is to be executed whenever a lexeme
matched by regular expressions is found in the input. So, the input to the generator is a list
of regular expressions in a priority order and associated actions for each of the regular
expressions. These actions generate the kind of token and other book keeping information.
Our problem is to construct a recognizer that looks for lexemes in the input buffer. If more
than one pattern matches, the recognizer is to choose the longest lexeme matched. If there
are two or more patterns that match the longest lexeme, the first listed matching pattern is
chosen.
So, the output of the generator is a program that reads input character stream and breaks
that into tokens. It also reports in case there is a lexical error i.e. either unexpected
LEX: A lexical analyzer generator
In this section, we consider the design of a software tool that automatically constructs the
lexical analyzer code from the LEX specifications. LEX is one such lexical analyzer generator
which produces C code based on the token specifications. This tool has been widely used to
specify lexical analyzers for a variety of languages. We refer to the tool as Lex Compiler, and to
its input specification as the Lex language. Lex is generally used in the manner depicted in the
slide. First, a specification of a lexical analyzer is prepared by creating a program lex.l in the lex
language. Then, the lex.l is run through the Lex compiler to produce a C program lex.yy.c . The
program lex.yy.c consists of a tabular representation of a transition diagram constructed from
the regular expressions of the lex.l, together with a standard routine that uses the table to
recognize lexemes. The actions associated with the regular expressions in lex.l are pieces of C
code and are carried over directly to lex.yy.c. Finally, lex.yy.c is run through the C compiler to
produce an object program a.out which is the lexical analyzer that transforms the input stream
into a sequence of tokens.
How does LEX work?