Compiler Design Chapter 2
Compiler Design Chapter 2
Figure 2.1: Interactions between the lexical analyzer and the parser
Since the lexical analyzer is the part of the compiler that reads the source text, it may perform
certain other tasks besides identification of lexemes. One such task is stripping out comments and
whitespace (blank, newline, tab). Another task is correlating error messages generated by the
compiler with the source program.
Needs / Roles / Functions of lexical analyzer
It produces stream of tokens.
It eliminates comments and whitespace.
It keeps track of line numbers.
It reports the error encountered while generating tokens.
It stores information about identifiers, keywords, constants and so on into symbol table.
Lexical analyzers are divided into two processes:
a) Scanning consists of the simple processes that do not require tokenization of the input, such
as deletion of comments and compaction of consecutive whitespace characters into one.
b) Lexical analysis is the more complex portion, where the scanner produces the sequence of
tokens as output.
Lexical Analysis versus Parsing / Issues in Lexical analysis
1. Simplicity of design: It is the most important consideration. The separation of lexical and
syntactic analysis often allows us to simplify tasks. whitespace and comments removed by
the lexical analyzer.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to
the lexical analyzer.
Tokens, Patterns, and Lexemes
A token is a pair consisting of a token name and an optional attribute value. The token name is an
abstract symbol representing a kind of single lexical unit, e.g., a particular keyword, or a
CoSc3112 Compiler Design Chapter II 2.2
sequence of input characters denoting an identifier. Operators, special symbols and constants are
also typical tokens.
A pattern is a description of the form that the lexemes of a token may take. Pattern is set of rules
that describe the token. A lexeme is a sequence of characters in the source program that matches the
pattern for a token.
Table 2.1: Tokens and Lexemes
TOKEN INFORMAL DESCRIPTION SAMPLE LEXEMES
(PATTERN)
if characters i, f if
else characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter, followed by letters and digits pi, score, D2, sum, id_1, AVG
number any numeric constant 35, 3.14159, 0, 6.02e23
literal anything surrounded by “ ” “Core”, “Design” “Appasami”,
In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token comparison
mentioned in table 2.1.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon
Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent
compiler phases additional information about the particular lexeme that matched.
The lexical analyzer returns to the parser not only a token name, but an attribute value that
describes the lexeme represented by the token.
The token name influences parsing decisions, while the attribute value influences translation of
tokens after the parse.
Information about an identifier - e.g., its lexeme, its type, and the location at which it is first found
(in case an error message) - is kept in the symbol table.
Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for that
identifier.
Example: The token names and associated attribute values for the Fortran statement E=M
* C ** 2 are written below as a sequence of pairs.
<id, pointer to symbol-table entry for E>
< assign_op >
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2 >
Note that in certain pairs, especially operators, punctuation, and keywords, there is no need for an
attribute value. In this example, the token number has been given an integer-valued attribute.
CoSc3112 Compiler Design Chapter II 2.3
6. If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed
by appending y to x.
Operations on Languages
In lexical analysis, the most important operations on languages are union, concatenation, and
closure, which are defined in table 2.2.
Table 2.2: Definitions of operations on languages
Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be the set of digits {0,1,..
.9). Other languages that can be constructed from languages L and D
1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of
length one, each of which strings is either one letter or one digit.
2. LD is the set df 520 strings of length two, each consisting of one letter followed by one
digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of ail strings of letters, including e, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Regular expression
Regular expression can be defined as a sequence of symbols and characters expressing a string or
pattern to be searched.
Regular expressions are mathematical representation which describes the set of strings of specific
language.
Regular expression for identifiers represented by letter_ ( letter_ | digit )*. The vertical bar means
union, the parentheses are used to group subexpressions, and the star means "zero or more
occurrences of".
Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's subexpressions.
The rules that define the regular expressions over some alphabet Σ.
Basis rules:
1. ε is a regular expression, and L(ε) is { ε }.
2. If a is a symbol in Σ , then a is a regular expression, and L(a) = {a}, that is,
the language with one string of length one.
Induction rules: Suppose r and s are regular expressions denoting languages L(r) and L(s),
respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r). i.e., Additional pairs of parentheses
around expressions.
Algebraic laws
Algebraic laws that hold for arbitrary regular expressions r, s, and t:
LAW DESCRIPTION
r|s = s|r | is commutative
r(s|t) = (r|s)t | is associative
r(st) = (rs)t Concatenation is associative
r(s|t) = rs|rt; (s|t)r = sr|tr Concatenation distributes over |
εr=rε=r ε is the identity for concatenation
r* = (r |ε)* ε is guaranteed in a closure
r** = r* * is idempotent
Figure 2.3 : firstpos and lastpos for nodes in the syntax tree for (a|b)*abb#
CoSc3112 Compiler Design Chapter II 2.8
We must also apply rule 2 to the star-node. That rule tells us positions 1 and 2 are in both
followpos(1) and followpos(2), since both firstpas and lastpos for this node are {1,2}. The complete
sets followpos are summarized in table
NODE n Followpos(n)
1 {1, 2, 3}
2 {1, 2, 3}
3 {4}
4 {4}
5 {4}
6 {}
(c) Let s be the representative of some group G of Πfina, and let the transition of D from
s on input a be to state t. Let r be the representative of t's group H. Then in D', there
is a transition from s to r on input a.
Example: Let us reconsider the DFA of Figure 2.8 for minimization.
STATE a b
A B C
B B D
C B C
D B E
(E) B C
The initial partition consists of the two groups {A, B, C, D} {E}, which are respectively the
nonaccepting states and the accepting states.
To construct Π new, the procedure considers both groups and inputs a and b. The group {E}
cannot be split, because it has only one state, so (E} will remain intact in Π new.
The other group {A, B, C, D} can be split, so we must consider the effect of each input symbol. On
input a, each of these states goes to state B, so there is no way to distinguish these states using
strings that begin with a. On input b, states A, B, and C go to members of group {A, B, C, D},
while state D goes to E, a member of another group.
Thus, in Π new, group {A, B, C, D} is split into {A, B, C}{D}, and Π new for this round is {A, B,
C){D){E}.
In the next round, we can split {A, B, C} into {A, C}{B}, since A and C each go to a
member of {A, B, C) on input b, while B goes to a member of another group, {D}. Thus, after the
second round, Π new = {A, C} {B} {D} {E).
For the third round, we cannot split the one remaining group with more than one state, since A and
C each go to the same state (and therefore to the same group) on each input. We conclude that
Πfinal = {A, C}{B){D){E).
Now, we shall construct the minimum-state DFA. It has four states, corresponding to the four
groups of Πfinal, and let us pick A, B, D, and E as the representatives of these groups. The
initial state is A, and the only accepting state is E.
Table : Transition table of minimum-state DFA
STATE a b
A B A
B B D
C B E
(E) B A
Use of Lex
The Lex compiler transforms the input patterns into a transition diagram and
generates code.
CoSc3112 Compiler Design Chapter II 2.11
An input file “lex.l” is written in the Lex language and describes the lexical analyzer
to be generated. The Lex compiler transforms “lex.l” to a C program, in a file that is
always named “lex.yy.c”.
The file “lex.yy.c” is compiled by the C – Compiler and converted into a file
“a.out”. The C-compiler output is a working lexical analyzer that can take a stream
of input characters and produce a stream of tokens.
The attribute value, whether it be another numeric code, a pointer to the symbol
table, or nothing, is placed in a global variable yylval which is shared between the
lexical analyzer and parser
The declarations section includes declarations of variables, manifest constants (identifiers declared
to stand for a constant, e.g., the name of a token), and regular definitions.
The translation rules of lex program statement have the form Pattern { Action }
Pattern P1 { Action A1}
Pattern P2 { Action A2}
…
Pattern Pn { Action An}
Each pattern is a regular expression. The actions are fragments of code typically written in C
language.
The third section holds whatever additional functions are used in the actions. Alternatively, these
functions can be compiled separately and loaded with the
lexical analyzer.
The lexical analyzer begins reading its remaining input, one character at a time, until it finds the
longest prefix of the input that matches one of the patterns Pi. It then executes the associated action
Ai. Typically, Ai will return to the parser, but if it does not (e.g., because Pi describes whitespace or
comments), then the lexical analyzer proceeds to find additional lexemes, until one of the
corresponding actions causes a return to the parser. The lexical analyzer returns a single value, the
token name, to the parser, but uses the shared, integer variable yylval to pass additional information
about the lexeme found.
CoSc3112 Compiler Design Chapter II 2.12
Figure 2.10: A Lex program is turned into a transition table and actions, which are used by a finite-
automaton simulator
Figure 2.14: Sequence of sets of states entered when processing input aaba
Figure 2.15: Transition graph for DFA handling the patterns a, abb, and a*b+