0% found this document useful (0 votes)
489 views

Compiler Design Chapter 2

The document discusses lexical analysis, which is the first phase of compiler design. It breaks down lexical analysis into scanning and lexical analysis. Scanning removes whitespace and comments, while lexical analysis produces a stream of tokens from the source code. Regular expressions are used to specify patterns for tokens like keywords, identifiers, operators, and punctuation. The lexical analyzer identifies lexemes that match token patterns and returns tokens with attribute values to the parser. It can encounter lexical errors if no token pattern matches the input.

Uploaded by

Vuggam Venkatesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
489 views

Compiler Design Chapter 2

The document discusses lexical analysis, which is the first phase of compiler design. It breaks down lexical analysis into scanning and lexical analysis. Scanning removes whitespace and comments, while lexical analysis produces a stream of tokens from the source code. Regular expressions are used to specify patterns for tokens like keywords, identifiers, operators, and punctuation. The lexical analyzer identifies lexemes that match token patterns and returns tokens with attribute values to the parser. It can encounter lexical errors if no token pattern matches the input.

Uploaded by

Vuggam Venkatesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

CoSc3112 Compiler Design Chapter II 2.

CHAPTER II LEXICAL ANALYSIS 2.1 NEED AND ROLE


OF LEXICAL ANALYZER
Lexical Analysis is the first phase of compiler. It reads the input characters from left to right, one
character at a time, from the source program.
It generates the sequence of tokens for each lexeme. Each token is a logical cohesive unit such as
identifiers, keywords, operators and punctuation marks.
It needs to enter that lexeme into the symbol table and also reads from the symbol table.
These interactions are suggested in Figure 2.1.

Figure 2.1: Interactions between the lexical analyzer and the parser
Since the lexical analyzer is the part of the compiler that reads the source text, it may perform
certain other tasks besides identification of lexemes. One such task is stripping out comments and
whitespace (blank, newline, tab). Another task is correlating error messages generated by the
compiler with the source program.
Needs / Roles / Functions of lexical analyzer
 It produces stream of tokens.
 It eliminates comments and whitespace.
 It keeps track of line numbers.
 It reports the error encountered while generating tokens.
 It stores information about identifiers, keywords, constants and so on into symbol table.
Lexical analyzers are divided into two processes:
a) Scanning consists of the simple processes that do not require tokenization of the input, such
as deletion of comments and compaction of consecutive whitespace characters into one.
b) Lexical analysis is the more complex portion, where the scanner produces the sequence of
tokens as output.
Lexical Analysis versus Parsing / Issues in Lexical analysis
1. Simplicity of design: It is the most important consideration. The separation of lexical and
syntactic analysis often allows us to simplify tasks. whitespace and comments removed by
the lexical analyzer.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to
the lexical analyzer.
Tokens, Patterns, and Lexemes
A token is a pair consisting of a token name and an optional attribute value. The token name is an
abstract symbol representing a kind of single lexical unit, e.g., a particular keyword, or a
CoSc3112 Compiler Design Chapter II 2.2

sequence of input characters denoting an identifier. Operators, special symbols and constants are
also typical tokens.
A pattern is a description of the form that the lexemes of a token may take. Pattern is set of rules
that describe the token. A lexeme is a sequence of characters in the source program that matches the
pattern for a token.
Table 2.1: Tokens and Lexemes
TOKEN INFORMAL DESCRIPTION SAMPLE LEXEMES
(PATTERN)
if characters i, f if
else characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter, followed by letters and digits pi, score, D2, sum, id_1, AVG
number any numeric constant 35, 3.14159, 0, 6.02e23
literal anything surrounded by “ ” “Core”, “Design” “Appasami”,

In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token comparison
mentioned in table 2.1.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon
Attributes for Tokens
When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent
compiler phases additional information about the particular lexeme that matched.
The lexical analyzer returns to the parser not only a token name, but an attribute value that
describes the lexeme represented by the token.
The token name influences parsing decisions, while the attribute value influences translation of
tokens after the parse.
Information about an identifier - e.g., its lexeme, its type, and the location at which it is first found
(in case an error message) - is kept in the symbol table.
Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for that
identifier.
Example: The token names and associated attribute values for the Fortran statement E=M
* C ** 2 are written below as a sequence of pairs.
<id, pointer to symbol-table entry for E>
< assign_op >
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2 >
Note that in certain pairs, especially operators, punctuation, and keywords, there is no need for an
attribute value. In this example, the token number has been given an integer-valued attribute.
CoSc3112 Compiler Design Chapter II 2.3

2.2 LEXICAL ERRORS


It is hard for a lexical analyzer to tell that there is a source-code error without the aid of other
components.
Consider a C program statement fi ( a == f(x)). The lexical analyzer cannot tell whether fi is a
misspelling of the keyword if or an undeclared function identifier. Since fi is a valid lexeme for the
token id, the lexical analyzer must return the token id to the parser.
The lexical analyzer is unable to proceed because none of the patterns for tokens matches any
prefix of the remaining input. The simplest recovery strategy is "panic mode" recovery.
We delete successive characters from the remaining input, until the lexical analyzer can find a
well-formed token at the beginning of what input is left.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Transformations like these may be tried in an attempt to repair the input. The simplest such strategy
is to see whether a prefix of the remaining input can be transformed into a valid lexeme by a single
transformation.
In practice most lexical errors involve a single character. A more general correction strategy is to
find the smallest number of transformations needed to convert the source program into one that
consists only of valid lexemes.

2.3 EXPRESSING TOKENS BY REGULAR


EXPRESSIONS Specification of Tokens
Regular expressions are an important notation for specifying lexeme patterns. We cannot express
all possible patterns, they are very effective in specifying those types of patterns that we actually
need for tokens.
Strings and Languages
An alphabet is any finite set of symbols. Examples of symbols are letters, digits, and punctuation.
The set {0,1) is the binary alphabet. ASCII is an important example of an alphabet.
A string (sentence or word) over an alphabet is a finite sequence of symbols drawn from that
alphabet. The length of a string s, usually written |s|, is the number of occurrences of symbols in s.
For example, banana is a string of length six. The empty string, denoted ε, is the string of length
zero.
A language is any countable set of strings over some fixed alphabet. Abstract languages like Φ,
the empty set, or { ε }, the set containing only the empty string, are languages under this
definition.
Parts of Strings:
1. A prefix of string s is any string obtained by removing zero or more symbols from the
end of s. For example, ban, banana, and ε are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example, nana, banana, and ε are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For instance,
banana, nan, and ε are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes,
and substrings, respectively, of s that are not ε or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s. For example, baan is a subsequence of banana.
CoSc3112 Compiler Design Chapter II 2.4

6. If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed
by appending y to x.
Operations on Languages
In lexical analysis, the most important operations on languages are union, concatenation, and
closure, which are defined in table 2.2.
Table 2.2: Definitions of operations on languages

Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be the set of digits {0,1,..
.9). Other languages that can be constructed from languages L and D
1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of
length one, each of which strings is either one letter or one digit.
2. LD is the set df 520 strings of length two, each consisting of one letter followed by one
digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of ail strings of letters, including e, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Regular expression
Regular expression can be defined as a sequence of symbols and characters expressing a string or
pattern to be searched.
Regular expressions are mathematical representation which describes the set of strings of specific
language.
Regular expression for identifiers represented by letter_ ( letter_ | digit )*. The vertical bar means
union, the parentheses are used to group subexpressions, and the star means "zero or more
occurrences of".
Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's subexpressions.
The rules that define the regular expressions over some alphabet Σ.
Basis rules:
1. ε is a regular expression, and L(ε) is { ε }.
2. If a is a symbol in Σ , then a is a regular expression, and L(a) = {a}, that is,
the language with one string of length one.
Induction rules: Suppose r and s are regular expressions denoting languages L(r) and L(s),
respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r). i.e., Additional pairs of parentheses
around expressions.

Example: Let Σ = {a, b}.


CoSc3112 Compiler Design Chapter II 2.5

Regular Language Meaning


expression
a|b {a, b} Single ‘a’ or ‘b’
(a|b) (a|b) {aa, ab, ba, bb} All strings of length two over the alphabet Σ
a* { ε, a, aa, aaa, …} Consisting of all strings of zero or more a's
(a|b)* {ε, a, b, aa, ab, ba, bb, set of all strings consisting of zero or more
aaa, …} instances of a or b
a|a*b {a, b, ab, aab, aaab, …} String a and all strings consisting of zero or
more a's and ending in b
A language that can be defined by a regular expression is called a regular set. If two regular
expressions r and s denote the same regular set, we say they are equivalent and write r = s. For
instance, (a|b) = (b|a), (a|b)*= (a*b*)*, (b|a)*= (a|b)*, (a|b) (b|a) =aa|ab|ba|bb.

Algebraic laws
Algebraic laws that hold for arbitrary regular expressions r, s, and t:
LAW DESCRIPTION
r|s = s|r | is commutative
r(s|t) = (r|s)t | is associative
r(st) = (rs)t Concatenation is associative
r(s|t) = rs|rt; (s|t)r = sr|tr Concatenation distributes over |
εr=rε=r ε is the identity for concatenation
r* = (r |ε)* ε is guaranteed in a closure
r** = r* * is idempotent

Extensions of Regular Expressions


Few notational extensions that were first incorporated into Unix utilities such as Lex that are
particularly useful in the specification lexical analyzers.
1. One or more instances: The unary, postfix operator + represents the positive closure of
a regular expression and its language. If r is a regular expression, then (r)+ denotes the
+ + +
language (L(r)) . The two useful algebraic laws, r* = r |ε and r = rr* = r*r.
2. Zero or one instance: The unary postfix operator ? means "zero or one occurrence."
That is, r? is equivalent to r|ε , L(r?) = L(r) U {ε}.
3. Character classes: A regular expression a1|a2|…|an, where the ai's are each symbols of
the alphabet, can be replaced by the shorthand [a1, a2, …an]. Thus, [abc] is shorthand
for a|b|c, and [a-z] is shorthand for a|b|…|z.
Example: Regular definition for C identifier
Letter_  [A-Z a-z_]
digit  [0-9]
id letter_ ( letter_ | digit )*
Example: Regular definition unsigned integer
digit  [0-9]
+
digits  digit
number  digits ( . digits)? ( E [+ -]? digits )?
Note: The operators *, +, and ? has the same precedence and associativity.
CS6660 Compiler Design Unit II 2.6

2.4 CONVERTING REGULAR EXPRESSION TO DFA


To construct a DFA directly from a regular expression, we construct its syntax tree and then
compute four functions: nullable, firstpos, lastpos, and followpas, defined as follows. Each
definition refers to the syntax tree for a particular augmented regular expression (r)#.
1. nullable(n) is true for a syntax-tree node n if and only if the subexpression represented
by n has ε in its language. That is, the subexpression can be "made null" or the empty
string, even though there may be other strings it can represent as well.
2. firstpos(n) is the set of positions in the subtree rooted at n that correspond to the first
symbol of at least one string in the language of the subexpression rooted at n.
3. lastpos(n) is the set of positions in the subtree rooted at n that correspond to the last
symbol of at least one string in the language of the subexpression rooted at n.
4. followpos(p), for a position p, is the set of positions q in the entire syntax tree such that
there is some string x = a1a2 …an in L((r)#) such that for some i, there is a way to
explain the membership of x in L((r)#) by matching ai to position p of the syntax tree
and ai+1 to position q.
We can compute nullable, firstpos, and lastpos by a straightforward recursion on the height of the
tree. The basis and inductive rules for nullable and firstpos are summarized in table.
The rules for lastpos are essentially the same as for firstpos, but the roles of children c1 and c2 must
be swapped in the rule for a cat-node.
There are only two ways to compute followpos.
1. If n is a cat-node with left child cl and right child c2, then for every position i in
lastpos(c1), all positions in firstpos(c2) are in followpos(i).
2. 2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are
in followpos(i) .

Converting a Regular Expression Directly to a DFA


Algorithm: Construction of a DFA from a regular expression
r. INPUT: A regular expression r. OUTPUT: A DFA D that
recognizes L(r). METHOD:

1. Construct a syntax tree T from the augmented regular expression (r)#.


2. Compute nullable, firstpos, lastpos, and followpos for T.
3. Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D,
CoSc3112 Compiler Design Chapter II 2.7

initialize Dstates to contain only the unmarked state firstpos(no),


where no is the root of syntax tree T for (r)#;
while ( there is an unmarked state S in Dstates )
{
mark S;
for ( each input symbol a )
{
let U be the union of followpos(p) for all p in S that correspond to a;
if ( U is not in Dstates )
add U as an unmarked state to Dstates;
Dtran[S, a] = U
}
} By
the above procedure. The states of D are sets of positions in T. Initially, each state is "unmarked,"
and a state becomes "marked" just before we consider its out-transitions. The start state of D is
firstpos(no), where node no is the root of T. The accepting states are those containing the position
for the endmarker symbol #.

Example: Construct a DFA for the regular expression r = (a|b)*abb

Figure 2.2: Syntax tree for (a|b)*abb#

Figure 2.3 : firstpos and lastpos for nodes in the syntax tree for (a|b)*abb#
CoSc3112 Compiler Design Chapter II 2.8

We must also apply rule 2 to the star-node. That rule tells us positions 1 and 2 are in both
followpos(1) and followpos(2), since both firstpas and lastpos for this node are {1,2}. The complete
sets followpos are summarized in table
NODE n Followpos(n)
1 {1, 2, 3}
2 {1, 2, 3}
3 {4}
4 {4}
5 {4}
6 {}

Figure 2.4: Directed graph for the function followpos


nullable is true only for the star-node, and we exhibited firstpos and lastpos in Figure 2.3. The value
of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D. all this set of states A.
We must compute Dtran[A, a] and Dtran[A, b]. Among the positions of A, 1 and 3 correspond to a,
while 2 corresponds to b. Thus, Dtran[A, a] = followpos(1) U followpos(3) = {1, 2,3,4}, and
Dtran[A, b] = followpos(2) = {1,2,3}.

Figure 2.5: DFA constructed for (a|b)*abb#


The latter is state A, and so does not have to be added to Dstates, but the former, B = {1,2,3,4}, is
new, so we add it to Dstates and proceed to compute its transitions. The omplete DFA is shown in
Figure 2.5.
Example: Construct NFA ε for (alb)*abb and convert to DFA by subset construction.

Figure 2.6: NFA ε for (a|b)*abb


CoSc3112 Compiler Design Chapter II 2.9

Figure 2.7: NFA for (a|b)*abb

Figure 2.8 Result of applying the subset construction to Figure 2.6

2.5 MINIMIZATION OF DFA


There can be many DFA's that recognize the same language. For instance, the DFAs of Figure 2.5
and 2.8 both recognize the same language L((a|b)*abb).
We would generally prefer a DFA with as few states as possible, since each state requires entries in
the table that describes the lexical analyzer.
Algorithm: Minimizing the number of states of a DFA.
INPUT: A DFA D with set of states S, input alphabet Σ, initial state so, and set of accepting states
F.
OUTPUT: A DFA D' accepting the same language as D and having as few states as possible.
METHOD:
1. Start with an initial partition II with two groups, F and S - F, the accepting and
nonaccepting states of D.
2. Apply the procedure of Fig. 3.64 to construct a new partition anew.
initially, let Πnew = Π;
for ( each group G of Π )
{
partition G into subgroups such that two states s and t are in the same subgroup if and only if for all
input symbols a, states s and t have transitions on a to states in the same group of Π; /* at
worst, a state will be in a subgroup by itself */ replace G in IInew by the set of all subgroups
formed;
}
3. If Π new = Π, let Π final = Π and continue with step (4). Otherwise, repeat step (2) with Π
new in place of If Π.
4. Choose one state in each group of Π final as the representative for that group. The
representatives will be the states of the minimum-state DFA D'.
5. The other components of D' are constructed as follows:
(a) The state state of D' is the representative of the group containing the start state of D.
(b) The accepting states of D' are the representatives of those groups that contain an
accepting state of D.
CoSc3112 Compiler Design Chapter II 2.10

(c) Let s be the representative of some group G of Πfina, and let the transition of D from
s on input a be to state t. Let r be the representative of t's group H. Then in D', there
is a transition from s to r on input a.
Example: Let us reconsider the DFA of Figure 2.8 for minimization.
STATE a b
A B C
B B D
C B C
D B E
(E) B C
The initial partition consists of the two groups {A, B, C, D} {E}, which are respectively the
nonaccepting states and the accepting states.
To construct Π new, the procedure considers both groups and inputs a and b. The group {E}
cannot be split, because it has only one state, so (E} will remain intact in Π new.
The other group {A, B, C, D} can be split, so we must consider the effect of each input symbol. On
input a, each of these states goes to state B, so there is no way to distinguish these states using
strings that begin with a. On input b, states A, B, and C go to members of group {A, B, C, D},
while state D goes to E, a member of another group.
Thus, in Π new, group {A, B, C, D} is split into {A, B, C}{D}, and Π new for this round is {A, B,
C){D){E}.
In the next round, we can split {A, B, C} into {A, C}{B}, since A and C each go to a
member of {A, B, C) on input b, while B goes to a member of another group, {D}. Thus, after the
second round, Π new = {A, C} {B} {D} {E).
For the third round, we cannot split the one remaining group with more than one state, since A and
C each go to the same state (and therefore to the same group) on each input. We conclude that
Πfinal = {A, C}{B){D){E).
Now, we shall construct the minimum-state DFA. It has four states, corresponding to the four
groups of Πfinal, and let us pick A, B, D, and E as the representatives of these groups. The
initial state is A, and the only accepting state is E.
Table : Transition table of minimum-state DFA
STATE a b
A B A
B B D
C B E
(E) B A

2.6 LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS-LEX


There are wide range f tools for construction of lexical analyzer based on regular expressions. Lex
is a tool (Computer program) that generates lexical analyzers.
Lex is a lexical analyzer based tool by specifying regular expressions to describe patterns for token.
Lex tool is referred to as the Lex language and the tool itself is the Lex compiler.

Use of Lex
 The Lex compiler transforms the input patterns into a transition diagram and
generates code.
CoSc3112 Compiler Design Chapter II 2.11

 An input file “lex.l” is written in the Lex language and describes the lexical analyzer
to be generated. The Lex compiler transforms “lex.l” to a C program, in a file that is
always named “lex.yy.c”.
 The file “lex.yy.c” is compiled by the C – Compiler and converted into a file
“a.out”. The C-compiler output is a working lexical analyzer that can take a stream
of input characters and produce a stream of tokens.
 The attribute value, whether it be another numeric code, a pointer to the symbol
table, or nothing, is placed in a global variable yylval which is shared between the
lexical analyzer and parser

Figure 2.9: Creating a lexical analyzer with Lex

Structure of Lex Programs


A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions

The declarations section includes declarations of variables, manifest constants (identifiers declared
to stand for a constant, e.g., the name of a token), and regular definitions.
The translation rules of lex program statement have the form Pattern { Action }
Pattern P1 { Action A1}
Pattern P2 { Action A2}

Pattern Pn { Action An}
Each pattern is a regular expression. The actions are fragments of code typically written in C
language.
The third section holds whatever additional functions are used in the actions. Alternatively, these
functions can be compiled separately and loaded with the
lexical analyzer.
The lexical analyzer begins reading its remaining input, one character at a time, until it finds the
longest prefix of the input that matches one of the patterns Pi. It then executes the associated action
Ai. Typically, Ai will return to the parser, but if it does not (e.g., because Pi describes whitespace or
comments), then the lexical analyzer proceeds to find additional lexemes, until one of the
corresponding actions causes a return to the parser. The lexical analyzer returns a single value, the
token name, to the parser, but uses the shared, integer variable yylval to pass additional information
about the lexeme found.
CoSc3112 Compiler Design Chapter II 2.12

2.7 DESIGN OF LEXICAL ANALYZER FOR A SAMPLE LANGUAGE


The lexical-analyzer generator such as Lex is architected with an automation simulator. The
implementation of Lex compiler can be based on either NFA or DFA.
2.7.1 The Structure of the Generated Analyzer
Figure 2.10 shows the architecture of a lexical analyzer generated by Lex. A Lex program is
converted into a transition table and actions which are used by a finite Automaton simulator.
The program that serves as the lexical analyzer includes a fixed program that simulates an
automaton; the automaton is deterministic or nondeterministic. The rest of the lexical analyzer
consists of components that are created from the Lex program by Lex itself.

Figure 2.10: A Lex program is turned into a transition table and actions, which are used by a finite-
automaton simulator

These components are:


1. A transition table for the automaton.
2. Those functions that are passed directly through Lex to the output.
3. The actions from the input program, which appear as fragments of code to be invoked at the
appropriate time by the automaton simulator.

2.7.2 Pattern Matching Based on NFA's


To construct the automation for several regular expressions, we need to combine all NFAs into one by
introducing a new start state with ε-transitions to each of the start states of the NFA's N i
for pattern pi as shown in figure 2.11.

Figure 2.11: An NFA constructed from a Lex program


Example: Consider the atern
CoSc3112 Compiler Design Chapter II 2.13

a { action Al for pattern pl }


abb { action A2 for pattern p2 }
a*b+ { action A3 for pattern p3}

Figure 2.12: NFA's for a, abb, and a*b+

Figure 2.13: Combined NFA

Figure 2.14: Sequence of sets of states entered when processing input aaba
Figure 2.15: Transition graph for DFA handling the patterns a, abb, and a*b+

You might also like