Chapter 2 Lexical Analysis
Chapter 2 Lexical Analysis
2.1 Introduction
The process of compilation starts with the first phase called lexical analysis. In this phase the
input is scanned completely in order to identify the tokens. The token structure can be recognized
with the help of some diagrams. These diagrams are popularly known as finite automata. And to
construct such finite automata regular expressions are used. These diagrams can be translated can
be translated into a program for identifying tokens. In this chapter:
▪ we will see what the role of lexical analyzer in compilation process is.
▪ we will also discuss the method of identifying tokens from source program.
▪ finally, we will also learn a tool, LEX which automates the construction of lexical
analyzer.
2.2 Role of Lexical Analyzer
Lexical analyzer is the first phase of compiler. The analyzer reads the input source program
from left to right one character at a time and generates the sequence of tokens. Each token is a
single logical cohesive unit such as identifier, keywords, operators, and punctuations marks.
Then the parser to determine the syntax of th source program can be use these tokens. The role of
lexical analyzer in the process of compilation is as shown below:
1
Compiled by Fikru T. & Dr. Velmurugan
Lexical analyzer works in two phases:
a. In the first phase, it performs scan
b. In second phase, it generates the series of tokens
2
Compiled by Fikru T. & Dr. Velmurugan
The blank and new line characters can be ignored. These streams of tokens will be given to
syntax analyzer.
2.3 Input Buffering
The lexical analyzer scans the input string from left to right one character a time. It uses two
pointers begin_ptr (bp) and forward_ptr(fp) to keep track of the portion of the input scanned.
Initially both the pointers point to the first character of the input string as shown below Figure 2.2.
3
Compiled by Fikru T. & Dr. Velmurugan
The fp will be moved ahead at white space. When fp encounters whitespace it ignores and
moves ahead. The both the begin_ptr(bp) and forward_ptr(fp) is set at next token i.
The input character is thus read from secondary storage. But reading in this way from
secondary storage is costly. Hence buffering technique is used. A block of data is first read into a
buffer and then scanned by lexical analyzer. There are two methods used in this context:
1. One buffer scheme
2. Two buffer scheme
1. One buffer scheme:
In this buffer scheme, only one buffer is used to store the input string. But the problem with
this scheme is that if lexeme is very long then it crosses the buffer boundary to scan rest of the
lexeme the buffer has to be refilled, that makes overwriting the first part of the lexeme.
4
Compiled by Fikru T. & Dr. Velmurugan
Initially both the bp and fp are pointing to the first character. Then the fp moves towards right
in search of end of lexeme. As soon as blank character is recognized, the string between bp and fp
is identified as corresponding token. To identify the boundary of the first buffer end of buffer
character should be placed at the end of first buffer. Similarly end of second buffer is also
recognized by the end of buffer mark present at the end of second buffer. When fp encounters first
eof , then one can recognize end of the first buffer and hence filling up of the second buffer is
started. In the same way when second eof is obtained the it indicates end of second buffer.
Alternatively, both the buffers can be filled up until end of the input program and stream of tokens
is identified. This eof character introduced at the end is called Sentinel. Sentinel is used to identify
the end of buffer.
5
Compiled by Fikru T. & Dr. Velmurugan
2.4.1 String and Language
String is a collection of finite number of alphabets or letters. The strings are synonymously
called as words.
• The length of a string is denoted by |S|
• The empty string is denoted by ε
• The empty set of strings is denoted by Φ
Following terms are commonly used in strings.
Terms Meaning
Operation Description
L.
L+
6
Compiled by Fikru T. & Dr. Velmurugan
For example, let L be a set of alphabets such as L={A, B, C,… Z, a, b,c, …z} and D be the
set of digits such as D={0, 1, 2…9} then by performing various operations as discussed above
new languages can be generated as follows:
• L U D is a set of letters and digits.
• LD is a set of strings consisting of letters followed by digits.
• L5 is a set of strings having length of 5 each.
• L* is a set strings having all the strings including ϵ.
• L+ is a set of strings except ϵ.
2.4.3 Regular Expressions
Regulars are mathematical symbolisms which describes the set of strings of specific languages.
It provides convenient and useful notation for representing tokens. Here are some rules that
describes definition of the regular expressions over the input set denoted by ∑.
1. ε is regular expression that denotes the set containing empty string.
2. If R1 and R2 are regular expressions then R= R1+R2 (same can also be represented as
R=R1|R2) is also regular expression which represents union operation.
3. If R1 and R2 are regular expression then R=R1.R2 is also regular expression which
represents concatenation operation.
4. If R1 is a regular expression the R=R* is also regular expression which represents
Kleen Closure.
A language denoted by regular expressions is said to be a regular set or regular language. Let
us see some examples of regular expressions.
Example 1: Write a regular expression (R.E) for a language containing the string of length two
over ∑ = {0,1}.
Solution: R.E = (0+1) (0+1)
Example 2: Write a regular expression (R.E) for a language containing the string which end
with “abb” over ∑ ={a,b}.
Solution: R.E = (a+b)*abb
Example 3: write a regular expression (R.E) for recognizing identifier.
Solution: for denoting identifier we will consider a set of letters and digits because identifier is a
combination of letters or letter and digits but having first character as letters always. Hence R.E.
can be denoted as:
7
Compiled by Fikru T. & Dr. Velmurugan
R.E = letter(letter+digit)*
Where letter = (A, B,…Z, a,b,…z) and digit = (0, 1, 2, …9)
Example 4: Write a regular expression (R.E) for the language accepting all combination of a’s
except the null string over ∑ ={a}.
Solution: The regular expression has to built for the language
L={a, aa, aaa, aaaa,….}
This set indicates that there is no null string. So, we can write,
R.E = a+
The + is called positive closure.
Example 5: Design a regular expression (R.E) for the language containing all strings with any
number of a’s and b’s over ∑ ={a,b}.
Solution: The R.E will be L={ε , a, aa,ab, b, ba, bab, abab….}
R.E = (a+b)*
The set for this r.e will be:
L = (ε, a, aa, ab, b,ba, b, bab, abab,… any combination of a and b}
The ( a + b ) means any combination of a and b even a null string.
Example 6: Construct a regular expression for the language containing all strings having any
number of a’s and b’s except the null string.
Solution: R.E = (a+b)+
This regular expression will give the set of strings of any combination of a’s and b’s except a null
string.
Example 7: Construct a regular expression for the language accepting all the strings which are
ending with 00 over the set ∑ = {0,1}.
Solution: The r.e has to be formed in which ate the end there should be 00. That means
r.e = (Any combination of 0’s and 1’s) 00
R.E = ( 0 + 1)*00
Thus, the valid strings are 100, 0100, 1000, …. We have all strings ending with 00.
Example 8: Construct a regular expression for the language accepting the strings which are
starting with 1 and ending with 0, over the set ∑ = {0,1}.
Solution: The first symbol in r.e should be 1 and the last symbol should be 0.
So, R.E = 1( 0 + 1 ) * 0
8
Compiled by Fikru T. & Dr. Velmurugan
Note that the condition is strictly followed by keeping starting and ending symbols correctly.
In between them there can be any combination of 0 and 1 including null string.
Example 9: Write a regular expression to denote the language L over ∑ * , where ∑ ={ a, b, c}
in which every string will be such that any number of a’s followed by number of b’s followed by
any number of c’s.
Solution: any number of a’s means a*. Similarly, any number of b’s and any number of b’s means
b* and c*. So, the regular expression is:
R.E = a* b* c*
Example 10: Construct regular expression for the language which consists of exactly two b’s
over the set ∑ = {a, b}.
Solution: There should be exactly two b’s.
Hence, R. E = a* ba*b a*
a* indicates either the string contains of a’s or null string. Thus, we can drive any string having
exactly two b’s and any number of a’s.
2.4.3 Notations Used for Representing Regular Expressions
Regular expressions are tiny units, which are useful for representing of strings belonging to
some specific languages. Let us see some notations used for writing the regular expressions.
1. One or more instances: + sign is used to represent one or more instances. If r is regular
expression the r+ denotes one or more occurrences of r. Example: set of strings in which
there are one or more occurrences of ‘a’ over the input set{a} regular expression can be
written as a+. It basically denotes the set of {a, aa, aaa, aaaa, …..}
2. Zero or more instances: * sign is used to represent zero or more instances. If r is regular
expression then r* denotes zero or more occurrences of r. Example: set of strings in
which there are zero or more occurrences of ‘a’ over the input set {a} then regular
expression can be written as a*. It basically denotes the set of {ε, a, aa, aaa, aaaa,….}
3. Character classes: A class of symbols can be denoted by []. Example: [012] means 0
or 1 or 2. A complete class of small letters form a to z can be represented by a regular
expression [a-z]. A complete class of numbers from 0 to 9 can be represented by a
9
Compiled by Fikru T. & Dr. Velmurugan
regular expression [0-9]. The hyphen indicates the range. Also we can write a regular
expression for representing any word of small letters as [a-z]*.
As we know the language can be represented by regular expression but not all the languages are
represented by R.E such a set of language is called non regular language.
2.4.4 Non-Regular Languages
A language which cannot be described by a regular expression is called a non-regular
language. And the set represented by such languages is called non regular set.
There are some languages which cannot be described by regular expressions. Example: we
cannot write regular expressions to check whether the given language is palindrome or not.
Similarly, we cannot write a regular expression to check whether the string is having balanced
parenthesis.
Thus, regular expressions can be used to denote only fixed number of repetitions. Unspecific
information cannot be represented by regular expression.
2.5 Recognition of Tokens
For programming language there are various types of tokens such as identifier, keywords,
constants, and operators and so on. The token is usually represented by a pair of token type and
token value.
The token type tells us the category of the token and token value gives us the information
regarding token. The token value is also called token attribute. During lexical analysis process
the symbol table is maintained. The token value can be a pointer to symbol table in case of
identifier and constants. The lexical analyzer reads the input program and generates symbol table
for tokens.
For example:
We will consider some encoding of tokens as follows.
10
Compiled by Fikru T. & Dr. Velmurugan
Consider, a program code as
11
Compiled by Fikru T. & Dr. Velmurugan
In the above example scanner scans the input string and recognizes "if " as a keyword and
returns token type as 1 since in given encoding code 1 indicates keyword "if" and hence 1 is at the
beginning of token stream. Next is a pair (8,1) where 8 indicates parenthesis and 1 indicates
opening parenthesis ‘(‘. Then we scan the input ‘a’ it recognizes it as identifier and searches the
symbol table to check whether the same entry is present. If not it inserts the information about this
identifier in symbol table and returns 100. If the same identifier or variable is already present in
symbol table then lexical analyzer does not insert it into the table instead it returns the location
where it is present.
2.6 Block Schematic of Lexical Analyzer
The lexical analysis a process recognizing tokens from input source program. The lexical
analyzer stores the input in a buffer. It builds the regular expressions for corresponding tokens.
From these regular expressions, finite automata is built. When lexeme matches with the pattern
generated by finite automata, the specific token gets recognized. The block schematic for this
process is as shown in figure below in the next slide.
12
Compiled by Fikru T. & Dr. Velmurugan
While constructing the lexical analyzer we first design the regular expression for recognizing
the corresponding tokens. A diagram representing the flowchart is built. Such a diagram is called
transition diagrams. The transition diagram elaborates the actions to be taken while recognizing
the token. The lexeme is stored in an input buffer. The forward pointer scans the input character-
by-character moving from left to right. The transition diagram is used to keep track of the
information about characters that are seen as the forward pointer scans the input. Positions in a
transition diagram are called states, and Those are drawn by circles are the edges in a diagram
represents the transitions from one state to another state.
There is special state called start state, which denotes the starting
of transition, which denotes the starting of transition diagram.
From starting state, we start recognizing the tokens. After
recognizing the token, we should reach the final state. In figure
Fig. 2.9 S1 is a start state and S2 is a final state.
Let us take some examples to understand the concept of transition table. The transition table
is a tabular representation of transition diagram. We will first design the transition diagram and
the we will build the transition table.
13
Compiled by Fikru T. & Dr. Velmurugan
The finite automata can be represented as:
14
Compiled by Fikru T. & Dr. Velmurugan
and so, while designing FA we will assume one start state, one state ending in 0 and another
state ending with 1. Since we want to check whether given binary number is even or not, we will
make the state for 0, the final state.
Fig.2.13
for Example 2.
Example 3: Design FA which accepts only those strings which start with 1 and ends with 0.
Solution: The FA will have a start A from which only the edge with input 1 will go to next state.
Fig.2.14 for
Example 3.
In state B if we read 1 we will be in B state but if we read 0 at state B we will reach to state C
which is a final state. In state C if we read either 0 or 1 we will go to state C or B respectively.
Note that the special care is taken for 0, if input ends with 0 it will be in final state.
Example 4: Design FA which checks whether the given unary number is divisible by 3.
Solution:
15
Compiled by Fikru T. & Dr. Velmurugan
Example 5: Design a transition diagram for the language accepts a string that will start with
any number of a’s followed by b over the set {a,b}.
Solution:
• R.E =
a*b
• strings belong to this language are {ab, aab, aaab,aaaab,…..}
2.7 Conversion NFA to DFA
In NFA, when a specific input is given to the current state, the machine goes to multiple states.
It can have zero, one or more than one move on a given symbol. On other hand, in DFA, when
a specific input is given to current state, the machine goes to only one state. DFA has only one
move on a given input symbol.
Let , M = (Q, ∑, Ƃ, qo, F) is an NFA which accepts the language L(M). There should be
equivalent DFA denoted by M’ = (Q’, ∑’, Ƃ’, qo’, F’) such that L(M) = L(M’).
Steps for converting NFA to DFA
Step 1: initially Q’ = Φ
Step 2: Add q0 to NFA to Q’. Then find the transition from this start state.
Step 3: In Q’, find the possible set of states for each input symbol. If this set of states is not in
Q’, then add it to Q’.
Step 4: in DFA, the final state will be all the states which contain F (final states of NFA).
16
Compiled by Fikru T. & Dr. Velmurugan
Example 1: Convert the given NFA to DFA
Solution: For the given transition diagram we will first construct the transition table.
Sate 0 1
q0 q0 q1
q1 {q1,q2} q1
*q2 q1 {q1,q2}
Now we obtain Ƃ’ transition for state q0.
Ƃ’([q0],0 = [q0]
Ƃ’([q0],1 = [q1]
The Ƃ’ transition for state q1 is obtained as:
Ƃ’([q1],0 = [q1,q2] (new state generated)
Ƃ’([q1],1 = [q1]
The Ƃ’ transition for state q2 is obtained as:
Ƃ’([q2],0 = [q2]
Ƃ’([q2],1 = [q1,q2]
Now we will obtain Ƃ’ transition on [q1,q2].
Ƃ’([q1,q2],0 )= Ƃ[q1,0] U Ƃ(q2,1)
={q1,q2} U {q2}
=[q1,q2]
Ƃ’([q1,q2],1 )= Ƃ[q1,1] U Ƃ(q2,1)
={q1} U {q1,q2}
={q1,q2}
=[q1,q2]
The state [q1, q2] is the final state as well because it contains a final state q2. The transition
table for the constructed DFA will be.
17
Compiled by Fikru T. & Dr. Velmurugan
Sate 0 1
[q0] [q0] [q1]
[q1] [q1,q2] q1
*[q2] [q2] [q1,q2]
*[q1,q2] [q1,q2] [q1,q2]
0 1 0
0 1
[q1, q2]
0, 1
1
0
0,1
q0 q1
1
Solution: For the given transition diagram we will first construct the transition table.
Sate 0 1
q0 {q0,q1} {q1}
*q1 Φ {q0,q1}
18
Compiled by Fikru T. & Dr. Velmurugan
Ƃ’([q0],1) = [q0, q1]
Now we will obtain Ƃ’ transition on [q0, q1].
Ƃ’([q0, q1],0) = Ƃ (q0,0) u Ƃ (q1,0)
= {q0,q1}u, Φ
= {q0,q1}
= [q0,q1]
Similarly,
Ƃ’([q0, q1],1) = Ƃ (q0,1)u Ƃ (q1,1)
= {q1} u {q0,q1}
= {q0,q1}
= [q0,q1]
As in the given NFA, q1 is a final state, then in DFA whenever, q1 exists that state becomes a
final state. Hence in DFA, final states are [q1] and [q0, q1]. Therefore, set of final states F = {[q1],
[q0, q1]}.
The transition table for the constructed DFA will be:
Sate 0 1
*[q1] Φ [q0,q1]
0, 1
0
0
[q0] [q0,q1]
1
1
[q1]
19
Compiled by Fikru T. & Dr. Velmurugan
2.8 DFA Analysis: DFA (Deterministic Finite Automata)
Deterministic refers to uniqueness of the computation. The finite automata are called
deterministic finite automata the machine is read an input string one symbol at a time. In DFA,
there is only one path for specific input from the current state to the next state. DFA does not
accept the null movie, i.e. the DFA cannot change state without any input character. DFA can
contain multiple final state. It is used in Lexical Analysis in Compiler.
In the following diagram, we can see that from state q0 for input a, there is only one path
which is going to q1. Similarly, from q0, there is only one path for input b going to q2.
20
Compiled by Fikru T. & Dr. Velmurugan
4. The final state is denoted by a double circle.
Example 1:
Explanation:
In the above diagram, we can see that on given 0 input to DFA in state q0 the DFA changes
states to q1 and always go to final state q1 on starting input 0. It can accept 00, 01, 000, 001, …
etc. it can’t accept any string which starts with 1, because it will never go to final state on a string
starting with 1.
21
Compiled by Fikru T. & Dr. Velmurugan
Solution:
Explanation:
In the above diagram, we can see that on given 0 as input to DFA in state q0, the DFA changes
states to q1. It can accept any string which ends with 0 like 00, 10,110, 100, …etc. it can’t accept
any string which ends with 1, because it will never go to the final state q1 on 1 input, so the string
ending with 1, will not be accepted or never be rejected.
23
Compiled by Fikru T. & Dr. Velmurugan
2.7.2 Function of Lex
Firstly, lexical analyzer creates a program firstLab.l in the Lex language. Then Lex compiler
runs the firstLab.l program and produces a C++ program lex.yy.c. Finally, C++ compiler runs the
lex.yy.c program and produces an object program a.exe. a.exe is lexical analyzer that transforms
an input stream into a sequence of tokens.
2.7.3 Lex File Format
A Lex program is separated into three sections by %% delimiters. The format of Lex source is
as follows:
%{
Declaration Section
%}
%%
Rule Section
%%
Auxiliary Procedure Section
1. In declaration section, declaration of variables constants can be done. Some regular
definitions can also be written in this section. The regular definitions are basically components
of regular expressions appearing in the rule section.
2. The rule section consists of regular expressions with associated actions. The translation rules
can be given in the form as:
R1 {action1}
R1 {action2}
. . .
Rn {actionn}
Where each R1 is a regular expression and action1 is a program fragment describing what
action is to be taken for corresponding regular expression. These actions can be specified by
piece of C++ code.
3. Auxiliary procedure section in which all the required procedure is defined. Sometimes these
procedures are required by the actions in the rule section. The lexical analyzer or scanner works
in co-ordination with parser. When activates by the parser, the lexical analyzer begins reading
its remaining input, character by character at a time. When the string is matched with one of
the regular expressions Ri then corresponding actioni will get executed and this actioni returns
the control to the parser. To repeated search for the lexeme can be made in order to return all
the tokens in the source string.The lexical analyzer ignores white spaces and comments in this
process.
24
Compiled by Fikru T. & Dr. Velmurugan
Sample Lex Program to understand the above concept
This a simple program that displays the given string "Welcome to Compiler Design Lab Session
“). In main function call to yylex routine is given. This function is defined in lex.yy.c program.
yytext: when lexer matches or recognizes the token from input token then the lexeme is stored in
a null terminated string called yytext. Is called when scanner encounters end of file.
yylex (): is important function. As soon as call to yytext () is encountered scanner starts scanning
the source program.
yywrap(): is called when scanner encounters end of file. if yywrap() returns 0 then scanner
continues scanning. When yywrap() returns 1 that means end of file is encountered.
yyin: it is standard input file that stores input source program.
yyleng: when a lexer recognizes token then the lexeme is stored in a null terminated string called
yytext. And yyleng stores the length or number of characters in the input string. The value in
yyleng is same as strlen() functions.
Before going to run and execute the above lex program you have to install and configure some
softwares like codeblocks, flex and others as follows in my case.
Let Us See Steps to Install, Configure and Integrate CodeBlocks, Flex and other Tools
1. First install codeblocks in appropriate directory.
2. Install flex in appropriate directory.
3. Set paths as the following:
Goto CodeBlocks->MinGW->bin copy the address of bin it somewhat look like C:\Program
Files (x86) \CodeBlocks\MinGW\bin
25
Compiled by Fikru T. & Dr. Velmurugan
Open control panel->goto System->Advanced Settings->Environment Variables->System
Variables click on path which inside system variables->click on edit->click on new and paste the
copied path to it C:\Program Files (x86) \CodeBlocks\MinGW\bin
Press OK
4. Set path for Flex as the following on the next slide…
5. Goto GnuWin32->bin copy the address of bin it should somewhat look like this
C:\GnuWin32\bin
6. Open control panel->goto System->Advanced Settings-> Environment Variables->System
Variables click on path which inside system variables->click on edit->click on new and paste
the copied path to it C:\GnuWin32\bin
Now Let Write and Lex program and execute it
1. Create folder on Desktop with name LexProgram or any name you want.
2. Open notepad and type a Lex program
3. Save it inside the folder like filename.l
4. Make sure while saving save it as all files rather than as a text document
5. Go to command prompt(cmd)
6. Go to the directory where you have saved the program
7. Type in command: flex filename.l
8. Type in command: g++ lex.yy.c
9. Execute/Run for windows command prompt: a.exe
Output look like below after executing the lex program
26
Compiled by Fikru T. & Dr. Velmurugan