0% found this document useful (0 votes)
243 views26 pages

Chapter 2 Lexical Analysis

The document discusses lexical analysis, which is the first phase of compilation. It involves scanning the source code and generating a stream of tokens. The role of the lexical analyzer is to identify tokens, eliminate blank spaces and comments, generate a symbol table, keep track of line numbers, and report any errors encountered. Regular expressions are used to specify patterns to define tokens. The lexical analyzer works by scanning the input with pointers and buffering the input into one or two buffers to identify tokens efficiently.

Uploaded by

Elias Hirpasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
243 views26 pages

Chapter 2 Lexical Analysis

The document discusses lexical analysis, which is the first phase of compilation. It involves scanning the source code and generating a stream of tokens. The role of the lexical analyzer is to identify tokens, eliminate blank spaces and comments, generate a symbol table, keep track of line numbers, and report any errors encountered. Regular expressions are used to specify patterns to define tokens. The lexical analyzer works by scanning the input with pointers and buffering the input into one or two buffers to identify tokens efficiently.

Uploaded by

Elias Hirpasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter 2: Lexical Analysis

2.1 Introduction
The process of compilation starts with the first phase called lexical analysis. In this phase the
input is scanned completely in order to identify the tokens. The token structure can be recognized
with the help of some diagrams. These diagrams are popularly known as finite automata. And to
construct such finite automata regular expressions are used. These diagrams can be translated can
be translated into a program for identifying tokens. In this chapter:
▪ we will see what the role of lexical analyzer in compilation process is.
▪ we will also discuss the method of identifying tokens from source program.
▪ finally, we will also learn a tool, LEX which automates the construction of lexical
analyzer.
2.2 Role of Lexical Analyzer
Lexical analyzer is the first phase of compiler. The analyzer reads the input source program
from left to right one character at a time and generates the sequence of tokens. Each token is a
single logical cohesive unit such as identifier, keywords, operators, and punctuations marks.
Then the parser to determine the syntax of th source program can be use these tokens. The role of
lexical analyzer in the process of compilation is as shown below:

Figure 2.1 Role of Lexical Analyzer


As the lexical analyzer scans the source program to recognize the tokens it is also called scanner.
Apart from token identification lexical analyzer also preforms the following functions.
Functions of Lexical Analyzer
1. It produces stream of tokens.
2. It eliminates blank and comments.
3. It generates symbol table which stores the information about identifiers, constants
encountered in the input.
4. It keeps track of line numbers.
5. It reports the error encountered while generating the tokens.

1
Compiled by Fikru T. & Dr. Velmurugan
Lexical analyzer works in two phases:
a. In the first phase, it performs scan
b. In second phase, it generates the series of tokens

2.2.1 Tokens, Patterns, Lexemes


Let us learn some terminologies, which are frequently used when we talk about the activity of
lexical analysis.
Tokens: it describes the class or category of input string. For example, identifiers, keywords,
constants are called tokens.
Patterns: Set of rules that describes the tokens.
Lexemes: Sequence of characters in the source program that are matched with the pattern of the
token. For example, int, I, num, ans, choice.
Let us take one example of programming statement to clarify understand these terms:
if (a<b)
Here “if " , "(", "a", "<", "b", ")" are all lexemes. And "if" is a keyword, "(" is opening
parenthesis, "a" and "b" is an identifier, "<" is an operator and so on.
Now to define the identifier pattern could be:
1. Identifier is a collection of tokens.
2. Identifier is a collection of alphanumeric characters and identifier’s beginning character
should be necessarily a letter.
When we want to compile agiven source program, we submit that program to compiler. A
compiler scans the source program and produces sequence of tokens therefore lexical analysis is
called as scanner. For example, the piece of source code given below:

2
Compiled by Fikru T. & Dr. Velmurugan
The blank and new line characters can be ignored. These streams of tokens will be given to
syntax analyzer.
2.3 Input Buffering
The lexical analyzer scans the input string from left to right one character a time. It uses two
pointers begin_ptr (bp) and forward_ptr(fp) to keep track of the portion of the input scanned.
Initially both the pointers point to the first character of the input string as shown below Figure 2.2.

Figure 2.2 Initial Configuration


The forward_ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered it indicates end of lexeme. In the above example as soon as forward_ptr(fp)
encounters a blank space the lexeme "int" is identified.

3
Compiled by Fikru T. & Dr. Velmurugan
The fp will be moved ahead at white space. When fp encounters whitespace it ignores and
moves ahead. The both the begin_ptr(bp) and forward_ptr(fp) is set at next token i.

The input character is thus read from secondary storage. But reading in this way from
secondary storage is costly. Hence buffering technique is used. A block of data is first read into a
buffer and then scanned by lexical analyzer. There are two methods used in this context:
1. One buffer scheme
2. Two buffer scheme
1. One buffer scheme:
In this buffer scheme, only one buffer is used to store the input string. But the problem with
this scheme is that if lexeme is very long then it crosses the buffer boundary to scan rest of the
lexeme the buffer has to be refilled, that makes overwriting the first part of the lexeme.

2. Two buffer Scheme:


To overcome the problem of one buffer scheme, in this method two buffers are used to store
the input string. The first buffer and second buffer are scanned alternately. when end of current
buffer is reached the other buffer is filled. The only problem with this method is that if length of
the lexeme is longer than length of the buffer when scanning input cannot be scanned completely.

4
Compiled by Fikru T. & Dr. Velmurugan
Initially both the bp and fp are pointing to the first character. Then the fp moves towards right
in search of end of lexeme. As soon as blank character is recognized, the string between bp and fp
is identified as corresponding token. To identify the boundary of the first buffer end of buffer
character should be placed at the end of first buffer. Similarly end of second buffer is also
recognized by the end of buffer mark present at the end of second buffer. When fp encounters first
eof , then one can recognize end of the first buffer and hence filling up of the second buffer is
started. In the same way when second eof is obtained the it indicates end of second buffer.
Alternatively, both the buffers can be filled up until end of the input program and stream of tokens
is identified. This eof character introduced at the end is called Sentinel. Sentinel is used to identify
the end of buffer.

2.4 Token Specification


To specify tokens regular expressions are used. When a pattern is matched by some regular
expression then token can be recognized. Let us understand the fundamental concepts of
language.

5
Compiled by Fikru T. & Dr. Velmurugan
2.4.1 String and Language
String is a collection of finite number of alphabets or letters. The strings are synonymously
called as words.
• The length of a string is denoted by |S|
• The empty string is denoted by ε
• The empty set of strings is denoted by Φ
Following terms are commonly used in strings.

Terms Meaning

Prefix of string A string obtained by removing zero or more tail symbols.


e.g. predefine the prefix could be “ pre”
Suffix of string A string obtained by removing zero or more leading symbols.
e.g. For string predefined the suffix could be ‘d’
Substring of string A string obtained by removing prefix and suffix of the a given string.
e.g. for string predefined, define can be substring
Sequence of string Any string formed by removing zero or more not necessarily the contiguous symbols.
e.g. perfin may be sequence of string predefined
2.4.2 Operations on language
A we have seen that the language is a collection of strings. There are various operations
which can be performed on the language

Operation Description

Union of two language L1 and L1Ul2={set of strings in L1 and strings in L2}


L2

Concatenation of two L1.l2={set of strings in L1 followed by set of strings in L2}


language L1 and L2

Kleen closure of L L*= ⋃∞


𝑖=0 𝐿𝑖 where L* denotes zero or more concatenates of

L.

Positive closure of L L+= ⋃∞


𝑖=1 𝐿𝑖 where L+ denotes one or more concatenates of L.

L+

6
Compiled by Fikru T. & Dr. Velmurugan
For example, let L be a set of alphabets such as L={A, B, C,… Z, a, b,c, …z} and D be the
set of digits such as D={0, 1, 2…9} then by performing various operations as discussed above
new languages can be generated as follows:
• L U D is a set of letters and digits.
• LD is a set of strings consisting of letters followed by digits.
• L5 is a set of strings having length of 5 each.
• L* is a set strings having all the strings including ϵ.
• L+ is a set of strings except ϵ.
2.4.3 Regular Expressions
Regulars are mathematical symbolisms which describes the set of strings of specific languages.
It provides convenient and useful notation for representing tokens. Here are some rules that
describes definition of the regular expressions over the input set denoted by ∑.
1. ε is regular expression that denotes the set containing empty string.
2. If R1 and R2 are regular expressions then R= R1+R2 (same can also be represented as
R=R1|R2) is also regular expression which represents union operation.
3. If R1 and R2 are regular expression then R=R1.R2 is also regular expression which
represents concatenation operation.
4. If R1 is a regular expression the R=R* is also regular expression which represents
Kleen Closure.
A language denoted by regular expressions is said to be a regular set or regular language. Let
us see some examples of regular expressions.
Example 1: Write a regular expression (R.E) for a language containing the string of length two
over ∑ = {0,1}.
Solution: R.E = (0+1) (0+1)
Example 2: Write a regular expression (R.E) for a language containing the string which end
with “abb” over ∑ ={a,b}.
Solution: R.E = (a+b)*abb
Example 3: write a regular expression (R.E) for recognizing identifier.
Solution: for denoting identifier we will consider a set of letters and digits because identifier is a
combination of letters or letter and digits but having first character as letters always. Hence R.E.
can be denoted as:

7
Compiled by Fikru T. & Dr. Velmurugan
R.E = letter(letter+digit)*
Where letter = (A, B,…Z, a,b,…z) and digit = (0, 1, 2, …9)
Example 4: Write a regular expression (R.E) for the language accepting all combination of a’s
except the null string over ∑ ={a}.
Solution: The regular expression has to built for the language
L={a, aa, aaa, aaaa,….}
This set indicates that there is no null string. So, we can write,
R.E = a+
The + is called positive closure.
Example 5: Design a regular expression (R.E) for the language containing all strings with any
number of a’s and b’s over ∑ ={a,b}.
Solution: The R.E will be L={ε , a, aa,ab, b, ba, bab, abab….}
R.E = (a+b)*
The set for this r.e will be:
L = (ε, a, aa, ab, b,ba, b, bab, abab,… any combination of a and b}
The ( a + b ) means any combination of a and b even a null string.
Example 6: Construct a regular expression for the language containing all strings having any
number of a’s and b’s except the null string.
Solution: R.E = (a+b)+
This regular expression will give the set of strings of any combination of a’s and b’s except a null
string.
Example 7: Construct a regular expression for the language accepting all the strings which are
ending with 00 over the set ∑ = {0,1}.
Solution: The r.e has to be formed in which ate the end there should be 00. That means
r.e = (Any combination of 0’s and 1’s) 00
R.E = ( 0 + 1)*00
Thus, the valid strings are 100, 0100, 1000, …. We have all strings ending with 00.
Example 8: Construct a regular expression for the language accepting the strings which are
starting with 1 and ending with 0, over the set ∑ = {0,1}.
Solution: The first symbol in r.e should be 1 and the last symbol should be 0.
So, R.E = 1( 0 + 1 ) * 0

8
Compiled by Fikru T. & Dr. Velmurugan
Note that the condition is strictly followed by keeping starting and ending symbols correctly.
In between them there can be any combination of 0 and 1 including null string.
Example 9: Write a regular expression to denote the language L over ∑ * , where ∑ ={ a, b, c}
in which every string will be such that any number of a’s followed by number of b’s followed by
any number of c’s.
Solution: any number of a’s means a*. Similarly, any number of b’s and any number of b’s means
b* and c*. So, the regular expression is:
R.E = a* b* c*
Example 10: Construct regular expression for the language which consists of exactly two b’s
over the set ∑ = {a, b}.
Solution: There should be exactly two b’s.

Hence, R. E = a* ba*b a*
a* indicates either the string contains of a’s or null string. Thus, we can drive any string having
exactly two b’s and any number of a’s.
2.4.3 Notations Used for Representing Regular Expressions
Regular expressions are tiny units, which are useful for representing of strings belonging to
some specific languages. Let us see some notations used for writing the regular expressions.
1. One or more instances: + sign is used to represent one or more instances. If r is regular
expression the r+ denotes one or more occurrences of r. Example: set of strings in which
there are one or more occurrences of ‘a’ over the input set{a} regular expression can be
written as a+. It basically denotes the set of {a, aa, aaa, aaaa, …..}
2. Zero or more instances: * sign is used to represent zero or more instances. If r is regular
expression then r* denotes zero or more occurrences of r. Example: set of strings in
which there are zero or more occurrences of ‘a’ over the input set {a} then regular
expression can be written as a*. It basically denotes the set of {ε, a, aa, aaa, aaaa,….}
3. Character classes: A class of symbols can be denoted by []. Example: [012] means 0
or 1 or 2. A complete class of small letters form a to z can be represented by a regular
expression [a-z]. A complete class of numbers from 0 to 9 can be represented by a

9
Compiled by Fikru T. & Dr. Velmurugan
regular expression [0-9]. The hyphen indicates the range. Also we can write a regular
expression for representing any word of small letters as [a-z]*.
As we know the language can be represented by regular expression but not all the languages are
represented by R.E such a set of language is called non regular language.
2.4.4 Non-Regular Languages
A language which cannot be described by a regular expression is called a non-regular
language. And the set represented by such languages is called non regular set.
There are some languages which cannot be described by regular expressions. Example: we
cannot write regular expressions to check whether the given language is palindrome or not.
Similarly, we cannot write a regular expression to check whether the string is having balanced
parenthesis.
Thus, regular expressions can be used to denote only fixed number of repetitions. Unspecific
information cannot be represented by regular expression.
2.5 Recognition of Tokens
For programming language there are various types of tokens such as identifier, keywords,
constants, and operators and so on. The token is usually represented by a pair of token type and
token value.

The token type tells us the category of the token and token value gives us the information
regarding token. The token value is also called token attribute. During lexical analysis process
the symbol table is maintained. The token value can be a pointer to symbol table in case of
identifier and constants. The lexical analyzer reads the input program and generates symbol table
for tokens.
For example:
We will consider some encoding of tokens as follows.

10
Compiled by Fikru T. & Dr. Velmurugan
Consider, a program code as

Our lexical analyzer will generate following token stream.


1, (8,1), (5,100), (7,1), (6,105), (8,2), (5,107), (9,1), (6,110), 2,(5,107),10, (5,107), (9,2),
(6,110).
The corresponding symbol table for identifiers and constants will be,

11
Compiled by Fikru T. & Dr. Velmurugan
In the above example scanner scans the input string and recognizes "if " as a keyword and
returns token type as 1 since in given encoding code 1 indicates keyword "if" and hence 1 is at the
beginning of token stream. Next is a pair (8,1) where 8 indicates parenthesis and 1 indicates
opening parenthesis ‘(‘. Then we scan the input ‘a’ it recognizes it as identifier and searches the
symbol table to check whether the same entry is present. If not it inserts the information about this
identifier in symbol table and returns 100. If the same identifier or variable is already present in
symbol table then lexical analyzer does not insert it into the table instead it returns the location
where it is present.
2.6 Block Schematic of Lexical Analyzer
The lexical analysis a process recognizing tokens from input source program. The lexical
analyzer stores the input in a buffer. It builds the regular expressions for corresponding tokens.
From these regular expressions, finite automata is built. When lexeme matches with the pattern
generated by finite automata, the specific token gets recognized. The block schematic for this
process is as shown in figure below in the next slide.

12
Compiled by Fikru T. & Dr. Velmurugan
While constructing the lexical analyzer we first design the regular expression for recognizing
the corresponding tokens. A diagram representing the flowchart is built. Such a diagram is called
transition diagrams. The transition diagram elaborates the actions to be taken while recognizing
the token. The lexeme is stored in an input buffer. The forward pointer scans the input character-
by-character moving from left to right. The transition diagram is used to keep track of the
information about characters that are seen as the forward pointer scans the input. Positions in a
transition diagram are called states, and Those are drawn by circles are the edges in a diagram
represents the transitions from one state to another state.
There is special state called start state, which denotes the starting
of transition, which denotes the starting of transition diagram.
From starting state, we start recognizing the tokens. After
recognizing the token, we should reach the final state. In figure
Fig. 2.9 S1 is a start state and S2 is a final state.
Let us take some examples to understand the concept of transition table. The transition table
is a tabular representation of transition diagram. We will first design the transition diagram and
the we will build the transition table.

The finite state machine (FSM) can be


defined as a collection of 5 tuples (Q, ∑, б, q0, F) where:
Q is finite set of states, which is non empty
∑ is input alphabet, indicates input symbol
q0 in Q is the initial state
F is a set of finale states
б is a transition function or a function defined for going to next state.

13
Compiled by Fikru T. & Dr. Velmurugan
The finite automata can be represented as:

The finite automata is a


mathematical representation of
finite state. The machine has input tape in which input is placed occupying one character in each
cell. The tape head reads the symbol from the input tape. The finite control is always one of internal
states which decides what will be the next state after reading the input by the tape reader. For
example, suppose the current state q1 and suppose now the tape is pointing to c, it is a finite control
which decides what will be the next at input c.
Example 1: Design FA which accepts the only input 101 over the input set z={0,1}.
Solution:

Fig.2.12 for Example 1.


Note that in the problem statement it is mentioned as only input 101 will be accepted. Hence in
the solution we have simply shown the transitions, for input 101. There is no other path shown for
other input.
Example 2: Design FA which checks whether the given binary number is even or not.
Solution: The binary number is made up of 0’s and 1’s when any binary number ends with 0 it is
always even and when a binary number ends with 1 it is always odd. For example,
0100 is even number, equal to 4
0011 is odd number, equal to 3

14
Compiled by Fikru T. & Dr. Velmurugan
and so, while designing FA we will assume one start state, one state ending in 0 and another
state ending with 1. Since we want to check whether given binary number is even or not, we will
make the state for 0, the final state.

Fig.2.13
for Example 2.
Example 3: Design FA which accepts only those strings which start with 1 and ends with 0.
Solution: The FA will have a start A from which only the edge with input 1 will go to next state.

Fig.2.14 for
Example 3.
In state B if we read 1 we will be in B state but if we read 0 at state B we will reach to state C
which is a final state. In state C if we read either 0 or 1 we will go to state C or B respectively.
Note that the special care is taken for 0, if input ends with 0 it will be in final state.
Example 4: Design FA which checks whether the given unary number is divisible by 3.
Solution:

Fig.2.15 for Example 4.


The unary number is made up of ones. The number 3 can be written in unary form as 111,
number 5 can be written as 11111 and so on. The unary number which is divisible by 3 can be
111111 or 111111111 and so on. The transition table is as follows.

15
Compiled by Fikru T. & Dr. Velmurugan
Example 5: Design a transition diagram for the language accepts a string that will start with
any number of a’s followed by b over the set {a,b}.
Solution:

• R.E =
a*b
• strings belong to this language are {ab, aab, aaab,aaaab,…..}
2.7 Conversion NFA to DFA
In NFA, when a specific input is given to the current state, the machine goes to multiple states.
It can have zero, one or more than one move on a given symbol. On other hand, in DFA, when
a specific input is given to current state, the machine goes to only one state. DFA has only one
move on a given input symbol.
Let , M = (Q, ∑, Ƃ, qo, F) is an NFA which accepts the language L(M). There should be
equivalent DFA denoted by M’ = (Q’, ∑’, Ƃ’, qo’, F’) such that L(M) = L(M’).
Steps for converting NFA to DFA
Step 1: initially Q’ = Φ
Step 2: Add q0 to NFA to Q’. Then find the transition from this start state.
Step 3: In Q’, find the possible set of states for each input symbol. If this set of states is not in
Q’, then add it to Q’.
Step 4: in DFA, the final state will be all the states which contain F (final states of NFA).

16
Compiled by Fikru T. & Dr. Velmurugan
Example 1: Convert the given NFA to DFA

Solution: For the given transition diagram we will first construct the transition table.
Sate 0 1
q0 q0 q1
q1 {q1,q2} q1
*q2 q1 {q1,q2}
Now we obtain Ƃ’ transition for state q0.
Ƃ’([q0],0 = [q0]
Ƃ’([q0],1 = [q1]
The Ƃ’ transition for state q1 is obtained as:
Ƃ’([q1],0 = [q1,q2] (new state generated)
Ƃ’([q1],1 = [q1]
The Ƃ’ transition for state q2 is obtained as:
Ƃ’([q2],0 = [q2]
Ƃ’([q2],1 = [q1,q2]
Now we will obtain Ƃ’ transition on [q1,q2].
Ƃ’([q1,q2],0 )= Ƃ[q1,0] U Ƃ(q2,1)
={q1,q2} U {q2}
=[q1,q2]
Ƃ’([q1,q2],1 )= Ƃ[q1,1] U Ƃ(q2,1)
={q1} U {q1,q2}
={q1,q2}
=[q1,q2]
The state [q1, q2] is the final state as well because it contains a final state q2. The transition
table for the constructed DFA will be.

17
Compiled by Fikru T. & Dr. Velmurugan
Sate 0 1
[q0] [q0] [q1]
[q1] [q1,q2] q1
*[q2] [q2] [q1,q2]
*[q1,q2] [q1,q2] [q1,q2]

The transition diagram will be:

0 1 0

[q0] [q1] [q2]

0 1

[q1, q2]

0, 1

The state q2 can be eliminated because q2 is unreachable state.


Example 1: Convert the given NFA to DFA

1
0
0,1
q0 q1
1
Solution: For the given transition diagram we will first construct the transition table.

Sate 0 1
q0 {q0,q1} {q1}
*q1 Φ {q0,q1}

Now we will obtain Ƃ’ transition for state q0 is:


Ƃ’([q0],0) = {q0, q1}
= [q0, q1] (new state obtained)
Ƃ’([q0],1) = {q1} =[q1]
The Ƃ’ transition for state q1 is obtained as:
Ƃ’([q0],0) = Φ

18
Compiled by Fikru T. & Dr. Velmurugan
Ƃ’([q0],1) = [q0, q1]
Now we will obtain Ƃ’ transition on [q0, q1].
Ƃ’([q0, q1],0) = Ƃ (q0,0) u Ƃ (q1,0)
= {q0,q1}u, Φ
= {q0,q1}
= [q0,q1]
Similarly,
Ƃ’([q0, q1],1) = Ƃ (q0,1)u Ƃ (q1,1)
= {q1} u {q0,q1}
= {q0,q1}
= [q0,q1]
As in the given NFA, q1 is a final state, then in DFA whenever, q1 exists that state becomes a
final state. Hence in DFA, final states are [q1] and [q0, q1]. Therefore, set of final states F = {[q1],
[q0, q1]}.
The transition table for the constructed DFA will be:

Sate 0 1

[q0] [q0,q1] [q1]

*[q1] Φ [q0,q1]

*[q0,q1] [q0,q1] [q0,q1]

The transition diagram will be:

0, 1
0
0
[q0] [q0,q1]

1
1

[q1]

19
Compiled by Fikru T. & Dr. Velmurugan
2.8 DFA Analysis: DFA (Deterministic Finite Automata)
Deterministic refers to uniqueness of the computation. The finite automata are called
deterministic finite automata the machine is read an input string one symbol at a time. In DFA,
there is only one path for specific input from the current state to the next state. DFA does not
accept the null movie, i.e. the DFA cannot change state without any input character. DFA can
contain multiple final state. It is used in Lexical Analysis in Compiler.
In the following diagram, we can see that from state q0 for input a, there is only one path
which is going to q1. Similarly, from q0, there is only one path for input b going to q2.

Formal definition of DFA


A DFA is a collection of 5-tuples same as described in the definition of FA.

The transition function can be defined as:

Graphical representation of DFA


A DFA can be represented by graphs called state diagram. In which:
1. The state is represented by vertices.
2. The arc labeled with an input character show the transitions.
3. The initial state is marked with an arrow.

20
Compiled by Fikru T. & Dr. Velmurugan
4. The final state is denoted by a double circle.
Example 1:

Solution: Transition diagram:

The transition table:


Sate 0 1
q0 q0 q1
q1 q2 q1
*q2 q2 q2

Example 2: DFA with ∑ = {0,1} accepts all string starting with 0.


Solution:

Explanation:
In the above diagram, we can see that on given 0 input to DFA in state q0 the DFA changes
states to q1 and always go to final state q1 on starting input 0. It can accept 00, 01, 000, 001, …
etc. it can’t accept any string which starts with 1, because it will never go to final state on a string
starting with 1.

Example 3: DFA with ∑ = {0,1} accepts all string ending with 0.

21
Compiled by Fikru T. & Dr. Velmurugan
Solution:

Explanation:
In the above diagram, we can see that on given 0 as input to DFA in state q0, the DFA changes
states to q1. It can accept any string which ends with 0 like 00, 10,110, 100, …etc. it can’t accept
any string which ends with 1, because it will never go to the final state q1 on 1 input, so the string
ending with 1, will not be accepted or never be rejected.

2.9 Error Recovery


A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
• Lexical errors are not very common, but it should be managed by a scanner.
• Misspelling of identifiers, operators, keywords are considered as lexical errors.
• Generally, a lexical error is caused by the appearance of some illegal character, mostly
at the beginning of a token.
Here, are a few most common error recovery techniques:
• Removes one character from the remaining input
• In panic mode, the successive characters are always ignored until we reach a well-
formed token
• By inserting the missing character into the remaining input
• Replace a character with another character
• Transpose two serial characters
2.10 A Typical Lexical Analyzer Generator
For efficient design of compiler, various tools have been built for constructing lexical analyzers
using special purpose notations called regular expressions. Regular expressions are used in
recognizing tokens. A tool called LEX gives this specification. LEX is utility which generates the
lexical analyzer. In LEX tool, designing the regular expressions for corresponding tokens is a
logical task.
22
Compiled by Fikru T. & Dr. Velmurugan
Having a file with extension. l (pronounced as dot l) the lex specifications file can be created.
For Example: the specification file name firstLab.l. This firstLab.l is the given to LEX compilers
to produce lex.yy.c. This lex.yy.c is a C++ program which is actually a lexical analyzer program.
The specification file stores the regular expressions for the tokens, the lex.yy.c file consists of the
tabular representation of the transition diagrams constructed for the regular expressions of
specification file say firstLab.l. The lexemes can be recognized with the help of tabular transitions
diagrams and some standards routines. In the specification file of LEX actions are associated with
each regular expression. This action are simply pieces of C++ codes. These pieces of C++ code
are directly carried over to the lex.yy.c Finally, the C++ compiler compiles this generated lex.yy.c
and produces an object program a.exe. When some input stream is given to a.exe then sequence of
tokens get generated.
The above discussion scenario is modeled as below:

2.7.1 Lex Program


Lex is a program that generates lexical analyzer. It is used with YACC parser generator. The
lexical analyzer is a program that transforms an input stream into a sequence of tokens. It reads
the input stream and produces the source code as output through implementing the lexical analyzer
in the C++ program.

23
Compiled by Fikru T. & Dr. Velmurugan
2.7.2 Function of Lex
Firstly, lexical analyzer creates a program firstLab.l in the Lex language. Then Lex compiler
runs the firstLab.l program and produces a C++ program lex.yy.c. Finally, C++ compiler runs the
lex.yy.c program and produces an object program a.exe. a.exe is lexical analyzer that transforms
an input stream into a sequence of tokens.
2.7.3 Lex File Format
A Lex program is separated into three sections by %% delimiters. The format of Lex source is
as follows:
%{
Declaration Section
%}
%%
Rule Section
%%
Auxiliary Procedure Section
1. In declaration section, declaration of variables constants can be done. Some regular
definitions can also be written in this section. The regular definitions are basically components
of regular expressions appearing in the rule section.
2. The rule section consists of regular expressions with associated actions. The translation rules
can be given in the form as:
R1 {action1}
R1 {action2}
. . .
Rn {actionn}
Where each R1 is a regular expression and action1 is a program fragment describing what
action is to be taken for corresponding regular expression. These actions can be specified by
piece of C++ code.
3. Auxiliary procedure section in which all the required procedure is defined. Sometimes these
procedures are required by the actions in the rule section. The lexical analyzer or scanner works
in co-ordination with parser. When activates by the parser, the lexical analyzer begins reading
its remaining input, character by character at a time. When the string is matched with one of
the regular expressions Ri then corresponding actioni will get executed and this actioni returns
the control to the parser. To repeated search for the lexeme can be made in order to return all
the tokens in the source string.The lexical analyzer ignores white spaces and comments in this
process.

24
Compiled by Fikru T. & Dr. Velmurugan
Sample Lex Program to understand the above concept

This a simple program that displays the given string "Welcome to Compiler Design Lab Session
“). In main function call to yylex routine is given. This function is defined in lex.yy.c program.
yytext: when lexer matches or recognizes the token from input token then the lexeme is stored in
a null terminated string called yytext. Is called when scanner encounters end of file.
yylex (): is important function. As soon as call to yytext () is encountered scanner starts scanning
the source program.
yywrap(): is called when scanner encounters end of file. if yywrap() returns 0 then scanner
continues scanning. When yywrap() returns 1 that means end of file is encountered.
yyin: it is standard input file that stores input source program.
yyleng: when a lexer recognizes token then the lexeme is stored in a null terminated string called
yytext. And yyleng stores the length or number of characters in the input string. The value in
yyleng is same as strlen() functions.
Before going to run and execute the above lex program you have to install and configure some
softwares like codeblocks, flex and others as follows in my case.
Let Us See Steps to Install, Configure and Integrate CodeBlocks, Flex and other Tools
1. First install codeblocks in appropriate directory.
2. Install flex in appropriate directory.
3. Set paths as the following:
Goto CodeBlocks->MinGW->bin copy the address of bin it somewhat look like C:\Program
Files (x86) \CodeBlocks\MinGW\bin

25
Compiled by Fikru T. & Dr. Velmurugan
Open control panel->goto System->Advanced Settings->Environment Variables->System
Variables click on path which inside system variables->click on edit->click on new and paste the
copied path to it C:\Program Files (x86) \CodeBlocks\MinGW\bin
Press OK
4. Set path for Flex as the following on the next slide…
5. Goto GnuWin32->bin copy the address of bin it should somewhat look like this
C:\GnuWin32\bin
6. Open control panel->goto System->Advanced Settings-> Environment Variables->System
Variables click on path which inside system variables->click on edit->click on new and paste
the copied path to it C:\GnuWin32\bin
Now Let Write and Lex program and execute it
1. Create folder on Desktop with name LexProgram or any name you want.
2. Open notepad and type a Lex program
3. Save it inside the folder like filename.l
4. Make sure while saving save it as all files rather than as a text document
5. Go to command prompt(cmd)
6. Go to the directory where you have saved the program
7. Type in command: flex filename.l
8. Type in command: g++ lex.yy.c
9. Execute/Run for windows command prompt: a.exe
Output look like below after executing the lex program

26
Compiled by Fikru T. & Dr. Velmurugan

You might also like