0% found this document useful (0 votes)
14 views29 pages

Compiler - Design - Notes - Unit - 1 - Part 1

The document provides an overview of compiler design, detailing the phases and types of compilers, including their roles in translating high-level languages to machine code. It explains the compilation process, including lexical analysis, syntax analysis, semantic analysis, and code generation, as well as the concepts of bootstrapping and cross-compilation. Additionally, it covers finite state machines and their application in lexical analysis, emphasizing the importance of understanding both programming languages and target platforms in compiler design.

Uploaded by

umairknp2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views29 pages

Compiler - Design - Notes - Unit - 1 - Part 1

The document provides an overview of compiler design, detailing the phases and types of compilers, including their roles in translating high-level languages to machine code. It explains the compilation process, including lexical analysis, syntax analysis, semantic analysis, and code generation, as well as the concepts of bootstrapping and cross-compilation. Additionally, it covers finite state machines and their application in lexical analysis, emphasizing the importance of understanding both programming languages and target platforms in compiler design.

Uploaded by

umairknp2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Compiler Design

Unit 1

Introduction to Compiler: Phases and passes, Bootstrapping, Finite state machines and regular
expressions and their applications to lexical analysis, Optimization of DFA-Based Pattern
Matchers implementation of lexical analyzers, lexical-analyzer generator, LEX compiler, Formal
grammars and their application to syntax analysis, BNF notation, ambiguity, YACC. The
syntactic specification of programming languages: Context free grammars, derivation and parse
trees, capabilities of CFG.

Introduction to Compiler

o A compiler is a translator that converts the high-level language into the machine
language.
o High-level language is written by a developer and machine language can be understood
by the processor.
o Compiler is used to show errors to the programmer.
o The main purpose of compiler is to change the code written in one language without
changing the meaning of the program.
o When we execute a program which is written in HLL programming language then it
executes into two parts.
o In the first part, the source program compiled and translated into the object program (low
level language).
o In the second part, object program translated into the target program through the
assembler.

Role of Compilers
A program written in a high-level language cannot run without compilation. Each
programming language has its own compiler, but the fundamental tasks performed by all
compilers remain the same. Translating source code into machine code involves multiple
stages, such as lexical analysis, syntax analysis, semantic analysis, code generation, and
optimization.
While compilers are specialized, they differ from general translators. A translator or language
processor is a tool that converts an input program written in one programming language into
an equivalent program in another language.
Language Processing Systems
We know a computer is a logical assembly of Software and Hardware. The hardware knows a
language, that is hard for us to grasp, consequently, we tend to write programs in a high-level
language, that is much less complicated for us to comprehend and maintain in our thoughts.
Now, these programs go through a series of transformations so that they can readily be used
by machines. This is where language procedure systems come in handy.

High-Level Language to Machine Code

 High-Level Language: If a program contains pre-processor directives such as #include or


#define it is called HLL. They are closer to humans but far from machines. These (#) tags
are called preprocessor directives. They direct the pre-processor about what to do.
 Pre-Processor: The pre-processor removes all the #include directives by including the
files called file inclusion and all the #define directives using macro expansion. It performs
file inclusion, augmentation, macro-processing, etc. For example: Let in the source
program, it is written #include ―Stdio.h‖. Pre-Processor replaces this file with its contents
in the produced output.
 Assembly Language: It’s neither in binary form nor high level. It is an intermediate state
that is a combination of machine instructions and some other useful data needed for
execution.
 Assembler: For every platform (Hardware + OS) we will have an assembler. They are not
universal since for each platform we have one. The output of the assembler is called an
object file. It translates assembly language to machine code.
 Compiler: The compiler is an intelligent program as compared to an assembler. The
compiler verifies all types of limits, ranges, errors, etc. Compiler program takes more
time to run and it occupies a huge amount of memory space. The speed of the compiler is
slower than other system software. It takes time because it enters through the program and
then does the translation of the full program.
 Interpreter: An interpreter converts high-level language into low-level machine
language, just like a compiler. But they are different in the way they read the input. The
Compiler in one go reads the inputs, does the processing, and executes the source code
whereas the interpreter does the same line by line. A compiler scans the entire program
and translates it as a whole into machine code whereas an interpreter translates the
program one statement at a time. Interpreted programs are usually slower concerning
compiled ones.
 Relocatable Machine Code: It can be loaded at any point and can be run. The address
within the program will be in such a way that it will cooperate with the program
movement.
 Loader/Linker: Loader/Linker converts the relocatable code into absolute code and tries
to run the program resulting in a running program or an error message (or sometimes both
can happen). Linker loads a variety of object files into a single file to make it executable.
Then loader loads it in memory and executes it.
o Linker: The basic work of a linker is to merge object codes (that have not even
been connected), produced by the compiler, assembler, standard library
function, and operating system resources.
o Loader: The codes generated by the compiler, assembler, and linker are
generally re-located by their nature, which means to say, the starting location
of these codes is not determined, which means they can be anywhere in the
computer memory. Thus the basic task of loaders to find/calculate the exact
address of these memory locations.
Overall, compiler design is a complex process that involves multiple stages and requires a
deep understanding of both the programming language and the target platform. A well-
designed compiler can greatly improve the efficiency and performance of software programs,
making them more useful and valuable for users.

Types of Compiler
 Self/Native Compiler: When the compiler runs on the same machine and produces
machine code for the same machine on which it is running then it is called as self
compiler or resident compiler. Example- Turbo C or GCC compiler. if a compiler runs on
a Windows machine and produces executable code for Windows, it is a native compiler.
 Cross Compiler: The compiler may run on one machine and produce the machine codes
for other computers then in that case it is called a cross-compiler. It is capable of creating
code for a platform other than the one on which the compiler is running. Example- a
compiler that runs on Linux/x86 box is building a program that will run on a separate
Arduino/ARM. if a compiler runs on a Linux machine and produces executable code for
Windows, then it is a cross-compiler.
 Source-to-Source Compiler: A Source-to-Source Compiler or transcompiler or transpiler
is a compiler that translates source code written in one programming language into the
source code of another programming language.
 Single Pass Compiler: When all the phases of the compiler are present inside a single
module, it is simply called a single-pass compiler. It performs the work of converting
source code to machine code.
 Two Pass Compiler: Two-pass compiler is a compiler in which the program is translated
twice, once from the front end and the back from the back end known as Two Pass
Compiler.
 Multi-Pass Compiler: When several intermediate codes are created in a program and a
syntax tree is processed many times, it is called Multi-Pass Compiler. It breaks codes into
smaller programs.
 Just-in-Time (JIT) Compiler: It is a type of compiler that converts code into machine
language during program execution, rather than before it runs. It combines the benefits of
interpretation (real-time execution) and traditional compilation (faster execution).
 Ahead-of-Time (AOT) Compiler: It converts the entire source code into machine code
before the program runs. This means the code is fully compiled during development,
resulting in faster startup times and better performance at runtime.
 Incremental Compiler: It compiles only the parts of the code that have changed, rather
than recompiling the entire program. This makes the compilation process faster and more
efficient, especially during development.

Phases of a Compile
There are two major phases of compilation, which in turn have many parts. Each of them
takes input from the output of the previous level and works in a coordinated way.

Analysis Phase
An intermediate representation is created from the given source code :
 Lexical Analyzer
 Syntax Analyzer
 Semantic Analyzer
 Intermediate Code Generator

Synthesis Phase
An equivalent target program is created from the intermediate representation. It has two parts
:
 Code Optimizer
 Code Generator
Phases of Compiler
In a compiler, there are the following six phases:

Points to Remember
A token is a sequence of characters representing a lexical unit matching with patterns such as
operators, identifiers and keywords.
A lexeme is a sequence of characters in the source program that matches the patterns for a
token. Simply, it is an instance of a token.
A pattern describes the rule lexemes take, for keyword as a token, the pattern is the sequence of
characters forming the keyword, for identifiers it is matched by strings.

 Lexical analysis: This is the first phase of compiler which converts high level input
programs into sequence of tokens. These tokens are sequence of characters that are treated
as a unit in grammar of the programming language. It can be implemented with
Deterministic finite Automata. The output is a sequence of tokens sent to parser for syntax
analysis.
 Syntactic analysis or Parsing: Also known as syntax analysis; this is the second phase
where provided input string is scanned for validation of structure of standard grammar. The
syntactical structure is analyzed and inspected to check whether the given input is correct in
terms of programming syntax.
 Semantic analysis: In the third phase, the compiler scans whether parse tree follows the
guidelines of language. The compiler keeps track of identifiers and expressions. A semantic
analyzer defines the validity of parse tree and an annotated syntax tree is the output.
 Intermediate code generation: Once the parse tree is semantically confirmed, an
intermediate code generator develops three address codes. During the time of the translation
of source program into object code, a middle-level language code is generated by the
compiler.
 Code optimizer: This is an optional phase of compiler that is used for optimizing the
intermediate code. Through this optimization, the program runs fast and consumes less
space. To increase the program speed, unnecessary strings of code are eliminated and
sequences of statements are organized.
 Code generation: It is the final phase of compiler where the compiler acquires fully
optimized intermediate code as input and maps it to machine code. During this phase, the
intermediate code is translated into machine code.
Compiler Passes
A Compiler pass refers to the traversal of a compiler through the entire
program. Compiler passes are of two types Single Pass Compiler, and Two Pass
Compiler or Multi-Pass Compiler. These are explained as follows.
Types of Compiler Pass
1. Single Pass Compiler
If we combine or group all the phases of compiler design in a single module known as a
single pass compiler.

Single Pass Compiler

In the above diagram, there are all 6 phases are grouped in a single module, some points of
the single pass compiler are as:
 A one-pass/single-pass compiler is a type of compiler that passes through the part of each
compilation unit exactly once.
 Single pass compiler is faster and smaller than the multi-pass compiler.
 A disadvantage of a single-pass compiler is that it is less efficient in comparison with the
multi-pass compiler.
 A single pass compiler is one that processes the input exactly once , so going directly
from lexical analysis to code generator, and then going back for the next read.
Note: Single pass compiler almost never done, early Pascal compiler did this as an
introduction.
Problems with Single Pass Compiler
 We can not optimize very well due to the context of expressions are limited.
 As we can’t back up and process, it again so grammar should be limited or simplified.
 Command interpreters such as bash/sh/tcsh can be considered Single pass compilers, but
they also execute entries as soon as they are processed.
2. Two-Pass compiler or Multi-Pass compiler
A Two pass/multi-pass Compiler is a type of compiler that processes the source
code or abstract syntax tree of a program multiple times. In multi-pass Compiler, we divide
phases into two passes as:

First Pass is referred as


 Front end
 Analytic part
 Platform independent
Second Pass is referred as
 Back end
 Synthesis Part
 Platform Dependent
Problems that can be Solved With Multi-Pass Compiler
First: If we want to design a compiler for a different programming language for the same
machine. In this case for each programming language, there is a requirement to make the
Front end/first pass for each of them and only one Back end/second pass as:
Second: If we want to design a compiler for the same programming language for different
machines/systems. In this case, we make different Back end for different Machine/system and
make only one Front end for the same programming language as:

Bootstrapping

It is an approach for making a self-compiling compiler that is a compiler written in the source
programming language that it determine to compile. A bootstrap compiler can compile the
compiler and thus we can use this compiled compiler to compile everything else and the future
versions of itself.

o Bootstrapping is widely used in the compilation development.


o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type
of compiler that can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
A compiler can be characterized by three languages:

1. Source Language
2. Target Language
3. Implementation Language

The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.

2. Create a compiler LCSA for language L written in a subset of L.

3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L,
which runs on machine A and produces code for machine A.

The process described by the T-diagrams is called bootstrapping.

Cross Compiler
A compiler is characterized by three languages as its source language, its object language, and
the language in which it is written. These languages may be quite different. A compiler can run
on one machine and produce target code for another machine. Such a compiler is known as a
cross-compiler.
 If it can write a cross-compiler for a new language 'L' in execution languages 'S' to generate a
program for machine 'N'. i.e. LsN
 If a current compiler for S runs on machine M and generates a program for M, it is defined by
SMM.
 If LSN runs through SMM, we get a compiler LMN, i.e., a compiler from L to N that runs on M.

Example − Create a cross compiler using bootstrapping when SsM runs on SAA.

Solution − First of all, it represents two compilers with T-diagram.

When SsM runs on SAA, SAM will be generated.


Finite state machine

o Finite state machine is used to recognize patterns.


o Finite automata machine takes the string of symbol as input and changes its state
accordingly. In the input, when a desired symbol is found then the transition occurs.
o While transition, the automata can either move to the next state or stay in the same state.
o FA has two states: accept state or reject state. When the input string is successfully
processed and the automata reached its final state then it will accept.

A finite automata consists of following:


Q: finite set of states
∑: finite set of input symbol
q0: initial state
F: final state
δ: Transition function

Transition function can be define as

1. δ: Q x ∑ →Q
FA is characterized into two ways:

1. DFA (finite automata)


2. NDFA (non deterministic finite automata)

DFA
DFA stands for Deterministic Finite Automata. Deterministic refers to the uniqueness of the
computation. In DFA, the input character goes to one state only. DFA doesn't accept the null
move that means the DFA cannot change state without any input character.

DFA has five tuples {Q, ∑, q0, F, δ}

Q: set of all states


∑: finite set of input symbol where δ: Q x ∑ →Q
q0: initial state
F: final state
δ: Transition function

Example
See an example of deterministic finite automata:

1. Q = {q0, q1, q2}


2. ∑ = {0, 1}
3. q0 = {q0}
4. F = {q3}
NDFA
NDFA refer to the Non Deterministic Finite Automata. It is used to transit the any number of
states for a particular input. NDFA accepts the NULL move that means it can change state
without reading the symbols.

NDFA also has five states same as DFA. But NDFA has different transition function.
Transition function of NDFA can be defined as:

δ: Q x ∑ →2Q

Example
See an example of non deterministic finite automata:

1. Q = {q0, q1, q2}


2. ∑ = {0, 1}
3. q0 = {q0}
4. F = {q3}

Regular expression
o Regular expression is a sequence of pattern that defines a string. It is used to denote
regular languages.
o It is also used to match character combinations in strings. String searching algorithm used
this pattern to find the operations on string.
o In regular expression, x* means zero or more occurrence of x. It can generate {e, x, xx,
xxx, xxxx,.....}
o In regular expression, x+ means one or more occurrence of x. It can generate {x, xx, xxx,
xxxx,.....}

Operations on Regular Language


The various operations on regular language are:

Union: If L and M are two regular languages then their union L U M is also a union.

1. L U M = {s | s is in L or s is in M}
Intersection: If L and M are two regular languages then their intersection is also an intersection.
1. L ⋂ M = {st | s is in L and t is in M}
Kleene closure: If L is a regular language then its kleene closure L1* will also be a regular
language.

1. L* = Zero or more occurrence of language L.

Example
Write the regular expression for the language:

L = {abnw : n ≥ 3, w ∈ (a,b)+}

Solution:
The string of language L starts with "a" followed by atleast three b's. It contains at least one "a"
or one "b" that is string are like abbba, abbbbbba, abbbbbbbb, abbbb.....a

So regular expression is:

r= ab3b* (a+b)+

Here + is a positive closure i.e. (a+b)+ = (a+b)* - ∈

Example 1: Write the regular expression for the language accepting all combinations of a's,
over the set ∑ = {a}

Solution: All combinations of a's means 'a' may be zero, single, double and so on. If a is
appearing zero times, that means a null string. That is we expect the set of {ε, a, aa, aaa, ....}. So
we give a regular expression for this as:
R = a*
That is Kleen closure of a.

Example 2: Write the regular expression for the language accepting all combinations of a's
except the null string, over the set ∑ = {a}

Solution: The regular expression has to be built for the language L = {a, aa, aaa, ....}. This set
indicates that there is no null string. So we can denote regular expression as:
R = a+

Example 3: Write the regular expression for the language accepting all the string
containing any number of a's and b's.

Solution: The regular expression will be:


R.E. = (a + b)*
This will give the set as L = {ε, a, aa, b, bb, ab, ba, aba, bab, .....}, any combination of a and b.
The (a + b)* shows any combination with a and b even a null string.

Example 4: Write the regular expression for the language accepting all the string which are
starting with 1 and ending with 0, over ∑ = {0, 1}.
Solution: In a regular expression, the first symbol should be 1, and the last symbol should be 0.
The R.E. is as follows:
R = 1 (0+1)* 0

Example 5: Write the regular expression for the language starting and ending with a and
having any having any combination of b's in between.

Solution: The regular expression will be:


R = a b* a

Example 6: Write the regular expression for the language starting with a but not having
consecutive b's.

Solution: The regular expression has to be built for the language: L = {a, aba, aab, aba, aaa,
abab, .....}. The regular expression for the above language is:
R = {a + ab}*

Example 7: Write the regular expression for the language accepting all the string in which
any number of a's is followed by any number of b's is followed by any number of c's.

Solution: As we know, any number of a's means a* any number of b's means b*, any number of
c's means c*. Since as given in problem statement, b's appear after a's and c's appear after b's. So
the regular expression could be:
R = a*b*c*
Regular expressions are fundamental in lexical analysis, the initial phase of a compiler that transforms
source code into tokens. They define patterns for valid tokens—such as keywords, identifiers, literals, and
operators—enabling the lexical analyzer to recognize and categorize these elements.

Role/Application of Regular Expressions in Lexical Analysis:

1. Token Definition: Regular expressions specify the structure of tokens. For example, an
identifier in many programming languages can be defined as a sequence starting with a
letter or underscore, followed by letters, digits, or underscores. This pattern can be
expressed as:

[a-zA-Z_][a-zA-Z0-9_]*

2. Finite Automata Generation: Regular expressions can be converted into finite


automata, which are used by lexical analyzers to recognize tokens. This conversion
process is well-established and efficient, making it a standard approach in compiler
design.
3. Tool Support: Tools like Flex (Fast Lexical Analyzer) and RE/flex (regex-centric, fast
lexical analyzer) utilize regular expressions to generate lexical analyzers. These tools
automate the creation of scanners based on regular expression patterns, streamlining the
development process.

Example:

Consider a simple lexical analyzer specification using regular expressions:


%%
[0-9]+ { printf("Integer: %s\n", yytext); }
[a-zA-Z_][a-zA-Z0-9_]* { printf("Identifier: %s\n", yytext); }
"+" { printf("Plus operator\n"); }
"-" { printf("Minus operator\n"); }
"=" { printf("Assignment operator\n"); }
[ \t\n]+ { /* Ignore whitespace */ }
. { printf("Unknown character: %s\n", yytext); }
%%

In this example, each regular expression pattern corresponds to a specific token type. The lexical
analyzer uses these patterns to identify and categorize parts of the input source code.

Regular expressions are essential in lexical analysis for defining token patterns, generating finite
automata for recognition, and facilitating the use of tools that automate the creation of lexical
analyzers.

Optimization of DFA-Based Pattern Matchers

1 Important States of an NFA

2 Functions Computed From the Syntax Tree

3 Computing unliable, firstpos, and lastpos

4 Computing followpos

5 Converting a Regular Expression Directly to a DFA

6 Minimizing the Number of States of a DFA

7 State Minimization in Lexical Analyzers

8 Trading Time for Space in DFA Simulation

There are three algorithms that have been used to implement and optimize pattern matchers
constructed from regular expressions.

 The first algorithm is useful in a Lex compiler, because it constructs a DFA directly from
a regular expression, without constructing an intermediate NFA. The resulting DFA also
may have fewer states than the DFA constructed via an NFA.
 The second algorithm minimizes the number of states of any DFA, by combining states
that have the same future behavior. The algorithm itself is quite efficient, running in
time 0(n log n), where n is the number of states of the DFA.
 The third algorithm produces more compact representations of transition tables than the
standard, two-dimensional table.

1. Important States of an NFA


To begin our discussion of how to go directly from a regular expression to a DFA, we must first
dissect the NFA construction of Algorithm 3.23 and consider the roles played by various states.
We call a state of an NFA important if it has a non-e out-transition. Notice that the subset
construction (Algorithm 3.20) uses only the important states in a set T when it computes e-
closure(move(T, a)), the set of states reachable from T on input a. That is, the set of
states move(s, a) is nonempty only if state s is important. During the subset construction, two sets
of NFA states can be identified (treated as if they were the same set) if they:

o Have the same important states, and


o Either both have accepting states or neither does.

When the NFA is constructed from a regular expression, we can say more about
the important states. The only important states are those introduced as initial states in the basis
part for a particular symbol position in the regular expression. That is, each important state
corresponds to a particular operand in the regular expression.

The constructed NFA has only one accepting state, but this state, having no out-
transitions, is not an important state. By concatenating a unique right endmarker # to a regular
expression r, we give the accepting state for r a transition on #, making it an important state of
the NFA for ( r ) # . In other words, by using the augmented regular expression ( r ) # , we can
forget about accepting states as the subset construction proceeds; when the construction is
complete, any state with a transition on # must be an accepting state.

The important states of the NFA correspond directly to the positions in the
regular expression that hold symbols of the alphabet. It is useful, as we shall see, to present the
regular expression by its syntax tree, where the leaves correspond to operands and the interior
nodes correspond to operators. An interior node is called a cat-node, or-node, or star-node if it is
labeled by the concatenation operator (dot), union operator |, or star operator *, respectively. We
can construct a syntax tree for a regular expression just as we did for arithmetic expressions
Example: Figure 3.56 shows the syntax tree for the regular expression of our running example.
Cat-nodes are represented by circles. •
Leaves in a syntax tree are labeled by e or by an alphabet symbol. To each leaf not labeled e, we
attach a unique integer. We refer to this integer as the position of the leaf and also as a position
of its symbol. Note that a symbol can have several positions; for instance, a has positions 1 and 3
in Fig. 3.56. The positions in the syntax tree correspond to the important states of the constructed
NFA.

Example: Figure 3.57 shows the NFA for the same regular expression as Fig. 3.56, with the
important states numbered and other states represented by letters. The numbered states in the
NFA and the positions in the syntax tree correspond in a way we shall soon see. •

2. Functions Computed From the Syntax Tree

To construct a DFA directly from a regular expression, we construct its syntax tree and then
compute four functions: nullable, firstpos, lastpos, and followpos, defined as follows. Each
definition refers to the syntax tree for a particular augmented regular expression ( r ) # .
1. nullable(n) is true for a syntax-tree node n if and only if the subexpression represented
by n has e in its language. That is, the subexpression can be "made null" or the empty string,
even though there may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n that corre-spond to the first symbol
of at least one string in the language of the subexpression rooted at n.

3. lastpos(n) is the set of positions in the subtree rooted at n that corre-spond to the last symbol
of at least one string in the language of the subexpression rooted at n.

4. followpos(p), for a position p, is the set of positions q in the entire syntax tree such that there
is some string x = axa2 ••• an in L ( ( r ) # ) such that for some i, there is a way to explain the
membership of x in L ( ( r ) # ) by matching to position p of the syntax tree and ai+i to position g.

Example: Consider the cat-node n in Fig. 3.56 that corresponds to the expression (a|b)*a. We
claim nullable(ri) is false, since this node generates all strings of a's and 6's ending in an a; it
does not generate e. On the other hand, the star-node below it is nullable; it generates e along
with all other strings of a's and 6's.

firstpos{n) — {1,2,3} . In a typical generated string like aa, the first position of the string
corresponds to position 1 of the tree, and in a string like 6a, the first position of the string comes
from position 2 of the tree. However, when the string generated by the expression of node n is
just a, then this a comes from position 3.

lastpos{n) — {3}. That is, no matter what string is generated from the expression of node n, the
last position is the a from position 3 of the tree.

followpos is trickier to compute, but we shall see the rules for doing so shortly. Here is an
example of the reasoning: followpos(l) — {1,2,3} . Consider a string • • • ac • • •, where the c is
either a or 6, and the a comes from position 1. That is, this a is one of those generated by the a in
expression (a|b)*. This a could be followed by another a or 6 coming from the same
subexpression, in which case c comes from position 1 or 2. It is also possible that this a is the last
in the string generated by (a|b)*, in which case the symbol c must be the a that comes from
position 3. Thus, 1,2, and 3 are exactly the positions that can follow position 1. • • . '

3 Computing nullable, firstpos, and lastpos

We can compute nullable, firstpos, and lastpos by a straightforward recursion on the height of
the tree. The basis and inductive rules for nullable and firstpos are summarized in Fig. 3.58. The
rules for lastpos are essentially the same as for firstpos, but the roles of children c\ and c2 must
be swapped in the rule for a cat-node.
Example: Of all the nodes in Fig. 3.56 only the star-node is nullable. We note from the table of
Fig. 3.58 that none of the leaves are nullable, because they each correspond to non-e operands.
The or-node is not nullable, because neither of its children is. The star-node is nullable, because
every star-node is nullable. Finally, each of the cat-nodes, having at least one nonnullable child,
is not nullable.

The computation of firstpos and lastpos for each of the nodes is shown in

Fig. 3.59, with firstposin) to the left of node n, and lastpos(n) to its right. Each of the leaves has
only itself for firstpos and lastpos, as required by the rule for non-e leaves in Fig. 3.58. For the
or-node, we take the union of firstpos at the

children and do the same for lastpos. The rule for the star-node says that we take the value of
firstpos or lastpos at the one child of that node.

Now, consider the lowest cat-node, which we shall call n. To compute firstpos(n), we first
consider whether the left operand is mailable, which it is in this case. Therefore, firstpos for n is
the union of firstpos for each of its children, that is {1,2} U {3} = {1,2,3} . The rule for lastpos
does not appear explicitly in Fig. 3.58, but as we mentioned, the rules are the same as for
firstpos, with the children interchanged. That is, to compute lastpos(n) we must ask whether its
right child (the leaf with position 3) is nullable, which it is not. Therefore, lastpos(n) is the same
as lastpos of the right child, or {3}.

4 Computing followpos
Finally, we need to see how to compute followpos. There are only two ways that a position of a
regular expression can be made to follow another.

1 . If n is a cat-node with left child c \ and right child C 2 , then for every position i
in lastpos(ci), all positions in firstpos(c2) are in followpos(i).

2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are


in followpos(i).

Example : Let us continue with our running example; recall that firstpos and lastpos were
computed in Fig. 3.59. Rule 1 for followpos requires that we look at each cat-node, and put each
position in firstpos of its right child in followpos for each position in lastpos of its left child. For
the lowest cat-node in Fig. 3.59, that rule says position 3 is in followpos(l) and followpos{2). The
next cat-node above says that 4 is in followpos(2>), and the remaining two cat-nodes give us 5
in followpos(4) and 6 in followpos(5).

Figure 3.59: firstpos and lastpos for nodes in the syntax tree for (a|b)*abb#

We must also apply rule 2 to the star-node. That rule tells us positions 1 and 2 are in both
followpos(l) and followpos(2), since both firstpos and lastpos for this node are {1,2} . The
complete sets followpos are summarized in Fig. 3.60.
We can represent the function followpos by creating a directed graph with a node for each
position and an arc from position i to position j if and only if j is in followpos(i). Figure 3.61
shows this graph for the function of Fig. 3.60.

It should come as no surprise that the graph for followpos is almost an NFA without e-transitions
for the underlying regular expression, and would become one if we:

1. Make all positions in firstpos of the root be initial states,

2. Label each arc from i to j by the symbol at position i, and

3. Make the position associated with endmarker # be the only accepting state.

5. Converting a Regular Expression Directly to a DFA

Algorithm 3.36 : Construction of a DFA from a regular expression r.

INPUT : A regular expression r.

OUTPUT : A DFA D that recognizes L(r).

METHOD :
1. Construct a syntax tree T from the augmented regular expression ( r ) # .

2. Compute nullable, firstpos, lastpos, and followpos for T, using the methods of Sections 3.9.3
and 3.9.4.

Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D, by the
procedure of Fig. 3.62. The states of D are sets of

positions in T. Initially, each state is "unmarked," and a state becomes "marked" just before we
consider its out-transitions. The start state of D is firstpos(no), where node no is the root of T.
The accepting states are those containing the position for the endmarker symbol #. •

Example: We can now put together the steps of our running example to construct a DFA for the
regular expression r = (a|b)*abb. The syntax tree for ( r ) # appeared in Fig. 3.56. We observed
that for this tree, nullable is true only for the star-node, and we exhibited firstpos and lastpos in
Fig. 3.59. The values of followpos appear in Fig. 3.60.

The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D. Call this
set of states A. We must compute Dtran[A, a] and Dtran[A, b]. Among the positions of A, 1 and
3 correspond to a, while 2 corresponds to b. Thus, Dtran[A,a] = followpos(l) U followpos(3) =
{1,2,3,4},

initialize Dstates to contain only the unmarked state firstpos(no), where no is the root of syntax
tree T for ( r ) # ;

while ( there is an unmarked state S in Dstates ) {

mark S;

for ( each input symbol a ) {

let U be the union of followpos(p) for all p

in S that correspond to a;

if ( U is not in Dstates )

add U as an unmarked state to Dstates; Dtran[S, a) = U;

Figure 3.62: Construction of a DFA directly from a regular expression and Dtran[A, b] =
followpos{2) = {1,2,3} . The latter is state A, and so does not have to be added to Dstates, but the
former, B = {1,2,3,4}, is new, so we add it to Dstates and proceed to compute its transitions. The
complete DFA is shown in Fig. 3.63. •

6. Minimizing the Number of States of a DFA

There can be many DFA's that recognize the same language. For instance, note that the DFA's of
Figs. 3.36 and 3.63 both recognize language L ( ( a | b ) * a b b ) . Not only do these automata
have states with different names, but they don't even have the same number of states. If we
implement a lexical analyzer as a DFA, we would generally prefer a DFA with as few states as
possible, since each state requires entries in the table that describes the lexical analyzer.

The matter of the names of states is minor. We shall say that two automata are the same up to
state names if one can be transformed into the other by doing nothing more than changing the
names of states. Figures 3.36 and 3.63 are not the same up to state names. However, there is a
close relationship between the states of each. States A and C of Fig. 3.36 are actually equivalent,
in the sense that neither is an accepting state, and on any input they transfer to the same state —
to B on input a and to C on input b. Moreover, both states A and C behave like state 123 of Fig.
3.63. Likewise, state B of Fig. 3.36 behaves like state 1234 of Fig. 3.63, state D behaves like
state 1235, and state E behaves like state 1236.

It turns out that there is always a unique (up to state names) minimum state DFA for any regular
language. Moreover, this minimum-state DFA can be constructed from any DFA for the same
language by grouping sets of equivalent states. In the case of L ( ( a | b ) * a b b ) , Fig. 3.63 is
the minimum-state DFA, and it can be constructed by partitioning the states of Fig. 3.36 as {A,
C}{B}{D}{E}.

In order to understand the algorithm for creating the partition of states that converts any DFA
into its minimum-state equivalent DFA, we need to see how input strings distinguish states from
one another. We say that string x distinguishes state s from state t if exactly one of the states
reached from s and t by following the path with label x is an accepting state. State s is
distinguishable from state t if there is some string that distinguishes them.

Implementation of a Lexical Analyzer

Lexical Analysis is the first step of the compiler which reads the source code one character at a
time and transforms it into an array of tokens. The token is a meaningful collection of characters
in a program. These tokens can be keywords including do, if, while etc. and identifiers including
x, num, count, etc. and operator symbols including >,>=, +, etc., and punctuation symbols
including parenthesis or commas. The output of the lexical analyzer phase passes to the next
phase called syntax analyzer or parser.

The syntax analyser or parser is also known as parsing phase. It takes tokens as input from
lexical analyser phase. The syntax analyser groups tokens together into syntactic structures. The
output of this phase is parse tree.

Function of Lexical Analysis

The main function of lexical analysis are as follows −

 It can separate tokens from the program and return those tokens to the parser as requested by it.
 It can eliminate comments, whitespaces, newline characters, etc. from the string.
 It can insert the token into the symbol table.
 Lexical Analysis will return an integer number for each token to the parser.
 Stripping out the comments and whitespace (tab, newline, blank, and other characters that are
used to separate tokens in the input).
 The correlating error messages that are produced by the compiler during lexical analyzer with the
source program.
 It can implement the expansion of macros, in the case of macro, pre-processors are used in the
source code.

LEX generates Lexical Analyzer as its output by taking the LEX program as its input. LEX
program is a collection of patterns (Regular Expression) and their corresponding Actions.
Patterns represent the tokens to be recognized by the lexical analyzer to be generated. For each
pattern, a corresponding NFA will be designed.

There can be n number of NFAs for n number of patterns.

Lexical Analysis: Lexical analysis, also known as scanning is the first phase of a compiler
which involves reading the source program character by character from left to right and
organizing them into tokens. Tokens are meaningful sequences of characters. There are
usually only a small number of tokens for a programming language including constants (such
as integers, doubles, characters, and strings), operators (arithmetic, relational, and logical),
punctuation marks and reserved keywords.

Lexical Analysis

 The lexical analyzer takes a source program as input, and produces a stream of tokens as
output.
Token
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages.
Categories of Tokens
 Keywords: In C programming, keywords are reserved words with specific meanings used
to define the language’s structure like if, else, for, and void. These cannot be used as
variable names or identifiers, as doing so causes compilation errors. C programming has a
total of 32 keywords.
 Identifiers: Identifiers in C are names for variables, functions, arrays, or other user-
defined items. They must start with a letter or an underscore (_) and can include letters,
digits, and underscores. C is case-sensitive, so uppercase and lowercase letters are
different. Identifiers cannot be the same as keywords like if, else or for.
 Constants: Constants are fixed values that cannot change during a program’s execution,
also known as literals. In C, constants include types like integers, floating-point numbers,
characters, and strings.
 Operators: Operators are symbols in C that perform actions on variables or other data
items, called operands.
 Special Symbols: Special symbols in C are compiler tokens used for specific purposes,
such as separating code elements or defining operations. Examples include ; (semicolon)
to end statements, , (comma) to separate values, {} (curly braces) for code blocks, and []
(square brackets) for arrays. These symbols play a crucial role in the program’s structure
and syntax.
Lexeme
A lexeme is an actual string of characters that matches with a pattern and generates a token.
eg- ―float‖, ―abs_zero_Kelvin‖, ―=‖, ―-‖, ―273‖, ―;‖ .
Lexemes and Tokens Representation

Lexemes Tokens Lexemes Continued… Tokens Continued…

while WHILE a IDENTIEFIER

( LPAREN = ASSIGNMENT

a IDENTIFIER a IDENTIFIER

>= COMPARISON – ARITHMETIC

b IDENTIFIER 2 INTEGER

) RPAREN ; SEMICOLON

Working of Lexical Analyzer:


Tokens in a programming language can be described using regular expressions. A scanner, or
lexical analyzer, uses a Deterministic Finite Automaton (DFA) to recognize these tokens, as
DFAs are designed to identify regular languages. Each final state of the DFA corresponds to a
specific token type, allowing the scanner to classify the input. The process of creating a DFA
from regular expressions can be automated, making it easier to handle token recognition
efficiently.
The lexical analyzer identifies the error with the help of the automation machine and the
grammar of the given language on which it is based like C, C++, and gives row number and
column number of the error.
Suppose we pass a statement through lexical analyzer: a = b + c;
It will generate token sequence like this: id=id+id; Where each id refers to it’s variable in the
symbol table referencing all details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
„int‟ „main‟ „(„ „)‟ „{„ „int‟ „a‟ „,‟ „b‟ „;‟
„a‟ „=‟ ‟10‟ „;‟ „return‟ „0‟ „;‟ „}‟
Above are the valid tokens. You can observe that we have omitted comments. As another

example, consider below printf statement. There are 5 valid token in this printf statement.
Exercise 1: Count number of tokens:
int main()
{
int a = 10, b = 20;
printf(“sum is:%d”,a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2: Count number of tokens:
int max(int i);
 Lexical analyzer first read int and finds it to be valid and accepts as token.
 max is read by it and found to be a valid function name after reading (
 int is also a token , then again I as another token and finally ;
Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;

You might also like