CD - Unit I
CD - Unit I
UNIT I
Introduction: Overview of compilation, Language Processors, The structure of a Compiler, Pass
and Phases of translation, Interpretation and bootstrapping.
Lexical Analysis: The Role of the Lexical Analyzer, Input Buffering, Recognition of Tokens,
Design of a Lexical-Analyzer Generator, Optimization of DFA-Based Pattern Matchers, The Lexical-
Analyzer Generator(LEX) tool.
1.1.1.1 PREPROCESSOR
A pre-processor produce input to compilers. They may perform the following functions.
Macro processing: A pre-processor may allow a user to define macros that are
short hands for longer constructs.
File inclusion: A pre-processor may include header files into the program text.
Rational pre-processor: these pre-processors augment older languages with
more modern flow-of-control and data structuring facilities.
Language Extensions: These pre-processor attempts to add capabilities to the
language by certain amounts to build-in macro
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.1.1.2 COMPILER
Compiler is a translator program that translates a program written in (HLL) the
source program and translates it into an equivalent program in (MLL) the target
program. As an important part of a compiler is error showing to the programmer.
1.1.1.3 ASSEMBLER
Programmers found it difficult to write or read programs in machine language.
They begin to use a mnemonic (symbols) for each machine instruction, which they
would subsequently translate into machine language. Such a mnemonic machine
language is now called an assembly language. Programs known as assembler were
written to automate the translation of assembly language in to machine language.
The input to an assembler program is called source program, the output is a
machine language translation (object program).
1.1.1.4 INTERPRETER:
An interpreter is a program that appears to execute a source program as if it
were machine language.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Languages such as BASIC, SNOBOL, and LISP can be translated using interpreters. JAVA
also uses interpreter. The process of interpretation can be carried out in following
phases.
Lexical analysis
Syntax analysis
Semantic analysis
Direct Execution
Advantages
Modification of user program can be easily made and implemented as execution
proceeds.
Type of object that may denotes various change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation.
The interpreter for the language makes it machine independent.
Disadvantages
The execution of the program is slower.
Memory consumption is more.
1.1.1.5 Loader and Link-editor:
Once the assembler procedures an object program, that program must be placed
into memory and executed. The assembler could place the object program directly in
memory and transfer control to it, thereby causing the machine language program to be
execute. This would waste core by leaving the assembler in memory while the users
program was being executed. Also the programmer would have to retranslate his program
with each execution, thus wasting translation time. To overcome this problems of wasted
translation time and memory. System programmers developed another component called
loader
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form
the loader could relocate” directly behind the user’s program. The task of adjusting
programs o they may be placed in arbitrary core locations is called relocation. Relocation
loaders perform four functions.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.1.2 STRUCTURE OF THE COMPILER DESIGN
LA or Scanners reads the source program one character at a time, carving the source
program into a sequence of atomic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.
Code Optimization:-
This is optional phase described to improve the intermediate code so that the output
runs faster and takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this phase. The
output of the code generator is the machine language program of the specified computer.
Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is
a stream of tokens, which is passed to the next phase, the syntax analyzer or parser. The SA
groups the tokens together into syntactic structure called as expression. Expression may
further be combined to form statements. The syntactic structure can be regarded as a tree
whose leaves are the token called as parse trees.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
The front end consists of those phases or parts of phases that is source language-
dependent and target machine, independents. These generally consist of lexical analysis,
semantic analysis, syntactic analysis, symbol table creation, and intermediate code
generation. A little part of code optimization can also be included in the front-end part.
The front-end part also includes the error handling that goes along with each of the
phases.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
The portions of compilers that depend on the target machine and do not depend on
the source language are included in the back end. In the back end, code generation and
necessary features of code optimization phases, along with error handling and symbol
table operations are also included.
A pass is a component where parts of one or more phases of the compiler are
combined when a compiler is implemented. A pass reads or scans the instructions of the
source program or the output produced by the previous pass, which makes necessary
transformation specified by its phases.
1. One-pass
2. Two-pass
Grouping
Several phases are grouped together to a pass so that it can read the input file and write an
output file.
1. One-Pass – In One-pass all the phases are grouped into one phase. The six phases
are included here in one pass.
2. Two-Pass – In Two-pass the phases are divided into two parts i.e. Analysis or Front
End part of the compiler and the synthesis part or back end part of the compiler.
A two-pass compiler utilizes its first pass to go into its symbol table a rundown of
identifiers along with the memory areas to which these identifiers relate. Then, at that
point, a second pass replaces mnemonic operation codes by their machine language
equivalent and replaced uses of identifiers by their machine address. In the second pass, the
compiler can read the result document delivered by the first pass, assemble the syntactic
tree and deliver the syntactical examination. The result of this stage is a record that contains
the syntactical tree.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.1.5 Interpretation
Interpreters were first used in 1952, making programming simpler given the limitations of
early computers. They are commonly used in micro-computers and help programmers
debug errors before moving to the next statement.
The advantage of the interpreter is that it is executed line by line which helps users
to find errors easily.
The disadvantage of the interpreter is that it takes more time to execute successfully
than the compiler.
Applications of Interpreters
1.1.6 Bootstrapping
1. Start with a Basic Compiler: A simple compiler is created using a basic language
(e.g., assembly language). It handles essential features of a programming language.
2. Create an Advanced Version: The basic compiler is used to compile a more advanced
version, which can handle additional features like better error checking and
optimizations.
3. Gradually Improve: Each version of the compiler builds on the previous one, adding
more features and improving efficiency. This process continues until the desired
result is achieved.
In the T-diagram:
1. Step 1: The source language is a subset of C (C0), the target language is Assembly,
and the implementation language is also Assembly.
2. Step 2: Using the C0 compiler, a compiler for the full C language is created, with C as
the source language and Assembly as the target language.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Cross compilation
Cross-compilation is a process where a compiler runs on one platform (host) but
generates machine code for a different platform (target). This is useful when the target
platform is not powerful enough to run the full compiler or when the target architecture is
different from the host system. Using bootstrapping in cross-compilation can help create a
compiler that runs on one system (the host) but produces code for another system (the
target).
Advantages of Bootstrapping:
Challenges of Bootstrapping:
1. Initial Effort: Requires significant time and effort to build the first simple compiler.
2. Complexity of Self-Compilation: Ensuring the compiler can compile itself while
supporting advanced features is challenging.
3. Time Consumption: Iterative improvements in early stages are slow and resource-
intensive.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Lexical Analysis
1.2.1OVER VIEW OF LEXICAL ANALYSIS
1. To identify the tokens we need some method of describing the possible tokens
that can appear in the input stream. For this purpose we introduce regular
expression, a notation that can be used to describe essentially all the tokens of
programming language.
2.Secondly, having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers, which
are designed using transition diagrams and finite automata.
The LA is the first phase of a compiler. It main task is to read the input
character and produce as output a sequence of tokens that the parser uses for syntax
analysis.
Upon receiving a get next token‟ command form the parser, the lexical
analyser reads the input character until it can identify the next token. The LA return to
the parser representation for the token it has found. The representation will be an
integer code, if the token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such
task is striping out from the source program the commands and white spaces in the
form of blank, tab and new line characters. Another is correlating error message from
the compiler with the source program.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Pattern: A set of strings in the input for which the same token is produced as
output. This set of strings is described by a rule called a pattern associated with the
token.
LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. This
means that there’s no way to recognise a lexeme as a valid token for you lexer. Syntax
errors, on the other side, will be thrown by your scanner when a given set of already
recognised valid tokens don't match any of the right sides of your grammar rules. Simple
panic-mode error handling system requires that we return to a high-level parsing
function when a parsing or lexical error is detected.
Input buffering is a critical concept in compiler design that improves the efficiency of
reading and processing source code. Typically, a compiler scans the input one character at a
time, which can be slow and inefficient. Input buffering addresses this issue by allowing the
compiler to read chunks of input data into a buffer before processing them. This reduces the
number of system calls, each of which carries overhead, thereby improving performance.
One major advantage of input buffering is its ability to reduce the frequency of
system calls needed to read the source code, leading to faster compilation times.
Additionally, it simplifies the compiler's design by minimizing the amount of code required
for input management.
However, input buffering is not without its challenges. If the buffer size is excessively
large, it can consume too much memory, potentially leading to slower performance or even
crashes, especially on systems with limited resources. Furthermore, improper management
of the buffer can result in errors during compilation, such as incorrect processing of the
input data.
Initially both the pointers point to the first character of the input string as shown below
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
In the process of lexical analysis, the forward pointer (fp) scans the input to identify the end
of a lexeme. When a blank space is encountered, it signifies the end of the current lexeme
(e.g., recognizing the lexeme "int"). The fp then moves ahead, skipping the white space,
while both the begin pointer (bp) and fp are reset to the starting position of the next token.
1. One Buffer Scheme: This approach uses a single buffer to hold the input data. It is
simpler but may require extra effort to manage overlapping lexemes.
2. Two Buffer Scheme: This method employs two buffers alternately. While one buffer
is being processed, the other is being filled with the next block of input, enabling
seamless processing and reducing delays caused by input operations.
One Buffer Scheme: In this scheme, only one buffer is used to store the input string but
the problem with this scheme is that if lexeme is very long then it crosses the buffer
boundary, to scan rest of the lexeme the buffer has to be refilled, that makes overwriting
the first of lexeme.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Two Buffer Scheme: The Two Buffer Scheme improves input buffering by using two
alternating buffers to store input. When one buffer is processed, the other is filled with the
next block of data, ensuring uninterrupted processing. Initially, both the begin pointer (bp)
and forward pointer (fp) point to the first character of the first buffer. The fp moves right to
find the end of a lexeme, which is marked by a blank space. The lexeme is identified as the
string between bp and fp.
To mark buffer boundaries, a Sentinel (end-of-buffer character) is placed at the end of each
buffer. When fp encounters the first sentinel, the second buffer is filled. Similarly,
encountering the second sentinel prompts refilling of the first buffer. This process continues
until all input is processed. A limitation of this method is that lexemes longer than the buffer
size cannot be fully scanned. Despite this, the scheme efficiently reduces secondary storage
access delays.
1. Transition Table
2. Transition Diagram
EXAMPLE
Assume the following grammar fragment to generate a specific language
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Where the terminals if, then, else, relop, id and num generates sets of strings given by
following regular definitions
where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])
For this language, the lexical analyzer will recognize the keywords if, then, and else,
as well as lexemes that match the patterns for relop, id, and number.
To simplify matters, we make the common assumption that keywords are also
reserved words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space, consisting of nonnull
sequences of blanks, tabs, and newlines.
Our lexical analyzer will strip out white space. It will do so by comparing a string
against the regular definition ws, below.
If a match for ws is found, the lexical analyzer does not return a token to the parser.
It is the following token that gets returned to the parser.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.2.4.2 Transition Diagram
It is a directed labelled graph consisting of nodes and edges. Nodes represent states, while edges
represent state transitions.
Read input beginning at the point on its input which we have referred to as
lexemeBegin. As it moves the pointer called forward ahead in the input, it calculates the set of
states it is in at each point.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Eventually, the NFA simulation reaches a point on the input where there are no next states. At that
point, there is no hope that any longer prefix of the input would ever get the NFA to an
accepting state; rather, the set of states will always be empty. Thus, we are ready to decide on the
longest prefix that is a lexeme matching some pattern.
Architecture, resembling the output of Lex, is to convert the NFA for all the patterns into an
equivalent DFA, using the subset construction method.
The accepting states are labelled by the pattern that is identified by that state.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
LOOKAHEAD OPERATOR
The Lex lookahead operator / in a Lex pattern r1/r2 is sometimes necessary, because
the pattern r1 for a particular token may need to describe some trailing context r2 in
order to correctly identify the actual lexeme.
When converting the pattern r1/r2 to an NFA, we treat the / as if it were e, so we do
not actually look for a / on the input. However, if the NFA recognizes a prefix xy of
the input buffer as matching this regular expression, the end of the lexeme is not where the
NFA entered its accepting state.
AN NFA FOR THE PATTERN FOR THE FORTRAN IF WITHLOOKAHEAD
Notice that the e-transition from state 2 to state 3 represents the lookahead operator. State
6 indicates the presence of the keyword IF. However, we find the lexeme IF by scanning
backwards to the last occurrence of state 2, whenever state 6 is entered.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.2.6 Optimization of DFA-Based Pattern Matchers
To optimize the DFA you have to follow the various steps. These are as follows:
Step 1: Remove all the states that are unreachable from the initial state via any set of the
transition of DFA.
Step 3: Now split the transition table into two tables T1 and T2. T1 contains all final states
and T2 contains non-final states.
1. δ (q, a) = p
2. δ (r, a) = p
That means, find the two states which have same value of a and b and remove one of them.
Step 5: Repeat step 3 until there is no similar rows are available in the transition table T1.
Step 7: Now combine the reduced T1 and T2 tables. The combined transition table is the
transition table of minimized DFA.
Solution:
Step 1: In the given DFA, q2 and q4 are the unreachable states so remove them.
Step 3:
1. One set contains those rows, which start from non-final sates:
2. Other set contains those rows, which starts from final states.
Step 5: In set 2, row 1 and row 2 are similar since q3 and q5 transit to same state on 0 and 1.
So skip q5 and then replace q5 by q3 in the rest.
LEX in compiler design is a tool that generates lexical analyzers, which are programs that
convert streams of characters into meaningful units called tokens. This process is known as
tokenization and is a key part of lexical analysis, the first phase of a compiler's workflow.
What is Lex? Lex is a specialized tool (or program) that automates the generation of lexical
analyzers. It takes input in the form of Lex source programs (File.l) and produces C programs
(lex.yy.c) as output. The generated C program can be compiled using a standard C compiler,
resulting in a lexical analyzer (a.out), which converts character streams into tokens.
Functions of Lex: