UNit-1 Lexical Analyser
UNit-1 Lexical Analyser
LANGUAGE TRANSLATORS :
Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target]language. The Source program is a high level language where
as the Target language can be any thing from the machine language of a target machine (between
Microprocessor to Supercomputer) to another high level language program.
Two commonly Used Translators are Compiler and Interpreter
1. Compiler : Compiler is a program, reads program in one language called Source
Language and translates in to its equivalent program in another Language called Target
Language, in addition to this its presents the error information to the User.
Source Program
Input Interpreter Output
Compiler 2|Page
Design
In addition to these translators, programs like interpreters, text formatters etc., may be used in
language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.
Normally the steps in a language processing system includes Preprocessing the skeletal Source
program which produces an extended or expanded source program or a ready to compile unit of
the source program, followed by compiling the resultant, then linking / loading , and finally its
equivalent executable code is produced. As I said earlier not all these steps are mandatory. In
some cases, the Compiler only performs this linking and loading functions implicitly.
The steps involved in a typical language processing system can be understood with following
diagram.
Source Program [ Example: filename.C ]
Preprocessor
Compiler
Assembler
TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be classified
into the following types;
Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL
into its equivalent in native machine code or object code.
Compiler 3|Page
Design
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.
Cross-Compilers: These are the compilers that run on one machine and produce code for
another machine.
Incremental Compilers: These compilers separate the source into user defined–steps;
Compiling/recompiling step- by- step; interpreting steps in a given order
Converters (e.g. COBOL to C++): These Programs will be compiling from one high level
language to another.
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers from
intermediate language (byte code, MSIL) to executable code or native machine code. These
perform type –based verification which makes the executable code more trustworthy
Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native
code for Java and .NET
Binary Compilation: These compilers will be compiling object code of one platform into object code
of another platform.
PHASES OF A COMPILER:
Compiler Phases are the individual modules which are chronologically executed to perform their
respective Sub-activities, and finally integrate the solutions to give target code.
It is desirable to have relatively few phases, since it takes time to read and write immediate files.
Following diagram (Figure1.4) depicts the phases of a compiler through which it goes during the
compilation. There fore a typical Compiler is having the following Phases:
In addition to these, it also has Symbol table management, and Error handler phases. Not all
the phases are mandatory in every Compiler. e.g, Code Optimizer phase is optional in some
4|Page
Compiler
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
The Phases of compiler divided in to two parts, first three phases we are called as
Analysis part remaining three called as Synthesis part.
In some application we can have a compiler that is organized into what is called passes.
Where a pass is a collection of phases that convert the input from one representation to a
completely deferent representation. Each pass makes a complete scan of the input and produces
its output to be processed by the subsequent pass. For example a two pass Assembler.
Compiler 5|Page
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
All of these phases of a general Compiler are conceptually divided into The Front-end,
and The Back-end. This division is due to their dependence on either the Source Language or
the Target machine. This model is called an Analysis & Synthesis model of a compiler.
The Front-end of the compiler consists of phases that depend primarily on the Source
language and are largely independent on the target machine. For example, front-end of the
compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the
Intermediate Code Generator.
The Back-end of the compiler consists of phases that depend on the target machine, and
those portions don‘t dependent on the Source language, just the Intermediate language. In this we
have different aspects of Code Optimization phase, code generation along with the necessary
Error handling, and Symbol table operations.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:
Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an
identifier , a Keyword , a punctuation mark, a multi character operator like := .
The Scanner generates a token-id, and also enters that identifiers name in the Symbol
table if it doesn‘t exist.
SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent
phase Semantic Analyzer and performs the following functions:
Groups the above received, and recorded token stream into syntactic structures, usually
into a structure called Parse Tree whose leaves are tokens.
The interior node of this tree represents the stream of tokens that logically belongs
together.
SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically correct,
it
Compiler 6|Page
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
Compiler 7|Page
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
may happen that they are not correct semantically. Therefore the semantic analyzer checks
the semantics (meaning) of the statements formed.
The Syntactically and Semantically correct structures are produced here in the form of a
Syntax tree or DAG or some other sequential representation like matrix.
It should be easy to produce,and Easy to translate into the target program. Example
intermediate code forms are:
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
Sometimes the data structures used in representing the intermediate forms may also be
changed.
CODE GENERATOR: This is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or absolute machine
code.
Memory locations are selected for each variable used, and assignment of variables to
registers is done.
The Compiler also performs the Symbol table management and Error handling throughout the
compilation process. Symbol table is nothing but a data structure that stores different source
language constructs, and tokens generated during the compilation. These two interact with all
phases of the Compiler.
Compiler 8|Page
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
For example the source program is an assignment statement; the following figure shows how the
phases of compiler will process the program.
Compiler 9|Page
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
LEXICAL ANALYSIS:
As the first phase of a compiler, the main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce as output tokens
for each lexeme in the source program. This stream of tokens is sent to the parser for syntax
analysis. It is common for the lexical analyzer to interact with the symbol table as well.
When the lexical analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table. This process is shown in the following figure.
. When lexical analyzer identifies the first token it will send it to the parser, the parser
receives the token and calls the lexical analyzer to send next token by issuing the
getNextToken() command. This Process continues until the lexical analyzer identifies all the
tokens. During this process the lexical analyzer will neglect or discard the white spaces and
comment lines.
A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a
sequence of input characters denoting an identifier. The token names are the input symbols that
the parser processes. In what follows, we shall generally write the name of a token in boldface.
We will often refer to a token by its token name.
A pattern is a description of the form that the lexemes of a token may take [ or match]. In the
case of a keyword as a token, the pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure that is
matched by many strings.
Compiler 10 | P a g
Design e
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an instance of that token.
both printf and score are lexemes matching the pattern for token id, and "Total = %d\n‖
is a lexeme matching literal [or string].
There are a number of reasons why the analysis portion of a compiler is normally separated into
lexical analysis and parsing (syntax analysis) phases.
Compiler 11 | P a g e
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
INPUT BUFFERING:
Before discussing the problem of recognizing lexemes in the input, let us examine
some ways that the simple but important task of reading the source program can be speeded.
This task is made difficult by the fact that we often have to look one or more characters beyond
the next lexeme before we can be sure we have the right lexeme. There are many situations
where we need to look at least one additional character ahead. For instance, we cannot be sure
we've seen the end of an identifier until we see a character that is not a letter or digit, and
therefore is not part of the lexeme for id. In C, single-character operators like -, =, or <
could also be the beginning of a two-character operator like ->, ==, or <=. Thus, we shall
introduce a two-buffer scheme that handles large look aheads safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs
Because of the amount of time taken to process characters and the large number of characters
that must be processed during the compilation of a large source program, specialized buffering
techniques have been developed to reduce the amount of overhead required to process a single
input character. An important scheme involves two buffers that are alternately reloaded.
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than
using one system call per character. If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the source file and is different from any
possible character of the source program.
1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this
chapter.
Compiler 12 | P a g e
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
Compiler 13 | P a g e
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.
Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the
actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater
than N, we shall never overwrite the lexeme in its buffer before determining it.
If we use the above scheme as described, we must check, each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also reload the other
buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one
to determine what character is read (the latter may be a multi way branch). We can combine the
buffer-end test with the test for the current character if we extend each buffer to hold a sentinel
character at the end. The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof. Figure 1.8 shows the same arrangement as Figure 1.7,
but with the sentinels added. Note that eof retains its use as a marker for the end of the entire
input.
Any eof that appears other than at the end of a buffer means that the input is at an end. Figure
1.9 summarizes the algorithm for advancing forward. Notice how the first test, which can be part
of
Compiler 14 | P a g e
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
Compiler 15 | P a g e
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
a multiway branch based on the character pointed to by forward, is the only test we make, except
in the case where we actually are at the end of a buffer or the end of the input.
switch ( *forward++ )
{
case eof: if (forward is at end of first buffer )
{
reload second buffer;
forward = beginning of second buffer;
}
Figure 1.9: use of switch-case for the sentential
SPECIFICATION OF TOKENS:
Regular expressions are an important notation for specifying lexeme patterns. While they cannot express
all possible patterns, they are very effective in specifying those types of patterns that we actually need for
tokens.
Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is
referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the
Lex compiler transforms the input patterns into a transition diagram and generates code, in a
file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.
Compiler 16 | P a g e
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
Declarations
%%
Translation rules
%%
Pattern {Action}
The auxiliary function definitions section includes the definitions of functions used to install
identifiers and numbers in the Symbol tale.
%}
/* regular definitions */
delim [ \t\n]
ws { delim}+
letter [A-Za-z]
digit [o-91
id {letter} ({letter} | {digit}) *
Compiler 17 | P a g e
Design
Dept Of Computer Science and Engineering III YEAR/ I SEM MRCET
then {return(THEN) ; }
else {return(ELSE) ; }
int installID0() {/* function to install the lexeme, whose first character is pointed to by yytext,
and whose length is yyleng, into the symbol table and return a pointer
thereto */
int installNum() {/* similar to installID, but puts numerical constants into a separate table */}
Figure 1.10 : Lex Program for tokens common tokens
In our compiler model, the parser obtains a string of tokens from the lexical analyzer,
as shown in the below Figure, and verifies that the string of token names can be generated
by the grammar for the source language. We expect the parser to report any syntax errors in
an intelligible fashion and to recover from commonly occurring errors to continue processing the
remainder of the program. Conceptually, for well-formed programs, the parser constructs a parse
tree and passes it to the rest of the compiler for further processing.
Compiler 18 | P a g e
Design
Figure2.1: Parser in the Compiler
During the process of parsing it may encounter some error and present the error information back
to the user
Syntactic errors include misplaced semicolons or extra or missing braces; that is,
―{" or "}." As another example, in C or Java, the appearance of a case statement without
an enclosing switch is a syntactic error (however, this situation is usually allowed by the
parser and caught later in the processing, as the compiler attempts to generate code).
Based on the way/order the Parse Tree is constructed, Parsing is basically classified in to
following two types:
1. Top Down Parsing : Parse tree construction start at the root node and moves to
the children nodes (i.e., top down order).
2. Bottom up Parsing: Parse tree construction begins from the leaf nodes and
proceeds towards the root node (called the bottom up order).
Computer 19 | P a g e
Design
Computer 20 | P a g e
Design