Unit I CD
Unit I CD
-I
INTRODUCTION TO LANGUAGE PROCESSING:
As Computers became inevitable and indigenous part of human life, and several
languages with different and more advanced features are evolved into this stream to satisfy
or comfort the user in communicating with the machine , the development of the translators
or mediator Software‘s have become essential to fill the huge gap between the human and
machine understanding. This process is called Language Processing to reflect the goal and
intent of the process. On the way to this process to understand it in a better way, we have
to be familiar with some key terms and concepts explained in following lines.
LANGUAGE TRANSLATORS :
In addition to these translators, programs like interpreters, text formatters etc., may be
used in language processing system. To translate a program in a high level language
program to an executable one, the Compiler performs by default the compile and linking
functions.
Normally the steps in a language processing system includes Preprocessing the skeletal
Source program which produces an extended or expanded source program or a ready to
compile unit of the source program, followed by compiling the resultant, then linking /
loading , and finally its equivalent executable code is produced. As I said earlier not all
these steps are mandatory. In some cases, the Compiler only performs this linking and
loading functions implicitly.
The steps involved in a typical language processing system can be understood with
following diagram.
Preprocessor
Compile
r
Assemble
r
TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be
classified into the following types;
Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL
into its equivalent in native machine code or object code.
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.
Cross-Compilers: These are the compilers that run on one machine and produce code
for another machine.
Incremental Compilers: These compilers separate the source into user defined –
steps; Compiling/recompiling step- by- step; interpreting steps in a given order
Converters (e.g. COBOL to C++): These Programs will be compiling from one high
level language to another.
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers
from intermediate language (byte code, MSIL) to executable code or native machine
code. These perform type –based verification which makes the executable code more
trustworthy
Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the
native code for Java and .NET
Binary Compilation: These compilers will be compiling object code of one platform into object code
of another platform.
PHASES OF A COMPILER:
Compiler Phases are the individual modules which are chronologically executed to perform
their respective Sub-activities, and finally integrate the solutions to give target code.
It is desirable to have relatively few phases, since it takes time to read and write immediate
files. Following diagram (Figure1.4) depicts the phases of a compiler through which it goes
during the compilation. There fore a typical Compiler is having the following Phases:
The Phases of compiler divided in to two parts, first three phases we are called as
Analysis part remaining three called as Synthesis part.
In some application we can have a compiler that is organized into what is called
passes. Where a pass is a collection of phases that convert the input from one
representation to a completely deferent representation. Each pass makes a complete scan
of the input and produces its output to be processed by the subsequent pass. For example a
two pass Assembler.
THE FRONT-END & BACK-END OF A COMPILER
All of these phases of a general Compiler are conceptually divided into The
Front-end, and The Back-end. This division is due to their dependence on either the
Source Language or the Target machine. This model is called an Analysis & Synthesis
model of a compiler.
The Front-end of the compiler consists of phases that depend primarily on
the Source language and are largely independent on the target machine. For
example, front-end of the compiler includes Scanner, Parser, Creation of Symbol
table, Semantic Analyzer, and the Intermediate Code Generator.
The Back-end of the compiler consists of phases that depend on the target
machine, and those portions don‘t dependent on the Source language, just the
Intermediate language. In this we have different aspects of Code Optimization
phase, code generation along with the necessary Error handling, and Symbol table
operations.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as
interface between the compiler and the Source language program and performs the
following functions:
Reads the characters in the Source program and groups them into a stream of
tokens in which each token specifies a logically cohesive sequence of
characters, such as an identifier , a Keyword , a punctuation mark, a multi
character operator like := .
The Scanner generates a token-id, and also enters that identifiers name in the Symbol
table if it doesn‘t exist.
SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its
subsequent phase Semantic Analyzer and performs the following functions:
Groups the above received, and recorded token stream into syntactic
structures, usually into a structure called Parse Tree whose leaves are
tokens.
The interior node of this tree represents the stream of tokens that
logically belongs together.
may happen that they are not correct semantically. Therefore the semantic analyzer
checks the semantics (meaning) of the statements formed.
The Syntactically and Semantically correct structures are produced here in the form of a
Syntax tree or DAG or some other sequential representation like matrix.
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and
beneficial in terms of saving development time, effort, and cost. This phase performs the
following specific functions:
Sometimes the data structures used in representing the intermediate forms may
also be changed.
CODE GENERATOR: This is the final phase of the compiler and generates the target
code, normally consisting of the relocatable machine code or Assembly code or
absolute machine code.
Memory locations are selected for each variable used, and assignment of
variables to registers is done.
The Compiler also performs the Symbol table management and Error handling
throughout the compilation process. Symbol table is nothing but a data structure that
stores different source language constructs, and tokens generated during the
compilation.
LEXICAL ANALYSIS:
As the first phase of a compiler, the main task of the lexical analyzer is to
read the input characters of the source program, group them into lexemes, and produce
as output tokens for each lexeme in the source program. This stream of tokens is
sent to the parser for syntax analysis. It is common for the lexical analyzer to interact
with the symbol table as well.
. When lexical analyzer identifies the first token it will send it to the parser,
the parser receives the token and calls the lexical analyzer to send next
token by issuing the getNextToken() command. This Process continues until the
lexical analyzer identifies all the tokens. During this process the lexical analyzer
will neglect or discard the white spaces and comment lines.
A pattern is a description of the form that the lexemes of a token may take [ or match].
In the case of a keyword as a token, the pattern is just the sequence of characters
that form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is
matched by many
strings.
score) ;
both printf and score are lexemes matching the pattern for token id, and "Total = %d\n‖
is a lexeme matching literal [or string].
There are a number of reasons why the analysis portion of a compiler is normally
separated into lexical analysis and parsing (syntax analysis) phases.
INPUT
BUFFERING:
Buffer
Pairs
Because of the amount of time taken to process characters and the large number
of characters that must be processed during the compilation of a large source
program, specialized buffering techniques have been developed to reduce the amount
of overhead required to process a single input character. An important scheme
involves two buffers that are alternately reloaded.
Each buffer is of the same size N, and N is usually the size of a disk block, e.g.,
4096 bytes. Using one system read command we can read N characters in to a
buffer, rather than using one system call per character. If fewer than N characters
remain in the input file, then a special character, represented by eof, marks the end
of the source file and
2. Pointer forward scans ahead until a pattern match is found; the exact
strategy whereby this determination is made will be covered in the
balance of this chapter.
Once the next lexeme is determined, forward is set to the character at its right
end. Then, after the lexeme is recorded as an attribute value of a token returned to the
parser, 1exemeBegin is set to the character immediately after the lexeme just found. In
Fig, we see forward has passed the end of the next lexeme, ** (the FORTRAN
exponentiation operator), and must be retracted one position to its left.
Advancing forward requires that we first test whether we have reached the end
of one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer. As long as we never need to
look so far ahead of the actual lexeme that the sum of the lexeme's length plus the
distance we look ahead is greater than N, we shall never overwrite the lexeme in its
buffer before determining it.
If we use the above scheme as described, we must check, each time we advance
forward, that we have not moved off one of the buffers; if we do, then we must
also reload the other buffer. Thus, for each character read, we make two tests: one for
the end of the buffer, and one to determine what character is read (the latter may be a
multi way branch). We can combine the buffer-end test with the test for the current
character if we extend each buffer to hold a sentinel character at the end. The sentinel
is a special character that cannot be part of the source program, and a natural choice is
the character eof. Figure 1.8 shows the same arrangement as Figure 1.7, but with the
sentinels added. Note that eof retains its use as a marker for the end of the entire
input.
Any eof that appears other than at the end of a buffer means that the input is at an end.
Figure 1.9 summarizes the algorithm for advancing forward. Notice how the first test,
which can be part of
a multiway branch based on the character pointed to by forward, is the only test we
make, except in the case where we actually are at the end of a buffer or the end of the
input.
switch ( *forward++ )
{
case eof: if (forward is at end of first buffer )
{
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer )
{
reload first buffer;
forward = beginning of first buffer;
}
SPECIFICATION OF TOKENS:
Regular expressions are an important notation for specifying lexeme patterns. While they
cannot express all possible patterns, they are very effective in specifying those types of patterns
that we actually need for tokens.
Lex is a tool used to generate lexical analyzer, the input notation for the Lex
tool is referred to as the Lex language and the tool itself is the Lex compiler. Behind
the scenes, the Lex compiler transforms the input patterns into a transition
diagram and
generates code, in a
file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.
Translation
rules
%
%
Auxiliary functions
definitions
In the Translation rules section, We place Pattern Action pairs where each pair have the
form
Pattern {Action}
%}
/* regular definitions */
delim [ \t\n]
ws
{ delim}+
letter [A-
Za-z] digit
[o-91
else {return(ELSE) ; }
int installID0() {/* function to install the lexeme, whose first character is
int installNum() {/* similar to installID, but puts numerical constants into a separate
table */}
SYNTAX ANALYSIS
(PARSER)
THE ROLE OF THE PARSER:
In our compiler model, the parser obtains a string of tokens from the
lexical analyzer, as shown in the below Figure, and verifies that the string of
token names can be generated by the grammar for the source language.
We expect the parser to report any syntax errors in an intelligible fashion and to
recover from commonly occurring errors to continue processing the remainder
of the program. Conceptually, for well-formed programs, the parser constructs a
parse tree and passes it to the rest of the compiler for further processing.
Figure2.1: Parser in
the Compiler
During the process of parsing it may encounter some error and present the error
information back to the user
1. Top Down Parsing : Parse tree construction start at the root node
and moves to the children nodes (i.e., top down order).
1. What is a Compiler? Explain the working of a Compiler with your own example?
2. What is the Lexical analyzer? Discuss the Functions of Lexical Analyzer.
3. Write short notes on tokens, pattern and lexemes?
4. Write short notes on Input buffering scheme? How do you change
the basic input buffering algorithm to achieve better performance?
5. What do you mean by a Lexical analyzer generator? Explain LEX tool.