CD Unit1 Notes
CD Unit1 Notes
A program that reads a program written in one high-level language and translates it into an equivalent
program in another (object) language, which is ready to be executed on a computer.
Related Topics
Compilers vs. Translators
Compilers typically refer to the translation from high-level source code to low-level code.
Examples
Typical compilers: gcc, javac…
Non-typical compilers:
o Latex (document compiler).
o C-to-silicon compiler.
Translators
o F2c: Fortran-to-C translator (both high-level).
o Latex2html (both documents).
o Dvips2ps (both low-level).
Compiler Interpreter
1) It translates the statements of the source code one by one and execute
1) It translates source code into object codes as a whole.
immediately.
4) Translator program is not required to translate the program each time you want to run 4) Translator program is required to translate the program each time you want to run
the program. the program.
5) It does not make easier to correct the mistakes in the source code. 5) It makes easier to correct the mistakes in the source code.
6) Most of the high-level programming languages have compiler program. 6) A few high-level programming languages have Iterpreter program.
Assembler: Program that translates an assembly-language program into a relocatable machine code.
In addition to a compiler, several other programs may be required to create an executable target program.
A source program may be divided into modules stored in separate files. The task of collecting the source program is sometimes
entrusted to a separate program, called a preprocessor. The preprocessor may also expand shorthand’s, called macros, into source
language statements
The modified source program is then fed to a compiler. The compiler may produce an assembly-language program as its output,
because assembly language is easier to produce as output and is easier to debug. The assembly language is then processed by a
program called an assembler that produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be linked together with other
relocatable object files and library files into the code tha t actually runs on the machine. The linker resolves external memory
addresses, where the code in one file may refer to a location in another file. The loader then puts together the entire executable
object files into memory for execution.
Fig: Language Processing System
Structure of a Compiler
There are two puts to compilation: analysis and synthesis.
The analysis part breaks up the source program into constituent pieces and imposes a grammatical structure on them. It
then uses this structure to create an intermediate representation of the source program. If the analysis part detects tha t the
source program is either syntactically ill formed or semantically unsound, then it must provide informative messages, so
the user can take corrective action. The analysis part also collects information about the source program and stores it in a
data structure called a symbol table, which is passed along with the intermediate representation to the synthesis part.
The synthesis part constructs the desired target program from the intermediate representation and the information in the
symbol table.
The analysis part is often called the front end of the compiler; the synthesis part is the back end.
Phases Of Compiler:
The compilation process is a sequence of various phases. E ach of which transforms the source program from one representation
to another and each phase takes input from its previous stage, has its own representation of source program, and feeds its output to the
next phase of the compiler.
Fig: Phases of Compiler
Lexical Analysis (also called Scanner)
It works as a text scanner. This phase scans the source code as a stream of characters and converts it into meaningful lexemes. Lexical
analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical analysis as input and generates a parse
tree (or syntax tree). In this phase, token arrangements are checked against the source code grammar, i.e., the parser checks if the
expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For example, assignment of values is
between compatible data types, and adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and
expressions; whether identifiers are declared before use or not, etc. The semantic analyzer produces an annotated syntax tree as an
output.
Intermediate Code Generation
After semantic analysis, the compiler generates an intermediate code of the source code for the target machine. It represents a program
for some abstract machine. It is in between the high-level language and the machine language. This intermediate code should be
generated in such a way that it makes it easier to be translated into the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as something that removes unnecessary
code lines, and arranges the sequence of statements in order to speed up the program execution without wasting resources (CPU,
memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and maps it to the target machine language.
The code generator translates the intermediate code into a sequence of (generally) re-locatable machine code.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifiers’ names along with their types are stored here.
The symbol table makes it easier for the compiler to quickly search the identifier record and retrieve it.
Fig: Phases of Compiler with Example
Pass: several phases to be grouped into one pass
For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate code generation might
be grouped together into one pass. Code optimization might be an optional pass. Then there could be a back-end pass consisting of
code generation for a particular target machine.
Compiler-Construction Tools:
These are the tools that are used to implement various phases of compiler
These are also called compiler-compilers, compiler-generators, or translator-writing systems
1. Scanner generators that produce lexical analyzers from a regular-expression description of the tokens of a language.
2. Parser generators that automatically produce syntax analyzers from a grammatical description of a programming
language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse tree and generating
intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for translating each operation of the
intermediate language into the machine language for a target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are transmitted from one part of a
program to each other part. Data-flow analysis is a key part of code optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for constructing various phases of a compiler.
Today, there are thousands of programming languages. They can be classified in a variety of ways.
One classification is by generation
Another classification of languages uses the term imperative for languages, in which a program specifies how a
computation is to be done and declarative for languages in which a program specifies what computation is to be done.
Declarative
o Functional : Lisp/Scheme, ML, Haskell
o Dataflow: Id, Val
o Logic, constraint-based: Prolog, spreadsheets
o Template-based: XSLT
Imperative
o Von Neumann: C, Ada, Fortran, . . .
o Scripting: Perl, Python, PHP, . . .
o Object-oriented: Smalltalk, Eiffel, C++, Java, . . .
The environment is a mapping from names to locations in the store. Since variables refer to locations
("1-values" in the terminology of C), we could alternatively define an environment as a mapping from
names to variables.
The state is a mapping from locations in store to their values. That is, the state maps 1-values to their
corresponding r-values, in the terminology of C.
An identifier is a string of characters, typically letters or digits, that refers to (identifies) an entity, such
as a data object, a procedure, a class, or a type. All identifiers are names, but not all names are
identifiers. Names can also be expressions
A variable refers to a particular location of the store.
The scope rules for C are based on program structure; the scope of a declaration is determined implicitly
by where the declaration appears in the program. Later languages, such as C++ , Java, and C# also
provide explicit control over scopes through the use of keywords like public, private, and protected.
A function generally returns a value of some type (the "return type"), while a procedure does not return
any value. C and similar languages, which have only functions, treat procedures as functions tha t have a
special return type "void," to signify no return value. Object-oriented languages like Java and C+ + use
the term "methods."
The keywords like public, private, and protected, object oriented languages such as C+ + or Java provide
explicit control over access to member names in a superclass. These keywords support encapsulation by
restricting access
Dynamic scope resolution is also essential for polymorphic procedures, those tha t have two or more
definitions for the same name
All programming languages have a notion of a procedure, but they can differ in how these procedures
get their arguments.
o Call by value
o Call by reference
o Call by name
Token: A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol
representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier
The token names are the input symbols that the parser processes
Pattern: A pattern is a description of the form that the lexemes of a token may take (i.e the regular expression pattern that the
lexeme should match)
Lexeme: A lexeme is a sequence of characters in the source program tha t matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token.
Lexical Errors
Input Buffering
It can be used in Lexical Analyzer to speed up the task of reading source Program.
In Scanner, it reads every character from the secondary storage. But it is very time consuming. To avoid this buffering Technique
is used.
In this, a block of data is first read into buffer and then scanned by lexical analyzer. Using one system read command we can read
N characters(block size)
into a buffer, rather than using one system call per character. If fewer than N characters remain in the input file, then a special
character, represented by eof, marks the end of the source file
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme
Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an
attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found. In
below Figure, we see forward has passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be
retracted one position to its left.
Fig: Using a pair of input buffers
If only One Buffer is used then
if the length of lexeme > length of buffer then to scan rest of lexeme the buffer has to be refilled, that makes overwriting
first part of lexeme.
So to overcome this, two buffering scheme is used. In Two buffering scheme Advancing forward requires that we first test
whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and move
forward to the beginning of the newly loaded buffer.
To identify the end of buffer, we have to place eof character at the end called as sentinel.(i.e sentinel is a special character that
represents buffer end)
Fig: Look ahead code with sentinels
Specification of Tokens
Regular expressions are an important notation for specifying lexeme patterns. While they cannot express all possible patterns, they
are very effective in specifying those types of patterns that we actually need for tokens
An alphabet is any finite set of symbols. Typical examples of symbols are letters,digits, and punctuation. The set {0,1} is the
binary alphabet
A string over an alphabet is a finite sequence of symbols drawn from that alphabet. The length of a string s, usually written |s|.
banana is a string of length six. The empty string, denoted by ε, is the string of length zero.
Regular Expressions(R.E) are useful for representing sets of strings of a specific language. It provides convenient and useful
notation for representing tokens
A Regular Expression can be defined recursively
1. Any element x∈∑ is a regular expression
2. Null string ε is a R.E
3. Union of two R.E’s R1 and R2 is also R.E (R1+R2)or (R1|R2)
4. Concatenation of two R.E’s R1 and R2 also R.E (R1.R2)or (R1R2)
5. Iteration (Closure) of R.E is also R.E (R*)
Regular definition:
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:
where:
1. Each di is a new symbol, not in ∑ and not the same as any other of the d’s, and
2. Each ri is a regular expression over the alphabet ∑ U {d1,d2,.. . ,di-1}.
By restricting ri to ∑ and the previously defined d’s, we avoid recursive definitions
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of tokens as far as the lexical
analyzer is concerned. The patterns for these tokens are described using regular definitions, as
In addition, we assign the lexical analyzer the job of stripping out whitespace, by recognizing the "token" ws defined by:
Token ws is different from the other tokens in that, when we recognize it, we do not return it to the parser, but rather restart the
lexical analysis from the character that follows the whitespace. It is the following token that gets returned to the parser.
Fig: Regular Expression Patterns for Tokens
Transition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert patterns into stylized flowcharts, called
"transition diagrams."
Now we perform the conversion from regular-expression patterns to transition diagrams by hand. But there is a mechanical
way to construct these diagrams from collections of regular expressions.
Fig: Transition Diagram for Relational Operators
Here the states 4,8 has a * to indicate that we must retract the input one position.
There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. When we find an identifier, a call to installlD
places it in the symbol table if it is not already there and returns a pointer to the symbol-table entry for the lexeme found.
The function getToken examines the symbol table entry for the lexeme found, and returns whatever token name the symbol table
says this lexeme represents — either id or one of the keyword tokens that was initially installed in the table.
2. Create separate transition diagrams for each keyword
The C source code for the lexical analyzer is generated when you enter
$ lex lex.l
where lex.l is the file containing your lex specification
The lexical analyzer code stored in lex.yy.c (or the .c file to which it was redirected) must be compiled to generate the executable
object program, or scanner, that performs the lexical analysis of an input text.
The lex library supplies a default main() that calls the function yylex(), so you need not supply your own main(). The library is
accessed by invoking the -ll option to cc:
$ cc lex.yy.c -ll