0% found this document useful (0 votes)
128 views28 pages

CD Unit1 Notes

A compiler is a program that translates programs written in a high-level language into an equivalent program in a lower-level language. Studying compiler design is applicable to many fields like command interpreters, text formatters, graphic interpreters, and more. Compilers translate source code into object code as a whole and create object files for faster execution, while interpreters directly execute code and do not create object files but allow for easier debugging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views28 pages

CD Unit1 Notes

A compiler is a program that translates programs written in a high-level language into an equivalent program in a lower-level language. Studying compiler design is applicable to many fields like command interpreters, text formatters, graphic interpreters, and more. Compilers translate source code into object code as a whole and create object files for faster execution, while interpreters directly execute code and do not create object files but allow for easier debugging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

What is Compiler?

A program that reads a program written in one high-level language and translates it into an equivalent
program in another (object) language, which is ready to be executed on a computer.

Why to study Compiler Design


Compiler theory and tools are applicable to other fields:
 Command and query interpreters;
 Text formatters (TeX, LaTeX, HTML);
 Graphic interpreters (PS, GIF, JPEG);
 Translating javadoc comments to HTML;
 Generating a table from the results of a SQL query;
 Spam filter;
 Server that responds to a network protocol;

Related Topics
Compilers vs. Translators
Compilers typically refer to the translation from high-level source code to low-level code.

Translators refer to the transformation at the same level of abstraction.

Examples
 Typical compilers: gcc, javac…
 Non-typical compilers:
o Latex (document compiler).
o C-to-silicon compiler.
 Translators
o F2c: Fortran-to-C translator (both high-level).
o Latex2html (both documents).
o Dvips2ps (both low-level).

Compilers vs. Interpreters


Interpreter:
It directly executes the source program on inputs supplied by the user.

Hybrid compiler (combine compilation and interpretation):

Compiler Interpreter
1) It translates the statements of the source code one by one and execute
1) It translates source code into object codes as a whole.
immediately.

2) It creates an object file. 2) It does not create an object file.

3) Program execution is very fast. 3) Program execution is slow.

4) Translator program is not required to translate the program each time you want to run 4) Translator program is required to translate the program each time you want to run
the program. the program.

5) It does not make easier to correct the mistakes in the source code. 5) It makes easier to correct the mistakes in the source code.

6) Most of the high-level programming languages have compiler program. 6) A few high-level programming languages have Iterpreter program.

Assembler: Program that translates an assembly-language program into a relocatable machine code.

In addition to a compiler, several other programs may be required to create an executable target program.

A source program may be divided into modules stored in separate files. The task of collecting the source program is sometimes
entrusted to a separate program, called a preprocessor. The preprocessor may also expand shorthand’s, called macros, into source
language statements

The modified source program is then fed to a compiler. The compiler may produce an assembly-language program as its output,
because assembly language is easier to produce as output and is easier to debug. The assembly language is then processed by a
program called an assembler that produces relocatable machine code as its output.

Large programs are often compiled in pieces, so the relocatable machine code may have to be linked together with other
relocatable object files and library files into the code tha t actually runs on the machine. The linker resolves external memory
addresses, where the code in one file may refer to a location in another file. The loader then puts together the entire executable
object files into memory for execution.
Fig: Language Processing System

Structure of a Compiler
There are two puts to compilation: analysis and synthesis.
The analysis part breaks up the source program into constituent pieces and imposes a grammatical structure on them. It
then uses this structure to create an intermediate representation of the source program. If the analysis part detects tha t the
source program is either syntactically ill formed or semantically unsound, then it must provide informative messages, so
the user can take corrective action. The analysis part also collects information about the source program and stores it in a
data structure called a symbol table, which is passed along with the intermediate representation to the synthesis part.
The synthesis part constructs the desired target program from the intermediate representation and the information in the
symbol table.
The analysis part is often called the front end of the compiler; the synthesis part is the back end.
Phases Of Compiler:
The compilation process is a sequence of various phases. E ach of which transforms the source program from one representation
to another and each phase takes input from its previous stage, has its own representation of source program, and feeds its output to the
next phase of the compiler.
Fig: Phases of Compiler
Lexical Analysis (also called Scanner)
It works as a text scanner. This phase scans the source code as a stream of characters and converts it into meaningful lexemes. Lexical
analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical analysis as input and generates a parse
tree (or syntax tree). In this phase, token arrangements are checked against the source code grammar, i.e., the parser checks if the
expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For example, assignment of values is
between compatible data types, and adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and
expressions; whether identifiers are declared before use or not, etc. The semantic analyzer produces an annotated syntax tree as an
output.
Intermediate Code Generation
After semantic analysis, the compiler generates an intermediate code of the source code for the target machine. It represents a program
for some abstract machine. It is in between the high-level language and the machine language. This intermediate code should be
generated in such a way that it makes it easier to be translated into the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as something that removes unnecessary
code lines, and arranges the sequence of statements in order to speed up the program execution without wasting resources (CPU,
memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and maps it to the target machine language.
The code generator translates the intermediate code into a sequence of (generally) re-locatable machine code.

Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifiers’ names along with their types are stored here.
The symbol table makes it easier for the compiler to quickly search the identifier record and retrieve it.
Fig: Phases of Compiler with Example
Pass: several phases to be grouped into one pass

For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate code generation might
be grouped together into one pass. Code optimization might be an optional pass. Then there could be a back-end pass consisting of
code generation for a particular target machine.
Compiler-Construction Tools:
These are the tools that are used to implement various phases of compiler
These are also called compiler-compilers, compiler-generators, or translator-writing systems

1. Scanner generators that produce lexical analyzers from a regular-expression description of the tokens of a language.
2. Parser generators that automatically produce syntax analyzers from a grammatical description of a programming
language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse tree and generating
intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for translating each operation of the
intermediate language into the machine language for a target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are transmitted from one part of a
program to each other part. Data-flow analysis is a key part of code optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for constructing various phases of a compiler.

The Evolution of Programming Languages

Today, there are thousands of programming languages. They can be classified in a variety of ways.
One classification is by generation

1. First Generation Languages:


 Machine language
o Operation code – such as addition or subtraction.
o Operands – that identify the data to be processed.
o Machine language is machine dependent as it is the only language the computer can
understand.
o Very efficient code but very difficult to write.
2. Second Generation Languages
 Assembly languages
o Symbolic operation codes replaced binary operation codes.
o Assembly language programs needed to be “assembled” for execution by the computer.
Each assembly language instruction is translated into one machine language instruction.
o Very efficient code and easier to write.

3. Third Generation Languages


Closer to English but included simple mathematical notation.
 Programs written in source code which must be translated into machine language programs
called object code.
 The translation of source code to object code is accomplished by a machine language system
program called a compiler.
 Alternative to compilation is interpretation which is accomplished by a system program called an
interpreter.
 Common third generation languages
 FORTRAN
 COBOL
 C and C++
 Visual Basic

4. Fourth Generation Languages


A high level language (4GL) that requires fewer instructions to accomplish a task than a third generation
language.
 Used with databases
o Query languages
o Report generators
o Forms designers
o Application generators

5. Fifth Generation Languages


 Declarative languages
 Functional(?): Lisp, Scheme, SML
o Also called applicative
o Everything is a function
 Logic: Prolog
o Based on mathematical logic
o Rule- or Constraint-based

Another classification of languages uses the term imperative for languages, in which a program specifies how a
computation is to be done and declarative for languages in which a program specifies what computation is to be done.

 Declarative
o Functional : Lisp/Scheme, ML, Haskell
o Dataflow: Id, Val
o Logic, constraint-based: Prolog, spreadsheets
o Template-based: XSLT
 Imperative
o Von Neumann: C, Ada, Fortran, . . .
o Scripting: Perl, Python, PHP, . . .
o Object-oriented: Smalltalk, Eiffel, C++, Java, . . .

The Science of Building a Compiler


A compiler must accept all source programs that conform to the specification of the language; the set of source
programs is infinite and any program can be very large, consisting of possibly millions of lines of code. Any
transformation
performed by the compiler while translating a source program must preserve the meaning of the program being compiled.

Modeling in Compiler Design and Implementation


The study of compilers is mainly a study of how we design the right mathematical models and choose the right
algorithms, while balancing the need for generality and power against simplicity and efficiency.
Some of most fundamental models are
o Finite-state machines and regular expressions
 useful for describing the lexical units of programs
o Context-free grammars
 used to describe the syntactic structure of programming languages
o Trees
 model for representing the structure of programs
 For translation into object code

The Science of Code Optimization


The term "optimization" in compiler design refers to the attempts tha t a compiler makes to produce code that is
more efficient than the obvious code.

Compiler optimizations must meet the following design objectives:


• The optimization must be correct, tha t is, preserve the meaning of the compiled program,
• The optimization must improve the performance of many programs,
• The compilation time must be kept reasonable, and
• The engineering effort required must be manageable.

Applications of Compiler Technology


 Implementation of High-Level Programming Languages
 Optimizations for Computer Architectures
 Parallelism
 Memory Hierarchies
 Design of New Computer Architectures
 RISC
 Specialized architectures
 Program Translators
 Binary Translation
 Hardware Synthesis
 Database Query Interpreters
 Compiled Simulation
 Software Productivity tools
 Types checking
 Bounds Checking
 Memory Management tools
Programming Language Basics
 A language uses static scope or lexical scope if it is possible to determine the scope of a declaration by
looking only at the program. Uses dynamic scope, as the program runs,

 The environment is a mapping from names to locations in the store. Since variables refer to locations
("1-values" in the terminology of C), we could alternatively define an environment as a mapping from
names to variables.
 The state is a mapping from locations in store to their values. That is, the state maps 1-values to their
corresponding r-values, in the terminology of C.
 An identifier is a string of characters, typically letters or digits, that refers to (identifies) an entity, such
as a data object, a procedure, a class, or a type. All identifiers are names, but not all names are
identifiers. Names can also be expressions
 A variable refers to a particular location of the store.
 The scope rules for C are based on program structure; the scope of a declaration is determined implicitly
by where the declaration appears in the program. Later languages, such as C++ , Java, and C# also
provide explicit control over scopes through the use of keywords like public, private, and protected.
 A function generally returns a value of some type (the "return type"), while a procedure does not return
any value. C and similar languages, which have only functions, treat procedures as functions tha t have a
special return type "void," to signify no return value. Object-oriented languages like Java and C+ + use
the term "methods."
 The keywords like public, private, and protected, object oriented languages such as C+ + or Java provide
explicit control over access to member names in a superclass. These keywords support encapsulation by
restricting access
 Dynamic scope resolution is also essential for polymorphic procedures, those tha t have two or more
definitions for the same name
 All programming languages have a notion of a procedure, but they can differ in how these procedures
get their arguments.
o Call by value
o Call by reference
o Call by name

Role of the Lexical Analyzer


The lexical analyzer is the first phase of a compiler.Its main task of the lexical analyzer is to read the input characters of the
source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The
stream of tokens is sent to the parser for syntax analysis.

Fig: Interaction between Lexical Analyzer and parser


The interaction is implemented by having the parser call the lexical analyzer. The call, suggested by the getNextToken
command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produce for it the
next token, which it returns to the parser.
Lexical analyzer may perform certain other tasks besides identification of lexemes
1. Stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate
tokens in the input).
2. Correlating error messages generated by the compiler with the source program. For instance, the lexical analyzer
may keep track of the number of newline characters seen, so it can associate a line number with each error message.
In some compilers, the lexical analyzer makes a copy of the source program with the error messages inserted at the
appropriate positions.
3. If the source program uses a macro-preprocessor, the expansion of macros may also be performed by the lexical
analyzer.
Lexical Analysis versus Parsing
There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing
(syntax analysis) phases.
1. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the
lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up
the compiler significantly.
3. Compiler portability is enhanced.

Token: A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol
representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier
The token names are the input symbols that the parser processes

Pattern: A pattern is a description of the form that the lexemes of a token may take (i.e the regular expression pattern that the
lexeme should match)

Lexeme: A lexeme is a sequence of characters in the source program tha t matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token.

Lexical Errors

 Lexical Analyzer cannot recognize the misspelled words


 Lexical analyzer is unable to proceed if none of the patterns for tokens matches any prefix of the remaining input.
The simplest recovery strategy is "panic mode" recovery. We delete successive characters from the remaining input, until
the lexical analyzer can find a well-formed token at the beginning of what input is left.

Other possible error-recovery actions are:


1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
This strategy makes sense, since in practice most lexical errors involve a single character. A more general correction strategy is to
find the smallest number of transformations needed to convert the source program into one that consists only of valid lexemes.

Input Buffering
It can be used in Lexical Analyzer to speed up the task of reading source Program.
In Scanner, it reads every character from the secondary storage. But it is very time consuming. To avoid this buffering Technique
is used.
In this, a block of data is first read into buffer and then scanned by lexical analyzer. Using one system read command we can read
N characters(block size)
into a buffer, rather than using one system call per character. If fewer than N characters remain in the input file, then a special
character, represented by eof, marks the end of the source file
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme

2. Pointer forward scans ahead until a pattern match is found

Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an
attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found. In
below Figure, we see forward has passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be
retracted one position to its left.
Fig: Using a pair of input buffers
If only One Buffer is used then
if the length of lexeme > length of buffer then to scan rest of lexeme the buffer has to be refilled, that makes overwriting
first part of lexeme.
So to overcome this, two buffering scheme is used. In Two buffering scheme Advancing forward requires that we first test
whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and move
forward to the beginning of the newly loaded buffer.
To identify the end of buffer, we have to place eof character at the end called as sentinel.(i.e sentinel is a special character that
represents buffer end)
Fig: Look ahead code with sentinels

Fig: Sentinels at end of each buffer

Specification of Tokens
Regular expressions are an important notation for specifying lexeme patterns. While they cannot express all possible patterns, they
are very effective in specifying those types of patterns that we actually need for tokens

An alphabet is any finite set of symbols. Typical examples of symbols are letters,digits, and punctuation. The set {0,1} is the
binary alphabet
A string over an alphabet is a finite sequence of symbols drawn from that alphabet. The length of a string s, usually written |s|.
banana is a string of length six. The empty string, denoted by ε, is the string of length zero.

The following string-related terms are commonly used:


1. A prefix of string s is any string obtained by removing zero or more symbols from the end of s. For example, ban, banana, and e
are
prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana,
banana, and e
are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For instance, banana, nan, and e are substrings of
banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and substrings, respectively, of s that are
not e or
not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For example, baan
is a
subsequence of banana.

Regular Expressions(R.E) are useful for representing sets of strings of a specific language. It provides convenient and useful
notation for representing tokens
A Regular Expression can be defined recursively
1. Any element x∈∑ is a regular expression
2. Null string ε is a R.E
3. Union of two R.E’s R1 and R2 is also R.E (R1+R2)or (R1|R2)
4. Concatenation of two R.E’s R1 and R2 also R.E (R1.R2)or (R1R2)
5. Iteration (Closure) of R.E is also R.E (R*)

Example Let ∑ = {a, b}.


1. The regular expression a|b denotes the language {a, b}.
2. (a|b)(a|b) denotes {aa, ah, ba, bb}, the language of all strings of length two over the alphabet E. Another regular expression for
the same language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, that is, { e , a , a a , a a a , . . . }.
4. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all strings of a's and 6's: {e,a, b,aa, ab,
ba, bb,aaa,...}.Another regular expression for the same language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab, aaab,...}, that is, the string a and all strings consisting of zero or more a's and ending
in b.

Regular definition:
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:

where:
1. Each di is a new symbol, not in ∑ and not the same as any other of the d’s, and
2. Each ri is a regular expression over the alphabet ∑ U {d1,d2,.. . ,di-1}.
By restricting ri to ∑ and the previously defined d’s, we avoid recursive definitions

Examples: 1. Regular Definition for Identifiers

2. Regular Definition for Unsigned numbers (integer or floating point)


For example, ^[^aeiou]*$ matches any complete line that does not contain a lowercase vowel.
RECOGNITION OF TOKENS
Consider the following grammar fragment:

Fig: A grammar for branching statement in PASCAL

The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of tokens as far as the lexical
analyzer is concerned. The patterns for these tokens are described using regular definitions, as

In addition, we assign the lexical analyzer the job of stripping out whitespace, by recognizing the "token" ws defined by:

Token ws is different from the other tokens in that, when we recognize it, we do not return it to the parser, but rather restart the
lexical analysis from the character that follows the whitespace. It is the following token that gets returned to the parser.
Fig: Regular Expression Patterns for Tokens

Transition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert patterns into stylized flowcharts, called
"transition diagrams."
Now we perform the conversion from regular-expression patterns to transition diagrams by hand. But there is a mechanical
way to construct these diagrams from collections of regular expressions.
Fig: Transition Diagram for Relational Operators

Here the states 4,8 has a * to indicate that we must retract the input one position.

Fig: Transition Diagram for Identifiers

There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. When we find an identifier, a call to installlD
places it in the symbol table if it is not already there and returns a pointer to the symbol-table entry for the lexeme found.
The function getToken examines the symbol table entry for the lexeme found, and returns whatever token name the symbol table
says this lexeme represents — either id or one of the keyword tokens that was initially installed in the table.
2. Create separate transition diagrams for each keyword

Fig: Transition Diagram for Keyword ‘then’

Fig: Transition Diagram for Unsigned Numbers

Fig: Transition Diagram for White Spaces

Lexical analyzer generator - LEX


lex generates a C-language scanner from a source specification that you write. This specification contains a list of rules indicating
sequences of characters -- expressions -- to be searched for in an input text, and the actions to take when an expression is found.
... definitions ...
%%
... rules ...
%%
... subroutines ...
The following example prepends line numbers to each line in a file

The C source code for the lexical analyzer is generated when you enter
$ lex lex.l
where lex.l is the file containing your lex specification
The lexical analyzer code stored in lex.yy.c (or the .c file to which it was redirected) must be compiled to generate the executable
object program, or scanner, that performs the lexical analysis of an input text.
The lex library supplies a default main() that calls the function yylex(), so you need not supply your own main(). The library is
accessed by invoking the -ll option to cc:
$ cc lex.yy.c -ll

Lex Predefined Variables

You might also like