III Year-V Semester: B.Tech. Computer Science and Engineering 5CS4-02: Compiler Design UNIT-1
III Year-V Semester: B.Tech. Computer Science and Engineering 5CS4-02: Compiler Design UNIT-1
UNIT-1
A program written in high-level language is called as source code. To convert the source code
into machine code, translators are needed.
A translator takes a program written in source language as input and converts it into a program in
target language as output.
It also detects and reports the error during translation.
Roles of translator are:
• Translating the high-level language program input into an equivalent machine language
program.
• Providing diagnostic messages wherever the programmer violates specification of the high-level
language program.
Different type of translators
Interpreter
4 Debugging is hard as the error messages It stops translation when the first error
are generated after scanning the entire is met. Hence, debugging is easy.
program only.
5 Programming languages like C, C++ Programming languages like Python,
uses compilers. BASIC, and Ruby uses interpreters.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generator
5. Code optimizer
6. Code generator
Lexical Analysis:
Lexical analyzer phase is the first phase of compilation process. It takes source code as input. It
reads the source program one character at a time and converts it into meaningful lexemes. Lexical
analyzer represents these lexemes in the form of tokens.
Syntax Analysis
Syntax analysis is the second phase of compilation process. It takes tokens as input and
generates a parse tree as output. In syntax analysis phase, the parser checks that the expression
made by the tokens is syntactically correct or not.
Semantic Analysis
Semantic analysis is the third phase of compilation process. It checks whether the parse tree
follows the rules of language. Semantic analyzer keeps track of identifiers, their types and
expressions. The output of semantic analysis phase is the annotated tree syntax.
Code Generation
Code generation is the final stage of the compilation process. It takes the optimized intermediate
code as input and maps it to the target machine language. Code generator translates the
intermediate code into the machine code of the specified computer.
Example:
1.3 Bootstrapping
o Bootstrapping is widely used in the compilation development.
o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of
compiler that can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which
runs on machine A and produces code for machine A.
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each literal. If the input string is
successfully processed and the automata reaches its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
1.6 Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and
punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a
set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by
|tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty
string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
Assignment =
Special Assignment +=, /=, *=, -=
Preprocessor #
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
When the lexical analyzer read the source-code, it scans the code letter by letter; and when it
encounters a whitespace, operator symbol, or special symbols, it decides that a word is
completed.
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based on the
longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a
language is given priority over user input. That is, if the lexical analyzer finds a lexeme that
matches with any existing reserved word, it should generate an error.
Recognition of Tokens
Tokens can be recognized by Finite Automata
A Finite automaton(FA) is a simple idealized machine used to recognize patterns within input taken
from some character set(or Alphabet) C. The job of FA is to accept or reject an input depending on
whether the pattern defined by the FA occurs in the input.
There are two notations for representing Finite Automata. They are
Transition Diagram
Transition Table
Transition diagram is a directed labeled graph in which it contains nodes and edges
Nodes represents the states and edges represents the transition of a state
Every transition diagram is only one initial state represented by an arrow mark (-->) and zero or
more final states are represented by double circle
Example:
Types or Sources of Error – There are two types of error: run-time and compile-time error:
1. A run-time error is an error which takes place during the execution of a program, and
usually happens because of adverse system parameters or invalid input data. The lack of
sufficient memory to run an application or a memory conflict with another program and
logical error are example of this. Logic errors, occur when executed code does not produce
the expected result. Logic errors are best handled by meticulous program debugging.
2. Compile-time errors rises at compile time, before execution of the program. Syntax error or
missing file reference that prevents the program from successfully compiling is the example
of this.
Finding error or reporting an error – Viable-prefix is the property of a parser which allows early
detection of syntax errors.
Error Recovery –
The basic requirement for the compiler is to simply stop and issue a message, and cease
compilation. There are some common recovery methods that are follows.
1. Panic mode recovery: This is the easiest way of error-recovery and also, it prevents the
parser from developing infinite loops while recovering error. The parser discards the input
symbol one at a time until one of the designated (like end, semicolon) set of synchronizing
tokens (are typically the statement or expression terminators) is found. This is adequate
when the presence of multiple errors in same statement is rare. Example: Consider the
erroneous expression- (1 + + 2) + 3. Panic-mode recovery: Skip ahead to next integer and
then continue. Bison: use the special terminal error to describe how much input to skip.
E->int|E+E|(E)|error int|(error)
2. Phase level recovery: Perform local correction on the input to repair the error. But error
correction is difficult in this strategy.
3. Error productions: Some common errors are known to the compiler designers that may
occur in the code. Augmented grammars can also be used, as productions that generate
erroneous constructs when these errors are encountered. Example: write 5x instead of 5*x
4. Global correction: Its aim is to make as few changes as possible while converting an
incorrect input string to a valid string. This strategy is costly to implement.