0% found this document useful (0 votes)
14 views23 pages

CT - Lecture 2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 23

Compiler phases

LECTURE 2
Phases of Compiler:

Lexical Analysis
• The first phase of the scanner works as a text scanner. This phase scans the
source code as a stream of characters and converts it into meaningful lexemes.

• Lexical analyzer represents these lexemes in the form of tokens as:

<token-name, attribute-value>
Phases of Compiler:
Syntax Analysis
• The next phase is called the syntax analysis or parsing.

• It takes the token produced by lexical analysis as input and generates a parse
tree <or syntax tree>.

• In this phase, token arrangements are checked against the source code
grammar,

• i.e. the parser checks if the expression made by the tokens is syntactically
correct.
Phases of Compiler:
Semantic Analysis
• Semantic analysis checks whether the parse tree constructed follows the rules
of language.

• For example, the assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of
identifiers, their types and expressions; whether identifiers are declared before
use or not, etc. The semantic analyzer produces an annotated syntax tree as an
output.
Phases of Compiler:
Intermediate Code Generation
• After semantic analysis, the compiler generates an intermediate code of the
source code for the target machine. It represents a program for some abstract
machine. It is in between the high-level language and the machine language.
This intermediate code should be generated in such a way that it makes it
easier to translate into the target machine code.
Phases of Compiler:
Code Optimization
• The next phase is code optimization of the intermediate code. Optimization can
be assumed as something that removes unnecessary code lines and arranges
the sequence of statements to speed up the program execution without wasting
resources <CPU, memory>.
Phases of Compiler:
Code Generation
• In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code
generator translates the intermediate code into a sequence of <generally> re-
locatable machine code. The sequence of instructions of machine code
performs the task as the intermediate code would do.
Phases of Compiler:
Symbol Table
• It is a data structure maintained throughout all the phases of a compiler. All the
identifier's names along with their types are stored here. The symbol table
makes it easier for the compiler to quickly search the identifier record and
retrieve it. The symbol table is also used for scope management.
Lexical Analysis
• Lexical analysis is a compiler’s first phase, also called linear analysis or scanning.
It takes modified source code from language pre-processors that are written in the
form of sentences. The lexical will read the source program one by one letter and
group the characters into meaningful sequences called lexemes. For each lexeme,
the lexical analyzer produces as output a token of the form

token-name; attribute-value
• If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer
works closely with the syntax analyzer. It reads character streams from the source
code, checks for legal tokens, and passes the data to the syntax analyzer when it
demands.
Lexical Analysis

• Upon receiving the “get next token”


command from the parser, the
lexical analyzer reads input
characters until it can identify the
next token.
Lexical Analysis
• The lexical analyzer reads the source text and, thus, it may perform certain
secondary tasks:
• Eliminate comments and white spaces as blanks, tabs and newline
characters.

• Correlate error messages from the compiler with the source program.
Lexical Analysis
• Token: A token is a group of characters having collective
meaning: typically, a word or punctuation mark, separated by a
lexical analyzer and passed to a parser.
• A lexeme is an actual character sequence forming a specific
token instance, such as num.
• Pattern: A rule describing the strings associated with a token.
Expressed as a regular expression and explaining how a particular
token can be formed.
Lexical Analysis
• For example, in C language, the variable declaration
line
int value = 100;

contains the tokens:

<int, keyword> <value, identifier> < =, operator > <100, constant> <; , symbol>
Lexical Analysis
• What are Tokens?
• A token is the smallest individual element of a program that
is meaningful to the compiler. It cannot be further divided.
Identifiers, strings, keywords, etc., can be the example of the
token. In the lexical analysis phase of the compiler, the
program is converted into a stream of tokens.
Lexical Analysis
• Different Types of Tokens

• There can be multiple types of tokens. Some of them are-


1. Keywords
• Keywords are words reserved for particular purposes and
imply a special meaning to the compilers. The keywords
must not be used for naming a variable, function, etc.
Lexical Analysis
• Different Types of Tokens
2. Identifier
• The names given to various components in the program, like
the function's name or variable's name, etc., are called
identifiers. Keywords cannot be identifiers.
3. Operators
• Operators are different symbols used to perform different
operations in a programming language.
Lexical Analysis
• Different Types of Tokens
4. Punctuations
• Punctuations are special symbols that separate different
code elements in a programming language.
• Consider the following line of code in C++ language -
int x = 45;
• The above statement has multiple tokens, which are-
Lexical Analysis
• Different Types of Tokens
•Keywords: int
•Identifier: x , 45
•Operators: =
•Punctuators: ;
Lexical Analysis
• Specifications of Tokens
• Strings
• Strings are a finite set of characters. These characters can be a digit or
an alphabet. There is also an empty string which is denoted by ε.
Lexical Analysis
• Specifications of Tokens
• Language
• A language is considered as a finite set of strings over some finite set
of alphabets. Computer languages are considered as finite sets, and
mathematically set operations can be performed on them. Finite
languages can be described by means of regular expressions.
Lexical Analysis
• Specifications of Tokens
• Regular Expressions
• The lexical analyzer needs to scan and identify only a finite set of valid
string/tokens/lexemes that belong to the language in hand. It searches for
the pattern defined by the language rules.
• Regular expressions can express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is
known as regular grammar. The language defined by regular grammar is
known as regular language.
Lexical Analysis
• Specifications of Tokens
• Regular Expressions
• Regular expression is an important notation for specifying patterns.
Each pattern matches a set of strings, so regular expressions serve as
names for a set of strings. Programming language tokens can be
described by regular languages.
Thank you for listening

You might also like