0% found this document useful (0 votes)
4 views36 pages

Compiler Unit 1

The document provides an overview of translators, specifically focusing on compilers, assemblers, and interpreters, detailing their functions and differences. It outlines the phases of compiler design, including lexical, syntax, and semantic analysis, as well as code generation and optimization. Additionally, it discusses various types of compilers, operations, and construction tools used in compiler development.

Uploaded by

jayeshgangrade10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

Compiler Unit 1

The document provides an overview of translators, specifically focusing on compilers, assemblers, and interpreters, detailing their functions and differences. It outlines the phases of compiler design, including lexical, syntax, and semantic analysis, as well as code generation and optimization. Additionally, it discusses various types of compilers, operations, and construction tools used in compiler development.

Uploaded by

jayeshgangrade10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Translators

A software system which converts the source code from one form of language to
another form of language is known as translator.
There are three types of translator:
1.​ Assembler
2.​ Compiler
3.​ Interpreter
Introduction of Compiler Design
Compiler design is the design of software that translates the source code
written in a high-level programming language (Like C, C++, Java or Python)
into the low level language like Machine code, Assembly code or Byte code. It
comprises a number of phases where the code is pre-processed, and parsed,
its semantics or meaning are checked, and optimized, and then the final
executable code is generated.
●​ Without compilation, no program written in a high-level language can
be executed.
●​ For every programming language, we have a different compiler;
however, the basic tasks performed by every compiler are the same.
●​ The process of translating the source code into machine code
involves several stages, including lexical analysis, syntax analysis,
semantic analysis, code generation, and optimization.
●​ Compiler is different from a general translator. A translator or
language processor is a program that translates an input program
written in a programming language into an equivalent program in
another

Language Processing Systems

Rupanshi Patidar
●​ High-Level Language: If a program contains pre-processor
directives such as #include or #define it is called HLL. They are
closer to humans but far from machines. These (#) tags are called
preprocessor directives. They direct the preprocessor about what to
do.
●​ Pre-Processor: The preprocessor removes all the #include
directives by including the files called file inclusion and all the #define
directives using macro expansion. It performs file inclusion,
augmentation, macro-processing, etc. For example: Let in the source
program, it is written #include “Stdio. h”. Pre-Processor replaces this
file with its contents in the produced output.
●​ Assembly Language: It’s neither in binary form nor high level. It is
an intermediate state that is a combination of machine instructions
and some other useful data needed for execution.
●​ Assembler: For every platform (Hardware + OS) we will have an
assembler. They are not universal since for each platform we have
one. The output of the assembler is called an object file. It translates
assembly language to machine code.
●​ Compiler: The compiler is an intelligent program as compared to an
assembler. The compiler verifies all types of limits, ranges, errors,
etc. Compiler program takes more time to run and it occupies a huge
amount of memory space. The speed of the compiler is slower than

Rupanshi Patidar
other system software. It takes time because it enters through the
program and then does the translation of the full program.
●​ Interpreter: An interpreter converts high-level language into
low-level machine language, just like a compiler. But they are
different in the way they read the input. The Compiler in one go reads
the inputs, does the processing, and executes the source code
whereas the interpreter does the same line by line. A compiler scans
the entire program and translates it as a whole into machine code
whereas an interpreter translates the program one statement at a
time. Interpreted programs are usually slower concerning compiled
ones.
●​ Relocatable Machine Code: It can be loaded at any point and can
be run. The address within the program will be in such a way that it
will cooperate with the program movement.
●​ Loader/Linker: Loader/Linker converts the relocatable code into
absolute code and tries to run the program resulting in a running
program or an error message (or sometimes both can happen).
Linker loads a variety of object files into a single file to make it
executable. Then loader loads it in memory and executes it.
○​ Linker: The basic work of a linker is to merge object
codes (that have not even been connected),
produced by the compiler, assembler, standard library
function, and operating system resources.
○​ Loader: The codes generated by the compiler,
assembler, and linker are generally re-located by their
nature, which means to say, the starting location of
these codes is not determined, which means they can
be anywhere in the computer memory. Thus the basic
task of loaders is to find/calculate the exact address
of these memory locations.

ANALYSIS OF THE SOURCE PROGRAM


In Compiling, analysis consists of three phases:
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis

1.Lexical analysis:

Rupanshi Patidar
Lexical analysis, also known as scanning, is the first phase of a compiler
which involves reading the source program character by character from left to
right and organizing them into tokens. Tokens are meaningful sequences of
characters. There are usually only a small number of tokens for a
programming language including constants (such as integers, doubles,
characters, and strings), operators (arithmetic, relational, and logical),
punctuation marks and reserved keywords.
Example :
position : = initial + rate * 60
Identifiers – position, initial, rate.
Operators - + , *
Assignment symbol - : =
Number - 60

Blanks – eliminated.

Rupanshi Patidar
2.Syntax analysis:Hierarchical Analysis is called parsing or syntax
analysis. Syntax Analysis (also known as parsing) is the step after Lexical
Analysis. The Lexical analysis breaks source code into tokens.
●​ Tokens are inputs for Syntax Analysis.
●​ The goal of Syntax Analysis is to interpret the meaning of these
tokens.
●​ It checks whether the tokens produced by the lexical analyzer are
arranged according to the language’s grammar.
●​ The syntax analyzer attempts to build a Parse Tree or Abstract
Syntax Tree (AST), which represents the program’s structure.

A syntax tree is the tree generated as a result of syntax analysis in


which the interior nodes are the operators and the exterior nodes are the
operands.

This analysis shows an error when the syntax is incorrect.

Example :
position : = initial + rate * 60
3.Semantic analysis :
This phase checks the source program for semantic errors and gathers
type information for subsequent code generation phases. An important

Rupanshi Patidar
component of semantic analysis is type checking. Here the compiler
checks that each operator has operands that are permitted by the source
language specification.
Phases of a Compiler
There are two major phases of compilation, which in turn have many parts.
Each of them takes input from the output of the previous level and works in a
coordinated way.

Analysis Phase

Rupanshi Patidar
An intermediate representation is created from the given source code :
●​ Lexical Analyzer
●​ Syntax Analyzer
●​ Semantic Analyzer
●​ Intermediate Code Generator

The lexical analyzer divides the program into “tokens”, the Syntax analyzer
recognizes “sentences” in the program using the syntax of the language and
the Semantic analyzer checks the static semantics of each construct.
Intermediate Code Generator generates “abstract” code.
Synthesis Phase

Rupanshi Patidar
An equivalent target program is created from the intermediate representation.
It has two parts :
●​ Code Optimizer
●​ Code Generator
Code Optimizer optimizes the abstract code, and the final Code Generator
translates abstract intermediate code into specific machine instructions.
Stages of Compiler Design
●​ Lexical Analysis: The first stage of compiler design is lexical
analysis , also known as scanning. In this stage, the compiler reads
the source code character by character and breaks it down into a
series of tokens, such as keywords, identifiers, and operators. These
tokens are then passed on to the next stage of the compilation
process.
●​ Syntax Analysis: The second stage of compiler design is syntax
analysis , also known as parsing. In this stage, the compiler checks
the syntax of the source code to ensure that it conforms to the rules
of the programming language. The compiler builds a parse tree,
which is a hierarchical representation of the program’s structure, and
uses it to check for syntax errors.
●​ Semantic Analysis: The third stage of compiler design is semantic
analysis . In this stage, the compiler checks the meaning of the
source code to ensure that it makes sense. The compiler performs
type checking, which ensures that variables are used correctly and
that operations are performed on compatible data types. The
compiler also checks for other semantic errors, such as undeclared
variables and incorrect function calls.
●​ Code Generation: The fourth stage of compiler design is code
generation . In this stage, the compiler translates the parse tree into
machine code that can be executed by the computer. The code
generated by the compiler must be efficient and optimized for the
target platform.
●​ Optimization: The final stage of compiler design is optimization. In
this stage, the compiler analyzes the generated code and makes
optimizations to improve its performance. The compiler may perform
optimizations such as constant folding, loop unrolling, and
function inlining.
●​ Error Handling: Error handling is integrated into all stages of
compilation to detect, report, and recover from errors. It identifies
invalid tokens in lexical analysis, syntax errors in parsing, and
semantic issues like type mismatches or undeclared variables.

Rupanshi Patidar
Effective error handling ensures smooth compilation and clear
feedback for developers.
●​ Symbol Table: The symbol table is a key data structure that stores
information about identifiers like variables, functions, and their
attributes. It is used in lexical analysis to record identifiers, in
semantic analysis for type checking and scope resolution, and in
code generation for mapping memory locations.
Types of Compiler
●​ Self Compiler: When the compiler runs on the same machine and
produces machine code for the same machine on which it is running
then it is called as self compiler or resident compiler.
●​ Cross Compiler: The compiler may run on one machine and
produce the machine codes for other computers then in that case it is
called a cross-compiler. It is capable of creating code for a platform
other than the one on which the compiler is running.
●​ Source-to-Source Compiler: A Source-to-Source Compiler or
transcompiler or transpiler is a compiler that translates source code
written in one programming language into the source code of another
programming language.
●​ Single Pass Compiler: When all the phases of the compiler are
present inside a single module, it is simply called a single-pass
compiler. It performs the work of converting source code to machine
code.
●​ Two Pass Compiler: Two-pass compiler is a compiler in which the
program is translated twice, once from the front end and the back
from the back end known as Two Pass Compiler.
●​ Multi-Pass Compiler: When several intermediate codes are created
in a program and a syntax tree is processed many times, it is called
Multi-Pass Compiler. It breaks codes into smaller programs.
●​ Just-in-Time (JIT) Compiler: It is a type of compiler that converts
code into machine language during program execution, rather than
before it runs. It combines the benefits of interpretation (real-time
execution) and traditional compilation (faster execution).
●​ Ahead-of-Time (AOT) Compiler: It converts the entire source code
into machine code before the program runs. This means the code is
fully compiled during development, resulting in faster startup times
and better performance at runtime.
●​ Incremental Compiler: It compiles only the parts of the code that
have changed, rather than recompiling the entire program. This
makes the compilation process faster and more efficient, especially
during development.

Rupanshi Patidar
Operations of Compiler
These are some operations that are done by the compiler.
●​ It breaks source programs into smaller parts.
●​ It enables the creation of symbol tables and intermediate
representations.
●​ It helps in code compilation and error detection.
●​ it saves all codes and variables.
●​ It analyses the full program and translates it.
●​ Convert source code to machine code.
Compiler Construction Tools
The compiler writer can use some specialized tools that help in implementing
various phases of a compiler. These tools assist in the creation of an entire
compiler or its parts. Some commonly used compiler construction tools
include:
1.​ Scanner Generator – It generates lexical analyzers from the input that
consists of regular expression description based on tokens of a
language. It generates a finite automaton to recognize the regular
expression. Example: Lex

2.​ Parser Generator – It produces syntax analyzers (parsers) from the


input that is based on a grammatical description of a programming
language or on a context-free grammar. It is useful as the syntax
analysis phase is highly complex and consumes more manual and
compilation time. Example: PIC, EQM

Rupanshi Patidar
3. Syntax directed translation engines – It generates intermediate code
with three address format from the input that consists of a parse tree. These
engines have routines to traverse the parse tree and then produces the
intermediate code. In this, each node of the parse tree is associated with one
or more translations.

4.Data-flow analysis engines – It is used in code optimization.Data flow


analysis is a key part of the code optimization that gathers the information,
that is the values that flow from one part of a program to another. Refer – data
flow analysis in Compiler

5. Automatic code generators – It generates the machine language for a


target machine. Each operation of the intermediate language is translated
using a collection of rules and then is taken as an input by the code generator.
A template matching process is used. An intermediate language statement is
replaced by its equivalent machine language statement using templates.

6. Compiler construction toolkits – It provides an integrated set of routines


that aids in building compiler components or in the construction of various
phases of compiler.

Features of compiler construction tools :


Lexical Analyzer Generator: This tool helps in generating the lexical
analyzer or scanner of the compiler. It takes as input a set of regular
expressions that define the syntax of the language being compiled and

Rupanshi Patidar
produces a program that reads the input source code and tokenizes it based
on these regular expressions.
Parser Generator: This tool helps in generating the parser of the compiler. It
takes as input a context-free grammar that defines the syntax of the language
being compiled and produces a program that parses the input tokens and
builds an abstract syntax tree.
Code Generation Tools: These tools help in generating the target code for
the compiler. They take as input the abstract syntax tree produced by the
parser and produce code that can be executed on the target machine.
Optimization Tools: These tools help in optimizing the generated code for
efficiency and performance. They can perform various optimizations such as
dead code elimination, loop optimization, and register allocation.
Debugging Tools: These tools help in debugging the compiler itself or the
programs that are being compiled. They can provide debugging information
such as symbol tables, call stacks, and runtime errors.
Profiling Tools: These tools help in profiling the compiler or the compiled
code to identify performance bottlenecks and optimize the code accordingly.
Documentation Tools: These tools help in generating documentation for the
compiler and the programming language being compiled. They can generate
documentation for the syntax, semantics, and usage of the language.
Language Support: Compiler construction tools are designed to support a
wide range of programming languages, including high-level languages such
as C++, Java, and Python, as well as low-level languages such as assembly
language.
Cross-Platform Support: Compiler construction tools may be designed to
work on multiple platforms, such as Windows, Mac, and Linux.
User Interface: Some compiler construction tools come with a user interface
that makes it easier for developers to work with the compiler and its
associated tools.
COUSINS OF COMPILER

Rupanshi Patidar
1) Preprocessor

It converts the HLL (high level language) into pure high level language. It
includes all the header files and also evaluates if any macro is included. It is the
optional because if any language which does not support #include and macro
preprocessor is not required.

2) Compiler

It takes pure high level language as a input and convert into assembly code.

3) Assembler

It takes assembly code as an input and converts it into assembly code.

4) Linking and loading

Rupanshi Patidar
It has four functions

1.​ Allocation:​
It means getting the memory portions from the operating system and
storing the object data.
2.​ Relocation:​
It maps the relative address to the physical address and relocates the
object code.
3.​ Linker:​
It combines all the executable object modules into pre single executable
file.
4.​ Loader:​
It loads the executable file into permanent storage.

Grouping of Phases and Compiler construction tools


Compiler construction is divided into three main groups of phases: Front-End,
Middle-End, and Back-End. Each group focuses on specific aspects of code
transformation, from source code to executable machine code. Let’s explore
these groups, their associated phases, and the tools used in detail.

1. Front-End Tools (Analysis Phase)

The front-end of a compiler is responsible for understanding the source code. It


ensures that the input program is syntactically and semantically correct.
Phases in the Front-End:

1.​ Lexical Analysis (Scanner)


○​ Breaks the input source code into tokens (basic units such as
keywords, operators, and identifiers).
○​ Identifies lexical errors, such as invalid symbols.
○​ Tools:
■​ Flex: A fast lexical analyzer generator.
■​ JFlex: A lexical analyzer generator for Java-based
applications.
2.​ Syntax Analysis (Parser)
○​ Parses the sequence of tokens into a syntax tree (or parse tree)
based on grammar rules.
○​ Detects syntax errors and reports them.
○​ Tools:
■​ YACC: Yet Another Compiler-Compiler, used for generating
parsers.
■​ Bison: A GNU replacement for YACC.

Rupanshi Patidar
■​ ANTLR: A tool for generating both lexers and parsers for
language recognition.
3.​ Semantic Analysis
○​ Ensures the correctness of the parse tree by checking data types,
variable declarations, and scope rules.
○​ Detects semantic errors, such as type mismatches.
○​ Tools:
■​ Custom-built semantic analyzers.
■​ LLVM Clang Static Analyzer: Performs semantic checks
during the compilation process.

2. Middle-End Tools (Optimization Phase)

The middle-end works on improving the intermediate representation (IR) of the


program for performance and efficiency, without altering its meaning.
Phases in the Middle-End:

1.​ Intermediate Code Generation


○​ Converts the high-level source code into an intermediate form, such
as three-address code or abstract syntax trees.
○​ Simplifies analysis and optimization by creating a
machine-independent representation.
○​ Tools:
■​ LLVM IR: The intermediate representation used in the LLVM
framework.
■​ GCC GIMPLE: An intermediate representation used in the
GNU Compiler Collection.
2.​ Code Optimization
○​ Improves the intermediate code by removing redundancies, reducing
execution time, and minimizing memory usage.
○​ Includes loop unrolling, constant folding, and dead code elimination.
○​ Tools:
■​ LLVM Optimizer (opt): Optimizes LLVM IR.
■​ GCC Optimizer: Optimizes GIMPLE and other intermediate
representations.
■​ Polly: A polyhedral optimizer for LLVM to enhance
performance for data-intensive applications.

3. Back-End Tools (Synthesis Phase)

The back-end is responsible for generating the final machine code and optimizing
it for specific hardware architectures.

Rupanshi Patidar
Phases in the Back-End:

1.​ Code Generation


○​ Transforms the optimized intermediate representation into target
machine code or assembly language.
○​ Assigns registers, handles memory allocation, and generates
efficient instruction sequences.
○​ Tools:
■​ LLVM Backend: Converts LLVM IR to machine-specific code.
■​ GCC Code Generator: Produces machine code from
GIMPLE.
2.​ Machine-Level Optimization
○​ Further optimizes the machine code to leverage hardware features,
such as instruction pipelining and parallelism.
○​ Focuses on improving runtime performance and reducing resource
usage.
○​ Tools:
■​ LLVM LLD: A linker that performs machine-specific
optimizations.
■​ GCC Assembler: Converts assembly language to machine
code.

Integrated Compiler Construction Tools

Several tools cover multiple phases of the compiler construction process:

●​ GCC (GNU Compiler Collection)


○​ A comprehensive suite that includes lexical analysis, syntax
analysis, code generation, and optimization.
●​ LLVM (Low-Level Virtual Machine)
○​ A modular and flexible framework supporting all phases of
compilation, from IR generation to machine code optimization.
●​ Clang
○​ A front-end for the LLVM framework, widely used for compiling C,
C++, and Objective-C programs.

Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes modified source code
from language preprocessors that are written in the form of sentences. The
lexical analyzer breaks these syntaxes into a series of tokens, by removing any
whitespace or comments in the source code.

Rupanshi Patidar
If the lexical analyzer finds a token invalid, it generates an error. The lexical
analyzer works closely with the syntax analyzer. It reads character streams from
the source code, checks for legal tokens, and passes the data to the syntax
analyzer when it demands.

There are three important terms to grab:

1.​ Tokens: A Token is a pre-defined sequence of characters that cannot be


broken down further. It is like an abstract symbol that represents a unit. A
token can have an optional attribute value. There are different types of
tokens:
○​ Identifiers (user-defined)
○​ Delimiters/ punctuations (;, ,, {}, etc.)
○​ Operators (+, -, *, /, etc.)
○​ Special symbols
○​ Keywords
○​ Numbers
2.​ Lexemes: A lexeme is a sequence of characters matched in the source
program that matches the pattern of a token.​
For example: (, ) are lexemes of type punctuation where punctuation is the
token.
3.​ Patterns: A pattern is a set of rules a scanner follows to match a lexeme in
the input program to identify a valid token. It is like the lexical analyzer's
description of a token to validate a lexeme.​
For example, the characters in the keyword are the pattern to identify a
keyword. To identify an identifier the pre-defined set of rules to create an
identifier is the pattern.

Rupanshi Patidar
Token Lexeme Pattern

Keyword while w-h-i-l-e

Relop < <, >, >=, <=, !=, ==

Integer 7 (0 - 9)*-> Sequence of


digits with at least one
digit

String "Hi" Characters enclosed by


""

Punctuation , ; , . ! etc.

Identifier number A - Z, a - z A sequence


of characters and
numbers initiated by a
character.

Specifications of Tokens
Let us understand how the language theory undertakes the following terms:

Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is
a set of English language alphabets.

Strings
Any finite sequence of alphabets (characters) is called a string. Length of the
string is the total number of occurrence of alphabets, e.g., the length of the string
tutorialspoint is 14 and is denoted by |tutorialspoint| = 14. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is denoted
by ε (epsilon).

Rupanshi Patidar
Special symbols
A typical high-level language contains the following symbols:-

Arithmetic Addition(+), Subtraction(-), Modulo(%),


Symbols Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =

Special +=, /=, *=, -=


Assignment

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location &
Specifier

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

Language
A language is considered as a finite set of strings over some finite set of
alphabets. Computer languages are considered as finite sets, and
mathematically set operations can be performed on them. Finite languages can
be described by means of regular expressions.

Roles and Responsibility of Lexical Analyzer

The lexical analyzer performs the following tasks-

●​ The lexical analyzer is responsible for removing the white spaces and
comments from the source program.
●​ It corresponds to the error messages with the source program.
●​ It helps to identify the tokens.
●​ The input characters are read by the lexical analyzer from the source code.

Rupanshi Patidar
Role of Lexical Analyzer

The lexical analysis is the first phase of the compiler where a lexical analyser
operate as an interface between the source code and the rest of the phases of a
compiler. It reads the input characters of the source program, groups them into

Rupanshi Patidar
lexemes, and produces a sequence of tokens for each lexeme. The tokens are
sent to the parser for syntax analysis.

If the lexical analyzer is located as a separate pass in the compiler it can need an
intermediate file to locate its output, from which the parser would then takes its
input. It can eliminate the need for the intermediate file, the lexical analyzer and
the syntactic analyser (parser) are often grouped into the same pass where the
lexical analyser operates either under the control of the parser or as a subroutine
with the parser.

The lexical analyzer also interacts with the symbol table while passing tokens to
the parser. Whenever a token is discovered, the lexical analyzer returns a
representation for that token to the parser. If the token is a simple construct
including parenthesis, comma, or a colon, then it returns an integer program. If
the token is a more complex items including an identifier or another token with a
value, the value is also passed to the parser.

Lexical analyzer separates the characters of the source language into groups
that logically belong together, called tokens. It includes the token name which is
an abstract symbol that define a type of lexical unit and an optional attribute
value called token values. Tokens can be identifiers, keywords, constants,
operators, and punctuation symbols including commas and parenthesis. A rule
that represent a group of input strings for which the equal token is make as
output is called the pattern.

Regular expression plays an essential role in specifying patterns. If a keyword is


treated as a token, the pattern is only a sequence of characters. For identifiers
and various tokens, patterns form a difficult structure.

The lexical analyzer also handles issues including stripping out the comments
and whitespace (tab, newline, blank, and other characters that are used to
separate tokens in the input). The correlating error messages that are generated
by the compiler during lexical analyzer with the source program.

For example, it can maintain track of all newline characters so that it can relate
an ambiguous statement line number with each error message. It can be
implementing the expansion of macros, in the case of macro, pre-processors are
used in the source program.

Following are the some steps that how lexical analyzer work:
1. Input pre-processing: In this stage involves cleaning up, input takes and
preparing lexical analysis this may include removing comments, white space and
other non-input text from input text.

Rupanshi Patidar
2. Tokenization: This is a process of breaking the input text into sequence of a
tokens.
3. Token classification: Lexeme determines type of each token, it can be
classified keyword, identifier, numbers, operators and separator.
4. Token validation: Lexeme checks each token with valid according to rule of
programming language.
5. Output Generation: It is a final stage lexeme generate the outputs of the
lexical analysis process, which is typically list of tokens.

Input Buffering
The lexical analyzer scans the input from left to right one character at a time. It
uses two pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of
the input scanned.
Input buffering is an important concept in compiler design that refers to the way in
which the compiler reads input from the source code. In many cases, the
compiler reads input one character at a time, which can be a slow and inefficient
process. Input buffering is a technique that allows the compiler to read input in
larger chunks, which can improve performance and reduce overhead.
1.​ The basic idea behind input buffering is to read a block of input from the
source code into a buffer, and then process that buffer before reading
the next block. The size of the buffer can vary depending on the specific
needs of the compiler and the characteristics of the source code being
compiled. For example, a compiler for a high-level programming
language may use a larger buffer than a compiler for a low-level
language, since high-level languages tend to have longer lines of code.
2.​ One of the main advantages of input buffering is that it can reduce the
number of system calls required to read input from the source code.
Since each system call carries some overhead, reducing the number of
calls can improve performance. Additionally, input buffering can simplify
the design of the compiler by reducing the amount of code required to
manage input.

However, there are also some potential disadvantages to input buffering. For
example, if the size of the buffer is too large, it may consume too much memory,
leading to slower performance or even crashes. Additionally, if the buffer is not
properly managed, it can lead to errors in the output of the compiler.

Rupanshi Patidar
Overall, input buffering is an important technique in compiler design that can help
improve performance and reduce overhead. However, it must be used carefully
and appropriately to avoid potential problems.

Initially both the pointers point to the first character of the input string as shown
below

Rupanshi Patidar
The forward ptr moves ahead to search for the end of lexeme. As soon as the
blank space is encountered, it indicates the end of lexeme. In the above
example, as soon as ptr (fp) encounters a blank space the lexeme “int” is
identified. The fp will be moved ahead at white space, when fp encounters white
space, it ignores and moves ahead. then both the begin ptr(bp) and forward
ptr(fp) are set at the next token. The input character is thus read from secondary
storage, but reading in this way from secondary storage is costly. Hence buffering
technique is used.A block of data is first read into a buffer, and then second by
lexical analyzer. There are two methods used in this context: One Buffer
Scheme, and Two Buffer Scheme. These are explained below.

Rupanshi Patidar
1.​ One Buffer Scheme: In this scheme, only one buffer is used to store the
input string but the problem with this scheme is that if lexeme is very long
then it crosses the buffer boundary, to scan rest of the lexeme the buffer
has to be refilled, that makes overwriting the first of lexeme.

The drawback of One-buffer schema: When the string we want to read is longer
than the buffer length, before reading the whole string, the end of the buffer is
reached, and the whole buffer has to be reloaded with the rest of the string,
which makes identification hard.

Rupanshi Patidar
2.​ Two Buffer Scheme: To overcome the problem of one buffer scheme, in
this method two buffers are used to store the input string. The first buffer
and second buffer are scanned alternately. When the end of the current
buffer is reached the other buffer is filled. The only problem with this
method is that if the length of the lexeme is longer than the length of the
buffer then scanning input cannot be scanned completely. Initially both the
bp and fp are pointing to the first character of the first buffer. Then the fp
moves towards the right in search of the end of lexeme. as soon as a blank
character is recognized, the string between bp and fp is identified as
corresponding token. to identify, the boundary of the first buffer end of
buffer character should be placed at the end of the first buffer. Similarly the
end of the second buffer is also recognized by the end of the buffer mark
present at the end of the second buffer. when fp encounters the first eof,
then one can recognize the end of the first buffer and hence filling up the
second buffer is started. in the same way when a second eof is obtained
then it indicates a second buffer. alternatively both the buffers can be filled
up until the end of the input program and a stream of tokens is identified.
This character introduced at the end is called Sentinel which is used to
identify the end of a buffer.

Rupanshi Patidar
how the lexical analyzer can match patterns with lexemes to
check the validity of lexemes with tokens.
Patterns:The Lexical analyzer has to scan and identify only a finite set of valid
tokens/ lexemes from the program for which it uses patterns. Patterns are to find
a valid lexeme from the program. These patterns are specified using "Regular
grammar". All the valid tokens are given predefined patterns to check the validity
of detected lexemes in the program.

1. Numbers
A number can be in the form of:

1.​ A whole number (0, 1, 2...)


2.​ A decimal number (0.1, 0.2...)
3.​ Scientific notation(1.25E), (1.25E23)

The grammar has to identify all types of numbers:

Rupanshi Patidar
Sample Regular grammar:

​ Digit -> 0|1|....9


​ Digits -> Digit (Digit)*
​ Number -> Digits (.Digits)? (E[+ -] ? Digits)?
​ Number -> Digit+ (.Digit)+? (E[+ -] ? Digit+)?
​ ? represents 0 or more occurrences of the previous expression
​ * represents 0 or more occurrences of the base expression
​ + represents 1 or more occurrences of the base expression

2. Delimiters
There are different types of delimiters like white space, newline character, tab
space, etc.

Sample Regular grammar:

​ Delimiter -> ' ', '\t', '\n'


​ Delimiters -> delimiter (delimiter)*
3. Identifiers
The rules of an identifier are:

1.​ It has to start only with an alphabet.


2.​ After the first alphabet, it can have any number of alphabets, digits, and
underscores.

Sample Regular grammar:

​ Letter -> a|b|....z


​ Letter -> A|B|...Z
​ Digit -> 0|1|...9
​ Identifier -> Letter (Letter/ Digit)*
Now, we have detected lexemes and pre-defined patterns for every token. The
lexical analyzer needs to recognize and check the validity of every lexeme using
these patterns.

To recognize and verify the tokens, the lexical analyzer builds Finite Automata for
every pattern. Transition diagrams can be built and converted into programs as
an intermediate step. Each state in the transition diagram represents a piece of
code. Every identified lexeme walks through the Automata. The programs built
from Automata can consist of switch statements to keep track of the state of the
lexeme. The lexeme is verified to be a valid token if it reaches the final state.

Rupanshi Patidar
Token
In programming, a token is the smallest unit of meaningful data; it may be an
identifier, keyword, operator, or symbol. A token represents a series or
sequence of characters that cannot be decomposed further. In languages
such as C, some examples of tokens would include:
●​ Keywords : Those reserved words in C like ` int `, ` char `, ` float `, `
const `, ` goto `, etc.
●​ Identifiers: Names of variables and user-defined functions.
●​ Operators : ` + `, ` – `, ` * `, ` / `, etc.
●​ Delimiters /Punctuators: Symbols used such as commas ” , ”
semicolons ” ; ” braces ` {} `.

By and large, tokens may be divided into three categories:


●​ Terminal Symbols (TRM) : Keywords and operators.
●​ Literals (LIT) : Values like numbers and strings.
●​ Identifiers (IDN) : Names defined by the user.

Let’s understand now how to calculate tokens in a source code (C language):


Example 1:

int a = 10; //Input Source code

Tokens

int (keyword), a(identifier), =(operator), 10(constant) and


;(punctuation-semicolon)

Answer – Total number of tokens = 5


Example 2:

Rupanshi Patidar
int main() {

printf("Welcome to SVVV!");

return 0;

Tokens

'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to SVVV!" ',

')', ';', 'return', '0', ';', '}'

Answer – Total number of tokens = 14


Lexeme
A lexeme is a sequence of source code that matches one of the predefined
patterns and thereby forms a valid token. For example, in the expression `x + 5`,
both `x` and `5` are lexemes that correspond to certain tokens. These lexemes
follow the rules of the language in order for them to be recognized as valid
tokens.
Example:

main is lexeme of type identifier(token)

(,),{,} are lexemes of type punctuation(token)

Pattern
A pattern is a rule or syntax that designates how tokens are identified in a
programming language. In fact, it is supposed to specify the sequences of
characters or symbols that make up valid tokens, and provide guidelines as to
how to identify them correctly to the scanner.

Rupanshi Patidar
Example of Programming Language (C, C++)
For a keyword to be identified as a valid token, the pattern is the sequence of
characters that make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules
that it must start with alphabet, followed by alphabet or a digit.
Difference Between Token, Lexeme, and Pattern

Criteria Token Lexeme Pattern

It is a sequence of
Token is
characters in the It specifies
basically a
source code that a set of
sequence of
are matched by rules that a
characters that
Definition given predefined scanner
are treated as a
language rules for follows to
unit as it cannot
every lexeme to be create a
be further
specified as a valid token.
broken down.
token.

The
sequence
all the reserved
Interpretation of
keywords of that
of type int, goto characters
language(main,
Keyword that make
printf, etc.)
the
keyword.

Rupanshi Patidar
it must
start with
the
Interpretation name of a
alphabet,
of type variable, main, a
followed by
Identifier function, etc
the
alphabet or
a digit.

Interpretation all the operators


of type are considered +, = +, =
Operator tokens.

each kind of
punctuation is
Interpretation considered a
of type token. e.g. (, ), {, } (, ), {, }
Punctuation semicolon,
bracket, comma,
etc.

any string
of
Interpretation a grammar rule
“Welcome to characters
of type or boolean
GeeksforGeeks!” (except ‘ ‘)
Literal literal.
between ”
and “

Rupanshi Patidar
Specification of Token
In compiler design, there are three specifications of token-

1.​ String
2.​ Language
3.​ Regular Expressions

1. Strings
Strings are a finite set of symbols or characters. These symbols can be a digit or
an alphabet. There is also an empty string which is denoted by ε.

Operations on String

The operations that can be performed on a string are-

1. Prefix
The prefix of String S is any string that is extracted by removing zero or more
characters from the end of string S. For example, if the String is "NINJA", the
prefix can be "NIN" which is obtained by removing "JA" from that String. A string
is a prefix in itself.
Proper prefixes are special types of prefixes that are not equal to the String itself
or equal to ε. We obtain it by removing at least one character from the end of the
String.

2. Suffix
The suffix of string S is any string that is extracted by removing any number of
characters from the beginning of string S. For example, if the String is "NINJA",
the suffix can be "JA," which is obtained by removing "NIN" from that String. A
string is a suffix of itself.
Proper suffixes are special types of suffixes that are not equal to the String itself
or equal to ε. It is obtained by removing at least one character from the beginning
of the String.

3. Substring
A substring of a string S is any string obtained by removing any prefixes and
suffixes of that String. For example, if the String is "AYUSHI," then the substring

Rupanshi Patidar
can be "US," which is formed by removing the prefix "AY" and suffix "HI." Every
String is a substring of itself.
Proper substrings are special types that are not equal to the String itself or equal
to ε. It is obtained by removing at least one prefix or suffix from the String.

4. Subsequence
The subsequence of the String is a string obtained by eliminating zero or more
symbols from the String. The symbols that are removed need not be consecutive.
For example, if the String is "NINJAMANIA," then a subsequence can be
"NIJAANIA," which is produced by removing "N" and "M."
Proper subsequences are special subsequences that are not equal to the String
itself or equal to ε. It is obtained by removing at least one symbol from the String.

5. Concatenation
Concatenation is defined as the addition of two strings. For example, if we have
two strings S=" Cod" and T=" ing," then the concatenation ST would be "Coding."

2. Language
A language can be defined as a finite set of strings over some symbols or
alphabets.

Operations on Language
The following operations are performed on a language in the lexical
analysis phase-

1. Union
Union is one of the most common operations we perform on a set. In terms of
languages also, it will hold a similar meaning.
Suppose there are two languages, L and S. Then the union of these two
languages will be
L ∪ S will be equal to { x | x belongs to either L or S }
For example If L = {a, b} and S = {c, d}Then L ∪ S = {a, b, c, d}

2. Concatenation
Concatenation links two languages by linking the strings from one language to all
the strings of the other language.

Rupanshi Patidar
If there are two languages, L and S, then the concatenation of L and S will be LS
equal to { ls | where l belongs to L and s belongs to S }.
For example, there are two languages, L and S, such that { L`, L"} is the set of
strings belonging to language L and { S,` S, "S`"} is the set of strings belonging to
language S.
Then the concatenation of L and S will be LS will be {L'S`, L'S ", L``S`, L``S "}

3. Kleene closure
Kleene closure of a language L is denoted by L*provides a set of all the strings
that can be obtained by concatenating L zero or more times.
If L = {a, b}
then L* = {ε, a, b, aa, bb, aaa, bbb, …}

Positive Closure
L+ denotes the Positive closure of a language L and provides a set of all the
strings that can be obtained by concatenating L one or more times.
If L = {a, b}
then L+ = {a, b, aa, bb, aaa, bbb, …}

3. Regular Expression
Regular expressions are strings of characters that define a searching pattern with
the help of which we can form a language, and each regular expression
represents a language.
A regular expression r can denote a language L(r) which can be built recursively
over the smaller regular expression by following some rules.

Writing Regular Expressions


Following symbols are used very frequently to write regular expressions

●​ The asterisk symbol ( * ): It is used in our regular expression to


instruct the compiler that the symbol that preceded the * symbol can
be repeated any number of times in the pattern. For example, if the
expression is ab*c, then it gives the following string- ac, abc, abbc,
abbbc, abbbbbc.. and so on.
●​ The plus symbol ( + ): It is used in our regular expression to tell the
compiler that the symbol that preceded + can be repeated one or more
times in the pattern. For example, if the expression is ab+c, then it
gives the following string- abc, abbc, abbbc, abbbbbc.. and so on.

Rupanshi Patidar
●​ Wildcard Character ( . ): The. A symbol, also known as the wildcard
character, is a character in our regular expression that can be replaced
by another character.
●​ Character Class: It is a way of representing multiple characters. For
example, [a – z] denotes the regular expression a | b | c | d | ….|z.
The following rules are used to define a regular expression r over some alphabet
Σ and the languages denoted by these regular expressions.

●​ It ε is a regular expression that denotes a language L(ε). The language


L(ε) has a set of strings {ε} which means that this language has a
single string which is the empty String.
●​ If there is a symbol 'a' in Σ, then 'a' is a regular expression that
denotes a language L(a). The language L(a) = {a}, i.e., the language
has only one String of length, and the String holds 'a' in the first
position.
●​ Consider the two regular expressions r and s then:
r|s denotes the language L(r) ∪ L(s).
(r) (s) denotes the language L(r) ⋅ L(s).
(r)* denotes the language (L(r))*.
(r)+ denotes the language L(r)

Rupanshi Patidar

You might also like