0% found this document useful (0 votes)
45 views136 pages

CD Unit 1 Merged

Uploaded by

shaik amreen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views136 pages

CD Unit 1 Merged

Uploaded by

shaik amreen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 136

COMPILER DESIGN

UNIT-1
Language processing system:

1. In a language processing system, the source code is first preprocessed.


2. in c language there are instructions like macros or defining the constants.These processes
will be exectuted by the preprocessers. After that we get the preprocessors code.
3. The preprocessed code will be given to the compiler.the compiler will execute the
preprocessed code and generate the assembly level code.
4. This assembly code will be given to the assembler(it is a language translater).the assembler
will consider the assembly level code and convert into machine level code/object code.
5. This object code will be given to the linker. A linker will link all the data link layer files ,object
codes and libraries to form the executable code.
6. The executable code will be given to the loader.the loader will load the executable codes
to the memory.
Components of Language processing system:

You have seen in the above diagram there are the following components. Let’s discuss it one
by one.

Source code:

Source code is the fundamental component of a computer program that is created by a


programmer, often written in the form of functions, descriptions, definitions, calls, methods and
other operational statements. It is designed to be human-readable and formatted in a way that
developers and other users can understand.

Preprocessor:

A Preprocessor is a system software. It performs preprocessing of the High Level Language(HLL).


Preprocessing is the first step of the language processing system. Language processing system
translates the high level language to machine level language or absolute machine code(i.e. to the
form that can be understood by machine.It includes all header files and also evaluates whether a
macro is included. It takes source code as input and produces modified source code as output. The
preprocessor is also known as a macro evaluator, the processing is optional that is if any language
that does not support #include macros processing is not required.

Compiler:

The compiler takes the modified code as input and produces the target code as output.

Assembler:
The assembler takes the target code as input and produces real locatable machine code as output.

Linker: Linker or link editor is a program that takes a collection of objects (created by assemblers and
compilers) and combines them into an executable program.

Loader: The loader keeps the linked program in the main memory.

Executable code: It is low-level and machine-specific code that the machine can easily understand.
Once the job of the linker and loader is done the object code is finally converted it into executable
code.

Language processors / language translators:


Computer programs are generally written in high-level languages (like C++, Python, and
Java). A language processor, or language translator, is a computer program that convert
source code from one programming language to another language or human readable
language. They also find errors during translation.

What is Language Processors?


Compilers, interpreters, translate programs written in high-level languages into machine code
that a computer understands and assemblers translate programs written in low-level or
assembly language into machine code. In the compilation process, there are several stages. To
help programmers write error-free code, tools are available.

Assembly language is machine-dependent, yet mnemonics used to represent instructions in it


are not directly understandable by machine and high-Level language is machine-independent.
A computer understands instructions in machine code, i.e. in the form of 0s and 1s. It is a
tedious task to write a computer program directly in machine code. The programs are written
mostly in high-level languages like Java, C++, Python etc. and are called source code. These
source code cannot be executed directly by the computer and must be converted into machine
language to be executed. Hence, a special translator system software is used to translate the
program written in a high-level language into machine code is called Language Processor and
the program after translated into machine code (object program/object code).

Types of Language Processors


The language processors can be any of the following three types:

1. Compiler

The language processor that reads the complete source program written in high-level
language as a whole in one go and translates it into an equivalent program in machine
language is called a Compiler. Example: C, C++, C#.

In a compiler, the source code is translated to object code successfully if it is free of errors.
The compiler specifies the errors at the end of the compilation with line numbers when there
are any errors in the source code. The errors must be removed before the compiler can
successfully recompile the source code again the object program can be executed number of
times without translating it again.
2. Assembler

The Assembler is used to translate the program written in Assembly language into machine
code. The source program is an input of an assembler that contains assembly language
instructions. The output generated by the assembler is the object code or machine code
understandable by the computer. Assembler is basically the 1st interface that is able to
communicate humans with the machine. We need an assembler to fill the gap between human
and machine so that they can communicate with each other. code written in assembly
language is some sort of mnemonics(instructions) like ADD, MUL, MUX, SUB, DIV, MOV
and so on. and the assembler is basically able to convert these mnemonics in binary code.
Here, these mnemonics also depend upon the architecture of the machine.

For example, the architecture of intel 8085 and intel 8086 are different.

3. Interpreter

The translation of a single statement of the source program into machine code is done by a
language processor and executes immediately before moving on to the next line is called an
interpreter. If there is an error in the statement, the interpreter terminates its translating
process at that statement and displays an error message. The interpreter moves on to the
next line for execution only after the removal of the error. An Interpreter directly executes
instructions written in a programming or scripting language without previously converting
them to an object code or machine code. An interpreter translates one line at a time and then
executes it.

Example: Perl, Python and Matlab.


DIFFERENCES B/W COMPILER AND INTERPRETER:

compiler Interpreter
Compiler: A compiler translates code from a Interpreter: An interpreter translates code
high-level programming language (like Python, written in a high-level programming language
JavaScript or Go) into machine code before into machine code line-by-line as the code runs.
the program runs.

Compiled code runs faster Interpreted code runs slower.


A compiler displays all errors after compilation Interpreter displays errors of each line one by
one
Linking-Loading Model is the basic working The Interpretation Model is the basic working
model of the Compiler. model of the Interpreter.
The compiler generates an output in the form of The interpreter does not generate any output.
(.exe).
CPU utilization is more in the case of a Compiler. CPU utilization is less in the case of a Interpreter
Object code is permanently saved for future use. No object code is saved for future use.
Compiler can check syntactic and semantic Interpreter checks the syntactic errors only.
errors in the program simultaneously.
Compiler are larger in size. Interpreters are smaller in size.
Compilers are not flexible. Interpreters are relatively flexible.
Compilers are more efficient. Interpreters are less efficient.
It works best for the Production Environment. It works the best for the programming and
development environment.
C, C++, C#, etc are programming languages that Python, Ruby, Perl, SNOBOL, MATLAB, etc are
are compiler-based. programming languages that are interpreter-
based.

Preprocessor:
A preprocessor produce input to compilers. They may perform the following
functions.
1. Macro processing: A preprocessor may allow a user to define macros that are
short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with
more
modern flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the
language
by certain amounts to build-in macro

Translator:
A translator or language processor is a program that translates an input
program written in a programming language into an equivalent program in another
language. The compiler is a type of translator, which takes a program written in
a high-level programming language as input and translates it into an equivalent
program in low-level languages such as machine language or assembly language.
TYPE OF TRANSLATORS:-
INTERPRETOR
COMPILER
PREPROSSESSOR

Linker:
Linker is a program in a system which helps to link object modules of a program into a single
object file. It performs the process of linking. Linkers are also called as link editors. Linking
is a process of collecting and maintaining piece of code and data into a single file. Linker also
links a particular module into system library. It takes object modules from assembler as input
and forms an executable file as output for the loader. Linking is performed at both compile
time, when the source code is translated into machine code and load time, when the program
is loaded into memory by the loader. Linking is performed at the last step in compiling a
program.

A linker is a program in a system, also known as a link editor and binder, which combines
object modules into a single object file. Generally, it is a program that performs the process of
linking; it takes one or multiple object files, which are generated by compiler. And, then
combines these files into an executable files.
Loader:
In compiler design, a loader is a program that is responsible for loading executable programs
into memory for execution. The loader reads the object code of a program, which is usually in
binary form, and copies it into memory. It also performs other tasks such as allocating
memory for the program’s data and resolving any external references to other programs or
libraries. The loader is typically part of the operating system and is invoked by the system’s
bootstrap program or by a command from a user. Loaders can be of two types:

 Absolute Loader: It loads a program at a specific memory location, specified in the


program’s object code. This location is usually absolute and does not change when the
program is loaded into memory.
 Relocating Loader: It loads a program at any memory location, and then adjusts all
memory references in the program to reflect the new location. This allows the same
program to be loaded into different memory locations without having to modify the
program’s object code.

The architecture of a loader in a compiler design typically consists of several components:

1. Source program: This is a program written in a high-level programming language that


needs to be executed.
2. Translator: This component, such as a compiler or interpreter, converts the source
program into an object program.
3. Object program: This is the program in a machine-readable form, usually in binary,
that contains both the instructions and data of the program.
4. Executable object code: This is the object program that has been processed by the
loader and is ready to be executed.

Overall, the Loader is responsible for loading the program into memory, preparing it for
execution, and transferring control to the program’s entry point. It acts as a bridge between
the Operating System and the program being loaded.

LIST OF COMPILERS
1. Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .D compilers
9 .Common Lisp compilers
10. ECMAScript interpreters
11. Eiffel compilers
12. Felix compilers
13. Fortran compilers
14. Haskell compilers
15 .Java compilers
16. Pascal compilers
17. PL/I compilers
18. Python compilers
19. Scheme compilers
20. Smalltalk compilers
21. CIL compilers

Structure of a Compiler/phases of compiler:


We basically have two phases of compilers, namely the Analysis phase and Synthesis phase.
The analysis phase creates an intermediate representation from the given source code. The
synthesis phase creates an equivalent target program from the intermediate representation.

A compiler is a software program that converts the high-level source code written in a
programming language into low-level machine code that can be executed by the computer
hardware. The process of converting the source code into machine code involves several
phases or stages, which are collectively known as the phases of a compiler. The typical
phases of a compiler are:

The 6 phases of a compiler are:

1. Lexical Analysis
2. Syntactic Analysis or Parsing
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation

Lexical Analysis

The first phase of scanner works as a text scanner. This phase scans the source code as a
stream of characters and converts it into meaningful lexemes. Lexical analyzer represents
these lexemes in the form of tokens as:

<token-name, attribute-value>
Syntax Analysis

The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements
are checked against the source code grammar, i.e. the parser checks if the expression made
by the tokens is syntactically correct.

Roles and Responsibilities of Syntax Analyzer

 Note syntax errors.


 Helps in building a parse tree.
 Acquire tokens from the lexical analyzer.
 Scan the syntax errors, if any.

Semantic Analysis

Semantic analysis checks whether the parse tree constructed follows the rules of language.
For example, assignment of values is between compatible data types, and adding string to an
integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not etc. The semantic analyzer produces an
annotated syntax tree as an output.

Roles and Responsibilities of Semantic Analyzer:

 Saving collected data to symbol tables or syntax trees.


 It notifies semantic errors.
 Scanning for semantic errors.

Intermediate Code Generation

After semantic analysis the compiler generates an intermediate code of the source code for
the target machine. It represents a program for some abstract machine. It is in between the
high-level language and the machine language. This intermediate code should be generated
in such a way that it makes it easier to be translated into the target machine code.

 A code that is neither high-level nor machine code, but a middle-level code is an
intermediate code.
 We can translate this code to machine code later.
 This stage serves as a bridge or way from analysis to synthesis.

Code Optimization

The next phase does code optimization of the intermediate code. Optimization can be
assumed as something that removes unnecessary code lines, and arranges the sequence of
statements in order to speed up the program execution without wasting resources (CPU,
memory).

Roles and Responsibilities:

 Remove the unused variables and unreachable code.


 Enhance runtime and execution of the program.
 Produce streamlined code from the intermediate expression.
Code Generation

In this phase, the code generator takes the optimized representation of the intermediate code
and maps it to the target machine language. The code generator translates the intermediate
code into a sequence of (generally) re-locatable machine code. Sequence of instructions of
machine code performs the task as the intermediate code would do.

Roles and Responsibilities:

 Translate the intermediate code to target machine code.


 Select and allocate memory spots and registers.

Symbol Table

It is a data-structure maintained throughout all the phases of a compiler. All the identifier's
names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also used
for scope management.

The symbol table connects or interacts with all phases of the compiler and error handler for
updates. It is also accountable for scope management.

It stores:

 It stores the literal constants and strings.


 It helps in storing the function names.
 It also prefers to store variable names and constants.
 It stores labels in source languages.
Reasons for the separation:
Simplicity—Removing the details of lexical analysis from the syntax
analyzer
makes it smaller and less complex.
Efficiency—It beomes easier to optimize the lexical analyzer.
Portability—The lexical analyzer reads source files, so it may be platform-
dependent
 Lexical analysis is separated from syntax analysis because they serve
different purposes in language processing.
 Lexical analysis focuses on identifying and grouping lexical tokens based
on the rules of the language, determining the syntactic structure of the
input [1].
 It deals with individual words and their meanings, semantic properties,
and syntactic behavior [2].
 On the other hand, syntax analysis is concerned with the arrangement and
relationships between these lexical tokens, creating data structures that
reflect the syntactic structure [3].
 By separating lexical analysis from syntax analysis, it allows for a more
modular approach to language processing, where the knowledge about the
lexicon can be separated from the knowledge about syntax [4].
 This separation is important for understanding the vulnerability of
linguistic features and developing better text classification models [5] .

Lexical analysis and syntax analysis are two distinct phases in the process of
compiling a program. They are separated for several reasons:

1. Modularity: Separating lexical analysis from syntax analysis allows


for a modular design of the compiler. Each phase can be
implemented and tested independently, making the overall compiler
design more manageable and easier to maintain.
2. Simplicity: Breaking down the compilation process into smaller,
more focused phases simplifies the overall complexity of the
compiler. Lexical analysis deals with individual tokens and their
categorization, while syntax analysis focuses on the structure and
relationships between these tokens. Keeping these tasks separate
makes the implementation of each phase more straightforward.
3. Efficiency: Lexical analysis can be implemented more efficiently
using specialized techniques like deterministic finite automata
(DFAs) or regular expressions. These techniques are optimized for
recognizing simple patterns in the input stream of characters. By
separating lexical analysis, the compiler can efficiently identify
tokens without the additional overhead of handling complex
grammatical rules.
4. Error Handling: Separating lexical and syntax analysis allows for
more precise error reporting. If an error occurs during lexical
analysis (e.g., an invalid token), the compiler can report the error at
the exact position in the source code where the issue was detected.
This can help programmers quickly identify and correct errors in
their code.
5. Flexibility: Separating lexical and syntax analysis provides
flexibility in the design of the compiler. Different languages may
have different lexical rules or tokenization requirements. By
decoupling lexical analysis from syntax analysis, compilers can be
more easily adapted to support different languages or language
features.

In summary, separating lexical analysis from syntax analysis in the compilation


process helps in achieving modularity, simplifying the overall design, improving
efficiency, enhancing error handling, and providing flexibility in compiler
implementation.
Cousins of compiler:
Cousins of a compiler consist of a preprocessor, an assembler, and a loader and linker, which play an essential role in
converting a high-level language into a low-level language along with the Compiler.

Cousins of Compiler
Converting a high-level language into a low-level language takes multiple steps and involves many programs
apart from the Compiler. Before the compilation can start, our source code needs to be preprocessed. After
the compilation, our code needs to be converted into executable code to execute on our machine. These
essential tasks are performed by the preprocessor, assembler, Linker, and Loader. They are known as the
Cousins of the Compiler.

Preprocessor
The preprocessor is one of the cousins of the Compiler. It is a program that performs preprocessing. It
performs processing on the given data and produces an output. The output generated is used as an input for
some other program.
The preprocessor increases the readability of the code by replacing a complex expression with a simpler one
by using a macro.

A preprocessor performs multiple types of functionality and operations on the data.

Some of them are-

Macro processing
Macro processing is mapping the input to output data based on a certain set of rules and defined processes.
These rules are known as macros.

Rational Preprocessors
Relational preprocessors are the processors that change older languages with some modern flow-of-control
and data-structuring facilities.

File Inclusion
The preprocessor is also used to include header files in the program text. A header file is a text file included
in our source program file during compilation. When the preprocessor finds an #include directive in the
program, it replaces it with the entire content of the specified header file.

Language extension
Language extension is used to add new capabilities to the existing language. This is done by including certain
libraries in our program, which provides extra functionality. An example of this is Equel, a database query
language embedded in C.

Error Detection
Some preprocessors are capable of performing error-checking on the source code that is given as input to
them. For example, it can check if the headers files are included properly and if the macros are defined
correctly or not.

Conditional Compilation
Certain preprocessors are capable of including or excluding certain pieces of code based on the result of a
condition. They provide more flexibility to the programmers for writing the code as they allow the
programmers to include or exclude certain features of the program based upon some condition.

Assembler
Assembler is also one of the cousins of the compiler. A compiler takes the preprocessed code and then
converts it into assembly code. This assembly code is given as input to the assembler, and the assembler
converts it into the machine code. Assembler comes into effect in the compilation process after the Compiler
has finished its job.

There are two types of assemblers-


 One-Pass assembler: They go through the source code (output of Compiler) only once and assume that all
symbols will be defined before any instruction that references them.

 Two-Pass assembler: Two-pass assemblers work by creating a symbol table with the symbols and their values
in the first pass, and then using the symbol table in a second pass, they generate code.

Linker
Linker takes the output produced by the assembler as input and combines them to create an executable file. It
merges two or more object files that might be created by different assemblers and creates a link between
them. It also appends all the libraries that will be required for the execution of the file. A linker's primary
function is to search and find referred modules in a program and establish the memory address where these
codes will be loaded.

Multiple tasks that can be performed by linkers include-

 Library Management: Linkers can be used to add external libraries to our code to add additional
functionalities. By adding those libraries, our code can now use the functions defined in those libraries.

 Code Optimization: Linkers are also used to optimize the code generated by the compiler by reducing the
code size and increasing the program's performance.

 Memory Management: Linkers are also responsible for managing the memory requirement of the executable
code. It allocates the memory to the variables used in the program and ensures they have a consistent memory
location when the code is executed.

 Symbol Resolution: Linkers link multiple object files, and a symbol can be redefined in multiple files, giving
rise to a conflict. The linker resolves these conflicts by choosing one definition to use.

Loader
The loader works after the linker has performed its task and created the executable code. It takes the input of
executable files generated from the linker, loads it to the main memory, and prepares this loaded code for
execution by a computer. It also allocates memory space to the program. The loader is also responsible for
the execution of programs by allocating RAM to the program and initializing specific registers.

Following tasks are performed by the loader

 Loading: The loader loads the executable files in the memory and provides memory for executing the
program.

 Relocation: The loader adjusts the memory addresses of the program to relocate its location in memory.

 Symbol Resolution: The loader is used to resolve the symbols not defined directly in the program. They do
this by looking for the definition of that symbol in a library linked to the executable file.

 Dynamic Linking: The loader dynamically links the libraries into the executable file at runtime to add
additional functionality to our program.

Left Recursion:
Recursion can be classified into following three types-

1. Left Recursion
2. Right Recursion
3. General Recursion

1. Left Recursion-
 A production of grammar is said to have left recursion if the leftmost variable of its RHS is
same as variable of its LHS.
 A grammar containing a production having left recursion is called as Left Recursive
Grammar.

Example-

S → Sa / ∈

(Left Recursive Grammar)

 Left recursion is considered to be a problematic situation for Top down parsers.


 Therefore, left recursion has to be eliminated from the grammar.

Elimination of Left Recursion


Left recursion is eliminated by converting the grammar into a right recursive grammar.

If we have the left-recursive pair of productions-

A → Aα / β

(Left Recursive Grammar)

where β does not begin with an A.

Then, we can eliminate left recursion by replacing the pair of productions with-

A → βA’
A’ → αA’ / ∈

(Right Recursive Grammar)

This right recursive grammar functions same as left recursive grammar.

2. Right Recursion-
 A production of grammar is said to have right recursion if the rightmost variable of its RHS is
same as variable of its LHS.
 A grammar containing a production having right recursion is called as Right Recursive
Grammar.

Example-

S → aS / ∈

(Right Recursive Grammar)


 Right recursion does not create any problem for the Top down parsers.
 Therefore, there is no need of eliminating right recursion from the grammar.

Also Read- Types of Recursive Grammar

3. General Recursion-
 The recursion which is neither left recursion nor right recursion is called as general
recursion.

Example-

S → aSb / ∈

Left recursion is a common problem that occurs in grammar during parsing in the syntax analysis
part of compilation. It is important to remove left recursion from grammar because it can create
an infinite loop, leading to errors and a significant decrease in performance

A Grammar G (V, T, P, S) is left recursive if it has a production in the form.

A → A α |β.

The above Grammar is left recursive because the left of production is occurring at a first
position on the right side of production. It can eliminate left recursion by replacing a pair of
production with

A → βA′

A → αA′|ϵ

Elimination of Left Recursion

Left Recursion can be eliminated by introducing new non-terminal A such that.

This type of recursion is also called Immediate Left Recursion.


In Left Recursive Grammar, expansion of A will generate Aα, Aαα, Aααα at each step,
causing it to enter into an infinite loop

Left Recursion: Grammar of the form,

S ⇒S | a | b

is called left recursive where S is any non Terminal and a and b are any set of terminals.
Problem with Left Recursion: If a left recursion is present in any grammar then, during
parsing in the syntax analysis part of compilation, there is a chance that the grammar
will create an infinite loop. This is because, at every time of production of grammar, S
will produce another S without checking any condition.
Algorithm to Remove Left Recursion with an example: Suppose we have a grammar
which contains left recursion:

S ⇒S a | S b | c | d

Check if the given grammar contains left recursion. If present, then separate the production
and start working on it. In our example:

S ⇒S a | S b | c | d

Introduce a new nonterminal and write it at the end of every terminal. We create a new
nonterminal S’ and write the new production as:

S ⇒ cS' | dS'

Write the newly produced nonterminal S’ in the LHS, and in the RHS it can either produce S’
or it can produce new production in which the terminals or non terminals which followed the
previous LHS will be replaced by the new nonterminal S’ at the end of the term.

S' ⇒ ε | aS' | bS'

So, after conversion, the new equivalent production is:

S ⇒ cS' | dS'
S' ⇒ ε | aS' | bS'

Types of left recursion in compiler design


Direct left recursion in compiler design

A grammar is said to be having direct left recursion when any production rule is in the form

S ⇒S | a | b

Where S is a non-terminal symbol and a and b are terminal symbols.

Indirect left recursion

A grammar is said to be having indirect left recursion when it does not have a direct left
recursion, but the productions rule is given in such a way that it is possible to derive a string
from a given Non-terminal symbol such that the leftmost symbol or the head of the derived
string is that non-terminal itself.

Example:

A ⇒B x
B ⇒C y
C ⇒ A z

Explanation:

The above grammar has indirect left recursion because it is possible to derive the
following production-

A ⇒B x
A ⇒ (C y) x
A ⇒ ((A z) y) x
A ⇒ A z y x

Hence the above grammar has indirect left recursion.

Refer notes for practice problems

LEXICAL ANALYSIS:
1. Lexical Analysis can be implemented with the Deterministic finite Automata.

2. The output is a sequence of tokens that is sent to the parser for syntax analysis
Spelling error:int written as intt is spelling error.

Exceeding length if identifier can be giving larger value to the int than expected.

Appearance of illegal character In the sense:if(a>b);(giving semicolon after if)

Lexical analysis breaks down an input text into meaningful components called tokens.
Here are the simplified steps:
1. **Identify Tokens**: Determine the set of symbols (letters, digits, operators, etc.)
that can form tokens.

2. **Assign Strings to Tokens**: Recognize and categorize strings. For example, "cat"
as a word token, "2023" as a number token.

3. **Return the Token Value**: Extract and return the smallest units (lexemes) that form
each token for further processing.

Advantages of lexical analysis:


1. **Data Cleaning**: Removes unnecessary elements like spaces and comments,
making the input cleaner.

2. **Simplifies Further Analysis**: Organizes input into useful tokens, making it


easier for the next analysis steps.

3. **Compresses Input**: Reduces and compiles the input, streamlining the data
for processing.

Limitations of lexical analysis:


1. **Ambiguity**: Sometimes struggles to categorize tokens correctly.

2. **Lookahead Complexity**: Needs to look ahead in the input, which can be complex.

3. **Limited Scope**: Can't detect issues like undeclared identifiers or misspelled


words since it only handles individual tokens without context.

TOKEN GENERATION:
in example1 total number of tokens are 5,they are int,num1,=,100,;

String is consideredas one token.

How Lexical Analyzer Works?


1. Input preprocessing: This stage involves cleaning up the input text and preparing it
for lexical analysis. This may include removing comments, whitespace, and other
non- essential characters from the input text.

2. Tokenization: This is the process of breaking the input text into a sequence of tokens.
This is usually done by matching the characters in the input text against a set of
patterns or regular expressions that define the different types of tokens.

3. Token classification: In this stage, the lexer determines the type of each token. For
example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.

4. Token validation: In this stage, the lexer checks that each token is valid according to
the rules of the programming language. For example, it might check that a variable
name is a valid identifier, or that an operator has the correct syntax.

5. Output generation: In this final stage, the lexer generates the output of the lexical
analysis process, which is typically a list of tokens. This list of tokens can then
be passed to the next stage of compilation or interpretation.

Suppose we pass a statement through lexical analyzer – a = b + c; It will


generate token sequence like this: id=id+id; Where each id refers to it’s variable in
the symbol table referencing all details For example, consider the program
ROLE OF LEXICAL ANALYSER:
-Lexical analysis (Scanner): to read the input characters and
output a sequence of tokens
– Syntactic analysis (Parser): to read the tokens and output a
parse tree and report syntax errors if any
Sure, here's a simpler explanation of how the lexical analyzer
(lexer) and syntactic analyzer (parser) interact:

1. **Lexical Analyzer (Lexer)**:


- **Job**: Breaks the input text into small pieces
called tokens (like words, numbers, symbols).
- **Output**: Sends these tokens to the parser.

2. **Syntactic Analyzer (Parser)**:


- **Job**: Looks at the tokens to check if they follow
the rules of the language (grammar).
- **Output**: Builds a structure (like a tree) showing
the organization of the tokens.

**Interaction**:
- The parser asks the lexer for the next token.
- The lexer reads the input and gives the next token to
the parser.
- This continues until the input is fully processed.

Basically, the lexer chops up the input into tokens, and the
parser checks if these tokens fit together correctly according
to the language rules.
Bootstrapping:
 Bootstrapping is a process in which simple language is used to translate more
complicated program which in turn may handle for more complicated program.
 This complicated program can further handle even more complicated program and
so on. Writing a compiler for any high level language is a complicated process.
 It takes lot of time to write a compiler from scratch.
 Hence simple language is used to generate target code in some stages.
 To clearly understand the Bootstrapping technique consider a following scenario.
 Suppose we want to write a cross compiler for new language X.
 The implementation language of this compiler is say Y and the target code being
generated is in language Z.
 That is, we create XYZ.
 Now if existing compiler Y runs on machine M and generates code for M then it is
denoted as YMM.
 Now if we run XYZ using YMM then we get a compiler XMZ.
 That means a compiler for source language X that generates a target code in
language Z and which runs on machine M.
 Following diagram illustrates the above scenario. Example: We can create compiler
of many different forms. Now we will generate
Bootstrapping is the process of writing a compiler for a programming language using the
language itself. In other words, it is the process of using a compiler written in a particular
programming language to compile a new version of the compiler written in the same
language.

INPUT BUFFERING:
Eof: end of file
Specification of Tokens:
Recognition of Tokens 42

Recognition of Tokens
The question is how to recognize the tokens?
Example: assume the following grammar fragment to generate a
specific language:
stmt if expr then stmt | if expr then stmt else stmt |
expr term relop term | term
term id | num
where the terminals if, then, else, relop, id, and num generate sets of
strings given by the following regular definitions:
if if
then then
else else
relop <| <=| =| <>| >| >=
id letter(letter|digit)*
num digits optional-fraction optional-exponent
Where letter and digits are as defined previously.
For this language fragment the lexical analyzer will recognize
the keywords if, then, else, as well as the lexemes denoted by relop,
id, and num. To simplify matters, we assume keywords are
reserved; that is, they cannot be used as identifiers. The num
represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space,
consisting of nonnull sequences of blanks, tabs, and newlines. The
lexical analyzer will strip out white space. It will do so by
comparing a string against the regular definition ws, below.
delim blank| tab | newline
ws delim+
If a match for ws is found, the lexical analyzer does not return a
token to the parser.

‫الجامعة المستنصريه‬
Recognition of Tokens 42

Transition Diagrams (TD)


As an intermediate step in the construction of a lexical
analyzer, we first produce flowchart, called a Transition diagram.
Transition diagrams depict the actions that take place when a lexical
analyzer is called by the parser to get the next token.
The TD uses to keep track of information about characters
that are seen as the forward pointer scans the input. it dose that by
moving from position to position in the diagram as characters are
read.

Components of Transition Diagram

1. One state is labeled the Start State; start it is the


initial state of the transition diagram where control resides when
we begin to recognize a token.

2. Positions in a transition diagram are drawn as circles


and are called states.
3. The states are connected by Arrows, called edges.
Labels on edges are indicating the input characters.

4. The Accepting states in which the tokens has been found.


5. Retract one character use * to indicate states on which this
input retraction.

‫الجامعة المستنصريه‬
Recognition of Tokens 42

Example: A Transition Diagram for the token relation operators


"relop" is shown in Figure below:

start 0
< =
1 2 Return (relop,LE)

> 3 Return (relop,NE)

*
Other
4 Return (relop,LT)

=
5 Return (relop,EQ)

> =
6 7 Return (relop,GE)

Other *
8 Return (relop,GT)

Transition Diagram for relation operators

Example: A Transition Diagram for the identifiers and keywords:

Letter or digit

start letter other *


9 10 11

Transition Diagram for identifiers and keywords

‫الجامعة المستنصريه‬
Recognition of Tokens 42

Example: A Transition Diagram for Unsigned Numbers in Pascal:

digit digit digit

digit digit E +or- *


start 12 13 . 14 15 16 17
digit
18
other
19
digit
E

digit digit

start20 digit . digit23other24


21 22

digit

digit other *
start 25 26 27

num digit+(. digit+| )(E(+|-| )digit+| )

Transition Diagram for unsigned numbers in Pascal

Treatment of White Space (WS):

delim

*
start 28 delim29other30

Transition Diagram for White Space

Nothing is returned when the accepting state is reached; we merely


go back to the start state of the first transition diagram to look for
another pattern.

‫الجامعة المستنصريه‬
Recognition of Tokens 42

Finite Automata (FA)


It a generalized transition diagram TD, constructed to compile a
regular expression RE into recognizer.
Recognizer for a Language: is a program that takes a string X as an
input and answers "Yes" if X is a sentence of the language and "No"
otherwise.

FA

Nondeterministic Finite Automata (NFA) Deterministic Finite Automata (DFA)

Note: Both NFA and DFA are capable of recognizing what regular
expression can denote.

Nondeterministic Finite Automata (NFA)


NFA: means that more than one transition out of a state may be
possible on a same input symbol.

a a
1 2

b
Also a transition on input ( -Transition) is possible.

a
1 2

b
3 4

‫الجامعة المستنصريه‬
Recognition of Tokens 42

A nondeterministic finite automation NFA is a mathematical model


consists of
1) A set of states S;
2) A set of input symbol, ∑, called the input symbols alphabet.
3) A set of transition to move the symbol to the sets of states.
4) A state S0 called the initial or the start state.
5) A set of states F called the accepting or final state.

Example: The NFA that recognizes the language (a | b)*abb is


shown below:

start a b b
0 1 2 3

Example: The NFA that recognizes the language aa*|bb* is shown


below:
a

a
1 2

3 4
b

‫الجامعة المستنصريه‬
Recognition of Tokens 03

Deterministic Finite Automata (DFA)


A deterministic finite automation (DFA, for short) is a special case
of a non-deterministic finite automation (NFA) in which
1. No state has an -transition, i.e., a transition on input , and
2. For each state S and input symbol a, there is at most one edge
labeled a leaving S.
A deterministic finite automation DFA has at most one transition
from each state on any input.

Example: The following figure shows a DFA that recognizes the


language (a|b)*abb.

b
b

start a b b
0 1 2 3
a

a
a

The Transition Table is:

State a b
0 1 0
1 1 2
2 1 3
3 1 0

‫الجامعة المستنصريه‬
Recognition of Tokens 03

Conversion of an NFA into a DFA


It is hard for a computer program to simulate an NFA because
the transition function is multivalued. The algorithm that called the
subset construction will convert an NFA for any language into a
DFA that recognizes the same languages.

Algorithm: (Subset construction): constructing DFA from NFA.


Input: NFA N.
Output: DFA D accepting the same language.
Method: this algorithm constructs a transition table Dtran for D.
Each DFA state is a set of NFA states and we construct Dtran so
that D will simulate "in parallel" all possible moves N can make on
a given input string.
It use the operations in below to keep track of sets of NFA states (s
represents an NFA state and T a set of NFA states).

Operations Description
Set of NFA states reachable from NFA state s on
 -closure(s)
 - transitions alone.
Set of NFA states reachable from some NFA state s
 -closure(T)
in T on  -transitions alone.
Set of NFA states to which there is a transition on
Move(T, a)
input symbol a from some NFA state s in T.

1)  -closure (s0) is the start state of D.


2)
A state of D is accepting if it contains at least one accepting
state in N.

‫الجامعة المستنصريه‬
Recognition of Tokens 04

Algorithm: (Subset construction):


Initially,  -closure(s0) is the only state in Dstates and it is
unmarked;
while there is an unmarked state T in Dstates do begin
mark T;
For each input symbol a do begin
U: = (  -closure (move (T, a)) ;
if U is not in Dstates then
add U as an unmarked state to Dstates;
Dtran [T, a]: = U
End
End

We construct Dstates, the set of states of D, and Dtran, the transition


table for D, in the following manner. Each state of D corresponds to
a set of NFA states that N could be in after reading some sequence
of input symbols including all possible  -transitions before or after
symbols are read.

Algorithm: Computation of  -closure


Push all states in T onto slack;
Initialize  -closure (T) to T;
While slack is not empty do begin
Pop t, the top clement, off of stack;
For each state u with an edge from t to u labeled  do
If u is not in  -closure (T) do begin
Add u to  -closure (T);
Push u onto stack
End
End

A simple algorithm to compute  -closure (T) uses a stack to hold


states whose edges have not been checked for  -labeled transitions.

‫الجامعة المستنصريه‬
Recognition of Tokens 00

Example: The figure below shows NFA N accepting the language


(a | b)*abb.

a
2 3

a b
0 1 6 7 8 b 9 10

4 b
5

Sol: apply the Algorithm of Subset construction as follow:


1) Find the start state of the equivalent DFA is  -closure (0),
which is consist of start state of NFA and the all states
reachable from state 0 via a path in which every edge is
labeled  .
A= {0, 1, 2, 4, 7}
2) Compute move (A, a), the set of states of NFA having transitions
on a from members of A. Among the states 0, 1, 2, 4 and 7,
only 2 and 7 have such transitions, to 3 and 8, so
move (A, a)={3, 8}
Compute the  -closure (move (A, a)) =  -closure ({3, 8}),
 -closure ({3, 8}) = {1, 2, 3, 4, 6, 7, 8} Let us call this set B.

start a
A B

‫الجامعة المستنصريه‬
Recognition of Tokens 02

3) Compute move (A, b), the set of states of NFA having transitions
on b from members of A. Among the states 0, 1, 2, 4 and 7,
only 4 have such transitions, to 5 so
move (A, b)={5}
Compute the  -closure (move (A, b)) =  -closure ({5}),
 -closure ({5}) = {1, 2, 4, 5, 6, 7} Let us call this set C.
So the DFA has a transition on b from A to C.

start b
A C

4) We apply the steps 2 and 3 on the B and C, this process continues


for every new state of the DFA until all sets that are states of the
DFA are marked.

The five different sets of states we actually construct are:


A = {0, 1, 2, 4, 7}
B = {1, 2, 3, 4, 6, 7, 8}
C = {1, 2, 4, 5, 6, 7}
D = {1, 2, 4, 5, 6, 7, 9}
E = {1, 2, 4, 5, 6, 7, 10}

State A is the start state, and state E is the only accepting state.
The complete transition table Dtran is shown in below:

INPUT SYMBOL
STATE a b
A B C
B B D
C B C
D B E
E B C

Transition table Dtran for DFA

‫الجامعة المستنصريه‬
Recognition of Tokens 02

Also, a transition graph for the resulting DFA is shown in below.


It should be noted that the DFA also accepts (a | b)*abb.

b
b C
a

start A B b b
D E
a
a
a
a

From a Regular Expression to an NFA


Now give an algorithm to construct an NFA from a regular
expression. The algorithm is syntax-directed in that it uses the
syntactic structure of the regular expression to guide the construction
process.

Algorithm: (Thompson's construction):


Input: a regular expression R over alphabet ∑.
Output: NFA N accepting L(R).

1- For  , construct the NFA

start i
 f

Here i is a start state and f a accepting state. Clearly this NFA


recognizes {  }.

‫الجامعة المستنصريه‬
Recognition of Tokens 02

2- For a in ∑, construct the NFA

start a f
i

Again i is a start state and f a accepting state. This machine


recognizes {a}.

3- For the regular expression a | b construct the following composite


NFA N(a | b).

i f

4- For the regular expression ab construct the following composite


NFA N(ab).

start a b f
i
5- For the regular expression a* construct the following composite
NFA N(a*).

start i
 a  f

‫الجامعة المستنصريه‬
Recognition of Tokens 02

Example: let us use algorithm Thompson's construction to


construct the following regular expressions:
1) RE = (ab)*

a b
i f

2) RE= (a | b)*a

a
i f

3) RE= a (bb| a)*b

b b

a b
i f

‫الجامعة المستنصريه‬
Recognition of Tokens 02

4) RE= a* (a | b)

start i a f

Lexical Errors
What if user omits the space in “Fori”?
No lexical error, single token IDENT (“Fori”) is produced instead of
sequence For, IDENT (“i”).

Typically few lexical error types


1) the illegal chars, for example:
Writ@ln (x);
2) unterminated comments, for example:
{Main program
3) Ill-formed constants

How is a Scanner Programmed?


1) Describe tokens with regular expressions.
2) Draw transition diagrams.
3) Code the diagram as table/program.

Recognition of Tokens:
Ws-word space
Lexical Analyzer Generator-LEX:
It is used for defining the declarations and also defining the regular
expressions.
Lexical Analysis
It is the first step of compiler design, it takes the input as a stream of characters and gives the output as
tokens also known as tokenization. The tokens can be classified into identifiers, Sperators, Keywords,
Operators, Constant and Special Characters.

It has three phases:

1. Tokenization: It takes the stream of characters and converts it into tokens.

2. Error Messages: It gives errors related to lexical analysis such as exceeding length, unmatched string, etc.

3. Eliminate Comments: Eliminates all the spaces, blank spaces, new lines, and indentations.

Lex
Lex is a tool or a computer program that generates Lexical Analyzers (converts the stream of characters into
tokens). The Lex tool itself is a compiler. The Lex compiler takes the input and transforms that input into
input patterns. It is commonly used with YACC(Yet Another Compiler Compiler). It was written by Mike
Lesk and Eric Schmidt.

Function of Lex
1. In the first step the source code which is in the Lex language having the file name ‘File.l’ gives as input to
the Lex Compiler commonly known as Lex to get the output as lex.yy.c.

2. After that, the output lex.yy.c will be used as input to the C compiler which gives the output in the form of
an ‘a.out’ file, and finally, the output file a.out will take the stream of character and generates tokens as
output.

lex.yy.c: It is a C program.
File.l: It is a Lex source program
a.out: It is a Lexical analyzer

Lex File Format


A Lex program consists of three parts and is separated by %% delimiters:-

Declarations
%%
Translation rules
%%
Auxiliary procedures

Declarations: The declarations include declarations of variables.

In the declaration section a regular expression can be defined. Following is an example of declaration section.
Each statement has two components: a name and a regular expression that is used to denote the name.

1. delim [\t\n]

2. ws{delim}+

3. letter [A-Za-z]

4. digit [0-9]

5. id{letter}({letter}|{digit})*
Transition rules: These rules consist of Pattern and Action.

This is the second section of the LEX program after the declarations. The declarations section is separated
from the Translation Rules section by means of the ―%%‖ delimiter. Here, each statement consists of two
components: a pattern and an action. The pattern is matched with the input. If there is a match of pattern, the
action listed against the pattern is carried out. Thus the LEX tool can be looked upon as a rule based
programming language. The following is an example of patterns p1, p2…pn and their corresponding actions
1 to n.

p1 {action1} /*p—pattern (Regular exp) */

pn {actionn}

For example, if the keyword IF is to be returned as a token for a match with the input string ―if‖ then the
translation rule is defined as

{if} {return(IF);}

The ―;‖ at the end of the (IF) indicates end of the first statement of an action and the entire sequence of
actions is available between a pair of parenthesis. If the action has been written in multiple lines then the
continuation character needs to be used. Similarly the following is an example for an identifier ―id‖, where
the usage of ―id‖ is already stated in the first ―declaration‖ section.

{id} {yylval=install_id();return(ID);}

In the above statement, when encountering an identifier, two actions need to be taken. The first one is call
install_id() function and assign it to yylval and the second one is a return statement.

Auxiliary procedures: The Auxilary section holds auxiliary functions used in the actions.

This section is separated from the translation rules section using the delimiter ―%%‖. In this section, the C
program’s main function is declared and the other necessary functions also defined. In the example defined in
translation rules section, the function install_id() is a procedure used to install the lexeme, whose first
character is pointed by yytext and length is provided by yyleng which are inserted into the symbol table and
return a pointer pointing to the beginning of the lexeme.

install_id() {

The functionality of install_id can be written separately or combined with the main function. The functions
yytext ( ) and yyleng( ) are lex commands to indicate the text of the input and the length of the string.

(write the example program given above.)

Finite Automata:
 Finite Automata(FA) is the simplest machine to recognize patterns.
 It is used to characterize a Regular Language.
 Also it is used to analyze and recognize Natural language Expressions.
 The finite automata or finite state machine is an abstract machine that
has five elements or tuples.
 It has a set of states and rules for moving from one state to another but
it depends upon the applied input symbol.
 Based on the states and the set of rules the input string can be either
accepted or rejected.
 Basically, it is an abstract model of a digital computer which reads an input
string and changes its internal state depending on the current input
symbol.
 Every automaton defines a language i.e. set of strings it accepts.
 The following figure shows some essential features of general automation.

The above figure shows the following features of automata:

1. Input
2. Output
3. States of automata
4. State relation
5. Output relation
6. A Finite Automata consists of the following:

Q : Finite set of states.


∑ : set of Input Symbols.
q : Initial state.
F : set of Final States.
(delta) : Transition Function.

FA is characterized into two types:

1) Deterministic Finite Automata (DFA):

DFA consists of 5 tuples {Q, ?, q, F, ?}.


Q : set of all states.
? : set of input symbols. ( Symbols which machine takes as
input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
? : Transition Function, defined as ? : Q X ? --> Q
In a DFA, for a particular input character, the machine goes to one state only. A transition
function is defined on every state for every input symbol. Also in DFA null (or ?) move is not
allowed, i.e., DFA cannot change state without any input character.

For example, construct a DFA which accept a language of all strings ending with ‘a’.
Given: ? = {a,b}, q = {q0}, F={q1}, Q = {q0, q1}

First, consider a language set of all the possible acceptable strings in order to construct an
accurate state transition diagram.
L = {a, aa, aaa, aaaa, aaaaa, ba, bba, bbbaa, aba, abba, aaba, abaa}
Above is simple subset of the possible acceptable strings there can many other strings which
ends with ‘a’ and contains symbols {a,b}.

Strings not accepted are,


ab, bb, aab, abbb, etc.

State transition table for above automaton,

3)
Nondeterministic Finite Automata(NFA): NFA is similar to DFA except
following additional features:

Null (or ?) move is allowed i.e., it can move forward without reading symbols.

Ability to transmit to any number of states for a particular input.

?: Transition Function
?: Q X (? U ? ) --> 2 ^ Q.

As you can see in the transition function is for any input including null (or ?),
NFA can go to any state number of states. For example, below is an NFA for
the above problem.
1. Both NFA and DFA have the same power and each NFA can be translated into a DFA.
2. There can be multiple final states in both DFA and NFA.
3. NFA is more of a theoretical concept.
4. DFA is used in Lexical Analysis in Compiler.
5. If the number of states in the NFA is N then, its DFA can have maximum 2N number of
states.

Regular Expressions and Finite Automata:


 The language accepted by finite automata can be easily described by simple expressions called Regular
Expressions. It is the most effective way to represent any language.
 The languages accepted by some regular expression are referred to as Regular languages.
 A regular expression can also be described as a sequence of pattern that defines a string.
 Regular expressions are used to match character combinations in strings. String searching algorithm used this
pattern to find the operations on a string.

For instance:

In a regular expression, x* means zero or more occurrence of x. It can generate {e, x, xx, xxx, xxxx, .....}

In a regular expression, x+ means one or more occurrence of x. It can generate {x, xx, xxx, xxxx, .....}

Operations on Regular Language


The various operations on regular language are:

Union: If L and M are two regular languages then their union L U M is also a union.

1. 1. L U M = {s | s is in L or s is in M}

Intersection: If L and M are two regular languages then their intersection is also an intersection.

1. 1. L ⋂ M = {st | s is in L and t is in M}
Kleen closure: If L is a regular language then its Kleen closure L1* will also be a regular language.

1. 1. L* = Zero or more occurrence of language L.

Example 1:
Write the regular expression for the language accepting all combinations of a's, over the set ∑ = {a}

Solution:

All combinations of a's means a may be zero, single, double and so on. If a is appearing zero times, that
means a null string. That is we expect the set of {ε, a, aa, aaa, ....}. So we give a regular expression for this
as:

1. R = a*

That is Kleen closure of a.

Example 2:
Write the regular expression for the language accepting all combinations of a's except the null string, over the
set ∑ = {a}

Solution:

The regular expression has to be built for the language

1. L = {a, aa, aaa, ....}

This set indicates that there is no null string. So we can denote regular expression as:

Example 3:
Write the regular expression for the language accepting all the string containing any number of a's and b's.

Solution:

The regular expression will be:

1. r.e. = (a + b)*

This will give the set as L = {ε, a, aa, b, bb, ab, ba, aba, bab, .....}, any combination of a and b.

The (a + b)* shows any combination with a and b even a null string.

Role of transition diagrams in construction of


lexical analyser:
Transition diagram is a special kind of flowchart for language analysis. In transition diagram the boxes of
flowchart are drawn as circle and called as states. States are connected by arrows called as edges. The label
or weight on edge indicates the input character that can appear after that state. Transition diagram of
identifier is given below:
This transition diagram for identifier reading first letter and after that letter or digit until the next input
character is delimiter for identifier, means the character that is neither letter nor digit. To turn up the
transition diagram into a program we construct the program segment code for each state of transition
diagram. Program segment code for each state is given below:

State-0:

C=GETCHAR();
if LETTER(C) then goto State 1
else FAIL()

State-1:

C=GETCHAR();
if LETTER(C) OR DIGIT(C) then goto State 1
else if DELIMITER(C) then goto State 2
else FAIL()

State-2:

RETRACT();
RETURN(ID, INSTALL())

Where, Next character for input buffer we use GETCHAR() which return next character. LETTER(C) is a
procedure which returns the true value if and only if C is a letter. FAIL(C) is a routine which RETRACT the
look ahead pointer and start up the next transition diagram otherwise call error routine. DIGIT(C) is a
procedure which returns the true value if and only if C is a digit. DELIMITER(C) is a procedure which
returns the true value if and only if C Is a character that could follow the identifier for example blank
symbol, arithmetic, logical operator, left parenthesis, right parenthesis, +, :, ; etc. Because DELIMITER is
not part of identifier therefore we must RETRACT the look ahead pointer one character for this purpose we
use the RETRACT() procedure . Because identifier has a value so to install the value of identifier in symbol
table we use INSTALL() procedure.

In compiler design, an identifier is a name given to a variable, function, or other programming language
construct. Identifiers must follow a set of rules and conventions to be recognized and interpreted correctly by
the compiler. One way to represent these rules is through a transition diagram, also known as a finite-state
machine.

The transition diagram for identifiers typically consists of several states, each representing a different stage in
the process of recognizing an identifier. Here is a high-level overview of the states and transitions involved:

1. Start state: This is the initial state of the diagram. It represents the point at which the compiler begins
scanning the input for an identifier.
2. First character state: In this state, the compiler has identified the first character of an identifier. The
next transition will depend on whether this character is a letter or an underscore.
3. Letter state: If the first character is a letter, the compiler moves into this state. The next transition will
depend on whether the next character is a letter, digit, or underscore.
4. Underscore state: If the first character is an underscore, the compiler moves into this state. The next
transition will depend on whether the next character is a letter, digit, or another underscore.
5. Digit state: If the first character is a digit, the compiler cannot recognize it as an identifier and will
move to an error state.
6. Identifier state: If the compiler successfully follows the appropriate transitions, it will eventually
reach an identifier state. This indicates that the sequence of characters scanned so far constitutes a
valid identifier.
7. Error state: If the compiler encounters an unexpected character or sequence of characters, it will move
to an error state. This indicates that the input does not constitute a valid identifier.

The transition diagram for identifiers can be more complex than this, depending on the specific rules and
conventions of the programming language. However, this basic structure provides a good starting point for
understanding how compilers recognize and interpret identifiers.

What is Design of Lexical Analysis in


Compiler Design?
Lexical Analysis can be designed using Transition Diagrams.

Finite Automata (Transition Diagram) − A Directed Graph or flowchart used to recognize token.

The transition Diagram has two parts −

 States − It is represented by circles.

 Edges − States are connected by Edges Arrows.

Example − Draw Transition Diagram for "if" keyword.

To recognize Token ("if"), Lexical Analysis has to read also the next character after "f". Depending upon the
next character, it will judge whether the "if" keyword or something else is.

So, Blank space after "if" determines that "If" is a keyword.

"*" on Final State 3 means Retract, i.e., control will again come to previous state 2. Therefore Blank space is
not a part of the Token ("if").

Transition Diagram for an Identifier − An identifier starts with a letter followed by letters or Digits.
Transition Diagram will be:
For example, In statement int a2; Transition Diagram for identifier a2 will be:

As (;) is not part of Identifier ("a2"), so use "*" for Retract i.e., coming back to state 1 to recognize identifier
("a2").

The Transition Diagram for identifier can be converted to Program Code as −

Coding

State 0: C = Getchar()

If letter (C) then goto state 1 else fail

State1: C = Getchar()

If letter (C) or Digit (C) then goto state 1

else if Delimiter (C) goto state 2

else Fail

State2: Retract ()

return (6, Install ());

In-state 2, Retract () will take the pointer one state back, i.e., to state 1 & declares that whatever has been
found till state 1 is a token.

The lexical Analysis will return the token to the Parser, not in the form of an English word but the form of a
pair, i.e., (Integer code, value).
In the case of identifier, the integer code returned to the parser is 6 as shown in the table.

Install () − It will return a pointer to the symbol table, i.e., address of tokens.

The following table shows the integer code and value of various tokens returned by lexical analysis to the
parser.

Integer Codes for different Tokens

Token Integer Code Value


Begin 1 -
End 2 -
If 3 -
Then 4 -
Else 5 -
Identifier 6 Pointer to Symbol Table
Constants 7 Pointer to Symbol Table
< 8 1
<= 8 2
= 8 3
<> 8 4
> 8 5
>= 8 6

These integer values are not fixed. Different Programmers can choose other integer codes and values while
designing the Lexical Analysis.

Suppose, if the identifier is stored at location 236 in the symbol table, then

Similarly, if constant is stored at location 238 then

Integer code = 7

Install () = 238 i.e., Pair will be (7, 238)

Transition Diagram (Finite Automata) for Tokens −


Creating a C program to read and display a string, and


then designing a lexical analyzer to identify tokens in
it, involves several steps. Here’s a basic
implementation:

1. **Read and Display String**: We'll start with a


simple program to read a string from the user and
display it.
2. **Lexical Analyzer**: We'll extend this program to
tokenize the input string. Tokens typically include
keywords, identifiers, numbers, operators, and special
symbols.
### Step 1: Read and Display String

Here's a simple C program to read a string from the


user and display it:

```c
#include <stdio.h>

int main() {
char str[100];

printf("Enter a string: ");


fgets(str, 100, stdin);

printf("You entered: %s", str);

return 0;
}
```

### Step 2: Design Lexical Analyzer


To build a lexical analyzer, we need to define what
tokens we want to recognize. For simplicity, let's
consider the following tokens:
- **Keywords**: `if`, `else`, `while`, `return`
- **Identifiers**: variable names, function names
- **Numbers**: integer literals
- **Operators**: `+`, `-`, `*`, `/`, `=`
- **Special Symbols**: `;`, `,`, `(`, `)`, `{`, `}`

We'll use a simple state machine approach for


tokenizing the input string.

Here's the complete C program:

```c
#include <stdio.h>
#include <ctype.h>
#include <string.h>

#define MAX_KEYWORDS 4
#define MAX_STRING_LENGTH 100
const char *keywords[MAX_KEYWORDS] = {"if",
"else", "while", "return"};

int isKeyword(const char *str) {


for (int i = 0; i < MAX_KEYWORDS; i++) {
if (strcmp(str, keywords[i]) == 0) {
return 1;
}
}
return 0;
}

void tokenize(char *str) {


char *p = str;
while (*p != '\0') {
// Skip whitespaces
while (isspace(*p)) {
p++;
}

// Recognize keywords and identifiers


if (isalpha(*p)) {
char buffer[MAX_STRING_LENGTH];
int i = 0;
while (isalpha(*p) || isdigit(*p) || *p == '_') {
buffer[i++] = *p++;
}
buffer[i] = '\0';

if (isKeyword(buffer)) {
printf("Keyword: %s\n", buffer);
} else {
printf("Identifier: %s\n", buffer);
}
}

// Recognize numbers
else if (isdigit(*p)) {
char buffer[MAX_STRING_LENGTH];
int i = 0;
while (isdigit(*p)) {
buffer[i++] = *p++;
}
buffer[i] = '\0';
printf("Number: %s\n", buffer);
}

// Recognize operators
else if (strchr("+-*/=", *p)) {
printf("Operator: %c\n", *p);
p++;
}

// Recognize special symbols


else if (strchr(";,(){}", *p)) {
printf("Special symbol: %c\n", *p);
p++;
}

// Handle unrecognized characters


else {
printf("Unrecognized character: %c\n", *p);
p++;
}
}
}

int main() {
char str[MAX_STRING_LENGTH];

printf("Enter a string: ");


fgets(str, MAX_STRING_LENGTH, stdin);

// Remove newline character from input


size_t len = strlen(str);
if (len > 0 && str[len-1] == '\n') {
str[len-1] = '\0';
}
printf("You entered: %s\n", str);
printf("Tokens:\n");
tokenize(str);
return 0;
}
1. **Reading and Displaying String**: The program
uses `fgets` to read a line of text from the user,
ensuring to strip the newline character if present.
2. **Tokenization**:
- **Whitespace Handling**: Skips any whitespace
characters.
- **Keywords and Identifiers**: If the character is
alphabetic, it reads a sequence of alphanumeric
characters (including underscores) and checks if the
resulting string is a keyword or an identifier.
- **Numbers**: Reads sequences of digits as
numbers.
- **Operators and Special Symbols**: Recognizes
single-character operators and special symbols.
- **Unrecognized Characters**: Prints any
unrecognized character.
This simple lexical analyzer can be extended to handle
more complex cases, such as multi-character operators
(`==`, `!=`, `<=`, `>=`), floating-point numbers, string
literals, and comments.

Syntax Analysis
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic concepts
used in the construction of a parser.

We have seen that a lexical analyzer can identify tokens with the help of regular expressions and pattern
rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations of the regular
expressions. Regular expressions cannot check balancing tokens, such as parenthesis. Therefore, this phase
uses context-free grammar (CFG), which is recognized by push-down automata.

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:


It implies that every Regular Grammar is also context-free, but there exists some problems, which are beyond
the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming languages.

Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies used in
parsing technology.

A context-free grammar has four components:

 A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The
non-terminals define sets of strings that help define the language generated by the grammar.
 A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings
are formed.
 A set of productions (P). The productions of a grammar specify the manner in which the terminals
and non-terminals can be combined to form strings. Each production consists of a non-terminal
called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called
the right side of the production.
 One of the non-terminals is designated as the start symbol (S); from where the production begins.

The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.

Example
We take the problem of palindrome language, which cannot be described by means of Regular Expression.
That is, L = { w | w = wR } is not a regular language. But it can be described by means of CFG, as illustrated
below:

G = ( V, Σ, P, S )

Where:

V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S = { Q }

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser
analyzes the source code (token stream) against the production rules to detect any errors in the code. The
output of this phase is a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a parse
tree as the output of the phase.

Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use error
recovering strategies, which we will learn later in this chapter.

Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we
take two decisions for some sentential form of input:

 Deciding the non-terminal which is to be replaced.


 Deciding the production rule, by which, the non-terminal will be replaced.

To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation.
The sentential form derived by the left-most derivation is called the left-sentential form.

Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most derivation.
The sentential form derived from the right-most derivation is called the right-sentential form.

Example

Production rules:

E → E + E
E → E * E
E → id

Input string: id + id * id

The left-most derivation is:


E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id

Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the
start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us see this by an
example from the last topic.

We take the left-most derivation of a + b * c

The left-most derivation is:

E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id

Step 1:

E→E*E

Step 2:

E→E+E*E

Step 3:
E → id + E * E

Step 4:

E → id + id * E

Step 5:

E → id + id * id

In a parse tree:

 All leaf nodes are terminals.


 All interior nodes are non-terminals.
 In-order traversal gives original input string.

A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.

Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least
one string.

Example

E → E + E
E → E – E
E → id

For the string id + id – id, the above grammar generates two parse trees:

The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in


grammar is not good for a compiler construction. No method can detect and remove ambiguity automatically,
but it can be removed by either re-writing the whole grammar without ambiguity, or by setting and following
associativity and precedence constraints.

Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is decided by the
associativity of those operators. If the operation is left-associative, then the operand will be taken by the left
operator or if the operation is right-associative, the right operator will take the operand.

Example

Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the expression
contains:

id op id op id

it will be evaluated as:

(id op id) op id

For example, (id + id) + id

Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression will
be:

id op (id op id)

For example, id ^ (id ^ id)


Precedence
If two different operators share a common operand, the precedence of operators decides which will take the
operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and another
corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily removed. As in
the previous example, mathematically * (multiplication) has precedence over + (addition), so the expression
2+3*4 will always be interpreted as:

2 + (3 * 4)

These methods decrease the chances of ambiguity in a language or its grammar.

Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as the
left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down parsers.
Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So, when the parser
encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the
left non-terminal and it goes into an infinite loop.

Example:

(1) A => Aα | β

(2) S => Aα | β
A => Sd

(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a string
of non-terminals.

(2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the parser
may go into a loop forever.

Removal of Left Recursion


One way to remove left recursion is to use the following technique:

The production

A => Aα | β
is converted into following productions

A => βA'
A'=> αA' | ε

This does not impact the strings derived from the grammar, but it removes immediate left recursion.

Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.

START

Arrange non-terminals in some order like A1, A2, A3,…, An

for each i from 1 to n


{
for each j from 1 to i-1
{
replace each production of form Ai ⟹Aj𝜸
with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸
where Aj ⟹ δ1 | δ2|…| δn are current Aj productions
}
}
eliminate immediate left-recursion

END

Example

The production set

S => Aα | β
A => Sd

after applying the above algorithm, should become

S => Aα | β
A => Aαd | βd

and then, remove immediate left recursion using the first technique.

A => βdA'
A' => αdA' | ε

Now none of the production has either direct or indirect left recursion.

Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down parser cannot
make a choice as to which of the production it should take to parse the string in hand.

Example

If a top-down parser encounters a production like

A ⟹ αβ | α𝜸 | …

Then it cannot determine which production to follow to parse the string as both productions are starting from
the same terminal (or non-terminal). To remove this confusion, we use a technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make one
production for each common prefixes and the rest of the derivation is added by new productions.

Example

The above productions can be written as

A => αA'
A'=> β | 𝜸 | …

Now the parser has only one production per prefix which makes it easier to take decisions.

First and Follow Sets


An important part of parser table construction is to create first and follow sets. These sets can provide the
actual position of any terminal in the derivation. This is done to create the parsing table where the decision of
replacing T[A, t] = α with some production rule.

First Set
This set is created to know what terminal symbol is derived in the first position by a non-terminal. For
example,

α → t β

That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).

Algorithm for calculating First set


Look at the definition of FIRST(α) set:

 if α is a terminal, then FIRST(α) = { α }.


 if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.
 if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in FIRST(α).

First set can be seen as:

Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production rules. We
do not consider what the non-terminal can generate but instead, we see what would be the next terminal
symbol that follows the productions of a non-terminal.

Algorithm for calculating Follow set:


 if α is a start symbol, then FOLLOW() = $
 if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A) except ℇ.
 if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in
FOLLOW(α).

Follow set can be seen as: FOLLOW(α) = { t | S *αt*}


Limitations of Syntax Analyzers
Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers are
responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the following
drawbacks -

 it cannot determine if a token is valid,


 it cannot determine if a token is declared before it is being used,
 it cannot determine if a token is initialized before it is being used,
 it cannot determine if an operation performed on a token type is valid or not.

These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.

Role of parser:
 A parser in a compiler checks the syntax of the source code and builds a data structure called a parse
tree.

In more detail, a parser is a crucial component of a compiler, which is a program that translates source
code written in a programming language into machine code that a computer can understand and
execute.
 The parser's role is to ensure that the source code is syntactically correct, meaning it adheres to the
rules and structure of the language in which it is written.
 If the source code does not follow these rules, the parser will generate an error message, and the
compilation process will stop.

The parser operates after the lexical analysis phase of the compiler, which breaks down the source
code into individual words or tokens.
 The parser takes these tokens and checks them against the grammar of the language.
 This grammar defines how tokens can be combined to form valid statements and expressions.
 If the tokens follow the grammar rules, the parser will construct a parse tree, a hierarchical data
structure that represents the syntactic structure of the source code.

The parse tree is then used in the next stages of the compilation process, such as semantic analysis
and code generation.
 The semantic analysis phase checks that the source code makes sense in the context of the language's
semantics, while the code generation phase translates the parse tree into machine code.
 It verifies the structure generated by the tokens based on the grammar.
 2. It constructs the parse tree.
 3. It reports the errors.
 4. It performs error recovery.

performs context-free syntax analysis

• guides context-sensitive analysis

• constructs an intermediate representation


• produces meaningful error messages

• attempts error correction

Context-Free Grammars:

Context Free Grammar is formal grammar, the syntax or structure of a formal language can be described
using context-free grammar (CFG), a type of formal grammar. The grammar has four tuples: (V,T,P,S).

V - It is the collection of variables or nonterminal symbols.


T - It is a set of terminals.
P - It is the production rules that consist of both terminals and nonterminals.
S - It is the Starting symbol.

A grammar is said to be the Context-free grammar if every production is in the form of :


G -> (V∪T)*, where G ∊ V

 And the left-hand side of the G, here in the example can only be a Variable, it cannot be a terminal.

 But on the right-hand side here it can be a Variable or Terminal or both combination of Variable and
Terminal.

Above equation states that every production which contains any combination of the ‘V’ variable or ‘T’
terminal is said to be a context-free grammar.

For example the grammar A = { S, a,b, P,S} having production :

 Here S is the starting symbol.

 {a,b} are the terminals generally represented by small characters.

 P is variable along with S.

S-> aS
S-> bSa

but

a->bSa, or
a->ba is not a CFG as on the left-hand side there is a variable which does not follow
the CFGs rule.

In the computer science field, context-free grammars are frequently used, especially in the areas of formal
language theory, compiler development, and natural language processing. It is also used for explaining the
syntax of programming languages and other formal languages.

Limitations of Context-Free Grammar


Apart from all the uses and importance of Context-Free Grammar in the Compiler design and the Computer
science field, there are some limitations that are addressed, that is CFGs are less expressive, and neither
English nor programming language can be expressed using Context-Free Grammar. Context-Free Grammar
can be ambiguous means we can generate multiple parse trees of the same input. For some grammar,
Context-Free Grammar can be less efficient because of the exponential time complexity. And the less precise
error reporting as CFGs error reporting system is not that precise that can give more detailed error messages
and information.

Derivations:

Derivation is used to find whether the string belongs to a given grammar.

Types

• Leftmost derivation.

• Rightmost derivation.

Leftmost Derivation
In leftmost derivation, at each and every step the leftmost non-terminal is expanded by substituting its
corresponding production to derive a string.
Example

Rightmost Derivation
In rightmost derivation, at each and every step the rightmost non-terminal is expanded by substituting its
corresponding production to derive a string.

Example

Parse Trees:

 Parse : It means to resolve (a sentence) into its component parts and describe their syntactic roles or
simply it is an act of parsing a string or a text.
 Tree: A tree may be a widely used abstract data type that simulates a hierarchical tree structure, with
a root value and sub-trees of youngsters with a parent node, represented as a group of linked nodes.

Parse Tree:

 Parse tree is the hierarchical representation of terminals or non-terminals.


 These symbols (terminals or non-terminals) represent the derivation of the grammar to yield input
strings.
 In parsing, the string springs using the beginning symbol.
 The starting symbol of the grammar must be used as the root of the Parse Tree.
 Leaves of parse tree represent terminals.
 Each interior node represents productions of a grammar.

Rules to Draw a Parse Tree:


1. All leaf nodes need to be terminals.
2. All interior nodes need to be non-terminals.
3. In-order traversal gives the original input string.

Example 1: Let us take an example of Grammar (Production Rules).

S -> sAB
A -> a
B -> b

The input string is “sab”, then the Parse Tree is:

Example-2: Let us take another example of Grammar (Production Rules).

S -> AB
A -> c/aA
B -> d/bB

The input string is “acbd”, then the Parse Tree is as follows:

Uses of Parse Tree:

 It helps in making syntax analysis by reflecting the syntax of the input language.
 It uses an in-memory representation of the input with a structure that conforms to the grammar.
 The advantages of using parse trees rather than semantic actions: you’ll make multiple passes over the
info without having to re-parse the input.

Ambiguity:

A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than one
rightmost derivation or more than one parse tree for the given input string. If the grammar is not ambiguous,
then it is called unambiguous.

If the grammar has ambiguity, then it is not good for compiler construction. No method can automatically
detect and remove the ambiguity, but we can remove ambiguity by re-writing the whole grammar without
ambiguity.

Example 1:
Let us consider a grammar G with the production rule

1. E→I
2. E→E+E
3. E→E*E
4. E → (E)
5. I → ε | 0 | 1 | 2 | ... | 9

Solution:

For the string "3 * 2 + 5", the above grammar can generate two parse trees by leftmost derivation:

Since there are two parse trees for a single string "3 * 2 + 5", the grammar G is ambiguous.

Example 2:
Check whether the given grammar G is ambiguous or not.

1. E → E + E
2. E → E - E
3. E → id

Solution:
From the above grammar String "id + id - id" can be derived in 2 ways:

First Leftmost derivation

1. E → E + E
2. → id + E
3. → id + E - E
4. → id + id - E
5. → id + id- id

Second Leftmost derivation

1. E → E - E
2. →E+E-E
3. → id + E - E
4. → id + id - E
5. → id + id - id

Since there are two leftmost derivation for a single string "id + id - id", the grammar G is ambiguous.

Example 3:
Check whether the given grammar G is ambiguous or not.

1. S → aSb | SS
2. S → ε

Solution:

For the string "aabb" the above grammar can generate two parse trees

Since there are two parse trees for a single string "aabb", the grammar G is ambiguous.

Example 4:
Check whether the given grammar G is ambiguous or not.

1. A → AA
2. A → (A)
3. A → a

Solution:
For the string "a(a)aa" the above grammar can generate two parse trees:

Since there are two parse trees for a single string "a(a)aa", the grammar G is ambiguous.

Left Recursion:
A Grammar G (V, T, P, S) is left recursive if it has a production in the form.

A → A α |β.

The above Grammar is left recursive because the left of production is occurring at a first position on the right
side of production. It can eliminate left recursion by replacing a pair of production with

A → βA′

A → αA′|ϵ

Elimination of Left Recursion

Left Recursion can be eliminated by introducing new non-terminal A such that.

This type of recursion is also called Immediate Left Recursion.

(see examples from notes)

Left Factoring:
Datastructures used in lexical analysis:
Lexical analysis, also known as scanning, is the first phase of a compiler. It processes the
input source code to produce a sequence of tokens. To accomplish this task efficiently,
several data structures are commonly employed. Here are the primary data structures used
in lexical analysis:

### 1. **Finite State Machines (FSMs)**

Finite State Machines, particularly Deterministic Finite Automata (DFA) and Non-
Deterministic Finite Automata (NFA), are foundational in lexical analysis. These automata
are used to recognize patterns in the input string.

- **NFA (Non-Deterministic Finite Automaton):**

- Used initially to describe lexical patterns because NFAs are more flexible and easier to
construct from regular expressions.

- Consists of states, transitions between states, an initial state, and one or more accepting
states.

- **DFA (Deterministic Finite Automaton):**

- NFAs are often converted into DFAs for practical implementation, as DFAs do not have
ambiguities and are more efficient for scanning input strings.

- In a DFA, for each state and input symbol, there is exactly one transition to a next state.
### 2. **Symbol Table**

The symbol table is a data structure used to store information about identifiers (e.g.,
variable names, function names) encountered in the source code.

- **Hash Tables:** Frequently used due to their average O(1) time complexity for insertions,
deletions, and lookups.

- **Binary Search Trees (BSTs):** Sometimes used, especially when the order of identifiers
needs to be preserved or when the table requires frequent range queries.

### 3. **Buffer**

Buffers are used to manage input streams efficiently. Two common buffering techniques
are:

- **Single Buffering:** Simple but can be inefficient due to frequent I/O operations.

- **Double Buffering:** Uses two buffers to reduce I/O operations. While one buffer is being
processed, the other is being filled with input data.

### 4. **Lexeme Table**

A lexeme table stores the lexemes (actual character sequences) identified during scanning.
This table is useful for quickly retrieving the lexemes corresponding to tokens.

### 5. **Trie (Prefix Tree)**

Tries are used for efficient storage and retrieval of keywords, especially when handling
reserved words in programming languages. They allow for quick prefix-based searches.

### 6. **Transition Table**

A transition table represents the state transitions in a DFA or NFA. This table is often
implemented as a two-dimensional array where the rows represent states and the columns
represent input symbols.

### 7. **Character Classes**

Character classes group sets of characters (e.g., digits, letters) into categories. This simplifies
the state machine and makes pattern matching more efficient.

### 8. **Stack**

A stack can be used to handle nested structures, such as nested comments or parentheses
in source code. It helps manage the scope and context while scanning the input.

### Practical Example in Lexical Analysis


When a lexical analyzer (lexer) starts processing the source code, it typically performs the
following steps:

1. **Reading Input:** Uses a buffer to read the source code.

2. **Pattern Matching:** Uses FSMs (DFA/NFA) to match input strings against patterns
defined by the language's grammar.

3. **Token Generation:** Produces tokens and, for identifiers, interacts with the symbol
table to store and retrieve information.

4. **Handling Reserved Words:** Uses a trie or hash table to quickly identify reserved
words.

5. **Managing Context:** Uses a stack to manage nested constructs if necessary.

By leveraging these data structures, a lexical analyzer can efficiently process source code
and produce a meaningful sequence of tokens for further stages of compilation.
A literal table is a data structure that is used to keep track of literal variables in the program. It holds constant and
strings used in the program but it can appear only once in a literal table and its contents apply to the whole program,
which is why deletions are not necessary for it. The literal table allows the reuse of constants and strings that plays an
important role in reducing the program size.
A parse tree is the hierarchical representation of symbols. The symbols include terminal or non-terminal. In
the parse tree the string is derived from the starting symbol and the starting symbol is mainly the root of the
parse tree. All the leaf nodes are symbols and the inner nodes are the operators or non-terminals. To get the
output we can use Inorder Traversal.

For example:- Parse tree for a+b*c.


And there is intermediate code which also needs data structures to store the data.

Nfa to dfa:
An NFA can have zero, one or more than one move from a given state on a given input symbol. An NFA can
also have NULL moves (moves without input symbol). On the other hand, DFA has one and only one move
from a given state on a given input symbol.

Steps for converting NFA to DFA:


Step 1: Convert the given NFA to its equivalent transition table
To convert the NFA to its equivalent transition table, we need to list all the states, input symbols, and the
transition rules. The transition rules are represented in the form of a matrix, where the rows represent the
current state, the columns represent the input symbol, and the cells represent the next state.

Step 2: Create the DFA’s start state


The DFA’s start state is the set of all possible starting states in the NFA. This set is called the “epsilon
closure” of the NFA’s start state. The epsilon closure is the set of all states that can be reached from the start
state by following epsilon (?) transitions.

Step 3: Create the DFA’s transition table


The DFA’s transition table is similar to the NFA’s transition table, but instead of individual states, the rows
and columns represent sets of states. For each input symbol, the corresponding cell in the transition table
contains the epsilon closure of the set of states obtained by following the transition rules in the NFA’s
transition table.

Step 4: Create the DFA’s final states


The DFA’s final states are the sets of states that contain at least one final state from the NFA.

Step 5: Simplify the DFA


The DFA obtained in the previous steps may contain unnecessary states and transitions. To simplify the
DFA, we can use the following techniques:

 Remove unreachable states: States that cannot be reached from the start state can be removed from the DFA.
 Remove dead states: States that cannot lead to a final state can be removed from the DFA.
 Merge equivalent states: States that have the same transition rules for all input symbols can be merged into a
single state.

Step 6: Repeat steps 3-5 until no further simplification is possible


After simplifying the DFA, we repeat steps 3-5 until no further simplification is possible. The final DFA
obtained is the minimized DFA equivalent to the given NFA.

Example: Consider the following NFA shown in Figure 1.

(refer notes)
Parsing techniques:
Parsing is known as Syntax Analysis.

It contains arranging the tokens as source code into grammatical phases that are used by the compiler to synthesis
output generally grammatical phases of the source code are defined by parse tree.

There are various types of parsing techniques which are as follows –


The process of transforming the data from one format to another is called Parsing. This process can be accomplished
by the parser. The parser is a component of the translator that helps to organise linear text structure following the set of
defined rules which is known as grammar.

There are two types of Parsing:

 The Top-down Parsing


 The Bottom-up Parsing

Top-down Parsing: When the parser generates a parse with top-down expansion to the first trace, the left-most
derivation of input is called top-down parsing. The top-down parsing initiates with the start symbol and ends on the
terminals. Such parsing is also known as predictive parsing.

o
 Recursive Descent Parsing: Recursive descent parsing is a type of top-down parsing
technique. This technique follows the process for every terminal and non-terminal
entity. It reads the input from left to right and constructs the parse tree from right to
left. As the technique works recursively, it is called recursive descent parsing.
 Back-tracking: The parsing technique that starts from the initial pointer, the root
node. If the derivation fails, then it restarts the process with different rules.

 Bottom-up Parsing: The bottom-up parsing works just the reverse of the top-down parsing. It first
traces the rightmost derivation of the input until it reaches the start symbol.


o
 Shift-Reduce Parsing: Shift-reduce parsing works on two steps: Shift step and Reduce
step.
 Shift step: The shift step indicates the increment of the input pointer to the
next input symbol that is shifted.
 Reduce Step: When the parser has a complete grammar rule on the right-hand
side and replaces it with RHS.
 LR Parsing: LR parser is one of the most efficient syntax analysis techniques as it
works with context-free grammar. In LR parsing L stands for the left to right tracing,
and R stands for the right to left tracing.
Why is parsing useful in compiler designing?
In the world of software, every different entity has its criteria for the data to be processed. So parsing is the
process that transforms the data in such a way so that it can be understood by any specific software.

The Technologies Use Parsers:

 The programming languages like Java.


 The database languages like SQL.
 The Scripting languages.
 The protocols like HTTP.
 The XML and HTML.

Types of grammar used for parsing:


Syntax tree vs parse tree
When you create a parse tree then it contains more details than actually needed. So, it is very difficult to compiler to
parse the parse tree. Take the following parse tree as an example:
 In the parse tree, most of the leaf nodes are single child to their parent nodes.
 In the syntax tree, we can eliminate this extra information.
 Syntax tree is a variant of parse tree. In the syntax tree, interior nodes are operators and leaves are
operands.
 Syntax tree is usually used when represent a program in a tree structure.

A sentence id + id * id would have the following syntax tree:

Abstract syntax tree can be represented as:

Abstract syntax trees are important data structures in a


compiler. It contains the least unnecessary information. Abstract syntax trees are more compact than a parse tree and
can be easily used by a compiler.
 Phase: A phase is a distinguishable stage in the compiler that takes input from the previous stage,
processes it, and yields output that can be used as input for the next stage. Phases are the steps in the
compilation process, and each phase takes input from the previous stage. Examples of phases include
lexical analysis, syntax analysis, and code generation.
 Pass: A pass refers to the traversal of a compiler through the entire program. It is the total number of
times the compiler goes through the entire program during the compilation process. Each pass takes
the result of the previous pass as input and creates intermediate outputs, allowing the code to improve
in each pass. The final code is generated after the final pass.

In the context of compiler design, the terms "phases of a compiler" and "passes
of a compiler" refer to different aspects of the compilation process. Here's a
detailed comparison:

### Phases of a Compiler


The compilation process is typically divided into several distinct phases, each
responsible for a specific aspect of transforming source code into executable
code. These phases include:

1. **Lexical Analysis (Scanning):**

- **Function:** Converts the sequence of characters in the source code into a


sequence of tokens.

- **Output:** Tokens (e.g., identifiers, keywords, operators).

2. **Syntax Analysis (Parsing):**

- **Function:** Analyzes the token sequence to determine its grammatical


structure according to the language's grammar.

- **Output:** Parse tree or abstract syntax tree (AST).

3. **Semantic Analysis:**

- **Function:** Checks for semantic errors and ensures the meaning of the
syntax is consistent with the language's rules.

- **Output:** Annotated syntax tree (AST with type information and other
semantic annotations).

4. **Intermediate Code Generation:**

- **Function:** Translates the syntax tree into an intermediate


representation (IR) that is easier to optimize and translate into machine code.

- **Output:** Intermediate code (e.g., three-address code).

5. **Optimization:**
- **Function:** Improves the intermediate code to make it more efficient
(e.g., faster execution, reduced memory usage).

- **Output:** Optimized intermediate code.

6. **Code Generation:**

- **Function:** Converts the intermediate code into target machine code or


assembly language.

- **Output:** Machine code or assembly code.

7. **Code Linking and Assembly:**

- **Function:** Resolves references between modules and libraries, and


converts assembly code into executable machine code.

- **Output:** Executable binary.

### Passes of a Compiler

A pass refers to a single traversal over the entire source code or intermediate
representation. A compiler can be either a single-pass compiler or a multi-pass
compiler:

- **Single-Pass Compiler:**

- **Function:** Completes all the compilation phases in one pass over the
source code.

- **Advantages:** Faster and uses less memory.

- **Disadvantages:** Less opportunity for optimization and may require more


complex code generation techniques to handle forward references.
- **Multi-Pass Compiler:**

- **Function:** Divides the compilation process into multiple passes, each


performing a subset of the phases or reprocessing the output of the previous
pass.

- **Advantages:** Allows for better optimization and more manageable code


generation.

- **Disadvantages:** Slower and requires more memory as it needs to store


intermediate representations.

### Examples of Passes in a Multi-Pass Compiler

1. **First Pass:**

- Performs lexical analysis, syntax analysis, and semantic analysis.

- Outputs an intermediate representation.

2. **Second Pass:**

- Performs intermediate code optimization.

- Outputs optimized intermediate code.

3. **Third Pass:**

- Generates target machine code from the optimized intermediate code.

### Relationship Between Phases and Passes

- **Phases:** Logical stages of the compilation process (e.g., lexical analysis,


syntax analysis).
- **Passes:** Physical traversals over the code, during which one or more phases
are executed.

A multi-pass compiler might, for example, perform all phases in the first pass
(up to generating intermediate code), then use subsequent passes to optimize
and generate final machine code. Alternatively, it might interleave phases such
as performing semantic analysis and intermediate code generation in separate
passes to allow for intermediate optimizations.

### Summary

- **Phases of a Compiler:** Conceptual steps in the compilation process, each


handling a specific task from converting source code to executable code.

- **Passes of a Compiler:** The actual number of times the compiler traverses


the entire source code or intermediate representation during the compilation
process. A single-pass compiler does this once, while a multi-pass compiler does
it multiple times, allowing for more complex and effective optimizations.

Understanding both concepts is crucial for appreciating the complexity and


efficiency considerations in compiler design.
The front end and back end of a compiler are two distinct parts that handle
different aspects of the compilation process. Here's how they differ:

### Front End

The front end of a compiler is responsible for the initial stages of the
compilation process, starting from the source code and producing an
intermediate representation (IR) that captures the program's syntax and
semantics. The main tasks of the front end include:

1. **Lexical Analysis (Scanning):**

- Converts the source code into a sequence of tokens.

2. **Syntax Analysis (Parsing):**

- Analyzes the token sequence to determine its grammatical structure


according to the language's grammar.

- Constructs a parse tree or abstract syntax tree (AST).

3. **Semantic Analysis:**

- Performs checks to ensure the program adheres to the language's semantics.

- Assigns types to expressions and identifiers.

- Performs error checking for type mismatches, undeclared variables, etc.

- Augments the AST with semantic information.

4. **Intermediate Code Generation:**

- Translates the annotated AST into an intermediate representation (IR) that


is independent of the source and target languages.

- Typically uses representations like three-address code or quadruples.

### Back End


The back end of a compiler takes the intermediate representation generated by
the front end and translates it into executable machine code or an equivalent
target representation. The main tasks of the back end include:

1. **Optimization:**

- Performs various optimizations on the intermediate representation to


improve the efficiency of the generated code.

- Optimization techniques may include constant folding, loop optimization,


register allocation, and dead code elimination.

2. **Code Generation:**

- Translates the optimized intermediate representation into the target


machine code or assembly language.

- Generates efficient and correct machine instructions that implement the


program's functionality.

3. **Code Linking and Assembly:**

- Resolves references between modules and libraries.

- Converts assembly code into executable machine code.

- May involve generating relocation information and resolving external


symbols.

### Key Differences

- **Responsibilities:**

- **Front End:** Handles tasks related to analysis, understanding, and


transformation of the source code.
- **Back End:** Focuses on optimization and code generation, translating the
high-level program representation into low-level machine instructions.

- **Input and Output:**

- **Front End:** Takes source code as input and produces an intermediate


representation (IR) as output.

- **Back End:** Takes the intermediate representation (IR) as input and


produces executable machine code or assembly code as output.

- **Language Independence:**

- **Front End:** Typically language-specific, as it deals with understanding the


syntax and semantics of a particular programming language.

- **Back End:** Can be more language-agnostic, as it focuses on generating


efficient code based on the intermediate representation, which may be shared
among multiple source languages.

- **Optimization:**

- **Front End:** Primarily concerned with semantic analysis and basic


optimizations that require semantic understanding.

- **Back End:** Performs more advanced optimizations tailored to the target


architecture and runtime environment.

### Summary

In summary, the front end of a compiler deals with the analysis and
understanding of the source code, producing an intermediate representation
(IR) that captures the program's semantics. The back end then takes this IR
and performs optimizations and code generation to produce efficient executable
code. Together, the front end and back end work in tandem to translate high-
level source code into machine-executable instructions.

You might also like