CD Unit 1 Merged
CD Unit 1 Merged
UNIT-1
Language processing system:
You have seen in the above diagram there are the following components. Let’s discuss it one
by one.
Source code:
Preprocessor:
Compiler:
The compiler takes the modified code as input and produces the target code as output.
Assembler:
The assembler takes the target code as input and produces real locatable machine code as output.
Linker: Linker or link editor is a program that takes a collection of objects (created by assemblers and
compilers) and combines them into an executable program.
Loader: The loader keeps the linked program in the main memory.
Executable code: It is low-level and machine-specific code that the machine can easily understand.
Once the job of the linker and loader is done the object code is finally converted it into executable
code.
1. Compiler
The language processor that reads the complete source program written in high-level
language as a whole in one go and translates it into an equivalent program in machine
language is called a Compiler. Example: C, C++, C#.
In a compiler, the source code is translated to object code successfully if it is free of errors.
The compiler specifies the errors at the end of the compilation with line numbers when there
are any errors in the source code. The errors must be removed before the compiler can
successfully recompile the source code again the object program can be executed number of
times without translating it again.
2. Assembler
The Assembler is used to translate the program written in Assembly language into machine
code. The source program is an input of an assembler that contains assembly language
instructions. The output generated by the assembler is the object code or machine code
understandable by the computer. Assembler is basically the 1st interface that is able to
communicate humans with the machine. We need an assembler to fill the gap between human
and machine so that they can communicate with each other. code written in assembly
language is some sort of mnemonics(instructions) like ADD, MUL, MUX, SUB, DIV, MOV
and so on. and the assembler is basically able to convert these mnemonics in binary code.
Here, these mnemonics also depend upon the architecture of the machine.
For example, the architecture of intel 8085 and intel 8086 are different.
3. Interpreter
The translation of a single statement of the source program into machine code is done by a
language processor and executes immediately before moving on to the next line is called an
interpreter. If there is an error in the statement, the interpreter terminates its translating
process at that statement and displays an error message. The interpreter moves on to the
next line for execution only after the removal of the error. An Interpreter directly executes
instructions written in a programming or scripting language without previously converting
them to an object code or machine code. An interpreter translates one line at a time and then
executes it.
compiler Interpreter
Compiler: A compiler translates code from a Interpreter: An interpreter translates code
high-level programming language (like Python, written in a high-level programming language
JavaScript or Go) into machine code before into machine code line-by-line as the code runs.
the program runs.
Preprocessor:
A preprocessor produce input to compilers. They may perform the following
functions.
1. Macro processing: A preprocessor may allow a user to define macros that are
short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with
more
modern flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the
language
by certain amounts to build-in macro
Translator:
A translator or language processor is a program that translates an input
program written in a programming language into an equivalent program in another
language. The compiler is a type of translator, which takes a program written in
a high-level programming language as input and translates it into an equivalent
program in low-level languages such as machine language or assembly language.
TYPE OF TRANSLATORS:-
INTERPRETOR
COMPILER
PREPROSSESSOR
Linker:
Linker is a program in a system which helps to link object modules of a program into a single
object file. It performs the process of linking. Linkers are also called as link editors. Linking
is a process of collecting and maintaining piece of code and data into a single file. Linker also
links a particular module into system library. It takes object modules from assembler as input
and forms an executable file as output for the loader. Linking is performed at both compile
time, when the source code is translated into machine code and load time, when the program
is loaded into memory by the loader. Linking is performed at the last step in compiling a
program.
A linker is a program in a system, also known as a link editor and binder, which combines
object modules into a single object file. Generally, it is a program that performs the process of
linking; it takes one or multiple object files, which are generated by compiler. And, then
combines these files into an executable files.
Loader:
In compiler design, a loader is a program that is responsible for loading executable programs
into memory for execution. The loader reads the object code of a program, which is usually in
binary form, and copies it into memory. It also performs other tasks such as allocating
memory for the program’s data and resolving any external references to other programs or
libraries. The loader is typically part of the operating system and is invoked by the system’s
bootstrap program or by a command from a user. Loaders can be of two types:
Overall, the Loader is responsible for loading the program into memory, preparing it for
execution, and transferring control to the program’s entry point. It acts as a bridge between
the Operating System and the program being loaded.
LIST OF COMPILERS
1. Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .D compilers
9 .Common Lisp compilers
10. ECMAScript interpreters
11. Eiffel compilers
12. Felix compilers
13. Fortran compilers
14. Haskell compilers
15 .Java compilers
16. Pascal compilers
17. PL/I compilers
18. Python compilers
19. Scheme compilers
20. Smalltalk compilers
21. CIL compilers
A compiler is a software program that converts the high-level source code written in a
programming language into low-level machine code that can be executed by the computer
hardware. The process of converting the source code into machine code involves several
phases or stages, which are collectively known as the phases of a compiler. The typical
phases of a compiler are:
1. Lexical Analysis
2. Syntactic Analysis or Parsing
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a
stream of characters and converts it into meaningful lexemes. Lexical analyzer represents
these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements
are checked against the source code grammar, i.e. the parser checks if the expression made
by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language.
For example, assignment of values is between compatible data types, and adding string to an
integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not etc. The semantic analyzer produces an
annotated syntax tree as an output.
After semantic analysis the compiler generates an intermediate code of the source code for
the target machine. It represents a program for some abstract machine. It is in between the
high-level language and the machine language. This intermediate code should be generated
in such a way that it makes it easier to be translated into the target machine code.
A code that is neither high-level nor machine code, but a middle-level code is an
intermediate code.
We can translate this code to machine code later.
This stage serves as a bridge or way from analysis to synthesis.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be
assumed as something that removes unnecessary code lines, and arranges the sequence of
statements in order to speed up the program execution without wasting resources (CPU,
memory).
In this phase, the code generator takes the optimized representation of the intermediate code
and maps it to the target machine language. The code generator translates the intermediate
code into a sequence of (generally) re-locatable machine code. Sequence of instructions of
machine code performs the task as the intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's
names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also used
for scope management.
The symbol table connects or interacts with all phases of the compiler and error handler for
updates. It is also accountable for scope management.
It stores:
Lexical analysis and syntax analysis are two distinct phases in the process of
compiling a program. They are separated for several reasons:
Cousins of Compiler
Converting a high-level language into a low-level language takes multiple steps and involves many programs
apart from the Compiler. Before the compilation can start, our source code needs to be preprocessed. After
the compilation, our code needs to be converted into executable code to execute on our machine. These
essential tasks are performed by the preprocessor, assembler, Linker, and Loader. They are known as the
Cousins of the Compiler.
Preprocessor
The preprocessor is one of the cousins of the Compiler. It is a program that performs preprocessing. It
performs processing on the given data and produces an output. The output generated is used as an input for
some other program.
The preprocessor increases the readability of the code by replacing a complex expression with a simpler one
by using a macro.
Macro processing
Macro processing is mapping the input to output data based on a certain set of rules and defined processes.
These rules are known as macros.
Rational Preprocessors
Relational preprocessors are the processors that change older languages with some modern flow-of-control
and data-structuring facilities.
File Inclusion
The preprocessor is also used to include header files in the program text. A header file is a text file included
in our source program file during compilation. When the preprocessor finds an #include directive in the
program, it replaces it with the entire content of the specified header file.
Language extension
Language extension is used to add new capabilities to the existing language. This is done by including certain
libraries in our program, which provides extra functionality. An example of this is Equel, a database query
language embedded in C.
Error Detection
Some preprocessors are capable of performing error-checking on the source code that is given as input to
them. For example, it can check if the headers files are included properly and if the macros are defined
correctly or not.
Conditional Compilation
Certain preprocessors are capable of including or excluding certain pieces of code based on the result of a
condition. They provide more flexibility to the programmers for writing the code as they allow the
programmers to include or exclude certain features of the program based upon some condition.
Assembler
Assembler is also one of the cousins of the compiler. A compiler takes the preprocessed code and then
converts it into assembly code. This assembly code is given as input to the assembler, and the assembler
converts it into the machine code. Assembler comes into effect in the compilation process after the Compiler
has finished its job.
Two-Pass assembler: Two-pass assemblers work by creating a symbol table with the symbols and their values
in the first pass, and then using the symbol table in a second pass, they generate code.
Linker
Linker takes the output produced by the assembler as input and combines them to create an executable file. It
merges two or more object files that might be created by different assemblers and creates a link between
them. It also appends all the libraries that will be required for the execution of the file. A linker's primary
function is to search and find referred modules in a program and establish the memory address where these
codes will be loaded.
Library Management: Linkers can be used to add external libraries to our code to add additional
functionalities. By adding those libraries, our code can now use the functions defined in those libraries.
Code Optimization: Linkers are also used to optimize the code generated by the compiler by reducing the
code size and increasing the program's performance.
Memory Management: Linkers are also responsible for managing the memory requirement of the executable
code. It allocates the memory to the variables used in the program and ensures they have a consistent memory
location when the code is executed.
Symbol Resolution: Linkers link multiple object files, and a symbol can be redefined in multiple files, giving
rise to a conflict. The linker resolves these conflicts by choosing one definition to use.
Loader
The loader works after the linker has performed its task and created the executable code. It takes the input of
executable files generated from the linker, loads it to the main memory, and prepares this loaded code for
execution by a computer. It also allocates memory space to the program. The loader is also responsible for
the execution of programs by allocating RAM to the program and initializing specific registers.
Loading: The loader loads the executable files in the memory and provides memory for executing the
program.
Relocation: The loader adjusts the memory addresses of the program to relocate its location in memory.
Symbol Resolution: The loader is used to resolve the symbols not defined directly in the program. They do
this by looking for the definition of that symbol in a library linked to the executable file.
Dynamic Linking: The loader dynamically links the libraries into the executable file at runtime to add
additional functionality to our program.
Left Recursion:
Recursion can be classified into following three types-
1. Left Recursion
2. Right Recursion
3. General Recursion
1. Left Recursion-
A production of grammar is said to have left recursion if the leftmost variable of its RHS is
same as variable of its LHS.
A grammar containing a production having left recursion is called as Left Recursive
Grammar.
Example-
S → Sa / ∈
A → Aα / β
Then, we can eliminate left recursion by replacing the pair of productions with-
A → βA’
A’ → αA’ / ∈
2. Right Recursion-
A production of grammar is said to have right recursion if the rightmost variable of its RHS is
same as variable of its LHS.
A grammar containing a production having right recursion is called as Right Recursive
Grammar.
Example-
S → aS / ∈
3. General Recursion-
The recursion which is neither left recursion nor right recursion is called as general
recursion.
Example-
S → aSb / ∈
Left recursion is a common problem that occurs in grammar during parsing in the syntax analysis
part of compilation. It is important to remove left recursion from grammar because it can create
an infinite loop, leading to errors and a significant decrease in performance
A → A α |β.
The above Grammar is left recursive because the left of production is occurring at a first
position on the right side of production. It can eliminate left recursion by replacing a pair of
production with
A → βA′
A → αA′|ϵ
S ⇒S | a | b
is called left recursive where S is any non Terminal and a and b are any set of terminals.
Problem with Left Recursion: If a left recursion is present in any grammar then, during
parsing in the syntax analysis part of compilation, there is a chance that the grammar
will create an infinite loop. This is because, at every time of production of grammar, S
will produce another S without checking any condition.
Algorithm to Remove Left Recursion with an example: Suppose we have a grammar
which contains left recursion:
S ⇒S a | S b | c | d
Check if the given grammar contains left recursion. If present, then separate the production
and start working on it. In our example:
S ⇒S a | S b | c | d
Introduce a new nonterminal and write it at the end of every terminal. We create a new
nonterminal S’ and write the new production as:
S ⇒ cS' | dS'
Write the newly produced nonterminal S’ in the LHS, and in the RHS it can either produce S’
or it can produce new production in which the terminals or non terminals which followed the
previous LHS will be replaced by the new nonterminal S’ at the end of the term.
S ⇒ cS' | dS'
S' ⇒ ε | aS' | bS'
A grammar is said to be having direct left recursion when any production rule is in the form
S ⇒S | a | b
A grammar is said to be having indirect left recursion when it does not have a direct left
recursion, but the productions rule is given in such a way that it is possible to derive a string
from a given Non-terminal symbol such that the leftmost symbol or the head of the derived
string is that non-terminal itself.
Example:
A ⇒B x
B ⇒C y
C ⇒ A z
Explanation:
The above grammar has indirect left recursion because it is possible to derive the
following production-
A ⇒B x
A ⇒ (C y) x
A ⇒ ((A z) y) x
A ⇒ A z y x
LEXICAL ANALYSIS:
1. Lexical Analysis can be implemented with the Deterministic finite Automata.
2. The output is a sequence of tokens that is sent to the parser for syntax analysis
Spelling error:int written as intt is spelling error.
Exceeding length if identifier can be giving larger value to the int than expected.
Lexical analysis breaks down an input text into meaningful components called tokens.
Here are the simplified steps:
1. **Identify Tokens**: Determine the set of symbols (letters, digits, operators, etc.)
that can form tokens.
2. **Assign Strings to Tokens**: Recognize and categorize strings. For example, "cat"
as a word token, "2023" as a number token.
3. **Return the Token Value**: Extract and return the smallest units (lexemes) that form
each token for further processing.
3. **Compresses Input**: Reduces and compiles the input, streamlining the data
for processing.
2. **Lookahead Complexity**: Needs to look ahead in the input, which can be complex.
TOKEN GENERATION:
in example1 total number of tokens are 5,they are int,num1,=,100,;
2. Tokenization: This is the process of breaking the input text into a sequence of tokens.
This is usually done by matching the characters in the input text against a set of
patterns or regular expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type of each token. For
example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each token is valid according to
the rules of the programming language. For example, it might check that a variable
name is a valid identifier, or that an operator has the correct syntax.
5. Output generation: In this final stage, the lexer generates the output of the lexical
analysis process, which is typically a list of tokens. This list of tokens can then
be passed to the next stage of compilation or interpretation.
**Interaction**:
- The parser asks the lexer for the next token.
- The lexer reads the input and gives the next token to
the parser.
- This continues until the input is fully processed.
Basically, the lexer chops up the input into tokens, and the
parser checks if these tokens fit together correctly according
to the language rules.
Bootstrapping:
Bootstrapping is a process in which simple language is used to translate more
complicated program which in turn may handle for more complicated program.
This complicated program can further handle even more complicated program and
so on. Writing a compiler for any high level language is a complicated process.
It takes lot of time to write a compiler from scratch.
Hence simple language is used to generate target code in some stages.
To clearly understand the Bootstrapping technique consider a following scenario.
Suppose we want to write a cross compiler for new language X.
The implementation language of this compiler is say Y and the target code being
generated is in language Z.
That is, we create XYZ.
Now if existing compiler Y runs on machine M and generates code for M then it is
denoted as YMM.
Now if we run XYZ using YMM then we get a compiler XMZ.
That means a compiler for source language X that generates a target code in
language Z and which runs on machine M.
Following diagram illustrates the above scenario. Example: We can create compiler
of many different forms. Now we will generate
Bootstrapping is the process of writing a compiler for a programming language using the
language itself. In other words, it is the process of using a compiler written in a particular
programming language to compile a new version of the compiler written in the same
language.
INPUT BUFFERING:
Eof: end of file
Specification of Tokens:
Recognition of Tokens 42
Recognition of Tokens
The question is how to recognize the tokens?
Example: assume the following grammar fragment to generate a
specific language:
stmt if expr then stmt | if expr then stmt else stmt |
expr term relop term | term
term id | num
where the terminals if, then, else, relop, id, and num generate sets of
strings given by the following regular definitions:
if if
then then
else else
relop <| <=| =| <>| >| >=
id letter(letter|digit)*
num digits optional-fraction optional-exponent
Where letter and digits are as defined previously.
For this language fragment the lexical analyzer will recognize
the keywords if, then, else, as well as the lexemes denoted by relop,
id, and num. To simplify matters, we assume keywords are
reserved; that is, they cannot be used as identifiers. The num
represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space,
consisting of nonnull sequences of blanks, tabs, and newlines. The
lexical analyzer will strip out white space. It will do so by
comparing a string against the regular definition ws, below.
delim blank| tab | newline
ws delim+
If a match for ws is found, the lexical analyzer does not return a
token to the parser.
الجامعة المستنصريه
Recognition of Tokens 42
الجامعة المستنصريه
Recognition of Tokens 42
start 0
< =
1 2 Return (relop,LE)
*
Other
4 Return (relop,LT)
=
5 Return (relop,EQ)
> =
6 7 Return (relop,GE)
Other *
8 Return (relop,GT)
Letter or digit
الجامعة المستنصريه
Recognition of Tokens 42
digit digit
digit
digit other *
start 25 26 27
delim
*
start 28 delim29other30
الجامعة المستنصريه
Recognition of Tokens 42
FA
Note: Both NFA and DFA are capable of recognizing what regular
expression can denote.
a a
1 2
b
Also a transition on input ( -Transition) is possible.
a
1 2
b
3 4
الجامعة المستنصريه
Recognition of Tokens 42
start a b b
0 1 2 3
a
1 2
3 4
b
الجامعة المستنصريه
Recognition of Tokens 03
b
b
start a b b
0 1 2 3
a
a
a
State a b
0 1 0
1 1 2
2 1 3
3 1 0
الجامعة المستنصريه
Recognition of Tokens 03
Operations Description
Set of NFA states reachable from NFA state s on
-closure(s)
- transitions alone.
Set of NFA states reachable from some NFA state s
-closure(T)
in T on -transitions alone.
Set of NFA states to which there is a transition on
Move(T, a)
input symbol a from some NFA state s in T.
الجامعة المستنصريه
Recognition of Tokens 04
الجامعة المستنصريه
Recognition of Tokens 00
a
2 3
a b
0 1 6 7 8 b 9 10
4 b
5
start a
A B
الجامعة المستنصريه
Recognition of Tokens 02
3) Compute move (A, b), the set of states of NFA having transitions
on b from members of A. Among the states 0, 1, 2, 4 and 7,
only 4 have such transitions, to 5 so
move (A, b)={5}
Compute the -closure (move (A, b)) = -closure ({5}),
-closure ({5}) = {1, 2, 4, 5, 6, 7} Let us call this set C.
So the DFA has a transition on b from A to C.
start b
A C
State A is the start state, and state E is the only accepting state.
The complete transition table Dtran is shown in below:
INPUT SYMBOL
STATE a b
A B C
B B D
C B C
D B E
E B C
الجامعة المستنصريه
Recognition of Tokens 02
b
b C
a
start A B b b
D E
a
a
a
a
start i
f
الجامعة المستنصريه
Recognition of Tokens 02
start a f
i
i f
start a b f
i
5- For the regular expression a* construct the following composite
NFA N(a*).
start i
a f
الجامعة المستنصريه
Recognition of Tokens 02
a b
i f
2) RE= (a | b)*a
a
i f
b b
a b
i f
الجامعة المستنصريه
Recognition of Tokens 02
4) RE= a* (a | b)
start i a f
Lexical Errors
What if user omits the space in “Fori”?
No lexical error, single token IDENT (“Fori”) is produced instead of
sequence For, IDENT (“i”).
Recognition of Tokens:
Ws-word space
Lexical Analyzer Generator-LEX:
It is used for defining the declarations and also defining the regular
expressions.
Lexical Analysis
It is the first step of compiler design, it takes the input as a stream of characters and gives the output as
tokens also known as tokenization. The tokens can be classified into identifiers, Sperators, Keywords,
Operators, Constant and Special Characters.
2. Error Messages: It gives errors related to lexical analysis such as exceeding length, unmatched string, etc.
3. Eliminate Comments: Eliminates all the spaces, blank spaces, new lines, and indentations.
Lex
Lex is a tool or a computer program that generates Lexical Analyzers (converts the stream of characters into
tokens). The Lex tool itself is a compiler. The Lex compiler takes the input and transforms that input into
input patterns. It is commonly used with YACC(Yet Another Compiler Compiler). It was written by Mike
Lesk and Eric Schmidt.
Function of Lex
1. In the first step the source code which is in the Lex language having the file name ‘File.l’ gives as input to
the Lex Compiler commonly known as Lex to get the output as lex.yy.c.
2. After that, the output lex.yy.c will be used as input to the C compiler which gives the output in the form of
an ‘a.out’ file, and finally, the output file a.out will take the stream of character and generates tokens as
output.
lex.yy.c: It is a C program.
File.l: It is a Lex source program
a.out: It is a Lexical analyzer
Declarations
%%
Translation rules
%%
Auxiliary procedures
In the declaration section a regular expression can be defined. Following is an example of declaration section.
Each statement has two components: a name and a regular expression that is used to denote the name.
1. delim [\t\n]
2. ws{delim}+
3. letter [A-Za-z]
4. digit [0-9]
5. id{letter}({letter}|{digit})*
Transition rules: These rules consist of Pattern and Action.
This is the second section of the LEX program after the declarations. The declarations section is separated
from the Translation Rules section by means of the ―%%‖ delimiter. Here, each statement consists of two
components: a pattern and an action. The pattern is matched with the input. If there is a match of pattern, the
action listed against the pattern is carried out. Thus the LEX tool can be looked upon as a rule based
programming language. The following is an example of patterns p1, p2…pn and their corresponding actions
1 to n.
pn {actionn}
For example, if the keyword IF is to be returned as a token for a match with the input string ―if‖ then the
translation rule is defined as
{if} {return(IF);}
The ―;‖ at the end of the (IF) indicates end of the first statement of an action and the entire sequence of
actions is available between a pair of parenthesis. If the action has been written in multiple lines then the
continuation character needs to be used. Similarly the following is an example for an identifier ―id‖, where
the usage of ―id‖ is already stated in the first ―declaration‖ section.
{id} {yylval=install_id();return(ID);}
In the above statement, when encountering an identifier, two actions need to be taken. The first one is call
install_id() function and assign it to yylval and the second one is a return statement.
Auxiliary procedures: The Auxilary section holds auxiliary functions used in the actions.
This section is separated from the translation rules section using the delimiter ―%%‖. In this section, the C
program’s main function is declared and the other necessary functions also defined. In the example defined in
translation rules section, the function install_id() is a procedure used to install the lexeme, whose first
character is pointed by yytext and length is provided by yyleng which are inserted into the symbol table and
return a pointer pointing to the beginning of the lexeme.
install_id() {
The functionality of install_id can be written separately or combined with the main function. The functions
yytext ( ) and yyleng( ) are lex commands to indicate the text of the input and the length of the string.
Finite Automata:
Finite Automata(FA) is the simplest machine to recognize patterns.
It is used to characterize a Regular Language.
Also it is used to analyze and recognize Natural language Expressions.
The finite automata or finite state machine is an abstract machine that
has five elements or tuples.
It has a set of states and rules for moving from one state to another but
it depends upon the applied input symbol.
Based on the states and the set of rules the input string can be either
accepted or rejected.
Basically, it is an abstract model of a digital computer which reads an input
string and changes its internal state depending on the current input
symbol.
Every automaton defines a language i.e. set of strings it accepts.
The following figure shows some essential features of general automation.
1. Input
2. Output
3. States of automata
4. State relation
5. Output relation
6. A Finite Automata consists of the following:
For example, construct a DFA which accept a language of all strings ending with ‘a’.
Given: ? = {a,b}, q = {q0}, F={q1}, Q = {q0, q1}
First, consider a language set of all the possible acceptable strings in order to construct an
accurate state transition diagram.
L = {a, aa, aaa, aaaa, aaaaa, ba, bba, bbbaa, aba, abba, aaba, abaa}
Above is simple subset of the possible acceptable strings there can many other strings which
ends with ‘a’ and contains symbols {a,b}.
3)
Nondeterministic Finite Automata(NFA): NFA is similar to DFA except
following additional features:
Null (or ?) move is allowed i.e., it can move forward without reading symbols.
?: Transition Function
?: Q X (? U ? ) --> 2 ^ Q.
As you can see in the transition function is for any input including null (or ?),
NFA can go to any state number of states. For example, below is an NFA for
the above problem.
1. Both NFA and DFA have the same power and each NFA can be translated into a DFA.
2. There can be multiple final states in both DFA and NFA.
3. NFA is more of a theoretical concept.
4. DFA is used in Lexical Analysis in Compiler.
5. If the number of states in the NFA is N then, its DFA can have maximum 2N number of
states.
For instance:
In a regular expression, x* means zero or more occurrence of x. It can generate {e, x, xx, xxx, xxxx, .....}
In a regular expression, x+ means one or more occurrence of x. It can generate {x, xx, xxx, xxxx, .....}
Union: If L and M are two regular languages then their union L U M is also a union.
1. 1. L U M = {s | s is in L or s is in M}
Intersection: If L and M are two regular languages then their intersection is also an intersection.
1. 1. L ⋂ M = {st | s is in L and t is in M}
Kleen closure: If L is a regular language then its Kleen closure L1* will also be a regular language.
Example 1:
Write the regular expression for the language accepting all combinations of a's, over the set ∑ = {a}
Solution:
All combinations of a's means a may be zero, single, double and so on. If a is appearing zero times, that
means a null string. That is we expect the set of {ε, a, aa, aaa, ....}. So we give a regular expression for this
as:
1. R = a*
Example 2:
Write the regular expression for the language accepting all combinations of a's except the null string, over the
set ∑ = {a}
Solution:
This set indicates that there is no null string. So we can denote regular expression as:
Example 3:
Write the regular expression for the language accepting all the string containing any number of a's and b's.
Solution:
1. r.e. = (a + b)*
This will give the set as L = {ε, a, aa, b, bb, ab, ba, aba, bab, .....}, any combination of a and b.
The (a + b)* shows any combination with a and b even a null string.
State-0:
C=GETCHAR();
if LETTER(C) then goto State 1
else FAIL()
State-1:
C=GETCHAR();
if LETTER(C) OR DIGIT(C) then goto State 1
else if DELIMITER(C) then goto State 2
else FAIL()
State-2:
RETRACT();
RETURN(ID, INSTALL())
Where, Next character for input buffer we use GETCHAR() which return next character. LETTER(C) is a
procedure which returns the true value if and only if C is a letter. FAIL(C) is a routine which RETRACT the
look ahead pointer and start up the next transition diagram otherwise call error routine. DIGIT(C) is a
procedure which returns the true value if and only if C is a digit. DELIMITER(C) is a procedure which
returns the true value if and only if C Is a character that could follow the identifier for example blank
symbol, arithmetic, logical operator, left parenthesis, right parenthesis, +, :, ; etc. Because DELIMITER is
not part of identifier therefore we must RETRACT the look ahead pointer one character for this purpose we
use the RETRACT() procedure . Because identifier has a value so to install the value of identifier in symbol
table we use INSTALL() procedure.
In compiler design, an identifier is a name given to a variable, function, or other programming language
construct. Identifiers must follow a set of rules and conventions to be recognized and interpreted correctly by
the compiler. One way to represent these rules is through a transition diagram, also known as a finite-state
machine.
The transition diagram for identifiers typically consists of several states, each representing a different stage in
the process of recognizing an identifier. Here is a high-level overview of the states and transitions involved:
1. Start state: This is the initial state of the diagram. It represents the point at which the compiler begins
scanning the input for an identifier.
2. First character state: In this state, the compiler has identified the first character of an identifier. The
next transition will depend on whether this character is a letter or an underscore.
3. Letter state: If the first character is a letter, the compiler moves into this state. The next transition will
depend on whether the next character is a letter, digit, or underscore.
4. Underscore state: If the first character is an underscore, the compiler moves into this state. The next
transition will depend on whether the next character is a letter, digit, or another underscore.
5. Digit state: If the first character is a digit, the compiler cannot recognize it as an identifier and will
move to an error state.
6. Identifier state: If the compiler successfully follows the appropriate transitions, it will eventually
reach an identifier state. This indicates that the sequence of characters scanned so far constitutes a
valid identifier.
7. Error state: If the compiler encounters an unexpected character or sequence of characters, it will move
to an error state. This indicates that the input does not constitute a valid identifier.
The transition diagram for identifiers can be more complex than this, depending on the specific rules and
conventions of the programming language. However, this basic structure provides a good starting point for
understanding how compilers recognize and interpret identifiers.
Finite Automata (Transition Diagram) − A Directed Graph or flowchart used to recognize token.
To recognize Token ("if"), Lexical Analysis has to read also the next character after "f". Depending upon the
next character, it will judge whether the "if" keyword or something else is.
"*" on Final State 3 means Retract, i.e., control will again come to previous state 2. Therefore Blank space is
not a part of the Token ("if").
Transition Diagram for an Identifier − An identifier starts with a letter followed by letters or Digits.
Transition Diagram will be:
For example, In statement int a2; Transition Diagram for identifier a2 will be:
As (;) is not part of Identifier ("a2"), so use "*" for Retract i.e., coming back to state 1 to recognize identifier
("a2").
Coding
State 0: C = Getchar()
State1: C = Getchar()
else Fail
State2: Retract ()
In-state 2, Retract () will take the pointer one state back, i.e., to state 1 & declares that whatever has been
found till state 1 is a token.
The lexical Analysis will return the token to the Parser, not in the form of an English word but the form of a
pair, i.e., (Integer code, value).
In the case of identifier, the integer code returned to the parser is 6 as shown in the table.
Install () − It will return a pointer to the symbol table, i.e., address of tokens.
The following table shows the integer code and value of various tokens returned by lexical analysis to the
parser.
These integer values are not fixed. Different Programmers can choose other integer codes and values while
designing the Lexical Analysis.
Suppose, if the identifier is stored at location 236 in the symbol table, then
Integer code = 7
```c
#include <stdio.h>
int main() {
char str[100];
return 0;
}
```
```c
#include <stdio.h>
#include <ctype.h>
#include <string.h>
#define MAX_KEYWORDS 4
#define MAX_STRING_LENGTH 100
const char *keywords[MAX_KEYWORDS] = {"if",
"else", "while", "return"};
if (isKeyword(buffer)) {
printf("Keyword: %s\n", buffer);
} else {
printf("Identifier: %s\n", buffer);
}
}
// Recognize numbers
else if (isdigit(*p)) {
char buffer[MAX_STRING_LENGTH];
int i = 0;
while (isdigit(*p)) {
buffer[i++] = *p++;
}
buffer[i] = '\0';
printf("Number: %s\n", buffer);
}
// Recognize operators
else if (strchr("+-*/=", *p)) {
printf("Operator: %c\n", *p);
p++;
}
int main() {
char str[MAX_STRING_LENGTH];
Syntax Analysis
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic concepts
used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions and pattern
rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations of the regular
expressions. Regular expressions cannot check balancing tokens, such as parenthesis. Therefore, this phase
uses context-free grammar (CFG), which is recognized by push-down automata.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies used in
parsing technology.
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The
non-terminals define sets of strings that help define the language generated by the grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings
are formed.
A set of productions (P). The productions of a grammar specify the manner in which the terminals
and non-terminals can be combined to form strings. Each production consists of a non-terminal
called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called
the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular Expression.
That is, L = { w | w = wR } is not a regular language. But it can be described by means of CFG, as illustrated
below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S = { Q }
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser
analyzes the source code (token stream) against the production rules to detect any errors in the code. The
output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a parse
tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use error
recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we
take two decisions for some sentential form of input:
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation.
The sentential form derived by the left-most derivation is called the left-sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most derivation.
The sentential form derived from the right-most derivation is called the right-sentential form.
Example
Production rules:
E → E + E
E → E * E
E → id
Input string: id + id * id
E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the
start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us see this by an
example from the last topic.
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
In a parse tree:
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least
one string.
Example
E → E + E
E → E – E
E → id
For the string id + id – id, the above grammar generates two parse trees:
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is decided by the
associativity of those operators. If the operation is left-associative, then the operand will be taken by the left
operator or if the operation is right-associative, the right operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the expression
contains:
id op id op id
(id op id) op id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression will
be:
id op (id op id)
2 + (3 * 4)
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as the
left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down parsers.
Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So, when the parser
encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the
left non-terminal and it goes into an infinite loop.
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a string
of non-terminals.
A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the parser
may go into a loop forever.
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left recursion.
Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.
START
END
Example
S => Aα | β
A => Sd
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down parser cannot
make a choice as to which of the production it should take to parse the string in hand.
Example
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both productions are starting from
the same terminal (or non-terminal). To remove this confusion, we use a technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make one
production for each common prefixes and the rest of the derivation is added by new productions.
Example
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take decisions.
First Set
This set is created to know what terminal symbol is derived in the first position by a non-terminal. For
example,
α → t β
Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production rules. We
do not consider what the non-terminal can generate but instead, we see what would be the next terminal
symbol that follows the productions of a non-terminal.
These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.
Role of parser:
A parser in a compiler checks the syntax of the source code and builds a data structure called a parse
tree.
In more detail, a parser is a crucial component of a compiler, which is a program that translates source
code written in a programming language into machine code that a computer can understand and
execute.
The parser's role is to ensure that the source code is syntactically correct, meaning it adheres to the
rules and structure of the language in which it is written.
If the source code does not follow these rules, the parser will generate an error message, and the
compilation process will stop.
The parser operates after the lexical analysis phase of the compiler, which breaks down the source
code into individual words or tokens.
The parser takes these tokens and checks them against the grammar of the language.
This grammar defines how tokens can be combined to form valid statements and expressions.
If the tokens follow the grammar rules, the parser will construct a parse tree, a hierarchical data
structure that represents the syntactic structure of the source code.
The parse tree is then used in the next stages of the compilation process, such as semantic analysis
and code generation.
The semantic analysis phase checks that the source code makes sense in the context of the language's
semantics, while the code generation phase translates the parse tree into machine code.
It verifies the structure generated by the tokens based on the grammar.
2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.
Context-Free Grammars:
Context Free Grammar is formal grammar, the syntax or structure of a formal language can be described
using context-free grammar (CFG), a type of formal grammar. The grammar has four tuples: (V,T,P,S).
And the left-hand side of the G, here in the example can only be a Variable, it cannot be a terminal.
But on the right-hand side here it can be a Variable or Terminal or both combination of Variable and
Terminal.
Above equation states that every production which contains any combination of the ‘V’ variable or ‘T’
terminal is said to be a context-free grammar.
S-> aS
S-> bSa
but
a->bSa, or
a->ba is not a CFG as on the left-hand side there is a variable which does not follow
the CFGs rule.
In the computer science field, context-free grammars are frequently used, especially in the areas of formal
language theory, compiler development, and natural language processing. It is also used for explaining the
syntax of programming languages and other formal languages.
Derivations:
Types
• Leftmost derivation.
• Rightmost derivation.
Leftmost Derivation
In leftmost derivation, at each and every step the leftmost non-terminal is expanded by substituting its
corresponding production to derive a string.
Example
Rightmost Derivation
In rightmost derivation, at each and every step the rightmost non-terminal is expanded by substituting its
corresponding production to derive a string.
Example
Parse Trees:
Parse : It means to resolve (a sentence) into its component parts and describe their syntactic roles or
simply it is an act of parsing a string or a text.
Tree: A tree may be a widely used abstract data type that simulates a hierarchical tree structure, with
a root value and sub-trees of youngsters with a parent node, represented as a group of linked nodes.
Parse Tree:
S -> sAB
A -> a
B -> b
S -> AB
A -> c/aA
B -> d/bB
It helps in making syntax analysis by reflecting the syntax of the input language.
It uses an in-memory representation of the input with a structure that conforms to the grammar.
The advantages of using parse trees rather than semantic actions: you’ll make multiple passes over the
info without having to re-parse the input.
Ambiguity:
A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than one
rightmost derivation or more than one parse tree for the given input string. If the grammar is not ambiguous,
then it is called unambiguous.
If the grammar has ambiguity, then it is not good for compiler construction. No method can automatically
detect and remove the ambiguity, but we can remove ambiguity by re-writing the whole grammar without
ambiguity.
Example 1:
Let us consider a grammar G with the production rule
1. E→I
2. E→E+E
3. E→E*E
4. E → (E)
5. I → ε | 0 | 1 | 2 | ... | 9
Solution:
For the string "3 * 2 + 5", the above grammar can generate two parse trees by leftmost derivation:
Since there are two parse trees for a single string "3 * 2 + 5", the grammar G is ambiguous.
Example 2:
Check whether the given grammar G is ambiguous or not.
1. E → E + E
2. E → E - E
3. E → id
Solution:
From the above grammar String "id + id - id" can be derived in 2 ways:
1. E → E + E
2. → id + E
3. → id + E - E
4. → id + id - E
5. → id + id- id
1. E → E - E
2. →E+E-E
3. → id + E - E
4. → id + id - E
5. → id + id - id
Since there are two leftmost derivation for a single string "id + id - id", the grammar G is ambiguous.
Example 3:
Check whether the given grammar G is ambiguous or not.
1. S → aSb | SS
2. S → ε
Solution:
For the string "aabb" the above grammar can generate two parse trees
Since there are two parse trees for a single string "aabb", the grammar G is ambiguous.
Example 4:
Check whether the given grammar G is ambiguous or not.
1. A → AA
2. A → (A)
3. A → a
Solution:
For the string "a(a)aa" the above grammar can generate two parse trees:
Since there are two parse trees for a single string "a(a)aa", the grammar G is ambiguous.
Left Recursion:
A Grammar G (V, T, P, S) is left recursive if it has a production in the form.
A → A α |β.
The above Grammar is left recursive because the left of production is occurring at a first position on the right
side of production. It can eliminate left recursion by replacing a pair of production with
A → βA′
A → αA′|ϵ
Left Factoring:
Datastructures used in lexical analysis:
Lexical analysis, also known as scanning, is the first phase of a compiler. It processes the
input source code to produce a sequence of tokens. To accomplish this task efficiently,
several data structures are commonly employed. Here are the primary data structures used
in lexical analysis:
Finite State Machines, particularly Deterministic Finite Automata (DFA) and Non-
Deterministic Finite Automata (NFA), are foundational in lexical analysis. These automata
are used to recognize patterns in the input string.
- Used initially to describe lexical patterns because NFAs are more flexible and easier to
construct from regular expressions.
- Consists of states, transitions between states, an initial state, and one or more accepting
states.
- NFAs are often converted into DFAs for practical implementation, as DFAs do not have
ambiguities and are more efficient for scanning input strings.
- In a DFA, for each state and input symbol, there is exactly one transition to a next state.
### 2. **Symbol Table**
The symbol table is a data structure used to store information about identifiers (e.g.,
variable names, function names) encountered in the source code.
- **Hash Tables:** Frequently used due to their average O(1) time complexity for insertions,
deletions, and lookups.
- **Binary Search Trees (BSTs):** Sometimes used, especially when the order of identifiers
needs to be preserved or when the table requires frequent range queries.
### 3. **Buffer**
Buffers are used to manage input streams efficiently. Two common buffering techniques
are:
- **Single Buffering:** Simple but can be inefficient due to frequent I/O operations.
- **Double Buffering:** Uses two buffers to reduce I/O operations. While one buffer is being
processed, the other is being filled with input data.
A lexeme table stores the lexemes (actual character sequences) identified during scanning.
This table is useful for quickly retrieving the lexemes corresponding to tokens.
Tries are used for efficient storage and retrieval of keywords, especially when handling
reserved words in programming languages. They allow for quick prefix-based searches.
A transition table represents the state transitions in a DFA or NFA. This table is often
implemented as a two-dimensional array where the rows represent states and the columns
represent input symbols.
Character classes group sets of characters (e.g., digits, letters) into categories. This simplifies
the state machine and makes pattern matching more efficient.
### 8. **Stack**
A stack can be used to handle nested structures, such as nested comments or parentheses
in source code. It helps manage the scope and context while scanning the input.
2. **Pattern Matching:** Uses FSMs (DFA/NFA) to match input strings against patterns
defined by the language's grammar.
3. **Token Generation:** Produces tokens and, for identifiers, interacts with the symbol
table to store and retrieve information.
4. **Handling Reserved Words:** Uses a trie or hash table to quickly identify reserved
words.
By leveraging these data structures, a lexical analyzer can efficiently process source code
and produce a meaningful sequence of tokens for further stages of compilation.
A literal table is a data structure that is used to keep track of literal variables in the program. It holds constant and
strings used in the program but it can appear only once in a literal table and its contents apply to the whole program,
which is why deletions are not necessary for it. The literal table allows the reuse of constants and strings that plays an
important role in reducing the program size.
A parse tree is the hierarchical representation of symbols. The symbols include terminal or non-terminal. In
the parse tree the string is derived from the starting symbol and the starting symbol is mainly the root of the
parse tree. All the leaf nodes are symbols and the inner nodes are the operators or non-terminals. To get the
output we can use Inorder Traversal.
Nfa to dfa:
An NFA can have zero, one or more than one move from a given state on a given input symbol. An NFA can
also have NULL moves (moves without input symbol). On the other hand, DFA has one and only one move
from a given state on a given input symbol.
Remove unreachable states: States that cannot be reached from the start state can be removed from the DFA.
Remove dead states: States that cannot lead to a final state can be removed from the DFA.
Merge equivalent states: States that have the same transition rules for all input symbols can be merged into a
single state.
(refer notes)
Parsing techniques:
Parsing is known as Syntax Analysis.
It contains arranging the tokens as source code into grammatical phases that are used by the compiler to synthesis
output generally grammatical phases of the source code are defined by parse tree.
Top-down Parsing: When the parser generates a parse with top-down expansion to the first trace, the left-most
derivation of input is called top-down parsing. The top-down parsing initiates with the start symbol and ends on the
terminals. Such parsing is also known as predictive parsing.
o
Recursive Descent Parsing: Recursive descent parsing is a type of top-down parsing
technique. This technique follows the process for every terminal and non-terminal
entity. It reads the input from left to right and constructs the parse tree from right to
left. As the technique works recursively, it is called recursive descent parsing.
Back-tracking: The parsing technique that starts from the initial pointer, the root
node. If the derivation fails, then it restarts the process with different rules.
Bottom-up Parsing: The bottom-up parsing works just the reverse of the top-down parsing. It first
traces the rightmost derivation of the input until it reaches the start symbol.
o
Shift-Reduce Parsing: Shift-reduce parsing works on two steps: Shift step and Reduce
step.
Shift step: The shift step indicates the increment of the input pointer to the
next input symbol that is shifted.
Reduce Step: When the parser has a complete grammar rule on the right-hand
side and replaces it with RHS.
LR Parsing: LR parser is one of the most efficient syntax analysis techniques as it
works with context-free grammar. In LR parsing L stands for the left to right tracing,
and R stands for the right to left tracing.
Why is parsing useful in compiler designing?
In the world of software, every different entity has its criteria for the data to be processed. So parsing is the
process that transforms the data in such a way so that it can be understood by any specific software.
In the context of compiler design, the terms "phases of a compiler" and "passes
of a compiler" refer to different aspects of the compilation process. Here's a
detailed comparison:
3. **Semantic Analysis:**
- **Function:** Checks for semantic errors and ensures the meaning of the
syntax is consistent with the language's rules.
- **Output:** Annotated syntax tree (AST with type information and other
semantic annotations).
5. **Optimization:**
- **Function:** Improves the intermediate code to make it more efficient
(e.g., faster execution, reduced memory usage).
6. **Code Generation:**
A pass refers to a single traversal over the entire source code or intermediate
representation. A compiler can be either a single-pass compiler or a multi-pass
compiler:
- **Single-Pass Compiler:**
- **Function:** Completes all the compilation phases in one pass over the
source code.
1. **First Pass:**
2. **Second Pass:**
3. **Third Pass:**
A multi-pass compiler might, for example, perform all phases in the first pass
(up to generating intermediate code), then use subsequent passes to optimize
and generate final machine code. Alternatively, it might interleave phases such
as performing semantic analysis and intermediate code generation in separate
passes to allow for intermediate optimizations.
### Summary
The front end of a compiler is responsible for the initial stages of the
compilation process, starting from the source code and producing an
intermediate representation (IR) that captures the program's syntax and
semantics. The main tasks of the front end include:
3. **Semantic Analysis:**
1. **Optimization:**
2. **Code Generation:**
- **Responsibilities:**
- **Language Independence:**
- **Optimization:**
### Summary
In summary, the front end of a compiler deals with the analysis and
understanding of the source code, producing an intermediate representation
(IR) that captures the program's semantics. The back end then takes this IR
and performs optimizations and code generation to produce efficient executable
code. Together, the front end and back end work in tandem to translate high-
level source code into machine-executable instructions.