Unit 1 CD Own
Unit 1 CD Own
Compilers:
A compiler is a type of translator that translates high-level programming code written in
languages like C, C++, Java, or Python into low-level machine code. The primary purpose
of a compiler is to convert source code into an executable program. Here's how it works:
1. Scanning or Lexical Analysis: The first phase of the compiler, known as lexical
analysis, involves breaking the source code into individual tokens (e.g., keywords,
identifiers, operators, and constants).
2. Parsing or Syntax Analysis: In this phase, the compiler checks the syntax of the
code, ensuring it adheres to the language's grammar rules. It builds a parse tree or
an abstract syntax tree (AST) to represent the program's structure.
3. Semantic Analysis: The compiler performs a deeper analysis to check for semantic
errors, like type mismatches or undeclared variables. It also resolves variables and
expressions.
5. Linker: A linker combines multiple object files or libraries into a single executable
program. It resolves external references and sets up the program's memory layout.
Each type of translator has its own role in the software development process, with
compilers being particularly crucial for turning high-level code into machine code.
Translators are essential components in the field of computer science and software
development for various reasons. Here are some of the key needs for translators:
6. Security: By translating high-level code into machine code, translators can add a
layer of security to the program. Machine code is less human-readable and harder
to tamper with, making it more difficult for malicious actors to exploit vulnerabilities.
7. Portability: Translators help achieve code portability, meaning that programs can be
moved from one platform to another with minimal effort. This is particularly
important for software developers who want to make their programs accessible to a
wide range of users and environments.
8. Code Reusability: Translators allow for the reuse of libraries and code
components written by others. For example, you can use libraries in different
programming languages within your codebase.
1.Lexical Analysis
The first phase is lexical analysis, where the source code is analyzed to break it
down into individual tokens (such as keywords, identifiers, operators, and
literals).
The output of this phase is a stream of tokens that represent the basic building
blocks of the program
The lexical analyzer reads the stream of characters making up the source program and
groups the characters into meaningful sequences called lexemes. For each lexeme, the
lexical analyzer produces as output a token of the form that it passes on to the subsequent
phase, syntax analysis.
In the token, the first component token-name is an abstract symbol that is used
during syntax analysis, and
the second component attribute-value points to an entry in the symbol table for
this token.
Information from the symbol-table entry is needed for semantic analysis and code
generation
For example, suppose a source program contains the assignment statement position =
initial + rate * 60
The characters in this assignment could be grouped into the following lexemes and
mapped into the following tokens passed on to the syntax analyzer:
2.Syntax Analysis
o In this phase, the compiler checks the syntax of the source code to
ensure it follows the language's grammar rules.
Semantic analysis ensures that the code is not only syntactically correct
but also semantically meaningful.
Suppose that position, initial, and rate have been declared to be floating-
point numbers, and that the lexeme 60 by itself forms an integer. The type
checker in the semantic analyser in Fig. discovers that the operator * is
applied to a floating-point number rate and an integer 60. In this case, the
integer may be converted into a floating-point number
Syntax trees are a form of intermediate representation; they are commonly used
during syntax and semantic analysis. This intermediate representation should have
two important properties: it should be easy to produce and it should be easy to
translate into the target machine
The output of the intermediate code generator in Fig. consists of the three-
address code sequence
There are several points worth noting about three-address instructions.
Second, the compiler must generate a temporary name to hold the value
computed by a three-address instruction
5.Code Optimization
The goal is to make the generated code faster and more space-efficient.
6.Code Generation
o In this phase, the compiler translates the intermediate code or AST into
low- level code that can be executed on a specific target architecture
(e.g., assembly language or machine code).
The symbol table helps in scope resolution, type checking, and generating
the correct machine code
8.Error Handling
The structure of a compiler can vary slightly depending on the specific compiler
design and language it targets. Additionally, some modern compilers may combine
or rearrange certain phases for optimization and performance reasons.
Nonetheless, these fundamental phases provide a clear overview of the compilation
process from source code to executable program.
Purpose: Lex is a tool used to generate a lexical analyser or lexer. The lexer reads the input
stream of characters (source code) and groups them into meaningful sequences called
lexemes, which are then classified into tokens.
How it works:
Lex takes a set of patterns (usually written as regular expressions) and converts
them into a C program that can recognize those patterns in the input.
The output of Lex is a program that reads input and produces tokens as output,
which can then be fed into a parser for further syntactic analysis.
Yacc (Yet Another compiler Compiler):Yacc is a tool for generating parsers. It takes
a grammar specification file and generates code for syntax analysis, typically in the form
of a parser that constructs a parse tree or an abstract syntax tree (AST).
Purpose: YACC is a tool used to generate a parser. A parser processes tokens from the
lexical analyzer and checks them against the grammatical rules of the programming
language (defined by a Context-Free Grammar). YACC produces a C program that
performs syntax analysis.
How it works:
The parser uses shift-reduce parsing techniques (LR parsing) to analyse the
structure of the input and check if it conforms to the defined grammar
Common Use: YACC is used for generating parsers for programming languages or data
formats. It works hand-in-hand with Lex, which provides the tokens that YACC uses for
parsing.
1. Lex reads the input source code and generates tokens based on predefined
patterns.
3. YACC then uses the tokens to build a parse tree, ensuring the input adheres to the
grammatical structure of the language.
Bison
Here are some tips for using Flex and Bison together:
Call the scanner from the parser
To build the scanner and parser into a working program, you can include a
header file created by Bison in the scanner. You can also delete the testing
main routine in the scanner, since the parser will now call the scanner.
You can modify the Yacc/Bison file to include the symbol table and routines
to install an identifier in the symbol table and perform context checking
ANTLR(Another Tool For Language Recognisation): ANTLR is a powerful and widely used
tool for generating parsers and lexers. It supports various target languages, including Java,
C#, Python, and others. ANTLR works with context-free grammars and generates parsers
that can build parse trees.
JavaCC (Java Compiler Compiler): JavaCC is a parser generator specifically designed for
Java. It allows you to define your language grammar and generates Java code for parsing and
processing that language.
LALR (Look-Ahead Left-to-Right Rightmost Derivation): LALR parser generators like Bison
and byacc (Bison for Unix) are popular for their efficiency in generating parsers for many
programming languages. They work well for context-free grammars.
LL (Left-to-Right, Leftmost Derivation) Parser Generators: Tools like ANTLR and JavaCC are
examples of LL parser generators. They are suitable for creating parsers for languages with LL
grammars.
Code Generation Tools: For generating machine code or assembly code, compiler
developers often use tools specific to the target architecture. These tools might include
assemblers, linkers, and loaders
Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a
target machine.
These tools can significantly expedite the process of building a compiler, making it more
efficient and less error-prone. The choice of tool depends on various factors, including the
target language, the complexity of the grammar, and the desired output format (e.g., AST or
machine code). Compiler developers often select the tool that best aligns with their project's
requirements and their familiarity with the tool itself.
2. Error Detection: The lexical analyzer can identify and report lexical errors,
such as misspelled or undefined tokens. This initial error checking can save
time in later phases of the compilation process.
3. Building a Symbol Table: In some compilers, the lexical analyzer may start
building a symbol table, a data structure used to keep track of all the identifiers
(variables, functions, etc.) in the program. It records the names, types, and
positions of these identifiers for later reference. In some cases, information
regarding the kind of identifier may be read from the symbol table by the lexical
analyzer to assist it in determining the proper token it must pass to the parser
4. Line Number Tracking: The lexical analyzer often keeps track of line numbers and
column positions within the source code. This information can be helpful for
producing meaningful error messages and for debugging.
5. Generating Output: After identifying and categorizing tokens, the lexical analyzer
generates output in the form of a stream of tokens or a sequence of (token,
attribute) pairs. This output is usually passed on to the next phase of the
compiler, which is the syntax analyzer (parser).
The output of the lexical analysis is used as input for the subsequent phases of the
compiler, particularly the parser. The parser then constructs a hierarchical
representation of the program's structure, often in the form of a parse tree or an
abstract syntax tree (AST).
In summary, the lexical analyzer is responsible for scanning the source code, breaking
it into tokens, and performing basic error checking. Its role is critical in making the
source code more manageable for subsequent phases of the compiler, which involve
parsing, semantic analysis, and ultimately code generation.
Sometimes, lexical analysers are divided into a cascade of two processes:
a) Scanning consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive whitespace
characters into one.
b) Lexical analysis proper is the more complex portion, which produces tokens from
the output of the scanner.
1. Lexeme:
2. Pattern:
o For example, a pattern for an integer literal might be \d+, meaning any
sequence of one or more digits. This pattern would match lexemes like
123 or 456.
3. Token:
o Tokens are often represented as pairs: the token type (or name) and,
optionally, the lexeme itself (or an associated attribute). For example, in
the lexeme 123, the token might be <INT, 123>, where INT indicates an
integer token and 123 is the lexeme.
Summary of Differences:
Input Buffering
Input buffering is a crucial concept in the context of lexical analysis and parsing within a
compiler. It refers to the practice of reading and processing the source code text in
chunks or buffers rather than character by character. Input buffering is used to improve
the efficiency of the lexical analysis phase and other phases of compilation. Here's why
input buffering is important:
1. Efficiency: Reading and processing a file character by character can be slow and
inefficient. Input buffering involves reading a portion of the source code into a
buffer (a temporary storage area) and then processing that buffer. This reduces the
number of file read operations, making the compilation process faster.
2. Reduced I/O Overhead: File input and output (I/O) operations are relatively slow
compared to in-memory processing. By buffering the input, the compiler minimizes
the number of disk or file system reads, which can be a bottleneck in the
compilation process.
4. Lookahead: In some cases, the lexical analyzer or parser needs to look ahead at the
upcoming characters in the source code to determine the correct token. Input
buffering allows the compiler to read a few characters ahead and make tokenization
decisions based on the buffered data.
5. Parsing Simplicity: During parsing, syntax analysis often involves examining several
characters at a time to recognize keywords or operators. Input buffering simplifies
this process, as the parser can work with a buffer of characters rather than
individual characters.
6. Error Reporting: When a lexical or syntax error is encountered, the context provided
by input buffering can help generate more informative error messages. The compiler
can show the portion of the source code containing the error and highlight the
specific characters involved.
7.Efficient Memory Usage: Buffering allows for efficient use of memory. Instead of
loading the entire source code into memory, which may not be feasible for very large
files, the compiler can load smaller portions as needed, keeping memory usage
manageable.
In practice, input buffering can involve reading a fixed-size chunk of the source code at a
time or dynamically adjusting the buffer size based on the needs of the compiler. The
size of the buffer is chosen to strike a balance between minimizing I/O operations and
efficiently utilizing memory.
Start by defining the token types that your lexical analyzer will recognize. These
include keywords, identifiers, constants (integers, floats, strings), operators,
delimiters, and comments. Create a list of these token types.
Write regular expressions for each token type to describe their patterns. Regular
expressions are used to match and extract substrings that correspond to tokens.
For example:
Tokenize the Input: Read the source code character by character, and use the regular
expressions to match the patterns of token types. As you find matches, extract the
substrings and create tokens with a token type and attribute (the matched substring).
Handle Whitespace and Comments: Ignore whitespace characters (e.g., spaces, tabs, line
breaks) and comments during tokenization. You can skip these characters to simplify
token extraction.
Error Handling: Implement error handling to deal with unexpected characters or invalid
token patterns. You might want to report a syntax error or an unrecognized token when
such issues occur.
Build a Symbol Table (Optional): If your language supports variables, functions, or other
named entities, you can build a symbol table to keep track of these identifiers. Include
each identifier's name, type, and other relevant information.
Output Tokens: As you tokenize the input, produce tokens by recording the token type
and attribute (if applicable). These tokens can be stored in memory or written to a file for
further processing by the parser.
Provide a User Interface (Optional): If you want to interactively test your lexical
analyser, create a simple user interface that accepts input code, runs the lexical analysis,
and displays the resulting tokens.
Testing and Debugging: Test your lexical analyser with various code snippets, including
valid and invalid constructs. Pay special attention to corner cases and edge cases to
ensure accurate tokenization.
Integration with Parser: The output of the lexical analyser (the stream of tokens) will be
used as input for the parser in the subsequent phases of the compiler. Ensure that the
format of tokens produced by the lexical analyser is compatible with what the parser
expects.
Documentation: Document your lexical analyser, including the token types, regular
expressions, and any special handling you have implemented. Provide clear instructions
for usage and testing.
Remember that this is a simple approach to designing a lexical analyzer. In practice, for
complex programming languages, you may encounter additional challenges, such as
handling nested comments or managing reserved words. More advanced lexical
analyzers often use tools like Lex or Flex to generate code from regular expressions and
automate much of the tokenization process. However, this step-by-step approach
provides a solid foundation for understanding the basic principles of lexical analysis.
Specification of Tokens:
Token specification involves defining the patterns for different token types using
regular expressions or similar notations. Each token type corresponds to a
particular lexical construct in the programming language. Here are some
common token types and their specifications:
Comments: Comments can be specified using regular expressions that match the
comment style used in the language
Recognition of Tokens
Recognition of tokens involves the actual process of identifying and extracting tokens
from the source code based on the specifications. Here's how it works:
The lexical analyzer reads the source code character by character, often using a
buffer to improve efficiency.
It maintains the current state or position within the input. For each character
read, it applies the defined regular expressions for each token type to check if the
character(s) match the token's pattern.
When a match is found, the lexical analyser records the matched substring as a
token and assigns it the appropriate token type.
It may also capture additional attributes, such as the value of a constant or the
name of an identifier.
If the current input does not match any defined token patterns, the lexical
analyser reports an error or handles the situation according to its error-handling
rules.
The extracted tokens, along with their types and attributes, are passed on to the
subsequent phases of the compiler for further processing (e.g., parsing).
In practice, lexical analysers are often implemented using lexical analyser generators like
Lex, Flex, or manually written code based on the specifications of the language's tokens.
These tools generate efficient and optimized code for token recognition, making the
process more reliable and maintainable. By specifying and recognizing tokens accurately,
the lexical analyser simplifies the task of parsing and understanding the structure of the
source code, which is crucial for the subsequent phases of the compilation process.
Finite automata
A finite automaton, often referred to as a finite automata (plural) or finite state machine
(FSM), is a mathematical model used in computer science and formal language theory to
describe processes with a finite number of states and transitions between those states.
Finite automata are fundamental tools for various applications, including lexical analysis
in compiler design, modeling state-based systems, and pattern recognition. There are
two main types of finite automata: deterministic and non-deterministic.
1 States: A finite automaton has a finite set of states. Each state represents a particular
condition or configuration of the system.
2. Transitions: Transitions describe the way the automaton moves from one state to
another based on input. For each state and input symbol, there is a defined transition
that specifies the next state.
3. Alphabet: The alphabet is the set of input symbols that the automaton recognizes. It
defines the language over which the automaton operates
4. Start State: The start state is the initial state where the automaton begins its operation
when given input.
5. Accepting (or Final) States: Some states are designated as accepting or final states.
When the automaton reaches an accepting state after processing the input, it recognizes
the input as part of the language and can accept it.
6. Transitions Function (or Transition Table): The transitions function defines the
behavior of the automaton. It is a mapping that takes a current state and an input symbol
and returns the next state. For deterministic finite automata (DFA), the transitions
function is often represented as a transition table.
7. Deterministic Finite Automaton (DFA): In a DFA, for each state and input symbol,
there is exactly one possible transition. DFAs are deterministic in that they can uniquely
determine the next state given a specific input.
10. Regular Languages: Finite automata are closely associated with regular languages,
which are a class of languages described by regular expressions. Regular languages can
be recognized by finite automata, both DFAs and NFAs.
11. Applications: Finite automata have various applications, including lexical analysis in
compilers, parsing, text pattern recognition, and modeling finite-state systems in
hardware design and natural language processing.
Finite automata serve as the foundation for understanding the concept of computation
and play a significant role in the theoretical and practical aspects of computer science.
They provide a structured way to analyze and process sequences of symbols, which is
critical in various fields of computing and engineering.
UNIT 2
The parser is a crucial component in a compiler or interpreter, and its primary role is
to analyse the syntactic structure of the source code according to the grammar of
the programming language. Here are the key roles and responsibilities of a parser:
1. Syntactic Analysis: The primary role of the parser is to perform syntactic analysis
or parsing. It reads the tokens produced by the lexical analyser and checks whether
they form valid sentences in the programming language's grammar. It ensures that
the source code adheres to the specified syntax rules.
2. Grammar Compliance: The parser enforces the rules and constraints defined by
the language's grammar. It checks for the correct order of statements, the use of
correct operators, proper nesting of constructs, and adherence to language-specific
syntactic rules.
3. Parsing Trees or Abstract Syntax Trees (AST): In the process of parsing, the
parser often constructs a data structure called a parse tree or an abstract syntax
tree (AST). These trees represent the hierarchical structure of the program, making
it easier for subsequent phases of the compiler or interpreter to analyze and
transform the code.
4. Error Detection and Reporting: The parser detects and reports syntax errors in
the source code. It generates error messages that provide information about the
location and nature of the errors. These error messages are essential for developers
to identify and correct issues in their code.
6. Scope Analysis: The parser may perform initial scope analysis by tracking variable
declarations, function definitions, and other scope-related information. This helps in
resolving identifiers and detecting scope-related errors.
7. Type Checking: The parser may perform basic type checking by ensuring that
operations involving variables, literals, and expressions conform to the expected
data types and compatibility rules defined by the language.
10. Integration with Semantic Analysis: The parser serves as the interface between
the lexical analysis and the subsequent semantic analysis phases of the compiler. It
provides the structured syntactic representation of the code for semantic analysis.
The parser is a bridge between the lexical analysis (which identifies tokens and their
lexical structure) and the semantic analysis and code generation phases. It plays a
critical role in ensuring that source code adheres to the language's syntax, and it
provides a foundation for further analysis and transformation of the program.
Universal
top-down: top-down methods build parse trees from the top (root) to the bottom
(leaves)
bottom-up: bottom-up methods start from the leaves and work their way up to
the root.
In either case, the input to the parser is scanned from left to right, one symbol at a time.
1. Terminal Symbols: These are the basic symbols of the language, such as
keywords, operators, and constants. Terminal symbols are the actual
tokens recognized by the lexer.
Production Rules: CFGs consist of a set of production rules that define how non-
terminal symbols can be replaced by a sequence of terminal and nonterminal
symbols. A production rule has the form:
Start Symbol: CFGs have a designated start symbol, typically denoted as S . The
start symbol represents the entire language or program. All derivations in the
grammar start from this symbol.
Ambiguity: A CFG is considered ambiguous if there are multiple valid parse trees
for the same input string. Ambiguity can complicate the parsing process and lead
to ambiguous language constructs.
Backus-Naur Form (BNF): BNF is a widely used notation for specifying CFGs. It
uses angle brackets ("<" and ">") to represent non-terminal symbols and defines
production rules using "::=".
To avoid always having to state that these are the terminals," these are the non -
terminals," and so on, the following notational conventions for grammars
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.
(b) The letter S, which, when it appears, is usually the start symbol.
The construction of a parse tree can be made precise by taking a derivational view, in
which productions are treated as rewriting rules. Beginning with the start symbol, each
rewriting step replaces a nonterminal by the body of one of its productions. This
derivational view corresponds to the top-down construction of a parse tree, but the
precision afforded by derivations will be especially helpful when bottom-up parsing is
discussed. As we shall see, bottom-up parsing is related to a class of derivations known as
\rightmost" derivations, in which the rightmost nonterminal is rewritten at each step
Pg 202 book ka
1. Lexical Analysis (Scanning): Lexical analysis deals with recognizing and tokenizing
the basic building blocks of a programming language.
For lexical analysis, you would typically use regular expressions to define the
patterns for each token type. Regular expressions are concise and powerful for this
purpose. For example, here's a simplified grammar for some token types:
For syntactic analysis, you would use a context-free grammar (CFG). CFGs are
suitable for defining the hierarchical structure of programming languages, as they
allow you to specify the relationships between language constructs. Here's a
simplified example of a CFG for a simple arithmetic expression language:
The above CFG defines the syntax for arithmetic expressions composed of addition,
subtraction, multiplication, division, numbers, and identifiers. It specifies the order
of operations and how expressions are structured hierarchically.
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous. (an ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence)
Eliminating ambiguity
Eliminating ambiguity in a context-free grammar (CFG) is an important step in ensuring
that the grammar can be used to uniquely define the syntax of a programming language or
any formal language. Ambiguity arises when there are multiple valid interpretations or
parse trees for a given input, which can lead to parsing conflicts and difficulties in language
processing. Here are some strategies to eliminate ambiguity from a CFG
1. Left Recursion Removal: Left recursion occurs when a non-terminal can directly or
indirectly produce a string that starts with itself. Left-recursive rules can lead to
ambiguity in the grammar. To remove left recursion, you can rewrite the rules to
be right recursive or use left-factoring to eliminate the left recursion. For example