0% found this document useful (0 votes)
8 views

Compiler Design

The document discusses the process of compiler design, which involves translating human-readable code into machine-executable code. It describes the main phases as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. The overall goal is to convert source code into object code that can run on computers.

Uploaded by

x21e0day
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Compiler Design

The document discusses the process of compiler design, which involves translating human-readable code into machine-executable code. It describes the main phases as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. The overall goal is to convert source code into object code that can run on computers.

Uploaded by

x21e0day
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

COMPILER DESIGN

Compiler design refers to the process of creating In simpler terms, a compiler is like a language
a special software tool called a compiler, which translator that helps turn human-readable code
helps in translating and converting high-level into a language that the computer can understand
programming code written by humans into a and run. It makes sure the code is correct, and
format that can be understood and executed by a then it converts it into a form that the computer
computer. It involves various steps like analyzing can execute.
the code's structure, checking for errors,
PHASES OF COMPILER
optimizing the code for better performance, and
generating the final executable program. In
• Lexical Analysis: This is the first phase of
simpler terms, compiler design is all about
the compiler. It reads the source code
building a translator that turns human-readable
character by character and groups them
code into instructions that a computer can follow.
into meaningful units called lexemes. For
A compiler is a special tool that helps translate example, it recognizes keywords,
code written in a high-level programming identifiers, numbers, and symbols in the
language (like Python or Java) into a language code. These lexemes are then converted
that the computer's processor can understand, into tokens, which are representations of
which is called machine language. these meaningful units.

The high-level language is what developers use to • Syntax Analysis: In this phase, the
write code because it's easier for humans to compiler takes the tokens generated in
understand. The machine language, on the other the lexical analysis phase and checks if
hand, is what the computer actually understands they form valid expressions according to
and can execute. the rules of the programming language's
syntax. It constructs a parse tree, which
When you write and run a program, the compiler
represents the structure and relationships
checks the code for errors and makes sure it
between the components of the code. The
follows the rules of the programming language.
parser ensures that the code is
Its main job is to convert the code written in one
syntactically correct.
language into another language without changing
the program's meaning. • Semantic Analysis: The semantic analysis
phase checks whether the parse tree,
The program execution happens in two parts :-
generated in the syntax analysis phase,
1. First, the compiler takes the code you follows the rules of the programming
wrote (source program) and translates it language in terms of meaning and
into a lower-level language called the context. It verifies the types of identifiers,
object program. This translation makes expressions, and statements. This phase
the code easier for the computer to work also performs tasks like type checking and
with. symbol table management to ensure that
2. Then, the object program is further the code has meaningful semantics.
translated into the target program using a • Intermediate Code Generation: Here, the
tool called an assembler. The target compiler converts the source code into an
program is the final version that the intermediate representation that is closer
computer can directly understand and to the machine language but still
execute. independent of the target machine. This
intermediate code should be easy to
translate into the final machine code. It
helps in further analysis and optimizations In simpler terms, the compiler goes through
before generating the actual target code. several steps: analyzing the code's structure and
meaning, generating an intermediate
• Code Optimization: This phase is optional
representation, optimizing the code if needed,
but aims to improve the intermediate
and finally translating it into the machine
code generated in the previous phase. It
language of the target computer. Each phase
analyzes the code for opportunities to
plays a specific role in converting human-readable
make it more efficient and optimized in
code into executable machine code.
terms of execution speed and memory
usage. This can involve eliminating
redundant code, rearranging the sequence
of statements, or applying mathematical
transformations to simplify expressions.

• Code Generation: The final phase of the


compilation process is code generation. It
takes the optimized intermediate code and
translates it into the specific machine
language of the target computer. The
code generator generates the instructions
that the computer's processor can
understand and execute. This phase
produces the final executable program
that can be run on the target machine.
LEXICAL ANALYSIS string into smaller chunks, such as keywords,
identifiers, numbers, operators, and punctuation
Lexical analysis refers to the process of analyzing symbols. These smaller chunks are the lexemes.
the source code of a program to identify and
For example, let's take the following line of code:
extract meaningful units called lexemes. It is the
first phase of the compilation process. x = 10 + y

During lexical analysis, the compiler reads the During the lexical analysis phase, the lexical
source code character by character and groups analyzer will recognize and extract the following
them into lexemes. Lexemes represent the lexemes :-
smallest meaningful units in a programming
• The identifier ‘x’
language, such as keywords, identifiers,
• The assignment operator ‘=’
operators, constants, and punctuation symbols.
• The number ‘10’
The lexical analyzer, also known as the lexer or • The addition operator ‘+’
scanner, is responsible for recognizing these • The identifier ‘y’
lexemes by applying predefined rules and patterns
Once the lexemes are identified, they are
specific to the programming language. It scans
converted into tokens, which are abstract
the source code, identifies lexemes based on these
representations of these lexemes. These tokens
rules, and generates a sequence of tokens.
will be used in later stages of the compiler to
Tokens are abstract representations of lexemes analyze the code's syntax and semantics.
and carry more meaning in the context of the
After the lexical analyzer identifies the lexemes
programming language. They provide a higher-
(meaningful units) from the source code, the next
level view of the code, categorizing the lexemes
step is to convert them into tokens. Tokens are
into different types like keywords, identifiers,
abstract representations of these lexemes and
operators, or literals.
carry more meaning in the context of the
The purpose of lexical analysis is to transform the programming language.
continuous stream of characters in the source
To convert lexemes into tokens, the compiler uses
code into a structured sequence of tokens. This
a set of predefined rules and patterns specific to
structured representation makes it easier for
the programming language. These rules define the
subsequent compiler phases, such as syntax
different categories or types of tokens that can be
analysis and semantic analysis, to analyze and
encountered in the code. Each token has a
understand the code's structure and meaning.
specific meaning and role in the language's
In simpler terms, lexical analysis is like breaking syntax.
down the source code into meaningful chunks
For example, let's consider the following line of
and assigning labels to these chunks. It helps the
code in the Python programming language:
compiler recognize and categorize different parts
of the code, which is important for understanding x = 10 + y
and processing the code correctly.
The lexical analyzer has identified the following
Lexical analysis is the first phase of the lexemes:
compilation process in which the compiler reads
the source code character by character. Its main • The identifier lexeme: "x" and "y"
job is to identify and extract meaningful units • The assignment operator lexeme: "="
called lexemes from the code. • The number lexeme: "10"
• The addition operator lexeme: "+"
Think of the source code as a long string of
characters. The lexical analyzer breaks down this
Now, let's see how these lexemes are converted SYNTAX ANALYSIS
into tokens:
Syntax analysis, also known as parsing, is the
• The identifier lexeme "x" becomes an
second phase of the compilation process after
identifier token.
lexical analysis. It focuses on analyzing the
• The assignment operator lexeme "="
structure of the source code based on the
becomes an assignment token.
grammar rules of the programming language.
• The number lexeme "10" becomes a
numeric constant token. Syntax analysis checks whether the sequence of
• The addition operator lexeme "+" becomes tokens generated from the lexical analysis phase
an addition operator token. conforms to the rules specified by the language's
syntax. It ensures that the code follows the
These tokens provide a higher-level representation
correct arrangement and combination of tokens to
of the code. They carry information about the
form valid expressions, statements, and program
type of the lexeme and its role in the
structures.
programming language. The tokens are then used
in subsequent phases of the compiler, such as Syntax analysis typically constructs a parse tree
syntax analysis and semantic analysis, to or syntax tree, which is a hierarchical
understand the structure and meaning of the representation of the code's structure. The parse
code. tree shows how the different parts of the code
relate to each other based on the language's
syntax rules.

Here's a simplified example in the C programming


language to illustrate syntax analysis:

Consider the code:

int x = 10;

During syntax analysis, the parser checks the


tokens generated from lexical analysis against the
grammar rules of the C language. It verifies that
the tokens form a valid declaration statement.

In this case, the syntax analysis will construct a


In simpler terms, converting lexemes into tokens parse tree like this:
is like giving special names or labels to different
parts of the code. These names or labels help us declaration
/ | \
understand the role and meaning of each part.
int = ;
It's like putting a tag on each word or symbol to | | |
say what it represents in the programming x 10 null
language.
The parse tree reflects the structure of the
declaration statement, showing that it starts with
the keyword "int," followed by an identifier "x,"
the assignment operator "=", and a numeric
constant "10." The tree also indicates that the
semicolon ";" marks the end of the statement.

Syntax analysis detects syntax errors if the tokens


cannot be arranged according to the language's
grammar rules. For example, consider the 2. Scope Analysis: The compiler checks the
following invalid code: scope rules to ensure that variables and
identifiers are declared and used correctly
int x = ;
within their respective scopes. It ensures
The syntax analysis will detect an error because that variables are declared before use and
the assignment operator "=" expects an expression are accessible where they are referenced.
on the right side, but in this case, it encounters a 3. Declaration Analysis: The compiler verifies
semicolon without a valid expression. that variables, functions, and other
Example :- symbols are properly declared and used
according to the language's rules. It
checks for issues like re-declaration,
multiple definitions, or referencing an
undeclared symbol.

4. Semantic Constraints: The compiler


enforces additional semantic constraints
specific to the programming language. For
example, ensuring that a function call has
the correct number and type of
arguments.

Syntax analysis is crucial because it ensures that Semantic analysis ensures that the code is
the code follows the correct structure and semantically valid based on the information
grammar of the programming language. By provided by the parse tree generated during the
constructing the parse tree, it provides a syntax analysis phase. It performs various checks
structural representation of the code that can be and validations to ensure that the code makes
used for subsequent analysis, optimization, and sense and follows the rules and constraints of the
code generation phases of the compiler. programming language. Let's explore some of the
tasks performed during semantic analysis in more
SEMANTIC ANALYZER detail :-

Semantic analysis is the phase of the compilation 1. Type Checking: Semantic analysis verifies
process that follows syntax analysis. It focuses on that the operations and expressions in the
understanding the meaning and correctness of the code are applied to compatible types. It
code beyond its structure. Semantic analysis
checks that variables, constants, and
checks for semantic errors and ensures that the
code makes sense according to the rules and expressions are used in a manner
constraints of the programming language. consistent with their declared types.

For example, consider the following code snippet


Here's a brief explanation of semantic analysis
in C:
and an example to illustrate its workings:

Semantic analysis involves performing various


tasks such as :-

1. Type Checking: The compiler verifies that


the operations and expressions in the code
During semantic analysis, the compiler would
are applied to compatible types. For
perform type checking and detect an error in the
example, adding a number to a string
line int z = x + y;. This error arises because the
would be a type mismatch and flagged as
addition operation is not allowed between an
an error.
integer (x) and a character (y). The compiler
would report a type mismatch error.

2. Scope Analysis: Semantic analysis checks


that variables and identifiers are used
within their proper scopes. It ensures that
variables are declared before use and are
accessible where they are referenced. The
compiler maintains a symbol table to Semantic analysis will identify an error in the
track the information about variables and line r.area() because the area() method is not
their scopes. defined within the Rectangle class. It ensures that
the code follows the language's rules regarding
For example, consider the following code snippet method calls and member accessibility.
in Python:
By performing these checks, semantic analysis
verifies the correctness and meaningfulness of the
code. It helps in generating appropriate error
messages and enables subsequent compiler phases
like optimization and code generation to work
with reliable and valid code.

During semantic analysis, the compiler would INTERMEDIATE CODE GENERATOR


ensure that the reference to x inside the function
my_function is valid. It checks the symbol table The intermediate code generator is a phase in the
to verify that x is accessible from the function's compilation process that converts the source code
scope. into an intermediate representation that is closer
to the target machine code. It serves as a bridge
3. Declaration Analysis: Semantic analysis
between the high-level programming language
verifies that variables, functions, and
and the final machine code.
other symbols are correctly declared and
used according to the language's rules. It The intermediate code generator takes the input
checks for issues such as re-declaration, from previous phases, such as syntax and
multiple definitions, or referencing an semantic analysis, and generates an intermediate
undeclared symbol. code representation. This intermediate code is
designed to be easier to analyze, optimize, and
For example, consider the following code snippet
translate into the target machine code.
in Java:
The process of intermediate code generation
involves several steps:

1. Mapping Expressions and Statements: The


intermediate code generator maps high-
During semantic analysis, the compiler would
level language constructs, such as
detect an error because the variable x is re-
expressions and statements, into
declared. This violates the language's rule that a
equivalent representations in the
variable should only be declared once in the
intermediate code. This mapping captures
same scope.
the essence of the original code in a more
4. Semantic Constraints :- Semantic analysis abstract form.
enforces additional semantic constraints
2. Handling Control Flow: The generator
specific to the programming language.
handles control flow constructs, such as
These constraints can include rules
conditionals (if-else) and loops (while,
for), and represents them in the manageable representation for subsequent
intermediate code. This may involve optimization and code generation phases.
generating intermediate code instructions
The purpose of the intermediate code generator is
to handle branches, jumps, and loops.
to facilitate further analysis and transformations
3. Managing Data and Variables: The on the code before generating the final machine
generator assigns memory locations and code. By using an intermediate representation,
manages the storage of variables and data the compiler can apply various optimizations and
in the intermediate code. It keeps track of target-specific translations in a more structured
data types, declarations, and references, and efficient manner.
ensuring consistent and efficient memory
usage. CODE OPTIMIZATION
4. Handling Function Calls: The generator Code optimization is a phase in the compilation
deals with function calls, including process that aims to improve the efficiency,
passing arguments, managing return speed, and overall performance of the generated
values, and maintaining the execution code. It focuses on transforming the code to
context during function invocations. produce an optimized version while preserving its
5. Generating Code Annotations: The original functionality.
intermediate code generator may include The code optimization phase analyzes the code to
annotations or additional information to identify opportunities for improvement and
aid subsequent optimization or code applies various techniques to make the code more
generation phases. These annotations can efficient. These techniques can include removing
provide hints or constraints for further redundant operations, reordering instructions,
analysis and transformation. simplifying expressions, and minimizing memory
Example: Let's consider the following code snippet usage.
in C: Here's an overview of how code optimization
int x = 5; works:
int y = 10; 1. Analysis: The optimizer analyzes the code
int z = x + y;
to identify patterns, dependencies, and
During intermediate code generation, the potential areas for improvement. It
generator can produce an intermediate examines the control flow, data
representation like three-address code :- dependencies, and resource usage within
the code.
t1 = 5
t2 = 10 2. Identifying Optimization Opportunities:
t3 = t1 + t2 Based on the analysis, the optimizer
identifies specific areas where the code
In this example, the intermediate code generator can be optimized. This can include
translates the variable assignments and the eliminating redundant computations,
addition operation into three-address code. It reducing memory access, and improving
assigns the value 5 to temporary variable t1, the loop structures.
value 10 to t2, and performs the addition of t1
3. Transformation: The optimizer applies
and t2, storing the result in t3.
various transformation techniques to
The generated intermediate code is closer to the improve the code. This can involve
machine code but still independent of the specific reordering instructions, eliminating
target machine architecture. It provides a more
unnecessary branches or loops, and int x = 5;
simplifying complex expressions. int y = x + 3;
int z = x * y;
4. Optimization Techniques: The optimizer
employs a range of techniques such as During common sub expression elimination, the
constant folding, common subexpression optimizer recognizes that x is used twice and
elimination, loop unrolling, register stores its value in a temporary variable. The code
is transformed to:
allocation, and instruction scheduling.
These techniques aim to eliminate
int x = 5;
inefficiencies, reduce overhead, and int temp = x;
exploit parallelism. int y = temp + 3;
int z = temp * y;
5. Trade-offs: During optimization, the
compiler needs to strike a balance Strength Reduction :- This technique aims to
between code efficiency and other factors replace expensive operations with cheaper ones.
such as code size, compilation time, and For example, multiplication can be replaced with
maintainability. Some optimizations may repeated addition or division with bit shifting.
result in increased code size or longer Here's an example:
compilation time, so the compiler must
int x = 10;
consider these trade-offs.
int y = x * 4;
DIFFERENT TECHNIQUES OF OPTIMIZATION
During strength reduction, the optimizer replaces
Here's a more detailed explanation of various the multiplication by 4 with a series of additions:
optimization techniques commonly applied during
int x = 10;
the code optimization phase :-
int y = x + x + x + x;
1. Constant Folding :- This optimization
technique involves evaluating expressions Loop Unrolling: Loop unrolling reduces loop
overhead by replicating loop iterations. Instead of
with constant values at compile-time
executing the loop for each iteration, it performs
instead of at runtime. It replaces the multiple iterations in a single pass. This can
expression with the computed result. For improve performance by reducing loop control
example: instructions and improving instruction pipelining.
Here's an example:
int x = 5;
int y = x + 3;
for (int i = 0; i < 5; i++) {
// Loop body
During constant folding, the expression x + 3 is
}
computed as 5 + 3 (which is 8) at compile-time.
The code is then transformed to:
During loop unrolling, the optimizer replicates
the loop body multiple times:
int x = 5;
int y = 8;

• Common Sub expression Elimination: This


technique identifies and eliminates
redundant computations by storing the
result of a sub expression in a temporary
variable. The temporary variable is then
reused wherever the sub expression
appears. For example:
These are just a few examples of the many locations in the target machine. It
optimization techniques used in the code considers the availability and limitations
optimization phase. Other techniques include of registers in the target architecture. For
register allocation, instruction scheduling,
example:
function inlining, dead code elimination, and
more. Each technique aims to improve the code's Intermediate Code:
performance, reduce overhead, and make more
efficient use of system resources. The specific t1 = a + b
optimization techniques applied depend on the t2 = c * d
compiler and the target platform.
Target Machine Code (ARM assembly):

TARGET CODE GENERATOR ADD r1, r2, r3


MUL r4, r5, r6
The target code generator is the final phase of
the compilation process. It takes the optimized 3. Memory Access and Addressing Modes:
intermediate code representation and translates it The target code generator handles
into the specific machine code instructions that memory access and determines the
can be executed directly on the target hardware appropriate addressing modes for loading
or platform. or storing data in memory. It optimizes
The target code generator is responsible for memory access patterns to minimize
generating efficient and executable code that latency and maximize cache efficiency.
closely corresponds to the architecture and For example:
instruction set of the target machine. It considers Intermediate Code:
the specific features and constraints of the target
platform to produce optimized machine code. x = y + z

Here's a detailed explanation of the target code Target Machine Code (MIPS assembly):
generator and examples to illustrate its workings:
lw $t0, y ; Load y into register $t0
1. Instruction Selection: The target code lw $t1, z ; Load z into register $t1
add $s0, $t0, $t1 ; Add y and z,
generator selects appropriate machine
store result in $s0
instructions based on the intermediate sw $s0, x ; Store result in x
code instructions. It maps each operation
and expression in the intermediate code In this example, the target code generator uses
to a corresponding sequence of machine the lw instruction to load values from memory
instructions. For example: into registers and the sw instruction to store the
result back into memory.
Intermediate Code:
4. Optimization and Target-Specific Features:
t1 = a + b
The target code generator may perform
additional optimizations specific to the
Target Machine Code (x86 assembly):
target machine architecture. It can exploit
ADD eax, ebx pipeline features, vectorization, or other
architectural characteristics for improved
In this example, the target code generator selects performance. The generated code may
the ADD instruction in x86 assembly to perform also include target-specific instructions or
the addition operation. extensions. For example:
2. Register Allocation: The target code
generator assigns intermediate variables
and values to specific registers or memory
Intermediate Code:

x = a * b

Target Machine Code (SIMD instruction for


vectorized multiplication):

VMULPS xmm0, xmm1, xmm2

In this example, the target code generator


recognizes the opportunity for vectorization and
generates SIMD (Single Instruction, Multiple Data)
instructions to perform a vectorized
multiplication.

The target code generator tailors the generated


machine code to the specific target machine
architecture, optimizing it for efficient execution.
It leverages the features and capabilities of the
target platform to produce code that takes
advantage of the hardware resources and
maximizes performance.

It's important to note that the target code


generator is highly dependent on the target
architecture and can vary significantly between
different machines or platforms.
CHAPTER TWO Let's consider an example in the C programming
language :-
LEXICAL ANALYSIS int x = 42;

Lexical analysis, also known as scanning or During lexical analysis, the following tokens
tokenization, is an important phase in the would be generated:
compilation process. Its purpose is to break down
the source code into smaller, meaningful units
• Keyword token: int

called lexemes or tokens. These tokens serve as


• Identifier token: x

the building blocks for further analysis and


• Operator token: =

processing by the compiler.


• Numeric constant token: 42
• Punctuation token: ;
During lexical analysis, the source code is read
These tokens represent the meaningful
character by character. The lexical analyzer, also
components of the code, such as the keyword int,
called a lexer or scanner, scans the code and
the variable name x, the assignment operator =,
identifies different lexemes based on predefined
the numeric constant 42, and the semicolon ;.
rules and patterns specific to the programming
language. The lexical analysis phase is crucial as it provides
a foundation for subsequent phases like syntax
Here's an overview of how lexical analysis works:
analysis and semantic analysis. It transforms the
1. Scanning: The lexical analyzer reads the continuous stream of characters into a structured
source code character by character, sequence of tokens, enabling the compiler to
ignoring whitespaces and comments that understand and process the code effectively.
do not affect the meaning of the code.

2. Lexeme Recognition: As the lexical


COUNTING THE NUMBER TOKENS IN A
analyzer scans the source code, it
CODE SEGMENT
identifies and recognizes lexemes based on
the language's grammar rules and lexical
Counting the number of tokens in a code segment
conventions. Lexemes can include
involves identifying and categorizing the
keywords, identifiers, operators, literals,
individual tokens present in the code. Tokens
and punctuation symbols.
represent the smallest meaningful units of code,
3. Token Generation: Once a lexeme is such as keywords, identifiers, operators, literals,
recognized, it is transformed into a token. and punctuation symbols. Here's a detailed
Tokens are categorized based on their role explanation of the techniques for counting tokens
and type in the programming language. and an example to illustrate the process :-
For example, a token can represent a
Techniques for Counting Tokens :-
keyword, an identifier, a numeric
constant, or an operator. 1. Manual Counting: This technique involves
visually inspecting the code segment and
4. Token Output: The lexical analyzer
identifying each token by following the
outputs the generated tokens to the
language's syntax rules and lexical
subsequent phases of the compiler for
conventions. Tokens are counted
further analysis and processing. The
manually, and a tally is kept for each
tokens provide a higher-level
token type encountered.
representation of the code, making it
easier to analyze its structure and 2. Lexical Analysis Tools: Lexical analysis
meaning. tools, such as Lex, Flex, or ANTLR,
automate the process of tokenization.
These tools generate lexical analyzers • 7 keyword tokens
based on predefined rules and patterns • 7 identifier tokens
specified using regular expressions or • 6 punctuation tokens
other formal grammars. The generated • 4 operator tokens
lexer scans the code segment and • 2 numeric literal tokens
produces tokens, making it easy to count
In total, there are 26 tokens in the given code
and categorize them.
segment.
Example: Let's consider a simple code segment in
Using lexical analysis tools, the token counting
the C programming language:
process becomes automated. The lexical analyzer
generated by the tool would handle the scanning
and tokenization, providing an accurate count of
tokens based on the specified rules.

Counting tokens in a code segment is essential for


understanding the code's structure, identifying its
components, and performing subsequent phases of
Using manual counting, we can identify and compilation. It helps in analyzing the code's
count the tokens :- syntax, detecting errors, and generating
meaningful representations for further processing
1. Keywords:
by the compiler.
• int: 4
• main: 1 Example :-
• return: 1

2. Identifiers:

• x: 3
• y: 2
• sum: 2

3. Punctuation:

• (: 1
• ): 1
• {: 1
• }: 1
• ;: 3

4. Operators:

• =: 3
• +: 1

5. Numeric Literals:

• 5: 1
• 10: 1
By tallying the counts, we can determine that the
code segment contains :-
CHAPTER THREE messages are generated to inform the
programmer about the specific issues in
SYNTAX ANALYSIS the code. These errors may include
missing semicolons, mismatched
The second phase of compiler design is syntax parentheses, or incorrect usage of
analysis, also known as parsing. The primary goal language constructs.
of syntax analysis is to ensure that the source The output of the syntax analysis phase is either
code follows the rules and structure specified by a parse tree or an abstract syntax tree (AST). An
the language's grammar. It involves analyzing the AST simplifies and abstracts the parse tree by
sequence of tokens generated by the lexical removing unnecessary details and focuses on the
analysis phase and constructing a parse tree or essential elements of the program's syntax. The
syntax tree that represents the syntactic structure resulting parse tree or AST is then used in
of the program. subsequent phases of the compiler, such as
Syntax analysis verifies the correctness of the semantic analysis and code generation.
program's syntax by checking if the sequence of Syntax analysis is a crucial step in the
tokens can be derived from the grammar of the compilation process as it ensures that the source
programming language. This process involves code adheres to the language's grammar and can
applying a set of production rules defined by a be further processed and transformed into
context-free grammar (CFG) or a similar executable code.
formalism. The CFG describes the syntax rules of
the language and determines how different
language constructs can be combined and nested. THE ROLE OF PARSER

During syntax analysis, the following steps are The parser plays a crucial role in the syntax
typically performed: analysis phase of the compiler. Its main task is to
1. Tokenization :- The input source code is take the sequence of tokens generated by the
divided into tokens using the rules lexical analyzer and determine whether this
specified by the lexical analyzer. Each sequence adheres to the grammar rules of the
token represents a meaningful unit of the source language. The grammar used by the parser
programming language, such as is typically a Context-Free Grammar (CFG), which
identifiers, keywords, operators, and provides a set of production rules for generating
literals. valid program structures.

2. Parsing: The tokens are processed and The parser's primary objective is to ensure that
organized hierarchically to create a parse the input program is well-formed and
tree or syntax tree. The parse tree syntactically correct. It accomplishes this by
represents the syntactic structure of the constructing a parse tree, which represents the
program and shows how the various hierarchical structure of the program based on
language constructs are related to each the grammar rules. The parse tree serves as a
other. There are different parsing visual representation of how the different
techniques, such as top-down parsing language constructs are nested within each other.
(e.g., recursive descent parsing) and There are two main types of parsers used in
bottom-up parsing (e.g., LR parsing), compilers: top-down parsers and bottom-up
which systematically apply the grammar parsers.
rules to construct the parse tree.
1. Top-Down Parsing: In top-down parsing,
3. Error Handling: If any syntax errors are the parser starts from the root of the
encountered during parsing, error parse tree and works its way down to the
leaves. It begins with the start symbol of CONTEXT FREE GRAMMERS
the grammar and recursively applies
production rules to generate the parse In the field of compiler design, context-free
tree. The most common top-down parsing grammars play a crucial role in specifying the
method is Recursive Descent Parsing, syntactic structure of programming languages. A
where each non-terminal in the grammar context-free grammar describes a set of strings or
is associated with a separate procedure or language and provides rules for composing these
function. strings from various syntactic elements. These
elements include terminals, non-terminals, start
2. Bottom-Up Parsing: In bottom-up parsing,
symbols, and production rules.
the parser starts from the input tokens
and gradually builds the parse tree by Terminals represent the basic symbols from which
reducing the tokens according to the strings are formed. They correspond to the
grammar rules. Bottom-up parsing fundamental building blocks of a programming
algorithms, such as LR (Left-to-Right, language, such as identifiers, keywords,
Rightmost derivation) or LALR (Look- operators, and literals. Non-terminals, on the
Ahead LR) parsing, are commonly used to other hand, act as placeholders that can be
construct the parse tree. replaced by other terminals or non-terminals to
create larger structures.
Regardless of the parsing method used, the parser
scans the input program from left to right, The start symbol is a designated non-terminal
examining one symbol (token or non-terminal) at that indicates the entry point or top-level
a time. If the parser encounters any syntax construct of the language's syntax. It defines the
errors, such as an invalid sequence of tokens or a language that the grammar represents. By starting
violation of the grammar rules, it reports these from the start symbol and applying production
errors to the programmer in a meaningful and rules, valid programs or expressions in the
understandable way. Error recovery mechanisms language can be derived.
may also be implemented in the parser to handle
Production rules specify how terminals and non-
common syntax errors and allow the compilation
terminals can be combined to form valid strings
process to continue.
in the language. Each production rule consists of
Once the parser constructs the parse tree a head (left side), an arrow symbol (→), and a
successfully, it passes the parse tree or an body (right side). The head represents the
Abstract Syntax Tree (AST) to the subsequent construct being defined, while the body describes
phases of the compiler for further processing. The the components that can be used to construct
parse tree or AST serves as the foundation for valid strings.
subsequent analysis, such as semantic analysis,
With these components, context-free grammars
optimization, and code generation.
provide a precise and formal way to describe the
Overall, the parser is responsible for verifying the syntactic structure of programming languages.
syntactic correctness of the program by applying They help compilers understand and analyze the
the grammar rules, constructing the parse tree, structure of programs during the syntax analysis
reporting syntax errors, and passing the resulting phase, enabling the detection of syntax errors and
tree structure to the next phases of the compiler. the construction of parse trees for further
processing.
• Terminals: Terminals are the basic c) Body or Right Side: The body of a
symbols from which strings are formed. production rule consists of zero or more
They represent the fundamental building terminals and non-terminals. It describes
blocks of a language. In the context of a one possible way in which strings can be
compiler, terminals are often referred to constructed from the non-terminal at the
as token names. Tokens are generated by head. The components of the body can be
the lexical analyzer and serve as input to terminals (tokens) or non-terminals, and
the syntax analyzer. Each token they specify the structure of the language.
corresponds to a specific lexeme, such as
Production rules provide the building blocks for
identifiers, keywords, operators, or
constructing valid strings in the language. By
literals.
applying the production rules recursively, starting
1. Non-Terminals: Non-terminals, also known from the start symbol, a parser can generate the
as syntactic variables, represent sets of valid syntax tree or parse tree for a given
strings of terminals. They act as program.
placeholders that can be replaced by other
Overall, context-free grammars are used to
terminals or non-terminals according to
specify the syntactic structure of a programming
the grammar rules. Non-terminals are
language. They define the valid combinations and
used to define the structure and syntax of
arrangements of terminals and non-terminals,
a programming language. They are
which are necessary for parsing and
typically represented by uppercase letters
understanding the structure of a program during
or symbols.
the syntax analysis phase of the compiler.
2. Start Symbol: In a context-free grammar,
one non-terminal is designated as the For example, take the following grammar and
start symbol. The start symbol specifies input string. E → E – E | E * E | a | b | c ;
the language that the grammar defines. It and input string given is “ a – b * c ”.
represents the top-level construct or entry
point of the language's syntax. All valid In the given example, we have a context-
programs or expressions in the language free grammar with a start symbol E,
can be derived by starting from the start
terminals -, *, a, b, c, and a single non-
symbol and applying the production rules.
terminal E. The grammar consists of
3. Production Rules: Production rules specify several production rules, each with the
the different ways in which terminals and
head A (in this case, E) and alternative
non-terminals can be combined to form
strings in the language. Each production bodies (α1, α2, ..., αk).
rule consists of three parts :-
The production rules for this grammar
a) Head or Left Side: The head of a are as follows :-
production rule is a non-terminal that
defines the strings generated by that rule. 1. E → E - E
It represents the current construct being 2. E → E * E
defined or expanded. 3. E → a
b) Arrow Symbol (→): The arrow symbol 4. E → b
separates the head from the body of the 5. E → c
production rule. It indicates that the head
can be replaced by the elements in the The first two production rules have the
body. same head E, and their bodies represent
alternative ways to construct expressions The production rules in a context-free
involving subtraction and multiplication. grammar define the structure of valid
The third, fourth, and fifth production strings in the language it represents. By
rules define the possible terminals that applying these rules, parsers can analyze
can be directly derived from E, which the syntax of programs and construct
are the single characters a, b, and c. parse trees that capture the hierarchical
relationships between the language
To generate a parse tree for a given
constructs.
input string "a - b * c" using this
grammar, the parser applies these DERIVATIONS
production rules in a way that matches
the input string. The parser starts with In the process of parsing, a derivation refers to a
sequence of production rules that are applied to
the start symbol E and applies the
transform a given input string according to the
production rules to derive the input grammar of the language. During parsing, two
string. Here's a step-by-step breakdown :- decisions are made: selecting the non-terminal to
be replaced and determining the production rule
1. E → E - E (using the first
to be used for the replacement.
production rule) This rule allows E
To illustrate the concept of derivation, let's
to be expanded into E - E.
consider a non-terminal symbol A surrounded by
2. E - E → a - E (using the third grammar symbols α and β, such that the current
sentential form is αAβ. Suppose we have a
production rule) The left E is
production rule A → γ. In this case, we can write
expanded into a using the third
the derivation as αAβ ⟹ αγβ, indicating that A
production rule. has been replaced by γ.

3. a - E → a - E * E (using the There are two types of derivations commonly


second production rule) The right used :-

E is expanded into E * E using the 1. Left-Most Derivation (LMD): In a left-most


second production rule. derivation, the input string is scanned and
replaced from left to right. At each step,
4. a - E * E → a - b * E (using the the left-most non-terminal is chosen for
fourth production rule) The right E replacement. This type of derivation
is expanded into b using the fourth ensures that the left-most variable in a
production body is always replaced first,
production rule.
and the process continues from left to
5. a - b * E → a - b * c (using the right. For example, let's consider the
fifth production rule) The right E grammar productions :

is expanded into c using the fifth E → E + E


production rule. E → E * E

At this point, the parse tree has been E → id


constructed, and the input string "a - b * Suppose we have the input string "id + id * id".
c" has been successfully derived using The left-most derivation for this input will be as
the given grammar. follows:
E → E * E terminal), and the edges connecting the nodes
represent the production rules used to derive the
E → E + E * E
string.
E → id + E * E
Here are some key characteristics of a parse
E → id + id * E tree :

E → id + id * id 1. Root Node: The root of the parse tree


corresponds to the start symbol of the
In each step, the left-most non-terminal
grammar. It represents the highest-level
(underlined) is selected for replacement.
construct in the language.
2. Right-Most Derivation (RMD): In a right-
2. Leaf Nodes: The leaf nodes of the parse
most derivation, the input is scanned and
tree correspond to the terminal symbols in
replaced from right to left. At each step,
the input string. They represent the
the right-most non-terminal is chosen for
individual tokens or basic elements of the
replacement. This type of derivation
language.
ensures that the right-most non-terminal
is replaced first, and the process continues 3. Interior Nodes: The interior nodes of the
from right to left. Using the same parse tree correspond to non-terminal
grammar productions as before, the right- symbols. They represent higher-level
most derivation for the input "id + id * constructs in the language and serve as
id" will be as follows: the points of derivation.

E → E + E 4. Children and Parent Nodes: Each interior


node has children nodes, which are
E → E + E * E
labeled from left to right according to the
E → E + E * id symbols in the right-hand side of the
E → E + id * id production rule. The parent node is the
non-terminal that was replaced by these
E → id + id * id symbols during the derivation.
In each step, the right-most non-terminal 5. In-order Traversal: Performing an in-order
(underlined) is selected for replacement. traversal of the parse tree, from left to
Both left-most and right-most derivations are right, results in the original input string.
useful for analyzing the structure of a program This traversal visits the nodes in the same
and understanding how the grammar rules are order as they appear in the input.
applied to transform the input string. They The parse tree captures the hierarchical structure
provide insights into the order of replacements of the language and shows how the input string
and the hierarchical relationships between the can be derived from the start symbol. It provides
grammar symbols. a visual representation that filters out the choice
of replacement order made during the derivation
DERIVATION AND PARSE TREE process, making it easier to understand the
structure and relationships within the parsed
In the context of parsing, a parse tree is a program.
graphical representation of a derivation. It
provides a visual depiction of how the start
symbol of a grammar generates a particular string
in the language. Each node in the parse tree
represents a symbol (either a terminal or a non-
Example :- In the given example, let's consider the grammar:
E → E + E
E → E * E
E → E - E
E → E + E * E
E → id
E → id + E * E
And the input string: id - id + id
E → id + id * E

E → id + id * id

CONTEXT FREE GRAMMERS VERSUS


REGULAR GRAMMERS

Regular expressions and context-free grammars


are two different notations used in language
specification. Here's a brief explanation of each:

1. Regular Expressions: Regular expressions


are a powerful tool for defining patterns
of tokens in a string. They are used to
describe regular languages, which are a
subset of context-free languages. Regular
expressions consist of a combination of
characters, operators, and metacharacters
that specify patterns to match against
input strings. They are concise and easy
to understand, making them suitable for
describing simple token structures such as
identifiers, constants, keywords, and other
lexical constructs. Regular expressions can
Ambiguity be used in lexical analysis to recognize
and tokenize input strings efficiently.
Ambiguity in the context of grammars refers to a
2. Context-Free Grammars: Context-free
situation where a grammar can produce more
grammars (CFGs) are used to define the
than one parse tree for a particular input string.
syntactic structure of languages. They
This means that there are multiple ways to derive
consist of a set of production rules that
the same string using the production rules of the
specify how non-terminal symbols can be
grammar, leading to different possible
expanded into a sequence of terminal and
interpretations.
non-terminal symbols. CFGs are more
powerful than regular expressions and can
describe context-free languages, which
include both regular and non-regular and the next input symbol. By following this
languages. They are particularly useful for process recursively, the parser explores different
describing nested structures, such as paths in the grammar until a successful parse or
balanced parentheses, matching begin-end a parsing error is encountered.
pairs, if-then-else constructs, and other
Recursive descent parsing is a common
hierarchical language features.
implementation technique for top-down parsing.
In the given example, the regular expression (a| It involves creating separate recursive procedures
b)*abb and the context-free grammar represent or functions for each non-terminal in the
the same language, which consists of strings of grammar. Each procedure handles the parsing
"a" and "b" that end with "abb". The regular logic for its corresponding non-terminal,
expression provides a concise and straightforward recursively calling other procedures as necessary
notation for describing the pattern, while the to handle the expansion of non-terminals. This
grammar expresses the language in a structured approach makes the parsing process intuitive and
and hierarchical manner. closely aligns with the grammar's structure.

It's worth noting that regular expressions can Top-down parsing has advantages such as
describe only regular languages, while context- simplicity and ease of understanding. It allows for
free grammars can describe both regular and non- straightforward error reporting and recovery since
regular languages. Regular expressions are it detects errors as soon as they occur during the
typically used in lexical analysis for tokenizing parsing process. However, it can be inefficient in
input strings, while grammars are used in syntax cases where backtracking is required due to
analysis to analyze the hierarchical structure of a ambiguous or non-deterministic grammars. To
language. address this, optimization techniques like
memoization or lookahead may be employed to
improve the efficiency of top-down parsers.

TOP DOWN PARSING


RECURSIVE DESCENT PARSING
Top-down parsing is a parsing technique used in
computer science and compiler design to Recursive-descent parsing is a specific type of
construct a parse tree or find a leftmost top-down parsing that uses procedures or
derivation for an input string. It starts from the functions to handle each non-terminal in the
top (the start symbol) of a context-free grammar grammar.
and recursively expands non-terminals to match
Here's a step-by-step explanation of the recursive-
the input symbols, ultimately producing a parse
descent parsing process:
tree. This parsing technique is also known as
predictive parsing because it predicts the 1. Initialization :-
production rules to be applied based on the
• Create a parse tree with the start
current non-terminal symbol and the next input
symbol S as the root.
symbol.
• Set two pointers: one for the parse
In top-down parsing, the goal is to construct the tree (tree pointer) and one for the
parse tree in a depth-first, pre-order manner. The input string (input pointer).
process begins with the start symbol as the root • Initially, the tree pointer points to
of the parse tree and proceeds by expanding non- S, and the input pointer points to
terminals from left to right , attempting to match the first symbol of the input
the input symbols. The choice of which string.
production rule to apply at each step is
determined by the current non-terminal symbol
2. Expansion: EXAMPLE :-
• Use the first production rule for
Draw a parse tree for the input string “cad”
the non-terminal pointed by the
using the following grammar
tree pointer to expand the tree.
• Move the tree pointer to the
leftmost symbol of the newly
S → cAd
created subtree.

3. Matching: A → ab | a

• Compare the symbol pointed by To construct a parse tree for the input string w =
the tree pointer with the symbol cad, begin with a tree consisting of a single node
pointed by the input pointer. labeled S, and the input pointer pointing to c,
• If they match, advance both the first symbol of w. S has only one production,
pointers to the right. so we use it to expand S and obtain the tree of
Fig. 3.2 (a). The leftmost leaf, labeled c, matches
• If they don't match, backtrack to
the first symbol of input w, so we advance the
the previous step before the non- input pointer to a, the second symbol of w, and
terminal expansion and try consider the next leaf, labeled A.
another production.

4. Recursive Expansion:

• Whenever the tree pointer points


to a non-terminal, repeat steps 2
to 4 for that non-terminal.
• Expand the non-terminal using its
first production rule.
• Continue matching symbols Now, we expand A using the first alternative A
between the tree pointer and the → a b to obtain the tree of Fig. 3.2 (b). We have
input pointer. a match for the second input symbol, a, so we
advance the input pointer to d, the third input
symbol, and compare d against the next leaf,
5. Parsing Completion:
labeled b. Since b does not match d, we report
• If the input pointer reaches the failure and go back to A to see whether there is
another alternative for A that has not been tried,
end of the input string and the
but that might produce a match.
tree pointer passes the last symbol
of the tree, parsing is successful. In going back to A, we must reset the input
• Otherwise, if there is no match or pointer to position 2, the position it had when
no alternative production can be we first came to A, which means that the
applied, parsing fails. procedure for A must store the input pointer in a
local variable. The second alternative for A
produces the tree of Fig. 3.2 (c). The leaf a
matches the second symbol of w and the leaf d
matches the third symbol. Since we have
produced a parse tree for w, we halt and
announce successful completion of parsing.
PREDICTIVE PARSING Predictive parsing is efficient because it avoids
backtracking and can handle LL(1) grammars,
Predictive parsing is a top-down parsing which are a class of grammars where a decision
technique used to construct a parse tree or verify can be made based on one token of lookahead.
the syntax of a given input string. It is called An LL(1) grammar ensures that the parsing
"predictive" because it predicts which production process is deterministic and no conflicts arise in
rule to apply based on the current input symbol the parsing table.
and the lookahead symbol. Unlike other top-down
To implement predictive parsing, a parser
parsing methods that use backtracking, predictive
generator tool or manual construction of the
parsing eliminates the need for backtracking by
parsing table can be used. The parsing table is
using a parsing table that determines the next
typically generated from the grammar rules and
production to apply.
can be constructed by analyzing the first and
In predictive parsing, the parser uses a stack to follow sets of the non-terminals in the grammar.
keep track of the grammar symbols and an input
Overall, predictive parsing is a straightforward
buffer to hold the input symbols. The parsing
and efficient technique for top-down parsing,
process starts with the stack containing the start
making it a popular choice for implementing
symbol of the grammar. The parser then
parsers in many programming languages.
compares the top of the stack with the current
input symbol. Based on this comparison, it uses
the parsing table to predict the production rule to LL(1) GRAMMERS
apply.
An LL(1) parser is a type of predictive parser that
The parsing table is typically a two-dimensional
can handle a class of grammars called LL(1)
table that maps a pair of a non-terminal and an
grammars. LL(1) stands for Left-to-right, Leftmost
input symbol to a production rule. The non-
derivation with one symbol of Lookahead. It
terminals are the rows of the table, and the input
means that the parser reads the input from left to
symbols are the columns. Each entry in the table
right and constructs a leftmost derivation of the
specifies the production rule to use when the
input string, using one symbol of lookahead to
given non-terminal and input symbol are
predict the next production rule to apply. "using
encountered.
one symbol of lookahead" means that the LL(1)
The predictive parsing algorithm proceeds by parser examines the next symbol in the input
repeatedly comparing the top of the stack with string to make parsing decisions. In other words,
the current input symbol. If they match, the it looks ahead at the next input symbol to
parser advances to the next input symbol and determine which production rule to apply.
pops the stack. If they don't match, the parser
The LL(1) parsing algorithm is based on a parsing
consults the parsing table to determine the
table that determines the next production rule to
production rule to apply based on the current
apply based on the current non-terminal on the
non-terminal and the lookahead symbol. The
stack and the lookahead symbol. The parsing
parser then pushes the symbols of the production
table is typically constructed by analyzing the
rule onto the stack in reverse order.
grammar and computing the first and follow sets
The parsing process continues until the input is of the non-terminals.
completely consumed, and the stack becomes
The LL(1) parsing algorithm proceeds as follows :-
empty. At this point, if the parsing is successful,
it means that the input string conforms to the 1. Initialize a stack with the start symbol of
grammar rules, and a valid parse tree can be the grammar.
constructed. 2. Read the current input symbol.
3. Repeat the following steps until the stack Overall, an LL(1) parser is a type of predictive
is empty: parser that uses a parsing table and one symbol
• a) If the top of the stack is a terminal of lookahead to parse the input string according
symbol and matches the current input to the grammar rules. It is a powerful and
symbol, pop the stack and read the next efficient parsing technique used in many compiler
input symbol. implementations.

• b) If the top of the stack is a non-


BOTTOM – UP PARSING
terminal symbol, consult the parsing table
to determine the production rule to apply
Bottom-up parsing is a parsing technique that
based on the current non-terminal and
constructs a parse tree for an input string by
the lookahead symbol.
starting from the leaves (the bottom) and working
upwards towards the root (the top). In this
c) If there is a production rule to apply,
approach, the parser attempts to find the
pop the non-terminal from the stack and
rightmost derivation in reverse for the given
push the symbols of the production rule
input string. It begins with the input string and
onto the stack in reverse order.
applies production rules in reverse order to reach
the start symbol of the grammar. If the start
d) If there is no production rule to apply,
symbol can be obtained from the input string,
report a syntax error.
then the string is said to be accepted by the
language defined by the grammar.
4. If the stack is empty and the input
symbols have been consumed, the parsing The process of bottom-up parsing involves
is successful. Otherwise, report a syntax repeatedly applying reduction rules to reduce a
error. group of input symbols to a non-terminal symbol
until the entire input string is reduced to the
The LL(1) parsing algorithm eliminates the need
start symbol.
for backtracking by using the lookahead symbol
to make a deterministic decision at each step. A common bottom-up parsing method is called
The LL(1) grammars are carefully designed to shift-reduce parsing, which involves two main
avoid conflicts in the parsing table, ensuring that actions:
there is a unique production rule to apply for
1. Shift: This action involves reading the
each combination of non-terminal and lookahead
next input symbol and pushing it onto the
symbol.
stack. The parser keeps shifting symbols
LL(1) parsers are commonly used in the from the input buffer to the stack until it
implementation of programming language can apply a reduction rule.
compilers and interpreters. They are efficient and
2. Reduce: This action involves applying a
can handle a wide range of programming
production rule in reverse. The parser
language grammars. However, LL(1) grammars
looks for a sequence of symbols on the
have some restrictions, such as not allowing left
top of the stack that matches the right-
recursion, to maintain the determinism and avoid
hand side of a production rule and
conflicts in the parsing process.
replaces them with the corresponding
non-terminal symbol (the left-hand side of
the production). The reduction reduces a
group of symbols to a non-terminal.

The shift-reduce parsing continues until the entire


input string is reduced to the start symbol of the
grammar. If this is achieved, the parsing process The LR parser uses a parsing table, known as the
is successful, and the input string is accepted by LR parsing table, to determine its actions based
the language defined by the grammar. Otherwise, on the current state of the parser and the input
if the parser reaches a point where it cannot symbol. The parsing table consists of entries that
apply any reduction or shift action, the input specify whether to shift the input symbol onto
string is not valid according to the grammar. the stack, reduce a portion of the stack using a
production rule, or accept the input string as a
The process of bottom-up parsing can be
valid parse.
visualized using a sequence of tree snapshots,
where each snapshot shows the reduction steps LR parsing can handle a broader class of
applied at different stages of parsing. These grammars than LL parsing, including left-recursive
snapshots illustrate how the input string is grammars and ambiguous grammars. It is
gradually reduced to the start symbol of the commonly used in industrial-strength compiler
grammar. generators and parser generators due to its
efficiency and ability to handle complex
Shift-reduce parsing is a powerful technique that
languages.
can handle a wide range of context-free
grammars, including ambiguous ones. However, it There are different types of LR parsers, such as
requires careful handling of conflicts that may LR(0), SLR(1), LALR(1), and LR(1), each with
arise due to multiple possible reductions or shifts varying degrees of lookahead and parsing power.
at certain points in the parsing process, such as These types differ in terms of the information
shift-reduce and reduce-reduce conflicts. To they use to make parsing decisions and the size
handle these conflicts, bottom-up parsers may of their parsing tables.
employ techniques like operator precedence,
In summary, LR parsing is a bottom-up parsing
precedence climbing, or LR (left-to-right,
technique that constructs a parse tree by
rightmost derivation) parsing methods. LR parsers
performing reductions based on the rightmost
are a class of bottom-up parsers known for their
derivation of the input string. It uses a parsing
efficiency and ability to handle a broad class of
table to guide its actions and can handle a wide
context-free grammars.
range of context-free languages efficiently.

LR PARSING
CHAPTER FIVE
LR parsing is a bottom-up parsing technique that
constructs a parse tree for the input string by
SYMBOL TABLE , SYNTAX RELATED
performing a series of reductions (reversing the
TRANSLATION AND TYPE CHECKING
right-hand side of a production) in a rightmost
derivation. It stands for "left-to-right, rightmost
A symbol table is a data structure used by
derivation" and is known for its power and
compilers to store and manage information about
efficiency in parsing a wide range of context-free
identifiers (such as variables, functions, classes)
languages.
encountered in the source code of a program. It
In LR parsing, the input string is initially placed acts as a central repository for collecting and
on the bottom of a stack. The parser reads the organizing information about these program
input symbols from left to right and tries to find constructs.
occurrences of production rules in the reverse
The symbol table is built incrementally during the
order of their right-hand sides. This process
analysis phases of a compiler and utilized during
involves shifting input symbols onto the stack and
the synthesis phases for generating the target
reducing them when a production rule is
code. Each entry in the symbol table contains
recognized.
relevant information about an identifier, such as Depending on the size and complexity of the
its name (lexeme), data type, memory location, language being compiled, symbol tables may be
scope, and other properties specific to the implemented using different data structures such
programming language. as hash tables, binary search trees, or other
suitable data structures to ensure efficient symbol
The symbol table plays a crucial role in
management during the compilation process.
maintaining scope and binding information. It
helps in resolving references to identifiers and
ensures that they are used correctly and INTRODUCTION TO TYPE CHECKING
consistently within the program. When a name is
Type checking is a crucial process performed by a
encountered in the source code, the symbol table
compiler to ensure that the source program
is searched to retrieve information about that
adheres to the syntactic and semantic rules of the
name. If a new identifier is discovered or
programming language. It is a static checking
additional information about an existing identifier
process that occurs during compilation, prior to
is found, the symbol table is updated accordingly.
the actual execution of the program.
The primary purposes of a symbol table are :-
The main objective of type checking is to detect
1. Storing entity names: It provides a and prevent type errors, which occur when an
structured form to store the names of all operation or expression is applied to incompatible
entities (variables, functions, classes, etc.) data types. Type errors can lead to unexpected
encountered in the program. This allows behavior or runtime errors in the compiled
for efficient access and retrieval of program. By performing type checks, the compiler
information. ensures that the program operates with consistent
2. Declaration verification: The symbol table and compatible data types, promoting safer and
is used to check if a variable or any other more reliable code execution.
entity has been properly declared before Static type checking involves examining the
its usage. It helps in ensuring that the program's structure, expressions, and statements
program follows the language's scoping to verify their compatibility and correctness
and declaration rules. according to the language's type system. Here are
3. Type checking: The symbol table is some key aspects of type checking :-
involved in the process of type checking, 1. Type Compatibility: The type checker
which ensures that assignments and verifies that the operands and operators
expressions in the source code are used in expressions or statements are
semantically correct. It verifies that the compatible. For example, adding an
operations performed on variables or integer and a string would be flagged as
expressions are compatible with their a type error.
declared types.
2. Type Inference: In languages with type
4. Scope resolution: The symbol table helps inference, the type checker deduces the
determine the scope of a name, i.e., types of variables or expressions based on
where the name is valid and accessible their usage and context. This allows for
within the program. It assists in resolving implicit type declarations without explicit
conflicts or ambiguities that may arise type annotations.
due to the presence of the same name in
3. Type Consistency: The type checker
different scopes.
ensures that the assigned value or result
Efficient symbol table mechanisms are essential to of an expression is consistent with the
support fast insertion and retrieval of entries. declared or expected type. For instance,
assigning a floating-point value to an In the context of type checking, the design of a
integer variable may require a type type checker relies on information about the
conversion or trigger a type error. syntactic constructs in the language, the concept
of types, and the rules for assigning types to
4. Function and Procedure Calls: The type
different language constructs. The type system
checker validates that function and
defines the properties and behavior of these types
procedure calls match the expected
and how they interact with each other.
number, order, and types of arguments. It
checks that the arguments provided In many programming languages, including Pascal
correspond to the defined parameter and C, types are classified as either basic or
types. constructed.

5. Data Structures and Operations :- Type 1. Basic Types: Basic types are atomic types
checking ensures that operations such as that do not have any internal structure as
indexing, dereferencing, or member access far as the programmer is concerned. They
are applied to compatible data structures. represent fundamental data types
For example, accessing an element in an supported by the language. Examples of
array requires an integer index, and basic types in Pascal include boolean,
accessing a field in a struct requires the character, integer, and real. Additionally,
correct field name. Pascal allows the construction of other
basic types, such as enumerated types
6. Type Safety: Type checking promotes type
(e.g., (violet, indigo, blue, green, yellow,
safety by preventing implicit type
orange, red)) and subrange types (e.g.,
conversions that may lead to data loss or
1...10). Basic types provide the building
unpredictable behavior. It enforces explicit
blocks for constructing more complex
type conversions or casting when
types.
necessary.
2. Constructed Types: Constructed types are
The type checker analyzes the program's
formed by combining basic types and
structure, symbol table entries, and type
other constructed types. They have a
declarations to perform these checks. If any type
composite structure and are used to
errors are detected, the compiler reports them as
represent more complex data structures.
compilation errors or warnings, allowing the
Common examples of constructed types
programmer to address and fix them before
are arrays, records (structs in C), and
executing the program.
sets. Arrays are collections of elements of
Overall, type checking plays a vital role in the same type, while records consist of
ensuring the integrity and correctness of the multiple fields or members, each with its
program's type system, promoting robustness and own type. Sets represent a collection of
reliability in the compiled code. distinct values from a predefined range.

The type system also defines rules for type


TYPE SYSTEMS compatibility and type coercion, specifying how
types can be used and manipulated in expressions
A type system is a fundamental component of and operations. For example, the type system
programming languages that governs the may dictate that addition, subtraction, and
classification and organization of data in a multiplication operators can only be applied to
program. It provides a set of rules and operands of the same basic type (such as
mechanisms for assigning types to various integers), and the result will also be of the same
language constructs, such as variables, type.
expressions, functions, and data structures.
Furthermore, the type system allows for the
construction of derived types, such as pointers in
C, which are used to store memory addresses of
other objects. Pointers have a type associated
with them, often expressed as "pointer to type,"
indicating the type of the object they point to.
The type system ensures that operations involving
pointers adhere to the correct type rules.

By enforcing type rules and performing type


checking, the type system helps prevent type
errors and ensures that programs follow the
correct usage of types, leading to safer and more
reliable code execution.

Compiler writers rely on the information provided


by the language specification, such as the Pascal
report or the C reference manual, to define the
type system and implement the type checker
accordingly. The type system forms a foundation
for the compilation process, guiding the analysis
and synthesis phases of the compiler to generate
correct and efficient target code.

SYNTAX RELATED TRANSLATION

You might also like