Heba Compiler Design Book - 2025
Heba Compiler Design Book - 2025
Heba Compiler Design Book - 2025
Prepared by:
Dr. Heba El Hadidi
2024-2025
Compiler Design Dr Heba El Hadidi
Contents
Page 2 2024-2025
Compiler Design Dr Heba El Hadidi
A compiler is a program that takes a program written in a source language and translates it
into an equivalent program in a target language. The source language often is a high level
language and target language is machine language.
Necessity of compiler:
• Techniques used in a lexical analyzer can be used in text editors, information retrieval
system, and pattern recognition programs.
• Techniques used in a parser can be used in a query processing system such as SQL.
• Many software having a complex front-end may need techniques used in compiler design.
• A symbolic equation solver which takes an equation as input. That program should parse
the given input equation.
• Most of the techniques used in compiler design can be used in Natural Language Processing
(NLP) systems.
Properties of Compiler:
a) Correctness
i) Correct output in execution. ii) It should report errors iii) Correctly report if the
programmer is not following language syntax.
b) Efficiency
d) Debugging / Usability.
Interpreter:
Types of Compiler:
One pass
Two pass
Multi pass
Source to source compiler It is a type of compiler that takes a high level language as a input
and its output as high level language. Example Open MP
List of compilers
1. Ada compiler
2. ALGOL compiler
3. BASIC compiler
4. C# compiler
5. C compiler
6. C++ compiler
7. COBOL compiler
8. Smalltalk comiler
9. Java compiler
1. Preprocessor.
2. Assembler.
A preprocessor is a program that processes its input data to produce output that is used as
input to another program. The output is said to be a preprocessed form of the input data,
which is often used by some subsequent programs like compilers. The preprocessor is
executed before the actual compilation of code begins; therefore the preprocessor digests
all these directives before any code is generated by the statements.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
A macro is a rule or pattern that specifies how a certain input sequence (often a sequence of
characters) should be mapped to an output sequence (also often a sequence of characters)
according to a defined procedure. The mapping processes that instantiate (transforms) a
macro into a specific output sequence is known as macro expansion.
2. File inclusion: A preprocessor may include header files into the program text.
Preprocessor includes header files into the program text. When the preprocessor finds
#include directive it replaces it by the entire content of the specified file. There are two ways
to specify a file to be included:
#include "file"
#include <file>
The only difference between both expressions is the places (directories) where the compiler
is going to look for the file. In the first case where the file name is specified between double-
quotes, the file is searched first in the same directory that includes the file containing the
directive. In case that it is not there, the compiler searches the file in the default directories
where it is configured to look for the standard header files.
If the file name is enclosed between angle-brackets <> the file is searched directly where the
compiler is configured to look for the standard header files. Therefore, standard header files
are usually included in angle-brackets, while other specific header files are included using
quotes.
For example, such a preprocessor might provide the user with built-in macros for constructs
like while-statements or if-statements, where none exist in the programming language
itself.
For example, the language equal is a database query language embedded in C. Statements
begging with ## are taken by the preprocessor to be database access statements unrelated
to C and are translated into procedure calls on routines that perform the database access.
ASSEMBLER
There are two types of assemblers based on how many passes through the source are needed
to produce the executable program.
One-pass assemblers go through the source code once and assume that all symbols will be
defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass, and
then use the table in a second pass to generate code. The assembler must at least be able to
determine the length of each instruction on the first pass so that the addresses of symbols
can be calculated.
The advantage of a one-pass assembler is speed, which is not as important as it once was with
advances in computer speed and capabilities. The advantage of the two-pass assembler is
that symbols can be defined anywhere in the program source. As a result, the program can be
defined in a more logical and meaningful way. This makes two-pass assembler programs
easier to read and maintain
Interpreter
Languages such as BASIC, pyhton, LISP can be translated using interpreters. JAVA also uses
an interpreter. The process of interpretation can be carried out in the following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages
Disadvantages
Once the assembler procedures an object program, that program must be placed into
memory and executed. The assembler could place the object program directly in memory and
transfer control to it, thereby causing the machine language program to be executed. This
would waste core by leaving the assembler in memory while the user’s program was being
executed. Also, the programmer would have to retranslate his program with each execution,
thus wasting translation time. To overcome this problem of wasted translation time and
memory.
So,
A linker or link editor is a program that takes one or more objects generated by a compiler
and combines them into a single executable program.
Three tasks
1. Searches the program to find library routines used by program, e.g. printf(), math
routines.
2. Determines the memory locations that code from each module will occupy and
relocates its instructions by adjusting absolute references
3. Resolves references among files Loader
“A loader is a program that places programs into memory and prepares them for execution.”
It would be more efficient if subroutines could be translated into object form the loader
could” relocate” directly behind the user’s program. The task of adjusting programs or they
may be placed in arbitrary core locations is called relocation. Relocation loaders perform four
functions.
A loader is the part of an operating system that is responsible for loading programs, one of
the essential stages in the process of starting a program. Loading a program involves reading
the contents of executable file, the file containing the program text, into memory, and then
carrying out other required preparatory tasks to prepare the executable for running. Once
loading is complete, the operating system starts the program by passing control to the loaded
program code.
All operating systems that support program loading have loaders, apart from systems where
code executes directly from ROM or in the case of highly specialized computer systems that
only have a fixed set of specialized programs.
In many operating systems the loader is permanently resident in memories, although some
OSs that support virtual memory may allow the loader to be located in a region of memory
that is pageable. In the case of operating systems that support virtual memory, the loader
may not actually copy the contents of executable files into memory, but rather may simply
declare to the virtual memory subsystem that there is a mapping between a region of memory
allocated to contain the running program's code and the contents of the associated
executable file. The virtual memory subsystem is hen made aware that pages with that region
of memory need to be filled on demand if and when program execution actually hits those
areas of unfilled memory. This may mean parts of a program's code are not actually copied
into memory until they are actually used, and unused code may never be loaded into memory
at all.
- Read executable file's header to determine the size of text and data segments
- Create a new address space for the program
- Copies instructions and data into address space
- Copies arguments passed to the program on the stack
- Initializes the machine registers including the stack ptr
- Jumps to a startup routine that copies the program's arguments from the stack to registers
and calls the program's main routine
Compiler Phases:
Each phase transforms the source program from one representation into another
representation. They communicate with error handlers and the symbol table.
Lexical Analyzer
• Lexical Analyzer reads the source program character by character and returns the tokens of
the source program.
• A token describes a pattern of characters having the same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimiters and so on)
Example:
tokens are:
newval (identifier)
:= (assignment operator)
oldval (identifier)
+ (add operator)
12 (a number)
Since the lexical analyzer is the part of the compiler that reads the source text, it may perform
certain other tasks besides identification of lexemes.
One such task is stripping out comments and whitespace (blank, newline, tab, and perhaps
other characters that are used to separate tokens in the input).
Another task is correlating error messages generated by the compiler with the source
program. For instance, the lexical analyzer may keep track of the number of newline
characters seen, so it can associate a line number with each error message.
In some compilers, the lexical analyzer makes a copy of the source program with the error
messages inserted at the appropriate positions. If the source program uses a macro-
preprocessor, the expansion of macros may also
be performed by the lexical analyzer.
A token is a pair consisting of a token name and an optional attribute value. The token name
is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a
sequence of input characters denoting an identifier. The token names are the input symbols
that the parser processes. In what follows, we shall generally write the name of a token in
boldface. We will often refer to a token by its token name.
Pattern:
A pattern is a description of the form that the lexemes of a token may take. In the case of a
keyword as a token,the pattern is just the sequence of characters that form the keyword. For
identifiers and some other tokens,
the pattern is a more complex structure that is matched by many strings.
Lexeme:
A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an instance of that token.
Examples of Tokens:
1.Tokens are treated as terminal symbols in the grammar for the source language using
boldface names to represent tokens.
2.The lexemes matched by the pattern for the tokens represent the strings of characters in
the source program that can be treated together as a lexical unit
3.In most of the programming languages keywords, operators identifiers , constants , literals
and punctuation symbols are treated as tokens.
4. A pattern is a rule describing the set of lexemes that can represent a particular token in the
source program.
5. In many languages certain strings are reserved i.e their meanings are predefined and
cannot be changes by the users
6. If the keywords are not reserved then the lexical analyzer must distinguish between a
keyword and a user-defined identifier
ATTRIBUTES FOR TOKENS:
When more than one lexeme can match a pattern, the lexical analyzer must provide the
subsequent compiler phases additional information about the particular lexeme that
matched. For example, the pattern for token number matches both 0 and 1, but it is extremely
important for the code generator to know which lexeme was found in the source program.
Thus, in many cases the lexical analyzer returns to the parser not only a token name, but an
attribute value that describes the lexeme represented by the token; the token name
influences parsing decisions, while the attribute value influences translation of tokens after
the parse. We shall assume that tokens have at most one associated attribute, although this
attribute may have a structure that combines several pieces of information. The most
important example is the token id, where we need to associate with
the token a great deal of information. Normally, information about an identifier - e.g., its
lexeme, its type, and the location at which it is first found (in case an error message about that
identifier must be issued) - is kept in the symbol table. Thus, the appropriate attribute value
for an identifier is a pointer to the symbol-table entry for that identifier.
Example : The token names and associated attribute values for the Fortran statement
are written below as a sequence of pairs.
INPUT BUFFERING:
During lexical analyzing , to identify a lexeme , it is important to look ahead atleast one
additional character. Specialized buffering techniques have been developed to reduce the
amount of overhead required to process a single input character
An important scheme involves two buffers that are alternatively reloaded
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.
Using one system read command we can read N characters into a buffer, rather than using
one system call per character. If fewer than N characters remain in the input file, then a special
character, represented by eof marks the end of the source file and is different from any
possible character of the source program.
Two pointers to the input are maintained:
I. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy whereby this
determination is made will be covered in the balance of this chapter.
Once the next lexeme is determined, forward is set to the character at its right end. Then, after
the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig. 3.3, we see forward has
passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be
retracted one position to its left.
Advancing forward requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the
beginning of the newly loaded buffer. As long as we never need to look so far ahead of the
actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater
than N, we shall never overwrite the lexeme in its buffer before determining it.
Sentinels
If we use the scheme of Section 3.2.1 as described, we must check, each time we advance
forward, that we have not moved off one of the buffers; if we do, then we must also reload the
other buffer. Thus, for each character read, we make two tests: one for the end of the buffer,
and one to determine what character
is read (the latter may be a multiway branch). We can combine the buffer-end test with the
test for the current character if we extend each buffer to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural
choice is the character eof. Figure 3.4 shows the same arrangement as Fig. 3.3, but with the
sentinels added. Note that eof retains its use as a marker for the end of the entire input. Any
eof that appears other than at the end of a buffer means that the input is at an end. Figure 3.5
summarizes the algorithm for advancing forward. Notice how the first test, which can be part
of a multiway branch based on the character pointed to by forward, is the only test we make,
except in the case where we actually are at the end of a buffer or the end of the input.
SPECIFICATION OF TOKENS :
Regular languages are an important notation for specifying lexeme patterns
Strings and Languages:
• An alphabet is any finite set of symbols ex: Letters, digits and punctuation
• The set {01) is the binary alphabet
• A string over an alphabet is a finite sequence of symbols drawn from the alphabet
• The length of the string s,represented as |s| , is the number of occurrences of symbols
in s .
• The empty string denoted as € is the string of length 0
• A language is any countable set of strings over some fixed alphabet ex: abstract
languages
• If x and y are strings ten the concatenation of x and y denoted by xy is the string formed
by appending y to x for ex if x=cse and y=department , then xy=csedepartment .
REGULAR EXPRESSIONS:
Suppose we wanted to describe the set of valid C identifiers.
We are able to describe identifiers by giving names to sets of letters and digits and using the
language operators union, concatenation, and closure. This process is so useful that a
notation called regular expressions has come into common use for describing all the
languages that can be built from these operators applied to the symbols of some alphabet. In
this notation, if letter- is established to stand for any letter or the underscore, and digit- is
established to stand for any digit, then we could describe the language of C
identifiers by:
letter- ( letter- I digit )*
Example : Let C = {a, b}.
1. The regular expression a| b denotes the language {a, b}.
2. (a| b) (a|b) denotes {aa, ab, ba, bb), the language of all strings of length two over the
alphabet C. Another regular expression for the same language is aalablbal bb.
3. a* denotes the language consisting of all strings of zero or more a's, that is, {E, a, aa, aaa, .
. . }.
4. (alb)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all
strings of a's and b's: {e, a, b, aa, ab, ba, bb, aaa, . . .}. Another regular expression for the same
language is (a*b*)*.
5. ala*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the string a and all strings
consisting of zero or more a's and ending in b. A language that can be defined by a regular
expression is called a regular set. If two regular expressions r and s denote the same regular
set, we say they are equivalent and write r = s. For instance, (alb) = (bla).
Example : C identifiers are strings of letters, digits, and underscores. Here is a regular
definition for the language of C identifiers. We shall conventionally use italics for the symbols
defined in regular definitions.
Letter→ (A |B | . .. |Z | a | b |. .. l z)
digit →( 0| 1 | … | 9)
id →letter+ ( letter- I digit )*
For this language, the lexical analyzer will recognize the keywords i f , then, and else, as well
as lexemes that match the patterns for relop, id, and number. To simplify matters, we make
the common assumption that keywords are also reserved words: that is, they are not
identifiers, even though their lexemes match the pattern for identifiers. In addition, we assign
the lexical analyzer the job of stripping out whitespace, by recognizing the "token" ws defined
by:
ws → ( blank I tab | newline )+
Here, blank, tab, and newline are abstract symbols that we use to express the ASCII characters
of the same names. Token ws is different from the other tokens in that, when we recognize it,
we do not return it to the parser, but rather restart the lexical analysis from the character that
follows the whitespace. It is the following token that gets returned to the parser.
Syntax Analyzer
• A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given
program.
Example:
For the line of code newval := oldval + 12, parse tree will be:
• A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not.
• If it satisfies, the syntax analyzer creates a parse tree for the given program
Example:
• Depending on how the parse tree is created, there are different parsing techniques.
– Top-Down Parsing,
– Bottom-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the leaves.
• Bottom-Up Parsing:
– Construction of the parse tree starts at the leaves and proceeds towards the root.
– Normally efficient bottom-up parsers are created with the help of some software tools.
Semantic Analyzer
• A semantic analyzer checks the source program for semantic errors and collects the type
information for the code generation.
• Context-free grammar used in the syntax analysis are integrated with attributes (semantic
rules). The result is a syntax-directed translation and attribute grammars.
Example:
the type of the identifier newval must match with type of the expression (oldval+12)
Code Optimizer
The code optimizer optimizes the code produced by the intermediate code generator in the
terms of time and space.
Example:
Code Generator
• The target program is normally a relocatable object file containing the machine codes.
Example:
Assuming that we have architecture with instructions that have at least one operand as a
machine register, the Final Code our line of code will be:
MOVE id2, R1
MULT id3, R1
ADD #1, R1
Exercise:
Errors
• An important concept in PLs is the ability to name objects such as variables, functions and
types. Each such named object will have a declaration, where the name is defined as a
synonym for the object. This is called binding.
• Each name will also have a number of uses, where the name is used as a reference to the
object to which it is bound. Often, the declaration of a name has a limited scope: a portion of
the program where the name will be visible. Such declarations are called local declarations,
whereas a declaration that makes the declared name visible in the entire program is called
global.
• It may happen that the same name is declared in several nested scopes. In this case, it is
normal that the declaration closest to a use of the name will be the one that defines that
particular use. In this context closest is related to the syntax tree of the program: The scope
of a declaration will be a sub-tree of the syntax tree and nested declarations will give rise to
scopes that are nested sub-trees. The closest declaration of a name is hence the declaration
corresponding to the smallest sub-tree that encloses the use of the name.
• As an example, look at this C statement block:
{
int x = 1; // declare integer variables x with scope until the closing brace in the last line.
int y = 2; // declare integer variables y with scope until the closing brace in the last line.
{
//A new scope is started by the second opening brace { and a floating-point variable x with an initial value close to π is declared. //This
will have scope until the first closing brace }, so the original x variable is not visible until the inner scope ends.
double x = 3.14159265358979;
y += (int)x; //This assignment will add 3 to y, so its new value is 5.
}
y += x; //this assignment, we have exited the inner scope, so the original x is restored. So, 1 will be added to y, //which
will have the final value 6
}
• Scoping based on the structure of the syntax tree, as shown in the example, is called static
or lexical binding and is the most common scoping rule in modern PLs. We will in the rest of
this book assume that static binding is used.
• A few PLs have dynamic binding, where the declaration that was most recently
encountered during execution of the program defines the current use of the name. By its
nature, dynamic binding can not be resolved at compile-time, so the techniques that in the
rest of this book are described as being used in a compiler will have to be used at run-time if
the language uses dynamic binding.
• A compiler will need to keep track of names and the objects these are bound to, so that any
use of a name will be attributed correctly to its declaration. This is typically done using a
symbol table.
Symbol Table
• Symbol Table is an important data structure created and maintained by the compiler in
order to keep track of semantics of variables i.e. it stores information about the scope and
binding information about names, information about instances of various entities such as
variable and function names, classes, objects, etc.
• A symbol table is a table that binds names to information. We need a number of operations
on symbol tables to accomplish this:
- We need an empty symbol table, in which no name is defined.
- We need to be able to bind a name to a piece of information. In case the name is already defined in
the symbol table, the new binding takes precedence over the old.
- We need to be able to look up a name in a symbol table to find the information the name is bound to.
If the name is not defined in the symbol table, we need to be told that.
- We need to be able to enter a new scope.
- We need to be able to exit a scope, reestablishing the symbol table to what it was before the scope
was entered.
5. Code Optimization: Uses information present in the symbol table for machine-
dependent optimization.
6. Target Code generation: Generates code by using address information of identifier
present in the table.
يتم ادخال هذه المعلومات إلي جدولsource code يحتاج المترجم معلومات عن األسماء التي تظهر في برنامج المصدر
الرموز
The compiler needs information about the names that appear in the source code. This
information is entered into the symbol table. This information is called attributes and is
collected during the analysis phase. يتم تجميع هذه المعلومات أثناء مرحلة التحليل
1- Variable name
lexical analysis مفتاح البحث عن المتغير وتدخل المعلومة الي الجدول أثناء مرحلة ال
The key to searching for the variable and entering the information into the table
during the lexical analysis phase.
2- Object code address: determines the location of the variable value at execution time
يحدد الموقع الخاص بقيمة المتغير وقت التنفيذ
3- Data type نوع المتغير وتستخدم هذه المعلومة لتحديد مساحة الذاكرة المطلوبة لقيم المتغير
This information is used to determine the memory space required for the variable
values
int 2B, char 1B, float 4B, double 8B,…
4- Number of dimension & No. of parameters
Simple variable 0, vector (1-D array) 1, 2D array (matrices) 2,
Function: according to its parameters اذا كانت دالة يحدد عدد معامالتها او عدد المتغيرات التي
تستقبلها الدالة
5- Line declaration: The line number at which the variable is declared in the program
(one line)
رقم السطر الذي تم تعريف المتغير عنده في البرنامج
6- Signal (reference) line :The number of line(s) on which the variable is used, other than
the line on which it is declared.
عرف فيه
ُ رقم السطر (السطور) الذي استخدم فيه المتغير عدا السطر الذي
أكثر من سطر يسمي,--cross-reference
Exercise:
Draw a cross-reference symbol table that would result when compiling the following C
source code segment:
1- Void main()
2- {
3- int i, j[5];
4- char C, index[5][6], block[5];
5- float f;
6- i=0;
7- i=i+k;
8- f=f+i;
9- C=”x”;
10- block[4]=C;
11- }
Solution:
counter Variable name Object address Type Dim Line Line
declared reference
1 i 0 int 0 3 6, 7, 8
2 j 2 (as i need 2Bytes) int 1 3 -
3 C 12 (j need 2x5 char 0 4 9, 10
bytes)
4 index 13 (C 1 Byte) char 2 4 -
5 block 43 char 1 4 10
6 f 48 float 0 5 8
7 k - - 0 - 7
So,
-alphabetically ordered
- its attributes added to the table need search operation to decide the suitable place to add
in عملية ادخال متغيرات بخصائصها في الجدول تتطلب عملية بحث في الجدول لتحديد المكان المناسب لتخزين المتغير
Left L i …. R Right
L j … R
L k … R
L m … R
L l … R
Ex:
L frog …. R
L bird … R L tree … R
L z1 … R
L och … R
من أعلي نزوال ألسفلroot المقارنة تعتمد علي ال
Depend on equation:
Hash(var name)=(var word length+ ASCII of first letter var word)% hash max
A 65 B 66 C 67
D 68 E 69 F 70
G 71 H 72 I 73
J 74 K 75 L 76
M 77 N 78 O 79
P 80 Q 81 R 82
S 83 T 84 U 85
V 86 W 87 X 88
Y 89 Z 90 a 97
b 98 c 99 d 100
e 101 f 102 g 103
h 104 i 105 j 106
k 107 l 108 m 109
n 110 o 111 p 112
q 113 r 114 s 115
t 116 u 117 v 118
w 119 x 120 y 121
z 122
Ex:
1. The efficiency of a program can be increased by using symbol tables, which give quick
and simple access to crucial data such as variable and function names, data kinds,
and memory locations.
2. better coding structure Symbol tables can be used to organize and simplify code,
making it simpler to comprehend, discover, and correct problems.
3. Faster code execution: By offering quick access to information like memory
addresses, symbol tables can be utilized to optimize code execution by lowering the
number of memory accesses required during execution.
4. Symbol tables can be used to increase the portability of code by offering a
standardized method of storing and retrieving data, which can make it simpler to
migrate code between other systems or programming languages.
5. Improved code reuse: By offering a standardized method of storing and accessing
information, symbol tables can be utilized to increase the reuse of code across
multiple projects.
6. Symbol tables can be used to facilitate easy access to and examination of a program’s
state during execution, enhancing debugging by making it simpler to identify and
correct mistakes.
1. Increased memory consumption: Systems with low memory resources may suffer from
symbol tables’ high memory requirements.
2. Increased processing time: The creation and processing of symbol tables can take a
long time, which can be problematic in systems with constrained processing power.
3. Complexity: Developers who are not familiar with compiler design may find symbol
tables difficult to construct and maintain.
4. Limited scalability: Symbol tables may not be appropriate for large-scale projects or
applications that require o the management of enormous amounts of data due to their
limited scalability.
5. Upkeep: Maintaining and updating symbol tables on a regular basis can be time- and
resource-consuming.
6. Limited functionality: It’s possible that symbol tables don’t offer all the features a
developer needs, and therefore more tools or libraries will be needed to round out their
capabilities.
There are many ways to implement symbol tables, but the most important distinction
between these is how scopes are handled. This may be done using a persistent (or functional)
data structure, or it may be done using an imperative (or destructively-updated) data
structure. A persistent data structure has the property that no operation on the structure will
destroy it. Conceptually, a new modified copy is made of the data structure whenever an
operation updates it, hence preserving the old structure unchanged. This means that it is
trivial to reestablish the old symbol table when exiting a scope, as it has been preserved by
the persistent nature of the data structure. In practice, only a small portion of the data
structure is copied when a symbol table is updated, most is shared with the previous version.
In the imperative approach, only one copy of the symbol table exists, so explicit actions are
required to store the information needed to restore the symbol table to a previous state. This
can be done by using an auxiliary stack. When an update is made, the old binding of a name
that is overwritten is recorded (pushed) on the auxiliary stack. When a new scope is entered,
a marker is pushed on the auxiliary stack. When the scope is exited, the bindings on the
auxiliary stack (down to the marker) are used to reestablish the old symbol table. The
bindings and the marker are popped off the auxiliary stack in the process, returning the
auxiliary stack to the state it was in before the scope was entered.
1- Simple persistent symbol tables
2- A simple imperative symbol table
Efficiency issues :
While all of the above implementations are simple, they all share the same efficiency problem:
Lookup is done by linear search, so the worst-case time for lookup is proportional to the size
of the symbol table. This is mostly a problem in relation to libraries: It is quite common for a
program to use libraries that define literally hundreds of names. A common solution to this
problem is hashing: Names are hashed (processed) into integers, which are used to index an
array. Each array element is then a linear list of the bindings of names that share the same
hash code. Given a large enough hash table, these lists will typically be very short, so lookup
time is basically constant.
Using hash tables complicates entering and exiting scopes somewhat. While each element of
the hash table is a list that can be handled like in the simple cases, doing this for all the array-
elements at every entry and exit imposes a major over-head. Instead, it is typical for
imperative e implementations to use a single auxiliary stack (as described in section 4.2.1) to
record all updates to the table so they can be undone in time proportional to the number of
updates that were done in the local scope. Functional implementations typically use
persistent hash-tables, which eliminates the problem.
In some languages (like Pascal) a variable and a function in the same scope may have the
same name, as the context of use will make it clear whether a variable or a function is used.
We say that functions and variables have separate name spaces, which means that defining a
name in one space doesn’t affect the same name in the other space. In other languages (e.g.
C or SML) the context can not (easily) distinguish variables from functions. Hence, declaring
a local variable might hide a function declared in an outer scope or vice versa. These
languages have a shared name space for variables and functions. Name spaces may be shared
or separate for all the kinds of names that can appear in a program, e.g., variables, functions,
types, exceptions, constructors, classes, field selectors etc. Which name spaces are shared is
language-dependent. Separate name spaces are easily implemented using one symbol table
per name space, whereas shared name spaces naturally share a single symbol table.
However, it is sometimes convenient to use a single symbol table even if there are separate
name spaces. This can be done fairly easily by adding name-space indicators to the names. A
name-space indicator can be a textual prefix to the name or it may be a tag that is paired with
the name. In either case, a lookup in the symbol table must match both the name and the
name-space indicator of the symbol that is looked up with the name and the name-space
indicator of the entry in the table.
Questions:
1- In some programming languages, identifiers are case-insensitive so, e.g., size and SiZe
refer to the same identifier. Describe how symbol tables can be made caseinsensitive.
2- What is the difference between compilation time and runtime.
3- Draw a figure to illustrate the compilation process.
4- What is the dimension attribute? What type of error can be determined based on the
dimension attribute?
5- Draw tree structured symbol table to store the following variables: name, age, degree,
average.
6- How one pass compiler work?
7- What is the input and output for semantic phase in compiler?
8- Where the variables (floor, door, window, apple, frog) will be stored in hash symbol table
if the hash table size=100 record?
9- Define parser and lexeme
10- Draw syntax tree for the expression a=d*c/2-5
11- What are the languages described by the following regular expressions RE defined on
the alphabet {a, b}
(i) a(aΙb)*b (ii) aab(aaΙbb)+ (iii) (aa)*a
b وتنتهي بحرفa لغة تبدأ كل كلمة بها بحرف-1
bb أوaa ويليها تكرار واحد علي األقل منaab لغه تبدأ جميع كلماتها بالسلسلة-2
ويكون طول كلماتها فرديa اللغة التي ال تحتوي كلماتها إال على حرف-3
12- Write RE to define the following languages:
(i) If the language alphabet is {a, b, c} and all its words start with letter a
a(a|b|c)*
(ii) Integer numbers multiplied of 5
5 [االعداد الصحيحة من مضاعفات0-9]*[5Ι0]
13- Write the leftmost derivation of a0+ab*(a+b1) for the G with the productions:
E→I| E+E| E*E| (E)
I→ a|b|Ia|Ib|I0|I1
Solution:
E→ E+E→ I+E→ I0+E→a0+E→a0+E*E→ a0+I*E→ a0+Ib*E→ a0+ab*E→a0+ab*(E)→
a0+ab*(E+E)→ a0+ab*(I+E)→ a0+ab*(a+E)→ a0+ab*(a+I)→ a0+ab*(a+I1)→ a0+ab*(a+b1)
14- Write the rightmost derivation of a0+ab*(a+b1) for the G with the productions:
E→I| E+E| E*E| (E)
I→ a|b|Ia|Ib|I0|I1
15- Consider G:
state→ type|list|terminator
type→int|float|char
list→list,id|id
terminator→;
consider the input w=int id,id,id,id;
build the parse tree.
Solution:
state
type terminator
list
int ;
list , id
list , id
list , id
id
Id2 *
const Id3
األقل أسبقية
Id1 +
* Id3
const Id2
Page 58 2024-2025
Compiler Design Dr Heba El Hadidi
In parsing, code is taken from the preprocessor, broken into smaller pieces and analyzed
so other software can understand it. The parser does this by building a data structure out
of the pieces of input.
More specifically, a person writes code in a human-readable language like C++ or Java and
saves it as a series of text files. The parser takes those text files as input and breaks them
down so they can be translated on the target platform.
And:
Stage 1: Lexical analysis
A lexical analyzer -- or scanner -- takes code from the preprocessor and breaks it into smaller
pieces. It groups the input code into sequences of characters called lexemes, each of which
corresponds to a token. Tokens are units of grammar in the programming language that the
compiler understands.
Lexical analyzers also remove white space characters, comments and errors from the
input.
Ex,
Token Lexeme
x identifier
+ Addition operator
z identifier
= Assignment operator
11 Number/const
Stage 2: Syntactic analysis
This stage of parsing checks the syntactical structure of the input, using a data structure
called a parse tree or derivation tree. A syntax analyzer uses tokens to construct a parse tree
that combines the predefined grammar of the programming language with the tokens of the
input string. The syntactic analyzer reports a syntax error if the syntax is incorrect.
Ex:
The syntactic analyzer takes (x+y)*3 as input and returns this parse tree, which enables
the parser to understand the equation.
then the analyzer will treat 20 as 20.0 before performing the operation.
Some sources refer only to the syntactic analysis stage as parsing because it generates the
parse tree. They leave out lexical and semantic analysis.
- Terminal symbols are those which are the constituents of the sentence generated
using grammar.
- Terminal symbols are denoted by using small case letters such as a, b, c etc.
- 2- Non-Terminal symbols
- Non-Terminal symbols are those which take part in the generation of the sentence but
are not part of it.
- Non-Terminal symbols are also called as auxiliary symbols or variables.
- Non-Terminal symbols are denoted by using capital letters such as A, B, C etc.
Example:
Consider a grammar G = (V , T , P , S) where-
•V={ S} // Set of Non-Terminal symbols
•T ={a,b } // Set of Terminal symbols
• P = { S → aSbS , S → bSaS , S → ∈ } // Set of production rules
This grammar generates the strings having equal number of a’s and b’s
Example:
•T ={a,b , c }
• P = { S → ABC , A → a , B → b , C → c }
•S={ S }
L(G) = { abc }
♫ A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production
rules to detect any errors in the code.
♫ The output of this phase is a parse tree.
Derivation
E→E+ E
E→E+E*E
E→E+ E*id
E→E+ id*id
E→id+ id*id
E→E*E
E→E+E*E
E→id+ E*E
E→id+ id*E
E→id+ id*id
▪ Step-1: E→E*E →→
▪ Step-2: E→E+E*E →→
♫ A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
♫ Example:
E→E+ E
E→E- E
E→id
For the string id+id-id, G will generate two parse tree:
Types of Parsing
❖ Syntax analysers follow production rules defined by means of context-free grammar.
❖ The way the production rules are implemented (derivation) divides parsing into two
types:
2- Bottom-up parsing
When a software language is created, its creators must specify a set of rules. These rules
provide the grammar needed to construct valid statements in the language.
Consider the following is a set of grammatical rules for a simple fictional language that only
contains a few words:
In this language, a sentence must contain a subject, verb and noun in that order, and specific
words are matched to the parts of speech.
A noun can be one of the following three words: dog, cat or person.
Parsing checks a statement that a user provides as input against these rules to prove that the
statement is valid.
Different parsing algorithms check in different orders. There are two main types of parsers:
• Top-down parsers. These start with a rule at the top, such as <sentence> ::= <subject>
<verb> <object>.
Given the input string "The person fed a cat," the parser would look at the first rule, and
work its way down all the rules checking to make sure they are correct.
In this case, the first word is a <subject>, it follows the subject rule, and the parser will
continue reading the sentence looking for a <verb>.
i.e, top-down parsers begin their work at the start symbol of the grammar at the top of the
parse tree. They then work their way down from the rule to the sentence.
• Bottom-up parsers. These start with the rule at the bottom.
In this case, the parser would look for an <object> first, then look for a <verb> next and so
on. i.e, Bottom-up parsers work their way up from the sentence to the rule.
Beyond these types, it's important to know the two types of derivation. Derivation is the
order in which the grammar reconciles the input string. They are:
• LL parsers. These parse input from left to right using leftmost derivation to match the
rules in the grammar to the input. This process derives a string that validates the input
by expanding the leftmost element of the parse tree.
• LR parsers. These parse input from left to right using rightmost derivation. This process
derives a string by expanding the rightmost element of the parse tree.
• Recursive descent parsers. Recursive descent parsers backtrack after each decision
point to double-check accuracy. Recursive descent parsers use top-down parsing.
• Earley parsers. These parses all context-free grammars, unlike LL and LR parsers. Most
real-world programming languages do not use context-free grammars.
• Shift-reduce parsers. These shift and reduce an input string. At each stage in the
string, they reduce the word to a grammar rule. This approach reduces the string until
it has been completely checked.
Parser
Parsers are used when there is a need to represent input data from source code abstractly as
a data structure so that it can be checked for the correct syntax. Coding languages and other
technologies use parsing of some type for this purpose.
Technologies that use parsing to check code inputs include the following:
• C++
• Extensible Markup Language or XML
• Hypertext Markup Language or HTML
• Hypertext Preprocessor or PHP
• Java
• JavaScript
• JavaScript Object Notation or JSON
• Perl
• Python
➢ Database languages. Database languages such as Structured Query Language also use
parsers.
➢ Protocols. Protocols like the Hypertext Transfer Protocol and internet remote function
calls use parsers.
Parser generator. Parser generators take grammar as input and generate source code, which
is parsing in reverse. They construct parsers from regular expressions, which are special
strings used to manage and match patterns in text.
Top-down parsing is a method of parsing the input string provided by the lexical
analyzer.
The top-down parser parses the input string and then generates the parse tree for it.
In the top-down approach construction of the parse tree starts from the root node and
ends up creating the leaf nodes.
It uses leftmost derivation to derive a string that matches the input string.
Here the leaf node presents the terminals that match the terminals of the input string.
1- Left Recursion
✓ A grammar G(V, T, P, S) is left recursive if it has a production rule in the form:
A→Aα|ß as the left of production is occurring at a first position on the right side of
production. And ß does not begin with an A.
✓ With left recursive problem, it becomes hard for top-down parser to judge
when to stop parsing the left non-terminal and it goes into an infinite loop.
A→Aα|ß ➔ A→ ßA’
A’→αA’|Ɛ
✓ In left Recursive grammar, expansion of A will generate Aα, Aαα, Aααα at each step,
causing it to enter into an infinite loop.
Example:
E→E+T|T
T→T*F|F
F→(E)|id
Solution:
E→ TE’
E’→ +TE’|Ɛ
T→FT’
T’→*FT’|Ɛ
F→(E)|id
2- Left Factoring
If more than one grammar production rules have a common prefix string, then the
top-down parser cannot make a choice as to which of the production it should take to
parse in hand.
Example:
Then it cannot determine which production to follow to parse the string as both
productions are starting from the same terminal (or non-terminal).
2- Bottom up parsing
Recursion-
1. Left Recursion-
Earlier discussed.
Example-
S → Sa | ∈
2. Right Recursion-
Example-
S → aS / ∈
(Right Recursive Grammar)
• Right recursion does not create any problem for the Top down parsers.
• Therefore, there is no need of eliminating right recursion from the grammar.
3. General Recursion-
• The recursion which is neither left recursion nor right recursion is called as general
recursion.
Example-
S → aSb | ∈
Problem:
A → ABd | Aa | a
B → Be | b
Solution-
A → aA’
A’ → BdA’ | aA’ | ∈
B → bB’
B’ → eB’ | ∈
Problem:
E→E+E|ExE|a
Solution-
E → aA
A → +EA | xEA | ∈
Problem:
E→E+T|T
T→TxF|F
F → id
Solution-
E → TE’
E’ → +TE’ | ∈
T → FT’
T’ → xFT’ | ∈
F → id
Problem:
S → (L) | a
L→L,S|S
Solution-
S → (L) | a
L → SL’
L’ → ,SL’ | ∈
Problem:
S → S0S1S | 01
Solution-
S → 01A A → 0S1SA | ∈
Problem:
S→A
A → Ad | Ae | aB | ac
B → bBc | f
Solution-
S→A
A → aBA’ | acA’
A’ → dA’ | eA’ | ∈
B → bBc | f
Problem:
A → AAα | β
Solution-
A → βA’
A’ → AαA’ | ∈
Problem:
Consider the following grammar and eliminate left recursion-
A → Ba | Aa | c
B → Bb | Ab | d
Solution-
Step-01:
A → BaA’ | cA’
A’ → aA’ | ∈
A → BaA’ | cA’
A’ → aA’ | ∈
B → Bb | Ab | d
Step-02:
A → BaA’ | cA’
A’ → aA’ | ∈
B → Bb | BaA’b | cA’b | d
Step-03:
Now, eliminating left recursion from the productions of B, we get the following grammar-
A → BaA’ | cA’
A’ → aA’ | ∈
B → cA’bB’ | dB’
B’ → bB’ / aA’bB’ | ∈
Problem:
X → XSb | Sa | b
S → Sb | Xa | a
Solution-
Step-01:
X → SaX’ | bX’
X’ → SbX’ | ∈
X → SaX’ | bX’
X’ → SbX’ | ∈
S → Sb | Xa | a
Step-02:
X → SaX’ | bX’
X’ → SbX’ | ∈
S → Sb | SaX’a | bX’a | a
Step-03:
Now, eliminating left recursion from the productions of S, we get the following grammar-
X → SaX’ | bX’
X’ → SbX’ | ∈
S → bX’aS’ | aS’
S’ → bS’ | aX’aS’ | ∈
Problem:
S → Aa |b
A → Ac | Sd | ∈
Solution-
Step-01:
Step-02:
S → Aa | b
A → Ac | Aad | bd | ∈
Step-03:
Now, eliminating left recursion from the productions of A, we get the following grammar-
S → Aa | b
A → bdA’ | A’
A’ → cA’ | adA’ | ∈
An important part of parser table construction is to create first and follow sets.
FIRST Function:
Example:
➔First(A)= {a, d, g}
FOLLOW Function:
FOLLOW(α) is a set of terminal symbols that appear immediately to the right of α.
NT بنفس ترتيب
E هوstart في المثال التالي الFOLLOW في ال$ (أول عنصر في القاعدة) دائما لهStart Symbol S
FOLLOW(E) has 3 cases
|
|
Follow(E)={+} فنأخذه أي أن+ مثلTerminal يتبعها رمز طرفيE -1 |
أحسبF→Ee مثالLHS للFollow يوجد فراغ ∈ أعود أحسبE بعد-2 |
Follow(E)=Follow(F) ويكونFollow(F)
First(T) has ∈ : يوجد حالتينFirst(T) هنا فنبحث عنT اخر هوNT رمزE بعدA→ ET -3 |
so Follow(E)=First(T) -∈ ⋃Follow(T)
أو
Example:
E → TE’
E’ → +TE’| ∈
T→ FT’
T’→*FT’ | ∈
F→ (E) | id
فنجد أول ظهور لها يمين في القاعدة الخامسةRHS في الجرامر ناحية اليمينE نبحث عنFollow(E)={$, …}
F→(E)| id
To calculate Follow(E’):
ايfollow(LHS) ) فنحسب2 فراغ (حالةE’ ويليE→TE’ يمين في قاعدة الجرامر األوليE’ أول مرة تظهر
FOLLOW(E)
So, Follow(E’)=Follow(E)={$, )}
الحالة الثالثة فنأخذNon-terminal يليهاE→TE’ يمين في القاعدة االوليT أول ظهور ل:Follow(T) لحساب
First(E’)
لحساب )’:Follow(T
}Follow(T’)=Follow(T)={+, ), $
لكن ’ Tظهرت يمين في جملة تانية لنفس الجرامر ∈ |’ T’→*FTونالحظ أن ’ Tاليمين بعدها فراغ (الحالة الثانية) فناخذ
)’ Follow(LHS)= Follow(Tوهي نفسها المطلوبة فلن تفيدنا
لحساب ):Follow(F
أول ظهور ل Fيمين في القاعدة ’ T→FTيلي fرمز NTهو ’( Tالحالة الثالثة) فنحسب )’ FIRST(Tوهو } ∈ {*,به
∈ فنحتاج أن نحذف ∈ ونحسب )’ Follow(Tكاالتي:
Example:
∈ |S1 → eS
E→b
Solution:
}∈ First(S1)={e,
}First(E)={b
Example:
S → aBDh
B → cC
C → bC | ∈
D → EF
E→g|∈
Solution:
➢ First(S) = { a }
➢ First(B) = { c }
➢ First(C) = { b , ∈ }
➢ First(D) = { First(E) – ∈ } ∪ First(F) = { g , f , ∈ }
➢ First(E) = { g , ∈ }
➢ First(F) = { f , ∈ }
➢ Follow(S) = { $ }
➢ Follow(B) = { First(D) – ∈ } ∪ First(h) = { g , f , h }
➢ Follow(C) = Follow(B) = { g , f , h }
➢ Follow(D) = First(h) = { h }
➢ Follow(E) = { First(F) – ∈ } ∪ Follow(D) = { f , h }
➢ Follow(F) = Follow(D) = { h }
Example:
S → bXY
X→b|c
Solution:
S → aXb|X
X→cXb|b
X→bXZ
Z→n
S → ABb|bc
A→abAB |∈
E → E+E|E*E|(E)|id
1- Remove ambiguity.
E→E+T
T→T*F
F→(E)|id
2- Remove left-Recursion
E→TE’
E’→+TE’| ∈
T→FT’
T’→*FT’| ∈
F→(E)|id
Now calculate First and Follow:
First(E)=First(T)=First(F)={(, id}
First(E’)={+, ∈}
First(T’)={*, ∈}
Follow(E)=Follow(E’)={$, )}
Follow(T)=Follow(T’)={+, $, )}
Follow(F)={*, +, $, )}
S→bSX|Y
X→XC|bb
𝑋̅→C𝑋̅| ∈
Y→b| bY
C→ccC| CX| cc
S→ (L)|a
L→ L, S| S
S→ cAd
A→aA’
First(S)={c}
First(A)={a} First(A’)={b, ∈}
S→ L=R| R
L→*R| id
Question:
S→aAbB| bAaB| ∈
A→S
B→S
a b $
S E1 E2 S→∈
A A→S A→S Error
B B→S B→S E3
Entries E1, E2 and E3 are needed to be filled. ∈ is the empty string, $ indicates end of
input.
FIRST(A)=FIRST(S)=FIRST(B)={a, b, ∈}
FOLLOW(A)={FIRST(bB), FIRST(aB)}={a, b}
FOLLOW(B)=FOLLOW(S)={FOLLOW(A), $}={a, b, $}
Example:
Calculate the first and follow functions for the given grammar:
S→A
A → aB | Ad
B→b
C→g
S→A
A → aBA’
A’ → dA’ | ∈
B→b
C→g
Then:
• First(S) = First(A) = { a }
• First(A) = { a }
• First(A’) = { d , ∈ }
• First(B) = { b }
• First(C) = { g }
• Follow(S) = { $ }
• Follow(A) = Follow(S) = { $ }
• Follow(A’) = Follow(A) = { $ }
• Follow(B) = { First(A’) – ∈ } ∪ Follow(A) = { d , $ }
• Follow(C) = NA
Example:
Calculate the first and follow functions for the following grammar:
S → (L) | a
L → SL’
L’ → ,SL’ | ∈
Solution-
First(S) = { ( , a }
First(L) = First(S) = { ( , a }
First(L’) = { , , ∈ }
Follow(L) = { ) }
Follow(L’) = Follow(L) = { ) }
Example:
Calculate the first and follow functions for the following grammar:
S → (L) | a
S → AaAb |BbBa
A→∈
B→∈
Solution-
First(A) = { ∈ }
First(B) = { ∈ }
Follow(S) = { $ }
Problem:
Calculate the first and follow functions for the given grammar-
E→E+T|T
T→TxF|F
F → (E) | id
Solution-
E → TE’
E’ → + TE’ | ∈
T → FT’
T’ → x FT’ / ∈
F → (E) / id
First(E’) = { + , ∈ }
First(T) = First(F) = { ( , id }
First(T’) = { x , ∈ }
First(F) = { ( , id }
Follow(E) = { $ , ) }
Follow(E’) = Follow(E) = { $ , ) }
Follow(T’) = Follow(T) = { + , $ , ) }
Example:
S→ ACB| Cbb|Ba
A→da| BC
B→g|∈
C→h|∈
Solution:
FIRST(B)= {g, ∈}
FIRST(C)= {h, ∈}
FOLLOW(S)={$}
FOLLOW(A)={h, g, $)
FOLLOW(B)={a, $, h, g}
Example:
S→ aSb| ba|∈
Solution:
FIRST(S)= {a,b, ∈}
Example:
S→ TabS| X
T→cT|∈
X→b| bX
Solution:
FIRST(S)= {c, a, b}
FIRST(T)={c, ∈}
FIRST(X)={b}
Example:
S→ AB| bS
A→aB| BB
B→b| cB
Solution:
FIRST(S)= {a, b, c}
FIRST(A)={a, b, c}
FIRST(B)={b, c}
Example:
S→ XYB| ccb
X→xX| ∈
Y→yY| Xy|∈
B→bbc|b
Solution:
FIRST(S)= {c, x, y, b}
FIRST(X)={x, ∈}
FIRST(Y)={y, x, ∈}
FIRST(B)={b}
Example:
S→ abS| bX
X→∈| cN
N→Nb| c
Solution:
S→ abS| bX
X→∈| cN
N→cN’
N’→ bN’| ∈
FIRST(S)= {a, b}
FIRST(X)={c, ∈}
FIRST(N)={c}
FIRST(N’)={b, ∈}
Example:
S→ XYZ
X→x | λ
Y→y | λ
Z→z
Solution:
So,
Example:
S→ XYZ
X→x | λ
Y→y | λ
Z→z| λ
Solution:
First(Z)={z, λ }
So,
Example:
S→ XYZa
X→Y
Y→y | λ
Z→z| λ
Solution:
Example:
S→ XaYZ
X→Y
Y→y | λ
Z→z| λ
Solution:
Exercise:
S→ aBDh
B→cC
C→bC|∈
D→EF
E→ g|∈
F→ f|∈
Example:
Example:
▪ Recursive descent is a top-down technique that constructs the parse tree from the
top and input is read from left to right.
▪ It uses procedures for every terminal and non-terminal entity.
▪ This parsing technique recursively parses the input to make a parse tree, which may
or may not require back-tracking.
1- Back-tracking
▪ In Top-Down Parsing with Backtracking, parser will attempt multiple rules or
production to identify the match for input string by backtracking at every step of
derivation.
▪ So, if the applied production does not give the input string as needed, or it does not
match with the needed string, then it can undo that shift.
▪ Example:
Consider the grammar:
S→aAd
A→bc| b
Make parse tree for the string ‘abd’.
Also show parse tree when backtracking is required when the wrong alternative is
chosen.
Solution:
- The derivation for the string abd will be:
- S→ aAd → abd (required string)
- If bc is substituted in place of non-terminal A then the string obtained will be abcd.
- S→aAd → abcd (Wrong Alternative (needs backtrack))
Backtracking looks very simple and easy to implement but choosing a different
alternative causes lot of problems as:
o Undoing semantic actions requires a lot of overhead.
o Entries made in the symbol table during parsing have to be removed.
Due to these reasons, backtracking is not used for practical compilers.
LL Parser
• An LL parser accepts LL grammar and used to implement predictive parser.
➢ LL grammar is a subset of context-free grammar but with some restrictions to
get the simplified version in order to achieve easy implementation.
➢ LL grammar can be implemented by means of both algorithms namely
recursive-descent or table driven.
• LL parser is denoted as LL(k).
➢ The first L in LL(k) is parsing the input from left to right.
➢ The second L in LL(k) stands for left-most derivation
➢ K represents the number of look ahead. Generally k=1, so LL(k) may also be
written as LL(1).
- Find First(α) and for each terminal in First(α), make entry A→α in the table.
- If First(α) contains Є as terminal then find Follow(A) and for each terminal in
Follow(A) make entry A→α in the table.
- If First(α) contains Є and Follow(A) contains $ as terminal, then make entry A→α in
the table for the $.
- To construct the parsing table, we have two functions:
♫ In the table, rows will contain the Non-Terminals and the columns will contain
the Terminal Symbols. All the Null productions of the grammars will go under
the Follow elements and the remaining productions will lie under the elements
of the First set.
T→ T*F| F
F→id| (E)
Solution:
E→TE’
E’→+TE’| Ɛ
T→FT’
T’→*FT’| Ɛ
F→ id| (E)
2- No left factoring
3- Calculate First and Follow sets:
Non-Terminals First Follow
E→TE’ {id, ( } {$, )}
E’→+TE’| Ɛ {+, Ɛ} {$, )}
T→FT’ {id, (} {+, $, )}
T’→*FT’| Ɛ {*, Ɛ} {+, $, )}
F→ id| (E) {id, (} {*, +, $, )}
4- Predictive Parsing Table
id + * ( ) $
E E→TE’ E→TE’
E’ E’→+TE’ E’→ Ɛ E’→ Ɛ
T T→FT’ T→FT’
T’ E’→ Ɛ T’→*FT’ T’→ Ɛ T’→ Ɛ
F F→id F→(E)
S→ iEtSS’| a
S’→eS| Ɛ
E→ b
Solution:
First(E)=={b}
First(S’)={e, Ɛ}
First(S)={i, a}
Follow(S)={e, $}
Follow(S’)={e, $}
Follow(E)={t}
The parsing table for this grammar is:
a b e i t $
S S→a S→iEtSS’
S’ S’→ Ɛ S’→ Ɛ
S’→ e
E E→b
As the table has multiply defined entry, the given grammar is not LL(1).
Follow(C)={…}
Follow(B)={…}
The parse table for this grammar is:
a b c d q $
S S→ AC$ S→ AC$ S→ AC$ S→ AC$ S→ AC$
A A→aBCd A→ BQ A→ Ɛ A→ BQ A→ Ɛ
B B→bB B→ d
C C→c C→Ɛ C→Ɛ
Q Q→ q
For the input string abdcdc$:
Stack Input Output
$S abdcdc$ S→ AC$
$CA abdcdc$ A→aBCd
$CdCBa abdcdc$ Pop a
$CdCB bdcdc$ B→bB
$CdCBb bdcdc$ Pop b
$CdCB dcdc$ B→d
$CdCd dcdc$ Pop d
$CdC cdc$ C→c
$Cdc cdc$ Pop C
$Cd dc$ Pop d
$C c$ C→c
$c c$ Pop c
$ $ Pop $
Accepted
Reductions
Bottom-up parsing
- Reducing a string w to the start symbol
- At each reduction step, a particular matching the RHS of a production is replaced by
the LHS.
- Right-most derivation is traced out in reverse.
- E.g.: abbcde
S→ aABe aAbcde
A→Abc | b aAde
B→ d aABe
S
abbcde can be reduced to S
Syntax tree
Syntax trees are abstract or compact representation of parse trees.
They are also called as Abstract Syntax Trees.
Example-
1. Parse tree
2. Syntax tree
3. Directed Acyclic Graph (DAG)
Solution-
Parse Tree-
Syntax Tree-
(a+b)*(c–d)+((e/f)*(a+b))
ab+ * ( c – d ) + ( ( e / f ) * ( a + b ) )
ab+ * cd- + ( ( e / f ) * ( a + b ) )
ab+ * cd- + ( ef/ * ( a + b ) )
ab+ * cd- + ( ef/ * ab+ )
ab+ * cd- + ef/ab+*
ab+cd-* + ef/ab+*
ab+cd-*ef/ab+*+
There may exist derivations for a string which are neither leftmost nor rightmost.
Example:
Consider the following grammar:
S→ABC
A→a
B→b
C→c
Consider s string w=abc
Total 6 derivations exist for string w.
The following 4 derivations are neither leftmost nor rightmost.
Derivation 1:
S → ABC
→ aBC (Using A → a)
→ aBc (Using C → c)
→ abc (Using B → b)
Derivation 2:
S → ABC
→ AbC (Using B → b)
→ abC (Using A → a)
→ abc (Using C → c)
Derivation 3:
S → ABC
→ AbC (Using B → b)
→ Abc (Using C → c)
→ abc (Using A → a)
The other 2 derivations are leftmost derivation and rightmost derivation.
• In fact, there may exist a grammar in which leftmost derivation and rightmost
derivation is exactly same for all the strings.
Example
Consider the following grammar-
S → aS | ∈
The language generated by this grammar is-
L = { an , n>=0 } or a*
All the strings generated from this grammar have their leftmost derivation and rightmost
derivation exactly same.
Let us consider a string w = aaa.
Leftmost Derivation:
S → aS
→ aaS (Using S → aS)
→ aaaS (Using S → aS)
→ aaa∈
→ aaa
Rightmost Derivation:
S → aS
→ aaS (Using S → aS)
→ aaaS (Using S → aS)
→ aaa∈
→ aaa
Clearly,
Leftmost derivation = Rightmost derivation
Similar is the case for all other strings.
Compiler is
1- Input buffer
2- Stack
3- Parsing table
Example:
Construct Productive Parsing Table for the following grammar:
S→ iEtSS1 | a
S1→ eS1| λ
E→ b
Solution:
The LR parser is an efficient bottom-up syntax analysis technique that can be used for a large
class of context-free grammar. This technique is also called LR(0) parsing.
L stands for the left to right scanning
R stands for rightmost derivation in reverse
0 stands for no. of input symbols of lookahead.
LR(0) is the simplest technique in the LR family. Although that makes it the easiest to learn, t
hese parsers are too weak to be of practical use for anything but a very limited set of gramma
rs.
Augmented grammar :
If G is a grammar with starting symbol S, then G’ (augmented grammar for G) is a grammar
with a new starting symbol S‘ and productions S’-> .S . The purpose of this new starting
production is to indicate to the parser when it should stop parsing. The ‘ . ‘ before S indicates
the left side of ‘ . ‘ has been read by a compiler and the right side of ‘ . ‘ is yet to be read by a
compiler.
Steps for constructing the LR parsing table :
1. Writing augmented grammar
2. LR(0) collection of items to be found
3. Defining 2 functions: goto(list of non-terminals) and action(list of terminals) in the
parsing table.
Q. Construct an LR parsing table for the given context-free grammar –
S–>AA
A–>aA|b
Solution :
STEP 1- Find augmented grammar –
The augmented grammar of the given grammar is:-
S’–>.S [0th production]
S–>.AA [1st production]
A–>.aA [2nd production]
A–>.b [3rd production]
STEP 2 – Find LR(0) collection of items
Below is the figure showing the LR(0) collection of items. We will understand everything one
by one.
• I0 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards the right (A->a.A) .
a is seen by the compiler.
• I0 goes to I4 when ‘ . ‘ of the 3rd production is shifted towards the right (A->b.) . b
is seen by the compiler.
• I2 goes to I5 when ‘ . ‘ of 1st production is shifted towards the right (S->AA.) . A is
seen by the compiler
• I2 goes to I4 when ‘ . ‘ of 3rd production is shifted towards the right (A->b.) . b is
seen by the compiler.
• I2 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards the right (A->a.A) .
a is seen by the compiler.
• I3 goes to I4 when ‘ . ‘ of the 3rd production is shifted towards the right (A->b.) . b
is seen by the compiler.
• I3 goes to I6 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler
• I3 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards the right (A->a.A) .
a is seen by the compiler.
As each cell has only 1 value in it, hence, the given grammar is LR(0).
Example:
(0) S 0 → S$
(1) S → AA
(2) S → bc
(3) S → baA
(4) A → c
Example:
Example:
Example:
(0) E ‘ → E $
(1) E → E + T
(2) E → T
(3) T → T * id
(4) T → id
SLR(1)
The simple improvement that SLR(1) makes on the basic LR(0) parser is to reduce only if
the next input token is a member of the follow set of the nonterminal being reduced. When
filling in the table, we don't assume a reduce on all inputs as we did in LR(0), we selectively
choose the reduction only when the next input symbols in a member of the follow set.
Example:
0) S' –> S
1) S –> XX
2) X –> aX
3) X –> b
Let’s parse an example string baab. It is a valid sentence in this language as shown by this le
ftmost derivation:
S –> XX
bX
baX
baaX
baab
Now, let’s consider what the states mean. S4 is where X>b is completed; S2 and S6 is
where we are in the middle of processing the 2 a's; S7 is where we process the final b; S9
is where we complete the X –> aX production; S5 is where we complete S –> XX; and S1 is
where we accept.
Example:
R, s
LR(1)
Example:
Example:
LALR
Give Example.
References
1- Keith D. Cooper and Linda Torczon, “Engineering a compiler”, 2nd edition, ElSeiver, 2012.
2- Seth D. Bergmann, “Compiler Design: Theory, Tools, and Example”, 2016.