Compiler Construction NOTE 1
Compiler Construction NOTE 1
2024/2025 SESSION
Recommended Textbook
Compilers: Principles, Techniques and Tools by Alfred V. Aho, Ravi Sethi and Jeffery D.
Ullman.
INTRODUCTION
Compiler construction is a broad field. Compiler writing is perhaps the most pervasive topic
in computer science, involving many fields as:
• Programming languages
• Architecture
• Theory of computation
• Algorithms
• Software engineering
In the early days, much of the effort was on how to implement high-level constructs. Then,
for a long time, the major emphasis was on improving the efficiency of generated code. These
topics remain important today, but many new technologies have caused them to become more
spialized.
Translator
This special software translates other programs into the machine language. They are in three
categories:
1. Assembler: this translates a program written in assembly language into the machine
language.
2. Interpreter: this translates the source program into the object program line by line and runs
execute the program) straightaway.
3. Compiler: this translates the whole source program at once into the object program
No Interpreter Compiler
1 A line in error can be corrected and The whole program must be
run again before moving on compiled again in case there is error
2 Requires more memory space to Less memory space is needed for program
store the interpreter when the to run
program is running
3 Programs run very slow Programs usually run faster
2
4 Very useful in teaching More useful in production
environment environment
More on Compilers
A compiler is a program that reads a program written in one language (the source language) and
translates it into an equivalent program in another language (the target language). The compiler
reports any errors found during the translation process.
The source language might be General purpose, e.g. C or Pascal, Or a “little language” for a
specific domain, e.g. SIML
The target language might be
– Some other programming language
– The machine language of a specific machine
A compiler creates machine code that runs on a processor with a specific Instruction Set
Architecture (ISA), which is processor-dependent. For example, you cannot compile code
for an x86 and run it on a MIPS architecture without a special compiler. Compilers are also
platform dependent. That is, a compiler can convert C++, for example, to machine code that’s
targeted at a platform that is running the Linux OS. A cross-compiler, however, can generate
code for a platform other than the one it runs on itself. A cross-compiler running on a
Windows machine, for instance, could generate code that runs on a specific Windows
operating system or a Linux (operating system) platform. Source-to-source compilers
translate one program, or code, to another of a different language (e.g., from Java to C).
Choosing a compiler then, means that first you need to know the ISA, operating system, and
the programming language that you plan to use. Compilers often come as a package with
other tools, and each processor manufacturer will have at least one compiler or a package of
software development tools (that includes a compiler.
More on Interpreters
An interpreter translates code like a compiler but reads the code and immediately executes
on that code, and therefore is initially faster than a compiler. Thus, interpreters are often used
in software development tools as debugging tools, as they can execute a single in of code at
a time. Compilers translate code all at once and the processor then executes upon the machine
3
language that the compiler produced. If changes are made to the code after compilation, the
changed code will need to be compiled and added to the compiled code (or perhaps the entire
program will need to be recompiled.) But an interpreter, although skipping the step of
compilation of the entire program to start, is much slower to execute than the same program
that’s been completely compiled. Interpreters, however, have usefulness in
areas where speed doesn’t matter (e.g., debugging and training) and it is possible to take the
entire interpreter and use it on another ISA, which makes it more portable than a compiler
when working between hardware architectures. There are several types of interpreters: the
syntax-directed interpreter (i.e., the Abstract Syntax Tree (AST) interpreter), bytecode
interpreter, and threaded interpreter (not to be confused with concurrent processing threads),
Just-in-Time (a kind of hybrid interpreter/compiler), and a few others. Instructions on how
to build an interpreter can be found on the web. Some examples of programming languages
that use interpreters are Python, Ruby, Perl, and PHP.
More on Assemblers
An assembler translates a program written in assembly language into machine language and is
effectively a compiler for the assembly language. Assembly language is a low-level programming
language. Low-level programming languages are less like human language in that they are more
difficult to understand at a glance; you have to study assembly code carefully in order to follow
the intent of execution and in most cases, assembly code has many more lines of code to represent
the same functions being executed as a higher-level language. An assembler converts assembly
language code into machine code (also known as object code), an even lower-level language that
the processor
Types of Compilers
Compilers are categorized based on various criteria such as their design, target language,
execution method, and translation approach. Here are some of the common types of
compilers:
1. Single-pass Compiler
A single-pass compiler processes the source code only once and generates the target code
directly.
❖ Characteristics: It is simpler and faster but may be limited in optimization capabilities.
❖ Use Case: Often used for small programs or languages with simple syntax structures.
2. Multi-pass Compiler
A multi-pass compiler goes through the source code multiple times before generating the
target code.
❖ Characteristics: Allows for better error checking and optimizations, making it suitable
for complex language structures.
❖ Use Case: Commonly used for optimizing high-level languages like C++.
3. Cross-compiler
4
Compiles source code for a different platform or architecture than the one on which the
compiler is running.
❖ Characteristics: Essential for embedded systems where the target device may not have
the capacity to host the compiler.
❖ Use Case: Building software for different hardware architectures, such as ARM
processors used in mobile devices.
4. Bootstrap Compiler
A special type of compiler that is used to compile itself.
❖ Characteristics: Useful in developing new programming languages or updating a
compiler.
❖ Use Case: Often used in the early stages of language development.
5. Just-In-Time (JIT) Compiler
A compiler that translates code during program execution rather than before.
❖ Characteristics: Improves performance by compiling only the parts of the program that are
frequently executed.
❖ Use Case: Commonly used in runtime environments like Java's JVM and .NET's CLR.
6. Interpreter-based Compiler (Hybrid Compiler)
Definition: A combination of a compiler and an interpreter, where the code is first compiled to
intermediate code and then interpreted.
❖ Characteristics: Provides a balance between the speed of execution and portability.
❖ Use Case: Common in scripting languages like Python and JavaScript.
Phases of a Compiler
The diagram below represents the phases of a compiler.
6
Linear (Lexical) Analysis
The linear analysis stage is called lexical analysis or scanning. For example, the characters in the
assignment statement:
position = initial + rate * 60
would be grouped into the following tokens and is translated as:
Syntax Analysis
7
Syntax Analysis or parsing involves grouping the tokens of the source program into
grammatical phrases that are used by the compiler to synthesize output. The hierarchical
structure of the source program can be represented by a PARSE TREE, for example:
Semantic Analysis
The semantic analysis stage:
– Checks for semantic errors, e.g. undeclared variables
– Gathers type information
– Determines the operators and operands of expressions
Example: if rate is a float, the integer literal 60 should be converted to a float before
multiplying. Then, we have
Symbol-table management
A symbol table is a data structure containing a record for each identifier with fields for the
attributes of the identifier. During analysis, we record the identifiers used in the program.
The symbol table stores each identifier with its attributes. Example of attributes:
– How much storage is allocated for the id – The id’s type– The id’s scope–
For functions, the parameter protocol
Some attributes can be determined immediately; some are delayed until later phases.
Error detection
Each compilation phase can have errors. Normally, we want to keep processing after an error,
in order to find more errors. Each stage has its own characteristic errors, e.g.
– Lexical analysis: a string of characters that do not form a legal token
– Syntax analysis: unmatched { } or missing ;
– Semantic: trying to add a float and a pointer
The internal representations of this process is illustrated with our case example below
9
Intermediate code generation
Some compilers explicitly create an intermediate representation of the source code program
after semantic analysis. The representation is as a program for an abstract machine. Most
common representation is “three-address code” in which all memory locations are treated as
registers, and most instructions apply an operator to two operand registers, and store the
result to a destination register.
Code optimization
This phase attempts to improve the intermediate code. At this stage, we improve the code to
make it run faster.
10
Code generation
This is the final phase of the compiler where the target code is generated. In this final stage,
we take the three-address code (3AC)or other intermediate representation, and convert to the
target language. We must pick memory locations for variables and allocate registers. For
example, using registers 1 and 2, the translation of our case sample code might become
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
i.e.
temp1 := id3 * 60.0
id1 := id2 + temp1
Assignment
Distinguish between a compiler and an interpreter.
LEXICAL ANALYSIS
This is the first phase of a compiler. Its task is to read the input characters and produce as
output a sequence of tokens that the parser uses for syntax analysis.
11
This interaction is commonly implemented by making the lexical analyzer be a subroutine or
a co-routine of the parser. Upon receiving a “get next token” command from the parser, the
lexical analyzer reads input characters until it can identify a token and returns it. Since the
lexical analyzer is part of the compiler that reads the source text, it may also perform certain
secondary tasks at the user interface. They are:
– Strip out comments and white space from the source code. – Correlate parser errors with
the source code location (the parser does not know what line of the file it is at, but the lexer
does)
Together, the complete set of tokens form the set of terminal symbols used in the grammar
for the parser. In most languages, the tokens fall into these categories:
– Keywords
– Operators
– Identifiers
– Constants
– Literal strings
– Punctuation
Token attributes
If there is more than one lexeme for a token, we have to save additional information about
the token.
12
Example: the token number matches lexemes 10 and 20.
Code generation needs the actual number, not just the token. With each token, we associate
ATTRIBUTES. Normally just a pointer into the symbol table. For C source code E = M * C
* C We have token/attribute pairs
<ID, ptr to symbol table entry for E>
<Assign_op, NULL>
<ID, ptr to symbol table entry for M>
<Mult_op, NULL>
<ID, ptr to symbol table entry for C>
<Mult_op, NULL>
<ID, ptr to symbol table entry for C>
Lexical errors
When errors occur, we could just crash but it is better to print an error message then continue.
Possible techniques to continue on error:
➢Delete a character
➢Insert a missing character
➢Replace an incorrect character by a correct character
➢Transpose adjacent characters
The three are written in the order of increasing difficulty for the implementer. Unfortunately,
the harder to implement approaches often yield faster lexical analyzers. Note that the design
and use of an automatic generator and some concepts for the organization of a hand-designed
lexical analyzer shall be discussed here.
Token specification
REGULAR EXPRESSIONS (REs) are the most common notation for pattern specification.
Every pattern specifies a set of strings, so a RE names a set of strings. Some definitions are
stated below:
– The ALPHABET or CHARCATER (often written ) is the set of legal input symbols and
denotes any finite set of symbols e.g. letters and characters e.g. {0,1) is an example of
binary alphabet, ASCII and EBCDIC is an example of computer alphabet
– A STRING over some alphabet is a finite sequence of symbols from e.g.
sentence and word in language theory
– The LENGTH of string s is written |s|. It is the number of occurrences of symbols in s
13
– The EMPTY STRING is a special 0-length string denoted by
A PREFIX o f s is formed by removing 0 or more trailing symbols of s e.g. ban is a prefix of
banana
A SUFFIX of s is formed by removing 0 or more leading symbols of s e.g. nana is a suffix
of banana
A SUBSTRING of s is formed by deleting a prefix and a suffix from s e.g. nan is a substring
of banana
A PROPER prefix, suffix, or substring is a nonempty string x that is, respectively, a prefix,
suffix, or substring of s but with x ≠s.
A SUBSEQUENCE of s is any string formed by deleting zero or more not
necessarily contiguous symbols from s e.g. baaa
A LANGUAGE is a set of strings over a fixed alphabet . Examples are:
• $ (the empty set)
• {"}
• {a, aa, aaa, aaaa }
The CONCATENATION of two strings x and y is written xy
String EXPONENTIATION is written si, where s0 = and si= si-1s for i>0.
Operations on Languages
Several important operations can be applied to languages. For lexical analysis, we are interested
primarily in union, concatenation, and closure, which are defined below. We can also generalize
the “exponentiation” operator to languages by defining L° to be {, and Li to be Li-1L. Thus, Li
is L concatenated with itself i-1 times.
The UNION of L and M: LM = {s | s is in L OR s is in M }
The CONCATENATION of L and M: LM = {st | s is in L and t is in M }
The KLEENE CLOSURE of L: L* =⋃∞ 𝑖=0 𝐿𝑖 , that is, the closure (or star or kleene closure)
of a language L is donated by L* and represents any number of strings that can be formed
from L possibly with repetitions. E.g. if L= (0, 1) L*= {ε, 0, 00, …1, 11, ……} L* is the
infinite union Ui ≥ 0 Li where L0 = {ε}, L1 =L and Li is for i ˃ 1.
The positive closure of a language L is denoted by L+, meaning concatenations of strings
from L excluding the null string ε. i.e. L+ = L* - ε
14
Example:
Let L be the set {A, B..., z, a, b,.., z} and D the set (0, 1, 2, ……9). We can think of L and D
in two ways. We can think of L as the alphabet consisting of the set of upper and lower case
letters, and D as the alphabet consisting of the set of the ten decimal digits. Alternatively,
since a symbol can be regarded as a string of length one, the sets L and D are each finite
languages. Here are some examples of new languages created from L and D by applying the
operators defined above.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter
A regular expression is built up out of simpler regular expressions using a set of defining rules.
Each regular expression r denotes a language L(r). The defining rules specify how L(r) is formed
by combining in various ways the languages denoted by the sub expressions of r.
Here are the rules that define the regular expressions over alphabet Z Associated with each rule is
a specification of the language denoted by the regular expression being defined.
1. is a regular expression that denotes {}, that is, the set containing the empty string.
2. If a is a symbol in then a is a regular expression that denotes {a}, i.e., the set
containing the string a. Although we use the same notation for all three, technically, the
regular expression a is different from the string a or the symbol a. It will be clear from
the context whether we are talking about a as a regular expression, string, or symbol.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then,
Unnecessary parentheses can be avoided in regular expressions if we adopt the convention that:
1. the unary operator * has the highest precedence and is left associative
Under these conventions, (a)|((b)*(c))is equivalent to a|b*c. Both expressions denote the set of
strings that are either a single a or zero or more b’s followed by one c.
2. The regular expression (a|b)(a|b) denotes {aa, ab, ba, bb), that is, the set of all strings of a’s
and b’s of length two. Another regular expression for this same set is aa|ab|ba|bb.
3. The regular expression a* denotes the set of all strings of zero or more a’s, i.e. (, a,
aa, aaa, …).
4. The regular expression (a|b)* denotes the set of all strings containing zero or more instances
of an a or b, that is, the set of all strings of a’s and b’s. Another regular expression for this
set is (a*b*)*
5. The regular expression a|a*b denotes the set containing the string a and all strings
consisting of zero or more a’s followed by b.
If two regular expressions r and s denote the same language, we say r and s are equivalent and
write r = s. For example, (a|b) = (b|a).
There are a number of algebraic laws obeyed by regular expressions and these can be used to
manipulate regular expressions into equivalent forms. The table below shows some algebraic laws
that hold for regular expressions r, s. and t.
16
Regular Definitions
To make our REs simpler, we may wish to give names to regular expressions and to define
regular expressions using these names as If they were symbols. If S is an alphabet of basic
symbols, them a regular definition is a sequence of definitions of the form d1→ r1 d2→r2
………
dn→rn
where each di is a distinct name, and each ri is a regular expression over the symbols in
{d1, d2,. . . di-1}, i.e., the basic symbols and the previously defined names. By restricting
each ri to symbols of and the previously defined names, we can construct a regular
expression over for any ri by repeatedly replacing regular expression names by the
expressions they denote. If ri used dj for some j i, then ri might be recursively defined and
this substitution process would not terminate. To distinguish names from symbols, we print
the names in regular definitions in bold face.
Example:
Here is a regular definition for identifier in Pascal
letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → letter(letter|digit)*
letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → letter(letter|___ ) (letter|digit|___)*
17
Example for numbers in Pascal:
digit → 0 | 1 | …….. | 9
digits → digitdigit*
optional_fraction → . digits |
optional_exponent → (E ( + | - | ) digits) |
num → digits optional_fraction optional_exponent
This definition says that an optional_fraction is either a decimal point followed by one or
more digits, or if it is missing (the empty string). An optional_exponent, if it is not missing,
is an E followed by a + or - sign, followed by one or more digits. Note that at least, one digit
must follow the period, so num does not match 1. however, it does match 1.0
Notational Shorthand
To simplify out REs, we can use a few shortcuts:
1. + “means one or more instances of” e.g. a+ (ab)+
Token Recognition
Having known how to specify the tokens for our language, how then do we write a program
to recognize them?
Consider this grammar example:
stmt→ if expr then stmt | if expr then stmt else stmt
expr→term relop term | term
term→ id | num
where the terminals if, then, else, relop, id and num generates sets of strings given by the
following regular definitions.
if → If
then → Then
else → Else
relop → < | <= | = | <> | > | >=
id → letter(letter | digit)*
num → digit*(.digit*)? (E(+|-)? digit*)?
It is assumed that keywords are reserved, lexemes are separated by white space, consisting of non-
null sequences of blanks, tabs and newlines. The lexical analyzer will strip out the white space by
comparing the strings with the regular definition ws below:
NOTE
Note that the regular expression constructs permitted by Lex are listed below in decreasing order
of precedence. In this table, c stands for any single character, r for a regular expression, and s for
a string.
19
Transition Diagrams
The transition diagram is a stylized flowchart produced as one of the intermediate steps in the
construction of a lexical analyzer. It depicts the action taken when a lexical analyzer is called by
the parser to get the next token. Transition diagrams are also called finite automata.
Positions in a transition diagram are drawn as circles and are called states. The states are connected
by arrows, called edges. Edges leaving state s have labels indicating the input characters that can
next appear after the transition diagram has reached state s. The label other refers to any character
that is not indicated by any of the other edges leaving s. Usually, when we recognize OTHER, we
need to put it back in the source stream since it is part of the next token. This action is denoted
with a * next to the corresponding state.
We assume the transition diagrams of this section are deterministic; that is, no symbol can match
the labels of two edges leaving one state. One state is labeled the start state; it is the initial state of
the transition diagram where control resides when we begin to recognize a token. Certain states
may have actions that are executed when the flow of control reaches that state. On entering a state
we read the next input character. If there is an edge from the current state whose label matches this
input character, we then go to the state pointed to by the edge. Otherwise, we indicate failure. The
figure below shows a transition diagram for the patterns >= and >. The transition diagram works
as follows, its start state is state 0. In state 0, we read the next input character. The edge labeled >
from state 0 is to be followed to state 6 if this input character is >. Otherwise, we have failed to
recognize either > or >=.
On reaching state 6 we read the next input character. The edge labeled = from state 6 is to be
followed to state 7 if this input character is an =. Otherwise, the edge labeled other indicates that
we are to go to state 8. The double circle on state 7 indicates that it is an accepting state, a state in
which the token >= has been found.
Notice that the character > and another extra character are read as we follow the sequence of edges
from the start state to the accepting state 8. Since the extra character is not a part of the relational
operator >, we must retract the forward pointer one character. We use a * to indicate states on
which this input retraction must take place.
In general, there may be several transition diagrams, each specifying a group of tokens. If failure
occurs while we are following one transition diagram, then we retract the forward pointer to where
it was in the start state of this diagram, and activate the next transition diagram. Since the lexeme-
beginning and forward pointers marked the same position in the start state of the diagram, the
forward pointer is retracted to the position marked by the lexeme-beginning pointer. If failure
20
occurs in all transition diagrams, then a lexical error has been detected and we invoke an error-
recovery routine.
A simple technique for separating keywords from identifiers is to initialize appropriately the
symbol table in which information about identifiers is saved. For the tokens described above, we
need to enter the strings if, then, and else into the symbol table before any characters in the input
are seen. We also make a note in the symbol table of the token to be returned when one of these
strings is recognized. The return statement next to the accepting state in the figure above uses
gettoken() and install_id() to obtain the token and attribute value, respectively, to be returned. The
procedure install_id() has access to the buffer, where the identifier lexeme has been located. The
symbol table is examined and if the lexeme is found there marked as a keyword, install_id() returns
0. If the lexeme is found and is a program variable install_id() returns a pointer to the symbol table
entry. If the lexeme is not found in the symbol table, it is installed as a variable and a pointer to
the newly created entry is returned.
The procedure gettoken() similarly looks for the lexeme in the symbol table. If the lexeme is
a keyword, the corresponding token is returned; otherwise, the token Id is returned. Note that
the transition diagram does not change if additional keywords are to be recognized; we
simply initialize the symbol table with the strings and tokens of the additional keywords. The
technique of placing keywords in the symbol table is almost essential if the lexical analyzer
is coded by hand. Without doing so, the number of states in a lexical analyzer for a typical
21
programming language is several hundred, while using the trick, fewer than a hundred states
will probably suffice.
The treatment of ws, representing white space, is different from that of the patterns discussed
above because nothing is returned to the parser when white space is found in the input. A
transition diagram recognizing ws by itself is shown below:
Nothing is returned when the accepting state is reached; we merely go back to the start state
of the first transition diagram to look for another pattern. Whenever possible, it is better to
look for frequently occurring tokens before less frequently occurring ones, because a
transition diagram is reached only after we fail on all earlier diagrams. Since white space is
expected to occur frequently, putting the transition diagram for white space near the
beginning should be an improvement over testing for white space at the end.
For each state, in every diagram, a piece of code is written to compare the next input with the
outgoing transitions and set the next state. If no next state is possible for the given input, we
call fail(). fail() backtracks to the beginning of the token and tries again with the next
transition diagram. If all diagrams fail, we generate an error. Ordering of the diagrams is
important. E.g. for unsigned numbers, the longer diagrams have to be tried first.
22
FINITE AUTOMATA
A recognizer for a language is a program that takes as input a string x and answers “yes” if x is a
sentence of the language and “no” otherwise. Regular expression is compiled into a recognizer by
constructing a generalized transition diagram called a finite automaton. A finite automaton can be
deterministic or nondeterministic, where nondeterministic means that more than one transition out
of a state may be possible on the same input symbol. Both deterministic and nondeterministic
finite automata are capable of recognizing precisely the regular sets. Thus, they both can recognize
exactly what regular expressions can denote. However, there is a time-space tradeoff; while
deterministic finite automata can lead to faster recognizers than non- deterministic automata, a
deterministic finite automaton can be much bigger than an equivalent nondeterministic automaton.
The conversion into a nondeterministic automaton is more direct.
Nondeterministic Finite Automata
A nondeterministic finite automaton (NFA, for short) is a mathematical model that consists of
1. a set of states S
2. a set of input symbols (the input symbol alphabets)
3. a transition function move that maps state-symbol pairs to sets of states
4. a state So that is distinguished as the start (or initial) state
5. a set of states F distinguished as accepting (or final) states
An NFA can be represented diagrammatically by a labeled directed graph called a transition graph,
in which the nodes are the states and the labeled edges represent the transition function. This graph
looks like a transition diagram, but the same character can label two or more transitions out of one
state, and edges can be labeled by the special symbol as well as by input symbols. The transition
graph for an NFA that recognizes the language (a|b)*abb is shown below. The set of states of the
NFA is {0, 1, 2, 3} and the input symbol alphabet is (a,b}. State 0 is distinguished as the start state,
and the accepting state 3 is indicated by a double circle.
23
In a computer, the transition function of an NFA can be implemented in several different
ways of which easiest implementation is a transition table in which there is a row for each
state and a column for each input symbol and if necessary. The entry for row i and symbol
a in the table is the set of states (or more likely in practice, a pointer to the set of states) that
can be reached by a transition from state i on input a. The transition table for the NFA above
is shown below:
The transition table representation has the advantage that provides fast access to the transitions of
a given state on a given character; its disadvantage is that it can take up a lot of space when the
input alphabet is large and most transitions are to the empty set. Adjacency list representations of
the transition function provide more compact implementations, but access to a given transition is
slower. It should be clear that we can easily convert any one of these implementations of a finite
automaton into another.
An NFA accepts an input string x if and only if there is some path in the transition graph from the
start state to some accepting state, such that the edge labels along this path spell out x. The NFA
described above accepts the input strings abb, aabb, babb, aaabb. For example, aabb is accepted
by the path from 0, following the edge labeled a to state 0 again, then to states I, 2, and 3 via edges
labeled a, b, and b, respectively.
A path can be represented by a sequence of state transitions called moves. The following
diagram shows the moves made in accepting the input string aabb:
In general, more than one sequence of moves can lead to an accepting stale. Notice that
several other sequences of moves may be made on the input string aabb but none of the others
happen to end in an accepting state. For example, another sequence of moves on input aabb
keeps reentering the non- accepting state 0:
24
The language defined by an NFA is the set of input strings it accepts.
Exercise:
1. Show that the NFA above accepts (a|b)*abb.
2. Design an NFA to accept aa*|bb*
Solution:
A deterministic finite automaton has at most one transition from each state on any input, if we are
using a transition table to represent the transition function of a DFA, then each entry in the
transition table is a single state. Consequently, it is very easy to determine whether a deterministic
finite automaton accepts an input siring, since there is at most one path from the start state labelled
by that string.
Example below shows the transition graph of a deterministic finite automaton accepting the same
language (a|b)*abb as that accepted by the NFA described above. It follows the sequence of states
0, 1, 2, 1, 2, 3
25
Assignment
1. Write Regular Expressions for Identifiers in Java, Python, C++ and and Pascal four
programming languages.
IDENTIFIERS IN C++
How to check identifier in C++?
• Names can contain letters, digits and underscores.
• Names must begin with a letter or an underscore (_)
• Names are case-sensitive ( myVar and myvar are different variables)
• Names cannot contain whitespaces or special characters like !, #, %, etc.
Regular Expression letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → (letter|___) (letter|digit|___)* Also
written in Lex convention as
^_|A-Z|a-z]+[_|A-Z|a-z|0-9]*$
26
Question?
1. How many identifiers are in the above code? Name them.
IDENTIFIERS IN C#
In programming languages, identifiers are used for identification purposes. Or in other words,
identifiers are the user-defined name of the program components. In C#, an identifier can be
a class name, method name, variable name, or label.
Example:
public class GFG { static
public void Main ()
{
int x;
}
27
}
Here the total number of identifiers present in the above example is 3 and the names of these
identifiers are:
GFG: Name of the class Main:
Method name
x: Variable name
class GFG {
// Main Method static public
void Main ()
{
// variable
int a = 10; int b
= 39; int c;
// simple addition c
= a + b;
Console.WriteLine("The sum of two number is: {0}", c); }
}
28
Output:
Below table shows the identifiers and keywords present in the above example:
IDENTIFIERS IN PYTHON
Python Identifier is the name we give to identify a variable, function, class, module or other
object. That means whenever we want to give an entity a name, that's called identifier. A
python identifier is a name given to various entities like variables, functions, and classes. The
isidentifier() method returns True if the string is a valid identifier, otherwise False. A string
is considered a valid identifier if it only contains alphanumeric letters (a-z) and (0-9), or
underscores (_). A valid identifier cannot start with a number, or contain any spaces. In
Python, identifiers are case-sensitive, meaning that foo and Foo are considered to be two
different identifiers.
Regular Expression
letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → (letter|___) (letter|digit|___)*
Also written in Lex convention as
^_|A-Z|a-z]+[_|A-Z|a-z|0-9]*$
IDENTIFIERS IN JAVA
The valid rules for defining Java identifiers are:
• It must start with either lower case alphabet[a-z] or upper case alphabet
[A-Z] or underscore (_) or a dollar sign($).
• It should be a single word, the white spaces are not allowed.
• It should not start with digits.
30
^_|$|A-Z|a-z]+[_|A-Z|a-z|0-9]*$
Here's a table on valid and invalid identifiers in Java for a quick glance.
Criteria Valid Identifiers Invalid Identifiers
Start with a letter,
underscore (_), or dollar myVariable, _value, $id 9pins, -name
sign ($)
Subsequent characters var1, i9, _1_value a@b, hello!world
(Case variations are valid but
Case Sensitivity myVariable, MyVariable
represent different identifiers)
No reserved words userInput, totalSum class, int, void
(No length-based invalidity, but
Unlimited length
longIdentifierName123 very long names are
(practically reasonable)
discouraged for readability)
SYNTAX ANALYSIS
Every programming language has rules that prescribe the syntactic structure of well-formed
programs. In Pascal, for example. a program is made out of blocks, a block out of statements,
a statement out of expressions, an expression out of tokens. and so on. The syntax of
programming language constructs can be described by context-free grammars or BNF
(Backus-Naur Form) notation. Grammars offer significant advantages to both language
designers and compiler writers.
What is a Grammar?
A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming
language. From certain classes of grammars we can automatically construct an efficient
parser that determines if a source program is syntactically well formed. As an additional
benefit, the parser construction process can reveal syntactic ambiguities and other difficult-
to-parse constructs that might otherwise go undetected in the initial design phase of a
language ad its compiler.
A properly designed grammar imparts a structure to a programming language that is useful
for the translation of source programs into correct object code and for the detection of errors.
Tools are available for converting grammar-based descriptions of translations into working
programs.
31
Commonly used parsing methods in compilers are classified as being either LL (left-to-right,
leftmost derivation) parser also known as top-down parser or LR (An LL parse is a left-to-right,
rightmost derivation) parser also known as or bottom-up parser. As indicated by their names, top-
down parsers build parse trees from the top (root) to the bottom (leaves), while bottom-up parsers
start from the leaves and work up to the root. In both cases, the input to the parser is scanned from
left to right, one symbol at a time.
Often much of the error detection and recovery in a compiler is centered around the syntax analysis
phase. One reason for this is that many errors are syntactic in nature or are exposed when the
stream of tokens coming from the lexical analyzer disobeys the grammatical rules defining the
programming language. Another is the precision of modern parsing methods; they can detect the
presence of Syntactic errors in programs very efficiently. Accurately detecting the presence of
semantic and logical errors at compile time is a much more difficult task.
The error handler in a parser has simple-to-state goals:
1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.
The effective realization of these goals presents difficult challenges. Fortunately, common errors
are simple ones and a relatively straightforward error-handling mechanism often suffices. In some
cases, however, an error may have occurred long before the position at which its presence is
detected and the precise nature of the error may be very difficult to deduce. In difficult cases, the
error handler may have to guess what the programmer had in mind when the program was written.
Several parsing methods, such as the LL and LR methods, detect an error as soon as possible. More
precisely, they have the viable-prefix property, meaning they detect that an error has occurred as
soon as they see a prefix of the input that is not a prefix of any string in the language.
32
Context-Free Grammars (CFG)
Grammars were introduced in the previous section to systematically describe the syntax of
programming language constructs like expressions and statements. Using a syntactic variable stmt
to denote statements and variable expr to denote expressions, the production
stmt → if (expr ) stmt else stmt
specifies the structure of this form of conditional statement. Other productions then define
precisely what an expr is and what else a stmt can be.
Terminals:
Terminals are the basic symbols from which strings are formed. The term "token name" is a
synonym for '"terminal" and frequently we will use the word "token" for terminal when it is clear
that we are talking about just the token name. We assume that the terminals are the first
components of the tokens output by the lexical analyzer. In the statement example above, the
terminals are the keywords if and else and the symbols "(" and ")".
Nonterminals:
Nonterminals are syntactic variables that denote sets of strings. In the example above, stmt
and expr are nonterminals. The sets of strings denoted by nonterminals help define the
language generated by the grammar. Nonterminals impose a hierarchical structure on the
language that is key to syntax analysis and translation.
In a grammar, one nonterminal is distinguished as the start symbol, and the set of strings it
denotes is the language generated by the grammar. Conventionally, the productions for the
start symbol are listed first. The productions of a grammar specify the manner in which the
terminals and nonterminals can be combined to form strings. Each production consists of:
i. A nonterminal called the head or left side of the production; this production defines
some of the strings denoted by the head.
ii. The symbol →, sometimes ::= has been used in place of the arrow.
iii. A body or right side consisting of zero or more terminals and nonterminals. The
components of the body describe one way in which strings of the nonterminal at the head
can be constructed.
EXAMPLE
1. Consider the grammar below for arithmetic expressions, identify the terminals,
nonterminals and the start symbol in the grammar.
33
Solution
NOTATIONAL CONVENTIONS
To avoid always having to state that "these are the terminals," "these are the nonterminals ," and
so on, the following notational conventions for grammars are used in most texts.
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is,
either nonterminals or terminals.
4. Lowercase Greek letters, , β, for example, represent (possibly empty) strings grammar
symbols. Thus, a generic production can be written as A → , where A is the head and →
the body.
34
5. A set of productions A with a common head A (call them A-productions), may be written
A → 1, A → 2, A → 3, ……….. A → k may be written A → 1 | 2 | 3 |……….. |k.
Call 1 , 2 , 3 ,……….. k the alternatives for A.
6. Unless stated otherwise, the head of the first production is the start symbol.
Example: Using the above conventions, the grammar in the example above can be rewritten
concisely as
E → E+T | E-T | T
T → T*F | T/F | F
F → (E) | id
The notational conventions tell us that E, T, and F are nonterminals, with E the start symbol.
The remaining symbols are terminals.
DERIVATIONS
The construction of a parse tree can be made precise by taking a derivational view, in which
productions are treated as rewriting rules. Beginning with the start symbol, each rewriting step
replaces a nonterminal by the body of one of its productions. This derivational view corresponds
to the top-down construction of a parse tree, but the precision afforded by derivations will be
especially helpful when bottom-up parsing is discussed. As we shall see, bottom-up parsing is
related to a class of derivations known as “rightmost" derivations, in which the rightmost
nonterminal is rewritten at each step.
For example, consider the following grammar, with a single nonterminal E,
E → E + E | E * E | -E | (E) | id
The production E | - E signifies that if E denotes an expression, then -E must also denote an
expression. The replacement of a single E by -E will be described by writing:
E → - E, which is read, “E derives -E."
The production E → (E) can be applied to replace any instance of E in any string of grammar
symbols by (E), For example
E * E → (E) * E or
E * E → E * (E). We can take a single E and repeatedly apply productions in any order to
get a sequence of replacements.
For example,
E → -E → -(E) → -(id)
We call such a sequence of replacements a derivation of -(id) from E. This derivation provides
a proof that the string -(id) is one particular instance of an expression.
At each step in a derivation, there are two choices to be made. We need to choose which
nonterminal to replace, and having made this choice, we must pick a production with that
nonterminal as head.
For example, the parse tree for -(id + id) resulting from the above derivations is shown below.
Note that the leaves of a parse tree are labelled by nonterminals or terminals and, read from left to
right, constitute a sentential form, called the yield or frontier of the tree.
To see the relationship between derivations and parse trees, consider any derivation → 2 → 3
……→ n, where 1 is a single nonterminal A. For each sentential form i in the derivation, we
can construct a parse tree whose yield is i. The process is an induction on i.
Note that a parse tree ignores variations in the order in which symbols in sentential forms are
replaced, and there is a many-to-one relationship between derivations and parse trees. For
example, both the leftmost and rightmost derivations above are associated with the same parse
tree. The figure below shows the sequence of parse trees for the leftmost derivation above.
36
In what follows, we shall frequently parse by producing a leftmost or a rightmost derivation, since
there is a one-to-one relationship between parse trees and either leftmost or rightmost derivations.
Both leftmost and rightmost derivations pick a particular order for replacing symbols in sentential
forms, so they too filter out variations in the order. It is not hard to show that every parse tree has
associated with it a unique leftmost and a unique rightmost derivation.
37