0% found this document useful (0 votes)

24 views37 pages

Compiler Construction NOTE 1

The document provides an overview of compiler construction, highlighting its importance in computer science and its applications in programming languages, software engineering, and algorithms. It discusses the differences between compilers, interpreters, and assemblers, as well as various types of compilers and their characteristics. Additionally, it outlines the phases of compilation, including lexical analysis, syntax analysis, semantic analysis, and code generation, along with the tools used in compiler writing.

Uploaded by

ugoprince666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views37 pages

Compiler Construction NOTE 1

Uploaded by

ugoprince666

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

CSC 401

2024/2025 SESSION

DR. (MRS) A. A. KAYODE

Recommended Textbook
Compilers: Principles, Techniques and Tools by Alfred V. Aho, Ravi Sethi and Jeffery D.
Ullman.
INTRODUCTION
Compiler construction is a broad field. Compiler writing is perhaps the most pervasive topic
in computer science, involving many fields as:
• Programming languages
• Architecture
• Theory of computation
• Algorithms
• Software engineering
In the early days, much of the effort was on how to implement high-level constructs. Then,
for a long time, the major emphasis was on improving the efficiency of generated code. These
topics remain important today, but many new technologies have caused them to become more
spialized.

WHY STUDY COMPILERS?

Compiler construction is studied for some of the following reasons:
1. Writing a compiler gives a student experience with large-scale applications development.
2. Compiler writing is one of the shining triumphs of Computer Science theory. It
demonstrates the value of theory over the impulse to just "hack up" a solution.
3. Compiler writing is a basic element of programming language research.
4. Many language researchers write compilers for the languages they design.
5. Many applications have similar properties to one or more phases of a compiler, and
compiler expertise and tools can help an application programmer working on other
projects besides compilers.

Translator
This special software translates other programs into the machine language. They are in three
categories:

1. Assembler: this translates a program written in assembly language into the machine
language.
2. Interpreter: this translates the source program into the object program line by line and runs
execute the program) straightaway.
3. Compiler: this translates the whole source program at once into the object program

Some Differences between an Interpreter and a Compiler

No Interpreter Compiler
1 A line in error can be corrected and The whole program must be
run again before moving on compiled again in case there is error
2 Requires more memory space to Less memory space is needed for program
store the interpreter when the to run
program is running
3 Programs run very slow Programs usually run faster
2
4 Very useful in teaching More useful in production
environment environment

More on Compilers
A compiler is a program that reads a program written in one language (the source language) and
translates it into an equivalent program in another language (the target language). The compiler
reports any errors found during the translation process.

The source language might be General purpose, e.g. C or Pascal, Or a “little language” for a
specific domain, e.g. SIML
The target language might be
– Some other programming language
– The machine language of a specific machine

A compiler creates machine code that runs on a processor with a specific Instruction Set
Architecture (ISA), which is processor-dependent. For example, you cannot compile code
for an x86 and run it on a MIPS architecture without a special compiler. Compilers are also
platform dependent. That is, a compiler can convert C++, for example, to machine code that’s
targeted at a platform that is running the Linux OS. A cross-compiler, however, can generate
code for a platform other than the one it runs on itself. A cross-compiler running on a
Windows machine, for instance, could generate code that runs on a specific Windows
operating system or a Linux (operating system) platform. Source-to-source compilers
translate one program, or code, to another of a different language (e.g., from Java to C).
Choosing a compiler then, means that first you need to know the ISA, operating system, and
the programming language that you plan to use. Compilers often come as a package with
other tools, and each processor manufacturer will have at least one compiler or a package of
software development tools (that includes a compiler.

More on Interpreters
An interpreter translates code like a compiler but reads the code and immediately executes
on that code, and therefore is initially faster than a compiler. Thus, interpreters are often used
in software development tools as debugging tools, as they can execute a single in of code at
a time. Compilers translate code all at once and the processor then executes upon the machine

3
language that the compiler produced. If changes are made to the code after compilation, the
changed code will need to be compiled and added to the compiled code (or perhaps the entire
program will need to be recompiled.) But an interpreter, although skipping the step of
compilation of the entire program to start, is much slower to execute than the same program
that’s been completely compiled. Interpreters, however, have usefulness in
areas where speed doesn’t matter (e.g., debugging and training) and it is possible to take the
entire interpreter and use it on another ISA, which makes it more portable than a compiler
when working between hardware architectures. There are several types of interpreters: the
syntax-directed interpreter (i.e., the Abstract Syntax Tree (AST) interpreter), bytecode
interpreter, and threaded interpreter (not to be confused with concurrent processing threads),
Just-in-Time (a kind of hybrid interpreter/compiler), and a few others. Instructions on how
to build an interpreter can be found on the web. Some examples of programming languages
that use interpreters are Python, Ruby, Perl, and PHP.

What qualities are important in a compiler?

1. It must generate correct code
2. The output should run fast
3. The compiler should run fast
4. The compilation time should be proportional to program size
5. There should be support for separate compilation
6. There should be good diagnostics for syntax errors
7. The compiler must work well with the debugger
8. There should be good diagnostics for flow anomalies
9. It should allow cross language calls
10. The compiler should be consistent and predictable

How to Build a Compiler:

Analysis-Synthesis Model of Compilation There are two parts to compilation: analysis and
synthesis.
1. ANALYSIS: The analysis part breaks up the source program into constituent pieces and
creates
5
an intermediate representation of the source program.
2. SYNTHESIS: The synthesis part constructs the desired target program from the
intermediate representation, of the two parts, synthesis requires the most specialized
techniques.
During analysis, the operations implied by the source program are determined and recorded in a
hierarchical structure called a tree. Often, a special kind of tree called a syntax tree is used, in
which each node represents an operation and the children of a node represent the arguments of
the operation. Sometimes we call the analysis part the FRONT END and the synthesis part the
BACK END of the compiler. Of the two parts, synthesis requires the most specialized
techniques. Note that the two can be written independently.

Phases of a Compiler
The diagram below represents the phases of a compiler.

6
Linear (Lexical) Analysis
The linear analysis stage is called lexical analysis or scanning. For example, the characters in the
assignment statement:
position = initial + rate * 60
would be grouped into the following tokens and is translated as:

1. The IDENTIFIER “position”

2. The ASSIGNMENT SYMBOL “=”
3. The IDENTIFIER “initial”
4. The PLUS OPERATOR “+”
5. The IDENTIFIER “rate”
6. The MULTIPLICATION OPERATOR “*”
7. The NUMERIC LITERAL 60
Blanks are always eliminated during lexical analysis

Syntax Analysis

7
Syntax Analysis or parsing involves grouping the tokens of the source program into
grammatical phrases that are used by the compiler to synthesize output. The hierarchical
structure of the source program can be represented by a PARSE TREE, for example:

The hierarchical structure of the syntactic units in a programming language is normally

represented by a set of recursive rules. Example of such rules for expressions is:
1. Any identifier is an expression
2. Any number is an expression
3. If expression1 and expression2 are expressions, then so are
expression1+expression2
expression1*expression2
(expression1)
Rules (1) and (2) are (non-recursive) basic rules, while (3) defines expressions in terms of
operators applied to other expressions.

Similarly, example of such rules for statements is:

1. If identifier1 is an identifier and expression2 is an expression, then identifier1 =
expression2 is a statement.
2. If expression1 is an expression and statement2 is a statement, then the following are
statements:

while (expression1) do statement2

if (expression1) then statement2
are statements.

Lexical vs. Syntactic analysis

Generally, if a syntactic unit can be recognized in a linear scan, we convert it into a token during
lexical analysis. More complex syntactic units, especially recursive structures, are normally
processed during syntactic analysis (parsing). Identifiers, for example, can be recognized easily in
a linear scan, so identifiers are tokenized during lexical analysis.
8
It is common to convert complex parse trees to simpler syntax trees, with a node for each
operator and children for the operands of each operator. For the analysis of the expression
Position=initial + rate *60 we have,

Semantic Analysis
The semantic analysis stage:
– Checks for semantic errors, e.g. undeclared variables
– Gathers type information
– Determines the operators and operands of expressions
Example: if rate is a float, the integer literal 60 should be converted to a float before
multiplying. Then, we have

Symbol-table management
A symbol table is a data structure containing a record for each identifier with fields for the
attributes of the identifier. During analysis, we record the identifiers used in the program.
The symbol table stores each identifier with its attributes. Example of attributes:
– How much storage is allocated for the id – The id’s type– The id’s scope–
For functions, the parameter protocol
Some attributes can be determined immediately; some are delayed until later phases.

Error detection
Each compilation phase can have errors. Normally, we want to keep processing after an error,
in order to find more errors. Each stage has its own characteristic errors, e.g.
– Lexical analysis: a string of characters that do not form a legal token
– Syntax analysis: unmatched { } or missing ;
– Semantic: trying to add a float and a pointer
The internal representations of this process is illustrated with our case example below

9
Intermediate code generation
Some compilers explicitly create an intermediate representation of the source code program
after semantic analysis. The representation is as a program for an abstract machine. Most
common representation is “three-address code” in which all memory locations are treated as
registers, and most instructions apply an operator to two operand registers, and store the
result to a destination register.

Code optimization
This phase attempts to improve the intermediate code. At this stage, we improve the code to
make it run faster.

10
Code generation
This is the final phase of the compiler where the target code is generated. In this final stage,
we take the three-address code (3AC)or other intermediate representation, and convert to the
target language. We must pick memory locations for variables and allocate registers. For
example, using registers 1 and 2, the translation of our case sample code might become

MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1

i.e.
temp1 := id3 * 60.0
id1 := id2 + temp1

Compiler writing tools

Many software tools have been developed to help implement various compiler phases. The
following is a list of some useful compiler construction tools:
Scanner generators produce lexical analyzers automatically.
– Input: a specification of the tokens of a language (usually written as regular
expressions)
– Output: C code to break the source language into tokens. e.g. Lex
Parser generators produce syntactic analyzers automatically.
– Input: a specification of the language syntax (usually written as a context-
free grammar)
– Output: C code to build the syntax tree from the token sequence. e.g. Yacc
There are also automated systems for code synthesis.

Assignment
Distinguish between a compiler and an interpreter.

LEXICAL ANALYSIS
This is the first phase of a compiler. Its task is to read the input characters and produce as
output a sequence of tokens that the parser uses for syntax analysis.

11
This interaction is commonly implemented by making the lexical analyzer be a subroutine or
a co-routine of the parser. Upon receiving a “get next token” command from the parser, the
lexical analyzer reads input characters until it can identify a token and returns it. Since the
lexical analyzer is part of the compiler that reads the source text, it may also perform certain
secondary tasks at the user interface. They are:
– Strip out comments and white space from the source code. – Correlate parser errors with
the source code location (the parser does not know what line of the file it is at, but the lexer
does)

Tokens, Patterns, Lexemes

A TOKEN is a set of strings over the source alphabet. A PATTERN is a rule that describes
that set. A LEXEME is a sequence of characters in the source program that is matched by the
pattern for a token. E.g. in Pascal, for the statement
const pi = 3.1416;
The substring pi is a lexeme for the token “identifier”
Example of tokens, lexemes, patterns
Token Sample Lexemes Informal Description of Pattern
if If If
while While While
relation <, <=, =, <>, >, >= < or <= or = or <> or > or >=
id count, sum, i, j, pi, D2 letter followed by letters and
digits
num 0, 12, 3.1416, 6.02E23 any numeric constant
literal “please enter input values” any characters between “and”

Together, the complete set of tokens form the set of terminal symbols used in the grammar
for the parser. In most languages, the tokens fall into these categories:
– Keywords
– Operators
– Identifiers
– Constants
– Literal strings
– Punctuation

Token attributes
If there is more than one lexeme for a token, we have to save additional information about
the token.
12
Example: the token number matches lexemes 10 and 20.
Code generation needs the actual number, not just the token. With each token, we associate
ATTRIBUTES. Normally just a pointer into the symbol table. For C source code E = M * C
* C We have token/attribute pairs
<ID, ptr to symbol table entry for E>
<Assign_op, NULL>
<ID, ptr to symbol table entry for M>
<Mult_op, NULL>
<ID, ptr to symbol table entry for C>
<Mult_op, NULL>
<ID, ptr to symbol table entry for C>
Lexical errors
When errors occur, we could just crash but it is better to print an error message then continue.
Possible techniques to continue on error:
➢Delete a character
➢Insert a missing character
➢Replace an incorrect character by a correct character
➢Transpose adjacent characters

Implementation of Lexical Analyzer

There are 3 general approaches to the implementation of a lexical analyzer.
1. Lexical analyzer generator can be used, such as the Lex compiler to produce the
lexical analyzer from a regular expression-based specification. In this case, the
generator provides routines for reading and buffering the input.
2. Writing the lexical analyzer in a conventional systems-programming language,
using the I/O facilities of that language to read the input
3. Writing the lexical analyzer in assembly language and explicitly managing the
reading of input.

The three are written in the order of increasing difficulty for the implementer. Unfortunately,
the harder to implement approaches often yield faster lexical analyzers. Note that the design
and use of an automatic generator and some concepts for the organization of a hand-designed
lexical analyzer shall be discussed here.

Token specification
REGULAR EXPRESSIONS (REs) are the most common notation for pattern specification.
Every pattern specifies a set of strings, so a RE names a set of strings. Some definitions are
stated below:
– The ALPHABET or CHARCATER (often written ) is the set of legal input symbols and
denotes any finite set of symbols e.g. letters and characters e.g. {0,1) is an example of
binary alphabet, ASCII and EBCDIC is an example of computer alphabet
– A STRING over some alphabet  is a finite sequence of symbols from  e.g.
sentence and word in language theory
– The LENGTH of string s is written |s|. It is the number of occurrences of symbols in s
13
– The EMPTY STRING is a special 0-length string denoted by 
A PREFIX o f s is formed by removing 0 or more trailing symbols of s e.g. ban is a prefix of
banana
A SUFFIX of s is formed by removing 0 or more leading symbols of s e.g. nana is a suffix
of banana
A SUBSTRING of s is formed by deleting a prefix and a suffix from s e.g. nan is a substring
of banana
A PROPER prefix, suffix, or substring is a nonempty string x that is, respectively, a prefix,
suffix, or substring of s but with x ≠s.
A SUBSEQUENCE of s is any string formed by deleting zero or more not
necessarily contiguous symbols from s e.g. baaa
A LANGUAGE is a set of strings over a fixed alphabet . Examples are:
• $ (the empty set)
• {"}
• {a, aa, aaa, aaaa }
The CONCATENATION of two strings x and y is written xy
String EXPONENTIATION is written si, where s0 =  and si= si-1s for i>0.
Operations on Languages
Several important operations can be applied to languages. For lexical analysis, we are interested
primarily in union, concatenation, and closure, which are defined below. We can also generalize
the “exponentiation” operator to languages by defining L° to be {, and Li to be Li-1L. Thus, Li
is L concatenated with itself i-1 times.
The UNION of L and M: LM = {s | s is in L OR s is in M }
The CONCATENATION of L and M: LM = {st | s is in L and t is in M }
The KLEENE CLOSURE of L: L* =⋃∞ 𝑖=0 𝐿𝑖 , that is, the closure (or star or kleene closure)
of a language L is donated by L* and represents any number of strings that can be formed
from L possibly with repetitions. E.g. if L= (0, 1) L*= {ε, 0, 00, …1, 11, ……} L* is the
infinite union Ui ≥ 0 Li where L0 = {ε}, L1 =L and Li is for i ˃ 1.
The positive closure of a language L is denoted by L+, meaning concatenations of strings
from L excluding the null string ε. i.e. L+ = L* - ε

14
Example:
Let L be the set {A, B..., z, a, b,.., z} and D the set (0, 1, 2, ……9). We can think of L and D
in two ways. We can think of L as the alphabet consisting of the set of upper and lower case
letters, and D as the alphabet consisting of the set of the ten decimal digits. Alternatively,
since a symbol can be regarded as a string of length one, the sets L and D are each finite
languages. Here are some examples of new languages created from L and D by applying the
operators defined above.

1. L  D is the set of Letters and digits.

2. LD is the set of strings consisting of a letter followed by a digit

3. L4 is the set of all four-letter strings.

4. L* is the set of all strings of letters, including , the empty string.

5. L(L U D)* is the set of all strings of letters and digits beginning with a letter

6. D* is the set of all strings of one or more digits.

Regular Expressions
In Pascal, an identifier is a letter followed by zero or more letters or digits; that is, an identifier
is a member of the set defined in part (5) of the above example. Regular expression is
presented here which allows us to define precisely sets such as this. With this notation, we
might define Pascal identifiers as letter(letter|digit)*
The vertical bar here means “or”, the parentheses are used to group subexpressions, the star means
“zero or more instances of” the parenthesized expression, and the juxtaposition of letter with the
remainder of the expression means concatenation.

A regular expression is built up out of simpler regular expressions using a set of defining rules.
Each regular expression r denotes a language L(r). The defining rules specify how L(r) is formed
by combining in various ways the languages denoted by the sub expressions of r.
Here are the rules that define the regular expressions over alphabet Z Associated with each rule is
a specification of the language denoted by the regular expression being defined.

1.  is a regular expression that denotes {}, that is, the set containing the empty string.

2. If a is a symbol in  then a is a regular expression that denotes {a}, i.e., the set
containing the string a. Although we use the same notation for all three, technically, the
regular expression a is different from the string a or the symbol a. It will be clear from
the context whether we are talking about a as a regular expression, string, or symbol.

3. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then,

i. (r)|(s) is a regular expression denoting L(r) U L(s).

ii. (r)(s) is a regular expression denoting L(r)L(s)
15
iii. (r)* is a regular expression denoting (L(r))*, that is, extra pairs of parentheses may be
placed around regular expressions if we so desire).
iv. (r) is a regular expression denoting L(r)

Unnecessary parentheses can be avoided in regular expressions if we adopt the convention that:

1. the unary operator * has the highest precedence and is left associative

2. concatenation has the second highest precedence and is left associative

3. | has the lowest precedence and is left associative.

Under these conventions, (a)|((b)*(c))is equivalent to a|b*c. Both expressions denote the set of
strings that are either a single a or zero or more b’s followed by one c.

Example: Let  = {a, b}.

1. The regular expression a|b denotes the set {a, b}.

2. The regular expression (a|b)(a|b) denotes {aa, ab, ba, bb), that is, the set of all strings of a’s
and b’s of length two. Another regular expression for this same set is aa|ab|ba|bb.

3. The regular expression a* denotes the set of all strings of zero or more a’s, i.e. (, a,
aa, aaa, …).

4. The regular expression (a|b)* denotes the set of all strings containing zero or more instances
of an a or b, that is, the set of all strings of a’s and b’s. Another regular expression for this
set is (a*b*)*

5. The regular expression a|a*b denotes the set containing the string a and all strings
consisting of zero or more a’s followed by b.

If two regular expressions r and s denote the same language, we say r and s are equivalent and
write r = s. For example, (a|b) = (b|a).
There are a number of algebraic laws obeyed by regular expressions and these can be used to
manipulate regular expressions into equivalent forms. The table below shows some algebraic laws
that hold for regular expressions r, s. and t.

16
Regular Definitions
To make our REs simpler, we may wish to give names to regular expressions and to define
regular expressions using these names as If they were symbols. If S is an alphabet of basic
symbols, them a regular definition is a sequence of definitions of the form d1→ r1 d2→r2
………
dn→rn
where each di is a distinct name, and each ri is a regular expression over the symbols in
 {d1, d2,. . . di-1}, i.e., the basic symbols and the previously defined names. By restricting
each ri to symbols of  and the previously defined names, we can construct a regular
expression over  for any ri by repeatedly replacing regular expression names by the
expressions they denote. If ri used dj for some j  i, then ri might be recursively defined and
this substitution process would not terminate. To distinguish names from symbols, we print
the names in regular definitions in bold face.

Example:
Here is a regular definition for identifier in Pascal
letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → letter(letter|digit)*

Here is a regular definition for identifier in C

letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → letter(letter|___ ) (letter|digit|___)*

This definition says that an optional_fraction is either a decimal point followed by one or
more digits, or if it is missing (the empty string). An optional_exponent, if it is not missing,
is an E followed by a + or - sign, followed by one or more digits. Note that at least, one digit
must follow the period, so num does not match 1. however, it does match 1.0

Notational Shorthand
To simplify out REs, we can use a few shortcuts:
1. + “means one or more instances of” e.g. a+ (ab)+

2. ? means “zero or one instance of”

e.g. optional_fraction→ (. digits ) ?
3. [ ] creates a character class
e.g. [A-Za-z][A-Za-z0-9]*

Token Recognition
Having known how to specify the tokens for our language, how then do we write a program
to recognize them?
Consider this grammar example:
stmt→ if expr then stmt | if expr then stmt else stmt 
expr→term relop term | term
term→ id | num
where the terminals if, then, else, relop, id and num generates sets of strings given by the
following regular definitions.
if → If
then → Then
else → Else
relop → < | <= | = | <> | > | >=
id → letter(letter | digit)*
num → digit*(.digit*)? (E(+|-)? digit*)?

It is assumed that keywords are reserved, lexemes are separated by white space, consisting of non-
null sequences of blanks, tabs and newlines. The lexical analyzer will strip out the white space by
comparing the strings with the regular definition ws below:

delim → blank | tab | newline

ws → delim*
18
If a match for ws is found, the lexical analyzer does not return a token to the parser; rather, it
proceeds to find a token following the white space and returns that to the parser. The goal is
to construct a lexical analyzer that will isolate the lexeme for the next token in the input
buffer and produce as output a pair consisting of the appropriate token and attribute-vaIue,
using the translation table given by the table below.

NOTE
Note that the regular expression constructs permitted by Lex are listed below in decreasing order
of precedence. In this table, c stands for any single character, r for a regular expression, and s for
a string.

19
Transition Diagrams
The transition diagram is a stylized flowchart produced as one of the intermediate steps in the
construction of a lexical analyzer. It depicts the action taken when a lexical analyzer is called by
the parser to get the next token. Transition diagrams are also called finite automata.
Positions in a transition diagram are drawn as circles and are called states. The states are connected
by arrows, called edges. Edges leaving state s have labels indicating the input characters that can
next appear after the transition diagram has reached state s. The label other refers to any character
that is not indicated by any of the other edges leaving s. Usually, when we recognize OTHER, we
need to put it back in the source stream since it is part of the next token. This action is denoted
with a * next to the corresponding state.
We assume the transition diagrams of this section are deterministic; that is, no symbol can match
the labels of two edges leaving one state. One state is labeled the start state; it is the initial state of
the transition diagram where control resides when we begin to recognize a token. Certain states
may have actions that are executed when the flow of control reaches that state. On entering a state
we read the next input character. If there is an edge from the current state whose label matches this
input character, we then go to the state pointed to by the edge. Otherwise, we indicate failure. The
figure below shows a transition diagram for the patterns >= and >. The transition diagram works
as follows, its start state is state 0. In state 0, we read the next input character. The edge labeled >
from state 0 is to be followed to state 6 if this input character is >. Otherwise, we have failed to
recognize either > or >=.

On reaching state 6 we read the next input character. The edge labeled = from state 6 is to be
followed to state 7 if this input character is an =. Otherwise, the edge labeled other indicates that
we are to go to state 8. The double circle on state 7 indicates that it is an accepting state, a state in
which the token >= has been found.

Notice that the character > and another extra character are read as we follow the sequence of edges
from the start state to the accepting state 8. Since the extra character is not a part of the relational
operator >, we must retract the forward pointer one character. We use a * to indicate states on
which this input retraction must take place.

In general, there may be several transition diagrams, each specifying a group of tokens. If failure
occurs while we are following one transition diagram, then we retract the forward pointer to where
it was in the start state of this diagram, and activate the next transition diagram. Since the lexeme-
beginning and forward pointers marked the same position in the start state of the diagram, the
forward pointer is retracted to the position marked by the lexeme-beginning pointer. If failure
20
occurs in all transition diagrams, then a lexical error has been detected and we invoke an error-
recovery routine.

Transition diagram for re1ationaI operators

Transition diagram for identifiers and keywords

A simple technique for separating keywords from identifiers is to initialize appropriately the
symbol table in which information about identifiers is saved. For the tokens described above, we
need to enter the strings if, then, and else into the symbol table before any characters in the input
are seen. We also make a note in the symbol table of the token to be returned when one of these
strings is recognized. The return statement next to the accepting state in the figure above uses
gettoken() and install_id() to obtain the token and attribute value, respectively, to be returned. The
procedure install_id() has access to the buffer, where the identifier lexeme has been located. The
symbol table is examined and if the lexeme is found there marked as a keyword, install_id() returns
0. If the lexeme is found and is a program variable install_id() returns a pointer to the symbol table
entry. If the lexeme is not found in the symbol table, it is installed as a variable and a pointer to
the newly created entry is returned.

The procedure gettoken() similarly looks for the lexeme in the symbol table. If the lexeme is
a keyword, the corresponding token is returned; otherwise, the token Id is returned. Note that
the transition diagram does not change if additional keywords are to be recognized; we
simply initialize the symbol table with the strings and tokens of the additional keywords. The
technique of placing keywords in the symbol table is almost essential if the lexical analyzer
is coded by hand. Without doing so, the number of states in a lexical analyzer for a typical

21
programming language is several hundred, while using the trick, fewer than a hundred states
will probably suffice.

Transition diagram for unsigned numbers in Pascal

The treatment of ws, representing white space, is different from that of the patterns discussed
above because nothing is returned to the parser when white space is found in the input. A
transition diagram recognizing ws by itself is shown below:

Nothing is returned when the accepting state is reached; we merely go back to the start state
of the first transition diagram to look for another pattern. Whenever possible, it is better to
look for frequently occurring tokens before less frequently occurring ones, because a
transition diagram is reached only after we fail on all earlier diagrams. Since white space is
expected to occur frequently, putting the transition diagram for white space near the
beginning should be an improvement over testing for white space at the end.

For each state, in every diagram, a piece of code is written to compare the next input with the
outgoing transitions and set the next state. If no next state is possible for the given input, we
call fail(). fail() backtracks to the beginning of the token and tries again with the next
transition diagram. If all diagrams fail, we generate an error. Ordering of the diagrams is
important. E.g. for unsigned numbers, the longer diagrams have to be tried first.

22
FINITE AUTOMATA
A recognizer for a language is a program that takes as input a string x and answers “yes” if x is a
sentence of the language and “no” otherwise. Regular expression is compiled into a recognizer by
constructing a generalized transition diagram called a finite automaton. A finite automaton can be
deterministic or nondeterministic, where nondeterministic means that more than one transition out
of a state may be possible on the same input symbol. Both deterministic and nondeterministic
finite automata are capable of recognizing precisely the regular sets. Thus, they both can recognize
exactly what regular expressions can denote. However, there is a time-space tradeoff; while
deterministic finite automata can lead to faster recognizers than nondeterministic automata, a
deterministic finite automaton can be much bigger than an equivalent nondeterministic automaton.
The conversion into a nondeterministic automaton is more direct.
Nondeterministic Finite Automata
A nondeterministic finite automaton (NFA, for short) is a mathematical model that consists of
1. a set of states S
2. a set of input symbols  (the input symbol alphabets)
3. a transition function move that maps state-symbol pairs to sets of states
4. a state So that is distinguished as the start (or initial) state
5. a set of states F distinguished as accepting (or final) states

An NFA can be represented diagrammatically by a labeled directed graph called a transition graph,
in which the nodes are the states and the labeled edges represent the transition function. This graph
looks like a transition diagram, but the same character can label two or more transitions out of one
state, and edges can be labeled by the special symbol  as well as by input symbols. The transition
graph for an NFA that recognizes the language (a|b)*abb is shown below. The set of states of the
NFA is {0, 1, 2, 3} and the input symbol alphabet is (a,b}. State 0 is distinguished as the start state,
and the accepting state 3 is indicated by a double circle.

23
In a computer, the transition function of an NFA can be implemented in several different
ways of which easiest implementation is a transition table in which there is a row for each
state and a column for each input symbol and  if necessary. The entry for row i and symbol
a in the table is the set of states (or more likely in practice, a pointer to the set of states) that
can be reached by a transition from state i on input a. The transition table for the NFA above
is shown below:

The transition table representation has the advantage that provides fast access to the transitions of
a given state on a given character; its disadvantage is that it can take up a lot of space when the
input alphabet is large and most transitions are to the empty set. Adjacency list representations of
the transition function provide more compact implementations, but access to a given transition is
slower. It should be clear that we can easily convert any one of these implementations of a finite
automaton into another.
An NFA accepts an input string x if and only if there is some path in the transition graph from the
start state to some accepting state, such that the edge labels along this path spell out x. The NFA
described above accepts the input strings abb, aabb, babb, aaabb. For example, aabb is accepted
by the path from 0, following the edge labeled a to state 0 again, then to states I, 2, and 3 via edges
labeled a, b, and b, respectively.
A path can be represented by a sequence of state transitions called moves. The following
diagram shows the moves made in accepting the input string aabb:

In general, more than one sequence of moves can lead to an accepting stale. Notice that
several other sequences of moves may be made on the input string aabb but none of the others
happen to end in an accepting state. For example, another sequence of moves on input aabb
keeps reentering the non- accepting state 0:

24
The language defined by an NFA is the set of input strings it accepts.

Exercise:
1. Show that the NFA above accepts (a|b)*abb.
2. Design an NFA to accept aa*|bb*

Solution:

Deterministic Finite Automata (DFA)

A deterministic finite automaton (DFA, for short) is a special case of a nondeterministic finite
automaton in which
1. no state has an -transition i.e., a transition on input , and
2. for each stale sand input symbol, a, there is at most one edge labelled a leaving s,

A deterministic finite automaton has at most one transition from each state on any input, if we are
using a transition table to represent the transition function of a DFA, then each entry in the
transition table is a single state. Consequently, it is very easy to determine whether a deterministic
finite automaton accepts an input siring, since there is at most one path from the start state labelled
by that string.
Example below shows the transition graph of a deterministic finite automaton accepting the same
language (a|b)*abb as that accepted by the NFA described above. It follows the sequence of states
0, 1, 2, 1, 2, 3

25
Assignment
1. Write Regular Expressions for Identifiers in Java, Python, C++ and and Pascal four
programming languages.

EXAMPLES OF IDENTIFIERS IN COMMON PROGRAMMING LANGUAGES

IDENTIFIERS IN C++
How to check identifier in C++?
• Names can contain letters, digits and underscores.
• Names must begin with a letter or an underscore (_)
• Names are case-sensitive ( myVar and myvar are different variables)
• Names cannot contain whitespaces or special characters like !, #, %, etc.
Regular Expression letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → (letter|___) (letter|digit|___)* Also
written in Lex convention as
^_|A-Z|a-z]+[_|A-Z|a-z|0-9]*$

26
Question?
1. How many identifiers are in the above code? Name them.

IDENTIFIERS IN C#
In programming languages, identifiers are used for identification purposes. Or in other words,
identifiers are the user-defined name of the program components. In C#, an identifier can be
a class name, method name, variable name, or label.

Example:
public class GFG { static
public void Main ()
{
int x;
}

27
}

Here the total number of identifiers present in the above example is 3 and the names of these
identifiers are:
GFG: Name of the class Main:
Method name
x: Variable name

Rules for defining identifiers in C#:

There are certain valid rules for defining a valid C# identifier. These rules should be followed,
otherwise, we will get a compile-time error.
• The only allowed characters for identifiers are all alphanumeric characters ([A-Z],
[az], [0-9]), ‘_‘ (underscore). For example “geek@” is not a valid C# identifier as it
contains ‘@’ – special character.
• Identifiers should not start with digits ([0-9]). For example “123geeks” is not valid in
the C# identifier.
• Identifiers should not contain white spaces.
• Identifiers cannot be used as keywords unless they include @ as a prefix. For example,
@as is a valid identifier, but “as” is not because it is a keyword.
• C# identifiers allow Unicode Characters.
• C# identifiers are case-sensitive.
• C# identifiers cannot contain more than 512 characters.
• Identifiers do not contain two consecutive underscores in their name because such
identifiers are used for the implementation.

Regular Expression letter → A | B | . . . | Z | a | b | …… | z

digit → 0 | 1 | . . . | 9
id → (letter|___) (letter|digit|___)* Also
written in Lex convention as
^_|A-Z|a-z]+[_|A-Z|a-z|0-9]*$

// Simple C# program to illustrate identifiers using

System;

class GFG {
// Main Method static public
void Main ()
{

// variable
int a = 10; int b
= 39; int c;
// simple addition c
= a + b;
Console.WriteLine("The sum of two number is: {0}", c); }
}

28
Output:

The sum of two number is: 49

Below table shows the identifiers and keywords present in the above example:

IDENTIFIERS IN PYTHON
Python Identifier is the name we give to identify a variable, function, class, module or other
object. That means whenever we want to give an entity a name, that's called identifier. A
python identifier is a name given to various entities like variables, functions, and classes. The
isidentifier() method returns True if the string is a valid identifier, otherwise False. A string
is considered a valid identifier if it only contains alphanumeric letters (a-z) and (0-9), or
underscores (_). A valid identifier cannot start with a number, or contain any spaces. In
Python, identifiers are case-sensitive, meaning that foo and Foo are considered to be two
different identifiers.

Rules for Identifiers in Python

• Keywords cannot be used as identifiers in python (because they are reserved words)
• The name of identifiers in python cannot begin with a number
• All the identifiers in python should have a unique name in the same scope
• The first character of identifiers in python should always start with an alphabet or
underscore, and then it can be followed by any of the digit, character, or underscore.
• Identifier name length is unrestricted
• Names of identifiers in python are case sensitive meaning ‘car’ and ‘Car’ Would be
treated differently
• Special characters such as ‘%’, ‘#’,’@’, and ‘$’ are not allowed as identifiers in
python.

Examples of Python Identifiers Valid Identifiers in

Python yourname - It contains only lowercase
alphabets.
29
Name_school - It contains only ‘_’ as a special character.
Id1 - Here, the numeric digit comes at the end.
roll_2 - It starts with lowercase and ends with a digit.
_classname - Contains lowercase alphabets and an underscore and It starts with an underscore
‘_’

Invalid Identifiers in Python

(for, while, in) - these are the keywords in python that cannot use as identifiers in python.
1myname - Invalid identifier because It begins with a digit
\$myname - Invalid identifier because It starts with a special character a b
- Invalid identifier because it contains a blank space
(a/b and a+b) - are Invalid identifiers because contain special character

Regular Expression
letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → (letter|___) (letter|digit|___)*
Also written in Lex convention as
^_|A-Z|a-z]+[_|A-Z|a-z|0-9]*$

IDENTIFIERS IN JAVA
The valid rules for defining Java identifiers are:
• It must start with either lower case alphabet[a-z] or upper case alphabet
[A-Z] or underscore (_) or a dollar sign($).
• It should be a single word, the white spaces are not allowed.
• It should not start with digits.

Rules For Defining Java Identifiers

There are specific rules to follow when defining these identifiers.
i. Alphanumeric Characters: Identifiers can include letters (A-Z and a-z), digits (0-9),
underscore (_), and dollar sign ($). However, they must start with a letter, an underscore
or a dollar sign. They cannot begin with a digit.
ii. Case Sensitivity: Java is case-sensitive, meaning MyIdentifier and myidentifier would be
recognized as distinct identifiers.
iii. No Reserved Words: Identifiers cannot be Java reserved words or keywords. For example,
you cannot name a variable int or class as these are reserved by the language.
iv. No Special Characters: Apart from underscore and dollar sign, no other special characters
or spaces are allowed in identifiers.
v. Length Limitation: Technically, there is no limit on the length of an identifier in Java.
However, it is advisable to keep them concise for the sake of readability and maintainability
of the code.

Regular Expression for JAVA

letter → A | B | . . . | Z | a | b | …… | z
digit → 0 | 1 | . . . | 9
id → (letter|___|$) (letter|digit|___)*
Also written in Lex convention as

30
^_|$|A-Z|a-z]+[_|A-Z|a-z|0-9]*$

Here's a table on valid and invalid identifiers in Java for a quick glance.
Criteria Valid Identifiers Invalid Identifiers
Start with a letter,
underscore (_), or dollar myVariable, _value, $id 9pins, -name
sign ($)
Subsequent characters var1, i9, _1_value a@b, hello!world
(Case variations are valid but
Case Sensitivity myVariable, MyVariable
represent different identifiers)
No reserved words userInput, totalSum class, int, void
(No length-based invalidity, but
Unlimited length
longIdentifierName123 very long names are
(practically reasonable)
discouraged for readability)

SYNTAX ANALYSIS
Every programming language has rules that prescribe the syntactic structure of well-formed
programs. In Pascal, for example. a program is made out of blocks, a block out of statements,
a statement out of expressions, an expression out of tokens. and so on. The syntax of
programming language constructs can be described by context-free grammars or BNF
(Backus-Naur Form) notation. Grammars offer significant advantages to both language
designers and compiler writers.

What is a Grammar?
A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming
language. From certain classes of grammars we can automatically construct an efficient
parser that determines if a source program is syntactically well formed. As an additional
benefit, the parser construction process can reveal syntactic ambiguities and other difficult-
to-parse constructs that might otherwise go undetected in the initial design phase of a
language ad its compiler.
A properly designed grammar imparts a structure to a programming language that is useful
for the translation of source programs into correct object code and for the detection of errors.
Tools are available for converting grammar-based descriptions of translations into working
programs.

THE ROLE OF THE SYNTAX ANALYZER (PARSER)

The parser obtains a string of tokens from the lexical analyzer and verifies that the string can be
generated by the grammar for the source language. We expect the parser to report any syntax errors
in an intelligible fashion. It should also recover from commonly occurring errors so that it can
continue processing the remainder of its input.

31
Commonly used parsing methods in compilers are classified as being either LL (left-to-right,
leftmost derivation) parser also known as top-down parser or LR (An LL parse is a left-to-right,
rightmost derivation) parser also known as or bottom-up parser. As indicated by their names, top-
down parsers build parse trees from the top (root) to the bottom (leaves), while bottom-up parsers
start from the leaves and work up to the root. In both cases, the input to the parser is scanned from
left to right, one symbol at a time.

Syntax Error Handling

Most programming language specifications do not describe how a compiler should respond to
errors: the response is left to the compiler designer. Planning the error handling right from the start
can both simplify the structure of a compiler and improve its response to errors. We know that
programs can contain errors at many different levels. For example,
❖ lexical, such as misspelling an identifier, keyword, or operator
❖ syntactic, such as an arithmetic expression with unbalanced parentheses
❖ semantic, such as an operator applied to an incompatible operand
❖ logical, such as an infinitely recursive call

Often much of the error detection and recovery in a compiler is centered around the syntax analysis
phase. One reason for this is that many errors are syntactic in nature or are exposed when the
stream of tokens coming from the lexical analyzer disobeys the grammatical rules defining the
programming language. Another is the precision of modern parsing methods; they can detect the
presence of Syntactic errors in programs very efficiently. Accurately detecting the presence of
semantic and logical errors at compile time is a much more difficult task.
The error handler in a parser has simple-to-state goals:
1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.
The effective realization of these goals presents difficult challenges. Fortunately, common errors
are simple ones and a relatively straightforward error-handling mechanism often suffices. In some
cases, however, an error may have occurred long before the position at which its presence is
detected and the precise nature of the error may be very difficult to deduce. In difficult cases, the
error handler may have to guess what the programmer had in mind when the program was written.
Several parsing methods, such as the LL and LR methods, detect an error as soon as possible. More
precisely, they have the viable-prefix property, meaning they detect that an error has occurred as
soon as they see a prefix of the input that is not a prefix of any string in the language.

32
Context-Free Grammars (CFG)
Grammars were introduced in the previous section to systematically describe the syntax of
programming language constructs like expressions and statements. Using a syntactic variable stmt
to denote statements and variable expr to denote expressions, the production
stmt → if (expr ) stmt else stmt
specifies the structure of this form of conditional statement. Other productions then define
precisely what an expr is and what else a stmt can be.

The Formal Definition of a Context-Free Grammar

A context-free grammar (grammar for short) consists of terminals, non-terminals, a start symbol,
and productions.

Terminals:
Terminals are the basic symbols from which strings are formed. The term "token name" is a
synonym for '"terminal" and frequently we will use the word "token" for terminal when it is clear
that we are talking about just the token name. We assume that the terminals are the first
components of the tokens output by the lexical analyzer. In the statement example above, the
terminals are the keywords if and else and the symbols "(" and ")".

Nonterminals:
Nonterminals are syntactic variables that denote sets of strings. In the example above, stmt
and expr are nonterminals. The sets of strings denoted by nonterminals help define the
language generated by the grammar. Nonterminals impose a hierarchical structure on the
language that is key to syntax analysis and translation.
In a grammar, one nonterminal is distinguished as the start symbol, and the set of strings it
denotes is the language generated by the grammar. Conventionally, the productions for the
start symbol are listed first. The productions of a grammar specify the manner in which the
terminals and nonterminals can be combined to form strings. Each production consists of:
i. A nonterminal called the head or left side of the production; this production defines
some of the strings denoted by the head.
ii. The symbol →, sometimes ::= has been used in place of the arrow.
iii. A body or right side consisting of zero or more terminals and nonterminals. The
components of the body describe one way in which strings of the nonterminal at the head
can be constructed.

EXAMPLE
1. Consider the grammar below for arithmetic expressions, identify the terminals,
nonterminals and the start symbol in the grammar.

33
Solution

In this grammar, the terminal symbols are:

The nonterminal symbols are expression, term and factor, and expression is the start symbol.

NOTATIONAL CONVENTIONS
To avoid always having to state that "these are the terminals," "these are the nonterminals ," and
so on, the following notational conventions for grammars are used in most texts.

1. These symbols are used to represent terminals:

(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0,1,. ..., 9
(e) Boldface strings such as id or if, each of which represents a single terminal
symbol.

2. These symbols are used to represent nonterminals:

(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be used to represent
nonterminals for the constructs. For example, nonterminals for expressions, terms, and
factors are often represented by E, T, and F, respectively.

3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is,
either nonterminals or terminals.

4. Lowercase Greek letters, , β,  for example, represent (possibly empty) strings grammar
symbols. Thus, a generic production can be written as A → , where A is the head and →
the body.

34
5. A set of productions A with a common head A (call them A-productions), may be written
A → 1, A → 2, A → 3, ……….. A → k may be written A → 1 | 2 | 3 |……….. |k.
Call 1 , 2 , 3 ,……….. k the alternatives for A.

6. Unless stated otherwise, the head of the first production is the start symbol.

Example: Using the above conventions, the grammar in the example above can be rewritten
concisely as
E → E+T | E-T | T
T → T*F | T/F | F
F → (E) | id
The notational conventions tell us that E, T, and F are nonterminals, with E the start symbol.
The remaining symbols are terminals.

DERIVATIONS
The construction of a parse tree can be made precise by taking a derivational view, in which
productions are treated as rewriting rules. Beginning with the start symbol, each rewriting step
replaces a nonterminal by the body of one of its productions. This derivational view corresponds
to the top-down construction of a parse tree, but the precision afforded by derivations will be
especially helpful when bottom-up parsing is discussed. As we shall see, bottom-up parsing is
related to a class of derivations known as “rightmost" derivations, in which the rightmost
nonterminal is rewritten at each step.
For example, consider the following grammar, with a single nonterminal E,

E → E + E | E * E | -E | (E) | id

The production E | - E signifies that if E denotes an expression, then -E must also denote an
expression. The replacement of a single E by -E will be described by writing:
E → - E, which is read, “E derives -E."
The production E → (E) can be applied to replace any instance of E in any string of grammar
symbols by (E), For example
E * E → (E) * E or
E * E → E * (E). We can take a single E and repeatedly apply productions in any order to
get a sequence of replacements.
For example,
E → -E → -(E) → -(id)
We call such a sequence of replacements a derivation of -(id) from E. This derivation provides
a proof that the string -(id) is one particular instance of an expression.
At each step in a derivation, there are two choices to be made. We need to choose which
nonterminal to replace, and having made this choice, we must pick a production with that
nonterminal as head.

ASSIGNMENT: READ MORE ON LEFTMOST AND RIGHTMOST DERIVATIONS

35
Parse Trees and Derivations
A parse tree is a graphical representation of a derivation that filters out the order in which
productions are applied to replace nonterminals. Each interior node of a parse tree represents the
application of a production. The interior node is labelled with the nonterminal A in the head of the
production; the children of the node are labelled, from left to right, by the symbols in the body of
the production by which this A was replaced during the derivation.

For example, the parse tree for -(id + id) resulting from the above derivations is shown below.

Note that the leaves of a parse tree are labelled by nonterminals or terminals and, read from left to
right, constitute a sentential form, called the yield or frontier of the tree.

To see the relationship between derivations and parse trees, consider any derivation → 2 → 3
……→ n, where 1 is a single nonterminal A. For each sentential form i in the derivation, we
can construct a parse tree whose yield is i. The process is an induction on i.

Note that a parse tree ignores variations in the order in which symbols in sentential forms are
replaced, and there is a many-to-one relationship between derivations and parse trees. For
example, both the leftmost and rightmost derivations above are associated with the same parse
tree. The figure below shows the sequence of parse trees for the leftmost derivation above.

36
In what follows, we shall frequently parse by producing a leftmost or a rightmost derivation, since
there is a one-to-one relationship between parse trees and either leftmost or rightmost derivations.
Both leftmost and rightmost derivations pick a particular order for replacing symbols in sentential
forms, so they too filter out variations in the order. It is not hard to show that every parse tree has
associated with it a unique leftmost and a unique rightmost derivation.

ASSINGMENT 2: READ MORE ON PARSE TREES AND DERIVATIONS IN YOUR

FLAT (CSC 361 NOTE)

Sample Justification For Travel For Teachers
100% (5)
Sample Justification For Travel For Teachers
2 pages
Technical Handbook Abarth 500 A.C. and L.E
100% (1)
Technical Handbook Abarth 500 A.C. and L.E
52 pages
2018 Com 414 (Compiler Construction)
100% (2)
2018 Com 414 (Compiler Construction)
79 pages
OilField Review 2016 Cement Evaluation
No ratings yet
OilField Review 2016 Cement Evaluation
10 pages
Vitotres343 TechGuide PDF
No ratings yet
Vitotres343 TechGuide PDF
32 pages
Lecture 1 - Introduction To Compilers
No ratings yet
Lecture 1 - Introduction To Compilers
42 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
62 pages
What Is A Language Processor
No ratings yet
What Is A Language Processor
17 pages
CD - CH1 - Introduction
No ratings yet
CD - CH1 - Introduction
36 pages
CD - CH1 - Introduction
No ratings yet
CD - CH1 - Introduction
36 pages
Compiler Construction Week 1
No ratings yet
Compiler Construction Week 1
34 pages
CD Unit-1
No ratings yet
CD Unit-1
52 pages
UNIT I - CS8602 Compiler Design Notes
No ratings yet
UNIT I - CS8602 Compiler Design Notes
26 pages
Compiler Design 1-1
No ratings yet
Compiler Design 1-1
27 pages
Complier Design (CSE306) : Dr. Murali Krishna Enduri Department of CSE
No ratings yet
Complier Design (CSE306) : Dr. Murali Krishna Enduri Department of CSE
79 pages
CS8602 Compiler Design Notes
No ratings yet
CS8602 Compiler Design Notes
149 pages
Intro Compiler
No ratings yet
Intro Compiler
7 pages
Modified 2024 - 2025
No ratings yet
Modified 2024 - 2025
28 pages
CSC 321 Compiler Consturction 1 Note Main
No ratings yet
CSC 321 Compiler Consturction 1 Note Main
82 pages
Intro Compiler
No ratings yet
Intro Compiler
7 pages
CD Notes Final
No ratings yet
CD Notes Final
72 pages
Compiler Construction
No ratings yet
Compiler Construction
5 pages
Compile Construction
No ratings yet
Compile Construction
84 pages
Compiler Design Lec1
No ratings yet
Compiler Design Lec1
6 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
32 pages
UNIT1
No ratings yet
UNIT1
40 pages
01 - Introduction To Compilers Structure & Goals
No ratings yet
01 - Introduction To Compilers Structure & Goals
22 pages
Compiler
No ratings yet
Compiler
79 pages
CD Experiments 1,2
No ratings yet
CD Experiments 1,2
12 pages
Manjakkudi
No ratings yet
Manjakkudi
158 pages
Chapter 1
No ratings yet
Chapter 1
49 pages
Compiler Construction Lecture 1
No ratings yet
Compiler Construction Lecture 1
13 pages
Week 3 Language Translators
No ratings yet
Week 3 Language Translators
6 pages
CD - Unit 1
No ratings yet
CD - Unit 1
67 pages
CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
No ratings yet
CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
27 pages
Compiler Lecture 3 4 5
No ratings yet
Compiler Lecture 3 4 5
14 pages
CH 1
No ratings yet
CH 1
21 pages
Compiler Designs
No ratings yet
Compiler Designs
47 pages
Ce UNIT II - PART 1
No ratings yet
Ce UNIT II - PART 1
28 pages
Unit Ii
No ratings yet
Unit Ii
29 pages
Compiler Design-Notes
100% (2)
Compiler Design-Notes
212 pages
Purpose of Translator Different Types of Translators: Compiler
No ratings yet
Purpose of Translator Different Types of Translators: Compiler
32 pages
History: History of Compiler Construction
No ratings yet
History: History of Compiler Construction
4 pages
Compiler - Introduction
No ratings yet
Compiler - Introduction
23 pages
Compiler 2024
No ratings yet
Compiler 2024
179 pages
CD 1
No ratings yet
CD 1
15 pages
Translators
No ratings yet
Translators
18 pages
Compiler - Design - Notes - Unit - 1 - Part 1
No ratings yet
Compiler - Design - Notes - Unit - 1 - Part 1
29 pages
Programmong With C
No ratings yet
Programmong With C
372 pages
CSC 319 Compiler Constructions
No ratings yet
CSC 319 Compiler Constructions
54 pages
Compilers and Interpreter
0% (1)
Compilers and Interpreter
3 pages
Lecture 01 02
No ratings yet
Lecture 01 02
12 pages
CS 3501 - CD-Unit 1 Notes
No ratings yet
CS 3501 - CD-Unit 1 Notes
67 pages
Language Processing System
No ratings yet
Language Processing System
17 pages
Completed LL
No ratings yet
Completed LL
29 pages
PCC 1
No ratings yet
PCC 1
8 pages
Operating Systems
No ratings yet
Operating Systems
5 pages
Compiler Design: Dr. Sherif Eletriby
No ratings yet
Compiler Design: Dr. Sherif Eletriby
46 pages
66fa371e68426CCWeek 01lecture01
No ratings yet
66fa371e68426CCWeek 01lecture01
27 pages
Module 1
No ratings yet
Module 1
9 pages
Unit 1
No ratings yet
Unit 1
38 pages
Lecture No.1: Compiler Construction
No ratings yet
Lecture No.1: Compiler Construction
22 pages
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
From Everand
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
Dexter Rogers
No ratings yet
Code Beneath the Surface: Mastering Assembly Programming
From Everand
Code Beneath the Surface: Mastering Assembly Programming
Kameron Hussain
No ratings yet
Group 2 - Aspects of Connected Speech
No ratings yet
Group 2 - Aspects of Connected Speech
31 pages
MODULE 4 MAT Antepartum Flexible Learning
No ratings yet
MODULE 4 MAT Antepartum Flexible Learning
2 pages
LCD TV: Service Manual
No ratings yet
LCD TV: Service Manual
51 pages
7º Basico B (7 Grade) : I.-Listening Test (Questions 1 To 10) Listen and Answer
No ratings yet
7º Basico B (7 Grade) : I.-Listening Test (Questions 1 To 10) Listen and Answer
3 pages
Egyptian Heaven and Hell Volume II
No ratings yet
Egyptian Heaven and Hell Volume II
314 pages
Annexure-4 CertificatefromUniversity
No ratings yet
Annexure-4 CertificatefromUniversity
1 page
01-Sap Annual Report2023 Ang
No ratings yet
01-Sap Annual Report2023 Ang
119 pages
Tense Changes in Reported Speech Rules, Examples, and Usage
No ratings yet
Tense Changes in Reported Speech Rules, Examples, and Usage
1 page
Productores de Banano de Nicaragua Probanic Datos Climáticos de Estación Finca San Luis Enero, 2011
No ratings yet
Productores de Banano de Nicaragua Probanic Datos Climáticos de Estación Finca San Luis Enero, 2011
17 pages
Onga'nya 24
No ratings yet
Onga'nya 24
23 pages
Synapse - Test Automation Engineer
No ratings yet
Synapse - Test Automation Engineer
1 page
Problem Solving 11 20
No ratings yet
Problem Solving 11 20
10 pages
A Wide Range of High Quality Pumps and Pumpsets Available From
No ratings yet
A Wide Range of High Quality Pumps and Pumpsets Available From
2 pages
1113
No ratings yet
1113
1 page
Food Safety, Sanitation and Hygience
No ratings yet
Food Safety, Sanitation and Hygience
21 pages
12 Reach Dealer Parts
No ratings yet
12 Reach Dealer Parts
185 pages
BROCHURE
No ratings yet
BROCHURE
4 pages
G0709084055
No ratings yet
G0709084055
16 pages
Freelnace Programmer and Ethical Hacking Know How
No ratings yet
Freelnace Programmer and Ethical Hacking Know How
7 pages
Information On The Format of The TOEFL
No ratings yet
Information On The Format of The TOEFL
2 pages
PL01ELBL53 Corporate Finance-I
No ratings yet
PL01ELBL53 Corporate Finance-I
3 pages
Catalist-Listed ES Group Announces Revised Chartering Agreement and New Vessel Sale To Sea Hub Tankers For S$29.4 Million
No ratings yet
Catalist-Listed ES Group Announces Revised Chartering Agreement and New Vessel Sale To Sea Hub Tankers For S$29.4 Million
3 pages
Timber Formwork Design
No ratings yet
Timber Formwork Design
12 pages
Cidam Layout
No ratings yet
Cidam Layout
40 pages
Pressure Equipment - European Commission PDF
No ratings yet
Pressure Equipment - European Commission PDF
22 pages
Scamper Technique
No ratings yet
Scamper Technique
19 pages

Compiler Construction NOTE 1

Uploaded by

Compiler Construction NOTE 1

Uploaded by

CSC 401

DR. (MRS) A. A. KAYODE

WHY STUDY COMPILERS?

Some Differences between an Interpreter and a Compiler

What qualities are important in a compiler?

How to Build a Compiler:

1. The IDENTIFIER “position”

The hierarchical structure of the syntactic units in a programming language is normally

Similarly, example of such rules for statements is:

while (expression1) do statement2

Lexical vs. Syntactic analysis

Compiler writing tools

Tokens, Patterns, Lexemes

Implementation of Lexical Analyzer

1. L  D is the set of Letters and digits.

2. LD is the set of strings consisting of a letter followed by a digit

3. L4 is the set of all four-letter strings.

4. L* is the set of all strings of letters, including , the empty string.

6. D* is the set of all strings of one or more digits.

i. (r)|(s) is a regular expression denoting L(r) U L(s).

2. concatenation has the second highest precedence and is left associative

3. | has the lowest precedence and is left associative.

Example: Let  = {a, b}.

Here is a regular definition for identifier in C

2. ? means “zero or one instance of”

delim → blank | tab | newline

Transition diagram for re1ationaI operators

Transition diagram for identifiers and keywords

Transition diagram for unsigned numbers in Pascal

Deterministic Finite Automata (DFA)

EXAMPLES OF IDENTIFIERS IN COMMON PROGRAMMING LANGUAGES

Rules for defining identifiers in C#:

Regular Expression letter → A | B | . . . | Z | a | b | …… | z

// Simple C# program to illustrate identifiers using

The sum of two number is: 49

Rules for Identifiers in Python

Examples of Python Identifiers Valid Identifiers in

Invalid Identifiers in Python

Rules For Defining Java Identifiers

Regular Expression for JAVA

THE ROLE OF THE SYNTAX ANALYZER (PARSER)

Syntax Error Handling

The Formal Definition of a Context-Free Grammar

In this grammar, the terminal symbols are:

1. These symbols are used to represent terminals:

2. These symbols are used to represent nonterminals:

ASSIGNMENT: READ MORE ON LEFTMOST AND RIGHTMOST DERIVATIONS

ASSINGMENT 2: READ MORE ON PARSE TREES AND DERIVATIONS IN YOUR

You might also like