0% found this document useful (0 votes)

2 views

Compiler Design

The document provides an overview of compiler design, detailing the structure and phases of a compiler, including lexical analysis, syntax analysis, and semantic analysis. It explains the role of the symbol table in managing variable and function names, as well as its importance in various compilation phases. Additionally, it distinguishes between tokens, lexemes, and patterns in programming languages, highlighting their significance in the compilation process.

Uploaded by

Kunal Kumar Sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Compiler Design

Uploaded by

Kunal Kumar Sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

PC MCA03002 Compiler Design

Module- I:
Compiler Structure: Model of compilation:

Programming languages are notations for describing computations to people and to machines.
The world as we know it depends on programming languages, because all the software running
on all the computers was written in some programming language. But, before a program can be
run, it first must be translated into a form in which it can be executed by a computer.
The software systems that do this translation are called compilers.

Language Processors
A compiler is a program that can read a program in one lan-guage — the source language —
and translate it into an equivalent program in another language — the target language; see Fig.
1.1. An important role of the compiler is to report any errors in the source program that it detects
during the translation process.

A COMPILER
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs;

An interpreter is another common kind of language processor. Instead of producing a target

program as a translation, an interpreter appears to directly execute the operations specified in the
source program on inputs supplied by the user
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.

The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.

E x a m p l e 1 . 1 : Java language processors combine compilation and interpretation, as shown

in Fig. 1.4. A Java source program may first be compiled into an intermediate form
called bytecodes. The bytecodes are then interpreted by a virtual machine. A benefit of this
arrangement is that bytecodes compiled on one machine can be interpreted on another machine,
perhaps across a network.
In order to achieve faster processing of inputs to outputs, some Java compilers, called just-in-
time compilers, translate the bytecodes into machine language immediately before they run the
intermediate program to process the input.
In addition to a compiler, several other programs may be required to create an executable target
program, as shown in Fig. 1.5. A source program may be divided into modules stored in separate
files. The task of collecting the source program is sometimes entrusted to a separate program,
called a preprocessor. The preprocessor may also expand shorthands, called macros, into source
lan-guage statements.

The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly lan-guage is easier to produce as output and is
easier to debug. The assembly language is then processed by a program called an assembler that
produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine. The linker resolves external memory addresses, where the code in one file may
refer to a location in another file. The loader then puts together all of the executable object files
into memory for execution.

The Structure of a Compiler

we have treated a compiler as a single box that maps a source program into a semantically
equivalent target program. If we open up this box a little, we see that there are two parts to this
mapping: analysis and synthesis.
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to cre-ate an intermediate
representation of the source program. If the analysis part detects that the source program is either
syntactically ill formed or semanti-cally unsound, then it must provide informative messages, so
the user can take corrective action. The analysis part also collects information about the source
program and stores it in a data structure called a symbol table, which is passed along with the
intermediate representation to the synthesis part.
The synthesis part constructs the desired target program from the interme-diate representation
and the information in the symbol table. The analysis part is often called the front end of the
compiler; the synthesis part is the back end.
If we examine the compilation process in more detail, we see that it operates as a sequence of
phases, each of which transforms one representation of the source program to another. A typical
decomposition of a compiler into phases is shown in Fig. 1.6. In practice, several phases may be
grouped together, and the intermediate representations between the grouped phases need not be
constructed explicitly. The symbol table, which stores information about the entire source
program, is used by all phases of the compile
various phases of a compiler.

Some compilers have a machine-independent optimization phase between the front end and the
back end. The purpose of this optimization phase is to perform transformations on the
intermediate representation, so that the back end can produce a better target program than it
would have otherwise pro-duced from an unoptimized intermediate representation.

Lexical analysis:

The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form that it passes on to the subsequent phase, syntax analysis. In the token, the first
component token-name is an abstract symbol that is used during syntax analysis, and the second
component attribute-value points to an entry in the symbol table for this token. Information
from the symbol-table entry Is needed for semantic analysis and code generation.
For example, suppose a source program contains the assignment statement
position = initial + rate * 60 (1.1)
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, 1), where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position . The
symbol-table entry for an identifier holds information about the identifier, such as its
name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=) . Since this token
needs no attribute-value, we have omitted the second component. We could have used any
abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the lexeme itself as the name of the abstract symbol.
3. initial is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for initial .
4. + is a lexeme that is mapped into the token (+) .
5. rate is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry
for rate .
6. * is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).
Blanks separating the lexemes would be discarded by the lexical analyzer.
lexical analysis as the sequence of tokens
<id,l> <= > <id, 2> <+> <id, 3> <*> <60> (1.2)
In this representation, the token names =, +, and * are abstract symbols for
the assignment, addition, and multiplication operators, respectively.
Interface with input parser and symbol table:

Symbol Table is an important data structure created and maintained by the compiler in order
to keep track of semantics of variables i.e. it stores information about the scope and binding
information about names, information about instances of various entities such as variable and
function names, classes, objects, etc.
 It is built-in lexical and syntax analysis phases.
 The information is collected by the analysis phases of the compiler and is used by the
synthesis phases of the compiler to generate code.
 It is used by the compiler to achieve compile-time efficiency.
 It is used by various phases of the compiler as follows:
 Lexical Analysis: Creates new table entries in the table, for example like entries about
tokens.
 Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of
reference, use, etc in the table.
 Semantic Analysis: Uses available information in the table to check for semantics i.e.
to verify that expressions and assignments are semantically correct(type checking) and
update it accordingly.
 Intermediate Code generation: Refers symbol table for knowing how much and what
type of run-time is allocated and table helps in adding temporary variable information.
 Code Optimization: Uses information present in the symbol table for machine-
dependent optimization.
 Target Code generation: Generates code by using address information of identifier
present in the table.
Symbol Table entries – Each entry in the symbol table is associated with attributes that
support the compiler in different phases.
Items stored in Symbol table:
 Variable names and constants
 Procedure and function names
 Literal constants and strings
 Compiler generated temporaries
 Labels in source languages
Information used by the compiler from Symbol table:
 Data type and name
 Declaring procedures
 Offset in storage
 If structure or record then, a pointer to structure table.
 For parameters, whether parameter passing by value or by reference
 Number and type of arguments passed to function
 Base Address
 Operations of Symbol table – The basic operations defined on a symbol table include:

Implementation of Symbol table –

Following are commonly used data structures for implementing symbol table:-
1. List –
 In this method, an array is used to store names and associated information.
 A pointer “available” is maintained at end of all stored records and new names are
added in the order as they arrive
 To search for a name we start from the beginning of the list till available pointer and if
not found we get an error “use of the undeclared name”
 While inserting a new name we must ensure that it is not already present otherwise an
error occurs i.e. “Multiple defined names”
 Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
 The advantage is that it takes a minimum amount of space.
2. Linked List –
 This implementation is using a linked list. A link field is added to each record.
 Searching of names is done in order pointed by the link of the link field.
 A pointer “First” is maintained to point to the first record of the symbol table.
 Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
Hash Table –
 In hashing scheme, two tables are maintained – a hash table and symbol table and are the
most commonly used method to implement symbol tables.
 A hash table is an array with an index range: 0 to table size – 1. These entries are pointers
pointing to the names of the symbol table.
 To search for a name we use a hash function that will result in an integer between 0 to table
size – 1.
 Insertion and lookup can be made very fast – O(1).
 The advantage is quick to search is possible and the disadvantage is that hashing is
complicated to implement
Binary Search Tree –
 Another approach to implementing a symbol table is to use a binary search tree i.e. we
add two link fields i.e. left and right child.
 All names are created as child of the root node that always follows the property of the
binary search tree.
 Insertion and lookup are O(log 2 n) on average.
Advantages of Symbol Table
1. The efficiency of a program can be increased by using symbol tables, which give quick and
simple access to crucial data such as variable and function names, data kinds, and memory
locations.
2. better coding structure Symbol tables can be used to organize and simplify code,
making it simpler to comprehend, discover, and correct problems.
3. Faster code execution: By offering quick access to information like memory addresses,
symbol tables can be utilized to optimize code execution by lowering the number of
memory accesses required during execution.
4. Symbol tables can be used to increase the portability of code by offering a standardized
method of storing and retrieving data, which can make it simpler to migrate code
between other systems or programming languages.
5. Improved code reuse: By offering a standardized method of storing and accessing
information, symbol tables can be utilized to increase the reuse of code across multiple
projects.
6. Symbol tables can be used to facilitate easy access to and examination of a program’s
state during execution, enhancing debugging by making it simpler to identify and
correct mistakes.
Applications of Symbol Table
1. Resolution of variable and function names: Symbol tables are used to identify the data
types and memory locations of variables and functions as well as to resolve their names.
2. Resolution of scope issues: To resolve naming conflicts and ascertain the range of
variables and functions, symbol tables are utilized.
3. Symbol tables, which offer quick access to information such as memory locations, are used
to optimize code execution.
4. Code generation: By giving details like memory locations and data kinds, symbol tables
are utilized to create machine code from source code.
5. Error checking and code debugging: By supplying details about the status of a program
during execution, symbol tables are used to check for faults and debug code.
6. Code organization and documentation: By supplying details about a program’s structure,
symbol tables can be used to organize code and make it simpler to understand.

token, lexeme and patterns.

Token:

It is basically a sequence of characters that are treated as a unit as it cannot be further broken
down. In programming languages like C language- keywords (int, char, float, const, goto,
continue, etc.) identifiers (user-defined names), operators (+, -, *, /), delimiters/punctuators
like comma (,), semicolon(;), braces ({ }), etc. , strings can be considered as tokens. This phase
recognizes three types of tokens: Terminal Symbols (TRM)- Keywords and Operators,
Literals (LIT), and Identifiers (IDN).
Let’s understand now how to calculate tokens in a source code (C language):

Example 1:

int a = 10; //Input Source code

Tokens:

int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)

Answer – Total number of tokens = 5

Example 2:
int main() {

// printf() sends the string inside quotation to

// the standard output (the display)
printf("Welcome to GeeksforGeeks!");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'
Answer – Total number of tokens = 14
Lexeme

It is a sequence of characters in the source code that are matched by given predefined language
rules for every lexeme to be specified as a valid token.
Example:

main is lexeme of type identifier(token)

(,),{,} are lexemes of type punctuation(token)

Pattern

It specifies a set of rules that a scanner follows to create a token.

Example of Programming Language (C, C++):

For a keyword to be identified as a valid token, the pattern is the sequence of characters that
make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must
start with alphabet, followed by alphabet or a digit.

Difference between Token, Lexeme, and Pattern

Criteria Token Lexeme Pattern

Token is basically a It is a sequence of characters in It specifies a set

sequence of characters the source code that are of rules that a
that are treated as a unit matched by given predefined scanner follows
as it cannot be further language rules for every lexeme to create a
Definition broken down. to be specified as a valid token. token.

all the reserved The sequence

Interpretation keywords of that of characters
of type language(main, printf, that make the
Keyword etc.) int, goto keyword.

it must start
with the
Interpretation
name of a variable, alphabet,
of type
followed by the
Identifier function, etc main, a
alphabet or a
Criteria Token Lexeme Pattern

digit.

Interpretation
of type all the operators are
Operator considered tokens. +, = +, =

each kind of punctuation

Interpretation is considered a token.
of type e.g. semicolon, bracket,
Punctuation comma, etc. (, ), {, } (, ), {, }

any string of
characters
Interpretation a grammar rule or (except ‘ ‘)
of type Literal boolean literal. “Welcome to GeeksforGeeks!” between ” and “

The output of Lexical Analysis Phase:

The output of Lexical Analyzer serves as an input to Syntax Analyzer as a sequence of tokens
and not the series of lexemes because during the syntax analysis phase individual unit is not
vital but the category or class to which this lexeme belongs is considerable.

Example:
z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)
The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that
consists of all the tokens present in the source code except Whitespaces and comments.

Difficulties in lexical analysis.

Lexical analysis is the process of producing tokens from the source program. It has the
following issues:
• Lookahead
• Ambiguities
1. Lookahead:

Lookahead is required to decide when one token will end and the next token will begin. The
simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the
lexemes of each token is required.

A way needed to resolve ambiguities

• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are similar.
Hence, the number of lookahead to be considered and a way to describe the lexemes of each
token is also needed.
2. Ambiguities:

The lexical analysis programs written with lex accept ambiguous specifications and choose the
longest match possible at each input point. Lex can handle ambiguous specifications. When
more than one expression can match the current input, lex chooses as follows:
• The longest match is preferred.
• Among rules which matched the same number of characters, the rule given first is preferred.
Error recovery strategies for a lexical analyzer
A character sequence that cannot be scanned into any valid token is a lexical error. Misspelling
of identifiers, keyword, or operators are considered as lexical errors. Usually, a lexical error is
caused by the appearance of some illegal character, mostly at the beginning of a token.
The following are the error-recovery strategies in lexical analysis:
1) Panic Mode Recovery: Once an error is found, the successive characters are always ignored
until we reach a well-formed token like end, semicolon.
2) Deleting an extraneous character.
3) Inserting a missing character.
4)Replacing an incorrect character by a correct character.
5)Transforming two adjacent characters.

Lexical Error:

When the token pattern does not match the prefix of the remaining input, the lexical analyzer
gets stuck and has to recover from this state to analyze the remaining input. In simple words,
a lexical error occurs when a sequence of characters does not match the pattern of any token.
It typically happens during the execution of a program.
Types of Lexical Error:

Types of lexical error that can occur in a lexical analyzer are as follows:
1. Exceeding length of identifier or numeric constants.

#include <iostream>
using namespace std;

int main() {

int a=2147483647 +1;

return 0;
}
This is a lexical error since signed integer lies between −2,147,483,648 and 2,147,483,647 .

2. Appearance of illegal characters.

Example:
#include <iostream>
using namespace std;

int main() {

printf("Geeksforgeeks");$
return 0;
}

This is a lexical error since an illegal character $ appears at the end of the statement.
3. Unmatched string

Example:
#include <iostream>
using namespace std;

int main() {
/* comment
cout<<"GFG!";
return 0;
}
This is a lexical error since the ending of comment “*/” is not present but the beginning is
present.
4. Spelling Error.

#include <iostream>
using namespace std;

int main() {
int 3num= 1234; /* spelling error as identifier
cannot start with a number*/
return 0;
}

5.Replacing a character with an incorrect character.

#include <iostream>
using namespace std;

int main() {

int x = 12$34; /*lexical error as '$' doesn't

belong within 0-9 range*/
return 0;
}

Other lexical errors include

6. Removal of the character that should be present.
#include <iostream> /*missing 'o' character
hence lexical error*/
using namespace std;

int main() {

cout<<"GFG!";
return 0;
}

7. Transposition of two characters.

#include <iostream>
using namespace std;

int main()
{
/* spelling of main here would be treated as an lexical
error and won't be considered as an identifier,
transposition of character 'i' and 'a'*/
cout << "GFG!";
return 0;
}
Error Recovery Technique

When a situation arises in which the lexical analyzer is unable to proceed because none of the
patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is
“panic mode” recovery. We delete successive characters from the remaining input until the
lexical analyzer can identify a well-formed token at the beginning of what input is left.

Error-recovery actions are:

1. Transpose of two adjacent characters.

2. Insert a missing character into the remaining input.
3. Replace a character with another character.
4. Delete one character from the remaining input.

input buffering:

The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as:
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a
blank space the lexeme “int” is identified. The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the begin ptr(bp) and forward
ptr(fp) are set at next token. The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence buffering technique is used.A
block of data is first read into a buffer, and then second by lexical analyzer. there are two
methods used in this context: One Buffer Scheme, and Two Buffer Scheme. These are
explained as following
below.

1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be refilled, that makes overwriting the first of

lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two
buffers are used to store the input string. the first buffer and second buffer are scanned
alternately. when end of current buffer is reached the other buffer is filled. the only problem
with this method is that if length of the lexeme is longer than length of the buffer then
scanning input cannot be scanned completely. Initially both the bp and fp are pointing to the
first character of first buffer. Then the fp moves towards right in search of end of lexeme. as
soon as blank character is recognized, the string between bp and fp is identified as
corresponding token. to identify, the boundary of first buffer end of buffer character should
be placed at the end first buffer. Similarly end of second buffer is also recognized by the
end of buffer mark present at the end of second buffer. when fp encounters first eof, then
one can recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer. alternatively both
the buffers can be filled up until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to identify the end

of buffer.

Specification of tokens:

There are 3 specifications of tokens:

1)Strings.
2) Language.
3)Regular expression.

Strings and Languages

An alphabet or character class is a finite set of symbols.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
A language is any countable set of strings over some fixed alphabet.

In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For
example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.
Operations on strings.

The following string-related terms are commonly used:

1. A prefix of string s is any string obtained by removing zero or more symbols

from the end of string s. For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For
example, nan is a substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes,
suffixes, and substrings, respectively of s that are not ε or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s
 6. For example, baan is a subsequence of banana.

 Operations on languages:
 The following are the operations that can be applied to languages:
 1. Union
 2. Concatenation
 3. Kleene closure
 4. Positive closure

 The following example shows the operations on strings: Let L={0,1} and S={a,b,c}

Regular Expressions
 · Each regular expression r denotes a language L(r).

 · Here are the rules that define the regular expressions over some alphabet Σ
and the languages that those expressions denote
 1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is
the empty string.
 2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‘a’ in its one position.
 3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a)
(r)|(s) is a regular expression denoting the language L(r) U L(s).
 b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular
expression denoting (L(r))*.
 d) (r) is a regular expression denoting L(r).
 4.The unary operator * has highest precedence and is left associative.
 5.Concatenation has second highest precedence and is left associative.
 6. | has lowest precedence and is left associative.
 Regular set

 A language that can be defined by a regular expression is called a regular set. If two
regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s.

 There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
 For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.

 Regular Definitions

 Giving names to regular expressions is referred to as a Regular definition. If Σ is an

alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form
 dl → r 1
 d2 → r2
 ………
 dn → rn
 1.Each di is a distinct name.
 2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
 Example: Identifiers is the set of strings of letters and digits beginning with a letter.
Regular
 definition for this set:
 letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
 id → letter ( letter | digit ) *

Shorthands

 Certain constructs occur so frequently in regular expressions that it is convenient to

introduce notational short hands for them.
 1. One or more instances (+):
 - The unary postfix operator + means “ one or more instances of” .
 - If r is a regular expression that denotes the language L(r), then ( r )+ is a regular
expression that denotes the language (L (r ))+
 - Thus the regular expression a+ denotes the set of all strings of one or more a’s.
 - The operator + has the same precedence and associativity as the operator *.

2. Zero or one instance ( ?):

 - The unary postfix operator ? means “zero or one instance of”.
 - The notation r? is a shorthand for r | ε.
 - If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language
3. Character Classes:
 - The notation [abc] where a, b and c are alphabet symbols denotes the regular expression
a | b | c.
 - Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
 - We can describe identifiers as being strings generated by the regular expression, [A–
Za–z][A– Za–z0–9]*

 Non-regular Set

 A language which cannot be described by any regular expression is a non-regular set.

Example: The set of all strings of balanced parentheses and repeating strings cannot be
described by a regular expression. This set can be specified by a context-free grammar.

Regular grammar & language definition.

Regular Languages are the most restricted types of languages and are accepted by finite
automata.

Regular Expressions.

Regular Expressions are used to denote regular languages. An expression is regular if:
 ɸ is a regular expression for regular language ɸ.
 ɛ is a regular expression for regular language {ɛ}.
 If a ∈ Σ (Σ represents the input alphabet), a is regular expression with language {a}.
 If a and b are regular expression, a + b is also a regular expression with language {a,b}.
 If a and b are regular expression, ab (concatenation of a and b) is also regular.
 If a is regular expression, a* (0 or more times a) is also regular.

Regular Expression Regular Languages

set of vovels (a∪e∪i∪o∪u) {a, e, i, o, u}

a followed by 0 or more b (a.b*) {a, ab, abb, abbb, abbbb,….}

any no. of vowels v.c ( where v – { ε , a ,aou, aiou, b, abcd…..} where ε

followed by any no. of vowels and c – represent empty string (in case 0 vowels
consonants consonants) and o consonants )

Regular Grammar : A grammar is regular if it has rules of form A -> a or A -> aB or A -> ɛ
where ɛ is a special symbol called NULL.

Regular Languages : A language is regular if it can be expressed in terms of regular

expression.
Closure Properties of Regular Languages.

Union : If L1 and If L2 are two regular languages, their union L1 ∪ L2 will also be regular.
For example, L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1 ∪ L2 = {an ∪ bn | n ≥ 0} is also regular.

Intersection : If L1 and If L2 are two regular languages, their intersection L1 ∩ L2 will also
be regular. For example,
L1= {am bn | n ≥ 0 and m ≥ 0} and L2= {a m bn ∪ bn am | n ≥ 0 and m ≥ 0}
L3 = L1 ∩ L2 = {am bn | n ≥ 0 and m ≥ 0} is also regular.

Concatenation : If L1 and If L2 are two regular languages, their concatenation L1.L2 will also
be regular. For example,
L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1.L2 = {am . bn | m ≥ 0 and n ≥ 0} is also regular.

Kleene Closure : If L1 is a regular language, its Kleene closure L1* will also be regular. For
example,
L1 = (a ∪ b)
L1* = (a ∪ b)*.

Complement : If L(G) is regular language, its complement L’(G) will also be regular.
Complement of a language can be found by subtracting strings which are in L(G) from all
possible strings. For example,
L(G) = {an | n > 3}
L’(G) = {an | n <= 3}

Note : Two regular expressions are equivalent if languages generated by them are same. For
example, (a+b*)* and (a+b)* generate same language. Every string which is generated by
(a+b*)* is also generated by (a+b)* and vice versa.

How to solve problems on regular expression and regular languages?

Question 1 : Which one of the following languages over the alphabet {0,1} is described by the
regular expression?
(0+1)*0(0+1)*0(0+1)*
(A) The set of all strings containing the substring 00.
(B) The set of all strings containing at most two 0’s.
(C) The set of all strings containing at least two 0’s.
(D) The set of all strings that begin and end with either 0 or 1.

Solution : Option A says that it must have substring 00. But 10101 is also a part of language
but it does not contain 00 as substring. So it is not correct option.
Option B says that it can have maximum two 0’s but 00000 is also a part of language. So it is
not correct option.
Option C says that it must contain atleast two 0. In regular expression, two 0 are present. So
this is correct option.
Option D says that it contains all strings that begin and end with either 0 or 1. But it can
generate strings which start with 0 and end with 1 or vice versa as well. So it is not correct.

Question 2 : Which of the following languages is generated by given grammar?

S -> aS | bS | ∊
(A) {an bm | n,m ≥ 0}
(B) {w ∈ {a,b}* | w has equal number of a’s and b’s}
(C) {an | n ≥ 0} ∪ {bn | n ≥ 0} ∪ {an bn | n ≥ 0}
(D) {a,b}*
Solution : Option (A) says that it will have 0 or more a followed by 0 or more b. But S -> bS
=> baS => ba is also a part of language. So (A) is not correct.
Option (B) says that it will have equal no. of a’s and b’s. But But S -> bS => b is also a part of
language. So (B) is not correct.
Option (C) says either it will have 0 or more a’s or 0 or more b’s or a’s followed by b’s. But as
shown in option (A), ba is also part of language. So (C) is not correct.
Option (D) says it can have any number of a’s and any numbers of b’s in any order. So (D) is
correct.

Question 3 : The regular expression 0(10)* denotes the same set as

(A) (1*0)*1*
(B) 0 + (0 + 10)*
(C) (0 + 1)* 10(0 + 1)*
(D) none of these

Solution : Two regular expressions are equivalent if languages generated by them are same.
Option (A) can generate all strings generated by 0*(10*)*. So they are equivalent.
Option (B) string null can not generated by given languages but 0*(10*)* can. So they are not
equivalent.
Option (C) will have 10 as substring but 0*(10*)* may or may not. So they are not equivalent.

Question 4 : The regular expression for the language having input alphabets a and b, in which
two a’s do not come together:
(A) (b + ab)* + (b +ab)*a
(B) a(b + ba)* + (b + ba)*
(C) both options (A) and (B)
(D) none of the above
Solution:
Option (C) stating both both options (A) and (B) is the correct regular expression for the stated
question.
The language in the question can be expressed as
L={&epsilon,a,b,bb,ab,aba,ba,bab,baba,abab,…}.
In option (A) ‘ab’ is considered the building block for finding out the required regular
expression.(b + ab)* covers all cases of strings generated ending with ‘b’.(b + ab)*a covers all
cases of strings generated ending with a.
Applying similar logic for option (B) we can see that the regular expression is derived
considering ‘ba’ as the building block and it covers all cases of strings starting with a and
starting with b.

Concurrency Concepts
No ratings yet
Concurrency Concepts
24 pages
1 Lexial Analysis
No ratings yet
1 Lexial Analysis
24 pages
Compiler Design
No ratings yet
Compiler Design
47 pages
Lec#1
No ratings yet
Lec#1
36 pages
Compiler Design unit-1
No ratings yet
Compiler Design unit-1
15 pages
Compiler Design
No ratings yet
Compiler Design
23 pages
Unit I SRM
100% (1)
Unit I SRM
36 pages
CD Unit-I
No ratings yet
CD Unit-I
25 pages
Compiler Design Note1
No ratings yet
Compiler Design Note1
111 pages
CD - 1
No ratings yet
CD - 1
22 pages
Language Processing System:-: Compiler
No ratings yet
Language Processing System:-: Compiler
6 pages
CD FINALIZED NOTES
No ratings yet
CD FINALIZED NOTES
6 pages
Compiler Construction CS-4207 Lecture - 01 - 02: Input Output Target Program
No ratings yet
Compiler Construction CS-4207 Lecture - 01 - 02: Input Output Target Program
8 pages
1-Introduction To Compilers
No ratings yet
1-Introduction To Compilers
41 pages
Lecture 1 - Ch1. Introduction To Compiler
No ratings yet
Lecture 1 - Ch1. Introduction To Compiler
29 pages
Quick Book of Compiler
100% (1)
Quick Book of Compiler
66 pages
Chapter 1 (Introduction)
No ratings yet
Chapter 1 (Introduction)
47 pages
1-Introduction to Compilers
No ratings yet
1-Introduction to Compilers
40 pages
Automata Theory and Compiler Design
No ratings yet
Automata Theory and Compiler Design
55 pages
CD_ UNIT-1
No ratings yet
CD_ UNIT-1
10 pages
Compiler 2021 Module 1
No ratings yet
Compiler 2021 Module 1
15 pages
Introduction To Compiler Design-Unit I
No ratings yet
Introduction To Compiler Design-Unit I
30 pages
Compiler Notes
No ratings yet
Compiler Notes
170 pages
Module - I: Introduction To Compiling: 1.1 Introduction of Language Processing System
No ratings yet
Module - I: Introduction To Compiling: 1.1 Introduction of Language Processing System
7 pages
Unit 1 Compiler Design
No ratings yet
Unit 1 Compiler Design
70 pages
CD Notes
No ratings yet
CD Notes
69 pages
Unit 1
No ratings yet
Unit 1
9 pages
Compiler Construction Notes
No ratings yet
Compiler Construction Notes
61 pages
Unit 1 Introduction: Cocsc14 Harshita Sharma
No ratings yet
Unit 1 Introduction: Cocsc14 Harshita Sharma
84 pages
Compiler 1
No ratings yet
Compiler 1
28 pages
CD Unit - 1 Lms Notes
No ratings yet
CD Unit - 1 Lms Notes
58 pages
Compiler_unit1
No ratings yet
Compiler_unit1
23 pages
Compiler Notes
No ratings yet
Compiler Notes
68 pages
CD Unit-I
No ratings yet
CD Unit-I
21 pages
Cousins of Compiler
100% (1)
Cousins of Compiler
25 pages
SSCDNotes PDF
100% (1)
SSCDNotes PDF
53 pages
Compiler Design Unit-1
No ratings yet
Compiler Design Unit-1
25 pages
Unit-1 Notes CD OU
No ratings yet
Unit-1 Notes CD OU
19 pages
Compiler Design: Instructor: Mohammed O. Samara University
No ratings yet
Compiler Design: Instructor: Mohammed O. Samara University
28 pages
UNIT-1 1.1. Introduction of Language Processingsystem
No ratings yet
UNIT-1 1.1. Introduction of Language Processingsystem
14 pages
Unit 1
No ratings yet
Unit 1
29 pages
Unit 1
No ratings yet
Unit 1
29 pages
Language Translation: Programming Tools
No ratings yet
Language Translation: Programming Tools
7 pages
CD KCS502 Unit 1 A
No ratings yet
CD KCS502 Unit 1 A
8 pages
UNIT-1-1
No ratings yet
UNIT-1-1
42 pages
Vino Compiler Notes
No ratings yet
Vino Compiler Notes
153 pages
1 Compiler Phases
No ratings yet
1 Compiler Phases
30 pages
Compiler Design Module
No ratings yet
Compiler Design Module
120 pages
CD UNIT-1
No ratings yet
CD UNIT-1
37 pages
Compiler Construction Week 2
No ratings yet
Compiler Construction Week 2
29 pages
Compiler Design
No ratings yet
Compiler Design
7 pages
CD_Unit_1
No ratings yet
CD_Unit_1
20 pages
CD UNIT 1 Chapter 1
No ratings yet
CD UNIT 1 Chapter 1
9 pages
cd unit I
No ratings yet
cd unit I
20 pages
Introduction To Compilers Complier: Ompiler Source Program Target Program Error Message
No ratings yet
Introduction To Compilers Complier: Ompiler Source Program Target Program Error Message
23 pages
Compiler Construction CSEC325 Token
No ratings yet
Compiler Construction CSEC325 Token
2 pages
Compiler Design Slide Chapter 1-6
No ratings yet
Compiler Design Slide Chapter 1-6
250 pages
Module-1 1
No ratings yet
Module-1 1
53 pages
Unit 1
No ratings yet
Unit 1
37 pages
lecture notes of compiler design lab
No ratings yet
lecture notes of compiler design lab
170 pages
Compiler Design
From Everand
Compiler Design
Knowledge Flow
No ratings yet
Datasheet of iDS 7216HQHI M1 - E - V4.70.140 - 20220712
No ratings yet
Datasheet of iDS 7216HQHI M1 - E - V4.70.140 - 20220712
4 pages
Tensor Flow
100% (1)
Tensor Flow
9 pages
Database Management System Lab Course Code: CSE 312: Instructor: Tasnim Tarannum Lecturer CSE Department
No ratings yet
Database Management System Lab Course Code: CSE 312: Instructor: Tasnim Tarannum Lecturer CSE Department
15 pages
List of MQL4 Functions - MQL4 Reference
No ratings yet
List of MQL4 Functions - MQL4 Reference
29 pages
HP Neverstop Laser MFP 1200 Series
No ratings yet
HP Neverstop Laser MFP 1200 Series
6 pages
PrintExp Control Software Instruction
100% (1)
PrintExp Control Software Instruction
21 pages
Old Cisco Chassis Mib
No ratings yet
Old Cisco Chassis Mib
52 pages
Oracle Application R12 (12.1.3) Installation On Linux (64 Bit)
No ratings yet
Oracle Application R12 (12.1.3) Installation On Linux (64 Bit)
34 pages
Cs2-Chapter 1 Queston Bank
No ratings yet
Cs2-Chapter 1 Queston Bank
11 pages
Introduction To OOALV
No ratings yet
Introduction To OOALV
20 pages
C Programming Questions and Answers - Pointers and Addresses
No ratings yet
C Programming Questions and Answers - Pointers and Addresses
71 pages
Properly Shutting Down A Pfsense Firewall Involves A Few Steps To Ensure Data Integrity and Prevent Any Potential Issues
No ratings yet
Properly Shutting Down A Pfsense Firewall Involves A Few Steps To Ensure Data Integrity and Prevent Any Potential Issues
2 pages
Top 50 Data Structures Interview Questions & Answers: 1) What Is Data Structure?
No ratings yet
Top 50 Data Structures Interview Questions & Answers: 1) What Is Data Structure?
8 pages
Lecture # 6 - Node - .Js MySQL
No ratings yet
Lecture # 6 - Node - .Js MySQL
45 pages
Epm-Xp Standard 1.0.7.8
100% (1)
Epm-Xp Standard 1.0.7.8
44 pages
TASCAM-dp-004 - Update Instructions
No ratings yet
TASCAM-dp-004 - Update Instructions
1 page
POS4EDC User Guide v108
No ratings yet
POS4EDC User Guide v108
87 pages
Assignment
No ratings yet
Assignment
3 pages
NetToPLCsim Manual en
100% (1)
NetToPLCsim Manual en
11 pages
BlueData EPIC Software Architecture Technical White Paper
No ratings yet
BlueData EPIC Software Architecture Technical White Paper
29 pages
Air Canvas Project Report OpenCV MediaPipe
No ratings yet
Air Canvas Project Report OpenCV MediaPipe
4 pages
Soundcore Life P3i
No ratings yet
Soundcore Life P3i
132 pages
Short Notes of Autocad... Deepak Bose
No ratings yet
Short Notes of Autocad... Deepak Bose
15 pages
Novel Approach For Search Engine: A C CAM A
No ratings yet
Novel Approach For Search Engine: A C CAM A
4 pages
Getting Started With Projects Based On The Stm32Mp1 Series in Stm32Cubeide
No ratings yet
Getting Started With Projects Based On The Stm32Mp1 Series in Stm32Cubeide
19 pages
I.K.Gujral Punjab Technical University Jalandhar: Grade Cum Marks Sheet
No ratings yet
I.K.Gujral Punjab Technical University Jalandhar: Grade Cum Marks Sheet
1 page
Hytran Demo Install Instructions
No ratings yet
Hytran Demo Install Instructions
2 pages
Laptop Specification For Buy
No ratings yet
Laptop Specification For Buy
2 pages
Zilog Z8000 Reference Manual
No ratings yet
Zilog Z8000 Reference Manual
299 pages