0% found this document useful (0 votes)
41 views18 pages

CMP 335 Regular Expression Exercises Note

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views18 pages

CMP 335 Regular Expression Exercises Note

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Introduction

Regular expressions (Regex) are a powerful tool used in various fields of computer science, including
compiler design. They provide a concise and flexible means of describing patterns in text, which is essential
for tasks like lexical analysis in compiler design. This blog will delve into the significance of regular
expressions in compiler design, explaining their role, usage, and impact on the overall process of compiling
a program.

A
What are Regular Expressions?
Regular expressions are sequences of characters that define search patterns, primarily used for string
matching within texts. These patterns can be simple, such as matching a single character, or complex,
involving combinations of various characters, special symbols, and operators.
Basic Syntax of Regular Expressions
 Literals: The simplest form of a regular expression is a literal, which matches the exact character. For
example, the regex a will match the character 'a' in a text.
 Concatenation: Two regular expressions can be concatenated, meaning they must appear in sequence.
For example, ab will match the sequence 'ab'.
 Alternation: The alternation operator | allows for matching one of several patterns. For example, a|
b matches either 'a' or 'b'.

 Repetition Operators:
 * (Kleene Star) matches zero or more occurrences of the preceding element.
 + matches one or more occurrences.
 ? matches zero or one occurrence.
Examples
 The regex a*b matches any number of 'a's followed by a 'b' (e.g., b, ab, aaab).
 The regex (a|b)c matches either 'ac' or 'bc'.
Role of Regular Expressions in Compiler Design
In compiler design, regular expressions play a crucial role in the lexical analysis phase, which is the first
phase of a compiler. The primary task of lexical analysis is to read the source code and convert it into
tokens, which are the smallest units of meaning (like keywords, operators, and identifiers).
Lexical Analysis
The lexical analyzer, also known as the scanner, uses regular expressions to identify patterns in the source
code and classify them into tokens. These tokens are then used by the parser in the subsequent phase of the
compiler.
For example, consider the following piece of code:
int main() {
int a = 5;
}

The lexical analyzer would break this code into tokens like int, main, (, ), {, }, a, =, 5, ;. To identify each of
these tokens, the lexer relies on regular expressions:
 Keywords like int and main can be matched directly using literals or predefined patterns.
 Identifiers (like variable names) can be matched using regex that allows sequences of letters and digits.
 Operators like = can be matched using specific literals.
Conversion to Finite Automata
Regular expressions are not only used for pattern matching but can also be converted into finite automata,
which are used to recognize tokens. This conversion is an essential step in the lexical analysis process.
 Deterministic Finite Automata (DFA): A DFA is used to recognize tokens in a single pass over the
input string. It’s efficient and suitable for real-time processing.
 Nondeterministic Finite Automata (NFA): An NFA is more flexible in terms of pattern matching but
requires more complex processing, as it can be in multiple states simultaneously.
The process usually involves converting the regular expression into an NFA, which is then optimized and
transformed into a DFA. This DFA can then be used by the lexer to scan the source code efficiently.
Advantages of Using Regular Expressions
 Simplicity: Regular expressions offer a simple and concise way to represent patterns, making the
lexer easier to implement.
 Efficiency: Once compiled into finite automata, regular expressions can be used to quickly and
efficiently recognize tokens in the source code.
 Flexibility: Regular expressions are highly flexible, allowing the lexer to handle a wide variety of
token patterns, from simple keywords to complex operators.
Limitations
While regular expressions are powerful, they have limitations:
 Context-Free Grammars: Regular expressions are not capable of expressing context-free grammars,
which are needed for the syntactical structure of programming languages. Therefore, they are limited
to the lexical analysis phase.
 Complexity: For very complex patterns, regular expressions can become hard to read and maintain,
especially as the language’s syntax grows in complexity.

Conclusion
Regular expressions are a fundamental tool in compiler design, particularly in the lexical analysis phase.
They enable the efficient identification and classification of tokens in source code, facilitating the
compilation process. By converting regular expressions into finite automata, compilers can quickly process
and recognize patterns, ensuring that the source code is correctly parsed and transformed into executable
code. Despite their limitations, regular expressions remain an essential component in the toolkit of compiler
designers.
QUESTIONS AND SOLUTIONS

1. What is Operator Precedence Parsing Algorithm in compiler design?


Any string of Grammar can be parsed by using stack implementation, as in shift Reduce parsing. But in
operator precedence parsing shifting and reducing is done based on Precedence Relation between symbol
at the top of stack & current input symbol of the input string to be parsed.

The operator precedence parsing algorithm is as follows −


Input − The precedence relations from some operator precedence grammar and an input string of terminals
from that grammar.

Output − There is no output but it can construct a skeletal parse tree as we parse, with one non-terminal
labeling all interior nodes and the use of single productions not shown. Alternatively, the sequence of shift-
reduce steps can be considered the output.

Method − Let the input string be a1a2 … … . an.$Initially, the stack contains $.

Repeat forever

If only $ is on the stack and only $ is on the input then accept and break else

begin

let a be the topmost terminal symbol on the stack and let b be the current input symbols.

If a <. b or a =. b then shift b onto the stack /*Shift*/

else if a . >b then /*reduce*/

repeat pop the stack

until the top stack terminal is related by <. to the terminal most recently popped.

else call the error-correcting routine end.

Example1 − Construct the Precedence Relation table for the Grammar.

E → E + E|E − E|E ∗ E|E⁄E|E ↑ E|(E)| − E|id

Using Assumptions

Operators Precedence Association

↑ Highest Right Associative

* and / Next Highest Left Associative

+ and − Lowest Left Associative


Solution

Operator Precedence Relations

+ − * / ↑ id ( ) $

+ .> .> <. <. <. <. <. .> .>

− .> .> <. <. <. <. <. .> .>

* .> .> .> .> <. <. <. .> .>

/ .> .> .> .> <. <. <. .> .>

↑ .> .> .> .> <. <. <. .> .>

id .> .> .> .> .> .> .>

( <. <. <. <. <. <. <. =

) .> .> .> .> .> .> .>


$ <. <. <. <. <. <. <.
Example2 − Find out all precedence relations between various operators & symbols in the following
Grammar & show them using the precedence table.

E → E + T|T

T → T ∗ F|F

F → (E)|id

Solution

+ * ( ) id $

+ .> .< <. .> <. .>

* .> .> <. .> <. .>

( <. <. <. =. <.

) .> .> .> .>

id .> .> .> .>

$ <. <. <. <.

2. What are Precedence Functions in compiler design?


Precedence relations between any two operators or symbols in the precedence table can be converted to
two precedence functions f & g that map terminals symbols to integers.

 If a <. b, then f (a) <. g (b)


 If a = b, then f (a) =. g (b)
 If a .> b, then f (a) .> g (b)
Here a, b represents terminal symbols. f (a) and g (b) represents the precedence functions that have an
integer value.

Computations of Precedence Functions

 For each terminal a, create the symbol fa&ga.


 Make a node for each symbol.
If a =. b, then fa & gb are in same group or node.

If a =. b & c =. b, then fa & fc must be in same group or node.

 (a) If a <. b, Mark an edge from gb to fa.


(b) If a .>b, Mark an edge from fa to gb.

 If the graph constructed has a cycle, then no precedence functions exist.


 If there are no cycles.
(a) fa = Length of longest path beginning at the group of fa.

(b) ga = Length of the longest path from the group of ga.

Example1 − Construct precedence graph & precedence function for the following table.

Solution

Step1 − Create Symbols

Step2 − No symbol has equal precedence, as can be seen in the given table; therefore, each symbol will
remain in a different node.

Step3 − If a <. b, create an edge from fa → ga

If a .>b, create an edge from gb → fa

Since, $ <. +,*, id. therefore, make an edge from g+, g*, gid to fs

Similarity + <. ,∗, id. ∴ make an edge from g*, gid to f+

Similarity * <. id. Therefore, Mark an edge from gid to f*.

Since, +,*, id . > $ therefore, Mark an edge from f+, f*, fid to gs.

Similarity +,*, id . > +. Mark an edge from f+, f*, fid to g+.

Similarity *, id . > *. Mark an edge from f*, fid to g.

Combining all the edges we get


Step4 − Computing the maximum length of the path from each node, we get the following precedence
functions

Id + * $
F 4 2 4 0
G 5 1 3 0
Example2 − Construct precedence graph & precedence function for the following table.

Solution

As we have (=.). Therefore f & g will be in the same group.

Computation of precedence graph


3. What is Design of Lexical Analysis in Compiler Design?
Lexical Analysis can be designed using Transition Diagrams.

Finite Automata (Transition Diagram) − A Directed Graph or flowchart used to recognize token.
The transition Diagram has two parts −

 States − It is represented by circles.

 Edges − States are connected by Edges Arrows.


Example − Draw Transition Diagram for "if" keyword.

To recognize Token ("if"), Lexical Analysis has to read also the next character after "f". Depending upon
the next character, it will judge whether the "if" keyword or something else is.

So, Blank space after "if" determines that "If" is a keyword.

"*" on Final State 3 means Retract, i.e., control will again come to previous state 2. Therefore Blank space
is not a part of the Token ("if").

Transition Diagram for an Identifier − An identifier starts with a letter followed by letters or Digits.
Transition Diagram will be:

For example, In statement int a2; Transition Diagram for identifier a2 will be:
As (;) is not part of Identifier ("a2"), so use "*" for Retract i.e., coming back to state 1 to recognize
identifier ("a2").

The Transition Diagram for identifier can be converted to Program Code as −

Coding
State 0: C = Getchar()
If letter (C) then goto state 1 else fail

State1: C = Getchar()
If letter (C) or Digit (C) then goto state 1
else if Delimiter (C) goto state 2
else Fail

State2: Retract ()
return (6, Install ());
In-state 2, Retract () will take the pointer one state back, i.e., to state 1 & declares that whatever has been
found till state 1 is a token.

The lexical Analysis will return the token to the Parser, not in the form of an English word but the form of
a pair, i.e., (Integer code, value).

In the case of identifier, the integer code returned to the parser is 6 as shown in the table.

Install () − It will return a pointer to the symbol table, i.e., address of tokens.
The following table shows the integer code and value of various tokens returned by lexical analysis to the
parser.

Suppose, if the identifier is stored at location 236 in the symbol table, then

Similarly, if constant is stored at location 238 then

Integer code = 7

Install () = 238 i.e., Pair will be (7, 238)

Transition Diagram (Finite Automata) for Tokens −


4.What are the Rules of Regular Expressions in Compiler Design?
The language accepted by finite automata can be simply defined by simple expressions known as Regular
Expressions. It is an effective approach to describe any language. A regular expression can also be
represented as a sequence of patterns that represent a string. Regular expressions are used to connect
character sequence in strings. The string searching algorithm used this pattern to discover the operations on
a string.

There are various rules for regular expressions which are as follows −

 ε is a Regular expression.
 Union of two Regular Expressions R1 and R2.
i.e., R1 + R2 or R1|R2 is also a regular expression.
 Concatenation of two Regular Expressions R1 and R2.
i.e., R1 R2 is also a Regular Expression.
 Closure of Regular Expression R, i.e., R* is also a Regular Expression.
 If R is a Regular Expression, then (R) is also a Regular Expression.
Algebraic Laws
R1|R2=R2|R1 or R1+ R2=R2+ R1 (Commutative)
R1| (R2|R3)=(R1| R2)|R3 (Associative)
Or
R1+ (R2+ R3)=(R1+ R2)+R3
R1 (R2|R3)=(R1R2)R3 (Associative)
R1| (R2|R3)=R1R2| R1R3 (Distributive)
Or
R1 (R2+ R3)=R1R2+R1R3
ε R=R ε=R (Concatenation)
Example1 − Write Regular Expressions for the following language over ∑∑ ={a,b}
 String of length zero or one.
Answer: ε | a | b or (ε+a+b)

 Strings of length two.


Answer: aa | ab | bb or (aa+ab+ba +bb)

 Strings of Even Length


Answer: (aa| ab| ba | bb)* or (aa+ab+ba +bb)*

 Set of all strings of a’s and b’s having at least two occurrences of aa.
Answer − (a+b)*aa(a+b)aa(a+b)*

Example2 − Find Regular Expressions for following language.


 L={ε,1,11,111,….}
{∴ 10=ε,11=1,12=11,13=111…..}
Answer: 1*

Answer: (11) ∗∗
 L={ε,11,1111,111111,…..}

Answer: (0+1) ∗∗ or (0|1) ∗∗


 L = Set of all strings of 0’s and 1’s = {ε,0,1,01,11,00,000,101,……}

Answer: (0+1) ∗∗ 11
 L = Set of all strings of 0’s and 1’s ending with 11.

Answer: 0(0+1) ∗∗ 1
 L = Set of all strings of 0’s and 1’s beginning with 0 and ending with 1.
Example3 − Write Regular Expression in which the second letter from the right end of the string is 1

Answer: (0+1) ∗∗ 1(0+1)


where ∑∑ ={0,1}.

Example4 − Write Regular Expressions for the following language over ∑∑ ={a,b}
 L=Set of strings having at least one occurrence of the double letter
Answer: (a+b)*(aa+bb)(a+b)*

 L = Set of strings having double letter at Beginning and Ending of string.


Answer: (aa+bb)(a+b)*(aa+bb)

 L = Set of strings having double letter at Beginning or on Ending of string.


Answer: (aa+bb)(a+b)*+ (a+b)*(aa+bb)+(aa+bb)(a+b)*(aa+bb)

5. What is assignment statements with Integer types in compiler design?


Assignment statements consist of an expression. It involves only integer variables.

Abstract Translation Scheme

Consider the grammar, which consists of an assignment statement.

S → id = E

E→E+E

E→E∗E

E → −E

E → (E)

E → id

Here Translation of E can have two attributes −

𝐄. 𝐏𝐋𝐀𝐂𝐄− It tells about the name that will hold the value of the expression.
𝐄. 𝐂𝐎𝐃𝐄− It represents a sequence of three address statements evaluating the expression E in


grammar represents an Assignment statement. E. CODE represents the three address codes of the
statement. CODE for non-terminals on the left is the concatenation of CODE for each non-terminal
on the right of Production.

Abstract Translation Scheme

Production Semantic Action


S → id = E {S. CODE = E. CODE| |id. PLACE| | '=. '||E. PLACE}
E → E(1) + E(2) {T = newtemp( );
E. PLACE = T;
E. CODE = E(1). CODE | |E(2). CODE| |
E. PLACE

E → E(1) ∗ E(2)
| | '=' | |E(1). PLACE | | '+' | |E(2). PLACE }
{T = newtemp( );
E. PLACE = T;
E. CODE = E(1). CODE | |E(2). CODE | |
E. PLACE | | '=' | |E(1). PLACE
(2)
| | '*' | |E . PLACE }
(1)
E → −E {T = newtemp( );
E. PLACE = T;
E. CODE = E(1). CODE
| |E. PLACE | | '=−' | |E(1). PLACE
}
(1)
E → (E ) {E. PLACE = E(1). PLACE;
(1)
E. CODE = E . CODE }
E → id {E. PLACE = id. PLACE;
E. CODE = null; }
In the first production S → id = E,

id. PLACE| | '=' | | E. PLACE is a string which follows S. CODE = E. CODE.

In the second production E → E(1) + E(2),

E. PLACE| | '=' | | E(1). PLACE | | '+' | | E(2). PLACE is a string which is appended with E. CODE = E (1).
CODE ||E(2). CODE.

In the fifth production, i.e., E → (E(1)) does not have any string which follows E. CODE = E (1). CODE.
This is because it does not have any operator on its R.H.S of the production.

Similarly, the sixth production also does not have any string appended after E. CODE = null. The sixth
production contains null because there is no expression appears on R.H.S of production. So, no CODE
attribute will exist as no expression exists, because CODE represents a sequence of Three Address
Statements evaluating the expression.

It consists of id in its R.H.S, which is the terminal symbol but not an expression. We can also use a
procedure GEN (Statement) in place of S. CODE & E. CODE, As the GEN procedure will automatically
generate three address statements.

So, the GEN statement will replace the CODE Statements.

GEN Statements replacing CODE definitions

Production Semantic Action

S → id = E GEN(id. PLACE = E. PLACE)

E → E(1) + E(2) GEN(E. PLACE = E(1). PLACE + E(2). PLACE

E → E(1) ∗ E(2) GEN(E. PLACE = E(1). PLACE ∗ E(2). PLACE

E → −E(1) GEN(E. PLACE = −E(1). PLACE)

E → (E(1)) None

E → id None

6. What is Design of Lexical Analysis in Compiler Design?


Lexical Analysis can be designed using Transition Diagrams.

Finite Automata (Transition Diagram) − A Directed Graph or flowchart used to recognize token.
The transition Diagram has two parts −

 States − It is represented by circles.

 Edges − States are connected by Edges Arrows.

Example − Draw Transition Diagram for "if" keyword.

To recognize Token ("if"), Lexical Analysis has to read also the next character after "f". Depending upon
the next character, it will judge whether the "if" keyword or something else is.

So, Blank space after "if" determines that "If" is a keyword.

"*" on Final State 3 means Retract, i.e., control will again come to previous state 2. Therefore Blank space
is not a part of the Token ("if").

Transition Diagram for an Identifier − An identifier starts with a letter followed by letters or Digits.
Transition Diagram will be:

For example, In statement int a2; Transition Diagram for identifier a2 will be:

As (;) is not part of Identifier ("a2"), so use "*" for Retract i.e., coming back to state 1 to recognize
identifier ("a2").
The Transition Diagram for identifier can be converted to Program Code as −

Coding
State 0: C = Getchar()
If letter (C) then goto state 1 else fail

State1: C = Getchar()
If letter (C) or Digit (C) then goto state 1
else if Delimiter (C) goto state 2
else Fail

State2: Retract ()
return (6, Install ());
In-state 2, Retract () will take the pointer one state back, i.e., to state 1 & declares that whatever has been
found till state 1 is a token.

The lexical Analysis will return the token to the Parser, not in the form of an English word but the form of
a pair, i.e., (Integer code, value).

In the case of identifier, the integer code returned to the parser is 6 as shown in the table.

7. Explain about regular expressions in A regular expression is basically a shorthand way of showing how
a regular language is built from the base set of regular languages.

The symbols are identical which are used to construct the languages, and any given expression that has a
language closely associated with it.

For each regular expression E, there is a regular language L(E).

Example 1
If the regular expression is as follows −

a + b · a*

It can be written in fully parenthesized form as follows −

(a + (b · (a*)))

Regular expressions vs. Languages


The symbols of the regular expressions are distinct from those of the languages. These symbols are given
below −

Operators in Regular expression −


There are two binary operations on regular expressions (+ and ·) and one unary operator (*)

These are closely associated with the union, product and closure operations on the corresponding
languages.

The regular expression a + bc* is basically shorthand for the regular language {a} ∪ ({b} · ({c}*)).
Example 1

Example 2
Find the language of the given regular expression. It is explained below −

L(a + bc*) = L(a) ∪ L(bc*)


a + bc*.

= L(a) ∪ (L(b) · L(c*))


= L(a) ∪ (L(b) · L(c)*)
= {a} ∪ ({b} · {c}*)
= {a} ∪ ({b} · {∧, c, c2, . . . , cn, . . . , })
= {a} ∪ {b, bc, bc2, . . . , bcn, . . . }
= {a, b, bc, bc2, . . . , bcn, . . . }.

8. What are the properties of Regular expressions in TOC?


A regular expression is basically a shorthand way of showing how a regular language is built from the base
set of regular languages.

The symbols are identical which are used to construct the languages, and any given expression that has a
language closely associated with it.

For each regular expression E, there is a regular language L(E).

There are some general equalities for the regular expressions.

Properties
All the properties held for any regular expressions R, E, F and can be verified by using properties of
languages and sets.

Additive (+) properties


The additive properties of regular expressions are as follows −

R+∅=∅+R=R
R+E=E+R

R+R=R
(R + E) + F = R + (E + F)
Product (·) properties
The product properties of regular expressions are as follows −

R∅ = ∅R = ∅
R∧ = ∧R = R
(RE)F = R(EF)
Distributive properties
The distributive properties of regular expressions are as follows −

R(E + F) = RE + RF
(R + E)F = RF + EF
Closure properties
The closure properties of regular expressions are as follows −

∅* = ∧ * = ∧

R* = ∧ + RR* = (∧ + R)R*
R* = R*R* = (R*)* = R + R*

RR* = R*R
R(ER)* = (RE)*R
(R + E)* = (R*E*)* = (R* + E*)* = R*(ER*)*
All the properties can be verified by using the properties of languages and sets.

Example 1
Show that

(∅ + a + b)* = a*(ba*)*

Using the properties above:


(∅ + a + b)* = (a + b)* (+ property)
= a*(ba*)* (closure property).
Example 2
Show that

∧ + ab + abab(ab)* = (ab)*

∧ + ab + abab(ab)* = ∧ + ab(∧ + ab(ab)*)


Using the properties above:

= ∧ + ab((ab)*) (using R* = ∧ + RR*)


= ∧ + ab(ab)*= (ab)* (using R* = ∧ + RR* again)

9.Explain about right linear regular grammars in TOC


Regular grammar describes a regular language. It consists of four components, which are as follows −

G = (N, E, P, S)
Where,

 N − finite set of non-terminal symbols,


 E − a finite set of terminal symbols,
 P − a set of production rules, each of one is in the forms
 S → aB

S → ∈,
 S→a

S ∈ N is the start symbol.




The above grammar can be of two forms −

 Right Linear Regular Grammar


 Left Linear Regular Grammar
Linear Grammar
When the right side of the Grammar part has only one terminal then it's linear else non linear.

Let’s discuss about right linear grammar −

Right linear grammar


Right linear grammar means that the non-terminal symbol will be at the right side of the production.

It is a formal grammar (N, Σ, P, S) such that all the production rules in P are of one of the following forms

L → a, { L is a non-terminal and a is a terminal in Σ}

L → ∈.
L → aM, {L and M are non-terminals in N and a is in Σ}

Example
Consider a language L= {bnabma | n>=2, m>=2}

S→bbB ⇒for first 2 b’s


The production rules or grammar for the given language L= {bnabma | n>=2, m>=2} is −

B→bB|aC ⇒ any number of b’s followed by a


C→bbD ⇒ 2b’s
D→ bD|a ⇒ any number of b’s followed by a

10.Explain about left linear regular grammar in TOC


Regular grammar describes a regular language. It consists of four components, which are as follows −

G = (N, E, P, S)
Where,

 N − finite set of non-terminal symbols,


 E − a finite set of terminal symbols,
 P − a set of production rules, each of one is in the forms
 S → aB

 S → ∈,
 S→a

 S ∈ N is the start symbol.


The above grammar can be of two forms −

 Right Linear Regular Grammar


 Left Linear Regular Grammar
Linear Grammar
When the right side of the Grammar part has only one terminal then it's linear else nonv linear.

eft linear grammar


In a left-regular grammar (also called left-linear grammar), the rules are of the form as given below −

 L → a, {L is a non-terminal in N and a is a terminal in Σ}

L → ∈, {∈ is the empty string}.


 L → Ma, {L and M are in N and a is in Σ}

The left linear grammar means that the non-terminal symbol will be at the left side.

Example
Consider a language {bnabma| n>=2, m>=2}
The left linear grammar that is generated based on given language is −

⇒ last 3 symbols bba


B → Bb| Dbba ⇒ for bm and bba are for bn followed by a.
S → Bbba

D → Db|e ⇒ for bn-2

You might also like