CMP 335 Regular Expression Exercises Note
CMP 335 Regular Expression Exercises Note
Regular expressions (Regex) are a powerful tool used in various fields of computer science, including
compiler design. They provide a concise and flexible means of describing patterns in text, which is essential
for tasks like lexical analysis in compiler design. This blog will delve into the significance of regular
expressions in compiler design, explaining their role, usage, and impact on the overall process of compiling
a program.
A
What are Regular Expressions?
Regular expressions are sequences of characters that define search patterns, primarily used for string
matching within texts. These patterns can be simple, such as matching a single character, or complex,
involving combinations of various characters, special symbols, and operators.
Basic Syntax of Regular Expressions
Literals: The simplest form of a regular expression is a literal, which matches the exact character. For
example, the regex a will match the character 'a' in a text.
Concatenation: Two regular expressions can be concatenated, meaning they must appear in sequence.
For example, ab will match the sequence 'ab'.
Alternation: The alternation operator | allows for matching one of several patterns. For example, a|
b matches either 'a' or 'b'.
Repetition Operators:
* (Kleene Star) matches zero or more occurrences of the preceding element.
+ matches one or more occurrences.
? matches zero or one occurrence.
Examples
The regex a*b matches any number of 'a's followed by a 'b' (e.g., b, ab, aaab).
The regex (a|b)c matches either 'ac' or 'bc'.
Role of Regular Expressions in Compiler Design
In compiler design, regular expressions play a crucial role in the lexical analysis phase, which is the first
phase of a compiler. The primary task of lexical analysis is to read the source code and convert it into
tokens, which are the smallest units of meaning (like keywords, operators, and identifiers).
Lexical Analysis
The lexical analyzer, also known as the scanner, uses regular expressions to identify patterns in the source
code and classify them into tokens. These tokens are then used by the parser in the subsequent phase of the
compiler.
For example, consider the following piece of code:
int main() {
int a = 5;
}
The lexical analyzer would break this code into tokens like int, main, (, ), {, }, a, =, 5, ;. To identify each of
these tokens, the lexer relies on regular expressions:
Keywords like int and main can be matched directly using literals or predefined patterns.
Identifiers (like variable names) can be matched using regex that allows sequences of letters and digits.
Operators like = can be matched using specific literals.
Conversion to Finite Automata
Regular expressions are not only used for pattern matching but can also be converted into finite automata,
which are used to recognize tokens. This conversion is an essential step in the lexical analysis process.
Deterministic Finite Automata (DFA): A DFA is used to recognize tokens in a single pass over the
input string. It’s efficient and suitable for real-time processing.
Nondeterministic Finite Automata (NFA): An NFA is more flexible in terms of pattern matching but
requires more complex processing, as it can be in multiple states simultaneously.
The process usually involves converting the regular expression into an NFA, which is then optimized and
transformed into a DFA. This DFA can then be used by the lexer to scan the source code efficiently.
Advantages of Using Regular Expressions
Simplicity: Regular expressions offer a simple and concise way to represent patterns, making the
lexer easier to implement.
Efficiency: Once compiled into finite automata, regular expressions can be used to quickly and
efficiently recognize tokens in the source code.
Flexibility: Regular expressions are highly flexible, allowing the lexer to handle a wide variety of
token patterns, from simple keywords to complex operators.
Limitations
While regular expressions are powerful, they have limitations:
Context-Free Grammars: Regular expressions are not capable of expressing context-free grammars,
which are needed for the syntactical structure of programming languages. Therefore, they are limited
to the lexical analysis phase.
Complexity: For very complex patterns, regular expressions can become hard to read and maintain,
especially as the language’s syntax grows in complexity.
Conclusion
Regular expressions are a fundamental tool in compiler design, particularly in the lexical analysis phase.
They enable the efficient identification and classification of tokens in source code, facilitating the
compilation process. By converting regular expressions into finite automata, compilers can quickly process
and recognize patterns, ensuring that the source code is correctly parsed and transformed into executable
code. Despite their limitations, regular expressions remain an essential component in the toolkit of compiler
designers.
QUESTIONS AND SOLUTIONS
Output − There is no output but it can construct a skeletal parse tree as we parse, with one non-terminal
labeling all interior nodes and the use of single productions not shown. Alternatively, the sequence of shift-
reduce steps can be considered the output.
Method − Let the input string be a1a2 … … . an.$Initially, the stack contains $.
Repeat forever
If only $ is on the stack and only $ is on the input then accept and break else
begin
let a be the topmost terminal symbol on the stack and let b be the current input symbols.
If a <. b or a =. b then shift b onto the stack /*Shift*/
else if a . >b then /*reduce*/
repeat pop the stack
until the top stack terminal is related by <. to the terminal most recently popped.
else call the error-correcting routine end.
Example1 − Construct the Precedence Relation table for the Grammar.
Using Assumptions
+ − * / ↑ id ( ) $
E → E + T|T
T → T ∗ F|F
F → (E)|id
Solution
+ * ( ) id $
Example1 − Construct precedence graph & precedence function for the following table.
Solution
Step2 − No symbol has equal precedence, as can be seen in the given table; therefore, each symbol will
remain in a different node.
Since, $ <. +,*, id. therefore, make an edge from g+, g*, gid to fs
Since, +,*, id . > $ therefore, Mark an edge from f+, f*, fid to gs.
Similarity +,*, id . > +. Mark an edge from f+, f*, fid to g+.
Id + * $
F 4 2 4 0
G 5 1 3 0
Example2 − Construct precedence graph & precedence function for the following table.
Solution
Finite Automata (Transition Diagram) − A Directed Graph or flowchart used to recognize token.
The transition Diagram has two parts −
To recognize Token ("if"), Lexical Analysis has to read also the next character after "f". Depending upon
the next character, it will judge whether the "if" keyword or something else is.
"*" on Final State 3 means Retract, i.e., control will again come to previous state 2. Therefore Blank space
is not a part of the Token ("if").
Transition Diagram for an Identifier − An identifier starts with a letter followed by letters or Digits.
Transition Diagram will be:
For example, In statement int a2; Transition Diagram for identifier a2 will be:
As (;) is not part of Identifier ("a2"), so use "*" for Retract i.e., coming back to state 1 to recognize
identifier ("a2").
Coding
State 0: C = Getchar()
If letter (C) then goto state 1 else fail
State1: C = Getchar()
If letter (C) or Digit (C) then goto state 1
else if Delimiter (C) goto state 2
else Fail
State2: Retract ()
return (6, Install ());
In-state 2, Retract () will take the pointer one state back, i.e., to state 1 & declares that whatever has been
found till state 1 is a token.
The lexical Analysis will return the token to the Parser, not in the form of an English word but the form of
a pair, i.e., (Integer code, value).
In the case of identifier, the integer code returned to the parser is 6 as shown in the table.
Install () − It will return a pointer to the symbol table, i.e., address of tokens.
The following table shows the integer code and value of various tokens returned by lexical analysis to the
parser.
Suppose, if the identifier is stored at location 236 in the symbol table, then
Integer code = 7
There are various rules for regular expressions which are as follows −
ε is a Regular expression.
Union of two Regular Expressions R1 and R2.
i.e., R1 + R2 or R1|R2 is also a regular expression.
Concatenation of two Regular Expressions R1 and R2.
i.e., R1 R2 is also a Regular Expression.
Closure of Regular Expression R, i.e., R* is also a Regular Expression.
If R is a Regular Expression, then (R) is also a Regular Expression.
Algebraic Laws
R1|R2=R2|R1 or R1+ R2=R2+ R1 (Commutative)
R1| (R2|R3)=(R1| R2)|R3 (Associative)
Or
R1+ (R2+ R3)=(R1+ R2)+R3
R1 (R2|R3)=(R1R2)R3 (Associative)
R1| (R2|R3)=R1R2| R1R3 (Distributive)
Or
R1 (R2+ R3)=R1R2+R1R3
ε R=R ε=R (Concatenation)
Example1 − Write Regular Expressions for the following language over ∑∑ ={a,b}
String of length zero or one.
Answer: ε | a | b or (ε+a+b)
Set of all strings of a’s and b’s having at least two occurrences of aa.
Answer − (a+b)*aa(a+b)aa(a+b)*
Answer: (11) ∗∗
L={ε,11,1111,111111,…..}
Answer: (0+1) ∗∗ 11
L = Set of all strings of 0’s and 1’s ending with 11.
Answer: 0(0+1) ∗∗ 1
L = Set of all strings of 0’s and 1’s beginning with 0 and ending with 1.
Example3 − Write Regular Expression in which the second letter from the right end of the string is 1
Example4 − Write Regular Expressions for the following language over ∑∑ ={a,b}
L=Set of strings having at least one occurrence of the double letter
Answer: (a+b)*(aa+bb)(a+b)*
S → id = E
E→E+E
E→E∗E
E → −E
E → (E)
E → id
𝐄. 𝐏𝐋𝐀𝐂𝐄− It tells about the name that will hold the value of the expression.
𝐄. 𝐂𝐎𝐃𝐄− It represents a sequence of three address statements evaluating the expression E in
grammar represents an Assignment statement. E. CODE represents the three address codes of the
statement. CODE for non-terminals on the left is the concatenation of CODE for each non-terminal
on the right of Production.
E → E(1) ∗ E(2)
| | '=' | |E(1). PLACE | | '+' | |E(2). PLACE }
{T = newtemp( );
E. PLACE = T;
E. CODE = E(1). CODE | |E(2). CODE | |
E. PLACE | | '=' | |E(1). PLACE
(2)
| | '*' | |E . PLACE }
(1)
E → −E {T = newtemp( );
E. PLACE = T;
E. CODE = E(1). CODE
| |E. PLACE | | '=−' | |E(1). PLACE
}
(1)
E → (E ) {E. PLACE = E(1). PLACE;
(1)
E. CODE = E . CODE }
E → id {E. PLACE = id. PLACE;
E. CODE = null; }
In the first production S → id = E,
E. PLACE| | '=' | | E(1). PLACE | | '+' | | E(2). PLACE is a string which is appended with E. CODE = E (1).
CODE ||E(2). CODE.
In the fifth production, i.e., E → (E(1)) does not have any string which follows E. CODE = E (1). CODE.
This is because it does not have any operator on its R.H.S of the production.
Similarly, the sixth production also does not have any string appended after E. CODE = null. The sixth
production contains null because there is no expression appears on R.H.S of production. So, no CODE
attribute will exist as no expression exists, because CODE represents a sequence of Three Address
Statements evaluating the expression.
It consists of id in its R.H.S, which is the terminal symbol but not an expression. We can also use a
procedure GEN (Statement) in place of S. CODE & E. CODE, As the GEN procedure will automatically
generate three address statements.
E → (E(1)) None
E → id None
Finite Automata (Transition Diagram) − A Directed Graph or flowchart used to recognize token.
The transition Diagram has two parts −
To recognize Token ("if"), Lexical Analysis has to read also the next character after "f". Depending upon
the next character, it will judge whether the "if" keyword or something else is.
"*" on Final State 3 means Retract, i.e., control will again come to previous state 2. Therefore Blank space
is not a part of the Token ("if").
Transition Diagram for an Identifier − An identifier starts with a letter followed by letters or Digits.
Transition Diagram will be:
For example, In statement int a2; Transition Diagram for identifier a2 will be:
As (;) is not part of Identifier ("a2"), so use "*" for Retract i.e., coming back to state 1 to recognize
identifier ("a2").
The Transition Diagram for identifier can be converted to Program Code as −
Coding
State 0: C = Getchar()
If letter (C) then goto state 1 else fail
State1: C = Getchar()
If letter (C) or Digit (C) then goto state 1
else if Delimiter (C) goto state 2
else Fail
State2: Retract ()
return (6, Install ());
In-state 2, Retract () will take the pointer one state back, i.e., to state 1 & declares that whatever has been
found till state 1 is a token.
The lexical Analysis will return the token to the Parser, not in the form of an English word but the form of
a pair, i.e., (Integer code, value).
In the case of identifier, the integer code returned to the parser is 6 as shown in the table.
7. Explain about regular expressions in A regular expression is basically a shorthand way of showing how
a regular language is built from the base set of regular languages.
The symbols are identical which are used to construct the languages, and any given expression that has a
language closely associated with it.
Example 1
If the regular expression is as follows −
a + b · a*
(a + (b · (a*)))
These are closely associated with the union, product and closure operations on the corresponding
languages.
The regular expression a + bc* is basically shorthand for the regular language {a} ∪ ({b} · ({c}*)).
Example 1
Example 2
Find the language of the given regular expression. It is explained below −
The symbols are identical which are used to construct the languages, and any given expression that has a
language closely associated with it.
Properties
All the properties held for any regular expressions R, E, F and can be verified by using properties of
languages and sets.
R+∅=∅+R=R
R+E=E+R
R+R=R
(R + E) + F = R + (E + F)
Product (·) properties
The product properties of regular expressions are as follows −
R∅ = ∅R = ∅
R∧ = ∧R = R
(RE)F = R(EF)
Distributive properties
The distributive properties of regular expressions are as follows −
R(E + F) = RE + RF
(R + E)F = RF + EF
Closure properties
The closure properties of regular expressions are as follows −
∅* = ∧ * = ∧
R* = ∧ + RR* = (∧ + R)R*
R* = R*R* = (R*)* = R + R*
RR* = R*R
R(ER)* = (RE)*R
(R + E)* = (R*E*)* = (R* + E*)* = R*(ER*)*
All the properties can be verified by using the properties of languages and sets.
Example 1
Show that
(∅ + a + b)* = a*(ba*)*
∧ + ab + abab(ab)* = (ab)*
G = (N, E, P, S)
Where,
S → ∈,
S→a
It is a formal grammar (N, Σ, P, S) such that all the production rules in P are of one of the following forms
−
L → ∈.
L → aM, {L and M are non-terminals in N and a is in Σ}
Example
Consider a language L= {bnabma | n>=2, m>=2}
G = (N, E, P, S)
Where,
S → ∈,
S→a
Example
Consider a language {bnabma| n>=2, m>=2}
The left linear grammar that is generated based on given language is −