Unit 3 - Regular Expression
Unit 3 - Regular Expression
Expression
Definitions
In formal language theory, a regular expression (a.k.a. regex, regexp, or r.e.), is a string that represents a
regular (type-3) language.
•A Regular Expressions (RegEx) is a special sequence of characters that defines a pattern for complex string-
matching functionality.
•It uses a search pattern to find a string or set of strings. It can detect the presence or absence of a text by
matching with a particular pattern, and also can split a pattern into one or more sub-patterns.
For example ^a...s$ The code defines a RegEx pattern. The pattern is: any five letter string starting with a and
ending with s.
abs No match alias Match abyss Match Alias No match An abacus No match
Basic Examples
Rather than start with technical details, we’ll start with a bunch of examples.
Regex Matches any string that
hello contains {hello}
gray|grey contains {gray, grey}
gr(a|e)y contains {gray, grey}
gr[ae]y contains {gray, grey}
b[aeiou]bble contains {babble, bebble, bibble, bobble, bubble}
[b-chm-pP]at|ot contains {bat, cat, hat, mat, nat, oat, pat, Pat…}
colou?r contains {color, colour}
rege(x(es)?|xps?) contains {regex, regexes, regexp, regexps}
go*gle contains {ggle, gogle, google, gooogle, goooogle, ...}
go+gle contains {gogle, google, gooogle, goooogle, ...}
z{3} contains {zzz}
z{3,6} contains {zzz, zzzz, zzzzz, zzzzzz}
z{3,} contains {zzz, zzzz, zzzzz, ...}
[Bb]rainf\*\*k contains {Brainf**k, brainf**k}
\d contains {0,1,2,3,4,5,6,7,8,9}
\d{5}(-\d{4})? contains a United States zip code
1\d{10} contains an 11-digit string starting with a 1
^dog begins with "dog"
dog$ ends with "dog"
^dog$ is exactly "dog"
Regular Expressions
Notation to specify a language
•The language accepted by finite automata can be easily described by
simple expressions called Regular Expressions. A regular expression can
also be described as a sequence of pattern that defines a string.
• They describe exactly the regular languages.
• If r is a regular expression, then L(r) is the language it defines.
• Sort of like a programming language.
• Fundamental in some languages like perl ,python and
applications like grep or lex
• Capable of describing the same thing as a NFA
• The two are actually equivalent, so RE = NFA = DFA
• We can define an algebra for regular expressions
Regular Expression Regular Languages
(a + b)* abb
(11)*
(aa)*(bb)*b
(aa + ab + ba
+bb)*
Language of given Regular Expression?
Regular Expression Regular Language
(0 + 10∗) L = { 0, 1, 10, 100, 1000, 10000, … }
(0∗ 10∗) L = {1, 01, 10, 010, 0010, …}
(0 + ε)(1 + ε) L = {ε, 0, 1, 01}
(a + b)*
(a + b)* abb
(11)*
(aa)*(bb)*b
(aa + ab + ba +bb)*
Language of given Regular Expression?
Regular Expression Regular Language
(aa)*(bb)*b
(aa + ab + ba +bb)*
Language of given Regular Expression?
Regular Regular Language
Expression
(0 + 10∗) L = { 0, 1, 10, 100, 1000, 10000, … }
(0∗ 10∗) L = {1, 01, 10, 010, 0010, …}
(0 + ε)(1 + ε) L = {ε, 0, 1, 01}
(a + b)* Set of strings of a’s and b’s of any length including the null
string. So L = { ε, a, b, aa , ab , bb , ba, aaa…….}
(a + b)* abb Set of strings of a’s and b’s ending with the string abb. So L =
{abb, aabb, babb, aaabb, ababb, …………..}
(11)* Set consisting of even number of 1’s including empty string, So
L= {ε, 11, 1111, 111111, ……….}
(aa)*(bb)*b Set of strings consisting of even number of a’s followed by odd number of b’s , so L = {b, aab,
aabbb, aabbbbb, aaaab, aaaabbb, …………..}
(aa + ab + ba String of a’s and b’s of even length can be obtained by concatenating any
+bb)* combination of the strings aa, ab, ba and bb including null, so L = {aa, ab, ba, bb, aaab, aaba,
………..}
Equivalence of FA and RE
Finite Automata and Regular Expressions are equivalent. To show
this:
Show we can express a DFA as an equivalent RE
Show we can express a RE as an ε-NFA. Since the ε-NFA can be
converted to a DFA, then RE will be equivalent to all the automata we
have described.
DFA, NFA, Regular Expression (RegEx)
and Regular Language (RegLang)
A DFA represent a Regular Expression language
Converting a RE to an Automata
We have shown we can convert an automata to a RE. To
show equivalence we must also go the other direction,
convert a RE to an automaton.
We can do this easiest by converting a RE to an ε-NFA
Inductive construction
Start with a simple basis, use that to build more complex parts of the
NFA
RE to ε-NFA
Basis:
a
R=a
ε
R=ε
R=Ø
R=S+T
ε ε
T
ε
R=ST
S T
ε ε
R=S*
S
RE to ε-NFA Example
Convert R= (ab+a)* to an NFA
We proceed in stages, starting from simple elements
and working our way up
a
a
b
b
a ε b
ab
RE to ε-NFA Example (2)
ab+a a ε b
ε ε
a
ε ε
(ab+a)*
ε
a ε b
ε ε
ε ε
a
ε ε
ε
Example 2. a ( a+b)* bb
3. (ab + a)* (aa+b)
What have we shown?
Regular expressions and finite state automata are
really two different ways of expressing the same
thing.
In some cases you may find it easier to start with
one and move to the other
E.g., the language of an even number of one’s is
typically easier to design as a NFA or DFA and then
convert it to a RE
Convert NFA to DFA for a given RegLang
1) NFA for * operator 2) NFA for + operator (union)
Example L(M)=(0+1)* Cexample L(M)=(0+1)
R = Q + RP ......(i)
Now, replace R by R = Q + RP,
R = Q + (Q + RP)P = Q + QP + RP2
Again, replace R by R = Q + RP:-
R = Q + QP + (Q + RP) P2
= Q + QP + QP2+ RP3
= Q + QP + QP2 + .. + QPn + RP(n+1) =QP*
DFA to RE using Arden theorem
Steps- To convert a given DFA to its regular expression using Arden’s Theorem, following
steps are followed-
Step-01:
Form a equation for each state considering the transitions which comes towards that state.
•Add ‘∈’ in the equation of initial state.
Step-02:
Bring final state in the form R = Q + RP to get the required regular expression.
Note
Arden’s Theorem can be used to find a regular expression for both DFA and NFA.
If there exists multiple final states, then-
•Write a regular expression for each final state separately.
•Add all the regular expressions to get the final regular expression.
DFA to RE using Arden theorem
Example 1: Find regular expression for the following DFA using Arden’s Theorem-
Solution- Step-01:
Form a equation for each state-
A = ∈ + B.1 ……(1)
B = A.0 …… (2)
Step-02:
Bring final state in the form R = Q + RP.
Using (1) in (2), we get-
B = (∈ + B.1).0
B = ∈.0 + B.1.0
B = 0 + B.(1.0) ……(3)