Java Regular Expression Final
Java Regular Expression Final
Navaneethan S
IT VAC Team
Email Validation
• Let’s say you want to verify an email address in the form given below without
regular expressions
Program seems
less complex.
• What it is ?
• “Regular Expressions are a way to search for patterns within data sets”.
• Once data set is available , the next task would be to extract specific
data that you will need from it.
• Example:
Extraction
Telephone process
Directory ( Looking for Specific Data
specific data )
Program seems
less complex.
Looking for a
specific pattern
in given email.
( Looking for
specific data
Telephone
Directory
match and Specific Data
extract if
needed )
• Looking for a specific data from a huge data set is terrible process.
• The Traditional control “ if ” will not serve you well.
• You need to be equipped with better tools to wade through the data in an effective
and efficient way.
• This is where Regular Expression can be of super useful.
[Linux Users]
“The true power of Linux command line unleashed only if you
supplement it with regex.”
Regular Expression
Something validates
what we have typed
in against a
pattern..
VideogameSalesData2019.csv file
Drive Location : https://drive.google.com/drive/folders/0ACCIIBaRrWySUk9PVA
Select all game records which was developed by PubG Corporation. - 7 Hits
Search for all the records containing the year 2019. - 768 Hits
Search Keyword : PUBG, 2019
(Search Mode : Normal)
Use Case for Regex
Complex without
using regex
Not take a complex scenario.
I want all the game records which was developed by “PUBG” or “Nintendo” and year should be compulsorily in “2018” or
“2019” only.
How do I frame my search term? I cannot come with a search term using normal mode.
.*(PUBG|Nintendo).*201(8|9).*
Total Hits : 63 Hits
A Simple RegEx
fooaaaabar
foo aaaa bar
fooabar
foo a bar
foobar
foo bar
fooaabar foo aa bar
RegEx pattern
fooxxxbar
fooxbar
fooa*bar
Sample Exercises
1 2 3 4
Step 1 Step 2 Step 3 Step 4
Understand the requirement: Identify the patterns in the Represent the patterns using Use a Regex engine like GREP or
What needs to be included? inclusions lists or the regular Expressions. Python or Java to apply the
What needs to be excluded? exclusions list regex pattern on the input.
Hands-on with Java regex engine
FYIP
POSIX Standards
(Regular Expressions)
• The BASIC Set comes with the set of symbols, each one having
specific meaning and interpretation.
Symbol What does it represents ?
\s Represents whitespace
[a-d] A single character that falls in the range ‘a-d’ i.e., one of ‘a’, ‘b’, ‘c’ or ‘d’.
fooabar
fooxbar
. Single Wildcard
foo a bar character Symbol
represents any
baryfoo foo x bar single character
foobar
The number of letters
barfoo and the letters itself is
fooabcbar unpredictable.
foobxcbar
foo bar
barcbyfoo
foo abc bar
foozbar
barafoo
foo bxc bar
barabfoo foo z bar
foo.*bar
Representing WhiteSpaces
fooxxxbar
foo bar foo\s*bar
fooxbar
fooxxbar foo <3 spaces> bar
foo bar foo <1 spaces> bar
foo bar foo <6 spaces> bar
foobar foo <0 spaces> bar
fooyyybar
loo
boo
[fcl]00
No spaces or commas
hoo between the letters
Character class are represented using square brackets.
Example : [abc] –Character class. One of the characters inside the square brackets – a, b or c.
Character class are not wildcard. You just cant put anything in there. It just has to be either a, b or c.
Character Class - Quiz
No spaces or commas
between the letters
foo
moo [fcdplb]00
coo
doo If there are too many entries inside a character class, it starts to get
poo unmanageable.
In this example we have 6 valid cases.
loo
Is there a better regex pattern that we can come up with, that is not as
boo lengthy ?
hoo
Character class are represented using square brackets.
Example : [abc] –Character class. One of the characters inside the square brackets – a, b or c.
Character class are not wildcard. You just cant put anything in there. It just has to be either a, b or c.
Caret Symbol
foo
moo [^mh]00
coo
doo
Caret Symbol also called as Exponent Operator negates the class.
poo Example : [^abc] – What does this mean ?
loo It represents any letter other than ‘a’, ‘b’ or ‘c’
Please note it represents only single character position.
boo
hoo
So far we have come with regular expressions using inclusion list.
When length of the regex is lengthy we can also look for regex involving exclusion list just to avoid
length regex pattern.
Character classes with ranges
joo
boo
koo j oo [j-m]00
loo k oo
woo l oo
moo m oo
zoo
coo
Character classes with ranges
joo
boo
koo j oo [j-mz]00
loo k oo
woo l oo
moo m oo
zoo z oo
coo
Example : [a-cx] – represents one of the characters falling in the range OR any of the other
choices given in the square brackets – a, b, c, x
Character classes with ranges
joo Even though they look like they are in sequence, they are
mix of lowercase and uppercase letters.
boo
Koo J oo
Loo K oo
woo L oo [j-mJ-Mz]00
moo m oo
zoo z oo 2 ranges
coo
Example : [a-cA-Cx] – represents one of the characters falling in one of the ranges OR any of the
other choices given in the square brackets – a, b, c, A, B, C, x
Escaping with Backslash
xxx.yy
xxx . yy Occurrences of x followed by
xx.yyyy xx . yyyy
occurrences of y
No recurrences of period .
x.yy x . yy
xy FYIP:
The letters a,b,c or x,y,z are all literals. They don’t mean anything special to the
xxyy regex engine.But certain symbols like period, star, square brackets, etc mean
something special to the regex engine.
yyxx What if these special symbols becomes part of our input string ?
In this case .period symbol is part of our input string.
yx Our regex engine treats .period symbol as a wildcard, which is not
what we want.
yxxx We need to escape this character with a backslash symbol.
^ $ * . [ ( ) \
Escaping with Backslash
In other words, the backslash symbol is a way of escaping the symbol from being
interpreted as a special symbol
^ $ * . [ ( ) \
Escaping with Backslash
xxx.yy
xxx . yy
xx.yyyy xx . yyyy
x.yy X* \. *y
x . yy
xy
xxyy
yyxx
yx
x*\.y*
yxxx
Escaping with Backslash
x # y
x#y x : y x [.:#] y
x:y x . y
x.y # - Pound Symbol does not have special meaning in
regex# engine.
- Pound Symbol does not have special
meaning in regex engine.
x&y Likewise : colon symbol either does not have special
Likewise : colon symbol either does not have
meaning.
special meaning.
x%y But .period
NoteBut
symbol do have.
: A.period symbol do
period outside thehave.
square bracket x[.:#]y
Note : A
represents period
any singledoes not have
character butany meaning
does not have
inside theinside
any meaning square thebracket.
squareItbracket.
is simply treated
It is simply
as aas
treated literal. (Reason
a literal. (Reasonbehind whywhy
behind we have
we have
excluded
excluded backslash
backslash for . for . period
period symbol)
symbol)
Escaping with Backslash
x # y
x#y x : y x [#:\^] y
x:y x ^ y
x^y # - Pound Symbol does not have special meaning in
regex engine.
x&y Likewise : colon symbol either does not have special
meaning.
x%y But .period symbol do have.
x[#:\^]y
Note : A caret symbol (^) inside the square bracket
has special meaning. It is used for negating a
character class. (Reason behind why we have
included backslash for ^ caret symbol)
Important Points
x#y
x\y x # y
x \ y
x^y x ^ y
x&y
x%y x[#\\\^]y
Anchors
• Which of the following represents three digit numbers that are multiples of 5?
A. ^[0-9][0-9][05]$
B. ^[0-9][0-9][0-9]$
C. ^[0-9]*$
D. ^[0-9].[0-9]$
POSIX Standard – Extended Set
Just like basic set, the extended set also comes with the set of symbols,
each one having a specific meaning and interpretation.
Symbol What does it represents ?
? Zero or more occurrences of the character that precedes this question mark
834
519
^ [0-9][0-9][0-9] $
4874
^ [0-9][0-9][0-9] $
5
^ [0-9][0-9][0-9] $
89
45687
A digit can be any character from 0 to 9.
25 So a digit can be represented by a character class with ASCII ranges starting from 0 to 9.
We have also a line beginning and line end anchor at the left and right.
645 Why do we need these?
The reason is that we do not want matches done with a subset of the string.
i.e. we do not want to match against substrings. Lets discuss this with an example.
Curly Braces Repeater
Lets take the above number 45687
Take the substring of this, take the middle 3 characters. 5-6-8…
It forms a 3 – digit number
Just because it contains a three digit number somewhere in between, we don’t want to identify it as a positive
match.
We are only interested in matching the whole string. In order to ensure this, we put the anchor at both ends.
This way the match will be run only against the whole strings and not the substrings.
834 ^ [0-9][0-9][0-9] $
519
4874
^
^
[0-9][0-9][0-9] $
[0-9][0-9][0-9] $ ^[0-9]{3}$
5 Lets assume for the problem above what if we want to represent a 10-digit numbers be
89 like?.
It might be cumbersome to write the character class 10 times over.
45687 We need a better compact way to represent this.
This leads us to the regex symbol, the repeater.
25 It is represented by opening and closing curly braces with a number in between.
This number signifies the number of repetitions.
645
a{m} represents exactly ‘m’ repetitions of whatever immediately precedes this. i.e. ‘a’
If you think you can do this with asterisk symbol ‘*’. Beware the limitation with asterisk is
that you cant represent an exact number of repetitions with it.
Curly braces Repeater
lion
tiger
^[a-z]{4,6}$
leopard lion 4 letters ^[a-z]{4}$
fox tiger 5 letters ^[a-z]{5}$
kangaroo mouse 5 letters ^[a-z]{5}$
bat cuckoo 6 letters ^[a-z]{6}$
mouse deer 4 letters ^[a-z]{4}$
cuckoo
a{m,n} represents atleast ‘m’ and atmost ‘n’ repetitions or whatever immediately
deer precedes this. i.e. ‘a’
Single Ended Curly Braces Repeater
ha
hahahahaha (ha){4,}
Hahahahaha ha{5}
hahaha Hahahaha ha{4}
If we have not used
Hahahahahaha ha{6}
hahahaha Hahahahahahahaha ha{8}
() parenthesis then
the repetition count
haha Hahahahahahahahaha ha{9}
inside curly braces
would be applied to
only ‘a’
hahahahahaha
hahahahahahahaha Parenthesis is used to group and treat as a single entity.
{m,} represents at least ‘m’ repetitions of whatever immediately
hahahahahahahahaha precedes this.
Single Ended Curly Braces Repeater
ha
haha
(ha){1,2} Why anchor at
both ends
ha (ha){1}
hahaha haha (ha){2}
here?
hahahahaha
hahahaha ^(ha){,2}$
hahaha hahaha ^(ha){3}$
hahahahahahaha hahahahaha
Hahahaha
^(ha){5}$
^(ha){4}$
hahahahahaha Hahaha ^(ha){3}$
Hahahahahahaha ^(ha){7}$
Hahahahahaha ^(ha){6}$
Single Ended Curly Braces Repeater
Reason why
anchor at both
hahahaha
^(ha){,2}$
ends are
here…
fooaaaabar
Note :
fooabar
Foo aaaa bar + denotes one or more
foobar occurrences.
Foo a bar
fooaabar
fooxxxbar
Foo aa bar
fooxbar
https://website
http://website http s ://website
httpss://website http ://website
httpx://website
httpxx://website https?://website
The number of ‘s’ can either be zero or one.
Only zero occurrences or a single occurrence should qualify.
This brings us to the next regex symbol, the question mark (?)
which represents only two possibilities either 0 or 1 repetition.
a? Zero or one occurrences of ‘a’(The character just preceding the question mark)
Making Choices With Pipe
sapwood
rosewood log wood
logwood ply wood
teakwood
plywood
redwood (log | ply)wood
Extended Set - Quiz
Which one of the following regular expressions can represent the words
‘colour’ as well as ‘color’?
A. colou*r
B. colou?r
C. colo.r
Which of the following regular expressions can represent the words
‘ascending’ as well as ‘descending’?
D. (asc/desc)ending
E. [asc|desc]ending
F. (asc|desc)ending
Extended Set - Quiz
1 2 3 4
Step 1 Step 2 Step 3 Step 4
Understand the requirement: Represent the search patterns Come up with the substitution Use a regex enabled find and
What needs to be replaced? using regex. Enclose the string by using the captured replace engine to do the
What should be the pattern(s) that needs to be pattern groups. replacement.
replacement ? replaced with parenthesis to
segregate them into capture
groups
The Monitor Resolutions Problem
([0-9]+)x[(0-9]+)
The Monitor Resolutions Problem
Java uses the group api on a matcher class to get the substitution string.
• 745.256.3369 xxx.xxx.3369
Given a US state code and a US zip code , separated by a space, (E.g. NY 10520) which of the following regular
expression would capture the state code into capture group 1 and the zip code into capture group 2?
D. ([A-Z]+)\s([0-9])+
E. ([A-Z]+)\s([0-9]+)
F. ([A-Z])+\s([0-9])+
The dollar price tag of a product(e.g. $21.44) is captured using the regex: \$([0-9]+)\.([0-9]+). Which of the below
substitution string can you use transform it to a string of the format: '44 cents and 21 dollars'?