0% found this document useful (0 votes)
14 views35 pages

06 Regularexpression

This document provides an overview of regular expressions in Java, covering their definition, usage, and various components such as pattern matching, quantifiers, and capturing groups. It explains how to compile patterns, create matchers, and utilize methods for searching and manipulating strings. Additionally, it discusses metacharacters, boundary matchers, and the significance of spaces in regular expressions.

Uploaded by

ngyntantai76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views35 pages

06 Regularexpression

This document provides an overview of regular expressions in Java, covering their definition, usage, and various components such as pattern matching, quantifiers, and capturing groups. It explains how to compile patterns, create matchers, and utilize methods for searching and manipulating strings. Additionally, it discusses metacharacters, boundary matchers, and the significance of spaces in regular expressions.

Uploaded by

ngyntantai76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Java Programming Course

Regular Expression
Session objectives

– Introduction
– Pattern match in Java
– Simple patterns
– Character classes
– Boundary matchers
– Types of quantifiers
– Capturing Groups
– Others problem

2
Regular Expressions

• A regular expression is a kind of pattern that can be applied to


text (Strings in Java)
• A regular expression either matches the text (or part of the text),
or it fails to match
o If a regular expression matches a part of the text, then you can easily
find out which part
o If a regular expression is complex, then you can easily find out which
parts of the regular expression match which parts of the text
o With this information, you can readily extract parts of the text, or do
substitutions in the text

• Regular expressions are an extremely useful tool for manipulating


text
o Regular expressions are heavily used in the automatic generation of Web
3
pages
A first example

• The regular expression "[a-z]+" will match a


sequence of one or more lowercase letters
[a-z] means any character from a through z, inclusive
+ means “one or more”
• Suppose we apply this pattern to the String "Now
is the time"
o There are three ways we can apply this pattern:
• To the entire string: it fails to match because the string
contains characters other than lowercase letters
• To the beginning of the string: it fails to match because the
string does not begin with a lowercase letter
• To search the string: it will succeed and match ow
– If applied repeatedly, it will find is, then the, then time,
then fail

4
Pattern match in Java

• First, you must compile the pattern


import java.util.regex.*;
Pattern p = Pattern.compile("[a-z]+");
• Next, you must create a matcher for a specific piece of text by
sending a message to your pattern
Matcher m = p.matcher("Now is the time");
• Points to notice:
o Pattern and Matcher are both in java.util.regex
o Neither Pattern nor Matcher has a public constructor; you create
these by using methods in the Pattern class
o The matcher contains information about both the pattern to use and
the text to which it will be applied.
5
Methods of Pattern class

6
Patterns Flags

• Call compile method with/without flag value to defines the


way the pattern is matched.

7
Methods of Matcher class

8
Matchcer methods

• Now that we have a matcher m,


o m.matches() returns true if the pattern matches the entire text
string, and false otherwise
o m.lookingAt() returns true if the pattern matches at the
beginning of the text string, and false otherwise
o m.find() returns true if the pattern matches any part of the text
string, and false otherwise
• If called again, m.find() will start searching from where the last
match was found
• m.find() will return true for as many matches as there are in the
string; after that, it will return false
• When m.find() returns false, matcher m will be reset to the
beginning of the text string (and may be used again)
o m.reset() reset the searching point to the start of the string.
9
Finding what was matched

• After a successful match,


o m.start() will return the index of the first character matched
o m.end() will return the index of the last character matched, plus one

• If no match was attempted, or if the match was unsuccessful,


m.start() and m.end() will throw an IllegalStateException
o This is a RuntimeException, so you don’t have to catch it

• It may seem strange that m.end() returns the index of the last
character matched plus one, but this is just what most String
methods require
o For example, "Now is the time".substring(m.start(), m.end()) will
return exactly the matched substring

10
A complete example

11
Additional methods

•If m is a matcher, then


•m.replaceFirst(replacement) returns a new String
where the first substring matched by the pattern has
been replaced by replacement
•m.replaceAll(replacement) returns a new String
where every substring matched by the pattern has
been replaced by replacement
•m.find(startIndex) looks for the next pattern
match, starting at the specified index
•m.reset() resets this matcher
•m.reset(newText) resets this matcher and gives it
new text to examine (which may be a String,
StringBuffer, or CharBuffer)
12
Metacharacters

• Special characters which affect the way a pattern is


matched.

• Special characters which behave as metacharacter and


supported by the API are: ( [ {\ ^ - $ I ] } ) ? * + .

• Special characters can be ordinary character by:


o precede the metacharacter with a backslash (\)
o enclose the metacharacter by specifying \Q at the beginning and
\E at the end

13
Some simple patterns

Pattern Description
abc exactly this sequence of three letters

[abc] any one of the letters a, b, or c


any character except one of the letters a, b, or
c
[^abc] (immediately within an open bracket, ^ means
“not,”but anywhere else it just means the
character ^)

[ab^c] a, b, ^ or c.

[a-z] any one character from a through z, inclusive

[a-zA-Z0-9] any one letter or digit


14
Character Classes

• Set of characters enclosed within square brackets

[abc] a, b, or c (simple class)

[^abc] Any character except a, b, or c (negation)

[a-zA-Z] a through z or A through Z, inclusive (range)

[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)

[a-z&&[def]] d, e, or f (intersection)
a through z, except for b and c: [ad-z]
[a-z&&[^bc]]
(subtraction)
[a-z&&[^m- a through z, and not m through p: [a-lq-z]
p]] (subtraction)
15
Sequences and alternatives

• If one pattern is followed by another, the two


patterns must match consecutively
o For example, [A-Za-z]+[0-9] will match one or more
letters immediately followed by one digit

• The vertical bar, |, is used to separate


alternatives
o For example, the pattern abc|xyz will match either abc or
xyz

16
Some predefined Character Classes

17
Boundary matchers

• These patterns match the empty string if at the specified


position:

^ the beginning of a line

$ the end of a line

\b a word boundary

\B not a word boundary

\A the beginning of the input (can be multiple lines)

\Z the end of the input except for the final terminator, if any

\z the end of the input

\G the end of the previous match


18
Example of Boundary Matchers

Output

19
19/28
Greedy quantifiers

• Assume X represents some pattern

X? optional, X occurs zero or one time

X* X occurs zero or more times

X+ X occurs one or more times

X{n} X occurs exactly n times

X{n,} X occurs n or more times

X{n,m} X occurs at least n but not more than m times

Note that these are all postfix operators, that is, they come
after the operand
20
Types of quantifiers

• A greedy quantifier [longest match first] (default) will


match as much as it can, and back off if it needs to

• A reluctant quantifier [shortest match first] will match


as little as possible, then take more if it needs to
o You make a quantifier reluctant by appending a ?:
X?? X*? X+? X{n}? X{n,}? X{n,m}?

• A possessive quantifier [longest match and never


backtrack] will match as much as it can, and never
let go
o You make a quantifier possessive by appending a +:
X?+ X*+ X++ X{n}+ X{n,}+ X{n,m}+

21
Quantifiers examples

Suppose your text is succeed


• Using the pattern suc*ce{2}d (c* is greedy):
o The c* will first match cc, but then ce{2}d won’t match
o The c* then “backs off” and matches only a single c, allowing the rest of
the pattern (ce{2}d) to succeed

• Using the pattern suc*?ce{2}d (c*? is reluctant):


o The c*? will first match zero characters (the null string), but then ce{2}d
won’t match
o The c*? then extends and matches the first c, allowing the rest of the
pattern (ce{2}d) to succeed

• Using the pattern au c*+ce{2}d (c*+ is possessive):


o The c*+ will match the cc, and will not back off, so ce{2}d never
matches and the pattern match fails.
22
Example of Quantifiers

Output

23
Capturing Groups - 1

• In regular expressions, parentheses are used for


grouping, but they also capture (keep for later use)
anything matched by that part of the pattern
o Example: ([a-zA-Z]*)([0-9]*) matches any number of letters
followed by any number of digits
o If the match succeeds, \1 holds the matched letters and \2 holds

the matched digits


o In addition, \0 holds everything matched by the entire pattern

• Capturing groups are numbered by counting their


opening parentheses from left to right:
o((A)(B(C)))
12 3 4
\0 = \1 = ((A)(B(C))), \2 = (A), \3 = (B(C)), \4 = (C)
• Example: ([a-zA-Z])\1 will match a double letter, such as
letter
24
Capturing Groups - 2

• If m is a matcher that has just performed a


successful match, then
o m.group(n) returns the String matched by capturing
group n
• This could be an empty string
• This will be null if the pattern as a whole matched but this
particular group didn’t match anything
o m.group() returns the String matched by the entire
pattern (same as m.group(0))
• This could be an empty string
• If m didn’t match (or wasn’t tried), then these
methods will throw an IllegalStateException

25
Example of Capturing Groups
Output

26
Example of Numbering Capture Group
Output

27
27/28
Example use of capturing groups

• Suppose word holds a word in English


• Also suppose we want to move all the consonants
at the beginning of word (if any) to the end of the
word (so string becomes ingstr)

• Note the use of (.*) to indicate “all the rest of the


characters”
28
Double backslashes

• Backslashes have a special meaning in regular


expressions; for example, \b means a word
boundary
• Backslashes have a special meaning in Java; for
example, \b means the backspace character
• Java syntax rules apply first!
• If you write "\b[a-z]+\b" you get a string with
backspace characters in it--this is not what you want!
• Remember, you can quote a backslash with another
backslash, so "\\b[a-z]+\\b" gives the correct string
• Note: if you read in a String from somewhere, this
does not apply--you get whatever characters are
actually there
29
Escaping metacharacters

• A lot of special characters--parentheses, brackets,


braces, stars, plus signs, etc.--are used in defining
regular expressions; these are called
metacharacters
• Suppose you want to search for the character
sequence a* (an a followed by a star)
o "a*"; doesn’t work; that means “zero or more as”
o "a\*"; doesn’t work; since a star doesn’t need to be
escaped (in Java String constants), Java just ignores the \
o "a\\*" does work; it’s the three-character string a, \, *
• Just to make things even more difficult, it’s illegal
to escape a non-metacharacter in a regular
expression
30
Spaces

• There is only one thing to be said about spaces (blanks) in


regular expressions, but it’s important:
o Spaces are significant!
• A space stands for a space--when you put a space in a
pattern, that means to match a space in the text string
• It’s a really bad idea to put spaces in a regular expression
just to make it look better
• Ex:
o Pattern.compile("a b+").matcher("abb"). matches()
 return false.
31
Additions to the String class

• All of the following are public


o public boolean matches(String regex)
o public String replaceFirst(String regex, String replacement)
o public String replaceAll(String regex, String replacement)
o public String[ ] split(String regex)
o public String[ ] split(String regex, int limit)

• If the limit n is greater than zero then the pattern will be
applied at most n - 1 times, the array's length will be no
greater than n, and the array's last entry will contain all
input beyond the last matched delimiter.
• If n is non-positive then the pattern will be applied as many
times as possible

32
Thinking in regular expressions

• Regular expressions are not easy to use at first


o It’s a bunch of punctuation, not words
o The individual pieces are not hard, but it takes practice to
learn to put them together correctly
o Regular expressions form a miniature programming
language
• It’s a different kind of programming language than
Java, and requires you to learn new thought patterns
o In Java you can’t just use a regular expression; you have
to first create Patterns and Matchers
o Java’s syntax for String constants doesn’t help, either
• Despite all this, regular expressions bring so
much power and convenience to String
manipulation that they are well worth the effort of
33
learning
34
That’s all for this session!

Thank you all for your attention and


patient !

35
35/27

You might also like