Perlre - Perl Regular Expressions Used in WM
Perlre - Perl Regular Expressions Used in WM
1 documentation - perlre
NAME
perlre - Perl regular expressions
DESCRIPTION
This page describes the syntax of regular expressions in Perl.
If you haven't used regular expressions before, a quick-start introduction is available in perlrequick,
and a longer tutorial introduction is available in perlretut.
For reference on how regular expressions are used in matching operations, plus various examples of
the same, see discussions of m//, s///, qr// and ?? in "Regexp Quote-Like Operators" in perlop.
Modifiers
Matching operations can have various modifiers. Modifiers that relate to the interpretation of the
regular expression inside are listed below. Modifiers that alter the way a regular expression is used by
Perl are detailed in "Regexp Quote-Like Operators" in perlop and "Gory details of parsing quoted
constructs" in perlop.
m
Treat string as multiple lines. That is, change "^" and "$" from matching the start of the string's
first line and the end of its last line to matching the start and end of each line within the string.
s
Treat string as single line. That is, change "." to match any character whatsoever, even a
newline, which normally it would not match.
Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^"
and "$" to match, respectively, just after and just before newlines within the string.
i
Do case-insensitive pattern matching.
If locale matching rules are in effect, the case map is taken from the current locale for code
points less than 255, and from Unicode rules for larger code points. However, matches that
would cross the Unicode rules/non-Unicode rules boundary (ords 255/256) will not succeed.
See perllocale.
There are a number of Unicode characters that match multiple characters under /i. For
example, LATIN SMALL LIGATURE FI should match the sequence fi. Perl is not currently
able to do this when the multiple characters are in the pattern and are split between
groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i;
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i;
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i;
# Matches
# Doesn't match!
# Doesn't match!
# The below doesn't match, and it isn't clear what $1 and $2 would
# be even if it did!!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i;
# Doesn't match!
Perl doesn't match multiple characters in a bracketed character class unless the character that
maps to them is explicitly mentioned, and it doesn't match them at all if the character class is
inverted, which otherwise could be highly confusing. See "Bracketed Character Classes" in
perlrecharclass, and "Negation" in perlrecharclass.
x
Extend your pattern's legibility by permitting whitespace and comments. Details in /x
p
http://perldoc.perl.org
Page 1
Regular expression modifiers are usually written in documentation as e.g., "the /x modifier", even
though the delimiter in question might not really be a slash. The modifiers /imsxadlup may also be
embedded within the regular expression itself using the (?...) construct, see Extended Patterns
below.
/x
/x tells the regular expression parser to ignore most whitespace that is neither backslashed nor
within a bracketed character class. You can use this to break up your regular expression into (slightly)
more readable parts. Also, the # character is treated as a metacharacter introducing a comment that
runs up to the pattern's closing delimiter, or to the end of the current line if the pattern extends onto
the next line. Hence, this is very much like an ordinary Perl code comment. (You can include the
closing delimiter within the comment only if you precede it with a backslash, so be careful!)
Use of /x means that if you want real whitespace or # characters in the pattern (outside a bracketed
character class, which is unaffected by /x), then you'll either have to escape them (using backslashes
or \Q...\E) or encode them using octal, hex, or \N{} escapes. It is ineffective to try to continue a
comment onto the next line by escaping the \n with a backslash or \Q.
You can use (?#text) to create a comment that ends earlier than the end of the current line, but text
also can't contain the closing delimiter unless escaped with a backslash.
Taken together, these features go a long way towards making Perl's regular expressions more
readable. Here's an example:
# Delete
$program
/\* # Match
.*? # Match
\*/ # Match
} []gsx;
http://perldoc.perl.org
(most) C comments.
=~ s {
the opening delimiter.
a minimal number of characters.
the closing delimiter.
Page 2
Page 3
http://perldoc.perl.org
Page 4
the pattern explicitly mentions a code point that is above 255 (say by \x{100}); or
Another mnemonic for this modifier is "Depends", as the rules actually used depend on various things,
and as a result you can get unexpected results. See "The "Unicode Bug"" in perlunicode. The
Unicode Bug has become rather infamous, leading to yet another (printable) name for this modifier,
"Dodgy".
Unless the pattern or string are encoded in UTF-8, only ASCII characters can match positively.
Here are some examples of how that works on an ASCII platform:
$str
$str
$str
$str
chop
$str
= "\xDF";
=~ /^\w/;
.= "\x{0e0b}";
=~ /^\w/;
$str;
=~ /^\w/;
#
#
#
#
This modifier is automatically selected by default when none of the others are, so yet another name
for it is "Default".
Because of the unexpected behaviors associated with this modifier, you probably should only use it to
maintain weird backward compatibilities.
/a (and /aa)
This modifier stands for ASCII-restrict (or ASCII-safe). This modifier, unlike the others, may be
doubled-up to increase its effect.
When it appears singly, it causes the sequences \d, \s, \w, and the Posix character classes to
match only in the ASCII range. They thus revert to their pre-5.6, pre-Unicode meanings. Under /a, \d
always means precisely the digits "0" to "9"; \s means the five characters [ \f\n\r\t], and
starting in Perl v5.18, experimentally, the vertical tab; \w means the 63 characters [A-Za-z0-9_];
and likewise, all the Posix classes such as [[:print:]] match only the appropriate ASCII-range
characters.
This modifier is useful for people who only incidentally use Unicode, and who do not wish to be
burdened with its complexities and security concerns.
With /a, one can write \d with confidence that it will only match ASCII characters, and should the
need arise to match beyond ASCII, you can instead use \p{Digit} (or \p{Word} for \w). There are
similar \p{...} constructs that can match beyond ASCII both white space (see "Whitespace" in
perlrecharclass), and Posix classes (see "POSIX Character Classes" in perlrecharclass). Thus, this
modifier doesn't mean you can't use Unicode, it means that to get Unicode matching you must
explicitly use a construct (\p{}, \P{}) that signals Unicode.
As you would expect, this modifier causes, for example, \D to mean the same thing as [^0-9]; in
fact, all non-ASCII characters match \D, \S, and \W. \b still means to match at the boundary between
\w and \W, using the /a definitions of them (similarly for \B).
Otherwise, /a behaves like the /u modifier, in that case-insensitive matching uses Unicode rules; for
example, "k" will match the Unicode \N{KELVIN SIGN} under /i matching, and code points in the
Latin1 range, above ASCII will have Unicode rules when it comes to case-insensitive matching.
http://perldoc.perl.org
Page 5
Regular Expressions
Metacharacters
The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex
routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable
reimplementation of the V8 routines.) See Version 8 Regular Expressions for details.
In particular the following metacharacters have their standard egrep-ish meanings:
\
^
.
$
http://perldoc.perl.org
Quote
Match
Match
Match
the
the
any
the
next metacharacter
beginning of the line
character (except newline)
end of the string (or before newline at the end
Page 6
of the string)
Alternation
Grouping
Bracketed Character class
By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character
only the end (or before the newline at the end), and Perl does certain optimizations with the
assumption that the string contains only one line. Embedded newlines will not be matched by "^" or
"$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after
any newline within the string (except if the newline is the last character in the string), and "$" will
match before any newline. At the cost of a little more overhead, you can do this by using the /m
modifier on the pattern match operator. (Older programs did this by setting $*, but this option was
removed in perl 5.10.)
To simplify multi-line substitutions, the "." character never matches a newline unless you use the /s
modifier, which in effect tells Perl to pretend the string is a single line--even if it isn't.
Quantifiers
The following standard quantifiers are recognized:
*
+
?
{n}
{n,}
{n,m}
Match
Match
Match
Match
Match
Match
0 or more times
1 or more times
1 or 0 times
exactly n times
at least n times
at least n but not more than m times
(If a curly bracket occurs in any other context and does not form part of a backslashed sequence like
\x{...}, it is treated as a regular character. In particular, the lower quantifier bound is not optional,
and a typo in a quantifier silently causes it to be treated as the literal characters. For example,
/o{4,a}/
compiles to match the sequence of six characters "o { 4 , a }". It is planned to eventually
require literal uses of curly brackets to be escaped, say by preceding them with a backslash or
enclosing them within square brackets, ("\{" or "[{]"). This change will allow for future syntax
extensions (like making the lower bound of a quantifier optional), and better error checking. In the
meantime, you should get in the habit of escaping all instances where you mean a literal "{".)
The "*" quantifier is equivalent to {0,}, the "+" quantifier to {1,}, and the "?" quantifier to {0,1}. n
and m are limited to non-negative integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can be seen in the error
message generated by code such as this:
$_ **= $_ , / {$_} / for 2 .. 42;
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a
particular starting location) while still allowing the rest of the pattern to match. If you want it to match
the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't
change, just the "greediness":
*?
+?
??
{n}?
{n,}?
{n,m}?
http://perldoc.perl.org
Match
Match
Match
Match
Match
Match
Match
Match
Match
Match
Match
Match
For instance,
'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a's in the string and won't leave any for the
remaining part of the pattern. This feature can be extremely useful to give perl hints about where it
shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most
efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent
subexpression (?>pattern) for more details; possessive quantifiers are just syntactic sugar for that
construct. For instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
Note that the possessive quantifier modifier can not be be combined with the non-greedy modifier.
This is because it would make no sense. Consider the follow equivalency table:
Illegal
-----------X??+
X+?+
X{min,max}?+
Legal
-----X{0}
X{1}
X{min}
Escape sequences
Because patterns are processed as double-quoted strings, the following also work:
\t
\n
\r
\f
\a
\e
\cK
\x{}, \x00
\N{name}
\N{U+263D}
\o{}, \000
\l
\u
\L
\U
\Q
http://perldoc.perl.org
tab
(HT, TAB)
newline
(LF, NL)
return
(CR)
form feed
(FF)
alarm (bell)
(BEL)
escape (think troff) (ESC)
control char
(example: VT)
character whose ordinal is the given hexadecimal number
named Unicode character or character sequence
Unicode character
(example: FIRST QUARTER MOON)
character whose ordinal is the given octal number
lowercase next char (think vi)
uppercase next char (think vi)
lowercase until \E (think vi)
uppercase until \E (think vi)
quote (disable) pattern metacharacters until \E
Page 8
Note
Description
[1] Match a character according to the rules of the
bracketed character class defined by the "...".
Example: [a-z] matches "a" or "b" or "c" ... or "z"
[[:...:]] [2] Match a character according to the rules of the POSIX
character class "..." within the outer bracketed
character class. Example: [[:upper:]] matches any
uppercase character.
(?[...]) [8] Extended bracketed character class
\w
[3] Match a "word" character (alphanumeric plus "_", plus
other connector punctuation chars plus Unicode
marks)
\W
[3] Match a non-"word" character
\s
[3] Match a whitespace character
\S
[3] Match a non-whitespace character
\d
[3] Match a decimal digit character
\D
[3] Match a non-digit character
\pP
[3] Match P, named property. Use \p{Prop} for longer names
\PP
[3] Match non-P
\X
[4] Match Unicode "eXtended grapheme cluster"
\C
Match a single C-language char (octet) even if that is
part of a larger UTF-8 character. Thus it breaks up
characters into their UTF-8 bytes, so you may end up
with malformed pieces of UTF-8. Unsupported in
lookbehind. (Deprecated.)
\1
[5] Backreference to a specific capture group or buffer.
'1' may actually be any positive integer.
\g1
[5] Backreference to a specific or previous group,
\g{-1}
[5] The number may be negative indicating a relative
previous group and may optionally be wrapped in
curly brackets for safer parsing.
\g{name} [5] Named backreference
\k<name> [5] Named backreference
\K
[6] Keep the stuff left of the \K, don't include it in $&
\N
[7] Any character but \n. Not affected by /s modifier
\v
[3] Vertical whitespace
\V
[3] Not vertical whitespace
\h
[3] Horizontal whitespace
\H
[3] Not horizontal whitespace
\R
[4] Linebreak
[1]
See "Bracketed Character Classes" in perlrecharclass for details.
[2]
See "POSIX Character Classes" in perlrecharclass for details.
[3]
http://perldoc.perl.org
Page 9
A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on
the other side of it (in either order), counting the imaginary characters off the beginning and end of the
string as matching a \W. (Within character classes \b represents backspace rather than a word
boundary, just as it normally does in any double-quoted string.) The \A and \Z are just like "^" and
"$", except that they won't match multiple times when the /m modifier is used, while "^" and "$" will
match at every internal line boundary. To match the actual end of the string and not ignore an optional
trailing newline, use \z.
The \G assertion can be used to chain global matches (using m//g), as described in "Regexp
Quote-Like Operators" in perlop. It is also useful when writing lex-like scanners, when you have
several patterns that you want to match against consequent substrings of your string; see the
previous reference. The actual location where \G will match can also be influenced by using pos()
as an lvalue: see "pos" in perlfunc. Note that the rule for zero-length matches (see Repeated Patterns
Matching a Zero-length Substring) is modified somewhat, in that contents to the left of \G are not
counted when determining the length of the match. Thus the following will not match forever:
my $string = 'ABC';
pos($string) = 1;
while ($string =~ /(.\G)/g) {
print $1;
}
It will print 'A' and then terminate, as it considers the match to be zero-width, and thus will not match
at the same position twice in a row.
It is worth noting that \G improperly used can result in an infinite loop. Take care when using patterns
http://perldoc.perl.org
Page 10
#
#
#
#
#
group 1
group 2
group 3
backref to group 3
backref to group 1
)
/x
would match the same as /(Y) ( (X) \g3 \g1 )/x. This allows you to interpolate regexes into
larger regexes and not have to worry about the capture groups being renumbered.
You can dispense with numbers altogether and create named capture groups. The notation is (?<
name>...) to declare and \g{name} to reference. (To be compatible with .Net regular expressions,
\g{name} may also be written as \k{name}, \k<name> or \k'name'.) name must not begin with a
number, nor contain hyphens. When different groups within the same pattern have the same name,
any reference to that name assumes the leftmost defined group. Named groups count in absolute and
relative numbering, and so can also be referred to by those numbers. (It's possible to do things with
named capture groups that would otherwise require (??{}).)
Capture group contents are dynamically scoped and available to you outside the pattern until the end
of the enclosing block or until the next successful match, whichever comes first. (See "Compound
Statements" in perlsyn.) You can refer to them by absolute number (using "$1" instead of "\g1",
etc); or by name via the %+ hash, using "$+{name}".
Braces are required in referring to named capture groups, but are optional for absolute or relative
numbered ones. Braces are safer when creating a regex by concatenating smaller strings. For
example if you have qr/$a$b/, and $a contained "\g1", and $b contained "37", you would get
/\g137/ which is probably not what you intended.
The \g and \k notations were introduced in Perl 5.10.0. Prior to that there were no named nor
relative numbered capture groups. Absolute numbered groups were referred to using \1, \2, etc.,
and this notation is still accepted (and likely always will be). But it leads to some ambiguities if there
http://perldoc.perl.org
Page 11
/(.)\g1/
# find first doubled char
and print "'$1' is the first doubled character\n";
/(?<char>.)\k<char>/
# ... a different way
and print "'$+{char}' is the first doubled character\n";
/(?'char'.)\g1/
# ... mix and match
and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) {
$hours = $1;
$minutes = $2;
$seconds = $3;
}
/(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/
/(.)(.)(.)(.)(.)(.)(.)(.)(.)\10/
/((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/
/((.)(.)(.)(.)(.)(.)(.)(.)(.))\010/
$a = '(.)\1';
$b = '(.)\g{1}';
"aa" =~ /${a}/;
"aa" =~ /${b}/;
"aa0" =~ /${a}0/;
"aa0" =~ /${b}0/;
"aa\x08" =~ /${a}0/;
"aa\x08" =~ /${b}0/;
#
#
#
#
\g10 is a backreference
\10 is octal
\10 is a backreference
\010 is octal
#
#
#
#
#
#
Several special variables also refer back to portions of the previous match. $+ returns whatever the
last bracket match matched. $& returns the entire matched string. (At one point $0 did also, but now it
returns the name of the program.) $` returns everything before the matched string. $' returns
everything after the matched string. And $^N contains whatever was matched by the most-recently
closed group (submatch). $^N can be used in extended patterns (see below), for example to assign a
submatch to a variable.
These special variables, like the %+ hash and the numbered match variables ($1, $2, $3, etc.) are
dynamically scoped until the end of the enclosing block or until the next successful match, whichever
comes first. (See "Compound Statements" in perlsyn.)
http://perldoc.perl.org
Page 12
Quoting metacharacters
Backslashed metacharacters in Perl are alphanumeric, such as \b, \w, \n. Unlike some other regular
expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that
looks like \\, \(, \), \[, \], \{, or \} is always interpreted as a literal character, not a metacharacter. This
was once used in a common idiom to disable or quote the special meanings of regular expression
metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
(If use locale is set, then this depends on the current locale.) Today it is more common to use the
quotemeta() function or the \Q metaquoting escape sequence to disable all metacharacters' special
meanings like this:
/$unquoted\Q$quoted\E$unquoted/
Beware that if you put literal backslashes (those not inside interpolated variables) between \Q and \E,
double-quotish backslash interpolation may lead to confusing results. If you need to use literal
backslashes within \Q...\E, consult "Gory details of parsing quoted constructs" in perlop.
quotemeta() and \Q are fully described in "quotemeta" in perlfunc.
Extended Patterns
Perl also defines a consistent extension syntax for features not found in standard tools like awk and
lex. The syntax for most of these is a pair of parentheses with a question mark as the first thing within
the parentheses. The character after the question mark indicates the extension.
The stability of these extensions varies widely. Some have been part of the core language for many
years. Others are experimental and may change without warning or be completely removed. Check
the documentation on an individual feature to verify its current status.
A question mark was chosen for this and for the minimal-matching construct because 1) question
marks are rare in older regular expressions, and 2) whenever you see one, you should stop and
"question" exactly what is going on. That's psychology....
http://perldoc.perl.org
Page 13
Page 14
Page 15
(?<a> x ) (?<b> y )
(?<a> z ) (?<b> w )) /x
Page 16
Page 17
# Initialize $cnt.
# Update $cnt,
# backtracking-safe.
# On success copy to
# non-localized location.
>x;
will initially increment $cnt up to 8; then during backtracking, its value will be unwound back
http://perldoc.perl.org
Page 18
(??{ $re })
|
)*
\)
}x;
See also (?PARNO) for a different, more efficient way to accomplish the same task.
http://perldoc.perl.org
Page 19
\)
)
)
}x;
If the pattern was used as follows
'foo(bar(baz)+baz(bop))'=~/$re/
and print "\$1 = $1\n",
"\$2 = $2\n",
"\$3 = $3\n";
the output produced should be the following:
$1 = foo(bar(baz)+baz(bop))
$2 = (bar(baz)+baz(bop))
$3 = bar(baz)+baz(bop)
If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper
than 50 times without consuming any input string will also result in a fatal error. The maximum
depth is compiled into perl, so changing it requires a custom build.
The following shows how using negative indexing can make it easier to embed recursive
http://perldoc.perl.org
Page 20
Page 21
# First capture
# Second capture
Will output 2, not 1. This is particularly important if you intend to compile the definitions with
the qr// operator, and later interpolate them in another pattern.
(?>pattern)
An "independent" subexpression, one which matches the substring that a standalone
pattern would match if anchored at the given position, and it matches nothing other than this
substring. This construct is useful for optimizations of what would otherwise be "eternal"
matches, because it will not backtrack (see Backtracking). It may also be useful in places
where the "grab all you can, and do not give anything back" semantic is desirable.
For example: ^(?>a*)ab will never match, since (?>a*) (anchored at the beginning of
http://perldoc.perl.org
Page 22
# x+
|
\( [^()]* \)
)+
\)
}x
That will efficiently match a nonempty group with matching parentheses two levels deep or
less. However, if there is no such group, it will take virtually forever on a long string. That's
because there are so many different ways to split a long string into several substrings. This is
what (.+)+ is doing, and (.+)+ is similar to a subpattern of the above pattern. Consider how
the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds, but
that each extra letter doubles this time. This exponential performance will make it appear that
your program has hung. However, a tiny change to this pattern
m{ \(
(
(?> [^()]+ )
|
\( [^()]* \)
)+
\)
}x
which uses (?>...) matches exactly when the one above does (verifying this yourself would
be a productive exercise), but finishes in a fourth the time when used on a similar string with
1000000 as. Be aware, however, that, when this construct is followed by a quantifier, it
currently triggers a warning message under the use warnings pragma or -w switch saying it
"matches null string many times in regex".
On simple groups, such as the pattern (?> [^()]+ ), a comparable effect may be achieved
by negative look-ahead, as in [^()]+ (?! [^()] ). This was only 4 times slower on a
string with 1000000 as.
The "grab all you can, and do not give anything back" semantic is desirable in many situations
where on the first sight a simple ()* looks like the correct solution. Suppose we parse text
with comments being delimited by # followed by some optional (horizontal) whitespace.
Contrary to its appearance, #[ \t]* is not the correct subexpression to match the comment
delimiter, because it may "give up" some whitespace if the remainder of the pattern can be
made to match that way. The correct answer is either one of these:
(?>#[ \t]*)
http://perldoc.perl.org
Page 23
Bracketing Form
--------------(?>PAT*)
(?>PAT+)
(?>PAT?)
(?>PAT{min,max})
(?[ ])
See "Extended Bracketed Character Classes" in perlrecharclass.
Page 24
Page 25
Page 26
Backtracking
NOTE: This section presents an abstract approximation of regular expression behavior. For a more
rigorous (and complicated) view of the rules involved in selecting a match among possible
alternatives, see Combining RE Pieces.
A fundamental feature of regular expression matching involves the notion called backtracking, which
is currently used (when needed) by all regular non-possessive expression quantifiers, namely *, *?, +
, +?, {n,m}, and {n,m}?. Backtracking is often optimized internally, but the general principle outlined
here is valid.
For a regular expression to match, the entire regular expression must match, not just part of it. So if
the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the
pattern to fail, the matching engine backs up and recalculates the beginning part--that's why it's called
backtracking.
Here is an example of backtracking: Let's say you want to find the word following "foo" in the string
"Food is on the foo table.":
$_ = "Food is on the foo table.";
if ( /\b(foo)\s+(\w+)/i ) {
print "$2 follows $1.\n";
}
http://perldoc.perl.org
Page 27
# Wrong!
That won't work at all, because .* was greedy and gobbled up the whole string. As \d* can match
on an empty string the complete regular expression matched successfully.
Beginning is <I have 2 numbers: 53147>, number is <>.
Here are some variants, most of which don't work:
$_ = "I have 2 numbers: 53147";
@pats = qw{
(.*)(\d*)
(.*)(\d+)
(.*?)(\d*)
(.*?)(\d+)
(.*)(\d+)$
(.*?)(\d+)$
(.*)\b(\d+)$
(.*\D)(\d+)$
};
for $pat (@pats) {
printf "%-12s ", $pat;
if ( /$pat/ ) {
http://perldoc.perl.org
Page 28
<I
<I
<>
<I
<I
<I
<I
<I
have
have
<>
have
have
have
have
have
<2>
numbers:
numbers:
numbers:
numbers:
5314> <7>
> <53147>
> <53147>
> <53147>
As you see, this can be a bit tricky. It's important to realize that a regular expression is merely a set of
assertions that gives a definition of success. There may be 0, 1, or several different ways that the
definition might succeed against a particular string. And if there are multiple ways it might succeed,
you need to understand backtracking to know which variety of success you will achieve.
When using look-ahead assertions and negations, this can all get even trickier. Imagine you'd like to
find a sequence of non-digits not followed by "123". You might try to write that as
$_ = "ABC123";
if ( /^\D*(?!123)/ ) {
print "Yup, no 123 in $_\n";
}
# Wrong!
But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the
string. Here's a clearer picture of why that pattern matches, contrary to popular expectations:
$x = 'ABC123';
$y = 'ABC445';
print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
This prints
2: got ABC
3: got AB
4: got ABC
You might have expected test 3 to fail because it seems to a more general purpose version of test 1.
The important difference between them is that test 3 contains a quantifier (\D*) and so can use
backtracking, whereas test 1 will not. What's happening is that you've asked "Is it true that at the start
of $x, following 0 or more non-digits, you have something that's not 123?" If the pattern matcher had
let \D* expand to "ABC", this would have caused the whole pattern to fail.
The search engine will initially match \D* with "ABC". Then it will try to match (?!123) with "123",
which fails. But because a quantifier (\D*) has been used in the regular expression, the search
http://perldoc.perl.org
Page 29
Page 30
Warning on \1 Instead of $1
Some people get too used to writing things like:
$pattern =~ s/(\W)/\\\1/g;
This is grandfathered (for \1 to \9) for the RHS of a substitute to avoid shocking the sed addicts, but
it's a dirty habit to get into. That's because in PerlThink, the righthand side of an s/// is a
double-quoted string. \1 in the usual double-quoted string means a control-A. The customary Unix
meaning of \1 is kludged in for s///. However, if you get into the habit of doing that, you get yourself
into trouble if you then add an /e modifier.
s/(\d+)/ \1 + 1 /eg;
Or if you try to do
s/(\d+)/\1000/;
http://perldoc.perl.org
Page 31
Page 32
Combining RE Pieces
Each of the elementary pieces of regular expressions which were described before (such as ab or \Z)
could match at most one substring at the given position of the input string. However, in a typical
regular expression these elementary pieces are combined into more complicated patterns using
combining operators ST, S|T, S* etc. (in these examples S and T are regular subexpressions).
Such combinations can include alternatives, leading to a problem of choice: if we match a regular
expression a|ab against "abc", will it match substring "a" or "ab"? One way to describe which
substring is actually matched is the concept of backtracking (see Backtracking). However, this
description is too low-level and makes you think in terms of a particular implementation.
Another description starts with notions of "better"/"worse". All the substrings which may be matched
by the given regular expression can be sorted from the "best" match to the "worst" match, and it is the
"best" match which is chosen. This substitutes the question of "what is chosen?" by the question of
"which matches are better, and which are worse?".
Again, for elementary pieces there is no such question, since at most one match at a given position is
possible. This section describes the notion of better/worse for combining operators. In the description
below S and T are regular subexpressions.
ST
Consider two possible matches, AB and A'B', A and A' are substrings which can be matched
by S, B and B' are substrings which can be matched by T.
If A is a better match for S than A', AB is a better match than A'B'.
http://perldoc.perl.org
Page 33
Page 34
Page 35
PCRE/Python Support
As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions to the regex syntax. While
Perl programmers are encouraged to use the Perl-specific syntax, the following are also accepted:
(?P<NAME>pattern)
Define a named capture group. Equivalent to (?<NAME>pattern).
(?P=NAME)
Backreference to a named capture group. Equivalent to \g{NAME}.
(?P>NAME)
Subroutine call to a named capture group. Equivalent to (?&NAME).
BUGS
Many regular expression constructs don't work on EBCDIC platforms.
There are a number of issues with regard to case-insensitive matching in Unicode rules. See i under
Modifiers above.
This document varies from difficult to understand to completely and utterly opaque. The wandering
prose riddled with jargon is hard to fathom in several places.
This document needs a rewrite that separates the tutorial content from the reference content.
SEE ALSO
perlrequick.
perlretut.
"Regexp Quote-Like Operators" in perlop.
"Gory details of parsing quoted constructs" in perlop.
perlfaq6.
"pos" in perlfunc.
perllocale.
perlebcdic.
Mastering Regular Expressions by Jeffrey Friedl, published by O'Reilly and Associates.
http://perldoc.perl.org
Page 36