Regular Expressions: Item 15: Know The Precedence of Regular Expression Operators
Regular Expressions: Item 15: Know The Precedence of Regular Expression Operators
Regular expressions are the most obvious very high-level feature of Perl. A
single pattern match in Perl—even a simple one—can perform the work of
many lines in a different language. Pattern matches, especially when com-
bined with Perl’s handling of strings and lists, provide capabilities that are
very difficult to mimic in other programming languages.
The power of regular expressions is one thing. Making use of it is another.
Getting the full benefit from regular expressions in Perl requires both
experience and understanding. Becoming fluent in regular expressions
may seem to be a difficult task, but I commend it to you. Once you have
mastered regular expressions in Perl, your programs will be faster,
shorter, and easier to write. In other words, more effective—which is why
you are reading this book, right?
This section discusses many commonly-encountered issues relating to
regular expressions. It is not a reference, however. For a complete descrip-
tion of regular expressions and Perl, see the Perl man pages and/or the
Camel book. For an illuminating and extremely thorough discussion of
regular expressions that reaches far beyond Perl, see Jeffrey Friedl’s excel-
lent Mastering Regular Expressions , the so-called “Hip Owls book.”
http://www.effectiveperl.com
52 Item 15 Regular Expressions
\n Matches newline.
Lowest | Alternation
The last entry in the precedence chart is alternation. Let’s continue to use
the “•” notation for a moment:
The zero-width atoms, for example, ^ and \b, group in the same way as
other atoms:
http://www.effectiveperl.com
54 Item 15 Regular Expressions
The pattern was meant to match Sender: and From: lines in a mail header,
but it actually matches something somewhat different. Here it is with
some parentheses added to clarify the precedence:
/(^Sender)|(From:\s+(.*))/;
Adding a pair of parentheses, or perhaps memory-free parentheses (?:…),
fixes the problem:
Double-quote interpolation
Perl regular expressions are subject to the same kind of interpolation that
double-quoted strings are. 2 Interpolated variables and string escapes like
\U and \Q are not regular expression atoms and are never seen by the reg-
ular expression parser. Interpolation takes place in a single pass that
occurs before a regular expression is parsed:
$x = 'test';
/$x*/; Matches tes, test, testt, etc.
/test*/; Same thing as /$x*/.
Read a pattern into $pat and match two consecutive occurrences of it.
2. Well, more or less. The $ anchor receives special treatment so that it is not
always interpreted as a scalar variable prefix.
In this example, if the user types in bob, the first regular expression will
match bobb, because the contents of $pat are expanded before the regular
expression is interpreted.
All three regular expressions in this example have another potential pit-
fall. Suppose the user types in the string “ hello :-)”. This will generate a
fatal run-time error. The result of interpolating this string into
/($pat){2}/ is /(hello :-)){2}/, which, aside from being nonsense, has
unbalanced parentheses.
If you don’t want special characters like parentheses, asterisks, periods,
and so forth interpreted as regular expression metacharacters, use the
quotemeta operator or the quotemeta escape, \Q. Both quotemeta and \Q
put a backslash in front of any character that isn’t a letter, digit, or under-
score:
http://www.effectiveperl.com
56 Item 16 Regular Expressions
The “count left parentheses” rule applies to all regular expressions, even
ones involving alternation:
The $+ special variable contains the value of the last non-empty memory:
Backreferences
Regular expressions can make use of the contents of memories via back-
references. The atoms \1, \2, \3, and so on match the contents of the cor-
responding memories. An obvious (but not necessarily useful) application
of backreferences is solving simple word puzzles:
http://www.effectiveperl.com
58 Item 16 Regular Expressions
This kind of thing is always good for 10 minutes of fun on a really slow
day. Just sit at your Unix box and type things like:
% perl -ne 'print if /([aeiou])(.*\1){3}/' /usr/dict/words
I get 106 words from this one, including “tarantara.” Hmm.
Backreferences are a powerful feature, but you may not find yourself using
them all that often. Sometimes they are handy for dealing with delimiters
in a simplistic way:
Go through the contents of OLD a line at a time, replacing some one-line HTML
comments.
while (<OLD>) {
while (/<!--\s*(.*?)\s*-->/g) { Extract info from
$_ = $` . new_html($1) . $' comment and check it out.
if ok_to_replace($1); Replace comment.
}
print NEW $_;
}
Some people complain that using match variables makes Perl programs
run slower. This is true. There is some extra work involved in maintaining
the values of the match variables, and once any of the match variables
appears in a program, Perl maintains them for every regular expression
match in the program. If you are concerned about speed, you may want to
rewrite code that uses match variables. You can generally rephrase such
code as substitutions that use memory variables. In the case above, you
could do the obvious (but incorrect):
while (<OLD>) {
s{(<!--\s*(.*?)\s*-->)}{ Use s///eg for
ok_to_replace($2) ? replacement (looks
new_html($2) : $1; better using braces as
}eg; delimiters).
print NEW $_;
}
In most cases, though, I would recommend that you write whatever makes
your code clearer, including using match variables when appropriate.
Worry about speed after everything works and you’ve made your deadline
(see Item 22).
The localizing behavior of match variables is the same as that of memory
variables.
Memory in substitutions
Memory and match variables are often used in substitutions. Uses of $1,
$2, $&, and so on within the replacement string of a substitution refer to
the memories from the match part, not an earlier statement (hopefully,
this is obvious):
http://www.effectiveperl.com
60 Item 16 Regular Expressions
You can use the /e (eval) option to help solve some tricky problems:
Using a match inside a map is even more succinct. This is one of my favor-
ite ultra-high-level constructs:
Note that it turns out to be extremely handy that a failed match returns an
empty list.
A match with the /g option in a list context returns all the memories for
each successful match:
Memory-free parentheses
Parentheses in Perl regular expressions serve two different purposes:
grouping and memory. Although this is usually convenient, or at least
irrelevant, it can get in the way at times. Here’s an example we just saw:
We need the first set of parentheses for grouping (so the ? will work right),
but they get in the way memory-wise. What we would like to have is the
ability to group without memory. Perl 5 introduced a feature for this spe-
cific purpose. Memory-free parentheses (?:…) group like parentheses,
but they don’t create backreferences or memory variables:
http://www.effectiveperl.com
62 Item 16 Regular Expressions
changed) through Perl’s pos operator. Applying a scalar m//g match allows
you to use a single regular expression, and it frees you from having to
keep track of the current position explicitly:
The most recent versions of Perl support a /c option for matches, which
modifies the way scalar m//g operates. Normally, when a scalar m//g
match fails, the match position is reset, and the next scalar m//g will start
matching at the beginning of the target string. The /c option causes the
match position to be retained following an unsuccessful match. This,
combined with the \G anchor, which forces a match beginning at the last
match position, allows you to write more straightforward tokenizers:
$_ = " Here are { nested {} { braces } }!"; Input goes into $_.
http://www.effectiveperl.com
64 Item 17 Regular Expressions
{ $c counts braces.
my $c;
while (/([{}])/gc) { Find braces
last unless ($c += {qw({ 1 } -1)}->{$1}) > 0 and count them
}; until count is 0.
}
print substr substr($_, 0, pos()), index($_, "{"); Print found string.
These examples illustrate incorrect patterns for matching text enclosed by delimiters—
in this case single-quoted strings and C comments.
$_ = "/* temp */ x = 10; /* too much? */"; Hoping to match /* temp */.
Do you see the problem with it? It fails on the following input:
/***/
The reason is that there is no way for it to match an asterisk inside the
comment that isn’t followed by exactly one other character, thus an odd
number of asterisks fails to match. It has other problems, too, but this
one is enough. The real answer looks like this: 4
s#/\*[^*]*\*+([^/*][^*]*\*+)*/##g; CORRECT
You are not likely to understand the how and why of this without recourse
to a diagram of the underlying state machine:
[^*] *
/ * * /
[^/*]
http://www.effectiveperl.com
66 Item 17 Regular Expressions
These examples illustrate patterns that correctly match text enclosed by delimiters.
You can now attempt more ambitious things, like a double-quoted string
with character escapes (let’s support \", \\, and \123):
The only problem with non-greedy matching is that it can be slower than
greedy matching. Don’t use non-greedy operators unnecessarily. But do
use non-greedy operators to avoid having to write complex regular expres-
sions that might or might not be correct.
$_ = 'testing';
/t(e|es)./;
print "matched: $&\n"; matched: tes
If Perl could talk, it might describe the matching process something like
this:
“OK, start at first character position. Looking for a t. Got one.
“Now, an alternation, first one is e. Looking for e. Got one.
“OK, the alternation matched. Next thing is a dot. Need one char to
match the dot. Got an s.
“Anything else? Nope. Guess we’re done.”
If you have no background experience with tools like lex or flex, or if this
is the only kind of regular expression you have ever known, you probably
don’t see anything unusual with this interpretation of this regular expres-
sion. On the other hand, if you are familiar with flex, you might be think-
ing, “Hmm, why didn’t that match test instead?”
Well, you could get it to match test by rewriting it:
$_ = 'testing';
/t(es|e)./;
print "matched: $&\n"; matched: test
t e [^s\n]
.
s
5. The Hip Owls book uses the term “NFA regular expressions” to refer to what I
call “procedural” matching.
http://www.effectiveperl.com
68 Item 18 Regular Expressions
This works fine on input like “ joebloe ttyp0…”. However, it will not match
at all on strings like “ webmaster-1 ttyp1…” and will return a strange result
on “joebloe pts/10…”. This match probably should have been written:
Another thing to watch out for is a “word” that contains punctuation char-
acters. Suppose you want to search for a whole word in a text string:
This works fine for input like hacker and even Perl5-Porter, but fails for
words like goin', or any word that does not begin and end with a \w char-
acter. It also will consider isn a matchable word if $text contains isn't.
The reason is that \b matches transitions between \w and \W characters—
not transitions between \s and \S characters. If you want to support
searching for words delimited by whitespace, you will have to write some-
thing like this instead:
The word boundary anchor, \b, and its inverse, \B, are zero-width pat-
terns. Even though they are not the only zero-width patterns ( ^, \A, etc.
are others), they are the hardest to understand. If you are not sure what \b
and \B will match in your string, try substituting for them:
% tryme
:What:':s: :a: ":word:" :boundary:?
W:h:a:t's a :"w:o:r:d": b:o:u:n:d:a:r:y?:
http://www.effectiveperl.com
70 Item 18 Regular Expressions
To force $ to really match the end of the string, you need to be more insis-
tent. One way to do this is to use the (?!…) regular expression operator:
Here, (?!\n) ensures that there are no newlines after the $.6
Ordinarily, $ only matches before the end of the string or a trailing new-
line. However, the /m (multi-line) option modifies the operation of $ so that
it can also match before intermediate newlines. The /m option also modi-
fies ^ so that it will match a position immediately following a newline in
the middle of the string:
$_ = "2\nlines";
s/^/<start>/mg; <start>2\n<start>lines
$_ = "2\nlines";
s/$/<end>/mg; 2<end>\nlines<end>
6. In earlier versions of Perl you may have to surround the $ with memory-free
parentheses—(?:$) instead of $—since the regular expression parser recognizes
$( as a special variable. This behavior was recently changed so that $ preceding (
is now recognized as an anchor, not part of a variable—as has long been the case
with $ preceding ).
%scores = %scores = (
<<'EOF' =~ /^(.*?):\s*(.*)/mg; 'fred' => 205,
fred: 205 'barney' => 195,
barney: 195 'dino' => 30
dino: 30 ); (See Item 13 for more
EOF about here-doc strings.)
$_ = "2\nlines";
s/\A/<start>/mg; <start>2\nlines
$_ = "2\nlines";
s/\Z/<end>/mg; 2\nlines<end>
($a, $b, $c) = split /\s+/, $_; Get first 3 fields of $_.
The two approaches take about the same amount of time to run, but the
code using split is simpler.
You can use pattern matches for more complex chores:
If you go to the trouble to benchmark these examples, you may find that
the version using a pattern match runs significantly faster than the version
http://www.effectiveperl.com
72 Item 19 Regular Expressions
using split. This wouldn’t be a problem, except that the pattern match is
significantly harder to read and understand. This is a general rule—pat-
tern matches tend to be faster, and split tends to be simpler and easier to
read. In cases like this, you have a decision to make. Do you use the faster
code, or do you use the code that is easier to understand? I think the
choice is obvious. If you must have speed, use a pattern match. But in gen-
eral, readability comes first. If speed is not the most important issue, use
split whenever the problem fits it.
You can use split several times to divide a string into successively smaller
pieces. For example, suppose that you have a line from a Unix passwd file
whose fifth field (the “GCOS” field) contains something like "Joseph N.
Hall, 555-2345, Room 888", and you would like to pick out just the last
name:
There are some situations where split can yield elegant solutions. Con-
sider one of our favorite problems, matching and removing C comments
from a string. You could use split to chop such a string up into a list of
comment delimiters and whatever appears between them, then process
the result to toss out the comments:
The following code will print $_ with C comments removed. It deals with double-quoted
strings that possibly contain comment delimiters. The memory parentheses in the
split pattern cause the delimiters, as well as the parts between them, to be returned.
% ps l
F UID PID PPID CP PRI NI SZ RSS WCHAN S TT
8 100 7363 7352 0 48 20 1916 1492 write3ve S pts/3
8 100 14227 7363 0 58 20 868 704 write3ve S pts/3
8 998 28693 3327 0 58 20 3068 1724 T pts/2
The following example extracts a few fields from the output of the ps command and
prints them.
Note that the @ specifier does not return a value. It moves to an absolute
position in the string being unpacked. In the example above, “ @8 A6”
means six characters starting at position 8.
You may find it aggravating to have to manually count out columns for the
unpack format. The following program may help you get the right numbers
with less effort:
Put a “picture” of the input in $_, and this program will generate a format.
$_ =
' aaaaabbbbbb ccccc ddddd';
http://www.effectiveperl.com
74 Item 20 Regular Expressions
while (/(\w)\1+/g) {
print '@' . length($`) . ' A' . length($&) . ' ';
}
print "\n";
You could also experiment interactively with the debugger (see Item 39) to
find the correct column numbers.
A few more complex comparisons are also faster if you avoid regular
expressions:
The index operator is very fast—it uses a Boyer-Moore algorithm for its
searches. Perl will also compile index-like regular expressions into
Boyer-Moore searches. You could write:
or, avoiding the use of $' (see Item 16 and Item 21):
The nifty thing about substr is that you can make replacements with it by
using it on the left side of an expression. The text referred to by substr is
replaced by the string value of the right-hand side:
http://www.effectiveperl.com
76 Item 20 Regular Expressions
You can combine index and substr to perform s///-like substitutions, but
in this case s/// is usually faster:
You can also do other lvalue-ish things with a substr, such as binding it to
substitutions or tr///:
$_ = "secret message";
The tr/// operator has other uses as well. It is the fastest way to count
characters in a string, and it can be used to remove duplicated characters:
The /x flag, which can be applied to both pattern matches and substitu-
tions, causes the regular expression parser to ignore whitespace (so long
as it isn’t preceded by a backslash, or isn’t inside a character class),
including comments:
http://www.effectiveperl.com
78 Item 21 Regular Expressions
/(
" (
\\\W | \\x[0-9a-fA-F]{2} | \\[0-3]?[0-7]?[0-7] | [^"\\]
)* "
)/xo;
http://www.effectiveperl.com
80 Item 22 Regular Expressions
The reason for this behavior is that the variables making up the pattern
might have changed since the last time the pattern was compiled, and thus
the pattern itself might be different. Perl makes this assumption to be
safe, but such recompilation is often unnecessary. In many cases, like the
The pattern /\b$magic\b/o is compiled on the first iteration of the foreach loop, using
whatever the value of $magic is at that time. The pattern is never compiled again,
even if the value of $magic changes.
The /o flag also works for substitutions. Note that the replacement string
in the substitution continues to work as it normally does—it can vary from
match to match:
Using a match variable anywhere in your program activates a feature that makes
copies of the match ($&), before-match ($`) and after-match ($') strings for every
single match in the program.
$_ = "match variable";
/.*/; Uh-oh: We activated the match
print "Gratuitious use of a $&\n"; variable feature.
http://www.effectiveperl.com
82 Item 22 Regular Expressions
▼ Don’t use match variables ($`, $&, $') if speed is important. (cont’d)
Perl isn’t smart enough to know which pattern(s) the match variables
might be referring to, so Perl sets up the values of the match variables
every time it does a pattern match. This results in a lot of extra copying
and unnecessary shuffling around of bytes.
Fortunately, the penalty isn’t that severe. In most cases (particularly if
some I/O is involved, as above), your program will run only slightly slower,
if at all. In test cases designed to spot the penalty, the extra time consumed
can range from a few percent to 30 to 40 percent. Jeffrey Friedl reports a
contrived test case in which the run time with a match variable present
was 700 times longer, but it is unlikely you will face a situation like this.
The pattern match below finds a word boundary, then tries to match george.
If that fails, it backs up to the boundary and tries to match jane. If that fails,
it tries judy, then elroy. If a match is found, it looks for another word
boundary.
while (<>) {
print if
/\b(george|jane|judy|elroy)\b/;
}
http://www.effectiveperl.com
84 Item 22 Regular Expressions
ing more than a single character back from the end of a word. If that char-
acter isn’t a t or d, there’s no point in continuing, because even if we did
find one earlier in the string it wouldn’t be at the end of the word.
There’s no way to force Perl to change this backtracking behavior (at least
not so far as I know), but you can approach the problem in a slightly dif-
ferent manner. Ask yourself: “If I were looking for words ending in t or d,
what would I be looking for?” More than likely, you’d be looking at the
ends of words. You’d be looking for something like:
/[td]\b/
Now, this is interesting. This little regular expression does everything that
the other two do, even though it may not be obvious at first. But think
about it. To the left of the t or d there will be zero or more \w characters.
We don’t care what sort of \w characters they are; so, tautologically if you
will, once we have a t or d to the left of a word boundary, we have a word
ending in t or d.
Naturally, this little regular expression runs much faster than either of the
two above—about twice as fast, more or less. Obviously there’s not much
backtracking, because the expression matches only a single character!
There’s no point in memorizing the contents of the inner parentheses in this pattern,
so if you want to save a little time, use memory-free parentheses.
The time saved isn’t generally all that great, and memory-free parentheses
don’t exactly improve readability. But sometimes, every little bit of speed
helps!
See Item 16 for more about memory-free parentheses.
my $host;
timethese (100,
{ mem => q{ The test code goes in an eval
for (@data) { string (see Item 54).
($host) = m/(\w+(\.\w+)+)/; }
},
The results:
Benchmark: timing 100 iterations of mem, memfree...
mem: 12 secs (12.23 usr 0.00 sys = 12.23 cpu)
memfree: 11 secs (10.64 usr 0.00 sys = 10.64 cpu)
Not bad: it takes about 15 percent longer to run the version without the
memory-free parentheses.
http://www.effectiveperl.com
http://www.effectiveperl.com