Tutorial: Using Regular Expressions: Section 1. Introduction To The Tutorial Who Is This Tutorial For?
Tutorial: Using Regular Expressions: Section 1. Introduction To The Tutorial Who Is This Tutorial For?
Page 1
ibm.com/developer
Note on presentation
For purposes of presenting examples in this tutorial, regular expressions described will be surrounded by forward slashes. This style of delimiting regular expressions is used by sed, awk, Perl, and other tools. For instance, an example might mention: /[A-Z]+(abc|xyz)*/ Read ahead to understand this example, for now just understand that the actual regular expression is everything between the slashes. Many examples will be accompanied by an illustration that shows a regular expression, and text that is highlighted for every match on that expression.
Page 2
ibm.com/developer
Tutorial navigation
Navigating through the tutorial is easy: Select Next and Previous to move forward and backward through the tutorial. When you're finished with a section, select the Main menu for the next section. Within a section, use the Section menu. If you'd like to tell us what you think, or if you have a question for the author about the content of the tutorial, use the Feedback button.
Contact
David Mertz is a writer, a programmer, and a teacher who always endeavors to improve his communication to readers (and tutorial takers). He welcomes any comments; please direct them to [email protected].
Page 3
ibm.com/developer
Character literals
The very simplest pattern matched by a regular expression is a literal character or a sequence of literal characters. Anything in the target text that consists of exactly those characters in exactly the order listed will match. A lowercase character is not identical to its uppercase version, and vice versa. A space in a regular expression, by the way, matches a literal space in the target (this is unlike most programming languages or command-line tools, where spaces separate keywords).
/.*/ Special characters must be escaped.* /\.\*/ Special characters must be escaped.*
Page 4
ibm.com/developer
/^Mary/ Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go. /Mary$/ Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
/.a/ Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Page 5
ibm.com/developer
/(Mary)( )(had)/ Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
/[a-z]a/ Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Character classes
Rather than name only a single character, you can include a pattern in a regular expression that matches any of a set of characters. A set of characters can be given as a simple list inside square brackets; for example, /[aeiou]/ will match any single lowercase vowel. For letter or number ranges you may also use only the first and last letter of a range, with a dash in the middle; for example, /[A-Ma-m]/ will match any lowercase or uppercase in the first half of the alphabet. Many regular expression tools also provide escape-style shortcuts to the most commonly used character class, such as \w for a whitespace character and \d for a digit. You could always define these character classes with square brackets, but the shortcuts can make regular expressions more compact and readable.
Page 6
ibm.com/developer
/[^a-z]a/ Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Complement operator
The caret symbol can actually have two different meanings in regular expressions. Most of the time, it means to match the zero-length pattern for line beginnings. But if it is used at the beginning of a character class, it reverses the meaning of the character class. Everything not included in the listed character set is matched.
/cat|dog|bird/ The pet store sold cats, dogs, and s. /=xxx|yyy=/ =xxx xxx= # =yyy yyy= # =xxx= # =yyy= /(=)(xxx)|(yyy)(=)/ =xxx xxx= # =yyy yyy= # =xxx= # =yyy= /=(xxx|yyy)=/ =xxx xxx= # =yyy yyy= # =xxx= # =yyy=
Alternation of patterns
Using character classes is a way of indicating that either one thing or another thing can occur in a particular spot. But what if you want to specify that either of two whole subexpressions occurs in a position in the regular expression? For that, you use the alternation operator, the vertical bar ("|"). This is the symbol that is also used to indicate a pipe in UNIX/DOS shells, and is sometimes called the pipe character. The pipe character in a regular expression indicates an alternation between everything in the group enclosing it. Even if there are several groups to the left and right of a pipe character, the alternation greedily asks for everything on both sides. To select the scope of the alternation, you must define a group that encompasses the patterns that may match. The example illustrates this.
Page 7
ibm.com/developer
/@(=+=)*@/ Match with zero in the middle: @@ Subexpression occurs, but...: @=+=ABC@ Many occurrences: @=+==+==+==+==+=@ Repeat entire pattern: @=+==+=+==+=@
Page 8
ibm.com/developer
Page 9
ibm.com/developer
/a{5} b{,6} c{4,8}/ aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc /a+ b{3,} c?/ aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc /a{5} b{6,} c{4,8}/ aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc
Numeric quantifiers
Using extended regular expressions, you can specify arbitrary pattern occurrence counts using a more verbose syntax than the question-mark, plus-sign, and asterisk quantifiers. The curly-braces ("{" and "}") can surround a precise count of how many occurrences you are looking for. The most general form of the curly-brace quantification uses two range arguments (the first must be no larger than the second, and both must be non-negative integers). The occurrence count is specified this way to fall between the minimum and maximum indicated (inclusive). As shorthand, either argument may be left empty: if so, the minimum/maximum is specified as zero/infinity, respectively. If only one argument is used (with no comma in there), exactly that many occurrences are matched.
Page 10
ibm.com/developer
/(abc|xyz) \1/ jkl jkl jkl jkl abc xyz abc xyz xyz abc abc xyz
Backreferences
One powerful option in creating search patterns is specifying that a subexpression that was matched earlier in a regular expression is matched again later in the expression. We do this using backreferences. Backreferences are named by the numbers 1 through 9, preceded by the backslash/escape character when used in this manner. These backreferences refer to each successive group in the match pattern, as in /(one)(two)(three)/\1\2\3/. Each numbered backreference refers to the group that, in this example, has the word corresponding to the number. It is important to note something the example illustrates. What gets matched by a backreference is the same literal string matched the first time, even if the pattern that matched the string could have matched other strings. Simply repeating the same grouped subexpression later in the regular expression does not match the same targets as using a backreference (but you have to decide what you actually want to match in either case). Backreferences refer back to whatever occurred in the previous grouped expressions, in the order those grouped expressions occurred. Because of the naming convention (\1-\9), many tools limit you to nine backreferences. Some tools allow actual naming of backreferences and/or saving them to program variables. Section 4 touches on these topics.
/(abc|xyz) (abc|xyz)/ jkl jkl jkl jkl abc xyz abc xyz xyz abc abc xyz
Page 11
ibm.com/developer
/th.*s/ -- Match the words that start -- with 'th' and end with 's'. this thus thistle this line matches too much
Page 12
ibm.com/developer
/th.*s/ -- Match the words that start -- with 'th' and end with 's'. /th[^s]*./ -- Match the words that start -- with 'th' and end with 's'. this thus thistle this line matches too much
Page 13
ibm.com/developer
Page 14
ibm.com/developer
This one simply substitutes some literal text for some other literal text. The search-and-replace capability of many tools can do this much, even without using regular expressions.
Most of the time, if you are using regular expressions to modify a target text, you will want to match more general patterns than just literal strings. Whatever is matched is what gets replaced (even if it is several different strings in the target).
Page 15
ibm.com/developer
s/([A-Z])([0-9]{2,4}) /\2:\1 /g < A37 B4 C107 D54112 E1103 XXX > 37:A B4 107:C D54112 1103:E XXX
Page 16
ibm.com/developer
/th.*s/ -- Match the words that start -- with 'th' and end with 's'. this line matches just right this # thus # thistle /th.*?s/ -- Match the words that start -- with 'th' and end with 's'. this # thus # thistle this line matches just right /th.*?s / -- Match the words that start -- with 'th' and end with 's'. -- (FINALLY!)S this # thus # thistle this line matches just right
Non-greedy quantifiers
Earlier in the tutorial, the problems of matching too much were discussed, and some workarounds were suggested. Some regular expression tools make this easier by providing optional non-greedy quantifiers. These quantifier grab as little as possible while still matching whatever comes next in the pattern (instead of as much as possible). Non-greedy quantifiers have the same syntax as regular greedy ones, except with the quantifier followed by a question-mark. For example, a non-greedy pattern might look like: /A[A-Z]*?B/". In English, this means "match an A, followed by only as many capital letters as are needed to find a B." One little thing to look out for is the fact that the pattern "/[A-Z]*?./" will always match zero capital letters. If you use non-greedy quantifiers, watch out for matching too little, which is a symmetric danger.
Page 17
ibm.com/developer
/M.*[ise] / MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota # /M.*[ise] /i MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota # /M.*[ise] /gis MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota #
Pattern-match modifiers
We already saw one pattern-match modifier in the modification examples: the global modifier. In fact, in many regular expression tools, we should have been using the "g" modifier for all our pattern matches. Without the "g", many tools will match only the first occurrence of a pattern on a line in the target. So this is a useful modifier (but not one you necessarily want to use always). Let us look at some others. As a little mnemonic, it is nice to remember the word "gismo" (it even seems somehow appropriate). The most frequent modifiers are: g - Match globally i - Case-insensitive match s - Treat string as single line m - Treat string as multiple lines o - Only compile pattern once
The o option is an implementation optimization, and not really a regular expression issue (but it helps the mnemonic). The single-line option allows the wildcard to match a newline character (it won't otherwise). The ultiple-line option causes "^" and "$" to match the begin and end of each line in the target, not just the begin/end of the target as a whole (with sed or grep this is the default). The insensitive option ignores differences between case of letters.
Page 18
ibm.com/developer
Naming backreferences
import re txt = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93" new=re.sub("(?P<pre>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)", "\g<pre>\g<id>", txt) print new A37 # B:abcd:42 # C66 # D93
The language Python offers a particularly handy syntax for really complex pattern backreferences. Rather than just play with the numbering of matched groups, you can give them a name. The syntax of using regular expressions in Python is a standard programming language function/method style of call, rather than Perl- or sed-style slash delimiters. Check your own tool to see if it supports this facility.
Page 19
ibm.com/developer
s/([A-Z]-)(?=[a-z]{3})([a-z0-9]* )/\2\1/g < A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 > xyz37A- # B-ab6142 # C-Wxy66 # qrs93Ds/([A-Z]-)(!=[a-z]{3})([a-z0-9]* )/\2\1/g < A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 > A-xyz37 # ab6142B- # Wxy66C- # D-qrs93
Lookahead assertions
Another trick of advanced regular expression tools is "lookahead assertions." These are similar to regular grouped subexpression, except they do not actually grab what they match. There are two advantages to using lookahead assertions. On the one hand, a lookahead assertion can function in a similar way to a group that is not backreferenced; that is, you can match something without counting it in backreferences. More significantly, however, a lookahead assertion can specify that the next chunk of a pattern has a certain form, but let a different subexpression actually grab it (usually for purposes of backreferencing that other subexpression). There are two kinds of lookahead assertions: positive and negative. As you would expect, a positive assertion specifies that something does come next, and a negative one specifies that something does not come next. Emphasizing their connection with non-backreferenced groups, the syntax for lookahead assertions is similar: (?=pattern) for positive assertions, and (?!pattern) for negative assertions.
Page 20
ibm.com/developer
The URL for my site is: http://mysite.com/mydoc.html. You might also enjoy ftp://yoursite.com/index.html for a good place to download files.
In the later examples we have started to see just how complicated regular expressions can get. These examples are not the half of it. It is possible to do some almost absurdly difficult-to-understand things with regular expression (but things that are nonetheless useful). There are two basic facilities that some of the more advanced regular expression tools use in clarifying expressions. One is allowing regular expressions to continue over multiple lines (by ignoring whitespace like trailing spaces and newlines). The second is allowing comments within regular expressions. Some tools allow you to do one or another of these things alone, but when it gets complicated, do both! The example given uses Perl's extend modifier to enable commented multi-line regular expressions. Consult the documentation for your own tool for details on how to compose these.
Page 21
ibm.com/developer
Your feedback
Please let us know whether this tutorial was helpful to you and how we could make it better. We'd also like to hear about other tutorial topics you'd like to see covered. Thanks! For questions about the content of this tutorial, contact the author, David Mertz, at [email protected].
Colophon
This tutorial was written entirely in XML, using the developerWorks tutorial tag set. The tutorial is converted into a number of HTML pages, a zip file, JPEG heading graphics, and a PDF file by a Java program and a set of XSLT stylesheets.
Page 22