Ruby Regexp - Sundeep Agarwal
Ruby Regexp - Sundeep Agarwal
Sundeep Agarwal
Ruby Regexp
Preface
Prerequisites
Conventions
Acknowledgements
Feedback and Errata
Author info
License
Book version
Why is it needed?
How this book is organized
Regexp introduction
Regexp documentation
match? method
Regexp literal reuse and interpolation
sub and gsub methods
Regexp operators
Cheatsheet and Summary
Exercises
Anchors
String anchors
Line anchors
Word anchors
Cheatsheet and Summary
Exercises
Alternation and Grouping
OR conditional
Regexp.union method
Grouping
Regexp.source method
Precedence rules
Cheatsheet and Summary
Exercises
Escaping metacharacters
Escaping with \
Regexp.escape method
Escaping delimiter
Escape sequences
Cheatsheet and Summary
Exercises
Dot metacharacter and Quantifiers
Dot metacharacter
split method
Greedy quantifiers
AND conditional
What does greedy mean?
Non-greedy quantifiers
Possessive quantifiers
Cheatsheet and Summary
Exercises
Interlude: Tools for debugging and visualization
rubular
debuggex
regexcrossword
Summary
Working with matched portions
match method
match method with block
Using regexp as a string index
scan method
split with capture groups
regexp global variables
Using hashes
Substitution in conditional expression
Cheatsheet and Summary
Exercises
Character class
Custom character sets
Range of characters
Negating character sets
Set intersection
Matching metacharacters literally
Escape sequence character sets
Named character sets
Numeric ranges
Cheatsheet and Summary
Exercises
Groupings and backreferences
Backreferences
Non-capturing groups
Subexpression calls
Recursive matching
Named capture groups
Negative backreferences
Conditional groups
Cheatsheet and Summary
Exercises
Interlude: Common tasks
CommonRegexRuby
Summary
Lookarounds
Conditional expressions
Negative lookarounds
Positive lookarounds
Capture groups inside positive lookarounds
AND conditional with lookarounds
Emulating positive lookbehind with \K
Variable length lookbehind
Negated groups and absence operator
\G anchor
Cheatsheet and Summary
Exercises
Modifiers
i modifier
m modifier
o modifier
x modifier
Inline comments
Inline modifiers
Cheatsheet and Summary
Exercises
Unicode
Encoding modifiers
Unicode character sets
Codepoints and Unicode escapes
\X vs dot metacharacter
Cheatsheet and Summary
Exercises
Further Reading
Preface
Scripting and automation tasks often need to extract
particular portions of text from input data or modify them
from one format to another. This book will help you learn
Regular Expressions, a mini-programming language for all
sorts of text processing needs.
Prerequisites
You should have prior experience working with Ruby,
should know concepts like blocks, string formats, string
methods, Enumerable, etc.
Conventions
The examples presented here have been tested with
Ruby version 2.7.1 and includes features not available
in earlier versions.
Code snippets shown are copy pasted from irb --
simple-prompt shell and modified for presentation
purposes. Some commands are preceded by comments
to provide context and explanations. Blank lines have
been added to improve readability. nil return value is
not shown for puts statements. Error messages are
shortened. And so on.
Unless otherwise noted, all examples and explanations
are meant for ASCII characters.
External links are provided for further reading
throughout the book. Not necessary to immediately
visit them. They have been chosen with care and would
help, especially during re-reads.
The Ruby_Regexp repo has all the code snippets and
files used in examples and exercises and other details
related to the book. If you are not familiar with git
command, click the Code button on the webpage to get
the files.
Acknowledgements
ruby-lang documentation — manuals and tutorials
/r/ruby/ and /r/regex/ — helpful forum for beginners
and experienced programmers alike
stackoverflow — for getting answers to pertinent
questions on Ruby and regular expressions
tex.stackexchange — for help on pandoc and tex
related questions
Cover image:
draw.io
tree icon by Gopi Doraisamy under Creative
Commons Attribution 3.0 Unported
wand icon by roundicons.com
Warning and Info icons by Amada44 under public
domain
softwareengineering.stackexchange and skolakoda for
programming quotes
Issue Manager:
https://github.com/learnbyexample/Ruby_Regexp/issues
Goodreads:
https://www.goodreads.com/book/show/48641238-ruby-
regexp
E-mail: [email protected]
Twitter: https://twitter.com/learn_byexample
Author info
Sundeep Agarwal is a freelance trainer, author and mentor.
His previous experience includes working as a Design
Engineer at Analog Devices for more than 5 years. You can
find his other works, primarily focused on Linux command
line, text processing, scripting languages and curated lists,
at https://github.com/learnbyexample. He has also been a
technical reviewer for Command Line Fundamentals book
and video course published by Packt.
License
This work is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License
Regexp introduction
Anchors
Alternation and Grouping
Escaping metacharacters
Dot metacharacter and Quantifiers
Interlude: Tools for debugging and visualization
Working with matched portions
Character class
Groupings and backreferences
Interlude: Common tasks
Lookarounds
Modifiers
Unicode
Further Reading
Regexp documentation
It is always a good idea to know where to find the
documentation. Visit ruby-doc: Regexp for information on
Regexp class, available methods, syntax, features, examples
and more. Here's a quote:
match? method
First up, a simple example to test whether a string is part
of another string or not. Normally, you'd use the include?
method and pass a string as argument. For regular
expressions, use the match? method and enclose the search
string within // delimiters (regexp literal).
>> sentence.include?('is')
=> true
>> sentence.include?('z')
=> false
>> sentence.match?(/is/)
=> true
>> sentence.match?(/z/)
=> false
>> sentence.match?(/is/, 2)
=> true
>> sentence.match?(/is/, 6)
=> false
Some of the regular expressions functionality is enabled by
passing modifiers, represented by an alphabet character. If
you have used command line, modifiers are similar to
command options, for example grep -i will perform case
insensitive matching. It will be discussed in detail in
Modifiers chapter. Here's an example for i modifier.
>> sentence.match?(/this/)
=> false
# 'i' is a modifier to enable case insensitive
matching
>> sentence.match?(/this/i)
=> true
>> pet
=> /dog/i
=> true
=> false
Similar to double quoted string literals, you can use
interpolation and escape sequences in a regexp literal. See
ruby-doc: Strings for syntax details on string escape
sequences. Regexp literals have their own special escapes,
which will be discussed in Escape sequences section.
>> "cat\tdog".match?(/\t/)
=> true
>> "cat\tdog".match?(/\a/)
=> false
=> "wager"
>> word
=> "cater"
=> "wager"
>> word
=> "wager"
Regexp operators
Ruby also provides operators for regexp matching.
=> 2
=> nil
=> true
=> false
hi
oh
=> true
=> false
>> words.grep(/tt/)
>> words.all?(/at/)
=> true
>> words.none?(/temp/)
=> false
Note Description
ruby-doc:
Ruby Regexp documentation
Regexp
/pat/ or
regexp literal
%r{pat}
s =~ /pat/ or
returns index of first match or nil
/pat/ =~ s
s !~ /pat/ or
returns true if no match or false
/pat/ !~ s
Exercises
=> false
>> line2.match?() ##### add your solution
here
=> true
b) For the given input file, print all lines containing the
string two.
# note that expected output shown here is
wrapped to fit pdf width
g) For the given input string, print all lines NOT containing
the string 2
'> apple 24
'> mango 50
'> guava 42
'> onion 31
items qty
mango 50
onion 31
water 10
i) For the given array, filter all elements that contains both
e and n.
j) For the given string, replace 0xA0 with 0x7F and 0xC0
with 0x1F.
>> ip = 'start address: 0xA0, func1 address:
0xC0'
=> 8
Anchors
Now that you're familiar with regexp syntax and some of
the methods, the next step is to know about the special
features of regular expressions. In this chapter, you'll be
learning about qualifying a pattern. Instead of matching
anywhere in the given input string, restrictions can be
specified. For now, you'll see the ones that are already part
of regular expression features. In later chapters, you'll
learn how to define your own rules for restriction.
String anchors
This restriction is about qualifying a regexp to match only
at start or end of an input string. These provide
functionality similar to the string methods start_with?
and end_with?. There are three different escape sequences
related to string level regexp anchors. First up is \A which
restricts the matching to the start of string.
>> 'cater'.match?(/\Acat/)
=> true
>> 'concatenation'.match?(/\Acat/)
=> false
=> true
=> false
>> 'spare'.match?(/are\z/)
=> true
>> 'nearest'.match?(/are\z/)
=> false
>> words.grep(/er\z/)
>> words.grep(/t\z/)
=> ["pest"]
=> "dX"
=> "dX"
=> "dare\n"
=> "dX\n"
Combining both the start and end string anchors, you can
restrict the matching to the whole string. Similar to
comparing strings using the == operator.
>> 'cat'.match?(/\Acat\z/)
=> true
>> 'cater'.match?(/\Acat\z/)
=> false
>> 'concatenation'.match?(/\Acat\z/)
=> false
=> "relive"
=> "resend"
=> "cater"
=> "hacker"
Line anchors
A string input may contain single or multiple lines. The
newline character \n is used as the line separator. There
are two line anchors, ^ metacharacter for matching the
start of line and $ for matching the end of line. If there are
no newline characters in the input string, these will behave
same as the \A and \z anchors respectively.
>> pets.match?(/^cat/)
=> true
>> pets.match?(/^dog/)
=> false
>> pets.match?(/dog$/)
=> true
>> pets.match?(/^dog$/)
=> false
=> true
>> "spare\npar\ndare".match?(/er$/)
=> false
>> "spare\npar\ndare".each_line.grep(/are$/)
>> "spare\npar\ndare".match?(/^par$/)
=> true
Just like string anchors, you can use the line anchors by
themselves as a pattern. gsub and puts will be used here to
better illustrate the transformation. The gsub method
returns an Enumerator if you don't specify a replacement
string nor pass a block. That paves way to use all those
wonderful Enumerator and Enumerable methods.
1: catapults
1: concatenate
1: cat
1: catapults
2: concatenate
3: cat
catapults.
concatenate.
cat.
foo 1
foo 2
foo 1
foo
1 baz
2 baz
baz
1 baz
baz
baz
Word anchors
The third type of restriction is word anchors. Alphabets
(irrespective of case), digits and the underscore character
qualify as word characters. You might wonder why there
are digits and underscores as well, why not only alphabets?
This comes from variable and function naming conventions
— typically alphabets, digits and underscores are allowed.
So, the definition is more oriented to programming
languages than natural ones.
You can get lot more creative with using word boundary as
a pattern by itself:
"par","spar","apparent","spare","part"
=> ":copper:"
=> "c:o:p:p:e:r"
Note Description
Exercises
a) Check if the given strings start with be.
>> pat.match?(line1)
=> true
>> pat.match?(line2)
=> false
>> pat.match?(line3)
=> true
>> pat.match?(line4)
=> false
b) For the given input string, change only whole word red
to brown
d) For the given input array, filter all elements that start
with den or end with ly
>> items = ['lovely', "1\ndentist", '2 lonely',
'eden', "fly\n", 'dent']
h) For the given input array, replace hand with X for all
words that start with hand followed by at least one word
character.
i) For the given input array, filter all elements starting with
h. Additionally, replace e with X for these filtered elements.
>> items = %w[handed hand handy unhanded handle
hand-2]
OR conditional
A conditional expression combined with logical OR
evaluates to true if any of the condition is satisfied.
Similarly, in regular expressions, you can use |
metacharacter to combine multiple patterns to indicate
logical OR. The matching will succeed if any of the
alternate pattern is found in the input string. These
alternatives have the full power of a regular expression, for
example they can have their own independent anchors.
Here's some examples.
=> true
=> true
=> false
Regexp.union method
You might infer from above examples that there can be
cases where lots of alternation is required. The
Regexp.union method can be used to build the alternation
list automatically. It accepts an array as argument or a list
of comma separated arguments.
=> /car|jeep/
>> pat
=> /cat|dog|fox/
Grouping
Often, there are some common things among the regexp
alternatives. It could be common characters or regexp
qualifiers like the anchors. In such cases, you can group
them using a pair of parentheses metacharacters. Similar
to a(b+c)d = abd+acd in maths, you get a(b|c)d =
abd|acd in regular expressions.
# without grouping
# with grouping
# without grouping
Regexp.source method
The Regexp.source method helps to interpolate a regexp
literal inside another regexp. For example, adding anchors
to alternation list created using the Regexp.union method.
>> alt
=> /cat|par/
=> /\b(cat|par)\b/
Precedence rules
There's some tricky situations when using alternation. If it
is used for testing a match to get true/false against a
string input, there is no ambiguity. However, for other
things like string replacement, it depends on a few factors.
Say, you want to replace either are or spared — which one
should get precedence? The bigger word spared or the
substring are inside it or based on something else?
=> 2
=> 10
=> 5
=> 5
>> alt
=> /handful|handy|hand/
programmatically combine
Regexp.union(array)
multiple strings/regexps
() group pattern(s)
for ex:
Regexp.union(words.sort_by
{ |w| -w.length })
Exercises
a) For the given input array, filter all elements that start
with den or end with ly
d) For the given input strings, replace all matches from the
array words with A.
e) Filter all whole elements from the input array items that
exactly matches any of the elements present in the array
words.
>> items.grep(pat)
Escaping with \
You have seen a few metacharacters and escape sequences
that help to compose a regexp literal. To match the
metacharacters literally, i.e. to remove their special
meaning, prefix those characters with a \ character. To
indicate a literal \ character, use \\.
=> false
# escaping will work
# match ( or ) literally
=> "/learn/by/example"
Regexp.escape method
How to escape all the metacharacters when a regexp is
constructed dynamically? Relax, Regexp.escape method
has got you covered. No need to manually take care of all
the metacharacters or worry about changes in future
versions.
\(a\^b\)
>> pat
=> /a_42|\(a\^b\)|2\|3/
=> /(?-mix:^cat|dog$)|a\^b/
Escaping delimiter
Another character to keep track for escaping is the
delimiter used to define the regexp literal. Or, you can use
a different delimiter than / to define a regexp literal using
%r to avoid escaping. Also, you need not worry about
unescaped delimiter inside #{} interpolation.
=> "~/foo/baz/ip.txt"
=> "~/foo/baz/ip.txt"
Escape sequences
In regexp literals, characters like tab and newline can be
expressed using escape sequences as \t and \n
respectively. These are similar to how they are treated in
normal string literals (see ruby-doc: Strings for details).
However, escapes like \b (word boundary) and \s (see
Escape sequence character sets section) are different for
regexps. And octal escapes \nnn have to be three digits to
avoid conflict with Backreferences.
=> "a:b:c"
>> 'h%x'.match?(/h\%x/)
=> true
>> 'h\%x'.match?(/h\%x/)
=> false
>> 'hello'.match?(/\l/)
=> true
=> "hello"
>> 'a+b'.match?(/a\053b/)
=> true
=> "150"
>> '12|30'.gsub(/2|3/, '5')
=> "15|50"
Note Description
\\ to match \ literally
Exercises
a) Transform given input strings to expected output using
same logic on both strings.
=> "35+qty/3"
=> "(qty+4)/2-35+pq/4"
=> "0Xcd"
=> "EXAMPLE"
>> ip = "123\b456"
>> puts ip
12456
>> ip = '3-(a^b)+2*(a^b)-(a/b)+3'
=> "3-X*X-X+3"
Dot metacharacter and
Quantifiers
This chapter introduces dot metacharacter and quantifiers.
As the name implies, quantifiers allows you to specify how
many times a character or grouping should be matched.
With * string operator, you can do something like 'no' * 5
to get "nonononono". This saves you manual repetition as
well as gives the ability to programmatically repeat a string
object as many times as you need. Quantifiers support this
simple repetition as well as ways to specify a range of
repetition. This range has the flexibility of being bounded
or unbounded with respect to start and end values.
Combined with dot metacharacter (and alternation if
needed), quantifiers allow you to construct conditional AND
logic between patterns.
Dot metacharacter
The dot metacharacter matches any character except the
newline character.
=> "483"
split method
This chapter will additionally use split method to
illustrate examples. The split method separates the string
based on given regexp (or string) and returns an array of
strings.
>> 'apple-85-mango-70'.split(/-/)
>> 'bus:3:car:5:van'.split(/:.:/)
>> 'apple-85-mango-70'.split(/-/, 2)
Greedy quantifiers
Quantifiers have functionality like the string repetition
operator and range method. They can be applied to both
characters and groupings (and more, as you'll see in later
chapters). Apart from ability to specify exact quantity and
bounded range, these can also match unbounded varying
quantities. If the input string can satisfy a pattern with
varying quantities in multiple ways, you can choose among
three types of quantifiers to narrow down possibilities. In
this section, greedy type of quantifiers is covered.
>> words.grep(/\bre.?d\b/)
=> "3X511114X"
Here's some examples with split and related methods.
partition splits the input string on the first match and the
text matched by the regexp is also present in the output.
rpartition is like partition but splits on the last match.
>> '3111111111125111142'.split(/1*/)
>> '3111111111125111142'.partition(/1*2/)
>> '3111111111125111142'.rpartition(/1*2/)
=> "3X5111142"
>> '3111111111125111142'.split(/1+/)
Pattern Description
>> demo.grep(/ab{1,4}c/)
>> demo.grep(/ab{3,}c/)
>> demo.grep(/ab{,2}c/)
>> demo.grep(/ab{3}c/)
=> ["xabbbcz"]
AND conditional
Next up, how to construct AND conditional using dot
metacharacter and quantifiers.
=> true
=> false
>> seq1.match?(/cat.*dog|dog.*cat/)
=> true
>> seq2.match?(/cat.*dog|dog.*cat/)
=> true
=> true
=> true
=> "Xt"
>> puts 'blah \< foo < bar \< blah <
baz'.gsub(/\\?</, '\<')
=> "Xle"
>> 'star'.sub(/t.*a/, 'X')
=> "sXr"
Non-greedy quantifiers
As the name implies, these quantifiers will try to match as
minimally as possible. Also known as lazy or reluctant
quantifiers. Appending a ? to greedy quantifiers makes
them non-greedy.
=> "Xot"
>> 'frost'.sub(/f.??o/, 'X')
=> "Xst"
=> "X3456789"
>>
'green:3.14:teal::brown:oh!:blue'.split(/:.*?:/)
Possessive quantifiers
Appending a + to greedy quantifiers makes them possessive
quantifiers. These are like greedy quantifiers, but without
the backtracking. So, something like /Error.*+valid/ will
never match because .*+ will consume all the remaining
characters. If both the greedy and possessive quantifier
versions are functionally equivalent, then possessive is
preferred because it will fail faster for non-matching cases.
# different results
Note Description
append ? to greedy
non-greedy
quantifier
match as minimally as
possible
append + to greedy
possessive
quantifier
Exercises
>> eqn1.split(pat)
>> eqn2.split(pat)
>> eqn3.split(pat)
=> "a+b"
>> str2.gsub(remove_parentheses, '')
# wrong output
# expected output
? is same as
* is same as
+ is same as
m) Can you reason out why this code results in the output
shown? The aim was to remove all <characters> patterns
but not the <> ones. The expected result was 'a 1<> b 2<>
c'.
>> s1.split(pat, 2)
>> s2.split(pat, 2)
>> s1.match?(pat)
=> true
>> s2.match?(pat)
=> true
>> s3.match?(pat)
=> false
>> s4.match?(pat)
=> false
>> s5.match?(pat)
=> false
Interlude: Tools for debugging
and visualization
As your regexp gets complicated, it can get difficult to
debug if you run into issues. Building your regexp step by
step from scratch and testing against input strings will go a
long way in correcting the problem. To aid in such a
process, you could use various online tools.
rubular
rubular is an online Ruby regular expression editor (based
on Ruby 2.5.7) to visually test your regexp. You need to add
your regexp, input string and optional modifiers. Matching
portions will be highlighted.
debuggex
Another useful tool is debuggex which converts your regexp
to a rail road diagram, thus providing a visual aid to
understanding the pattern. This doesn't support Ruby, so
select JavaScript flavor.
regexcrossword
For practice, regexcrossword is often recommended. It only
supports JavaScript, so some of the puzzles may not work
the same with Ruby syntax. See regexcrossword: howtoplay
for help.
match method
First up, the match method which is similar to match? method. Both these
methods accept a regexp and an optional index to indicate the starting
location. Furthermore, these methods treat a string argument as if it was
a regexp all along (which is not the case with other string methods like
sub, split, etc). The match method returns a MatchData object from
which various details can be extracted like the matched portion of string,
location of matched portion, etc. nil is returned if there's no match for
the given regexp.
>> m.to_a
>> m.captures
>> m[1]
The offset method gives the starting and ending + 1 indexes of the
matching portion. It accepts an argument to indicate entire matching
portion or specific capture group. You can also use begin and end
methods to get either of those locations.
>> m = 'awesome'.match(/w(.*)me/)
>> m.offset(0)
=> [1, 7]
>> m.offset(1)
=> [2, 5]
>> m.begin(0)
=> 1
>> m.end(1)
=> 5
>> m.regexp
=> /hi.*bye/i
>> m.string
c a
bc ac a
=> nil
=> "abbbc"
=> "w"
>> word
=> "want"
scan method
The scan method returns all the matched portions as an array. With match
method you can get only the first matching portion.
>>
'2020/04/25,1986/Mar/02,77/12/31'.scan(%r{(.*?)/(.*?)/(.*?),})
ABC
ABBBC
>> '31111111111251111426'.split(/1*4?2/)
>> '31111111111251111426'.split(/(1*4?2)/)
If part of the regexp is outside a capture group, the text thus matched
won't be in the output. If a capture group didn't participate, that element
will be totally absent in the output.
>> '31111111111251111426'.split(/(1*)4?2/)
>> '3.14aabccc42'.split(/(a+)b+(c+)/)
>> '31111111111251111426'.split(/(1*)(4)?2/)
>> '3.14aabccc42abc88'.split(/(a+b+c+)/, 2)
Here's an example:
=> 8
>> $~
>> $~[0]
>> $`
>> $&
>> $'
For methods that match multiple times, like scan and gsub, the global
variables will be updated for each match. Referring to them in later
instructions will give you information only for the final match.
ABC
ABBBC
11
>> $~
>> $`
If you need to apply methods like map and use regexp global
variables, use gsub instead of scan.
In addition to using $~, you can also use $N where N is the capture group
you want. $1 will have string matched by the first group, $2 will have
string matched by the second group and so on. As a special case, $+ will
have string matched by the last group. Default value is nil if that
particular capture group wasn't used in the regexp.
=> 2
>> $&
# same as $~[1]
>> $1
>> $2
=> "fab"
>> $+
>> $4
=> nil
>> $~[-2]
=> "fab"
>> $~.values_at(1, 3)
Using hashes
With the help of block form and global variables, you can use a hash
variable to determine the replacement string based on the matched text.
If the requirement is as simple as passing entire matched portion to the
hash variable, both sub and gsub methods accept a hash instead of string
in replacement section.
>> h = { '1' => 'one', '2' => 'two', '4' => 'four' }
>> '9234012'.gsub(/1|2|4/, h)
=> "9two3four0onetwo"
>> '9234012'.gsub(/./, h)
=> "XtwoXfourXonetwo"
For hashes that have many entries and likely to undergo changes during
development, building alternation list manually is not a good choice. Also,
recall that as per precedence rules, longest length string should come
first.
>> pat
=> /handful|handy|hand|a\^b/
=> nil
16 apples
=> nil
=> ["cog", 2]
Note Description
s[/pat/] =
same as s.sub!(/pat/, 'replace')
'replace'
inplace
sub! and gsub! return nil if substitution fails
substitution
Note Description
Exercises
a) For the given strings, extract the matching portion from first is to last
t.
=> 12
=> 4
=> 2
=> 4
=> 12
=> 18
=> 17
=> 14
d) The given input string contains : exactly once. Extract all characters
after the : as output.
>> ip = 'fruits:apple, mango, guava, blueberry'
>> s1 = 'first-3.14'
>> s2 = 'next-123'
=> "first-1.144222799920162"
=> "next-4.812184355372417"
f) Replace all occurrences of par with spar, spare with extra and park
with garden for the given input strings.
g) Extract all words between ( and ) from the given input string as an
array. Assume that the input will not contain any broken parentheses.
>> ip = 'another (way) to reuse (portion) matched (by)
capture groups'
i) Use scan to get the output as shown below for the given input strings.
Note the characters used in the input strings carefully.
>> row1.scan(pat)
>> row2.scan(pat)
For row1, find the sum of integers of each array element. For example,
sum of -2 and 5 is 3.
For row2, find the sum of floating-point numbers of each array
element. For example, sum of 1.32 and -3.14 is -1.82.
>> ip = '42:no-output;1000:car-truck;SQEX49801'
Range of characters
Character classes have their own metacharacters to help define
the sets succinctly. Metacharacters outside of character classes
like ^, $, () etc either don't have special meaning or have
completely different one inside the character classes. First up, the
- metacharacter that helps to define a range of characters instead
of having to specify them all individually.
# all digits, same as: scan(/[0123456789]+/)
>> 'Sample123string42with777numbers'.scan(/[0-9]+/)
=> ["best"]
>> 'Sample123string42with777numbers'.scan(/[^0-9]+/)
=> "bar:baz"
>> dates.scan(%r{([^/]+)/([^/]+)/([^/,]+),?})
>> words.grep(/\A[^aeiou]+\z/)
>> words.grep_v(/[aeiou]/)
Set intersection
Using && between two sets of characters will result in matching
only the intersection of those two sets. To aid in such definitions,
you can use [] in nested fashion.
=> "words[5]"
ba\bab
>> 'Sample123string42with777numbers'.split(/\d+/)
=> "secret"
=> "-123-42-777-"
=> "foo5bar3x83y120"
\R matches line break characters \n, \v, \f, \r, \u0085 (next
line), \u2028 (line separator), \u2029 (paragraph separator) or
\r\n. Unlike other escapes, \R cannot be used inside a character
class.
>> "food\r\ngood"[/d\Rg/]
=> "d\r\ng"
>>
'Sample123string42with777numbers'.split(/[[:digit:]]+/)
>>
'Sample123string42with777numbers'.scan(/[[:alpha:]]+/)
Numeric ranges
Character classes can also be used to construct numeric ranges.
# numbers between 10 to 29
Note Description
[a-z&&
intersection of a-z and [^aeiou]
[^aeiou]]
Exercises
a) For the array items, filter all elements starting with hand and
ending with s or y or le.
g) For the array words, filter all elements not starting with e or p
or u.
>> words = %w[p-t you tea heel owe new reed ear]
=> "(2),kite,12,,D,WHTSZ323"
=> "hi,WHTSZ323"
>> str1.split(pat)
>> str2.split(pat)
known
mood
know
pony
inns
m) For the given array, filter all elements containing any number
sequence greater than 624.
>> end
>> max_nested_braces('a*b')
=> 0
>> max_nested_braces('}a+b{')
=> -1
>> max_nested_braces('a*b+{}')
=> 1
>> max_nested_braces('{{a+2}*{b+c}+e}')
=> 2
>> max_nested_braces('{{a+2}*{b+{c*d}}+e}')
=> 3
>> max_nested_braces("{{a+2}*{\n{b+{c*d}}+e*d}}")
=> 4
>> max_nested_braces('a*{b+c*{e*3.14}}}')
=> -1
>> ip.split
r) Extract all whole words for the given input strings. However,
based on user input ignore, do not match words if they contain
any character present in the ignore variable. Assume that ignore
variable will not contain any regexp metacharacter.
>> s1.scan(pat)
=> ["newline"]
>> s2.scan(pat)
=> []
>> s1.scan(pat)
=> ["match"]
>> s2.scan(pat)
Backreferences
Backreferences are like variables in a programming language.
You have already seen how to use MatchData object and regexp
global variables to refer to the text matched by capture groups.
Backreferences provide the same functionality, with the
advantage that these can be directly used in regexp definition
as well as replacement section. Another advantage is that you
can apply quantifiers to backreferences.
=> "fork,42,nice,3.14,fork"
=> false
# there's no ambiguity here as \k<1> can only mean
1st backreference
=> true
>> s = 'abcdefghijklmna1d'
>> s.sub(/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)
(.).*\1\x31/, 'X')
=> "Xd"
>>
'123hand42handy777handful500'.split(/hand(?:y|ful)?/)
>> '1,2,3,4,5,6,7'.sub(/\A(([^,]+,){3})([^,]+)/,
'\1(\3)')
=> "1,2,3,(4),5,6,7"
>> '1,2,3,4,5,6,7'.sub(/\A((?:[^,]+,){3})([^,]+)/,
'\1(\2)')
=> "1,2,3,(4),5,6,7"
>> s.scan(/(123)+/)
>> s.scan(/(?:123)+/)
3.14,"42",five
one,2,3.14,"42",five
Subexpression calls
It may be obvious, but it should be noted that backreference
will provide the string that was matched, not the regexp that
was inside the capture group. For example, if (\d[a-f])
matches 3b, then backreferencing will only give 3b and not any
other valid match of regexp like 8f, 0a etc. This is akin to how
variables behave in programming, only the result of expression
stays after variable assignment, not the expression itself.
To refer to the regexp itself, use \g<0>, \g<1>, \g<2> etc. This
is applicable only in regexp definition, not in replacement
sections. This behavior, which is similar to function call, is
known as subexpression call in regular expression parlance.
Recursion will be discussed in the next section.
>> row = 'today,2008-03-24,food,2012-08-
12,nice,5632'
>> row[/(\d{4}-\d{2}-\d{2}).*\g<1>/]
=> "2008-03-24,food,2012-08-12"
>> d.scan(/(\d{4}-\d{2}-\d{2}),\g<1>/)
Recursive matching
The \g<N> subexpression call supports recursion as well. Useful
to match nested patterns, which is usually not recommended to
be done with regular expressions. Indeed, use a proper parser
library if you are looking to parse file formats like html, xml,
json, csv, etc. But for some cases, a parser might not be
available and using regexp might be simpler than writing a
parser from scratch.
>> eqn0.scan(/\([^()]++\)/)
>> eqn1.scan(/\([^()]++\)/)
>> eqn1.scan(/\((?:[^()]++|\([^()]++\))++\)/)
>> eqn2.scan(/\((?:[^()]++|\([^()]++\))++\)/)
[^()]++ #non-parentheses
characters
| #OR
\) #literal )
/x
>> eqn1.scan(lvl2)
>> eqn2.scan(lvl2)
[^()]++ #non-parentheses
characters
| #OR
\) #literal )
/x
>> eqn0.scan(lvln)
>> eqn1.scan(lvln)
>> eqn2.scan(lvln)
>> eqn3.scan(lvln)
# alternate syntax
>> row[/(?<date>\d{4}-\d{2}-\d{2}).*\g<date>/]
=> "2008-03-24,food,2012-08-12"
=> 0
# same as: $1
>> date
=> "2018-10-25"
# same as: $2
>> product
=> "car"
# single match
>> details.match(/(?<date>[^,]+),(?<product>
[^,]+)/).named_captures
>> details.match(/(?<date>[^,]+),
([^,]+)/).named_captures
=> {"date"=>"2018-10-25"}
# multiple matches
>> s.gsub(/(?<fw>\w+),(?<sw>\w+)/).map {
$~.named_captures }
Negative backreferences
Another approach when there are multiple capture groups is to
use negative backreference. The negative numbering starts
with -1 to refer to capture group closest to the backreference
that was defined before the use of negative backreference. In
other words, the highest numbered capture group prior to the
negative backreference will be -1, the second highest will be
-2 and so on. The \k<N> syntax with negative N becomes a
negative backreference. This can only be used in regexp
definition section as \k in replacement section is reserved for
named references.
>> '1,2,3,3,5'.match?(/\A([^,]+,){2}
([^,]+),\k<-1>,/)
=> true
Conditional groups
This special grouping allows you to add a condition that
depends on whether a capture group succeeded in matching.
You can also add an optional else condition. The syntax as per
Onigmo doc is shown below.
(?(cond)yes-subexp|no-subexp)
>> words.grep(/\A(<)?\w+(?(1)>)\z/)
>> words.grep(/\A(?:<\w+>|\w+)\z/)
>> words.grep(/\A(?:<?\w+>?)\z/)
>> words.grep(/\A(?:(\()?\w+(?(1)\)|-\w+))\z/)
Note Description
ex: /\((?:[^()]++|\g<0>)++\)/
matches nested parentheses
(?
conditional group
(cond)yes|no)
Note Description
Exercises
a) Replace the space character that occurs after a word ending
with a or r with a newline character.
area
not a
_a2_ roar
took 22
b) Add [] around words starting with s and containing e and t
in any order.
c) Replace all whole words with X that start and end with the
same word character. Single character word should get
replaced with X too, as it satisfies the stated condition.
=> 13
>> ip = 'firecatlioncatcatcatbearcatcatparrot'
>> str1.scan(hex_seq)
>> str2.scan(hex_seq)
>> ip1[pat]
=> "if(3-(k*3+4)/12-(r+2/3))"
>> ip2[pat]
=> "if(a(b)c(d(e(f)1)2)3)"
Interlude: Common tasks
Tasks like matching phone numbers, ip addresses, dates,
etc are so common that you can often find them collected
as a library. This chapter shows some examples for
CommonRegexRuby. See also Awesome Regex: Collections.
CommonRegexRuby
You can either install commonregex gem or go through
commonregex.rb and choose the regular expression you
need. See also CommonRegexRuby: README for details
and examples of available patterns.
=> true
>> parsed.get_ipv4
=> ["255.21.255.22"]
>> parsed.get_dates
=> ["23/04/96"]
>> CommonRegex.get_ipv4(data)
=> ["255.21.255.22"]
>> CommonRegex.get_dates(data)
=> ["23/04/96"]
Make sure to test these patterns for your use case. For
example, the below data has a valid IPv4 address followed
by another number separated by a dot character. If such
cases should be ignored, then you'll have to create your
own version of the pattern or change the input accordingly.
>> CommonRegex.get_ipv4(new_data)
Summary
Some patterns are quite complex and not easy to build and
validate from scratch. Libraries like CommonRegexRuby
are helpful to reduce your time and effort needed for
commonly known tasks. However, you do need to test the
solution for your use case. See also stackoverflow:
validating email addresses.
Lookarounds
You've already seen how to create custom character classes
and various avatars of special groupings. In this chapter
you'll learn more groupings, known as lookarounds, that
help to create custom anchors and add conditions within
regexp definition. These assertions are also known as zero-
width patterns because they add restrictions similar to
anchors and are not part of the matched portions. Also, you
will learn how to negate a grouping similar to negated
character sets and what's special about the \G anchor.
Conditional expressions
Before you get used to lookarounds too much, it is good to
remember that Ruby is a programming language. You have
control structures and you can combine multiple conditions
using logical operators, methods like all?, any?, etc. Also,
do not forget that regular expressions is only one of the
tools available for string processing.
Negative lookarounds
Lookaround assertions can be added in two ways —
lookbehind and lookahead. Each of these can be a
positive or a negative assertion. Syntax wise, lookbehind
has an extra < compared to the lookahead version.
Negative lookarounds can be identified by the use of !
whereas = is used for positive lookarounds. This section is
about negative lookarounds, whose complete syntax is
shown below.
# overlap example
=> ["ink"]
>> 'foo_baz=num1+35*42/num2'.gsub(/(?
<!\A)\b(?!\z)/, ' ')
=> "**a*e"
=> "**a*e"
>> 'foo_baz=num1+35*42/num2'.gsub(/(?!\z)\b(?
<!\A)/, ' ')
Positive lookarounds
Unlike negative lookarounds, absence of something will not
satisfy positive lookarounds. Instead, for the condition to
satisfy, the pattern has to match actual characters and/or
zero-width assertions. Positive lookaround can be identified
by use of = in the grouping. The complete syntax looks like:
=> ["20"]
>> '1,two,3,four,5'.scan(/(?<=,)[^,]+(?=,)/)
>> ',1,,,two,3,,'.gsub(/(?<=\A|,)(?=,|\z)/,
'nil')
=> "nil,1,nil,nil,two,3,nil,nil"
=> "{},{cat}{},{tiger}{}"
=> "{},{cat},{tiger}"
Capture groups inside positive
lookarounds
Even though lookarounds are not part of matched portions,
capture groups can be used inside positive lookarounds.
Can you reason out why it won't work for negative
lookarounds?
a b
b c
c d
d e
# same as:
/b.*e.*t|b.*t.*e|e.*b.*t|e.*t.*b|t.*b.*e|t.*e.*b/
>> words.grep(/(?=.*b)(?=.*e).*t/)
>> words.grep(/(?=.*a)(?=.*e)(?=.*i)(?
=.*o).*u/)
>> words.grep(/(?=.*a)(?=.*q)(?!.*n\z)/)
=> "secret"
=> nil
>> row
=> "421,foo,2425,42,5,6"
=> "{},{cat},{tiger}"
>> ',cat,tiger'.gsub(/(?:\A|,)\K[^,]*+/,
'{\0}')
=> "{},cat,{tiger}"
# allowed
>> s.scan(/(?<=(?:po|da)re)\d+/)
=> ["42", "7"]
>> s.scan(/(?<=\b[a-z]{4})\d+/)
>> s.scan(/(?<!tar|dare)\d+/)
# not allowed
>> s.scan(/(?<=(?:o|ca)re)\d+/)
>> s.scan(/(?<=\b[a-z]+)\d+/)
>> s.gsub(/(?:tar|dare)(\d+)/).map { $1 }
>> s.gsub(/\b[pd][a-z]*(\d+)/).map { $1 }
>> 'fox,cat,dog,parrot'.match?
(/\A((?!cat).)*dog/)
=> false
# match 'dog' only if it is not preceded by
'parrot'
>> 'fox,cat,dog,parrot'.match?
(/\A((?!parrot).)*dog/)
=> true
>> 'fox,cat,dog,parrot'[/\A((?!cat).)*/]
=> "fox,"
>> 'fox,cat,dog,parrot'[/\A((?!parrot).)*/]
=> "fox,cat,dog,"
>> 'fox,cat,dog,parrot'[/\A(?:(?!(.)\1).)*/]
=> "fox,cat,dog,pa"
>> 'fox,cat,dog,parrot'.match?(/at(?~do)par/)
=> false
>> 'fox,cat,dog,parrot'.match?(/at(?~go)par/)
=> true
>> 'fox,cat,dog,parrot'[/at(?~go)par/]
=> "at,dog,par"
\G anchor
The \G anchor restricts matching from start of string like
the \A anchor. In addition, after a match is done, ending of
that match is considered as the new anchor location. This
process is repeated again and continues until the given
regexp fails to match (assuming multiple matches with
methods like scan and gsub).
Note Description
Exercises
>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'
>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'
>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'
>> ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'
=> "comma,separated,values"
>> pwds.grep(rule_chk)
>> s1 = 'apple'
>> s2 = '1.2-3:4'
>> s1 = '42:cat'
>> s2 = 'twelve:a2b'
>> s3 = 'we:be:he:0:a:b:bother'
=> "42"
=> "twelve:a2b"
=> "we:be:he:0:a:b"
>> ip = '::very--at<=>row|in.a_b#b2c=>lion----
east'
>> str1.match?(neg)
=> true
>> str2.match?(neg)
=> false
>> str3.match?(neg)
=> false
>> str4.match?(neg)
=> true
>> str5.match?(neg)
=> false
r) Can you reason out why the output shown is different for
these two regular expressions?
=> "a2b2:ride:in:awe:b2b:3list:end"
=> "3e:s4w:seer"
=> "thr33:f0ur"
Modifiers
Just like options change the default behavior of commands
used from a terminal, modifiers are used to change aspects
of regexp. They can be applied to entire regexp or to a
particular portion of regexp, and both forms can be mixed
up as well. The cryptic output of Regexp.union when one of
the arguments is a regexp will be explained as well in this
chapter. In regular expression parlance, modifiers are also
known as flags.
i modifier
First up, the i modifier which will ignore case while
matching alphabets.
=> nil
=> 0
>> 'Sample123string42with777numbers'.scan(/[a-
z]+/i)
m modifier
Use m modifier to allow . metacharacter to match newline
character as well.
o modifier
The o modifier restricts the #{} interpolations inside a
regexp definition to be performed only once, even if it is
inside a loop. As an alternate, you could simply assign a
variable with the regexp definition and use that within the
loop without needing the o modifier.
>> n = 2
?> for w in words
>> end
bike
auto
>> n = 1
?> for w in words
?> n += 1
>> end
bus
auto
train
x modifier
The x modifier is another provision like the named capture
groups to help add clarity to regexp definitions. This
modifier allows to use literal whitespaces for aligning
purposes and add comments after the # character to break
down complex regexp into multiple lines with comments.
(?:[^,]+,){3} # non-capturing
group to get the 3 columns
([^,]+) # group-2,
captures 4th column
/x
=> "1,2,3,(4),5,6,7"
=> false
>> 'cat and dog'.match?(/t\ a/x)
=> true
=> true
=> "a"
=> "a#b"
Inline comments
Comments can also be added using (?#comment) grouping
independent of x modifier.
=> "1,2,3,(4),5,6,7"
Inline modifiers
To apply modifiers to specific portions of regexp, specify
them inside a special grouping syntax. This will override
the modifiers applied to entire regexp definitions, if any.
The syntax variations are:
=> /(?i-mx:^cat)|123/
=> /(?-mix:cat)|a\^b|(?mi-x:the.*ice)/
Cheatsheet and Summary
Note Description
(?
apply modifiers only for this pat
modifiers:pat)
(?-
negate modifiers only for this pat
modifiers:pat)
Note Description
Exercises
a) Remove from first occurrence of hat to last occurrence
of ice for the given input strings. Match these markers
case insensitively.
'> hi there
'> 42
>> bye}
good start
hi there
42
bye
>> end
>> end
>> File.open('sample_mod.md').read ==
File.open('expected.md').read
=> true
>> end
>>
aLtErNaTe_CaSe('Sample123string42with777numbers')
=> "sAmPlE123sTrInG42wItH777nUmBeRs"
e) For the given input strings, match all of these three
patterns:
>> s1.match?(pat)
=> true
>> s2.match?(pat)
=> false
>> s3.match?(pat)
=> true
>> s1.match?(pat)
=> true
>> s2.match?(pat)
=> true
>> s3.match?(pat)
=> false
Unicode
So far in the book, all examples were meant for strings
made up of ASCII characters only. However, Regexp class
uses source encoding by default. And the default string
encoding is UTF-8. See ruby-doc: Encoding for details on
working with different string encoding.
Encoding modifiers
Modifiers can be used to override the encoding to be used.
For example, the n modifier will use ASCII-8BIT instead of
source encoding.
>> 'fox:αλεπού'.scan(/\w+/n)
=> ["fox"]
>> 'fox:αλεπού'.scan(/\w+/)
=> ["fox"]
>> 'fox:αλεπού'.scan(/[[:word:]]+/)
>> 'fox:αλεπού'.scan(/(?u)\w+/)
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{L}+/)
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{Greek}+/)
>> 'φοο12,βτ_4,foo'.scan(/\p{Word}+/)
=> "φοοβτfoo"
fox:αλεπού
>> 'fox:αλεπού,eagle:αετός'.scan(/[\u{61}-
\u{7a}]+/)
\X vs dot metacharacter
Some characters have more than one codepoint. These are
handled in Unicode with grapheme clusters. The dot
metacharacter will only match one codepoint at a time. You
can use \X to match any character, even if it has multiple
codepoints.
g̈
=> "cag̈ed"
=> "cod"
=> "cod"
=> "he\nat"
=> "heat"
=> "heat"
Exercises
a) Output true or false depending on input string made
up of ASCII characters or not. Consider the input to be non-
empty strings and any character that isn't part of 7-bit
ASCII set should give false
=> false
##### add your solution here for str2
=> false
##### add your solution here for str3
=> true
>> s2 = (0x80..0xff).to_a.pack('U*')
>> s3 = (0x2600..0x27eb).to_a.pack('U*')
=> "!\"#%&'()*,-./:;?@[\\]_{}"
=> "¡§«¶·»¿"
=> "❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫"