GitHub - Cyrilbois - You-Should-Learn-Regex - Regular Expresion Tutorial (Blog - Patricktriest.com) Source Code
GitHub - Cyrilbois - You-Should-Learn-Regex - Regular Expresion Tutorial (Blog - Patricktriest.com) Source Code
com/cyrilbois/You-Should-Learn-Regex
blog.patricktriest.com/you-should-learn-regex/
0 stars 13 forks
View code
README.md
Regular Expressions (Regex): One of the most powerful, widely applicable, and
sometimes intimidating techniques in software engineering. From validating email
addresses to performing complex code refactors, regular expressions have a wide
range of uses and are an essential entry in any software engineer's toolbox.
The complexity of the specialized regex syntax, however, can make these expressions
somewhat inaccessible. For instance, here is a basic regex that describes any time in
the 24-hour HH/MM format.
\b([01]?[0-9]|2[0-3]):([0-5]\d)\b
If this looks complex to you now, don't worry, by the time we finish the tutorial
understanding this expression will be trivial.
1 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
This web application is my favorite tool for building, testing, and debugging regular
expressions. I highly recommend that you use it to test out the expressions that we'll
cover in this tutorial.
The source code for the examples in this tutorial can be found at the Github
repository here - https://github.com/triestpa/You-Should-Learn-Regex
We'll start with a very simple example - Match any line that only contains numbers.
^[0-9]+$
We could replace with , which will do the same thing (match any
digit).
The great thing about this expression (and regular expressions in general) is that it
can be used, without much modification, in any programing language.
To demonstrate we'll now quickly go through how to perform this simple regex
search on a text file using 16 of the most popular programming languages.
2 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
1234
abcde
12db2
5362
Each script will read the file, search it using our regular expression, and
print the result ( ) to the console.
const fs = require('fs')
const testFile = fs.readFileSync('test.txt', 'utf8')
const regex = /^([0-9]+)$/gm
let results = testFile.match(regex)
console.log(results)
0.1 - Python
import re
0.2 - R
3 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
0.3 - Ruby
0.4 - Haskell
import Text.Regex.PCRE
main = do
fileContents <- readFile "test.txt"
let stringResult = fileContents =~ "^[0-9]+$" :: AllTextMatches [] String
print (getAllTextMatches stringResult)
0.5 - Perl
0.6 - PHP
<?php
$myfile = fopen("test.txt", "r") or die("Unable to open file.");
$test_str = fread($myfile,filesize("test.txt"));
fclose($myfile);
$re = '/^[0-9]+$/m';
preg_match_all($re, $test_str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
?>
4 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
0.7 - Go
package main
import (
"fmt"
"io/ioutil"
"regexp"
)
func main() {
testFile, err := ioutil.ReadFile("test.txt")
if err != nil { fmt.Print(err) }
testString := string(testFile)
var re = regexp.MustCompile(`(?m)^([0-9]+)$`)
var results = re.FindAllString(testString, -1)
fmt.Println(results)
}
0.8 - Java
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
class FileRegexExample {
public static void main(String[] args) {
try {
String content = new String(Files.readAllBytes(Paths.get("test.txt")));
Pattern pattern = Pattern.compile("^[0-9]+$", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(content);
ArrayList<String> matchList = new ArrayList<String>();
while (matcher.find()) {
matchList.add(matcher.group());
}
System.out.println(matchList);
} catch (IOException e) {
e.printStackTrace();
}
}
5 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
}
0.9 - Kotlin
import java.io.File
import kotlin.text.Regex
import kotlin.text.RegexOption
0.10 - Scala
import scala.io.Source
import scala.util.matching.Regex
object FileRegexExample {
def main(args: Array[String]) {
val fileContents = Source.fromFile("test.txt").getLines.mkString("\n")
val pattern = "(?m)^[0-9]+$".r
val results = (pattern findAllIn fileContents).mkString(",")
println(results)
}
}
0.11 - Swift
import Cocoa
do {
let fileText = try String(contentsOfFile: "test.txt", encoding: String.Encoding
let regex = try! NSRegularExpression(pattern: "^[0-9]+$", options: [ .anchorsMatchLines
let results = regex.matches(in: fileText, options: [], range: NSRange(location
let matches = results.map { String(fileText[Range($0.range, in: fileText)!]) }
print(matches)
} catch {
print(error)
}
6 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
0.12 - Rust
fn main() {
let mut f = File::open("test.txt").expect("file not found");
let mut test_str = String::new();
f.read_to_string(&mut test_str).expect("something went wrong reading the file"
0.13 - C#
using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using System.Linq;
namespace RegexExample
{
class FileRegexExample
{
static void Main()
{
string text = File.ReadAllText(@"./test.txt", Encoding.UTF8);
Regex regex = new Regex("^[0-9]+$", RegexOptions.Multiline);
MatchCollection mc = regex.Matches(text);
var matches = mc.OfType<Match>().Select(m => m.Value).ToArray();
7 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
Console.WriteLine(string.Join(" ", matches));
}
}
}
0.14 - C++
#include <string>
#include <fstream>
#include <iostream>
#include <sstream>
#include <regex>
using namespace std;
int main () {
ifstream t("test.txt");
stringstream buffer;
buffer << t.rdbuf();
string testString = buffer.str();
regex numberLineRegex("(^|\n)([0-9]+)($|\n)");
sregex_iterator it(testString.begin(), testString.end(), numberLineRegex);
sregex_iterator it_end;
while(it != it_end) {
cout << it -> str();
++it;
}
}
0.15 - Bash
#!bin/bash
grep -E '^[0-9]+$' test.txt
Writing out the same operation in sixteen languages is a fun exercise, but we'll be
mostly sticking with Javascript and Python (along with a bit of Bash at the end) for
the rest of the tutorial since these languages (in my opinion) tend to yield the
clearest, most readable implementations.
8 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
Let's go through another simple example - matching any valid year in the 20th or
21st centuries.
\b(19|20)\d{2}\b
We're starting and ending this regex with instead of and . represents a
word boundary, or a space between two words. This will allow us to match years
within the text blocks (instead of on their own lines), which is very useful for search
through, say, paragraph text.
- Word boundary
- Matches either '19' or '20' using the OR ( ) operand.
- Two digits, same as
- Word boundary
We can use this expression in a Python script to find how many times each year in
the 20th or 21st century is mentioned in a historical Wikipedia article.
import re
import urllib.request
import operator
9 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
for year in sorted(year_counts, key=year_counts.get, reverse=True):
print(year, year_counts[year])
The above script will print each year, along the number of times it is mentioned.
1941 137
1943 80
1940 76
1945 73
1939 71
...
Now we'll define a regex expression to match any time in the 24-hour format
( , such as 16:59).
\b([01]?[0-9]|2[0-3]):([0-5]\d)\b
- Word boundary
- 0 or 1
- Signifies that the preceding pattern is optional.
- any number between 0 and 9
- operand
- 2, followed by any number between 0 and 3 (i.e. 20-23)
- Matches the character
- Any number between 0 and 5
- Any number between 0 and 9 (same as )
- Word boundary
You might have noticed something new in the above pattern - we're wrapping the
hour and minute capture segments in parenthesis . This allows us to define
each part of the pattern as a capture group.
Capture groups allow us individually extract, transform, and rearrange pieces of each
matched pattern.
10 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
For example, in the above 24-hour pattern, we've defined two capture groups - one
for the hour and one for the minute.
Here's how we could use Javascript to parse a 24-hour formatted time into hours and
minutes.
As an extra exercise, you could try modifying this script to convert 24-hour times to
12-hour (am/pm) times.
\b(0?[1-9]|[12]\d|3[01])([\/\-])(0?[1-9]|1[012])\2(\d{4})
This one is a bit longer, but it should look pretty similar to what we've covered
already.
11 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
The only new concept here is that we're using to match the second capture
group, which is the divider ( or ). This enables us to avoid repeating our pattern
matching specification, and will also require that the dividers are consistent (if the
first divider is , then the second must be as well).
Using capture groups, we can dynamically reorganize and transform our string input.
The standard way to refer to capture groups is to use the or symbol, along
with the index of the capture group (remember that the capture group element is the
full captured text).
Let's imagine that we were tasked with converting a collection of documents from
using the international date format style ( ) to the American style
( )
Our replacement pattern ( ) will simply swap the month and day content
in the expression.
12 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
import re
regex = r'\b(0?[1-9]|[12]\d|3[01])([ \/\-])(0?[1-9]|1[012])\2(\d{4})'
test_str = "Today's date is 18/09/2017"
subst = r'\3\2\1\2\4'
result = re.sub(regex, subst, test_str)
print(result)
^[^@\s]+@[^@\s]+\.\w{2,6}$
- Start of input
- Match any character except for and whitespace
- 1+ times
- Match the '@' symbol
- Match any character except for and whitespace), 1+ times
- Match the '.' character.
- Match any word character (letter, digit, or underscore), 2-6 times
- End of input
const tests = [
`[email protected]`, // Valid
'', // Invalid
`test.test`, // Invalid
13 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
'@[email protected]', // Invalid
'invalid@@test.com', // Invalid
`gmail.com`, // Invalid
`this is a [email protected]`, // Invalid
`[email protected]@gmail.com` // Invalid
]
console.log(tests.map(isValidEmail))
This is a very simple example which ignores lots of very important email-validity edge
cases, such as invalid start/end characters and consecutive periods. I really don't
recommend using the above expression in your applications; it would be best to
instead use a reputable email-validation library or to track down a more complete
email validation regex.
For instance, here's a more advanced expression from (the aptly named)
emailregex.com which matches 99% of RFC 5322 compliant email addresses.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b
\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-
z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b
\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
One of the most useful ad-hoc uses of regular expressions can be code refactors.
Most code editors support regex-based find/replace operations. A well-formed regex
substitution can turn a tedious 30-minute busywork job into a beautiful single-
expression piece of regex refactor wizardry.
14 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
Instead of writing scripts to perform these operations, try doing them natively in your
text editor of choice. Nearly every text editor supports regex based find-and-replace.
What if we wanted to find all of the single-line comments within a CSS file?
To capture any single-line CSS comment, we can use the following expression.
(\/\*+)(.*)(\*+\/)
Note that we have defined three capture groups in the above expression: the
opening characters ( ), the comment contents ( ), and the closing
characters ( ).
We could use this expression to turn each single-line comment into a multi-line
comment by performing the following substitution.
15 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
$1\n$2\n$3
/*
Multiline Comment
*/
h1 {
font-size: 2rem;
}
The substitution will yield the same file, but with each single-line comment converted
to a multi-line comment.
/*
Single Line Comment
*/
body {
background-color: pink;
}
/*
Multiline Comment
*/
h1 {
font-size: 2rem;
}
/*
Another Single Line Comment
*/
h2 {
font-size: 1rem;
}
16 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
Let's say we have a big messy CSS file that was written by a few different people. In
this file, some of the comments start with , some with , and some with
.
Let's write a regex substitution to standardize all of the single-line CSS comments to
start with .
In order to do this, we'll extend our expression to only match comments with two or
more starting asterisks.
(\/\*{2,})(.*)(\*+\/)
This expression very similar to the original. The main difference is that at the
beginning we've replaced with . The syntax signifies "two or
more" instances of .
To standardize the opening of each comment we can pass the following substitution.
/*$2$3
The result will be the same file with standardized comment openings.
17 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
(https?:\/\/)(www\.)?(?<domain>[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})
(?<path>\/[-a-zA-Z0-9@:%_\/+.~#?&=]*)?
- Match http(s)
- Optional "www" prefix
- Match a valid domain name
- Match a domain extension extension (i.e. ".com" or ".org")
- Match URL path ( ), query
string ( ), and/or file extension ( ), all optional.
You'll notice here that some of the capture groups now begin with a
identifier. This is the syntax for a named capture group, which makes the data
extraction cleaner.
6.1 - Real-World Example - Parse Domain Names From URLs on A Web Page
Here's how we could use named capture groups to extract the domain name of each
URL in a web page using Python.
18 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
import re
import urllib.request
html = str(urllib.request.urlopen("https://moz.com/top500").read())
regex = r"(https?:\/\/)(www\.)?(?P<domain>[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})(?P<pat
matches = re.finditer(regex, html)
The script will print out each domain name it finds in the raw web page HTML
content.
...
facebook.com
twitter.com
google.com
youtube.com
linkedin.com
wordpress.org
instagram.com
pinterest.com
wikipedia.org
wordpress.com
...
Regular expressions are also supported by many Unix command line utilities! We'll
walk through how to use them with to find specific files, and with to
replace text file content in-place.
We'll define another basic regular expression, this time to match image files.
^.+\.(?i)(png|jpg|jpeg|gif|webp)$
- Start of line.
- Match any character (letters, digits, symbols), expect for (new line), 1+
times.
19 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
Here's how you could list all of the image files in your directory.
This can be done quite using the command, along with a modified version of
our email regex from earlier.
- The Unix "stream editor" utility, which allows for powerful text file
transformations.
- Use extended regex pattern matching
- Replace the file stream in-place
- Wrap the beginning of the line in a capture group
- Simplified version of our email regex.
- Replace each email address with .
- Perform the operation on the file.
My email is [email protected]
Once the command has been run, the email will be redacted from the file.
20 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
My email is {redacted}
Warning - This command will automatically remove all email addresses from
any that you pass it, so be careful where/when you run it, since this
operation cannot be reversed. To preview the results within the terminal,
instead of replacing the text in-place, simply omit the flag.
Note - While the above command should work on most Linux distributions,
macOS uses the BSD implementation is , which is more limited in its
supported regex syntax. To use on macOS with decent regex support, I
would recommend installing the GNU implementation of with
, and then using from the command line instead of .
Ok, so clearly regex is a powerful, flexible tool. Are there times when you should
avoid writing your own regex expressions? Yes!
Parsing structured languages, from English to Java to JSON, can be a real pain using
regex expressions.
Writing your own regex expression for this purpose is likely to be an exercise in
frustration that will result in eventual (or immediate) disaster when an edge case or
minor syntax/grammar error in the data source causes the expression to fail.
It may seem tempting to use regular expressions to filter user input (such as from a
web form), to prevent hackers from sending malicious commands (such as SQL
injections) to your application.
Using a custom regex expression here is unwise since it is very difficult to cover every
potential attack vector or malicious command. For instance, hackers can use
alternative character encodings to get around naively programmed input blacklist
filters.
21 de 23 25/02/2021 19:54
GitHub - cyrilbois/You-Should-Learn-Regex: Regular Expresi... https://github.com/cyrilbois/You-Should-Learn-Regex
This is another instance where I would strongly recommend using the well-tested
libraries and/or services, along with the use of whitelists instead of blacklists, in order
to protect your application from malicious inputs.
Regex matching speeds can range from not-very-fast to extremely slow, depending
on how well the expression is written. This is fine for most use cases, especially if the
text being matched is very short (such as an email address form). For high-
performance server applications, however, regex can be a performance bottleneck,
especially if expression is poorly written or the text being searched is long.
Regex is an incredibly useful tool, but that doesn't mean you should use it
everywhere.
Overusing regex is a great way to make your co-workers (and anyone else who needs
to work with your code) very angry with you.
I hope that this has been a useful introduction to the many uses of regular
expressions.
There still are lots of regex use cases that we have not covered. For instance, regex
can be used in PostgreSQL queries to dynamically search for text patterns within a
database.
We have also left lots of powerful regex syntax features uncovered, such as
lookahead, lookbehind, atomic groups, recursion, and subroutines.
To improve your regex skills and to learn more about these features, I would
recommend the following resources.
The source code for the examples in this tutorial can be found at the Github
Packages
repository here - https://github.com/triestpa/You-Should-Learn-Regex
No packages published
Feel free to comment below with any suggestions, ideas, or criticisms regarding this
tutorial.
Languages
JavaScript 24.9% Python 15.6% Java 8.5% Rust 7.0% C# 6.8% C++ 5.6%
Other 31.6%
23 de 23 25/02/2021 19:54