Regex

M1 Programming
1 Strings
1.1 Manipulate strings
1.2 Escape character
1.3 Formatted strings
2 Regular Expressions
2.1 Anchors
2.2 Character classes
2.3 Special Character Classes
2.4 Repetition Cases
2.5 Grouping with Parentheses
2.6 Backreferences
3 Exercise
Regex
1 Strings
1.1 Manipulate strings
# Python
import os
wd_t = r'C:\Users\mypath\to\myfolder'
wd_t
## 'C:\\Users\\mypath\\to\\myfolder'
print(wd_t)
## C:\Users\mypath\to\myfolder
The purpose of the r’…’ notation is to create

raw strings and use different rules for
interpreting backslash escape sequences.
There is no need to use r if you’re just getting
the value from another variable.
In Python, a string can be treated as an array,

and it’s easy to select specific pieces of the
string using the position of the letters. In R, a
string is a block that needs to be decomposed
in order to select specific letters. The number
of characters in a string is its length. Strings
can be modified in various ways, which are
listed in the following link:
https://www.w3schools.com/python/python_ref_string.asp
. Some examples are provided there.
# Python
wd_info = 'Current working directory:\

n'+os.getcwd()
slc_infos = wd_info[10:20]
len(slc_infos)
## 10
slc_infos
## 'rking dire'
slc_infos = list(slc_infos)
slc_infos
# mutable
## ['r', 'k', 'i', 'n', 'g', ' ', 'd',
'i', 'r', 'e']
slc_infos[7] = '0'
slc_infos
# count
## ['r', 'k', 'i', 'n', 'g', ' ', 'd',
'0', 'r', 'e']
wd_info.count('s')
# modify the string

## 1
wd_info.title()
## 'Current Working Directory:\n/Home/
Peltouz/Documents/Github/M1-Programmin
g'
wd_info.upper()
## 'CURRENT WORKING DIRECTORY:\n/HOME/
PELTOUZ/DOCUMENTS/GITHUB/M1-PROGRAMMIN
G'
wd_info.lower()
## 'current working directory:\n/home/
peltouz/documents/github/m1-programmin
g'
splt_inf = wd_info.split(': ')
splt_inf
## ['Current working directory:\n/home
/peltouz/Documents/GitHub/M1-Programmi
ng']
' '.join(splt_inf)
## 'Current working directory:\n/home/
peltouz/Documents/GitHub/M1-Programmin
g'
import numpy as np
"".join(['a :','1','b :','2','c :','3'

])
## 'a :1b :2c :3'
" ".join(['a :','1','b :','2','c :','3
'])
## 'a : 1 b : 2 c : 3'
[''.join([letter,number]) for letter,
number in zip(['a :','b :','c :'],['1'
,'2','3'])]
## ['a :1', 'b :2', 'c :3']
list(map(''.join,zip(['a :','b :','c :
'],['1','2','3'])))
## ['a :1', 'b :2', 'c :3']
'.'.join((np.char.array(['a :','b :','
c :'])+np.char.array(['1','2','3'])).t
olist())
## 'a :1.b :2.c :3'

np.char.array([['a :','b :','c :']])+'
O'
## chararray([['a :O', 'b :O', 'c :O']
], dtype='<U4')
# R
wd_t = 'C:\\Users\\mypath\\to\\myfolde
r'
print(wd_t)
## [1] "C:\\Users\\mypath\\to\\myfolde
r"
cat(wd_t)
## C:\Users\mypath\to\myfolder
In R, there is no way to tell R to read a literal

backslash, we need to double it while writing
it. Also, we need to split the string before
accessing its elements. The ‘strsplit’ function is
designed to work with vectors and
automatically creates a list, with each element
of the list being a decomposed string. This is
why we see ‘[[1]]’ after each ‘strsplit’ function
call. We only want to access the first element
because there is only one string in the object
submitted to the function.
# R
library(stringr)
library(tools)
wd_info <- paste0('Current working di

rectory: ',getwd())
slc_infos <- wd_info[10:20]
slc_infos
## [1] NA NA NA NA NA NA NA NA NA NA
NA
slc_infos <- paste(strsplit(wd_info,sp
lit = '')[[1]][10:20],collapse = "")
nchar(slc_infos)
## [1] 11
slc_infos
## [1] "orking dire"
slc_infos <- strsplit(slc_infos,split
= '')[[1]]
slc_infos
## [1] "o" "r" "k" "i" "n" "g" " " "d
" "i" "r" "e"
# mutable
slc_infos[7] = '0'
# count
length(which(strsplit(wd_info,split =
'')[[1]]=='S'))
## [1] 0
# modify the string

toTitleCase(wd_info)
## [1] "Current Working Directory: /Ho
me/Peltouz/Documents/GitHub/M1-Program
ming"
toupper(wd_info)
## [1] "CURRENT WORKING DIRECTORY: /HO
ME/PELTOUZ/DOCUMENTS/GITHUB/M1-PROGRAM
MING"
tolower(wd_info)
## [1] "current working directory: /ho
me/peltouz/documents/github/m1-program
ming"
paste0('a :','1','b :','2','c :','3')

## [1] "a :1b :2c :3"
paste('a :','1','b :','2','c :','3')
## [1] "a : 1 b : 2 c : 3"
paste0(c('a :','1','b :','2','c :','3'

)) # do nothing with a vector
## [1] "a :" "1" "b :" "2" "c :" "
3"
paste(c('a :','1','b :','2','c :','3')
,collapse = " ") # safer to use this
## [1] "a : 1 b : 2 c : 3"
str_c(c('a :','1','b :','2','c :','3')
,collapse = " ")
## [1] "a : 1 b : 2 c : 3"
paste0(c('a :','b :','c :'),c('1','2',

'3'))
## [1] "a :1" "b :2" "c :3"
paste(c('a :','b :','c :'),c('1','2','
3'))
## [1] "a : 1" "b : 2" "c : 3"
paste(c('a :','b :','c :'),c('1','2','
3'),collapse = ".")
## [1] "a : 1.b : 2.c : 3"
paste0(c('a :','b :','c :'),c('0'))

## [1] "a :0" "b :0" "c :0"
paste(c('a :','b :','c :'),c('0'),coll
apse = "~")
## [1] "a : 0~b : 0~c : 0"
str_c(c('a :~','b :~','c :~'),c('0'))
## [1] "a :~0" "b :~0" "c :~0"
1.2 Escape character

# Python
import time
import psutil
import os
time_ = time.ctime(time.time())
time_info = 'Time: '
wd = os.getcwd()
wd_info = '\nCurrent working directory
: '
infos = time_info+time_+wd_info+wd
print(infos)
## Time: Wed Feb 8 10:39:10 2023
## Current working directory: /home/pe
ltouz/Documents/GitHub/M1-Programming
In R we use cat() to take into account ‘’ (new line),

Python automatically understand it with print.
# R
library(benchmarkme)
time_ <- Sys.time()
time_info <- 'Time: '
wd = getwd()
wd_info = '\nCurrent working directory
: '
infos <- paste0(time_info,time_,

wd_info,wd)
print(infos)
## [1] "Time: 2023-02-08 10:39:10\nCur
rent working directory: /home/peltouz/
Documents/GitHub/M1-Programming"
cat(infos)
## Time: 2023-02-08 10:39:10
Here is some information on the computer where

this document was created:
The working directory is the path to the folder

where we are currently working. If we read or write
a file without specifying the exact path (only the
name of the file), the program will search for the file
in the current working directory.
‘\’ is the backslash character and is used to specify

special characters, tabulation, new lines, backslashes
or quotes, etc. When working with file paths, this can
be confusing for the machine. To avoid this, we can
specify that the string must be read as it is by using ‘r’
before the quotation mark.
Here is a list of escape characters that can be useful:
Escape
Sequence Description
\t Insert a tab in the text at this point.
\b Insert a backspace in the text at this

point.
\n Insert a newline in the text at this

point.
\r Insert a carriage return in the text at

this point.
\f Insert a form feed in the text at this

point.
\’ Insert a single quote character in the

text at this point.
\” Insert a double quote character in the

text at this point.
\\ Insert a backslash character in the

text at this point.
1.3 Formatted strings

In Python, the % sign is used for producing
formatted output. Another way to format
output is to use the “{}”.format() method, which
allows for the automatic replacement of
values in a raw text. Let’s rewrite our function
using this method.
# Python
def get_infos():
time_info = 'Time: '+time.ctime(time
.time())
wd_info = '\nCurrent working directo
ry: '+os.getcwd()
infos = time_info+wd_info
print(infos)
def get_infos_2():
infos = '''Time: %s
Current working directory: %s'''% (t
ime.ctime(time.time()),
os.getcwd())
print(infos)
def get_infos_3():
infos = '''Time: {t}
Current working directory: {wd}'''.f
ormat(t = time.ctime(time.time()),
wd = os.getcwd())
print(infos)
get_infos()
## Time: Wed Feb 8 10:39:11 2023
get_infos_2()
## Time: Wed Feb 8 10:39:11 2023
## Current working directory: /home/
g
get_infos_3()
## Time: Wed Feb 8 10:39:11 2023
g
In R, the use of % is also possible for

formatting strings. We can use the sprintf
function to print the formatted string, but it
can also be stored for later use. We can use the
cat function to print the string containing
escape characters, or we can use writeLines as
shown in the ‘get_info_2’ function.
# R
get_infos <- function(){

time_info = paste0('Time: ',Sys.time
())
wd_info = paste0('\nCurrent working
directory: ',getwd())
infos = paste0(time_info,wd_info)
cat(infos)
}
get_infos_2<- function(){
infos = sprintf('Time: %s
Current working directory: %s',
Sys.time(),
getwd())
writeLines(infos) # other way of tak
e into account escape character
}
get_infos()
## Time: 2023-02-08 10:39:11
get_infos_2()
## Time: 2023-02-08 10:39:11
g
2 Regular Expressions
Text used in this section is the abstract of
‘ImageNet Classification with Deep
Convolutional Neural Networks’ (Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E.
Hinton 2012) presented at ‘Advances in
Neural Information Processing Systems 25’
(NIPS 2012)
'We trained a large, deep convoluti

onal neural network to classify the 1.
3 million
high-resolution images in the LSVRC-20
10 ImageNet training set into the 1000
different
classes. On the test data, we achieved
top-1 and top-5 error rates of 39.7\%
and
18.9\% which is considerably better th
an the previous state-of-the-art resul
ts.
The neural network, which has 60 milli
on parameters and 500,000 neurons,
consists of five convolutional layers,
some of which are followed by max-pool
ing
layers, and two globally connected lay
ers with a final 1000-way softmax.
To make training faster, we used non-s
aturating neurons and a very efficient
GPU
implementation of convolutional nets.
To reduce overfitting in the globally
connected layers we employed a new reg
ularization method that proved to be v
ery effective.'
Regular expressions (regex) are a way to

express how a character sequence should be
matched, and are a very flexible tool. They are
commonly used in text manipulation and are
used by string-searching algorithms for “find”
or “find and replace” operations on strings, or
for input validation.
Except for control characters (+ ? . * ^ $ ( ) [ ] { }

| ), all characters match themselves. You can
escape a control character by preceding it with
a backslash.
Some functions in ‘re’ (Python) and ‘stringr’(R)

that can be useful when working with regular
expressions are:
Re Stringr Description
findall str_match_all Returns a list

containing all
matches
finditer Returns a iterator

containing all
matches
search str_locate_all Returns a Match

object if there is a
match anywhere in
the string
split str_split Returns a list

where the string
has been split at
each match
sub str_replace_all Replaces one or

many matches with
a string
Metacharacters are characters that have a

special meaning in regular expressions. Their
uses are discussed in more detail in the next
few sections.
Parentheses ( ) are used for grouping. Square

brackets [ ] define a character class, and curly
braces { } are used by a quantifier with specific
limits.
Character Description Example
[] A set of characters “[a-m]”
\ Signals a special “\d”

sequence (can also
be used to escape
special characters)
. Any character “he..o”

(except newline
character)
^ Starts with “^hello”
$ Ends with “world$”
* Zero or more “aix*”

occurrences
+ One or more “aix+”

occurrences
{} Exactly the “al{2}”

specified number
of occurrences
| Either or “x|y”
() Grouping “(ab)\1” matches

characters (or “abab”
other regexes)
# Python
import re
abstract = open("data/imagenet_abstrac
t.txt", "r").read()
re.search('layers',abstract)
## <_sre.SRE_Match object; span=(437,
443), match='layers'>
re.findall('layers',abstract)
## ['layers', 'layers', 'layers', 'lay
ers']
re.findall('net+','net ne nett network
s')
## ['net', 'nett', 'net']
re.findall('net*','net ne nett network
s')
## ['net', 'ne', 'nett', 'net']

re.sub(r'\\','',abstract)
## 'We trained a large, deep convol
utional neural network to classify the
1.3 million high-resolution images in
the LSVRC-2010 ImageNet training set i
nto the 1000 different classes. On the
test data, we achieved top-1 and top-5
error rates of 39.7% and 18.9% which i
s considerably better than the previou
s state-of-the-art results. The neural
network, which has 60 million paramete
rs and 500,000 neurons, consists of fi
ve convolutional layers, some of which
are followed by max-pooling layers, an
d two globally connected layers with a
final 1000-way softmax. To make traini
ng faster, we used non-saturating neur
ons and a very efficient GPU implement
ation of convolutional nets. To reduce
overfitting in the globally connected
layers we employed a new regularizatio
n method that proved to be very effect
ive.'
re.split(' ',abstract)[:5]
## ['We', 'trained', 'a', 'large,',

'deep']
# R
library(stringr)
library(readr)
abstract <- as.character(read_file('da

ta/imagenet_abstract.txt'))
str_locate_all(pattern = 'layers',abst
ract)
## [[1]]
## start end
## [1,] 438 443
## [2,] 488 493
## [3,] 523 528
## [4,] 728 733
str_locate_all(pattern = 'net+','net n
e nett networks')
## [[1]]
## start end
## [1,] 1 3
## [2,] 8 11
## [3,] 13 15
str_locate_all(pattern = 'net*','net n
e nett networks')
## [[1]]
## start end
## [1,] 1 3
## [2,] 5 6
## [3,] 8 11
## [4,] 13 15
str_replace_all(abstract,'\\\\','')
## [1] "We trained a large, deep co
nvolutional neural network to classify
the 1.3 million high-resolution images
in the LSVRC-2010 ImageNet training se
t into the 1000 different classes. On
the test data, we achieved top-1 and t
op-5 error rates of 39.7% and 18.9% wh
ich is considerably better than the pr
evious state-of-the-art results. The n
eural network, which has 60 million pa
rameters and 500,000 neurons, consists
of five convolutional layers, some of
which are followed by max-pooling laye
rs, and two globally connected layers
with a final 1000-way softmax. To make
training faster, we used non-saturatin
g neurons and a very efficient GPU imp
lementation of convolutional nets. To
reduce overfitting in the globally con
nected layers we employed a new regula
rization method that proved to be very
effective."
str_split(abstract,' ',simplify = T)[1

:5]
## [1] "We" "trained" "a" "
large," "deep"
2.1 Anchors
This needs to specify match position.
Example Description
^Python Match “Python” at the start of a

string or internal line
Python$ Match “Python” at the end of a

string or line
\APython Match “Python” at the start of a

string
Python\Z Match “Python” at the end of a

string
\bPython\b Match “Python” at a word

boundary
\brub\B \B is nonword boundary: match

“rub” in “rube” and “ruby” but not
alone
Python(?=!) Match “Python”, if followed by an

exclamation point.
Python(?!!) Match “Python”, if not followed

by an exclamation point.
# Python
import re
words = ['neuron','neural','network','
networks','deep convolutional neural n
etwork']
[w for w in words if re.findall(r'\Ane

u',w)]
## ['neuron', 'neural']
[w for w in words if re.findall(r'^neu
',w)]
## ['neuron', 'neural']
[w for w in words if re.findall(r'work
s?\Z',w)]
## ['network', 'networks', 'deep convo
lutional neural network']
[w for w in words if re.findall(r'work
s?$',w)]
## ['network', 'networks', 'deep convo

lutional neural network']
# R
library(stringr)
words = c('neuron','neural','network',
'networks','deep convolutional neural
network')
str_detect(words,'\\Aneu')
## [1] TRUE TRUE FALSE FALSE FALSE
str_detect(words,'^neu')
## [1] TRUE TRUE FALSE FALSE FALSE
str_detect(words,'neu\\Z')
## [1] FALSE FALSE FALSE FALSE FALSE
str_detect(words,'neu$')
## [1] FALSE FALSE FALSE FALSE FALSE
2.2 Character classes

Example Description
[Pp]ython Match “Python” or “python”
rub[ye] Match “ruby” or “rube”
[aeiou] Match any one lowercase vowel
[0-9] Match any digit; same as [0123456789]

Regex

Uploaded by

Copyright:

Available Formats

Regex

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regex

Uploaded by

Copyright:

Available Formats

M1 Programming

1.1 Manipulate strings

1.2 Escape character

1.3 Formatted strings

2.2 Character classes

2.3 Special Character Classes

2.4 Repetition Cases

2.5 Grouping with Parentheses

The purpose of the r’…’ notation is to create

In Python, a string can be treated as an array,

wd_info = 'Current working directory:\

# modify the string

"".join(['a :','1','b :','2','c :','3'

## 'a :1.b :2.c :3'

In R, there is no way to tell R to read a literal

wd_info <- paste0('Current working di

# modify the string

paste0('a :','1','b :','2','c :','3')

paste0(c('a :','1','b :','2','c :','3'

paste0(c('a :','b :','c :'),c('1','2',

paste0(c('a :','b :','c :'),c('0'))

1.2 Escape character

In R we use cat() to take into account ‘’ (new line),

infos <- paste0(time_info,time_,

Here is some information on the computer where

The working directory is the path to the folder

‘\’ is the backslash character and is used to specify

Here is a list of escape characters that can be useful:

\t Insert a tab in the text at this point.

\b Insert a backspace in the text at this

\n Insert a newline in the text at this

\r Insert a carriage return in the text at

\f Insert a form feed in the text at this

\’ Insert a single quote character in the

\” Insert a double quote character in the

\\ Insert a backslash character in the

1.3 Formatted strings

In R, the use of % is also possible for

get_infos <- function(){

'<p>We trained a large, deep convoluti

Regular expressions (regex) are a way to

Except for control characters (+ ? . * ^ $ ( ) [ ] { }

Some functions in ‘re’ (Python) and ‘stringr’(R)

findall str_match_all Returns a list

finditer Returns a iterator

search str_locate_all Returns a Match

split str_split Returns a list

sub str_replace_all Replaces one or

Metacharacters are characters that have a

Parentheses ( ) are used for grouping. Square

Character Description Example

[] A set of characters “[a-m]”

\ Signals a special “\d”

. Any character “he..o”

^ Starts with “^hello”

$ Ends with “world$”

* Zero or more “aix*”

+ One or more “aix+”

{} Exactly the “al{2}”

() Grouping “(ab)\1” matches