Regex

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

M1 Programming

1 Strings

1.1 Manipulate strings

1.2 Escape character

1.3 Formatted strings

2 Regular Expressions

2.1 Anchors

2.2 Character classes

2.3 Special Character Classes

2.4 Repetition Cases

2.5 Grouping with Parentheses

2.6 Backreferences

3 Exercise

Regex
1 Strings
1.1 Manipulate strings
# Python
import os

wd_t = r'C:\Users\mypath\to\myfolder'
wd_t
## 'C:\\Users\\mypath\\to\\myfolder'
print(wd_t)
## C:\Users\mypath\to\myfolder

The purpose of the r’…’ notation is to create


raw strings and use different rules for
interpreting backslash escape sequences.
There is no need to use r if you’re just getting
the value from another variable.

In Python, a string can be treated as an array,


and it’s easy to select specific pieces of the
string using the position of the letters. In R, a
string is a block that needs to be decomposed
in order to select specific letters. The number
of characters in a string is its length. Strings
can be modified in various ways, which are
listed in the following link:
https://www.w3schools.com/python/python_ref_string.asp
. Some examples are provided there.

# Python

wd_info = 'Current working directory:\


n'+os.getcwd()
slc_infos = wd_info[10:20]
len(slc_infos)
## 10
slc_infos
## 'rking dire'
slc_infos = list(slc_infos)
slc_infos

# mutable
## ['r', 'k', 'i', 'n', 'g', ' ', 'd',
'i', 'r', 'e']
slc_infos[7] = '0'
slc_infos

# count
## ['r', 'k', 'i', 'n', 'g', ' ', 'd',
'0', 'r', 'e']
wd_info.count('s')

# modify the string


## 1
wd_info.title()
## 'Current Working Directory:\n/Home/
Peltouz/Documents/Github/M1-Programmin
g'
wd_info.upper()
## 'CURRENT WORKING DIRECTORY:\n/HOME/
PELTOUZ/DOCUMENTS/GITHUB/M1-PROGRAMMIN
G'
wd_info.lower()
## 'current working directory:\n/home/
peltouz/documents/github/m1-programmin
g'
splt_inf = wd_info.split(': ')
splt_inf
## ['Current working directory:\n/home
/peltouz/Documents/GitHub/M1-Programmi
ng']
' '.join(splt_inf)
## 'Current working directory:\n/home/
peltouz/Documents/GitHub/M1-Programmin
g'

import numpy as np

"".join(['a :','1','b :','2','c :','3'


])
## 'a :1b :2c :3'
" ".join(['a :','1','b :','2','c :','3
'])

## 'a : 1 b : 2 c : 3'
[''.join([letter,number]) for letter,
number in zip(['a :','b :','c :'],['1'
,'2','3'])]
## ['a :1', 'b :2', 'c :3']
list(map(''.join,zip(['a :','b :','c :
'],['1','2','3'])))
## ['a :1', 'b :2', 'c :3']
'.'.join((np.char.array(['a :','b :','
c :'])+np.char.array(['1','2','3'])).t
olist())

## 'a :1.b :2.c :3'


np.char.array([['a :','b :','c :']])+'
O'
## chararray([['a :O', 'b :O', 'c :O']
], dtype='<U4')

# R

wd_t = 'C:\\Users\\mypath\\to\\myfolde
r'
print(wd_t)
## [1] "C:\\Users\\mypath\\to\\myfolde
r"
cat(wd_t)
## C:\Users\mypath\to\myfolder

In R, there is no way to tell R to read a literal


backslash, we need to double it while writing
it. Also, we need to split the string before
accessing its elements. The ‘strsplit’ function is
designed to work with vectors and
automatically creates a list, with each element
of the list being a decomposed string. This is
why we see ‘[[1]]’ after each ‘strsplit’ function
call. We only want to access the first element
because there is only one string in the object
submitted to the function.

# R
library(stringr)
library(tools)

wd_info <- paste0('Current working di


rectory: ',getwd())
slc_infos <- wd_info[10:20]
slc_infos
## [1] NA NA NA NA NA NA NA NA NA NA
NA
slc_infos <- paste(strsplit(wd_info,sp
lit = '')[[1]][10:20],collapse = "")

nchar(slc_infos)
## [1] 11
slc_infos
## [1] "orking dire"
slc_infos <- strsplit(slc_infos,split
= '')[[1]]
slc_infos
## [1] "o" "r" "k" "i" "n" "g" " " "d
" "i" "r" "e"

# mutable
slc_infos[7] = '0'

# count
length(which(strsplit(wd_info,split =
'')[[1]]=='S'))
## [1] 0

# modify the string


toTitleCase(wd_info)
## [1] "Current Working Directory: /Ho
me/Peltouz/Documents/GitHub/M1-Program
ming"
toupper(wd_info)
## [1] "CURRENT WORKING DIRECTORY: /HO
ME/PELTOUZ/DOCUMENTS/GITHUB/M1-PROGRAM
MING"
tolower(wd_info)
## [1] "current working directory: /ho
me/peltouz/documents/github/m1-program
ming"

paste0('a :','1','b :','2','c :','3')


## [1] "a :1b :2c :3"
paste('a :','1','b :','2','c :','3')
## [1] "a : 1 b : 2 c : 3"

paste0(c('a :','1','b :','2','c :','3'


)) # do nothing with a vector
## [1] "a :" "1" "b :" "2" "c :" "
3"
paste(c('a :','1','b :','2','c :','3')
,collapse = " ") # safer to use this
## [1] "a : 1 b : 2 c : 3"
str_c(c('a :','1','b :','2','c :','3')
,collapse = " ")
## [1] "a : 1 b : 2 c : 3"

paste0(c('a :','b :','c :'),c('1','2',


'3'))
## [1] "a :1" "b :2" "c :3"
paste(c('a :','b :','c :'),c('1','2','
3'))
## [1] "a : 1" "b : 2" "c : 3"
paste(c('a :','b :','c :'),c('1','2','
3'),collapse = ".")
## [1] "a : 1.b : 2.c : 3"

paste0(c('a :','b :','c :'),c('0'))


## [1] "a :0" "b :0" "c :0"
paste(c('a :','b :','c :'),c('0'),coll
apse = "~")
## [1] "a : 0~b : 0~c : 0"
str_c(c('a :~','b :~','c :~'),c('0'))
## [1] "a :~0" "b :~0" "c :~0"

1.2 Escape character


# Python
import time
import psutil
import os

time_ = time.ctime(time.time())
time_info = 'Time: '
wd = os.getcwd()
wd_info = '\nCurrent working directory
: '

infos = time_info+time_+wd_info+wd

print(infos)
## Time: Wed Feb 8 10:39:10 2023
## Current working directory: /home/pe
ltouz/Documents/GitHub/M1-Programming

In R we use cat() to take into account ‘’ (new line),


Python automatically understand it with print.

# R
library(benchmarkme)
time_ <- Sys.time()
time_info <- 'Time: '
wd = getwd()
wd_info = '\nCurrent working directory
: '

infos <- paste0(time_info,time_,


wd_info,wd)
print(infos)
## [1] "Time: 2023-02-08 10:39:10\nCur
rent working directory: /home/peltouz/
Documents/GitHub/M1-Programming"

cat(infos)
## Time: 2023-02-08 10:39:10
## Current working directory: /home/pe
ltouz/Documents/GitHub/M1-Programming

Here is some information on the computer where


this document was created:

The working directory is the path to the folder


where we are currently working. If we read or write
a file without specifying the exact path (only the
name of the file), the program will search for the file
in the current working directory.

‘\’ is the backslash character and is used to specify


special characters, tabulation, new lines, backslashes
or quotes, etc. When working with file paths, this can
be confusing for the machine. To avoid this, we can
specify that the string must be read as it is by using ‘r’
before the quotation mark.

Here is a list of escape characters that can be useful:

Escape
Sequence Description

\t Insert a tab in the text at this point.

\b Insert a backspace in the text at this


point.

\n Insert a newline in the text at this


point.

\r Insert a carriage return in the text at


this point.

\f Insert a form feed in the text at this


point.

\’ Insert a single quote character in the


text at this point.

\” Insert a double quote character in the


text at this point.

\\ Insert a backslash character in the


text at this point.

1.3 Formatted strings


In Python, the % sign is used for producing
formatted output. Another way to format
output is to use the “{}”.format() method, which
allows for the automatic replacement of
values in a raw text. Let’s rewrite our function
using this method.

# Python

def get_infos():
time_info = 'Time: '+time.ctime(time
.time())
wd_info = '\nCurrent working directo
ry: '+os.getcwd()
infos = time_info+wd_info
print(infos)

def get_infos_2():
infos = '''Time: %s
Current working directory: %s'''% (t
ime.ctime(time.time()),
os.getcwd())
print(infos)

def get_infos_3():
infos = '''Time: {t}
Current working directory: {wd}'''.f
ormat(t = time.ctime(time.time()),
wd = os.getcwd())
print(infos)

get_infos()
## Time: Wed Feb 8 10:39:11 2023
## Current working directory: /home/pe
ltouz/Documents/GitHub/M1-Programming
get_infos_2()
## Time: Wed Feb 8 10:39:11 2023
## Current working directory: /home/
peltouz/Documents/GitHub/M1-Programmin
g
get_infos_3()
## Time: Wed Feb 8 10:39:11 2023
## Current working directory: /home/
peltouz/Documents/GitHub/M1-Programmin
g

In R, the use of % is also possible for


formatting strings. We can use the sprintf
function to print the formatted string, but it
can also be stored for later use. We can use the
cat function to print the string containing
escape characters, or we can use writeLines as
shown in the ‘get_info_2’ function.

# R

get_infos <- function(){


time_info = paste0('Time: ',Sys.time
())
wd_info = paste0('\nCurrent working
directory: ',getwd())
infos = paste0(time_info,wd_info)
cat(infos)
}

get_infos_2<- function(){
infos = sprintf('Time: %s
Current working directory: %s',
Sys.time(),
getwd())
writeLines(infos) # other way of tak
e into account escape character
}

get_infos()
## Time: 2023-02-08 10:39:11
## Current working directory: /home/pe
ltouz/Documents/GitHub/M1-Programming

get_infos_2()
## Time: 2023-02-08 10:39:11
## Current working directory: /home/
peltouz/Documents/GitHub/M1-Programmin
g

2 Regular Expressions
Text used in this section is the abstract of
‘ImageNet Classification with Deep
Convolutional Neural Networks’ (Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E.
Hinton 2012) presented at ‘Advances in
Neural Information Processing Systems 25’
(NIPS 2012)

'<p>We trained a large, deep convoluti


onal neural network to classify the 1.
3 million
high-resolution images in the LSVRC-20
10 ImageNet training set into the 1000
different
classes. On the test data, we achieved
top-1 and top-5 error rates of 39.7\%
and
18.9\% which is considerably better th
an the previous state-of-the-art resul
ts.
The neural network, which has 60 milli
on parameters and 500,000 neurons,
consists of five convolutional layers,
some of which are followed by max-pool
ing
layers, and two globally connected lay
ers with a final 1000-way softmax.
To make training faster, we used non-s
aturating neurons and a very efficient
GPU
implementation of convolutional nets.
To reduce overfitting in the globally
connected layers we employed a new reg
ularization method that proved to be v
ery effective.</p>'

Regular expressions (regex) are a way to


express how a character sequence should be
matched, and are a very flexible tool. They are
commonly used in text manipulation and are
used by string-searching algorithms for “find”
or “find and replace” operations on strings, or
for input validation.

Except for control characters (+ ? . * ^ $ ( ) [ ] { }


| ), all characters match themselves. You can
escape a control character by preceding it with
a backslash.

Some functions in ‘re’ (Python) and ‘stringr’(R)


that can be useful when working with regular
expressions are:

Re Stringr Description

findall str_match_all Returns a list


containing all
matches

finditer Returns a iterator


containing all
matches

search str_locate_all Returns a Match


object if there is a
match anywhere in
the string

split str_split Returns a list


where the string
has been split at
each match

sub str_replace_all Replaces one or


many matches with
a string

Metacharacters are characters that have a


special meaning in regular expressions. Their
uses are discussed in more detail in the next
few sections.

Parentheses ( ) are used for grouping. Square


brackets [ ] define a character class, and curly
braces { } are used by a quantifier with specific
limits.

Character Description Example

[] A set of characters “[a-m]”

\ Signals a special “\d”


sequence (can also
be used to escape
special characters)

. Any character “he..o”


(except newline
character)

^ Starts with “^hello”

$ Ends with “world$”

* Zero or more “aix*”


occurrences

+ One or more “aix+”


occurrences

{} Exactly the “al{2}”


specified number
of occurrences

| Either or “x|y”

() Grouping “(ab)\1” matches


characters (or “abab”
other regexes)

# Python
import re

abstract = open("data/imagenet_abstrac
t.txt", "r").read()

re.search('layers',abstract)
## <_sre.SRE_Match object; span=(437,
443), match='layers'>
re.findall('layers',abstract)
## ['layers', 'layers', 'layers', 'lay
ers']
re.findall('net+','net ne nett network
s')
## ['net', 'nett', 'net']
re.findall('net*','net ne nett network
s')

## ['net', 'ne', 'nett', 'net']


re.sub(r'\\','',abstract)
## '<p>We trained a large, deep convol
utional neural network to classify the
1.3 million high-resolution images in
the LSVRC-2010 ImageNet training set i
nto the 1000 different classes. On the
test data, we achieved top-1 and top-5
error rates of 39.7% and 18.9% which i
s considerably better than the previou
s state-of-the-art results. The neural
network, which has 60 million paramete
rs and 500,000 neurons, consists of fi
ve convolutional layers, some of which
are followed by max-pooling layers, an
d two globally connected layers with a
final 1000-way softmax. To make traini
ng faster, we used non-saturating neur
ons and a very efficient GPU implement
ation of convolutional nets. To reduce
overfitting in the globally connected
layers we employed a new regularizatio
n method that proved to be very effect
ive.</p>'
re.split(' ',abstract)[:5]

## ['<p>We', 'trained', 'a', 'large,',


'deep']

# R
library(stringr)
library(readr)

abstract <- as.character(read_file('da


ta/imagenet_abstract.txt'))

str_locate_all(pattern = 'layers',abst
ract)
## [[1]]
## start end
## [1,] 438 443
## [2,] 488 493
## [3,] 523 528
## [4,] 728 733

str_locate_all(pattern = 'net+','net n
e nett networks')
## [[1]]
## start end
## [1,] 1 3
## [2,] 8 11
## [3,] 13 15

str_locate_all(pattern = 'net*','net n
e nett networks')
## [[1]]
## start end
## [1,] 1 3
## [2,] 5 6
## [3,] 8 11
## [4,] 13 15

str_replace_all(abstract,'\\\\','')
## [1] "<p>We trained a large, deep co
nvolutional neural network to classify
the 1.3 million high-resolution images
in the LSVRC-2010 ImageNet training se
t into the 1000 different classes. On
the test data, we achieved top-1 and t
op-5 error rates of 39.7% and 18.9% wh
ich is considerably better than the pr
evious state-of-the-art results. The n
eural network, which has 60 million pa
rameters and 500,000 neurons, consists
of five convolutional layers, some of
which are followed by max-pooling laye
rs, and two globally connected layers
with a final 1000-way softmax. To make
training faster, we used non-saturatin
g neurons and a very efficient GPU imp
lementation of convolutional nets. To
reduce overfitting in the globally con
nected layers we employed a new regula
rization method that proved to be very
effective.</p>"

str_split(abstract,' ',simplify = T)[1


:5]
## [1] "<p>We" "trained" "a" "
large," "deep"

2.1 Anchors
This needs to specify match position.

Example Description

^Python Match “Python” at the start of a


string or internal line

Python$ Match “Python” at the end of a


string or line

\APython Match “Python” at the start of a


string

Python\Z Match “Python” at the end of a


string

\bPython\b Match “Python” at a word


boundary

\brub\B \B is nonword boundary: match


“rub” in “rube” and “ruby” but not
alone

Python(?=!) Match “Python”, if followed by an


exclamation point.

Python(?!!) Match “Python”, if not followed


by an exclamation point.

# Python
import re

words = ['neuron','neural','network','
networks','deep convolutional neural n
etwork']

[w for w in words if re.findall(r'\Ane


u',w)]
## ['neuron', 'neural']
[w for w in words if re.findall(r'^neu
',w)]
## ['neuron', 'neural']
[w for w in words if re.findall(r'work
s?\Z',w)]
## ['network', 'networks', 'deep convo
lutional neural network']
[w for w in words if re.findall(r'work
s?$',w)]

## ['network', 'networks', 'deep convo


lutional neural network']

# R
library(stringr)

words = c('neuron','neural','network',
'networks','deep convolutional neural
network')

str_detect(words,'\\Aneu')
## [1] TRUE TRUE FALSE FALSE FALSE
str_detect(words,'^neu')
## [1] TRUE TRUE FALSE FALSE FALSE
str_detect(words,'neu\\Z')
## [1] FALSE FALSE FALSE FALSE FALSE
str_detect(words,'neu$')
## [1] FALSE FALSE FALSE FALSE FALSE

2.2 Character classes


Example Description

[Pp]ython Match “Python” or “python”

rub[ye] Match “ruby” or “rube”

[aeiou] Match any one lowercase vowel

[0-9] Match any digit; same as [0123456789]

You might also like