Pattern Matching With Regular Expressions: Perl For Biologists
Pattern Matching With Regular Expressions: Perl For Biologists
CANTERBURY
She hath been then more fear'd than harm'd, my liege;
For hear her but exampled by herself:
When all her chivalry hath been in France
And she a mourning widow of her nobles,
She hath herself not only well defended
But taken and impounded as a stray
The King of Scots; whom she did send to France,
To fill King Edward's fame with prisoner kings...
1
Perl for biologists
if ($text =~ /King/i) {
print “found\n”; this is a
“modifier” and
} means ignore
case.
=~ is the binding
operator and in an if / / is used to denote a
means “contains” pattern (called a “regular
expression” or RE ) which can
be used to search within
textual data
2
Perl for biologists
Example-2
$DNA=“ACGGCGGACCCGGAATTACTA”;
print ”Starting DNA\n$DNA\n”;
Starting DNA
# Transcribe the DNA to RNA by
ACGTCGGACCCGGAATTACTA
replacing T’s with U’s
$RNA=$DNA;
$RNA =~ s/T/U/g;
# print RNA to screen Transcribed RNA
ACGUCGGACCCGGAAUUACUA
print “Transcribed RNA\n$RNA\n”;
3
Perl for biologists
$RNA =~ s/T/U/g ;
RE
RE giving
giving gg modifier
modifier
variable
variable binding
binding pattern
pattern to
to replacement
replacement
operator
operator means
means
look
look for
for text
text “global”,i.e.
“global”,i.e.
everywhere
everywhere
in
in the
the
string
string
Naturally, the pattern to be found and the
replacement text can both be variables
4
Perl for biologists
5
Perl for biologists
6
Perl for biologists
Making the search more flexible
Other examples:
# translate codon to amino acid The . means
if ( $codon =~ /GC./) {$aa=“ala”; } # alanine any char
elsif ($codon =~ /TG[TC]/) { $aa=“cys”;}# cysteine except newline
elsif ($codon =~ /GA[TC]/) {$aa=“asp”;} # aspartic
acid
elsif ($codon =~ /GA[AG]/) {$aa =“glu”;}# glutamic
acid
..
# non bases
if ($dna =~/[^acgt]/) {
print “Invalid base pairs found\n”;
} character
# check input is a number range
if ($input[$i] !~ /[0-9]/) {
print “Invalid input\n”;
exit;
}
Escaping characters
# look for the $$$$ separator in a .sd
structure file
7
Perl for biologists
Escaping characters
# look for the $$$$ separator in
a .sd structure file
The
The$$has
hasspecial
special
meaning
meaningso
soininthis
this
example
exampleititneeds
needstoto
be
beescaped
escapedwith
with\\
Special characters
Characters which cannot be typed in or are not easily visibile also use the \
notation. In addition, Perl defines certain meta-characters for REs.
\n newline
\t tab
\s whitespace (i.e. space or tab)
\r carriage return
\033 octal char
\w match a word char
\W match a non-word char
\d match a digit
\D match a non-digit
8
Perl for biologists
9
Perl for biologists
join
The join function does the reverse of split, it sticks together fragments
using a string (strictly speaking join is not a RE function).
10
Perl for biologists
Summary
• Text in Perl can be searched and manipulated with regular expressions (REs)
• By default REs are enclosed by two / , e.g. /word/
• Modifiers such as i or g can be added change default behaviour, e.g.
/word/i searches for “word” regardless of case
• REs can be just text but they can also contain metacharacters to be
more/less specific. Examples include
• ^ and $ (anchors) to restrict search at ends of text
• char classes with [], e.g. [0-9] to allow a selection of chars
• full stop . means any character (except new line)
• Characters which musn’t be interpreted as metacharacters need to be
escaped, e.g. \$, \^ , \[ , etc
• Useful functions include split and the operators =~ and substitute s.
11