0% found this document useful (0 votes)
136 views158 pages

Perl For Bio in For Ma Tics

This document provides an overview of programming for computational biology. It discusses commonly used programming languages for bioinformatics like Perl, Python, and R. It notes that Perl is particularly popular for bioinformatics due to its strengths in text manipulation and its forgiving nature. The document outlines some basic Perl concepts like variables, arithmetic operations, strings, conditional statements, loops, reading/writing files. It states that the goals of the course are to introduce programming concepts, teach rudimentary Perl, and expose students to bioinformatics software and file formats.

Uploaded by

Amitha Sampath
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views158 pages

Perl For Bio in For Ma Tics

This document provides an overview of programming for computational biology. It discusses commonly used programming languages for bioinformatics like Perl, Python, and R. It notes that Perl is particularly popular for bioinformatics due to its strengths in text manipulation and its forgiving nature. The document outlines some basic Perl concepts like variables, arithmetic operations, strings, conditional statements, loops, reading/writing files. It states that the goals of the course are to introduce programming concepts, teach rudimentary Perl, and expose students to bioinformatics software and file formats.

Uploaded by

Amitha Sampath
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 158

Programming for Computational Biology

Ian Holmes Department of Bioengineering University of California, Berkeley

Programming languages
Self-contained language
Platform-independent Used to write O/S C (imperative, procedural) C++, Java (object-oriented) Lisp, Haskell, Prolog (functional)

Scripting language
Closely tied to O/S Perl, Python, Ruby

Domain-specific language An O/S typically manages


Devices (see above) Files & directories Users & permissions Processes & signals R (statistics) MatLab (numerics) SQL (databases)

Bioinformatics pipelines often involve chaining together multiple tools

Perl is the most-used bioinformatics language


Most popular bioinformatics programming languages Bioinformatics career survey, 2008 Michael Barton

Pros and Cons of Perl


Reasons for Perls popularity in bioinformatics (Lincoln Stein)
Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing, summarizing and otherwise mangling text Perl is forgiving Perl is component-oriented Perl is easy to write and fast to develop in Perl is a good prototyping language Perl is a good language for Web CGI scripting

Problems with Perl


Hard to read (theres more than one way to do it, cryptic syntax) Too forgiving (no strong typing, allows sloppy code)

Perl overview
Interpreted, not compiled
Fast edit-run-revise cycle

Procedural & imperative


Sequence of instructions (control flow) Variables, subroutines

Syntax close to C (the de facto standard minimal language)


Weakly typed (unlike C) Redundant, not minimal (theres more than one way to do it) Syntactic sugar
High-level data structures & algorithms
Hashes, arrays

Operating System support (files, processes, signals) String manipulation

Goals of this course


Concepts of computer programming Rudimentary Perl (widely-used language)
"How Perl saved the Human Genome Project" (Lincoln Stein)

Introduction to Bioinformatics file formats Practical data-handling algorithms Exposure to Bioinformatics software

Structural elements
Learning Perl, Schwartz et al ISBN 0-596-10105-8 O'Reilly "There's more than one way to do it
Q: But which is best? A: TESTS
Terminal session Description of test conditions

Tests (above) supercede texts (below):


The main program
Terminal input

The program output


Standard output stream

Files are shown in yellow Filename

General principles of programming


Make incremental changes Test everything you do
the edit-run-revise cycle

Write so that others can read it


(when possible, write with others)

Think before you write Use a good text editor Good debugging style

Perl for Bioinformatics Section 1: Scalars and Loops


Ian Holmes Department of Bioengineering University of California, Berkeley

Perl basics
Basic syntax of a Perl program:
Lines # Elementary Perl program All statements end beginning print "Hello World\n"; with a semicolon with "#" are comments, and are ignored by Perl Single or double quotes enclose a "string literal" "\n" means new line (double quotes are "interpolated") print statement tells Perl to print the following stuff to the screen Hello World

Variables
We can tell Perl to "remember" a particular value, using the assignment operator =:
$x = 3; print $x; $x = "ACGCGT"; print $x;
Binding site for yeast transcription factor MCB

ACGCGT

The $x is referred to as a "scalar variable".


Variable names can contain alphabetic characters, numbers (but not at the start of the name), and underscore symbols "_" Scalar variable names are all prefixed with the dollar symbol.

Arithmetic operations
Basic operators are + - / * %
$x = 14; $y = 3; print "Sum: ", $x + $y, "\n"; print "Product: ", $x * $y, "\n"; print "Remainder: ", $x % $y, "\n"; $x = 5; print "x started as $x\n"; $x = $x * 2; print "Then x was $x\n"; $x = $x + 1; print "Finally x was $x\n"; Sum: 17 Product: 42 Remainder: 2

Could write $x *= 2; Could write $x += 1; or even ++$x;

x started as 5 Then x was 10 Finally x was 11

Can also use += -= /= *= ++ --

String operations
Concatenation . .=
$a = "pan"; $b = "cake"; $a = $a . $b; print $a; $a = "soap"; $b = "dish"; $a .= $b; print $a;

pancake

soapdish

Can find the length of a string using the function length($x)


$mcb = "ACGCGT"; print "Length of $mcb is ", length($mcb); Length of ACGCGT is 6

More string operations


Convert to upper case Convert to lower case Reverse the string $x = "A simple sentence"; print $x, "\n"; print uc($x), "\n"; print lc($x), "\n"; $y = reverse($x); print $y, "\n"; $x =~ tr/i/a/; print $x, "\n"; print length($x), "\n"; A simple A SIMPLE a simple ecnetnes A sample 17 sentence SENTENCE sentence elpmis A sentence

Transliterate "i"'s into "a"'s

Calculate the length of the string

Concatenating DNA fragments


$dna1 = "accacgt"; $dna2 = "taggtct"; print $dna1 . $dna2; accacgttaggtct

"Transcribing" DNA to RNA


DNA string is a mixture of upper & lower case $dna = "accACgttAGGTct"; $rna = lc($dna); $rna =~ tr/t/u/; print $rna; Make it all lower case Transliterate "t" to "u"

accacguuaggucu

Comparison: variables in C are typed

C does not have a basic type for strings only individual characters. Strings are built up from more basic elements as arrays of characters (well get to arrays later). Much of this functionality is provided in C and C++ as part of the standard library.

Conditional blocks
The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Perl, this looks like this:
if (condition) { action } else { alternative }
$x = 149; $y = 100; These braces { } if ($x > $y) tell Perl which { piece of code print "$x is greater than $y\n"; is contingent on } the condition. else { print "$x is less than $y\n"; } 149 is greater than 100

Conditional operators
"does not equal"

Numeric: > >= < <= != ==


$x = 5 * 4; $y = 17 + 3; if ($x == $y) { print "$x equals $y"; }

Note that the test for "$x equals $y" is $x==$y, not $x=$y 20 equals 20

String: eq ne gt lt ge le
"equals" "does not equal"
Shorthand syntax for assigning more than one variable at a time

"is alphabetically less-or-equal"

"is alphabetically greater than" "is alphabetically less than"

"is alphabetically greater-or-equal"

($x, $y) = ("Apple", "Banana"); if ($y gt $x) { print "$y after $x "; }

Banana after Apple

Logical operators
Logical operators: && means "and", || means "or"
$x = 222; if ($x % 2 == 0 and $x % 3 == 0) { print "$x is an even multiple of 3\n"; } 222 is an even multiple of 3

An exclamation mark ! is used to negate what follows Thus !($x < $y) means the same as ($x >= $y)

In computers, the value zero is often used to represent falsehood, while any non-zero value (e.g. 1) represents truth. Thus:
if (1) { print "1 is true\n"; } if (0) { print "0 is true\n"; } if (-99) { print "-99 is true\n"; } 1 is true -99 is true

Loops
Here's how to print out the numbers 1 to 10:
The code inside the braces is repeatedly executed as long as the condition $x<=10 remains 1 true $x = 1; while ($x <= 10) { print $x, " "; ++$x; } 2 3 4 5 6 7 8 9 10

Equivalent to $x = $x + 1;

This is a while loop. The code is executed while the condition is true.

A common kind of loop


Let's dissect the code of the while loop again:
Initialisation Test for completion Continuation $x = 1; while ($x <= 10) { print $x, " "; ++$x; }

This form of while loop is common enough to have its own shorthand: the for loop.
Continuation Initialisation Test for completion for ($x = 1; $x <= 10; ++$x) { print $x, " "; }

Loops in C++ are similar to Perl


cout is the standard output stream, part of the standard library. Used in C++ only (C has a complicated printf command)

defined and undef


The function defined($x) is true if $x has been assigned a value:
if (defined($newvar)) { print "newvar is defined\n"; } else { print "newvar is not defined\n"; } newvar is not defined

A variable that has not yet been assigned a value has the special value undef Often, if you try to do something "illegal" (like reading from a nonexistent file), you end up with undef as a result
C does not have defined or undef. At best, using an uninitialized value will cause a compiler error; at worst, it will lead to undefined behavior (i.e. disaster)

Reading a line of data


To read from a file, we first need to open the file and give it a filehandle.
open FILE, "sequence.txt"; This code snippet opens a file called "sequence.txt", and associates it with a filehandle called FILE

Once the file is opened, we can read a single line from it into the scalar $x :
This reads the next line from the file, including the newline at the end, "\n". if the end of the file is reached, $x is assigned the special value undef $x = <FILE>;

Reading an entire file


The following piece of code reads every line in a file and prints it out to the screen:
This reads a line of data into $x, then checks if $x is defined. If $x is undef, then the file must have ended. open FILE, "sequence.txt"; while (defined ($x = <FILE>)) { print $x; } close FILE;

A shorter version of this is as follows:


this is equivalent to defined($x=<FILE>) open FILE, "sequence.txt"; while ($x = <FILE>) { print $x; } close FILE;

The default variable, $_


Many operations that take a scalar argument, such as length($x), are assumed to work on $_ if the $x is omitted:
$_ = "Hello"; print; print length; Hello5

So we can also read a whole file like this:


open FILE, "sequence.txt"; This line is equivalent to while (<FILE>) { while (defined($_=<FILE>)) { print; } close FILE;

Files in C++ are streams

Debugging
Most programs don't work first time Most apparently "working" programs actually aren't Bugs are cryptic Debugging is a scientific process As you gain experience, you will begin to "insure" against bugs with your programming technique

Mars Climate Orbiter


Mars Climate Orbiter was the third spacecraft to be launched under the Mars Surveyor program to map & explore Mars Around 2am PDT on September 23, 1998, the spacecraft disappeared behind Mars following a maneouvre that should have put it into Mars orbit This failure, along with a subsequent (unexplained) craft loss, cost NASA $327.6 million

What was the problem?


Following a certain kind of engine burn, designed to stabilise the craft's angular momentum, the Orbiter sent data to the ground station, so that its trajectory could be recalibrated (by a software module called SM_FORCE) The Orbiter also internally recomputed its trajectory following a burn The Orbiter's internal software module used metric units (Newton-seconds) while the ground station's SM_FORCE module used Imperial (pound-seconds). The specification called for metric units The maneouvre executed on September 23rd was therefore computed using the wrong trajectory, taking the Orbiter too low into Mars' atmosphere

Why was the bug not detected?


The spacecraft periodically transmitted its computed trajectory to the ground station. A quick comparison between the two trajectories would have revealed the error. However,
Other bugs in the SM_FORCE module prevented its use until 4 months into the flight The ground crew weren't aware that trajectory data from the spacecraft were available Discrepancies were noticed, but were only reported informally by email, and not taken seriously enough

i.e. incomplete testing; ignoring unexpected results; institutional complacency.

Debugging is scientific
Finding bugs can be very frustrating A job that you thought was nearly finished, for which you have budgeted a certain amount of time, stretches out indefinitely Often you may have no idea what's wrong If you think of debugging as a scientific problem and approach it systematically, much of the pain disappears

The Process of Debugging


Step 1: Identify the Problem
observe it (e.g. because a test fails) reproduce it (so you can make it happen 100% of the time) isolate it (strip it down to its bare essentials)

The Process of Debugging


Step 2: Gather Information
record all symptoms (disparate symptoms may be
related; if not, you should tackle them systematically one by one)

follow the flow of control of the program (many


ways of doing this: e.g. you can use a "debugger" to watch the variables; the time-honored method, and definitely the best, is to insert debugging print statements into your code)

note recent changes (usually the cause of bugs) look for similar problems (can ask other developers) check "machine environment" (e.g. if you move to a
different computer, does it have less memory? less disk space?)

The Process of Debugging


Step 3: Form a Hypothesis
try to isolate the code that causes the problem
e.g. strip away all "working" code that is not essential to reproducing the bug if you can't find the bug, use a systematic "deletion" strategy (c.f. genetics!) until you have narrowed down the problem

what should that code be doing? this can be seen as a continuation of Step 1 ("identify the problem") debugging is a cyclic, interactive process

The Process of Debugging


Step 4: Test Your Hypothesis
do not skip this step! often the hypothesis will come to you in a flash of inspiration, but you still need to test it for simple bugs, testing just means fixing the problem for more complex bugs, you'll need to proceed to the next steps...

The Process of Debugging


Step 5: Propose a Solution
keep it minimal: try not to redesign all the code unless this is absolutely necessary then again, do not flinch from redesign if this is what is called for

Step 6: Test the Solution


also make sure you didn't break existing code

Process of Debugging: Summary


Step 1: Identify the Problem Step 2: Gather Information Step 3: Form a Hypothesis Step 4: Test Your Hypothesis Step 5: Propose a Solution Step 6: Test the Solution

Proactive debugging
Place consistency checks in your code
also called assertions

Put comments in your code


this saves time when debugging

Comment known (and fixed) bugs


keep a record of what you've fixed

Put log messages into your code


you can make these optional (e.g. comment them out); having them there can save lots of time

Perl for Bioinformatics Section 2: Sequences and Arrays

Summary: scalars and loops


Assignment operator $x = 5; Arithmetic operations $y = $x * 3; String operations $s = "Value of y is " . $y; Conditional tests if ($y > 10) { print $s; } Logical operators if ($y>10 && $s eq "") { exit; Loops for ($x=1; $x<10; ++$x) { print $x, "\n"; } defined and undef Reading a file

Pattern-matching
A very sophisticated kind of logical test is to ask whether a string contains a pattern e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT?
20 bases upstream of the yeast gene YBR007C $name = "YBR007C"; $dna="TAATAAAAAACGCGTTGTCG"; if ($dna =~ /ACGCGT/) { print "$name has MCB!\n"; } YBR007C has MCB!

The pattern binding operator =~ The pattern for the MCB binding site

FASTA format
A format for storing multiple named sequences in a single file >CG11604
Name of sequence is preceded by > symbol NB sequences can span multiple lines
TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT

This file contains 3' UTRs for Drosophila genes CG11604, CG11455 and CG11488

Call this file fly3utr.txt

Printing all sequence names in a FASTA database


open FILE, "fly3utr.txt"; while ($x = <FILE>) { if ($x =~ />/) { print $x; } } close FILE; >CG11604 >CG11455 >CG11488

The key to this program is this block:


This pattern matches (and returns TRUE) if the default variable $_ contains the FASTA sequence-name symbol > if ($x =~ />/) { print $x; } This line prints $_ if the pattern matched

Pattern replacement
$_ is the default variable for these operations

open FILE, "fly3utr.txt"; while (<FILE>) { if (/>/) { s/>//; print; } } close FILE;

CG11604 CG11455 CG11488

New statement removes the ">"

The new statement s/>// is an example of a replacement. General form: s/OLD/NEW/ replaces OLD with NEW Thus s/>// replaces ">" with "" (the empty string)

Finding all sequence lengths


Start
Print last sequence length Open file Read line
yes

End of file?

no

Remove \n newline character at end of line

Stop
Sequence name Print last sequence length
no

yes

Line starts with > ?

no

Sequence data Add length of line to running total

First sequence?

yes

Record the name

Reset running total of current sequence length

Finding all sequence lengths


The chomp statement trims the newline character "\n" off the end of the default variable, $_. Try it without this and see what happens and if you can work out why
>CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT

open FILE, "fly3utr.txt"; while (<FILE>) { chomp; if (/>/) { if (defined $len) { print "$name $len\n"; } $name = $_; $len = 0; } else { $len += length; } } print "$name $len\n"; close FILE; >CG11604 58 >CG11455 83 >CG11488 68

Reverse complementing DNA


A common operation due to double-helix symmetry of DNA
Start by making string lower case again. This is generally good practise Reverse the string Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a' $dna = "accACgttAGgtct"; $revcomp = lc($dna); $revcomp = reverse($revcomp); $revcomp =~ tr/acgt/tgca/; print $revcomp;
agacctaacgtggt

Running external programs


Suppose you want to get the output of another program into a variable. e.g. the following shell command prints the number of lines in the file myfile.txt wc -l myfile.txt You can execute a command like this from Perl using system

system "wc -l myfile.txt";


but that only prints the result to standard output; it does not give you access to the output of the command from within the Perl program. One way to get the output is by enclosing the command in backticks:

$lines = `wc -l myfile.txt`;


An (equivalent) way is to open a pipe from the command:

open FILEHANDLE, "wc -l myfile.txt |"; $lines = <FILEHANDLE>;

Arrays
An array is a variable holding a list of items
@nucleotides = ('a', 'c', 'g', 't'); print "Nucleotides: @nucleotides\n"; Nucleotides: a c g t

We can think of this as a list with 4 entries a c g t


element 0 element 1 element 2 element 3 the array is the set of all four elements Note that the element indices start at zero.

Array literals
There are several, equally valid ways to assign an entire array at once.
This is the most common: a commaseparated list, delimited by parentheses @a = (1,2,3,4,5); print "a = @a\n"; @b = ('a','c','g','t'); print "b = @b\n"; @c = 1..5; print "c = @c\n"; @d = qw(a c g t); print "d = @d\n";

a b c d

= = = =

1 a 1 a

2 c 2 c

3 g 3 g

4 5 t 4 5 t

Accessing arrays
To access array elements, use square brackets; e.g. $x[0] means "element zero of array @x"
@x = ('a', 'c', 'g', 't'); print $x[0], "\n"; $i = 2; print $x[$i], "\n"; a g

Remember, element indices start at zero! If you use an array @x in a scalar context, such as @x+0, then Perl assumes that you wanted the length of the array.
@x = ('a', 'c', 'g', 't'); print @x + 0; 4

Array operations
You can sort and reverse arrays...
@x = ('a', 't', 'g', 'c'); @y = sort @x; @z = reverse @y; print "x = @x\n"; print "y = @y\n"; print "z = @z\n"; x = a t g c y = a c g t z = t g c a

You can read the entire contents of a file into an array (each line of the file becomes an element of the array)
open FILE, "sequence.txt"; @x = <FILE>;

push, pop, shift, unshift


pop removes the last element of an array push adds an element to the end of an array shift removes the first element of an array unshift adds an element to the start of an array @x = ("Fame", "Power", "Money"); print "I started with @x\n"; $y = pop @x; push @x, "Success"; print "Then I had @x\n"; $z = shift @x; unshift @x, "Glamour"; print "Now I have @x\n"; print "I lost $y and $z\n";

I started with Fame Power Money Then I had Fame Power Success Now I have Glamour Power Success I lost Money and Fame

foreach
Finding the total of a list of numbers:
foreach statement loops through each entry in an array @val = (4, 19, 1, 100, 125, 10); $total = 0; foreach $x (@val) { $total += $x; } print $total;

259

Equivalent to:
@val = (4, 19, 1, 100, 125, 10); $total = 0; for ($i = 0; $i < @val; ++$i) { $total += $val[$i]; } print $total;

259

Iterator comparison
foreach
[yoko:~] yam% time perl -e 'foreach $n (1..10**6) { $total += log $n } print $total, "\n"' 12815518.3846579 0.765u 0.007s 0:00.80 95.0% 0+0k 0+0io 0pf+0w
iMac G5 1.8GHz 512MB, Mac OS X 10.4.2, perl v5.8.6 built for darwin-thread-multi-2level

for
[yoko:~] yam% time perl -e 'for ($n = 1; $n <= 10**6; ++$n) { $total += log $n } print $total, "\n"' 12815518.3846579 1.080u 0.007s 0:01.12 96.4% 0+0k 0+0io 0pf+0w
iMac G5 1.8GHz 512MB, Mac OS X 10.4.2, perl v5.8.6 built for darwin-thread-multi-2level

(technically, the keywords for and foreach are interchangeable; historically, for was used with initialization-continuation-termination constructs and foreach was used with arrays)

The @ARGV array


A special array is @ARGV This contains the command-line arguments when the program is invoked at the Unix prompt It's a way for the user to pass information into the program

Exploding a sequence into an array


The split statement turns a string into an array. Here, it splits after every character, but we can also split at specific points, like a restriction enzyme $dna = "accggtgtgcg"; print "String: $dna\n"; @array = split //, $dna; print "Array: @array\n";

String: accggtgtgcg Array: a c c g g t g t g c g

The programming language C treats all strings as arrays

Taking a slice of an array


The syntax @x[i,j,k...] returns a (3-element) array containing elements i,j,k... of array @x
@nucleotides = ('a', 'c', 'g', 't'); @purines = @nucleotides[0,2]; @pyrimidines = @nucleotides[1,3]; print "Nucleotides: @nucleotides\n"; print "Purines: @purines\n"; print "Pyrimidines: @pyrimidines\n";

Nucleotides: a c g t Purines: a g Pyrimidines: c t

Finding elements in an array


The grep command is used to select some elements from an array The statement grep(EXPR,LIST) returns all elements of LIST for which EXPR evaluates to true (when $_ is set to the appropriate element) e.g. select all numbers over 100:
@numbers = (101, 235, 10, 50, 100, 66, 1005); @numbersOver100 = grep ($_ > 100, @numbers); print "Numbers: @numbers\n"; print "Numbers over 100: @numbersOver100\n";
Numbers: 101 235 10 50 100 66 1005 Numbers over 100: 101 235 1005

Applying a function to an array


The map command applies a function to every element in an array Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST Example: multiply every number by 3
@numbers = (101, 235, 10, 50, 100, 66, 1005); @numbersTimes3 = map ($_ * 3, @numbers); print "Numbers: @numbers\n"; print "Numbers times 3: @numbersTimes3\n";
Numbers: 101 235 10 50 100 66 1005 Numbers times 3: 303 705 30 150 300 198 3015

Perl for Bioinformatics


Section 3: Patterns and Subroutines

Review: pattern-matching
The following code:
if (/ACGCGT/) { print "Found MCB binding site!\n"; }

prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the default variable, $_ Instead of using $_ we can "bind" the pattern to another variable (e.g. $dna) using this syntax: if ($dna =~ /ACGCGT/) {
print "Found MCB binding site!\n"; }

We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax: $dna =~ s/ACGCGT/_MCB_/; We can replace all occurrences by appending a 'g':
$dna =~ s/ACGCGT/_MCB_/g;

Regular expressions
Perl provides a pattern-matching engine Patterns are called regular expressions They are extremely powerful
probably Perl's strongest feature, compared to other languages

Often called "regexps" for short

Motivation: N-glycosylation motif


Common post-translational modification in ER
Membrane & secreted proteins Purpose: folding, stability, cell-cell adhesion

Attachment of a 14-sugar oligosaccharide Occurs at asparagine residues with the consensus sequence NX1X2, where
X2 is serine or threonine

QuickTime and a X1 can be anything decompressor (but proline & aspartic acid inhibit) are needed to see this picture.

Can we detect potential N-glycosylation sites in a protein sequence?

Interlude: interactive testing


This script echoes input from the keyboard
while (<STDIN>) { print; }

The special filehandle STDIN means "standard input", i.e. the keyboard

Sometimes (e.g. in Windows IDEs) the output isnt printed until the script stops This is because of buffering. To stop buffering, set to "autoflush":
$| = 1; while (<STDIN>) { print; }

$| is the autoflush flag

Matching alternative characters


[ACGT] matches one A, C, G or T:
while (<STDIN>) { print "Matched: $_" if /[ACGT]/; }

this is not printed This is printed Matched: This is printed

In general square brackets denote a set of alternative possibilities Use - to match a range of characters: [A-Z] . matches anything \s matches spaces or tabs Italics denote \S is anything that's not a space or tab input text [^X] matches anything but X

Matching alternative strings


/(this|that)/ matches "this" or "that" ...and is equivalent to /th(is|at)/
while (<STDIN>) { print "Matched: $_" if /this|that|other/; }

Won't match THIS Will match this Matched: Will match this Won't match ThE oThER Will match the other Matched: Will match the other

Remember, regexps are case-sensitive

Matching multiple characters


x* matches zero or more x's (greedily) x*? matches zero or more x's (sparingly) x+ matches one or more x's (greedily) x{n} matches n x's x{m,n} matches from m to n x's

Word and string boundaries


^ matches the start of a string $ matches the end of a string \b matches word boundaries

"Escaping" special characters


\ is used to "escape" characters that otherwise have meaning in a regexp so \[ matches the character "["
if not escaped, "[" signifies the start of a list of alternative characters, as in [ACGT]

Retrieving what was matched


If parts of the pattern are enclosed by parentheses, then (following the match) those parts can be retrieved from the scalars $1, $2...
$| = 1; while (<STDIN>) { if (/(a|the) (\S+)/i) { print "Noun: $2\n"; } } Pick up the cup Noun: cup Sit on a chair Noun: chair Put the milk in the tea Noun: milk

e.g. /the (\S+) sat on the (\S+) drinking (\S+)/ matches "the cat sat on the mat drinking milk" with $1="cat", $2="mat", $3="milk"

Note: only the first "the" is picked up by this regexp

Variations and modifiers


//i ignores upper/lower case distinctions:
while (<STDIN>) { print "Matched: $_" if /pattern/i; }

pAttERn Matched pAttERn

//g starts search where last match left off


pos($_) is index of first character after last match

s/OLD/NEW/ replaces first "OLD" with "NEW" s/OLD/NEW/g is "global" (i.e. replaces every occurrence of "OLD" in the string)

N-glycosylation site detector


Convert to upper case

$| = 1; while (<STDIN>) { $_ = uc $_; while (/(N[^PD][ST])/g) { print "Potential N-glycosylation sequence ", $1, " at residue ", pos() - 2, "\n"; } }

Regexp uses 'g' modifier to get all matches in sequence

while (/(N[^P][ST])/g) { ... }


The main regular expression

pos() is index of first residue after match, starting at zero; so, pos()-2 is index of first residue of three-residue match, starting at one.

PROSITE and Pfam

PROSITE a database of regular expressions for protein families, domains and motifs

Pfam a database of Hidden Markov Models (HMMs) equivalent to probabilistic regular expressions

Subroutines
Often, we can identify self-contained tasks that occur in so many different places we may want to separate their description from the rest of our program. Code for such a task is called a subroutine. Examples of such tasks: NB: Perl provides
finding the length of a sequence reverse complementing a sequence finding the mean of a list of numbers
the subroutine length($x) to do this already

Finding all sequence lengths (2)


open FILE, "fly3utr.txt"; while (<FILE>) { chomp; if (/>/) { print_name_and_len(); $name = $_; $len = 0; } else { $len += length; } } print_name_and_len(); close FILE; sub print_name_and_len { if (defined ($name)) { print "$name $len\n"; } }

Subroutine calls

Subroutine definition; code in here is not executed unless subroutine is called

Reverse complement subroutine


"my" announces that $rev is local to the subroutine revcomp "return" announces that the return value of this subroutine is whatever's in $rev sub revcomp { my $rev; $rev = reverse ($dna); $rev =~ tr/acgt/tgca/; return $rev; } $rev = 12345; $dna = "accggcatg"; $rev1 = revcomp(); print "Revcomp of $dna is $rev1\n"; $dna = "cggcgt"; $rev2 = revcomp(); print "Revcomp of $dna is $rev2\n"; Value of $rev is unchanged by calls to revcomp print "Value of rev is $rev\n"; Revcomp of accggcatg is catgccggt Revcomp of cggcgt is acgccg Value of rev is 12345

Revcomp with arguments


The array @_ holds the arguments to the subroutine (in this case, the sequence to be revcomp'd) sub revcomp { my ($dna) = @_; my $rev = reverse ($dna); $rev =~ tr/acgt/tgca/; return $rev; } $dna1 = "accggcatg"; $rev1 = revcomp ($dna1); print "Revcomp of $dna1 is $rev1\n"; Now we don't have to re-use the same variable for the sequence to be revcomp'd $dna2 = "cggcgt"; $rev2 = revcomp ($dna2); print "Revcomp of $dna2 is $rev2\n"; Revcomp of accggcatg is catgccggt Revcomp of cggcgt is acgccg

Mean & standard deviation


@xdata = (1, 5, 1, 12, 3, 4, 6); ($x_mean, $x_sd) = mean_sd (@xdata); @ydata = (3.2, 1.4, 2.5, 2.4, 3.6, 9.7); ($y_mean, $y_sd) = mean_sd (@ydata); Subroutine takes a list of $n numeric arguments Square root sub mean_sd { my @data = @_; my $n = @data + 0; my $sum = 0; my $sqSum = 0; foreach $x (@data) { $sum += $x; $sqSum += $x * $x; } my $mean = $sum / $n; my $variance = $sqSum / $n - $mean * $mean; my $sd = sqrt ($variance); return ($mean, $sd); }

Subroutine returns a two-element list: (mean,sd)

Maximum element of an array


Subroutine to find the largest entry in an array
@num = (1, 5, 1, 12, 3, 4, 6); $max = find_max (@num); print "Numbers: @num\n"; print "Maximum: $max\n"; sub find_max { my @data = @_; my $max = pop @data; foreach my $x (@data) { if ($x > $max) { $max = $x; } } return $max; }

Numbers: 1 5 1 12 3 4 6 Maximum: 12

Including variables in patterns


Subroutine to find number of instances of a given binding site in a sequence
$dna = "ACGCGTAAGTCGGCACGCGTACGCGT"; $mcb = "ACGCGT"; print "$dna has ", count_matches ($mcb, $dna), " matches to $mcb\n"; sub count_matches { my ($pattern, $text) = @_; my $n = 0; while ($text =~ /$pattern/g) { ++$n } return $n; }
ACGCGTAAGTCGGCACGCGTACGCGT has 3 matches to ACGCGT

Perl for Bioinformatics Section 4: Hashes

Data structures
Suppose we have a file containing a table of Drosophila gene names and cellular compartments, one pair on each line:
Cyp12a5 MRG15 Cop bor Bx42 Mitochondrion Nucleus Golgi Cytoplasm Nucleus

Suppose this file is in "genecomp.txt"

Reading a table of data


We can split each line into a 2-element array using the split command. This breaks the line at each space:
open FILE, "genecomp.txt"; while (<FILE>) { ($g, $c) = split; push @gene, $g; push @comp, $c; } close FILE; print "Genes: @gene\n"; print "Compartments: @comp\n";

Genes: Cyp12a5 MRG15 Cop bor Bx42 Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus

The opposite of split is join, which makes a scalar from an array: print join (" and ", @gene);
Cyp12a5 and MRG15 and Cop and bor and Bx42

Finding an entry in a table


The following code assumes that we've already read in the table from the file:
$geneToFind = shift @ARGV; print "Searching for gene $geneToFind\n"; for ($i = 0; $i < @gene; ++$i) { if ($gene[$i] eq $geneToFind) { print "Gene: $gene[$i]\n"; print "Compartment: $comp[$i]\n"; exit; } } print "Couldn't find gene\n";

Example: $ARGV[0] = "Cop"

Searching for gene Cop Gene: Cop Compartment: Golgi

Binary search
The previous algorithm is inefficient. If there are N entries in the list, then on average we have to search through (N+1) entries to find the one we want. For the full Drosophila genome, N=12,000. This is painfully slow. An alternative is the Binary Search algorithm:
Start with a sorted list. Compare the middle element with the one we want. Pick the half of the list that contains our element. Iterate this procedure to "home in" on the right element. This takes around log2(N) steps.

Associative arrays (hashes)


Implementing algorithms like binary search is a common task in languages like C. Conveniently, Perl provides a type of array called an associative array (also called a hash) that is pre-indexed for quick search.
An associative array is a set of key value pairs (like our gene compartment table)

$comp{"Cop"} = "Golgi";

Curly braces {} are used to index an associative array

Reading a table using hashes


open FILE, "genecomp.txt"; while (<FILE>) { ($g, $c) = split; $comp{$g} = $c; } $geneToFind = shift @ARGV; print "Gene: $geneToFind\n"; print "Compartment: ", $comp{$geneToFind}, "\n";

...with $ARGV[0] = "Cop" as before:

Gene: Cop Compartment: Golgi

Reading a FASTA file into a hash


sub read_FASTA { my ($filename) = @_; my (%name2seq, $name, $seq); open FILE, $filename; while (<FILE>) { chomp; if (/>/) { s/>//; if (defined $name) { $name2seq{$name} = $seq; } $name = $_; $seq = ""; } else { $seq .= $_; } } $name2seq{$name} = $seq; close FILE; return %name2seq; }

Formatted output of sequences


sub print_seq { my ($name, $seq) = @_; 50-column output print ">$name\n"; my $width = 50; for (my $i = 0; $i < length($seq); $i += $width) { if ($i + $width > length($seq)) { $width = length($seq) - $i; } print substr ($seq, $i, $width), "\n"; } }

The term substr($x,$i,$len) returns the substring of $x starting at position $i with length $len. For example, substr("Biology",3,3) is "log"

keys and values


keys returns the list of keys in the hash
e.g. names, in the %name2seq hash

values returns the list of values


e.g. sequences, in the %name2seq hash
%name2seq = read_FASTA ("fly3utr.txt"); print "Sequence names: ", join (" ", keys (%name2seq)), "\n"; my $len = 0; foreach $seq (values %name2seq) { $len += length ($seq); } print "Total length: $len\n";
Sequence names: CG11488 CG11604 CG11455 Total length: 210

Files of sequence names


Easy way to specify a subset of a given FASTA database Each line is the name of a sequence in a given database e.g. CG1167
CG685 CG1041 CG1043

Get named sequences


Given a FASTA database and a "file of sequence names", print every named sequence:
($fasta, $fosn) = @ARGV; %name2seq = read_FASTA ($fasta); open FILE, $fosn; while ($name = <FILE>) { chomp $name; $seq = $name2seq{$name}; if (defined $seq) { print_seq ($name, $seq); } else { warn "Can't find sequence: $name. ", "Known sequences: ", join (" ", keys %name2seq), "\n"; } } close FILE;

Intersection of two sets


Two files of sequence names: What is the overlap? Find intersection using hashes:
CG1167 CG685 CG1041 CG1043
fosn1.txt

CG215 CG1041 CG483 CG1167 CG1163


fosn2.txt

open FILE1, "fosn1.txt"; while (<FILE1>) { $gotName{$_} = 1; } close FILE1; open FILE2, "fosn2.txt"; while (<FILE2>) { print if $gotName{$_}; } close FILE2;
CG1041 CG1167

Assigning hashes
A hash can be assigned directly, as a list of "key=>value" pairs:
%comp = ('Cyp12a5' => 'Mitochondrion', 'MRG15' => 'Nucleus', 'Cop' => 'Golgi', 'bor' => 'Cytoplasm', 'Bx42' => 'Nucleus'); print "keys: ", join(";",keys(%comp)), "\n"; print "values: ", join(";",values(%comp)), "\n";

keys: bor;Cop;Bx42;Cyp12a5;MRG15 values: Cytoplasm;Golgi;Nucleus;Mitochondrion;Nucleus

The genetic code as a hash


%aa = ('ttt'=>'F', 'ttc'=>'F', 'tta'=>'L', 'ttg'=>'L', 'ctt'=>'L', 'ctc'=>'L', 'cta'=>'L', 'ctg'=>'L', 'att'=>'I', 'atc'=>'I', 'ata'=>'I', 'atg'=>'M', 'gtt'=>'V', 'gtc'=>'V', 'gta'=>'V', 'gtg'=>'V', 'tct'=>'S', 'tcc'=>'S', 'tca'=>'S', 'tcg'=>'S', 'cct'=>'P', 'ccc'=>'P', 'cca'=>'P', 'ccg'=>'P', 'act'=>'T', 'acc'=>'T', 'aca'=>'T', 'acg'=>'T', 'gct'=>'A', 'gcc'=>'A', 'gca'=>'A', 'gcg'=>'A', 'tat'=>'Y', 'tac'=>'Y', 'taa'=>'!', 'tag'=>'!', 'cat'=>'H', 'cac'=>'H', 'caa'=>'Q', 'cag'=>'Q', 'aat'=>'N', 'aac'=>'N', 'aaa'=>'K', 'aag'=>'K', 'gat'=>'D', 'gac'=>'D', 'gaa'=>'E', 'gag'=>'E', 'tgt'=>'C', 'tgc'=>'C', 'tga'=>'!', 'tgg'=>'W', 'cgt'=>'R', 'cgc'=>'R', 'cga'=>'R', 'cgg'=>'R', 'agt'=>'S', 'agc'=>'S', 'aga'=>'R', 'agg'=>'R', 'ggt'=>'G', 'ggc'=>'G', 'gga'=>'G', 'ggg'=>'G' );

Translating: DNA to protein


$prot = translate ("gatgacgaaagttgt"); print $prot; sub translate { my ($dna) = @_; $dna = lc ($dna); my $len = length ($dna); if ($len % 3 != 0) { die "Length $len is not a multiple of 3"; } my $protein = ""; for (my $i = 0; $i < $len; $i += 3) { my $codon = substr ($dna, $i, 3); if (!defined ($aa{$codon})) { die "Codon $codon is illegal"; } $protein .= $aa{$codon}; } return $protein; }

DDESC

Counting residue frequencies


%count = count_residues ("gatgacgaaagttgt"); @residues = keys (%count); foreach $residue (@residues) { print "$residue: $count{$residue}\n"; } sub count_residues { my ($seq) = @_; my %freq; $seq = lc ($seq); for (my $i = 0; $i < length($seq); ++$i) { my $residue = substr ($seq, $i, 1); ++$freq{$residue}; } return %freq; }

g: a: c: t:

5 5 1 4

Counting N-mer frequencies


%count = count_nmers ("gatgacgaaagttgt", 2); @nmers = keys (%count); foreach $nmer (@nmers) { print "$nmer: $count{$nmer}\n"; } sub count_nmers { my ($seq, $n) = @_; my %freq; $seq = lc ($seq); for (my $i = 0; $i <= length($seq) - $n; ++$i) { my $nmer = substr ($seq, $i, $n); ++$freq{$nmer}; } return %freq; }

cg: tt: ga: tg: gt: aa: ac: at: ag:

1 1 3 2 2 2 1 1 1

N-mer frequencies for a whole file


my %name2seq = read_FASTA ("fly3utr.txt"); while (($name, $seq) = each %name2seq) { %count = count_nmers ($seq, 2, %count); } @nmers = keys (%count); foreach $nmer (@nmers) { print "$nmer: $count{$nmer}\n"; } sub count_nmers { my ($seq, $n, %freq) = @_; $seq = lc ($seq); for (my $i = 0; $i <= length($seq) - $n; ++$i) { my $nmer = substr ($seq, $i, $n); ++$freq{$nmer}; } return %freq; } The each command is a shorthand for looping through each (key,value) pair in an array ct: tc: tt: cg: ga: tg: gc: gt: aa: ac: gg: at: ca: ag: ta: cc: 5 9 26 4 11 12 2 17 39 10 4 17 11 15 20 2

Note how we keep passing %freq back into the count_nmers subroutine, to get cumulative counts

Files and filehandles


This XYZ is the filehandle

Opening a file: Closing a file: Reading a line: Reading an array: Printing a line: Read-only: Write-only: Test if file exists:

open XYZ, $filename; close XYZ; $data = <XYZ>; @data = <XYZ>; print XYZ $data; open XYZ, "<$filename"; open XYZ, ">$filename"; if (-e $filename) { print "$filename exists!\n"; }

Perl for Bioinformatics Section 5: References

Behind the Scenes


PC = memory + CPU (+ peripherals) Memory is just a list of bytes (e.g. 227 bytes in a machine with CPU 128Mb of RAM) To a first approximation, this is just one huge array. The array index is called the address some of the array elements are interpreted as instruction codes by the CPU
addresses 0 1 2 65 39 243

45 113

227 -2 227 -1

16 2

Buffer overflow attack

Hexadecimal notation
Computers use binary notation, which is tricky to interconvert to/from decimal notation however, binary notation is big & unwieldy A compromise is to use hexadecimal Hexadecimal is base 16 (decimal is base 10, binary is base 2) The letters A-F are used to represent the extra digits for 10-15
Binary: 101 1011 11100 101000011 Decimal: 5 11 28 323 Hexadecimal: 5 B 1C 143

References
Recall the subroutine find_max(@x) which returns the largest element in the array @x Count the number of times we create an array in this code.
Array @x created here @x copied into @_ here @x = (1, 5, 1, 12, 3, 4, 6); $max = find_max (@x); sub find_max { my @data = @_; ...

@_ copied into @data here

All in all, we've created three copies of this array. Each copy uses up time and memory. This seems unnecessary... and it is. Instead of passing the whole array into the subroutine, we could simply tell the subroutine where in memory the array begins. The memory address of a particular variable is called a reference to that variable. This is a useful abstraction. Addresses are often displayed in hexadecimal.

Reference syntax
To create a reference to
a scalar, $x: an array, @x: a hash, %x:
$scalar_ref = \$x; $array_ref = \@x; $hash_ref = \%x;

To access a reference to
a scalar: $x = $$scalar_ref; an array: @x = @$array_ref; an array element: $x = $array_ref->[3]; %x = %$hash_ref; a hash: a hash element: $x = $hash_ref->{'key'};
$x = $$array_ref[3];

Alternative syntax for arrays:

References to scalars
$x = 10; $y = 20; print "Initially: x=$x, y=$y\n"; $xReference = \$x; print "X-reference: $xReference\n"; print "Referenced variable: $$xReference\n"; $$xReference += 3; print "Now: x=$x, y=$y\n"; $yReference = \$y; print "Y-reference: $yReference\n"; print "Referenced variable: $$yReference\n"; $$yReference *= 2; print "Finally: x=$x, y=$y\n"; Initially: x=10, y=20 X-reference: SCALAR(0x1832ac0) Referenced variable: 10 Now: x=13, y=20 Y-reference: SCALAR(0x1832ae4) Referenced variable: 20 Finally: x=13, y=40 This reference points to $x This changes the value of $x This reference points to $y This changes the value of $y

This is the memory location used to store $x This is the memory location used to store $y

References to arrays
@x = ('a', 'c', 'g', 't'); @y = 1..10; print "x: @x\n"; print "y: @y\n"; $xReference = \@x; print "X-reference: $xReference\n"; print "Referenced array: @$xReference\n"; $$xReference[3] =~ tr/t/u/; print "New x: @x\n"; $yReference = \@y; print "Referenced array: @$yReference\n"; $yReference->[3] *= 2; print "New y: @y\n"; x: a c g t y: 1 2 3 4 5 6 7 8 9 10 X-reference: ARRAY(0x1832b08) Referenced array: a c g t New x: a c g u Referenced array: 1 2 3 4 5 6 7 8 9 10 New y: 1 2 3 8 5 6 7 8 9 10 This reference points to @x This changes the 4th element of @x This reference points to @y This changes the 4th element of @y (NB alternative notation) Note that the type of reference is now ARRAY, not SCALAR

References to hashes
%comp = ('Cyp12a5' => 'Mitochondrion', 'MRG15' => 'Nucleus', 'Cop' => 'Golgi', 'bor' => 'Cytoplasm', 'Bx42' => 'Nucleus'); $ref = \%comp; print "Values: ", join(" ",values(%comp)), "\n"; print "Ref: $ref\n"; print "Ref values: ", join(" ",values(%$ref)), "\n"; $$ref{'MRG15'} =~ s/N/n/; print "New values: ", join(" ",values(%comp)), "\n";

The reference points to %comp This changes $comp{'MRG15'}

Values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus Ref: HASH(0x1832b08) Ref values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus New values: Cytoplasm Golgi Nucleus Mitochondrion nucleus

Note lower-case 'n' after change

References to subroutines
We can also have references to subroutines Syntax for assigning a subroutine reference:
$subref = \&read_FASTA;

Syntax for calling a subroutine reference:


%name2seq = &$subref ("fly3utr.txt");

Anonymous subroutines:
$subref = sub { print "Hello world\n"; }; &$subref(); Hello world

References to code
sub hello { print "Hello @_!\n"; } my $codeRef1 = \&hello; &$codeRef1 ("Mr", "President"); print "Ref: $codeRef1\n"; my $codeRef2 = sub { print "Goodbye @_!" }; &$codeRef2 ("cruel", "world"); This is an anonymous subroutine reference The reference points to the subroutine hello

Hello Mr President! Ref: CODE(0x180cc3c) Goodbye cruel world!

An anonymous subroutine is one that is never named, but only referenced. Well be seeing more about anonymous references on the following slides.

Reasons for references


Increased efficiency/performance (pass a reference instead of the whole thing) Allowing a subroutine to modify the value of a variable, and have this modification be propagated back to the caller of the subroutine Allowing arrays/hashes to contain (references to) other arrays/hashes Abstract representation of subroutines

Anonymous arrays and hashes


Recall the syntax for assigning an entire array...
@nucleotide = ('a', 'c', 'g', 't');

...and the syntax for assigning an entire hash...


%dna2rna = ('a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u');

We can also create an array and assign a reference to it, without explicitly naming the array variable:
$nucleotide_ref = ['a', 'c', 'g', 't'];

This is called an anonymous array. We can also create anonymous hashes:

Note square brackets instead of parentheses

$dna2rna_ref = {'a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u'}; Note curly brackets

Arrays of arrays
More precisely, arrays of references-to-arrays. Suppose we want to represent this matrix:
This matrix could be a table of RNA base-pairing scores if the row and column indices are (A,C,G,U). The score of a pair is the number of strong hydrogen bonds that it forms. Thus, A-U and U-A pairs score +2; C-G and G-C pairs score +3; G-U and U-G pairs score +1; and all other pairs score 0.

0 0 0 2

0 0 3 0

0 3 0 1

2 0 1 0

We could do it like this:

@matrix is an array of references to arrays

$row1 = $row2 = $row3 = $row4 = @matrix

[0,0,0,2]; [0,0,3,0]; [0,3,0,1]; [2,0,1,0]; = ($row1,$row2,$row3,$row4);

Or, more succinctly, like this:


@matrix = ([0,0,0,2], [0,0,3,0], [0,3,0,1], [2,0,1,0]);

Arrays in C and C++


C has nothing like Perls hashes, although various libraries (e.g. GLIB) have equivalents. C++s Standard Template Library offers the map template, which is similar to a hash.

The vector is a C++ template. Templates (like C arrays) are strongly typed, unlike Perls weakly typed arrays & hashes.

Genome annotations

GFF annotation format


Nine-column tab-delimited format for simple annotations:
SEQ1 SEQ1 SEQ1 SEQ1 SEQ1 SEQ1 SEQ2
Sequence name Program

EMBL EMBL EMBL netgene genie genie grail

atg exon splice5 splice5 sp5-20 sp5-10 ATG


Feature type

103 103 172 172 163 168 17

105 172 173 173 182 177 19

. . . 0.94 2.3 2.1 2.1

+ + + + + + -

0 0 . . . . 0

group1 group1 group1 group1 group1 group1 group2

Coding End Strand frame Start residue (+ or -) ("." if not residue (starts at 1) Score applicable) (starts at 1)

Group

Many of these now obsolete, but name/start/end/strand (and sometimes type) are useful Methods: read, write, compareTo(GFF_file), getSeq(FASTA_file)

Reading a GFF file


This subroutine reads a GFF file Each line is made into an array via the split command The subroutine returns an array of such arrays
sub read_GFF { my ($filename) = @_; open GFF, "<$filename"; my @gff; while (my $line = <GFF>) { chomp $line; my @data = split /\t/, $line, 9; push @gff, \@data; } close GFF; return @gff; }

Splits the line into at most nine fields, separated by tabs ("\t") Appends a reference to @data to the @gff array

Writing a GFF file


We should be able to write as well as read all datatypes Each array is made into a line via the join command Arguments: filename & reference to array of arrays
sub write_GFF { my ($filename, $gffRef) = @_; open GFF, ">$filename" or die $!; foreach my $gff (@$gffRef) { print GFF join ("\t", @$gff), "\n"; } close GFF or die $!; } open evaluates FALSE if the file failed to open, and $! contains the error message

close evaluates FALSE if there was an error with the file

GFF intersect detection


Let (name1,start1,end1) and (name2,start2,end2) be the coordinates of two segments If they don't overlap, there are three possibilities:
name1 and name2 are different; name1 = name2 but start1 > end2;

name1 = name2 but start2 > end1;

Checking every possible pair takes time N2 to run, where N is the number of GFF lines (how can this be improved?)

Self-intersection of a GFF file


sub self_intersect_GFF { my @gff = @_; my @intersect; foreach $igff (@gff) { foreach $jgff (@gff) { if ($igff ne $jgff) { if ($$igff[0] eq $$jgff[0]) { if (!($$igff[3] > $$jgff[4] || $$jgff[3] > $$igff[4])) { push @intersect, $igff; last; } } } } } return @intersect; } Fields 0, 3 and 4 of the GFF line are the sequence name, start and end co-ordinates of the feature

Note: this code is slow. Vast improvements in speed can be gained if we sort the @gff array before checking for intersection.

Converting GFF to sequence


Puts together several previously-described subroutines Namely: read_FASTA read_GFF revcomp print_seq
($gffFile, $seqFile) = @ARGV; @gff = read_GFF ($gffFile); %seq = read_FASTA ($seqFile); foreach $gffLine (@gff) { $seqName = $gffLine->[0]; $seqStart = $gffLine->[3]; $seqEnd = $gffLine->[4]; $seqStrand = $gffLine->[6]; $seqLen = $seqEnd + 1 - $seqStart; $subseq = substr ($seq{$seqName}, $seqStart-1, $seqLen); if ($seqStrand eq "-") { $subseq = revcomp ($subseq); } print_seq ("$seqName/$seqStart-$seqEnd/$seqStrand", $subseq); }

DNA Microarrays

Normalizing microarray data


Often microarray data are normalized as a precursor to further analysis (e.g. clustering) This can eliminate systematic bias; e.g.
if every level for a particular gene is elevated, this might signal a problem with the probe for that gene if every level for a particular experiment is elevated, there might have been a problem with that experiment, or with the subsequent image analysis

Normalization is crude (it can eliminate real signal as well as noise), but common

Rescaling an array
For each element of the array: add a, then multiply by b
@array = (1, 3, 5, 7, 9); print "Array before rescaling: @array\n"; rescale_array (\@array, -1, 2); print "Array after rescaling: @array\n"; sub rescale_array { my ($arrayRef, $a, $b) = @_; foreach my $x (@$arrayRef) { $x = ($x + $a) * $b; } } Array before rescaling: 1 3 5 7 9 Array after rescaling: 0 4 8 12 16

Array is passed by reference

Microarray expression data


A simple format with tab-separated fields First line contains experiment names Subsequent lines contain:
gene name expression levels for each experiment
* Cyp12a5 MRG15 Cop bor Bx42 ... EmbryoStage1 104.556 4590.15 33.12 5512.36 1045.1 ... EmbryoStage2 102.441 6691.11 56.3 3315.12 632.7 ... EmbryoStage3 55.643 9472.22 66.21 1044.13 200.11 ... ... ... ... ... ... ...

Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn

Reading a file of expression data


sub read_expr { my ($filename) = @_; open EXPR, "<$filename"; my $firstLine = <EXPR>; chomp $firstLine; my @experiment = split /\t/, $firstLine; shift @experiment; my %expr; while (my $line = <EXPR>) { chomp $line; my ($gene, @data) = split /\t/, $line; if (@data+0 != @experiment+0) { warn "Line has wrong number of fields\n"; } $expr{$gene} = \@data; } close EXPR; return (\@experiment, \%expr); } Reference to array of experiment names

Note use of scalar context to compare array sizes

Reference to hash of arrays (hash key is gene name, array elements are expression data)

Normalizing by gene
A program to normalize expression data from a set of microarray experiments
($experiment, $expr) = read_expr ("expr.txt"); while (($geneName, $lineRef) = each %$expr) { normalize_array ($lineRef); }
NB $data is a reference to an array

sub normalize_array { my ($data) = @_; my ($mean, $sd) = mean_sd (@$data); @$data= map (($_ - $mean) / $sd, @$data); }
Could also use the following: rescale_array($data,-$mean,1/$sd);

Normalizes by gene

Normalizing by column
Remaps gene arrays to column arrays
($experiment, $expr) = read_expr ("expr.txt"); my @genes = sort keys %$expr; for ($i = 0; $i < @$experiment; ++$i) { my @col; foreach $j (0..@genes-1) { $col[$j] = $expr->{$genes[$j]}->[$i]; } normalize_array(\@col); foreach $j (0..@genes-1) { $expr->{$genes[$j]}->[$i] = $col[$j]; } }

Puts column data in @col Normalizes (note use of reference) Puts @col back into %expr

Perl for Bioinformatics Section 6: Advanced topics

Sorting
It is often useful to be able to sort an array
e.g. smallest element first, largest last

Many sort algorithms exist


Bubblesort (swaps) Quicksort (pivots) Binary tree sort (inserts)

Typically, in older languages, you have to implement one of these yourself


although qsort is provided in C

This is changing...

Sorting string data


Perl provides the sort function to sort an array of strings into alphabetic order:
@nucleotides = ('g', 'c', 't', 'a'); @sorted_nucleotides = sort @nucleotides; print "Nucleotides: @nucleotides\n"; print "Sorted: @sorted_nucleotides\n";

Nucleotides: g c t a Sorted: a c g t

Sorting numeric data


To sort numeric data, we have to provide a sort function This is a subroutine that compares two items, $a and $b It must return -1 if $a<$b, 0 if $a==$b and +1 if $a>$b Fortunately, Perl provides an operator that does just this. It is the spaceship operator $a <=> $b The syntax is as follows:
@x = (5, 1, 16, 2, -1, 10); @y = sort by_number @x; print "y: @y\n"; sub by_number { return $a <=> $b; } The variables $a and $b get passed "automagically" into this subroutine. Yet another example of arbitrary Perl weirdness...

y: -1 1 2 5 10 16

Standard sort functions


$a <=> $b is the "standard" numeric sort The "standard" alphabetic sort is $a cmp $b The alphabetic sort is the one used by default:
$x = "Pears"; $y = "Apples"; $z = "Oranges"; print "$x cmp $y: print "$x cmp $z: print "$y cmp $z: print "$x cmp $x:

", ", ", ",

$x $x $y $x

cmp cmp cmp cmp

$y, $z, $z, $x,

"\n"; "\n"; "\n"; "\n";

Pears cmp Apples: 1 Pears cmp Oranges: 1 Apples cmp Oranges: -1 Pears cmp Pears: 0

Sorting a GFF file


We can "chain" multiple sort functions to sort by sequence name, then by startpoint, then by endpoint:
($infile, $outfile) = @ARGV; @gff = read_GFF ($infile); @gff = sort by_GFF_startpoint (@gff); write_GFF ($outfile, \@gff); sub by_GFF_startpoint { return ($$a[0] cmp $$b[0] or $$a[3] <=> $$b[3] or $$a[4] <=> $$b[4]); } Fields 0, 3 and 4 of the GFF line are the sequence name, start and end co-ordinates of the feature this line does the actual sort "chaining" multiple sort comparisons

This works because (X or Y or Z) = X (if X!=0) or Y (if X==0 and Y != 0) or Z (if X==Y==0)

Packages
Perl allows you to organise your subroutines in packages each with its own namespace
use PackageName; PackageName::doSomething(); This line includes a file called "PackageName.pm" in your code

This invokes a subroutine called doSomething() in the package called "PackageName.pm"

Perl looks for the packages in a list of directories specified by the array @INC
print "INC dirs: @INC\n"; INC dirs: Perl/lib Perl/site/lib . The "." means the directory that the script is saved in

Many packages available at http://www.cpan.org/

Object-oriented programming
Data structures are often associated with code
FASTA: read_FASTA print_seq revcomp ... GFF: read_GFF write_GFF ... Expression data: read_expr mean_sd ...

Object-oriented programming makes this association explicit. A type of data structure, with an associated set of subroutines, is called a class The subroutines themselves are called methods A particular instance of the class is an object

OOP concepts
Abstraction
represent the essentials, hide the details

Encapsulation
storing data and subroutines in a single unit hiding private data (sometimes all data, via accessors)

Inheritance
abstract base interfaces multiple derived classes

Polymorphism
different derived classes exhibit different behaviors in response to the same requests

OOP: Analogy

OOP: Analogy

o Messages (the words in the speech balloons, and also perhaps the coffee itself) o Overloading (Waiter's response to "A coffee", different response to "A black coffee") o Polymorphism (Waiter and Kitchen implement "A black coffee" differently) o Encapsulation (Customer doesn't need to know about Kitchen) o Inheritance (not exactly used here, except implicitly: all types of coffee can be drunk or spilled, all humans can speak basic English and hold cups of coffee, etc.) o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is a Factory (and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.

OOP: Advantages
Often more intuitive
Data has behavior

Modularity
Interfaces are well-defined Implementation details are hidden

Maintainability
Easier to debug, extend

Framework for code libraries


Graphics & GUIs BioPerl, BioJava

OOP: Jargon
Member, method
A variable/subroutine associated with a particular class

Overriding
When a derived class implements a method differently from its parent class

Constructor, destructor
Methods called when an object is created/destroyed

Accessor
A method that provides [partial] access to hidden data

Factory
An [abstract] object that creates other objects

Singleton
A class which is only ever instantiated once (i.e. theres only ever one object of this class) C.f. static member variables, which occur once per class

Objects in Perl
An object in Perl is usually a reference to a hash The method subroutines for an object are found in a class-specific package
Command bless $x, MyPackage associates variable $x with package MyPackage

Syntax of method calls


e.g. $x->save(); this is equivalent to PackageName::save($x); Typical constructor: PackageName->new(); @EXPORT and @EXPORT_OK arrays used to export method names to users namespace

Many useful Perl objects available at CPAN

AUTOLOAD
When an undefined method is called on an object, the special method AUTOLOAD is called, if defined Special variable $AUTOLOAD contains function name Allows implementation of e.g. default accessors for hash elements

GD.pm
A graphics package by Lincoln Stein
use GD; # create a new image $im = new GD::Image(100,100); # allocate some colors $white = $im->colorAllocate(255,255,255); $black = $im->colorAllocate(0,0,0); $red = $im->colorAllocate(255,0,0); $blue = $im->colorAllocate(0,0,255); # make the background transparent $im->transparent($white); # Put a black frame around the picture $im->rectangle(0,0,99,99,$black); # Draw a blue oval $im->arc(50,50,95,75,0,360,$blue); # And fill it with red $im->fill(50,50,$red); # Convert the image to PNG and print it out print $im->png;

CGI.pm
CGI (Common Gateway Interface)
Page-based web programming paradigm

CGI.pm (also by Lincoln Stein)


Perl CGI interface runs on a webserver allows you to write a program that runs behind a webpage

CGI (static, page-based) is gradually being supplemented by AJAX

BioPerl
A set of Open Source Bioinformatics packages
largely object-oriented Can be downloaded from bio.perl.org

Handles various different file formats Parses BLAST and other programs Basis for Ensembl
the human genome annotation project www.ensembl.org

Example: GenBank

Example: Bio::DB::GenBank
Interface to the GenBank database
use Bio::DB::GenBank; $gb = new Bio::DB::GenBank; $seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID # or ... $seq = $gb->get_Seq_by_acc('J00522'); # Accession Number $seq = $gb->get_Seq_by_version('J00522.1'); # Accession.version $seq = $gb->get_Seq_by_gi('405830'); # GI Number

Saves having to rewrite same old parsers

Digest::MD5
MD5 is a one-way hash function e.g. gravatar.com uses MD5 to map (authenticated) email addresses to avatar icons

Digest::MD5
MD5 is a one-way hash function e.g. gravatar.com uses MD5 to map (authenticated) email addresses to avatar icons

use Digest::MD5 qw(md5 md5_hex md5_base64); my $baseURL = "http://www.gravatar.com/avatar/; while (<>) { chomp; print $baseURL, md5_hex(lc($_)), "\n; }

Other programming languages


Procedural languages
Interpreted/scripting languages
"Shell languages (TCSH, BASH, CSH) Python: cleaner, object-oriented Ruby: even more object-oriented Compiled languages C: very basic, portable and fast C++: more elaborate, object-oriented C Java: stripped-down portable C++; "safer & cleaner

Functional languages
More mathematical, cleaner; but less pragmatic Lisp, Scheme
Lisp is the oldest. (Lots (of (parentheses)))

Prolog, ML, Haskell

Co-ordinate transformation
Motivation: map clones to chromosomes
Chromosome 17455 17855

Clones

403

803

Co-ordinate transformations (cont.)


What if a segment spans multiple clones?

You might also like