Unix Perl Tutorial
Unix Perl Tutorial
Unix and Perl Primer for Biologists by Keith Bradnam & Ian Korf is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. Please send feedback, questions, money, or abuse to [email protected] or [email protected]. Copyright 2009, all rights reserved.
Introduction
Advances in high-throughput biology have transformed modern biology into an incredibly data-rich science. Biologists who never thought they needed computer programming skills are now nding that using an Excel spreadsheet is simply not enough. Learning to program a computer can be a daunting task, but it is also incredibly worthwhile. You will not only improve your research, you will also open your mind to new ways of thinking and have a lot of fun. This course is designed for Biologists who want to learn how to program but never got around to it. Programming, like language or math, comes more naturally to some than others. But we all learn to read, write, add, subtract, etc., and we can all learn to program. Programming, more than just about any other skill, comes in waves of understanding. You will get stuck for a while and a little frustrated, but then suddenly you will see how a new concept aggregates a lot of seemingly disconnected information. And then you will embrace the new way, and never imagine going back to the old way. As you are learning, if you are getting confused and discouraged, slow down and ask questions. You can contact us either in person, by email, or (preferably) on the associated Unix and Perl for Biologists Google Group. The lessons build on each other, so do not skip ahead thinking you will return to the confusing concept at a later date. Why Unix? The Unix operating system has been around since 1969. Back then there was no such thing as a graphical user interface. You typed everything. It may seem archaic to use a keyboard to issue commands today, but it's much easier to automate keyboard tasks than mouse tasks. There are several variants of Unix (including Linux), though the differences do not matter much. Though you may not have noticed it, Apple has been using Unix as the underlying operating system on all of their computers since 2001. Increasingly, the raw output of biological research exists as in silico data, usually in the form of large text les. Unix is particularly suited to working with such les and has several powerful (and exible) commands that can process your data for you. The real strength of learning Unix is that most of these commands can be combined in an almost unlimited fashion. So if you can learn just ve Unix commands, you will be able to do a lot more than just ve things. Why Perl? Perl is one of the most popular Unix programming languages. It doesn't matter much which language you learn rst because once you know how one works, it is much easier to learn others. Among languages, there is often a distinction between interpreted (e.g. Perl, Python, Ruby) and compiled (e.g. C, C++, Java) languages. People often call
2
interpreted programs scripts. It is generally easier to learn programming in a scripting language because you don't have to worry as much about variable types and memory allocation. The downside is the interpreted programs often run much slower than compiled ones (100-fold is common). But let's not get lost in petty details. Scripts are programs, scripting is programming, and computers can solve problems quickly regardless of the language. Typeset Conventions All of the Unix and Perl code in these guides is written in constant-width font with line numbering. Here is an example with 3 lines.
1. 2. 3. for ($i = 0; $i < 10; $i++) { print $i, "\n"; }
Text you are meant to type into a terminal is indented in constant-width font without line numbering. Here is an example.
ls -lrh
Sometimes a paragraph will include a reference to a Unix, or will instruct you to type something from within a Unix program. This text will be in underlined constant-width font. E.g. Type the pwd command again.
From time to time this documentation will contain web links to pages that will help you nd out more about certain Unix commands and Perl functions. Such links will appear in a standard web link format and can be clicked to take you the relevant web page. Important or critical points will be placed in text boxes like so:
About the authors Keith Bradnam started out his academic career studying ecology. This involved lots of eld trips and and throwing quadrats around on windy hillsides. He was then lucky to be in the right place at the right time to do a Masters degree in Bioinformatics (at a time when nobody was very sure what bioinformatics was). From that point onwards he has spent most of his waking life sat a keyboard (often staring into a Unix terminal). A PhD studying eukaryotic genome evolution followed; this was made easier by the fact that only one genome had been completed at the time he started (this soon changed). After a brief stint working on an Arabidopsis genome database, he moved to working on the excellent model organism database WormBase at the Wellcome Trust Sanger Institute. It was here that he rst met Ian Korf and they bonded over a shared love of Macs, neatly written code, and English puddings. Ian then tried to run away and hide in California at the UC Davis Genome Center but Keith tracked him down and joined his lab. Apart from doing research, he also gets to look after all the computers in the lab and teach the occasional class or two. However, he would give it all up for the chance to be able to consistently beat Ian at foosball, but that seems unlikely to happen anytime soon. Keith still likes Macs and neatly written code, but now has a much harder job nding English puddings. Ian Korf believes that you can tell what a person will do with their life by examining their passions as a teen. Although he had no idea what a 'sequence analysis algorithm' was at 16, a deep curiosity about biological mechanisms and an obsession with writing/ playing computer games is only a few bits away. Ian's rst experience with bioinformatics came as a post-doc at Washington University (St. Louis) where he was a member of the Human Genome Project. He then went across the pond to the Sanger Centre for another post-doc. There he met Keith Bradnam, and found someone who truly understood the role of communication and presentation in science. Ian was somehow able to persuade Keith to join his new lab in Davis California, and this primer on Unix and Perl is but one of their hopefully useful contributions.
Preamble! What computers can run Perl?! What computers can run Unix?! Do I need to run this course from a USB drive?! Unix Part 1!
Learning the essentials!
13 13 13 14 15
15
Introduction to Unix! U1. The Terminal! U2. Your rst Unix command! U3: The Unix tree! U4: Finding out where you are! U5: Getting from A to B! U6: Root is the root of all evil! U7: Up, up, and away! U8: Im absolutely sure that this is all relative! U9: Time to go home! U10: Making the ls command more useful! U11: Man your battle stations!! U12: Make directories, not war!
5
15 15 17 18 19 19 20 20 21 21 22 23 23
U13: Time to tidy up! U14: The art of typing less to do more! U15: U can touch this! U16: Moving heaven and earth! U17: Renaming les! U18: Stay on target ! U19: Here, there, and everywhere! U20: To slash or not to slash?!
24 24 25 25 26 26 27 27
U21: The most dangerous Unix command you will ever learn!! 28 U22: Go forth and multiply! U23: Going deeper and deeper! U24: When things go wrong! U25: Less is more! U26: Directory enquiries! U27: Fire the editor! U28: Hidden treasure! U29: Sticking to the script ! U30: Keep to the $PATH! 29 30 30 31 31 32 33 34 35
U31: Ask for permission! U32: The power of shell scripts! Unix Part 2!
How to Become a Unix power user!
35 36 38
38
U33: Match making! U34: Your rst ever Unix pipe! U35: Heads and tails! U36: Getting fancy with regular expressions! U37: Counting with grep! U38: Regular expressions in less! U39: Let me transl(iter)ate that for you! U40: Thats what she sed! U41: Word up! U42: GFF and the art of redirection! U43: Not just a pipe dream! U44: The end of the line! U45: This one goes to 11! Summary! Perl!
7
38 39 40 41 42 42 43 43 44 44 45 46 47 49 50
Your programming environment ! Saving Perl scripts! P1. Hello World! P2. Scalar variables! Variables summary! P3. Safer programming: use strict ! P4. Math! Operator Precedence! P5. Conditional statements! Numerical comparison operators in Perl! Indentation and block structure! Whitespace! Other Conditional Constructs! Numeric Precision and Conditionals! P6. String operators! String comparison operators in Perl! Matching Operators! Matching operators in Perl! The transliteration operator!
8
50 51 53 54 56 57 59 61 62 62 62 63 64 65 66 66 67 68 69
Project 1: DNA composition! Program Name! Executable! Usage Statement ! Goals of your program! P7. List context ! P8. Safer programming: use warnings! P9. Arrays! Making arrays bigger and smaller! Common Array Functions! P10. From strings to arrays and back! P11. Sorting! P12. Loops! The for Loop! The foreach Loop! The while Loop! The do Loop! Loop Control! When to use each type of loop?!
9
71 71 71 71 72 73 74 75 77 78 80 81 83 83 86 86 87 88 89
Project 2: Descriptive statistics! Count, Sum, and Mean! Min, Max, and Median! Variance! Standard Deviation! Project 3: Sequence shufer! Strategy 1! Strategy 2! P13. File I/O! The default variable $_! The open() Function! Naming le handles! P14. Hashes! Keys and Values! Adding, Removing, and Testing! Hash names! P15. Organizing with hashes! P16. Counting codons with substr()! P17. Regular expressions 101!
10
90 90 90 90 90 91 91 91 92 92 93 94 95 96 96 97 98 99 101
The full set of Perl regular expression characters! 04 1 P18. Extracting text ! More Info! P19. Boolean logic! Project 4: Codon usage of a GenBank le! P20. Functions (subroutines)! Why use subroutines?! P21. Lexical variables and scope! Loop Variables! Safer programming: use strict ! P22. Sliding window algorithms! P23. Function libraries! Project 5: Useful functions! P25. Options processing! P26. References and complex data structures! Multi-dimensional Arrays!
11
105 106 107 108 109 113 114 116 116 117 119 121 124 126 126
References! Anonymous Data! Records! What next?! Troubleshooting guide! Introduction! How to troubleshoot ! Pre-Perl error messages! Within-Perl error messages! Other errors! Table of common Perl error messages! Version history!
126 127 128 128 129 129 129 130 130 132 133 134
12
Preamble
What computers can run Perl?
One of the main goals of this course is to learn Perl. As a programming language, Perl is platform agnostic. You can write (and run) Perl scripts on just about any computer. We will assume that >99% of the people who are reading this use either a Microsoft Windows PC, an Apple Mac, or one of the many Linux distributions that are available (Linux can be considered as a type of Unix, though this claim might offend the Linux purists reading this). A small proportion of you may be using some other type of dedicated Unix platform, such as Sun or SGI. For the Perl examples, none of this matters. All of the Perl scripts in this course should work on any machine that you can install Perl on (if an example doesnt work then please let us know!).
13
14
Unix Part 1
Learning the essentials Introduction to Unix
These exercises will (hopefully) teach you to become comfortable when working in the environment of the Unix terminal. Unix contains many hundred of commands but you will probably use just 10 or so to achieve most of what you want to do. You are probably used to working with programs like the Apple Finder or the Windows File Explorer to navigate around the hard drive of your computer. Some people are so used to using the mouse to move les, drag les to trash etc. that it can seem strange switching from this behavior to typing commands instead. Be patient, and try as much as possible to stay within world of the Unix terminal. Please make sure you complete and understand each task before moving on to the next one.
15
You should now see something that looks like the following (the text that appears inside your terminal window will be slightly different):
Before we go any further, you should note that you can: make the text larger/smaller (hold down command and either + or ) resize the window (this will often be necessary) have multiple terminal windows on screen (see the Shell menu) have multiple tabs open within each window (again see the Shell menu)
There will be many situations where it will be useful to have multiple terminals open and it will be a matter of preference as to whether you want to have multiple windows, or one window with multiple tabs (there are keyboard shortcuts for switching between windows, or moving between tabs).
16
Library
There are four things that you should note here: 1) You will probably see different output to what is shown here, it depends on your computer. Dont worry about that for now. 2) The 'olson27-1:~ kbradnam$' text that you see is the Unix command prompt. It contains my user name (kbradnam), the name of the machine that I am working on (olson27-1 and the name of the current directory (~ more on that later). Note that the command prompt might not look the same on different Unix systems. In this case, the $ sign marks the end of the prompt. 3) The output of the ls command lists ve things. In this case, they are all directories, but they could also be les. Well learn how to tell them apart later on. 4) After the ls command nishes it produces a new command prompt, ready for you to type your next command. The ls command is used to list the contents of any directory, not necessarily the one that you are currently in. Plug in your USB drive, and type the following:
olson27-1:~ kbradnam$ ls /Volumes/USB/Unix_and_Perl_course Applications Code Data Documentation
On a Mac, plugged in drives appear as subdirectories in the special Volumes directory. The name of the USB ash drive is USB. The above output shows a set of four directories that are all inside the Unix_and_Perl_course directory). Note how the underscore character _ is used to space out words in the directory name.
17
18
When you log in to a Unix computer, you are typically placed into your home directory. In this example, after I log in, I am placed in a directory called 'clmuser' which itself is a subdirectory of another directory called 'users'. Conversely, 'users' is the parent directory of 'clmuser'. The rst forward slash that appears in a list of directory names always refers to the top level directory of the le system (known as the root directory). The remaining forward slash (between users and clmuser) delimits the various parts of the directory hierarchy. If you ever get lost in Unix, remember the pwd command. As you learn Unix you will frequently type commands that dont seem to work. Most of the time this will be because you are in the wrong directory, so its a really good habit to get used to running the pwd command a lot.
The rst command reads as change directory to the Unix_and_Perl_course directory that is inside a directory called USB, which itself is inside the Volumes directory that is at the root level of the computer. Did you notice that the command prompt changed after you ran the c d command? The ~ sign should have changed to Unix_and_Perl_course. This is a useful feature of the command prompt. By default it reminds you where you are as you move through different directories on the computer.
NB. For the sake of clarity, I will now simplify the command prompt in all of the following examples
19
Note that the second and third commands do not include a forward slash. When you specify a directory that starts with a forward slash, you are referring to a directory that should exist one level below the root level of the computer. What happens if you try the following two commands? The rst command should produce an error message.
$ cd Volumes $ cd /Volumes
The error is because without including a leading slash, Unix is trying to change to a Volumes directory below your current level in the le hierarchy (/Volumes/USB/ Unix_and_Perl_course), and there is no directory called Volumes at this location.
What if you wanted to navigate up two levels in the le system in one go? Its very simple, just use two sets of the .. operator, separated by a forward slash:
$ cd /Volumes/USB/Unix_and_Perl_course $ pwd /Volumes/USB/Unix_and_Perl_course $ cd ../.. $ pwd /Volumes
20
or...
$ cd /Volumes/USB/Unix_and_Perl_course/Data
They both achieve the same thing, but the 2nd example requires that you know about the full path from the root level of the computer to your directory of interest (the 'path' is an important concept in Unix). Sometimes it is quicker to change directories using the relative path, and other times it will be quicker to use the absolute path.
Hopefully, you should nd that cd and cd ~ do the same thing, i.e. they take you back to your home directory (from wherever you were). Also notice how you can specify the single forward slash to refer to the root directory of the computer. When working with Unix you will frequently want to jump straight back to your home directory, and typing cd is a very quick way to get there.
21
The ls command (like most Unix commands) has a set of options that can be added to the command to change the results. Command-line options in Unix are specied by using a dash (-) after the command name followed by various letters, numbers, or words. If you add the letter l to the ls command it will give you a longer output compared to the default:
$ ls -l /Volumes/USB/Unix_and_Perl_course total 192 drwxrwxrwx 1 keith staff 16384 Oct 3 09:03 drwxrwxrwx 1 keith staff 16384 Oct 3 11:11 drwxrwxrwx 1 keith staff 16384 Oct 3 11:12 drwxrwxrwx 1 keith staff 16384 Oct 3 11:34
For each le or directory we now see more information (including le ownership and modication times). The d at the start of each line indicates that these are directories Task U10.1: There are many, many different options for the ls command. Try out the following (against any directory of your choice) to see how the output changes.
ls ls ls ls -l -R -l -t -r -lh
Note that the last example combine multiple options but only use one dash. This is a very common way of specifying multiple command-line options. You may be wondering what some of these options are doing. Its time to learn about Unix documentation...
22
When you are using the man command, press space to scroll down a page, b to go back a page, or q to quit. You can also use the up and down arrows to scroll a line at a time. The man command is actually using another Unix program, a text viewer called less, which well come to later on. Some Unix commands have very long manual pages, which might seem very confusing. It is typical though to always list the command line options early on in the documentation, so you shouldnt have to read too much in order to nd out what a command-line option is doing.
Work
In the last example we created the two temp directories in two separate steps. If we had used the -p option of the mkdir command we could have done this in one step. E.g.
$ mkdir -p Temp1/Temp2
Task U12.1: Practice creating some directories and navigating between them using the cd command. Try changing directories using both the absolute as well as the relative path (see section U8).
23
Task U13.1: Remove the remaining empty Temp directories that you have created
24
earth.txt
heaven.txt
Data Documentation
Temp
For the mv command, we always have to specify a source le (or directory) that we want to move, and then specify a target location. If we had wanted to we could have moved both les in one go by typing any of the following commands:
$ mv *.txt Temp/ $ mv *t Temp/ $ mv *ea* Temp/
The asterisk (*) acts as a wild-card character, essentially meaning match anything'. The second example works because there are no other les or directories in the directory that end with the letters 't' (if there was, then they would be copied too). Likewise, the third example works because only those two les contain the letters ea in their names. Using wild-card characters can save you a lot of typing. Task U16.1: Use touch to create three les called 'fat', 't', and feet inside the Temp directory. Then type either 'ls f?t' or ls f*t and see what happens. The ? character is also a wild-card but with a slightly different meaning. Try typing ls f??t as well.
25
Data Documentation
Temp
rags
riches
In this example we create a new le ('rags') and move it to a new location and in the process change the name (to 'riches'). So mv can rename a le as well as move it. The logical extension of this is using mv to rename a le without moving it (you have to use mv to do this as Unix does not have a separate 'rename' command):
$ mv Temp/riches Temp/rags $ ls Temp/ earth.txt heaven.txt
rags
Data Documentation
Temp
Temp2
heaven.txt rags
This step moves the Temp2 directory inside the Temp directory. Task U18.1: Create another Temp directory (Temp3) and then change directory to your home directory (/users/clmuser). Without changing directory, move the Temp3 directory to inside the /Volumes/USB/Temp directory.
26
Misc
Unix_test_files
or...
$ ls Data/ Arabidopsis GenBank Misc Unix_test_files
In the rst example, we change directories just to run the ls command, and then we change directories back to where we were again. The second example shows how we could have just stayed where we were.
The two examples are not quite identical, but they produce identical output. So does the trailing slash character in the second example matter? Well not really. In both cases we have a directory named Documentation and it is optional as to whether you include the trailing slash. When you tab complete any Unix directory name, you will nd that a trailing slash character is automatically added for you. This becomes useful when that directory contains subdirectories which you also want to tab complete. I.e. imagine if you had to type the following (to access a buried directory ggg) and tabcompletion didnt add the trailing slash characters. Youd have to type the seven slashes yourself.
$ cd aaa/bbb/ccc/ddd/eee/fff/ggg/ 27
U21: The most dangerous Unix command you will ever learn!
You've seen how to remove a directory with the rmdir command, but rmdir wont remove directories if they contain any les. So how can we remove the les we have created (in /Volumes/USB/Unix_and_Perl_course/Temp)? In order to do this, we will have to use the rm (remove) command.
Please read the next section VERY carefully. Misuse of the rm command can lead to needless death & destruction
Potentially, rm is a very dangerous command; if you delete something with rm, you will not get it back! It does not go into the trash or recycle can, it is permanently removed. It is possible to delete everything in your home directory (all directories and subdirectories) with rm, that is why it is such a dangerous command. Let me repeat that last part again. It is possible to delete EVERY le you have ever created with the rm command. Are you scared yet? You should be. Luckily there is a way of making rm a little bit safer. We can use it with the -i command-line option which will ask for conrmation before deleting anything:
$ pwd /Volumes/USB/Unix_and_Perl_course/Temp $ ls Temp2 Temp3 earth.txt heaven.txt rags $ rm -i earth.txt remove earth.txt? y $ rm -i heaven.txt remove heaven.txt? y
We could have simplied this step by using a wild-card (e.g. rm -i *.txt). Task U21.1: Remove the last le in the Temp directory (rags) and then remove the two empty directories (Temp 2 & Temp3).
28
What if we wanted to copy les from a different directory to our current directory? Lets put a le in our home directory (specied by ~ remember) and copy it to the USB drive:
$ touch ~/file3 $ ls file1 file2 $ cp ~/file3 . $ ls file1 file2 file3
This last step introduces another new concept. In Unix, the current directory can be represented by a . (dot) character. You will mostly use this only for copying les to the current directory that you are in. But just to make a quick point, compare the following:
$ ls $ ls . $ ls ./
In this case, using the dot is somewhat pointless because ls will already list the contents of the current directory by default. Also note again how the trailing slash is optional. Lets try the opposite situation and copy these les back to the home directory (even though one of them is already there). The default behavior of copy is to overwrite (without warning) les that have the same name, so be careful.
$ cp file* ~/
Based on what we have already covered, do you think the trailing slash in ~/ is necessary?
29
Task U23.1: The -R option means copy recursively, many other Unix commands also have a similar option. See what happens if you dont include the -R option. Weve nished with all of these temporary les now. Make sure you remove the Temp directory and its contents (remember to always use rm -i).
In both cases, I made a typo when specifying the name of the directories. With the ls command, we get a fairly useful error message. With the cp command we get a more cryptic message that reveals the correct usage statement for this command. In general, if a command fails, check your current directory (pwd) and check that all the les or directories that you mention actually exist (and are in the right place). Many errors occur because people are not in the right directory!
30
When you are using less, you can bring up a page of help commands by pressing h, scroll forward a page by pressing 'space', or go forward or backwards one line at a time by pressing j or k. To exit less, press q (for quit). The less program also does about a million other useful things (including text searching).
file1 file2
Hopefully, youll agree that the 2nd example makes things a little clearer. You can also do things like always capitalizing directory names (like I have done) but ideally I would suggest that you always use ls -p. If this sounds a bit of a pain, then it is. Ideally you want to be able to make ls -p the default behavior for ls. Luckily, there is a way of doing this by using Unix aliases. Its very easy to create an alias:
$ alias ls='ls -p' $ ls Applications/ Data/ Code/ Documentation/
file1 file2
If you have trouble remembering what some of these very short Unix commands do, then aliases allow you to use human-readable alternatives. I.e. you could make a copy alias for the cp command or even make list_les_sorted_by_date perform the ls -lt
31
command. Note that aliases do not replace the original command. It can be dangerous to use the name of an existing command as an alias for a different command. I.e. you could make an rm alias that put les to a trash directory by using the mv command. This might work for you, but what if you start working on someone elses machine who doesnt have that alias? Or what if someone else starts working on your machine? Task U26.1: Create an alias such that typing rm will always invoke rm -i. Try running the alias command on its own to see what happens. Now open a new terminal window (or a new tab) and try running your ls alias. What happens?
32
The bottom of the nano window shows you a list of simple commands which are all accessible by typing Control plus a letter. E.g. Control + X exits the program. Task U27.1: Type the following text in the editor and then save it (Control + O). Nano will ask if you want to save the modied buffer and then ask if you want to keep the same name. Then exit nano (Control + X) and use less to conrm that the prole le contains the text you added.
# some useful command line short-cuts alias ls='ls -p' alias rm='rm -i'
Now you have successfully created a conguration le (called prole) which contains two aliases. The rst line that starts with a hash (#) is a comment, these are just notes that you can add to explain what the other lines are doing. But how do you get Unix to recognize the contents of this le? The source command tells Unix to read the contents of a le and treat it as a series of Unix commands (but it will ignore any comments). Task U27.2: Open a new terminal window or tab (to ensure that any aliases will not work) and then type the following (make sure you rst change to the correct directory):
$ source profile
Now try the ls command to see if the output looks different. Next, use touch to make a new le and then try deleting it with the rm command. Are the aliases working?
Remember to type: source /Volumes/USB/Unix_and_Perl_course/.prole every time you use a new terminal window
When you have done that, simply type hello.sh and see what happens. If you have previously run source .profile then you should be able to run hello.sh from any directory that you navigate to. If it worked, then it should have printed Hello world. This very simple script uses the Unix command echo which just prints output to the screen. Also note the comment that precedes the echo command, it is a good habit to add explanatory comments. Task U29.2: Try moving the script outside of the Code directory (maybe move it up one level) and then cd to that directory. Now try running the script again. You should nd that it doesnt work anymore. Now try running ./hello.sh (thats a dot + slash at the beginning). It should work again.
34
When you try running any program in Unix, your computer will look in a set of predetermined places to see if a program by that name lives there. All Unix commands are just les that live in directories somewhere on your computer. Unix uses something called $PATH (which is an environment variable) to store a list of places to look for programs to run. In our .prole le we have just told Unix to also look in your Code directory. If we didnt add the Code directory to the $PATH, then we have to run the program by rst typing ./ (dot slash). Remember that the dot means the current directory. Think of it as a way of forcing Unix to run a program (including Perl scripts).
This would use the chmod to add executable permissions (+x) to the le called hello.sh (the u means add this permission to just you, the user). Without it, your script wont run. Except that it did. One of the oddities of using the USB drive for this course, is that les copied to a USB drive have all permissions turned on by default. Just remember that you will normally need to run chmod on any script that you create. Its probably a good habit to get into now. The chmod command can also modify read and write permissions for les, and change any of the three sets of permissions (read, write, execute) at the level of user, group, and other. You probably wont need to know any more about the chmod command other than you need to use it to make scripts executable.
35
Make sure that this script is saved in Code directory. Now return to the Unix_test_les directory and run this script. It should place the relevant les in the correct directories. This is a relatively simple use of shell scripting. As you can see the script just contains regular Unix commands that you might type at the command prompt. But if you had to do this type of le sorting every day, and had many different types of le, then it would save you a lot of time. Did you notice the #!/bin/bash line in this script? There are several different types of shell script in Unix, and this line makes it clearer that a) that this is actually a le that can be treated as a program and b) that it will be a bash script (bash is a type of Unix). As a general rule, all type of scriptable programming languages should have a similar line as the rst line in the program. Task U32.2: Here is another script. Copy this information into a le called change_le_extension.sh and place that le in the Code directory.
#!/bin/bash for filename in *.$1 do mv $filename ${filename%$1}$2 done
Now go to the Data/Unix_test_les/Text directory. If you have run the exercise from Task U32.1 then your text directory should now contain three les. Run the following command:
$ change_file_extension.sh txt text
36
Now run the ls command to see what has happened to the les in the directory. You should see that all the les that ended with txt now end with text. Try using this script to change the le extensions of other les. Its not essential that you understand exactly how this script works at the moment (things will become clearer as you learn Perl), but you should at least see how a relatively simple Unix shell script can be potentially very useful.
End of part 1.
You can now continue to learn a series of much more powerful Unix commands, or you can switch to learning Perl. The choice is yours!
37
Unix Part 2
How to Become a Unix power user
The commands that you have learnt so far are essential for doing any work in Unix but they don't really let you do anything that is very useful. The following sections will introduce a few new commands that will start to show you how powerful Unix is.
...
This will produce lots of output which will ood past your screen. If you ever want to stop a program running in Unix, you can type Control+C (this sends an interrupt signal which should stop most Unix programs). The grep command has many different command-line options (type man grep to see them all), and one common option is to get grep to show lines that don't match your input pattern. You can do this with the -v option and in this example we are seeing just the sequence part of the FASTA le.
$ grep -v ">" intron_IME_data.fasta GTATACACATCTCTCTACTTTCATATTTTGCATCTCTAACGAAATCGGATTCCGTCGTTG TGAAATTGAGTTTTCGGATTCAGTGTTGTCGAGATTCTATATCTGATTCAGTGATCTAAT GATTCTGATTGAAAATCTTCGCTATTGTACAG GTTAGTTTTCAATGTTGCTGCTTCTGATTGTTGAAAGTGTTCATACATTTGTGAATTTAG TTGATAAAATCTGAACTCTGCATGATCAAAGTTACTTCTTTACTTAGTTTGACAGGGACT TTTTTTGTGAATGTGGTTGAGTAGAATTTAGGGCTTTGGATTAAATGTGACAAGATTTTG ...
38
Notice that you still have control of your output as you are now in the less program. If you press the forward slash (/) key in less, you can then specify a search pattern. Type ATGTGA after the slash and press enter. The less program will highlight the location of these matches on each line. Note that grep matches patterns on a per line basis. So if one line ended ATG and the next line started TGA, then grep would not nd it.
Any time you run a Unix program or command that outputs a lot of text to the screen, you can instead pipe that output into the less program
39
The * character acts as a wildcard meaning 'search all les in the current directory' and the head command restricts the total amount of output to 10 lines. Notice that the output also includes the name of the le containing the matching pattern. In this case, the grep command nds the ACGTC pattern in four protein sequences and several lines of the the chromosome 1 DNA sequence (we dont know how many exactly because the head command is only giving us ten lines of output).
40
Youll learn more about regular expressions when you learn Perl. The '^' character is a special character that tells grep to only match a pattern if it occurs at the start of a line. Similarly, the '$' tells grep to match patterns that occur at the end of the line. Task U36.1: The '.' and '*' characters are also special characters that form part of the regular expression. Try to understand how the following patterns all differ. Try using each of these these patterns with grep against any one of the sequence les. Can you predict which of the ve patterns will generate the most matches?
ACGT AC.GT AC*GT AC.*GT
The asterisk in a regular expression is similar to, but NOT the same, as the other asterisks that we have seen so far. An asterisk in a regular expression means: match zero or more of the preceding character or pattern
Try searching for the following patterns to ensure you understand what . and * are doing:
A...T AG*T A*C*G*T*
41
!
$ grep -c i2 intron_IME_data.fasta 9785
Task U37.1: Count how many times each pattern from TaskU31.1 occurs in all of the sequence les (specifying *.fasta will allow you to specify all sequence les).
42
The 's' part of the sed command puts sed in 'substitute' mode, where you specify one pattern (between the rst two forward slashes) to be replaced by another pattern (specied between the second set of forward slashes). Note that this doesnt actually change the contents of the le, it just changes the screen output from the previous command in the pipe. We will learn later on how to send the output from a command into a new le.
43
U41: Word up
For this section we want to work with a different type of le. It is sometimes good to get a feeling for how large a le is before you start running lots of commands against it. The ls -l command will tell you how big a le is, but for many purposes it is often more desirable to know how many 'lines' it has. That is because many Unix commands like grep and sed work on a line by line basis. Fortunately, there is a simple Unix command called wc (word count) that does this:
$ cd Data/Arabidopsis/ $ wc At_genes.gff 531497 4783473 39322356 At_genes.gff
The three numbers in the output above count the number of lines, words and bytes in the specied le(s). If we had run wc -l, the 'l' option would have shown us just the line count.
This step introduces a new concept. Up till now we have sent the output of any command to the screen (this is the default behavior of Unix commands), or through a pipe to another program. Sometimes you just want to redirect the output into an actual le, and that is what the '>' symbol is doing, it acts as one of three redirection operators in Unix.
44
As already mentioned, the GFF le that we are working with is a standard le format in bioinformatics. For now, all you really need to know is that every GFF le has 9 elds, each separated with a tab character. There should always be some text at every position (even if it is just a '.' character). The last eld often is used to store a lot of text.
In this example, we combine three separate Unix commands together in one go. Lets break it down (it can be useful to just run each command one at at time to see how each additional command is modifying the preceding output): 1) the cut command rst takes the At_genes_subset.gff le and cuts out just the 3rd column (as specied by the -f option). Luckily, the default behavior for the cut command is to split text les into columns based on tab characters (if the columns were separated by another character such as a comma then we would need to use another command line option to specify the comma). 2) The sort command takes the output of the cut command and sorts it alphanumerically 3) The uniq command (in its default format) only keeps lines which are unique to the output (otherwise you would see thousands of 'curated', ' Coding_transcript' etc.)
45
Now lets imagine that you might want to nd which features start earliest in the chromosome sequence. The start coordinate of features is always specied by column 4 of the GFF le, so:
$ cut -f 3,4 At_genes_subset.gff | sort -n -k 2 | head chromosome 1 exon 3631 five_prime_UTR gene 3631 mRNA 3631 CDS 3760 protein 3760 CDS 3996 exon 3996 CDS 4486
3631
Here we rst cut out just two columns of interest (3 & 4) from the GFF le. The -f option of the cut command lets us specify which columns we want to remove. The output is then sorted with the sort command. By default, sort will sort alphanumerically, rather than numerically, so we use the -n option to specify that we want to sort numerically. We have two columns of output at this point and we could sort based on either column. The -k 2 species that we use the second column. Finally, we use the head command to get just the 10 rows of output. These should be lines from the GFF le that have the lowest starting coordinate.
46
Use less to look at the Data/Misc/excel_data.csv le. This is a simple 4-line le that was exported from a Mac version of Microsoft Excel. You should see that if you use less, then this appears as one line with the newlines replaced with ^M characters. You can convert these carriage returns into Unix-friendly line-feed characters by using the tr command like so:
$ cd Data/Misc $ tr '\r' '\n' < excel_data.csv sequence 1,acacagagag sequence 2,acacaggggaaa sequence 3,ttcacagaga sequence 4,cacaccaaacac
This will convert the characters but not save the resulting output, if you wanted to send this output to a new le you will have to use a second redirect operator:
$ tr '\r' '\n' < excel_data.csv > excel_data_formatted.csv
Let's say that we want to extract ve sequences from this le that are: a) from rst introns, b) in the 5' UTR, and c) closest to the TSS. Therefore we will need to look for FASTA headers with an 'i1' part (rst intron) and also a '5UTR' part. We can use grep to nd header lines that match these terms, but this will not let us extract the associated sequences. The distance to the TSS is the number in the FASTA header which comes after the intron position. So we want to nd the ve introns which have the lowest values. Before I show you one way of doing this in Unix, think for a moment how you would go about this if you didn't know any Unix or Perl...would it even be something you could do without manually going through a text le and selecting each sequence by eye? Note that this Unix command is so long that I have had to wrap it across two lines, when you type this, keep it on just one line:
47
$ tr '\n' '@' < intron_IME_data.fasta | sed 's/>/#>/g' | tr '#' '\n' | grep "i1_.*5UTR" | sort -nk 3 -t "_" | head -n 5 | tr '@' '\n' >AT4G39070.1_i1_7_5UTR GTGTGAAACCAAAACCAAAACAAGTCAATTTGGGGGCATTGAAAGCAAAGGAGAGAGTAG CTATCAAATCAAGAAAATGAGAGGAAGGAGTTAAAAAAGACAAAGGAAACCTAAGCTGCT TATCTATAAAGCCAACACATTATTCTTACCCTTTTGCCCACACTTATACCCCATCAACCT CTACATACACTCACCCACATGAGTGTCTCTACATAAACACTACTATATAGTACTGGTCCA AAGGTACAAGTTGAGGGAG >AT5G38430.1_i1_7_5UTR GCTTTTTGCCTCTTACGGTTCTCACTATATAAAGATGACAAAACCAATAGAAAAACAATT AAG >AT1G31820.1_i1_14_5UTR GTTTGTACTTCTTTACCTCTCGTAAATGTTTAGACTTTCGTATAAGGATCCAAGAATTTA TCTGATTGTTTTTTTTTCTTTGTTTCTTTGTGTTGATTCAG >AT3G12670.1_i1_18_5UTR GTAGAATTCGTAAATTTCTTCTGCTCACTTTATTGTTTCGACTCATACCCGATAATCTCT TCTATGTTTGGTAGAGATATCTTCTCAAAGTCTTATCTTTCCTTACCGTGTTCTGTGTTT TTTGATGATTTAG >AT1G26930.1_i1_19_5UTR GTATAATATGAGAGATAGACAAATGTAAAGAAAAACACAGAGAGAAAATTAGTTTAATTA ATCTCTCAAATATATACAAATATTAAAACTTCTTCTTCTTCAATTACAATTCTCATTCTT TTTTTCTTGTTCTTATATTGTAGTTGCAAGAAAGTTAAAAGATTTTGACTTTTCTTGTTT CAG
That's a long command, but it does a lot. Try to break down each step and work out what it is doing (you will need to consult the man page for some commands maybe). Notice that I use one of the other redirect operators ('<') to read from a le. It took seven Unix commands to do this, but these are all relatively simple Unix commands; it is the combination of them together which makes them so powerful. One might argue that when things get this complex with Unix that it might be easier to do it in Perl!
48
Summary
Congratulations are due if you have reached this far. If you have learnt (and understood) all of the Unix commands so far then you probably will never need to learn anything more in order to do a lot of productive Unix work. But keep on dipping into the man page for all of these commands to explore them in even further detail. The following table provides a reminder of most of the commands that we have covered so far. If you include the three, as-yet-unmentioned, commands in the last column, then you will probably be able to achieve >95% of everything that you will ever want to do in Unix. The power comes from how you can use combinations of these commands.
Basic le control mv cp mkdir rmdir rm | (pipe) > (write to le) < (read from le)
Viewing/ Misc. Power creating/ useful commands editing les commands less head tail touch nano man chmod source wc curl uniq sort cut tr grep sed
49
Perl
Your programming environment
For this course, you will be using two applications, a text editor and a terminal. You should already be familiar with the Terminal application from the Unix lesson. If you are using a Mac then we recommend using a (Mac-specic) text editor called Smultron. A copy of this is provided in /Volumes/USB/Unix_and_Perl_course/Applications.
Smultron is a typical programmer's text editor. It has several useful features such as syntax highlighting, automatic indentation, line numbering, and advanced search & replace. There are many good text editors available for Mac, Unix, and Windows. Smultron is better than most, and it is free. Windows users should consider Notepad++.
50
51
Here is a handy Mac tip that will apply to Smultron and also to any other Mac graphical application that allows you to edit and save text. When you rst open a new empty document, the program is as yet unsaved.
Now notice what happens when you start entering text into the main Smultron window. The window close button (the red circle in the top left of the window), now has a small black dot inside it.
This is meant to serve as a reminder that your le is still unsaved. As soon as you click the Save button, this black dot will disappear. From time to time you will have problems with your Perl scripts, and this might simply be because you have not saved any changes that you have made.
52
Line 1 has a # sign on it. When Perl sees a # sign, everything that follows on that line is considered a comment. Programmers use comments to describe what a program does, who wrote the program, what needs to be xed, etc. It's a good idea to put comments in your code, especially as they grow larger. Line 2 is the only line of this program that does anything. The print() function outputs its arguments to your terminal. In this case, there is only one argument, the text "Hello World\n". The funny \n at the end is a newline character, which is like a carriage return. Most of the time, Perl statements end with a semicolon. This is like a period at the end of a sentence. The last statement in a block does not require a semicolon. We will revisit this in a later lesson. Save the program as helloworld.pl. To run the program, type the following in the terminal and hit return (make sure you are in the correct directory).
perl helloworld.pl
This will run the perl program and tell it to execute the instructions of the helloworld.pl le. If it worked, great. If you received a message like the one below, you may have forgotten to save the le, misspelled the le name, or saved the le to someplace unintended. Always use tab-completion to prevent spelling mistakes. Always save your programs to the Unix_and_Perl_course/Code directory (for now anyway).
Can't open perl script "helloworld.pl": No such file or directory
Task P1.2: Modify the program to output some other text, for example the date. Add a few more print statements and experiment with what happens if you omit or add extra newlines. Task P1.3: Make a few deleterious mutations to your program. For example, leave off the semicolon or one of the parentheses. Observe the error messages. One of the most important aspects of programming is debugging. Probably more time is spent debugging than programming, so it's a good idea to start recognizing errors now.
53
Line 1 will appear at the top of every Perl script that we write from now on. This line of code is very similar to the line that appeared at the top of our Unix shell script. It lets Unix know that the Perl program (located at /usr/bin/perl) can read this le and run the remaining code inside it. Line 2 is simply a comment. You should always include a few comments in your programs. Line 3 is another line that we will add to every script from now on. This line effectively tells Perl that we would like to be warned if we start writing certain types of bad code. This is a good thing! We will return to this later on. Line 4 is deliberately blank. You should use spaces and blank lines to improve the readability of your code. In this case we are separating the rst three lines of the script (which dont actually calculate anything) from the rest. Lines 5 is a variable assignment. The variable $x gets the value of 3 Line 6 prints the value of $x and then print a newline. As you can see, the print() function can take multiple arguments separated by commas. Run the program by typing the line below in your terminal. Observe the output and go back through the code and line descriptions to make sure you understand everything.
scalar.pl
54
The addition of #!/usr/bin/perl to the script means that we no longer have to type: # # perl scalar.pl What is actually happening here is that we are making it clear that these text les contain instructions written in Perl. The line that we add tells Unix that it should nd a program called perl in the /usr/bin directory and that program should be capable of making sense of your Perl commands. Now try adding the following lines to your program.
7. 8. 9. $s = "something"; print($s, "\n"); print("$s\n");
Line 7 is another variable assignment, but unlike $x, our new variable $s gets a character string, which is just another term for text. Lines 89 print our new variable $s and then print a newline character. Save the script and run it again. You should see that although lines 89 are different they produce exactly the same output. The print function can print a list of items (all separated by commas), but it often makes more sense to print just one thing instead. It would have been possible to rewrite our very rst Perl script with the following:
print("H","e","l","l","o"," ","W","o","r","l","d","!","\n");
Hopefully you will agree that printing this phrase as one string and not thirteen separate strings is a lot easier on the eye. Now add the following line to your program, and run it again.
10. print "$s\n";
Line 10 calls the print function without parentheses. You do not have to use parentheses for Perl functions, but they are often useful to keep a line organized. In most cases, you will see the print function without parentheses. Now add the nal two lines to the program:
11. 12. print '$x $s\n'; print "$x $s\n";
Line 11 puts the two variables between single quotes. Any text between single quotes will print exactly as shown. This also means that \n loses its special meaning as a newline character. In contrast, strings between double quotes will undergo variable
55
interpolation. This means that variables are always expanded inside double quotes, and print will always show what those variables contain. Task P2.2: Mutate your program. Delete a $ and see what error message you get. Task P2.3: Modify the program by changing the contents of the variables. Observe the output. Try experimenting by creating more variables. Variables summary You can use (almost) anything for your variable names, though you should try to use names which are descriptive and not too long. You should also use lower case names for your variables. This is not essential though. Which of the following is the best variable name for a variable that will store a DNA sequence?
$x = "ATGCAGTGA"; $dna_sequence_variable; $sequence = "ATGCAGTGA"; $dna = "ATGCAGTAGA"; # # # # $x is not a good choice also not a good choice, too long $sequence is better $dna is even better
It is perfectly ne to give a variable the same name as an existing function in Perl though this might be confusing. I.e. a variable named $print might look a bit too similar to the print() function. Sometimes though the choice of variable name is obvious: $length is often a good name for variables that contain the length of something, even though there is also a length() function in Perl (which we will learn about later on). As shown in the example above, variable names can contain underscore characters to separate words. This is often useful and helps make things easier to understand. E.g.
$first_name = "Keith"; $second_name = "Bradnam";
Finally, you should be aware that (with a few exceptions) you can use spaces to make things clearer (or less clear if you so desire). The following lines are all treated by Perl in exactly the same way:
$dna = "ATGCAGTGA"; # one space either side of the = sign $dna="ATGCAGTGA"; # no spaces either side of the = sign $dna = "ATGCAGTGA"; # lots of spaces!
56
You hopefully noticed that this program introduces another new concept; line 3 includes another usage statement: use strict; (in addition to use warnings;). Up till now we have ended each line of Perl code with a semi-colon, but there are times when it is simpler to put two lines of Perl code into one line in an editor. Perl will still treat these as two separate lines of code. Telling Perl to use strict means that Perl will insist your script is written in a certain way which is widely considered to be a better way of writing code. At this point it is not important to go into the details of what exactly use strict is doing. Just accept our word that including a use strict; use warnings; line in every script that you write is a good thing to do (we will return to these issues later). Task 3.2: Now try running the script. You should hopefully see the following errors:
Global symbol "$pi" requires explicit Global symbol "$pi" requires explicit Global symbol "$pi" requires explicit Global symbol "$pi" requires explicit Execution of strict.pl aborted due to package name at strict.pl package name at strict.pl package name at strict.pl package name at strict.pl compilation errors. line line line line 5. 6. 8. 9.
We see one error message for each use of the $pi variable in the script. Now see what happens if you remove the use strict; statement and re-run the script. It should now work. What is happening here? When we tell Perl that we want to use strict; Perl will rst check the code and one of the things it will do is to look at how variables are declared. In Perl, when we rst introduce any variable we can optionally describe whether they are available to all parts of a program or not. However, if we turn on use strict; it becomes mandatory to say whether the variable is a local or global variable. At this time it is not important to understand the details of this (we will return to it later
57
on), other than that we want our programs to include use strict and so we will be making our variables local variables. Task 3.3: Make sure that use strict; is back in your program. Now change line 5 of the program to the following and run your script again (it should now work and should not produce any errors):
5. my $pi = 3.14;
We are now declaring the $pi variable using the word 'my'. This makes the variable a local variable and we will now be doing this most of the time that we introduce any new variable. It might help to think of the my word as reading as 'let'. At this point you are probably thinking that including use strict; in your programs is making things more complex. That is true but the benets of including use strict; outweigh the costs associated with it. Task 3.4: The other point of this programming exercise is to introduce you to the simple fact you can reassign variables to different values or strings. Try declaring a new variable and and assign it a value. Add two more lines to change that value and print it out again.
58
P4. Math
Perl, like most programming languages supports a variety of mathematical operators and functions. Let's experiment with some of these. Task P4.1: Write the program below, save it as math.pl, and then run it. But wait, this time we are going to take a slightly different strategy. The program is getting longer. If you type the whole thing and have a lot of errors, it will become difcult to debug. So instead, write only a few lines, and then save, run, and observe the output. Debug if necessary. Try to check that your program is working every few lines. As you get more experience, you will gain skill and condence and not need to check as frequently.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. #!/usr/bin/perl # math.pl use strict; use warnings; my $x my $y print print print print print print = 3; = 2; "$x plus $y is ", $x + $y, "\n"; "$x minus $y is ", $x - $y, "\n"; "$x times $y is ", $x * $y, "\n"; "$x divided by $y is ", $x / $y, "\n"; "$x modulo $y is ", $x % $y, "\n"; "$x to the power of $y is ", $x ** $y, "\n";
Task P4.2: In addition to the mathematical operators we've just seen, there are a number of built-in numeric functions: e.g. abs(), int(), log(), rand(), sin(). Add the following lines to the program, run it, and observe the output.
13. 14. 15. 16. 17. 18. print print print print print print "the absolute value of -$x is ", abs(-$x), "\n"; "the natural log of $x is ", log($x), "\n"; "the square root of $x is ", sqrt($x), "\n"; "the sin of $x is ", sin($x), "\n"; "a random number up to $y is ", rand($y), "\n"; "a random integer up to $x x $y is ", int(rand($x * $y)), "\n";
Line 18 could have been written as int rand $x * $y. This is another example where you can omit parentheses if you like. But just because you can doesn't mean you should.
59
Task P4.3: In the examples above, the print() function outputs text as well as the actual mathematical operations. This is fairly uncommon in real programming. Generally, we want to make some computation, store that value, and do more computations. To store values, we need to create a new variable that will hold the contents.
19. 20. my $z = ($x + $y) / 2; print "$z\n";
Task P4.4: In this next exercise, you will build a simple calculator that calculates X to the power of Y. Instead of assigning the variables inside the code, we will let the user input the values without editing the le. In general, this is how programs should work. Once written, they can be used without editing the source code.
1. 2. 3. 4. 5. 6. #!/usr/bin/perl # pow.pl use strict; use warnings; my ($x, $y) = @ARGV; print $x ** $y, "\n";
Line 5 has an unfamiliar construct. @ARGV is list of values from the command line. We will discuss lists and arrays in greater detail later. For now, just accept that the values from the command line will be contained in $x and $y. For example, if you type the line below in the terminal, when the program runs, $x will contain 3.14 and $y will contain 2.718.
pow.pl 3.14 2.718
Task 4.5: Let's make one more calculator for fun. This one will compute the factorial of a number. Factorials are usually computed with some kind of a loop (we will talk a lot about loops later). Here is an alternate method that provides a reasonable approximation. Unlike the true factorial, this method can use non-integers. Lines 710 are indented with tabs to make it easier to read.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. #!/usr/bin/perl # stirling.pl (Stirling's approximation to the factorial) use strict; use warnings; my ($n) = (@ARGV); my $ln_factorial = (0.5 * log(2 * 3.14159265358979)) + ($n + 0.5) * log($n) - $n + 1 / (12 * $n) - 1 / (360 * ($n ** 3)); print 2.71828 ** $ln_factorial, "\n";
60
Try it out:
stirling.pl 5 stirling.pl 7.1
Operator Precedence Let's quickly discuss operator precedence. Some operators have higher precedence than others. We're used to seeing this in math where multiplication and division come before addition and subtraction: 3 + 2 * 5 = 13. If you want to force addition before multiplication, you can do this as (3 + 2) * 5 = 25. Perl has a lot of operators in addition to the mathematical operators and there are a lot of precedence rules. Don't bother memorizing them. The universal precedence rule is this: multiplication comes before addition, use parentheses for everything else.
61
Did you notice how the print statements on lines 7 and 9 are indented? This is no accident! It shows the logical hierarchy. The spacing is achieved by using a tab character. Many code editors will be smart enough to put tabs in for you automatically. Numerical comparison operators in Perl We have just seen the == operator, here are all the ways of comparing two numbers: Operator
== != > < >= <= <=>
Meaning
equal to not equal to greater than less than less than or equal to comparison
Example
if ($x == $y) if ($x != $y) if ($x > $y) if ($x < $y) if ($x <= $y) if ($x <=> $y)
Indentation and block structure In general, all the statements that are conditional on some other statement are indented with a tab character. You can have conditional statements inside other conditional statements, in which case you will have multiple levels of indentation. Is this necessary? Yes and no. It is necessary to aid readability, but it is not necessary to get your program to run. Pay attention to the indentation in the example programs and follow them
62
closely. http://en.wikipedia.org/wiki/Indent_style contains a good description of indentation styles. Feel free to choose one of those, but do not make up your own style! Your #1 job as a programmer is to write programs that can be easily understood by others, and inventing new programming paradigms defeats that goal. Task P5.2: Modify the program by changing the variables and relational operators. The numeric relational operators are in the accompanying table. Experiment to see if you can gure out what the <=> operator does (it is called the spaceship operator). Task P5.3: Hierarchy is one of the most important concepts in programming. We are used to seeing hierarchical le systems where les are inside of folders which might be inside other folders. Programming uses the same concept. In Perl, hierarchy is shown with tabs and curly brackets. Statements (les) are inside curly brackets (folders) which might be inside other curly brackets (more folders).
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. #!/usr/bin/perl # nested_conditional.pl use strict; use warnings; my ($x, $y) = @ARGV; if ($x > $y) { print "$x is greater than $y\n"; if ($x < 5) { print "$x is greater than $y and less than 5\n"; } } else { print "$x is not greater than $y\n"; }
Whitespace Indentation and white space improve readability. Consider the following legal but confusing code which omits tabs and spaces (and even some semicolons)
if($x>$y){print"1\n";if($x<5){print"2\n"}}else{print"3\n"}
A program must be readable above all else. A program that works but is unreadable is difcult to improve or maintain.
63
Task P5.4: Sometimes you want to test a series of conditions. This next example shows you how to do this with elsif.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. #!/usr/bin/perl use strict; use warnings; # elsif.pl my ($x) = @ARGV; if ($x >= 3) { print "x is } elsif ($x >= 2) { print "x is } elsif ($x >= 1) { print "x is } else { print "x is }
Task P5.5: For simple switches such as the above example, it is sometimes useful to break the usual indentation rule. In the example below, note that the obligatory semicolons have been dropped. It turns out that the last line of a block does not need to be terminated with a semicolon precisely for this kind of code beautication. Also note that spaces are added so that the braces line up in columns.
1. 2. 3. 4. if ($x >= 3) {print elsif ($x >= 2) {print elsif ($x >= 1) {print else {print "x "x "x "x is is is is at least as big as 3\n"} at least as big as 2, but less than 3\n"} at least as big as 1, but less than 2\n"} less than 1\n"}
Other Conditional Constructs An alternative to if ($x != $y) is unless ($x == y). There are times when unless is more expressive than if not. You cannot use elsif or else with an unless however. Perl also lets you do something called post-x notation. This allows you to put the if or unless at the end of the statement rather than at the beginning. You can't use elsif or else in this case, but sometimes the post-x just reads much better. Here are two examples:
5. 6. print "x is less than y\n" if $x < $y; print "x is less than y\n" unless $x >= $y;
64
Finally, Perl includes something called the trinary operator that lets you do very simple if-then-else statements with just a few symbols. Consider the following statement:
7. 8. if ($x == $y) {print "yes"} else {print "no"}
The trinary operator is not that commonly used, but you will see it from time to time. Numeric Precision and Conditionals Although Perl hides the details, numbers in a computer are generally stored either as integer or oating point (decimal) numbers. Both ints and oats have minimum and maximum values, and oats have limited precision. You have probably run into these concepts with your calculator. If you keep squaring a number greater than 1.0 you will eventually run into an overow error. In Perl, this will happen at approximately 1e+308. Similarly, if you repeatedly square a number less than 1.0, you will eventually reach an underow error. In Perl, the closest you can get to zero is approximately 1e-308. Try some extreme values in pow.pl or stirling.pl to reach underow and overow. Floating point numbers do not have the exact value you may expect. For example, 0.1 is not exactly one-tenth. Perl sometimes hides these details. Try the following code. When you run this, you expect to see 0.3 0.3 0.0, but that's not what happens because adding the imprecise 0.1 three times is not the same as the imprecise 0.3.
1. 2. 3. 4. 5. 6. 7. #!/usr/bin/perl # float.pl use strict; use warnings; my $x = 0.1 + 0.1 + 0.1; my $y = 0.3; print $x, "\t", $y, "\t", $x - $y, "\n"; # \t is a tab character
Since oating point numbers are approximations, you should not compare them in conditional statements. Never ask if ($x == $y) if the values are oats because as we have seen, 0.3 is not necessarily equal to 0.3. Instead, ask if their difference is smaller than some acceptable threshold value.
8. 9. my $threshold = 0.001; if (abs($x - $y) < $threshold) {print "close enough\n"}
65
Line 7 introduces the concatenate operator which in Perl is represented by the dot (.) character. This operator allows you to join two or more strings together and (optionally) store the result in a new variable. In this case we create a new variable ($s3) which stores the result of joining three things together ($s1, a space character " ", and $s2). Now add the following lines to the script.
9. 10. 11. if ($s1 eq $s2) {print "same string\n"} elsif ($s1 gt $s2) {print "$s1 is greater than $s2\n"} elsif ($s1 lt $s2) {print "$s1 is less than $s2\n"}
How are these strings compared? It might make sense to compare them by length, but that's not what is happening. They are compared by their ASCII values. So 'A' is less than 'B' which is less than 'Z'. Similarly 'AB' is less than 'AC' and 'ABCDE' is also less than 'AC'. Oddly, 'a' is greater than 'A'. See the wikipedia page on ASCII to see the various values. To get the length of a string, you use the length() function. String comparison operators in Perl Operator
eq ne gt lt . cmp
Meaning
equal to not equal to greater than less than concatenation comparison
Example
if ($x eq $y) if ($x ne $y) if ($x gt $y) if ($x lt $y) $z = $x . $y if ($x cmp $y)
66
Task P6.2: Modify the program in P6.1 to experiment with different string comparison operators. Then try comparing a number and a string using both numeric and string comparison operators. Try using the length() function. Task P6.3: If you are interested in ASCII values, try using the ord() and chr() functions, which convert letters to numbers and vice-versa.
12. 13. print ord("A"), "\n"; print chr(66), "\n";
Matching Operators One of the most common tasks you may have as a programmer is to nd a string within another string. In a biological context, you might want to nd a restriction site in some DNA sequence. These kinds of operations are really easy in Perl. We are only going to touch on a few examples here. In a few lessons we will get much more detailed. Task P6.4: Enter the program below and observe the output. The binding operator =~ signies that we are going to do some string manipulation next. The exact form of that manipulation depends on the next few characters. The most common is the match operator m//. This is used so commonly that the m can be omitted. There are also substitution and transliteration operators. If your script is working then try changing line 6 to make the matching operator match other patterns.
1. 2. 3. 4. 5. 6. 7. #!/usr/bin/perl # matching.pl use strict; use warnings; my $sequence = "AACTAGCGGAATTCCGACCGT"; if ($sequence =~ m/GAATTC/) {print "EcoRI site found\n"} else {print "no EcoRI site found\n"}
67
Meaning
match match not match substitution
Example
if ($s =~ m/GAATTC/) if ($s =~ /GAATTC/) if ($s !~ m/GAATTC/) $s =~ s/thing/other/;
Task P6.5: Add the following lines and observe what happens when you use the substitution operator. This behaves in a similar way to the sed command in Unix.
8. 9. $sequence =~ s/GAATTC/gaattc/; print "$sequence\n";
Now add the following lines and nd out what happens to $sequence.
10. 11. 12. 13. $sequence =~ s/A/adenine/; print "$sequence\n"; $sequence =~ s/C//; print "$sequence\n";
Line 12 replaces the occurrence of a C character with nothing (//), i.e. it deletes a C character. You should have noticed though that lines 10 and 12 only replaced the rst occurrence of the matching pattern. What if you wanted to replace all occurrences? To specify a global option (i.e. replace all occurrences), we add a letter g to the end of the substitution operator:
14. $sequence =~ s/C//g; # adding g on the end of substitution operator
This is similar to how we use command-line options in Unix, the global option modies the default behavior of the operator. Task P6.6: Add the following lines to the script and try to work out what happens when you add an 'i' to the to matching operator:
15. 16. my $protein = "MVGGKKKTKICDKVSHEEDRISQLPEPLISEILFHLSTKDLWQSVPGLD"; print "Protein contains proline\n" if ($protein =~ m/p/i);
68
Task P6.7: In bioinformatics, you will sometimes be given incorrectly formatted data les which might break your script. Therefore we often want to stop a script early on if we detect that the input data is not what we were expecting. Add the following lines to your script and see if you can work out what the die function is doing.
17. 18. 19. my $input = "ACNGTARGCCTCACACQ"; # do you know your IUPAC characters? die "non-DNA character in input\n" if ($input =~ m/[efijlopqxz]/i); print "We never get here\n";
It is very common to stop scripts by using the 'die ... if' syntax. There is no point letting a script continue processing data if the data contains errors. Perl does not know about rules of biology so you will need to remember to add suitable checks to your scripts. The transliteration operator The transliteration operator gets its own section as it is a little bit different to the other matching operators. If you worked through Part 2 of the Unix lessons you may remember that there is a tr command in Unix. The transliteration operator behaves in the same way as this command. It takes a list of characters and changes each item in the list to a character in a second list, though we often use it with just one thing in each list. It automatically performs this operation on all characters in a string (so no need for a global option). Task P6.8: Make a new script to test the full range of abilities of the transliteration operator. Notice how there are comments at the end of many of the lines (the hash character # denotes the start of a comment). You dont have to type these comments, but adding comments to your scripts is a good habit to get into. You will need to add suitable print statements to this script in order for it to do anything.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. #!/usr/bin/perl # transliterate.pl use strict; use warnings; my $text = "these are letters: abcdef, and these are numbers, 123456"; $text $text $text $text $text $text =~ =~ =~ =~ =~ =~ tr/a/b/; tr/bs/at/; tr/123/321/; tr/abc/ABC/; tr/ABC/X/; tr/d/DE/; # # # # # # changes any occurrence of a to b the letter b becomes a, and s becomes t 1 becomes 3, 2 stays as 2, 3 becomes 1 capitalize the letters a, b, and c any A, B, or C will become an X incorrect use, only d will be changed to D
On Line 5 in this script we dene a string and save that to a variable ($text). Lines 712 then perform a series of transliterations on the text.
69
Task P6.9: If you have many characters to transliterate, you can use the tr command in a slightly different way, which you may (or may not) nd easier to understand:
13. 14. $text =~ tr [abcdefgh] [hgfedcba]; # semicolon is here and not on line 13
In this case we use two pairs of square brackets to denote the two range of characters, rather than just three slashes. There are two lines of your code in the text editor, but Perl sees this as just one line. Perl scripts can contain any amount of whitespace, and it often helps to split one line of code into two separate lines in your editor. The following line would be treated by Perl as exactly the same as Lines 1314:
15. $text =~ tr[abcdefgh][hgfedcba]; # whitespace removed
Task P6.10: The transliteration operator can also be used to count how many changes are made. This can be extremely useful when working with DNA sequences. Add the following lines to your script.
16. 17. 18. my $sequence = "AACTAGCGGAATTCCGACCGT"; my $g_count = ($sequence =~ tr/G/G/); print "The letter G occurs $g_count times in $sequence\n";
Line 17 may appear confusing. The transliteration operator is changing the letter G to itself, and it then assigns the result of this operation to a new variable ($g_count). So what is happening? Perl performs the code inside the parentheses rst and this performs the transliteration. The result of the transliteration is that lots of G->G substitutions are made which leaves $sequence unchanged. The transliteration operator counts how many changes are made. Normally it does nothing with this count, but if you ask Perl to assign the output of the transliteration to a variable (as in this example), then it will store the count in that variable. Task P6.11: Remove the parentheses from line 17. Does the script still work? This is a case where the parentheses are not needed by Perl, but their inclusion might make your code more understandable. If you have any code where you use the assignment operator (=), Perl always evaluates the right-hand side of the equals sign rst. Task P6.12: Add the following line to your script and see if you can understand how to specify a range of characters with the tr operator.
19. $sequence =~ tr/[A-Z]/[a-z]/;
Note that the square brackets in this example have a completely different meaning to those used in lines 1314, and line 15.
70
71
While it is possible to use the die function without printing any output, you should always try to include a helpful statement as to why the program has stopped. It is common to see several die statements near the start of a script as this is the point when you should ideally check that all of the script's parameters make sense and that any input les are present (and valid). On line 5 the die function will be run unless the @ARGV array contains exactly one item (remember that this array contains a list of anything you specify on the commandline after the script name). You will often use the die function in conjunction with the if operator, i.e. if something is missing, stop the script. Note that line 5 could be replaced with the following if we wanted to make things even more explicit:
7. 8. 9. 10. my $number_of_arguments = @ARGV; if($number_of_arguments != 1){ die "usage: dnastats.pl <dna sequence>\n"; }
Goals of your program Your program should read a sequence that is specied on the command line and report the following: The length of the sequence The total number of A, C, G, and T nucleotides The fraction of A, C, G, and T nucleotides (i.e. %A, %C etc.) The GC fraction
72
The code in line 5 takes a list of three values (1, 2, 3) and assigns them to a list of three variables ($x, $y, $z). Without using lists, we would have to have three separate lines of code in order to declare and initialize each variable with a value. Assignments in lists occur simultaneously. Because of this, line 7 below exchanges the values for $x and $y.
7. 8. ($x, $y) = ($y, $x); print "x=$x y=$y\n";
Task P7.2: Exchange the value of $x and $y without using list context. This is one of those problems that appears difcult at rst, but once you see the solution, it will seem so obvious that you can't imagine how you didn't think of it immediately.
73
Line 5 assigns 3 variables with 5 values. The two extra values on the right are simply thrown away. Line 8 assigns 3 variables from only 2 values. What is $c? The output from line 9 suggests that $c is some kind of a blank, and the output from line 10 suggests it has no length. But the output from line 11 suggests that $c has a value of zero. $c has an undened value. It is simultaneously zero and an empty string. Do you nd this confusing? It is. Undened values are bad. You should never assume the contents of a variable. Variables should always be assigned before they are used. Similarly, lists should be the same length on each side of an assignment, but Perl has no way of checking this. To nd undened values, always include use warnings in your program. This will alert you when undened variables are being used. If you have undened values, stop immediately and debug. A program that runs with undened values can be very dangerous. Task P8.2: Modify the original program by adding a use warnings; line. Run the program and observe what happens. The errors that you should see are a good thing!
74
P9. Arrays
Lists are useful for declaring and assigning multiple variables at once, but they are transient and if we want to store the details of a list then we have to capture all the values into separate variables. Ideally, there should be a way of referring to all of a list in one go, and there should be a way to access individual items in a list. In Perl (and in most other programming languages) we do this using arrays. An array is a named list. Each array element can be any scalar variable that we have seen so far, e.g. a number, letter, word, sentence etc. In Perl, as in most programming languages, an array is indexed by integers beginning with zero. The rst element of an array is therefore the zero-th element. This might confuse you but that's just the way it is. Arrays in Perl are named using the '@' character. Let's imagine that we have an array called @cards that contains ve playing cards (we can imagine that each card in the array would be stored as a text string such as '7D' for 'seven of diamonds').
If we wanted to see what the individual elements of the @cards array were, we could access them at array positions 0 through to 4. It's important to note that arrays always have a start (the zero-th position), an end (in this case, position 4), and a length (in this case 5). Arrays can contain just one element in which case the start and the end would be the same. Arrays can also contain no elements whatsoever (more of that later).
75
In biology you might frequently see arrays used to store DNA or protein sequence information. This could either be where each element is a separate DNA/protein sequence, or where each element is one nucleotide/amino acid and the whole array is the sequence. Task P9.1: Create and run the following program.
1. 2. 3. 4. 5. 6. 7. 8. #!/usr/bin/perl # array.pl use strict; use warnings; my @animals = ('cat', 'dog', 'pig'); print "1st animal in array is: $animals[0]\n"; print "2nd animal in array is: $animals[1]\n"; print "Entire animals array contains: @animals\n";
Line 5 assigns the @animals array a list of 3 values. Note how we also have to declare arrays with 'my' if we are including the use strict; statement. Lines 6 and 7 show how to access individual elements of an array. You specify a position in the array by putting an integer value between square brackets. The integer value is known as the 'array index'. Lines 6 and 7 also shows that you can interpolate individual scalars inside double quotes, i.e. Perl prints out the value stored at the specied array position rather than just printing the text $animals[0]. Line 8 shows that if you include an array name between double quotes, then the entire array interpolates and Perl will add spaces between each element in the printed output. Note that each element of the list is a scalar variable. We write $animals[0] never @animals[0]. There is no such thing as @animals[0] in Perl. The membership of $animals[0] in @animals is shown by the square brackets. Writing @animals[0] is one of the most common errors of new programmers (it's so common that it will actually be legal in the next version of Perl...). Try modifying the code to include this erroneous syntax and observe the warning message.
6. print "@animals[0]\n"; # bad
76
Making arrays bigger and smaller Perl arrays are dynamic. That is, they grow and shrink automatically as you add/remove data from them. It is very common to modify the contents of arrays, and it is also very common to start off with an array full of things, and then remove one thing at a time. Most of the time we add or remove things to either end of an array and Perl has four dedicated functions to do this:
Task P9.2: To examine this dynamic behavior, we will rst learn to use the push() function to add some new data onto the array. The push function is used to add one thing to the end of an array. The end of an array is the element with the highest array index position. Add the following lines to your program.
9. push @animals, "fox"; # the array is now longer 10. my $length = @animals; 11. print "The array now contains $length elements\n";
Line 10 introduces a very useful concept in Perl. If you assign a list to a scalar variable, then the scalar variable becomes the length of the list. This is so useful that you will use it a lot in your Perl code. You can think of this in another way. Anywhere in a Perl script where it is possible to specify a numerical value, you can instead specify the name of an array. If that array contains any elements then Perl will calculate the length of the array and use the number of elements.
77
It is a common mistake to confuse the following two lines of code, can you work out what the difference is?
$length = @animals; ($length) = @animals;
The rst line of code takes the length of the animals array and assigns that to the $length variable. But when we add parentheses around $length we are now making a list, and the second line of code is therefore a list assignment. It doesn't look much like a list because there is only one thing in it, but it is still a list. So the second line of code could be read as 'take the @animals array and assign all of the elements to a new list called $length'. Of course in this case the new list is shorter than the array so it can only receive one item. Have a look again at section P8.1 to see if that helps you understand things. Task P9.3: Just to make sure you fully understand arrays, let's add a few more lines.
12. 13. 14. 15. 16. 17. my ($first, $second) = @animals; print "First two animals: $first $second\n"; my @animals2 = @animals; # make a copy of @animals @animals = (); # assign @animals an empty list -> destroys contents print "Animals array now contains: @animals\n"; print "Animals2 array still contains @animals2\n";
Common Array Functions We already saw push() as a way of adding an element to the end (tail) of a list. Naturally, you can add an element to the front (head) of a list, or remove elements instead of adding them. Try modifying your program to use the following set of functions: pop(), shift(), unshift(), and if you're really brave splice(). The last function is the hardest one to understand but also the most powerful because it allows you add, remove, or substitute array elements at any position in the array, not just at the ends. Function
push(@array, "some value") $popped_value = pop(@array) $shifted_value = shift (@array) unshift(@array, "some value") splice(...)
Meaning
add a value to the end of the list remove a value from the end of the list remove a value from the front of the list add a value to the front of the list everything above and more!
78
Task P9.4: Experiment with the array functions by adding some new lines to array.pl. Rather than just adding a text string to an array, try to see if you can use the push() or shift() functions to add variables or even other arrays to existing arrays. For the shift() and pop() functions, try to see what happens if you don't assign the popped or shifted value to a variable. E.g. try to determine the difference between the following two lines of code:
my $value = pop(@array); pop(@array);
More About Array Indexes Let's consider a couple more indexing issues. Add the following lines but before running it, try to guess what will happen.
18. 19. 20. 21. 22. @animals = ('cat', 'dog', 'pig'); # needed because @animals was emptied print "Animal at array position 1.2 is $animals[1.2]\n"; print "Animal at array position 1.7 is $animals[1.7]\n"; print "Animal at array position -1 is $animals[-1]\n"; print "array length = ", scalar(@animals), "\n";
Floating point value such as 1.2 or 1.7 are rounded down. Using negative numbers for the array index positions have the effect of counting from the tail of the array. The scalar() function forces scalar context on its argument. As we know, an array gives its length in scalar context. Recall $length = @animals. The scalar() function does the same thing without the need to create an extra variable. Something else you can try is to look up an array element using a text string rather than a number. E.g. what happens if you try the following?
23. print "Animal at array position 'foobar' is ", $animals["foobar"], "\n";
You could substitute "foobar" for any text at all. The rst thing that you should notice is that the Perl program should give you a useful warning message:
Argument "foobar" isn't numeric in array element at...
Strings such as "foobar" have a numeric value of zero and so if you use any text instead of a number when trying to lookup a specic position in an array, you will always get the rst (zero-th) element. Hopefully you will never try doing this.
79
Line 5 uses qw() to make an array. qw() is short for quote words. It's a little shorthand so that we don't have to keep typing quotation marks. Line 6 creates a string from an array with join(), and species that each element of the array should be joined with a comma followed by a space. The opposite function of join() is the split() function. This divides a string into an array. But we have to tell it where to split. This works sort of like a restriction digest but the restriction site is consumed in the process. Task P10.2: Add the following lines to your program and run it.
8. 9. 10. 11. my $dna = "aaaaGAATTCttttttGAATTCggggggg"; my $EcoRI = "GAATTC"; my @digest = split($EcoRI, $dna); print "@digest\n";
If we want to convert a string into an array and split the string at every possible position, we need to use an empty string ("") in the split() function. This is often used to convert DNA/protein sequences stored in variables into arrays:
12. 13. my @dna = split("", $dna); print "@dna\n";
80
P11. Sorting
As in real life, lists are great, but sorted lists are even better. Imagine looking through a telephone book if it wasn't sorted... tedious. Perl has an incredibly exible sorting function. But it's a little complicated, so you may want to come back and read this part again later. Task P11.1: Create the following program and run it. How does Perl sort items in a list?
1. 2. 3. 4. 5. 6. 7. #!/usr/bin/perl # sorting.pl use strict; use warnings; my @list = qw( c b a C B A a b c 3 2 1); # an unsorted list my @sorted_list = sort @list; print "default: @sorted_list\n";
Line 5 calls the sort() function. This could have been written with parentheses around the @list part, but this is one of those cases where parentheses are usually left off. We assign the result of the sort to a new array, but we could have also overwritten the original array, e.g.
my @list = sort @list;
Looking at the output, it should be clear that Perl sorts by ASCII value by default. It is using the cmp operator we saw earlier. What if you want to sort numerically? Then you would have to use the numeric comparison operator <=>. To specify this, you use an unfamiliar syntax.
8. 9. @sorted_list = sort {$a <=> $b} @list; print "numeric: @sorted_list\n";
In general, sorting routines compare pairs of values. In Perl, these values are held by the magic variables $a and $b. For this reason, you should not use these variable names in your own programs. Line 7 shows that $a and $b are compared numerically. The default sort is simply {$a cmp $b}. This code should produce a few warning messages (because we have the use warnings; statement), and this is because we are asking to sort values numerically but 'A', 'B', 'C' etc are not numbers. As we saw previously, text has a numeric value of zero. So if you compare text as numbers, it does not sort alphabetically (and Perl warns us of this fact).
81
If you want to sort in reverse direction, you simply exchange the variables $a and $b.
10. 11. 12. @list = qw (2 34 -1000 1.6 8 121 73.2 0); @sorted_list = sort {$b <=> $a} @list; print "reversed numeric: @sorted_list\n";
What if you want to sort both numerically and alphabetically and you want no differentiation between capitals and lowercase? Perl can do this, of course, but the explanation will be left for later.
13. 14. @sorted_list = sort {$a <=> $b or uc($a) cmp uc($b)} @list; print "combined: @sorted_list\n";
82
P12. Loops
Loops are one of the most important constructs in programming. Once you have mastered loops, you can do some really useful programming. Loops allow us to do things like count from 1 to 100, or cycle through each element in an array, or even, process every line in an input le. There are three main loops that you will use in programming, the for loop, the foreach loop, and the while loop. The for Loop The for loop generally iterates over integers, usually from zero to some other number. You can think of the integer as a 'loop counter' which keeps track of how many times you have been through the loop (just like a lap counter during a car race). The for loop has 3 components: 1. initialization - provide some starting value for the loop counter 2. validation - provide a condition for when the loop should end 3. update - how should the loop counter be changed in each loop cycle If we return to the car race analogy, we can imagine a car having to drive 10 laps around a circular track. At the start of the race the car has not completed any laps so the loop counter would be initialized to zero. The race is clearly over when the counter reaches 10 and each lap of the track updates the counter by 1 lap. Task P12.1: Create and run the following program.
1. 2. 3. 4. 5. 6. 7. #!/usr/bin/perl # loop.pl use strict; use warnings; for (my $i = 0; $i < 10; $i = $i + 1) { print "$i\n"; }
The syntax for a for loop requires the three loop components to be placed in parentheses and separated with semi-colons. Curly braces are then used to write the code that will be executed during each iteration of the loop. This code is usually indented in the same way that we indent blocks of code following if statements. In this loop we rst declare a new variable (my $i) to act as our loop counter. It is a convention in programming to use $i as a loop variable name because of the use of i as a counter in mathematical notation, e.g.
83
You could name your loop counter anything that you wanted to, but we suggest that for now you just use $i. Let's see what the three components of our loop are doing:
$i = 0 performs initialization, i.e. start our loop with $i equal to zero $i < 10 performs validation, i.e. keep the loop going as long as $i is less than 10 $i = $i + 1 performs the update, $i is incremented by one during each loop iteration
It is very common in Perl that you want to take a number and just add 1 to it. In fact, it is so common that Perl has its own operator to do it, the increment operator:
$i++
This is more succinct and is the common way to increment a variable by one. Not surprisingly, you can also decrement a variable by one with --. Note that the 'update' component of the loop should describe a way of increasing (or decreasing) the value of $i otherwise the loop would never end. Task P12.2: Try looping backwards and skipping.
8. 9. for (my $i = 50; $i >= 45; $i--) {print "$i\n"} for (my $i = 0; $i < 100; $i += 10) {print "$i\n"}
Since the blocks of code following these for loops are only one line long, they do not need a semi-colon at the end. We saw this earlier when making conditional statements tidy. Line 9 uses the += operator. This is a useful shortcut and in this case the result is exactly the same as if we had typed $i = $i + 10. Similar operators exist for subtraction (-=), multiplication (*=) etc. Note how the loop in line 9 is counting in tens and not incrementing by one at a time.
84
Task P12.3: Let's do something a little bit useful with a loop. This program computes the sum of integers from 1 to n, where n is some number on the command line. Of course you could compute this as (n+1) * n / 2, but what is the point of having a computer if not to do brute force computations?
1. 2. 3. 4. 5. 6. 7. 8. 9. #!/usr/bin/perl # sumint.pl use strict; use warnings; die "usage: sumint.pl <limit>\n" unless @ARGV == 1; my ($limit) = @ARGV; my $sum = 0; for (my $i = 1; $i <= $limit; $i++) {$sum += $i} print "$sum\n";
Line 5 is a usage statement. We saw this earlier in Project 1 and in this script it is just adding a check to ensure that we specify one (and only one) command-line argument when we run the script.. Line 6 assigns the command-line argument to $limit. Line 7 creates a variable to hold the sum. Line 8 uses a loop to add the latest value of $i to the $sum variable. Task P12.4: Write a program, factorial.pl, that computes the factorial of a number. Structurally, it will be very similar to sumint.pl, but of course you will be multiplying values instead of adding. Task P12.5: One of the most common operations you will do as a programmer is to loop over arrays. Let's do that now. To make it interesting, we will loop over two arrays simultaneously.
1. 2. 3. 4. 5. 6. 7. 8. 9. #!/usr/bin/perl # loops.pl use strict; use warnings; my @animals = qw(cat dog cow); my @sounds = qw(Meow Woof Moo); for (my $i = 0; $i < @animals; $i++) { print "$i) $animals[$i] $sounds[$i]\n"; }
The for loop starts at 0, which is where all arrays start, and continues as long as the loop variable $i is less than the length of the array (which is found from the scalar context of an array).
85
The foreach Loop The foreach loop allows you to iterate through the contents of an array without a numeric index. Instead, a temporary variable is set to the contents of each element. Add the following code to your program.
10. 11. 12. foreach my $animal (@animals) { print "$animal\n"; }
Here, $animal is the temporary variable. It changes from cat to dog to cow with each iteration of the loop. It is very common to name the temporary variable as a singular form of the array name. E.g. foreach my $protein (@proteins){ You can also use the foreach loop in a numeric manner. If you are a lazy typist (potentially an admirable quality if you are concerned about RSI), you can even use for rather than foreach. Line 13 shows how to create to create a numeric list with the .. operator.
13. for my $i (0..5) {print "$i\n"}
The while Loop The while loop continues to iterate as long as some condition is met, where the condition is some notion of True or False. The 'condition' part of a while loop can be as simple or as complex as you want it to be. Here is an example of a very simple while loop which keeps doubling a number until some limit is reached:
14. 15. 16. 17. 18. my $x = 1; while($x < 1000){ print "$x\n"; $x += $x; }
In this example the code will continue to loop while the value of $x is less than 1000, and $x is doubled for each iteration of the loop. It is important that the test condition will be testing something that is going to change. But Perl will allow you to write code which contains a pointless test condition.
86
Task P12.6: Add these lines to your program and run it.
19. 20. 21. 22. 23. 24. while (0) print } while (1) print } { "this statement is never executed because 0 is false\n"; { "this statement loops forever\n";
The rst while loop will never print anything at all because a zero value is always treated as false by Perl. So the loop will run only while the value of zero evaluates to true which is never going to happen. The second loop will start but never end because the test condition ('while 1 is true') is always true. In fact, anything which isn't a zero or the null string ("") will always evaluate as true. To stop this program, press Control+c in your terminal. This sends the Unix 'interrupt' signal to the program (you might want to commit that trick to memory). Let's try looping through an array with a while loop. Replace lines 10-12 with these.
13. 14. 15. 16. while (@animals) { my $animal = shift @animals; print "$animal\n"; }
In each iteration through the loop, the array @animals is shortened by removing one item from the front of the list (using the shift function). The loop ends when the length of the array is 0 (empty). There are times when this kind of array-deletion construct is useful, but most of the time you will be looping through arrays with for or foreach. The do Loop The do loop is a variation of the while loop. Unlike the while loop, it always executes at least once. Do loops are not so common.
17. do { 18. print "hello\n"; 19. } while (0);
Congratulations! You have now learned about variables, numbers, math, strings, conditionals, arrays, and loops. Even though there is still a lot to learn, you have come a long way. You can now write some very useful programs.
87
Loop Control There are times when you will want a little more control in your loops. The next keyword immediately restarts the loop at the top and advances the loop variable. The redo keyword restarts the loop also, but does not advance the loop variable. The last keyword terminates the entire loop. P12.7: Here is a program that illustrates redo and last. It computes the prime numbers between 100 and 200.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. #!/usr/bin/perl # primes.pl use strict; use warnings; my $n = 0; while (1) { $n++; redo if $n < 100; last if $n > 200; # breaks out of while loop my $prime = 1; # assumed true for (my $i = 2; $i < $n; $i++) { if ($n % $i == 0) { $prime = 0; # now known to be false last; # breaks out of for loop } } print "$n\n" if $prime; }
Line 8 contains a redo. This short-circuits the while loop as long as $n is less than 100. You could have used next here also because there is no loop variable. Line 9 uses the last function to terminate the while loop, effectively ending the program, if $n is greater than 200. Lines 1117 determine if a number is prime. This method starts off assuming $n is prime. It then checks all the numbers from 2 and $n -1 to determine if $i is a factor of $n. If $i is a factor of $n (line 11) then there is no point in calculating any further because $n is not prime. So $prime is set to false (line 14) and the for loop is terminated (line 15).
88
When to use each type of loop? There will be situations where you can use different types of loop structure to achieve exactly the same goal for a program. Conversely there are times when only one type of loop will do. It might not always be clear to you how to make the correct choice, but with practice it becomes more obvious. Feel free to experiment with different loop structures to see what works and what doesn't.
89
Count, Sum, and Mean We already know how to do these. Min, Max, and Median The median value is at the middle of the sorted list of values. If the list has an even number of elements, then the median is the average of the two at the middle. The minimum and maximum are easily found from the sorted array. Variance Variance is the average squared difference from the mean. So compute the mean rst and then go back through the values, nd the difference from the mean, square it, and add it all up. In the end, you divide by the n or n -1 depending on if you are computing the population or sample variance. Standard Deviation Simply the sqrt() of the variance.
90
91
Then try running it against several les at once (by putting multiple le names on the command line).
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. #!/usr/bin/perl # linecount.pl use strict; use warnings; my $lines = 0; my $letters = 0; while (<>) { $lines++; $letters += length($_); } print "$lines\t$letters\n"; # \t is a tab character
Line 7 contains something you haven't seen before. This is the <> le operator. By default, this reads one line at a time from the le specied on the command line. If there are multiple les on the command line, it will read them all in succession. It even reads from stdin (standard input) if you include it in a pipe. True Perl magic! The default variable $_ Line 9 is our rst introduction to the default variable $_. Perl automatically assigns this variable in some settings. Here, $_ contains each line of the le. Although you don't see it, Perl is actually performing the following operation.
5. while ($_ = <>) {
But you should get used to using $_ because it is so common among Perl programs. Perl can also retrieve $_ by default in some functions. For example, without any
92
arguments, print() will report $_. The following one-line program simply echos the contents of a le.
while (<>) {print}
You can use $_ in loops too, but I prefer not to. Here is another one-liner in which $_ is used in place of a named loop variable.
for (0..5) {print}
Confusing? Yes, a little. But you do get used to it. For now, feel free to name all your variables. By the way, in addition to $_, there are a large number of other special variables with equally strange symbols. The open() Function There are times when you have several les and you don't want to read them all one after the other. For example, one might be a FASTA le and the other GFF. You wouldn't want to process both les with the same code. To open and read a single le, you use the open() function. This will open a le for reading or writing, but not both. Let's see how we use that. Task P13.2: Create the following program. This will read the contents of a le that you specify on the command line and then create a second le with slightly altered contents.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. #!/usr/bin/perl # filemunge.pl use strict; use warnings; open(IN, "<$ARGV[0]") or die "error reading $ARGV[0] for reading"; open(OUT, ">$ARGV[0].munge") or die "error creating $ARGV[0].munge"; while (<IN>) { chomp; my $rev = reverse $_; print OUT "$rev\n"; } close IN; close OUT;
Lines 5 and 6 contain open() statements for reading and writing. IN and OUT are called le handles. These are special variables used only for le operations. The second argument determines if the open() statement is for reading or writing. "<" is for reading and ">" is for writing. This should look familiar from your Unix lessons. If you do not include "<" or ">", then the le is opened for reading. Both of the open() statements include an additional "or" clause in case of failure. We will talk more about this later.
93
Line 7 should look a little familiar. Instead of using <> by itself, there is a named le handle inside the brackets. Only the le associated with IN will be read. Line 8 introduces the chomp() function. This removes a \n character from the end of a line if present. It is quite common to chomp your $_. Line 9 reverses $_. We haven't seen the reverse() function before. It reverses both strings and arrays. Lines 12 and 13 close the two le handles. You should always get into the habit of making sure that every open function has a matching close function. It is possible that bad things will happen if you dont close a le handle. You should also try to close a le handle at the rst opportunity when it is safe to do so, i.e. as soon as you are nished with reading from, or writing to, a le. Naming le handles File handles are typically given upper case names. You can use lower-case names and your script will probably still work but Perl will also print out a warning. If you only ever read from one input le and write to one output le then IN and OUT are typical le handle names, though feel free to name them whatever you feel is most suitable (INPUT, DATA etc.). If you need to read from multiple les then it might be a good idea to use le handle names that describe the type of data, e.g. GFF or FASTA.
94
P14. Hashes
A hash is also called a dictionary or associative array. It is very similar to the kind of array we saw earlier except that instead of indexing the array with integers, the array is indexed with text. The dictionary analogy is tting. A word is an index to its denition. A hash can be created in list context, just like an array. But since we need to provide the text index, it is necessary to provide key, value pairs. Task P14.1: Create the following program. We have not seen the % sign in front of a variable before. This is symbol for a hash variable. If we are including the use strict; statement then we will also need to declare any hashes with 'my'.
1. 2. 3. 4. 5. 6. #!/usr/bin/perl # hash.pl use strict; use warnings; my %genetic_code = ('ATG', 'Met', 'AAA', 'Lys', 'CCA', 'Pro'); print "$genetic_code{'ATG'}\n";
Notice that when you want to access a value from a hash, you use curly brackets '{' rather than square brackets '['. Curly brackets lets Perl know you are accessing a hash rather than an array. You could have variables named $A, @A, and %A, and they would all be different variables. Note that using the same name for different things in this way, would be considered bad programming style. $A is scalar. $A[0] is the rst element of the @A array. $A{'cat'} is the value for the 'cat' key of the %A hash. When declaring hashes, there is an alternative syntax that is makes the assignments more obvious. Here, we replace the comma between the key and the value with a kind of arrow =>. This reads as 'gets'. Or alternatively 'says'. So 'cat' => 'meow' reads as "cat says meow".
5. %genetic_code = ('ATG' => 'Met', 'AAA' => 'Lys', 'CCA' => 'Pro');
This looks even more logical when split onto multiple lines.
5. %genetic_code = ( 6. 'ATG' => 'Met', 7. 'AAA' => 'Lys', 8. 'CCA' => 'Pro', 9. ); 10. print "$genetic_code{'ATG'}\n";
The last comma in line 8 is unnecessary, but it does no harm, and we like tidy, consistent code. It turns out that when using the => syntax, Perl knows that the you are assigning a hash, so the quotes around the keys are actually unnecessary.
95
Unix and Perl Primer for Biologists 5. 6. 7. 8. 9. 10. %genetic_code = ( ATG => 'Met', # single quotes now removed from keys AAA => 'Lys', CCA => 'Pro', ); print "$genetic_code{ATG}\n";
The quotes on the values are absolutely required in this example because the values are strings. You would not need them if the values were numbers. Keys and Values It's a simple matter to iterate through arrays because they have numeric indices from 0 to one less than the array size. For hashes, we must iterate over the keys, and for that, we need the various strings. Not surprisingly, this is performed with the keys() function. Task P14.2: Add the following code to your program to report the keys and corresponding values from your hash. It is very common to use the variable name $key in a foreach loop, although in this example $codon may also be a suitable choice.
11. 12. 13. foreach my $key (keys %genetic_code) { print "$key $genetic_code{$key}\n"; }
They keys() function returns an array of keys. Similarly, the values() function returns an array of values. Add the following lines to your program to observe this more explicitly.
14. 15. 16. 17. my @keys = keys(%genetic_code); my @vals = values(%genetic_code); print "keys: @keys\n"; print "values: @vals\n";
Hashes store key-value pairs in a semi-random order (it's not random, but you have no control over it). So you will often want to sort the keys. Replace line 11 with the following.
11. foreach my $key (sort keys %genetic_code) {
Adding, Removing, and Testing Recall that for arrays, you generally either push() or unshift() to add new values to an array. You can also assign a value at an arbitrary index such as $array[999] = 5. Adding pairs to a hash is similar to assigning an arbitrary index. If you assume the key exists, Perl will create it for you. But watch out, if you use a key that previously existed, the value will be overwritten.
96
Line 18 adds a new key ('CCG') to the hash. Note that the value of this key also exists as the value to another key in the hash. Line 19 reassigns the value that the 'AAA' key points to. Sometimes you may want to ask if a particular key exists in a hash, for example, before overwriting something. To do this, you use the exists() function.
20. 21. if (exists $genetic_code{AAA}) {print "AAA codon has a value\n"} else {print "No value set for AAA codon\n"}
To remove a key and its value from a hash, you use the delete() function.
22. delete $genetic_code{AAA};
Function
keys %hash values %hash exists $hash{key} delete $hash{key}
Meaning
returns an array of keys returns an array of values returns true if the key exists removes the key and value from the hash
Hash names If you work with a lot of hashes, it can sometimes help to make the hash name explain something about the data it contains. Hashes typically link pairs of connected data, e.g. name of sequence, and GC% content of that sequence; name of a politician, and the number of votes that they received. Based on these examples, which of the following hash names do you nd easier to understand:
%seq %sequences; %sequence_details; %sequence2gc; %sequence_to_gc; %vote; %names; %name2votes; %name_to_votes;
97
98
Task P16.2: Now let's do something useful and determine the codon usage for a sequence given on the command line.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. #!/usr/bin/perl # codon_usage.pl by ___ use strict; use warnings; die "usage: codon_usage.pl <sequence>\n" unless @ARGV == 1; my ($seq) = @ARGV; my %count = (); my $total = 0; # individual codons # total codons
# extract each codon from the sequence and count it for (my $i = 0; $i < length($seq); $i += 3) { my $codon = substr($seq, $i, 3); if (exists $count{$codon}) {$count{$codon}++} else {$count{$codon} = 1} $total++; } # report codon usage of this sequence foreach my $codon (sort keys %count) { my $frequency = $count{$codon}/$total; printf "%s\t%d\t%.4f\n", $codon, $count{$codon}, $frequency; } 99
Note that on line 8 we use a slightly different way of introducing a hash. The following lines of code are similar:
my %count; my %count = ();
The rst example declares a new hash, and the second example additionally initializes the hash which means it will empty the hash of any data (if any existed). You may have noticed that line 20 adds a second 'my $codon = ' statement to the script. It is important to realize that the $codon within this second foreach loop is completely different to the $codon that exists in the previous for loop. If this seems confusing, then you will have to wait a little longer before we give the full explanation for this. If it bothers you, then feel free to rename the second $codon variable to something else. Line 22 introduces the printf() function to format the output. printf() has a somewhat arcane syntax handed down from the C programming language. %s means string, %d means digit (integer), and %f means oating point. %.4f means 4 decimal places.
100
Imagine doing this for all possible codons... tiresome. Ideally, we want a solution which would search for 'CCN' where N is A, C, G, or T. This is where regular expressions (regex) come in. Simply put, a regular expression describes a nite range of possibilities. Unix, Perl and other programming languages use a fairly standard way of implementing regular expressions (so anything you learn about them in Perl, will be very useful if you use Unix commands like 'grep' or 'sed'). Task P17.2: Delete lines 610 and replacing them with these:
6. 7. 8. if ($seq =~ m/CC./){ print "Contains proline ($&)\n"; }
Well that was easy! In the context of regular expressions, the dot (.) on line 6 represents any single character. It should not be confused with the use of a dot as the concatenate operator. Line 7 contains a funny variable called $&. This is sort of like $_. Perl sets $& to the string matched by the most recent regular expression match.
101
Task P17.3: Change the regex to now see whether the sequence contains an arginine codon.
6. 7. 8. if($seq =~ m/CG./){ print "Contains arginine ($&)\n"; }
If you copied the sequence exactly as above, your script should be telling you that the $seq variable contains an arginine codon, even though it doesn't. Can you see why? The dot character will match any character, including a space. So the last two letters of the ACG codon plus the space that follows matches the pattern. A better solution is to restrict the match to any character that is within a specied set. Task P16.4: Replace line 6 with this more specic pattern.
6. if ($seq =~ m/CG[ACGT]/) {
The square brackets allow you denote a number of possible characters, any of which can match (this is known as specifying a character class). This is a much better solution when we have a limited range of characters. Note though, that even when you have many characters inside the square brackets, you are only ever matching one character in the target sequence. Biological sequences are sometimes represented as upper case and sometimes lower case. How do you handle this? Task P16.5: Go back to line 5 and substitute some of the capital letters for lower case as in the example below.
5. my $seq = "ACG TAC GAA GAC ccA ACA GAT AGC gcg TGC CAG aaa TAG ATT";
There are two solutions to matching both upper and lower case. The rst one is to use the square brackets to spell out every possible combination of upper or lower case letters that specify CCN.
6. if ($seq =~ m/[Cc][Cc][ACGTacgt]/){
The 2nd option is much simpler. Use the ignore-case functionality of the matching operator. This just involves appending an 'i' after the second forward slash, and this will now mean that ccc, ccG, cCa, CaT, etc. will all count as a valid match.
6. if ($seq =~ m/GG[ACGT]/i){
102
Because there is no uppercase or lowercase standard for sequence les, it is good to always use the ignore-case option when working with sequences. This also works with the substitution operator. An alternative is to always convert a sequence to upper or lower case before you start processing it. The uc() and lc() functions perform these operations. Another useful option when specifying a character class is to use a dash to specify a range of characters or numbers.
9. if ($seq =~ m/[a-z]/){ 10. print "Contains at least one lower case letter\n"; 11. }
Perl denes several symbols for common character classes. Two of the most useful ones are \s and \S which are used to match whitespace and non-whitespace respectively. Anchors To ensure that a pattern matches the beginning or ending of a string, one uses the ^ and $ symbols. This is the same as when using regexes in Unix. Anti-classes You can also specify characters that should not occur. Unfortunately, the ^ symbol is reused. But it is used inside square brackets. [^A] matches anything except capital A. \S is equivalent to [^\s]. Repetition If you want to match several characters or character classes in a row, you use repetition symbols. /A+/ matches 1 or more As. If you want to match exactly 5 As, you could write /AAAAA/ or /A{5}/. To match a range, you specify the minimum and maximum number of characters such as /A{3,5}/. You can also specify zero or one with /A?/ and zero or more with /A*/. Alternation You can match more than one pattern at once if you separate them with pipe symbols. To match all stop codons you would use
/TAA|TAG|TGA/;
103
Backslash Any of the special reserved characters can be matched by prexing with a backslash. Therefore, you can match a dot (.) with \. and it will only match a dot. The backslash is also used to escape the special meaning of other characters in Perl. What if you wanted to print the value of the variable $answer but also include the text $answer in the output string? Or what if you wanted to print \n but not have it print a a newline?
my $answer = 3; print "\$answer is $answer\n"; print "This is a newline character: \\n\n";
Meaning
any character alphanumeric and _ any non-word character any whitespace any non-whitespace any digit character any non-digit character tab newline match 0 or more times match 1 or more times match 1 or 0 times match exactly n times match n to m times match from start match to end
104
But you won't always have tab-delimited text. Some les are much more complex. Task P18.1: Let's retrieve all the gene names and coordinates from a GenBank le. Take a look at the le Unix_and_Perl_course//Data/GenBank/E.coli.genbank and scroll down until you nd the 'gene' keyword.
FEATURES source Location/Qualifiers 1..4686137 /organism="Escherichia coli str. K12 substr. DH10B" /mol_type="genomic DNA" /strain="K-12" /sub_strain="DH10B" /db_xref="taxon:316385" 190..255 /gene="thrL" /locus_tag="ECDH10B_0001" /db_xref="GeneID:6058969"
gene
The coordinates of the gene are given on the same line. One line below contains the gene name as /gene="thrL". Page down a bit and you will nd a gene on the complement strand.
gene complement(5683..6459) /gene="yaaA" /locus_tag="ECDH10B_0006" /db_xref="GeneID:6061859"
105
In order to parse this le, we must deal with genes on the complement strand and also the fact that all the information isn't on the same line. The following program reports the name and coordinates of all genes.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. #!/usr/bin/perl # parse_genes.pl use strict; use warnings; while (my $line = <>) { if ($line =~ /^\s{5}gene/) { my ($beg, $end) = $line =~ /(\d+)\.\.(\d+)/; $line = <>; my ($name) = $line =~ /.*="(.+)"/; print "$name $beg $end\n"; } }
Lines 15 should look very familiar by now. Line 6 asks if $line starts with 5 spaces (^\s{5}) followed by the word 'gene'. GenBank format is very strict about how many spaces begin each line. Had we been lazy, we could have used \s* or \s+. Line 7 extracts the coordinates from $line and assigns them to the $beg and $end variables. Regular expressions in list context return values from parenthesized patterns. You might want to repeat the phrase a dozen times or so. It's that important. Line 8 gets another line of input because the gene name is one line below. Line 9 extracts the name using (.+) because gene names sometimes contain strange characters and spaces, even though they don't in E. coli. Nearly all CDSs have a gene name, but because a few dont we have to captures lines that match either a /gene= or a /locus_tag= pattern. This is enabled by using the .* pattern. More Info We've only scratched the surface of regular expressions. For more information, read the Perl man pages.
man perlrequick man perlre
106
We understood the 'or' as something that only happens if the le doesn't open. How exactly does that work? All Perl functions return a True or False value. False values are 0 and the empty string "". All other values are true. So we can understand the open() statement above as a more concise version of the following.
$return_value = open(IN, "< $ARGV[0]); if ($return_value == 0) { die "can't open file $ARGV[0]\n"; }
But why doesn't the die() statement get executed if the return value is True? Because the whole statement from open() to the semicolon is evaluated with Boolean logic. The Boolean operators are 'and', 'or', 'not'. Let's review how 'and' and 'or' behave.
True True False False True True False False and and and and or or or or True False True False True False True False = = = = = = = = True False False False True True True False
If the open() function works, then the entire Boolean expression "open() or die" will be True. Perl does not attempt to evaluate more than it needs. So once open() succeeds, it short-circuits the rest of the statement. Back in P11.1 we saw this statement. What's going on here?
@list = sort {$a <=> $b or uc($a) cmp uc($b)} @list
The sorting function rst compares $a and $b numerically. If their numeric values are zero (e.g. because they are strings), the expression $a <=> $b returns zero. Perl must then evaluate the right side "uc($a) cmp uc($b)" to determine if the whole expression is true or false. So numbers get compared rst, and if they are equal, they are further compared by ASCII value.
107
8.
Tips Make sure that the coding sequences are correct. Most should start with ATG and end with a stop codon. If they do not, you may need to improve your code. Remember that sequence coordinates are 1-based but substr() is zero-based. You will have to subtract 1 from the sequence coordinates. Some proteins may contain the peptide 'CDS' (cysteine-aspartate-serine) in them. So be careful with your regex. Don't use this program with eukaryotes. Describing the joins of the various exons can take several lines, which makes parsing the le a little more difcult.
108
valid CDS\n";
sub print_error { print "$ARGV[0] is not a valid sequence for a CDS\n"; print "It may not start with an ATG start codon\n"; print "It may not end with a stop codon\n"; print "It may contain non ATCGN DNA characters\n"; }
Lines 9, 12, and 15 all call calls the print_error() subroutine which is declared on line 21. Subroutines behave just like any other Perl function, but unlike built-in functions like print(), you must include parentheses. To declare a function/subroutine, you use the sub keyword. This is immediately followed by the name of the function and a block structure delimited by curly braces.
109
This script is not a very good script, it prints the same error message regardless of what error is found in the sequence. However, you should see that by using a subroutine we only need to write the code to produce the error message in one place. Task P20.2: When we use subroutines it is far more common to pass the subroutine one or more variables and get the subroutine to do something useful with those variables. Create the following program. It reads a le of sequences and computes the GC% of each one.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. #!/usr/bin/perl # gc.pl use strict; use warnings; while (my $seq = <>) { chomp($seq); gc($seq); } sub gc { my ($seq) = @_; $seq = uc($seq); # convert to upper case to be sure my $g = $seq =~ tr/G/G/; my $c = $seq =~ tr/C/C/; my $gc = ($g + $c) / length($seq); print "GC% = $gc\n"; }
Before you run this script you will need to create a text le which contains a few lines of DNA characters. Use the name of the when you run the script e.g. gc.pl dna_file.txt Lines 7 calls the gc() subroutine and passes it the $seq variable. To pass a variable to a subroutine, include it between the parentheses that follow the subroutine name. Lines 1017 contain the gc() function. Because line 7 passes a variable to the subroutine, we must add code to receive it. Subroutines receive arguments via the special @_ array. Variables that are passed to the subroutine are stored in the @_ array. Note that $seq is used again within the subroutine, we'll explain why in the next section. For now, just accept that the $seq in the subroutine is unrelated to the other $seq. Line 11 shows a typical list assignment, the rst element of the @_ array is copied to $seq. You may see some programs using the shift() function to remove elements of the @_ array. E.g.
my $seq = shift(@_); my $seq = shift;
110
In the second example, shift is used without specifying an array name. If no array is specied the shift function uses the @_ array by default. Anything passed to the @_ array is copied. This means that line 7 is effectively sending a copy of $seq to the gc() subroutine. In other words, $seq is unchanged by gc(). You can test this by adding a print "$seq\n" statement after line 7. Subroutines can be dened anywhere in a program. It's common to put them at the end of a Perl program. In other languages you might nd them at the beginning. You can do it either way, but try to be consistent. Task P20.3 The previous program demonstrated a much better use of subroutines, but it is still not ideal. Maybe we don't always want to print the value of GC% as soon as we calculate it. In general, we often want a subroutine to calculate something and send that back to wherever we called the subroutine from. We can do this in Perl by using return values within a subroutine. Let's make a script that uses the melting temperature code that we saw earlier in P15.1, but that now puts it in a subroutine
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. #!/usr/bin/perl # tm.pl use strict; use warnings; while (my $seq = <>) { chomp($seq); my $tm = tm($seq); print "Tm = $tm\n"; } # calculate Tm sub tm{ my $seq = shift; my $A = $seq =~ tr/A/A/; my $C = $seq =~ tr/C/C/; my $G = $seq =~ tr/G/G/; my $T = $seq =~ tr/T/T/; my $tm = 2 * ($A + $T) + 4 * ($C + $G); # simple Tm formula return($tm) }
First of all, let's look at line 19. This line returns a value from the subroutine. You can use return anywhere in the function and it will exit at that point. I.e. if there was a print statement on line 20, it would never be performed. You can return multiple values in a return statement or even none. Sometimes we just return 1 or 0 to indicate success or failure.
111
So what happens with that returned value? If we now look at line 7 we can see that the output of the tm() function is assigned to a variable. If we had wanted to make our code more concise (which is not always a good thing) we could have replaced lines 7 and 8 with:
print "TM = ", tm($seq), "\n";
If we didn't need to store the melting temperature in a variable, then we could just include it in a print() statement. We could also have replaced lines 18 and 19 with the following:
return(2 * ($A + $T) + 4 * ($C + $G));
In this case, Perl will rst make the calculation of the melting temperature and return the resulting value. Most people nd it easier to rst store this result into a variable and then return the variable. Task 20.4: So far we have only ever passed one variable to a subroutine and returned just one thing back to the calling function. It is very common to pass and return multiple arguments. It is also common to have multiple return statements which are all dependent on the outcome of some logical test. Modify the GC% script in order to pass two things to the subroutine: the sequence plus a GC% threshold (a oating point number which will be stored in a $threshold variable within the subroutine). If the GC content is above the value of $threshold then we will return "High GC" else we will return "Low GC". To simplify things, you can specify the sequence and the threshold value on the command line (instead of reading a le). We also want the script to print out whether each sequence is high or low GC, but that print statement must not be in the subroutine! You will have to look up how to pass two things to a subroutine. The end of the subroutine will look like the following:
112
Unix and Perl Primer for Biologists sub gc { # # missing code to go here # $seq = uc($seq); # convert to upper case to be sure my $g = $seq =~ tr/G/G/; my $c = $seq =~ tr/C/C/; my $gc = ($g + $c) / length($seq); if($gc > $threshold){ return("High GC"); } else{ return("Low GC"); }
Why use subroutines? As your programs get longer you might nd yourself wanting to do the same thing more than once in your program. Maybe part of your program takes two input sequences and calculates the percentage similarity. Your program might then modify those sequences and then recalculate the percentage similarity. Without subroutines you would have to have the same lines of code in two places in your script. This is a bad idea. Where possible, code should be reused. As soon as you nd yourself writing the same code in more than one place, you should think about putting that code in a subroutine. Subroutines can also help improve the readability of your code. Rather than see all of the details of how you calculate some mathematical function, it might be cleaner to keep that code in a subroutine and this keeps it hidden from the main body of the code.
113
Line 7 prints $seq but it now prints the version of $seq that was modied in the subroutine. Without the my declarations, they are no longer separate variables. This is probably not the behavior that we wanted. When we don't declare variables with my, they become global variables. Changing that variable in any one part of the program changes it everywhere else. We should never do this, it is just about the worst thing you can do as a programmer. To make sure we do not affect other parts of a program, we will always choose to make variables inside a function exist only within that function. The my keyword does this for us and it creates a lexical variable. A lexical variable lives and dies within a set of curly braces (a block). This means that we can reuse variable names to store different things as long as they exist within different blocks of code
114
Task P21.1: The following program demonstrates the use of lexical variables.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. #!/usr/bin/perl # lexical.pl use strict; use warnings; my $x my if $x; # declaration without assignment = 1; ($y, $z) = (2, 3); # you can declare and assign, even as list ($x < $y) { my $z = 10; print "inside: X = $x, Y = $y, Z = $z\n";
It is critical that you completely understand the scope of a variable. The scope is from the point it is declared until the enclosing curly bracket at the same logical level. There is a $z variable that is born at line 9 and dies at line 11. Importantly, this $z is not the same as the $z on line 7. The inner $z effectively hides the outer $z as soon as it is declared. Variables in a wider scope are visible in a narrower scope. So we can see $x and $y at line 11. Variables in a narrower scope do not exist in a wider scope. To see this more clearly, try changing lines 9 & 13 to to the following.
9. 13. $q my ($z,$q) = (10,15); print "outside: $x $y $z $q\n";
115
Loop Variables Lexical variables in loops look a little strange because they are declared outside the curly braces.
1. 2. 3. 4. 5. 6. $i for (my $i = 0; $i < 10; $i++) { } foreach my $seq (@seq) { }
is declared on line 1 and dies at line 3. So even though it appears outside the curly braces, its scope is actually the entire loop. $seq is born anew with each iteration of the loop at line 4 and dies each time at line 6. Safer programming: use strict All variables should be lexical variables. To ensure this behavior, include "use strict" in your programs. In fact, your programs should always contain a line like this.
use strict; use warnings;
You may run into someone who thinks that strict and warnings are a hassle. Feel free to talk to, dine with, or even marry this person, but in no circumstances should you share code with them!
116
Task P22.2: Here is an alternative approach using a single loop that reuses our gc() function. Replace lines 11-15 with the following two lines and then copy the gc() subroutine into the script. This strategy is slightly less efcient because there is some overhead in every function call. But I think you will agree that it reads much better!
8. 9. my $subseq = substr($seq, $i, $window); printf "%d\t%.3f\n", $i, gc($subseq);
117
Task P22.3: Did you notice that both of the previous sliding window algorithms recount the same bases? Imagine a window of 1000 bases. The total number of Cs and Gs is not going to change much as the window slides over one more position. In fact, the number of Gs or Cs can only change by plus or minus 1. Why count 1000 letters when you only need to change one value? You don't have to. If you count the Cs and Gs in the initial window, you can then update the counts as you slide along. This algorithm turns out to much more efcient for large windows. You might want to come back to this task at a later time. It's doesn't introduce any new concepts, but the code is denitely more complicated.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. #!/usr/bin/perl # sliding_fast.pl use strict; use warnings; die "usage: sliding_fast.pl <window> <seq>" unless @ARGV == 2; my ($window, $seq) = @ARGV; # initial window my $gc_count = 0; for (my $i = 0; $i < $window; $i++) { my $nt = substr($seq, $i, 1); if ($nt =~ /[CG]/i) {$gc_count++} } printf "%d\t%.3f\n", 0, $gc_count/$window; # all other windows my $limit = length($seq) - $window + 1; for (my $i = $window; $i < $limit; $i++) { my $prev = substr($seq, $i -1, 1); my $next = substr($seq, $i, 1); if ($prev =~ /[CG]/i) {$gc_count--} if ($next =~ /[CG]/i) {$gc_count++} printf "%d\t%.3f\n", $i -$window +1, $gc_count/$window; }
Sometimes you must choose between readability and speed. Most of the time, you should let readability take precedence. Why? Because readable code is easier to debug and maintain. If you absolutely need something to run faster, there are a variety of possible solutions including (a) buying a faster computer (b) changing the structure of the algorithm (c) programming in a compiled language such as C.
118
The gc() function can now be used in any program you write as long as Library.pm is in the same directory as the script that wants to use it. Now let's see how we use libraries.
119
Task P23.2: Go back to sliding.pl and insert the line "use Library;". One generally puts such statements at the top of a program, but you can put them anywhere. This simple statement allows the program to use any of the functions in the library. To call gc(), we must prepend the function call with the library name Library::gc() as in line 9. The reason for this is that we might be using several libraries. So Library::gc() and OtherLibrary::gc() are separate functions.
1. 2. 3. 4. 5. 6. 7. 8. 8. 9. 10. #!/usr/bin/perl # sliding.pl use strict; use warnings; use Library; die "usage: sliding.pl <window> <seq>" unless @ARGV == 2; my ($window, $seq) = @ARGV; for (my $i = 0; $i < length($seq) - $window +1; $i++) { my $subseq = substr($seq, $i, $window); printf "%d\t%.3f\n", $i, Library::gc($subseq); }
120
121
Another way to run an external program is with a system() call. Whatever you put into a system() call is run just like the Unix command line. Unlike the open() function which returns 0 when it fails, the system() function returns 0 when it succeeds (there are good reasons for this, but for now let's just be angry about it). It is generally preferable to use the system function rather than backticks as this gives you more control of testing whether the Unix command that you run actually worked or not. Add the following lines to your program.
9. system("ls > foo") == 0 or die "Command failed\n";
You now have a le called foo that contains your le list. To get this into your program you can use open() as we have seen before. On line 11 we introduce a shorthand for reading all the lines of a le at once. Be careful with this because you could run out of memory if you slurp up a big genome.
10. 11. 12. 13. open(IN, "< foo") or die "Can't open foo\n"; my @files = <IN>; # reads the entire file into @files close IN; foreach my $file (@files) {print "$file\n"}
You will most commonly use le handles to read from les, or write to les. However, le handles can also be used in connection with 'pipes' which act just like pipes in Unix. So
122
you can establish a le handle which acts as a pipe that receives input from a Unix command (go back to the Unix lesson if you need a reminder). The program is run, and the output of the Unix command is sent directly to the le handle.
14. 15. 16. 17. 18. # file handle 'IN' will now receive output from the 'ls' command open(IN, "ls |") or die; while (my $line = <IN>) { print "file: ", $line; }
If we reverse things and put the pipe after the le handle, then we can even use open() to send commands to a program!
19. 20. 21. 22. # the file handle OUT will now connect to the Unix wc command open(OUT, "| wc") or die; print OUT "this sentence has 1 line, 10 words, and 51 letters\n"; close OUT;
123
To use Getopt::Std, you must rst dene global variables called $opt_something where the something is a single letter. For example if you wanted a command line option -v to indicate that the program should display its version number, you need a global variable called $opt_v. To dene a global variable, you can use the "use vars" method or the "our" method (sort of like "my" except for global rather than lexical variables). Both syntaxes are displayed below. You also have to tell Getopt::Std that you want to parse the command line. You do this with the getopts() function. The syntax is a little strange. If the option takes arguments, you follow the letter with a colon. So, getopts('x') signals that -x takes no arguments while getopts('x:') signals that -x requires an argument. The example below shows how you can mix both kinds. Try running the program below with a bunch of different options and see what happens. Note that the options are removed from the command line. So @ARGV never contains the options.
124
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.
#!/usr/bin/perl # getopt.pl use strict; use warnings; use Getopt::Std; use vars qw($opt_h $opt_v); our $opt_p; # alternative to our() getopts('hvp:'); my $VERSION = "1.0"; # it's a good idea to version your programs my $usage = " usage: getopt.pl [options] <arguments...> options: -h help -v version -p <some parameter> "; if ($opt_h) { print $usage; # it's common to provide a -h to give help exit; } if ($opt_v) { print "version ", $VERSION, "\n"; exit; } if ($opt_p) {print "Parameter is: $opt_p\n"} print "Other arguments were: @ARGV\n";
Lines 21 and 26 introduce the exit() function. This terminates the program immediately without producing an error message like die().
125
References Up to now, we have never passed two arrays or hashes to a subroutine. Why? Because the arrays would get damaged by passing through @_. Consider the following code:
sub compare_two_arrays { my (@a, @b) = @_; }
The intent is to ll up @a and @b from @_. Unfortunately, in list context, Perl cannot determine the size of the arrays. So what happens is that @a gets all of the data and @b gets none. But surely, we want to be able to make comparisons of arrays. To do this, we must turn an array into a scalar value. This is done quite simply with the backslash \ operator. To dereference a particular element of the array, we use the arrow operator -> and square brackets.
my @array = qw(cat dog cow); my $array_ref = \@array; print $array_ref->[0], "\n"; # prints cat
126
We can also create references to hashes and dereference them with the arrow operator. Note that we use curly brackets here to show that the scalar value is a reference to a hash.
my %hash = (cat => 'meow', dog => 'woof', cow => 'moo'); my $hash_ref = \%hash; print $hash_ref->{cat}, "\n"; # prints meow
To dereference the entire array or hash, rather than a specic element, we use the {} operator as follows.
print join("\t", @{$array_ref}), "\n"; foreach my $key (keys %{$hash_ref}) { print $key, "\t", $hash_ref->{$key}, "\n"; }
Here, the arrays in square brackets are references to arrays. But these arrays have no names, so they are called anonymous arrays. In a multi-dimensional array, the rst dimension is a reference to other dimensions. References are scalar values, but they point to arrays, hashes, and some other types. To dereference a scalar, you use the -> notation. In a multi-dimensional context, the -> symbols are implied. Previously, we used $matrix[0][0], but this can also be understood more explicitly as $matrix[0]->[0]. But use the former syntax, not the latter. When constructing multi-dimensional structures, the various dimensions can be hashes or arrays, or a mixture. The dimensions need not even be the same size.
my @matrix = ( [1, 2], {cat => 'meow', dog => 'woof', cow => 'moo'}, [{hello => 'world'}, {foo => 'bar'}], ); print $matrix[0][1], "\n";
127
Unix and Perl Primer for Biologists print $matrix[1]{cat}, "\n"; print $matrix[2][1]{foo}, "\n";
Records One of the most common places you will see a reference is a hash reference. These are used to store record-like data.
my @authors = ( {first => 'Ian', last => 'Korf', middle => 'F'}, {first => 'Keith', last => 'Bradnam', middle => 'R'}, ); foreach $author (@authors) { print $author->{last}, ", ", $author->{first}, " ", $author->{middle}, ".\n"; }
What next?
If you've come this far, you've done very well. You can now pick up a variety of Perl books and start to learn more advanced and specialized topics. As always you will learn Perl much more quickly if you have some real-world problems that you need to write a script for. This doesn't have to be work related, if you have any text les that contain data of some sort, then you can probably think of a Perl script to do something with that data. E.g. you could work out the average rating of each artist in your iTunes library by writing a script to parse the 'iTunes Music Library.xml' le that is produced by iTunes. Sometimes the best way of improving your Perl is when you have to x or improve someone else's script. Seeing how other people code will give you ideas and make you realize what works well and what doesn't. There is a lot of freely available Perl code on the web (just search Google for "perl script to do x, y, and z) and you will often nd that you can adapt from other people's code. But it usually is much more fun to write your own!
128
Troubleshooting guide
Introduction
The next few pages list many of the common error messages that you might see if you are having problems with your Perl script. They are broadly divided into three categories: 1) Errors that are caused before your Perl code is even evaluated 2) Errors in the code itself (most commonly, very simple syntax errors) 3) Other mistakes (sometimes achieved by great feats of stupidity) If there is a problem with your script, you will sometimes see a lot of errors appear when you try to run it. It pays to try to understand these error messages. With time, you will become quicker at xing errors, or at least knowing where to look rst. Many text editors like Smultron are specically designed for working with programming languages, and they can help you hear and see problems as you create them. Smultron beeps to warn you if you have entered too many closing brackets or parentheses. It also colors code that appears in between pairs of quotation marks, so you quickly see if you have typed one quotation mark too many.
How to troubleshoot
Programming languages like Perl have sophisticated, and therefore complicated, debugging tools. But for simple scripts, these tools can be overkill. Here is some simpler advice to how to go about xing your scripts: 1) Stay calm and don't blame the computer. In nearly all cases, the computer is only ever doing what you have told it to do. Keeping a clear head will help you nd the problem. 2) Check and re-check your code. Most errors are due to simple typos in your script, and sometimes you will be looking at the error without realizing that it is the error. 3) Start with the rst error message that you see. Subsequent error messages often all stem from the rst problem in your script. Fix one, and you may x them all. 4) If you think a problem is due to an error on a single line of code, then you can comment out that line (by adding a # character to the start of the line). Then save and re-run your script to see if it now works. If it does, then you have conrmed which line contained the problem. Note that this is not appropriate for commenting out a single line of a block of code, e.g. the rst line of an if statement. 5) Sometimes a program will partly work, but fail at some point within your code. Consider adding simple 'print' statements to work out where the program is failing.
129
130
syntax error at script.pl line X, near YYY Syntax errors are among the most frequent errors that you will see. On the plus side, they are usually very easy to x. On the negative side, they can sometimes be very hard to spot as they frequently involve a single character that is either missing or surplus to requirements. Most commonly they might be because of: unmatched parentheses - like brackets, items that are in (parentheses) should always be a double act. missing semi-colon - If you start writing some code, then it has to end (at some point) with a semi-colon. The main exceptions to this rule are for the very rst line of a script (#!/usr/bin/perl) or when a line ends in a closing curly bracket '}'. Also note that you can write one line of Perl code across several lines of your text editor, but this is still one line of code, and so needs one semi-colon. missing comma - Perl uses commas in many different ways, have you forgotten to include one in a place where Perl requires one? inventing new Perl commands and operators - if you write if ($a === $b), then you have invented a new operator (===) which will cause a syntax error as Perl will have no idea what you mean. Can't nd string terminator '"' anywhere before EOF at script.pl line X Did you make sure that you have pairs of quotation mark characters? If you have an odd number of single or double quotes characters, then you might see this error. use of uninitialized variable in... Your scripts will do many things with variables. You will add their values, calculate their lengths, and print their contents to the screen. But what if the variable doesn't actually contain any data? Maybe you were expecting to ll it with data from the command-line or from processing a le, but something went wrong? If you try doing something with a variable that contains no data, you will see this error. Global symbol "$variable" requires explicit package name at You wouldn't happen to be using the strict package and not declaring a variable with 'my' would you? If you denitely have included use strict; then maybe check that all your variable names are spelled correctly. You might have introduced a variable as my $apple but then later incorrectly referred to it as $appple.
131
Other errors
Program changes not saved If you make changes to your program but don't save them, then those changes will not be applied when you run the script. Always check that the script you are running is saved before you run it. If you are using a graphical text editor on an Apple computer, then you will always see a black dot within the red 'close window' icon on the top left of a window when there are any unsaved changes. Program that you are editing is not the same as program you are running. Occasionally, you might make copies of your programs and your directory might end up with programs named things like script1.pl, script1b.pl, script2.pl, new_script2.pl. This is a bad habit to get into and you might nd yourself editing one script but trying to run another script. You will become very frustrated when every change you make to your script has seemingly no effect. Program runs with no errors but doesn't print any output It might seem mysterious when your Perl program which you so carefully wrote, doesn't seem to do anything. It is therefore worth asking yourself the question 'did I ask it to do anything?'. More specically, have you made sure your program is printing out any output. Making your program calculate the answer to the life, the universe, and everything is one thing...but if you don't print out the answer, then it will remain a mystery.
132
133
Version history
2.3.5 - 5/24/10 - Fixed a single typo in exercise P20 which would prevent the script working properly. 2.3.4 - 11/13/09 - a couple of typo xes and slight restructuring to transliteration routine 2.3.3 - 10/30/09 - Expanded on arrays and loops sections. More explanatory text is given with more examples. 2.3.2 - 10/16/09 - Added a new section on how to trouble-shoot problematic Perl scripts, with explanations of common error messages. Plus some more minor typo xes. 2.3.1 - 10/9/09 - Minor typo xes 2.3 - 9/29/09 - One new Perl task added to introduce the die function slightly earlier. Added new Unix task to learn about converting newline characters. 2.2.1 - 8/2/09 - Fixed incorrect numbering for list of projects in Project 5 section 2.2 - 7/28/09 - Big change in that all examples (apart from rst few) now have 'use strict'. Changed some examples to be more biologically relevant. Added more hyperlinks for Perl functions. Added graphical example of arrays. Expanded explanations in many examples, particularly the section on subroutines which gains many new examples. 2.1 - 7/22/09 - Added Preamble section to explain how to go about this course on a Windows machine. Added author bios. Changed directory structure for course les so that everything is contained within one parent directory (Unix_and_Perl_course). Broke several of the Unix sections into smaller sections 2.05 - 7/17/09 - Some minor typos xed 2.04 - 7/16/09 - Fixed minor typos. Expanded section on variables, and offered advice on variable names. Simplied some print examples. 2.03 - 7/15/09 - Fixed minor typos. Expanded table of useful commands. Expanded explanation of tr operator, @ARGV, and escaping via backslash character. Fixed E.coli project example. Mentioned how to spot unsaved les in Mac editors. Moved tables inline. 2.02 - 7/14/09 - use warnings reinstated to all scripts. Table of contents added. Hash bang line also included in all scripts. A few new sections added to Unix part of course, including a table of commonly used Unix commands. Lots of small of formatting changes to stop sections splitting over pages where possible.
134
2.01 - 7/13/09 - Miscellaneous typos xed and text reworded to clarify. 2.00 - 7/12/09 - Revision based on feedback from course. Switched to PDF documentation rather than HTML. Smaller, more focused exercises. 1.00 - First taught course to grad students in UC Davis in Fall 2008 0.5 - Brief Unix/Perl training material written to help new students who join our lab
135