Bioinformatics Exercises Print
Bioinformatics Exercises Print
Bioinformatics Exercises Print
Paul Craig
Department of Chemistry Rochester Institute of Technology
GAGTTAGGTCTGAAAGCAAAAGAAATTATGGATGCGGGCAAGTT
GGTGACTGATGAGTTAGTTATCGCATTACTCAAAGAACGTATCACA
CAGGAAGATTGCCGCGATGGTTTTCTGTTAGACGGGTTCCCGCGT
ACCATTCCTCAGGCAGATGCCATGAAAGAAGCCGGTATCAAAGTT
GATTATGTGCTGGAGTTTGATGTTCCAGACGAGCTGATTGTTGAG
CGCATTGTCGGCCGTCGGGTACATGCTGCTTCAGGCCGTGTTTATC
ACGTTAAATTCAACCCACCTAAAGTTGAAGATAAAGATGATGTTAC
CGGTGAAGAGCTGACTATTCGTAAAGATGATCAGGAAGCGACTGT
CCGTAAGCGTCTTATCGAATATCATCAACAAACTGCACCATTGGTT
TCTTACTATCATAAAGAAGCGGATGCAGGTAATACGCAATATTTTAA
ACTGGACGGAACCCGTAATGTAGCAGAAGTCAGTGCTGAACTGG
CGACTATTCTCGGTTAATTCTGGATGGCCTTATAGCTAAGGCGGTT
TAAGGCCGCCTTAGCTATTTCAAGTAAGAAGGGCGTAGTACCTACA
AAAGGAGATTTGGCATGATGCAAAGCAAACCCGGCGTATTAATGG
TTAATTTGGGGACACCAGATGCTCCAACGTCGAAAGCTATCAAGC
GTTATTTAGCTGAGTTTTTGAGTGACCGCCGGGTAGTTGATACTTC
CCCATTGCTATGGTGGCCATTGCTGCATGGTGTTATTTTACCGCTTC
GGTCACCACGTGTAGCAAAACTTTATCAATCCGTTTGGATGGAAG
AGGGCTCTCCTTTATTGGTTTATAGCCGCCGCCAGCAGAAAGCACT
GGCAGCAAGAATGCCTGATATTCCTGTAGAATTAGGCATGAGCTAT
GGTTCAC
a. First, try to find an open reading frame in this segment of DNA. What is an open reading frame
(ORF)? You can find the answer in your textbook or online with a simple Internet search
(http://www.google.com). You may also wish to try the bookshelf at PubMed
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books). In bacteria, an open reading frame on a
piece of mRNA almost always begins with AUG, which corresponds to ATG in the DNA segment
that codes for the mRNA. According to the standard genetic code, there are three Stop codons on
mRNA: UAA, UAG, and UGA, which correspond to TAA, TAG, and TGA in the parent DNA
segment. Here are the rules for finding an open reading frame in this piece of bacterial DNA:
1. It must start with ATG. In this exercise, the first ATG is the Start codon. In a real gene search,
you would not have this information.
2. It must end with TAA, TAG, or TGA.
3. It must be at least 300 nucleotides long (coding for 100 amino acids).
4. The ATG Start codon and the Stop codon must be in frame. This means that the total number
of bases in the sequence from the Start to the Stop codon must be evenly divisible by 3.
Hints: Try this search by pasting the DNA sequence into a word processing program, then
searching for the Start and Stop codons. Once you have found a pair, highlight the text of the
proposed ORF and use the program's Word Count function to count the number of characters
between (or including) the Start and Stop codons. This number must be evenly divisible by 3.
You can also use a fixed-width font such as Courier, enlarge the size of the text, and adjust the
margins so that each line holds just three characters (one codon). Once you find the first ATG,
delete the characters that precede it. Then search for a Stop codon that fits all on one line (is in
the same reading frame as the Start codon).
b. Admittedly, Part (a) is a tedious approach. Here is an easier one: Highlight the entire DNA sequence
again and copy it. Then go to the Translate tool on the ExPASy server
(http://www.expasy.org/tools/dna.html). Paste the sequence into the box entitled Please enter a
DNA or RNA sequence in the box below (numbers and blanks are ignored). Then select Verbose
(Met, Stop, spaces between residues) as the Output format and click on Translate Sequence.
The Results of Translation page that appears contains six different reading frames. What is a
reading frame and why are there six? Identify the reading frame that contains a protein (more than
100 continuous amino acids with no interruptions by a Stop codon) and note its name. Now go back
to the Translate tool page, leave the DNA sequence in the sequence box, but select Compact (M,
-, no spaces) as the Output format. Go to the same reading frame as before and copy the protein
sequence (by one-letter abbreviations) starting with M for methionine and ending in - for the
Stop codon. Save this sequence to a separate text file.
c. Now you will identify the protein and the bacterial source. Go to the NCBI BLAST page
(http://www.ncbi.nlm.nih.gov/BLAST/). What does BLAST stand for? You will do a simple
BLAST search using your protein sequence, but you can do much more with BLAST. You are
encouraged to work the Tutorials on the BLAST home page to learn more. On the BLAST page,
select Protein-protein BLAST. Enter your protein sequence in the Search box. Use the default
values for the rest of the page and click on the BLAST! button. You will be taken to the
formatting BLAST page. Click on the Format! button. You may have to wait for the results.
Your protein should be the first one listed in the BLAST output. What is the protein and what is the
source?
4. Sequence Homology.
You will use BLAST to look at sequences that are homologous to the protein that you identified in
Problem 3.
a. First, some definitions: What do the terms homolog, ortholog, and paralog mean? Go to the
NCBI BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/) and choose Protein-protein
BLAST. Paste your protein sequence into the Search box. Before clicking on the BLAST!
button, narrow the search by kingdom. As you look down the BLAST page, you'll see an Options
section. Under Limit by entrez query (followed by an empty box) or select from: (followed by
a drop-down menu), select Eukaryota. Now click on the BLAST! button. Click on the
Format! button on the next page. Can you find a homologous sequence from yeast?
(Hint: Use your browser's Find tool to search for the term Saccharomyces.) Note the Score and
E value given at the right of the entry.
Can you find a homologous sequence from humans?
(Hint: Search for the term Homo.) Note its Score and E value.
Most biochemists consider 25% identity the cutoff for sequence homology, meaning that if two
proteins are less than 25% identical in sequence, more evidence is needed to determine whether
they are homologs. Click on the Score values for the yeast and human proteins to see each
sequence aligned with the Yersinia pestis sequence and to see the percent sequence identity. Are
the yeast and human sequences homologous to the Yersinia pestis sequence?
b. Use the BLAST online tutorial
(http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html) to discover the meaning
of the Score and E value for each sequence that is reported. What is the difference between an
identity and a conservative substitution? Provide an example of each from the comparison of your
sequence and a homologous sequence obtained from BLAST.
c. BLAST uses a substitution matrix to assign values in the alignment process, based on the analysis
of amino acid substitutions in a wide variety of protein sequences. Be sure you understand the
meaning of the term substitution matrix. What is the default substitution matrix on the BLAST
page? What other matrices are available? What is the source of the names for these substitution
matrices? Repeat the BLAST search in Problem 4(a) using a different substitution matrix. Do you
find different answers?
h. Use a mixture of the restriction enzymes BamHI, AvaI, and PstI to construct a restriction map of
pUC18 similar to the one shown in Fig. 5-43.
i. For the adventurous: Find an enzyme or combination of enzymes that will produce 10 fragments
from pUC18. Draw a restriction map of your results.
human, archaea, bacterial, and plant) in a single file (suggested name: TIM_5_FASTA.txt). This
must be a simple text file with individual sequences separated by a blank line.
2. Multiple Sequence Alignment
Multiple sequence alignment is a tool to identify highly conserved residues in homologous proteins. A
program called CLUSTALW will perform multiple sequence alignments on protein sets that are
submitted in FASTA format. CLUSTALW is available as a command line program to be executed in a
UNIX environment (not very user-friendly). Fortunately, the European Bioinformatics Institute has a
web interface that performs CLUSTALW alignments: http://www.ebi.ac.uk/clustalw/.
a. Go to the EBI site and submit your text file containing the five triose phosphate isomerase
sequences in FASTA format on the input form page. There are many options for refining the
alignment, but for now, use the default values. Be sure to enter your email address. The output of
CLUSTALW can be accessed in many ways. The simplest version will be described here, but you
are encouraged to explore other options (especially JaiView). In the simple text output, the
sequences are optimally aligned and annotated: Residues that are identical in all chains are marked
with an asterisk (*), those that are highly conserved are marked with a colon (:), and those that are
semiconserved are marked with a period (.). From your multiple sequence alignment, how many
identical residues did you find? Identify the residues, using the single-letter amino acid
abbreviations. Classify these identity sites as polar, nonpolar, acidic, and basic amino acids. Do
most of the identities fall into a single class of amino acids? If you plan to continue to Part (b),
keep your browser open or bookmark the results page. You can learn more about CLUSTALW at
a tutorial provided by EBI (http://www.ebi.ac.uk/2can/tutorials/protein/clustalw.html).
b. Figure 7-21 of the textbook shows a phylogenetic tree, which is described as a diagram that
indicates the ancestral relationships among organisms that produce the protein. There are useful
tutorials on phylogenetic trees at the Los Alamos National Laboratories web site
(http://www.hiv.lanl.gov/content/hiv-db/TREE_TUTORIAL/Tree-tutorial.html), at the EBI help
page (http://www.ebi.ac.uk/clustalw/tree_frame.html), and at the NCBI site
(http://www.ncbi.nlm.nih.gov/About/primer/phylo.html). Complete one or all these tutorials.
Scroll down the output page from the CLUSTALW program at EBI to the tree representations of
the alignments. What is the difference between a cladogram and a phylogram tree? What do these
trees tell you about triose phosphate isomerase from the five different species? The tree image on
the EBI site is a dynamic image, meaning that you can't just cut and paste it. If you would like to
capture this image, you can use the PrintScreen button on your computer and paste the image into
a simple Paint program (with Mac OSX use the program Grab for screen capture).