6.1 Bioinformatics Databases and Tools - Introduction: Lecture 6: December, 28, 2001
6.1 Bioinformatics Databases and Tools - Introduction: Lecture 6: December, 28, 2001
1. When obtaining a new DNA sequence, one needs to know whether it has already been
deposited in the databanks fully or partially, or whether they contain any homologous
sequences(sequences which are descended from a common ancestor).
2. Some of the databases contain annotation which has already been added to a specific
sequence. Finding annotation for the searched sequence or its homologous sequences
can facilitate its research.
3. Find similar non-coding DNA stretches in the database: for instance repeat elements
or regulatory sequences.
4. Other uses for specific purpose, like locating false priming sites for a set of PCR
oligonucleotides.
5. Search for homologous proteins - proteins similar in their sequence and therefore also
in their presumed folding or structure or function.
Today they suffer from several problems, unpredicted in early years (when their sizes were
much smaller):
• Databases are regulated by users rather than by a central body (except for Swiss-Prot).
• Only the owner of the data can change it.
• Sequences are not up to date.
• Large degree of redundancy in databases and between databases.
• Lack of standard for fields or annotation.
PIR home page: [20]. For a sample PIR entry, see [23].
Swiss-Prot
Swiss-Prot (home page: [35]) was established in 1986. It is maintained collaboratively by
SIB (Swiss Institute of Bioinformatics) and EBI/EMBL. Provides high-level annotations,
including description of protein function, structure of protein domains, post-translational
modifications, variants, etc. It aims to be minimally redundant. Swiss-Prot is linked to
many other resources, including other sequence databases. For a sample entry, see figures
6.1, 6.2, 6.3.
GenPept
GenPept is a supplement to the GenBank nucleotide sequence database. Its entries are trans-
lation of coding regions in GenBank entries. They contain minimal annotation, primarily
extracted from the corresponding GenBank entries. For the complete annotations, one must
refer to the GenBank entry or entries referenced by the accession number(s) in the GenPept
entry. For a sample GenPet entry, see [9].
NRL 3D
NRL 3D is produced and maintained by PIR. It contains sequences extracted from the
Protein DataBank (PDB) (see [45]). The entries include secondary structure, active site,
binding site and modified site annotations, details of experimental method, resolution, R-
factor, etc. NRL 3D makes the sequence data in the PDB available for both text based and
4 Algorithms for Molecular Biology Tel
c Aviv Univ.
sequence-based searching. It also provides cross-reference information for use with the other
PIR Protein Sequence Databases. For NRL 3D information, and sample entry, see [22].
The large DNA databases are: Genbank (US), EMBL (Europe - UK), DDBJ (Japan).
These databases are quite similar regarding their contents and are updating one another
periodically. This was is a result of the International Nucleotide Sequence Database Collab-
oration.
EMBL
EMBL is a DNA sequence database from European Bioinformatics Institute (EBI). See EBI
home page: [30]. EMBL includes sequences from direct submissions, from genome sequencing
projects, scientific literature and patent applications. Its growth is exponential, on 3.12.01 it
contained 15,386,184,380 bases in 14,370,773 records. EMBL supports several retrieval tools:
SRS for text based retrieval and Blast and FastA for sequence based retrieval. See [31] for
more information and for a sample EMBL entry. EMBL is divided into several divisions.
Primary sequence databases 7
The division differ by the amount of sequences and by the quality of the data. See figure 6.5
for division statistics.
Figure 6.5: Sorce: [31]. EMBL divisions and number of bases in each division.
GenBank
GenBank is a DNA sequence database from National Center Biotechnology Information
(NCBI). See NCBI home page: [38]). It incorporates sequences from publicly available
sources (direct submission and large-scale sequencing). Like EMBL it is also split into
smaller, discrete divisions (see table 6.2). This facilitates an efficient search. See [43] for
more information and for a sample GenBank entry.
Table 6.2: Source: [8]. GenBank divisions. The biggest division is the EST; Due to its rapid
growth, it is divided into 23 pieces.
Glossary
ESTs (Expressed Sequence Tags) - Short fragments of mRNA samples that are taken
from a variety of tissues and organisms. These samples are amplified and sequenced.
The sequencing is done in one read pass, therefore the ESTs are a non-accurate source
of information. There are about 6 million sequenced ESTs (more than 1/3 cloned from
human) .
STSs (Sequence-Tagged Sites) - Short genomic samples that serve as genomic markers.
Text based search - Searching the annotations. Examples: SRS, GCG’s Lookup, Entrez.
Sequence based search - Searching the sequence itself. Examples: Blast, FastA, SW.
SRS had been developed at the EBI. It provides a homogeneous interface to over 80 biological
databases (see SRS help at [25]). It includes databases of sequences, metabolic pathways,
transcription factors, application results (like BLAST, SSEARCH, FASTA), protein 3-D
structures, genomes, mappings, mutations, and locus specific mutations. For each of the 80
available databases, there is a short description, including its last release. Before entering a
query, one selects one or more of the databases to search. It is possible to send the query
results as a batch query to a sequence search tool. The SRS is highly recommended for use.
SRS entrance page: [24].
Entrez
Entrez is a molecular biology database and retrieval system, developed by the NCBI (see
Entrez help at [42]). It is an entry point for exploring the NCBI’s integrated databases. The
Entrez is easy to use, but unlike SRS, the search is limited. It does not allow customization
with an institutes preferred databases. Entrez entrance page: [41].
10 Algorithms for Molecular Biology Tel
c Aviv Univ.
• A DNA sequence is a string of length n over an alphabet of size 4. Its protein translation
is a string of length n/3 over an alphabet of size 20. Statistically, the expected number
of random matches in some arbitrary database is larger for a DNA sequence.
• DNA databases are much larger than protein databases, and they grow faster. This
also means more random hits.
Bottom line: Translating DNA to a protein yields better search results. When possible (i.e.
for a coding DNA sequence), it is the recommended technique.
Protein sequences are always searched against protein databases. Translating them to
DNA is ambiguous and results in a large number of possible DNA sequences. The analysis
in the previous paragraph also discourages translation to DNA.
Homology modeling
As stated, a primary goal of sequence search is to find sequences which are homologous to
the query sequence. Such a homologous sequence shares sequence similarity with the query
sequence. The similarity is derived from common ancestry and conservation throughout
evolution. Homologous proteins are similar in their structure. This is the basis for homology
modeling structure determination through the structure of similar proteins.
Sensitivity - The ability to detect “true positive” matches . The most sensitive search finds
all true matches, but might have lots of “false positives”.
Specificity - The ability to reject “false positive” matches. The most specific search will
return only true matches, but might have lots of “false negatives”.
Sequence Based Searching 11
When one chooses which algorithm to use, there is a trade off between these two figures of
merit. It is quiet trivial to create an algorithm which will optimize one of these properties.
The problem is to create an algorithm that will perform well with respect to both of them.
A second criteria for evaluating algorithm is its time performance.
We will examine three main search tools: FastA (better for nucleotides than for proteins),
BLAST (better for proteins than for nucleotides) and SW-search (more sensitive than FastA
or BLAST, but much slower).
6.4.1 FastA
FastA is a sequence comparison software that uses the method of Pearson and Lipman [6].
The basic FastA algorithm assumes a query sequence and a database over the same alphabet.
Practically, FastA is a family of programs, allowing also cross queries of DNA versus protein.
The program variants are listed in table 6.3.
PROGRAM FUNCTION
fasta3 scan a protein or DNA sequence library for similar sequences
fastx/y3 compare a DNA sequence to a protein sequence database, comparing the
translated DNA sequence in forward and reverse frames.
tfastx/y3 compares a protein to a translated DNA data bank
fasts3 compares linked peptides to a protein databank
fastf3 compares mixed peptides to a protein databank
Table 6.3: Source: [33]. Variants of the FastA algorithm. Note: fastx3 uses a simpler,
faster algorithm for alignments that allows frameshifts only between codons; fasty3 is slower
but produces better alignments with poor quality sequences because frameshifts are allowed
within codons (source: [32]).
Figure 6.6: Sorce: [28]. FastA query screen. A - Default gap opening penalty: −12 for
proteins, −16 for DNA. Default gap extension penalty: −2 for proteins, −4 for DNA. B -
Max number of scores and alignments is 100. C - The larger the word-length the less sensi-
tive, but faster the search will be. D - Default matrix: Blosum50. Lower PAM and higher
blosum detect close sequences. Higher PAM and lower blosum detect distant sequences.
Sequence Based Searching 13
FastA - Steps
• Hashing: FastA locates regions of the query sequence and matching regions in the
database sequences that have high densities of exact matches of k-tuple subsequences.
The ktup parameter controls the length of the k-tuple.
• Scoring: The ten highest scoring regions are scored again using a scoring matrix. The
score for such a pair of regions is saved as the init1 score.
• Introduction of Gaps: FastA determines if any of the initial regions from different
diagonals can be joined together to form an approximate alignment with gaps. Only
non-overlapping regions may be joined. The score for the joined regions is the sum of
the scores of the initial regions minus a joining penalty for each gap. The score of the
highest scoring region, at the end of this step, is saved as the initn score.
• Alignment: After computing the initial scores, FastA determines the best segment of
similarity between the query sequence and the search set sequence, using a variation
of the Smith-Waterman algorithm. The score for this alignment is the opt score.
FastA Output
The standard FastA output contains a list of the best alignment scores and a visual rep-
resentation of the alignments. See figures 6.8, 6.7. When evaluating FastA E-scores, the
following rule of thumb can be applied: Sequences with E-score less than 0.01 are almost
always found to be homologous. Sequences with E-score between 1 and 10 frequently turn
out to be related as well.
FastA uses a statistical model in order to determine a threshold E-score above which
results are returned. However, sometimes the assumptions of this statistical model fail. The
reliability of the sequence statistics for a given query can be quickly confirmed by looking at
the histogram of observed and expected similarity scores (see [44]). The FastA histogram is
an optional output. A sample histogram is shown in figure 6.9.
14 Algorithms for Molecular Biology Tel
c Aviv Univ.
Figure 6.7: Sorce: [28]. A sample FastA output: alignment scores. Column 1-3 detail the
name and annotation of the record. Columns 4-7 are the FastA scores.
Figure 6.8: Sorce: [28]. A sample FastA output: alignment of the query sequence against
the result sequences.
Sequence Based Searching 15
Figure 6.9: Source: [44]. Histogram of FASTA3 similarity scores - Results of search of a
Drosophila class-theta glutathione transferase against the annotated PIR1 protein sequence
database. The initial histogram output is shown. The shaded section indicates the region
that is most likely to show discrepancies between observed and expected number of scores
when the statistical model fails.
16 Algorithms for Molecular Biology Tel
c Aviv Univ.
The BLAST program compares the query to each sequence in database using heuristic
rules to speed up the pairwise comparison. It first creates sequence abstraction by listing
exact and similar words. BLAST finds similar words between the query and each database
sequence. It then extends such words to obtain high-scoring sequence pairs (HSPs) (BLAST
parlance for local ungapped alignments). BLAST calculates statistics analytically, are cal-
culated statistically like in FastA.
The BLAST graphical output is similar to FastA output. A sample output screen is
shown in figure 6.10.
Figure 6.10: Sorce: [39]. A sample BLAST output screen. There are three sections: 1.
A graphical representation of the alignments. 2. Scores: for each result a line containing
name, annotation and BLAST scores. 3. Alignment of the query sequence against the results
sequence.
18 Algorithms for Molecular Biology Tel
c Aviv Univ.
• If the query has repeated segments, remove them and repeat the search.
Z-score
The Z-score is an old, yet commonly used statistical estimator for the validity of statistical
results, including alignment scores. It is defined by the number of standard deviations
that separate an observed score from the average random score. In other words, it is the
difference between the observed score and the average random score, normalized by the
standard deviation of the distribution. A higher Z-score means that the score can be trusted
with a higher confidence level.
20 Algorithms for Molecular Biology Tel
c Aviv Univ.
Figure 6.11: Easy Case: Illustration of an easy case of estimating significance. The score of
really related records are distributed away from random records and thus can easily identified.
Figure 6.12: Complex Case: Illustration of a complex case of estimating significance. The
dark area represents the number of random records (shuffeled query sequnce) that exceed
the query score. In this case the common area betweeb the random plot and the real plot is
large, which makes it hard to distinguish between the real and random ones.
Significance of Scores 21
E-value
The E-value is the most frequently used statistical estimator for the validity of alignment
scores. It is defined as the expected number of false positives with a score higher than the
observed score. This value is dependant, obviously, on the number of random alignments,
determined by the size of the aligned sequences. A lower E-value indicates that the score
has a higher confidence level.
P-value
Once we have calculated the E-value, E, for a certain score, we can go one step further.
The P-value is the probability of the observed score – the probability that a certain score
occurred by chance. To find a formula for the P-value, let us define a random variable YE
as the number of random records achieving an E-value of E or better. This random variable
has a Poisson distribution with the parameter λ=E. The probability that no random events
have a lower score then our score, i.e. that YE = 0, decreases exponentially with our score -
s. Therefore, that probability that at least one random record achieved a better score then
our E-value can be computed using the following simple formula [1]:
P = 1 − e−E
Like the E-value, this value is dependent on the size of the database. A lower P-value
means that the score has a higher confidence level. This estimator is not widely used for
determining the validity of sequence alignment scores.
Figure 6.13: Random walk: The score for a match is +2 and the punishment for a mismatch
is -1, As shown,the expectancy for the whole walk is negative. The probability that the Top
Score will be larger than X decreases exponentially with x.
Xn ≡ max{x1 , ...., xn }
λS − ln K
B=
ln 2
This score, unlike the raw score, is measured in standard units, and is independent of
the distribution, and thus it is more instructive. Clearly, it is linear to the raw score. Since
the E-scores are a decreasing exponential to the raw scores, the E-score is derived from the
following approximation, taking into consideration the length of the aligned sequences:
E = mn2−B
FASTA
Unlike BLASTN, which uses a statistical model with an extensive theory behind it, FASTA
attempts to give good estimates for e-values using values of tens of thousands of random se-
quences aligned during the course of the algorithm. The FASTA algorithm uses the following
steps for the estimation:
24 Algorithms for Molecular Biology Tel
c Aviv Univ.
• Random alignment scores are collected through the course of the algorithm. This is
possible because FASTA has a heuristic, which produces alignment scores for pairs of
sequences very quickly. FASTA assumes that the searched database is large enough,
so a heterogeneous sample of scores is collected.
• Scores are assigned into bins of a histogram based on the length of the sequence
matched. The best scores are removed from each bin, so that possible “positive”
scores will not be taken into account, since we are interested in finding the number of
expected false positives above our score.
• The expected value of a random alignment against the database is calculated. The
expected value is the result of a linear regression of the data against the logarithm of
the length of the sequence.
• For each score alignment score for which FASTA needs an E-value, FASTA first cal-
culates the Z-value. This is done using the standard deviation of the random scores
from the expected value of a random alignment with the same length of the analyzed
alignment.
• The conversion of the Z-value into E-value follows the assumption that the distri-
bution of the random scores is an Extreme Value Distribution. To get the E-value,
FASTA multiplies the number of sequences in the database by the probability that such
sequence will have a value higher then our score. This probability can be directly cal-
culated from the number of standard deviations separating our score from the average
score.
6.7.1 Prosite
The Prosite database [37] is based on SwissPort and thus is very well annotated, but small.
Characterization of protein families is done by the single most conserved motif observed in
a multiple sequence alignment of known homologous. These conserved motifs usually relate
to biological functions such as active sites or binding sites. The search in Prosite does not
require an exact match in structure. Prosite enables searches using complex patterns. It is
possible to search textually using regular expressions for names of known proteins, etc. It is
also possible to scan a protein sequence using prosite for structural pattern matches. The
database is well cross-linked to SwissProt and TrEMBL.
6.7.2 FingerPrints
Unlike Prosite, FingerPrints has an improved diagnostic reliability which is achieved by using
more than one conserved structural motif to characterize a protein family. With FingerPrints,
many motifs are encoded using ungapped and unweighed local alignments.
The input to FingerPrints is a small multiple alignment, which has some conserved motifs.
These motifs are searched for in the database, and only sequences that match all the motifs
are considered for further analysis. With the new alignment, the database is searched for
more sequences until no further complete fingerprint matches can be identified. These final
aligned motifs constitute the refined fingerprint that is entered into the database.
26 Algorithms for Molecular Biology Tel
c Aviv Univ.
6.7.3 Blocks
Blocks [11] uses multiply aligned ungapped segments corresponding to the most highly con-
served regions of proteins. Block Searcher [14] ,Get Blocks [13] and Block Maker [12] are
aids to detection and verification of protein sequence homology. They compare a protein
or DNA sequence to a database of protein blocks, retrieve blocks, and create new blocks,
respectively.
6.7.4 Profiles
The Profiles databases [19] uses the notion of profiles to achieve a good detection of distant
sequence relationships. A profile is a scoring table with multiple alignment information for
the whole sequences, not just for conserved regions. Profiles are weighted to indicate:
• where insertions and deletions (INDELs) are allowed (not within core secondary struc-
tures).
Profiles provide a sensitive means of detecting distant sequence relationships, where only
a few residues are well conserved. The inherent complexity of profiles renders them to be
highly potent discriminators.
The ISREC (Swiss Institute for Experimental Research) has created a compendium of
profiles, allowing to find even distant homologous. Each of those profiles has separate data
and family annotations.
6.7.5 Pfam
Pfam [46] uses a different method for its database. High quality seed alignments are used
to create Hidden Markov Models to which sequences are aligned. Pfam has to classes of
alignments, according to their credibility:
6.7.6 eMotif
eMotif [17], also known as identify, uses data from Blocks and FingePrints to generate
consensus expressions from the conserved regions of sequence alignments.
eMotif adopts a “fuzzy” algorithm which allows certain amino acid alternations. This allows
eMotif to find homologous sequences that other programs can not find, but it results in a lot
of noise. This trade-off shows why it is important to use multiple programs when searching
for information.
6.7.7 InterrPro
InterPro [34] is an interface to several secondary databases: ProSite, prints, ProDom and
Pfam. It has an intuitive interface both for text and sequence-based searches, and since it
incorporates several databases, it is very recommended.
28 Algorithms for Molecular Biology Tel
c Aviv Univ.
Bibliography
[1] A.Dembo and S.Karlin. Strong limit theorems of empirical functionals for large ex-
cedances of partial sums of i.i.d variables. Annuals of Probability, 19(4):1737–1755,
1991.
[3] S. Karlin and S. F. Altschul. Methods for assessing the statistical significance of molec-
ular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA,
87:2264–2268, 1990.
[5] Pagni M. and Jongeneel CV. Making sense of score statistics for sequence alignments.
Briefings in Bioinformatics, 2(1):51–67, 2001.
[6] R. W. Pearson and D. J. Lipman. Improved tools for biological sequence comparison.
Proc. Natl. Acad. Sci. USA, 85:2444–2448, 1988.
[8] ftp://genbank.sdsc.edu/pub/release.notes/gb107.release.notes.
[9] http://bioinfo.md.huji.ac.il/databases/genpept.html.
[10] http://bioinfo.tau.ac.il/GCG/html/unix/tofasta.html.
[11] http://blocks.fhcrc.org/.
[12] http://blocks.fhcrc.org/blockmkr/make_blocks.html.
[13] http://blocks.fhcrc.org/blocks-bin/getblock.sh.
29
30 BIBLIOGRAPHY
[14] http://blocks.fhcrc.org/blocks/blocks_search.html.
[15] http://circinus.ebi.ac.uk:6543/michele/jalview/help.html.
[16] http://dapsas1.weizmann.ac.il/bcd/bcd_parent/bcd_bioccel/bioccel.html.
[17] http://dna.Stanford.EDU/identify.
[18] http://flybase.bio.indiana.edu/.
[19] http://isrec.isb-sib.ch/software/PFSCAN_for_m.html.
[20] http://pir.georgetown.edu/.
[21] http://pir.georgetown.edu/pirwww/aboutpir/collaborate.html.
[22] http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html.
[23] http://pir.georgetown.edu/pirwww/dbinfo/sample-hahu.html.
[24] http://srs/ebi/ac/uk/.
[25] http://srs.ebi.ac.uk/srs6/man/mi_srswww.html.
[26] http://www2.ebi.ac.uk/bic_sw/.
[27] http://www2.ebi.ac.uk/clustalw.
[28] http://www2.ebi.ac.uk/fasta3/.
[29] http://www.dna.affrc.go.jp/htdocs/growth/index.html.
[30] http://www.ebi.ac.uk/.
[31] http://www.ebi.ac.uk/embl/.
[32] http://www.ebi.ac.uk/fasta33/fasta3x.txt.
[33] http://www.ebi.ac.uk/fasta3/help.html.
[34] http://www.ebi.ac.uk/interpro.
[35] http://www.ebi.ac.uk/swissprot/.
[36] http://www.ebi.ac.uk/swissprot/Information/information.html.
[37] http://www.expasy.ch/prosite.
BIBLIOGRAPHY 31
[38] http://www.ncbi.nlm.nih.gov/.
[39] http://www.ncbi.nlm.nih.gov/BLAST/.
[40] http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/guide.html.
[41] http://www/ncbi.nlm.nih.gov/Entrez/.
[42] http://www/ncbi.nlm.nih.gov/Entrez/entrezhelp.html.
[43] http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html.
[44] http://www.people.virginia.edu/~wrp/papers/mmol98f.pdf.
[45] http://www.rcsb.org/pdb/.
[46] http://www.sanger.ac.uk/Software/Pfam.