0% found this document useful (0 votes)
5 views55 pages

Module III

The document discusses bioinformatics focusing on database similarity search techniques, particularly substitution matrices like PAM and BLOSUM for scoring amino acid alignments. It explains the importance of scoring matrices in quantifying residue substitutions and the differences between transitions and transversions in nucleotide sequences. Additionally, it covers the construction and application of PAM and BLOSUM matrices in analyzing protein sequences based on evolutionary relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views55 pages

Module III

The document discusses bioinformatics focusing on database similarity search techniques, particularly substitution matrices like PAM and BLOSUM for scoring amino acid alignments. It explains the importance of scoring matrices in quantifying residue substitutions and the differences between transitions and transversions in nucleotide sequences. Additionally, it covers the construction and application of PAM and BLOSUM matrices in analyzing protein sequences based on evolutionary relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Bioinformatics

— Unit III: Database Similarity Search


Techniques—

Dr. Chandra Mohan D


Assistant Professor
Computer Science and Engineering Group
Indian Institute of Information Technology, Sri City

If you know your own DNA sequence than you know every thing about your self

February 10, 2025 Bioinformatics 1


Outline
 Substitution or Scoring matrices (13-02-2023)
 PAM matrix and BLOSUM protein substitution matrices

 Database Similarity Search

 Unique requirements
 Heuristic search
 Basic Local Alignment Search Tool (BLAST) and
 FASTA
 Variants, Statistical significance

 Comparison of BLAST and FASTA

February 10, 2025 Bioinformatics 2


Scoring Matrices
 In the dynamic programming algorithm presented, the alignment
procedure has to make use of a scoring system,
 which is a set of values for quantifying the likelihood of one residue

being substituted by another in an alignment


 The scoring systems is called a substitution matrix and is derived from
 statistical analysis of residue substitution data from sets of reliable

alignments of highly related sequences


 Scoring matrices for nucleotide sequences are relatively simple
 A positive value or high score is given for a match and a negative value
or low score for a mismatch
 This assignment is based on the assumption that the frequencies of
mutation are equal for all bases
 However, this assumption may not be realistic

February 10, 2025 Bioinformatics 3


Purines and Pyrimidines

Pyrimidines:
It is an organic ring consisting of six atoms: 4 carbon atoms and 2
nitrogen atoms
The nitrogen atoms are placed in the 1 and 3 positions around the ring

Atoms or groups attached to this ring distinguish pyrimidines, which

include cytosine, thymine, uracil, thiamine (vitamin B1), uric acid, and
barbituates
Pyrimidines function in DNA and RNA, cell signaling, energy storage (as

phosphates), enzyme regulation, and to make protein and starch


Purines:
A purine contains a pyrimidine ring fused with an imidazole ring (a five-
member ring with two non-adjacent nitrogen atoms)
This two-ringed structure has nine atoms forming the ring: 5 carbon

atoms and 4 nitrogen atoms


February 10, 2025 Bioinformatics 4
DNA Bases

February 10, 2025 Bioinformatics 5


Scoring Matrices: Transitions and Transversions
 Observations show that
 Transitions (substitutions between purines and purines or between

pyrimidines and pyrimidines) occur more frequently than


 Transversions (substitutions between purines and pyrimidines)

 Therefore, a more sophisticated statistical model with different


probability values to reflect the two types of mutations is needed
 Scoring matrices for amino acids are more complicated because
 scoring has to reflect the physicochemical properties of amino acid

residues, as well as
 the likelihood of certain residues being substituted among true

homologous sequences
 Certain amino acids with similar physicochemical properties can be
more easily substituted than those without similar characteristics

February 10, 2025 Bioinformatics 6


Scoring Matrices: Amino Acid Properties
 Substitutions among similar residues are likely to preserve the essential
functional and structural features
 However, substitutions between residues of different physicochemical
properties are more likely
 to cause disruptions to the structure and function

 This type of disruptive substitution is less likely to be selected in


evolution because it renders nonfunctional proteins
 For example, phenylalanine, tyrosine, and tryptophan all share aromatic
ring structures
 Because of their chemical similarities,
 they are easily substituted for each other without perturbing the

regular function and structure of the protein


 Similarly, arginine, lysine, and histidine are all large basic residues and
there is a high probability of them being substituted for each other
February 10, 2025 Bioinformatics 7
Scoring Matrices
 Aspartic acid, glutamic acid, asparagine, and glutamine belong to the
acid and acid amide groups and can be associated with relatively high
frequencies of substitution
 The hydrophobic residue group includes methionine, isoleucine,
leucine, and valine
 Small and polar residues include serine, threonine, and cysteine
 Residues within these groups have high likelihoods of being
substituted for each other
 However, cysteine contains a sulfhydryl group that plays a role in
metal binding, active site, and disulfide bond formation
 Substitution of cysteine with other residues therefore often abolishes
the enzymatic activity or destabilizes the protein structure
 Cysteine is thus a very infrequently substituted residue

February 10, 2025 Bioinformatics 8


Amino Acid Scoring Matrices
 Amino acid substitution matrices, which are 20 × 20 matrices, have
been devised to reflect the likelihood of residue substitutions
 There are essentially two types of amino acid substitution matrices
 One type is based on interchangeability of the genetic code or

amino acid properties, and


 The other is derived from empirical studies of amino acid

substitutions
 Although the two different approaches
coincide to a certain extent,
 the first approach has been shown

to be less accurate than the second


approach

February 10, 2025 Bioinformatics 9


Amino Acid Scoring Matrices
 Thus, the empirical approach has gained the most popularity in
sequence alignment applications
 The empirical matrices, which include PAM and BLOSUM matrices,
are derived from actual alignments of highly similar sequences
 PAM matrices are used to score alignments between closely

related protein sequences


 BLOSUM matrices are used to score alignments between

evolutionarily divergent protein sequences


 By analyzing the probabilities of amino acid substitutions in these
alignments, a scoring system can be developed by giving
 a high score for a more likely substitution and

 a low score for a rare substitution

February 10, 2025 Bioinformatics 10


Amino Acid Scoring Matrices
 For a given substitution matrix,
 A positive score means that the frequency of amino acid

substitutions found in a data set of homologous sequences is


greater than would have occurred by random chance
 They represent substitutions of very similar residues or identical

residues
 A zero score means that the frequency of amino acid substitutions

found in the homologous sequence data set is equal to that


expected by chance
 In this case, the relationship between the amino acids is weakly

similar in terms of physicochemical properties (size, shape, and


Charge of amino acids)

February 10, 2025 Bioinformatics 11


Amino Acid Scoring Matrices
 A negative score means that the frequency of amino acid
substitutions found in the homologous sequence data set is less
than would have occurred by random chance
 This normally occurs with substitutions between dissimilar
residues
Probability of amino acid substitutions:
The substitution matrices apply logarithmic conversions to describe the

probability of amino acid substitutions


The converted values are the so-called log-odds scores (or log-odds

ratios), which are


 Logarithmic ratios of the observed mutation frequency divided by

the probability of substitution expected by random chance


The conversion can be either to the base of 10 or to the base of 2

February 10, 2025 Bioinformatics 12


Amino Acid Scoring Matrices
Example:
In an alignment that involves ten sequences, each having only one

aligned position,
 nine of the sequences are F (phenylalanine) and the remaining one

I (isoleucine)
The observed frequency of I being substituted by F is one in ten (0.1),

whereas the probability of I being substituted by F by random chance is


one in twenty (0.05)
Thus, the ratio of the two probabilities is 2 (0.1/0.05)

After taking this ratio to the logarithm to the base of 2, this makes the

log-odds equal to 1
This value can then be interpreted as the likelihood of substitution

between the two residues being 2^1, which is two times more frequently
than by random chance
February 10, 2025 Bioinformatics 13
Amino Acid Scoring Matrices- PAM
 The PAM matrices (also called Dayhoff PAM matrices) were first
constructed by Margaret Dayhoff, who compiled alignments of 71
groups of very closely related protein sequences
 PAM stands for “point accepted mutation” or APM
 Because of the use of very closely related homologs,
 The observed mutations were not expected to significantly change

the common function of the proteins


 Thus, the observed amino acid mutations are considered to be
accepted by natural selection
 These protein sequences were clustered based on phylogenetic
reconstruction using maximum parsimony
 The PAM matrices were subsequently derived based on the
evolutionary divergence between sequences of the same cluster

February 10, 2025 Bioinformatics 14


Amino Acid Scoring Matrices-PAM Matrices
 Each PAM matrix is designed to compare two sequences which are a
specific number of PAM units apart
 Two sequences S1 and S2 are at evolutionary distance PAM1 if S1 has
converted to S2 with an average of one substitution per 100 amino acids
 One PAM unit is defined as 1% of the amino acid positions that have
been changed
 To construct a PAM1 substitution table, a group of closely related
sequences with mutation frequencies corresponding to one PAM unit is
chosen
 Construction of the PAM1 matrix involves alignment of full-length
sequences and construction of phylogenetic trees using the parsimony
principle
 This allows computation of ancestral sequences for each internal node of
the trees
February 10, 2025 Bioinformatics 15
Amino Acid Scoring Matrices-PAM Matrices
 Ancestral sequence information is used to count the number of
substitutions along each branch of a tree
Mutability:
The number of mutational changes from a common ancestor for a
particular amino acid residue divided by the total number of such
residues occurring in an alignment
 The PAM score for a particular residue pair is derived using the steps

 Calculations of relative mutability

 Normalization of the expected residue substitution frequencies by

random chance, and


 Logarithmic transformation to the base of 10 of the normalized

mutability value divided by the frequency of a particular residue


 The resulting value is rounded to the nearest integer and entered

into the substitution matrix


February 10, 2025 Bioinformatics 16
Amino Acid Scoring Matrices-PAM Matrices
 This completes the log-odds score computation
 After compiling all substitution probabilities of possible amino acid
mutations, a 20 × 20 PAM matrix is established
 Positive scores in the matrix denote substitutions occurring more
frequently than expected among evolutionarily conserved
replacements
 Negative scores correspond to substitutions that occur less frequently
than expected
 Other PAM matrices with increasing numbers for more divergent
sequences are extrapolated from PAM1 through matrix multiplication
 For example, PAM80 is produced by values of the PAM1 matrix
multiplied by itself eighty times

February 10, 2025 Bioinformatics 17


Amino Acid Scoring Matrices-PAM Matrices

 The mathematical transformation accounts for multiple substitutions


having occurred in an amino acid position during evolution
 For example, when a mutation is observed as F replaced by I, the

evolutionary changes may have actually undergone a number of


intermediate steps before becoming I, such as in a scenario of F →
M→L→I
 For that reason, a PAM80 matrix only corresponds to 50% of
observed mutational rates
February 10, 2025 Bioinformatics 18
Amino Acid Scoring Matrices-PAM Matrices
 The increasing PAM numbers correlate with increasing PAM units
and thus evolutionary distances of protein sequences
 For example, PAM250, which corresponds to 20% amino acid
identity, represents 250 mutations per 100 residues
 In theory, the number of evolutionary changes approximately
corresponds to an expected evolutionary span of 2,500 million years
 Thus, the PAM250 matrix is normally used for divergent sequences
 Accordingly, PAM matrices with lower serial numbers are more
suitable for aligning more closely related sequences
 The extrapolated values of the PAM250 amino acid substitution
matrix are shown in figure

February 10, 2025 Bioinformatics 19


Amino Acid Scoring Matrices-PAM Matrices

PAM250 amino acid substitution matrix. Residues are grouped according to


physicochemical similarities
February 10, 2025 Bioinformatics 20
Amino Acid Scoring Matrices- BLOSUM
 In the PAM matrix construction, the only direct observation of residue
substitutions is in PAM1,
 based on a relatively small set of extremely closely related

sequences
 Sequence alignment statistics for more divergent sequences are not
available
 To fill in the gap, a new set of substitution matrices have been
developed
 This is the series of amino acid blocks substitution matrices
(BLOSUM), all of which are derived based on
 direct observation for every possible amino acid substitution in

multiple sequence alignments


 The sequence patterns, also called blocks, are ungapped

alignments of less than sixty amino acid residues in length


February 10, 2025 Bioinformatics 21
Amino Acid Scoring Matrices-BLOSUM
 The frequencies of amino acid substitutions of the residues in these
blocks are calculated
 to produce a numerical table, or block substitution matrix

 The BLOSUM matrices are actual percentage identity values of


sequences selected for construction of the matrices
 BLOSUM62 indicates that the sequences selected for constructing the
matrix share an average identity value of 62%
 Other BLOSUM matrices based on sequence groups of various
identity levels have also been constructed
 In the reversing order as the PAM numbering system, the lower the
BLOSUM number, the more divergent sequences they represent
February 10, 2025 Bioinformatics 22
Amino Acid Scoring Matrices-BLOSUM

 Count the frequency of occurrence of each amino acids


 Count the frequency of each amino acid pair aligned in the same column

 Count the observed frequency of amino acid pairs (AB)obs=8/60

February 10, 2025 Bioinformatics 23


Amino Acid Scoring Matrices-BLOSUM Matrices
 Expected frequency of amino acid pairs (AB)exp=(14/24*4/24)*2
 Since ancestral states are not known, A->B or B->A both substitutions
are equi-probable
 Therefore (AB)exp=(BA)exp
 Instead of considering (AB)exp, (BA)exp separately use 2* (AB)exp
 Take the ratio of observed and expected frequency of each pair
 The BLOSUM score for a particular residue pair is derived from
 The log ratio of observed residue substitution frequency versus the

expected probability of a particular residue

February 10, 2025 Bioinformatics 24


Amino Acid Scoring Matrices-BLOSUM Matrices
 The log odds is taken to the base of 2 instead of 10 as in the PAM
matrices
 The resulting value is rounded to the nearest integer and entered into
the substitution matrix
 As in the PAM matrices, positive and negative values correspond to
substitutions that occur more or less frequently than expected among
evolutionarily conserved replacements
 The principal difference is that the PAM matrices (except PAM1), are
derived from an evolutionary model whereas the BLOSUM matrices
consist of entirely direct observations
 Thus, the BLOSUM matrices may have less evolutionary meaning
than the PAM matrices
 This is why the PAM matrices are used most often for reconstructing
phylogenetic trees
February 10, 2025 Bioinformatics 25
Comparison between PAM and BLOSUM
 However, because of the mathematical extrapolation procedure used,
the PAM values may be less realistic for divergent sequences
 The BLOSUM matrices are entirely derived from local sequence
alignments of conserved sequence blocks, whereas
 the PAM1 matrix is based on the global alignment of full-length

sequences composed of both conserved and variable regions


 This is why the BLOSUM matrices may be more advantageous in
searching databases and finding conserved domains in proteins
 Several empirical tests have shown that the BLOSUM matrices
outperform the PAM matrices in terms of accuracy of local alignment
 This could be largely because the BLOSUM matrices are derived from
 a much larger and more representative dataset than the one used to

derive the PAM matrices

February 10, 2025 Bioinformatics 26


Introduction: Database Similarity Search
 A main application of pairwise alignment is
 retrieving biological sequences in databases based on similarity

 This process involves submission of a query sequence and


 performing a pairwise comparison of the query sequence with all

individual sequences in a database


 Thus, database similarity searching is pairwise alignment on a large
scale
 This type of searching is one of the most effective ways
 to assign putative functions to newly determined sequences

 However, the dynamic programming method is slow and impractical


to use in most cases
 Special search methods are needed to speed up the computational
process of sequence comparison
February 10, 2025 Bioinformatics 27
Why Database Similarity Search
 A common reason for performing a database search with a query
sequence is
 to find a related gene in another organism

 For a query sequence of unknown function, a matched gene may


provide a clue as to function
 Alternatively, a query sequence of known function (e.g., a yeast gene)
may be used
 to search through sequences of a particular organism (e.g, a plant)

to identify a gene that may have the same function


 Sequences of an organism that are collected for such purposes include
 genomic sequences (sequences of BAC clones or the assembled

sequence of an entire chromosome),


 EST sequences, and

 cDNA/protein sequences for particular genes


February 10, 2025 Bioinformatics 28
Unique Requirements of Database Searching
 There are unique requirements for implementing algorithms for
sequence database searching
 The first criterion is sensitivity, which refers to the ability to find as
many correct hits as possible
 It is measured by the extent of inclusion of correctly identified
sequence members of the same family
 These correct hits are considered “true positives” in the database
searching exercise
 The second criterion is selectivity, also called specificity, which
refers to the ability to exclude incorrect hits
 These incorrect hits are unrelated sequences mistakenly identified in
database searching and are considered “false positives”
 The third criterion is speed, which is the time it takes to get results
from database searches
February 10, 2025 Bioinformatics 29
Metrics: Sensitivity and Specificity
 Sensitivity is also referred to as the true positive (recognition) rate
 the proportion of positive tuples that are correctly identified

 Specificity is the true negative rate


 the proportion of negative tuples that are correctly identified

 It can be shown that accuracy is a function of sensitivity and specificity:

 The sensitivity of the classifier is 90/300=30.00%


 The specificity is 9560/9700 =98.56%
 The classifier’s overall accuracy is 9650/10,000=96.50%
 https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
Data Mining: Concepts and
February 10, 2025 Techniques 30
Unique Requirements of Database Searching
 Depending on the size of the database, speed sometimes can be a
primary concern
 Ideally, one wants to have the greatest sensitivity, selectivity, and
speed in database searches
 However, satisfying all three requirements is difficult in reality
 What generally happens is that an increase in sensitivity is associated
with decrease in selectivity
 A very inclusive search tends to include many false positives
 Similarly, an improvement in speed often comes at the cost of
lowered sensitivity and selectivity
 A compromise between the three criteria often has to be made
 In database searching, as well as in many other areas in
bioinformatics, are two fundamental types of algorithms

February 10, 2025 Bioinformatics 31


Unique Requirements of Database Searching
 One is the exhaustive type, which uses a rigorous algorithm to
 find the best or exact solution for a particular problem by

examining all mathematical combinations


 Dynamic programming is an example of the exhaustive method and is
computationally very intensive
 Another is the heuristic type, which is a computational strategy to
 find an empirical or near optimal solution by using rules of thumb

 Essentially, this type of algorithms take shortcuts by reducing the


search space according to some criteria
 However, the shortcut strategy is not guaranteed to find the best or
most accurate solution
 It is often used because of the need for obtaining results within a
realistic time frame without significantly sacrificing the accuracy of
the computational output
February 10, 2025 Bioinformatics 32
Heuristic Database Searching
 Searching a large database using the dynamic programming methods,
such as the Smith–Waterman algorithm,
 Although accurate and reliable, is too slow and impractical when

computational resources are limited


 An estimate conducted nearly a decade ago had shown that
 Querying a database of 300,000 sequences using a query sequence

of 100 residues took 2-3 hours to complete with a regular


computer system at the time
 Thus, speed of searching became an important issue
 To speed up the comparison, heuristic methods have to be used
 The heuristic algorithms perform faster searches because
 They examine only a fraction of the possible alignments examined

in regular dynamic programming

February 10, 2025 Bioinformatics 33


Heuristic Database Searching
 Currently, there are two major heuristic algorithms for performing
database searches: BLAST and FASTA
 These methods are not guaranteed to find the optimal alignment or
true homologs, but are 50–100 times faster than dynamic
programming
 The increased computational speed comes at a moderate expense of
sensitivity and specificity of the search, which is easily tolerated by
working molecular biologists
 Both programs can provide a reasonably good indication of sequence
similarity by identifying similar sequence segments
 Both BLAST and FASTA use a heuristic word method for fast
pairwise sequence alignment
 It works by finding short stretches of identical or nearly identical
letters in two sequences
February 10, 2025 Bioinformatics 34
Heuristic Database Searching: BLAST
 These short strings of characters are called words, which are similar
to the windows used in the dot matrix method
 The basic assumption is that two related sequences must have at least
one word in common
 By first identifying word matches, a longer alignment can be
obtained by extending similarity regions from the words
 Once regions of high sequence similarity are found, adjacent high-
scoring regions can be joined into a full alignment
 The BLAST program was developed by Stephen Altschul of NCBI
in 1990 and has since become
 One of the most popular programs for sequence analysis

 BLAST uses heuristics to align a query sequence with all sequences


in a database

February 10, 2025 Bioinformatics 35


Heuristic Database Searching: BLAST
 The objective is to find high-scoring un-gapped segments among
related sequences
 The existence of such segments above a given threshold indicates

pairwise similarity beyond random chance, which helps


 To discriminate related sequences from unrelated sequences in a

database
BLAST performs sequence alignment through the following steps
Step1:
 Create a list of words from the query sequence

 Each word is typically three residues for protein sequences and eleven

residues for DNA sequences


 The list includes every possible word extracted from the query

sequence
 This step is also called seeding
February 10, 2025 Bioinformatics 36
Heuristic Database Searching: BLAST
Step 2:
Search a sequence database for the occurrence of these words

This step is to identify database sequences containing the matching

words
Step 3:
The matching of the words is scored by a given substitution matrix

A word is considered a match if it is above a threshold

Step 4:
Perform pairwise alignment from highest score word by extending the

words in both directions while counting the alignment score using the
same substitution matrix
The extension continues until the score of the alignment drops below a

threshold due to mismatches (the drop threshold is twenty-two for


proteins and twenty for DNA)
February 10, 2025 Bioinformatics 37
Heuristic Database Searching: BLAST
Step 5:
The resulting contiguous aligned segment pair without gaps is called

high-scoring segment
In the original version of BLAST, the highest scored HSPs are

presented as the final report


They are also called maximum scoring pairs

-------------------------------------------------------------------------------------
A recent improvement in the implementation of BLAST is the ability

to provide gapped alignment


In gapped BLAST, the highest scored segment is chosen to be

extended in both directions using dynamic programming where gaps


may be introduced
The extension continues if the alignment score is above a certain

threshold; otherwise it is terminated


February 10, 2025 Bioinformatics 38
Heuristic Database Searching: BLAST

February 10, 2025 Bioinformatics 39


Heuristic Database Searching: BLAST

February 10, 2025 Bioinformatics 40


Variants of BLAST
 BLAST is a family of programs that includes BLASTn, BLASTp,
BLASTx tBLASTn, and tBLASTx
 BLASTn queries nucleotide sequences with a nucleotide sequence

database
 BLASTp uses protein sequences as queries to search against a protein

sequence database
 BLASTx uses nucleotide sequences as queries and translates them in all

six reading frames to produce translated protein sequences, which are


used to query a protein sequence database
 tBLASTn queries protein sequences to a nucleotide sequence database

with the sequences translated in all six reading frames


 queries protein sequences to a nucleotide sequence database with the

sequences translated in all six reading frames


 tBLASTx uses nucleotide sequences, which are translated in all six

frames, to search against a nucleotide sequence database that has all the
Februarysequences
10, 2025 translated in six frames
Bioinformatics 41
Variants of BLAST

February 10, 2025 Bioinformatics 42


Statistical Significance
 The BLAST output provides a list of pairwise sequence matches
ranked by statistical significance
 The significance scores help to distinguish evolutionarily related
sequences from unrelated ones
 Generally, only hits above a certain threshold are displayed
 Deriving the statistical measure is slightly different from that for
single pairwise sequence alignment; the larger the database, the more
unrelated sequence alignments there are
 This necessitates a new parameter that takes into account the total
number of sequence alignments conducted, which is proportional to
the size of the database
 In BLAST searches, this statistical indicator is known as the E-value
(expectation value), and it indicates the probability that the resulting
alignments from a database search are caused by random chance
February 10, 2025 Bioinformatics 43
Statistical Significance
 BLAST compares a query sequence against all database sequences,
and so the E-value is determined by the following formula:
E=m×n×P
where m is the total number of residues in a database, n is the
number of residues in the query sequence, and P is the probability
that an HSP alignment is a result of random chance
 Example: aligning a query sequence of 100 residues to a database
containing a total of 10^12 residues results in a P-value for the
ungapped HSP region in one of the database matches of 1 × 10^−20
 The E-value, which is the product of the three values, is 100 × 10^12
× 10^−20, which equals 10−6
 It is expressed as 1e − 6 in BLAST output
 This indicates that the probability of this database sequence match
occurring due to random chance is 10^−6
February 10, 2025 Bioinformatics 44
Statistical Significance
 The E-value provides information about the likelihood that a given
sequence match is purely by chance
 The lower the E-value, the less likely the database match is a result
of random chance and therefore the more significant the match is
 Empirical interpretation of the E-value is as follows.
 If E < 1e − 50 (or 1 × 10^−50), there should be an extremely high
confidence that the database match is a result of homologous
relationships
 If E is between 0.01 and 1e − 50, the match can be considered a
result of homology
 If E is between 0.01 and 10, the match is considered not significant,
but may hint at a tentative remote homology relationship
 Additional evidence is needed to confirm the tentative relationship
 If E > 10, the sequences under consideration are unrelated
February 10, 2025 Bioinformatics 45
Heuristic Database Searching: FASTA
 FASTA was in fact the first database similarity search tool
developed, preceding the development of BLAST
 FASTA uses a “hashing” strategy to find matches for a short stretch

of identical residues with a length of k


 The string of residues is known as ktuples or ktups, which are

equivalent to words in BLAST, but are normally shorter than the


words
 Typically, a ktup is composed of two residues for protein sequences

and six residues for DNA sequences


FASTA performs sequence alignment through the following steps
Step1
 The first step in FASTA alignment is to identify ktups between two

sequences by using the hashing strategy

February 10, 2025 Bioinformatics 46


Heuristic Database Searching: FASTA
 This strategy works by constructing a lookup table that shows the
position of each ktup for the two sequences under consideration
 The positional difference for each word between the two sequences

is obtained by subtracting the position of the first sequence from that


of the second sequence and is expressed as the offset
 The ktups that have the same offset values are then linked to reveal a

contiguous identical sequence region that corresponds to a stretch of


diagonal in a 2D matrix
Step2:
 Narrow down the high similarity regions between the two sequences

 Normally, many diagonals between the two sequences can be

identified in the hashing step


 The top ten regions with the highest density of diagonals are

identified as high similarity regions


February 10, 2025 Bioinformatics 47
Heuristic Database Searching: FASTA
 The diagonals in these regions are scored using a substitution matrix
 Neighboring high-scoring segments along the same diagonal are

selected and joined to form a single alignment


 This step allows introducing gaps between the diagonals while

applying gap penalties


 The score of the gapped alignment is calculated again

Step 3
 The gapped alignment is refined further using the Smith–Waterman

algorithm to produce a final alignment


Step 4
 Perform a statistical evaluation of the final alignment as in BLAST,

which produces the E-value

February 10, 2025 Bioinformatics 48


Heuristic Database Searching: FASTA

February 10, 2025 Bioinformatics 49


Heuristic Database Searching: FASTA

February 10, 2025 Bioinformatics 50


Statistical Significance
 FASTA also uses E-values and bit scores
 Estimation of the two parameters in FASTA is essentially the same
as in BLAST
 However, the FASTA output provides one more statistical parameter,
the Z-score
 This describes the number of standard deviations from the mean
score for the database search
 For a Z-score > 15, the match can be considered extremely
significant, with certainty of a homologous relationship
 If Z is in the range of 5 to 15, the sequence pair can be described as
highly probable homologs
 If Z < 5, their relationships is described as less certain

February 10, 2025 Bioinformatics 51


Variants of FASTA
 Similar to BLAST, FASTA has a number of subprograms
 The web-based FASTA program offered by the European
Bioinformatics Institute (www.ebi.ac.uk/) allows
 The use of either DNA or protein sequences as the query to

search against a protein database or nucleotide database


 Some available variants of the program are FASTx, which
 Translates a DNA sequence and uses the translated protein

sequence to query a protein database


 tFASTx, which compares a protein query sequence to a translated
DNA database

February 10, 2025 Bioinformatics 52


Comparison of FASTA and BLAST
 BLAST and FASTA have been shown to perform almost equally
well in regular database searching
 Both FASTA and BLAST have undergone evolution to recent
versions that
 Provide very powerful search tools for the molecular biologist

and are freely available


 However, there are some notable differences between the two
approaches
 The major difference is in the seeding step;
 BLAST uses a substitution matrix to find matching words,

whereas
 FASTA identifies identical matching words using the hashing

procedure

February 10, 2025 Bioinformatics 53


Comparison of FASTA and BLAST
 By default, FASTA scans smaller window sizes
 Thus, it gives more sensitive results than BLAST, with a better
coverage rate for homologs
 However, it is usually slower than BLAST
 The use of low-complexity masking in the BLAST procedure means
that it may have higher specificity than FASTA because potential
false positives are reduced
 BLAST sometimes gives multiple best-scoring alignments from the
same sequence;
 FASTA returns only one final alignment

February 10, 2025 Bioinformatics 54


End of Module-III

Thank You

Data Mining: Concepts and


February 10, 2025 Techniques 55

You might also like