Sequence Analysis in Bioinformatics
Sequence Analysis in Bioinformatics
Bioinformatics
BIOF605
• Sequence comparison lies at the heart of bioinformatics analysis. It is
an important first step toward structural and functional analysis of
newly determined sequences.
• As new biological sequences are being generated at exponential rates,
sequence comparison is becoming increasingly important to draw
functional and evolutionary inference of a new protein with proteins
already existing in the database.
• The most fundamental process in this type of comparison is sequence
alignment. This is the process by which sequences are compared by
searching for common character patterns and establishing residue–
residue correspondence among related sequences. Pairwise sequence
alignment is the process of aligning two sequences and is the basis of
database similarity searching and multiple sequence alignment.
DEFINITION OF SEQUENCE
ALIGNMENT
• Sequence alignment is the procedure of comparing two (pair-wise
alignment) or more (multiple sequence alignment) sequences by searching
for a series of individual characters or character patterns that are in the
same order in the sequences. Two sequences are aligned by writing them
across a page in two rows.
• Identical or similar characters are placed in the same column, and
nonidentical characters can either be placed in the same column as a
mismatch or opposite a gap in the other sequence. In an optimal
alignment, nonidentical characters and gaps are placed to bring as many
identical or similar characters as possible into vertical register.
• Sequences that can be readily aligned in this manner are said to be similar.
Pairwise Sequence Alignment
• The goal of pairwise sequence alignment is to establish a
correspondence between the elements in a pair of sequences that
share a common property, such as common ancestry or a common
structural or functional role.
• In bioinformatics, the sequences under consideration are typically
nucleic acid or amino acid polymers.
• We will consider three variants of the pairwise sequence alignment
problem: global alignment, and local alignment.
• Global Alignment
For the two hypothetical protein sequence fragments in the global alignment is stretched over
the entire sequence length to include as many matching amino acids as possible up to and
including the sequence ends. Vertical bars between the sequences indicate the presence of
identical amino acids. Although there is an obvious region of identity in this example (the
sequence GKG preceded by a commonly observed substitution of T for A), a global alignment
may not align such regions so that more amino acids along the entire sequence lengths can be
matched.
• Local Alignment
In a local alignment, the alignment stops at the ends of regions of identity or strong similarity,
and a much higher priority is given to finding these local regions than to extending the alignment
to include more neighboring amino acid pairs. Dashes indicate sequence not included in the
alignment. This type of alignment favors finding conserved nucleotide patterns, DNA sequences,
or amino acid patterns in protein sequences.
Dot Matrix Method
• The most basic sequence alignment method is the dot matrix method, also known as the dot plot
method.
• It is a graphical way of comparing two sequences in a two-dimensional matrix.
• In a dot matrix, two sequences to be compared are written in the horizontal and vertical axes of
the matrix.
• The comparison is done by scanning each residue of one sequence for similarity with all residues
in the other sequence.
• If a residue match is found, a dot is placed within the graph.
Dynamic Programming Method
• Dynamic programming is a method that determines optimal alignment by matching two
sequences for all possible pairs of characters between the two sequences.
• It is fundamentally similar to the dot matrix method in that it also creates a two-dimensional
alignment grid.
• However, it finds alignment in a more quantitative way by converting a dot matrix into a scoring
matrix to account for matches and mismatches between sequences.
• By searching for the set of highest scores in this matrix, the best alignment can be accurately
obtained
PAM Matrices
• The PAM matrices (also called Dayhoff PAM matrices) were first constructed by Margaret Dayhoff,
who compiled alignments of seventy-one groups of very closely related protein sequences.
• PAM stands for “point accepted mutation” (although “accepted point mutation” or APM may be a
more appropriate term, PAM is easier to pronounce).
• Because of the use of very closely related homologs, the observed mutations were not expected
to significantly change the common function of the proteins.
• Thus, the observed amino acid mutations are considered to be accepted by natural selection
BLOSUM Matrices
• Instead of using the extrapolation function, the BLOSUM matrices are actual percentage identity
values of sequences selected for construction of the matrices.
• For example, BLOSUM62 indicates that the sequences selected for constructing the matrix share
an average identity value of 62%.
• Other BLOSUM matrices based on sequence groups of various identity levels have also been
constructed.
• In the reversing order as the PAM numbering system, the lower the BLOSUM number, the more
divergent sequences they represent.
Pairwise Sequence Alignment Tools < EMBL-EBI
Global Alignment using Needleman algorithm Local Alignment using Waterman algorithm
Multiple Sequence Alignment
• From a multiple alignment of three or more protein sequences, the
highly conserved residues that define structural and functional
domains in protein families can be identified.
• New members of such families can then be found by searching
sequence databases for other sequences with these same domains.
• Alignment of DNA sequences can assist in finding conserved
regulatory patterns in DNA sequences.
• Despite the great value of multiple sequence alignments, obtaining
one presents a very difficult algorithmic problem.
SCORING FUNCTION
• Multiple sequence alignment is to arrange sequences in such a way that a maximum number of
residues from each sequence are matched up according to a particular scoring function.
• The scoring function for multiple sequence alignment is based on the concept of sum of pairs
(SP).
• As the name suggests, it is the sum of the scores of all possible pairs of sequences in a multiple
alignment based on a particular scoring matrix. In calculating the SP scores, each column is scored
by summing the scores for all possible pairwise matches, mismatches and gap costs.
• The score of the entire alignment is the sum of all of the column scores. The purpose of most
multiple sequence alignment algorithms is to achieve maximum SP scores.
Progressive Alignment Method
• Progressive alignment depends on the stepwise assembly of multiple alignment and is heuristic
in nature.
• It speeds up the alignment of multiple sequences through a multistep process. It first conducts
pairwise alignments for each possible pair of sequences using the Needleman–Wunsch global
alignment method and records these similarity scores from the pairwise comparisons.
• The scores can either be percent identity or similarity scores based on a particular substitution
matrix. Both scores correlate with the evolutionary distances between sequences.
• The scores are then converted into evolutionary distances to generate a distance matrix for all
the sequences involved. A simple phylogenetic analysis is then performed based on the
distance matrix to group sequences based on pairwise distance scores.
• As a result, a phylogenetic tree is generated using the neighbor-joining method. The tree
reflects evolutionary proximity among all the sequences.
Clustal Omega < Multiple Sequence Alignment < EMBL-EBI
Presented by
Name Enrollment number