0% found this document useful (0 votes)
18 views31 pages

Sequence Comparison Part 1

The document discusses pairwise sequence alignment. It defines sequence alignment as comparing two or more sequences to find identical or similar characters in the same order. Optimal alignment aims to place as many matching characters as possible in vertical register by inserting gaps. Sequence alignment is useful for identifying evolutionary relationships between sequences and characterizing functions of unknown sequences based on homology. It discusses different types of sequence alignment strategies and concepts like sequence identity, similarity, and homology.

Uploaded by

letsvansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

Sequence Comparison Part 1

The document discusses pairwise sequence alignment. It defines sequence alignment as comparing two or more sequences to find identical or similar characters in the same order. Optimal alignment aims to place as many matching characters as possible in vertical register by inserting gaps. Sequence alignment is useful for identifying evolutionary relationships between sequences and characterizing functions of unknown sequences based on homology. It discusses different types of sequence alignment strategies and concepts like sequence identity, similarity, and homology.

Uploaded by

letsvansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CS-434

BIOINFORMATICS
DR. UROOJ AINUDDIN
PAIRWISE SEQUENCE
ALIGNMENT
CHAPTER 3
Sequence alignment
• Sequence alignment is the procedure of comparing two (pair-wise
alignment) or more (multiple sequence alignment) sequences by
searching for a series of individual characters or character patterns that
are in the same order in the sequences.
• Two sequences are aligned by writing them across a page in two rows.
• Identical or similar characters are placed in the same column.
• Non-identical characters can either be placed in the same column as a
mismatch or opposite a gap in the other sequence.
Optimal sequence alignment

• In an optimal alignment, non-identical characters and gaps are


placed to bring as many identical or similar characters as possible
into vertical register.
• Sequences that can be readily aligned in this manner are said to be
similar.
Why are we interested in alignment?

• The building blocks of DNA and proteins, nucleotide bases and


amino acids, form linear sequences that determine the primary
structure of the molecules.
• As the sequences gradually mutate and diverge over time, traces
of evolution may remain in certain portions of the sequences to
allow identification of the common ancestry.
• Identifying the evolutionary relationships between sequences
helps to characterize the function of unknown sequences.
Sequence homology
• When two sequences are descended from a common evolutionary
origin, they are said to have a homologous relationship or share
homology.
• A homologous gene (or homolog) is a gene inherited in two
species by a common ancestor.
• While homologous genes can have similar sequences, similar
sequences are not necessarily homologous.
• Homology is not quantifiable.
Types of homologs

• Orthologs
• Paralogs
Orthologs
• Orthologous genes are genes in different species that originated from
a single gene of a common ancestor.
• Orthologs are generated by speciation (creation of a new species). It
occurs when a group within a species separates from other members of
its species and develops its own unique characteristics.
• Homologous proteins from different species that possess the same
function are called orthologs.
• Example: the beta-hemoglobin genes of humans and chimpanzees.
Paralogs

• Paralogs are generated by gene duplication, i.e., a gene in


an organism is duplicated to occupy two different positions in
the same genome.
• Homologous proteins that have different functions in the same
species are termed paralogs.
• Examples: the genes encoding myoglobin and hemoglobin.
Residue

• In biochemistry, a residue refers to a single unit that makes


up a polymer, such as an amino acid in a polypeptide
or protein, or a nucleotide in a nucleic acid.
Sequence identity
• Sequence identity is the number of residues which match exactly
between two different sequences.
• Gaps are not counted.
• Identity(A,B) = [Li*2]/[LA+LB], where Li is number of identical residues, LA
is the length of sequence A, and LB is the length of sequence B.
A: AAGGCTT
B: AAGGC
C: AAGGCAT
Sequence identity
• Sequence identity is the number of residues which match exactly
between two different sequences.
• Gaps are not counted.
• Identity(A,B) = [Li*2]/[LA+LB], where Li is number of identical residues, LA
is the length of sequence A, and LB is the length of sequence B.
• Identity(A,B) = [5*2]/[7+5] = 83.33% A: AAGGCTT
B: AAGGC
Sequence identity
• Sequence identity is the number of residues which match exactly
between two different sequences.
• Gaps are not counted.
• Identity(A,B) = [Li*2]/[LA+LB], where Li is number of identical residues, LA
is the length of sequence A, and LB is the length of sequence B.
• Identity(A,B) = [5*2]/[7+5] = 83.33% A: AAGGCTT
• Identity(A,C) = [6*2]/[7+7] = 85.71%
C: AAGGCAT
Sequence identity
• Sequence identity is the number of residues which match exactly
between two different sequences.
• Gaps are not counted.
• Identity(A,B) = [Li*2]/[LA+LB], where Li is number of identical residues, LA
is the length of sequence A, and LB is the length of sequence B.
• Identity(A,B) = [5*2]/[7+5] = 83.33%
B: AAGGC
• Identity(A,C) = [6*2]/[7+7] = 85.71%
C: AAGGCAT
• Identity(B,C) = [5*2]/[5+7] = 83.33%
Calculating sequence identity
X: CCAGTGTGGCCGATACCCCAGGTTGGCACGCATCGTTGCCTTGGTAAGC 49
Y: CCAGTGTGGCCGATGCCCGTGCTACGCATCGTTGCCTTGGTAAGC 45
• We realign the second shorter sequence to optimize matching.
X: CCAGTGTGGCCGATACCCCAGGTTGGCACGCATCGTTGCCTTGGTAAGC
Y: CCAGTGTGGCCGATGCCC--G-T-GT-ACGCATCGTTGCCTTGGTAAGC
• Identity(X,Y) = [42*2]/[49+45] = 89.36%
Zones of identity
• If two protein sequences are aligned at full length, an identity of 30% or
higher can be safely regarded as having close homology. They are said to be
in the safe zone.
• If their identity level falls between 20% and 30%, a homologous relationship
becomes less certain. This is the area regarded as the twilight zone, where
remote homologs mix with randomly related sequences.
• Below 20% identity, where high proportions of nonrelated sequences are
present, homologous relationships cannot be reliably determined and thus fall
into the midnight zone.
• Identity values only provide a tentative guidance for homology identification.
Sequence similarity

• Sequence similarity is the percentage of aligned residues that are


similar in physiochemical properties such as size, charge, and
hydrophobicity.
• Similarity(A,B) = [Ls*2]/[LA+LB], where Ls is number of
similar residues, LA is the length of sequence A, and LB is the length
of sequence B.
Difference between homology and similarity
• When the two sequences share a high enough degree of similarity, it is
concluded that they are homologous.
• Similarity is a direct result of observation from sequence alignment.
• Sequence similarity can be quantified using percentages; homology is
a qualitative statement.
• One may say that two sequences share 40% similarity. It is incorrect to
say that the two sequences share 40% homology. They are
either homologous or nonhomologous.
What level of similarity leads to inferring
homology?
• Nucleotide sequences consist of only four characters, so unrelated
sequences have at least a 25% chance of being identical.
• For protein sequences, there are twenty possible amino acid residues,
so two unrelated sequences can match up 5% of the residues by
random chance.
• If gaps are allowed, the percentage could increase to 10–20%.
• Sequence length is also a crucial factor. The longer the sequence, the
less likely it is for two sequences to match up by random chance.
Similarity and identity
• Sequence similarity and sequence identity are synonymous
for nucleotide sequences.
• For protein sequences, however, the two concepts are very different.
• Protein sequence identity refers to the percentage of matches of the
same amino acid residues between two aligned sequences.
• Protein sequence similarity refers to the percentage of aligned residues
that have similar physicochemical characteristics and can be more
readily substituted for each other.
Explaining similarity for amino acids

• Amino acids are not exchanged with the same probability as might
be conceived theoretically.
• For example, an exchange of aspartic acid for glutamic acid is
frequently observed; however, a change from aspartic acid to
tryptophan is rarely seen.
• There are several reasons that may facilitate the aspartic acid -
glutamic acid exchange and not the aspartic acid - tryptophan
exchange.
Explaining similarity for amino acids
• One reason for this is the triplet-based genetic code.
• For an exchange of aspartic acid to glutamic acid, only a mutation
of the last nucleotide in the codon is required (GAT/GAC to
GAA/GAG).
• In contrast, a complete mutation of the whole triplet must occur for
aspartic acid to be exchanged for tryptophan (GAT/GAC to TGG).
• Of course, a complete mutational substitution has a much lower
probability of occurrence and needs a longer timeframe.
Explaining similarity for amino acids
• A second reason for the mutation of aspartic acid to glutamic acid to
occur more often is that both have similar properties.
• In contrast, aspartic acid and tryptophan are chemically different – the
hydrophobic tryptophan is frequently found in the center of proteins,
whereas the hydrophilic aspartic acid occurs more often at the surface.
• An exchange of aspartic acid for tryptophan, therefore, could greatly
alter the structure of a protein and, consequently, its function.
• Such striking amino acid exchanges accompanied by a loss of function
rarely happen.
Alignment strategies

• There are two different alignment strategies that are often


used:
1. Global alignment, and
2. Local alignment.
Global alignment
• In global alignment, the two sequences to be aligned are assumed to
be generally similar over their entire length.
• This method is more applicable for aligning two closely
related sequences of roughly the same length.
• Alignment is carried out from beginning to end of both sequences to
find the best possible match across the entire length.
• For divergent sequences and sequences of variable lengths, this
method may not be able to generate optimal results because it fails to
recognize highly similar local regions between the two sequences.
Local alignment
• Local alignment does not assume that the two sequences in question
have similarity over the entire length.
• The two sequences to be aligned can be of different lengths.
• It only finds local regions with the highest level of similarity between the
two sequences and aligns these regions without regard for the
alignment of the rest of the sequence.
• This approach can be used for aligning more divergent sequences with
the goal of searching for conserved patterns in DNA or protein
sequences. It helps identify modules that are similar, called domains or
motifs.
Which alignment strategy is this?
X: CCAGTGTGGCCGATACCCCAGGTTGGCACGCATCGTTGCCTTGGTAAGC 49
Y: CCAGTGTGGCCGATGCCCGTGCTACGCATCGTTGCCTTGGTAAGC 45

X: CCAGTGTGGCCGATACCCCAGGTTGGCACGCATCGTTGCCTTGGTAAGC
Y: CCAGTGTGGCCGATGCCC--G-T-GT-ACGCATCGTTGCCTTGGTAAGC
Example of global alignment
X: EARDFNQYYSSIKRSGSIQ 19
Y: LPKLFIDQYYSSIKRTMGH 19
X: EARDFN-QYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMGH--
• Identity(X,Y) = [10*2]/[19+19] = 52.63%
• Consider K and R, D and N, H and Q to be similar. Also consider G,
S and T, to be similar.
X: EARDF-NQYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMG--H
Example of global alignment
X: EARDFNQYYSSIKRSGSIQ 19
Y: LPKLFIDQYYSSIKRTMGH 19
X: EARDFN-QYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMGH--
• Identity(X,Y) = [9*2]/[19+19] = 52.63%
• Consider K and R, D and N, H and Q to be similar. Also consider G,
S and T, to be similar.
X: EARDF-NQYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMG--H
• Similarity(X,Y) = [14*2]/[19+19] = 73.68%
Example of local alignment
X: EARDFNQYYSSIKRSGSIQ 19
Y: LPKLFIDQYYSSIKRTMGH 19
X: EARDFN-QYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMGH--
• Identity(X,Y) = [8*2]/[8+8] = 100%
• Consider K and R, D and N, H and Q to be similar. Also consider G,
S and T, to be similar.
X: EARDF-NQYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMG--H
Example of local alignment
X: EARDFNQYYSSIKRSGSIQ 19
Y: LPKLFIDQYYSSIKRTMGH 19
X: EARDFN-QYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMGH--
• Identity(X,Y) = [8*2]/[8+8] = 100%
• Consider K and R, D and N, H and Q to be similar. Also consider G,
S and T, to be similar.
X: EARDF-NQYYSSIKRS-GSIQ
Y: LPKLFIDQYYSSIKRTMG--H
• Similarity(X,Y) = [10*2]/[10+10] = 100%

You might also like