Module 5
Module 5
Bioinformatics
Module V
(Biology for Engineers BSCD203, BSCG203)
Table of Contents
..
S. No. CONTENT
Biological Sequences:
• DNA Sequences: Represent the genetic code of an organism, specifying the order of
nucleotides (adenine, thymine, cytosine, and guanine).
• RNA Sequences: Similar to DNA but with uracil instead of thymine. Involved in various
cellular processes, including protein synthesis.
• Protein Sequences: Represent the amino acid composition of a protein, crucial for
understanding protein structure and function.
Sequence Databases:
1. Sequence Similarity:
• Definition: Sequence similarity refers to the degree of resemblance between two or more
biological sequences.
• Importance: Similarity helps infer evolutionary relationships, identify conserved
regions, and predict functional elements in sequences.
Bioinformatics Databases:
Sequence similarity
Sequence similarity refers to the degree of similarity between two biological sequences, such as
DNA, RNA, or protein sequences. It is a crucial concept in bioinformatics and molecular
biology, as it helps researchers understand the functional and evolutionary relationships between
different biological entities.
There are various methods to measure sequence similarity, and the choice of method depends
on the type of sequences being compared and the specific goals of the analysis. Here are some
common methods for assessing sequence similarity:
4. Sequence Databases and Homology Search: Tools like BLAST (Basic Local Alignment
Search Tool) are used to search sequence databases to find homologous sequences.
6. Quantitative Measures: Metrics like the Jaccard index or the Hamming distance can be used
to quantify the similarity between sequences.
7. Scoring Matrices: Matrices like BLOSUM (for proteins) or PAM (Point Accepted Mutation,
also for proteins) assign scores to different substitutions, aiding in alignment algorithms.
Understanding sequence similarity is crucial for predicting the function of genes or proteins,
identifying conserved motifs, and inferring evolutionary relationships. It is a fundamental step
in various bioinformatics applications, including functional annotation, comparative genomics,
and drug discovery.
Homology:
Purpose: Homology is a key concept in evolutionary biology and molecular evolution. When
sequences are homologous, their similarities often reflect shared ancestry and can provide
insights into the evolutionary history of genes or proteins. Functional similarities between
homologous sequences are often retained due to shared ancestry, but they can also diverge over
time.
In summary, similarity is a measure of how alike two sequences are, while homology implies a
shared evolutionary history. Similarity is a practical metric used in various bioinformatics
applications, while homology provides insights into the evolutionary relationships between
biological entities. Homologous sequences are expected to exhibit some level of similarity, but
not all similar sequences are necessarily homologous.
Homology in bioinformatics?
Homology in bioinformatics refers to the biological homology between DNA, RNA and protein
sequences which are defined in terms of shared ancestral properties in the evolutionary tree of
life. In other words, it is the common evolutionary ancestry of two sequences. The reason for
such occurrence could be either due to speciation events (orthologs), horizontal gene transfer
events (xenologs) or duplication events (paralogs).
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
It is possible to deduce the homology between DNA, RNA or proteins by their amino acid or
nucleotide sequence similarity. A significant similarity serves as a strong evidential property to
infer those two sequences are related to a common ancestral sequence with evolutionary
changes. Alignments of multiple sequences indicate the regions of each sequence with
homologous nature.
Similarity in bioinformatics
Alignment
Output: The output of a pairwise alignment is a set of matched positions and potentially
introduced gaps, indicating where the sequences align or diverge.
Purpose: Aligns three or more sequences simultaneously, often to identify conserved regions
and understand evolutionary relationships.
Algorithms: Popular algorithms for multiple sequence alignment include ClustalW, MAFFT,
and T-Coffee.
Output: The output of an MSA is a column-wise arrangement of sequences, with gaps introduced
to maximize overall similarity. Conserved regions are often easily recognizable in the alignment.
Sequence 1: ACGTACGT
Sequence 2: ACGA--GT
In this example, gaps (represented by dashes) are introduced to align the two sequences. The
aligned positions show where the nucleotides match or differ. The goal is to maximize similarity,
taking into account matches, mismatches, and gap penalties.
Multiple sequence alignment involves aligning more than two sequences. For example:
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Sequence 1: ACGTACGT
Sequence 2: ACGA--GT
Sequence 3: ACGT-CGT
In this case, the alignment considers all three sequences simultaneously, introducing gaps as
needed to align regions of similarity.
Scoring model
A scoring model is a set of rules or parameters used to assign scores to different elements in a
computational or analytical context. In bioinformatics, scoring models are often used in the
context of sequence alignment to evaluate the similarity between two sequences or to assess the
significance of the alignment.
Here are some key components of scoring models used in sequence alignment:
Substitution Matrix:
In the context of sequence alignment, a substitution matrix assigns scores to different amino acid
or nucleotide substitutions. Common examples include BLOSUM (for proteins) and PAM
(Point Accepted Mutation, also for proteins) matrices.
The matrix reflects the likelihood of one residue being substituted for another based on observed
evolutionary changes.
Gap Penalties:
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Gap penalties are used to assign scores for introducing gaps in the alignment. There are typically
two types of gap penalties: gap opening penalty (larger penalty for starting a gap) and gap
extension penalty (smaller penalty for extending an existing gap).
Scoring System:
The overall scoring system combines the scores from the substitution matrix and gap penalties
to assess the overall similarity of the aligned sequences.
Scores are assigned to matched residues, mismatched residues, and gap positions.
Scoring Function:
The scoring function calculates the overall score for a given alignment. It is often a sum of the
scores for matched residues, mismatched residues, and gap positions.
The Expect value (E) is a parameter that describes the number of hits one can “expect” to see
by chance when searching a database of a particular size. It decreases exponentially as the Score
(S) of the match increases. Essentially, the E value describes the random background noise.
Some scoring models may include additional parameters or adjustments, and the size of the
substitution matrix (e.g., BLOSUM30, BLOSUM62) may vary based on the desired sensitivity
or specificity of the alignment.
The choice of scoring model and parameters depends on the nature of the sequences being
compared and the goals of the analysis. Different models may be suitable for comparing protein
sequences, DNA sequences, or RNA sequences, and researchers often select the most
appropriate model based on empirical testing and biological considerations.
Hidden Markov Models (HMMs) can be used for pairwise sequence alignment, and they are
particularly useful when dealing with biological sequences like proteins. The process involves
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
constructing an HMM that represents the underlying states of the sequences and then using the
model to find the optimal alignment. Here's a simplified overview of the process:
Model Construction:
• States: Define states in the HMM to represent different aspects of the sequences, such as
match states, insertion states, and deletion states.
• Transitions: Define transitions between states, representing the probabilities of moving from
one state to another.
• Emission Probabilities: Assign emission probabilities to each state, representing the
likelihood of emitting a particular symbol (amino acid or nucleotide) given the current state.
The answer lies both in the solid mathematical principles that the model is based on and the
simplicity that comes along with them. Every Hidden Markov Model relies on the assumption
that the events we observe depend on some internal factors or states, which are not directly
observable. This trait is very general which makes it very applicable and is also where the hidden
part of the name comes from. The Markov part, however, comes from how we model the changes
of the above-mentioned hidden states through time. We use the Markov property, a strong
assumption that the process of generating the observations is memoryless, meaning the next
hidden state depends only on the current hidden state.
The first order Markov process makes a very important simplification to observed sequential
data—the current system state depends only on the previous system state.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Graph of a Markov process. The current system state depends only on the previous system
state.
Additionally, hidden Markov models make one more important modification to the Markov
process — the actual system states are assumed to be unobservable and are hidden. For a sequence
of hidden states Z, the hidden Markov process emits a corresponding sequence of observable
processes X. Using the observed processes X, we try to guess what Z really is using hidden
Markov models!
Graph of a hidden Markov process. We are unable to observe the actual hidden states of the
system Z, and can only observe the observable processes X. Image created by the author.
An example of a hidden Markov process is the guessing of someone’s mood. We cannot directly
observe or measure the mood of a person (at least without sticking electrodes in the person’s
brain), instead we observe his or her facial features, and then try to guess the mood. We assume
that moods can be described as a Markov process, and that there are 2 possible moods — good
and bad. We also assume that there are 2 possible observable facial features — smiling and
frowning.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Initial Hidden State Probabilities: When we first meet someone, we assume that there is a 70%
chance that the person is in a good mood and a 30% chance that the person is in a bad mood.
Hidden State Transition Matrix: We also assume that when a person is in a good mood, moments
later there is a 80% chance that he or she will still be in a good mood, and a 20% chance that he
or she will now be in a bad mood. We also assume the same probabilities for the opposite situation
in order to simplify the problem. Observable Emission Probabilities: Finally, we assume that
when a person is in a good mood, there is a 90% chance that he or she will be smiling, and a 10%
chance that he or she will be frowning.
Multiple sequence alignment (MSA) is a bioinformatics technique used to align three or more
biological sequences simultaneously. In the context of MSA, there are different types of
alignments, including local alignment, global alignment, and alignments with or without gaps.
Let's explore the concepts of local and global alignments in the context of multiple sequence
alignment:
• Algorithm: Methods like ClustalW and MAFFT can be adapted for local MSA. These
algorithms use heuristics to identify conserved regions within the sequences.
• Output: The output of local MSA consists of aligned segments that are locally similar
across the input sequences.
2. Global Multiple Sequence Alignment:
• Purpose: Global MSA aligns entire sequences from start to end, aiming to find the overall
similarity and conserved regions across the entire length of the sequences.
• Algorithm: Algorithms like ClustalW, MAFFT, and T-Coffee are commonly used for global
MSA. They consider the entire length of the sequences during alignment.
• Output: The output of global MSA is a complete alignment of all input sequences, spanning
the entire length of each sequence.
3. Gapped Multiple Sequence Alignment:
• Purpose: Gapped MSA allows for the introduction of gaps in the alignment to account for
insertions or deletions in the sequences.
• Algorithm: Most MSA algorithms, including those mentioned above, inherently handle
gapped alignments. Gaps are introduced to maximize the overall similarity between
sequences.
• Output: The output includes gaps introduced to align regions that may have insertions or
deletions in some sequences.
4. Ungapped Multiple Sequence Alignment:
• Purpose: Ungapped MSA does not allow for the introduction of gaps during the alignment
process.
• Algorithm: Some MSA methods provide options to perform ungapped alignments,
ensuring that the aligned sequences are gap-free.
• Output: The output consists of a gap-free alignment, making it suitable for comparing
sequences without considering insertions or deletions.
Choosing between local and global alignments, as well as gapped or ungapped alignments,
depends on the specific goals of the analysis and the characteristics of the sequences being
aligned. Local alignment is often used when focusing on specific conserved regions, while
global alignment provides a comprehensive overview of the entire sequences. The decision to
allow gaps or not depends on the biological context and the expected variability in the sequences.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Ungapped alignment
Gapped alignments
Three kinds of mutations
Replacement of a base (or aa) o
Insertion of a base (or aa) o
Deletion of a base (or aa)
Score according to:
+1 if match
0 if no match
-1 if gap
BLAST:
BLAST (Basic Local Alignment Search Tool) and FASTA are both widely used bioinformatics
tools for comparing biological sequences, such as DNA, RNA, or protein sequences, to identify
similarities and potential homologies. Despite having similar purposes, they use different
algorithms and approaches for sequence similarity searches.
BLAST (Basic Local Alignment Search Tool):
Algorithm: BLAST employs a heuristic algorithm that quickly identifies local regions of
similarity between sequences by breaking the search into smaller, manageable pieces. The
algorithm looks for short, exact matches (seeds) and extends them to form alignments.
Types of BLAST:
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Algorithm: The FASTA algorithm uses a dynamic programming approach to perform pairwise
sequence alignments. It starts by searching for short regions of similarity (word matches)
between sequences and then extends them to create an alignment.
Types of FASTA:
• FASTA (fasta36): The original program for comparing protein or DNA sequences.
• SSEARCH (fasta35): Used for global pairwise sequence alignments.
• TFASTX, TFASTY, and TFASTZ: Perform faster translated searches.
Output: FASTA outputs alignments along with statistical scores, including E-values, sequence
identity, and similarity scores. It provides a summary of the alignment, as well as detailed
information on matched regions.
Comparison:
• BLAST is often preferred for its speed and is suitable for quickly identifying local
similarities in large databases.
• FASTA may be more sensitive for certain applications, as it uses rigorous statistical
methods and dynamic programming for alignment.
Both tools are widely used and have their strengths and weaknesses, making them
complementary in bioinformatics analyses.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
4.Phylogenetic tree:
Sequence Alignment: Align the sequences to identify homologous positions. Multiple sequence
alignment (MSA) tools like ClustalW, MAFFT, or Muscle are commonly used for this step.
Accurate alignment is crucial for the accuracy of the phylogenetic tree.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Data cleaning and filtering: Remove poorly aligned or highly variable regions that may
introduce noise into the analysis. This step helps in obtaining a more reliable phylogenetic
signal.
Model selection: Choose an appropriate evolutionary model that describes the substitution
pattern of nucleotides or amino acids in the sequences. Common models include the General
Time Reversible (GTR) model for nucleotides and the Jukes-Cantor model for simpler cases.
Bootstrap analysis: Assess the robustness of the tree topology by performing bootstrap
analysis. This involves resampling the data to generate multiple datasets and re-running the tree-
building process to estimate the reliability of each branch.
Tree visualization: Visualize the resulting phylogenetic tree using tree visualization software
like FigTree, iTOL, or other phylogenetic tree viewers. Trees are often displayed in a
hierarchical format with branches representing evolutionary relationships.
Interpretation and analysis: Analyze the tree to infer evolutionary relationships, divergence
times, and patterns of speciation. Interpret the tree in the context of biological knowledge and
hypotheses.
Phylogenetic tree construction is a complex process, and the choice of methods and models
depends on the characteristics of the data and the biological questions being addressed.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
Researchers often validate their results through multiple methods and consider the biological
context when interpreting phylogenetic trees.
Neighbor-Joining algorithm
The Neighbor-Joining algorithm is advantageous because it is relatively fast and can handle
large datasets. However, it assumes a constant evolutionary rate across lineages, and it may be
sensitive to errors in distance estimation. Researchers often perform bootstrap analysis to assess
the reliability of the branches in the Neighbor-Joining tree. Despite its limitations, the Neighbor-
Joining algorithm is widely used in practice and has contributed to the field of molecular
phylogenetics.
The Neighbor-Joining (NJ) algorithm is a popular method for constructing phylogenetic trees in
the field of biology. It is widely used for analyzing molecular data, such as DNA or protein
sequences, to infer evolutionary relationships among different species or individuals.
Context: The primary goal of phylogenetic tree construction is to represent the evolutionary
history and relationships among biological entities, such as species or genes. This is done by
analyzing molecular data that reflects the genetic similarities and differences between these
entities.
Sequence Data: In biology, researchers often start with molecular sequences, such as DNA,
RNA, or protein sequences, obtained from different species or individuals. These sequences are
aligned to identify homologous positions, reflecting common ancestry.
Distance matrix: The Neighbor-Joining algorithm takes as input a distance matrix, which
quantifies the evolutionary distances between pairs of taxa (species or sequences). These
distances can be estimated based on sequence divergence, substitution rates, or other measures.
Node and Branch Representation: In the context of biology, each node in the Neighbor-
Joining tree represents a taxon (species or sequence). Branch lengths connecting nodes represent
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
the evolutionary distances between taxa. The algorithm aims to construct a tree that best reflects
the observed distances.
Cluster formation: The algorithm iteratively forms clusters of taxa based on their pairwise
distances. At each step, it identifies the pair of taxa with the minimum Q-value, indicating a
potential cluster. A new node is created to represent this cluster, and the tree is updated
accordingly.
Hierarchy and Evolutionary Relationships: The resulting tree has a hierarchical structure that
reflects the evolutionary relationships among taxa. Nodes closer to each other in the tree are
more closely related, while those farther apart share a more distant common ancestor.
Validation and analysis: In biological studies, researchers often validate the reliability of the
Neighbor-Joining tree through statistical methods, such as bootstrap analysis. The resulting tree
is then analyzed to draw biological conclusions, such as identifying clades, understanding
speciation events, or inferring the functional implications of genetic evolution.
• Which tool is suitable for quickly identifying local similarities in large sequence
databases?
a) ClustalW
b) MAFFT
c) BLAST
d) T-Coffee
Answer: c) BLAST
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24
5. How does the E-value in BLAST results help in sequence similarity analysis?
10. What are the types of sequence databases commonly used in bioinformatics, and what
kind of sequences do they store?
1. Compare and contrast the algorithms used in pairwise sequence alignment, focusing on
Needleman-Wunsch and Smith-Waterman algorithms.
4. Describe the different methods for measuring sequence similarity and their respective
applications in bioinformatics.
5. Explore the challenges and limitations associated with sequence alignment algorithms
and how researchers address them.
6. Analyze the role of hidden Markov models in sequence alignment, highlighting their
advantages and applications in bioinformatics.
7. Discuss the significance of homology in evolutionary biology and its implications for
understanding biological diversity.
9. Evaluate the strengths and weaknesses of BLAST and FASTA as sequence alignment
tools, considering factors such as speed, sensitivity, and accuracy.
10. Investigate the role of scoring models in sequence alignment, including the components
of scoring matrices and their impact on alignment accuracy.