0% found this document useful (0 votes)

56 views

Module 5

The document discusses key concepts in bioinformatics sequence analysis including biological sequences, sequence databases, sequence similarity, homology, and alignment. Sequence similarity refers to the degree of resemblance between biological sequences and helps infer evolutionary relationships and predict functional elements. Homology implies common evolutionary origin and can be inferred through sequence or structural similarity.

Uploaded by

dhrubojyotihazra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Module 5

Uploaded by

dhrubojyotihazra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

2nd Semester

Biology for Engineers (BSCD203, BSCG203)

2023-24

Bioinformatics

Module V
(Biology for Engineers BSCD203, BSCG203)

Table of Contents
..

S. No. CONTENT

1. Sequence similarity, homology, and alignment

2. Pair wise alignment: Scoring model, pair wise alignment using

Hidden Markov models (HMM)

3. Multiple alignment: local alignment gapped and un-gapped global

alignment. BLAST, FASTA.

4. Phylogenetic tree construction: Neighbour Joining Algorithm.

2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Bioinformatics involves the application of computational techniques to analyze biological data,

and sequence analysis is a fundamental aspect of bioinformatics. Here are the basics of
bioinformatics sequence analysis and sequence similarity:

Biological Sequences:

• DNA Sequences: Represent the genetic code of an organism, specifying the order of
nucleotides (adenine, thymine, cytosine, and guanine).
• RNA Sequences: Similar to DNA but with uracil instead of thymine. Involved in various
cellular processes, including protein synthesis.
• Protein Sequences: Represent the amino acid composition of a protein, crucial for
understanding protein structure and function.

Sequence Databases:

• Genomic Databases: Contain complete genomes of organisms.

• Nucleotide Databases: Store DNA and RNA sequences.
• Protein Databases: House protein sequences and related information.

1. Sequence Similarity:
• Definition: Sequence similarity refers to the degree of resemblance between two or more
biological sequences.
• Importance: Similarity helps infer evolutionary relationships, identify conserved
regions, and predict functional elements in sequences.

Bioinformatics Databases:

• GenBank, ENA, DDBJ: Repositories for nucleotide sequences.

• UniProt: A comprehensive database of protein sequences.
• NCBI: Provides various tools and databases for bioinformatics analysis.

Understanding sequence data and similarity is crucial for numerous applications in

bioinformatics, including functional genomics, comparative genomics, drug discovery, and
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

evolutionary biology. Researchers leverage computational tools and algorithms to extract

meaningful information from biological sequences, contributing to our understanding of the
structure, function, and evolution of biological molecules.

Sequence similarity

Sequence similarity refers to the degree of similarity between two biological sequences, such as
DNA, RNA, or protein sequences. It is a crucial concept in bioinformatics and molecular
biology, as it helps researchers understand the functional and evolutionary relationships between
different biological entities.

There are various methods to measure sequence similarity, and the choice of method depends
on the type of sequences being compared and the specific goals of the analysis. Here are some
common methods for assessing sequence similarity:

1. Pairwise Sequence Alignment:

• Local Alignment: Methods like Smith-Waterman algorithm are used to find the best local
similarities between sequences.
• Global Alignment: Methods like Needleman-Wunsch algorithm align entire sequences,
allowing for gaps.
2. Multiple Sequence Alignment (MSA):
• MSA is used to align three or more sequences simultaneously. Popular algorithms include
ClustalW, MAFFT, and T-Coffee.
3. Sequence Identity and Similarity:
• Identity: The percentage of identical positions between two sequences.
• Similarity: The percentage of identical and similar (conservative) positions between two
sequences.

4. Sequence Databases and Homology Search: Tools like BLAST (Basic Local Alignment
Search Tool) are used to search sequence databases to find homologous sequences.

5. Phylogenetic Analysis: Construction of phylogenetic trees helps in understanding

evolutionary relationships by comparing sequence similarities.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

6. Quantitative Measures: Metrics like the Jaccard index or the Hamming distance can be used
to quantify the similarity between sequences.

7. Scoring Matrices: Matrices like BLOSUM (for proteins) or PAM (Point Accepted Mutation,
also for proteins) assign scores to different substitutions, aiding in alignment algorithms.

Understanding sequence similarity is crucial for predicting the function of genes or proteins,
identifying conserved motifs, and inferring evolutionary relationships. It is a fundamental step
in various bioinformatics applications, including functional annotation, comparative genomics,
and drug discovery.

Homology:

Definition: Homology implies a common evolutionary origin of two or more biological

sequences. If sequences are homologous, it suggests that they share a common ancestor.

Measurement: Homology is usually inferred through sequence similarity or structural

similarity. Evolutionary relationships are often inferred by comparing sequences and identifying
similarities that are unlikely to have arisen by chance.

Purpose: Homology is a key concept in evolutionary biology and molecular evolution. When
sequences are homologous, their similarities often reflect shared ancestry and can provide
insights into the evolutionary history of genes or proteins. Functional similarities between
homologous sequences are often retained due to shared ancestry, but they can also diverge over
time.

In summary, similarity is a measure of how alike two sequences are, while homology implies a
shared evolutionary history. Similarity is a practical metric used in various bioinformatics
applications, while homology provides insights into the evolutionary relationships between
biological entities. Homologous sequences are expected to exhibit some level of similarity, but
not all similar sequences are necessarily homologous.

Two very important basic concepts:

2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

• Similarity: Degree of likeness between two sequences, usually expressed as a

percentage of similar (or identical) residues over a given length of the alignment.
• Homology: Statement about common evolutionary ancestry of two sequences. Can only
be true or false. A high degree of similarity implies a high probability of homology:
• If two sequences are very similar, the sequences are usually homologous
• If two sequence are not similar, we don’t know if they are homologous
• If two sequences are not homologous, their sequences are usually not similar (but may
be by chance)
• If two sequences are homologous, their sequences may or may not be similar

Homology in bioinformatics?

Homology in bioinformatics refers to the biological homology between DNA, RNA and protein
sequences which are defined in terms of shared ancestral properties in the evolutionary tree of
life. In other words, it is the common evolutionary ancestry of two sequences. The reason for
such occurrence could be either due to speciation events (orthologs), horizontal gene transfer
events (xenologs) or duplication events (paralogs).
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Multiple sequence alignment

It is possible to deduce the homology between DNA, RNA or proteins by their amino acid or
nucleotide sequence similarity. A significant similarity serves as a strong evidential property to
infer those two sequences are related to a common ancestral sequence with evolutionary
changes. Alignments of multiple sequences indicate the regions of each sequence with
homologous nature.

Similarity in bioinformatics

In bioinformatics, similarity assesses the similarity between two proteins or nucleotide

sequences. There are two main steps to this process. The initial step is pair-wise alignment,
which helps to find the optimal alignment between two sequences (including gaps) using
algorithms such as BLAST, FastA, and LALIGN. After pair-wise alignment, it is necessary to
obtain two quantitative parameters from each pair-wise comparison. They are identity and
similarity. In BLAST, search similarities are known as positives.

Alignment

Sequence alignment is a bioinformatics technique used to arrange the biological sequences

(DNA, RNA, or protein) in a way that highlights their similarities, differences, and evolutionary
relationships. There are two main types of sequence alignment: pairwise alignment and multiple
sequence alignment.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

2. Pairwise Sequence Alignment:

Purpose: Compares two sequences to identify regions of similarity or homology.

Algorithms: Common algorithms for pairwise alignment include the Needleman-Wunsch

algorithm (global alignment) and the Smith-Waterman algorithm (local alignment).

Output: The output of a pairwise alignment is a set of matched positions and potentially
introduced gaps, indicating where the sequences align or diverge.

3. Multiple Sequence Alignment (MSA):

Purpose: Aligns three or more sequences simultaneously, often to identify conserved regions
and understand evolutionary relationships.

Algorithms: Popular algorithms for multiple sequence alignment include ClustalW, MAFFT,
and T-Coffee.

Output: The output of an MSA is a column-wise arrangement of sequences, with gaps introduced
to maximize overall similarity. Conserved regions are often easily recognizable in the alignment.

Here's a simple example of a pairwise sequence alignment:

Sequence 1: ACGTACGT

Sequence 2: ACGA--GT

In this example, gaps (represented by dashes) are introduced to align the two sequences. The
aligned positions show where the nucleotides match or differ. The goal is to maximize similarity,
taking into account matches, mismatches, and gap penalties.

Multiple sequence alignment involves aligning more than two sequences. For example:
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Sequence 1: ACGTACGT

Sequence 2: ACGA--GT

Sequence 3: ACGT-CGT

In this case, the alignment considers all three sequences simultaneously, introducing gaps as
needed to align regions of similarity.

Sequence alignment is a fundamental tool in bioinformatics, used for various purposes,

including the identification of conserved motifs, understanding functional regions, and inferring
evolutionary relationships between sequences.

Scoring model

A scoring model is a set of rules or parameters used to assign scores to different elements in a
computational or analytical context. In bioinformatics, scoring models are often used in the
context of sequence alignment to evaluate the similarity between two sequences or to assess the
significance of the alignment.

Here are some key components of scoring models used in sequence alignment:

Substitution Matrix:

In the context of sequence alignment, a substitution matrix assigns scores to different amino acid
or nucleotide substitutions. Common examples include BLOSUM (for proteins) and PAM
(Point Accepted Mutation, also for proteins) matrices.

The matrix reflects the likelihood of one residue being substituted for another based on observed
evolutionary changes.

Gap Penalties:
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Gap penalties are used to assign scores for introducing gaps in the alignment. There are typically
two types of gap penalties: gap opening penalty (larger penalty for starting a gap) and gap
extension penalty (smaller penalty for extending an existing gap).

Scoring System:

The overall scoring system combines the scores from the substitution matrix and gap penalties
to assess the overall similarity of the aligned sequences.

Scores are assigned to matched residues, mismatched residues, and gap positions.

Scoring Function:

The scoring function calculates the overall score for a given alignment. It is often a sum of the
scores for matched residues, mismatched residues, and gap positions.

E-value and Significance:

The Expect value (E) is a parameter that describes the number of hits one can “expect” to see
by chance when searching a database of a particular size. It decreases exponentially as the Score
(S) of the match increases. Essentially, the E value describes the random background noise.

Lower E-values suggest more significant alignments.

Matrix Size and Parameters:

Some scoring models may include additional parameters or adjustments, and the size of the
substitution matrix (e.g., BLOSUM30, BLOSUM62) may vary based on the desired sensitivity
or specificity of the alignment.

The choice of scoring model and parameters depends on the nature of the sequences being
compared and the goals of the analysis. Different models may be suitable for comparing protein
sequences, DNA sequences, or RNA sequences, and researchers often select the most
appropriate model based on empirical testing and biological considerations.

Pair wise alignment using Hidden Markov models (HMM)

Hidden Markov Models (HMMs) can be used for pairwise sequence alignment, and they are
particularly useful when dealing with biological sequences like proteins. The process involves
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

constructing an HMM that represents the underlying states of the sequences and then using the
model to find the optimal alignment. Here's a simplified overview of the process:

Model Construction:

• States: Define states in the HMM to represent different aspects of the sequences, such as
match states, insertion states, and deletion states.
• Transitions: Define transitions between states, representing the probabilities of moving from
one state to another.
• Emission Probabilities: Assign emission probabilities to each state, representing the
likelihood of emitting a particular symbol (amino acid or nucleotide) given the current state.

The Markov Model

The answer lies both in the solid mathematical principles that the model is based on and the
simplicity that comes along with them. Every Hidden Markov Model relies on the assumption
that the events we observe depend on some internal factors or states, which are not directly
observable. This trait is very general which makes it very applicable and is also where the hidden
part of the name comes from. The Markov part, however, comes from how we model the changes
of the above-mentioned hidden states through time. We use the Markov property, a strong
assumption that the process of generating the observations is memoryless, meaning the next
hidden state depends only on the current hidden state.

The first order Markov process makes a very important simplification to observed sequential
data—the current system state depends only on the previous system state.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Graph of a Markov process. The current system state depends only on the previous system
state.

Additionally, hidden Markov models make one more important modification to the Markov
process — the actual system states are assumed to be unobservable and are hidden. For a sequence
of hidden states Z, the hidden Markov process emits a corresponding sequence of observable
processes X. Using the observed processes X, we try to guess what Z really is using hidden
Markov models!

Graph of a hidden Markov process. We are unable to observe the actual hidden states of the
system Z, and can only observe the observable processes X. Image created by the author.

Guessing Someone’s Mood (An example of HMM)

An example of a hidden Markov process is the guessing of someone’s mood. We cannot directly
observe or measure the mood of a person (at least without sticking electrodes in the person’s
brain), instead we observe his or her facial features, and then try to guess the mood. We assume
that moods can be described as a Markov process, and that there are 2 possible moods — good
and bad. We also assume that there are 2 possible observable facial features — smiling and
frowning.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Initial Hidden State Probabilities: When we first meet someone, we assume that there is a 70%
chance that the person is in a good mood and a 30% chance that the person is in a bad mood.
Hidden State Transition Matrix: We also assume that when a person is in a good mood, moments
later there is a 80% chance that he or she will still be in a good mood, and a 20% chance that he
or she will now be in a bad mood. We also assume the same probabilities for the opposite situation
in order to simplify the problem. Observable Emission Probabilities: Finally, we assume that
when a person is in a good mood, there is a 90% chance that he or she will be smiling, and a 10%
chance that he or she will be frowning.

Guessing someone’s mood using hidden Markov models.

3.Multiple sequence alignment

Multiple sequence alignment (MSA) is a bioinformatics technique used to align three or more
biological sequences simultaneously. In the context of MSA, there are different types of
alignments, including local alignment, global alignment, and alignments with or without gaps.
Let's explore the concepts of local and global alignments in the context of multiple sequence
alignment:

1. Local Multiple Sequence Alignment:

• Purpose: Local MSA aims to identify regions of similarity or homology within the
sequences, allowing for variations in other parts of the sequences.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

• Algorithm: Methods like ClustalW and MAFFT can be adapted for local MSA. These
algorithms use heuristics to identify conserved regions within the sequences.
• Output: The output of local MSA consists of aligned segments that are locally similar
across the input sequences.
2. Global Multiple Sequence Alignment:
• Purpose: Global MSA aligns entire sequences from start to end, aiming to find the overall
similarity and conserved regions across the entire length of the sequences.
• Algorithm: Algorithms like ClustalW, MAFFT, and T-Coffee are commonly used for global
MSA. They consider the entire length of the sequences during alignment.
• Output: The output of global MSA is a complete alignment of all input sequences, spanning
the entire length of each sequence.
3. Gapped Multiple Sequence Alignment:
• Purpose: Gapped MSA allows for the introduction of gaps in the alignment to account for
insertions or deletions in the sequences.
• Algorithm: Most MSA algorithms, including those mentioned above, inherently handle
gapped alignments. Gaps are introduced to maximize the overall similarity between
sequences.
• Output: The output includes gaps introduced to align regions that may have insertions or
deletions in some sequences.
4. Ungapped Multiple Sequence Alignment:
• Purpose: Ungapped MSA does not allow for the introduction of gaps during the alignment
process.
• Algorithm: Some MSA methods provide options to perform ungapped alignments,
ensuring that the aligned sequences are gap-free.
• Output: The output consists of a gap-free alignment, making it suitable for comparing
sequences without considering insertions or deletions.
Choosing between local and global alignments, as well as gapped or ungapped alignments,
depends on the specific goals of the analysis and the characteristics of the sequences being
aligned. Local alignment is often used when focusing on specific conserved regions, while
global alignment provides a comprehensive overview of the entire sequences. The decision to
allow gaps or not depends on the biological context and the expected variability in the sequences.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Ungapped alignment

• Line up two sequences

• Score according to
+1 if match
0 if no match

Gapped alignments
Three kinds of mutations
Replacement of a base (or aa) o
Insertion of a base (or aa) o
Deletion of a base (or aa)
Score according to:
+1 if match
0 if no match
-1 if gap

BLAST:

BLAST (Basic Local Alignment Search Tool) and FASTA are both widely used bioinformatics
tools for comparing biological sequences, such as DNA, RNA, or protein sequences, to identify
similarities and potential homologies. Despite having similar purposes, they use different
algorithms and approaches for sequence similarity searches.
BLAST (Basic Local Alignment Search Tool):

Algorithm: BLAST employs a heuristic algorithm that quickly identifies local regions of
similarity between sequences by breaking the search into smaller, manageable pieces. The
algorithm looks for short, exact matches (seeds) and extends them to form alignments.
Types of BLAST:
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

• BLASTn: Compares nucleotide sequences to nucleotide databases.

• BLASTp: Compares protein sequences to protein databases.
• BLASTx: Translates nucleotide sequences in all six reading frames and compares the
resulting amino acid sequences to protein databases.
• tBLASTn and tBLASTx: Perform sequence similarity searches between translated
nucleotide sequences and protein databases.
Output: BLAST provides a list of significant matches, along with alignment details, statistical
scores (E-values), and graphical representations of local alignments.
FASTA:

Algorithm: The FASTA algorithm uses a dynamic programming approach to perform pairwise
sequence alignments. It starts by searching for short regions of similarity (word matches)
between sequences and then extends them to create an alignment.
Types of FASTA:
• FASTA (fasta36): The original program for comparing protein or DNA sequences.
• SSEARCH (fasta35): Used for global pairwise sequence alignments.
• TFASTX, TFASTY, and TFASTZ: Perform faster translated searches.
Output: FASTA outputs alignments along with statistical scores, including E-values, sequence
identity, and similarity scores. It provides a summary of the alignment, as well as detailed
information on matched regions.
Comparison:
• BLAST is often preferred for its speed and is suitable for quickly identifying local
similarities in large databases.
• FASTA may be more sensitive for certain applications, as it uses rigorous statistical
methods and dynamic programming for alignment.
Both tools are widely used and have their strengths and weaknesses, making them
complementary in bioinformatics analyses.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

4.Phylogenetic tree:

Phylogenetic tree construction is a method used in bioinformatics and evolutionary biology to

depict the evolutionary relationships among a group of species, genes, or other biological
entities. The process involves analyzing molecular sequences (DNA, RNA, or proteins) to infer
the historical relationships and divergence patterns.

Here's a general overview of the steps involved in phylogenetic tree construction:

Sequence Data Collection:

Obtain molecular sequences (such as DNA, RNA, or protein sequences) from the organisms of
interest. Commonly used genes include 16S rRNA for bacteria and archaea, and mitochondrial
or chloroplast genes for eukaryotes.

Sequence Alignment: Align the sequences to identify homologous positions. Multiple sequence
alignment (MSA) tools like ClustalW, MAFFT, or Muscle are commonly used for this step.
Accurate alignment is crucial for the accuracy of the phylogenetic tree.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Data cleaning and filtering: Remove poorly aligned or highly variable regions that may
introduce noise into the analysis. This step helps in obtaining a more reliable phylogenetic
signal.

Model selection: Choose an appropriate evolutionary model that describes the substitution
pattern of nucleotides or amino acids in the sequences. Common models include the General
Time Reversible (GTR) model for nucleotides and the Jukes-Cantor model for simpler cases.

Phylogenetic tree inference:

Use tree inference methods to construct the phylogenetic tree based on the sequence data and
the selected evolutionary model.

Common methods include:

Distance-based methods: Neighbor-Joining (NJ), UPGMA.
Maximum Parsimony (MP): Seeks to find the tree that requires the fewest evolutionary changes.
Maximum Likelihood (ML): Determines the tree that maximizes the likelihood of the observed
data given the evolutionary model.

Bootstrap analysis: Assess the robustness of the tree topology by performing bootstrap
analysis. This involves resampling the data to generate multiple datasets and re-running the tree-
building process to estimate the reliability of each branch.

Tree visualization: Visualize the resulting phylogenetic tree using tree visualization software
like FigTree, iTOL, or other phylogenetic tree viewers. Trees are often displayed in a
hierarchical format with branches representing evolutionary relationships.

Interpretation and analysis: Analyze the tree to infer evolutionary relationships, divergence
times, and patterns of speciation. Interpret the tree in the context of biological knowledge and
hypotheses.
Phylogenetic tree construction is a complex process, and the choice of methods and models
depends on the characteristics of the data and the biological questions being addressed.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Researchers often validate their results through multiple methods and consider the biological
context when interpreting phylogenetic trees.

Neighbor-Joining algorithm

The Neighbor-Joining algorithm is advantageous because it is relatively fast and can handle
large datasets. However, it assumes a constant evolutionary rate across lineages, and it may be
sensitive to errors in distance estimation. Researchers often perform bootstrap analysis to assess
the reliability of the branches in the Neighbor-Joining tree. Despite its limitations, the Neighbor-
Joining algorithm is widely used in practice and has contributed to the field of molecular
phylogenetics.

The Neighbor-Joining (NJ) algorithm is a popular method for constructing phylogenetic trees in
the field of biology. It is widely used for analyzing molecular data, such as DNA or protein
sequences, to infer evolutionary relationships among different species or individuals.

Context: The primary goal of phylogenetic tree construction is to represent the evolutionary
history and relationships among biological entities, such as species or genes. This is done by
analyzing molecular data that reflects the genetic similarities and differences between these
entities.

Sequence Data: In biology, researchers often start with molecular sequences, such as DNA,
RNA, or protein sequences, obtained from different species or individuals. These sequences are
aligned to identify homologous positions, reflecting common ancestry.

Distance matrix: The Neighbor-Joining algorithm takes as input a distance matrix, which
quantifies the evolutionary distances between pairs of taxa (species or sequences). These
distances can be estimated based on sequence divergence, substitution rates, or other measures.

Node and Branch Representation: In the context of biology, each node in the Neighbor-
Joining tree represents a taxon (species or sequence). Branch lengths connecting nodes represent
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

the evolutionary distances between taxa. The algorithm aims to construct a tree that best reflects
the observed distances.

Cluster formation: The algorithm iteratively forms clusters of taxa based on their pairwise
distances. At each step, it identifies the pair of taxa with the minimum Q-value, indicating a
potential cluster. A new node is created to represent this cluster, and the tree is updated
accordingly.

Hierarchy and Evolutionary Relationships: The resulting tree has a hierarchical structure that
reflects the evolutionary relationships among taxa. Nodes closer to each other in the tree are
more closely related, while those farther apart share a more distant common ancestor.

Application in evolutionary biology: Biologists use Neighbor-Joining trees to study the

evolutionary history of species, populations, or genes. These trees help researchers understand
the branching patterns, divergence times, and relatedness among different biological entities.

Validation and analysis: In biological studies, researchers often validate the reliability of the
Neighbor-Joining tree through statistical methods, such as bootstrap analysis. The resulting tree
is then analyzed to draw biological conclusions, such as identifying clades, understanding
speciation events, or inferring the functional implications of genetic evolution.

In summary, the Neighbor-Joining algorithm is a powerful tool in biology for reconstructing

phylogenetic trees based on molecular sequence data. Its simplicity, speed, and applicability to
large datasets make it a valuable method for studying the evolutionary relationships among
different biological entities.

Multiple Choice Questions (MCQs):

• What is the primary goal of sequence alignment in bioinformatics?
a) Identifying differences between sequences
b) Highlighting evolutionary relationships
c) Generating random alignments
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

d) Predicting protein structures

Answer: b) Highlighting evolutionary relationships

• Which algorithm is commonly used for local alignment of biological sequences?

a) Needleman-Wunsch
b) Smith-Waterman
c) ClustalW
d) BLAST
Answer: b) Smith-Waterman

• Which database is specifically designed to store protein sequences?

a) GenBank
b) ENA
c) DDBJ
d) UniProt
Answer: d) UniProt

• What does the E-value represent in BLAST results?

a) Expected value of alignment
b) Evolutionary distance
c) Energy value
d) Evaluation score
Answer: a) Expected value of alignment

• Which method is used to align three or more sequences simultaneously?

a) Pairwise alignment
b) Multiple sequence alignment
c) Hidden Markov Models
d) Phylogenetic analysis
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

Answer: b) Multiple sequence alignment

• What is the main purpose of scoring matrices in sequence alignment?

a) To assign scores to different elements in the alignment
b) To compute E-values
c) To identify gaps in the alignment
d) To visualize the alignment
Answer: a) To assign scores to different elements in the alignment

• Which model is commonly used to describe the substitution pattern of nucleotides or

amino acids in sequences?
a) Hidden Markov Model
b) Markov Chain Model
c) Substitution Matrix Model
d) Evolutionary Tree Model
Answer: c) Substitution Matrix Model

• What does the Neighbor-Joining algorithm aim to construct?

a) Phylogenetic trees
b) Scoring matrices
c) Multiple sequence alignments
d) Substitution patterns
Answer: a) Phylogenetic trees

• Which tool is suitable for quickly identifying local similarities in large sequence
databases?
a) ClustalW
b) MAFFT
c) BLAST
d) T-Coffee
Answer: c) BLAST
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

• Which step is essential before constructing a phylogenetic tree?

a) Sequence alignment
b) Scoring matrix selection
c) Visualization
d) Model validation
Answer: a) Sequence alignment

Short Answer Questions:

1. Define sequence similarity and its importance in bioinformatics.

2. Explain the purpose of scoring matrices in sequence alignment.

3. What is the difference between global and local sequence alignment?

4. Describe the role of phylogenetic trees in evolutionary biology.

5. How does the E-value in BLAST results help in sequence similarity analysis?

6. Briefly explain the concept of homology in bioinformatics.

7. What are the main steps involved in multiple sequence alignment?

8. How does the Neighbor-Joining algorithm construct phylogenetic trees?

9. Discuss the significance of gap penalties in sequence alignment.

10. What are the types of sequence databases commonly used in bioinformatics, and what
kind of sequences do they store?

Long Answer Questions:

1. Compare and contrast the algorithms used in pairwise sequence alignment, focusing on
Needleman-Wunsch and Smith-Waterman algorithms.

2. Explain the process of constructing a phylogenetic tree, including sequence data

collection, alignment, model selection, and tree inference.
2nd Semester
Biology for Engineers (BSCD203, BSCG203)
2023-24

3. Discuss the applications of sequence similarity analysis in functional genomics,

comparative genomics, and drug discovery.

4. Describe the different methods for measuring sequence similarity and their respective
applications in bioinformatics.

5. Explore the challenges and limitations associated with sequence alignment algorithms
and how researchers address them.

6. Analyze the role of hidden Markov models in sequence alignment, highlighting their
advantages and applications in bioinformatics.

7. Discuss the significance of homology in evolutionary biology and its implications for
understanding biological diversity.

8. Elaborate on the process of multiple sequence alignment, including the types of

alignments and the algorithms commonly used for this purpose.

9. Evaluate the strengths and weaknesses of BLAST and FASTA as sequence alignment
tools, considering factors such as speed, sensitivity, and accuracy.

10. Investigate the role of scoring models in sequence alignment, including the components
of scoring matrices and their impact on alignment accuracy.

Kobelco Excavator Error Codes
67% (3)
Kobelco Excavator Error Codes
7 pages
TM 11-6625-1711-24P-1 - Simulator - Test - Set - AN - APM-245 - 1980 PDF
No ratings yet
TM 11-6625-1711-24P-1 - Simulator - Test - Set - AN - APM-245 - 1980 PDF
52 pages
3.7
No ratings yet
3.7
22 pages
Basic Concept of Sequence Similarity Identity and Homology
No ratings yet
Basic Concept of Sequence Similarity Identity and Homology
17 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
54 pages
Bioinformatics Lab Assignment Group 3
No ratings yet
Bioinformatics Lab Assignment Group 3
7 pages
Bioinformatics Chaper3
No ratings yet
Bioinformatics Chaper3
34 pages
3
No ratings yet
3
107 pages
Sequence Analysis - Alignment
No ratings yet
Sequence Analysis - Alignment
57 pages
Unit-3 (1)
No ratings yet
Unit-3 (1)
44 pages
Sequence Homology Searching — an Introduction to Applied Bioinformatics
No ratings yet
Sequence Homology Searching — an Introduction to Applied Bioinformatics
20 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
Lecture 6- Sequence Analysis
No ratings yet
Lecture 6- Sequence Analysis
28 pages
Sequence Comparison Part 1
No ratings yet
Sequence Comparison Part 1
31 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Introduction To Bioinformatics: Tolga Can
No ratings yet
Introduction To Bioinformatics: Tolga Can
21 pages
Pairwise Alignment Prelab PDF
No ratings yet
Pairwise Alignment Prelab PDF
87 pages
msa_MTech
No ratings yet
msa_MTech
17 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Sequence Alignment (Chapter 6) : The Biological Problem
No ratings yet
Sequence Alignment (Chapter 6) : The Biological Problem
44 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
Module-II
No ratings yet
Module-II
51 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
Tools in Bioinformatics
100% (1)
Tools in Bioinformatics
17 pages
5.Pairwise Alignment
No ratings yet
5.Pairwise Alignment
85 pages
Bioinformatics-An Introduction and Overview
No ratings yet
Bioinformatics-An Introduction and Overview
12 pages
Sequence Alignment
No ratings yet
Sequence Alignment
27 pages
Bioinformatics Intro
No ratings yet
Bioinformatics Intro
69 pages
Bioinformatics final
No ratings yet
Bioinformatics final
18 pages
Sequences Alignments (Similarity & Homology)
No ratings yet
Sequences Alignments (Similarity & Homology)
32 pages
Unit 3 Sequence Alignment and Phylogenetic Tree
No ratings yet
Unit 3 Sequence Alignment and Phylogenetic Tree
70 pages
2. Sequence alignment
No ratings yet
2. Sequence alignment
25 pages
Sequence Alignment
No ratings yet
Sequence Alignment
7 pages
lecture1_Loi
No ratings yet
lecture1_Loi
52 pages
Genomes 4 (C-6, IdentIfyIng Gene FunctIons)
No ratings yet
Genomes 4 (C-6, IdentIfyIng Gene FunctIons)
20 pages
Sequence Analysis &alignment
100% (1)
Sequence Analysis &alignment
2 pages
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
100% (3)
Application in Establishing Epidemiology and Variability: Genome & Protein " Sequence Analysis Programs"
23 pages
Bioinformatics: ABE 2007 Kent Koster Group 3
No ratings yet
Bioinformatics: ABE 2007 Kent Koster Group 3
43 pages
Pertsemlidis and Fondon 2011_BLAST
No ratings yet
Pertsemlidis and Fondon 2011_BLAST
10 pages
Genomics and Similarity search
No ratings yet
Genomics and Similarity search
43 pages
Bio in For Matics
No ratings yet
Bio in For Matics
17 pages
What Is Bioinformatics
No ratings yet
What Is Bioinformatics
10 pages
First Lecture
No ratings yet
First Lecture
89 pages
7256
No ratings yet
7256
51 pages
WunbeiJoshua BioinformaticsAssignment
No ratings yet
WunbeiJoshua BioinformaticsAssignment
8 pages
Bio Informatics
No ratings yet
Bio Informatics
46 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
7 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
10 pages
Unit 2.1
No ratings yet
Unit 2.1
77 pages
Bioinformatics I
No ratings yet
Bioinformatics I
39 pages
Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
Sequence Alignment
No ratings yet
Sequence Alignment
17 pages
BioInformatics Abstract For Paper Presentation
100% (1)
BioInformatics Abstract For Paper Presentation
11 pages
Bioinformatics: Tina Elizabeth Varghese
No ratings yet
Bioinformatics: Tina Elizabeth Varghese
9 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Sequence Analysis
No ratings yet
Sequence Analysis
6 pages
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
PMI PMP PMBOK 7 Practice Exam Book Over 3 Full Practice Tests 2023
50% (2)
PMI PMP PMBOK 7 Practice Exam Book Over 3 Full Practice Tests 2023
560 pages
Field Computation by Moment Methods
No ratings yet
Field Computation by Moment Methods
237 pages
Course Outline: Strategic Marketing: MS Business Administration 1 Semester
No ratings yet
Course Outline: Strategic Marketing: MS Business Administration 1 Semester
3 pages
Why Is Homework Bad For Middle School Students
100% (1)
Why Is Homework Bad For Middle School Students
5 pages
Life Skill Training
No ratings yet
Life Skill Training
56 pages
IPM Bit Technology
100% (1)
IPM Bit Technology
70 pages
Sintering Hardening
No ratings yet
Sintering Hardening
9 pages
Observation Sheet
No ratings yet
Observation Sheet
3 pages
L02 R02D01 Fos 00 XX DWG Ar 00401
No ratings yet
L02 R02D01 Fos 00 XX DWG Ar 00401
1 page
The Super CFO-Egon Zehnder
No ratings yet
The Super CFO-Egon Zehnder
24 pages
MDC - Mlis
No ratings yet
MDC - Mlis
8 pages
Self Concept Questionnaire by Rksaraswat - Compress
100% (1)
Self Concept Questionnaire by Rksaraswat - Compress
4 pages
Group 4 - Community
No ratings yet
Group 4 - Community
16 pages
Material Time: Morning (3:00 A.m.)
No ratings yet
Material Time: Morning (3:00 A.m.)
5 pages
Biogas Technology in Southeast Asia: Pruk Aggarangsi Sirichai Koonaphapdeelert Saoharit Nitayavardhana James Moran
No ratings yet
Biogas Technology in Southeast Asia: Pruk Aggarangsi Sirichai Koonaphapdeelert Saoharit Nitayavardhana James Moran
193 pages
Combatting Cult Mind Control PDF
No ratings yet
Combatting Cult Mind Control PDF
132 pages
Grade 2-T: Be A Math Detective - Catch The Thief
No ratings yet
Grade 2-T: Be A Math Detective - Catch The Thief
4 pages
Expressive Arts Benefits - by Shelley Klammer
No ratings yet
Expressive Arts Benefits - by Shelley Klammer
16 pages
Marginal Analysis: Marginal Cost v. Marginal Benefit
No ratings yet
Marginal Analysis: Marginal Cost v. Marginal Benefit
14 pages
Balitang Ina
No ratings yet
Balitang Ina
4 pages
UNESCO Chapter
No ratings yet
UNESCO Chapter
5 pages
Edited - Modules For Fitness Sports and Recreational Leadership
100% (1)
Edited - Modules For Fitness Sports and Recreational Leadership
26 pages
2012 Spring Pocket Guide
No ratings yet
2012 Spring Pocket Guide
69 pages
Solutions Pushdown
No ratings yet
Solutions Pushdown
6 pages
SKF Speedi-Sleeve PDF
No ratings yet
SKF Speedi-Sleeve PDF
44 pages
The Domain of Public Administration
No ratings yet
The Domain of Public Administration
14 pages
Aqa A2 English Literature B Coursework Word Limit
100% (1)
Aqa A2 English Literature B Coursework Word Limit
4 pages

Module 5

Uploaded by

Module 5

Uploaded by

2nd Semester

Biology for Engineers (BSCD203, BSCG203)

1. Sequence similarity, homology, and alignment

2. Pair wise alignment: Scoring model, pair wise alignment using

3. Multiple alignment: local alignment gapped and un-gapped global

4. Phylogenetic tree construction: Neighbour Joining Algorithm.

Bioinformatics involves the application of computational techniques to analyze biological data,

• Genomic Databases: Contain complete genomes of organisms.

• GenBank, ENA, DDBJ: Repositories for nucleotide sequences.

Understanding sequence data and similarity is crucial for numerous applications in

evolutionary biology. Researchers leverage computational tools and algorithms to extract

1. Pairwise Sequence Alignment:

5. Phylogenetic Analysis: Construction of phylogenetic trees helps in understanding

Definition: Homology implies a common evolutionary origin of two or more biological

Measurement: Homology is usually inferred through sequence similarity or structural

Two very important basic concepts:

• Similarity: Degree of likeness between two sequences, usually expressed as a

Multiple sequence alignment

In bioinformatics, similarity assesses the similarity between two proteins or nucleotide

Sequence alignment is a bioinformatics technique used to arrange the biological sequences

2. Pairwise Sequence Alignment:

Purpose: Compares two sequences to identify regions of similarity or homology.

Algorithms: Common algorithms for pairwise alignment include the Needleman-Wunsch

3. Multiple Sequence Alignment (MSA):

Here's a simple example of a pairwise sequence alignment:

Sequence alignment is a fundamental tool in bioinformatics, used for various purposes,

E-value and Significance:

Lower E-values suggest more significant alignments.

Matrix Size and Parameters:

Pair wise alignment using Hidden Markov models (HMM)

The Markov Model

Guessing Someone’s Mood (An example of HMM)

Guessing someone’s mood using hidden Markov models.

3.Multiple sequence alignment

1. Local Multiple Sequence Alignment:

• Line up two sequences

• BLASTn: Compares nucleotide sequences to nucleotide databases.

Phylogenetic tree construction is a method used in bioinformatics and evolutionary biology to

Here's a general overview of the steps involved in phylogenetic tree construction:

Sequence Data Collection:

Phylogenetic tree inference:

Common methods include:

Application in evolutionary biology: Biologists use Neighbor-Joining trees to study the

In summary, the Neighbor-Joining algorithm is a powerful tool in biology for reconstructing

Multiple Choice Questions (MCQs):

d) Predicting protein structures

• Which algorithm is commonly used for local alignment of biological sequences?

• Which database is specifically designed to store protein sequences?

• What does the E-value represent in BLAST results?

• Which method is used to align three or more sequences simultaneously?

Answer: b) Multiple sequence alignment

• What is the main purpose of scoring matrices in sequence alignment?

• Which model is commonly used to describe the substitution pattern of nucleotides or

• What does the Neighbor-Joining algorithm aim to construct?

• Which step is essential before constructing a phylogenetic tree?

Short Answer Questions:

1. Define sequence similarity and its importance in bioinformatics.

2. Explain the purpose of scoring matrices in sequence alignment.

3. What is the difference between global and local sequence alignment?

4. Describe the role of phylogenetic trees in evolutionary biology.

6. Briefly explain the concept of homology in bioinformatics.

7. What are the main steps involved in multiple sequence alignment?

8. How does the Neighbor-Joining algorithm construct phylogenetic trees?

9. Discuss the significance of gap penalties in sequence alignment.

Long Answer Questions:

2. Explain the process of constructing a phylogenetic tree, including sequence data

3. Discuss the applications of sequence similarity analysis in functional genomics,

8. Elaborate on the process of multiple sequence alignment, including the types of

You might also like