Bio Model
Bio Model
5. Write short note on Gap penalties and its usage in comparing Biological sequences.
Ans: Gap Penalties and Its Usage in Comparing Biological Sequences
In sequence alignment, especially in bioinformatics, gap penalties are designed to discourage
the introduction of gaps because gaps in an alignment represent evolutionary events such as
insertions and deletions, which are less common than base substitutions. The selection of an
appropriate gap penalty can significantly affect the quality of an alignment.
Gap penalties come in two types: gap opening penalties and gap extension penalties. A gap
opening penalty is the cost to open a gap, and it's usually quite high. On the other hand, a
gap extension penalty is the cost to extend an existing gap, which is usually lower. This
models the biological fact that once a deletion or insertion occurs, it's relatively easier for
more deletions or insertions to happen in the same place.
In Needleman-Wunsch (global alignment) and Smith-Waterman (local alignment) algorithms,
gap penalties are used to adjust the alignment score. They play a crucial role in determining
the final form of the alignment by balancing the number of gaps with the number of
matched or mismatched characters.
6. List any three types of BLAST and make short description on each.
Ans: Types of BLAST
• BLASTN: This tool is used for nucleotide-nucleotide comparison. It can compare a
nucleotide query sequence against a nucleotide database. BLASTN is often used for
identifying similar DNA sequences across different species, helping to detect
homologous genes.
• BLASTP: This variant of BLAST is used for protein-protein comparison. It takes a
protein query sequence and compares it to a protein database. BLASTP uses a
scoring matrix (like BLOSUM or PAM) to identify regions of similarity and is helpful in
predicting the function of new or unknown proteins.
• BLASTX: This tool compares the six-frame conceptual translation products of a
nucleotide query sequence (both strands) against a protein sequence database. It
allows the identification of potential protein-coding regions within a given nucleotide
sequence. It's often used to find potential coding regions in newly sequenced DNA.
7. What are the principle underlying the formation of Ramachandran plot?
Ans: Principles Underlying the Formation of Ramachandran Plot:
The Ramachandran plot, named after G.N. Ramachandran who first developed this concept,
is based on the idea that the spatial arrangement of the peptide bond, which joins individual
amino acids in a protein's polypeptide chain, can have significant implications on the overall
structure of the protein. It visualizes dihedral angles ψ against φ of amino acid residues in
protein structure.
Two key principles underlie the formation of the Ramachandran plot:
• Steric Hindrance: The plot takes into account the spatial constraints of a protein's
structure. Some combinations of angles result in clashes between atoms, making
certain conformations impossible. The plot separates these regions into allowed and
disallowed regions, indicating which combinations of the dihedral angles (φ, ψ) are
sterically allowed and which are not.
• Polypeptide Chain Conformations: The plot depicts the favored angles that peptide
groups in proteins can take, which essentially define the types of secondary
structures in proteins (i.e., α-helices, β-sheets, and random coils). The plot reveals
that these structures are the most favored conformations due to optimal hydrogen
bonding and minimal steric clashes.
PART B
11. (a) What is the central dogma of molecular biology?
(b) Explain the steps involved in the process of transcription. How is the primary transcript
produced by a prokaryote different from that produced by a eukaryotic cell?
Ans: (a) The Central Dogma of Molecular Biology The central dogma of molecular biology explains
the flow of genetic information within a biological system. It was proposed by Francis Crick and
involves three key processes: replication, transcription, and translation. According to this theory,
information encoded in DNA (the genes) is transcribed into mRNA (messenger RNA), which is then
translated into proteins in the cells. However, the information can't usually flow back from protein to
either RNA or DNA.
(b) Process of Transcription & Differences in Prokaryotes and Eukaryotes Transcription involves the
synthesis of RNA from a DNA template. The process can be broken down into three stages: initiation,
elongation, and termination.
Initiation involves the binding of the enzyme RNA polymerase to a specific sequence on the DNA
called the promoter. Once RNA polymerase is bound to the promoter, it begins to separate the DNA
strands.
Elongation is the addition of nucleotides to the mRNA strand. RNA polymerase reads the DNA strand
from 3' to 5' end and builds the mRNA strand from 5' to 3' end.
During termination, the RNA transcript is released and the polymerase detaches from the DNA.
The primary transcript produced by a prokaryote is already mature mRNA because prokaryotes lack a
nucleus and their DNA is not separated from the ribosome (the site of protein synthesis). So,
transcription and translation occur almost simultaneously.
However, in eukaryotic cells, the primary transcript, also known as pre-mRNA, needs further
processing to become mature mRNA. This process includes the addition of a 5' cap and a 3' poly-A
tail, as well as the removal of non-coding sequences (introns) in a process called splicing. Only after
these modifications is the mRNA ready for translation.
(b) Explain bio-molecules involved in central dogma, its structure and types.
Ans: (a) Translation Process in Protein Synthesis Translation is the process of protein synthesis
where the sequence of nucleotides in mRNA is decoded into a sequence of amino acids to form a
protein. It happens in three stages: initiation, elongation, and termination.
Initiation: The ribosome assembles around the target mRNA. The first tRNA is attached at the start
codon.
Elongation: The ribosome traverses along the mRNA, decoding each codon and attaching
corresponding amino acids to the growing polypeptide chain.
Termination: When the ribosome reaches a stop codon on the mRNA, the process of translation
ends. The complete polypeptide is released and can undergo further modifications before becoming
a functional protein.
(b) Bio-molecules Involved in Central Dogma: Their Structure and Types There are three primary
biomolecules involved in the central dogma: DNA, RNA, and protein.
DNA (Deoxyribonucleic acid): DNA is a double-stranded molecule consisting of two long polymers of
simple units called nucleotides, with backbones made of sugars and phosphate groups joined by
ester bonds. There are four types of nucleotides in DNA, differentiated by their nitrogenous bases:
adenine (A), cytosine (C), guanine (G), and thymine (T).
RNA (Ribonucleic acid): RNA is usually single-stranded and contains a sugar ribose, as opposed to
the deoxyribose found in DNA. Like DNA, RNA has four types of nucleotides, but thymine is replaced
by uracil (U). The three main types of RNA include mRNA (messenger RNA), tRNA (transfer RNA), and
rRNA (ribosomal RNA).
Protein: Proteins are made up of amino acids. They are the final product in the flow of information
from DNA to RNA to protein. There are 20 standard amino acids, and their sequence in a protein is
determined by the sequence of codons in mRNA, which was itself transcribed from the DNA
sequence.
13. (a) Explain the importance of Primary and secondary databases in Bioinformatics
(b) Illustrate the methods of pairwise sequence alignment. What is the use of assigning gap
penalties in alignment?
Primary and secondary databases are two types of biological databases that have their own
significance in bioinformatics.
• Primary Databases: These databases contain raw data such as the nucleotide sequence,
protein sequence data. They collect the data directly from researchers and store it in raw
format. Primary databases are the first destination of raw sequence data after it is produced.
An example of a primary database is GenBank, which collects a wide variety of genetic
sequences from scientists globally.
• Secondary Databases: These are derived from the primary databases. They contain analyzed
and interpreted information. These databases offer additional annotations like protein-
protein interactions, gene-disease associations, metabolic and signaling pathways, etc. These
are invaluable for researchers who are interested in specific domains or functions. An
example of a secondary database is the Protein Data Bank (PDB).
The existence of these databases allows scientists to store, retrieve, and work with massive volumes
of biological data, making bioinformatics analysis possible and practical.
Pairwise sequence alignment methods are designed to find the best-matching piecewise (local) or
global alignments of two query sequences. There are two fundamental methods:
• Global Alignment: This method, best known by the Needleman-Wunsch algorithm, attempts
to align every residue in every sequence. It is most useful when the sequences in the query
set are similar and of roughly equal size.
• Local Alignment: This method, best represented by the Smith-Waterman algorithm, tries to
identify regions of similarity within long sequences that are often widely divergent overall.
Local alignment is generally preferred when the sequences are suspected to have regions of
similarity within their larger sequence context (like genes in genomes).
Gap penalties are integral to sequence alignment. They discourage the introduction of gaps (spaces
representing insertions or deletions) into the aligned sequences. The gap penalties are set according
to the expected mutation rate, and they influence the outcome of the alignment. High gap penalties
tend to result in fewer gaps in the alignment, whereas low gap penalties may lead to more gaps.
14. (a) Illustrate sequence alignment. What are the applications of sequence alignment in
Bioinformatics?
(b) What is the use of scoring matrices? Differentiate between PAM and BLOSUM matrices and its
usage in alignment.
Sequence alignment is a method of arranging sequences of DNA, RNA, or protein to identify regions
of similarity. These similarities may be consequences of functional, structural, or evolutionary
relationships between the sequences.
• Phylogenetic Analysis: Aligned sequences are used to create phylogenetic trees, which
depict evolutionary relationships among species or individuals.
• Protein Modeling: Sequence alignment can identify domains and determine the function of
new proteins based on the known functions of similar proteins.
• Genomic Annotation: In genome projects, new genes can be annotated by aligning them
with known genes.
• Drug Discovery: Identifying similarities between pathogenic proteins and human proteins
can help design drugs that inhibit the pathogenic protein without affecting the human one.
Scoring matrices are key elements in bioinformatics analyses such as sequence alignment. They are
used to calculate the alignment score based on the substitutions, matches, and mismatches that
occur. These scores are a measure of sequence similarity.
• PAM (Point Accepted Mutation) Matrices: These matrices are based on the observation of
substitutions in closely related proteins. A PAM1 matrix represents the probability of one
point mutation per 100 amino acids. PAM matrices are extrapolated to larger evolutionary
distances.
• BLOSUM (BLOcks SUbstitution Matrix) Matrices: BLOSUM matrices are derived from
comparisons of blocks of sequences in a database that are already aligned and assumed to
be related. Each BLOSUM matrix is tailored to a specific evolutionary distance; for instance,
BLOSUM50 is designed for aligning sequences with 50% similarity.
These matrices are used in sequence alignment to score matches and substitutions, helping to
identify the most likely evolutionary relationships between sequences. PAM is typically more suitable
for closely related sequences, while BLOSUM works well for distantly related sequences.
15. (a) Using Needleman and Wunsch dynamic programming method, construct the partial
alignment score table for the following two sequences, using the scoring parameters: match score:
+5, mismatch score: -1, gap penalty: -2. CCATGCU GATTACA Also write down the optimal global
alignment between these sequences along with the optimal score.
(b) Interpret the blast result and statistical significance of the alignment by analyzing the results.
Ans: 15. (a) Using Needleman and Wunsch dynamic programming method
Here's how you would use the Needleman-Wunsch algorithm to align the two sequences "CCATGCU"
and "GATTACA". This is a simplified example and in practice, biological sequences would be much
longer.
First, initialize a matrix with the sequences one on top and one on the side:
- G A T T A C A
- 0
C
C
A
T
G
C
U
Next, fill in the first row and the first column with the cumulative gap penalties:
- G A T T A C A
- 0 -2 -4 -6 -8 -10 -12 -14
C -2
C -4
A -6
T -8
G -10
C -12
U -14
Then, fill in each cell by taking the maximum of:
1. The diagonal cell's value plus the match score (if the two characters match) or the mismatch
score (if the two characters don't match).
- G A T *T A C *A
- 0 -2 -4 -6 -8 -10 -12 -14
C -2 1 -1 -3 -5 -7 -9 -11
C -4 -1 0 -2 -4 -6 -8 -10
A -6 -3 -2 1 -1 -3 -5 -7
*T -8 -5 -4 -1 4 2 0 -2
*G-10 -7 -6 -3 2 3 1 -1
C-12 -9 -8 -5 0 1 6 4
*U-14 -11 -10 -7 -2 -1 4 5
This gives the optimal alignment:
TGC-UA
T-ACA
The optimal score is the value in the bottom right corner of the matrix, which is 5 in this case.
(b) Interpret the BLAST result and statistical significance of the alignment
BLAST returns results in the form of alignments between the query sequence and matching
sequences in the database. For each alignment, it provides the following information:
• The identity percentage, which is the percentage of characters in the alignment that are the
same.
• The positive percentage, which is the percentage of characters in the alignment that are
similar.
• The alignment itself, showing the query sequence, the matching sequence, and a middle line
indicating matches, mismatches, and gaps.
The E-value is a measure of the statistical significance of the alignment. It is the number of
alignments with a given score (or higher) that we would expect to find by chance in a database of the
same size. Lower E-values indicate more significant matches. An E-value of 1e-3, for instance, means
that we would expect to find 1 match with a similar score by chance if we searched a database of the
same size 1000 times.
16. (a) Using Smith Waterman method construct the partial alignment scoring table and obtain the
optimal local alignment of the following two sequences: ACGTATCGCGTATA GATGCTCTCGGAJAA
Multiple sequence alignment (MSA) is a way of aligning three or more biological sequences (often
protein sequences) to identify regions of conservation that may be of functional, structural, or
evolutionary significance.
1. ACGTACGT
2. ACGT----
3. A-CGTCGT
The dashes represent gaps that are inserted to maximize the alignment of the sequences. A multiple
sequence alignment of these three sequences might look like:
ACGTACGT
ACGT----
A-CGTCGT
Multiple sequence alignment is more complex than pairwise alignment because the goal is to
optimize the alignment of all sequences simultaneously. This often requires sophisticated
computational methods and scoring systems that consider the evolutionary relationships among the
sequences.
There are multiple tools available for multiple sequence alignment like Clustal Omega, MUSCLE, T-
Coffee, etc. These tools are based on different algorithms to carry out the multiple sequence
alignment.
(b) Explain how the protein structure is determined by using experimental techniques.
• Primary structure: This is the linear sequence of amino acids that makes up the polypeptide
chain. The primary structure is determined by the gene corresponding to the protein.
Changes in the primary structure can result in a different protein.
• Secondary structure: These are the local folding patterns that occur within a polypeptide
chain due to hydrogen bonding between the backbone atoms. The most common secondary
structures are alpha helices and beta sheets. The type and sequence of secondary structures
in a protein is determined by the primary structure.
• Quaternary structure: Some proteins are made up of multiple polypeptide chains, also
known as subunits. The quaternary structure describes the arrangement and interactions of
these subunits.
• X-ray crystallography: This technique involves purifying the protein and forming a crystal. X-
rays are then directed at the crystal, and the pattern of diffracted rays is captured. By
analyzing the diffraction pattern, the electron density of the protein can be determined,
which can then be used to determine the protein's structure.
• Nuclear Magnetic Resonance (NMR): NMR can be used to study proteins in solution. It
works by applying a magnetic field and measuring the spin interactions of atomic nuclei. The
data can be used to determine the structure of the protein.
Protein-protein interactions are essential for many biological processes, including signal
transduction, gene regulation, and metabolic control. Proteins often function in complex networks of
interactions, allowing for intricate regulation and control of biological pathways. The complexity and
versatility of these interactions contribute to the complexity of an organism, as the same protein can
participate in different functional complexes and thus have multiple roles. For instance, proteins can
form large complexes that carry out replication, transcription, translation, cell signaling, immune
responses, and more.
The Protein Data Bank (PDB) is a global repository for the 3D structural data of large biological
molecules, including proteins and nucleic acids. This data is gathered from experimental methods
such as X-ray crystallography, NMR spectroscopy, and Cryo-EM. Researchers worldwide can freely
access this data for studying biological phenomena, creating new algorithms for structure prediction,
understanding disease mechanisms, or designing new drugs. The PDB provides a wealth of structural
details, including atomic coordinates, which helps researchers to visualize the 3D structure and
understand the function of proteins.
19. (a) Discuss systems biology approach of understanding complex biological systems.
Systems biology is an approach that seeks to understand the biological systems as a whole, rather
than focusing on individual components in isolation. It is interdisciplinary in nature, integrating
biology with fields like mathematics, computer science, and engineering. It aims to comprehend the
behavior and properties of biological systems by studying the interactions and dynamics of its
components.
In systems biology, instead of breaking down a system into individual components and studying each
separately, the focus is on how these components interact with each other and result in complex
behaviors. This allows for a more holistic understanding of the biological processes. It emphasizes the
study of network interactions among components (genes, proteins, etc.), thus revealing emergent
properties and system-level behaviors that aren't explainable by studying individual elements.
The use of computational and mathematical modeling in systems biology can help simulate, predict,
and manipulate the behavior of biological systems, providing insights into their functions and
dynamics.
In the modeling of biological systems, variables, parameters, and constants play crucial roles:
• Variables: These are measurable quantities that change over time in a model, representing
the dynamic aspects of the system. For example, the concentration of a protein in a cell.
• Parameters: These are numerical values that influence the behavior of the model, and they
can be adjusted to fit experimental data. They don't change within the model itself, such as
reaction rates.
• Constants: Constants are values that don't change over time. They represent intrinsic
properties of the system, like the Avogadro constant.
20. (a) Explain on advantages of Computational Modeling of biological system.
• It enables the simulation of complex biological processes that may be difficult or even
impossible to study experimentally.
• It allows for the exploration of various scenarios and the manipulation of variables in a
controlled environment, which helps in understanding how changes in one part of the
system can affect the whole.
• Captures essential features and interactions: Models should capture the essential features
and interactions of the system while omitting unnecessary details.
• Predictive, interpretable, and scalable: Models should be predictive, meaning they can
forecast future system behaviors. They should be interpretable - the components of the
model correspond to elements of the real system. And they should be scalable, maintaining
accuracy as the system complexity or size increases.
• Adequateness: A model's adequateness refers to its ability to capture the relevant aspects of
a biological system and provide meaningful insights. The best models can effectively balance
complexity and simplicity for optimal utility and understanding.