0% found this document useful (0 votes)
2 views

Module 5 notes

bioinformatics notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 5 notes

bioinformatics notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

Module 5

Topic - Predictive Methods using


protein sequences
Levels of Protein Structures

Primary structure (Amino acid sequence)



Secondary structure �α-helix, β-sheet�

Tertiary structure �Three-dimensional structure
formed by assembly of secondary structures�

Quaternary structure �Structure formed by more
than one polypeptide chains�
Secondary Structure
• Protein Secondary structure takes one of three
forms:
u α helix
u β sheet
u Turn, coil or loop
• Secondary structure are tightly packed in the
protein core in a hydrophobic environment
• Secondary structure is predicted within a small
window
• Many different algorithms, not highly accurate
• Better predictions from a multiple alignment
• Methods: neural networks, nearest-neighbor
method, HMM,
• Types of Secondary Structure
• Alpha-Helix (α-Helix):
– Structure: Right-handed coil stabilized by hydrogen bonds between the carbonyl
oxygen of one residue and the amide hydrogen four residues away.
– Features:
• Compact and cylindrical.
• Found in transmembrane proteins and coiled-coil regions.
– Examples: Hemoglobin, myoglobin.
• Beta-Sheet (β-Sheet):
– Structure: Extended strands aligned side by side, connected by hydrogen bonds.
• Parallel β-Sheet: Strands run in the same direction.
• Antiparallel β-Sheet: Strands run in opposite directions.
– Features:
• Can form flat or twisted sheets.
• Common in fibrous and globular proteins.
– Examples: Silk fibroin, immunoglobulins.
• Beta-Turns and Loops:
– Beta-Turn: Short loops reversing chain direction, stabilized by hydrogen bonds.
– Loops: Non-repetitive, flexible regions connecting helices and sheets.
– Importance: Provide flexibility and are often involved in active sites or binding
interfaces.
Alpha-Helix Structure
The Beta-Sheet.
Reverse Turns
• Folding Classes
• Definition
• Folding classes describe the overall arrangement of secondary
structural elements within a protein. These patterns are linked to the
protein’s function and evolutionary relationships.
• Major Folding Classes
• All-Alpha Proteins:
– Composed entirely of α-helices.
– Examples: Globin fold (e.g., myoglobin, hemoglobin).
• All-Beta Proteins:
– Composed entirely of β-sheets.
– Examples: Immunoglobulin fold, beta-barrels.
• Alpha/Beta Proteins:
– Alternating α-helices and β-strands.
– Examples: TIM barrel (e.g., triosephosphate isomerase).
• Alpha + Beta Proteins:
– Contain separate α-helices and β-sheets without alternating patterns.
– Examples: Ferredoxin-like fold.
• Determining Secondary Structure and Folding
Classes
• Experimental Methods
• X-ray Crystallography:
– Provides high-resolution 3D structures.
• NMR Spectroscopy:
– Useful for studying small, soluble proteins.
• Circular Dichroism (CD) Spectroscopy:
– Estimates the proportion of helices, sheets, and random
coils.
• Computational Methods
• Secondary Structure Prediction:
– Tools like PSIPRED, DSSP.
• Folding Class Identification:
– SCOP (Structural Classification of Proteins) or CATH
databases.
Protein Tertiary Structure
• Tertiary structure refers to the three-dimensional (3D) arrangement of a protein’s
secondary structural elements (α-helices, β-sheets, and random coils) into a compact,
functional shape. This structure is determined by interactions between the side chains (R
groups) of amino acids in the polypeptide chain, making it crucial for a protein's biological
function.
• Features of Tertiary Structure
• 1. Globular Proteins
• Structure: Most enzymes and regulatory proteins are globular, with their hydrophobic core
and hydrophilic surface, facilitating interaction with other molecules.
• Examples: Myoglobin, hemoglobin, and enzymes like lactase.
• 2. Fibrous Proteins
• Structure: These proteins have elongated, filamentous shapes and are typically involved
in structural support or muscle contraction.
• Examples: Collagen (in connective tissues), keratin (in hair, nails), and elastin (in skin).
• 3. Domains
• Definition: A protein domain is a distinct, independently folding unit of a protein that often
has its own specific function.
• Example: The SH2 domain in signaling proteins recognizes phosphorylated tyrosine
residues.
• 4. Active Sites
• Structure: Many proteins, particularly enzymes, have specific regions (active sites) where
substrates bind and undergo chemical reactions. These sites are shaped by the tertiary
structure.
• Importance: The active site’s specific shape is determined by the arrangement of side
chains and is crucial for the protein’s catalytic activity.
Protein Tertiary Structure
Classification
• Class α: a bundle of α helices connected by loops
on the surface of protein
• Class β: antiparallel βsheets
• Class α/β: mainly parallel βsheets with
interveningα helices
• Class α+β: mainly segregated α helices and
antiparallel β sheets
• Multidomain proteins: comprise domains
representing more than one of the above 4 classes
• Membrane and cell-surface proteins: α helices
(hydrophobic) with a particular length range,
traversing a membrane
• Tertiary Structure Determination Methods
• 1. X-ray Crystallography
• Description: X-ray diffraction data is collected from protein crystals and used to build a 3D
model of the protein's atomic structure.
• Strengths: Provides high-resolution structures.
• Limitations: Requires high-quality protein crystals, which can be challenging to obtain.
• 2. Nuclear Magnetic Resonance (NMR) Spectroscopy
• Description: Measures the interactions between atomic nuclei in the protein to generate
structural information.
• Strengths: Does not require crystallization and is useful for studying proteins in solution.
• Limitations: Lower resolution compared to X-ray, and it is often limited to smaller proteins.
• 3. Cryo-Electron Microscopy (Cryo-EM)
• Description: Uses electron microscopy to visualize proteins in their native state at near-
atomic resolution, without the need for crystallization.
• Strengths: Can study large, complex structures (e.g., ribosomes, viral capsids).
• Limitations: Relatively high cost and technical complexity.
• 4. Computational Methods
• Description: Methods like homology modeling, folding simulations, and ab initio
prediction use sequence data to predict protein structure.
• Strengths: Fast and useful for proteins lacking experimental structures.
• Limitations: Predictions are less accurate than experimental methods.
All Alpha
All Beta
Alpha/beta
Class β
Class α

Class α+β

Class α/β

membrane

Membrane proteins
QUARTERNARY STRUCTURE
Types of Protein Fold Prediction
Tools
• 1.1. Homology Modeling (Comparative Modeling)
• Homology modeling is based on the assumption that evolutionary related
proteins (homologs) share similar structures. If the 3D structure of a related
protein is known, the structure of the target protein can be predicted by
aligning their sequences and transferring the structural information.
• Steps:
– Find homologous sequences with known structures (template proteins).
– Align the target sequence to the template.
– Model the 3D structure of the target protein based on the template’s structure.
– Refine the model by adjusting loop regions and side-chain conformations.
• Tools:
– SWISS-MODEL: A web-based server for homology-based protein structure prediction.
– MODELLER: A widely used tool for comparative modeling.
– Phyre2: A tool for predicting protein structure by exploiting known protein structures
and sequence alignments.
• Ab Initio (De Novo) Prediction
• Ab initio methods predict protein structures from scratch, without relying on template
structures. These methods attempt to model the folding process based on physical and
chemical principles, using the amino acid sequence as input.
• Challenges: Ab initio methods are computationally intensive and often require large
amounts of time and resources, particularly for large proteins.
• Approach:
– Use of physical principles (e.g., molecular dynamics, energy minimization).
– Simulation of folding pathways, considering factors such as hydrogen bonding, hydrophobic
interactions, and electrostatic forces.
• Tools:
– ROSETTA: A powerful tool used for ab initio protein structure prediction.
– FOLDIT: A crowdsourced game that uses human intuition for ab initio protein folding predictions.
– I-TASSER: A hybrid method that combines threading, ab initio, and homology modeling.
• 1.3. Threading (Fold Recognition)
• Threading (also known as fold recognition) is used when a homologous protein with a
known structure cannot be found. This method involves fitting the target sequence onto a
3D structure from a database of known protein folds.
• Steps:
– The sequence of the target protein is "threaded" into a library of known protein structures (called
templates).
– The best-fit structure is selected based on energy scoring functions (similarity, geometric
compatibility).
• Tools:
– TASSER: Combines threading and ab initio approaches.
– Phyre2: Also uses threading in addition to homology modeling.
• Machine Learning-Based Methods
• Recent advancements in artificial intelligence (AI) and
machine learning (ML) have led to the development of
prediction tools that use neural networks and deep learning to
predict protein structures.
• Approach: These methods are trained on large datasets of
protein sequences and structures to identify patterns and
relationships between sequence and fold.
• Advantages: They can make highly accurate predictions
based on patterns learned from millions of known structures.
• Tools:
– AlphaFold: A breakthrough tool developed by DeepMind that
uses deep learning to predict protein structures with remarkable
accuracy. AlphaFold won the CASP14 competition for its
accuracy in predicting protein folds.
– RoseTTAFold: A deep learning-based tool developed as an
alternative to AlphaFold, also using neural networks to predict
protein structure.
Protein Identity Based on
Composition
• Protein identity based on composition refers to analyzing a protein's amino acid
composition to infer its characteristics, structure, function, and evolutionary relationships.
This approach does not rely on the sequence but on the overall proportions of amino acids
in the protein.
• Key Concepts
• 1. Amino Acid Composition
• Proteins are made up of 20 standard amino acids, each with unique chemical properties.
• The relative abundance of these amino acids in a protein determines its composition.
• 2. Protein Features Derived from Composition
• Functional Class: Certain proteins (e.g., enzymes, structural proteins) show characteristic
amino acid profiles.
• Cellular Localization: Proteins targeted to specific cellular compartments (e.g., cytoplasm,
membrane) have distinct compositions.
– Example: Membrane proteins are rich in hydrophobic residues (e.g., leucine, isoleucine).
• Stability: Proteins with higher proline content tend to have greater structural rigidity.
• Evolutionary Relationships: Composition can suggest homology or evolutionary
conservation.
• Methods for Analyzing Protein Composition
• 1. Experimental Determination
• Amino Acid Analysis:
– Hydrolyze the protein into free amino acids.
– Quantify amino acids using chromatography or mass
spectrometry.
• 2. Computational Analysis
• Composition Profiling:
– Count the frequency of each amino acid in the protein
sequence.
– Tools: ExPASy ProtParam, EMBOSS Pepstats.
• Machine Learning:
– Algorithms predict properties like solubility or function
based on amino acid composition.
• Applications of Protein Composition Analysis
• 1. Functional Annotation
• Predicts a protein's potential role based on its amino acid makeup.
– Example: Enzymes often have specific conserved residues required for
catalytic activity.
• 2. Cellular Localization
• Distinguishes extracellular, membrane, and intracellular proteins.
– Example: Signal peptides in secreted proteins contain hydrophobic
residues.
• 3. Structure Prediction
• Amino acid composition influences secondary and tertiary structure.
– High glycine and proline content often indicates loop or coil regions.
• 4. Protein Stability
• The Instability Index (derived from composition) predicts whether a
protein is stable under physiological conditions.
– Example: Heat-stable proteins have a higher proportion of specific
residues like arginine and lysine.
• 5. Comparative Genomics
• Compositional similarity can suggest evolutionary relationships or
functional similarities between proteins.
Physical Properties of Proteins
Based on Sequence
• Protein sequences determine their physical properties, which influence structure,
function, and interactions. By analyzing the amino acid sequence, various physical
attributes can be inferred computationally or experimentally.
• Key Physical Properties
• 1. Molecular Weight
• Determined by summing the molecular weights of individual amino acids in the
sequence.
• Importance:
– Crucial for understanding protein size and separation techniques like SDS-
PAGE.
• Tools: ExPASy ProtParam.
• 2. Isoelectric Point (pI)
• The pH at which the protein has no net electrical charge.
• Calculation:
– Based on the pKa values of ionizable groups in amino acid side chains and the
N- and C-termini.
• Importance:
– Helps in protein purification (e.g., isoelectric focusing).
– Predicts solubility under different pH conditions.
• Tools: Compute pI/Mw tool on ExPASy.
• 3. Hydrophobicity
• Determines the overall hydrophobic or hydrophilic nature of the
protein.
• Hydropathy Index:
– Positive values indicate hydrophobic residues (e.g., leucine, valine).
– Negative values indicate hydrophilic residues (e.g., glutamate, lysine).
• Importance:
– Predicts transmembrane regions (e.g., membrane proteins are
hydrophobic).
– Influences folding and interactions with water or other molecules.
• Tools: Kyte-Doolittle plot for hydrophobicity profiling.
• 4. Secondary Structure Propensity
• Sequence composition predicts alpha-helices, beta-sheets, and
random coils.
• Patterns:
– Alpha-helix: Common in sequences rich in alanine, leucine, and
glutamate.
– Beta-sheet: Favored by valine, isoleucine, and phenylalanine.
• Tools: PSIPRED or GOR for secondary structure prediction.
• 5. Solubility
• Calculated from the proportion of polar and charged residues in the
sequence.
• Importance:
– Critical for determining suitability for expression and purification in different
environments.
• Tools: SoluProt or Protein-Sol.
• 6. Stability
• Predicted by analyzing amino acid content and interactions.
• Factors:
– Disulfide Bonds: Cysteine residues form disulfide bridges that enhance stability.
– Thermodynamic Stability: Glycine and proline influence rigidity.
• Importance:
– Essential for designing proteins for industrial or therapeutic purposes.
• Tools: FoldX or iStable.
• 7. Post-Translational Modification Sites
• Specific sequences or motifs predict modifications like phosphorylation,
glycosylation, or acetylation.
• Importance:
– Influences protein activity, localization, and interactions.
• Tools: NetPhos, GlycoEP.
• 8. Charge Distribution
• The arrangement of positively (lysine, arginine) and negatively (glutamate,
aspartate) charged residues.
• Importance:
– Determines electrostatic interactions and binding with nucleic acids or other proteins.
• 9. Flexibility and Disorder
• Some proteins or regions are intrinsically disordered, enabling flexibility and
multiple interactions.
• Prediction:
– High glycine and proline content often indicates disorder.
• Tools: IUPred or DISOPRED.
• 10. Aromatic Content
• Aromatic residues (tryptophan, tyrosine, phenylalanine) influence
absorbance and fluorescence.
• Importance:
– Helps in spectroscopic studies to determine protein concentration (e.g., UV
absorbance at 280 nm).
• Applications of Sequence-Based Property
Analysis
• Protein Engineering:
– Modify sequences to enhance stability, solubility, or
activity.
• Drug Design:
– Understand binding interactions for therapeutic
targeting.
• Structural Prediction:
– Predict folding patterns and functional domains.
• Biotechnology:
– Design proteins optimized for industrial or medical
applications.
web-based software -NN

PREDICT
NNPREDICT is a tool that uses artificial neural networks to predict a
protein's secondary structure—specifically the likelihood of regions forming
alpha-helices, beta-sheets, or coils. Neural networks are effective here
because they can learn complex patterns from large datasets..
• input: It takes the amino acid sequence of a protein.
• Neural Network Analysis: The neural network is trained on known protein
structures, learning the common patterns and features associated with
secondary structures.
• Prediction: For each amino acid, NNPREDICT calculates the probability
that it is part of a helix, sheet, or coil. It relies on sequence similarity to
previously studied proteins, enhancing its predictions by comparing against
these known structures.
• This approach works well because certain amino acid sequences have
strong tendencies to form specific structures, and neural networks can
capture these tendencies from large datasets, leading to relatively accurate
predictions.
JPRED
• JPRED is a popular online tool used for predicting the secondary
structure of proteins.
• How JPRED Works
• Input: Users provide a protein sequence, typically in FASTA format
(a text-based representation of amino acid sequences).
• Sequence Alignments: JPRED uses multiple sequence alignments
(MSA) to identify patterns and evolutionary information across
similar proteins. These alignments show conserved regions, which
can give clues about secondary structures.
• Machine Learning Algorithms: JPRED leverages machine
learning techniques to analyze sequence alignments and predict
whether each segment of the sequence is likely to form an alpha
helix, beta sheet, or coil.
• Outputs: The result is a prediction of the secondary structure for
each amino acid residue in the sequence, often visualized as a
string of letters (H for helix, E for sheet, C for coil).
• Interpreting JPRED’s Results
• Confidence Scores: JPRED provides a confidence score for each
prediction, indicating how certain the algorithm is that a particular
residue belongs to a specific secondary structure.
• Alignment Visualization: Users can also view the sequence
alignment used by JPRED, which helps in understanding which
parts of the sequence are highly conserved across species and
likely to have similar structures.
• 4. Applications of JPRED
• Drug Design and Protein Engineering: Knowing secondary
structures can aid in designing drugs that interact with specific
protein regions or in engineering proteins with altered stability or
activity.
• Function Prediction: By understanding the structural properties of
a protein, scientists can hypothesize about its potential biological
functions.
• Mutation Studies: Researchers studying mutations in proteins can
use JPRED to predict how these changes might impact the protein's
secondary structure and, consequently, its function.
JPRED RESULT
SOPMA
• SOPMA, which stands for Self-Optimized Prediction Method with Alignment,
is a computational tool used in bioinformatics to predict the secondary structure
of proteins. SOPMA analyzes protein sequences to identify regions likely to form
specific secondary structures, such as alpha helices, beta sheets, or random
coils, based on patterns in the amino acid sequence. This information is useful
for understanding protein function, as the
• Purpose of SOPMA in Protein Structure Prediction
• Proteins are made up of chains of amino acids that fold into specific 3D shapes.
• SOPMA predicts how these chains fold at a secondary level, which includes
structures like:
– Alpha helices: Spiral-shaped structures stabilized by hydrogen bonds.
– Beta sheets: Flat, sheet-like structures with strands held together by
hydrogen bonds.
– Random coils: Irregularly shaped segments without a fixed structure.
• Understanding these structures helps in determining how proteins interact,
which can be crucial for drug design, disease research, and biotechnology.
• shape of a protein often influences its interactions with other molecules.
• How SOPMA Works
• Input: SOPMA requires an amino acid sequence of a protein, which is
represented by single-letter codes for each amino acid.
• Database Comparison: SOPMA compares this sequence to a large database
of known protein structures. It uses alignment techniques to match segments of
the sequence with similar segments in proteins with known structures.
• Machine Learning Techniques: SOPMA uses algorithms to predict secondary
structures based on patterns observed in the database, adjusting predictions to
optimize accuracy.
• 3. Output and Interpretation
• SOPMA provides a prediction of each amino acid in the sequence, categorizing
it into:
– H for helix,
– E for extended strand (beta sheet), or
– C for coil.
• The output usually includes a probability score for each structure type, indicating
the confidence level of each prediction.
SOPMA RESULT
DSSP
• DSSP stands for Definition of Secondary Structure of Proteins. It is a
program and file format developed to analyze and classify the secondary
structures of proteins based on their 3D atomic coordinates, typically derived
from X-ray crystallography, NMR, or cryo-electron microscopy experiments.
DSSP is commonly used in bioinformatics and structural bio
• Key Components of DSSP
• Protein Structure Data Input:
– DSSP works with protein structures that are available in PDB (Protein Data
Bank) format. PDB files contain the 3D coordinates of atoms in a protein
structure, describing the positions of amino acids and atoms within the
molecule.
• Secondary Structure Classification:
– DSSP assigns secondary structure elements to regions of the protein. The
main types of secondary structures are:
• α-helices: Coiled, spiral structures stabilized by hydrogen bonds
between backbone atoms.
• β-strands: Extended, stretched structures that align to form β-sheets
when paired with other β-strands.
• Turns and loops: Irregular structures connecting helices and strands.
• Coils: Random or undefined regions of the structure.
• Hydrogen Bond Analysis:
– Hydrogen bonds play a critical role in stabilizing the secondary
structures in proteins. DSSP calculates possible hydrogen bonds based
on distances and angles between atoms, especially in the backbone.
• Accessible Surface Area (ASA):
– DSSP also calculates the Accessible Surface Area (ASA), which is the
area of each amino acid exposed to the solvent (like water). This
information is helpful to understand which parts of a protein are likely
involved in interactions with other molecules.
• How DSSP Works
• Input:
– The program takes the atomic coordinates from a PDB file.
• Hydrogen Bond Identification:
– DSSP identifies hydrogen bonds by evaluating the geometry of the
amino acid backbone atoms. It calculates distances and angles
between hydrogen donor and acceptor atoms to predict hydrogen
bonding.
• Assigning Secondary Structure:
– Based on the hydrogen bond information, DSSP assigns secondary
structure elements by detecting repetitive patterns:
• α-helices and 310 helices (a variation of α-helix with different
hydrogen bonding patterns)
• β-strands and β-sheets
• Turns and other coil regions
• Output:
– DSSP outputs a file where each amino acid residue in the protein is
labeled with its assigned secondary structure. It also provides detailed
data on hydrogen bonding, ASA, and other geometric features.
STRIDE
• STRIDE is a secondary structure assignment tool that analyzes the 3D
atomic coordinates of a protein structure to classify regions into alpha-
helices, beta-sheets, and other structural types. Here’s a breakdown of how
STRIDE operates:
• Input: It takes 3D structural data, typically from Protein Data Bank (PDB)
files.
• Hydrogen Bond Analysis: STRIDE identifies hydrogen bonds, which are
critical for stabilizing secondary structures. By examining the bond geometry
and strength, STRIDE determines where hydrogen bonds support specific
structures.
• Empirical Energy Functions: It uses empirical energy calculations to
assess the stability of potential secondary structures, providing a more
detailed picture of the protein's structural layout.
• Nuanced Classification: STRIDE can make more refined assignments by
combining bonding information with energy calculations, often giving it an
edge in accuracy over tools like DSSP, which rely primarily on geometric
criteria.
RESTRICTION MAPPING
AND
PRIMER DESIGN
A restriction map is a description of restriction
endonuclease cleavage sites within a piece of
DNA.
Generating such a map is usually the first step
in characterizing an unknown DNA, and a
prerequisite to manipulating it for other
purposes.
Typically, restriction enzymes that cleave DNA
infrequently (e.g. those with 6 bp recognition
sites) and are relatively inexpensive are used
to produce such a map.
What are the 3 general steps used to clone DNA?
n Isolate DNA from an organism
n Cut the organismal DNA and the vector with restriction
enzymes making recombinant DNA
n Introduce the recombinant DNA into a host

Restriction Enzymes
n Recognize a specific site (generally a pallidromic sequence)
n Produce overhangs or straight cuts
n Naturally found in bacteria, they protect against viruses and
foreign DNA
n More than 400 enzymes have been isolated
n Named for the organism they from which they are isolated
The first letter is that of the genus and the 2nd and 3rd are from the
species
Restriction site in DNA, showing symmetry of the sequence
around the center point. The sequence is a palindrome, reading
the same from left to right (5’-to-3’) on the top strand (GAATTC,
here) as it does from right to left (5’-to-3’) on the bottom strand.
Examples of how restriction enzymes cleave DNA. (a) SmaI results in blunt
ends. (b) BamHI results in 5’ overhanging (“sticky”) ends. (c) PstI results in 3’
overhanging (“sticky”) ends.
Restriction Mapping
There are three methods used to generate
a restriction map:
(i) mapping by multiple R.E. digestions
(ii) mapping by partial R.E. digestions
(iii) using a computer
Mapping by Multiple R.E.
Digestions
The most straightforward method for restriction mapping
is to digest samples of the plasmid with:
(i) a set of individual enzymes,
(ii) and with pairs of those enzymes.
The digestions are then "run out" on an agarose gel to
determine the sizes of the fragments generated.
The sizes of the fragments determined by comparison
with standard DNA molecular weight markers.
If you know the fragment sizes, it is usually a fairly easy
task to deduce where each enzyme cuts
This is what mapping is all about.
Creating a map by Partial
Digestions of End-Labelled DNA
A DNA fragment is labeled with a radioisotope
on only one end.
It can be partially digested with restriction
enzymes to generate labeled fragments.
Partial digestion is performed by using very
small amounts of enzyme or short periods of
time.
Analysis of the resulting products by PGE
enables one to define the distance of R.E. sites
from the labelled end
Using a Computer to Generate
Restriction Maps
All of the techniques described above for
generating a restriction map assume that
you don't have the sequence of the DNA.
If the sequence is known, it is a simple
matter to feed that sequence into any
number of computer programs.
These programs will search the sequence
for dozens of restriction enzyme
recognition sites and build a map for you.
Using R.E. maps for analysing
Recombinant DNA
Checking the size of the insert
Checking the orientation of the insert
Determining pattern of restriction sites
within insert DNA
Utilities

Identifying cloning sites in plasmids.


Verifying the integrity of recombinant DNA.
Diagnosing the structure of mutations or
modifications in DNA sequences.
DNA Strider
DNA Strider is a software program designed for
analyzing DNA and protein sequences. It is
particularly popular for basic sequence analysis
tasks.
Features:
Calculates GC content.
Identifies open reading frames (ORFs).
Predicts restriction enzyme cleavage sites.
Generates complementary sequences.
Translates DNA to protein sequences.
MacVector
MacVector is a comprehensive software package for molecular
biology and sequence analysis, specifically designed for Mac
operating systems.
Features:
Sequence assembly and alignment.
Primer design and analysis.
Protein structure predictions.
Phylogenetic tree construction.
Restriction enzyme mapping and cloning simulation.
Utility:
Used for designing experiments, analyzing sequences, and
visualizing genetic data.
Streamlines tasks in molecular biology like plasmid mapping,
sequence alignment, and SNP analysis.
OMIGA
OMIGA is a sequence analysis software used
for visualizing, annotating, and analyzing DNA
and protein sequences.
Features:
Allows for sequence editing and annotation.
Provides tools for restriction mapping.
Includes functionality for codon optimization and
primer design.
Utility:
It is particularly useful in genetic engineering,
where precise sequence editing and analysis
are required.
Web-Based Tools
There are several web-based tools available for restriction mapping
and sequence analysis. Two prominent ones are:
a. MAP
MAP (Restriction Mapper) is an online tool that identifies all
restriction enzyme sites in a given DNA sequence.
Features:
n Supports a wide range of restriction enzymes.
n Displays positions and fragment sizes resulting from cuts.
Utility:
n Used to plan cloning experiments
PRIMER DESIGN
Primer design is the process of creating short single-
stranded DNA sequences (primers) that bind specifically
to a target DNA region. These primers serve as starting
points for DNA synthesis during Polymerase Chain
Reaction (PCR) or other amplification techniques.
Primer Design Ensures specificity for the target
sequence to avoid non-specific amplification.
Improves the efficiency of PCR by reducing primer-dimer
formation and mismatches.
Optimizes reaction conditions like annealing temperature
and amplification efficiency
The most critical step in PCR
experiment will be designing
oligonucleotide primers.
Poor primers could result in little or
even no PCR product.
Alternatively, they could amplify many
unwanted DNA fragments.
Specificity
PCR is capable of amplifying a single
target DNA fragment out of a complex
mixture of DNA.
This ability depends on the specificity
of the primers.
Primers are short single-stranded
oligonucleotides which anneal to
template DNA and serve as a “primer”
for DNA synthesis.
In order to achieve the geometric
amplification of a DNA fragment, there
must be two primers, one flanking each
end of the target DNA.
It is essential that the primers have a
sequence that is complementary to the
target DNA.
Critical issues for specificity
Primers must be complementary to
flanking sequences of target region
Primers should not be complementary
to many non-target regions of genome
Melting Temperature (Tm)
The annealing temperature for a PCR reaction is
based on the melting temperature (Tm) of the
primers.
The Tm is the temperature at which a population of a
double stranded DNA molecule is partially denatured
such that half of the molecules are in the single
stranded state and half are in the double stranded
state.
At temperatures above the Tm the DNA molecules
will be in the single stranded form; at temperatures
below the Tm the DNA can form the double stranded
form.
Primer Length
Annealing efficiency is proportional to
primer length.
Therefore very long primers will not
anneal efficiently and this will lead to a
reduction in the amount of PCR
product produced.
Product Size
The choice of primers determines the
size of the PCR product.
If the two primers are complementary
to nearby regions on the template DNA,
then a small fragment of DNA will be
amplified.
If the two primers are complementary
to regions farther apart, then a larger
fragment of DNA will be amplified.
Primer Dimers
If the primers have self-complementary
sequences the primers, which are in high
concentration, will anneal with themselves.
If they anneal with themselves they are not
available to bind to the target DNA.
There are two types of potential self-
complementary sequences, those that lead
to hairpins and those that lead to primer
dimers.
G/C Content
It is important that primers be about
50% G/C and 50% A/T.
It is also important that regions within
the primer not have long runs of G/C or
A/T.
A stretch of A/T’s might only weakly
base pair while a stretch of G/C might
promote mis-annealing.
G/C clamp
Stable base pairing of the 3’ end of a primer
and the target DNA is necessary for efficient
DNA synthesis.
To ensure the stability of this interaction,
primers are often designed ending in either a
G or a C. (GC base pairs are more stable than
AT base pairs.)
This terminal G or C is called a G/C clamp.
Primer design
1. primers should be 17-28 bases in length;
2. base composition should be 50-60% (G+C);
3. primers should end (3') in a G or C, or CG or GC: this
prevents "breathing" of ends and increases efficiency of
priming;
4. Tms between 55-80oC are preferred;
5. primer self-complementarity (ability to form 2o
structures such as hairpins or primer dimers) should be
avoided;
6. it is especially important that the 3'-ends of primers
should not be complementary (ie. base pair), as
otherwise primer dimers will be synthesised
preferentially to any other product;
7. runs of three or more Cs or Gs at the 3'-ends of
primers may promote mispriming at G or C-rich
sequences (because of stability of annealing), and
should be avoided.
Need for Primer Design Tools
Designing primers manually can be challenging due to the
complexities involved in:
Avoiding Primer-Dimer Formation:
n Complementary sequences between primers can lead to self-
annealing or dimerization, reducing efficiency.
Ensuring Specificity:
n The primers must bind only to the desired target sequence.
Optimizing Tm (Melting Temperature):
n The Tm needs to be appropriate for the PCR reaction conditions.
Minimizing Secondary Structures:
n Primers should not form hairpins or other secondary structures.
Adjusting Primer Length and GC Content:
n Typically, primers are 18–25 bases long with 40–60% GC
content for stability.
Primer Design Programs and
Software
Primer3
Primer3 is one of the most popular open-source tools for
primer design.
Features:
Designs primers for PCR, qPCR, and sequencing.
Allows customization of primer parameters like Tm,
length, and GC content.
Offers options to avoid regions with known
polymorphisms or repetitive sequences.
Generates multiple primer pairs and ranks them based
on user-defined criteria.
Workflow:
Input the target DNA sequence.
Specify the region of interest (e.g., flanking the exon or
SNP).
Set constraints for primer properties (length, Tm, GC
content).
Analyze output to select the best primer pair.
Advantages:
Highly customizable.
Can be integrated with other bioinformatics tools.
Applications:
General PCR primer design.
Designing primers for mutagenesis or gene cloning.
Real-time PCR and probe-based amplification
experiments.
Other Popular Primer Design Tools
Oligo 7: Commercial software with advanced tools for
primer and probe design.
SnapGene: Integrates primer design with plasmid
visualization.
Integrated DNA Technologies (IDT) PrimerQuest: A
web-based tool for designing qPCR and standard PCR
primers.
BLAST Primer Tools: Helps verify primer specificity by
aligning against genomic databases
3D Structure Modeling in Drug
Discovery
• 3D structure modeling refers to the computational techniques used to
predict the three-dimensional structure of biological molecules such as
proteins and ligands. In drug discovery, it provides a detailed view of
molecular interactions.
• Applications in Drug Discovery:
• Understanding Targets: Determines the active site of target proteins for
drug binding.
• Lead Optimization: Guides modifications in lead molecules to improve
binding affinity.
• Virtual Screening: Screens large libraries of compounds for potential drug
candidates.
• Techniques:
• Homology Modeling:
– Builds the 3D structure of a target protein based on a similar, known structure.
• Ab Initio Modeling:
– Predicts protein structures from scratch using physicochemical principles.
• Molecular Dynamics Simulations:
– Studies the motion of atoms in the 3D structure to understand flexibility and stability.
Molecular Docking

• Molecular docking is a computational technique used to


predict the interaction between a small molecule (ligand) and
a macromolecular target (such as a protein, enzyme, or DNA).
It is an essential tool in drug discovery and structural biology
for identifying potential drug candidates and understanding
their binding mechanisms.
• Principle of Molecular Docking
• The primary goal of docking is to predict the "binding pose" or
"binding orientation" of a ligand when it interacts with the
active site of a receptor. The process involves:
• Sampling: Exploring all possible conformations and
orientations of the ligand in the binding site.
• Scoring: Evaluating these poses using scoring functions to
identify the most energetically favorable interaction.
• Types of Molecular Docking
• Rigid Docking:
– Assumes both the ligand and receptor are rigid
structures.
– Simple and computationally less demanding but
may overlook flexibility.
• Flexible Docking:
– Considers the flexibility of the ligand and/or
receptor.
– More accurate but computationally intensive.
• Protein-Protein Docking:
– Simulates interactions between two proteins to
understand complex formation.
• Key Steps in Molecular Docking
• 1. Preparation of the Receptor
• Structure Retrieval: Obtain the 3D structure of the target
protein from databases like PDB (Protein Data Bank).
• Active Site Identification:
– Use tools or literature to identify the binding pocket.
– Alternatively, perform blind docking if the active site is unknown.
• Preprocessing:
– Add missing atoms, correct bonds, and optimize the protein
structure by removing water molecules and assigning charges.
• 2. Preparation of the Ligand
• Structure Retrieval: Obtain the chemical structure of the
ligand from databases like PubChem or design it using
chemical drawing tools.
• Optimization:
– Generate the 3D structure.
– Optimize bond lengths, angles, and torsions.
• Assign Charges: Assign partial charges for energy
calculations.
• 3. Docking Simulation
• Sampling Algorithm: Generates multiple poses of the ligand in the receptor’s binding site.
Common approaches include:
– Systematic Search: Explores all possible conformations.
– Random/Genetic Algorithms: Uses randomness or evolutionary strategies to explore
conformations.
– Monte Carlo Simulation: Uses statistical sampling.
• 4. Scoring
• Evaluates the binding affinity between the ligand and receptor.
• Scoring functions estimate the strength of interactions based on:
– Hydrogen bonding.
– Hydrophobic interactions.
– Electrostatic forces.
– Van der Waals interactions.
• Common scoring functions include:
– Empirical scoring (e.g., ChemScore).
– Knowledge-based scoring (e.g., PMF).
– Force field-based scoring.
• 5. Post-Docking Analysis
• Ranking Poses: Ligand poses are ranked based on their docking scores.
• Visual Inspection: Use molecular visualization tools (e.g., PyMOL, Chimera) to examine
the binding mode.
• Validation:
– Analyze interactions such as hydrogen bonds, π-π stacking, and salt bridges.
– Cross-validate with experimental data or use consensus scoring.
Applications of Molecular
Docking
• 1. Drug Discovery
• Identifies lead compounds with high binding affinity to target
proteins.
• Guides optimization of molecular structure for better activity
and selectivity.
• 2. Virtual Screening
• Screens large chemical libraries to identify potential ligands
for a specific receptor.
• 3. Understanding Binding Mechanisms
• Reveals how drugs interact with targets at the molecular level,
helping in mechanism elucidation.
• 4. Predicting Drug Resistance
• Identifies mutations in target proteins that might alter drug
binding, aiding in the design of resistant variants.
• Popular Docking Tools
• AutoDock:
– Open-source tool with flexible docking options.
– Widely used for academic research.
• Schrödinger Glide:
– High precision with advanced scoring algorithms.
– Suitable for industrial applications.
• Molecular Operating Environment (MOE):
– Integrates docking with QSAR and pharmacophore modeling.
• SwissDock:
– Web-based tool for free docking simulations.
• Dock:
– Focuses on rigid docking with systematic search.
Quantitative Structure-Activity
Relationship (QSAR)
• Quantitative Structure-Activity Relationship (QSAR) is
a computational method that establishes a
mathematical relationship between the chemical
structure of compounds and their biological activities.
It is widely used in drug discovery, toxicology, and
chemical engineering to predict the properties of new
compounds.
• Principle of QSAR
• The fundamental idea behind QSAR is that the
biological activity of a compound is determined by its
chemical structure. By analyzing the structure and
activity of known compounds, QSAR models can
predict the biological activity of new or untested
compounds.
• Steps in QSAR Development
• 1. Data Collection
• Chemical Structures: Obtain the structures of a series of related
compounds.
• Biological Activity: Experimentally measure the biological activity
of these compounds (e.g., IC50, EC50, Ki).
• 2. Descriptor Calculation
• Descriptors are numerical values that represent the structural,
physicochemical, or electronic properties of a compound. These
include:
– Physicochemical Descriptors: Molecular weight, hydrophobicity
(logP), polarizability.
– Electronic Descriptors: Electron density, partial charges.
– Geometrical Descriptors: 3D shape, molecular volume.
– Topological Descriptors: Connectivity indices, molecular branching.
– Quantum Descriptors: HOMO-LUMO gap, dipole moment.
• 3. Data Preprocessing
• Normalize the data for consistency.
• Remove irrelevant or redundant descriptors.
• Ensure the dataset is balanced and unbiased.
• Model Building
• Use statistical or machine learning techniques to correlate
descriptors with biological activity:
– Linear Regression: Models the relationship using a straight-line
equation.
– Partial Least Squares (PLS): Reduces dimensionality while modeling.
– Support Vector Machines (SVM) and Neural Networks: Non-linear
methods for complex datasets.
– k-Nearest Neighbors (k-NN): Predicts activity based on similar
compounds.
• 5. Model Validation
• Split data into training and test sets to ensure the model is
generalizable.
• Use techniques like cross-validation and external validation.
• Metrics to evaluate performance:
– R² (coefficient of determination): Indicates how well the model fits the
data.
– RMSE (root mean square error): Measures prediction accuracy.
– Q² (predictive ability): Validates model performance on new data.
• 6. Predictions
• Apply the QSAR model to predict the biological activity of untested
compounds.
• Screen large libraries to identify potential leads for further testing.
• Applications of QSAR
• 1. Drug Discovery
• Predicts the activity of new drug candidates.
• Guides chemical modifications to enhance efficacy or
reduce toxicity.
• 2. Toxicology
• Identifies potentially harmful effects of chemicals.
• Reduces the need for animal testing by predicting
toxicity in silico.
• 3. Environmental Chemistry
• Assesses the environmental impact of industrial
chemicals (e.g., bioaccumulation, degradation).
• 4. Material Science
• Optimizes the properties of materials, such as
polymers or dyes.
• QSAR Categories
• 1D-QSAR:
– Uses simple properties like molecular weight,
logP, or polarizability.
• 2D-QSAR:
– Incorporates topological and connectivity indices
of molecules.
• 3D-QSAR:
– Uses 3D molecular structures and spatial
features (e.g., CoMFA – Comparative Molecular
Field Analysis).
• QSAR in Practice
• Software for QSAR Modeling:
• MOE (Molecular Operating Environment):
– Integrates QSAR with other computational tools like
docking.
• ChemOffice:
– Calculates descriptors and performs regression
analysis.
• OECD QSAR Toolbox:
– Focuses on environmental and toxicological QSAR.
• AutoQSAR:
– Automates QSAR model development and validation.
Deriving the Pharmacophoric
Pattern
• A pharmacophoric pattern (or pharmacophore) is the
spatial arrangement of essential molecular features
required for a compound to interact with a specific
biological target and produce a desired biological
effect. Deriving a pharmacophore is a critical step in
drug discovery, as it defines the "blueprint" for
designing new drug molecules with improved efficacy
and selectivity.
• What is a Pharmacophore?
• According to the International Union of Pure and
Applied Chemistry (IUPAC), a pharmacophore is:
• “An ensemble of steric and electronic features that is
necessary to ensure the optimal interactions with a
specific biological target to trigger or block its
biological response.”
• Key Molecular Features of a Pharmacophore:
• Hydrogen Bond Donors (HBD):
– Atoms or groups that donate hydrogen bonds (e.g., –OH, –NH
groups).
• Hydrogen Bond Acceptors (HBA):
– Atoms or groups that accept hydrogen bonds (e.g., carbonyl
oxygens, ethers).
• Hydrophobic Regions:
– Nonpolar areas that interact with hydrophobic regions of the
target.
• Aromatic Rings:
– Planar ring systems that participate in stacking interactions (π-π
interactions).
• Positive/Ionic Groups:
– Charged groups that form ionic bonds with oppositely charged
residues.
• Negative/Ionic Groups:
– Charged groups that interact electrostatically.
• Steps to Derive a Pharmacophoric Pattern
• 1. Data Collection
• Gather information about a set of active compounds known to bind to the biological
target.
• Optionally, include inactive compounds to help identify features essential for activity.
• 2. Structural Alignment
• Align the 3D structures of active compounds to identify common features responsible
for their activity.
– Rigid Alignment: Superimposes compounds without altering their conformations.
– Flexible Alignment: Allows conformational changes to better fit compounds together.
• 3. Feature Identification
• Identify the molecular features common to all active compounds:
– Hydrogen bond donors/acceptors.
– Aromatic rings.
– Hydrophobic pockets.
– Charged groups.
• Use software tools like MOE, Schrödinger Phase, or LigandScout to extract these
features.
• 4. Define the Pharmacophore
• Represent the pharmacophore as a 3D model with:
– Spatial positions of features.
– Inter-feature distances.
– Angles and constraints for flexibility.
• 5. Validate the Pharmacophore
• Test the pharmacophore against a database of known active and inactive
compounds:
– Active compounds should match the pharmacophore.
– Inactive compounds should not match.
• Refine the pharmacophore if needed by adjusting feature constraints or
adding/removing features.
• Methods to Derive Pharmacophores
• 1. Ligand-Based Pharmacophore Modeling
• Based solely on the structures of known active compounds.
• Identifies shared features among active ligands, independent of the target
structure.
• 2. Structure-Based Pharmacophore Modeling
• Uses the 3D structure of the biological target (e.g., a protein receptor) to
derive the pharmacophore.
• Identifies critical interactions between the ligand and the target, such as:
– Hydrogen bonds.
– Hydrophobic pockets.
– Salt bridges.
• Often used when crystal structures of the target are available (e.g., from
PDB)
• Applications of Pharmacophore Modeling
• 1. Virtual Screening
• Searches compound libraries for molecules that match
the pharmacophore.
• Identifies potential lead compounds without requiring
physical testing.
• 2. Lead Optimization
• Guides chemical modifications to improve activity,
selectivity, or pharmacokinetics.
• 3. Drug Repurposing
• Identifies existing drugs that match a pharmacophore
for a new target.
• 4. Mechanism Elucidation
• Explains why certain compounds are active or inactive
based on their fit to the pharmacophore.
• Tools for Deriving Pharmacophores
• LigandScout:
– Extracts pharmacophores from ligand-receptor complexes.
– Offers user-friendly visualization tools.
• Schrödinger Phase:
– Provides advanced ligand-based and structure-based pharmacophore
modeling.
• Discovery Studio:
– Features robust algorithms for pharmacophore generation and
validation.
• MOE (Molecular Operating Environment):
– Integrates pharmacophore modeling with docking, QSAR, and other
drug discovery tools.
• PharmaGist:
– A free, web-based tool for automatic pharmacophore detection from
ligands.
• Advantages of Pharmacophore
Modeling
• Cost-Effective:
– Reduces the need for expensive experimental
techniques.
• High Throughput:
– Screens thousands of compounds quickly.
• Target Agnostic:
– Can be applied even when the receptor
structure is unknown (ligand-based).
Receptor Mapping
• Receptor mapping is a computational and experimental
technique used to identify the binding site(s) and essential
interactions of a ligand with a receptor (typically a protein).
The primary goal of receptor mapping is to understand the
structural and functional features of the receptor that are
responsible for ligand binding and biological activity. This
knowledge is crucial for drug discovery and design.
• Receptor mapping involves:
• Identifying the binding pocket or active site on the receptor.
• Characterizing the chemical and structural features of the
receptor that are essential for binding.
• Establishing a correlation between these features and the
biological activity of ligands.
• Importance of Receptor Mapping
• Drug Discovery:Helps design ligands that
optimally interact with the receptor.
• Facilitates identification of druggable binding
sites.
• Understanding Mechanism of
Action:Reveals how drugs or endogenous
molecules interact with their targets.
• Selectivity and Specificity:Guides the
development of selective drugs that target
specific receptors, minimizing side effects.
• Methods for Receptor Mapping
• 1. Experimental Methods
• These involve physical and chemical techniques to determine receptor-
ligand interactions.
• Site-Directed Mutagenesis:
– Mutate specific amino acids in the receptor to determine their role in ligand binding.
– Example: Replacing a polar residue in the active site with a hydrophobic one to
assess its impact on binding affinity.
• X-Ray Crystallography:
– Provides a high-resolution 3D structure of the receptor.
– Visualizes the exact binding pose of a ligand.
• NMR Spectroscopy:
– Identifies interactions between the receptor and ligand in solution.
– Useful for studying dynamic and flexible binding sites.
• Cryo-Electron Microscopy (Cryo-EM):
– Captures the structure of large protein complexes, including receptor-ligand
interactions, at near-atomic resolution.
• Affinity Labeling:
– Chemically modify ligands to react covalently with the receptor at the binding site.
– Identifies the residues involved in binding.
• 2. Computational Methods
• Computational receptor mapping involves analyzing the receptor's structure
and simulating ligand interactions.
• Molecular Docking:
– Predicts the binding pose of a ligand within the receptor's active site.
– Tools: AutoDock, Glide, MOE.
• Molecular Dynamics (MD) Simulations:
– Simulates the behavior of the receptor-ligand complex over time.
– Identifies key interactions and receptor flexibility.
• Binding Pocket Detection:
– Algorithms identify cavities or pockets on the receptor surface that are likely binding
sites.
– Tools: CASTp, Fpocket.
• Energy-Based Mapping:
– Calculates interaction energies between different parts of the receptor and the ligand.
– Helps identify hotspots for binding.
• Pharmacophore-Based Mapping:
– Derives a pharmacophore (ensemble of features) from the receptor's binding site.
– Tools: LigandScout, Schrödinger Phase.
• Fragment-Based Approaches:
– Maps small chemical fragments onto the receptor to identify favorable binding regions.
– Guides the design of larger ligands.
• Steps in Receptor Mapping
• 1. Receptor Preparation
• Obtain the receptor structure (e.g., from the Protein Data Bank).
• Prepare the structure:
– Remove water molecules.
– Add missing residues or atoms.
– Assign charges and optimize geometry.
• 2. Binding Site Identification
• Analyze the receptor's surface to locate potential binding pockets.
• Use computational tools or experimental data to define the binding site.
• 3. Interaction Analysis
• Examine the chemical environment of the binding pocket:
– Hydrogen bond donors/acceptors.
– Hydrophobic regions.
– Charged residues.
• Analyze ligand interactions with these features.
• 4. Mapping
• Systematically probe the binding site using:
– Docking of various ligands.
– Fragment-based mapping.
– Computational screening.
• 5. Validation
• Validate the mapped interactions using:
– Experimental data (e.g., mutagenesis, crystallography).
– Retrospective docking of known ligands.
• Applications of Receptor Mapping
• 1. Drug Discovery and Design
• Helps design ligands that fit the binding site with high affinity
and specificity.
• 2. Identifying Allosteric Sites
• Maps alternative binding sites that modulate receptor activity,
useful for developing allosteric modulators.
• 3. Virtual Screening
• Screens chemical libraries for molecules that interact with the
receptor’s mapped features.
• 4. Predicting Drug Resistance
• Identifies mutations in the receptor that may alter ligand
binding.
• 5. Biomarker Identification
• Maps interactions of endogenous molecules, aiding in
understanding disease mechanisms.
Estimating Biological Activities
• Biological activity estimation involves predicting how a
compound interacts with a biological target and the
resulting effect. This process is key for identifying
potential drug candidates and optimizing their efficacy.
• 1. Definition of Biological Activity
• Biological Activity: The effect a compound has on a
biological system, often quantified by parameters like:
– IC50 (Inhibitory Concentration): The concentration of a
compound required to inhibit a biological process by 50%.
– EC50 (Effective Concentration): The concentration
needed to produce 50% of the maximal biological effect.
– Ki (Inhibition Constant): A measure of binding affinity for
an enzyme or receptor.
• Methods for Estimating Biological Activity
• A. Experimental Approaches
• High-Throughput Screening (HTS):
– Tests large libraries of compounds against a target in vitro.
– Measures activity through assays such as enzymatic
activity or cell viability.
• Dose-Response Curves:
– Generates a curve by testing various concentrations of a
compound.
– Determines potency (IC50 or EC50).
• In Vivo Studies:
– Measures biological effects in animal models to estimate
activity and toxicity.
• Biophysical Techniques:
– Surface Plasmon Resonance (SPR) or Isothermal
Titration Calorimetry (ITC) directly measures binding
affinities.
• B. Computational Approaches
• Quantitative Structure-Activity Relationship (QSAR):
– Correlates chemical structure with biological activity using statistical
models.
– Predicts activity for untested compounds based on structural
descriptors.
• Molecular Docking:
– Simulates binding of a ligand to a receptor and estimates binding
energy.
– Provides insights into the strength of interaction and activity.
• Pharmacophore Modeling:
– Identifies essential features responsible for activity and screens libraries
for matching compounds.
• Machine Learning Models:
– Use algorithms (e.g., random forests, neural networks) trained on
datasets of known activity.
– Predict biological activity for new compounds.
• ADMET Prediction:
– Evaluates Absorption, Distribution, Metabolism, Excretion, and Toxicity
to estimate in vivo activity.
Ligand-Receptor Interactions
• Basics of Ligand-Receptor Interactions
• Ligand: A molecule (e.g., drug, hormone) that binds to a receptor to
produce a biological effect.
• Receptor: A macromolecule (e.g., protein, DNA) that specifically binds
ligands.
• Interaction Types:
– Orthosteric Binding: Ligand binds directly to the active site.
– Allosteric Binding: Ligand binds to a different site, modulating the receptor's activity.
• 2. Key Forces Governing Ligand-Receptor Binding
• Hydrogen Bonds: Between donor and acceptor atoms (e.g., N-H, O-H
groups).
• Electrostatic Interactions: Between charged groups (e.g., carboxylate-
anion and ammonium-cation).
• Hydrophobic Interactions: Nonpolar groups interacting with hydrophobic
receptor regions.
• Van der Waals Forces: Weak, short-range attractions due to induced
dipoles.
• π-Stacking: Aromatic rings interact via π-electron clouds.
• Methods to Study Ligand-Receptor Interactions
• A. Experimental Techniques
• X-Ray Crystallography:
– Provides high-resolution structures of ligand-receptor
complexes.
– Visualizes binding poses and interactions.
• Nuclear Magnetic Resonance (NMR):
– Detects changes in chemical shifts upon ligand binding.
– Suitable for studying dynamics and weak interactions.
• Cryo-Electron Microscopy (Cryo-EM):
– Determines structures of large, dynamic receptor-ligand
complexes.
• Surface Plasmon Resonance (SPR):
– Measures real-time binding kinetics and affinity constants.
• Thermal Shift Assays:
– Detects binding-induced changes in receptor stability.
• B. Computational Techniques
• Molecular Docking:
– Predicts the binding pose and energy of a ligand within the
receptor’s active site.
– Tools: AutoDock, Glide, MOE.
• Molecular Dynamics (MD) Simulations:
– Simulates the behavior of ligand-receptor complexes over time.
– Captures dynamic and flexible interactions.
• Free Energy Calculations:
– Estimates binding free energy using methods like MM-PBSA or
FEP.
– Provides quantitative insights into binding affinity.
• Binding Site Analysis:
– Identifies key residues and properties of the receptor binding
pocket.
– Tools: CASTp, Fpocket.
Applications of Estimating
Biological Activities and Ligand-

Receptor Interactions
1. Drug Discovery and Development
• Identifies compounds with high activity and specificity for
therapeutic targets.
• Optimizes lead compounds to enhance binding affinity and
efficacy.
• 2. Mechanistic Studies
• Explores how ligands modulate receptor function (agonists,
antagonists, allosteric modulators).
• 3. Toxicity Prediction
• Identifies off-target interactions that may cause adverse
effects.
• 4. Personalized Medicine
• Predicts activity for patient-specific receptors or mutations.
• Tools for Biological Activity and Interaction
Analysis
• AutoDock and AutoDock Vina:
– Docking and interaction energy estimation.
• Schrödinger Suite (Glide, Maestro):
– High-accuracy docking and activity prediction.
• MOE (Molecular Operating Environment):
– Integrated tools for activity modeling and interaction
analysis.
• DeepChem:
– Machine learning-based biological activity prediction.
• LigandScout:
– Pharmacophore and interaction feature extraction.
Docking softwares (AUTODOCK,
HEX),
• AutoDock
• Overview
• AutoDock is one of the most widely used docking programs for predicting how small
molecules (ligands) bind to a receptor (protein or DNA). It is developed and maintained by
the Scripps Research Institute.
• Features
• Flexible Docking:
– Allows flexibility in ligand and receptor (partial flexibility for receptors) during docking.
• Scoring Function:
– Based on a semi-empirical free energy force field that estimates binding affinity.
• Search Algorithms:
– Supports multiple search algorithms like Genetic Algorithm (GA), Simulated Annealing (SA), and
Lamarckian Genetic Algorithm (LGA).
• Visualization Tools:
– Results can be visualized using tools like PyMOL, AutoDockTools (ADT), or Chimera.
• Applications
• Protein-ligand docking.
• Protein-protein docking (to some extent).
• Virtual screening for drug discovery.
• Study of receptor-ligand interaction dynamics.
• Workflow
• Receptor Preparation:
– Add hydrogens and assign charges using
AutoDockTools.
– Specify the binding site using a grid box.
• Ligand Preparation:
– Assign torsional flexibility and charge.
• Docking Simulation:
– Run docking using selected search parameters
and algorithms.
• Analyze Results:
– Evaluate binding poses and energies.
• Advantages
• Well-documented and user-friendly for beginners.
• Free and open-source.
• Supports flexible ligand and partially flexible receptor docking.
• Provides detailed output on binding affinities and interaction
energy.
• Limitations
• Relatively slower compared to newer tools.
• Limited support for complete receptor flexibility.
• Cannot handle large receptor systems efficiently.
• Versions
• AutoDock 4: Traditional docking with detailed ligand-receptor
interaction analysis.
• AutoDock Vina: Faster version with an improved scoring
function, suitable for high-throughput screening.
• 2. HEX
• Overview
• HEX is a molecular docking and molecular superposition program. It
specializes in protein-protein docking but also supports ligand-receptor
docking.
• Features
• FFT-Based Docking:
– Uses Fast Fourier Transform (FFT) to calculate docking solutions efficiently in 3D.
• Scoring Function:
– Combines shape complementarity and electrostatic potential for docking.
• Visualization:
– Includes built-in tools for viewing docking results.
• Supports Large Systems:
– Efficiently handles large receptor and ligand systems.
• Speed:
– Optimized for fast docking, especially for rigid-body systems.
• Applications
• Protein-protein docking.
• Protein-DNA/RNA docking.
• Docking of large complexes.
• Virtual screening for small molecules.
• Workflow
• Prepare Molecules:
– Load the receptor and ligand structures in supported formats (e.g., PDB).
• Set Parameters:
– Define interaction parameters, such as distance constraints and rotation angles.
• Run Docking:
– Perform rigid-body docking or flexible docking simulations.
• Analyze Results:
– Rank solutions based on the docking score and visualize interactions.
• Advantages
• Extremely fast due to FFT-based calculations.
• Well-suited for large systems and protein-protein docking.
• Built-in visualization reduces dependency on external tools.
• Limitations
• Limited ligand flexibility during docking.
• Scoring function is less detailed compared to other docking software.
• Less commonly used for small molecule docking compared to AutoDock.
• Licensing
• HEX is freely available for academic use but requires a license for
commercial purposes.
• Applications in Drug Discovery
• AutoDock Applications
• Docking small molecules to protein active
sites.
• Virtual screening of large chemical libraries.
• Studying mutations and their impact on
binding.
• HEX Applications
• Mapping protein-protein interaction sites.
• Investigating protein-DNA/RNA interactions.
• Analyzing large biomolecular assemblies.
Detecting Functional Sites in the
Prokaryotic and Eukaryotic
Genomes (promoters,
transcription factor
binding sites, translation initiation
sites), Integrated Gene Parsing,
finding RNA Genes, Web based
tools
(GENSCAN, GRAIL,
GENEFINDER)
Introduction
 Less than 2% of vertebrate genomes code for proteins.
 Currently available computational methods are not yet
powerful enough to elucidate precisely the gene
structure from a large-scale genomic sequence.
Therefore, gene prediction programs rely on factors
such as compositional bias found in protein-coding
regions, as well as similarity with known coding
sequences.
 This chapter briefly reviews some of the computational
methods underlying most computational gene finders,
then focuses on a number of the most commonly used
publicly available methods.
Gene Prediction Methods
 Gene-finding methods predict the location of genes in
genomic sequences through a combination of one or
more of following approaches:
 Intrinsicor template gene prediction (predicting gene
structure without direct comparison to other
sequences)-
 Searching by signal

 Searching by content

 Extrinsic or look-up gene prediction (predicting gene


structure by direct comparison to known sequences)-
 Homology-based gene prediction

 Comparative gene prediction


Gene Prediction Methods (cont.)
 Prokaryotic genes vs. eukaryotic genes
 Prokaryotic genes are usually found adjacent
to each other as Open Reading Frames
(ORFs).
 Eukaryotic genes (coding exons) are often
separated by long stretches of intergenic non-
coding introns.
• Promoters
• Definition: Promoters are DNA sequences upstream of genes that
initiate transcription by recruiting RNA polymerase and transcription
factors.
• Prokaryotic Promoters:
• Contain conserved regions such as:
– -10 box (Pribnow box): TATAAT sequence.
– -35 box: TTGACA sequence.
• Detection:
– Computational: Algorithms search for conserved motifs (e.g., BPROM or Neural
Networks).
– Experimental: Techniques like DNase I footprinting and promoter reporter
assays.
• Eukaryotic Promoters:
• Complex and diverse, containing elements such as:
– Core promoter: Includes TATA box, initiator (Inr), and downstream promoter
element (DPE).
– Enhancers/silencers: Influence promoter activity from distant regions.
• Detection:
– Use tools like PromoterScan or TSSFinder.
• Transcription Factor Binding Sites (TFBS)
• Definition: Short DNA motifs where transcription factors bind to
regulate transcription.
• Prokaryotic TFBS:
• Simple regulatory systems often involve operons.
• Examples: Repressor binding to operators (e.g., lac operon).
• Eukaryotic TFBS:
• More complex and include enhancers, silencers, and insulators.
• Can occur near or far from the gene being regulated.
• Detection:
• Sequence-Based Methods:
– Position Weight Matrices (PWMs): Scans sequences for motif
likelihood.
– Tools like FIMO (Find Individual Motif Occurrences).
• Experimental Methods:
– ChIP-seq: Identifies DNA fragments bound to transcription factors.
– SELEX: Determines sequence preferences of TFs.
• Databases:
– JASPAR and TRANSFAC provide known TFBS motifs.
• Translation Initiation Sites (TIS)
• Definition: Sites where ribosomes initiate protein synthesis, typically
around the start codon (AUG).
• Prokaryotic TIS:
• Shine-Dalgarno sequence upstream of the start codon aligns with
16S rRNA.
• Distance between Shine-Dalgarno and start codon is critical for
initiation.
• Eukaryotic TIS:
• Ribosome scans from the 5' cap of mRNA until it encounters the
Kozak sequence (e.g., ACCAUGG).
• No Shine-Dalgarno, but start codon recognition depends on
secondary structures in mRNA.
• Detection:
• Computational Tools:
– Use gene prediction tools like AUGUSTUS or TIS databases.
• Experimental Approaches:
– Ribosome Profiling: Identifies ribosome-protected mRNA fragments.
– Reporter Assays: Confirm functional translation from predicted TIS.
• Significance
• Detecting functional sites allows
understanding of gene regulation and
evolutionary mechanisms.
• Useful in biotechnology for engineering
gene expression and in medicine for
identifying mutations linked to diseases.
Integrated Gene Parsing
• Integrated gene parsing involves analyzing genome sequences to
identify and annotate genes and their structural components such as
exons, introns, regulatory regions, and functional sites. It combines
computational tools and biological data for accurate genome
interpretation.
• Key Steps in Integrated Gene Parsing
• 1. Gene Structure Identification
• Components:
– Exons (coding regions).
– Introns (non-coding regions).
– UTRs (untranslated regions at 5' and 3' ends).
• Approach:
– Ab Initio Methods: Use statistical models to predict gene structures
based solely on sequence data (e.g., AUGUSTUS, GENSCAN).
– Homology-Based Methods: Compare sequences to known genes
using BLAST or alignment tools like ClustalW.
• 2. Promoter and Regulatory Element Detection
• Identifies transcription start sites and upstream regulatory regions.
• Tools: MEME for motif discovery, or JASPAR for known regulatory
element databases.
• 3. Coding Sequence (CDS) Parsing
• Locates regions translated into proteins.
• Starts with a start codon (AUG) and ends with a stop codon (UAA,
UAG, UGA).
• Tools: GeneMark, Prodigal (prokaryotes), or AUGUSTUS
(eukaryotes).
• 4. Splicing Site Detection
• Identifies intron-exon boundaries in eukaryotic genomes.
• Signals:
– Donor sites (5' splice site): GT.
– Acceptor sites (3' splice site): AG.
• Tools: SplicePredictor or RNA-seq data for experimental validation.
• 5. Functional Annotation
• Links parsed gene regions to biological functions.
• Uses functional databases like UniProt, Pfam, or KEGG.
• Integration Across Tools and Data
• Combines multiple datasets (e.g., sequence data,
RNA-seq, and proteomics) for comprehensive
annotation.
• Uses integrated platforms like Ensembl, UCSC
Genome Browser, or NCBI GenBank for
visualization and validation.
• Applications
• Genome annotation in research and
biotechnology.
• Identifying genes linked to diseases.
• Understanding evolutionary biology.
Finding RNA Genes
• RNA genes encode non-coding RNAs (ncRNAs) that perform structural,
regulatory, or catalytic roles instead of being translated into proteins.
Examples include rRNA, tRNA, miRNA, and lncRNA. Identifying RNA genes
is challenging due to their lack of protein-coding characteristics (like open
reading frames).
• Key RNA Gene Types and Features
• rRNA (Ribosomal RNA):
– Forms the structural and functional core of ribosomes.
– Highly conserved sequences across species.
• tRNA (Transfer RNA):
– Small RNA molecules that carry amino acids to ribosomes during translation.
– Contains conserved cloverleaf secondary structures.
• miRNA (MicroRNA):
– Regulates gene expression by binding to mRNA targets.
– Small (~22 nt), processed from hairpin precursors.
• lncRNA (Long Non-Coding RNA):
– Involved in gene regulation, chromatin remodeling, and other cellular processes.
– Often lacks sequence conservation but shows expression-level significance.
• Steps for Finding RNA Genes
• 1. Sequence-Based Approaches
• Conserved Motifs: Search for conserved RNA motifs or
sequences in genomic data.
– Databases: Rfam or SILVA for rRNA, tRNA databases for
transfer RNAs.
• Gene Prediction Tools:
– Infernal: Searches for RNA genes using covariance models
(CMs) from Rfam.
– tRNAscan-SE: Specifically predicts tRNA genes.
– RNAz: Predicts functional ncRNAs based on thermodynamic
stability.
• 2. Secondary Structure Prediction
• RNA genes often form characteristic secondary structures like
hairpins or loops.
• Tools:
– RNAfold: Predicts minimum free-energy RNA secondary
structures.
– CMfinder: Identifies RNA motifs using secondary structures.
• 3. Comparative Genomics
• Phylogenetic Conservation: RNA genes are often conserved in sequence
and structure.
• Tools like BLASTN, ClustalW, or Mauve alignments help find homologous
RNA genes.
• 4. Expression-Based Approaches
• RNA-seq data provides insights into actively transcribed RNA genes.
– Align reads to the genome and filter for non-coding RNAs using specialized pipelines.
– Tools: StringTie, Cufflinks for transcriptome assembly and annotation.
• 5. Experimental Validation
• Confirm RNA gene predictions through:
– Northern Blotting: Detect specific RNA molecules.
– RT-PCR: Amplify and quantify RNA expression.
– Ribosome Profiling: Identify untranslated RNA regions.
• Applications
• Understanding gene regulation networks.
• Identifying biomarkers for diseases.
• Exploring evolutionary relationships through conserved RNA genes.
• By combining computational and experimental methods, researchers can
effectively identify and annotate RNA genes across diverse genomes.
GENSCAN:
• GENSCAN is a computational program used to identify gene
structures in genomic sequences, specifically for eukaryotic
organisms. It predicts locations and structures of protein-
coding genes based on sequence data and statistical models.
• GENSCAN uses Hidden Markov Models (HMMs) to predict
gene features, such as exons, introns, start/stop codons, and
intergenic regions. It combines sequence features with
statistical probabilities to infer the most likely gene structures.
• 1. Input Requirements
• Genomic Sequence: The tool requires input as raw DNA
sequences (both coding and non-coding regions).
• Strand Information: It can analyze both strands of DNA
• 2. Gene Prediction Components
• GENSCAN predicts the following features in a given
sequence:
• Exons:
– Coding regions divided into:
• Initial Exons: Start codon (AUG) and the 5' splice site.
• Internal Exons: Flanked by splice sites.
• Terminal Exons: Include a stop codon and 3' splice site.
• Introns:
– Non-coding regions between exons.
– GENSCAN identifies splice donor (GT) and acceptor (AG) sites.
• Promoter Regions:
– Indicates transcription start sites.
• Poly-A Signal:
– Recognizes regions that signal transcription termination in
eukaryotes.
• Intergenic Regions:
– Non-coding sequences between genes.
• 3. Outputs
• GENSCAN provides detailed predictions for the following:
• Gene Locations: Start and end positions of predicted genes.
• Exon-Intron Structure: Exact boundaries of exons and
introns.
• Coding Sequences (CDS): Translated amino acid
sequences.
• Likelihood Scores: Statistical confidence for each prediction.
• Applications
• Genome Annotation: Identifying genes in newly sequenced
genomes.
• Comparative Genomics: Comparing gene content across
species.
• Bioinformatics Pipelines: Used as a step in automated
genome annotation tools.
Genefinder
• GeneFinder is a tool used in bioinformatics for identifying and predicting genes within DNA
sequences. It’s a crucial software for understanding genetic information in various
organisms, as it helps researchers identify coding regions (genes) that are responsible for
producing proteins, which are essential to biological functions. Here's an overview of
GeneFinder, including its features, input, output, and the algorithms it typically uses.
• 1. Features of GeneFinder
• GeneFinder offers several advanced features to help with gene prediction, including:
• Gene Prediction in DNA Sequences: Identifies and predicts the location of genes within
large DNA sequences by scanning for gene-encoding regions.
• Identification of Exons and Introns: Detects both exons (coding sequences) and introns
(non-coding sequences) within genes.
• Open Reading Frame (ORF) Detection: Recognizes the ORFs, which are sequences that
could potentially code for proteins.
• Multiple Species Compatibility: Adaptable to sequences from various organisms, making
it versatile in genome studies across species.
• Probability-based Prediction Models: Many GeneFinder tools incorporate probabilistic
models or machine learning approaches to improve prediction accuracy.
• Graphical Visualization: Offers visual representations of the gene structures within DNA
sequences to simplify interpretation.
• 2. Input for GeneFinder
• The input for GeneFinder usually includes:
• DNA Sequence: The primary input is a raw DNA sequence in a FASTA format. This can come from sequenced
genomes of various organisms, such as human, bacterial, or plant genomes.
• Genome Information (Optional): Additional data about the organism's genome structure, such as known gene
locations, may also be provided to improve accuracy.
• Parameter Settings: Optional configurations to adjust the sensitivity or specificity of the search, depending on the
research focus.
• 3. Output of GeneFinder
• GeneFinder provides detailed outputs including:
• Predicted Genes: A list of genes identified within the DNA sequence, including information about their start and
end positions.
• Gene Structure Information: The locations of exons and introns, with potential coding regions highlighted.
• Amino Acid Sequences: Translated sequences that represent the proteins encoded by the genes found.
• Statistical Confidence Scores: Many tools also output confidence scores for each gene prediction to show the
reliability of the findings.
• Visual Representations: Often, graphical outputs show the arrangement of genes, exons, and introns on the
DNA sequence for easier interpretation.
• 4. Algorithms Used in GeneFinder
• GeneFinder typically employs several computational and statistical algorithms to predict genes, including:
• Hidden Markov Models (HMM): HMMs are commonly used in gene prediction to model the probability of gene
structures based on known patterns in the DNA sequence. It helps in accurately detecting coding and non-coding
regions.
• Dynamic Programming: Used to optimize the search for ORFs and other features by systematically breaking
down the prediction process into smaller, manageable calculations.
• Heuristic Algorithms: Some GeneFinder tools use heuristics for faster but approximate solutions, which can be
particularly useful for scanning large genomes quickly.
• Machine Learning Models: Modern tools may incorporate supervised or unsupervised learning models trained on
known gene data to predict gene locations and structures in unknown sequences.
• Position Weight Matrices (PWMs): PWMs are sometimes used to identify promoter regions, which can indicate
the start of a gene.
• Signal Detection: Techniques like splice site recognition and promoter detection are used to identify biological
signals that mark the start or end of a gene.

You might also like