Module 5 notes
Module 5 notes
Class α+β
Class α/β
membrane
Membrane proteins
QUARTERNARY STRUCTURE
Types of Protein Fold Prediction
Tools
• 1.1. Homology Modeling (Comparative Modeling)
• Homology modeling is based on the assumption that evolutionary related
proteins (homologs) share similar structures. If the 3D structure of a related
protein is known, the structure of the target protein can be predicted by
aligning their sequences and transferring the structural information.
• Steps:
– Find homologous sequences with known structures (template proteins).
– Align the target sequence to the template.
– Model the 3D structure of the target protein based on the template’s structure.
– Refine the model by adjusting loop regions and side-chain conformations.
• Tools:
– SWISS-MODEL: A web-based server for homology-based protein structure prediction.
– MODELLER: A widely used tool for comparative modeling.
– Phyre2: A tool for predicting protein structure by exploiting known protein structures
and sequence alignments.
• Ab Initio (De Novo) Prediction
• Ab initio methods predict protein structures from scratch, without relying on template
structures. These methods attempt to model the folding process based on physical and
chemical principles, using the amino acid sequence as input.
• Challenges: Ab initio methods are computationally intensive and often require large
amounts of time and resources, particularly for large proteins.
• Approach:
– Use of physical principles (e.g., molecular dynamics, energy minimization).
– Simulation of folding pathways, considering factors such as hydrogen bonding, hydrophobic
interactions, and electrostatic forces.
• Tools:
– ROSETTA: A powerful tool used for ab initio protein structure prediction.
– FOLDIT: A crowdsourced game that uses human intuition for ab initio protein folding predictions.
– I-TASSER: A hybrid method that combines threading, ab initio, and homology modeling.
• 1.3. Threading (Fold Recognition)
• Threading (also known as fold recognition) is used when a homologous protein with a
known structure cannot be found. This method involves fitting the target sequence onto a
3D structure from a database of known protein folds.
• Steps:
– The sequence of the target protein is "threaded" into a library of known protein structures (called
templates).
– The best-fit structure is selected based on energy scoring functions (similarity, geometric
compatibility).
• Tools:
– TASSER: Combines threading and ab initio approaches.
– Phyre2: Also uses threading in addition to homology modeling.
• Machine Learning-Based Methods
• Recent advancements in artificial intelligence (AI) and
machine learning (ML) have led to the development of
prediction tools that use neural networks and deep learning to
predict protein structures.
• Approach: These methods are trained on large datasets of
protein sequences and structures to identify patterns and
relationships between sequence and fold.
• Advantages: They can make highly accurate predictions
based on patterns learned from millions of known structures.
• Tools:
– AlphaFold: A breakthrough tool developed by DeepMind that
uses deep learning to predict protein structures with remarkable
accuracy. AlphaFold won the CASP14 competition for its
accuracy in predicting protein folds.
– RoseTTAFold: A deep learning-based tool developed as an
alternative to AlphaFold, also using neural networks to predict
protein structure.
Protein Identity Based on
Composition
• Protein identity based on composition refers to analyzing a protein's amino acid
composition to infer its characteristics, structure, function, and evolutionary relationships.
This approach does not rely on the sequence but on the overall proportions of amino acids
in the protein.
• Key Concepts
• 1. Amino Acid Composition
• Proteins are made up of 20 standard amino acids, each with unique chemical properties.
• The relative abundance of these amino acids in a protein determines its composition.
• 2. Protein Features Derived from Composition
• Functional Class: Certain proteins (e.g., enzymes, structural proteins) show characteristic
amino acid profiles.
• Cellular Localization: Proteins targeted to specific cellular compartments (e.g., cytoplasm,
membrane) have distinct compositions.
– Example: Membrane proteins are rich in hydrophobic residues (e.g., leucine, isoleucine).
• Stability: Proteins with higher proline content tend to have greater structural rigidity.
• Evolutionary Relationships: Composition can suggest homology or evolutionary
conservation.
• Methods for Analyzing Protein Composition
• 1. Experimental Determination
• Amino Acid Analysis:
– Hydrolyze the protein into free amino acids.
– Quantify amino acids using chromatography or mass
spectrometry.
• 2. Computational Analysis
• Composition Profiling:
– Count the frequency of each amino acid in the protein
sequence.
– Tools: ExPASy ProtParam, EMBOSS Pepstats.
• Machine Learning:
– Algorithms predict properties like solubility or function
based on amino acid composition.
• Applications of Protein Composition Analysis
• 1. Functional Annotation
• Predicts a protein's potential role based on its amino acid makeup.
– Example: Enzymes often have specific conserved residues required for
catalytic activity.
• 2. Cellular Localization
• Distinguishes extracellular, membrane, and intracellular proteins.
– Example: Signal peptides in secreted proteins contain hydrophobic
residues.
• 3. Structure Prediction
• Amino acid composition influences secondary and tertiary structure.
– High glycine and proline content often indicates loop or coil regions.
• 4. Protein Stability
• The Instability Index (derived from composition) predicts whether a
protein is stable under physiological conditions.
– Example: Heat-stable proteins have a higher proportion of specific
residues like arginine and lysine.
• 5. Comparative Genomics
• Compositional similarity can suggest evolutionary relationships or
functional similarities between proteins.
Physical Properties of Proteins
Based on Sequence
• Protein sequences determine their physical properties, which influence structure,
function, and interactions. By analyzing the amino acid sequence, various physical
attributes can be inferred computationally or experimentally.
• Key Physical Properties
• 1. Molecular Weight
• Determined by summing the molecular weights of individual amino acids in the
sequence.
• Importance:
– Crucial for understanding protein size and separation techniques like SDS-
PAGE.
• Tools: ExPASy ProtParam.
• 2. Isoelectric Point (pI)
• The pH at which the protein has no net electrical charge.
• Calculation:
– Based on the pKa values of ionizable groups in amino acid side chains and the
N- and C-termini.
• Importance:
– Helps in protein purification (e.g., isoelectric focusing).
– Predicts solubility under different pH conditions.
• Tools: Compute pI/Mw tool on ExPASy.
• 3. Hydrophobicity
• Determines the overall hydrophobic or hydrophilic nature of the
protein.
• Hydropathy Index:
– Positive values indicate hydrophobic residues (e.g., leucine, valine).
– Negative values indicate hydrophilic residues (e.g., glutamate, lysine).
• Importance:
– Predicts transmembrane regions (e.g., membrane proteins are
hydrophobic).
– Influences folding and interactions with water or other molecules.
• Tools: Kyte-Doolittle plot for hydrophobicity profiling.
• 4. Secondary Structure Propensity
• Sequence composition predicts alpha-helices, beta-sheets, and
random coils.
• Patterns:
– Alpha-helix: Common in sequences rich in alanine, leucine, and
glutamate.
– Beta-sheet: Favored by valine, isoleucine, and phenylalanine.
• Tools: PSIPRED or GOR for secondary structure prediction.
• 5. Solubility
• Calculated from the proportion of polar and charged residues in the
sequence.
• Importance:
– Critical for determining suitability for expression and purification in different
environments.
• Tools: SoluProt or Protein-Sol.
• 6. Stability
• Predicted by analyzing amino acid content and interactions.
• Factors:
– Disulfide Bonds: Cysteine residues form disulfide bridges that enhance stability.
– Thermodynamic Stability: Glycine and proline influence rigidity.
• Importance:
– Essential for designing proteins for industrial or therapeutic purposes.
• Tools: FoldX or iStable.
• 7. Post-Translational Modification Sites
• Specific sequences or motifs predict modifications like phosphorylation,
glycosylation, or acetylation.
• Importance:
– Influences protein activity, localization, and interactions.
• Tools: NetPhos, GlycoEP.
• 8. Charge Distribution
• The arrangement of positively (lysine, arginine) and negatively (glutamate,
aspartate) charged residues.
• Importance:
– Determines electrostatic interactions and binding with nucleic acids or other proteins.
• 9. Flexibility and Disorder
• Some proteins or regions are intrinsically disordered, enabling flexibility and
multiple interactions.
• Prediction:
– High glycine and proline content often indicates disorder.
• Tools: IUPred or DISOPRED.
• 10. Aromatic Content
• Aromatic residues (tryptophan, tyrosine, phenylalanine) influence
absorbance and fluorescence.
• Importance:
– Helps in spectroscopic studies to determine protein concentration (e.g., UV
absorbance at 280 nm).
• Applications of Sequence-Based Property
Analysis
• Protein Engineering:
– Modify sequences to enhance stability, solubility, or
activity.
• Drug Design:
– Understand binding interactions for therapeutic
targeting.
• Structural Prediction:
– Predict folding patterns and functional domains.
• Biotechnology:
– Design proteins optimized for industrial or medical
applications.
web-based software -NN
•
PREDICT
NNPREDICT is a tool that uses artificial neural networks to predict a
protein's secondary structure—specifically the likelihood of regions forming
alpha-helices, beta-sheets, or coils. Neural networks are effective here
because they can learn complex patterns from large datasets..
• input: It takes the amino acid sequence of a protein.
• Neural Network Analysis: The neural network is trained on known protein
structures, learning the common patterns and features associated with
secondary structures.
• Prediction: For each amino acid, NNPREDICT calculates the probability
that it is part of a helix, sheet, or coil. It relies on sequence similarity to
previously studied proteins, enhancing its predictions by comparing against
these known structures.
• This approach works well because certain amino acid sequences have
strong tendencies to form specific structures, and neural networks can
capture these tendencies from large datasets, leading to relatively accurate
predictions.
JPRED
• JPRED is a popular online tool used for predicting the secondary
structure of proteins.
• How JPRED Works
• Input: Users provide a protein sequence, typically in FASTA format
(a text-based representation of amino acid sequences).
• Sequence Alignments: JPRED uses multiple sequence alignments
(MSA) to identify patterns and evolutionary information across
similar proteins. These alignments show conserved regions, which
can give clues about secondary structures.
• Machine Learning Algorithms: JPRED leverages machine
learning techniques to analyze sequence alignments and predict
whether each segment of the sequence is likely to form an alpha
helix, beta sheet, or coil.
• Outputs: The result is a prediction of the secondary structure for
each amino acid residue in the sequence, often visualized as a
string of letters (H for helix, E for sheet, C for coil).
• Interpreting JPRED’s Results
• Confidence Scores: JPRED provides a confidence score for each
prediction, indicating how certain the algorithm is that a particular
residue belongs to a specific secondary structure.
• Alignment Visualization: Users can also view the sequence
alignment used by JPRED, which helps in understanding which
parts of the sequence are highly conserved across species and
likely to have similar structures.
• 4. Applications of JPRED
• Drug Design and Protein Engineering: Knowing secondary
structures can aid in designing drugs that interact with specific
protein regions or in engineering proteins with altered stability or
activity.
• Function Prediction: By understanding the structural properties of
a protein, scientists can hypothesize about its potential biological
functions.
• Mutation Studies: Researchers studying mutations in proteins can
use JPRED to predict how these changes might impact the protein's
secondary structure and, consequently, its function.
JPRED RESULT
SOPMA
• SOPMA, which stands for Self-Optimized Prediction Method with Alignment,
is a computational tool used in bioinformatics to predict the secondary structure
of proteins. SOPMA analyzes protein sequences to identify regions likely to form
specific secondary structures, such as alpha helices, beta sheets, or random
coils, based on patterns in the amino acid sequence. This information is useful
for understanding protein function, as the
• Purpose of SOPMA in Protein Structure Prediction
• Proteins are made up of chains of amino acids that fold into specific 3D shapes.
• SOPMA predicts how these chains fold at a secondary level, which includes
structures like:
– Alpha helices: Spiral-shaped structures stabilized by hydrogen bonds.
– Beta sheets: Flat, sheet-like structures with strands held together by
hydrogen bonds.
– Random coils: Irregularly shaped segments without a fixed structure.
• Understanding these structures helps in determining how proteins interact,
which can be crucial for drug design, disease research, and biotechnology.
• shape of a protein often influences its interactions with other molecules.
• How SOPMA Works
• Input: SOPMA requires an amino acid sequence of a protein, which is
represented by single-letter codes for each amino acid.
• Database Comparison: SOPMA compares this sequence to a large database
of known protein structures. It uses alignment techniques to match segments of
the sequence with similar segments in proteins with known structures.
• Machine Learning Techniques: SOPMA uses algorithms to predict secondary
structures based on patterns observed in the database, adjusting predictions to
optimize accuracy.
• 3. Output and Interpretation
• SOPMA provides a prediction of each amino acid in the sequence, categorizing
it into:
– H for helix,
– E for extended strand (beta sheet), or
– C for coil.
• The output usually includes a probability score for each structure type, indicating
the confidence level of each prediction.
SOPMA RESULT
DSSP
• DSSP stands for Definition of Secondary Structure of Proteins. It is a
program and file format developed to analyze and classify the secondary
structures of proteins based on their 3D atomic coordinates, typically derived
from X-ray crystallography, NMR, or cryo-electron microscopy experiments.
DSSP is commonly used in bioinformatics and structural bio
• Key Components of DSSP
• Protein Structure Data Input:
– DSSP works with protein structures that are available in PDB (Protein Data
Bank) format. PDB files contain the 3D coordinates of atoms in a protein
structure, describing the positions of amino acids and atoms within the
molecule.
• Secondary Structure Classification:
– DSSP assigns secondary structure elements to regions of the protein. The
main types of secondary structures are:
• α-helices: Coiled, spiral structures stabilized by hydrogen bonds
between backbone atoms.
• β-strands: Extended, stretched structures that align to form β-sheets
when paired with other β-strands.
• Turns and loops: Irregular structures connecting helices and strands.
• Coils: Random or undefined regions of the structure.
• Hydrogen Bond Analysis:
– Hydrogen bonds play a critical role in stabilizing the secondary
structures in proteins. DSSP calculates possible hydrogen bonds based
on distances and angles between atoms, especially in the backbone.
• Accessible Surface Area (ASA):
– DSSP also calculates the Accessible Surface Area (ASA), which is the
area of each amino acid exposed to the solvent (like water). This
information is helpful to understand which parts of a protein are likely
involved in interactions with other molecules.
• How DSSP Works
• Input:
– The program takes the atomic coordinates from a PDB file.
• Hydrogen Bond Identification:
– DSSP identifies hydrogen bonds by evaluating the geometry of the
amino acid backbone atoms. It calculates distances and angles
between hydrogen donor and acceptor atoms to predict hydrogen
bonding.
• Assigning Secondary Structure:
– Based on the hydrogen bond information, DSSP assigns secondary
structure elements by detecting repetitive patterns:
• α-helices and 310 helices (a variation of α-helix with different
hydrogen bonding patterns)
• β-strands and β-sheets
• Turns and other coil regions
• Output:
– DSSP outputs a file where each amino acid residue in the protein is
labeled with its assigned secondary structure. It also provides detailed
data on hydrogen bonding, ASA, and other geometric features.
STRIDE
• STRIDE is a secondary structure assignment tool that analyzes the 3D
atomic coordinates of a protein structure to classify regions into alpha-
helices, beta-sheets, and other structural types. Here’s a breakdown of how
STRIDE operates:
• Input: It takes 3D structural data, typically from Protein Data Bank (PDB)
files.
• Hydrogen Bond Analysis: STRIDE identifies hydrogen bonds, which are
critical for stabilizing secondary structures. By examining the bond geometry
and strength, STRIDE determines where hydrogen bonds support specific
structures.
• Empirical Energy Functions: It uses empirical energy calculations to
assess the stability of potential secondary structures, providing a more
detailed picture of the protein's structural layout.
• Nuanced Classification: STRIDE can make more refined assignments by
combining bonding information with energy calculations, often giving it an
edge in accuracy over tools like DSSP, which rely primarily on geometric
criteria.
RESTRICTION MAPPING
AND
PRIMER DESIGN
A restriction map is a description of restriction
endonuclease cleavage sites within a piece of
DNA.
Generating such a map is usually the first step
in characterizing an unknown DNA, and a
prerequisite to manipulating it for other
purposes.
Typically, restriction enzymes that cleave DNA
infrequently (e.g. those with 6 bp recognition
sites) and are relatively inexpensive are used
to produce such a map.
What are the 3 general steps used to clone DNA?
n Isolate DNA from an organism
n Cut the organismal DNA and the vector with restriction
enzymes making recombinant DNA
n Introduce the recombinant DNA into a host
Restriction Enzymes
n Recognize a specific site (generally a pallidromic sequence)
n Produce overhangs or straight cuts
n Naturally found in bacteria, they protect against viruses and
foreign DNA
n More than 400 enzymes have been isolated
n Named for the organism they from which they are isolated
The first letter is that of the genus and the 2nd and 3rd are from the
species
Restriction site in DNA, showing symmetry of the sequence
around the center point. The sequence is a palindrome, reading
the same from left to right (5’-to-3’) on the top strand (GAATTC,
here) as it does from right to left (5’-to-3’) on the bottom strand.
Examples of how restriction enzymes cleave DNA. (a) SmaI results in blunt
ends. (b) BamHI results in 5’ overhanging (“sticky”) ends. (c) PstI results in 3’
overhanging (“sticky”) ends.
Restriction Mapping
There are three methods used to generate
a restriction map:
(i) mapping by multiple R.E. digestions
(ii) mapping by partial R.E. digestions
(iii) using a computer
Mapping by Multiple R.E.
Digestions
The most straightforward method for restriction mapping
is to digest samples of the plasmid with:
(i) a set of individual enzymes,
(ii) and with pairs of those enzymes.
The digestions are then "run out" on an agarose gel to
determine the sizes of the fragments generated.
The sizes of the fragments determined by comparison
with standard DNA molecular weight markers.
If you know the fragment sizes, it is usually a fairly easy
task to deduce where each enzyme cuts
This is what mapping is all about.
Creating a map by Partial
Digestions of End-Labelled DNA
A DNA fragment is labeled with a radioisotope
on only one end.
It can be partially digested with restriction
enzymes to generate labeled fragments.
Partial digestion is performed by using very
small amounts of enzyme or short periods of
time.
Analysis of the resulting products by PGE
enables one to define the distance of R.E. sites
from the labelled end
Using a Computer to Generate
Restriction Maps
All of the techniques described above for
generating a restriction map assume that
you don't have the sequence of the DNA.
If the sequence is known, it is a simple
matter to feed that sequence into any
number of computer programs.
These programs will search the sequence
for dozens of restriction enzyme
recognition sites and build a map for you.
Using R.E. maps for analysing
Recombinant DNA
Checking the size of the insert
Checking the orientation of the insert
Determining pattern of restriction sites
within insert DNA
Utilities
Searching by content