Notes 2
Notes 2
DEPARTMENT OF BIOTECHNOLOGY
In contemplating protein folding, it is necessary to consider different types of amino acid side-chains
separately. For each situation, the reaction involved will be assumed to be:
Note that this formalism means that a negative ΔG implies that the folding process is spontaneous.
First we will look at polar groups in an aqueous solvent. For polar groups, the ΔHchain favors the
unfolded structure because the backbone and polar groups interact form stronger interactions with
water than with themselves. More hydrogen bonds and electrostatic interactions can be formed in
unfolded state than in the folded state. This is true because many hydrogen bonding groups can form
more than a single hydrogen bond. These groups form multiple hydrogen bonds if exposed to water,
but frequently can form only single hydrogen bonds in the folded structure of a protein.
For similar reasons, the ΔH solvent favors the folded protein because water interacts more strongly
with itself than with the polar groups in the protein. More hydrogen bonds can form in the absence
of an extended protein, and therefore the number of the sum of the ΔH polar contributions is close
to zero, but usually favors the folded structure for the protein slightly. The chain ΔH contributions
are positive, while the solvent ΔH contributions are negative. The sum is slightly negative in most
cases, and therefore slightly favors folding.
The ΔS chain of the polar groups favors the unfolded state, because the chain is much more disordered
in the unfolded state. In contrast, the ΔS solvent favors the folded
State, because the solvent is more disordered with the protein in the folded state. In most cases, the
sum of the ΔS polar favors the unfolded state slightly. In other words, the ordering of the chain
during the folding process outweighs the other entropic factors.
The ΔG polar that is obtained from the values of ΔH polar and ΔS polar for the polar groups varies
somewhat, but usually tends to favor the unfolded protein. In other words, the folding of proteins
comprised of polar residues is usually a nonspontaneous process.
Next, we will consider a chain constructed from non-polar groups in aqueous
Solvent. Once again, the ΔH chain usually favors the unfolded state slightly. Once again, the reason
is that the backbone can interact with water in the unfolded state. However, the effect is smaller for
non-polar groups, due to the greater number of favorable van der Waals interactions in the folded
state. This is a result of the fact that non-polar atoms form better van der Waals contacts with other
non-polar groups than with water; in some cases, these effects mean that the ΔH chain for nonpolar
residues is slightly negative.
As with the polar groups, the ΔH solvent for non-polar groups favors the folded state. In the case of
non-polar residues, ΔH solvent favors folding more than it does for polar groups, because water
interacts much more strongly with itself than it does with non-polar groups,
The sum of the ΔH non-polar favors folding somewhat. The magnitude of the ΔH nonpolar is not
very large, but is larger than the magnitude of the ΔH polar, which also tends to slightly favor folding.
The ΔS chain of the non-polar groups favors the less ordered unfolded state. However, the ΔS solvent
highly favors the folded state, due to the hydrophobic effect. During the burying of the non-polar side
chains, the solvent becomes more disordered. The ΔS solvent is a major driving force for protein
folding.
The ΔG non-polar is therefore negative, due largely to the powerful contribution of the ΔS solvent.
Adding together the terms for ΔG polar and ΔG non-polar gives a slightly negative overall ΔG for
protein folding, and therefore, proteins generally fold spontaneously raising the temperature,
however, tends to greatly increase the magnitude of the TΔS chain term, and therefore to result in
unfolding of the protein.
The folded state is the sum of many interactions. Some favor folding, and some
Favor the unfolded state. The qualitative discussion above did not include the magnitudes of the
effects. For real proteins, the various ΔH and ΔS values are difficult to measure accurately. However,
for many proteins it is possible to estimate the overall ΔG of folding. Measurements of this value
have shown that the overall ΔG for protein folding is very small: only about –10 to –50 k
Joules/mol. This corresponds to a few salt bridges or hydrogen bonds.
Studies of protein folding have revealed one other important point: the hydrophobic effect is very
important, but it is relatively non-specific. Any hydrophobic group will interact with essentially any
other hydrophobic group. While the hydrophobic effect is a major driving force for protein folding,
it is the constrains imposed by the more geometrically specific hydrogen bonding and electrostatic
interactions in conjunction with the hydrophobic interactions that largely determine the overall folded
structure of the protein.
1.5.10. Anfinsen’s experiment
1. The Observation
Ribonuclease A (RNaseA) is an extracellular enzyme of 124 residues with four disulfide bonds. In
the first phase of the experiment, the S-S bonds were reduced to eight –SH groups (using
mercaptoethanol, HS-CH2-CH2-OH); the protein was then denatured with 8 M urea. Under these
conditions, the enzyme is inactive and
becomes a flexible random polymer. In the
second phase, the urea was slowly removed
(dialysis); then the –SH groups were
oxidized back to S-S bonds. If the protein
was able to regain its native structure
spontaneously after removal of the urea, we
expect that it would also regain its activity.
In fact, the activity was >90% of the
untreated enzyme. Moreover, sequence
analysis showed that nearly all of the correct
S-S bonds had been formed.
2. The Control
A reasonable objection can be raised to the above result by suggesting that perhaps RNase A was not
completely unfolded in 8 M urea. To address this class of objections, RNase A was first reduced and
denatured as above. But in the second phase, the enzyme was first oxidized to form S-S bonds, and
then the urea was removed, i.e. the order of steps in the second phase of the experiment was reversed.
The resulting activity was only about 1-2% of the untreated enzyme. Sequence analysis showed a
random assortment of S-S bonds.
Anfinsen's work showed convincingly that proteins can indeed adopt their native information
spontaneously, i.e. sequence determines structure. His demonstration of this fundamental property of
proteins opened the problem to a massive amount of experimental and theoretical effort
Properties
Molecular chaperones interact with unfolded or partially folded protein subunits, e.g. nascent
chains emerging from the ribosome, or extended chains being translocated across subcellular
membranes.
They stabilize non-native conformation and facilitate correct folding of protein subunits.
They do not interact with native proteins, nor do they form part of the final folded structures.
Some chaperones are non-specific, and interact with a wide variety of polypeptide chains, but
others are restricted to specific targets.
They often couple ATP binding/hydrolysis to the folding process.
Essential for viability, their expression is often increased by cellular stress.
Main role: They prevent inappropriate association or aggregation of exposed hydrophobic surfaces
and direct their substrates into productive folding, transport or degradation pathways.
Location and Function
Many chaperones are heat shock proteins, that is, proteins expressed in response to elevated
temperatures or other cellular stresses. The reason for this behaviour is thatprotein folding is severely
affected by heat and, therefore, some chaperones act to prevent or correct damage caused by
misfolding. Other chaperones are involved in folding newly made proteins as they are extruded from
the ribosome. Although most newly synthesized proteins can fold in absence of chaperones, a
minority strictly requires them for the same.
Some chaperone systems work as foldases: they support the folding of proteins in an ATP-dependent
manner (for example, the GroEL/GroES or the DnaK/DnaJ/GrpE system). Other chaperones work
as holdases: they bind folding intermediates to prevent their aggregation, for
example DnaJ or Hsp33.
Macromolecular crowding may be important in chaperone function. The crowded environment of
the cytosol can accelerate the folding process, since a compact folded protein will occupy less volume
than an unfolded protein chain. However, crowding can reduce the yield of correctly folded protein
by increasing protein aggregation. Crowding may also increase the effectiveness of the chaperone
proteins such as GroEL, which could counteract this reduction in folding efficiency.
More information on the various types and mechanisms of a subset of chaperones that encapsulate
their folding substrates (e.g. GroES) can be found in the chaperonins. Chaperonins are characterized
by a stacked double-ring structure and are found in prokaryotes, in the cytosol of eukaryotes, and in
mitochondria.
Other types of chaperones are involved in transport across membranes, for example membranes of
the mitochondria and endoplasmic reticulum (ER) in eukaryotes. Bacterial translocation—specific
chaperone maintains newly synthesized precursor polypeptide chains in a translocation-competent
(generally unfolded) state and guides them to the translocon.
New functions for chaperones continue to be discovered, such as assistance in protein
degradation, bacterial adhesin activity, and in responding to diseases linked to protein
aggregation (e.g. see prion) and cancer maintenance.
CHEPARONINE
Chaperonins are proteins that provide favourable conditions for the correct folding of other proteins,
thus preventing aggregation. Newly made proteins usually must fold from a linear chain of amino
acids into a three-dimensional form. Chaperonins belong to a large class of molecules that assist
protein folding, called molecular chaperones. The energy to fold proteins is supplied by adenosine
triphosphate
GroupI Chaperonins
1.GroEL is a double-ring 14mer with a greasy hydrophobic patch at its opening and can
accommodate the native folding of substrates 15-60 kDa in size.
2.GroES is a single-ring heptamer that binds to GroEL in the presence of ATP or transition state
analogues of ATP hydrolysis, such as ADP-AlF3. It's like a cover that covers GroEL (box/bottle).
GroEL/GroES may not be able to undo protein aggregates, but kinetically it competes in the pathway
of misfolding and aggregation, thereby preventing aggregate formation.
Group II Chaperonins
Group II chaperonins, found in the eukaryotic cytosol and in archaea, are more poorly characterized.
TRiC (TCP-1 Ring Complex, also called CCT for chaperonin containing TCP-1), the eukaryotic
chaperonin, is composed of two rings of eight different though related subunits, each thought to be
represented once per eight-membered ring. TRiC was originally thought to fold only the cytoskeletal
proteins actin and tubulin but is now known to fold dozens of substrates.
Mm cpn (Methanococcus maripaludis chaperonin), found in the archaea Methanococcus maripaludis,
is composed of sixteen identical subunits (eight per ring). It has been shown to fold the mitochondrial
protein rhodanese; however, no natural substrates have yet been identified.
Group II chaperonins are not thought to utilize a GroES-type cofactor to fold their substrates. They
instead contain a "built-in" lid that closes in an ATP-dependent manner to encapsulate its substrates,
a process that is required for optimal protein folding activity.
Mechanism of action
Chaperonins undergo large conformational changes during a folding reaction as a function of the
enzymatic hydrolysis of ATP as well as binding of substrate proteins and cochaperonins, such as
GroES. These conformational changes allow the chaperonin to bind an unfolded or misfolded protein,
encapsulate that protein within one of the cavities formed by the two rings, and release the protein
back into solution. Upon release, the substrate protein will either be folded or will require further
rounds of folding, in which case it can again be bound by a chaperonin.
The exact mechanism by which chaperonins facilitate folding of substrate proteins is unknown.
According to recent analyses by different experimental techniques, GroEL-bound substrate proteins
populate an ensemble of compact and locally expanded states that lack stable tertiary interactions. A
number of models of chaperonin action have been proposed, which generally focus on two (not
mutually exclusive) roles of chaperonin interior: passive and active. Passive models treat the
chaperonin cage as an inert form, exerting influence by reducing the conformational space accessible
to a protein substrate or preventing intermolecular interactions e.g. by aggregation prevention. The
active chaperonin role is in turn involved with specific chaperonin–substrate interactions that may be
coupled to conformational rearrangements of the chaperonin.
Probably the most popular model of the chaperonin active role is the iterative annealing mechanism
(IAM), which focus on the effect of iterative, and hydrophobic in nature, binding of the protein
substrate to the chaperonin. According to computational simulation studies, the IAM leads to more
productive folding by unfolding the substrate from misfolded conformations or by prevention from
protein misfolding through changing the folding pathway.
HUMAN CHAPERONE PROTEINS
Chaperones are found in, for example, the endoplasmic reticulum (ER), since protein synthesis often
occurs in this area.
Endoplasmic reticulum
In the endoplasmic reticulum (ER) there are general, lectin- and non-classical molecular chaperones
helping to fold proteins.
General chaperones: GRP78/BiP, GRP94, GRP170.
Lectin chaperones: calnexin and calreticulin
Non-classical molecular chaperones: HSP47 and ERp29
Folding chaperones:
Protein disulfide isomerase (PDI),
Peptidyl prolyl cis-trans-isomerase (PPI)
ERp57
Nomenclature and examples of bacterial and archael chaperons.
There are many different families of chaperones; each family acts to aid protein folding in a different
way. In bacteria like E. coli, many of these proteins are highly expressed under conditions of high
stress, for example, when the bacterium is placed in high temperatures. For this reason, the term "heat
shock protein" has historically been used to name these chaperones. The prefix "Hsp" designates that
the protein is a heat shock protein.
Hsp60
Hsp60 (GroEL/GroES complex in E. coli) is the best characterized large (~ 1 MDa) chaperone
complex. GroEL is a double-ring 14mer with a hydrophobic patch at its opening; it is so large it can
accommodate native folding of 54-kDa GFP in its lumen. GroES is a single-ring heptamer that binds
to GroEL in the presence of ATP or ADP. GroEL/GroES may not be able to undo previous
aggregation, but it does compete in the pathway of misfolding and aggregation.[19] Also acts
in mitochondrial matrix as molecular chaperone.
Hsp70
Hsp70 (DnaK in E. coli) is perhaps the best characterized small (~ 70 kDa) chaperone.
Ubiquitin-independent degradation
Although most proteasomal substrates must be ubiquitinated before being degraded, there are some
exceptions to this general rule, especially when the proteasome plays a normal role in the post-
translational processing of the protein. The proteasomal activation of NF-κB by
processing p105 into p50 via internal proteolysis is one major example. Some proteins that are
hypothesized to be unstable due tointrinsically unstructured regions, are degraded in a ubiquitin-
independent manner. The most well-known example of a ubiquitin-independent proteasome substrate
is the enzyme ornithine decarboxylase. Ubiquitin-independent mechanisms targeting key cell
cycle regulators such as p53 have also been reported, although p53 is also subject to ubiquitin-
dependent degradation. Finally, structurally abnormal, misfolded, or highly oxidized proteins are also
subject to ubiquitin-independent and 19S-independent degradation under conditions of cellular stress.
1.5.18. Protein folding errors
Proteins can miss function for several reasons. When a protein is miss folded it can lead to
denaturation of the protein. Denaturation is the loss of protein structure and function. The miss
folding does not always lead to complete lack of function but only partial loss of functionality. The
miss functioning of proteins can sometimes lead to diseases in the human body.
Alzheimer's disease
Alzheimer's disease (AD) is a neurological degenerative disease that affects around 5 million
Americans, including nearly half of those who are age 85 or older. The predominant risk factors of
AD are age, family history, and heredity. Alzheimer’s disease typically results in memory loss,
confusion of time and place, misplacing places, and changes in mood and behavior. AD results in
dense plaques in the brain that are comprised of fibrillar β-amyloid proteins with a well-orders β-
sheet secondary structure. These plaques visually look like voids in the brain figure matter and are
directly connected to the deterioration of thought processes. It has been determined that AD is a
protein misfolding disease, where the misfolded protein is directly related to the formation of these
plaques in the brain.
Mad Cow
Diseases caused by prions, like Mad Cow / Creutzfeldt-Jacob are also, in essence, protein
folding disorders. These are caused by a certain protein, named PrP, that will stay in a mis-
folded conformation (PrPsc) if encouraged to go into it in the first place. In most people, the PrP
protein folds normally, leaving the person healthy. Rarely, a mutation in the PrP gene will allow
the protein to be made incorrectly, and it will fold incorrectly, making a PrPsc prion. These prions,
when exposed to PrP which is in the process of folding, will encourage that PrP to fold badly too,
thus creating another PrPsc. While PrP can be processed and cleaned out of a cell once it has been
used, PrPsc is shaped differently enough that it can't be, so it never goes away. PrPsc, much more
quickly than with Ab in Alzheimer's, builds up into plaques, handily destroying whatever nervous
tissue it's building up in. See the writeups under prion for more on this.
Cystic Fibrosis
Besides building up un-processable plaques, protein folding errors can leave behind too little of the
effective conformation for it to do its job. This is the case with diseases like Cystic Fibrosis, and
many other hereditary diseases. Cystic Fibrosis results from lack of a protein that regulates chloride
ion transport through a cell membrane. Findings show that while this protein seems to be forming
correctly, there is a problem with one of its associated chaperone proteins. Chaperone proteins help
encourage unfolded proteins to fold in the right way by surrounding them and protecting their
movement. In Cystic Fibrosis, the chaperone doesn't pull away from the transport protein smoothly,
leaving it partially mis-folded and useless. The broken chaperone protein then moves on to do the
same thing to another transport protein, and so forth.
Within each module, an adenylation (A) domain recognises and activates a specific substrate by
addition of AMP (Fig. 2c). The activated substrate is then tethered to a flexible 40 -
phosphopantetheine (PPT) prosthetic group, which is itself covalently attached to a thiolation (T)
domain (also known as a peptidyl carrier protein (PCP) domain) (Fig. 2b). The T domain lies at the
heart of the biosynthetic process, with its flexible PPT prosthesis effectively the ‘‘swinging arm’’ of
a biomolecular assembly line that transfers peptide intermediates between different domains and
modules. Post-attachment of an activated substrate by its A domain partner, a T domain then passes
that substrate to a condensation (C) domain, which catalyses peptide bond formation between the
donor substrate provided by the T domain immediately upstream, and the acceptor substrate provided
by the downstream T domain (Fig. 2d). Following the initial condensation event, the process can
repeat in an iterative fashion, with the previous peptide intermediate now serving as the donor
substrate for the C domain of the next module in an NRPS complex (Fig. 2e). Along the way, certain
modules may contain additional tailoring domains that modify individual substrates in a directed
fashion (e.g., epimerisation (E) domains, for conversion from L- to D-enantiomers). The growing
peptide continues to be passed from the T domain of one module to the T domain of the next until
the product is released, typically via a hydrolysis or intramolecular cyclisation reaction catalysed by
a thioesterase (TE) domain associated with the final module in an NRPS complex (Fig. 2f).
1.6.2. Strategies to create novel peptide products via genetic manipulation of NRPS templates
The modular structure of the NRPS assembly line suggests that it should be possible to rationally
alter one or more residues in a non-ribosomal peptide product by substitution or engineering of the
module(s) that specify the target residue(s). In nature, the diversity of non-ribosomal peptides is
thought to have arisen from point mutation, substitution of domains or modules for alternatives that
specify different substrates, and/or the insertion/deletion of modules. The modular structure of the
NRPS assembly line suggests that it should be possible to rationally alter one or more residues in a
non-ribosomal peptide product by substitution or engineering of the module(s) that specify the target
residue(s). In nature, the diversity of non-ribosomal peptides is thought to have arisen from point
mutation, substitution of domains or modules for alternatives that specify different substrates, and/or
the insertion/deletion of modules.
Most enzymes used commercially are extracellular enzymes, and the first step in their isolation is
separation of the cells from the solution. For intracellular enzymes, which are being isolated today in
increasing amounts, the first step involves grinding to rupture the cells. A number of methods for the
disruption of cells are known, corresponding to the different types of cells and the problems involved
in isolating intracellular enzymes. However, only a few of these methods are used on an industrial
scale.
The wet grinding of cells in a high-speed bead mill is another effective method of cell disruption
[190–193]. Glass balls with a diameter of 0.2–1 mm are used to break the cells. The efficiency of this
method depends on the geometry of the stirrer system. A symmetrical arrangement of circular disks
gives better results than the normal asymmetrical arrangement. Given optimal parameters such as
stirring rate, number and size of glass beads, flow rate, cell concentration, and temperature, a protein
release of up to 90 % can be achieved in a single passage.
2.2.1 Filtration
The filtration rate is a function of filter area, pressure, viscosity, and resistance offered by the filter
cake and medium. For a clean liquid, all these terms are constant which results in a constant flow rate
for a constant pressure drop. The cumulative filtrate volume increases linearly with time. During the
filtration of suspensions, the increasing thickness of the formed filter cake and the concomitant
resistance gradually decrease the flow rate. Additional difficulties may arise because of the
compressibility of biological material. In this case, the resistance offered by the filter cake and, hence,
the rate of filtration depend on the pressure applied. If the pressure applied exceeds a certain limit,
the cake may collapse and total blockage of the filter can result.
Pressure Filters A filter press (plate filter, chamber filter) is used to filtrate small volumes or to
remove precipitates formed during purification. The capacity to retain solid matter is limited, and the
method is rather work-intensive. However, these filters are highly suitable for the fine filtration of
enzyme solutions.
Vacuum Filters Vacuum filtration is generally the method of choice because biological materials are
easily compressible. A rotary vacuum filter is used in the continuous filtration of large volumes. The
suspension is usually mixed with a filter aid, e.g., kieselguhr, before being applied to the filter. The
filter drum is coated with a thin
layer of filter aid (precoat). The drum is divided into different sections so that the filter cake can also
be washed and dried on the filter. The filter cake is subsequently removed by using a series of endless
strings or by scraper discharge (knife). The removal of a thin layer of precoat each time exposes a
fresh filtering area. This system is useful for preventing an increase in resistance with the
accumulation of filter cake during the course of filtration.
Cross-Flow Filtration In conventional methods, the suspension flows perpendicular to the filtering
material. In cross-flow filtration, the input stream flows parallel to the filter area, thus preventing the
accumulation of filter cake and an increased resistance to filtration. To maintain a sufficiently high
filtration rate, thismethodmust consume a relatively large amount of energy, in the form of high flux
rates over the membranes. With the membranes now available, permeate rates can be attained. Indeed,
in many cases the use of a separator is more economical.
Rotary vacuum filter Principles of conventional, dead end filtration (A) and
cross-flow filtration (B)
The future of this method depends on the development of suitable membranes, but cross-flow
filtration can be
conveniently used in recombinant DNA techniques to separate organisms in a closed system.
2.2.2 Centrifugation
The sedimentation rate of a bacterial cell with a diameter of 0.5 mmis less than 1 mm/h. An
economical separation can be achieved only by sedimentation in a centrifugal field. The range of
applications of centrifuges depends on the particle size and the solids content.
Decanters (scroll-type centrifuges) work with low centrifugal forces and are used in the separation
of large cells or protein precipitates. Solid matter is discharged continuously by a screw conveyer
moving at a differential rotational speed.
Tubular bowl centrifuges are built for very high centrifugal forces and can be used to sediment very
small particles. However, these centrifuges cannot be operated in a continuous process. Moreover,
solid matter must be removed by hand after the centrifuge has come to a stop. A further disadvantage
is the appearance of aerosols.
Separators (disk stack centrifuges) can be used in the continuous removal of solid matter from
suspensions. Solids are discharged by a hydraulically operated discharge port (intermittent discharge)
or by an arrangement of nozzles (continuous discharge). Bacteria and cellular fragments can be
separated by a combination of high centrifugal
forces, up to 15 000 _ gravity, presently attainable, and short sedimentation distances. Disk stack
centrifuges that can be sterilized with steam are used for recombinant DNA techniques in a closed
system.
2.2.3 Extraction
An elegant method used to isolate intracellular enzymes is liquid–liquid extraction in an aqueous
two-phase system This method is based on the incomplete mixing of different polymers, e.g., dextran
and poly(ethylene glycol), or a polymer and a salt in an aqueous solution [208]. The first extraction
step separates cellular fragments. Subsequent purification can be accomplished by extraction or, if
high purity is required, by other methods. The extractability can be improved by using affinity ligands
or modified chromatography gels, e.g., phenyl-Sepharose.
2.2.3.1 Flocculation and Flotation Flocculation Separation of bacterial cells or cell debris by
filtration or centrifugation can involve considerable difficulties due to their small size and physical
properties. The compressible nature of the cells is the primary limiting factor for using filtration as a
separation step to remove them. The low permeability of a typical cell cake results in a filtration rate
that is often too slow to be practical. In cell removal by centrifugation, the small size and low density
difference between the cells or cell debris and the medium results in a low sedimentation rate.
Flocculation of cell suspensions has been reported to aid cell separation by both filtration and
centrifugation.
Flocculation is the process whereby destabilized particles are induced to come together, make
contact, and subsequently form larger aggregates. Flocculating agents are additives capable of
increasing the degree of flocculation of a suspension. They can be organic or inorganic, and natural
or synthetic. A comprehensive review of various categories of flocculating agents can be found in.
Synthetic organic flocculating agents are by far the most commonly used agents for cell flocculation
in industrial processes. They are typically water-soluble, charged polymeric substances with average
molecular weight ranging from about 103 to greater than 5x106 and are generally referred to as
polyelectrolytes. The positively and negatively charged polymers are referred to as cationic and
anionic polyelectrolytes, respectively. Polyelectrolytes containing both positive and negative charges
are termed polyampholytes. Flocculation of cells by polyelectrolytes is a two-step process. The first
step is the neutralization of the surface charge on the suspended cells or cell debris. The second step
involves the linkage of these particles to form large aggregates. The various mechanisms and theories
of flocculation have been summarized. Flocculant selection for a specific cell separation process is a
challenge as many factors can impact flocculation. These factors can have their origin in the broth
(cell surface charge and size, ionic strength, pH, cell concentration, and the presence of other charged
matter), the polymer (molecular weight, charge and charge density, structure, type), and engineering
parameters (mixing and mode and order of addition). The final criteria for flocculant selection should
take into consideration all aspects of the flocculation process. These include the cost of the added
flocculant, subsequent separation performance, process robustness, and yield. In some cases,
flocculation can also provide purification by selectively removing unwanted proteins, nucleic acids,
lipids and endotoxin from the cell broth.
Flotation If no stable agglomerates are formed, cells can be separated by flotation. Here, cells are
adsorbed onto gas bubbles, rise to the top, and accumulate in a froth. An example is the separation
of single cell protein.
2.3 Concentration
The enzyme concentration in starting material is often very low. The volume of material to be
processed is generally very large, and substantial amounts of waste material must be removed.
Thus, if economic purification is to be achieved, the volume of starting material must be decreased
by concentration. Only mild concentration procedures that do not inactivate enzymes can be
employed. These include thermal methods, precipitation, and to an increasing extent, membrane
filtration.
2.3.1Thermal Methods
Only brief heat treatment can be used for concentration because enzymes are thermolabile.
Evaporators with rotating components that achieve a thin liquid film (thin-layer evaporator,
centrifugal thin-layer evaporator) or circulation evaporators (long-tube evaporator) can be
employed.
1.3.2 Precipitation
Enzymes are very complex protein molecules possessing both ionizable and hydrophobic groups
which interact with the solvent. Indeed, proteins can be made to agglomerate and, finally, precipitate
by changing their environment. Precipitation is actually a simple procedure for concentrating
enzymes.
Precipitation with Salts High salt concentrations act on the water molecules surrounding the protein
and change the electrostatic forces responsible for solubility. Ammonium sulfate is commonly used
for precipitation; hence, it is an
effective agent for concentrating enzymes. Enzymes can also be fractionated, to a limited extent, by
using different concentrations of ammonium sulfate. The corrosion of stainless steel and cement by
ammonium sulfate is a disadvantage, which causes additional problems in wastewater treatment.
Sodium sulfate is more efficient from this point of view, but it is less soluble and must be used at
temperatures of 35–40 8C. The optimal concentration of salt required for precipitation must be
determined experimentally, and generally ranges from 20 to 80 % saturation.
Precipitation with Organic Solvents Organic solvents influence the solubility of enzymes by
reducing the dielectric constant of the medium. The solvation effect of water molecules surrounding
the enzyme is changed; the interaction of protein molecules is increased; and therefore,
agglomeration and precipitation occur. Commonly used solvents are ethanol and acetone.
Satisfactory results are obtained only if the concentration of solvent and the temperature are
carefully controlled because enzymes can be inactivated easily by organic solvents.
Precipitation with Polymers The polymers generally used are polyethylenimines and poly(ethylene
glycols) of different molecular masses. The mechanism of this precipitation is similar to that of
organic solvents and results from a change in the solvation effect of the water molecules
surrounding the enzyme. Most enzymes precipitate at polymer
concentrations ranging from 15 to 20 %.
Precipitation at the Isoelectric Point Proteins are ampholytes and carry both acidic and basic
groups. The solubility of proteins is markedly influenced by pH and is minimal at the isoelectric
point at which the net charge is zero. Because most proteins have isoelectric points in the acidic
range, this process is also called acid precipitation.
2.3.2 Ultrafiltration
A semipermeable membrane permits the separation of solvent molecules from larger enzyme
molecules because only the smaller molecules can penetrate the membrane when the osmotic
pressure is exceeded. This is the principle of all membrane separation processes, including
ultrafiltration. In reverse osmosis, used to separate materials with low molecular mass, solubility
and diffusion phenomena influence the process, whereas ultrafiltration and cross-flow filtration are
based solely on the sieve effect. In processing enzymes, cross-flow filtration is used to harvest cells,
whereas ultrafiltration is employed for concentrating and desalting.
2.4 Purification
For many industrial applications, partially purified enzyme preparations will suffice; however,
enzymes for analytical purposes and for medical use must be highly purified. Special procedures
employed for enzyme purification are crystallization, electrophoresis, and chromatography.
2.4.1 Crystallization
The rapid growth in the utilization of enzymes in commercial sectors such as agriculture and
consumer products requires a cost-effective, industrial-scale purification method. Crystallization,
one of the oldest chemical purification technologies, has the potential to fulfill these requirements.
Enzyme crystallization is the formation of solid enzyme particles of defined shape and size. An
enzyme can be induced to crystallize or form protein-protein interactions by creating solvent
conditions that result in enzyme supersaturation. The theory and history of protein crystallization
are well documented.Much of the emphasis in enzyme crystallization has focused on obtaining
crystals for X-ray diffraction analysis rather than as a purification process.
2.4.2 Electrophoresis
Electrophoresis is used to isolate pure enzymes on a laboratory scale. Depending on the conditions,
the following procedures can be used: zone electrophoresis, isotachophoresis, or porosity gradients.
The heat generated in electrophoresis and the interference caused by convection are problems
associated with a scale-up of this method. An interesting contribution to the industrial application
of electrophoresis is a continuous process in which the electrical field is stabilized by rotation.
2.4.3 Chromatography
Chromatography is of fundamental importance to enzyme purification. Molecules are separated
according to their physical properties (size, shape, charge, hydrophobic interactions), chemical
properties (covalent binding), or biological properties (biospecific affinity). In gel chromatography
(also called gel filtration), hydrophilic, cross-linked gels with pores of finite size are used in
columns to separate biomolecules. Concentrated solutions are necessary for separation because the
sample volume that can be applied to a column is limited to ca. 10 % of the column volume. In gel
filtration, molecules areseparated according to size and shape. Molecules larger than the largest
pores in the gel beads, i.e., above the exclusion limit, cannot enter the gel and are eluted first.
Smaller molecules, which enter the gel beads to varying extent depending on their size and shape,
are retarded in their passage through the column and eluted in order of decreasing molecular mass.
Gel filtration is used commercially for both separation and desalting of enzyme solutions.
For hydrophobic chromatography, media derived from the reaction of CNBr-activated Sepharose
with aminoalkanes of varying chain length are suitable. This method is based on the interaction of
hydrophobic areas of protein molecules with hydrophobicgroups on the matrix. Adsorption occurs
at high salt concentrations, and fractionation
of bound substances is achieved by eluting with a negative salt gradient. This method is ideally
suited for further purification of enzymes after concentration by precipitation with such salts as
ammonium sulfate.
Covalent chromatography differs from other types of chromatography in that a covalent bond is
formed between the required protein and the stationary phases.
Enzyme assays are laboratory methods for measuring enzymatic activity. They are vital for the study
of enzyme kinetics and enzyme inhibition.
2.6.4. Purification fold: This factor or parameter provides information on the degree of purity of an
enzyme or protein after being subjected to a series of purification processes
Yield or Recovery: This parameter indicated the percentage of the total enzyme obtained after a
purification step. It is calculated by taking the ratio (expresses as percent) of total activity of enzyme
in purified fraction to the total activity of enzyme in the crude or un purified homogenate
Chemiluminescence of luminol
Calorimetry is the measurement of the heat released or absorbed by chemical reactions. These assays
are very general, since many reactions involve some change in heat and with use of a
microcalorimeter, not much enzyme or substrate is required. These assays can be used to measure
reactions that are impossible to assay in any other way.
Chemiluminescent
Chemiluminescence is the emission of light by a chemical reaction. Some enzyme reactions produce
light and this can be measured to detect product formation. These types of assay can be extremely
sensitive, since the light produced can be captured by photographic film over days or weeks, but can
be hard to quantify, because not all the light released by a reaction will be detected.
The detection of horseradish peroxidase by enzymatic chemiluminescence (ECL) is a common
method of detecting antibodies in western blotting. Another example is the enzyme luciferase, this is
found in fireflies and naturally produces light from its substrate luciferin.
Light scattering
Static light scattering measures the product of weight-averaged molar mass and concentration of
macromolecules in solution. Given a fixed total concentration of one or more species over the
measurement time, the scattering signal is a direct measure of the weight-averaged molar mass of the
solution, which will vary as complexes form or dissociate. Hence the measurement quantifies the
stoichiometry of the complexes as well as kinetics. Light scattering assays of protein kinetics is a
very general technique that does not require an enzyme.
Microscale thermophoresis
Microscale thermophoresis (MST) measures the size, charge and hydration entropy of
molecules/substrates at equilibrium. The thermophoretic movement of a fluorescently labeled
substrate changes significantly as it is modified by an enzyme. This enzymatic activity can be
measured with high time resolution in real time. The material consumption of the all optical MST
method is very low, only 5 μl sample volume and 10nM enzyme concentration are needed to measure
the enzymatic rate constants for activity and inhibition. MST allows analysts to measure the
modification of two different substrates at once (multiplexing) if both substrates are labeled with
different fluorophores. Thus substrate competition experiments can be performed.
2.7.2. Discontinuous assays
Discontinuous assays are when samples are taken from an enzyme reaction at intervals and the
amount of product production or substrate consumption is measured in these samples.
Radiometric
Radiometric assays measure the incorporation of radioactivity into substrates or its release from
substrates. The radioactive isotopes most frequently used in these assays are 14C, 32P, 35S and 125I.
Since radioactive isotopes can allow the specific labelling of a single atom of a substrate, these assays
are both extremely sensitive and specific. They are frequently used in biochemistry and are often the
only way of measuring a specific reaction in crude extracts (the complex mixtures of enzymes
produced when you lyse cells).
Radioactivity is usually measured in these procedures using a scintillation counter.
Chromatographic
Chromatographic assays measure product formation by separating the reaction mixture into its
components by chromatography. This is usually done by high-performance liquid
chromatography (HPLC), but can also use the simpler technique of thin layer chromatography.
Although this approach can need a lot of material, its sensitivity can be increased by labelling the
substrates/products with a radioactive or fluorescent tag. Assay sensitivity has also been increased by
switching protocols to improved chromatographic instruments (e.g. ultra-high pressure liquid
chromatography) that operate at pump pressure a few-fold higher than HPLC instruments (see High-
performance liquid chromatography#Pump pressure).
2.7.3. Factors to control in assays
Several factors effect the assay outcome and a recent review summarizes the various parameters that
needs to be monitored to keep an assay up and running.
Salt Concentration: Most enzymes cannot tolerate extremely high salt concentrations. The
ions interfere with the weak ionic bonds of proteins. Typical enzymes are active in salt
concentrations of 1-500 mM. As usual there are exceptions such as
the halophilic algae and bacteria.
Effects of Temperature: All enzymes work within a range of temperature specific to the
organism. Increases in temperature generally lead to increases in reaction rates. There is a
limit to the increase because higher temperatures lead to a sharp decrease in reaction rates.
This is due to the denaturating (alteration) of protein structure resulting from the breakdown
of the weak ionic and hydrogen bonding that stabilize the three-dimensional structure of the
enzyme active site.[16] The "optimum" temperature for human enzymes is usually between 35
and 40 °C. The average temperature for humans is 37 °C. Human enzymes start to denature
quickly at temperatures above 40 °C. Enzymes from thermophilic archaea found in the hot
springs are stable up to 100 °C. However, the idea of an "optimum" rate of an enzyme reaction
is misleading, as the rate observed at any temperature is the product of two rates, the reaction
rate and the denaturation rate. If you were to use an assay measuring activity for one second,
it would give high activity at high temperatures, however if you were to use an assay
measuring product formation over an hour, it would give you low activity at these
temperatures.
Effects of pH: Most enzymes are sensitive to pH and have specific ranges of activity. All
have an optimum pH. The pH can stop enzyme activity by denaturating (altering) the three-
dimensional shape of the enzyme by breaking ionic, and hydrogen bonds. Most enzymes
function between a pH of 6 and 8; however pepsin in the stomach works best at a pH of 2 and
trypsin at a pH of 8.
Substrate Saturation: Increasing the substrate concentration increases the rate of reaction
(enzyme activity). However, enzyme saturation limits reaction rates. An enzyme is saturated
when the active sites of all the molecules are occupied most of the time. At the saturation
point, the reaction will not speed up, no matter how much additional substrate is added. The
graph of the reaction rate will plateau.
Level of crowding, large amounts of macromolecules in a solution will alter
the rates and equilibrium constants of enzyme reactions, through an effect
called macromolecular crowding.
Enzymes are biocatalyst that carries out all the essential biochemical reactions inside the body of an
organism. Their unique feature is that they remain unaltered after the reaction is completed.
Therefore, they can be used again and again. But the limitation of soluble enzymes is their isolation
from the product and the substrate. Most of the Enzymes in the living organism are attached to the
cell membrane or entrapped within the cells. This observation led to the concept that pure isolated
enzymes may actually perform better when they are immobilized on a solid support. The term
immobilized enzyme is used to denote “enzymes physically confined or localized in a defined region
of space with retention of their catalytic activities and which can be used repeatedly and
continuously”. Immobilization is beneficial because it facilitates work up product isolation. Some of
the potential advantages and disadvantages of immobilization are highlighted below.
Soluble Enzyme + Substrate----------- Product (single time usage of enzyme)
Immobilized Enzyme + Substrate---------Product (Repeated usage of enzyme)
Fig Adsorption
Methods of Entrapment
Inclusion in the gels: enzymes trapped in gels
Inclusion in fibers: enzymes supported on fiber formate
Inclusion in microcapsules: enzymes entrapped in microcapsules formed by monomer mixtures
such as polyamine, calcium alginate
Advantage of Entrapment method:
Fast
Cheap (low cost matrix available)
Mild conditions are required
Less chance of conformational change in the enzyme
Disadvantage of Entrapment method:
Leakage of enzyme
Pore diffusion limitation
Chance of microbial contamination
Cross linking
This method involves attachment of biocatalysts to each other by bi- or multifunctional reagents or
ligands. In this way, very high molecular weight typically insoluble aggregates are formed. Cross-
linking is a relatively simple process. It is not a preferred method of immobilization as it does not
use any support matrix. So they are usually gelatinous and not particularly firm. Since it involves a
bond of the covalent kind, biocatalyst immobilized in this way frequently undergoes changes in
conformation with a resultant loss of activity. Still it finds good use in combination with other
support dependent immobilization technologies, namely to minimize leakage of enzymes already
immobilized by adsorption.
Figure : Cross-linking
the physical and chemical properties of the support matrix and interactions of the matrix
with substrates or products changes the kinetics of the immobilized enzyme.
Generally there is a decrease in the rate of enzyme catalyzed reaction because the matrix
restric the diffusion of the substrate towards the enzyme.
The Km of the enzyme also changes after immobilization because of diffusion limitations. If
the matrix is positively charged and substrate is also positively charges due to electrostatic
repulsion the substrate will not come in the viccinity of the enzyme hence Km is altered.
Sometimes the 3D structure of the enzyme is also changed which also result in altering the
kinetic properties of enzyme.
The performance of the immobilized enzyme can be improved further by studying the
structural changes of the immobilized enzyme
Figure Structural studies improves the performance of the immobilized enzyme
Biosensor: Biosensor are electronic monitoring devices that make use of an enzyme’s specificity
and the technique of enzyme immobilization
Fig Biosensor
A biosensor has been developed for detecting glucose in the blood of diabetics
2.8.3.2.1.
Table 2 shows some of the immobilized enzymes used for the synthesis of various antibiotics.
Immobilized enzymes are used in the processing of food samples and its analysis.
2.8.3.2.3. Biodiesel production Biodiesel has gained importance for its ability to replace fossil fuels
which are likely to run out within a century.
Bioremediation is a technique that involves the use of enzyme and biological organism to remove
pollutants from a contaminated site.
Figure 21: Peroxidase immobilized on support and used for continuous palm oil mill effluent
(POME) treatment
Figure 22: Removal of phenolic derivative by immobilized enzyme.
Table 5 lists some recent research about the use of enzymes in different immobilized forms for dye
and phenolic compounds removal.
An abzyme (from antibody and enzyme), also called catmab (from catalytic monoclonal antibody),
is a monoclonal antibody with catalytic activity. Molecules which are modified to gain new catalytic
activity are called synzymes. Abzymes are usually artificial constructs, but are also found in normal
humans (anti-vasoactive intestinal peptide autoantibodies) and in patients with autoimmune diseases
such as systemic lupus erythematosus, where they can bind to and hydrolyze DNA. Abzymes are
potential tools in biotechnology, e.g., to perform specific actions on DNA.
Enzymes function by lowering the activation energy of the transition state, thereby catalyzing the
formation of an otherwise less-favorable molecular intermediate between reactants and products. If
an antibody is developed to a stable molecule that's similar to an unstable intermediate of another
(potentially unrelated) reaction, the developed antibody will enzymatically bind to and stabilize the
intermediate state, thus catalyzing the reaction. A new and unique type of enzyme is produced.
HIV treatment
In a June 2008 issue of the journal Autoimmunity Reviews, researchers S Planque, Sudhir Paul, Ph.D,
and Yasuhiro Nishiyama, Ph.D of the University Of Texas Medical School at Houston announced
that they have engineered an abzyme that degrades the superantigenic region of the gp120 CD4
binding site. This is the one part of the HIV virus outer coating that does not change, because it is the
attachment point to T lymphocytes, the key cell in cell-mediated immunity. Once infected by HIV,
patients produce antibodies to the more changeable parts of the viral coat. The antibodies are
ineffective because of the virus' ability to change their coats rapidly. Because this protein gp120 is
necessary for the HIV virus to attach, it does not change across different strains and is a point of
vulnerability across the entire range of the HIV variant population.
The abzyme does more than bind to the site, it actually destroys the site, rendering the HIV virus
inert, and then can attach to other viruses. A single abzyme can destroy thousands of HIV viruses.
Human clinical trials will be the next step in producing treatment and perhaps even preventative
vaccines and microbicide
The rate of this reaction is promoted by enzyme catalysts that stabilize the transition state of this
reaction, thereby decreasing the activation energy and allowing for more rapid conversion of substrate
to product. In this case, the transition state is thought to involve a transient positive charge on the
sulfur atom and a double-negative charge on the periodate ion as shown below on the left.
In order to generate abzymes complementary in structure to this transition state, mice were
immunized with an aminophosphonic acid hapten, as shown above at the right. Obviously, its
structure mirrors the structure and electrostatic properties of the sulfoxide transition state. Of the
hapten-binding monoclonal antibodies produced with this hapten, many were found to catalyze
sulfide oxidation but with a wide range of binding affinities and catalytic efficiencies. In particular,
abzyme 28B4 binds hapten with high affinity (Kd = 52 nM) and exhibits a correspondingly high
degree of catalytic efficiency (k3/KM = 190,000 M-1s-1).
Elucidation of the molecular structure of abzyme 28B4 bound to the hapten reveals much about the
nature of its catalytic action. Highly specific structural and electrostatic interactions create a
remarkable degree of structural complementarity between the antigen-binding site and the sulfoxide
transition state analog as illustrated in the following series of three-dimensional views of the
antibody-hapten complex.
The concept involved in amperometric enzyme biosensors is conversion of a chemical signal (in this
case, the enzyme reaction) to an analytical signal(a current) using the working electrode as the
transducer. A schematic diagram for such a sensor is shown in . The enzyme is immobilized on the
surface of an electrode, and this immobilized layer is covered by a membrane. The function of the
membrane is to provide stability, and it can also be used to prevent potential interferants from
reacting with the enzyme. The electrode assembly is placed in the solution containing the analyte,
which can readily diffuse through the membrane, and into the immobilized enzyme layer.
A more recent application of “wired” enzyme technology is its use for the cathode of a biofuel cell
(15). In this example, the cathode reaction was the four-electron reduction of oxygen to water, and
the anode was the oxidation of glucose, again using a “wired” glucose oxidase electrode. In previous
fuel cells, reduction of water has been achieved using either noble metal cathodes at pH 0 or
activated carbon cathodes at pH 14; high temperature was required in both cases. In contrast, in this
example, reduction of oxygen to water at a current density of 5 mA cm was achieved at 37.5°C in pH
5 citrate buffer using an enzyme electrode based on laccase.
The scheme for mediated laccase reduction of water is shown in . The electrode material was carbon
cloth (i.e. a large surface area electrode), to which the osmium-containing redox polymer was
covalently attached. The laccase enzyme was electrostatically bound to the osmium centers. One
major advantage of this approach is that the redox potential of the covalently-bound osmium complex
can be altered by varying the ligands. In this case, bidentate dimethyl-bipyridine and tridentate
terpyridine were used, giving the osmium complex a redox potential of +0.78 V (vs. NHE). This
value is close to the redox potential of laccase under these conditions (+0.82 V); that is, the redox
potential of the osmium complex is adjusted to minimize the overpotential required for laccase
reduction. Another advantage of this approach is that the electrode reactions are so selective that the
reactions of glucose at the cathode and oxygen at the anode are insignificant, which eliminates the
need for a membrane to separate the two electrodes into two compartments
Enzyme multiplied immunoassay technique, or EMIT, is a common method for screening urine and
blood for drugs, both legal or illicit. First introduced by Syva Company in 1973, it is the first
homogeneous immunoassay to be widely used commercially.
A mix and read protocol has been developed that is exceptionally simple and rapid. The most widely
used applications for EMIT are for therapeutic drug monitoring (serum) and as a primary screen for
abused drugs and their metabolites (urine). The US patents covering the major aspects of the method,
3,817,837 and 3,875,011, have expired. While still sold by Siemens Healthcare under its original
trade name, EMIT, assay kits with different names that employ the same technology are supplied by
other companies. The test is not particularly accurate, especially with regard to test results for
cannabis. When the Food and Drug Administration approved EMIT, it did so with the strict provision
that positive test results should be confirmed by an alternative testing method
UNIT – III - ENZYME AND PROTEIN ENGINEERING – SBTA5202
3. PROTEIN ENGINEERING
Protein engineering is the process of developing useful or valuable proteins. It is a young
discipline, with much research taking place into the understanding of protein
oldingand recognition for protein design principles. There are two general strategies for protein
engineering, 'rational' protein design and directed evolution. These techniques are not mutually
exclusive; researchers will often apply both. In the future, more detailed knowledge of protein
structure and function, as well as advancements in high-throughput technology, may greatly expand
the capabilities of protein engineering. Eventually, even unnatural amino acids may be incorporated,
thanks to a new method that allows the inclusion of novel amino acids in the genetic code
In rational protein design, the scientist uses detailed knowledge of the structure and function
of the protein to make desired changes. In general, this has the advantage of being inexpensive and
technically easy, since site-directed mutagenesis techniques are well-developed. However, its major
drawback is that detailed structural knowledge of a protein is often unavailable, and, even when it is
available, it can be extremely difficult to predict the effects of various mutations. Computational
protein design algorithms seek to identify novel amino acid sequences that are low in energy when
folded to the pre-specified target structure. While the sequence-conformation space that needs to be
searched is large, the most challenging requirement for computational protein design is a fast, yet
accurate, energy function that can distinguish optimal sequences from similar suboptimal ones.
Enzyme engineering
Enzyme engineering is the application of modifying an enzyme's structure (and, thus, its
function) or modifying the catalytic activity of isolated enzymes to produce new metabolites, to allow
new (catalyzed) pathways for reactions to occur, or to convert from some certain compounds into
others (biotransformation). These products will be useful as chemicals, pharmaceuticals, fuel, food,
or agricultural additives. An enzyme reactor consists of a vessel containing a reactional medium that
is used to perform a desired conversion by enzymatic means. Enzymes used in this process are free
in the solution.
APPLICATIONS.
Solid-phase peptide synthesis (SPPS), pioneered by Robert Bruce Merrifield, caused a paradigm
shift within the peptide synthesis community, and it is now the standard method for
synthesizing peptides and proteins in the lab. SPPS allows for the synthesis of natural peptides which
are difficult to express in bacteria, the incorporation of unnatural amino acids, peptide/protein
backbone modification, and the synthesis of D-proteins, which consist of D-amino acids.
Small porous beads are treated with functional units ('linkers') on which peptide chains can be
built. The peptide will remain covalently attached to the bead until cleaved from it by a reagent such
as anhydrous hydrogen fluoride or trifluoroacetic acid. The peptide is thus 'immobilized' on the solid-
phase and can be retained during a filtration process while liquid-phase reagents and by-products of
synthesis are flushed away.
The general principle of SPPS is one of repeated cycles of deprotection-wash-coupling-wash. The
free N-terminal amine of a solid-phase attached peptide is coupled (see below) to a single N-protected
amino acid unit. This unit is then deprotected, revealing a new N-terminal amine to which a further
amino acid may be attached. The superiority of this technique partially lies in the ability to perform
wash cycles after each reaction, removing excess reagent with all of the growing peptide of interest
remaining covalently attached to the insoluble resin.
The overwhelmingly important consideration is to generate extremely high yield in each step. For
example, if each coupling step were to have 99% yield, a 26-amino acid peptide would be synthesized
in 77% final yield (assuming 100% yield in each deprotection); if each step were 95%, it would be
synthesized in 25% yield. Thus each amino acid is added in major excess (2~10x) and coupling amino
acids together is highly optimized by a series of well-characterized agents.[citation needed]
There are two majorly used forms of SPPS – Fmoc and Boc. Unlike ribosome protein synthesis,
solid-phase peptide synthesis proceeds in a C-terminal to N-terminal fashion. The N-termini of
amino acid monomers is protected by either of these two groups and added onto a deprotected amino
acid chain.
Automated synthesizers are available for both techniques, though many research groups
continue to perform SPPS manually. SPPS is limited by yields, and typically peptides and proteins
in the range of 70 amino acids are pushing the limits of synthetic accessibility. Synthetic difficulty
also is sequence dependent; typically amyloid peptides and proteins are difficult to make. Longer
lengths can be accessed by using native chemical ligation to couple two peptides together with
quantitative yields.
Since its introduction over 40 years ago, SPPS has been significantly optimized. First, the
resins themselves have been optimized.[2] Furthermore, the 'linkers' between the C-terminal amino
acid and polystyrene resin have improved attachment and cleavage to the point of mostly quantitative
yields. The evolution of side chain protecting
groups has limited the frequency of unwanted
side reactions. In addition, the evolution of new
activating groups on the carboxyl group of the
incoming amino acid have improved coupling
and decreased epimerization. Finally, the
process itself has been optimized. In Merrifield's
initial report, the deprotection of the α-amino
group resulted in the formation of a peptide-
resin salt, which required neutralization with
base prior to coupling. The time between
neutralization of the amino group and coupling
of the next amino acid allowed for aggregation
of peptides, primarily through the formation of
secondary structures, and adversely affected
coupling. The Kent group showed that concomitant neutralization of the α-amino group and coupling
of the next amino acid led to improved coupling. Each of these improvements has helped SPPS
become the robust technique that it is today.
Novel Proteins
Pets such as dogs and cats are often fed diets consisting of chicken, beef, lamb and fish. If an
intolerance develops, it may be hard to discern exactly what ingredient causes the problem. For these
pets, offering a diet consisting of novel proteins such as bison, duck, rabbit, fish they haven't eaten
before, venison, kangaroo or egg, either alone or with a single carbohydrate source such as potato or
rice, can help resolve the problem.
Increasing the production and use of novel proteins is one of our research lines to help
building new protein value chains. Durable success and impact of these value chains depend entirely
on the ability to use the technical and nutrition functionality of proteins in a final product, and – even
more important – on the acceptance of novel proteins by consumers and regulatory bodies.A typical
example of a new value chain is the FoodWaste2Feed project. The acceptance of novel proteins by
consumers is studied in several programs, focussing on various protein sources and on different target
groups.Most novel proteins to be used as food or food ingredient will have to be approved prior to
market introduction under the Novel Food Regulation (Regulation (EC) No 258/97). Wageningen
UR developed a Guideline for producers of novel proteins on how to fill the NFR application dossiers
(Guideline LEI 14-075). There is no pre-market authorisation procedure for novel feed ingredients
derived from non-animal sources, a notification will suffice (Regulation (EC) No 767/2009). For
proteins derived from animals, like insect proteins, the situation is rather complex and depends among
others on the destination of the feed (food producing animal or not).
The method quite simply involves template DNA - the DNA to be mutated, usually bacterial
DNA - and an oligonucleotide carrying the reverse complement of the desired mutation which can
anneal to the template and be used as a primer for DNA synthesis. For instance, if a TGG codon is
present in the bacterial DNA, and the desired mutation is AAT, then the oligonucleotide primer
should read ATT (all read from 5' to 3'). Once the mutagenic primer is annealed to its template, the
complete structure is called a heteroduplex, owing to the differences between the strands. The
heteroduplex is used to transform a cell - most often E. coli - and it is left there overnight.
In theory, both strands of the heteroduplex should be replicated at equal frequency to give a 50/50
mixture of mutant to template DNA in the cell. In practice, mutant recovery in this way is poor for
two reasons: firstly, because of the cell's intrinsic mismatch repair system, and secondly, because
the template DNA is methylated, and methylated DNA is preferentially replicated by the host cell
machinery. Consequently, higher-efficiency mutagenesis approaches have been developed to raise
the percentage mutant recovery from around 0.1% to as high as 50%. These approaches work on
the principle that once the template DNA has been used to copy the mutant strand it is of no further
use, and can only hinder mutant recovery.
Basic mechanism
The basic procedure requires the synthesis of a short DNA primer. This synthetic primer
contains the desired mutation and is complementary to the template DNA around the mutation site
so it can hybridize with the DNA in the gene of interest. The mutation may be a single base change
(a point mutation), multiple base changes, deletion, or insertion. The single-strand primer is then
extended using a DNA polymerase, which copies the rest of the gene. The gene thus copied contains
the mutated site, and is then introduced into a host cell as a vector and cloned. Finally, mutants are
selected by DNA sequencing to check that they contain the desired mutation.
The original method using single-primer extension was inefficient due to a low yield of
mutants. This resulting mixture contains both the original unmutated template as well as the mutant
strand, producing a mixed population of mutant and non-mutant progenies. Furthermore the template
used is methylated while the mutant strand is unmethylated, and the mutants may be counter-selected
due to presence of mismatch repair system that favors the methylated template DNA, resulting in
fewer mutants. Many approaches have since been developed to improve the efficiency of
mutagenesis.
Quikchange is one high-efficiency mutagenesis approach that has been developed. The
plasmid template is denatured and mutagenic primers are annealed to each strand. The primers are
de-phosphorylated so that although they can be extended, there cannot be ligation between the end
of the synthesised strand and the start of the primer. Once DNA synthesis is complete, the template
DNA and mutagenic DNA are denatured in each of the two PCR-like products. The mutant strands
cannot be reused for DNA synthesis because when the primers anneal to them, they have no
template material to copy. Instead the parental strands are reused: one new mutagenic primer is
added to each, DNA is synthesised, the products are denatured and then the parental strands used
again. Unlike conventional PCR, which makes products exponentially, this is a linear amplification:
two new mutated strands are made in each 'cycle'. At the end of this amplification period, the
parental templates are recognised by a methylation-specific restriction enzyme called Dpn I.
Because the mutated DNA is unmethylated, it goes unrecognised by this enzyme and remains
intact. The consequence is that parental DNA cannot be used to transform E. coli, while mutant
DNAcan.Although currently the mutant products are all linearised, their lengthy (approx 40 nt)
complementary primers can anneal to result in a double-stranded circular plasmid containing a
homoduplex of only mutant DNA. The E. coli repair system will complete ligation where the
dephosphorylated primers fail to do this, to form complete plasmids readyforhostcellreplication.
2. Mix the template plasmid, primers, dNTPs and a thermostable polymerase and run for 16-25
thermal cycles
6. Sequence the plasmid to ensure that the mutation has been correctly inserted
3.6. Uracil-containing DNA method
This approach is based on the simple notion that (deoxy)uracil is not a usual component of DNA. It
involves the following protocol:
1. Grow the template DNA to contain a high proportion of deoxyuracil (dU) by growing it in an E.
coli mutant:
- dut- (which lacks dUTPase; an enzyme which normally prevents the incorporation of uracil into
DNA)
- ung- (which lacks uracil glycosylase; an enzyme which normally removes uracil from DNA)
In vivo: transform the heteroduplex DNA into a wild-type E. coli which retains its uracil
glycosylase function (ung+). The template DNA, which is rich in dU, will then be repaired using
the newly-incorporated mutant strand as a template. The product is a homoduplex DNA containing
only mutant strands.
In vitro: extract the heteroduplex DNA from E. coli and treat with uracil glycosylase to remove the
parental DNA. Then synthesise a new strand with DNA polymerase and dNTPs, using the mutant
strand as a template. Both of these are performed in the test tube. The product, again, is a
homoduplex DNA containing only mutant strands.
Cassette mutagenesis is a technique employed to introduce multiple mutations to the same region of
DNA. A cassette (block of DNA) is designed to contain all of the desired mutations and then given
ligatable ends to facilitate its insertion into the wild-type DNA. Quikchange, described above, can
be used to generate suitable restriction sites for its insertion, and the cassette should have both 5'
phosphorylation and 4-base 'sticky' overhangs at each end in order to encourage its insertion into
the host molecule.
Sticky feet PCR is used to generate insertional mutations in the wild-type DNA. The
mutagenic primer contains a series of bases (the desired insertion) which is not present in the
template DNA. Because it cannot form complementary pairs with the template DNA upon
annealing, the desired insertion 'loops out'. When the primer is extended, to generate heteroduplex
DNA, the parental strand is digested, as usual, using Dpn I. This leaves a single-stranded mutant
strand, containing the insertion, which then itself acts as a template for new DNA synthesis; the
nascent DNA strand will contain the complement of the insertion. The product is homoduplex DNA
containing both strands with the insertional mutation.
The size of insertion that can be generated by sticky feet is limited, however, by the size of
oligonucleotide primer that can be accurately synthesised (certainly no more than 80 nucleotides in
length). Deletions are performed in a similar manner, except the mutagenic primer lacks the bases
which need to be deleted (it contains only the flanking sequences). Upon annealing, this causes the
deletion bases in the template DNA to 'loop out' because they have nothing to anneal to. Unlike
with insertions, there is no size limitation to deletions because the oligonucleotide only need be big
enough to correspond to the flanking regions of the desired site of deletion.
In animal studies, alkylating agents such as N-ethyl-N-nitrosourea (ENU) have been used to generate
mutant mice. Ethyl methanesulfonate (EMS) is also often used to generate animal and plant
mutants.Random mutagenesis is an incredibly powerful tool for altering the properties of enzymes.
Imagine, for example, you were studying a G-protein coupled receptor (GPCR) and wanted to create
a temperature-sensitive version of the receptor or one that was activated by a different ligand than
the wild-type.
1. Error-prone PCR. This approach uses a “sloppy” version of PCR, in which the polymerase has a
fairly high error rate (up to 2%), to amplify the wild-type sequence. The PCR can be made error-
prone in various ways including increasing the MgCl2 in the reaction, adding MnCl2 or using unequal
concentrations of each nucleotide. Here is a good review of error prone PCR techiques and theory.
After amplification, the library of mutant coding sequences must be cloned into a suitable plasmid.
The drawback of this approach is that size of the library is limited by the efficiency of the cloning
step. Although point mutations are the most common types of mutation in error prone PCR, deletions
and frameshift mutations are also possible. There are a number of commercial error-prone PCR kits
available, including those from Stratagene and Clontech
2. Rolling circle error-prone PCR is a variant of error-prone PCR in which wild-type sequence is
first cloned into a plasmid, then the whole plasmid is amplified under error-prone conditions. This
eliminates the ligation step that limits library size in conventional error-prone PCR but of course the
amplification of the whole plasmid is less efficient than amplifying the coding sequence alone. More
details can be found here.
3. Mutator strains. In this approach the wild-type sequence is cloned into a plasmid and transformed
into a mutator strain, such as Stratagene’s XL1-Red. XL1-red is an E.coli strain whosedeficiency in
three of the primary DNA repair pathways (mutS, mutD and mutT) causes it to make errors during
replicate of it’s DNA, including the cloned plasmid. As a result each copy of the plasmid replicated
in this strain has the potential to be different from the wild-type. One advantage of mutator strains is
that a wide variety of mutations can be incorporated including substitutions, deletions and frame-
shifts. The drawback with this method is that the strain becomes progressively sick as it accumulates
more and more mutations in it’s own genome so several steps of growth, plasmid isolation,
transformation and re-growth are normally required to obtain a meaningful library.
4. Temporary mutator strains. Temporary mutator strains can be built by over-expressing a mutator
allele such as mutD5 (a dominant negative version of mutD) which limits the cell’s ability to repair
DNA lesions. By expressing mutD5 from an inducible promoter it is possible to allow the cells to
cycle between mutagenic (mutD5 expression on) and normal (mutD5 expression off) periods of
growth. The periods of normal growth allow the cells to recover from the mutagenesis, which allows
these strains to grow for longer than conventional mutator strains. If a plasmid with a temperature-
sensitive origin of replication is used, the mutagenic plasmid can easily be removed restore normal
DNA repair, allowing the mutants to be grown up for analysis/screening. An example of the
construction and use of such a strain can be found here. As far as I am aware there are no
commercially available temporary mutator strains.
5. Insertion mutagenesis. Finnzymes have a kit that uses a transposon-based system to randomly
insert a 15-base pair sequence throughout a sequence of interest, be it an isolated insert or plasmid.
This inserts 5 codons into the sequence, allowing any gene with an insertion to be expressed (i.e. no
frame-shifts or stop codons are cause). Since the insertion is random, each copy of the sequence will
have different insertions, thus creating a library.
6. Ethyl methanesulfonate (EMS) is a chemical mutagen. EMS aklylates guanidine residues,
causing them to be incorrectly copied during DNA replication. Since EMS directly chemically
modifies DNA, EMS mutagenesis can be carried out either in vivo (i.e. whole-cell mutagenesis) or
in vitro.
7. Nitrous acid is another chemical mutagen. It acts by de-aminating adenine and cytosine residues
causing transversion point mutations (A/T to G/C and vice versa). Note: I have only mentioned two
chemical mutagens but there are many others. Hirokazu Inoue has written an excellent article
describing some of them and their use in mutagenesis
8. DNA Shuffling is a very powerful method in which members of a library (i.e. copies of same
gene each with different types of mutation) are randomly shuffled. This is done by randomly
digesting the library with DNAseI then randomly re-joining the fragments using self-priming PCR.
Shuffling can be applied to libraries produced by any of the above method and allows the effects of
different combinations of mutations to be tested.
PICHIA PASTORIS
Yeast is another traditional, powerful tool for expressing recombinant proteins and has been
used successfully to express a multitude of proteins. Yeast has many of the advantageous features of
E. coli such as a short doubling time and a readily manipulated genome, but also has the additional
benefits of a eukaryote that includes improved folding and most posttranslational modifications. The
first yeast routinely used for recombinant protein expression was Saccharomyces cerevisiae.
However, in the last 15 years, P. pastoris has become the yeast of choice because it typically permits
higher levels of recombinant protein expression than does S. cerevisiae. P. pastoris is a methyltropic
yeast, and can use methanol as its only carbon source. The growth of P. pastoris in methanol-
containing medium results in the dramatic transcriptional induction of the genes for alcohol oxidase
(AOX) and dihydroxyacetone synthase. After induction, these proteins comprise up to 30% of the P.
pastoris biomass. Investigators have exploited this methanol-dependent gene induction by
incorporating the strong, yet tightly regulated, promoter of the alcohol oxidase I (AOX1) gene into
the majority of vectors for expressing recombinant proteins. The P. pastoris expression vectors
integrate in the genome whereas by contrast, S. cerevisiae vectors use the more unstable method of
replicating episomally. The length of time to assess recombinant gene expression with the P. pastoris
method is approximately 3–4 weeks which includes the transformation of yeast, screening the
transformants for integration, and an expression timecourse. An appealing feature of P. pastoris is the
extremely high cell densities achievable under appropriate culture conditions. Using inexpensive
medium, the P. pastoris culture can reach 120 g/l of dry cell weight density. An important caveat is
that the induction medium requires a low percentage of methanol. In large-scale cultures, the amount
of methanol becomes a fire hazard requiring a new level of safety conditions.
P. pastoris has been used to obtain both intracellular and secreted recombinant proteins. Like
other eukaryotes, it efficiently generates disulfide bonds and has successfully been used to express
proteins containing many disulfide bonds. To facilitate secretion, the recombinant protein must be
engineered to carry a signal sequence. The most commonly used signal sequence is the pre-pro
sequence from S. cerevisiae a-mating factor. Because P. pastoris secretes few endogenous proteins,
purification of the recombinant protein from the medium is a relatively simple task. If proteolysis of
the recombinant protein is a concern, expression can be completed using the pep4 protease-deficient
strain of P. pastoris selecting an Expression System 135. This strain has reduced vacuole peptidase
A activity which is responsible for activation of carboxypeptidase Y and protease B1.
Yeast has the posttranslational capacity to add glycans at both specific asparagine residues
(N-linked) and serine/threonine residues (O-linked). These glycan structures are substantially
different from the modifications added by insect and mammalian cells. In P. pastoris the N-linked
glycan is a high mannose type and usually contains 8–17 mannoses, which is quite different from S.
cerevisiae structures that consist of approximately 50–150 mannose residues. Similar to insect and
mammalian cells, the consensus sequence for N-linked glycans in yeast is Asn-Xaa-Ser/Thr. Two
groups have completed extensive engineering to create P. pastoris strains that produce complex N-
linked glycan structures comparable to those produced by mammalian cells. Wever, only the strains
developed by Roland Contreras’ group are available to investigators and must be licensed through
Research Corporation Technologies. The O-linked structures in P. pastoris have not been studied
comprehensively but are known to be formed by the addition of one to four mannose residues to
serines/threonines. Several reports have indicated that expression of certain proteins in P. pastoris
resulted in the addition O-linked glycans not observed when the protein was expressed endogenously
in mammalian cells.
Baculovirus/Insect Cells
Baculovirus-mediated expression in insect cells offers another useful tool for generating
recombinant proteins. Baculovirus is a lytic, large (130 kb), double-stranded DNA virus, and the
Autographa californica virus is the most commonly used baculovirus isolate for recombinant
expression. Baculovirus is routinely amplified in insect cell lines derived from the fall armyworm
Spodoptera frugiperda (Sf 9, Sf 21), and recombinant protein expression is completed either in the
aforementioned lines or in a line derived from the cabbage looper Trichoplusia ni (High-Five).
Originally, creating recombinant baculoviruses involved cotransfecting the gene of interest flanked
by baculovirus sequence with baculovirus DNA into insect cells, and screening for rare homologous
recombination events. Recombinants were identified by screening plaques with a modified
morphology, and often additional rounds of plaque screening were required to ensure that the
recombinant viral preparation was not contaminated with wild-type virus. This lengthy and laborious
process for generating recombinant viruses has been largely replaced by using site-specific
transposition (Bac-to-Bac or BaculoDirect, Invitrogen) or an improved homologous recombination
method with an engineered 136 William H. Brondyk baculovirus containing a lethal mutation in
orf1629 (flashBAC from Oxford Expression Technologies or BacMagic from EMD-Novagen). Both
of these approaches overcome the requirement to isolate plaques because the efficiency of
recombination is 100%. Following one or two rounds of amplifying the recombinant baculovirus, the
investigator can quantify the baculovirus concentration stock either by the plaque assay or by using
the newer, more rapid real-time PCR or antibody-based assays. The improvements in creating and
quantifying recombinant baculoviruses have dramatically reduced the time for evaluating baculovirus
expression to approximately 3 weeks, including a time-course study for optimizing expression.
The most common promoters used with baculovirus expression are the polH and p10
promoters, both of which induce a high level of expression in the very late phase of the baculovirus
infection. During this phase, cells undergo cell death with the concomitant release of proteases, which
can result in degradation of the expressed recombinant protein. To reduce proteolysis of the
recombinant protein, promoters active in earlier phases of the lytic cycle such as the basic promoter
have been used. Alternatively proteolytic activity can be minimized by using constructs deleted in
the chiA and v-cath genes, which encode chitinase and a cathepsin protease, respectively.
Baculovirus-mediated expression is routinely used to generate both cytoplasmic and secreted
recombinant proteins. Efficient secretion generally requires the presence of a signal peptide. Both
insect and mammalian signal sequences can promote entry into the insect cell secretory pathway.
Insect cells were originally grown in serum-containing medium which complicated purification of
the secreted proteins. Recent advances in media development permit the replacement of serum with
protein hydrolysates derived from either animal tissues or plants, thereby greatly simplifying protein
purification. However, the high cost of this specialized media can limit its use for large-scale
bioproduction. Insect cells efficiently generate disulfide bonds in recombinant proteins. They also
produce the majority of the posttranslational modifications found in mammalian cells. However, the
N-linked glycan structure formed in most insect cells is the predominantly fucosylated paucimannose
structures (Man3GlcNAc2-N-Asn). This finding has prompted the recent generation of insect cell
lines that produce glycoproteins with the complex N-linked glycans normally found in mammalian
cells. A transgenic Sf-9 insect line expressing several glycosyltransferases is commercially available
(Mimic cell line, Invitrogen) and produces N-linked glycans containing a biantennary, sialylated
structure. There are only a few reports describing the O-linked glycans structures generated by insect
cells.
MAMMALIAN CELLS
Mammalian expression methods have conventionally been considered to be the least efficient
vehicle for expressing recombinant proteins. However, recent advances have significantly improved
the expression levels from mammalian cell lines. For example, stably transfected Chinese hamster
ovary (CHO) cells have been reported to express recombinant antibodies up to a level of a few grams
per liter. While many cell lines and expression strategies have been tested, this chapter will focus on
transient transfection in human embryonic kidney (HEK293) cells and stable transfection with CHO
cells.
The HEK293 cell line was derived from human embryonic kidney cells transformed with
adenovirus. HEK293 cells can be transiently transfected with a high efficiency (>80%) using certain
cationic lipids, calcium phosphate, or polyethyleneimine as transfection reagents. For large-scale
transient transfections (>100 ml), calcium phosphate or polyethyleneimine reagents are more cost-
effective options when compared to cationic lipids. Transient transfections have been performed at
even the bioreactor level but for most laboratories this scale is technically challenging. The transient
transfection method is relatively easy, and the evaluation for a given recombinant protein can be
made in less than 2 weeks.
CHO cells are commonly used for mammalian expression when large quantities of
recombinant protein are needed. For example, most therapeutic antibodies currently on the market
are manufactured using this method. The standard method for stable CHO expression involves
transfecting dihydrofolate reductase (DHFR)-deficient CHO cells with a DHFR selection cassette
along with an expression cassette containing the gene of interest. Dihydrofolate reductase converts
dihydrofolate into tetrahydrofolate which is required for the de novo synthesis of purines, certain
amino acids, and thymidylic acid. Methotrexate, which binds and inhibits DHFR, is used as a
selection agent and only those cells that have integrated the DHFR selection cassette will survive.
Sequentially increasing the concentration of methotrexate will result in amplification of the DHFR
gene along with the linked gene of interest. Following at least one round of selection with the drug
methotrexate, the stably transfected pools are subcloned using limiting dilution cloning into multiwell
plates. Typically only a small percentage of the screened subclones will be expressing the
recombinant gene at a high level since in the majority of the clones, the expression cassette has
integrated into the heterochromatin region which is transcriptionally inactive. Unfortunately, the
entire selection and screening process takes at least 2–3 months, making this the major drawback of
the CHO 138 William H. Brondyk method. However, recent high-throughput methods based on flow
cytometry or automation have increased the ease in rapidly screening and selecting high expressing
clones. Another development has been to use specific cis-acting DNA elements flanking the
recombinant gene cassette that confer active transcription to integration sites. Unfortunately, the
majority of these DNA elements is owned by companies and must be licensed for use in the
laboratory, and, even with the aforementioned advances in CHO expression, the timelines for
generating a high expressing CHO clone have not changed considerably.
Mammalian expression systems are used primarily to generate secreted rather than
intracellular recombinant proteins. Serum-free media have been developed for both the CHO and
HEK293 cell lines, which simplifies the purification of secreted recombinant proteins. However, the
cost of the media is quite high, making large-scale bioproduction rather costly. Mammalian cells
contain the most superior folding and disulfide bond formation when compared to other expression
hosts. The N-linked and O-linked glycan structures formed by mammalian cells are extremely varied
and are not only dependent on the protein but also on the mammalian cell type used as the expression
host. Furthermore, the cell culture conditions such as nutrient content, pH, temperature, oxygen levels
and ammonia concentration can significantly affect the glycosylation profile. N-linked glycosylation
can result in oligomannose, hybrid, and complex structures, and the structures all contain the
Man3GlcNac2 core. The oligomannose glycans can have two to six additional mannoses and the
mannoses can be phosphorylated or sulfated. The most common complex structures have two to four
Gal b1,4-GlcNac2 attached to the mannoses which result in bi-, tri-, and tetra-antennary branches.
The branches can terminate with sialic acid, and fucose can also be attached to the structures. Hybrid
structures contain features of both the oligomannose and complex structures. O-Glycosylation
structures can be classified into eight types based on their core structures: O-GalNAc-type
glycosylation, O-GlcNAc-type glycosylation, O-fucosylation, O-mannosylation, O-glucosylation,
phosphoglycosylation, O-glycosaminoglycan-type glycosylation, and collagen-type glycosylation.
3.12. Proteins structure: Crystallography
In all forms of microscopy, the amount of detail or the resolution is limited by the wavelength
of the electro-magnetic radiation used. With light microscopy, where the shortest wavelength is about
300 nm, one can see individual cells and sub-cellular organelles. With electron microscopy, where
the wavelength may be below 10 nm, one can see detailed cellular architecture and the shapes of
large protein molecules. In order to see proteins in atomic detail, we need to work with electro-
magnetic radiation with a wavelength of around 0.1 nm or 1 Å, in other words we need to use X-rays.
In light microscopy, the subject is irradiated with light and causes the incident radiation to be
diffracted in all directions. The diffracted beams are then collected, focused and magnified by the
lenses in the microscope to give an enlarged image of the object. The situation with electron
microscopy is similar only in this case the diffracted beams are focused using magnets. Unfortunately
it is not possible to physically focus an X-ray diffraction pattern, so it has to be done mathematically
and this is where the computers come in. The diffraction pattern is recorded using some sort of
detector which used to be X-ray sensitive film, but nowadays is usually an image plate or a charge-
coupled device (CCD).
The diffraction from a single molecule would be too weak to be measurable. So we use an
ordered three-dimensional array of molecules, in other words a crystal, to magnify the signal. Even
a small protein crystal might contain a billion molecules. If the internal order of the crystal is poor,
then the X-rays will not be diffracted to high angles or high resolution and the data will not yield a
detailed structure. If the crystal is well ordered, then diffraction will be measurable at high angles or
high resolution and a detailed structure should result. The X-rays are diffracted by the electrons in
the structure and consequently the result of an X-ray experiment is a 3-dimensional map showing the
distribution of electrons in the structure.
A crystal behaves like a three-dimensional diffraction grating, which gives rise to both
constructive and destructive interference effects in the diffraction pattern, such that it appears on the
detector as a series of discrete spots which are known as reflections. Each reflection contains
information on all atoms in the structure and conversely each atom contributes to the intensity of each
reflection. As with all forms of electro-magnetic radiation, X-rays have wave properties, in other
words they have both an amplitude and a phase. In order to recombine a diffraction pattern, both of
these parameters are required for each reflection. Unfortunately, only the amplitudes can be recorded
experimentally all phase information is lost. This is known as "the phase problem". When
crystallographers say they have solved a structure, it means that they have solved "the phase
problem". In other words they have obtained phase information sufficient to enable an interpretable
electron density map to be calculated.
Firstly we need to obtain a pure sample of our target protein. We can do this by either isolating it
from its source, or by cloning its gene into a high expression system. The sample then needs be
assessed for suitability according to the following criteria:
1. Is it pure and homogeneous? we can test this by various electrophoretic methods and mass
spectrometry .
2. Is the protein soluble and folded? if protein estimations suggest that a lot of protein is being
lost, then it may be due to precipitation. The degree of ordered secondary structure can be
tested with circular dichroism if this is very low then the protein may be misfolded. This
may occur if the protein is being produced faster than it can fold and may result in the
formation of insoluble inclusion bodies. Attenuating the induction can alleviate this problem
e.g. using a lower temperature.
3. Is the sample monodisperse? in other words is the sample free from aggregation? This can
be monitored using a dynamic light scattering (DLS) device.
4. Is the protein still active? check with activity assays
5. Is the sample stable? Occasionally good protein crystals will form overnight at room
temperature, but usually it may take several days to one or two weeks before suitable
crystals can grow. Therefore, ideally the sample needs to remain stable over that period
If the sample fails one or more of the above criteria, it may be worthwhile returning to the expression
and purification protocols and trying something different, such as the addition of ligands known to
interact with the protein, or adding extra purification steps. In extreme cases it may be worthwhile
switching to a different expression system altogether or working with a mutated or truncated
construct. It may be possible to refold protein successfully using chaotropic reagents such as urea.
Aggregated or polydisperse samples may be made monodisperse by simply changing pH or adding
some salt. However, without DLS, this is very difficult to assess.
Crystallization
Before beginning trials the sample needs to be concentrated and transferred to dilute buffer
containing little or no salt if the protein is happy under these conditions. This can easily be achieved
using centrifugal concentrators. In order to screen a reasonable number of conditions we need at
least 200 m l of protein at 10 mg/ml. If this is not the case then you may need to scale up the
expression and purification to make it so.
If a similar protein has already been crystallized then it is definitely worth trying the
conditions used to grow crystals of this protein. In any case if you have enough material one would
normally subject it to one or more sparse matrix screens. To date the total number of different
conditions in our repertoire of screens comes to about 400.
We normally use these tissue culture trays to set up crystallizations with up to 24 different
conditions per tray. The method used is hanging drop vapour diffusion it has the advantage of being
the least expensive on protein. The set up is as follows:
The well is prepared first and usually contains 1ml of a buffered precipitant solution such as
polyethylene glycol or ammonium sulfate or even a mixture of PEG and salt. Sometimes additives
are also included such as detergents or metal ions which may enhance the crystallization. Then 1 m l
of the concentrated protein sample is pipetted onto a siliconized coverslip, followed by 1 m l of the
well solution. The coverslip is then inverted over the well and sealed using a bead of vacuum grease.
This is then left undisturbed for at least 24 hours to equilibrate. At the start of the experiment, the
precipitant concentration in the drop is half that of the well. Equilibration then takes place via the
vapour phase. Given the relatively large volume of the well, its concentration effectively remains the
same. The drop however loses water vapour to the well until the precipitant concentration equals that
of the well. Hopefully, if the conditions have been favourable, at some point during this process the
protein has become supersaturated and been driven out of solution in the form of crystals. All too
often however these trials result in precipitate or the formation of salt crystals, or nothing happens at
all and the drops remain clear. I would estimate that the success rate at this stage is less than 0.1%.
If no promising leads are found then there are several possible courses of action. We can add
various things to the sample which may affect crystallization. We can work at a different temperature,
temperature can have a profound affect on protein solubility. Temperatures of 4° C and 18° C are
typically used. If we have already been round this cycle more than once, it may be time to go back to
the purification and expression and try something different, such as working with a fragment of our
target protein.
If however we are lucky enough to get one or more "hits" in the screens, then we do follow-
up experiments which will be variations on a theme where the theme is the successful set of
conditions. Essentially we need to refine all variables and possibly introduce some new ones in order
to achieve our goal, which is large, single crystals (see below). Things to try at this stage include
varying the concentrations of all components in the crystallization, slight pH changes, using additives,
switching to similar buffers or precipitants, or even using different crystallization methods (e.g.
dialysis). Occasionally good crystals will form overnight, but more typically they will take from
several days to several weeks to grow.
Using crystallographic terminology, this process is called X-ray data collection. When the
X-rays hit the crystal, a phenomenon called X-ray diffraction takes place. Diffraction is a common
physical phenomenon and occurs when a wave (of any nature) encounters an obstacle, which can be
any material object. This results in bending of the wave around that object, also called scattering of
waves. Another way for diffraction to occur is when a wave encounters a small opening, a small
hole or a slit. This causes spreading of the wave in all directions. In practice, in both cases, the
obstacle and the hole/slit start to act as a new wave source, sending around waves with slightly
different direction of propagation, as compared to the original wave. The "new" scattered waves
interact with each other, resulting in another physical phenomena called interference, which
translated to normal language simply means addition of waves.
X-ray diffraction is caused by the interaction of electromagnetic waves with the matter
inside the crystals, and particularly with the electrons. These waves get scattered by the electrons,
or each electron becomes a small X-ray source of its own. Scattered waves from all the electrons
within each atom are added to each other, giving diffracted waves from each atom, etc. When the
scattered waves are added, they may either get stronger or cancel each other. Those which get
stronger are registered by the X-ray detector, as in the figure above. Interestingly, we do not
necessarily need X-rays to observe interference, we can, for example go to a lake nearby, through
two stones into the water and then observe how the waves from the two stones either reinforce each
other or become weaker.
Regions of secondary structure. A closely related method assigns a chemical shift index (CSI)
to each residue in a protein, by comparison with a table of chemical shifts corresponding to random
structure. Regions where the CSI is clustered with negative values are assigned as a-helical; those
with positive values are assigned to b-structure. In contrast to optical methods for determination of
protein structure, NMR provides information on the location of secondary structural elements within
the protein sequence. Oldfield has even suggested that suf- ficiently accurate data may be used to
predict the three-dimensional structure of the protein from chemical-shift data alone. In order to apply
chemicalshift information to predictions of secondary structure, the chemical shifts must be assigned
to particular residues in the protein. This is a tedious task that requires the measurement and analysis
of 2-D and often 3-D spectra. In addition, the limit of about 30K for a protein that produces
sufficiently narrow lines seriously hampers the general application of the method. However, the
chemical-shift index serves as a useful check on further model refinement in high-resolution NMR
studies of small proteins.
CD in the far UV region (180–260nm) provides information regarding different forms of regular
secondary structure found in proteins whereas the near UV region (240-360nm) can provide a
detailed fingerprint of the tertiary structure. It can provide information about interaction between
ligands or cofactors for e.g. DNA-protein interaction. CD can also be very useful in the comparison
of batches of pharmaceuticals and we can provide some additional help with the analysis using
objective pattern recognition techniques. We also provide advice and consultancy on obtaining a
good quality CD spectral measurement
.
FTIR facilitates the structural analysis of proteins in different chemical environments, which makes
it a valuable tool for the biotechnology and pharmaceutical industry. This technique can be utilised
to analyse the structure of protein therapeutics at higher concentrations, than CD. The facilities
available in house - ATR accessories specially designed for protein solutions and powders - enable
the analysis of proteins in formulation buffer as well as in powder form.
Fluorescence spectroscopy can also provide tertiary structural information. Changes in the local
environment of tryptophan residues can be followed by changes in the emission spectra.
Proteins are highly diversified class of biomolecules Differences in their chemical properties, such
as charge shape, size and solubility, enable them to perform many biological functions. These
functions include – enzyme catalysts, metabolic regulation, binding and transport of small molecules,
gene regulation, immunological defense and cell structure.
The cellular activities and functions involve one or more proteins. Their central place in the
cell is reflected in the fact that genetic information is ultimately expressed as proteins,
The basic building blocks of proteins are amino acids. There are about 20 amino acids found
in proteins, all of which share certain structural features. These features are:
Carboxyl (acid) (-COOH) group
An amino (basic) (-NH ) group
They differ from each other with respect to their side chains. Amino acids of proteins are linked
together by peptide bonds between their carboxyl and –amino group to form linear polymers. Proteins
have 3 or 4 levels of structural organization. and complexity. The primary structure of a protein is the
sequence of amino acids in its polypeptide chain or chains. Secondary structure is formed and
stabilized by the interaction of amino acids that are fairly close to one another on the polypeptide
chain. The polypeptide with its primary and secondary structure can be coiled or organized along
three axes to form a more complex, three dimensional shape. Thus, level of organization is the tertiary
structure
A number of colorimetric and photometric methods are used for the determination of proteins.
Photocolorimetric methods are based on the so called “colour” reactions for functional group of
protein molecules. Among these are reactions for peptide groups and folin’s test for amino acid
aromatic radicals (tyrosin and trypthophan). The biuret test is more specific since peptide bond occurs
only in proteins and peptides. It is widely used in clinico-biochemical examination. The Lowry’s
method, based on folin’s reaction is highly sensitive but of low specificity, since free aromatic amino
acids and numerous materials containing a phenolic group produce a similar colouration.
Photonephelometric methods for protein 2 concentration determination are based on the estimation
of the degree of turbidity (or clouding) of a protein suspension in solution. These methods have not
gained wide acceptance in practice.
Spectrophotometric methods are sub-divided into direct and indirect methods. The latter
method represents a sensitive and accurate variant of the photocolorimetrically techniques. After the
induction of the colour reaction of a protein, the coloured solution is measured spectrophotometrically
and the protein. Concentration is estimated by the percentage of monochromatic light energy
absorbed by the colour solution.
The direct method is based on the measure of light absorption by protein solution in the ultra
violet spectra region at 200-220nm (characteristic absorption due to aromatic amino acid radicals,
chiefly tryphotophan and tyrosine). These methods are easy to handle and require no preliminary
colouration of the solution to be induced by a chromogenic agent. The 200-220nm spectrophotomery
is more specific than that at 230nm. Since in the latter case, the additional absorption due to various
low molecular aromatic compound, which are found in biological materials that interferes with the
measurement accuracy.
The local dye “Uri isi” which was purchased at Nsukka market is used locally for dying grey
hair. It is believed to undergo some reactions with certain chemical components of the hair in the
presence of hydrogen peroxide. When applied on the hair in the presence of hydrogen peroxide, the
grey colour of the hair is changed to dark colour. Preliminary screening showed that the dye reacts
with proteins to produce a change in colour. The present study attempts to design a new colorimetric
method for estimation of proteins based on the colour reaction between local dye “uri isi” and
proteins.
QUESTIONS TO PRACTICE:
synthetase pair is used to deliver the analog in response to a nonsense or four-base codon. In 1996,
Drabkin and coworkers used an Escherichia coli tRNA/glutaminyl-tRNA synthetase pair for amber
codon suppression in mammalian cells, and showed that the suppressor tRNA was not charged by
any of the mammalian aminoacyl-tRNA synthetases. Shortly thereafter, Furter (1998) introduced a
yeast tRNA/phenylalanyl-tRNA synthetase (PheRS) Noncanonical Amino Acids in Protein Science
and Engineering 129 pair into E. coli for site-specific incorporation of the noncanonical amino acid
p-fluorophenylalanine. Since then, amber codon suppression has become the most common method
for site-specific incorporation of noncanonical amino acids in vivo. Schultz and coworkers have been
especially successful in producing orthogonal suppressor tRNA/aminoacyl-tRNA synthetase pairs
for incorporation of chemically, structurally, and spectroscopically diverse amino acid analogs. Site-
specific incorporation has also been accomplished in Xenopus oocytes using microinjected
messenger RNAs and chemically misacylated amber suppressor tRNAs.
Sisido have pioneered the use of four-base codons (frameshift suppression) for site-specific
introduction of noncanonical amino acids into proteins, and have employed this strategy to label
streptavidin with fluorophores for fluorescence resonance energy transfer (FRET) experiments. Much
of the work reported to date with four-base codons involves in vitro translation, but design of
appropriate orthogonal tRNA/aminoacyl-tRNA synthetase pairs enables use of the method in
bacterial cells. Anderson and coworkers have reported orthogonal tRNA/leucyl-tRNA synthetase
(LeuRS) pairs for four-
base, amber, and opal
suppression. Anderson
have reported use of a four-
base codon with an amber
codon for incorporation of
two noncanonical amino
acids into a recombinant
protein using two
orthogonal sets. An
analogous five-base codon
strategy has also been
described.
Reassignment of
sense codons can also be
used for site-specific
incorporation of
noncanonical amino acids, although the fidelity of the method is lower than that of nonsense or
frameshift suppression (Fig. 2). Because the 20 canonical amino acids are encoded by 61 sense
codons, the genetic code is highly degenerate. For example, phenylalanine is coded by two codons,
UUC and UUU. In E. coli, both codons are read by a single tRNA, which decodes UUC via Watson–
Crick base-pairing and UUU through a “wobble” interaction. Reassignment of the UUU codon was
achieved by introducing into an E. coli expression host a mutant yeast PheRS capable of charging 2-
naphthylalanine, and a mutant yeast tRNAPhe equipped with an AAA anticodon. Expression of
dehydrofolate reductase led to preferential incorporation of phenylalanine at UUC codons and of 2-
naphthylalanine at UUU codons. The generality and quantitative specificity of this method have not
yet been established.
4.1.1. Translational Fidelity
Aminoacyl-tRNA Synthetases Translational fidelity is controlled in large measure by the
aminoacyl-tRNA synthetases, which match the 20 canonical amino acids with their cognate tRNAs.
The remarkable capacity of the synthetases to discriminate among the natural amino acids might lead
one to expect noncanonical substrates to be excluded by the translational apparatus (for more details
see the chapter by Mascarenhas et al., this volume). In fact, many noncanonical amino acids are
activated by the wild-type synthetases at rates that support efficient protein synthesis in bacterial
cells. For analogs that are activated more slowly, addition of plasmid-encoded copies of the cognate
synthetase can restore the rate of protein synthesis to levels characteristic of overexpressed
recombinant proteins, and synthetase engineering has enabled further expansion of the set of useful
amino acids. Szostak and coworkers have described a screen for identifying noncanonical amino acid
substrates that are susceptible to enzymatic aminoacylation. Using the screen, they identified 59
previously unknown amino acid substrates.
4.2. Choice of protein scaffold for protein engineering
When engineering a new functionality in a protein, many aspects must be taken into account.
First, it is necessary to know as much as possible about the starting structure in order to assess its
potential. It is important to consider whether the protein will work in a particular selection or
screening system, whether it will tolerate the changes introduced, and whether its production is simple
and scalable enough for future applications. These are just a few examples of the assets to be
considered in a structural framework.
The features we are seeking in a structural framework fall into two categories: experimental
demands, and application demands. The fi rst is associated with the method used in the engineering
experiment, while the latter depends on the intended use of the product.
A number of questions concerning the experimental procedure:
Does an adequate assay, selection or screening system exist, or should the framework
provide a means for testing the newly established functionality? If for example you
are seeking a protein where signal change can be measured as a function of binding,
certain scaffolds such as periplasmic binding proteins will facilitate this more easily
than others
Does the applied methodology have requirements for the framework? For example,
small single - chain proteins are preferable for the application of phage and ribosome
display and for the construction of fusion proteins. Likewise, cysteine - free scaffolds
are useful when unique cysteines should be introduced to which effector compounds
can be coupled. Further, a robust scaffold with high thermodynamic stability is
preferable because it can compensate for any destabilizing effects of newly introduced
functional residues, an effect often observed in rational design approaches.
The expression of functional molecules is also important in selection and screening
systems; for example, poorly or insolubly expressed proteins will not be able to
complement a missing functionality and thus unstable variants are often eliminated in
the process. A number of questions, concerning applicability:
Under what conditions should the fi nal product be active, should it be especially
stable or degradable, and does it need to be localized specifi cally?
Is large - scale production feasible, what are the protein yields, and is there an easy
purifi cation?
High thermodynamic stability, reversible folding, and high expression levels are what
you will be looking for. The absence of disulfi de bonds or free cysteines is also
advantageous because it allows the expression of functional molecules in the reducing
environment of the bacterial cytoplasm, which usually produces higher yields than
periplasmic or eukaryotic expression or refolding in vitro.
Proteins can be optimized to improve chemical robustness, thermodynamic stability or
recombinant expression yields before using them as a framework in an engineering experiment.
However, if considered well, the choice of a framework may also relieve the need to engineer many
of these properties, so that attention can be focused on the property in question.
Apart from practical considerations, the choice of the structural framework can also be
important for the new functionality that is introduced. Using a partial binding pocket and adjusting it
to fi t a new ligand may be easier to achieve than introducing a new one from scratch. Obviously,
studying the structure of the framework is essential in rational design approaches, but it can also be
advantageous in directed evolution experiments. A detailed knowledge of the protein structure can
reveal important parts that are better left untouched and help focus on the variable regions that can
be subjected to randomization. To a certain degree, sequence alignments will also provide this type
of information. Highly conserved residues are often important for folding or stability of the protein,
while variable regions are free to evolve.
Background
Early methods of secondary structure prediction, introduced in the 1960s and early
1970s,[4][5][6][7][8] focused on identifying likely alpha helices and were based mainly on helix-coil
transition models.[9] Significantly more accurate predictions that included beta sheets were
introduced in the 1970s and relied on statistical assessments based on probability parameters derived
from known solved structures. These methods, applied to a single sequence, are typically at most
about 60-65% accurate, and often underpredict beta sheets.[1] The evolutionary conservation of
secondary structures can be exploited by simultaneously assessing many homologous sequences in
a multiple sequence alignment, by calculating the net secondary structure propensity of an aligned
column of amino acids. In concert with larger databases of known protein structures and
modern machine learning methods such as neural nets and support vector machines, these methods
can achieve up 80% overall accuracy in globular proteins.[10] The theoretical upper limit of accuracy
is around 90%,[10] partly due to idiosyncrasies in DSSP assignment near the ends of secondary
structures, where local conformations vary under native conditions but may be forced to assume a
single conformation in crystals due to packing constraints. Limitations are also imposed by secondary
structure prediction's inability to account fortertiary structure; for example, a sequence predicted as
a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet
region of the protein and its side chains pack well with their neighbors. Dramatic conformational
changes related to the protein's function or environment can also alter local secondary structure.
Historical perspective
To date, over 20 different secondary structure prediction methods have been developed. One of the
first algorithms was Chou-Fasman method, which relies predominantly on probability parameters
determined from relative frequencies of each amino acid's appearance in each type of secondary
structure.[11] The original Chou-Fasman parameters, determined from the small sample of structures
solved in the mid-1970s, produce poor results compared to modern methods, though the
parameterization has been updated since it was first published. The Chou-Fasman method is roughly
50-60% accurate in predicting secondary structures.
The next notable program was the GOR method, named for the three scientists who developed it
— Garnier, Osguthorpe, and Robson, is an information theory-based method. It uses the more
powerful probabilistic technique of Bayesian inference.[12] The GOR method takes into account not
only the probability of each amino acid having a particular secondary structure, but also
the conditional probability of the amino acid assuming each structure given the contributions of its
neighbors (it does not assume that the neighbors have that same structure). The approach is both more
sensitive and more accurate than that of Chou and Fasman because amino acid structural propensities
are only strong for a small number of amino acids such as proline and glycine. Weak contributions
from each of many neighbors can add up to strong effects overall. The original GOR method was
roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets,
which it frequently mispredicted as loops or disorganized regions.[1]
Another big step forward, was using machine learning methods. First artificial neural
networks methods were used. As a training sets they use solved structures to identify common
sequence motifs associated with particular arrangements of secondary structures. These methods are
over 70% accurate in their predictions, although beta strands are still often underpredicted due to the
lack of three-dimensional structural information that would allow assessment of hydrogen
bonding patterns that can promote formation of the extended conformation required for the presence
of a complete beta sheet.[1] PSIPRED and JPRED are some of the most known programs based on
neural networks for protein secondary structure prediction. Next, support vector machines have
proven particularly useful for predicting the locations of turns, which are difficult to identify with
statistical methods
Extensions of machine learning techniques attempt to predict more fine-grained local properties of
proteins, such as backbone dihedral angles in unassigned regions. Both SVMs[15] and neural
networks[16] have been applied to this problem.[13] More recently, real-value torsion angles can be
accurately predicted by SPINE-X and successfully employed for ab initio structure prediction.[17]
Other improvements
It is reported that in addition to the protein sequence, secondary structure formation depends on other
factors. For example, it is reported that secondary structure tendencies depend also on local
environment,[18] solvent accessibility of residues,[19] protein structural class,[20] and even the
organism from which the proteins are obtained.[21] Based on such observations, some studies have
shown that secondary structure prediction can be improved by addition of information about protein
structural class,[22] residue accessible surface area[23][24] and also contact number information.[25]
Tertiary structure
The practical role of protein structure prediction is now more important than ever. Massive amounts
of protein sequence data are produced by modern large-scale DNAsequencing efforts such as
the Human Genome Project. Despite community-wide efforts in structural genomics, the output of
experimentally determined protein structures—typically by time-consuming and relatively
expensive X-ray crystallography or NMR spectroscopy—is lagging far behind the output of protein
sequences.
The protein structure prediction remains an extremely difficult and unresolved undertaking. The two
main problems are calculation of protein free energy and finding the global minimum of this energy.
A protein structure prediction method must explore the space of possible protein structures which
is astronomically large. These problems can be partially bypassed in "comparative" or homology
modeling and fold recognition methods, in which the search space is pruned by the assumption that
the protein in question adopts a structure that is close to the experimentally determined structure of
another homologous protein. On the other hand, the de novo or ab initio protein structure
predictionmethods must explicitly resolve these problems. The progress and challenges in protein
structure prediction has been reviewed in Zhang 2008.[26]
Comparative protein modelling uses previously solved structures as starting points, or templates. This
is effective because it appears that although the number of actual proteins is vast, there is a limited
set of tertiary structural motifs to which most proteins belong. It has been suggested that there are
only around 2,000 distinct protein folds in nature, though there are many millions of different
proteins.
These methods may also be split into two groups:
Homology modeling
is based on the reasonable assumption that two homologous proteins will share very similar
structures. Because a protein's fold is more evolutionarily conserved than its amino acid
sequence, a target sequence can be modeled with reasonable accuracy on a very distantly
related template, provided that the relationship between target and template can be discerned
through sequence alignment. It has been suggested that the primary bottleneck in comparative
modelling arises from difficulties in alignment rather than from errors in structure prediction
given a known-good alignment.[37] Unsurprisingly, homology modelling is most accurate
when the target and template have similar sequences.
Protein threading
[38]
scans the amino acid sequence of an unknown structure against a database of solved
structures. In each case, a scoring function is used to assess the compatibility of the sequence
to the structure, thus yielding possible three-dimensional models. This type of method is also
known as 3D-1D fold recognition due to its compatibility analysis between three-
dimensional structures and linear protein sequences. This method has also given rise to
methods performing an inverse folding search by evaluating the compatibility of a given
structure with a large database of sequences, thus predicting which sequences have the
potential to produce a given fold.
Accurate packing of the amino acid side chains represents a separate problem in protein
structure prediction. Methods that specifically address the problem of predicting side-chain
geometry include dead-end elimination and the self-consistent mean field methods. The side
chain conformations with low energy are usually determined on the rigid polypeptide
backbone and using a set of discrete side chain conformations known as "rotamers." The
methods attempt to identify the set of rotamers that minimize the model's overall energy.
These methods use rotamer libraries, which are collections of favorable conformations for
each residue type in proteins. Rotamer libraries may contain information about the
conformation, its frequency, and the standard deviations about mean dihedral angles, which
can be used in sampling.[39] Rotamer libraries are derived from structural bioinformatics or
other statistical analysis of side-chain conformations in known experimental structures of
proteins, such as by clustering the observed conformations for tetrahedral carbons near the
staggered (60°, 180°, -60°) values.
Rotamer libraries can be backbone-independent, secondary-structure-dependent, or
backbone-dependent. Backbone-independent rotamer libraries make no reference to
backbone conformation, and are calculated from all available side chains of a certain type
(for instance, the first example of a rotamer library, done by Ponder and Richards at Yale in
1987).[40] Secondary-structure-dependent libraries present different dihedral angles and/or
rotamer frequencies for -helix, -sheet, or coil secondary structures.[41]Backbone-
dependent rotamer libraries present conformations and/or frequencies dependent on the local
backbone conformation as defined by the backbone dihedral angles and , regardless of
secondary structure.[42]
The modern versions of these libraries as used in most software are presented as
multidimensional distributions of probability or frequency, where the peaks correspond to
the dihedral-angle conformations considered as individual rotamers in the lists. Some
versions are based on very carefully curated data and are used primarily for structure
validation,[43] while others emphasize relative frequencies in much larger data sets and are
the form used primarily for structure prediction, such as the Dunbrack rotamer libraries.[44]
Side-chain packing methods are most useful for analyzing the protein's hydrophobic core,
where side chains are more closely packed; they have more difficulty addressing the looser
constraints and higher flexibility of surface residues, which often occupy multiple rotamer
conformations rather than just one.[45][46]
Statistical methods have been developed for predicting structural classes of proteins based
on their amino acid composition,[47] pseudo amino acid composition[48][49][50][51] and
functional domain composition.[52]
Quaternary structure
In the case of complexes of two or more proteins, where the structures of the proteins are
known or can be predicted with high accuracy, protein–protein docking methods can be used
to predict the structure of the complex. Information of the effect of mutations at specific sites
on the affinity of the complex helps to understand the complex structure and to guide docking
methods.
Molecular mechanics
Molecular mechanics is one aspect of molecular modelling, as it refers to the use of classical
mechanics/Newtonian mechanics to describe the physical basis behind the models. Molecular
models typically describe atoms (nucleus and electrons collectively) as point charges with an
associated mass. The interactions between neighbouring atoms are described by spring-like
interactions (representing chemical bonds) and van der Waals forces. The Lennard-Jones potential
is commonly used to describe van der Waals forces. The electrostatic interactions are computed
based on Coulomb's law. Atoms are assigned coordinates in Cartesian space or in internal
coordinates, and can also be assigned velocities in dynamical simulations. The atomic velocities are
related to the temperature of the system, a macroscopic quantity. The collective mathematical
expression is known as a potential function and is related to the system internal energy (U), a
thermodynamic quantity equal to the sum of potential and kinetic energies. Methods which
minimize the potential energy are known as energy minimization techniques (e.g., steepest descent
and conjugate gradient), while methods that model the behaviour of the system with propagation of
time are known as molecular dynamics.
This function, referred to as a potential function, computes the molecular potential energy as a sum
of energy terms that describe the deviation of bond lengths, bond angles and torsion angles away
from equilibrium values, plus terms for non-bonded pairs of atoms describing van der Waals and
electrostatic interactions. The set of parameters consisting of equilibrium bond lengths, bond
angles, partial charge values, force constants and van der Waals parameters are collectively known
as a force field. Different implementations of molecular mechanics use different mathematical
expressions and different parameters for the potential function. The common force fields in use
today have been developed by using high level quantum calculations and/or fitting to experimental
data. The technique known as energy minimization is used to find positions of zero gradient for all
atoms, in other words, a local energy minimum. Lower energy states are more stable and are
commonly investigated because of their role in chemical and biological processes. A molecular
dynamics simulation, on the other hand, computes the behaviour of a system as a function of time.
It involves solving Newton's laws of motion, principally the second law, . Integration of
Newton's laws of motion, using different integration algorithms, leads to atomic trajectories in
space and time. The force on an atom is defined as the negative gradient of the potential energy
function. The energy minimization technique is useful for obtaining a static picture for comparing
between states of similar systems, while molecular dynamics provides information about the
dynamic processes with the intrinsic inclusion of temperature effects.
Variables
Molecules can be modelled either in vacuum or in the presence of a solvent such as water.
Simulations of systems in vacuum are referred to as gas-phase simulations, while those that include
the presence of solvent molecules are referred to as explicit solvent simulations. In another type of
simulation, the effect of solvent is estimated using an empirical mathematical expression; these are
known as implicit solvation simulations.
Applications
Molecular modelling methods are now routinely used to investigate the structure, dynamics, surface
properties and thermodynamics of inorganic, biological and polymeric systems. The types of
biological activity that have been investigated using molecular modelling include protein folding,
enzyme catalysis, protein stability, conformational changes associated with biomolecular function,
and molecular recognition of proteins, DNA, and membrane complexes.
Background
The "mechanical" molecular model was developed out of a need to describe molecular structures
and properties in as practical a manner as possible. The range of applicability of molecular
mechanics includes:
The great computational speed of molecular mechanics allows for its use in procedures such as
molecular dynamics, conformational energy searching, and docking. All the procedures require
large numbers of energy evaluations.
The mechanical molecular model considers atoms as spheres and bonds as springs. The
mathematics of spring deformation can be used to describe the ability of bonds to stretch, bend, and
twist:
Non-bonded atoms (greater than two bonds apart) interact through van der Waals attraction, steric
repulsion, and electrostatic attraction/repulsion. These properties are easiest to describe
mathematically when atoms are considered as spheres of characteristic radii.
The object of molecular mechanics is to predict the energy associated with a given conformation of
a molecule. However, molecular mechanics energies have no meaning as absolute quantities. Only
differences in energy between two or more conformations have meaning. A simple molecular
mechanics energy equation is given by:
These equations together with the data (parameters) required to describe the behavior of different
kinds of atoms and bonds, is called a force-field. Many different kinds of force-fields have been
developed over the years. Some include additional energy terms that describe other kinds of
deformations. Some force-fields account for coupling between bending and stretching in adjacent
bonds in order to improve the accuracy of the mechanical model.
The mathematical form of the energy terms varies from force-field to force-field. The more
common forms will be described.
Stretching Energy
The stretching energy equation is based on Hooke's law. The "kb"
parameter controls the stiffness of the bond spring, while "ro" defines its
equilibrium length. Unique "kb" and "ro" parameters are assigned to each
pair of bonded atoms based on their types (e.g. C-C, C-H, O-C, etc.). This
equation estimates the energy associated with vibration about the
equilibrium bond length. This is the equation of a parabola, as can be seen
in the following plot:
Notice that the model tends to break down as a bond is stretched toward the point of
dissociation.
Bending Energy
The bending energy equation is also based on Hooke's law. The "ktheta" parameter controls
the stiffness of the angle spring, while "thetao" defines its equilibrium angle. This equation
estimates the energy associated with vibration about the equilibrium bond angle:
Unique parameters for angle bending are assigned to each bonded triplet of atoms based on
their types (e.g. C-C-C, C-O-C, C-C-H, etc.). The effect of the "kb" and "ktheta" parameters
is to broaden or steepen the slope of the parabola. The larger the value of "k", the more
energy is required to deform an angle (or bond) from its equilibrium value. Shallow
potentials are achieved for "k" values between 0.0 and 1.0. The Hookeian potential is shown
in the following plot for three values of "k":
Torsion Energy
The torsion energy is modeled by a simple periodic function, as can be seen in the following
plot:
The torsion energy in molecular mechanics is primarily used to correct the remaining energy
terms rather than to represent a physical process. The torsional energy represents the amount
of energy that must be added to or subtracted from the Stretching Energy + Bending Energy
+ Non-Bonded Interaction Energy terms to make the total energy agree with experiment or
rigorous quantum mechanical calculation for a model dihedral angle (ethane, for example
might be used a a model for any H-C-C-H bond).
The "A" parameter controls the amplitude of the curve, the n parameter controls its
periodicity, and "phi" shifts the entire curve along the rotation angle axis (tau). The
parameters are determined from curve fitting. Unique parameters for torsional rotation are
assigned to each bonded quartet of atoms based on their types (e.g. C-C-C-C, C-O-C-N, H-
C-C-H, etc.). Torsion potentials with three combinations of "A", "n", and "phi" are shown in
the following plot:
Notice that "n" reflects the type symmetry in the dihedral angle. A CH3-CH3 bond, for
example, ought to repeat its energy every 120 degrees. The cis conformation of a dihedral
angle is assumed to be the zero torsional angle by convention. The parameter phi can be
used to synchronize the torsional potential to the initial rotameric state of the molecule
whose energy is being computed.
Non-Bonded Energy
The non-bonded energy represents the pair-wise sum of the energies of all possible
interacting non-bonded atoms i and j:
The non-bonded energy accounts for repulsion, van der Waals attraction, and electrostatic
interactions. van der Waals attraction occurs at short range, and rapidly dies off as the
interacting atoms move apart by a few Angstroms. Repulsion occurs when the distance
between interacting atoms becomes even slightly less than the sum of their contact radii.
Repulsion is modeled by an equation that is designed to rapidly blow up at close distances
(1/r^12 dependency). The energy term that describes attraction/repulsion provides for a
smooth transition between these two regimes. These effects are often modeled using a 6-12
equation, as shown in the following plot:
The "A" and "B" parameters control the depth and position (interatomic distance) of the
potential energy well for a given pair of non-bonded interacting atoms (e.g. C:C, O:C, O:H,
etc.). In effect, "A" determines the degree of "stickiness" of the van der Waals attraction and
"B" determines the degree of "hardness" of the atoms (e.g marshmallow-like, billiard ball-
like, etc.).
The "A" parameter can be obtained from atomic polarizability measurements, or it can be
calculated quantum mechanically. The "B" parameter is typically derived from
crystallographic data so as to reproduce observed average contact distances between
different kinds of atoms in crystals of various molecules.
Partial atomic charges can be calculated for small molecules using an ab initio or
semiempirical quantum technique (usually MOPAC or AMPAC). Some programs assign
charges using rules or templates, especially for macromolecules. In some force-fields, the
torsional potential is calibrated to a particular charge calculation method (rarely made
known to the user). Use of a different method can invalidate the force-field consistency.
Molecular Dynamics
In the broadest sense, molecular dynamics is concerned with molecular motion. Motion is inherent
to all chemical processes. Simple vibrations, like bond stretching and angle bending, give rise to IR
spectra. Chemical reactions, hormone-receptor binding, and other complex processes are associated
with many kinds of intra- and intermolecular motions.
The driving force for chemical processes is described by thermodynamics. The mechanism by
which chemical processes occur is described by kinetics. Thermodynamics dictates the energetic
relationships between different chemical states, whereas the sequence or rate of events that occur as
molecules transform between their various possible states is described by kinetics:
Conformational transitions and local vibrations are the usual subjects of molecular dynamics
studies. Molecular dynamics alters the intramolecular degrees of freedom in a step-wise fashion,
analogous to energy minimization. The individual steps in energy minimization are merely directed
at establishing a down-hill direction to a minimum. The steps in molecular dynamics, on the other
hand, meaningfully represent the changes in atomic position, ri, over time (i.e. velocity).
Newton's equation is used in the molecular dynamics formalism to simulate atomic motion:
The rate and direction of motion (velocity) are governed by the forces that the atoms of the system
exert on each other as described by Newton's equation. In practice, the atoms are assigned initial
velocities that conform to the total kinetic energy of the system, which in turn, is dictated by the
desired simulation temperature. This is carried out by slowly "heating" the system (initially at
absolute zero) and then allowing the energy to equilibrate among the constituent atoms. The basic
ingredients of molecular dynamics are the calculation of the force on each atom, and from that
information, the position of each atom throughout a specified period of time (typically on the order
of picoseconds = 10^-12 seconds).
The force on an atom can be calculated from the change in energy between its current position and
its position a small distance away. This can be recognized as the derivative of the energy with
respect to the change in the atom's position:
Energies can be calculated using either molecular mechanics or quantum mechanics methods.
Molecular mechanics energies are limited to applications that do not involve drastic changes in
electronic structure such as bond making/breaking. Quantum mechanical energies can be used to
study dynamic processes involving chemical changes. The latter technique is extremely novel, and
of limited availability (Gaussian03 is an example of such a program).
Knowledge of the atomic forces and masses can then be used to solve for the positions of each atom
along a series of extremely small time steps (on the order of femtoseconds = 10^-15 seconds). The
resulting series of snapshots of structural changes over time is called a trajectory. The use of this
method to compute trajectories can be more easily seen when Newton's equation is expressed in the
following form:
In practice, trajectories are not directly obtained from Newton's equation due to lack of an
analytical solution. First, the atomic accelerations are computed from the forces and masses. The
velocities are next calculated from the accelerations based on the following relationship:
A trajectory between two states can be subdivided into a series of sub-states separated by a small
time step, "delta t" (e.g. 1 femtosecond):
The initial atomic positions at time "t" are used to predict the atomic positions at time "t + delta t".
The positions at "t + delta t" are used to predict the positions at "t + 2*delta t", and so on.
The method derives its name from the fact that the velocity and position information successively
alternate at 1/2 time step intervals.
Molecular dynamics has no defined point of termination other than the amount of time that can be
practically covered. Unfortunately, the current picosecond order of magnitude limit is often not
long enough to follow many kinds of state to state transformations, such as large conformational
transitions in proteins.
Molecular dynamics calculations can be performed using both HyperChem and Gaussian programs.
Quantum mechanics:
Definition of Computational Chemistry
• Computational Chemistry: Use mathematical approximations and computer programs to
obtain results relative to chemical problems.
• Ab Initio Quantum Chemistry: Uses methods that do not include any empirical parameters
or experimental data.
What’s it Good For?
• Computational chemistry is a rapidly growing field in chemistry.
• Some of the almost limitless properties that can be calculated with computational chemistry
are:
– NMR spectra
– thermochemical data
Motivation
• Schrödinger Equation can only be solved exactly for simple systems.
• For more complex systems (i.e. many electron atoms/molecules) we need to make some
simplifying assumptions/approximations and solve it numerically.
• However, it is still possible to get very accurate results (and also get very crummy results).
– In general, the “cost” of the calculation increases with the accuracy of the
calculation and the size of the system.
– Born-Oppenheimer Approximation
ˆ (r; R) E (r; R)
Hel el el el
2 2 2
Z e e
Hˆ el
2me i
i ri
2
i
j i j rij
The potential energy calculated by summing the energies of various interactions is a numerical
value for a single conformation. This number can be used to evaluate a particular conformation, but
it may not be a useful measure of a conformation because it can be dominated by a few bad
interactions. For instance, a large molecule with an excellent conformation for nearly all atoms can
have a large overall energy because of a single bad interaction, for instance two atoms too near each
other in space and having a huge van der Waals repulsion energy. It is often preferable to carry out
energy minimization on a conformation to find the best nearby conformation. Energy minimization
is usually performed by gradient optimization: atoms are moved so as to reduce the net forces on
them. The minimized structure has small forces on each atom and therefore serves as an excellent
starting point for molecular dynamics simulations.
In other words, each Cartesian component, , of the gradient equals the derivative of the
potential energy with respect to that component. Only those interactions involving particle
contribute to the gradients of the Cartesian coordinates of ( ). The components of
constitute a path, P, in -dimensional space. Finding the minimum along this pathway typically
involves an interpolation of two points in -space to find a new point where . Usually,
however, at the new point, so a new path is chosen and minimization proceeds. It is possible to
set at each new point, but it is more efficient to choose the new pathway to be orthogonal to all
previous paths. This method of ``conjugate gradients'' is perhaps the most popular method of energy
minimization. Details of this method can be found in Reference [16].
It is also possible to minimize the energy of a conformation by optimizing the dihedral angle degrees
of freedom, rather than the Cartesian coordinates. The minimization occurs in -dimensional
space, where is the number of dihedral angles. Torques, or derivatives of the forcefield with
respect to dihedral angles, take the place of the gradient. We have found that ``torque minimization,''
when followed by Cartesian minimization, produces an overall lower-energy conformation than
Cartesian minimization alone. Neither method, however, can guarantee that the lowest possible
conformation (the global minimum) will be reached. The process of moving along pathways in
conformational space usually ends at a ``local minimum'' - a well in the potential energy surface,
where the energy is lower than for all other nearby conformations, but not necessarily lower than
other local minima.
Some contain sets of patterns and motifs derived from sequence homologs.
GenBank - the NIH genetic sequence database, an annotated collection of all publicly available
DNA sequences.
TIGR - a collection of curated databases containing DNA and protein sequence, gene
expression, cellular role, protein family, and taxonomic data for microbes, plants and
humans.
BLOCKS - multiply aligned ungapped segments corresponding to the most highly conserved
regions of proteins.
PFam - a database of multiple alignments of protein domains or conserved protein regions. The
alignments represent some evolutionary conserved structure which has implications for the
protein's function. Profile hidden Markov models (profile HMMs) built from the Pfam
alignments can be very useful for automatically recognizing that a new protein belongs to an
existing protein family, even if the homology is weak.
Protein Profiles - online cross-references to the Oxford University Press Protein Profiles
project.
ProtoMap - site offers an exhaustive classification of all the proteins in the SWISSPROT and
TrEMBL databases, into groups of related proteins.The resulting classification splits the protein
space into well defined groups of proteins, most of them are closely correlated with natural
biological families and superfamiliesfor comprehensive evaluation results). The hierarchical
organization may help to detect finer subfamilies that make up known families of proteins as
well as interesting relations between protein families.
SBASE - protein domain library sequences that contains 237.937 annotated structural,
functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major
sequence databases and sequence pattern collections.
SYSTERS - SYSTERS cluster set contains sequences from SWISS-PROT , TrEMBL, PIR,
Wormpep, and MIPS Yeast protein translations which are sorted into disjoint clusters.
fragmental sequences build single sequence clusters, while the remaining sequences are
contained in clusters of non-redundant sequences per cluster.
PROTEIN STRUCTURE DATABASES
Library of Protein Family Cores - structural alignments of protein families and computed
average core structures for each family.Useful for building models, threading, and exploratory
analysis.
RCSB Protein Data Bank - single international repository for the processing and distribution
of 3-D macromolecular structure data primarily determined experimentally.
Protein Loop Classification - Conformational clusters and consensus sequences for protein
loops derived by computational analysis of their structures.
3 Dee ñ Database of Protein Domain Definitions - contains structural domain definitions for
all protein chains in the Protein Databank (PDB)that have 20 or more residues and are not
theoretical models.
GENOMES
FlyBase - a comprehensive database for information on the genetics and molecular biology of
Drosophila. It includes data from the Drosophila Genome Projects and data curated from the
literature.
GeneCards - database of human genes, their products and their involvement in diseases.
KEGG: Kyoto Encyclopedia of Genes and Genomes - information pathways that consist of
interacting molecules or genes and to provide links from the gene catalogs produced by genome
sequencing projects.
WhiteHead Institute for Genomic Research ñ information on the Neurospora crassa Genome
Database, Human SNP Database, Human Physical Mapping Project, Mouse Genetic and
Physical Mapping Project,Rat Genetic Mapping Project, Mouse RH Mapping Project, Genome
Center ftp Archive (Data)
FastM and ModelInspector A program for the generation of models for regulatory regions in
DNA sequences.
OTHER
Enzyme Structures Database - contains the known enzyme structures that have been
deposited in the Brookhaven Protein Data Bank (the PDB).
Gene Ontology Consortium ñ attempts to produce a dynamic controlled vocabulary that can
be applied to all eukaryotes.
Human Transcript Database a curated source for information related to RNA molecules that
have been sequenced.
NDB ñ Nucleic Acid Database Project - assembles and distributes structural information about
nucleic acids.
PMD ñ Protein Mutant Database - covers natural as well as artificial mutants, including random
and site-directed ones, for all proteins except members of the globin and immunoglobulin
families.
TOOLS
ProteinProspector - Proteomics tools for mining sequence databases in conjunction with Mass
Spectrometry experiments.
SignalP - predicts the presence and location of signal peptide cleavage sites in amino acid
sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes,
and eukaryotes.
These algorithms are designed for the comparison of a protein sequence against sequence
databases to detect similar or homologous proteins.Conserved regions usually have similar
amino acid sequence and/or structural similarities.Perform at least three separate searches using
different algorithms.If default settings do not detect any similar proteins, try varying the PAM
matrix values.Lower matrix values are best for identifying short regions of sequence with very
high similarity. Higher PAM matrices are able to detect longer, weaker
matches.Simultaneously, adjust the gap penalty value around the default value.
BLAST- The BLAST programs have been designed for speed, with a minimal sacrifice of
sensitivity to distant sequence relationships.The BLAST search algorithm is designed to find
close matches rapidly. It is faster than the S-W algorithm.
BLITZperforms a sensitive and extremely fast comparison of a protein sequence against the
SWISS-PROT protein sequence database using the Smith-Waterman algorithm.The Smith-
Waterman algorithm is able to detect short matching regions such as binding sites in the middle
of long sequences.
Bic-sw - Smith & Waterman algorithm implementation for protein database searches
FASTA ñ detects patches of regional similarity rather than the best alignment between the
query sequence and the database sequences. Very fast, but complete sensitivity is sacrificed.
GeneMatcher - The Smith-Waterman (S-W) search algorithm used by the FDF server is about
5% more sensitive towards divergent matches than the BLAST algorithm. This significantly
increases the chances of finding distant homologs of your query sequence in the databases.FDF
software incorporates a frameshift-tolerant search algorithm. This feature is particularly useful
when searching for potential coding sequences in low-quality DNA sequences, such as those
found in EST databases.
MPsearch - MPSRCH is a biological sequence comparison tool that implements the true Smith
and Waterman algorithm. This algorithm exhaustively compares every letter in a query
sequence with every letter in the database.
Paralign and SWMMX - searches a number of sequence databases for sequences similar to your
amino acid query sequence using two very sensitive algorithms. You can choose between the
well-known Smith-Waterman optimal local alignment algorithm or a new algorithm called
ParAlign, which is much faster but still almost as sensitive.
Pfam ñ HMM Search - Unlike standard pairwise alignment methods (e.g. BLAST, FASTA),
Pfam HMMs deal sensibly with multidomain proteins.
SAS ñ Sequences Annotated by Structure - will perform a FASTA search of the given sequence
against the proteins of known structure in the PDB and return a multiple alignment of all hits,
each annotated by structural features.
Scanps 2.3 - Fast implementation of the true Smith & Waterman algorithm for protein database
searches.
There are a limited number of families into which most proteins are grouped.Proteins within a
given family generally have a shared function.Conserved regions are usually important for
function or for maintaining a specific 3D structure. Conserved regions usually have similar
amino acid sequence and/or structural similarities.Domains are distinct functional regions of a
protein, often linked together by a flexible region.Motifs are recurring substructures found in
many proteins.Proteins of 500 or more amino acids most likely contain discrete functional
domains.Regions of low complexity often separate domains.Long stretches of repeated
residues, particularly proline, glutamine, serine, or threonine, often indicate linker
sequences.Approximately 2000-3000, out of a predicted 10,000-20,000, different protein
families have been characterized.Roughly, half of the proteins encoded in a new genome can be
placed in a known family based on their amino acid sequence.
eMatrix ñ fast and accurate sequence analysis using minimal-risk scoring matrices.
MEME ñ Multiple EM for Motif Elicitation - MEME is a tool for discovering motifs in a group
of related DNA or protein sequences.Takes as input a group of DNA or protein sequences (the
training set) and outputs as many motifs as requested. MEME uses statistical modeling
techniques to automatically choose the best width, number of occurrences, and description for
each motif.
MOTIF - findssequence motifs in a query sequence, also provides functional and genomic
information of the found motifs using DBGET and LinkDB as the hyperlinked annotations.
Results presented graphically, and, where available, 3D structures of the found motifs can be
examined by RasMol program when the hits are found in PROSITE database.Also, given a
profile generated from the multiple sequence alignment, or, retrieved from a motif library such
as PROSITE or Pfam, you can align a protein sequence with the profile.
Network Protein Sequence Analysis -this multi-algorithm server offers two pattern and
signature searches: PATTINPROT: scan a protein sequence or a protein database for one or
several pattern(s) andPROSCAN: scan a sequence for sites/signatures against PROSITE
database.
PFam HMM Search - Analyzes a protein query sequence to find Pfam domain matches.
ProDom BLAST ñ BLAST homology search against all domain sequences in ProDom.
Pscan - uses information derived from the PRINTS database to detect functional fingerprints in
protein.
P-val FingerPRINTScan - find the closest matching PRINTS fingerprint/s to a query sequence.
ScanProsite - Scans a protein sequence for the occurrence of patterns stored in the PROSITE
database.
Folding and coiling due to H-bond formation determines secondary structure.H-bonds form
between carboxyl and amino groups of nonadjacent amino acids.A single polypeptide can have
both helical and sheet regions.Non-helix and sheet regions can form bends, loops or turns.
HTH - gives a practical estimation of the probability that the sequence is a helix-turn-helix
motif.
Jpred2 - takes either a protein sequence or a mulitple alignment of protein sequences, and
predicts secondary structure. It works by combining a number of modern, high quality
prediction methods to form a consensus.
META PredictProtein ñ this multi-algorithm server utilizes eight different algorithms for
predicting secondary structure.
MultiCoil - program predicts the location of coiled-coil regions in amino acid sequences and
classifies the predictions as dimeric or trimeric. The method is based on the PairCoil algorithm.
PairCoil - predicts the location of coiled-coil regions in amino acid sequences by use of
Pairwise Residue Correlations.
PSA Protein Structure Prediction Server - determines the probable placement of secondary
structural elements along a query sequence.
PSIPRED
Structure Prediction Server ñ this multi-algorithm server uses the PHD algorithm to predict
secondary structure.
SOSUI
Tandem Repeats Finder - a program to locate and display tandem repeats (two or more adjacent,
approximate copies of a pattern of nucleotides) in DNA sequences.
TMHMM - predicts transmembrane helices and the predicted location of the intervening loop
regions.
TERTIARY STRUCTURE
Dali - compares the coordinates of a query protein structure andcompares them against those in
the Protein Data Bank. The output consists of a multiple alignment of structural neighbours.
3D-pssm - A Fast, Web-based Method for Protein Fold Recognition using 1D and 3D Sequence
Profiles coupled with Secondary Structure and Solvation Potential Information.
PROTEIN CHEMISTRY
Compute pI/MW
FindMod Tool - predicts potential protein post-translational modifications (PTM) and find
potential single amino acid substitutions in peptides.
GlycoMod Tool - predicts the possible oligosaccharide structures that occur on proteins from
their experimentally determined masses.
ProtParam Tool - allows the computation of various physical and chemical parameters for a
given protein stored in SWISS-PROT or TrEMBL or for a user entered sequence.
YinOYang 1.2 Prediction Server - produces neural network predictions for O-þ-GlcNAc
attachment sites in eukaryotic protein sequences.
SRYPGQVSFGGIGGLNDQIRELREVIELPLKNPELFLRVGIKPPKGVLLYGPPGTGKTL
LARAVASSLETNFLKVVSSAIVDKYIGESARLIREMFGYAKGTRALHHLHGRDRCHR
WQAFQRGYICRQRNPAYTYGAPQPARRFRLSRQDQDHHGDEPPRYPRPCFAACRPSR
SQD
QUESTIONS TO PRACTICE:
1. What are non-canonical amino acids? Discuss its applications in protein engineering.
2. Discuss the Aminoacyl t-RNA synthetases structure.
3. Engineering of t RNA and Aminoacyl t-RNA synthetase for the site specific in
corporation of unnatural amino acids into proteins in vivo.
4. Write a brief note on protein scaffolds and its choice for protein engineering
5. Brief on protein structure prediction methods
6. What is molecular modeling? What its applications in protein engineering
7. Discuss in brief about energy minimization
8. Give a detailed account on protein databases
UNIT – V - ENZYME AND PROTEIN ENGINEERING – SBTA5202
5. APPLICATIONS OF DIRECTED EVOLUTION TOOLS
Historically, microbial culture has been the most important route for enzyme discovery, even though
only a small fraction of all microbes can be sampled by this method. This classical strategy has rapidly
been replaced by high-throughput methods based on genomic sequence discovery. However, even
these strategies are limited by the natural ability of enzymes to perform only a well-defined set of
transformations. Directed evolution has been used with great success in recent years for the
diversification of gene sequences and optimization of enzyme phenotypes. By surveying the available
gene sequence space, specific traits are created through screening of libraries consisting of 104 −1010
individuals. In all cases, optimal assay development is critical to the success in optimizing the fitness
landscape of these enzymes.
Expanding specificity
Another application of directed evolution is to fine-tune the specificity of enzymes. Many successful
examples have been demonstrated that are useful for the production of important industrial products.
The E. coli D-2-keto-3-deoxy-6-phosphogluconate (KDPG) aldolase,which catalyzes the highly
specific reversible aldol reaction on D-configurated KDPG substrates, was subjected to DNA
shuffling and screening, and one variant was isolated capable of accepting both D- and L-
glyceraldehyde as substrates in a non-phosphorylated form. More recently, the P450 BM-3
monoxygenase, normally specific for medium chain fatty acids, has been evolved to accept small
hydrocarbon substrates and convert them at very high rates.
Perhaps the most dramatic success in this area is the use of directed evolution to create novel
specificity and activity. Sun et al. used combinatorial mutagenesis to change the substrate specificity
of galactose oxidase to use glucose as a substrate. One variant (with only three point mutations)
exhibited activity against D-glucose and oxidized other primary and secondary alcohols. Family
shuffling of two homologous biphenyl dioxygenases created several variants with enhanced substrate
specificity towards ortho-substituted polychlorinated
biphenyls and other aromatic compounds such as benzene, suggesting the feasibility to expand the
biodegradability of other highly recalcitrant pollutants.
In addition to substrate specificity, product specificity can also be altered by directed evolution. Wild-
type toluene 4-monooxygenase (T4MO) of Pseudomonas stutzeri OX1 oxidizes toluene to p-cresol
(96%) and oxidizes benzene sequentially to phenol, catechol, and
1,2,3-trihydroxybenzene. To synthesize novel dihydroxy and trihydroxy derivatives of benzene and
toluene, DNA shuffling of the alpha-hydroxylase fragment of T4MO (TouA) and saturation
mutagenesis of the TouA active site residues were used to generate random mutants. Several variants
were isolated to form 4-methylresorcinol, 3-methylcatechol, and methylhydroquinone from o-cresol,
whereas wild-type T4MO formed only 3-methylcatechol.
These variants also formed catechol, resorcinol, and hydroquinone from phenol, whereas wild-type
T4MO formed only catechol. These reactions show the potential synthesis of important intermediates
for pharmaceuticals.
Changing stereo- and enantio-selectivity
Often the production of enantiomerically pure compounds is of extreme importance, particularly in
the pharmaceutical industry. In this respect, directed evolution has been useful
in creating enzymes with desirable enantioselectivity. May et al. were the first to demonstrate
Rubin-Pitel et al. the feasibility to invert the enantioselectivity of D-hydantoinase to generate an
enzyme that has enhanced selectivity towards L-5-(2-methylthioethyl)hydantoin. Similarly,
inversion of enantioselectivity of a lipase was achieved towards (R)-selectivity with E =30
(comparing to E = 1.1 for the wild type enzyme). Perhaps the best industrial success was
demonstrated with the synthesis of cis-(1S, 2R)-indandial, a key precursor of an inhibitor of HIV
protease, by toluene dioxygenase . In three rounds of screening, several variants with up to three-fold
decrease in production of the undesirable 1-indenol (only 20% from 60%) were obtained. In addition
to enantioselectivity, the steroselectivity can be easily altered by directed evolution. Williams et al.
demonstrated that stereospecificity of tagatose-1,6-bisphosphate aldolase can be altered by 100-fold
via three rounds of DNA shuffling and screening. The resulting mutant catalyzes the formation of
carbon-carbon bonds with unnatural diastereoselectivity, where the >99:<1 preference for the
formation of tagatose 1,6-bisphosphate was switched to a 4:1 preference for the diastereoisomer,
fructose 1,6-bisphosphate.
Directed evolution can also be used as a powerful tool in optimizing an entire metabolic pathway.
Functional evolution of an arsenic resistance operon has been accomplished by three rounds of
shuffling and selection, resulting in cells that grew in 0.5 M arsenate, a 40- fold increase in resistance.
Ten mutations were located in arsB, encoding the arsenite membrane pump, resulting in a 4-fold to
6-fold increase in arsenite resistance. While arsC, the arsenate reductase gene, contained no
mutations, its expression level was increased, and
the rate of arsenate reduction was increased 12-fold.
Directed evolution has also been shown to enable the construction of artificial networks of
transcriptional control elements in living cells. By applying directed evolution to genes comprising a
simple genetic circuit, a nonfunctional circuit containing improperly matched components can evolve
rapidly into a functional one. Such an approach is likely to result in a
library of genetic devices with a range of behaviors that can be used to construct more complex
genetic circuits.
5.3 Protein Engineering of Enzymes Involved in Bioplastic Metabolism
The petroleum industry has optimized profi ts by producing value-added coproducts, such as plastics
and chemicals, in addition to primary liquid fuels. A similar coproduct strategy applied to
biorefineries processing cellulosic biomass to liquid fuels and/or energy would transform a
technology that is marginally economic, depending on oil prices, to a sustainable business with
enhanced revenue streams from multiple coproducts. The challenge is fi nding a biobased coproduct
that is compatible with a biorefi nery scenario and where markets warrant its production on a similar
scale as liquid fuels and/or energy. Polyhydroxyalkanoate (PHA) bioplastics represent a coproduct
that would be entirely compatible with either production of liquid fuels by hydrolyzing the residual
biomass after PHA extraction or by alternative thermochemical processes. PHA bioplastics possess
properties making them suitable replacements for many of the applications currently served by
petroleum-based plastics, thus providing tremendous market potential.
PHA bioplastics are a value-added coproduct that possess a market size compatible with large-scale
production of biofuels and/or energy from plants. They have material properties suitable for accessing
the markets currently served by petroleum based plastics and their production has been demonstrated
in several leading candidate bioenergy crops. Production of these materials in biomass crops has the
potential to significantly improve the economics of biomass biorefineries producing liquid fuels
and/or energy. We have discussed reasonable production scenarios that will take time to implement
but present very attractive business opportunities with exciting revenue streams, providing an
economic framework for a truly sustainable business.