Chou Fasman
Chou Fasman
P ROTEINS are complex organic compounds that consist of bonds to form the basic structure of proteins. However, owing
amino acids joined by peptide bonds. Proteins are to the many ‘side groups’ that are part of the amino acids
essential to the structure and function of all living cells other sorts of bonds may form between the amino acid units.
and viruses. Many proteins function as enzymes or form These additional bonds twist and turn the protein into
subunits of enzymes. Some proteins play structural or convoluted shapes that are unique to the protein and essential
mechanical roles. Some proteins function in immune response to its ability to perform certain functions within the human
and the storage and transport of various ligands. Proteins body.
serve as nutrients as well; they provide the organism with the Amino acids play central roles both as building blocks of
amino acids that are not synthesized by that organism. proteins and as intermediates in metabolism. The 20 amino
Proteins are amongst the most actively studied molecules in acids that are found within proteins convey a vast array of
biochemistry. chemical versatility. The precise amino acid content, and the
An amino acid is any molecule that contains both an amino sequence of those amino acids, of a specific protein, is
group and a carboxylic acid group. An amino acid residue is determined by the sequence of the bases in the gene that
the residuals of an amino acid after it forms a peptide bond encodes that protein. The chemical properties of the amino
and loses a water molecule. acids of proteins determine the biological activity of the
protein. Proteins not only catalyze all (or most) of the
reactions in living cells, they control virtually all cellular
process. In addition, proteins contain within their amino acid
sequences the necessary information to determine how that
Er. Rajbir Singh, Asstt. Prof. & Head, Department of Information Tech.,
Lala Lajpat Rai Institute of Engineering & Technology, Moga , Punjab,
protein will fold into a three dimensional structure, and the
INDIA ( phone: +91-9417977061; e-mail: [email protected]). stability of the resulting structure.
Er. Dheerajpal Kaur, Lecturer. He is now with the Department of As amino acids bind together in chains to form the stuff
Electronic & Comm. Engg., Lala Lajpat Rai Institute of Engineering. and from which our life is born. It's a two-step process: Amino
Technology, Moga, Punjab, INDIA (e-mail: [email protected]). acids get together and form peptides or polypeptides. It is
Dr. Parvinder S. Sandhu is working as Professor with the Rayat & Bahra from these groupings that proteins are made. Commonly
Institute Of Engineering & Bio-Technology, Mohali-Sahauran14004. E-Mail:
[email protected], recognized amino acids include glutamine, glycine,
phenylalanine, tryptophan, and valine. Three of those
980
World Academy of Science, Engineering and Technology 48 2010
phenylalanine, tryptophan, and valine are essential amino function. This form includes the position of the protein
acids for humans; the others are isoleucine, leucine, lysine, subunits of the assembly with respect to each other.
methionine, and threonine. The essential amino acids cannot
be synthesized by the body; instead, they must be ingested
through food.
Amino acids make up 75% of the human body. They are
essential to nearly every bodily function. Every chemical
reaction that takes place in your body depends on amino acids
and the proteins that they build.
Humans can produce 10 of the 20 amino acids. The others
must be supplied in the food. Failure to obtain enough of even
1 of the 10 essential amino acids, those that we cannot make,
results in degradation of the body's proteins muscle and to
obtain the one amino acid that is needed.
Amino acids are carbon compounds that contain two
functional groups: an amino group (NH2) and a carboxylic
acid group (COOH). A side chain attached to the compound
gives each amino acid a unique set of characteristics.
981
World Academy of Science, Engineering and Technology 48 2010
The DNA double helix is partially unwound by the with often less than 60% of residues being produced in the
polymerase and transcription always proceeds in a 3’ to 5’ correct secondary structure state.
direction on the DNA template so that the RNA produced is
extended in a 5’ to 3’ direction. (ii)PHD
The neural net model employed by Rost and Sander was
(ii) Translation (RNA > Protein) fairly complex and computationally expensive. Because of the
The ribosome binds to the mRNA at the start codon (AUG) computational demands, a 7-fold cross-validation was used in
that is recognized only by the initiator tRNA. The ribosome place of jack-knife testing. Accuracy was over 70% using
proceeds to the elongation phase of protein synthesis. During multiple sequence alignment, but the fifth of residues with the
this stage, complexes, composed of an amino acid linked to highest reliability was predicted with over 90% accuracy. Rost
tRNA, sequentially bind to the appropriate codon in mRNA and Sander also tested PHD on 26 new proteins, none with
by forming complementary base pairs with the tRNA significant sequence similarity to any protein in the training
anticodon. The ribosome moves from codon to codon along set, and found comparable results. PHD, however, suffers
the mRNA. Amino acids are added one by one, translated into from some of the ANN problems. Rost and Sander were
polypeptidic sequences dictated by DNA and represented by concerned with overtraining and therefore terminated training
mRNA. At the end, a release factor binds to the stop codon, once the accuracy was higher than 70% for all training
terminating translation and releasing the complete polypeptide samples.
from the ribosome.
(iii) Chou- Fasman Method
The Chou-Fasman method was among the first secondary
structure prediction algorithms developed and relies
predominantly on probability parameters determined from
relative frequencies of each amino acid's appearance in each
type of secondary structure. In this method, a helix is
predicted if, in a run of six residues, four are helix favoring
and the average valued of the helix propensity is greater than
100 and greater than the average strand propensity. Such a
helix is extended along the sequence until a proline is
encountered (helix breaker) or a run of 4 residues with helical
propensity less than 100 is found. A strand is predicted if, in a
run of 5 residues, three are strand favouring, and the average
value of the strand propensity is greater than 1.04 and greater
than the average helix propensity. Such a strand is extended
along the sequence until a run of 4 residues with strand
propensity less than 100 is found.
Fig. 3 Protein Synthesis from DNA to RNA to Protein.
II. METHODOLOGY
Secondary Structure Prediction
Given a protein sequence with amino acids a1 a2 . . . an, the Data Mining Model used for implementation of the CHOU-
secondary structure prediction problem is to predict whether FASMAN method
each amino acid ai is in a α−helix, a β−sheet, or neither. If you As part of the larger process known as knowledge
know (say through structural studies), the actual secondary discovery, data mining is the process of extracting information
structure for each amino acid, then the 3-state accuracy is the from large volumes of data. This is achieved through the
percent of residues for which your prediction matches reality. identification and analysis of relationships and trends within
It is called “3-state” because each residue can be in one of 3 commercial databases. Data mining is used in areas as diverse
“states”: α, β, or other (O). Because there are only 3 states, as space exploration and medical research.
random guessing would yield a 3-state accuracy of about 33% This model makes use of Clustering as the data mining
assuming that all structures are equally likely. There are method and uses conceptual clustering as the type of
different methods of prediction with various accuracies. Some clustering. Clustering can be considered the most important
of these methods are: unsupervised learning problem.
(i) GOR Method
The GOR method, named for the three scientists who Clustering: Cluster analysis is an exploratory data analysis
developed it – Garnier, Osguthorpe, and Robson. Considering tool for solving classification problems. Its object is to sort
the information carried by a residue about its own secondary cases into groups, or clusters, so that the degree of association
structure, in combination with the information carried by other is strong between members of the same cluster and weak
residues in a local window of eight residues on either side of between members of different clusters. Each cluster thus
the sequence of the residue concerned. describes, in terms of the data collected, the class to which its
The accuracy of these early methods based on the local members belong; and this description may be abstracted
amino acid composition of single sequences was fairly low, through use from the particular to the general class or type.
982
World Academy of Science, Engineering and Technology 48 2010
Cluster analysis is thus a tool of discovery. It may reveal TABLE I CONFORMATIONAL PARAMETERS AND POSITIONAL FREQUENCIES OR
Α-HELIX, ß-SHEET AND TURN RESIDUES.
associations and structure in data which, though not
previously evident, nevertheless are sensible and useful. Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Applying Cluster Analysis Technique for identifying the
Alanine 142 83 66 0.060 0.076 0.035 0.058
Secondary structure of a given amino acid sequence to be in
α-helix β-Sheet or Turn. Arginine 98 93 95 0.070 0.106 0.099 0.085
• Input :- Amino acid sequence (Plain Text Format) Aspartic acid 101 54 146 0.147 0.110 0.179 0.081
• Output: - Clusters of α-helix, β-sheet and Turn. Asparagine 67 89 156 0.161 0.083 0.191 0.091
Chou-Fasman Method For Protein Structure Prediction Cysteine 70 119 119 0.149 0.050 0.117 0.128
The Chou-Fasman algorithm for the prediction of protein Glumatic acid 151 37 74 0.056 0.060 0.077 0.064
secondary structure is one of the most widely used
predictive schemes. The Chou-Fasman method of secondary Glutamine 111 110 98 0.074 0.098 0.037 0.098
structure prediction depends on assigning a set of prediction
Glycine 57 75 156 0.102 0.085 0.190 0.152
values to a residue and then applying a simple algorithm to
the conformational parameters and positional frequencies. Histidine 100 87 95 0.140 0.047 0.093 0.054
The Chou-Fasman algorithm is simple in principle.
The conformational parameters for each amino acid were Isoleucine 108 160 47 0.043 0.034 0.013 0.056
calculated by considering the relative frequency of a given
Leucine 121 130 59 0.061 0.025 0.036 0.070
amino acid within a protein, its occurrence in a given type
of secondary structure, and the fraction of residues Lysine 114 74 101 0.055 0.115 0.072 0.095
occurring in that type of structure. These parameters are
measures of a given amino acid's preference to be found in Methionine 145 105 60 0.068 0.082 0.014 0.055
helix, sheet or coil. Using these conformational parameters,
Phenylalanin 113 138 60 0.059 0.041 0.065 0.065
one finds nucleation sites within the sequence and extends
e
them until a stretch of amino acids is encountered that is not
disposed to occur in that type of structure or until a stretch is Proline 57 55 152 0.102 0.301 0.034 0.068
encountered that has a greater disposition for another type
of structure. At that point, the structure is terminated. This Serine 77 75 143 0.120 0.139 0.125 0.106
process is repeated throughout the sequence until the entire
Threonine 83 119 96 0.086 0.108 0.065 0.079
sequence is predicted.
The Chou-Fasman method of secondary structure Tryptophan 108 137 96 0.077 0.013 0.064 0.167
prediction depends on assigning a set of prediction values to
Tyrosine 69 147 114 0.082 0.065 0.114 0.125
a residue and then applying a simple algorithm to those
numbers. Valine 106 170 50 0.062 0.048 0.028 0.053
983
World Academy of Science, Engineering and Technology 48 2010
984
World Academy of Science, Engineering and Technology 48 2010
985