0% found this document useful (0 votes)
36 views6 pages

Chou Fasman

The document discusses the Chou-Fasman Method for protein structure prediction using cluster analysis, highlighting the importance of amino acids in forming proteins and their roles in biological functions. It outlines various protein structures, the process of protein synthesis, and the application of data mining techniques in predicting secondary structures. The research emphasizes the need for computational methods to complement experimental results in bioinformatics.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views6 pages

Chou Fasman

The document discusses the Chou-Fasman Method for protein structure prediction using cluster analysis, highlighting the importance of amino acids in forming proteins and their roles in biological functions. It outlines various protein structures, the process of protein synthesis, and the application of data mining techniques in predicting secondary structures. The research emphasizes the need for computational methods to complement experimental results in bioinformatics.

Uploaded by

azhagar_ss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

World Academy of Science, Engineering and Technology 48 2010

Chou-Fasman Method for Protein Structure


Prediction using Cluster Analysis
Rajbir Singh, Sumandeep Kaur Deol, Parvinder S. Sandhu

Since we are interested in amino acids that form proteins, it is


Abstract—The research in bioinformatics has accumulated large safe to use the terms residue and amino acid interchangeably.
amount of data. As the hardware technology advancing, the cost of There are 20 different amino acids in nature that form
storing is decreasing. The biological data is available in different proteins.
formats and is comparatively more complex. Knowledge discovery
from these large and complex databases is the key problem of this
Structure of Protein
era. Data mining and machine learning techniques are needed which
can scale to the size of the problems and can be customized to the
-NH2 + -COOH = -CONH-
application of biology. In the present research work, the Chou- Amino group Carboxylic group Peptide Bond
Fasman Method is implemented with the help of data mining. Protein (Amino acid 1) (Amino acid 2)
structure determination and prediction has been a focal research A chain of such peptide bonds is called polypeptide and is a
subject in the field of bioinformatics due to the importance of protein protein.
structure in understanding the biological and chemical activities of
organisms. The experimental methods used by biotechnologists to Examples of proteins:
determine the structures of proteins demand sophisticated equipment a) Protective Proteins, for example, keratin (nails).
and time. A host of computational methods are developed to predict b) Defense Proteins, for example, antibodies.
the location of secondary structure elements in proteins for
c) Toxins, for example, snake venom.
complementing or creating insights into experimental results. Cluster
analysis is used as data mining model to retrieve the results. d) Structural Proteins, for example, collagen of bones.
e) Enzymes (biocatalysts), for example, pepsin, trypsin.
Keywords—Amino-Acid, Protein , Polypeptide, clusters, DNA, g) Hormones, for example, insulin is a protein.
RNA, PHD, GOR
Amino acids are the basic building blocks of proteins.
I. INTRODUCTION Fundamentally, amino acids are joined together by peptide

P ROTEINS are complex organic compounds that consist of bonds to form the basic structure of proteins. However, owing
amino acids joined by peptide bonds. Proteins are to the many ‘side groups’ that are part of the amino acids
essential to the structure and function of all living cells other sorts of bonds may form between the amino acid units.
and viruses. Many proteins function as enzymes or form These additional bonds twist and turn the protein into
subunits of enzymes. Some proteins play structural or convoluted shapes that are unique to the protein and essential
mechanical roles. Some proteins function in immune response to its ability to perform certain functions within the human
and the storage and transport of various ligands. Proteins body.
serve as nutrients as well; they provide the organism with the Amino acids play central roles both as building blocks of
amino acids that are not synthesized by that organism. proteins and as intermediates in metabolism. The 20 amino
Proteins are amongst the most actively studied molecules in acids that are found within proteins convey a vast array of
biochemistry. chemical versatility. The precise amino acid content, and the
An amino acid is any molecule that contains both an amino sequence of those amino acids, of a specific protein, is
group and a carboxylic acid group. An amino acid residue is determined by the sequence of the bases in the gene that
the residuals of an amino acid after it forms a peptide bond encodes that protein. The chemical properties of the amino
and loses a water molecule. acids of proteins determine the biological activity of the
protein. Proteins not only catalyze all (or most) of the
reactions in living cells, they control virtually all cellular
process. In addition, proteins contain within their amino acid
sequences the necessary information to determine how that
Er. Rajbir Singh, Asstt. Prof. & Head, Department of Information Tech.,
Lala Lajpat Rai Institute of Engineering & Technology, Moga , Punjab,
protein will fold into a three dimensional structure, and the
INDIA ( phone: +91-9417977061; e-mail: [email protected]). stability of the resulting structure.
Er. Dheerajpal Kaur, Lecturer. He is now with the Department of As amino acids bind together in chains to form the stuff
Electronic & Comm. Engg., Lala Lajpat Rai Institute of Engineering. and from which our life is born. It's a two-step process: Amino
Technology, Moga, Punjab, INDIA (e-mail: [email protected]). acids get together and form peptides or polypeptides. It is
Dr. Parvinder S. Sandhu is working as Professor with the Rayat & Bahra from these groupings that proteins are made. Commonly
Institute Of Engineering & Bio-Technology, Mohali-Sahauran14004. E-Mail:
[email protected], recognized amino acids include glutamine, glycine,
phenylalanine, tryptophan, and valine. Three of those

980
World Academy of Science, Engineering and Technology 48 2010

phenylalanine, tryptophan, and valine are essential amino function. This form includes the position of the protein
acids for humans; the others are isoleucine, leucine, lysine, subunits of the assembly with respect to each other.
methionine, and threonine. The essential amino acids cannot
be synthesized by the body; instead, they must be ingested
through food.
Amino acids make up 75% of the human body. They are
essential to nearly every bodily function. Every chemical
reaction that takes place in your body depends on amino acids
and the proteins that they build.
Humans can produce 10 of the 20 amino acids. The others
must be supplied in the food. Failure to obtain enough of even
1 of the 10 essential amino acids, those that we cannot make,
results in degradation of the body's proteins muscle and to
obtain the one amino acid that is needed.
Amino acids are carbon compounds that contain two
functional groups: an amino group (NH2) and a carboxylic
acid group (COOH). A side chain attached to the compound
gives each amino acid a unique set of characteristics.

Fig. 2 Different representations of protein structure

Fig. 1 A generic Amino acid Structure Protein Synthesis


While the genetic code itself resides in DNA, DNA is never
Structures of proteins are investigated under four primary used directly in the synthesis of proteins. DNA is
groups: “transcribed” into messenger RNA (mRNA) that carries
• Primary Structure is the sequence of amino acids in the information from DNA and it is mRNA that is then used to
protein. Counting of residues always starts at the N- “translate” this information into a specific sequence of amino
terminal end (NH2-group), which is the end where the acids that constitute proteins. Similarly, amino acids cannot
amino group is involved in a peptide bond. The primary recognize codons directly, an adapter molecule is necessary, a
structure of a protein is determined by the gene transfer RNA (tRNA). Amino acids are incorporated into a
corresponding to the protein. protein in an order predetermined by the mRNA sequence. A
• Secondary Structure is the composition of common tRNA can recognize more than one codon very often, the first
patterns in the protein. Some patterns are frequently 2 nucleotides of the codon are sufficient to specify an amino
observed in the native states of proteins. This structure acid and the third nucleotide varies.
class includes regions in the protein of these patterns but
it does not include the coordinates of residues. (i)Transcription (DNA > RNA)
• Tertiary Structure is the native state, or folded form, of a RNA is synthesized on a DNA template in the process known
single protein chain. This form is also called the as DNA transcription. Transcription generates the mRNA
functional form. Tertiary structure of a protein includes containing the information to synthesize a specific protein and
the coordinates of its residues in three dimensional also the other RNA molecules, ribosomal RNA, tRNA’s
spaces. The elements of secondary structure are usually involved in the process. The key enzyme(s) involved in the
folded into a compact shape using a variety of loops and process is RNA polymerase, an incredibly complex enzyme of
turns. molecular mass of 500,000kDa. DNA is transcribed by RNA
• Quaternary Structure is the structure of a protein polymerase binding to a specific start site or “promoter” on
complex. Some proteins form a large assembly to the DNA and proceeding until it reaches a termination signal.

981
World Academy of Science, Engineering and Technology 48 2010

The DNA double helix is partially unwound by the with often less than 60% of residues being produced in the
polymerase and transcription always proceeds in a 3’ to 5’ correct secondary structure state.
direction on the DNA template so that the RNA produced is
extended in a 5’ to 3’ direction. (ii)PHD
The neural net model employed by Rost and Sander was
(ii) Translation (RNA > Protein) fairly complex and computationally expensive. Because of the
The ribosome binds to the mRNA at the start codon (AUG) computational demands, a 7-fold cross-validation was used in
that is recognized only by the initiator tRNA. The ribosome place of jack-knife testing. Accuracy was over 70% using
proceeds to the elongation phase of protein synthesis. During multiple sequence alignment, but the fifth of residues with the
this stage, complexes, composed of an amino acid linked to highest reliability was predicted with over 90% accuracy. Rost
tRNA, sequentially bind to the appropriate codon in mRNA and Sander also tested PHD on 26 new proteins, none with
by forming complementary base pairs with the tRNA significant sequence similarity to any protein in the training
anticodon. The ribosome moves from codon to codon along set, and found comparable results. PHD, however, suffers
the mRNA. Amino acids are added one by one, translated into from some of the ANN problems. Rost and Sander were
polypeptidic sequences dictated by DNA and represented by concerned with overtraining and therefore terminated training
mRNA. At the end, a release factor binds to the stop codon, once the accuracy was higher than 70% for all training
terminating translation and releasing the complete polypeptide samples.
from the ribosome.
(iii) Chou- Fasman Method
The Chou-Fasman method was among the first secondary
structure prediction algorithms developed and relies
predominantly on probability parameters determined from
relative frequencies of each amino acid's appearance in each
type of secondary structure. In this method, a helix is
predicted if, in a run of six residues, four are helix favoring
and the average valued of the helix propensity is greater than
100 and greater than the average strand propensity. Such a
helix is extended along the sequence until a proline is
encountered (helix breaker) or a run of 4 residues with helical
propensity less than 100 is found. A strand is predicted if, in a
run of 5 residues, three are strand favouring, and the average
value of the strand propensity is greater than 1.04 and greater
than the average helix propensity. Such a strand is extended
along the sequence until a run of 4 residues with strand
propensity less than 100 is found.
Fig. 3 Protein Synthesis from DNA to RNA to Protein.
II. METHODOLOGY
Secondary Structure Prediction
Given a protein sequence with amino acids a1 a2 . . . an, the Data Mining Model used for implementation of the CHOU-
secondary structure prediction problem is to predict whether FASMAN method
each amino acid ai is in a α−helix, a β−sheet, or neither. If you As part of the larger process known as knowledge
know (say through structural studies), the actual secondary discovery, data mining is the process of extracting information
structure for each amino acid, then the 3-state accuracy is the from large volumes of data. This is achieved through the
percent of residues for which your prediction matches reality. identification and analysis of relationships and trends within
It is called “3-state” because each residue can be in one of 3 commercial databases. Data mining is used in areas as diverse
“states”: α, β, or other (O). Because there are only 3 states, as space exploration and medical research.
random guessing would yield a 3-state accuracy of about 33% This model makes use of Clustering as the data mining
assuming that all structures are equally likely. There are method and uses conceptual clustering as the type of
different methods of prediction with various accuracies. Some clustering. Clustering can be considered the most important
of these methods are: unsupervised learning problem.
(i) GOR Method
The GOR method, named for the three scientists who Clustering: Cluster analysis is an exploratory data analysis
developed it – Garnier, Osguthorpe, and Robson. Considering tool for solving classification problems. Its object is to sort
the information carried by a residue about its own secondary cases into groups, or clusters, so that the degree of association
structure, in combination with the information carried by other is strong between members of the same cluster and weak
residues in a local window of eight residues on either side of between members of different clusters. Each cluster thus
the sequence of the residue concerned. describes, in terms of the data collected, the class to which its
The accuracy of these early methods based on the local members belong; and this description may be abstracted
amino acid composition of single sequences was fairly low, through use from the particular to the general class or type.

982
World Academy of Science, Engineering and Technology 48 2010

Cluster analysis is thus a tool of discovery. It may reveal TABLE I CONFORMATIONAL PARAMETERS AND POSITIONAL FREQUENCIES OR
Α-HELIX, ß-SHEET AND TURN RESIDUES.
associations and structure in data which, though not
previously evident, nevertheless are sensible and useful. Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Applying Cluster Analysis Technique for identifying the
Alanine 142 83 66 0.060 0.076 0.035 0.058
Secondary structure of a given amino acid sequence to be in
α-helix β-Sheet or Turn. Arginine 98 93 95 0.070 0.106 0.099 0.085

• Input :- Amino acid sequence (Plain Text Format) Aspartic acid 101 54 146 0.147 0.110 0.179 0.081

• Output: - Clusters of α-helix, β-sheet and Turn. Asparagine 67 89 156 0.161 0.083 0.191 0.091

Chou-Fasman Method For Protein Structure Prediction Cysteine 70 119 119 0.149 0.050 0.117 0.128

The Chou-Fasman algorithm for the prediction of protein Glumatic acid 151 37 74 0.056 0.060 0.077 0.064
secondary structure is one of the most widely used
predictive schemes. The Chou-Fasman method of secondary Glutamine 111 110 98 0.074 0.098 0.037 0.098
structure prediction depends on assigning a set of prediction
Glycine 57 75 156 0.102 0.085 0.190 0.152
values to a residue and then applying a simple algorithm to
the conformational parameters and positional frequencies. Histidine 100 87 95 0.140 0.047 0.093 0.054
The Chou-Fasman algorithm is simple in principle.
The conformational parameters for each amino acid were Isoleucine 108 160 47 0.043 0.034 0.013 0.056
calculated by considering the relative frequency of a given
Leucine 121 130 59 0.061 0.025 0.036 0.070
amino acid within a protein, its occurrence in a given type
of secondary structure, and the fraction of residues Lysine 114 74 101 0.055 0.115 0.072 0.095
occurring in that type of structure. These parameters are
measures of a given amino acid's preference to be found in Methionine 145 105 60 0.068 0.082 0.014 0.055
helix, sheet or coil. Using these conformational parameters,
Phenylalanin 113 138 60 0.059 0.041 0.065 0.065
one finds nucleation sites within the sequence and extends
e
them until a stretch of amino acids is encountered that is not
disposed to occur in that type of structure or until a stretch is Proline 57 55 152 0.102 0.301 0.034 0.068
encountered that has a greater disposition for another type
of structure. At that point, the structure is terminated. This Serine 77 75 143 0.120 0.139 0.125 0.106
process is repeated throughout the sequence until the entire
Threonine 83 119 96 0.086 0.108 0.065 0.079
sequence is predicted.
The Chou-Fasman method of secondary structure Tryptophan 108 137 96 0.077 0.013 0.064 0.167
prediction depends on assigning a set of prediction values to
Tyrosine 69 147 114 0.082 0.065 0.114 0.125
a residue and then applying a simple algorithm to those
numbers. Valine 106 170 50 0.062 0.048 0.028 0.053

The algorithm contains the following steps:


(e) To identify a β-turn at residue number i,the product p(t) =
(a) Assign parameter values to all residues of the f(i)f(i+1)f(i+2)f(i+3) is calculated. To predict a β-turn,
Peptide. the following three conditions have to be simultaneously
(b) Scan the peptide and identify regions where 4 out of 6 fulfilled:
contiguous residues have P(α)>100.Theseregions nucleate p (t)>0.000075
α- helices. Extend these in both directions until a set of
four contiguous residues have an average P(α)<100.This p(t) = f(i)f(i+1)f(i+2)f(i+3) .
ends the helix.
Where the f(i+1) value for the i+1 residue is used, the
(c) Scan the peptide and identify regions where 3 out of 5
f(i+2) value for the i+2 residue is used and the f(i+3)
contiguous residues have P(β)>100.These residues
value for the i+3 residue is used
nucleate β- strands. Extend these in both directions until a
set of four contiguous residues have an average • The average value for P (turn)>100 for four amino
P(β)<100.This ends β- strand. acids.
(d) Any region containing overlapping α and β assignments
are taken to be helical or β depending on if the average • The average P (turn) is larger than the average P(α)
P(α) and P(β) for that region is largest. If this residues an as well as P(β).
α or β- region so that it becomes less than 5 residues, the (f) The remaining part of the sequence without
α or β assignment for that region is removed. Assignment = are considered as coils.

983
World Academy of Science, Engineering and Technology 48 2010

III. CHOICE OF SEQUENCE FORMAT AAATTBBBBBBTTTTTTTTTTTTBBBBBBTTTTAAAAA


There are various formats of Amino acid sequences, and AAATTTTTTTTBBBBTTTT
each has its own set of characters and utility. To get a deeper
understanding and better results it is essential to choose a V. CONCLUSION
valid input format. The various formats are:
It attempts to classify amino acid in protein sequence
• Plain text format according to their predicted local structure, which can be
subdivided into three states: α-helix, β-sheet or turn.
• FASTA format
• This research work will assist researchers to predict
• Genetic Computer Group Format (GCG)
tertiary quaternary structure and will also act as a
• NEXUS
base for protein fold and function prediction.
• NBRF & PIR
• Protein fold can be predicted with better accuracy
with this technique.
After studying the different Amino Acid formats, plain text
• Protein Structure and Function Prediction can be
was chosen for the present problem. The plain text format is
included in the same system.
comparatively simple than the other formats available. The
plain text sequence format is typically generated by word • Various other data mining techniques can be used to
processors (saved as text file with line breaks). Moreover the determine an optimum result.
other details regarding DNA are not necessary for the present • Choice of various formats of amino acid sequences
research work. can be utilized.
• Protein structure and protein function prediction can
Plain text format: Plain Text format looks like the following: be done based on improved Chou-Fasman method
MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVL which includes 4 amino acids enabling a reverse β-
YIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSL turn.
RILYMMDEINNPSTVKTMGHQWYWSYEYTDYEDLSF
DSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQEEEE VI. FUTURE SCOPE OF WORK
MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVL Following improvements regarding the developed model of
YIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSL bioinformatics can be made:
RILYMMDEINNPSTVKTMGHQWYWSYEYTDYEDLSF • The system can be extended to predict the tertiary
DSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQE. structure of the protein.
• Various different mining techniques can be utilized
to determine the optimum result.
IV. RESULTS AND DISCUSSION
• Different formats of amino acids can be utilized.
For a given sequence of amino acids, this technique first • Protein fold can be predicted with better accuracy
clusters the amino acids and then these amino acid clusters are with this technique.
analyzed to predict the structure of protein. The user inputs • This technique can be further extended for multiple
the primary structure of the protein i.e. the amino acid sequence alignment.
sequence. The clusters of amino acids are extended till a
alpha-helix, beta helix or a turn are predicted using the REFERENCES
conformational parameters and positional frequencies for α- [1] András Fiser, Andrej Sali (2000) “Comparative protein structure
helix, ß-sheet and turn residues. modeling” Pels Family Center for Biochemistry and Structural Biology
,The Rockefeller University, pp 82-88.
The whole detailed method is explained below: [2] Andreas Rechtsteiner, Jeremy Luinstra, Luis M Rocha, Charlie E M
Example: Strauss (2006) “Use of Text Mining for Protein Structure Prediction and
Functional Annotation in Lack of Sequence Homology” Center of
Genomics and Bioinformatics, Indiana University, Bloomington, IN
INPQAIFDIQIKRLHEYKRQHHDKQVHMANLCVVGGFA 47401, pp 1-4.
VNGVAALHSDLVVKDLFPEYHQLWPNKFHNVTNGITP [3] Ben Blum, Michael I. Jordan (2007) “Feature Selection Methods for
RRWIKQCNPALAALLDKSLQKEWANDLDQLINLVKLA Improving Protein Structure Prediction with Rosetta” Department of
DDAKFRQLYRVIKQANKVRLAEFVKVRTIDLNLLHILA Electrical Engineering and Computer Science University of California
at Berkeley, CA 94305, pp1-7.
LYKERIRENP [4] Chen Yonghui, Reilly Kevin D., Sprague Alan P., Guan Zhijie,
The above sequence is divided into clusters and from the “SEQOPTICS: a protein sequence clustering system” Symposium of
Computations in Bioinformatics and Bioscience (SCBB06) in
table the conformational parameter and positional frequencies conjunction with the International Multi-Symposiums on Computer and
for α-helix, ß-sheet and turn residues are established Computational Sciences 2006 (IMSCCS|06) Hangzhou, China. June 20–
24, 2006, pp 1-5.
Hence the final secondary structure of the given sequence is: [5] Eisen Michael B., Spellman Paul T., Brown Patrick O., Botstein David
(1998) “Cluster analysis and display of genome-wide expression
TTTBBBBBBBBBBBBBTTTTAAAAAAABBBBBBBTTTT patterns” Proc. Natl. Acad. Sci. USA.Vol. 95, pp. 14863–14868.
TTTTTTTTTTTTTTTTTBBBBBTTAAAAAAAAAAAAAA [6] Fraley Chris, Raftery Adrian E. (1998) “How Many Clusters? Which
Clustering Method? Answers Via Model-Based Cluster Analysis” The
computer journal, Vol. 41, No. 8, 1998 pp 578-587.

984
World Academy of Science, Engineering and Technology 48 2010

[7] Fraley Chris, Raftery Adrian E. (2000) “Model based clustering,


Discriminant Analysis, and density estimation” Working Paper no II,
Center for statics and social science, University of Washington, USA, pp
1-28.
[8] George Tzanis, Christos Berberidis, and Ioannis Vlahavas (2002)
“Biological Data Mining” Department of Informatics, Aristotle
University of Thessaloniki, Greece, pp 1-8.

Singh R is an Assistant Professor, Department of


Information Technology of Lala Lajpat Rai Institute of
Engineering & Technology Moga, India. He received
his B.E (Honor) degree in Computer Science and
Engineering from MD University, Rothak, Haryana and
M-Tech degree in Computer Science and Engineering
from Punjab Technical University, Jalandhar Pb.
(INDIA). He has authored 04 books on Computer
Science. His main field of research interest is Bio-
Informatics and Data mining. He works on the Gene
Expression, Phylogenetic Trees and Prediction of
Protein Sequence & Structure.

Sumandeep Kaur Deol is a Faculty with the


Department of Computer Science & Engineering of
Lala Lajpat Rai Institute of Engineering & Technology
Moga, India. She received her B.Tech in Computer
Science & Engineering and M-Tech degree in
Computer Science & Engineering from Punjab
Technical University, Jalandhar Pb. (INDIA). Her
research interests include Neural Networks, Genetics
Algorithm and Data Mining.

985

You might also like