Lecture4-Protein Data Analysis
Lecture4-Protein Data Analysis
Dr. Y. V. Lokeswari
Associate Professor
SSN College of Engineering
Protein Data Analysis – Protein and Amino Acid Sequence
• Protein synthesis constitutes the final stage of information flow within a cell.
• The genetic code in the coding regions of a DNA sequence is translated into biomolecular end products
that perform specific cellular and biological functions.
• Proteomics is the study of proteins and their interactions.
• An understanding of proteins and their functions would lead to new approaches for the diagnosis and treatment of
diseases, for the discovery of new drugs, and for disease control.
• Proteins are composed of linear, unbranched chains of amino acids (from an alphabet of 20 amino acids), linked
together by peptide bonds.
• The general structure consists of two functional groups (amino group, NH2, and carboxyl group, COOH), an H atom, and a
distinctive side group R, all bound to a carbon center called the alphacarbon.
• The differences between the 20 amino acids are in the nature of the R groups.
• These vary considerably in their chemical and physical properties.
• It is the chemistry of the R groups that determine the many interactions
that stabilize the structure of protein and enable its biological function.
The formation of a peptide bond between two amino acids to form a peptide chain.
The N-Cα-N sequence is repeated throughout the protein and forms the backbone of the 3D structure
Protein Data Analysis – Protein and Amino Acid Sequence
• Proteins are complex organic molecules that perform their functions through interactions with other molecules at the
molecular level.
• It requires information about their 3D structures at the molecular level.
• Protein structures are hierarchical.
• The primary structure of protein refers to the sequence of amino acids that make up the protein.
• The secondary structure refers to the local folding pattern of the polypeptide chain.
• The tertiary structure describes how the secondary structure elements are arranged to form the overall 3D folding
pattern. The tertiary structure is held together by hydrogen, ionic, and disulphide bonds between amino acids.
• It is this unique structure that gives a protein is specific function.
• The quaternary structure describes the interaction of two or more globular or tertiary structures and other groups such
as metal ions or cofactors that make up the functional protein.
• The quaternary structure is held together by ionic, hydrogen, and disulfide bonds between amino acids.
• An example of a protein with a quaternary structure is hemoglobin.
Protein Data Analysis – Protein and Amino Acid Sequence
• The secondary structure of proteins is predominantly stabilized by hydrogen bonds and is generally classified into four
types: α-helix, β-sheet, loop, and random coil.
• The α-helix is the most common form of secondary structure in proteins.
• The helix has 3.6 amino acid residues per turn and is stabilized by hydrogen bonding between the backbone carbonyl
oxygen of one residue and the backbone NH of the fourth residue along the helix.
• Certain amino acids have a distinct preference for α-helices. Alanine (A), glutamic acid (E), leucine (L), and methionine
(M) are good helix formers,
• praline (P), glycine (G), tyrosine (Y), and serine (S) are helix-breaking residues.
• The second most common element of secondary structure in proteins is the β-sheet.
• A β-sheet is formed from several individual β-strands that are distant from each other along the primary protein
sequence.
• β-strands are usually five to 10 residues long and are in fully extended conformation.
• The individual strands are aligned next to each other in such a way that carbonyl oxygens are hydrogen-bonded with
neighboring NH groups.
Protein Data Analysis – Protein and Amino Acid Sequence
Hydrogen bond patterns in beta sheets. Here, a four-stranded beta sheet, which contains three antiparallel and one
parallel strand, is drawn schematically. Hydrogen bonds are indicated with red lines (antiparallel strands) and green lines
(parallel strands) connecting the hydrogen and receptor oxygen
Protein Data Analysis – Protein and Amino Acid Sequence
• Loops are regions of a protein chain that connect α-helices and β-strands or sheets to each other.
• the helices and sheets form the stable hydrophobic core of the protein, and the connecting loops are to be found on the
surface of the structure.
• Because amino acids in loops are not constrained by space and environment, unlike amino acids in the core region, and
because they do not have an effect on the arrangement of secondary structures in the core, more substitutions, insertions,
and deletions may occur.
• Thus, in a sequence alignment, the presence of these features may be an indication of a loop.
• Random coil is the term used for segments of polypeptide chains that do not form regular secondary structures.
• Such conformations are not really random: they are the result of a balance of interactions between amino acid side
chains and the solvent and interactions between sidechains.
• Depending on the type of secondary structures present, the tertiary structure of a protein is classified into seven classes in
the SCOP database
Protein Data Analysis – Protein and Amino Acid Sequence
Protein is classified into seven classes in the SCOP Internet resources for protein structure classification
1. All α proteins (Fig. 4.10a) • The CATH database- hierarchical domain classification of
protein structures
2. All β proteins (Fig. 4.10b)
• SCOP (Structural Classification of Proteins) database -
3. Alpha and beta proteins (α / β) (Fig. 4.10c) structural and evolutionary relationships between all proteins
Mainly parallel β-sheets with intervening α-helices • SWISS-Model - fully automated protein structure
homologymodeling server
4. Alpha and beta proteins (α +b) (Fig. 4.10d)
• Protein Data Bank (PDB) - repository of 3D protein structure
Mainly segregated α-helices and antiparallel β-sheet
• The DALI (Distance ALIgnment tool) server is a network
5. Multi-domain proteins (α and β) (Fig. 4.10e) service for comparing protein structures in 3D.
Folds consisting of two or more domains belonging to • The FSSP (Fold classification based on Structure-Structure
different classes alignment of Proteins) database is based on an exhaustive all-
against-all 3D structure comparison of protein structures
6. Membrane and cell surface proteins and peptides (Fig. • 3Dee contains structural domain definitions for all protein chains
4.10f)
• The DSSP (Database of Secondary Structure in Proteins)
Exclude proteins in the immune system database is a database of secondary structure assignments for all
7. Small proteins (Fig. 4.10g) protein
• The function of a protein is directly related to the 3D shape, i.e., the folding, of the molecule, and the 3D
shape is directly determined by the sequence of amino acids in the molecule.
• the primary structure, i.e., the sequence of amino acids, ultimately determines the fold (3D structure) and
function of a protein.
• A major goal in bioinformatics and structural molecular biology is to understand the relationship
between the amino acid sequence and the 3D structure in protein, and to predict the fold based on the
amino acid sequence alone.
• This type of structure prediction directly from the amino acid sequence is called ab initio structure
prediction.
• Protein fold prediction from an amino acid sequence is still a distant goal, and most current algorithms
aim at predicting only the secondary structures, such as α-helices, β-strands, and loops/coils.
• The prediction of the secondary structure is an essential intermediate step on the way to predicting the full
3D structure of a protein. If the secondary structure of a protein is known, it is possible to derive a
comparatively small number of possible tertiary (3D) structures using knowledge about the ways that the
secondary structural elements pack.
• Some of the major computational methods of secondary structure prediction are:
• (1) statistical feature-based method, (2) nearest neighbor method, and (3) neural network-model
method.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Statistical feature-based method : The frequency of occurrence of each of the 20 amino acids in different
secondary structures is used to create a scoring matrix.
• To predict a secondary structure, a sequence is scanned using a sliding window for the occurrence of
amino acids that have a high probability for one type of structure, as measured by the scoring matrices.
• In the Garnier, Osguthorpe, and Robson (GOR) method a window of 17 residues is used for the
prediction of the structural conformation of the central amino acid in the window.
• The GOR method estimates the joint probabilities of secondary structure S and amino acid a from sequences
in structural databases, and uses these probabilities to estimate the information difference between the
hypotheses that residual a is in structure S and residual a is not in structure S.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Statistical feature-based method :
The Garnier, Osguthorpe, and Robson (GOR) method is a classic approach used in bioinformatics to
predict the secondary structure of proteins. Here's a detailed breakdown of the process involved in this
method:
1. Window-Based Scanning
• Sliding Window: The GOR method uses a sliding window of 17 residues (amino acids) to analyze the
sequence. This means that for each central amino acid in the window, the surrounding 16 residues (8 on
each side) are considered in the prediction.
2. Scoring Matrices
• Amino Acid Frequencies: The method relies on scoring matrices derived from structural databases. These
matrices contain information about the frequency of each amino acid in different secondary structure
types (alpha-helix, beta-sheet, or coil).
• Probability Calculation: For each amino acid in the window, the GOR method estimates the probability
that the central residue belongs to a particular secondary structure type, based on the frequencies
observed in known protein structures.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Statistical feature-based method :
3. Joint Probabilities
• Estimating Probabilities: The GOR method calculates joint probabilities P(S,a) which represent the
probability that an amino acid a is in a secondary structure S based on the sequences in structural databases.
• Probability Differences: For each amino acid a, the method evaluates the difference in probability of the
amino acid being in structure S versus not being in structure S. This helps in understanding whether the
presence of the amino acid has a significant effect on the likelihood of that secondary structure.
4. Information Difference
• Information Gain: The GOR method uses the calculated probabilities to estimate the "information
difference" or "information gain". This is a measure of how much information is gained about the
secondary structure of the central residue a when considering its occurrence versus non-occurrence in a
given structural state S.
• Predictive Modeling: The difference in probabilities for each possible secondary structure (alpha-helix,
beta-sheet, or coil) is used to predict the most likely secondary structure for the central residue.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Statistical feature-based method :
5. Prediction
• Optimal Structure Assignment: After calculating the probabilities and information differences for all
possible secondary structures, the GOR method assigns the secondary structure with the highest
probability to the central amino acid in the window.
• Sliding Window Application: This process is repeated across the entire protein sequence by sliding the
window along the sequence, predicting the secondary structure for each residue based on the surrounding
context.
Summary
To summarize, the GOR method predicts secondary structures by:
• Using a sliding window to analyze each residue in context with its neighbors.
• Calculating joint probabilities of amino acids and secondary structures from structural databases.
• Assessing information differences to determine the likelihood of each secondary structure for the central
residue.
• Assigning the most probable secondary structure based on these calculations and repeating the process
for the entire sequence.
This approach combines statistical analysis of known protein structures with probabilistic modeling to make
predictions about unknown sequences.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction