0% found this document useful (0 votes)
3 views

Lecture4-Protein Data Analysis

Uploaded by

shoyo3918
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture4-Protein Data Analysis

Uploaded by

shoyo3918
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Protein Data Analysis

Dr. Y. V. Lokeswari
Associate Professor
SSN College of Engineering
Protein Data Analysis – Protein and Amino Acid Sequence
• Protein synthesis constitutes the final stage of information flow within a cell.
• The genetic code in the coding regions of a DNA sequence is translated into biomolecular end products
that perform specific cellular and biological functions.
• Proteomics is the study of proteins and their interactions.
• An understanding of proteins and their functions would lead to new approaches for the diagnosis and treatment of
diseases, for the discovery of new drugs, and for disease control.
• Proteins are composed of linear, unbranched chains of amino acids (from an alphabet of 20 amino acids), linked
together by peptide bonds.
• The general structure consists of two functional groups (amino group, NH2, and carboxyl group, COOH), an H atom, and a
distinctive side group R, all bound to a carbon center called the alphacarbon.
• The differences between the 20 amino acids are in the nature of the R groups.
• These vary considerably in their chemical and physical properties.
• It is the chemistry of the R groups that determine the many interactions
that stabilize the structure of protein and enable its biological function.

General structure of amino acid


Protein Data Analysis – Protein and Amino Acid Sequence
• The amino acids are linked together by peptide bonds to form a polypeptide chain.
• The peptide bond results from a condensation reaction involving the amino and carboxylic acid moieties on two amino
acids

The formation of a peptide bond between two amino acids to form a peptide chain.
The N-Cα-N sequence is repeated throughout the protein and forms the backbone of the 3D structure
Protein Data Analysis – Protein and Amino Acid Sequence
• Proteins are complex organic molecules that perform their functions through interactions with other molecules at the
molecular level.
• It requires information about their 3D structures at the molecular level.
• Protein structures are hierarchical.
• The primary structure of protein refers to the sequence of amino acids that make up the protein.
• The secondary structure refers to the local folding pattern of the polypeptide chain.
• The tertiary structure describes how the secondary structure elements are arranged to form the overall 3D folding
pattern. The tertiary structure is held together by hydrogen, ionic, and disulphide bonds between amino acids.
• It is this unique structure that gives a protein is specific function.
• The quaternary structure describes the interaction of two or more globular or tertiary structures and other groups such
as metal ions or cofactors that make up the functional protein.
• The quaternary structure is held together by ionic, hydrogen, and disulfide bonds between amino acids.
• An example of a protein with a quaternary structure is hemoglobin.
Protein Data Analysis – Protein and Amino Acid Sequence
• The secondary structure of proteins is predominantly stabilized by hydrogen bonds and is generally classified into four
types: α-helix, β-sheet, loop, and random coil.
• The α-helix is the most common form of secondary structure in proteins.
• The helix has 3.6 amino acid residues per turn and is stabilized by hydrogen bonding between the backbone carbonyl
oxygen of one residue and the backbone NH of the fourth residue along the helix.
• Certain amino acids have a distinct preference for α-helices. Alanine (A), glutamic acid (E), leucine (L), and methionine
(M) are good helix formers,
• praline (P), glycine (G), tyrosine (Y), and serine (S) are helix-breaking residues.
• The second most common element of secondary structure in proteins is the β-sheet.
• A β-sheet is formed from several individual β-strands that are distant from each other along the primary protein
sequence.
• β-strands are usually five to 10 residues long and are in fully extended conformation.
• The individual strands are aligned next to each other in such a way that carbonyl oxygens are hydrogen-bonded with
neighboring NH groups.
Protein Data Analysis – Protein and Amino Acid Sequence

Hydrogen bond patterns in beta sheets. Here, a four-stranded beta sheet, which contains three antiparallel and one
parallel strand, is drawn schematically. Hydrogen bonds are indicated with red lines (antiparallel strands) and green lines
(parallel strands) connecting the hydrogen and receptor oxygen
Protein Data Analysis – Protein and Amino Acid Sequence
• Loops are regions of a protein chain that connect α-helices and β-strands or sheets to each other.
• the helices and sheets form the stable hydrophobic core of the protein, and the connecting loops are to be found on the
surface of the structure.
• Because amino acids in loops are not constrained by space and environment, unlike amino acids in the core region, and
because they do not have an effect on the arrangement of secondary structures in the core, more substitutions, insertions,
and deletions may occur.
• Thus, in a sequence alignment, the presence of these features may be an indication of a loop.
• Random coil is the term used for segments of polypeptide chains that do not form regular secondary structures.
• Such conformations are not really random: they are the result of a balance of interactions between amino acid side
chains and the solvent and interactions between sidechains.
• Depending on the type of secondary structures present, the tertiary structure of a protein is classified into seven classes in
the SCOP database
Protein Data Analysis – Protein and Amino Acid Sequence
Protein is classified into seven classes in the SCOP Internet resources for protein structure classification
1. All α proteins (Fig. 4.10a) • The CATH database- hierarchical domain classification of
protein structures
2. All β proteins (Fig. 4.10b)
• SCOP (Structural Classification of Proteins) database -
3. Alpha and beta proteins (α / β) (Fig. 4.10c) structural and evolutionary relationships between all proteins
Mainly parallel β-sheets with intervening α-helices • SWISS-Model - fully automated protein structure
homologymodeling server
4. Alpha and beta proteins (α +b) (Fig. 4.10d)
• Protein Data Bank (PDB) - repository of 3D protein structure
Mainly segregated α-helices and antiparallel β-sheet
• The DALI (Distance ALIgnment tool) server is a network
5. Multi-domain proteins (α and β) (Fig. 4.10e) service for comparing protein structures in 3D.
Folds consisting of two or more domains belonging to • The FSSP (Fold classification based on Structure-Structure
different classes alignment of Proteins) database is based on an exhaustive all-
against-all 3D structure comparison of protein structures
6. Membrane and cell surface proteins and peptides (Fig. • 3Dee contains structural domain definitions for all protein chains
4.10f)
• The DSSP (Database of Secondary Structure in Proteins)
Exclude proteins in the immune system database is a database of secondary structure assignments for all
7. Small proteins (Fig. 4.10g) protein

Usually dominated by metal ligand, heme, and/or


disulfide bridges
Protein Data Analysis – Protein Sequence Comparison
• Proteins can be compared in terms of sequence similarity or structural similarity.
• Significant sequence similarity is usually an important indicator of an evolutionary relationship between
sequences.
• In contrast, significant structural similarity is common, even among proteins that do not share any
sequence similarity or evolutionary relationship.
• Similarity between two protein sequences can be assessed by sequence comparison.
• In protein sequence alignment, the problem of degeneracy in the genetic code (where multiple DNA triplets
may code for the same amino acid) does not occur.
• In addition, it is much less likely that two proteins will have the same letter (amino acid), by chance alone,
at any position, since protein sequences are written with a 20-letter alphabet.
• comparison tools that are used for DNA sequence comparison can also be used for protein sequences
(BLAST, FASTA, CLUSTALW)
• The varying degrees of similarity reflect the different likelihoods of one amino acid being substituted for
another during the course of molecular evolution.
Protein Data Analysis – Protein Sequence Comparison
• Quantification of the similarity between amino acids is by means of scoring matrices.
• The 20 by 20 matrices, relating each amino acid to every amino acid, fall into the PAM, Percent or Point
Accepted Mutation.
• PAM is a unit introduced to quantify the amount of evolutionary change in a protein sequence.
• One PAM unit is the amount of evolution which will change, on average, 1% of amino acids in a protein
sequence.
• The BLOSUM matrix is constructed from blocks of sequences derived from the Blocks database
(http://www.blocks.fhcrc.org/).
• The Blocks database contains multiply aligned ungapped segments or blocks that correspond to the most highly
conserved regions of proteins.
• BLOSUM is constructed from these blocks by examining the substitution frequencies of each amino acid
pair.
• The matrix number in a BLOSUM matrix, e.g., as in BLOSUM 62, means that the matrix is derived from
blocks containing (>62%) identities in ungapped sequence alignment.
Protein Data Analysis – Protein Structure Comparison
Protein Data Analysis – Protein Structure Comparison
• As more and more protein structures have been determined and deposited in various protein structure databases, the prediction of protein
structure by computer algorithms is becoming more feasible.
• When proteins of unknown structure are similar to a protein of known structure at the sequence level, the 3D structure of the proteins
can be predicted.
• The stronger the similarity and identity, the more similar are the 3D folds and other structural features of the proteins.
• By tracking their structural similarities, very distant evolutionary relationships between proteins may be inferred.
• Several methods have been proposed to compare protein structures and measure the degree of structural similarity between them.
• These methods are based either on alignment of intra- and inter-molecular atomic distances (e.g., DALI) or on alignment of
secondary structure elements (e.g., VAST).
• In the latter case, two proteins are compared based on the types and arrangements of their Alpha-helices and Βετα−strands, as well as
on the ways in which these elements are connected.
• DALI (Distance ALIgnment tool) is based on the alignment of 2D distance matrices, which represent all intra-molecular CAlpha-
CAlpha distances of a protein structure.
• For a given pair of structures, DALI attempts to compute the optimal arrangement of similar contact patterns from their respective
distance matrices.
• Each distance matrix is first split into hexapeptide fragments, and all pairs of similar fragments from the two structures are stored in a
pair list.
• The final alignment is computed by assembling pairs of overlapping fragments from the pair list.
• The scoring function for an alignment of two structures is based on the intra-molecular distances.
Protein Data Analysis – Protein Structure Comparison
• The program VAST (Vector Alignment Search Tool) is based on aligning secondary structure elements.
• In VAST, all pairs of secondary structure elements (one from each structure) that have the same type are represented as
nodes of a graph.
• Two nodes are connected by an edge if the distance and angle between the corresponding pairs of secondary
structure elements from the two proteins are within some threshold.
• The graph therefore represents correspondences between pairs of secondary structure elements that have the same type,
relative orientation, and connectivity.
• This correspondence graph is then searched to find the maximal subgraph such that every node in the subgraph is
connected to every other node in the subgraph and is not contained in any larger subgraph with this property.
• This finds the initial secondary structure alignment.
• VAST then extends this initial alignment to a residue level alignment using a Gibbs sampling technique.
• VAST only reports alignments that yield a P-value less than 0.05.
• A P-value of 0.05 indicates that VAST expects to find an alignment with the same degree of similarity by chance in 5%
of all pair-wise comparisons.
• The results of this computation are included in NCBI’s Molecular Modeling Database
Protein Data Analysis – Protein Structure Prediction
• Comparative Modeling
The structure of a new protein could be predicted based on the presence of certain patterns or motifs, such as specific
amino acid patterns or profiles that are known to have specific structures.
This type of prediction is also called comparative modeling, and is useful when there is a clear sequence relationship
between the target structure and one or more known structures.
PROSITE database (http://us.expasy.org/prosite/) is an annotated collection of motif descriptors dedicated to the
identification of protein families and domains.
The generalized profiles used in PROSITE allow the detection of even poorly conserved domains or families.
Pfam is a collection of protein families and domains, based on multiple protein alignments and profile-HMMs of
these families.
BLOCKS is a collection of multiply aligned ungapped segments that correspond to the most highly conserved regions of
proteins.
eMOTIF is a collection of protein sequence motifs representing conserved biochemical properties and biological
functions derived from the BLOCKS and PRINTS databases
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction

• The function of a protein is directly related to the 3D shape, i.e., the folding, of the molecule, and the 3D
shape is directly determined by the sequence of amino acids in the molecule.
• the primary structure, i.e., the sequence of amino acids, ultimately determines the fold (3D structure) and
function of a protein.
• A major goal in bioinformatics and structural molecular biology is to understand the relationship
between the amino acid sequence and the 3D structure in protein, and to predict the fold based on the
amino acid sequence alone.
• This type of structure prediction directly from the amino acid sequence is called ab initio structure
prediction.
• Protein fold prediction from an amino acid sequence is still a distant goal, and most current algorithms
aim at predicting only the secondary structures, such as α-helices, β-strands, and loops/coils.
• The prediction of the secondary structure is an essential intermediate step on the way to predicting the full
3D structure of a protein. If the secondary structure of a protein is known, it is possible to derive a
comparatively small number of possible tertiary (3D) structures using knowledge about the ways that the
secondary structural elements pack.
• Some of the major computational methods of secondary structure prediction are:
• (1) statistical feature-based method, (2) nearest neighbor method, and (3) neural network-model
method.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction

• Statistical feature-based method : The frequency of occurrence of each of the 20 amino acids in different
secondary structures is used to create a scoring matrix.
• To predict a secondary structure, a sequence is scanned using a sliding window for the occurrence of
amino acids that have a high probability for one type of structure, as measured by the scoring matrices.
• In the Garnier, Osguthorpe, and Robson (GOR) method a window of 17 residues is used for the
prediction of the structural conformation of the central amino acid in the window.
• The GOR method estimates the joint probabilities of secondary structure S and amino acid a from sequences
in structural databases, and uses these probabilities to estimate the information difference between the
hypotheses that residual a is in structure S and residual a is not in structure S.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Statistical feature-based method :
The Garnier, Osguthorpe, and Robson (GOR) method is a classic approach used in bioinformatics to
predict the secondary structure of proteins. Here's a detailed breakdown of the process involved in this
method:
1. Window-Based Scanning

• Sliding Window: The GOR method uses a sliding window of 17 residues (amino acids) to analyze the
sequence. This means that for each central amino acid in the window, the surrounding 16 residues (8 on
each side) are considered in the prediction.

2. Scoring Matrices

• Amino Acid Frequencies: The method relies on scoring matrices derived from structural databases. These
matrices contain information about the frequency of each amino acid in different secondary structure
types (alpha-helix, beta-sheet, or coil).

• Probability Calculation: For each amino acid in the window, the GOR method estimates the probability
that the central residue belongs to a particular secondary structure type, based on the frequencies
observed in known protein structures.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Statistical feature-based method :
3. Joint Probabilities
• Estimating Probabilities: The GOR method calculates joint probabilities P(S,a) which represent the
probability that an amino acid a is in a secondary structure S based on the sequences in structural databases.

• Probability Differences: For each amino acid a, the method evaluates the difference in probability of the
amino acid being in structure S versus not being in structure S. This helps in understanding whether the
presence of the amino acid has a significant effect on the likelihood of that secondary structure.

4. Information Difference

• Information Gain: The GOR method uses the calculated probabilities to estimate the "information
difference" or "information gain". This is a measure of how much information is gained about the
secondary structure of the central residue a when considering its occurrence versus non-occurrence in a
given structural state S.

• Predictive Modeling: The difference in probabilities for each possible secondary structure (alpha-helix,
beta-sheet, or coil) is used to predict the most likely secondary structure for the central residue.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Statistical feature-based method :
5. Prediction
• Optimal Structure Assignment: After calculating the probabilities and information differences for all
possible secondary structures, the GOR method assigns the secondary structure with the highest
probability to the central amino acid in the window.

• Sliding Window Application: This process is repeated across the entire protein sequence by sliding the
window along the sequence, predicting the secondary structure for each residue based on the surrounding
context.

Summary
To summarize, the GOR method predicts secondary structures by:
• Using a sliding window to analyze each residue in context with its neighbors.
• Calculating joint probabilities of amino acids and secondary structures from structural databases.
• Assessing information differences to determine the likelihood of each secondary structure for the central
residue.
• Assigning the most probable secondary structure based on these calculations and repeating the process
for the entire sequence.

This approach combines statistical analysis of known protein structures with probabilistic modeling to make
predictions about unknown sequences.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction

• Nearest-neighbor method of secondary structure prediction predicts the secondary structural


conformation of an amino acid in the query sequence by identifying training sequences of known
structures that are homologous to the query sequence.
• The nearest-neighbor method requires the availability of a set of training sequences with known structures
but with minimal sequence similarity to each other, and a scoring scheme for measuring similarity
between sequence segments.
• A large list of short sequence fragments is then generated by sliding a window of length n (e.g., n = 17)
along each training sequence, and the secondary structure of the center amino acid in the window is
recorded.
• For structure prediction, a window of the same size is applied to the query sequence and the amino acid in
the window is compared to each of the sequence fragments. The k (e.g., k = 50) best matching fragments
are identified and the frequencies of the known secondary structures of the center amino acids in each of the
matching fragments are used to predict the secondary structure of the center amino acid in the query window.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• Outputs from several nearest-neighbor predictors (i.e., with different parameters for n and k, and balanced or
unbalanced prediction) could be combined using a simple majority vote rule or a more sophisticated machine
learning algorithm such as neural network to improve the prediction accuracy
• .
• The program NNSSP at http://searchlauncher.bcm.tmc.edu/pssprediction/Help/nnssp.html (Salamov and
Solovyev, 1995, 1997) is a nearest-neighbor based secondary structure prediction algorithm.
• Another method that also uses nearest-neighbor prediction is the program called PREDATOR
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• The neural network-based method uses an artificial neural network which simulates the neural system in
the brain for structure prediction.
• Neural networks generalize by extracting the underlying physicochemical principles from the training
sequence data. Training the network is the process of adjusting the weights w associated with each link.
• Initially, the weights are assigned random values.
• A sliding window of 13-17 amino acid residues is positioned along a training sequence and the
predicted output is compared to the known structure of the center amino acid residue.
• Errors in the predictions are used for adjusting the weights using the back-propagation algorithm
• The back-propagation algorithm uses a gradient search technique to minimize a cost function equal to the
mean square difference between the desired and the actual network outputs.
• Training by back-propagation is stopped when the errors cannot be reduced further.

A three-layer feed-forward neural network


Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• The PHDsec is a neural network-based secondary structure prediction algorithm
• PHDsec predictions have three main features:
• (1) improved accuracy by using evolutionary information contained in multiple sequence alignments as
input to the neural networks,
• (2) improved β -strand prediction accuracy through a balanced training procedure, and
• (3) more accurate prediction of secondary structure segments by using a multi-level system
• The first level in PHDsec is a three-layer feed-forward neural network.
• Input to this first level sequence-to-structure network consists of two contributions: one from the local
sequence, i.e., taken from a window of 13 adjacent residues, and another from the global sequence statistics.
• Output of the first level network is the 1D structural state of the residue at the center of the input window,
i.e., α -helix (H), β -strand (E), and loop (L).
• The second level is a three-layer feed-forward structure-to-structure network. The output for the second level
network is identical to the first level.
• The second level network introduces a correlation between adjacent residues with the effect that the
predicted secondary structure segments have length distributions similar to the observed distributions.
• The third level consists of an arithmetic average over independently trained networks (jury decision).
• The final level is a simple filter that affects only drastic, unrealistic predictions (e.g., HEH to HHH; EHE to
EEE; and LHL to LLL).
• PHDsec is reported to have a prediction accuracy of Q3 > 72%.
Protein Data Analysis – Protein Structure Prediction
• Ab Initio Structure Prediction
• PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) is another neural network-based secondary structure
prediction algorithm that was reported to have very high prediction accuracy, with a Q3 score of 76.5% to
78.3%.
• PSIPRED incorporates two simple feed-forward neural networks that perform analysis on the iterated
profile (position-specific scoring matrix) obtained from PSI-BLAST, and Position Specific Iterated -
BLAST.
• The high sensitivity and accuracy of the PSI-BLAST alignments was thought to be a major contributing
factor to the high prediction rate of the PSIPRED method.
• Hidden Markov Models (HMM) have also been applied in protein structure prediction.
• In one example of this approach, the models are trained on patterns of α -helix, β -strand, tight turns, and
loops in specific structural classes, which then may be used to provide the most probable secondary structure
and structural class of a protein.
• A center that is focused on the prediction of protein structure is the Protein Structure Prediction Center
(http://predictioncenter.llnl.gov/Center.html), supported by the National Institutes of Health, National
Library of Medicine, and the U.S. Department of Energy, Office of Biological and Environmental
Research.
• CASP (Critical Assessment of techniques for protein Structure Prediction) event that aims to promote
an objective evaluation of prediction methods on a continuing basis.
Protein Data Analysis – Protein Structure Prediction
• Threading
• The ways that protein can fold appear to be limited, there is considerable optimism that methods will
eventually be found to predict the fold of any protein, given just its amino acid sequence.
• One popular and quite successful method for tertiary structure prediction is threading
• In threading, a new sequence is mounted on a series of known folds (a sequence-structure alignment) from
homologous sequences with the goal of finding a fold that provides the best score (lowest energy).
• Two commonly used techniques for deciding whether a given protein sequence is compatible with a known
fold are the environmental template and the contact potential method
• In the environmental template method, the environment, e.g., the secondary structure of the buried status,
the polarity, the types of nearby side chains, and the hydrophobicity, of each amino acid in each known
structural core is determined. The frequencies of different amino acids within multiple alignments in
different environments are then counted and used to create structural 3D profiles.
• Dynamic programming is used to align a sequence to a string of descriptors that describe the 3D
environment of the target structure, and the new sequence is predicted to have a fold similar to that of the
target core if a significantly high score is obtained.
• In the contact potential method, the number of and closeness between amino acids in the core are analyzed,
and each structural core is represented as a 2D contact matrix. The query sequence is evaluated for amino
acid interactions that will correspond to those in the core and that will contribute to the stability of the
protein. The most energetically stable conformations are assumed to be the most likely 3D structures.
References
• Top 100 AI tools for genomics, drug discovery and ML.
• https://omicstutorials.com/top-100-ai-tools-unveiled-in-bioinformatics/

You might also like