Exploring Database and Analyzing Protein Sequence
Exploring Database and Analyzing Protein Sequence
AUBHISHEK ZAMAN
ROLL:O8
4/26/2009
C
ontents
Chapter 1: About Bioinformatics 7-20
1.1 General 08
1.2 Resources 10
1.2.1 Gateways 10
1.2.2 Database 12
1.2.3 Software or Tools 14
1.3 Application and Importance 15
1.4 Project Aim 18
Chapter2: Working with Protein Sequences 21-65
2.1 General 22
2.2Fetching protein sequence from Database 22
2.2.1 Database 22
2.2.2 Method 23
2.2.3 Result 24
2.3 Analyzing Protein Sequences 24
2.3.1 Understanding the general, physical, chemical properties of a
Protein sequence. 24
2.3.1.1 Software/ Tools 25
2.3.1.2 Method 25
2.3.1.3Result 26
2.3.2 Searching Database for similar sequences 27
2.3.2.1Software/ Tools 27
2.3.2.2 Methods 28
2.3.2.3 Result 30
2.3.3 Sequence Alignment Study 41
2.3.3.1 Pair wise alignment 41
2.3.3.1.1 Software/ Programs 42
2.3.3.1.2 Methods 43
2.3.3.1.3 Results 44
2.3.3.2 Multiple Sequence Alignment 47
2.3.3.2.1 Software/ Programs 49
2.3.3.2.2 Methods 49
2.3.3.2.3 Results 51
2.3.4 Phylogenetic tree construction 54
2.3.4.1 Software/ Tools 54
2.3.4.2 Methods 56
2.3.4.3 Result 58
2.3.5 Secondary Structure Prediction 60
2.3.5.1 Software/ Tools 60
2.3.5.2 Methods 61
2.3.5.3 Result 63
Page | 2
List of abbreviation
ABBREVIATION ELABORATION
BLAST Basic Local Alignment Search Tool
DDBJ D NA Data Bank of Japan
EBI European Bionformatics Institute
Page | 3
List of figures
Page | 4
List of Tables
Table no. Name Of Table Page No.
1.1 Tools at EBI 11
1.2 Available tools at Bioinformatics Group - University College London 12
Page | 5
List of Web Addresses:
• http://www. ncbi.nlm.nih.gov
• http://www.ebi.ac.uk
• http://bioinf.cs.ucl.ac.uk/psipred/
• http://www.expasy.org
• http://www.pdb.org/
Reference Source:
• Bioinformatics: a practical guide to the analysis of genes and proteins
B.F. Ouelette and A.D. Baxevanis
• Discovering Genomics, Proteomics and Bioinformatics
A.M. Campbell and L.J. Heyer
• Post Genome informatics
Minoru Kanesha
• Bioinformatics-Sequence And Genome Analysis
D.W. Mount
• Bioinformatics for Dummies
G,M. Claverie and C. Notredame
• www.wikipedia.org
Page | 6
Chapter 1:
About Bioinformatics
Page | 7
B
IOINFORMATICS is an interdisciplinary subject. It may be termed as a blend of biological
and computational sciences. Bioinformatics involves storing, retreiving and manipulation of
biological data using computational texhniques.
Computer
Biology Science
BIOINFORMATICS
Mathematics
Statistics
1.1 General
B
iological data are flooding in at an unprecedented rate. For example as of August 2000, the
GenBank repository of nucleic acid sequences contained 8,214,000 entries and the SWISS-
PROT database of protein sequences contained 88,166. On average, the amount of
information stored in these databases is doubling every 15 months.
Bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry)
and applying informatics techniques (derived from disciplines such as applied maths, computer
science and statistics) to understand and organise the information associated with these molecules, on
a large scale. In short, bioinformatics is a management information system for molecular biology and
has many practical applications.
Bio stands for life, informatics comes from the word information. So, Bioinformatics refers to the
science that deals with the information that comes from living system. However, bioinformatics more
properly refers to the creation and advancement of algorithms, computational and statistical
Page | 8
techniques, and theory to solve formal and practical problems arising from the management and
analysis of biological data.
The National Center for Biotechnology Information (NCBI) defines bioinformatics as:
"Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. There are three important sub-disciplines within
bioinformatics: the development of new algorithms and statistics with which to assess relationships
among members of large data sets; the analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and protein structures; and the development
and implementation of tools that enable efficient access and management of different types of
information."
The terms bioinformatics and computational biology are often used interchangeably. However
bioinformatics more properly refers to the creation and advancement of algorithms, computational and
statistical techniques, and theory to solve formal and practical problems posed by or inspired from the
management and analysis of biological data. Important sub-disciplines within bioinformatics and
computational biology include:
• the development and implementation of tools that enable efficient access to, and use
and management of, various types of information
• the development of new algorithms (mathematical formulas) and statistics with which
to assess relationships among members of large data sets, such as methods to locate a gene
within a sequence, predict protein structure and/or function, and cluster protein sequences into
families of related sequences
Storing, retreiving and manipulating biological data in a meaningful way to interpret the biological
system is the prime objective of Bioinformatics. To do so in the initial phase the data produced by the
thousands of research teams all over the world are collected and organized in databases specialized for
particular subjects. GDB (Gene Data Bank), SWISS-PROT, GenBank, PDB (Protein Data Bank) etc
are some well known examples. As informations kept growing in size and complexities need of
specialized tools with diverse algorithmic approach started growing too. It resulted in application of
specialized softwares such as BLAST, CLUSTALW, BIOEDIT, SRATCH, Swiss PDB Viewer etc
for better data manipulation and sorting out.
Page | 9
1.2. The Resources
ENTREZ
NCBI
• submission
• updates
• submission
• updates
GenBANK
EBI
EMBL
DDBJ
CIB
SRS
getentry
• submission
• updates
Figure1.2: Data flow for new submission and updates between three databases
1.2.1 Gateway
A
gateway in Information Technology (IT) is thought to be an open door through which a user
collects a specialized information. A gateway can be reached at a specific Universal
Resource Locator (URL).
There are several gateways for software and databases that offer access to many of the sites in
bioinformatics. The gateways and databases are listed below:
Page | 10
as well as other information relevant to biotechnology. In addition to GenBank. NCBI
provides Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D
protein structures), the Unique Human Gene Sequence Collection, a Gene Map of the Human
genome, a Taxonomy Browser, and coordinates with the National Cancer Institute to provide
the Cancer Genome Anatomy Project. All these databases are linked through a unique search
and retrieval system, called Entrez., that also include cross-referenced information integrate
these resources
Tool Description
Align Pairwise global and local alignment tool (EMBOSS).
ClustalW Multiple sequence alignments.
CpG Plot/CpGreport CpG Island finder and plotting tool (EMBOSS).
GeneMark Gene prediction service.
Genetic Code Viewer Review of genetic code differences.
Wise2 Compares a protein sequence or a protein profile HMM to a DNA
sequence.
Mutation Checker Sequence validation.
Pepstats/Pepwindow/Pepinfo EMBOSS programs for basic protein sequence analysis
(EMBOSS).
Promoter wise Compares two DNA sequences allowing for inversions and
translocations, ideal for promoters.
Reverse Translator Reverse complement checker.
SAPS Statistics on protein sequences.
Transeq DNA sequence translation tool (EMBOSS).
The ExPASy (Expert Protein Analysis System) is a proteomics server of the Swiss
Institute of Bioinformatics (SIB) which analyzes protein sequences and structures
and two-dimensional gel electrophoresis (2-D Page electrophoresis). The server
functions in collaboration with the European Institute of Bioinformatics. ExPASy
also produces the protein sequence knowledgebase, UniProtKB/Swiss-prot, and its
Page | 11
computer annotated supplement, UniProtKB/Trembl.
1.2.2 Databases
A
database in internet is actually consisted of a Database management system (DBMS) which
has two interface- one is for user to use and input and another one is for management in the
host computer. A database is compilation of entities in correspondence to its marked out
attributes.
Page | 12
Database (or data base) is a collection of data in an organised way so that its contents can easily be
accessed, managed, and modified by a computer. It is also called data bank. The most prevalent type
of database is the relational database which organizes the data in tables; multiple relations can be
mathematically defined between the rows and columns of each table to yield the desired information.
An object-oriented database stores data in the form of objects which are organized in hierachical
classes that may inherit properties from classes higher in the tree structure.
A biological database is a large, organized body of persistent data, usually associated with
computerized software designed to update, query, and retrieve components of the data stored within
the system. A simple database might be a single file containing many records, each of which includes
the same set of information. Most biological databases consist of long strings of nucleotides (guanine,
adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each
sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof),
respectively. Sequences are represented in shorthand, using single letter designations.
Primary or archived databases contain information and annotation of DNA and protein sequences,
DNA and protein structures and DNA and protein expression profiles.
Secondary or derived databases are so called because they contain the results of analysis on the
primary resources including information on sequence patterns or motifs, variants and mutations and
evolutionary relationships. Information from the literature is contained in bibliographic databases,
such as Medline.
The following table represent widely used databases for analyzing DNA and protein sequences as
well as databases and types of researches can be performed for DNA, protein structure and protein
function.
Page | 13
Databases DISC – DNA Information and Stock Center, http://www.dna.affrc.go.jp/
Japan
NCBI – GenPept http://www.ncbi.nlm.nih.gov/
Protein ExPasy – SwissProt and TrEMBL http://www.expasy.ch/
EBI (European Bioinformatics Institute) – http://www.ebi.ac.uk/
SwissProt, TrEMBL, PIR
Databases DISC – DNA Information and Stock Center, http://www.dna.affrc.go.jp/
Japan
Meta-databases: A meta-database can be considered a database of databases, rather than any one
integration project or technology. They collect data from different sources and usually make them
available in new and more convenient form, or with an emphasis on a particular disease or
organism.
1.2.3. Software/Tools
S
oftware tools are computer programs for sequence analysis, database construction
and management, evolutionary relations, structural analysis, pathways. The
software tools are integrated into databases.
Page | 14
The Bioinformatics Toolbox offers computational molecular biologists and other
research scientists an open and extensible environment in which to explore ideas,
prototype new algorithms, and build applications in drug research, genetic engineering,
and other genomics and proteomics projects. These tools range from a collection of
standalone tools with a common data format under a single, slick standalone or web-
based interface, to integrative and extensible bioinformatics workflow development
environments.
The important software programs in Bioinformatics that have been used in our project
are given in the following table:
B
ioinformatics is being used in following fields:
Many expression studies have so far focused on devising methods to cluster genes by
similarities in expression profiles. This is in order to determine the proteins that are
expressed together under different cellular conditions. Briefly, the most common
methods are hierarchical clustering, self-organising maps, and K-means clustering. Hierarchical
methods originally derived from algorithms to construct phylogenetic trees, and group genes in a
bottom-up fashion; genes with the most similar expression profiles are clustered first, and those
with more diverse profiles are included iteratively. In contrast, the self-organising map and K-
means methods employ a top-down approach in which the user pre-defines the number of clusters
for the dataset. The clusters are initially assigned randomly, and the genes are regrouped iteratively
until they are optimally clustered.
Drug development
One of the earliest medical applications of bioinformatics has been in aiding rational drug
design. Figure 1.3 outlines the commonly cited approach, taking the MLH1 gene product as
an example drug target. MLH1 is a human gene encoding a mismatch repair protein (mmr)
situated on the short arm of chromosome 3. Through linkage analysis and its similarity to
Page | 15
mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer (126).
Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can
be determined using translation software. Sequence search techniques can then be used to
find homologues in model organisms, and based on sequence similarity, it is possible to
model the structure of the human protein on experimentally characterised structures. Finally,
docking algorithms could design molecules that could bind the model structure, leading the way for
biochemical assays to test their biological activity on the actual protein.At present all drugs on the
market target only about 500 proteins. With an improved understanding of disease mechanisms and
using computational tools to identify and validate new drug targets, more specific medicines that act
on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs
promise to have fewer side effects than many of today's medicines.
Pharmacogenomics
Clinical medicine will become more personalized with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritence affects the body's
response to drugs. At present, some drugs fail to make it to the market because a small percentage of
the clinical patient population show adverse affects to a drug due to sequence variants in their DNA.
Today, doctors have to use trial and error to find the best drug to treat a particular patient as those
with the same clinical symptoms can show a wide range of responses to the same treatment. In the
future, doctors will be able to analyse a patient's genetic profile and prescribe the best available drug
therapy and dosage from the beginning.
Gene therapy
Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of
a person’s defective genes. Currently, this field is in its infantile stage with clinical trials for many
different types of cancer and other diseases ongoing.
Page | 16
Detection of Antibiotic-resistant pathogens
Scientists have been examining the genome of Enterococcus faecalis, a leading cause of bacterial
infection among hospital patients. They have discovered a region made up of a number of antibiotic-
resistant genes that may transform the bacterium from a harmless gut bacterium to a menacing
invader. The discovery of this region could provide useful marker for detecting pathogenic strains and
help to control the spread of infection inwards.
Agriculture
Bioinformatics tools can be used to sequence the genomes of plants and animals and elucidate the
functions of different genes. This specific genetic knowledge could then be used to produce nutrient
rich, drought, disease and insect resistant plants and improve the quality of livestock making them
healthier, more disease resistant and more productive.
Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been successfully
transferred to cotton, maize and potatoes. This new ability of the plants to resist insect attack means
that the amount of insecticides being used can be reduced and hence the nutritional quality of the
crops is increased.
Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron
and other micronutrients. Scientists have also inserted a gene from yeast into tomato, the result is a
plant whose fruit stays longer on the vine and has an extended shelf life.
Biotechnology
The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for
practical applications in industry and government-funded environmental remediation. These
microorganisms thrive in water temperatures above the boiling point and therefore may provide the
DOE, the Department of Defence, and private companies with heat-stable enzymes suitable for use in
industrial processes.
Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving
and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the
environment, our bodies, the air, food and water. Traditionally, a variety of microbial properties have
been applied in the baking, brewing and food industries. The arrival of the complete genome
sequences and their potential to provide a greater insight into the microbial world and its capacities
could have broad and far reaching implications for environment, health, energy and industrial
applications.
Waste management
Deinococcus radiodurans is known as the world's toughest bacteria and it is the most radiation
resistant organism known. Scientists are interested in this organism because of its potential usefulness
in cleaning up waste sites that contain radiation and toxic chemicals. Microbial Genome Program
(MGP) scientists are determining the DNA sequence of C. crescentus one of the organisms
responsible for sewerage treatment.
Page | 17
Maintenance of climatic balance
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for
energy, are thought to contribute to global climate change. Recently, the DOE (Department of Energy,
USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is
to study the genomes of microbes that use carbon dioxide as their sole carbon source.
Evolutionary Studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that
evolutionary studies can be performed in a quest to determine the tree of life and the last universal
common ancestor.
Forensic studies
Bioinformatics has created a great opportunity to ease the forensic experiment. It has been guaranteed
the highest possible accuracy to detect the right culprit in forensic investigations.
Scientists used their genomic tools to help distinguish between the strains of Bacillus anthryacis that
was used in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains.
Bioweapon creation
Scientists have recently built the virus Poliomyelitis by entirely artificial means using genomic data
available on the internet and materials from a mail order chemical supply.
T he aim of our project was to get introduced with the field of Bioinformatics. More specifically
the target was to-
Be familiar with biological databases and available tools to analyze the information
in such databases.
Finding the sequence of the protein and study the physicochemical properties.
Developing methods to predict the structure and/or function and resive the
secondery structure.
A well known protein Chymotrypsin (PDB Id- P00766) was studied as the in the project.
Chymotrypsin is a proteolytic enzyme. This enzyme catalyzes the hydrolysis of peptide bonds of
Page | 18
proteins in the small intestine. It is selective for peptide bonds with aromatic or large aromatic
hydrophobic side chains (Tyr, Trp, Phe) on the carboxyl side of this bond. Chymotrypsin also
catalyzes the hydrolysis of ester bonds. It is termed as serine Protease because it has a reactive serine
residue in its active site. Three amino acid residues have been found to play the key role in catalysis:
Ser195, His57 and Asp102. Together these residues are termed as “Catalytic Triad”. Although far
apart in the primary structure the protein folding brings these residues close and in correct orientation
in tertiary structure. Chymotrypsin was the first discovered Serine protease. Its crystal structure was
first resolved by David Blow in 1967. this discovery provided
provided a key understanding of the catalytic
mechanism of a great variety of enzymes.
The mechanism of chymotrypsin action is illustrated in the following page.
Using bioinformatics tools we have performed a number jobs concerned with Chymotrypsin, such as
Page | 19
Construction of a Phylogenetic tree and to determine the evolutionary relationship based on
the protein that was chosen for multiple sequence alignment.
Protein
Sequence file
Primary structure:
Physico-chemical Sequence
properties Comparison
Identity Similarity
Page | 20
Chapter2:
Page | 21
2.1 General
W
ith the availability of hundreds of complete genome sequences from both prokaryotes and
eukaryotes efforts are now focused o the identification and functional analysis of the
proteins encoded by these gnomes. this urgency has resulted in a big burst of fresh
informations linked to proteomics. there came the need of a protein sequence databases.
Uniprot NREF 50
Uniprot NREF 90
Uniprot archive
Sub/pept DDBJ/E
VEG PDB Patent WGS EnsE REF Fly Wor
ide data MBL/G Base
A data MBL SEQ mBas
enbank
e
Figure 2.1: The flow of data from primary data sources into component databases of universal
protein resourse.
2.2.1 DATABASE
we searched the protein database incorporated with NCBI gateway.It is the NIH protein sequence
database, an annotated collection of all publicly available Protein sequences. The complete release
notes for the current version of protein database are available on the NCBI ftp site. A new release is
made every two months .
Page | 22
Methods
1. Search for the desired
sequence was started with the
NCBI home page
(http://www.ncbi.nlm.nih.gov)
2. “Protein” was chosen in the
“Search” box and was
searched for Chymotrypsin
sequence .
3. P00766 was selected from the
list and clicked.
4. The information available on
the page was read carefully.
Genpept format
Page | 23
2.2.2 Results
The sequence was retreived and saved in microsoft word format for further use.
>gi|117615|sp|P00766|
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE
FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG
WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV
GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN
P
roteins are condensation polymers of amino acid residues. however a liner organisation of
residues itself do not express much about protein structure as well as protein function. it is the
3D or tertiery native structure (quarternary in case of a multisubunit protein) which depicts a
protein best.
Though primary structure analysis is not a good methode for functional and structural analysis of the
protein, it can provide with some valuable informations regarding poteins behaviour in a solution, its
molecular weight, extinction coefficient etc. Thus general physiochemical properties can be a good
indicator to understand protein activities in broader scale.
Page | 24
2.3.1.1 Software
2.3.1.2 Method
Page | 25
5. Then in the box for the sequence, the sequence of “P00766 (swiss-prot accession
no)” was pasted.
2.3.1.3 Results
ProtParam
User-provided sequence:
10 20 30 40 50 60
CGVPAIQPVL SGLSRIVNGE EAVPGSWPWQ VSLQDKTGFH FCGGSLINEN WVVTAAHCGV
TLAAN
Page | 26
Cys (C) 10 4.1%
Gln (Q) 10 4.1%
Glu (E) 5 2.0%
Gly (G) 23 9.4%
His (H) 2 0.8%
Ile (I) 10 4.1%
Leu (L) 19 7.8%
Lys (K) 14 5.7%
Met (M) 2 0.8%
Phe (F) 6 2.4%
Pro (P) 9 3.7%
Ser (S) 28 11.4%
Thr (T) 23 9.4%
Trp (W) 8 3.3%
Tyr (Y) 4 1.6%
Val (V) 23 9.4%
Pyl (O) 0 0.0%
Sec (U) 0 0.0%
(B) 0 0.0%
(Z) 0 0.0%
(X) 0 0.0%
Carbon C 1127
Hydrogen H 1783
Nitrogen N 307
Oxygen O 353
Sulfur S 12
Formula: C1127H1783N307O353S12
Total number of atoms: 3582
Extinction coefficients:
P
osition Specific Iterated BLAST (Psi-BLAST)
2.3.2.2 Method
BLASTp
Page | 28
Standard protein-
CBI home BLAST protein BLAST
page
BLOSUM62
BLAST run selected from Sequence of 1GCT
MATRIX option pasted in Search
window
Psi-BLAST
Page | 29
CBI home page BLAST PSI-BLAST
2.3.2.3 Results
BLASTp results
Page | 30
Figure 2.3: Graphical presentation of BLASTp results
Page | 31
More such results.......................................................................
Score = 496 bits (1278), Expect = 5e-139, Method: Compositional matrix adjust.
Identities = 245/245 (100%), Positives = 245/245 (100%), Gaps = 0/245 (0%)
Query 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
Sbjct 56 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 115
Page | 32
Query 121 VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 180
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
Sbjct 176 VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 235
Page | 33
pdb|1P2O|C Chain C, Structural Consequences Of Accommodation Of Four Non-
Cognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin
And Chymotrypsin
pdb|1P2Q|A Chain A, Structural Consequences Of Accommodation Of Four Non-
Cognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin
And Chymotrypsin
pdb|1P2Q|C Chain C, Structural Consequences Of Accommodation Of Four Non-
Cognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin
And Chymotrypsin
pdb|1OXG|A Chain A, Crystal Structure Of A Complex Formed Between Organic
Solvent Treated Bovine Alpha-Chymotrypsin And Its Autocatalytically
Produced Highly Potent 14-Residue Peptide At 2.2 Resolution
pdb|1T7C|A Chain A, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T7C|C Chain C, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8L|A Chain A, Crystal Structure Of The P1 Met Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8L|C Chain C, Crystal Structure Of The P1 Met Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8M|A Chain A, Crystal Structure Of The P1 His Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8M|C Chain C, Crystal Structure Of The P1 His Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8N|A Chain A, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8N|C Chain C, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8O|A Chain A, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T8O|C Chain C, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1CHG|A Chain A, Chymotrypsinogen,2.5 Angstroms Crystal Structure,
Comparison
With Alpha-Chymotrypsin,And Implications For Zymogen
Activation
pdb|1GCD|A Chain A, Refined Crystal Structure Of "aged" And "non-Aged"
Organophosphoryl Conjugates Of Gamma-Chymotrypsin
Length=245
Score = 495 bits (1274), Expect = 2e-138, Method: Compositional matrix adjust.
Identities = 245/245 (100%), Positives = 245/245 (100%), Gaps = 0/245 (0%)
Query 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
Sbjct 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
Page | 34
Query 241 TLAAN 245
TLAAN
Sbjct 241 TLAAN 245
Score = 486 bits (1251), Expect = 6e-136, Method: Compositional matrix adjust.
Identities = 241/245 (98%), Positives = 241/245 (98%), Gaps = 0/245 (0%)
Query 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
CGVPAIQPVLSGL IVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
Sbjct 1 CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
Page | 35
Figure 2.4: Graphical presentation of PSI-BLAST search result
Page | 36
Similar More Results.......................................................................
Query 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
Sbjct 19 CGVPGIPPVITGYSRIVNGEEAVPHSWPWQVSLQEYTGFHFCGGSLINENWVVTAAHCNV 78
Page | 37
VC+ SD F G CVT+GWGLTRY +TP RLQQ +LPLL+N C+K+WG+KI D M
+AAN
Query 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
Sbjct 20 CGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENWVVTAAHCNV 79
+AAN
Page | 38
Query 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
CGVPAI+PVL+GLSRIVNGE+AVPGSWPWQVSLQD TGFHFCGGSLI+E+WVVTAAHCGV
Sbjct 20 CGVPAIEPVLNGLSRIVNGEDAVPGSWPWQVSLQDSTGFHFCGGSLISEDWVVTAAHCGV 79
LAAN
Page | 39
GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN
>|CTRL protein[Human]
LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI
SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT
RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA
GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN
>|Ela3 protein[Mouse]
PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL
GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA
PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA
DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN
>gi|chymotrypsinogen 2-like protein [Sparus aurata]
GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW
VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV
CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC
MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN
>gi|Zebrafish [Danio rerio]
WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA
HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA
EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD
SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN
>pdb|1S0Q|TRYPSINOGEN
IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS
IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA
PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC
NYVSWIKQTIASN
>pdb|3PTN|TRYPSIN
IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS
IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA
PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC
NYVSWIKQTIASN
>gi|PRSS2 protein [Bos taurus]
MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ
VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL
ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL
QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS
>gi|tryptase-III [Human]
LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP
DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET
FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD
SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|beta 1 tryptase [Gorilla gorilla]
MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL
TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV
TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML
CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta]
MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ
VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL
ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL
QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS
>pdb|5PTP|HYDROLASE
IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS
IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA
PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDXGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC
Page | 40
NYVSWIKQTIASN
>pdb|3EST|ELASTASE
VVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNNGTEQ
YVGVQKIVVHPYWNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQ
LAQTLQQAYLPTVDYAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVS
RLGCNVTRKPTVFTRVSAYISWINNVIASN
>pdb|1ZHR|FactorXI
IVGGTASVRGEWPWQVTLHTTSPTQRHLCGGSIIGNQWILTAAHCFYGVESPKILRVYSGILNQAEIAED
TSFFGVQEIIIHDQYKMAESGYDIALLKLETTVNYADSQRPISLPSKGDRNVIYTDCWVTGWGYRKLRDK
IQNTLQKAKIPLVTNEECQKRYRGHKITHKMICAGYREGGKDACKGDSGGPLSCKHNEVWHLVGITSWGE
GCAQRERPGVYTNVVEYVDWILEKTQAV
>pdb|1DDJ|PLASMINOGEN
SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY
KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG
WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI
LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN
>|Mast cell protease6
MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT
AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS
LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC
AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS
>gi|899286|Hepsin
TSGFFCVDEGRLPHTQRLLEVISVCDCPRGRFLAAICQDCGRRKLPVDRIVGGRDTSLGRWPWQVSLRYD
GAHLCGGSLLSGDWVLTAAHCFPERNRVLSRWRVFAGAVAQASPHGLQLGVQAVVYHGGYLPFRDPNSEE
NSNDIALVHLSSPLPLTEYIQPVCLPAAGQALVDGKICTVTGWGNTQYYGQQAGVLQEARVPIISNDVCN
GADFYGNQIKPKMFCAGYPEGGIDACQGDSGGPFVCEDSISRTPRWRLCGIVSWGTGCALAQKPGVYTKV
SDFREWIFQAIKTHSEASGMVTQL
>pdb|1SPJ|KALLIKREIN
IVGGWECEQHSQPWQAALYHFSTFQCGGILVHRQWVLTAAHCISDNYQLWLGRHNLFDDENTAQFVHVSE
SFPHPGFNMSLLENHTRQADEDYSHDLMLLRLTEPADTITDAVKVVELPTEEPEVGSTCLASGWGSIEPE
NFSFPDDLQCVDLKILPNDECKKAHVQKVTDFMLCVGHLEGGKDTCVGDSGGPLMCDGVLQGVTSWGYVP
CGTPNKPSVAVRVLSYVKWIEDTIAENS
>pdb|1HCG|FACTOR X
IVGGQECKDGECPWQALLINEENEGFCGGTILSEFYILTAAHCLYQAKRFKVRVGDRNTEQEEGGEAVHE
VEVVIKHNRFTKETYDFDIAVLRLKTPITFRMNVAPACLPERDWAESTLMTQKTGIVSGFGRTHEKGRQS
TRLKMLEVPYVDRNSCKLSSSFIITQNMFCAGYDTKQEDACQGDSGGPHVTRFKDTYFVTGIVSWGEGCA
RKGKYGIYTKVTAFLKWIDRSMKTRGLPKAK
>pdb|1HYL|COLLAGENASE
IINGYEAYTGLFPYQAGLDITLQDQRRVWCGGSLIDNKWILTAAHCVHDAVSVVVYLGSAVQYEGEAVVN
SERIISHSMFNPDTYLNDVALIKIPHVEYTDNIQPIRLPSGEELNNKFENIWATVSGWGQSNTDTVILQY
TYNLVIDNDRCAQEYPPGIIVESTICGDTSDGKSPCFGDSGGPFVLSDKNLLIGVVSFVSGAGCESGKPV
GFSRVTSYMDWIQQNTGIKF
>gi|Cold-Adaption Enzymes [Salmon]
IVGGYECKAYSQAHQVSLNSGYHFCGGSLVNENWVVSAAHCYKSRVEVRLGEHNIKVTEGSEQFISSSRV
IRHPNYSSYNIDNDIMLIKLSKPATLNTYVQPVALPTSCAPAGTMCTVSGWGNTMSSTADSDKLQCLNIP
ILSYSDCNDSYPGMITNAMFCAGYLEGGKDSCQGDSGGPVVCNGELQGVVSWGYGCAEPGNPGVYAKVCI
FSDWLTSTMASY
S
equence alignment is the procedure of comparing two (pair-wise alignment)
(multiple sequence alignment) sequences by searching for a series of individual
character patterns that are in the same order in the sequences. Two sequences are
Page | 41
by writing them across a page in two rows. Identical or similar characters are placed in the
same column, and nonidentical characters can either be placed in the same column as a mis-
match or opposite a gap in the other sequence. In an optimal alignment, nonidentical char-
acters and gaps are placed to bring as many identical or similar characters as possible into
vertical register. Sequences that can be readily aligned in this manner are said to be similar.
There are two types of sequence alignment, global and local. In global alignment, an
attempt is made to align the entire sequence, using as many characters as possible, up to
both ends of each sequence. Sequences that are quite similar and approximately the same
length are suitable candidates for global alignment. In local alignment, stretches of
sequence with the highest density of matches are aligned, thus generating one or more
islands of matches or subalignments in the aligned sequences. Local alignments are more
suitable for aligning sequences that are similar along some of their lengths but dissimilar
in others, sequences that differ in length, or sequences that share a conserved region or
domain.
Pairwise alignment is the process by which a pair of sequences are compared to one another by
sequence alignment technique either global or local. It can also bedotplot
LGPSSKQTGKGS-SRIWDN
Global alignment
LN-ITKSAGKGAIMRLGDA
–------TGKG--------
Local alignment
-------AGKG--------
2.3.3.1.1 Software/Program
BLAST2 sequence
T
his tool produces the alignment of two given sequences using BLAST engine for
local alignment. While the standard BLAST program is widely used to search for
homologous sequences in nucleotide and protein databases, one often needs to
compare only two sequences that are already known to be homologous, coming from
related species or, e.g. different isolates of the same virus. In such cases searching the
entire database would be unnecessarily time-consuming. 'BLAST 2 Sequences' utilizes
the BLAST algorithm for pair wise DNA-DNA or protein-protein sequence comparison.
The results of BLAST2 Sequences give information about the similarities and identities
of other proteins regarding of the query protein. It also gives a graphical representation
of the alignment.
Page | 42
2.3.3.1.2 Methods
1. Starting with NCBI, “BLAST” search was selected
4. The query sequence was pasted from the saved fie in the
first window
5. The subject
sequence was
pasted from file
in the 2nd
window.
6. “Align” was
clicked.
7. The results
were saved.
(bl2seq)
Result saved
Page | 43
2.3.3.1.3 Result
Pair-wise alignment results were found seperately for sequences. Among those one particular
result of p00766 and cold adaptation enzyme is given below--
Page | 44
Table 2.1: Pair-wise alignment results for retreived sequences to identify similarities
Sq(p00766) Sq=Cold
01. S=human adaptation 164bits 97/231 137/231 12/
L=245 enzyme (414 ) -38
1℮-38 (41%) (59 %) (5 %)
S=(salmon)
L=231
Sq(p00766) Sq=pdb[4CHA] 494bits -138
02 S=human S= (1271 ) 5℮-138 241/245 241/245 0/245
L=245 L=245 (98 %) (98 %) (0 %)
Page | 45
Sq(p00766) Sq=3PTN(trypsin 175bits 98/232 140/232 11/232
-42
10 S=human ) ( 444) 4℮-42 ( 42%) ( 60%) ( 4%)
L=245 S=
L=223
Page | 46
Sq(p00766) Sq=Hepsin 159bits -37 95/249 130/249 23/243
20 S=human S=human ( 403) 2℮-37 ( 38%) ( 52%) ( 9%)
L=245 L=304
O
ne of the major contribution of molecular biology to evolutionary analysis is the discovery
that the DNA sequences of different organisms are often related. Similar genes are conserved
across widely divergent species, often performing a similar or even identical function, and at
other times, mutating or rearranging to perform an altered function through the forces of natural
selection. Thus, many genes are represented in highly conserved forms in organisms. Through
simultaneous alignment of the sequences of these genes, sequence patterns that have been subject to
alteration may be analyzed.
Because the potential for learning about the structure and function of molecules by
multiple sequence alignment (msa) is so great, computational methods have received a
great deal of attention. In msa, sequences are aligned optimally by bringing the greatest
number of similar characters into register in the same column of the alignment, just as
described in Chapter 3 for the alignment of two sequences. Computationally, msa presents
several difficult challenges. First, finding an optimal alignment of more than two sequences
that includes matches, mismatches, and gaps, and that takes into account the degree of
variation in all of the sequences at the same time poses a very difficult challenge. The
dynamic programming algorithm used for optimal alignment of pairs of sequences can be
extended to three sequences, but for more than three sequences, only a small number of
relatively short sequences may be analyzed. Thus, approximate methods are used, including (1) a
progressive global alignment of the sequences starting with an alignment of the
most alike sequences and then building an alignment by adding more sequences, (2) iterative methods
that make an initial alignment of groups of sequences and then revise the
alignment to achieve a more reasonable result, (3) alignments based on locally conserved
patterns found in the same order in the sequences, and (4) use of statistical methods and
probabilistic models of the sequences. A second computational challenge is identifying a
reasonable method of obtaining a cumulative score for the substitutions in the column of
an msa. Finally, the placement and scoring of gaps in the various sequences of an msa presents an
additional challenge.
The msa of a set of sequences may also be viewed as an evolutionary history of the sequences. If the
Page | 47
sequences in the msa align very well, they are likely to be recently derived from a common ancestor
sequence. Conversely, a group of poorly aligned sequences share a more complex and distant
evolutionary relationship. The task of aligning a set of sequences, some more closely and others less
closely related, is identical to that of discovering the evolutionary relationships among the sequences.
As with aligning a pair of sequences, the difficulty in aligning a group of sequences varies
considerably with sequence similarity. On the one hand, if the amount of sequence variation is minimal,
it is quite straightforward to align the sequences, even without the assistance of a computer program.
On the other hand, if the amount of sequence variation is
great, it may be very difficult to find an optimal alignment of the sequences because so
many combinations of substitutions, insertions, and deletions, each predicting a different
alignment, are possible.
Page | 48
2.3.3.2.1 Software/ Tools
CLASTALW
C
LUSTALW is a general purpose multiple sequence alignment program for DNA or proteins.
It produces biologically meaningful multiple sequence alignments of divergent sequences. It
is a fully automated sequence alignment tool for DNA and protein sequences. It returns the
best match over a total length of input sequences, be it a protein or a nucleic acid. This program
follows the following steps:
Perform pair wise alignments of all of the sequences.
Use the alignment scores to produce a phylogenic tree using neighbor-joining methods.
Align the sequences sequentially, guided by the phylogenetic relationships indicated by
the tree.
CLUSTALW improves the sensitivity of progressive multiple sequence alignment through sequence
weighting, position specific gap penalties and weight matrix choice. Evolutionary relationships can
also be seen via viewing Cladograms or Phylograms. The sequence alignment is performed in global
alignment manner.
JalView
J
alview is a multiple alignment editor written entirely in java. It was initially to be used as a
visualization tool for the PFAM CORBA server and client at the EBI but is available as a
general purpose alignment editor. It is used widely in a variety of web pages (e.g. the EBI
clustalw server and the PFAM protein domain database) but is available as a general purpose
alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic relationships are
patterns of shared common history between biological replicators.
2.3.3.2.2 Method
2. “Toolbox” was clicked and then “Sequence Analysis” was chosen from the drop down
menu.
3. From the tools available, “CLUSTALW” was selected for multiple sequence
alignment.
Page | 49
6. “Input” was selected from the output order.
Parameters: Alignment-Fast
Matrix-blosum; output order- CLUSTALW Sequence Analysis
Input
Page | 50
2.3.2.2.3 Results
C
lustal results are best expressive when the initial gap sequences are omitted. It is
because the multiple sequence alignment here is a global alignment process. So after
omitting sequences that caused too much gaps to match p00766 sequence we had 14
overall meaningful sequences. That are-
>gi|117615|sp|P00766|
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE
FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG
WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV
GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN
>pdb|4CHA|
CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE
FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG
WGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV
GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN
>pdb|1GCT|CHYMOTRYPSIN*A
CGVPAIQPVLSGLIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGEFD
QGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWG
LTRYANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVS
WGSSTCSTSTPGVYARVTALVNWVQQTLAAN
>|CTRB2 protein[Human]
MASLWLLSCFSLVGAAFGCGVPAIHPVLSGLSRIVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWV
VTAAHCGVRTSDVVVAGEFDQGSDEENIQVLKIAKVFKNPKFSILTVNNDITLLKLATPARFSQTVSAVC
LPSADDDFPAGTLCATTGWGKTKYNANKTPDKLQQAALPLLSNAECKKSWGRRITDVMICAGASGVSSCM
GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN
>|CTRL protein[Human]
LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI
SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT
RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA
GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN
>|Ela3 protein[Mouse]
PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL
GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA
PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA
DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN
>gi|chymotrypsinogen 2-like protein [Sparus aurata]
GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW
VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV
CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC
MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN
>gi|Zebrafish [Danio rerio]
WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA
HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA
EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD
SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN
>gi|PRSS2 protein [Bos taurus]
MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ
VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL
ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL
QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS
>gi|tryptase-III [Human]
LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP
Page | 51
DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET
FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD
SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|beta 1 tryptase [Gorilla gorilla]
MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL
TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV
TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML
CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta]
MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ
VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL
ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL
QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS
>pdb|1DDJ|PLASMINOGEN
SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY
KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG
WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI
LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN
>|Mast cell protease6
MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT
AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS
LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC
AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS
Page | 52
CLUSTAL W results
Page | 53
JAL view result
Similar results were found in case of keeping the parameter output order “aligned” in case of
“input”
2.3.4.1 Software/Tools
CLUSTALW
Page | 54
M
ultiple sequence comparisons help highlight weak sequence similarity and shed
light on structure, function, or origin. The most widely used programs for global
multiple sequence alignment are from the Clustal series of programs.
CLUSTALW and CLUSTALX are progressive alignment programs that follow the
following steps:
ClustalW is use to align DNA or protein sequences in order to elucidate their relatedness as
well as their evolutionary origin.
JalView
J
alview is a multiple alignment editor written entirely in java. It was initially to be used
as a visualization tool for the Pfam CORBA server and client at the EBI but is available
as a general purpose alignment editor. It is used widely in a variety of web pages (e.g.
the EBI clustalw server and the PFAM protein domain database) but is available as a general
purpose alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic
relationships are patterns of shared common history between biological replicators.
Page | 55
2.3.4.2 Methods
Using CLASTALw
1. Starting with the EBI home page http://www.ebi.ac.uk European Bioinformatics Institute
was selected.
4. The following parameters were selected from Output and Phylogenetic tree:
• TREE TYPE: nj
• CORRECTDISTANCE: on
• IGNORE GAPS: on
5. The multiple sequence alignment result previously obtained from CLUSTALW was
pasted.
Page | 56
Using JalView
1. Starting with the EBI home page (http://www.ebi.ac.uk), European Bioinformatics
Institute was selected.
3. From the tools available, “CLUSTALW” was selected for multiple sequence
alignment.
4. The parameters chosen were: “Full” for Alignment, “Blosum” for Matrix and “Input”
for Output Order.
Parameters:
Alignment-“Fast”,
Run Sequences Matrix- “Blosum”, CLUSTALW
pasted OutputOrder-Input
“Input”.
Page | 57
2.3.4.3 Results
Cladogram
Phylogram
Page | 58
Fig 2.11: Phylogenic Tree (Phylogram) from Homologous sequence of p00766
Page | 59
2.3.5 Secondary Structure Prediction
P roteins’ secondary structure depend on their primary sequences. Several software can
be used to determint secondary structure. Some of them are listed below:
2.3.5.1.1 PSI-Pred
PSIPRED is a software tool provided by University College London (UCL).Its widely used software
to predict secondary structure from sequence. The PSIPRED protein structure prediction server allows
one’s to submit a protein sequence, perform a prediction of one’s choice and receive the results of the
prediction via e-mail. PSIPRED is a simple and reliable secondary structure prediction method,
incorporating two feed-forward neural networks which perform an analysis on output obtained from
PSI-BLAST (Position Specific Iterated - BLAST).It is a highly accurate method for protein secondary
structure prediction.
2.3.5.1.2 Neural Network
etwork (NN) is a special type of problem solving algorithm based on the parallel
architecture of complex animal neuronal organization. Hidden Markov Model is the basis of
developing this algorithm. Neural Network simulates human learning process by mimicking
networking organization of neuron and synapses. A single neuron, in the computational scheme, is a
node in a directed graph, with one or more entering connections designated as input, and a single
leaving connection called the output. To form a network, several neurons are assembled and the
outputs of some connected to the inputs of others. Some nodes contain connections that provide input
to the entire network; some deliver output information from the network to the outside world; and
others, that do not interact directly with the outside, are called “hidden” layers.
Applying this to the interpretation of genotypic information, neural networks are trained using a large
database of input (genotype and treatment) data and output (drug response) data. The model is then
tested on a testing set of input and output data to see how accurate it is.
Page | 60
perceptions or radial basis function networks. Decision trees, hierarchical self-organizing maps,
hierarchies of experts, hierarchical or tree-based classifiers are typical applications for hierarchical
neural networks.
2.3.5.2 Methods
Using PSIPRED:
1. Starting with the Bioinformatics Unit page (http://bioinf.cs.ucl.ac.uk ), “Secondary
structure prediction (PSIPRED)” was selected from “Protein Structure Prediction.” 24
Page | 61
Using HNN
Submit
Result saved sequence pasted
Page | 62
2.3.5.3 Results
Page | 63
PSIPRED Result
On Tue, 20 Jan 2009 09:03:36 GMT, "Apache" <[email protected]> said:
PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
PSIPRED HFORMAT (PSIPRED V2.6 by David Jones)
Conf: 998765777799997599999999987868999938997897079940998998841313
Pred:
CCCCCCCCCCCCCCCEECCEECCCCCCCCEEEEEECCCCEEEEEEEECCCEEEECHHHCC
AA:
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
10 20 30 40 50 60
Conf: 787579999626553799748997889998999888888785199997888757697801
Pred: CCCCEEEEEEEECCCCCCCCEEEEEEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCEEC
AA: TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
70 80 90 100 110 120
Conf: 687998776999898999828854678999988635999787299999987088799887
Pred:
CCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCEEEEEEEEECCHHHHHHHCCCCCCCCE
AA:
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
130 140 150 160 170 180
Conf: 974899833677888994777119989999999860688879988499887997999999
Pred:
EECCCCCCCCCCCCCCCEEEECCCCEEEEEEEEEEECCCCCCCCCEEEEEHHHHHHHHHH
AA:
ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
190 200 210 220 230 240
Conf: 98559
Pred: HHHCC
AA: TLAAN
Calculate PostScript, PDF and JPEG graphical output for this result
using: http://bioinf3.cs.ucl.ac.uk/cgi-bin/psipred/gra/nph-view2.cgi?id=0eef479d8c802aad.psi
Page | 64
Figure 2.15: Secondary structure by PSI-Pred.
Page | 65
Chapter 3:
Discussion
Page | 66
3.1 General
T
he genomic era is characterised by a enormous expansion in the
amount of biological information available in the field of DNA
molecular biology. The greatest challenge of the molecular
biology community is to make sense of the data and exploring meaningful
means to exploit those data in practical genomics and proteomics. The
result was obvious- using computer to store, retreive and manipulate the
data to produce meaningful informations. RNA
From central dogma of life we know, informations pass from genome to
proteome through transcription and translation. Transcription is the
process of encoading DNA to mRNA and Translation is mRNA to
PROTEIN
Protein.
Our project aim was to get familiar with the basic bioinformatics tool. The protein specimen
we took was chymotrypsin. It is one of the most well studied sample in enzyme study. The
enzyme is responsible for breaking polypeptides into a smaller fragment. By definition the
enzyme is a Protease group of protein. We extracted homologous protein sequences from the
database and did Pairwise sequence alignment and multiple sequence alignment to
understand the evolutionary relatiobship among the sequences.
A evolutionary tree was built on the basis of sequence homology; identity and similarity
present in the protein sequences.
E
NTREZ is a combined database and search engine composed of-
We used the Protein database of NCBI gateway to retreive the chymotrypsin sequence.
However we went through other features of the database and explored various informations
about the protein sequence.
Genbank informations are kept in Flatfile format which actually is composed of 3 sets of
informations-
1. Header
2. Features
3. Sequence
Header part is composed of following informations-
• Locus
• Description
• Accession no.
• GI no.
• Version
• Source (organism)
• Organism (in detail)
• Reference (title, journal, author)
Page | 68
• source
• Gene
• RNA
lastly the sequence part contains the protein sequence in FLAT file format. However for input
in BLAST we needed FASTA format sequence which starts with a “>” sign followed by a
short description an “Enter” and the sequence without any “Space” in “Courier’’ font.
Another important learning of the project was to get aquainted with the softwares of the
database. We learned use of different BLAST software of the NCBI gateway. The uses are
enlisted below-
Apart from this we also used some useful tools such as CLUSTAL W for multiple sequence
alignment of EBI gateway.
T
he protein sequence was of 245 residues.
We learned how to search database for homologous sequences using BLASt tools
like BLASTp and PSI-BLAst.
Page | 69
We learned about general physiochemical properties of the protein by using Protparam
software. Only the sequence was pasted not the FASTA format. The software gave us an
approximate pI value by counting Total number of negatively charged residues (Asp + Glu:
14)Total number of positively charged residues (Arg + Lys: 18) and molecular weight by
multiplying 110 with the total residue number. Other informations like extinction coefficient
could also have been predicted.
The percentage and position of alpha helix, beta sheets were predicted by using different
tools for secondary structure prediction like PSIPRED, Hierarchical Neural Network (HNN) .
It gave us an idea of the secondary structure of the protein which included more Beta-pleated
structure (around 45%) and less Alpha helix structures (around 14%).
3.4 Conclusion
As the field of molecular biology is advancing, thousands of new proteins are being
discovered. So sequencing of unknown proteins and determination of their structure remain a
crude necessity for the researchers. By studying the structure of a known protein, this
elementary project has provided us to work with unknown proteins and assuming their
functions during advanced research activities.
Page | 70