"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Course: PG Pathshala-Biophysics

Paper 13: Bioinformatics


Module 1: Introduction to Bioinformatics
Content Writer: Dr. Hariprasad. G, All India Institute of Medical Sciences, New Delhi

1. Learning Objectives:
---------------------------------------------------------------------------------------------------------------

 Introduction and History of Bioinformatics


 Definition of Bioinformatics
 Work flow in Bioinformatics
 Biological Data
 Sources of Biological Data
 Drug Discovery

 Genomic data
 Proteomic data
 Applications in human health and medicine
 Flavor of Bioinformatics experimental data outcomes
___________________________________________________________________________

2. Introduction and History of Bioinformatics


"If you can’t do Bioinformatics, you can’t do Biology", J.D. Tisdall, 2003
Bioinformatics is the science of developing algorithms and analyze biological information in
computer databases to accelerate and enhance biological research. This interdisciplinary field
mainly involves subjects of Biophysics, Biochemistry, structural biology, Pharmacogenomics,
microbiology, genetics, biotechnology, molecular biology and genetics, computer science,
mathematics, and statistics. It can therefore be summarized as a technique applied using computer
science to solve problems in biology, which is applicable for betterment of mankind.
It was 1960 when the first time protein sequence data was managed using high level computer
language. Margaret. O. Dayhoff was a pioneer in the field of bioinformatics. She collected all

1
known protein sequences in the book “Atlas of Protein Sequence and Structure” which got
published it in 1965. She and her colleagues pioneered the computer based application and
“Protein Information Resource” a protein database (Hunt LT., 1983).
The father and mother of Bioinformatics, Dr. Margaret Belle Dayhoff (March 11, 1925 –
February 5, 1983) was an American biophysicist and a pioneer in the field of bioinformatics. She
engineered the application of mathematics and computational methods to the field of
biochemistry. She dedicated her career to applying computational methods to support advances in
biology and medicine, most notably the creation of protein and nucleic acid databases and tools to
retrieve information from these databases. She originated one of the first substitution
matrices, Point accepted mutations (PAM). The one-letter code used for amino acids was
developed by her in order to reduce the size of the data files used to describe amino acid
sequences in an age of punch-card computing. One of Dayhoff's most important contributions to
bioinformatics was her Atlas of Protein Sequence and Structure, a book reporting all known
protein sequences (65) that she published in 1965. This led to the Protein Information
Resource database of protein sequences, the first online database system that could be accessed
by telephone line and available for interrogation by remote computers. The book has since been
cited nearly 4,500 times.
The term bioinformatics was first used in 1970 to refer to the study of information processes in
biological systems by Paulien Hogeweg & Ben Hesper.

 A look into important landmarks in the field of Bioinformatics


 1965 Margaret Dayhoff's Atlas of Protein Sequences
 1970 Needleman-Wunsch algorithm

 1977 DNA sequencing and software to analyze it


 1981 Smith-Waterman algorithm developed
 1981 The concept of a sequence motif
 1982 GenBank Release 3 made public
 1982 Phage lambda genome sequenced
 1983 Sequence database searching algorithm
 1985 FASTP/FASTN: fast sequence similarity searching
 1988 National Center for Biotechnology Information (NCBI) created at NIH/NLM
 1988 EMB network for database distribution

2
 1990 BLAST: fast sequence similarity searching
 1991 EST: expressed sequence tag sequencing
 1993 Sanger Centre, Hinxton, UK
 1994 EMBL European Bioinformatics Institute, Hinxton, UK
 1995 First bacterial genomes completely sequenced

 1996 Yeast genome completely sequenced


 1997 PSI-BLAST
 1998 Worm (multicellular) genome completely sequenced
 1999 Fly genome completely sequenced

3. Definition and need of Bioinformatics


Bioinformatics is the analysis of biological information using computers and statistical
techniques; the science of developing and utilizing computer databases and algorithms to
accelerate and enhance biological research.

Biological data are being produced at a phenomenal rate. On average, these databases are
doubling in size every 15 months (Benson DA. et. al. 2000). Complete sequences for over 40
organisms have been released, ranging from 450 genes to over 100,000. Add to this the data from
the myriad of related projects that study gene expression, determine the protein structures
encoded by the genes, and detail how these products interact with one another, and we can begin
to imagine the enormous quantity and variety of information that is being produced. As a result of
this surge in data, computers have become indispensable to biological research. Such an approach
is ideal because of the ease with which computers can handle large quantities of data and probe
the complex dynamics observed in nature. Bioinformatics, the subject of the current review, is
often defined as the application of computational techniques to understand and organize the
information associated with biological macromolecules. This unexpected union between the two
subjects is largely attributed to the fact that life itself is an information technology. Most
important areas of improvements have been in the CPU, disk storage and Internet, allowing faster
computations, better data storage and revolutionalised the methods for accessing and exchanging
data.

3
“Annotation of large-scale gene sequence data will benefit from comprehensive and consistent
application of well-documented, standard analysis methods and from progressive and vigilant
efforts to ensure quality and utility and to keep the annotation up to date. However, it is
imperative to learn how to apply information derived from functional genomics and proteomics
technologies to conceptualize and explain the behaviors of biological systems”

4. Work flow in Bioinformatics


 Compilation and storage of biological data generated from across the globe in a very
orderly and disciplined manner
 of Analysis and Interpretation of various types of biological data including nucleotide
sequences, amino acid sequences, protein domains and protein structures
 Development of new algorithms and statistics to assess the biological information such as
relationships among members of large data sets
 Making sense of biological data in a sense to improve the understanding of organisms,
plants, animals and humans
 Integrating data in manner so as to enable cross link current biological research outcomes
with that generated in the past

5. Biological Data
Biological data are being produced at an exceptional rate. Human Genome Project resulted in a
huge amount of publicly available genomic information. There needs to be some key to
unraveling the mass of information generated by large scale sequencing efforts underway in

4
laboratories around the world. Therefore there precipitated out a dire need of a discipline with
biological background and computational skills in understanding human diseases and in the
identification of new molecular targets for drug discovery. Bioinformatics made use of
mathematics, control theory, information theory, computer programming and statistics for writing
and running software programs that use algorithms in order to analyze biological data to produce
meaningful information.

5.1 Sources of Biological Data


 Protein database
 PDB (www.rcsb.org/pdb): A database for solved protein structure.
 UniProt (www.uniprot.org): A protein information database.
 CATH (www.cathdb.info/): It is a protein structure classification database.

 Disease database
 OMIM (www.ncbi.nlm.nih.gov/omim): A database for genetic diseases.
 IEDB (www.iedb.org): An epitope database and prediction resource.
 Metabolite database
 HMDB (http://www.hmdb.ca/): A database for small molecule metabolites found in
the human body.
 ECMDB (www.ecmdb.ca/): A database for metabolites found E. Coli.

5
 Literature database
 PubMed (www.ncbi.nlm.nih.gov/pubmed/): A literature reference database.

 Sequence database
 Genbank (wwww.ncbi.nlm.nih.gov/genbank/): A sequence database.
 DDBJ (www.ddbj.nig.ac.jp): A database for DNA sequence.
 EMBL (www.emblwww.embl.org/.org/): A nucleotide sequence database.

 Pathway database
 KEGG PATHWAY Database (Kyoto Encyclopedia of Genes and Genomes): Is a
interaction network database.
 MINT (www.mint.bio.uniroma2.it/mint/): The Molecular INTeraction database.
 BioGRID (www.thebiogrid.org/): A database for protein-protein interactions, genetic
interactions, chemical interactions, and post-translational modifications

6. Drug Discovery

6
Bioinformation that pertains to drug discovery involves data that is either generated or used in
designing newer drug molecules for various diseases. A very simple example to illustrate this
is the structural and functional characterization of those genes and proteins that cause
inflammation and pain. Structural details such as physio-chemical composition and
dimensions of amino acids that constitute the drug target, information of the physical space
available for the binding of the substrate to the protein, charge and hydrophobic nature of the
protein, are important parameters for drug development.

7. Genomic data
Genomics data refers to the data used in the study of genes, their functions and their interactions.
Enormous data are generated from DNA sequencing, recombinant DNA technology and
mutational studies of the genes. Application of bioinformatics to the sequence to analyses and
elucidate the function and structure also generates a large amount of genomic data. Genomic data
is very complex. For example, a prokaryotic organism such as the E. coli genome has a 4,639,221
base pair sequence. A genomic database stores sequences, subsequences along with the associated
data, such as gene ID, Aliases, alternate allele, locations, Exon count and position, references,
comments, features and references.

8. Proteomic data
A proteome is a collection of proteins of an organism’s tissue at a particular time and under a set
of physiological conditions. In the process of compilation of a proteome, proteins are identified
and interactions between different set of protein or other biomolecules is studied. Such
interactions constitute different metabolic processes that occur in cells at all times. Also included
in the analysis is comprehensive analysis of proteins in different cell types under various
pathological and physiological conditions, their post-translational modifications that relate to
different functions and changes in their expression pattern with time, environmental changes,
genetic factors and disease. Proteomic data therefore is very much relevant to understand cellular
mechanisms in any biological organism including the human body, thereby closely relating to
translational applications such as biomarker discovery and drug target identification. In terms of
information, Proteomic experiments generate gel based and non-gel based differential proteomic
expression profile, protein identification, sequence analysis, and annotation.

9. Applications in human health and medicine

7
 Diagnose and treat genetic disorders better
 Predict the vulnerability of people to diseases and patients to complications
 Understand cellular events in health and disease states
 Develop targeted therapies for conditions such as cancer
 Develop more potent drugs for various pathological states

 Develop diagnostics with higher sensitivity and specificity

10. Flavor of Bioinformatics experimental data outcomes


 Protein structural analysis
cDNA sequence mature portion of the group PLA2 of Heterometrus fulvipes of 103 amino
acids was deduced. This sequence has 40% identity with bee venom PLA2, whose crystal
structure was known. Based on the details of this crystal structure, homology modeling was
done and Hf-PLA2 structure was analyzed. Tertiary structure of the enzyme comprises of three
helices connected by loops and having a well defined calcium binding loop and channel for the
binding of the substrate for catalysis. The variation like the replacement of aspartic acid
residue with glutamic acid in the well known histidine-aspartic acid dyad is an interesting and
rare feature. (Hariprasad et al., 2007).

 Indole based molecules as potent inhibitors of inflammation drug target, PLA2


Group III phospholipase A(2) is a known mediator of inflammation, atherosclerosis and cancer
in mammals. The availability of the human group III phospholipase A(2) (hIIIPLA(2)) amino
acid sequence offers an opportunity to study its structural features by modeling. The overall
structure comprises of three α-helices, a β-wing and the calcium binding loop which is present
at the N-terminus of the enzyme. However, the unique structural features of hIIIPLA(2) in
comparison to the other well known group I/II PLA(2)s are: (1) the replacement of the
'conserved' tyrosine residue by phenylalanine at position 87 in the active site; (2) a decrease in
the volume of the substrate binding hydrophobic channel and (3) presence of a C-terminal
extension which has a close proximity to the third helix. Docking studies of the enzyme with
small molecules gives a detailed insight into the participating residues of the enzyme and also
their binding affinities predicted to range from micromolar to nanomolar range, thereby
making them either potential lead molecules or potent drugs. (Hariprasad et al., 2010)

8
 Peptide Arg-Ala-Ser-Ala-Arg as a potent inhibitor of inflammation drug target, PLA2
Group III phospholipase A(2) enzyme transcript from the Mesobuthus tamulus (Indian red
scorpion) codes for three distinct products that include a large enzymatic subunit, a pentameric
peptide and a small non-enzymatic subunit. Interestingly, there is a conservative mutation of
the conserved aspartic acid, a classical participant of catalysis in this enzyme family, to
glutamic acid. However, the side chain oxygen atoms of this glutamate are oriented away from
the catalytic histidine implicating the non-participation of this residue in stabilizing the
tautomeric conformation of the histidine. The acidic non-enzymatic subunit comprises of
extensive hydrophobic residues with a conformation of an anti-parallel β-sheets making it
ideal for tissue specific targeting. The native pentapeptide with the sequence Alanine-
Arginine-Serine-Alanine-Arginine was docked to the enzymatic subunit. The peptide ligand
occupies the hydrophobic cavity and makes a plethora of interactions with the residues in the
channel. This makes the ligand a potential reversible inhibitor, ideal to prevent the enzyme
from interacting with non-specific molecules enroute to the target. (Hariprasad et al., 2011)

 Structural basis for nephrotoxicity caused by Gentamicin


Gentamicin is a member of aminoglycoside group of broad spectrum antibiotics. It impairs
protein synthesis by binding to A site of the 30S subunit of bacterial ribosomes. One of the
main side effects of this drug is nephrotoxicity. The drug is known to bind to calreticulin, a
chaperone essential for the folding of glycosylated proteins. The paper provides a detailed
structural insight of the calreticulin-gentamicin complex by molecular modeling and the
binding of the drug in the presence of explicit solvent was analyzed by molecular dynamics
simulation. The gentamicin molecule binds to the lectin site of the calreticulin and lies in the
concave channel formed by the long beta sheets. The details therefore, strongly implicate
gentamicin as a competitive inhibitor of sugar binding with calreticulin. (Hariprasad et al.,
2012)

 Structural basis for the classification of group III PLA2


The group III PLA2 enzymes are present in a wide array of organisms across many species
with completely different functions. A detailed understanding of the structure and evolutionary
proximity amongst the enzymes was carried out for a meaningful classification of this group.
In addition to the conservation of calcium binding motif and the catalytic histidine, the
sequences exhibit specific 'amino acid signatures'. Structural analysis reveals that these

9
enzymes have a conserved globular structure with species specific variations seen at the active
site, calcium binding loop, hydrophobic channel, the C-terminal domain and the quaternary
conformational state. This classification also helps to understand structure-function
relationship, enzyme-substrate specificity and designing of potent inhibitors against the drug
target isoforms. (Hariprasad et al., 2013)

 Variations in Phospholipase A2 of Clonorchis sinensis parasite and human host


Hepatic fibrosis is a common complication of the infection by the parasite, Clonorchis
sinensis. A secretory phospholipase A2 enzyme from the parasite is implicated in the
pathology. The active site of the enzyme shows the classical features of PLA(2) with the
participation of the three residues: histidine-aspartic acid-tyrosine in hydrogen bond formation.
This is an interesting variation from the house keeping group III PLA(2) enzyme of human
which has a histidine-aspartic acid and phenylalanine arrangement at the active site. This
difference is therefore an important structural parameter that can be exploited to design
specific inhibitor molecules against the pathogen PLA(2) (Hariprasad et al., 2012)

 Pathway analysis of differentially expressed proteins in responders and non-


responders to combination chemotherapy of paclitaxel and carboplatin
Conventional treatment for advanced ovarian cancer is an initial debulking surgery followed
by chemotherapy combination of carboplatin and paclitaxel. Despite initial high response,
three-fourths of these women experience disease recurrence with a dismal prognosis. Patients
with advanced-stage ovarian cancer who underwent cytoreductive surgery were enrolled and
tissue samples were collected. Fluorescence-based differential in-gel expression coupled with
mass spectrometric analysis was used for discovery phase of experiments, and pathway
analysis were performed for expression and functional validation of differentially expressed
proteins. The expressions of some of these proteins correlated with increased apoptotic activity
in responders and decreased apoptotic activity in nonresponders. Therefore, the proteins
qualify as potential biomarkers to predict chemotherapy response (Sehrawat et al., 2016).

 Potent inhibitor molecules against different isoforms of drug target PLA2


implicated in cholangiocarcinoma
Clonorchis sinensis or the Chinese liver fluke is one of the most prevalent parasites affecting a
major population in the oriental countries. The parasite lacks lipid generating mechanisms but

10
is exposed to fatty acid rich bile in the liver. A secretory phospholipase A2, an enzyme that
breaks down complex lipids, is important for the growth of the parasite. The five isoforms of
this particular enzyme from the parasite therefore qualify as potential drug targets. In this
study, a detailed structural and ligand binding analysis of the isoforms has been done by
modeling. Most significant feature pertains to the catalytic site where the isoforms exhibit
three variations of either a histidine-aspartate-tyrosine or histidine-glutamate-tyrosine or
histidine-aspartate-phenylalanine. The molecular diversity of the parasitic PLA2 described in
this study provides a platform for personalized medicine in the therapeutics of clonorchiasis
(Hariprasad et al., 2014).

 Molecular modeling of Gly80 and Ser80 variants of human group IID phospholipase
A2 and their receptor complexes: potential basis for weight loss in COPD
Weight loss is a well known systemic manifestation of Chronic Obstructive Pulmonary Disease.
A Gly80Ser mutation on human group IID secretory phospholipase A2 enhances expression of
cytokines that are responsible for weight loss. In this study, we seek to establish the structural
correlation of wild type and foresaid mutation with its function. Secretory phospholipase A2 with
glycine and serine at eightieth positions and M-type receptor were modelled. Major structural
differences between wild type and mutant enzymes are observed locally at the site of mutation
and in the global conformations. They are: (1) loop-L3 between H2 and H3 that bears residue
Gly80 in the wild type is in a closed conformation with respect to the channel opening, while in
the mutant enzyme it adopts a relatively open conformation; (2) mutant enzyme is less compact
and has higher solvent accessible surface area, and (3) interfacial binding contact surface area is
more, and quality of interactions with the receptor is better in the mutant enzyme as compared to
the wild type. Therefore, the structural differences delineated in this study are potential
biophysical factors that determine the increased potency of mutant enzyme with macrophage
receptor for cytokine secreting function resulting in exacerbation of cachexia. (Imran et al.,
2016)

11. Bioinformatics: Course content


1. Introduction to Bioinformatics
2. Biological databases
3. Tools used in bioinformatics
4. Genome analysis

11
5. Next Generation Sequencing
6. Database Mining
7. Protein Sequence analysis
8. Phylogenetic analysis
9. Protein secondary structure predictions

10. Protein tertiary structure prediction


11. Protein Structure validation tools
12. In-silico drug screening
13. Docking to drug targets
14. Functional Genomics
15. Systems Biology
16. Programming in Bioinformatics

CONCLUSIONS & SUMMARY

1. Over the last few decades there has been vast amount of research resulting in large
amount of biological data in fields of genomics, Transcriptomics and proteomics.
2. BIOINFORMATICS has emerged as a field of science that compiles, integrates and
makes it readily available to one and all on this planet.
3. It inter links subjects of biophysics, biochemistry, structural biology, molecular biology,
microbiology, computational biology and statistics
4. Bioinformatics has been useful to understand better the innumerable living organisms in
this world.

5. Bioinformatics has had a great impact in the field of medicine and human health by aiding
drug designing, biomarker discovery and molecular mapping of microbes.

12

You might also like