"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
"If You Can't Do Bioinformatics, You Can't Do Biology", J.D. Tisdall, 2003
1. Learning Objectives:
---------------------------------------------------------------------------------------------------------------
Genomic data
Proteomic data
Applications in human health and medicine
Flavor of Bioinformatics experimental data outcomes
___________________________________________________________________________
1
known protein sequences in the book “Atlas of Protein Sequence and Structure” which got
published it in 1965. She and her colleagues pioneered the computer based application and
“Protein Information Resource” a protein database (Hunt LT., 1983).
The father and mother of Bioinformatics, Dr. Margaret Belle Dayhoff (March 11, 1925 –
February 5, 1983) was an American biophysicist and a pioneer in the field of bioinformatics. She
engineered the application of mathematics and computational methods to the field of
biochemistry. She dedicated her career to applying computational methods to support advances in
biology and medicine, most notably the creation of protein and nucleic acid databases and tools to
retrieve information from these databases. She originated one of the first substitution
matrices, Point accepted mutations (PAM). The one-letter code used for amino acids was
developed by her in order to reduce the size of the data files used to describe amino acid
sequences in an age of punch-card computing. One of Dayhoff's most important contributions to
bioinformatics was her Atlas of Protein Sequence and Structure, a book reporting all known
protein sequences (65) that she published in 1965. This led to the Protein Information
Resource database of protein sequences, the first online database system that could be accessed
by telephone line and available for interrogation by remote computers. The book has since been
cited nearly 4,500 times.
The term bioinformatics was first used in 1970 to refer to the study of information processes in
biological systems by Paulien Hogeweg & Ben Hesper.
2
1990 BLAST: fast sequence similarity searching
1991 EST: expressed sequence tag sequencing
1993 Sanger Centre, Hinxton, UK
1994 EMBL European Bioinformatics Institute, Hinxton, UK
1995 First bacterial genomes completely sequenced
Biological data are being produced at a phenomenal rate. On average, these databases are
doubling in size every 15 months (Benson DA. et. al. 2000). Complete sequences for over 40
organisms have been released, ranging from 450 genes to over 100,000. Add to this the data from
the myriad of related projects that study gene expression, determine the protein structures
encoded by the genes, and detail how these products interact with one another, and we can begin
to imagine the enormous quantity and variety of information that is being produced. As a result of
this surge in data, computers have become indispensable to biological research. Such an approach
is ideal because of the ease with which computers can handle large quantities of data and probe
the complex dynamics observed in nature. Bioinformatics, the subject of the current review, is
often defined as the application of computational techniques to understand and organize the
information associated with biological macromolecules. This unexpected union between the two
subjects is largely attributed to the fact that life itself is an information technology. Most
important areas of improvements have been in the CPU, disk storage and Internet, allowing faster
computations, better data storage and revolutionalised the methods for accessing and exchanging
data.
3
“Annotation of large-scale gene sequence data will benefit from comprehensive and consistent
application of well-documented, standard analysis methods and from progressive and vigilant
efforts to ensure quality and utility and to keep the annotation up to date. However, it is
imperative to learn how to apply information derived from functional genomics and proteomics
technologies to conceptualize and explain the behaviors of biological systems”
5. Biological Data
Biological data are being produced at an exceptional rate. Human Genome Project resulted in a
huge amount of publicly available genomic information. There needs to be some key to
unraveling the mass of information generated by large scale sequencing efforts underway in
4
laboratories around the world. Therefore there precipitated out a dire need of a discipline with
biological background and computational skills in understanding human diseases and in the
identification of new molecular targets for drug discovery. Bioinformatics made use of
mathematics, control theory, information theory, computer programming and statistics for writing
and running software programs that use algorithms in order to analyze biological data to produce
meaningful information.
Disease database
OMIM (www.ncbi.nlm.nih.gov/omim): A database for genetic diseases.
IEDB (www.iedb.org): An epitope database and prediction resource.
Metabolite database
HMDB (http://www.hmdb.ca/): A database for small molecule metabolites found in
the human body.
ECMDB (www.ecmdb.ca/): A database for metabolites found E. Coli.
5
Literature database
PubMed (www.ncbi.nlm.nih.gov/pubmed/): A literature reference database.
Sequence database
Genbank (wwww.ncbi.nlm.nih.gov/genbank/): A sequence database.
DDBJ (www.ddbj.nig.ac.jp): A database for DNA sequence.
EMBL (www.emblwww.embl.org/.org/): A nucleotide sequence database.
Pathway database
KEGG PATHWAY Database (Kyoto Encyclopedia of Genes and Genomes): Is a
interaction network database.
MINT (www.mint.bio.uniroma2.it/mint/): The Molecular INTeraction database.
BioGRID (www.thebiogrid.org/): A database for protein-protein interactions, genetic
interactions, chemical interactions, and post-translational modifications
6. Drug Discovery
6
Bioinformation that pertains to drug discovery involves data that is either generated or used in
designing newer drug molecules for various diseases. A very simple example to illustrate this
is the structural and functional characterization of those genes and proteins that cause
inflammation and pain. Structural details such as physio-chemical composition and
dimensions of amino acids that constitute the drug target, information of the physical space
available for the binding of the substrate to the protein, charge and hydrophobic nature of the
protein, are important parameters for drug development.
7. Genomic data
Genomics data refers to the data used in the study of genes, their functions and their interactions.
Enormous data are generated from DNA sequencing, recombinant DNA technology and
mutational studies of the genes. Application of bioinformatics to the sequence to analyses and
elucidate the function and structure also generates a large amount of genomic data. Genomic data
is very complex. For example, a prokaryotic organism such as the E. coli genome has a 4,639,221
base pair sequence. A genomic database stores sequences, subsequences along with the associated
data, such as gene ID, Aliases, alternate allele, locations, Exon count and position, references,
comments, features and references.
8. Proteomic data
A proteome is a collection of proteins of an organism’s tissue at a particular time and under a set
of physiological conditions. In the process of compilation of a proteome, proteins are identified
and interactions between different set of protein or other biomolecules is studied. Such
interactions constitute different metabolic processes that occur in cells at all times. Also included
in the analysis is comprehensive analysis of proteins in different cell types under various
pathological and physiological conditions, their post-translational modifications that relate to
different functions and changes in their expression pattern with time, environmental changes,
genetic factors and disease. Proteomic data therefore is very much relevant to understand cellular
mechanisms in any biological organism including the human body, thereby closely relating to
translational applications such as biomarker discovery and drug target identification. In terms of
information, Proteomic experiments generate gel based and non-gel based differential proteomic
expression profile, protein identification, sequence analysis, and annotation.
7
Diagnose and treat genetic disorders better
Predict the vulnerability of people to diseases and patients to complications
Understand cellular events in health and disease states
Develop targeted therapies for conditions such as cancer
Develop more potent drugs for various pathological states
8
Peptide Arg-Ala-Ser-Ala-Arg as a potent inhibitor of inflammation drug target, PLA2
Group III phospholipase A(2) enzyme transcript from the Mesobuthus tamulus (Indian red
scorpion) codes for three distinct products that include a large enzymatic subunit, a pentameric
peptide and a small non-enzymatic subunit. Interestingly, there is a conservative mutation of
the conserved aspartic acid, a classical participant of catalysis in this enzyme family, to
glutamic acid. However, the side chain oxygen atoms of this glutamate are oriented away from
the catalytic histidine implicating the non-participation of this residue in stabilizing the
tautomeric conformation of the histidine. The acidic non-enzymatic subunit comprises of
extensive hydrophobic residues with a conformation of an anti-parallel β-sheets making it
ideal for tissue specific targeting. The native pentapeptide with the sequence Alanine-
Arginine-Serine-Alanine-Arginine was docked to the enzymatic subunit. The peptide ligand
occupies the hydrophobic cavity and makes a plethora of interactions with the residues in the
channel. This makes the ligand a potential reversible inhibitor, ideal to prevent the enzyme
from interacting with non-specific molecules enroute to the target. (Hariprasad et al., 2011)
9
enzymes have a conserved globular structure with species specific variations seen at the active
site, calcium binding loop, hydrophobic channel, the C-terminal domain and the quaternary
conformational state. This classification also helps to understand structure-function
relationship, enzyme-substrate specificity and designing of potent inhibitors against the drug
target isoforms. (Hariprasad et al., 2013)
10
is exposed to fatty acid rich bile in the liver. A secretory phospholipase A2, an enzyme that
breaks down complex lipids, is important for the growth of the parasite. The five isoforms of
this particular enzyme from the parasite therefore qualify as potential drug targets. In this
study, a detailed structural and ligand binding analysis of the isoforms has been done by
modeling. Most significant feature pertains to the catalytic site where the isoforms exhibit
three variations of either a histidine-aspartate-tyrosine or histidine-glutamate-tyrosine or
histidine-aspartate-phenylalanine. The molecular diversity of the parasitic PLA2 described in
this study provides a platform for personalized medicine in the therapeutics of clonorchiasis
(Hariprasad et al., 2014).
Molecular modeling of Gly80 and Ser80 variants of human group IID phospholipase
A2 and their receptor complexes: potential basis for weight loss in COPD
Weight loss is a well known systemic manifestation of Chronic Obstructive Pulmonary Disease.
A Gly80Ser mutation on human group IID secretory phospholipase A2 enhances expression of
cytokines that are responsible for weight loss. In this study, we seek to establish the structural
correlation of wild type and foresaid mutation with its function. Secretory phospholipase A2 with
glycine and serine at eightieth positions and M-type receptor were modelled. Major structural
differences between wild type and mutant enzymes are observed locally at the site of mutation
and in the global conformations. They are: (1) loop-L3 between H2 and H3 that bears residue
Gly80 in the wild type is in a closed conformation with respect to the channel opening, while in
the mutant enzyme it adopts a relatively open conformation; (2) mutant enzyme is less compact
and has higher solvent accessible surface area, and (3) interfacial binding contact surface area is
more, and quality of interactions with the receptor is better in the mutant enzyme as compared to
the wild type. Therefore, the structural differences delineated in this study are potential
biophysical factors that determine the increased potency of mutant enzyme with macrophage
receptor for cytokine secreting function resulting in exacerbation of cachexia. (Imran et al.,
2016)
11
5. Next Generation Sequencing
6. Database Mining
7. Protein Sequence analysis
8. Phylogenetic analysis
9. Protein secondary structure predictions
1. Over the last few decades there has been vast amount of research resulting in large
amount of biological data in fields of genomics, Transcriptomics and proteomics.
2. BIOINFORMATICS has emerged as a field of science that compiles, integrates and
makes it readily available to one and all on this planet.
3. It inter links subjects of biophysics, biochemistry, structural biology, molecular biology,
microbiology, computational biology and statistics
4. Bioinformatics has been useful to understand better the innumerable living organisms in
this world.
5. Bioinformatics has had a great impact in the field of medicine and human health by aiding
drug designing, biomarker discovery and molecular mapping of microbes.
12