Bioinformatics-An Introduction and Overview
Bioinformatics-An Introduction and Overview
Abstract––An extraordinary capital of data is being generated by genome sequencing projects and other experimental
efforts to verify and establish the structure and function of biological molecules. The demands and opportunities for
interpreting these data are expanding more than ever. Bioinformatics is a science which uses computational techniques
to analyze the biological problems; the science of developing and utilizing computer databases and algorithms to
accelerate and enhance biological research. Bioinformatics is much more than what this definition says, it’s commonly
referred as dry lab work which accelerates the wet lab work drastically. Biology + Informatics + Statistics + (Bio-
Chemistry + Bio- Physics). Bioinformatics is a tool to solve the Biological problems based on existing data. It is a method
to solve the Biological outcomes based on existing experimental results. It creates the way for the Biologists to store all
the data. It makes some lab experiments easy by predicting the outcome of the lab experiment. Sometimes it shows the
initial way to start the lab experiment from existing results. It helps the researchers to get an idea about any lab
experiments before they start. Computers have become an essential component of modern biology. They help to manage
the vast and increasing amount of biological data and continue to play an integral role in the discovery of new biological
relationships. This in silica approach to biology has helped to reshape the modern biological sciences. Bioinformatics is a
scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution,
analysis and interpretation and combines the tools and techniques of biology, physics, chemistry, computer science,
information technology and mathematics. Bioinformatics has helped to make possible the current revolution in modern
molecular biology. It’s method to predict the biological outcomes before anyone go for full fledged research. It’s a
method to compare the biological data. Ex: sequence analysis. It’s a way to predict or solve the protein structure.It the
only way for PERSONALIZED MEDICINE in this post genomic era. It’s the method to do comparative genomics and
predict the Human homolog genes in other species. It’s the method to annotate the newly sequenced genomes. We are
living in the world of Computers. By analyzing the existing biological data using Information Technology we can predict
the biological outcomes. In this review, we provide an introduction to bioinformatics, biological tools, software and
databases.
I. AIMS OF BIOINFORMATICS
The aims of bioinformatics are threefold. First aim of bioinformatics is to organize data in a way that permits
researchers to access existing information and to submit new entries as they are produced, e.g. the Protein Data Bank for 3D
macromolecular structures [1].While data-curation is an essential task, the information stored in these databases is essentially
useless until analysed. Thus the purpose of bioinformatics extends much further. The second aim is to develop tools/software
and resources that help in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare
it with previously characterized sequences. This needs more than just a simple text-based search and programs such as
FASTA [2] and PSI-BLAST [3] must consider what comprises a biologically significant match. Development of such
resources dictates expertise in computational theory as well as a thorough understanding of biology. The third aim is to use
these tools to analyze the data and interpret the results in a biologically meaningful manner. In bioinformatics, we can now
conduct global analyses of all the available data with the aim of uncovering common principles that apply across many
systems and highlight novel features.
88
Bioinformatics: An introduction and Overview
Molecular simulations
(force field calculations,
molecular movements,
docking predictions)
Characterisation of repeats
Genomes 40 complete genomes Structural assignments to genes
(1.6 million-3 billion bases each) Phylogenetic analysis
Genomic-scale censuses
(characterization protein content, metabolic pathways)
Linkage analysis relating specific genes to diseases
Other data
1. Structural Bioinformatics:
This approach helps in the prediction of 3D structure of protein from its protein sequence. Homology modelling is
the best method for predicting the protein structures by using already structured or crystallized protein as a template.
MODELLER is one of the best software for Homology modeling [4]. Protein Data Bank is the data base for 3D co-ordinates
of a protein. Homology modeling, also known as comparative modeling, is a class of methods in protein structure prediction
for constructing an atomic-resolution model of a protein from its amino acid sequence (the "query sequence" or "target").
Almost all homology modeling techniques rely on the identification of one or more known protein structures likely to
resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query
sequence to residues in the template sequence. The sequence alignment and template structure are then used to produce a
structural model of the target. Because protein structures are more conserved than DNA sequences, detectable levels of
sequence similarity usually imply significant structural similarity
2. Drug Designing:
It is the process to find the drugs by design based on their biological targets. The field of drug design is a rapidly
growing area in which many successes have occurred in recent years. The explosion of genomic, proteomic, and structural
information has provided hundreds of new targets and opportunities for future drug lead discovery [5]. The process of
elucidating the atomic structure of structures and proteins and their complexes and the design of novel, therapeutically
relevant ligands based on these structure elucidations, is known as structure based drug design [6]. The majority of drugs are
small molecules designed to bind, interact, and modulate the activity of specific biological receptors. Receptors are proteins
89
Bioinformatics: An introduction and Overview
that bind and interact with other molecules to perform enormous functions needed for the maintenance of life. They include
huge array of cell-surface receptors (hormone receptors, cell-signaling receptors, neurotransmitter receptors, etc.), enzymes,
and other functional proteins. Due to genetic abnormalities, physiologic stressors, or some combination thereof, the function
of specific receptors and enzymes may become altered to the point that our well-being is diminished.
Drug design is the approach of finding drugs by design, based on their biological targets. Typically a drug target is
a key molecule involved in a particular metabolic or signalling pathway that is specific to a disease condition or pathology,
or to the infectivity or survival of a microbial pathogen [5, 6].
Some approaches attempt to stop the functioning of the pathway in the diseased state by causing a key molecule to stop
functioning. Drugs may be designed that bind to the active region and inhibit this key molecule. However these drugs would
also have to be designed in such a way as not to affect any other important molecules that may be similar in appearance to
the key molecules. Sequence homologies are often used to identify such risks.Other approaches may be to enhance the
normal pathway by promoting specific molecules in the normal pathways that may have been affected in the diseased
state.Computer-assisted drug design uses computational chemistry to discover, enhance, or study drugs and related
biologically active molecules.
3. Phylogenetics:
In biology, phylogenetics is the study of evolutionary relatedness among various groups of organisms (e.g.,
species, populations). The evolutionary history estimated from phylogenetic analysis is usually depicted as branching,
treelike diagrams that represents an estimated pedigree of the inherited relationships among molecules, organisms, or both.
Sometimes it is called cladistic. Predicting the genetic or evolutionary relation of set of organisms. Mitochondrial SNPs and
Microsatellites (DNA repeats) are mostly used in Phylogenetics. MEGA [7],http://paup.csit.fsu.edu/ PAUP [8] are some of
the important softwares. Maximum Parsimony and Maximum Likelihood are mostly used methods.). The term phylogenetics
is of Greek origin from the terms phyle/phylon, meaning "tribe, race," and genetikos, meaning "relative to birth" from
genesis Taxonomy, the classification of organisms according to similarity, has been richly informed by phylogenetics but
remains methodologically and logically distinct. The fields overlap however in the science of phylogenetic systematics or
cladism, where only phylogenetic trees are used to delimit taxa, each representing a group of lineage-connected individuals
Evolution is regarded as a branching process, whereby populations are altered over time and may speciate into separate
branches, hybridize together, or terminate by extinction. This may be visualized as a multidimensional character-space that a
90
Bioinformatics: An introduction and Overview
population moves through over time. The problem posed by phylogenetics is that genetic data are only available for the
present, and fossil records (osteometric data) are sporadic and less reliable. Our knowledge of how evolution operates is used
to reconstruct the full tree [9]
4. Computational biology:
Computational biology is an interdisciplinary field that applies the techniques of computer science, applied
mathematics, and statistics to address problems inspired by biology. It encompasses the fields of:
Bioinformatics, which applies algorithms and statistical techniques to the interpretation, classification and
understanding of biological datasets. These typically consist of large numbers of DNA, RNA, or protein
sequences. Sequence alignment is used to assemble the datasets for analysis. Comparisons of homologous
sequences, gene finding, and prediction of gene expression are the most common techniques used on assembled
datasets; however, analysis of such datasets have many applications throughout all fields of biology[10].
Computational biomodeling, a field within biocybernetics concerned with building computational models of
biological systems.
Computational genomics, a field within genomics which studies the genomes of cells and organisms. High-
throughput genome sequencing produces lots of data, which requires extensive post-processing (genome assembly)
and uses DNA microarray technologies to perform statistical analyses on the genes expressed in individual cell
types. This can help find genes of interests for certain diseases or conditions. This field also studies the
mathematical foundations of sequencing.
Molecular modeling, which consists of modelling the behaviour of molecules of biological importance.
Systems biology, which uses systems theory to model large-scale biological interaction networks (also known as
the interactome).
Protein structure prediction and structural genomics, which attempt to systematically produce accurate structural
models for three-dimensional protein structures that have not been determined experimentally.
Computational biochemistry and biophysics, which make extensive use of structural modeling and simulation
methods such as molecular dynamics and Monte Carlo method-inspired Boltzmann sampling methods in an
attempt to elucidate the kinetics and thermodynamics of protein functions.
Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation which
involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.
The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that[14].
However, nowadays more and more additional information is added to the annotation platform. The additional information
allows manual annotators to deconvolute discrepancies between genes that are given the same annotation.
91
Bioinformatics: An introduction and Overview
For example, the SEED database uses genome context information, similarity scores, experimental data, and integrations of
other resources to provide the most accurate genome annotations through their Subsystems approach[15]. The Ensembl
database relies on both curated data sources as well as a range of different software tools in their automated genome
annotation pipeline[16].
These steps may involve both biological experiments and in silico analysis. A variety of software tools have been
developed to permit scientists to view and share genome annotations. Genome annotation is the next major challenge for the
Human Genome Project,now that the genome sequences of human and several model organisms are largely complete[17].
Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list"
for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this
parts list and in understanding how all the parts "fit together" Genome annotation is an active area of investigation and
involves a number of different organizations in the life science community which publish the results of their efforts in
publicly available biological databases accessible via the web and other electronic means. Here is an alphabetical listing of
on-going projects relevant to genome annotation:
ENCyclopedia Of DNA Elements (ENCODE)
Ensembl
Gene Ontology Consortium
RefSeq
Uniprot
Vertebrate and Genome Annotation Project (Vega)
92
Bioinformatics: An introduction and Overview
.
Still there are number of categories in which Protein Sequence Databases can be characterized, these databases are as
follows:-
Name Type Web Address
Swiss-Prot Primary www.expasy.ch
NCBI Protein http://www.ncbi.nlm.nih.gov/entrez
Composite
database /query.fcgi?db=Protein
PIR-NREF Primary http://pir.georgetown.edu/
Pattern based secondary
PROSITE http://www.expasy.org/prosite
database
InterPro Families/Domains http://www.ebi.ac.uk/interpro
http://www.bioinf.man.ac.uk/
PRINTS Family Fingerprints
dbbrowser/PRINTS/
Pfam Protein families http://www.sanger.ac.uk/Software/Pfam/
ProDom Domains http://www.toulouse.inra.fr/prodom.html
AAindex Protein property http://www.genome.ad.jp/aaindex/
PMD,Protein Mutant
Literature based information http://pmd.ddbj.nig.ac.jp/
Database
Amino acid sequences
PRF/SEQDB http://www4.prf.or.jp/en/
predicted from genes
OWL Composite database http://umber.sbs.man.ac.uk/dbbrowser/OWL/
http://www.hgmp.mrc.ac.uk/Bioinformatics/
SPTR SWISS PROT+TrEMBL
Databases/sptr-help.html
93
Bioinformatics: An introduction and Overview
94
Bioinformatics: An introduction and Overview
Mycobacterium http://probe.nalusda.gov:8300/cgi-bin/browse/mycdb
Streptococcus http://dna1.chem.uoknor.edu/strep.html
Streptomyces http://www.uea.ac.uk/nrp/jic/gstrgenome.htm
HIV http://hiv-web.lanl.gov/
Virus Information Resource http://life.anu.edu.au/./viruses/virus.html
Genome Databases(Human)
Name Web Address
GDB,Genome Database http://gdbwww.gdb.org/
GeneCards http://bioinformatics.weizmann.ac.il/cards/
Gene Map'99 http://www.ncbi.nlm.nih.gov/genemap/
OMIM http://www.ncbi.nlm.nih.gov/Omim/
TIGR http://www.tigr.org/tdb/hgi/
GenAtlas http://bisance.citi2.fr/GENATLAS/
o SeaView
o Strap:-Interactive extendable and scriptable editor for large protein alignments
o GeneDoc
Motifs/Pattern Searching
NPS@: PATTINPROT search
PROSITE
PRATT
PatScan
PatternFind
Motif Explorer
N-Glycosylation Site Prediction Server
BNL motif searching
MOTIFS in SwissProt at IBCP
PRINTS
BLOCKS
PRODOM
SBASE
MOTIF at GenomeNet (Japan)
Phylogeny Programs
Joe Felsenstein's Phylogeny programs website
Phylogenetic Analysis Computer Programs
Phylogeny software (Glasgow University)
TreeTop - Phylogenetic Tree prediction
CMBI CLUSTAL W
Puzzle: Tree reconstruction for sequences by quartet puzzling and maximum likelihood (Strimmer, von Haeseler)
96
Bioinformatics: An introduction and Overview
97
Bioinformatics: An introduction and Overview
REFERENCES
[1]. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Jr., Brice MD, Rodgers JR, et al. The Protein Data Bank. A
computer-based archival file for macromolecular structures. Eur J Biochem 1977;80(2):319-24.,7- Berman HM,
Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res
2000;28(1):235-42
[2]. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA
1988;85(8):2444-2448
[3]. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389-3402
[4]. http://salilab.org/modeller/
[5]. Greer J, Erickson JW, Baldwin JJ, Varney MD (1994). "Application of the three-dimensional structures of protein
target molecules in structure-based drug design". J Med Chem 37 (8): 1035–1054. doi:10.1021/jm00034a001.
PMID 8164249.
[6]. Gubernator K, Böhm HJ (1998). Structure-Based Ligand Design, Methods and Principles in Medicinal Chemistry.
Weinheim: Wiley-VCH).
[7]. www.megasoftware.net
[8]. paup.csit.fsu.edu
[9]. A.W.F. Edwards & L.L. Cavalli-Sforza (1964). in Systematics Assoc. Publ. No. 6: Phenetic and Phylogenetic
Classification: Reconstruction of evolutionary trees, 67-76.
[10]. computationalbiology.berkeley.edu/
[11]. Korf I. (2004-05-14). "Gene finding in novel genomes". BMC Bioinformatics 5: 59-67. doi:10.1186/1471-2105-5-
59. PMID 15144565
[12]. http://opal.biology.gatech.edu/GeneMark/
98
Bioinformatics: An introduction and Overview
[13]. http://genes.mit.edu/GENSCAN.html
[14]. http://blast.ncbi.nlm.nih.gov/Blast.cgi
[15]. http://www.theseed.org/wiki/Main_Page
[16]. http://www.ensembl.org/index.html
[17]. http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
[18]. S. Lightstone, T. Teorey, T. Nadeau, Physical Database Design: the database professional's guide to exploiting
indexes, views, storage, and more, Morgan Kaufmann Press, 2007. ISBN 0123693896
[19]. Date, C. J. An Introduction to Database Systems, Eighth Edition, Addison Wesley, 2000
99