Bio in For Ma Tics
Bio in For Ma Tics
Bio in For Ma Tics
Introduction
Bioinformatics was applied in the creation and maintenance of a database to store biological
information at the beginning of the "genomic revolution", such as nucleotide and amino acid
sequences. Development of this type of database involved not only design issues but the
development of complex interfaces whereby researchers could both access existing data as well
as submit new or revised data.
In order to study how normal cellular activities are altered in different disease states, the
biological data must be combined to form a comprehensive picture of these activities. Therefore,
the field of bioinformatics has evolved such that the most pressing task now involves the analysis
and interpretation of various types of data, including nucleotide and amino acid sequences,
protein domains, and protein structures. The actual process of analyzing and interpreting data is
referred to as computational biology. Important sub-disciplines within bioinformatics and
computational biology include:
a) the development and implementation of tools that enable efficient access to, and use and
management of, various types of information. b) the development of new algorithms
(mathematical formulas) and statistics with which to assess relationships among members of
large data sets, such as methods to locate a gene within a sequence, predict protein structure
and/or function, and cluster protein sequences into families of related sequences
Major research areas
Sequence analysis
Main articles: Sequence alignment and Sequence database
Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of hundreds of organisms
have been decoded and stored in databases. The information is analyzed to determine genes that
encode polypeptides, as well as regulatory sequences. A comparison of genes within a species or
between different species can show similarities between protein functions, or relations between
species (the use of molecular systematics to construct phylogenetic trees). With the growing
amount of data, it long ago became impractical to analyze DNA sequences manually. Today,
computer programs are used to search the genome of thousands of organisms, containing billions
of nucleotides. These programs would compensate for mutations (exchanged, deleted or inserted
bases) in the DNA sequence, in order to identify sequences that are related, but not identical. A
variant of this sequence alignment is used in the sequencing process itself. The so-called shotgun
sequencing technique (which was used, for example, by The Institute for Genomic Research to
sequence the first bacterial genome, Haemophilus influenzae) does not give a sequential list of
nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600-
800 nucleotides long). The ends of these fragments overlap and, when aligned in the right way,
make up the complete genome. Shotgun sequencing yields sequence data quickly, but the task of
assembling the fragments can be quite complicated for larger genomes. In the case of the Human
Genome Project, it took several days of CPU time (on one hundred Pentium III desktop
machines clustered specifically for the purpose) to assemble the fragments. Shotgun sequencing
is the method of choice for virtually all genomes sequenced today, and genome assembly
algorithms are a critical area of bioinformatics research.
Another aspect of bioinformatics in sequence analysis is the automatic search for genes and
regulatory sequences within a genome. Not all of the nucleotides within a genome are genes.
Within the genome of higher organisms, large parts of the DNA do not serve any obvious
purpose. This so-called junk DNA may, however, contain unrecognized functional elements.
Bioinformatics helps to bridge the gap between genome and proteome projects--for example, in
the use of DNA sequences for protein identification.
Genome annotation
Main article: Gene finding
In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. The first genome annotation software system was designed in 1995
by Dr. Owen White, who was part of the team that sequenced and analyzed the first genome of a
free-living organism to be decoded, the bacterium Haemophilus influenzae. Dr. White built a
software system to find the genes (places in the DNA sequence that encode a protein), the
transfer RNA, and other features, and to make initial assignments of function to those genes.
Most current genome annotation systems work similarly, but the programs available for analysis
of genomic DNA are constantly changing and improving.
Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of species, as well as their change
over time. Informatics has assisted evolutionary biologists in several key ways; it has enabled
researchers to:
trace the evolution of a large number of organisms by measuring changes in their DNA,
rather than through physical taxonomy or physiological observations alone,
more recently, compare entire genomes, which permits the study of more complex
evolutionary events, such as gene duplication, horizontal gene transfer, and the prediction
of factors important in bacterial speciation,
build complex computational models of populations to predict the outcome of the system
over time
track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms is sometimes confused
with computational evolutionary biology, but the two areas are unrelated.
Biodiversity
Main article: Biodiversity informatics
The expression of many genes can be determined by measuring mRNA levels with multiple
techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial
analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing
(MPSS), or various applications of multiplexed in-situ hybridization. All of these techniques are
extremely noise-prone and/or subject to bias in the biological measurement, and a major research
area in computational biology involves developing statistical tools to separate signal from noise
in high-throughput gene expression studies. Such studies are often used to determine the genes
implicated in a disorder: one might compare microarray data from cancerous epithelial cells to
data from non-cancerous cells to determine the transcripts that are up-regulated and down-
regulated in a particular population of cancer cells.
Analysis of regulation
Regulation is the complex orchestration of events starting with an extracellular signal such as a
hormone and leading to an increase or decrease in the activity of one or more proteins.
Bioinformatics techniques have been applied to explore various steps in this process. For
example, promoter analysis involves the identification and study of sequence motifs in the DNA
surrounding the coding region of a gene. These motifs influence the extent to which that region
is transcribed into mRNA. Expression data can be used to infer gene regulation: one might
compare microarray data from a wide variety of states of an organism to form hypotheses about
the genes involved in each state. In a single-cell organism, one might compare stages of the cell
cycle, along with various stress conditions (heat shock, starvation, etc.). One can then apply
clustering algorithms to that expression data to determine which genes are co-expressed. For
example, the upstream regions (promoters) of co-expressed genes can be searched for over-
represented regulatory elements.
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot
of the proteins present in a biological sample. Bioinformatics is very much involved in making
sense of protein microarray and HT MS data; the former approach faces similar problems as with
microarrays targeted at mRNA, the latter involves the problem of matching large amounts of
mass data against predicted masses from protein sequence databases, and the complicated
statistical analysis of samples where multiple, but incomplete peptides from each protein are
detected.
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways.
Massive sequencing efforts are used to identify previously unknown point mutations in a variety
of genes in cancer. Bioinformaticians continue to produce specialized automated systems to
manage the sheer volume of sequence data produced, and they create new algorithms and
software to compare the sequencing results to the growing collection of human genome
sequences and germline polymorphisms. New physical detection technology are employed, such
as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative
genomic hybridization), and single nucleotide polymorphism arrays to detect known point
mutations. These detection methods simultaneously measure several hundred thousand sites
throughout the genome, and when used in high-throughput to measure thousands of samples,
generate terabytes of data per experiment. Again the massive amounts and new types of data
generate new opportunities for bioinformaticians. The data is often found to contain considerable
variability, or noise, and thus Hidden Markov model and change-point analysis methods are
being developed to infer real copy number changes.
Another type of data that requires novel informatics development is the analysis of lesions found
to be recurrent among many tumors .
Protein structure prediction is another important application of bioinformatics. The amino acid
sequence of a protein, the so-called primary structure, can be easily determined from the
sequence on the gene that codes for it. In the vast majority of cases, this primary structure
uniquely determines a structure in its native environment. (Of course, there are exceptions, such
as the bovine spongiform encephalopathy - aka Mad Cow Disease - prion.) Knowledge of this
structure is vital in understanding the function of the protein. For lack of better terms, structural
information is usually classified as one of secondary, tertiary and quaternary structure. A viable
general solution to such predictions remains an open problem. As of now, most efforts have been
directed towards heuristics that work most of the time.
One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of
bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A,
whose function is known, is homologous to the sequence of gene B, whose function is unknown,
one could infer that B may share A's function. In the structural branch of bioinformatics,
homology is used to determine which parts of a protein are important in structure formation and
interaction with other proteins. In a technique called homology modeling, this information is
used to predict the structure of a protein once the structure of a homologous protein is known.
This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the
hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in
the organism. Though both of these proteins have completely different amino acid sequences,
their protein structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo (from
scratch) physics-based modeling.
Comparative genomics
Main article: Comparative genomics
The core of comparative genome analysis is the establishment of the correspondence between
genes (orthology analysis) or other genomic features in different organisms. It is these
intergenomic maps that make it possible to trace the evolutionary processes responsible for the
divergence of two genomes. A multitude of evolutionary events acting at various organizational
levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides.
At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion,
transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of
hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The
complexity of genome evolution poses many exciting challenges to developers of mathematical
models and algorithms, who have recourse to a spectra of algorithmic, statistical and
mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation
algorithms for problems based on parsimony models to Markov Chain Monte Carlo algorithms
for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein families computation.
Systems biology involves the use of computer simulations of cellular subsystems (such as the
networks of metabolites and enzymes which comprise metabolism, signal transduction pathways
and gene regulatory networks) to both analyze and visualize the complex connections of these
cellular processes. Artificial life or virtual evolution attempts to understand evolutionary
processes via the computer simulation of simple (artificial) life forms.
Protein-protein docking
Main article: Protein-protein docking
In the last two decades, tens of thousands of protein three-dimensional structures have been
determined by X-ray crystallography and Protein nuclear magnetic resonance spectroscopy
(protein NMR). One central question for the biological scientist is whether it is practical to
predict possible protein-protein interactions only based on these 3D shapes, without doing
protein-protein interaction experiments. A variety of methods have been developed to tackle the
Protein-protein docking problem, though it seems that there is still much work to be done in this
field.
Software tools for bioinformatics range from simple command-line tools, to more complex
graphical programs and standalone web-services available from various bioinformatics
companies or public institutions. The computational biology tool best-known among biologists is
probably BLAST, an algorithm for determining the similarity of arbitrary sequences against
other sequences, possibly from curated databases of protein or DNA sequences. BLAST is one of
a number of generally available programs for doing sequence alignment. The NCBI provides a
popular web-based implementation that searches their databases.
SOAP and REST-based interfaces have been developed for a wide variety of bioinformatics
applications allowing an application running on one computer in one part of the world to use
algorithms, data and computing resources on servers in other parts of the world. The main
advantages lay in the end user not having to deal with software and database maintenance
overheads. Basic bioinformatics services are classified by the EBI into three categories: SSS
(Sequence Search Services), MSA (Multiple Sequence Alignment) and BSA (Biological
Sequence Analysis). The availability of these service-oriented bioinformatics resources
demonstrate the applicability of web based bioinformatics solutions, and range from a collection
of standalone tools with a common data format under a single, standalone or web-based
interface, to integrative, distributed and extensible bioinformatics workflow management
systems.
Future Scope:
Bioinformatics is the rapidly growing and
developing field in computational science era. The
major databases which are useful for life science
research are NCBI, DDBJ, EMBL, TIGR, PDB,
SWIEE-PROT and TrEMBL. These databases are
public databases, conduct research in computational
biology, and develop software tools for analyzing
genome data. With the rapidly emergence and vast
development of this field, it has the bright
perspectives in upcoming decades.