Bioinformatica
Bioinformatica
Bioinformatics is both an umbrella term for the body of biological studies that use computer
programming as part of their methodology, as well as a reference to specific analysis
"pipelines" that are repeatedly used, particularly in the fields of genetics and genomics.
Common uses of bioinformatics include the identification of candidate genes and nucleotides .
Often, such identification is made with the aim of better understanding the genetic basis of
disease, unique adaptations, desirable properties, or differences between populations. In a
less formal way, bioinformatics also tries to understand the organisational principles within
nucleic acid and protein sequences.
Introduction
History
Historically, the term bioinformatics did not mean what it means today. Paulien Hogeweg and
Ben Hesper coined it in 1970 to refer to the study of information processes in biotic systems.
This definition placed bioinformatics as a field parallel to biophysics or biochemistry . Dayhoff
compiled one of the first protein sequence databases, initially published as books and
pioneered methods of sequence alignment and molecular evolution. Another early contributor
to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with
his comprehensive volumes of antibody sequences released with Tai Te Wu between 1980 and
1991.
Goals
To study how normal cellular activities are altered in different disease states, the biological
data must be combined to form a comprehensive picture of these activities. Therefore, the
field of bioinformatics has evolved such that the most pressing task now involves the analysis
and interpretation of various types of data. This includes nucleotide and amino acid
sequences, protein domains, and protein structures. The actual process of analyzing and
interpreting data is referred to as computational biology. Important sub-disciplines within
bioinformatics and computational biology include:
Development and implementation of computer programs that enable efficient access to, use
and management of, various types of information
Development of new algorithms and statistical measures that assess relationships among
members of large data sets. For example, there are methods to locate a gene within a
sequence, to predict protein structure and/or function, and to cluster protein sequences into
families of related sequences.
Over the past few decades rapid developments in genomic and other molecular research
technologies and developments in information technologies have combined to produce a
tremendous amount of information related to molecular biology. Bioinformatics is the name
given to these mathematical and computing approaches used to glean understanding of
biological processes.
Common activities in bioinformatics include mapping and analyzing DNA and protein
sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3-
D models of protein structures.
Bioinformatics is a science field that is similar to but distinct from biological computation and
computational biology. Biological computation uses bioengineering and biology to build
biological computers, whereas bioinformatics uses computation to better understand biology.
Bioinformatics and computational biology have similar aims and approaches, but they differ in
scale: bioinformatics organizes and analyzes basic biological data, whereas computational
biology builds theoretical models of biological systems, just as mathematical biology does with
mathematical models.
Analyzing biological data to produce meaningful information involves writing and running
software programs that use algorithms from graph theory, artificial intelligence, soft
computing, data mining, image processing, and computer simulation. The algorithms in turn
depend on theoretical foundations such as discrete mathematics, control theory, system
theory, information theory, and statistics.
Sequence analysis
Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of thousands of
organisms have been decoded and stored in databases. This sequence information is analyzed
to determine genes that encode proteins, RNA genes, regulatory sequences, structural motifs,
and repetitive sequences. A comparison of genes within a species or between different species
can show similarities between protein functions, or relations between species . With the
growing amount of data, it long ago became impractical to analyze DNA sequences manually.
Today, computer programs such as BLAST are used daily to search sequences from more than
260 000 organisms, containing over 190 billion nucleotides. These programs can compensate
for mutations in the DNA sequence, to identify sequences that are related, but not identical. A
variant of this sequence alignment is used in the sequencing process itself. The so-called
shotgun sequencing technique does not produce entire chromosomes. Instead it generates
the sequences of many thousands of small DNA fragments . The ends of these fragments
overlap and, when aligned properly by a genome assembly program, can be used to
reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the
task of assembling the fragments can be quite complicated for larger genomes. For a genome
as large as the human genome, it may take many days of CPU time on large-memory,
multiprocessor computers to assemble the fragments, and the resulting assembly usually
contains numerous gaps that must be filled in later. Shotgun sequencing is the method of
choice for virtually all genomes sequenced today, and genome assembly algorithms are a
critical area of bioinformatics research.
Following the goals that the Human Genome Project left to achieve after its closure in 2003, a
new project developed by the National Human Genome Research Institute in the U.S
appeared. The so-called ENCODE project is a collaborative data collection of the functional
elements of the human genome that uses next-generation DNA-sequencing technologies and
genomic tiling arrays, technologies able to generate automatically large amounts of data with
lower research costs but with the same quality and viability.
Genome annotation
In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. This process needs to be automated because most genomes are
too large to annotate by hand, not to mention the desire to annotate as many genomes as
possible, as the rate of sequencing has ceased to pose a bottleneck. Annotation is made
possible by the fact that genes have recognisable start and stop regions, although the exact
sequence found in these regions can vary between genes.
The first genome annotation software system was designed in 1995 by Owen White, who was
part of the team at The Institute for Genomic Research that sequenced and analyzed the first
genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae.
track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms is sometimes
confused with computational evolutionary biology, but the two areas are not necessarily
related.
Comparative genomics
The core of comparative genome analysis is the establishment of the correspondence between
genes or other genomic features in different organisms. It is these intergenomic maps that
make it possible to trace the evolutionary processes responsible for the divergence of two
genomes. A multitude of evolutionary events acting at various organizational levels shape
genome evolution. At the lowest level, point mutations affect individual nucleotides. At a
higher level, large chromosomal segments undergo duplication, lateral transfer, inversion,
transposition, deletion and insertion.
Many of these studies are based on the homology detection and protein families computation.
Pan genomics
Pan genomics is a concept introduced in 2005 by Tettelin and Medini which eventually took
root in bioinformatics. Pan genome is the complete gene repertoire of a particular taxonomic
group: although initially applied to closely related strains of a species, it can be applied to a
larger context like genus, phylum etc. It is divided in two parts- The Core genome: Set of genes
common to all the genomes under study and The Dispensable/Flexible Genome: Set of genes
not present in all but one or some genomes under study.
Genetics of disease
With the advent of next-generation sequencing we are obtaining enough sequence data to
map the genes of complex diseases such as infertility, breast cancer or Alzheimer's Disease.
Genome-wide association studies are essential to pinpoint the mutations for such complex
diseases. Furthermore, the possibility for genes to be used at prognosis, diagnosis or
treatment is one of the most essential applications. Many studies are discussing both the
promising ways to choose the genes to be used and the problems and pitfalls of using genes to
predict disease presence or prognosis.
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable
ways. Massive sequencing efforts are used to identify previously unknown point mutations in a
variety of genes in cancer. Bioinformaticians continue to produce specialized automated
systems to manage the sheer volume of sequence data produced, and they create new
algorithms and software to compare the sequencing results to the growing collection of
human genome sequences and germline polymorphisms. New physical detection technologies
are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses,
and single-nucleotide polymorphism arrays to detect known point mutations. These detection
methods simultaneously measure several hundred thousand sites throughout the genome,
and when used in high-throughput to measure thousands of samples, generate terabytes of
data per experiment. Again the massive amounts and new types of data generate new
opportunities for bioinformaticians. The data is often found to contain considerable variability,
or noise, and thus Hidden Markov model and change-point analysis methods are being
developed to infer real copy number changes.
However, with the breakthroughs that the next-generation sequencing technology is providing
to the field of Bioinformatics, cancer genomics may be drastically change. This new methods
and software allow bioinformaticians to sequence in a rapid and affordable way many cancer
genomes. This could mean a more flexible process to classify types of cancer by analysis of
cancer driven mutations in the genome. Furthermore, individual tracking of patients during the
progression of the disease may be possible in the future with the sequence of cancer samples.
Another type of data that requires novel informatics development is the analysis of lesions
found to be recurrent among many tumors.
The expression of many genes can be determined by measuring mRNA levels with multiple
techniques including microarrays, expressed cDNA sequence tag sequencing, serial analysis of
gene expression tag sequencing, massively parallel signature sequencing, RNA-Seq, also
known as "Whole Transcriptome Shotgun Sequencing", or various applications of multiplexed
in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias
in the biological measurement, and a major research area in computational biology involves
developing statistical tools to separate signal from noise in high-throughput gene expression
studies. Such studies are often used to determine the genes implicated in a disorder: one
might compare microarray data from cancerous epithelial cells to data from non-cancerous
cells to determine the transcripts that are up-regulated and down-regulated in a particular
population of cancer cells.
Protein microarrays and high throughput mass spectrometry can provide a snapshot of the
proteins present in a biological sample. Bioinformatics is very much involved in making sense
of protein microarray and HT MS data; the former approach faces similar problems as with
microarrays targeted at mRNA, the latter involves the problem of matching large amounts of
mass data against predicted masses from protein sequence databases, and the complicated
statistical analysis of samples where multiple, but incomplete peptides from each protein are
detected.
Analysis of regulation
Regulation is the complex orchestration of events starting with an extracellular signal such as a
hormone and leading to an increase or decrease in the activity of one or more proteins.
Bioinformatics techniques have been applied to explore various steps in this process. For
example, promoter analysis involves the identification and study of sequence motifs in the
DNA surrounding the coding region of a gene. These motifs influence the extent to which that
region is transcribed into mRNA. Expression data can be used to infer gene regulation: one
might compare microarray data from a wide variety of states of an organism to form
hypotheses about the genes involved in each state. In a single-cell organism, one might
compare stages of the cell cycle, along with various stress conditions . One can then apply
clustering algorithms to that expression data to determine which genes are co-expressed. For
example, the upstream regions of co-expressed genes can be searched for over-represented
regulatory elements. Examples of clustering algorithms applied in gene clustering are k-means
clustering, self-organizing maps, hierarchical clustering, and consensus clustering methods
such as the Bi-CoPaM. The later, namely Bi-CoPaM, has been actually proposed to address
various issues specific to gene discovery problems such as consistent co-expression of genes
over multiple microarray datasets.
Structural bioinformatics
Protein structure prediction is another important application of bioinformatics. The amino acid
sequence of a protein, the so-called primary structure, can be easily determined from the
sequence on the gene that codes for it. In the vast majority of cases, this primary structure
uniquely determines a structure in its native environment. Knowledge of this structure is vital
in understanding the function of the protein. Structural information is usually classified as one
of secondary, tertiary and quaternary structure. A viable general solution to such predictions
remains an open problem. Most efforts have so far been directed towards heuristics that work
most of the time.
One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of
bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A,
whose function is known, is homologous to the sequence of gene B, whose function is
unknown, one could infer that B may share A's function. In the structural branch of
bioinformatics, homology is used to determine which parts of a protein are important in
structure formation and interaction with other proteins. In a technique called homology
modeling, this information is used to predict the structure of a protein once the structure of a
homologous protein is known. This currently remains the only way to predict protein
structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the
hemoglobin in legumes . Both serve the same purpose of transporting oxygen in the organism.
Though both of these proteins have completely different amino acid sequences, their protein
structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo
physics-based modeling.
Network analysis seeks to understand the relationships within biological networks such as
metabolic or protein-protein interaction networks. Although biological networks can be
constructed from a single type of molecule or entity, network biology often attempts to
integrate many different data types, such as proteins, small molecules, gene expression data,
and others, which are all connected physically, functionally, or both.
Systems biology involves the use of computer simulations of cellular subsystems to both
analyze and visualize the complex connections of these cellular processes. Artificial life or
virtual evolution attempts to understand evolutionary processes via the computer simulation
of simple life forms.
Others
Literature analysis
The growth in the number of published literature makes it virtually impossible to read every
paper, resulting in disjointed sub-fields of research. Literature analysis aims to employ
computational and statistical linguistics to mine this growing library of text resources. For
example:
Protein-protein interaction – identify which proteins interact with which proteins from text
morphometrics
quantifying occlusion size in real-time imagery from the development of and recovery during
arterial injury
Biodiversity informatics
Biodiversity informatics deals with the collection and analysis of biodiversity data, such as
taxonomic databases, or microbiome data. Examples of such analyses include phylogenetics,
niche modelling, species richness mapping, or species identification tools.
Databases
Databases are essential for bioinformatics research and applications. There is a huge number
of available databases covering almost everything from DNA and protein sequences, molecular
structures, to phenotypes and biodiversity. Databases generally fall into one of three types.
Some contain data resulting directly from empirical methods such as gene knockouts. Others
consist of predicted data, and most contain data from both sources. There are meta-databases
that incorporate data compiled from multiple other databases. Some others are specialized,
such as those specific to an organism. These databases vary in their format, way of accession
and whether they are public or not. Some of the most commonly used databases are listed
below. For a more comprehensive list, please check the link at the beginning of the subsection.
Please keep in mind that this is a quick sampling and generally most computation data is
supported by wet lab data as well.
Software tools for bioinformatics range from simple command-line tools, to more complex
graphical programs and standalone web-services available from various bioinformatics
companies or public institutions.
Many free and open-source software tools have existed and continued to grow since the
1980s. The combination of a continued need for new algorithms for the analysis of emerging
types of biological readouts, the potential for innovative in silico experiments, and freely
available open code bases have helped to create opportunities for all research groups to
contribute to both bioinformatics and the range of open-source software available, regardless
of their funding arrangements. The open source tools often act as incubators of ideas, or
community-supported plug-ins in commercial applications. They may also provide de facto
standards and shared object models for assisting with the challenge of bioinformation
integration.
The range of open-source software packages includes titles such as Bioconductor, BioPerl,
Biopython, BioJava, BioJS, BioRuby, Bioclipse, EMBOSS, .NET Bio, Apache Taverna, and UGENE.
To maintain this tradition and create further opportunities, the non-profit Open Bioinformatics
Foundation
An alternative method to build public bioinformatics databases is to use the MediaWiki engine
with the extension. This system allows the database to be accessed and updated by all experts
in the field.
SOAP- and REST-based interfaces have been developed for a wide variety of bioinformatics
applications allowing an application running on one computer in one part of the world to use
algorithms, data and computing resources on servers in other parts of the world. The main
advantages derive from the fact that end users do not have to deal with software and
database maintenance overheads.
Basic bioinformatics services are classified by the EBI into three categories: SSS, MSA, and BSA .
The availability of these service-oriented bioinformatics resources demonstrate the
applicability of web-based bioinformatics solutions, and range from a collection of standalone
tools with a common data format under a single, standalone or web-based interface, to
integrative, distributed and extensible bioinformatics workflow management systems.
provide interactive tools for the scientists enabling them to execute their workflows and view
their results in real-time
simplify the process of sharing and reusing workflows between the scientists.
enable scientists to track the provenance of the workflow execution results and the workflow
creation steps.
Some of the platforms giving this service: Galaxy, Kepler, Taverna, UGENE, Anduril.
Education platforms
Software platforms designed to teach bioinformatics concepts and methods include Rosalind
and online courses offered through the Swiss Institute of Bioinformatics Training Portal. The
Canadian Bioinformatics Workshops provides videos and slides from training workshops on
their website under a Creative Commons license.
Conferences
There are several large conferences that are concerned with bioinformatics. Some of the most
notable examples are Intelligent Systems for Molecular Biology, European Conference on
Computational Biology, and Research in Computational Molecular Biology .
See also
References
Further reading
Aluru, Srinivas, ed. Handbook of Computational Molecular Biology. Chapman & Hall/Crc, 2006.
ISBN 1-58488-406-1
Baldi, P and Brunak, S, Bioinformatics: The Machine Learning Approach, 2nd edition. MIT Press,
2001. ISBN 0-262-02506-X
Barnes, M.R. and Gray, I.C., eds., Bioinformatics for Geneticists, first edition. Wiley, 2003. ISBN
0-470-84394-2
Baxevanis, A.D. and Ouellette, B.F.F., eds., Bioinformatics: A Practical Guide to the Analysis of
Genes and Proteins, third edition. Wiley, 2005. ISBN 0-471-47878-4
Baxevanis, A.D., Petsko, G.A., Stein, L.D., and Stormo, G.D., eds., Current Protocols in
Bioinformatics. Wiley, 2007. ISBN 0-471-25093-7
Durbin, R., S. Eddy, A. Krogh and G. Mitchison, Biological sequence analysis. Cambridge
University Press, 1998. ISBN 0-521-62971-3
Kohane, et al. Microarrays for an Integrative Genomics. The MIT Press, 2002. ISBN 0-262-
11271-X
Lund, O. et al. Immunological Bioinformatics. The MIT Press, 2005. ISBN 0-262-12280-4
Pachter, Lior and Sturmfels, Bernd. "Algebraic Statistics for Computational Biology" Cambridge
University Press, 2005. ISBN 0-521-85700-7
Pevzner, Pavel A. Computational Molecular Biology: An Algorithmic Approach The MIT Press,
2000. ISBN 0-262-16197-4
Stevens, Hallam, Life Out of Sequence: A Data-Driven History of Bioinformatics, Chicago: The
University of Chicago Press, 2013, ISBN 9780226080208
Tisdall, James. "Beginning Perl for Bioinformatics" O'Reilly, 2001. ISBN 0-596-00080-4