0% found this document useful (0 votes)
30 views10 pages

Bioinformatica

Uploaded by

iuha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

Bioinformatica

Uploaded by

iuha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Bioinformatics is an interdisciplinary field that develops methods and software tools for

understanding biological data. As an interdisciplinary field of science, bioinformatics combines


computer science, statistics, mathematics, and engineering to analyze and interpret biological
data.

Bioinformatics is both an umbrella term for the body of biological studies that use computer
programming as part of their methodology, as well as a reference to specific analysis
"pipelines" that are repeatedly used, particularly in the fields of genetics and genomics.
Common uses of bioinformatics include the identification of candidate genes and nucleotides .
Often, such identification is made with the aim of better understanding the genetic basis of
disease, unique adaptations, desirable properties, or differences between populations. In a
less formal way, bioinformatics also tries to understand the organisational principles within
nucleic acid and protein sequences.

Introduction

Bioinformatics has become an important part of many areas of biology. In experimental


molecular biology, bioinformatics techniques such as image and signal processing allow
extraction of useful results from large amounts of raw data. In the field of genetics and
genomics, it aids in sequencing and annotating genomes and their observed mutations. It plays
a role in the text mining of biological literature and the development of biological and gene
ontologies to organize and query biological data. It also plays a role in the analysis of gene and
protein expression and regulation. Bioinformatics tools aid in the comparison of genetic and
genomic data and more generally in the understanding of evolutionary aspects of molecular
biology. At a more integrative level, it helps analyze and catalogue the biological pathways and
networks that are an important part of systems biology. In structural biology, it aids in the
simulation and modeling of DNA, RNA, and protein structures as well as molecular
interactions.

History

Historically, the term bioinformatics did not mean what it means today. Paulien Hogeweg and
Ben Hesper coined it in 1970 to refer to the study of information processes in biotic systems.
This definition placed bioinformatics as a field parallel to biophysics or biochemistry . Dayhoff
compiled one of the first protein sequence databases, initially published as books and
pioneered methods of sequence alignment and molecular evolution. Another early contributor
to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with
his comprehensive volumes of antibody sequences released with Tai Te Wu between 1980 and
1991.

Goals

To study how normal cellular activities are altered in different disease states, the biological
data must be combined to form a comprehensive picture of these activities. Therefore, the
field of bioinformatics has evolved such that the most pressing task now involves the analysis
and interpretation of various types of data. This includes nucleotide and amino acid
sequences, protein domains, and protein structures. The actual process of analyzing and
interpreting data is referred to as computational biology. Important sub-disciplines within
bioinformatics and computational biology include:
Development and implementation of computer programs that enable efficient access to, use
and management of, various types of information

Development of new algorithms and statistical measures that assess relationships among
members of large data sets. For example, there are methods to locate a gene within a
sequence, to predict protein structure and/or function, and to cluster protein sequences into
families of related sequences.

The primary goal of bioinformatics is to increase the understanding of biological processes.


What sets it apart from other approaches, however, is its focus on developing and applying
computationally intensive techniques to achieve this goal. Examples include: pattern
recognition, data mining, machine learning algorithms, and visualization. Major research
efforts in the field include sequence alignment, gene finding, genome assembly, drug design,
drug discovery, protein structure alignment, protein structure prediction, prediction of gene
expression and protein–protein interactions, genome-wide association studies, and the
modeling of evolution.

Bioinformatics now entails the creation and advancement of databases, algorithms,


computational and statistical techniques, and theory to solve formal and practical problems
arising from the management and analysis of biological data.

Over the past few decades rapid developments in genomic and other molecular research
technologies and developments in information technologies have combined to produce a
tremendous amount of information related to molecular biology. Bioinformatics is the name
given to these mathematical and computing approaches used to glean understanding of
biological processes.

Common activities in bioinformatics include mapping and analyzing DNA and protein
sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3-
D models of protein structures.

Relation to other fields

Bioinformatics is a science field that is similar to but distinct from biological computation and
computational biology. Biological computation uses bioengineering and biology to build
biological computers, whereas bioinformatics uses computation to better understand biology.
Bioinformatics and computational biology have similar aims and approaches, but they differ in
scale: bioinformatics organizes and analyzes basic biological data, whereas computational
biology builds theoretical models of biological systems, just as mathematical biology does with
mathematical models.

Analyzing biological data to produce meaningful information involves writing and running
software programs that use algorithms from graph theory, artificial intelligence, soft
computing, data mining, image processing, and computer simulation. The algorithms in turn
depend on theoretical foundations such as discrete mathematics, control theory, system
theory, information theory, and statistics.

Sequence analysis

Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of thousands of
organisms have been decoded and stored in databases. This sequence information is analyzed
to determine genes that encode proteins, RNA genes, regulatory sequences, structural motifs,
and repetitive sequences. A comparison of genes within a species or between different species
can show similarities between protein functions, or relations between species . With the
growing amount of data, it long ago became impractical to analyze DNA sequences manually.
Today, computer programs such as BLAST are used daily to search sequences from more than
260 000 organisms, containing over 190 billion nucleotides. These programs can compensate
for mutations in the DNA sequence, to identify sequences that are related, but not identical. A
variant of this sequence alignment is used in the sequencing process itself. The so-called
shotgun sequencing technique does not produce entire chromosomes. Instead it generates
the sequences of many thousands of small DNA fragments . The ends of these fragments
overlap and, when aligned properly by a genome assembly program, can be used to
reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the
task of assembling the fragments can be quite complicated for larger genomes. For a genome
as large as the human genome, it may take many days of CPU time on large-memory,
multiprocessor computers to assemble the fragments, and the resulting assembly usually
contains numerous gaps that must be filled in later. Shotgun sequencing is the method of
choice for virtually all genomes sequenced today, and genome assembly algorithms are a
critical area of bioinformatics research.

Following the goals that the Human Genome Project left to achieve after its closure in 2003, a
new project developed by the National Human Genome Research Institute in the U.S
appeared. The so-called ENCODE project is a collaborative data collection of the functional
elements of the human genome that uses next-generation DNA-sequencing technologies and
genomic tiling arrays, technologies able to generate automatically large amounts of data with
lower research costs but with the same quality and viability.

Another aspect of bioinformatics in sequence analysis is annotation. This involves


computational gene finding to search for protein-coding genes, RNA genes, and other
functional sequences within a genome. Not all of the nucleotides within a genome are part of
genes. Within the genomes of higher organisms, large parts of the DNA do not serve any
obvious purpose.

Genome annotation

In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. This process needs to be automated because most genomes are
too large to annotate by hand, not to mention the desire to annotate as many genomes as
possible, as the rate of sequencing has ceased to pose a bottleneck. Annotation is made
possible by the fact that genes have recognisable start and stop regions, although the exact
sequence found in these regions can vary between genes.

The first genome annotation software system was designed in 1995 by Owen White, who was
part of the team at The Institute for Genomic Research that sequenced and analyzed the first
genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae.

track and share information on an increasingly large number of species and organisms

Future work endeavours to reconstruct the now more complex tree of life.

The area of research within computer science that uses genetic algorithms is sometimes
confused with computational evolutionary biology, but the two areas are not necessarily
related.
Comparative genomics

The core of comparative genome analysis is the establishment of the correspondence between
genes or other genomic features in different organisms. It is these intergenomic maps that
make it possible to trace the evolutionary processes responsible for the divergence of two
genomes. A multitude of evolutionary events acting at various organizational levels shape
genome evolution. At the lowest level, point mutations affect individual nucleotides. At a
higher level, large chromosomal segments undergo duplication, lateral transfer, inversion,
transposition, deletion and insertion.

Ultimately, whole genomes are involved in processes of hybridization, polyploidization and


endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses
many exciting challenges to developers of mathematical models and algorithms, who have
recourse to a spectra of algorithmic, statistical and mathematical techniques, ranging from
exact, heuristics, fixed parameter and approximation algorithms for problems based on
parsimony models to Markov Chain Monte Carlo algorithms for Bayesian analysis of problems
based on probabilistic models.

Many of these studies are based on the homology detection and protein families computation.

Pan genomics

Pan genomics is a concept introduced in 2005 by Tettelin and Medini which eventually took
root in bioinformatics. Pan genome is the complete gene repertoire of a particular taxonomic
group: although initially applied to closely related strains of a species, it can be applied to a
larger context like genus, phylum etc. It is divided in two parts- The Core genome: Set of genes
common to all the genomes under study and The Dispensable/Flexible Genome: Set of genes
not present in all but one or some genomes under study.

Genetics of disease

With the advent of next-generation sequencing we are obtaining enough sequence data to
map the genes of complex diseases such as infertility, breast cancer or Alzheimer's Disease.
Genome-wide association studies are essential to pinpoint the mutations for such complex
diseases. Furthermore, the possibility for genes to be used at prognosis, diagnosis or
treatment is one of the most essential applications. Many studies are discussing both the
promising ways to choose the genes to be used and the problems and pitfalls of using genes to
predict disease presence or prognosis.

Analysis of mutations in cancer

In cancer, the genomes of affected cells are rearranged in complex or even unpredictable
ways. Massive sequencing efforts are used to identify previously unknown point mutations in a
variety of genes in cancer. Bioinformaticians continue to produce specialized automated
systems to manage the sheer volume of sequence data produced, and they create new
algorithms and software to compare the sequencing results to the growing collection of
human genome sequences and germline polymorphisms. New physical detection technologies
are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses,
and single-nucleotide polymorphism arrays to detect known point mutations. These detection
methods simultaneously measure several hundred thousand sites throughout the genome,
and when used in high-throughput to measure thousands of samples, generate terabytes of
data per experiment. Again the massive amounts and new types of data generate new
opportunities for bioinformaticians. The data is often found to contain considerable variability,
or noise, and thus Hidden Markov model and change-point analysis methods are being
developed to infer real copy number changes.

However, with the breakthroughs that the next-generation sequencing technology is providing
to the field of Bioinformatics, cancer genomics may be drastically change. This new methods
and software allow bioinformaticians to sequence in a rapid and affordable way many cancer
genomes. This could mean a more flexible process to classify types of cancer by analysis of
cancer driven mutations in the genome. Furthermore, individual tracking of patients during the
progression of the disease may be possible in the future with the sequence of cancer samples.

Another type of data that requires novel informatics development is the analysis of lesions
found to be recurrent among many tumors.

Gene and protein expression

Analysis of gene expression

The expression of many genes can be determined by measuring mRNA levels with multiple
techniques including microarrays, expressed cDNA sequence tag sequencing, serial analysis of
gene expression tag sequencing, massively parallel signature sequencing, RNA-Seq, also
known as "Whole Transcriptome Shotgun Sequencing", or various applications of multiplexed
in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias
in the biological measurement, and a major research area in computational biology involves
developing statistical tools to separate signal from noise in high-throughput gene expression
studies. Such studies are often used to determine the genes implicated in a disorder: one
might compare microarray data from cancerous epithelial cells to data from non-cancerous
cells to determine the transcripts that are up-regulated and down-regulated in a particular
population of cancer cells.

Analysis of protein expression

Protein microarrays and high throughput mass spectrometry can provide a snapshot of the
proteins present in a biological sample. Bioinformatics is very much involved in making sense
of protein microarray and HT MS data; the former approach faces similar problems as with
microarrays targeted at mRNA, the latter involves the problem of matching large amounts of
mass data against predicted masses from protein sequence databases, and the complicated
statistical analysis of samples where multiple, but incomplete peptides from each protein are
detected.

Analysis of regulation

Regulation is the complex orchestration of events starting with an extracellular signal such as a
hormone and leading to an increase or decrease in the activity of one or more proteins.
Bioinformatics techniques have been applied to explore various steps in this process. For
example, promoter analysis involves the identification and study of sequence motifs in the
DNA surrounding the coding region of a gene. These motifs influence the extent to which that
region is transcribed into mRNA. Expression data can be used to infer gene regulation: one
might compare microarray data from a wide variety of states of an organism to form
hypotheses about the genes involved in each state. In a single-cell organism, one might
compare stages of the cell cycle, along with various stress conditions . One can then apply
clustering algorithms to that expression data to determine which genes are co-expressed. For
example, the upstream regions of co-expressed genes can be searched for over-represented
regulatory elements. Examples of clustering algorithms applied in gene clustering are k-means
clustering, self-organizing maps, hierarchical clustering, and consensus clustering methods
such as the Bi-CoPaM. The later, namely Bi-CoPaM, has been actually proposed to address
various issues specific to gene discovery problems such as consistent co-expression of genes
over multiple microarray datasets.

Structural bioinformatics

Protein structure prediction is another important application of bioinformatics. The amino acid
sequence of a protein, the so-called primary structure, can be easily determined from the
sequence on the gene that codes for it. In the vast majority of cases, this primary structure
uniquely determines a structure in its native environment. Knowledge of this structure is vital
in understanding the function of the protein. Structural information is usually classified as one
of secondary, tertiary and quaternary structure. A viable general solution to such predictions
remains an open problem. Most efforts have so far been directed towards heuristics that work
most of the time.

One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of
bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A,
whose function is known, is homologous to the sequence of gene B, whose function is
unknown, one could infer that B may share A's function. In the structural branch of
bioinformatics, homology is used to determine which parts of a protein are important in
structure formation and interaction with other proteins. In a technique called homology
modeling, this information is used to predict the structure of a protein once the structure of a
homologous protein is known. This currently remains the only way to predict protein
structures reliably.

One example of this is the similar protein homology between hemoglobin in humans and the
hemoglobin in legumes . Both serve the same purpose of transporting oxygen in the organism.
Though both of these proteins have completely different amino acid sequences, their protein
structures are virtually identical, which reflects their near identical purposes.

Other techniques for predicting protein structure include protein threading and de novo
physics-based modeling.

Network and systems biology

Network analysis seeks to understand the relationships within biological networks such as
metabolic or protein-protein interaction networks. Although biological networks can be
constructed from a single type of molecule or entity, network biology often attempts to
integrate many different data types, such as proteins, small molecules, gene expression data,
and others, which are all connected physically, functionally, or both.

Systems biology involves the use of computer simulations of cellular subsystems to both
analyze and visualize the complex connections of these cellular processes. Artificial life or
virtual evolution attempts to understand evolutionary processes via the computer simulation
of simple life forms.

Molecular interaction networks


Tens of thousands of three-dimensional protein structures have been determined by X-ray
crystallography and protein nuclear magnetic resonance spectroscopy and a central question
in structural bioinformatics is whether it is practical to predict possible protein–protein
interactions only based on these 3D shapes, without performing protein–protein interaction
experiments. A variety of methods have been developed to tackle the protein–protein docking
problem, though it seems that there is still much work to be done in this field.

Other interactions encountered in the field include Protein–ligand and protein–peptide.


Molecular dynamic simulation of movement of atoms about rotatable bonds is the
fundamental principle behind computational algorithms, termed docking algorithms, for
studying molecular interactions.

Others

Literature analysis

The growth in the number of published literature makes it virtually impossible to read every
paper, resulting in disjointed sub-fields of research. Literature analysis aims to employ
computational and statistical linguistics to mine this growing library of text resources. For
example:

Abbreviation recognition – identify the long-form and abbreviation of biological terms

Named entity recognition – recognizing biological terms such as gene names

Protein-protein interaction – identify which proteins interact with which proteins from text

The area of research draws from statistics and computational linguistics.

High-throughput image analysis

Computational technologies are used to accelerate or fully automate the processing,


quantification and analysis of large amounts of high-information-content biomedical imagery.
Modern image analysis systems augment an observer's ability to make measurements from a
large or complex set of images, by improving accuracy, objectivity, or speed. A fully developed
analysis system may completely replace the observer. Although these systems are not unique
to biomedical imagery, biomedical imaging is becoming more important for both diagnostics
and research. Some examples are:

high-throughput and high-fidelity quantification and sub-cellular localization

morphometrics

clinical image analysis and visualization

determining the real-time air-flow patterns in breathing lungs of living animals

quantifying occlusion size in real-time imagery from the development of and recovery during
arterial injury

making behavioral observations from extended video recordings of laboratory animals

infrared measurements for metabolic activity determination

inferring clone overlaps in DNA mapping, e.g. the Sulston score


High-throughput single cell data analysis

Computational techniques are used to analyse high-throughput, low-measurement single cell


data, such as that obtained from flow cytometry. These methods typically involve finding
populations of cells that are relevant to a particular disease state or experimental condition.

Biodiversity informatics

Biodiversity informatics deals with the collection and analysis of biodiversity data, such as
taxonomic databases, or microbiome data. Examples of such analyses include phylogenetics,
niche modelling, species richness mapping, or species identification tools.

Databases

Databases are essential for bioinformatics research and applications. There is a huge number
of available databases covering almost everything from DNA and protein sequences, molecular
structures, to phenotypes and biodiversity. Databases generally fall into one of three types.
Some contain data resulting directly from empirical methods such as gene knockouts. Others
consist of predicted data, and most contain data from both sources. There are meta-databases
that incorporate data compiled from multiple other databases. Some others are specialized,
such as those specific to an organism. These databases vary in their format, way of accession
and whether they are public or not. Some of the most commonly used databases are listed
below. For a more comprehensive list, please check the link at the beginning of the subsection.

Used in Motif Finding:

Used in Gene Ontology:,,

Used in Gene Finding: Hidden Markov Model

Used in finding Protein Structures/Family:

Used for Next Generation Sequencing:, FASTQ Format

Used in Gene Expression Analysis:,

Used in Network Analysis: Interaction Analysis Databases, Functional Networks

Please keep in mind that this is a quick sampling and generally most computation data is
supported by wet lab data as well.

Software and tools

Software tools for bioinformatics range from simple command-line tools, to more complex
graphical programs and standalone web-services available from various bioinformatics
companies or public institutions.

Open-source bioinformatics software

Many free and open-source software tools have existed and continued to grow since the
1980s. The combination of a continued need for new algorithms for the analysis of emerging
types of biological readouts, the potential for innovative in silico experiments, and freely
available open code bases have helped to create opportunities for all research groups to
contribute to both bioinformatics and the range of open-source software available, regardless
of their funding arrangements. The open source tools often act as incubators of ideas, or
community-supported plug-ins in commercial applications. They may also provide de facto
standards and shared object models for assisting with the challenge of bioinformation
integration.

The range of open-source software packages includes titles such as Bioconductor, BioPerl,
Biopython, BioJava, BioJS, BioRuby, Bioclipse, EMBOSS, .NET Bio, Apache Taverna, and UGENE.
To maintain this tradition and create further opportunities, the non-profit Open Bioinformatics
Foundation

An alternative method to build public bioinformatics databases is to use the MediaWiki engine
with the extension. This system allows the database to be accessed and updated by all experts
in the field.

Web services in bioinformatics

SOAP- and REST-based interfaces have been developed for a wide variety of bioinformatics
applications allowing an application running on one computer in one part of the world to use
algorithms, data and computing resources on servers in other parts of the world. The main
advantages derive from the fact that end users do not have to deal with software and
database maintenance overheads.

Basic bioinformatics services are classified by the EBI into three categories: SSS, MSA, and BSA .
The availability of these service-oriented bioinformatics resources demonstrate the
applicability of web-based bioinformatics solutions, and range from a collection of standalone
tools with a common data format under a single, standalone or web-based interface, to
integrative, distributed and extensible bioinformatics workflow management systems.

Bioinformatics workflow management systems

A Bioinformatics workflow management system is a specialized form of a workflow


management system designed specifically to compose and execute a series of computational
or data manipulation steps, or a workflow, in a Bioinformatics application. Such systems are
designed to

provide an easy-to-use environment for individual application scientists themselves to create


their own workflows

provide interactive tools for the scientists enabling them to execute their workflows and view
their results in real-time

simplify the process of sharing and reusing workflows between the scientists.

enable scientists to track the provenance of the workflow execution results and the workflow
creation steps.

Some of the platforms giving this service: Galaxy, Kepler, Taverna, UGENE, Anduril.

Education platforms

Software platforms designed to teach bioinformatics concepts and methods include Rosalind
and online courses offered through the Swiss Institute of Bioinformatics Training Portal. The
Canadian Bioinformatics Workshops provides videos and slides from training workshops on
their website under a Creative Commons license.

Conferences
There are several large conferences that are concerned with bioinformatics. Some of the most
notable examples are Intelligent Systems for Molecular Biology, European Conference on
Computational Biology, and Research in Computational Molecular Biology .

See also

References

Further reading

Raul Isea, Global Journal of Advanced Research, 2015

Ilzins, O., Isea, R. and Hoebeke, J. 2015

Achuthsankar S Nair, Communications of Computer Society of India, January 2007

Aluru, Srinivas, ed. Handbook of Computational Molecular Biology. Chapman & Hall/Crc, 2006.
ISBN 1-58488-406-1

Baldi, P and Brunak, S, Bioinformatics: The Machine Learning Approach, 2nd edition. MIT Press,
2001. ISBN 0-262-02506-X

Barnes, M.R. and Gray, I.C., eds., Bioinformatics for Geneticists, first edition. Wiley, 2003. ISBN
0-470-84394-2

Baxevanis, A.D. and Ouellette, B.F.F., eds., Bioinformatics: A Practical Guide to the Analysis of
Genes and Proteins, third edition. Wiley, 2005. ISBN 0-471-47878-4

Baxevanis, A.D., Petsko, G.A., Stein, L.D., and Stormo, G.D., eds., Current Protocols in
Bioinformatics. Wiley, 2007. ISBN 0-471-25093-7

Cristianini, N. and Hahn, M., Cambridge University Press, 2006.

Durbin, R., S. Eddy, A. Krogh and G. Mitchison, Biological sequence analysis. Cambridge
University Press, 1998. ISBN 0-521-62971-3

Keedwell, E., Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to


Bioinformatics Problems. Wiley, 2005. ISBN 0-470-02175-6

Kohane, et al. Microarrays for an Integrative Genomics. The MIT Press, 2002. ISBN 0-262-
11271-X

Lund, O. et al. Immunological Bioinformatics. The MIT Press, 2005. ISBN 0-262-12280-4

Pachter, Lior and Sturmfels, Bernd. "Algebraic Statistics for Computational Biology" Cambridge
University Press, 2005. ISBN 0-521-85700-7

Pevzner, Pavel A. Computational Molecular Biology: An Algorithmic Approach The MIT Press,
2000. ISBN 0-262-16197-4

Soinov, L. Journal of Pattern Recognition Research, Vol 1 2006 p. 37–41

Stevens, Hallam, Life Out of Sequence: A Data-Driven History of Bioinformatics, Chicago: The
University of Chicago Press, 2013, ISBN 9780226080208

Tisdall, James. "Beginning Perl for Bioinformatics" O'Reilly, 2001. ISBN 0-596-00080-4

You might also like