Introduction To Bioinformatics - BCHS 4214
Introduction To Bioinformatics - BCHS 4214
I- Introduction
Bioinformatics is the application of computational tools to the storage, analysis, and visualization
of biological data. It is a branch of science that integrates computer science, mathematics and
statistics, chemistry and engineering for analysis, exploration, integration and exploitation of
biological sciences data, in Research and Development. Bioinformatics also deals with storage,
retrieval, analysis and interpretation of biological data using computer based software and tools.
History of Bioinformatics
Bioinformatics emerged in mid 1990s. • From 1965-78 Margaret O. Dayhoff established first
database of protein sequences, published annually as series of volume entitled “Atlas of protein
sequence and structure”. • During 1977 DNA sequences began to accumulate slowly in literature
and it became more common to predict protein sequences by translating sequenced genes than by
direct sequencing of proteins. • Thus number of uncharacterised proteins began to increase. • In
1980, there were enough DNA sequences to justify the establishment of the first nucleotide
sequence database, GenBank at National Centre for Biotechnology Information (NCBI), USA.
NCBI served as primary databank provider for information. The European Molecular Biology
Laboratory (EMBL) was established at European Bioinformatics Institute (EBI) in 1980 and the
aim of this data library was to collect, organize and distribute nucleotide sequence data and related
information. • In1986, DNA Data Bank was established by GemonNet, Japan. • In 1984, the
National Biomedical Research Foundation (NBRF) established the protein information Resource
(PIR). • All these data banks operate in close collaboration and regularly exchange data. •
Management and analysis of the rapidly accumulating sequence data required new computer
software and statistical tools. • This attracted scientists from computer science and mathematics to
the fast emerging field of bioinformatics.
a) Sequence Analysis:
All the applications that analyzes various types of sequence information and can compare between
similar types of information is grouped under Sequence Analysis.
The application of sequence analysis determines those genes which encode regulatory sequences or
peptides by using the information of sequencing. For sequence analysis, there are many powerful
tools and computers which perform the duty of analyzing the genome of various organisms. These
computers and tools also see the DNA mutations in an organism and also detect and identify those
sequences which are related. Shotgun sequence techniques are also used for sequence analysis of
numerous fragments of DNA. Special software is used to see the overlapping of fragments and their
assembly.
b) Function Analysis:
These applications analyze the function engraved within the sequences and helps predict the
functional interaction between various proteins or genes. Also expressional analysis of various
genes is a prime topic for research these days.
c) Structure Analysis:
When it comes to the realm of RNA and Proteins, its structure plays a vital role in the interaction
with any other thing. This gave birth to a whole new branch termed Structural Bioinformatics with
is devoted to predict the structure and possible roles of these structures of Proteins or RNA.
d)Prediction of Protein Structure:
It is easy to determine the primary structure of proteins in the form of amino acids which are present
on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures
of proteins. For this purpose either the method of crystallography is used or tools of bioinformatics
can also be used to determine the complex protein structures.
e) Comparative Genomics:
Comparative genomics is the branch of bioinformatics which determines the genomic structure and
function relation between different biological species. For this purpose, intergenomic maps are
constructed which enable the scientists to trace the processes of evolution that occur in genomes of
different species. These maps contain the information about the point mutations as well as the
information about the duplication of large chromosomal segments.
f)Health and Drug discovery:
The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management.
Complete sequencing of human genes has enabled the scientists to make medicines and drugs which
can target more than 500 genes. Different computational tools and drug targets has made the drug
delivery easy and specific because now only those cells can be targeted which are diseased or
mutated. It is also easy to know the molecular basis of a disease.
Drug development
At present all drugs on the market target only about 500 proteins. With an improved understanding
of disease mechanisms and using computational tools to identify and validate new drug targets,
more specific medicines that act on the cause, not merely the symptoms, of the disease can be
developed. These highly specific drugs promise to have fewer side effects than many of today's
medicines.
Microbial genome applications
Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving
and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the
environment, our bodies, the air, food and water. Traditionally, use has been made of a variety of
microbial properties in the baking, brewing and food industries. The arrival of the complete genome
sequences and their potential to provide a greater insight into the microbial world and its capacities
could have broad and far reaching implications for environment, health, energy and industrial
applications. For these reasons, in 1994, the US Department of Energy (DOE) initiated the MGP
(Microbial Genome Project) to sequence genomes of bacteria useful in energy production,
environmental cleanup, industrial processing and toxic waste reduction. By studying the genetic
material of these organisms, scientists can begin to understand these microbes at a very fundamental
level and isolate the genes that give them their unique abilities to survive under extreme conditions.
Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading cause of bacterial
infection among hospital patients. They have discovered a virulence region made up of a number of
antibiotic-resistant genes that may contribute to the bacterium's transformation from harmless gut
bacteria to a menacing invader. The discovery of the region, known as a pathogenicity island, could
provide useful markers for detecting pathogenic strains and help to establish controls to prevent the
spread of infection in wards.
Enzyme Database
- BRENDA (Braunshchweig Enzyme Database) (http://www.brenda.uni-koeln.de/)
- Enzyme, a part of ExPaSy (Expert Protein Analysis System, the proteomic server of Swiss
Institute of Bioinformatics).
Clinical Databases
It generally contain information from the Human.
- Human Gene Mutation Database, Cardiff, UK (http://www.hgmd.org): It registers known
mutations in the human genome and the diseases they cause.
- OMIM database (Online Mendelian Inheritance in Man, http://www.ncbi.nlm.nih.gov/Omim). The
OMIM database contains abstracts and texts describing genetic disorders to support genomics
efforts and clinical genetics. It provides gene maps, and known disorder maps in tabular listing
formats. It also contains keyword search.
Bibliographic Databases
They are used for searching for reference articles.
- PubMed for example enables user to do keyword searches, provides links to a selection of full
articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries,
among others. It also contains entries for more than 30 million abstracts of scientific publications.
V- Sequence Alignment
A sequence alignment is a way of arranging the sequences of DNA or protein to identify regions of
similarity that may be a consequence of functional, structural, or evolutionary relationships between
the sequences. Some essential words here are :
Similarity : The extent to which nucleotide or protein sequences are related. It is based upon
identity plus conservation.
Identity : The extent to which two sequences are invariant.
Conservation : Changes at a specific position of an amino acid or (less commonly, DNA) sequence
that preserve the physico chemical properties of the original residue.
There are 2 types of alignment :Pairwise alignment and Multiple Alignment.
Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or
global alignments of two query sequences. Pairwise alignments can only be used between two
sequences at a time, but they are efficient to calculate and are often used for methods that do not
require extreme precision (such as searching a database for sequences with high similarity to a
query). The three primary methods of producing pairwise alignments are dot- matrix methods,
dynamic programming, and word methods.
Sequence Analyses for relatedness • Homologs: similar sequences in different organisms derived
from a common ancestor sequence. • Orthologs : homologous sequences in different related species
that arose from a common ancestral gene during speciation. Orthologs are presumed to have similar
biological function. e.g. Human and rats myoglobins both transport oxygen in muscle • Paralogs:
homologous genes within the same organism e.g. human α and β globins are paralogs. Paralogs are
the result of gene duplication events • Xenologs: similar sequences that have arisen out of
horizontal transfer events (symbiosis, viruses, etc
Steps in Phylogenetic Analysis : The basic steps in any phylogenetic analysis include:
1. Assemble and align a dataset The first step is to identify a protein or DNA sequence of interest
and assemble a dataset consisting of other related sequences. DNA sequences of interest can be
retrieved using NCBI BLAST or similar search tools. Once sequences are selected and retrieved,
multiple sequence alignment is created. This involves arranging a set of sequences in a matrix to
identify regions of homology. There are many websites and software programs, such as
ClustalW, MSA, MAFFT, and T-Coffee, designed to perform multiple sequence on a given set of
molecular data.
2. Build (estimate) phylogenetic trees from sequences using computational methods and stochastic
models To build phylogenetic trees, statistical methods are applied to determine the tree topology
and calculate the branch lengths that best describe the phylogenetic relationships of the aligned
sequences in a dataset. The most common computational methods applied include distance-
matrix methods, and discrete data methods, such as maximum parsimony and maximum likelihood.
There are several software packages, such as Paup, PAML, PHYLIP, that apply these most
popular methods.
3. Statistically test and assess the estimated trees. Tree estimating algorithms generate one or
more optimal trees. This set of possible trees is subjected to a series of statistical tests to evaluate
whether one tree is better than another – and if the proposed phylogeny is reasonable. Common
methods for assessing trees include the Bootstrap and Jackknife Resampling methods, and
analytical methods, such as parsimony, distance, and likelihood.
Bioinformatics Tools for Phylogenetic Analysis There are several bioinformatics tools and
databases that can be used for phylogenetic analysis. These include PANTHER, P-Pod, PFam,
TreeFam, and the PhyloFacts structural phylogenomic encyclopedia. Each of these databases
uses different algorithms and draws on different sources for sequence information, and therefore the
trees estimated by PANTHER, for example, may differ significantly from those generated by P-Pod
or PFam. As with all bioinformatics tools of this type, it is important to test different methods,
compare the results, then determine which database works best (according to consensus results) for
studies involving different types of datasets.