Bioinformatics (STH Sir)
Bioinformatics (STH Sir)
The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic
processes in biotic systems. It was primary used since late 1980s has been in genomics and
genetics, particularly in those areas of genomics involving large-scale DNA sequencing.
Bioinformatics can be defined as the application of computer technology to the management of
biological information. Bioinformatics is the science of storing, extracting, organizing, analyzing,
interpreting and utilizing information from biological sequences and molecules. It has been
mainly fueled by advances in DNA sequencing and mapping techniques. Over the past few
decades rapid developments in genomic and other molecular research technologies and
developments in information technologies have combined to produce a tremendous amount of
information related to molecular biology. The primary goal of bioinformatics is to increase the
understanding of biological processes.
The branch of science concerned with information and information flow in biological
systems, esp. the use of computational methods in genetics and genomics.
Biological Databases-In the post genomic era, nucleotide and protein sequences from different
organisms are available. It has paved the determination of secondary and 3-D structure of the
proteins as well. This vast amount of information is processed and arranged systematically in
different biological databases. The information present in these databases can be used to derive
common feature of a sequence class and classification of an unknown sequence.
Primary Database- This the collection of the data obtained from the experiment such as
sequence of DNA or Protein, 3-D structure of a protein.
GenBank-This is a public sequence database and it can be accessed through a web addess
http://www.ncbi.nlm.nih.gov/genbank/. The entry into the genbank is made through a login
into the database with a pre-requisite of publication of the new sequence in any scientific journal.
Each entry in the database has a unique accession number and it remains unchanged. A sample
GenBank entry can be accessed via a link http://www. ncbi.nlm. nih.gov/ Sitemap /sampler
ecord.html. A typical GenBank entry has the information about the locus name, length of the
sequence, type of the molecule (DNA/RNA), nucleotide sequence of the entry.
Entrez-Entrez system is used to search all NCBI associated databases. It is a powerful tool to
peform simple or complicated searches by combining key word with the logical operator (AND,
NOT). For example, searching a protein kinase sequence in human can be done by the following
search syntax: Homo sapiens [ORGN] AND protein kinase.
EMBL and DDBJ- EMBL is the nucleotide sequence database present at European
bioinformatics institute where as DDBJ is the DNA sequence database present at centre for
information biology, Japan. EMBL can be accessed at http://www.embl.de/ whereas DDBJ can
be accessed at http://www.ddbj.nig.ac.jp/. Every day, GenBank, EMBL and DDBJ synchronize
their nucleotide sequence and as a result searching of a nucleotide in any of the database is
sufficient.
SWISSPROT-it is the collection of the annoted protein sequence of the swiss instituite of
bioinformatics (SIB). SWISSPROT can be accessed at http://web.expasy.org/groups/swissprot/.
The protein sequence entry in the swissprot is manually curated and if required it is compared
with the available literature. Swissprot is part of the UniProt database and collectively known as
UniProt Knowledgebase. A ‘niceprot’ view of the entry in swissprot database are graphically
presented for better readability and hyperlinks are given for other databases as well.
NCBI protein database-It is a compilation of the protein sequence present in other databases.
The NCBI database contains the entries from the swissprot, PIR database, PDB database and
other known databases.
UniProt-EBI, SIB and Georgetown university together collected the protein information in the
form of a centralized catalogue known as universal protein resource (UniProt). It contains the
information about the 3-D structure, expression profile, secondary structures and biochemical
function of the protein. UniProt consists of 3 parts: UniProt Knowledge database (UniProtKB),
UniProt Reference (UniRef) and UniProt Archive (UniPArc). As discussed before, UniProtKB is a
collection from SwissProt and TrEMBL database. UniRef is a nonredudant sequence database
and it can allow to search similar sequences. UniRef 100, UniRef90 and UniRef50 are the three
version of the database allow searching of sequences 100%, >90% and >50% identical ot the
query sequence.
Secondary Database-The analysis of the primary data gives rise to the development of secondary
database. Secondary structures, hydrophobicity plot and domains are present in the various
secondary databases.
Prosite-Prosite is one of the secondary biological database which contains motifs to classify the
unknown sequence into the protein family or class of enzyme. It can be accessed with the web
address http://prosite.expasy.org/. The database contains motifs derived from the multiple
sequence alignment. The quert sequence is aligned against the multiple sequence alignment to
determine the presence or absence of the motif.
A query sequence can be analyzed using the algorithm ScanProsite. In addition, it may allow to
search the sequence with similar pattern in SwissProt, TrEMBL and PDB databases.
Pfam: The Pfam database contains the profiles of the protein sequences and classifies the protein
families as per the over-all profile. A profile is a pattern of the amino acid in a protein sequence
and determine probability of a given amino acid. Pfam is based on the sequence alignment. A
high-quality sequence alignment gives the idea about the probability of appearance of an amino
acid at a particular position and contain evolutionary related sequences. However, in few cases
a sequence alignment may have sequences with no evolutionary relationship to each other. A
critical analysis of result from the Pfam database is necessary to draw conclusions.
Interpro-SwissProt, TrEMBL, Prosite, Pfam, PRINT, ProDom, Smart and TIGRFAMS are
integrated into a comprehensive signature database known as Interpro. The results from interpro
gives the output from individual databases and allows user to compare the output considering
the algorithm used in each database.
Protein Data bank (PDB)- it is the collection of the experimentally determined crystal stuture of
the biological macromolecules. It is co-ordinated by the consortium located in Europe, Japan
and USA. As of August 2013, the database contains 93043 structures which includes protein,
nucleic acids, and protein-nucleic acid or protein-small molecule complexes
(http://www.rcsb.org/pdb/home/home.do). A PDB ID or the key word can be use to search the
database. The result from the database summarizes all information related to the structure such
as crystallization condition, reference of the journal article where the finding are published etc.
SCOP-SCOP (structural classification of protein) utilizes the basic idea that the proteins with
similar biological functions and evolutionary related with each other must have a similar
structure. The database classifies the structure of a known protein into the families,
superfamilies and fold. A protein structure belongs to a famiy if the sequence identity must be
atleast 30% over the total length of the sequence. Proteins with structural or functional similarity
but low sequence identity are classified into the superfamilies. Whereas proteins with similar
secondary structure arrangement belongs to the fold.
CATH-Similar to SCOP, CATH classifies the protein into 4 categories: Class (C), Architecture (A),
Topology (T), and Homologous superfamily (H). A protein is classified as Class depending on the
proportion of the secondary structure elements rather than their arrangement. There are 4
classes, helices (α-class), sheet (β-class), helix-sheet (α/β class) and proteins with few secondary
structures. The arrangement of secondary elements in a protein structure is used for their
classification within the architecture. The connection of secondary elements is used for their
classification within the topology category. The homologous superfamily considers the presence
of similar domains in two protein structure for their classification.
A protein sequence can be deduced from a DNA sequence using the genetic code. Since the
genetic code is redundant to some degree, several DNA sequences can code for the same protein
sequence. As a consequence, the similarity of two protein-coding DNA sequences may appear
less than that of their translated protein sequence.
The reading frame of a DNA sequence may not always be known, and shifting it by one position
has dramatic effects on the translated amino acid sequence. The similarity at the protein level
can be completely destroyed. Thus, alignments of protein-coding sequences performed at DNA
and amino acid levels do not always give the same results.
It should be pointed out that a protein sequence can be deduced from a DNA or RNA sequence.
However, from one protein sequence, several possible DNA sequences could be predicted, since
it cannot be known which codons were used to obtain the protein sequence. Thus, when working
with protein sequences, a degree of information is lost that was present in the DNA or RNA
sequence. This is in agreement with the “Central Dogma” of molecular biology.
For a DNA sequence, nucleotides are either the same or they are different. For proteins, however,
there is a third category, as amino acids can be similar though not identical. These are amino
acids that have a similar chemical structure: for instance serine and threonine, which both have
hydroxyl (−OH) groups. Leucine and isoleucine also have similar chemical properties, and
glutamate and aspartate are both acidic. Replacing Asp for Glu in a enzyme requiring an acidic
amino acid in its active site would likely not completely alter the function, although substitution
in the same location with a large aromatic amino acid, such as Tyr, could well destroy the enzyme
activity. Thus it would be appropriate to score Glu as ‘similar’ to Asp, but both as ‘different’ to
Tyr.
Amino acids can be placed in groups that can be considered similar, and taking this into
consideration in alignments produces two scores: an identity score and a similarity score.
However, determining which amino acids are similar is not always as clear as it might seem, as
there are different degrees of similarity, depending on the context. For example, alanine,
isoleucine, leucine, and valine are all aliphatic amino acids, and in many cases these can be
substituted for each other in a globular protein without much difference in the overall shape.
However, their size is different enough to have significant impact if substituted in an active site
of an enzyme. In some cases, it matters only that an amino acid is charged, and whether it is a
positive or negative charge is not important. But in other cases, when ionic bonding stabilizes a
structure, for example, charge is crucial. So depending on which list one consults, amino acids
can sometimes appear as similar, sometimes not. Because of this ambiguity in definitions of
similarity, it is our opinion that more weight should be given to the percentage identity of an
alignment score of two sequences than to the percentage similarity, as well as to the length of
the alignment as a fraction of the query sequence.
Alignments of sequences are commonly performed using the Basic Local Alignment Search Tool.
BLAST can be quite fast, and there are several automated servers available on the web, where
one can paste a sequence in a form and quickly search for similarity to genes or sequences stored
in a database. GenBank, a public database storing DNA and protein sequences, allows one to
specifically search all or particular selections of microbial genomes.
BLASTN is the program to search a DNA query against DNA, whereas BLASTP searches a protein
sequence against a protein database. BLASTN is set up to automatically search for homologies
on either strand present in the database. BLASTX uses a DNA query and translates this in all
three reading frames, for both strands, and performs six BLASTP searches in addition to
BLASTN. BLASTP uses various similarity matrices to determine which amino acids are similar.
The output of BLAST gives a considerable amount of information about the alignment. In addition
to the sequence alignment, with identity and similarity scores, it also produces a bit score and
an expectation value. The bit score is a measure of the statistical significance of the alignment;
the higher the score, the more similar the two sequences. The expectation value (E-value) is also
a statistical measure: it is the number of times the hit may have occurred by chance. If the
number is very low, it is very unlikely the finding occurred just by chance; so the lower the E-
value, the more significant the score is. An E-value of 10 means that one would expect to have
10 such hits in the searched database by chance, so it is quite likely that the hit is not significant.
An E-value of 10−58 would make it very unlikely the alignment happened by chance, so this is
a good score. However, the obtained E-value is dependent on the length of the match, and the
size of the database, as well as the content of the searched database.
BLAST is not the only alignment tool (although it is probably the most commonly used). Another
well-established program is FASTA (FAST All), which uses an alternative algorithm to detect
sequence similarities. FASTA is more sensitive than BLAST, and when it was developed more
than 20 years ago it was quite fast. Today, however, the databases have grown so much that this
method can take quite a bit of time, and often BLAST searches are considerably quicker. FASTA
is now less frequently used than BLAST.
The term ‘FASTA’ lives on as a format for sequences that is accepted by many sequence analysis
programs. Instead of just entering your sequence, it is often advantageous to give it an identifier
(a name, number, or description). But this name should not be ‘read’ by the program as part of
the sequence itself. The FASTA format reserves the first line for this, and has to start with a
greater-than sign (‘>’). The line finishes with a hard return, so that everything from the second
line onwards is read as a sequence (this can be DNA, RNA, or protein).
A new letter is introduced here to represent a certain ambiguity: W for A or T. There are single
letter codes for all degrees of uncertainty, although most bioinformatic tools accept GATC only,
plus in some cases N for unknown.
Multiple alignment analysis is used to identify gene similarity and to define how diverse two
genes might be to still consider them similar. This is important, for instance, in designing probes
used in microarray analysis; genes we consider ‘similar’ should be recognized by one probe, and
their hybridization signals should be treated as equal.
Another commonly used method to visualize similarity of the sequence of proteins is to use a tree
plot, as shown in Fig. Notice in this figure that now there are two main clusters: a fairly tight
cluster of γ-Proteobacteria (E. coli and relatives), and a looser set of ‘other organisms,’ which are
taxonomically more diverse. There are several methods for producing a tree plot, and many web
sites offer a service where one can paste in a FASTA file containing multiple sequences, do the
alignment using CLUSTALW, and then have the program draw a tree.
Phylogenetic trees have been around for nearly 150 years; an evolutionary tree is one of the few
illustrations in Darwin’s The Origin of Species. However, the more modern ‘molecular based’
trees have been around only since the 1960s. There are several different types of tree plots. They
can be rooted, with a single ancestral organism implied, or unrooted, with no clear origins. Most
biologists (including Darwin) tend to think of trees as rooted.
Bioinformatics Tools
Following are the some of the important tools for bioinformatics
Application of Bioinformatics
Sequence analysis
Sequence analysis is the most primitive operation in computational biology. This operation
consists of finding which part of the biological sequences are alike and which part differs during
medical analysis and genome mapping processes. The sequence analysis implies subjecting a
DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence
searches, or other bioinformatics methods on a computer.
Genome annotation
In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. The first genome annotation software system was designed in 1995
by Dr. Owen White.
Protein-protein docking
In the last two decades, tens of thousands of protein three-dimensional structures have been
determined by X-ray crystallography and Protein nuclear magnetic resonance spectroscopy
(protein NMR). One central question for the biological scientist is whether it is practical to predict
possible protein-protein interactions only based on these 3D shapes, without doing protein-
protein interaction experiments. A variety of methods have been developed to tackle the Protein-
protein docking problem, though it seems that there is still much work to be done in this field.
Data Mining
Data mining refers to extracting or “mining” knowledge from large amounts of data. Data Mining
(DM) is the science of finding new interesting patterns and relationship in huge amount of data.
It is defined as “the process of discovering meaningful new correlations, patterns, and trends by
digging into large amounts of data stored in warehouses”. Data mining is also sometimes called
Knowledge Discovery in Databases (KDD). Data mining is not specific to any industry. It requires
intelligent technologies and the willingness to explore the possibility of hidden knowledge that
resides in the data.
Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich, but lacks a
comprehensive theory of life’s organization at the molecular level. The extensive databases of
biological information create both challenges and opportunities for development of novel KDD
methods. Mining biological data helps to extract useful knowledge from massive datasets
gathered in biology, and in other related life sciences areas such as medicine and neuroscience.
Challenges
Bioinformatics and data mining are developing as interdisciplinary science. Data mining
approaches seem ideally suited for bioinformatics, since bioinformatics is data-rich but lacks a
comprehensive theory of life’s organization at the molecular level.
However, data mining in bioinformatics is hampered by many facets of biological databases,
including their size, number, diversity and the lack of a standard ontology to aid the querying of
them as well as the heterogeneous data of the quality and provenance information they contain.
Another problem is the range of levels the domains of expertise present amongst potential users,
so it can be difficult for the database curators to provide access mechanism appropriate to all.
The integration of biological databases is also a problem. Data mining and bioinformatics are
fast growing research area today. It is important to examine what are the important research
issues in bioinformatics and develop new data mining methods for scalable and effective analysis.