0% found this document useful (0 votes)
74 views13 pages

Bioinformatics (STH Sir)

The document discusses the field of bioinformatics, which uses computational methods to analyze and organize biological data such as DNA sequences. It began in the late 1980s with advances in DNA sequencing that produced large amounts of genetic data. Bioinformatics databases now store and classify biological information including nucleotide sequences, protein sequences, and 3D protein structures. Sequence alignment algorithms allow comparison of unknown sequences to known sequences to help determine potential functions and relationships.

Uploaded by

Ranojoy Sen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views13 pages

Bioinformatics (STH Sir)

The document discusses the field of bioinformatics, which uses computational methods to analyze and organize biological data such as DNA sequences. It began in the late 1980s with advances in DNA sequencing that produced large amounts of genetic data. Bioinformatics databases now store and classify biological information including nucleotide sequences, protein sequences, and 3D protein structures. Sequence alignment algorithms allow comparison of unknown sequences to known sequences to help determine potential functions and relationships.

Uploaded by

Ranojoy Sen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

BIOINFORMATICS

The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic
processes in biotic systems. It was primary used since late 1980s has been in genomics and
genetics, particularly in those areas of genomics involving large-scale DNA sequencing.
Bioinformatics can be defined as the application of computer technology to the management of
biological information. Bioinformatics is the science of storing, extracting, organizing, analyzing,
interpreting and utilizing information from biological sequences and molecules. It has been
mainly fueled by advances in DNA sequencing and mapping techniques. Over the past few
decades rapid developments in genomic and other molecular research technologies and
developments in information technologies have combined to produce a tremendous amount of
information related to molecular biology. The primary goal of bioinformatics is to increase the
understanding of biological processes.

The branch of science concerned with information and information flow in biological
systems, esp. the use of computational methods in genetics and genomics.

Origin of bioinformatics and biological databases


The first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues.
Nearly a decade later, the first nucleic acid sequence was reported, that of yeast tRNAalanine with
77 bases.
In 1965, Dayhoff gathered all the available sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and Structure).
The Protein DataBank followed in 1972 with a collection of ten X-ray crystallographic protein
structures. The SWISSPROT protein sequence database began in 1987.

Biological Databases-In the post genomic era, nucleotide and protein sequences from different
organisms are available. It has paved the determination of secondary and 3-D structure of the
proteins as well. This vast amount of information is processed and arranged systematically in
different biological databases. The information present in these databases can be used to derive
common feature of a sequence class and classification of an unknown sequence.
Primary Database- This the collection of the data obtained from the experiment such as
sequence of DNA or Protein, 3-D structure of a protein.

Database of nucleic acid sequences

GenBank-This is a public sequence database and it can be accessed through a web addess
http://www.ncbi.nlm.nih.gov/genbank/. The entry into the genbank is made through a login
into the database with a pre-requisite of publication of the new sequence in any scientific journal.
Each entry in the database has a unique accession number and it remains unchanged. A sample
GenBank entry can be accessed via a link http://www. ncbi.nlm. nih.gov/ Sitemap /sampler
ecord.html. A typical GenBank entry has the information about the locus name, length of the
sequence, type of the molecule (DNA/RNA), nucleotide sequence of the entry.

Entrez-Entrez system is used to search all NCBI associated databases. It is a powerful tool to
peform simple or complicated searches by combining key word with the logical operator (AND,
NOT). For example, searching a protein kinase sequence in human can be done by the following
search syntax: Homo sapiens [ORGN] AND protein kinase.

EMBL and DDBJ- EMBL is the nucleotide sequence database present at European
bioinformatics institute where as DDBJ is the DNA sequence database present at centre for
information biology, Japan. EMBL can be accessed at http://www.embl.de/ whereas DDBJ can
be accessed at http://www.ddbj.nig.ac.jp/. Every day, GenBank, EMBL and DDBJ synchronize
their nucleotide sequence and as a result searching of a nucleotide in any of the database is
sufficient.

Database of protein sequences

SWISSPROT-it is the collection of the annoted protein sequence of the swiss instituite of
bioinformatics (SIB). SWISSPROT can be accessed at http://web.expasy.org/groups/swissprot/.
The protein sequence entry in the swissprot is manually curated and if required it is compared
with the available literature. Swissprot is part of the UniProt database and collectively known as
UniProt Knowledgebase. A ‘niceprot’ view of the entry in swissprot database are graphically
presented for better readability and hyperlinks are given for other databases as well.

NCBI protein database-It is a compilation of the protein sequence present in other databases.
The NCBI database contains the entries from the swissprot, PIR database, PDB database and
other known databases.

UniProt-EBI, SIB and Georgetown university together collected the protein information in the
form of a centralized catalogue known as universal protein resource (UniProt). It contains the
information about the 3-D structure, expression profile, secondary structures and biochemical
function of the protein. UniProt consists of 3 parts: UniProt Knowledge database (UniProtKB),
UniProt Reference (UniRef) and UniProt Archive (UniPArc). As discussed before, UniProtKB is a
collection from SwissProt and TrEMBL database. UniRef is a nonredudant sequence database
and it can allow to search similar sequences. UniRef 100, UniRef90 and UniRef50 are the three
version of the database allow searching of sequences 100%, >90% and >50% identical ot the
query sequence.
Secondary Database-The analysis of the primary data gives rise to the development of secondary
database. Secondary structures, hydrophobicity plot and domains are present in the various
secondary databases.
Prosite-Prosite is one of the secondary biological database which contains motifs to classify the
unknown sequence into the protein family or class of enzyme. It can be accessed with the web
address http://prosite.expasy.org/. The database contains motifs derived from the multiple
sequence alignment. The quert sequence is aligned against the multiple sequence alignment to
determine the presence or absence of the motif.
A query sequence can be analyzed using the algorithm ScanProsite. In addition, it may allow to
search the sequence with similar pattern in SwissProt, TrEMBL and PDB databases.

Pfam: The Pfam database contains the profiles of the protein sequences and classifies the protein
families as per the over-all profile. A profile is a pattern of the amino acid in a protein sequence
and determine probability of a given amino acid. Pfam is based on the sequence alignment. A
high-quality sequence alignment gives the idea about the probability of appearance of an amino
acid at a particular position and contain evolutionary related sequences. However, in few cases
a sequence alignment may have sequences with no evolutionary relationship to each other. A
critical analysis of result from the Pfam database is necessary to draw conclusions.

Interpro-SwissProt, TrEMBL, Prosite, Pfam, PRINT, ProDom, Smart and TIGRFAMS are
integrated into a comprehensive signature database known as Interpro. The results from interpro
gives the output from individual databases and allows user to compare the output considering
the algorithm used in each database.

Molecular structure database

Protein Data bank (PDB)- it is the collection of the experimentally determined crystal stuture of
the biological macromolecules. It is co-ordinated by the consortium located in Europe, Japan
and USA. As of August 2013, the database contains 93043 structures which includes protein,
nucleic acids, and protein-nucleic acid or protein-small molecule complexes
(http://www.rcsb.org/pdb/home/home.do). A PDB ID or the key word can be use to search the
database. The result from the database summarizes all information related to the structure such
as crystallization condition, reference of the journal article where the finding are published etc.

SCOP-SCOP (structural classification of protein) utilizes the basic idea that the proteins with
similar biological functions and evolutionary related with each other must have a similar
structure. The database classifies the structure of a known protein into the families,
superfamilies and fold. A protein structure belongs to a famiy if the sequence identity must be
atleast 30% over the total length of the sequence. Proteins with structural or functional similarity
but low sequence identity are classified into the superfamilies. Whereas proteins with similar
secondary structure arrangement belongs to the fold.

CATH-Similar to SCOP, CATH classifies the protein into 4 categories: Class (C), Architecture (A),
Topology (T), and Homologous superfamily (H). A protein is classified as Class depending on the
proportion of the secondary structure elements rather than their arrangement. There are 4
classes, helices (α-class), sheet (β-class), helix-sheet (α/β class) and proteins with few secondary
structures. The arrangement of secondary elements in a protein structure is used for their
classification within the architecture. The connection of secondary elements is used for their
classification within the topology category. The homologous superfamily considers the presence
of similar domains in two protein structure for their classification.

Sequence Comparison by Means of Alignments


The basic idea behind a sequence alignment is quite simple. The essence is to align two (or more)
sequences and score the positions that are identical. In order to find a possible function of a new
gene, for example, one can compare the query sequence against those of known genes in a
database, in case a very similar gene with known function has already been described by
someone else.

A protein sequence can be deduced from a DNA sequence using the genetic code. Since the
genetic code is redundant to some degree, several DNA sequences can code for the same protein
sequence. As a consequence, the similarity of two protein-coding DNA sequences may appear
less than that of their translated protein sequence.
The reading frame of a DNA sequence may not always be known, and shifting it by one position
has dramatic effects on the translated amino acid sequence. The similarity at the protein level
can be completely destroyed. Thus, alignments of protein-coding sequences performed at DNA
and amino acid levels do not always give the same results.

It should be pointed out that a protein sequence can be deduced from a DNA or RNA sequence.
However, from one protein sequence, several possible DNA sequences could be predicted, since
it cannot be known which codons were used to obtain the protein sequence. Thus, when working
with protein sequences, a degree of information is lost that was present in the DNA or RNA
sequence. This is in agreement with the “Central Dogma” of molecular biology.

For a DNA sequence, nucleotides are either the same or they are different. For proteins, however,
there is a third category, as amino acids can be similar though not identical. These are amino
acids that have a similar chemical structure: for instance serine and threonine, which both have
hydroxyl (−OH) groups. Leucine and isoleucine also have similar chemical properties, and
glutamate and aspartate are both acidic. Replacing Asp for Glu in a enzyme requiring an acidic
amino acid in its active site would likely not completely alter the function, although substitution
in the same location with a large aromatic amino acid, such as Tyr, could well destroy the enzyme
activity. Thus it would be appropriate to score Glu as ‘similar’ to Asp, but both as ‘different’ to
Tyr.
Amino acids can be placed in groups that can be considered similar, and taking this into
consideration in alignments produces two scores: an identity score and a similarity score.
However, determining which amino acids are similar is not always as clear as it might seem, as
there are different degrees of similarity, depending on the context. For example, alanine,
isoleucine, leucine, and valine are all aliphatic amino acids, and in many cases these can be
substituted for each other in a globular protein without much difference in the overall shape.
However, their size is different enough to have significant impact if substituted in an active site
of an enzyme. In some cases, it matters only that an amino acid is charged, and whether it is a
positive or negative charge is not important. But in other cases, when ionic bonding stabilizes a
structure, for example, charge is crucial. So depending on which list one consults, amino acids
can sometimes appear as similar, sometimes not. Because of this ambiguity in definitions of
similarity, it is our opinion that more weight should be given to the percentage identity of an
alignment score of two sequences than to the percentage similarity, as well as to the length of
the alignment as a fraction of the query sequence.

Pairwise Alignments: BLAST and FASTA

Alignments of sequences are commonly performed using the Basic Local Alignment Search Tool.
BLAST can be quite fast, and there are several automated servers available on the web, where
one can paste a sequence in a form and quickly search for similarity to genes or sequences stored
in a database. GenBank, a public database storing DNA and protein sequences, allows one to
specifically search all or particular selections of microbial genomes.

BLASTN is the program to search a DNA query against DNA, whereas BLASTP searches a protein
sequence against a protein database. BLASTN is set up to automatically search for homologies
on either strand present in the database. BLASTX uses a DNA query and translates this in all
three reading frames, for both strands, and performs six BLASTP searches in addition to
BLASTN. BLASTP uses various similarity matrices to determine which amino acids are similar.

The output of BLAST gives a considerable amount of information about the alignment. In addition
to the sequence alignment, with identity and similarity scores, it also produces a bit score and
an expectation value. The bit score is a measure of the statistical significance of the alignment;
the higher the score, the more similar the two sequences. The expectation value (E-value) is also
a statistical measure: it is the number of times the hit may have occurred by chance. If the
number is very low, it is very unlikely the finding occurred just by chance; so the lower the E-
value, the more significant the score is. An E-value of 10 means that one would expect to have
10 such hits in the searched database by chance, so it is quite likely that the hit is not significant.
An E-value of 10−58 would make it very unlikely the alignment happened by chance, so this is
a good score. However, the obtained E-value is dependent on the length of the match, and the
size of the database, as well as the content of the searched database.

BLAST is not the only alignment tool (although it is probably the most commonly used). Another
well-established program is FASTA (FAST All), which uses an alternative algorithm to detect
sequence similarities. FASTA is more sensitive than BLAST, and when it was developed more
than 20 years ago it was quite fast. Today, however, the databases have grown so much that this
method can take quite a bit of time, and often BLAST searches are considerably quicker. FASTA
is now less frequently used than BLAST.

The term ‘FASTA’ lives on as a format for sequences that is accepted by many sequence analysis
programs. Instead of just entering your sequence, it is often advantageous to give it an identifier
(a name, number, or description). But this name should not be ‘read’ by the program as part of
the sequence itself. The FASTA format reserves the first line for this, and has to start with a
greater-than sign (‘>’). The line finishes with a hard return, so that everything from the second
line onwards is read as a sequence (this can be DNA, RNA, or protein).

Multiple Alignments: CLUSTALW


Multiple alignments in which several sequences are compared to each other are very informative,
as they can identify regions that are less variable or more variable within a set of genes. For
multiple alignments CLUSTALW is a frequently used program.3 This program first calculates the
highest similarity for each possible pair of combinations, and then estimates the optimal multiple
alignments for all (it is based on the same algorithm for similarity as FASTA). CLUSTALW is
much slower than BLAST and is more suitable for the input of short sequences, of which a degree
of similarity has already been established. CLUSTALW is not suitable to search databases. A
better approach is first to search for hits in GenBank with a query gene, and then to take a
selection of these hits and combine them in a multiple alignment together with the query
sequence. This way one can identify regions of higher or lower degrees of conservation, for
instance to identify a constant region that can be used for PCR primer design, or a variable region
that may be a target for a typing procedure.

A new letter is introduced here to represent a certain ambiguity: W for A or T. There are single
letter codes for all degrees of uncertainty, although most bioinformatic tools accept GATC only,
plus in some cases N for unknown.

From Alignments to Phylogenic Relationships


Multiple alignments can also be done for proteins, and Fig. shows an example of several IHF
(Integration Host Factor) proteins aligned. In the figure, though, the most similar sequences are
grouped together, to illustrate more clearly the clusters one can identify. This illustrates that the
alignment conservation hints at how closely the proteins are related to each other.

Multiple alignment analysis is used to identify gene similarity and to define how diverse two
genes might be to still consider them similar. This is important, for instance, in designing probes
used in microarray analysis; genes we consider ‘similar’ should be recognized by one probe, and
their hybridization signals should be treated as equal.
Another commonly used method to visualize similarity of the sequence of proteins is to use a tree
plot, as shown in Fig. Notice in this figure that now there are two main clusters: a fairly tight
cluster of γ-Proteobacteria (E. coli and relatives), and a looser set of ‘other organisms,’ which are
taxonomically more diverse. There are several methods for producing a tree plot, and many web
sites offer a service where one can paste in a FASTA file containing multiple sequences, do the
alignment using CLUSTALW, and then have the program draw a tree.

Phylogenetic trees have been around for nearly 150 years; an evolutionary tree is one of the few
illustrations in Darwin’s The Origin of Species. However, the more modern ‘molecular based’
trees have been around only since the 1960s. There are several different types of tree plots. They
can be rooted, with a single ancestral organism implied, or unrooted, with no clear origins. Most
biologists (including Darwin) tend to think of trees as rooted.
Bioinformatics Tools
Following are the some of the important tools for bioinformatics

Application of Bioinformatics
Sequence analysis
Sequence analysis is the most primitive operation in computational biology. This operation
consists of finding which part of the biological sequences are alike and which part differs during
medical analysis and genome mapping processes. The sequence analysis implies subjecting a
DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence
searches, or other bioinformatics methods on a computer.

Genome annotation
In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. The first genome annotation software system was designed in 1995
by Dr. Owen White.

Analysis of gene expression


The expression of many genes can be determined by measuring mRNA levels with various
techniques such as microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis
of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or
various applications of multiplexed in-situ hybridization etc. All of these techniques are
extremely noise-prone and subject to bias in the biological measurement. Here the major
research area involves developing statistical tools to separate signal from noise in high-
throughput gene expression studies.

Analysis of protein expression


Gene expression is measured in many ways including mRNA and protein expression; however
protein expression is one of the best clues of actual gene activity since proteins are usually final
catalysts of cell activity. Protein microarrays and high throughput (HT) mass spectrometry (MS)
can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very
much involved in making sense of protein microarray and HT MS data.

Analysis of mutations in cancer


In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways.
Massive sequencing efforts are used to identify previously unknown point mutations in a variety
of genes in cancer. Bioinformaticians continue to produce specialized automated systems to
manage the sheer volume of sequence data produced, and they create new algorithms and
software to compare the sequencing results to the growing collection of human genome
sequences and germline polymorphisms. New physical detection technologies are employed, such
as oligonucleotide microarrays to identify chromosomal gains and losses and single-nucleotide
polymorphism arrays to detect known point mutations. Another type of data that requires novel
informatics development is the analysis of lesions found to be recurrent among many tumors.

Protein structure prediction


The amino acid sequence of a protein (so-called, primary structure) can be easily determined
from the sequence on the gene that codes for it. In most of the cases, this primary structure
uniquely determines a structure in its native environment. Knowledge of this structure is vital
in understanding the function of the protein. For lack of better terms, structural information is
usually classified as secondary, tertiary and quaternary structure. Protein structure prediction
is one of the most important for drug design and the design of novel enzymes. A general solution
to such predictions remains an open problem for the researchers.
Comparative genomics
Comparative genomics is the study of the relationship of genome structure and function across
different biological species. Gene finding is an important application of comparative genomics,
as is discovery of new, non-coding functional elements of the genome. Comparative genomics
exploits both similarities and differences in the proteins, RNA, and regulatory regions of different
organisms. Computational approaches to genome comparison have recently become a common
research topic in computer science.

Modeling biological systems


Modeling biological systems is a significant task of systems biology and mathematical biology.
Computational systems biology aims to develop and use efficient algorithms, data structures,
visualization and communication tools for the integration of large quantities of biological data
with the goal of computer modeling. It involves the use of computer simulations of biological
systems, like cellular subsystems such as the networks of metabolites and enzymes, signal
transduction pathways and gene regulatory networks to both analyze and visualize the complex
connections of these cellular processes. Artificial life is an attempt to understand evolutionary
processes via the computer simulation of simple life forms.

High-throughput image analysis


Computational technologies are used to accelerate or fully automate the processing,
quantification and analysis of large amounts of high-information-content biomedical images.
Modern image analysis systems augment an observer's ability to make measurements from a
large or complex set of images. A fully developed analysis system may completely replace the
observer. Biomedical imaging is becoming more important for both diagnostics and research.
Some of the examples of research in this area are: clinical image analysis and visualization,
inferring clone overlaps in DNA mapping, Bioimage informatics, etc.

Protein-protein docking
In the last two decades, tens of thousands of protein three-dimensional structures have been
determined by X-ray crystallography and Protein nuclear magnetic resonance spectroscopy
(protein NMR). One central question for the biological scientist is whether it is practical to predict
possible protein-protein interactions only based on these 3D shapes, without doing protein-
protein interaction experiments. A variety of methods have been developed to tackle the Protein-
protein docking problem, though it seems that there is still much work to be done in this field.

Over-view of the computer-aided drug design


Drug design and discovery is a long process involving identification of suitable drug target,
screening and selection of the inhibitor, toxicity analysis and pharmacological analysis of the
inhibitor molecule to suit it for therapeutic purpose. The whole process of drug design and
discovery through a traditional trial-and error approach is a lengthy, time consuming and costly
process. With the evident advancement in the computational hardware and software, most of the
drug discovery steps can be performed. In a computer aided drug design approach, a drug target
is selected from the database and a 3-D structure is determined experimentally or if the
homologous structure is known then a homology model is generated. Once the structure of the
enzyme is known, active site of the enzyme is mapped by structural comparison with known
enzyme. Two approaches can be used to design the inhibitor molecule against the enzyme,
pharmacophore approach or the docking with the random inhibitor molecules from the different
chemical libraries. Top selected inhibitor molecules can further validated in the in-silico toxicity
analysis and pharmacokinetic parameters. The best molecule can be tested further in the wet
lab experiment to validate the computational results and a series of clinical trials are needed
before allowing therapeutic applications. Each step of the computer aided drug design can be
performed by multiple softwares with different algorithms.

Data Mining
Data mining refers to extracting or “mining” knowledge from large amounts of data. Data Mining
(DM) is the science of finding new interesting patterns and relationship in huge amount of data.
It is defined as “the process of discovering meaningful new correlations, patterns, and trends by
digging into large amounts of data stored in warehouses”. Data mining is also sometimes called
Knowledge Discovery in Databases (KDD). Data mining is not specific to any industry. It requires
intelligent technologies and the willingness to explore the possibility of hidden knowledge that
resides in the data.
Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich, but lacks a
comprehensive theory of life’s organization at the molecular level. The extensive databases of
biological information create both challenges and opportunities for development of novel KDD
methods. Mining biological data helps to extract useful knowledge from massive datasets
gathered in biology, and in other related life sciences areas such as medicine and neuroscience.

Data mining tasks


The two "high-level" primary goals of data mining, in practice, are prediction and description.
The main tasks wellsuited for data mining, all of which involves mining meaningful new patterns
from the data, are:
Classification: Classification is learning a function that maps (classifies) a data item into one of
several predefined classes.
Estimation: Given some input data, coming up with a value for some unknown continuous
variable.
Prediction: Same as classification & estimation except that the records are classified according
to some future behaviour or estimated future value).
Association rules: Determining which things go together, also called dependency modeling.
Clustering: Segmenting a population into a number of subgroups or clusters.
Description & visualization: Representing the data using visualization techniques.
Learning from data falls into two categories: directed (“supervised”) and undirected
(“unsupervised”) learning. The first three tasks – classification, estimation and prediction – are
examples of supervised learning. The next three tasks – association rules, clustering and
description & visualization – are examples of unsupervised learning. In unsupervised learning,
no variable is singled out as the target; the goal is to establish some relationship among all the
variables. Unsupervised learning attempts to find patterns without the use of a particular target
field.
The development of new data mining and knowledge discovery tools is a subject of active
research. One motivation behind the development of these tools is their potential application in
modern biology.

Application of Data Mining in Bioinformatics


Applications of data mining to bioinformatics include gene finding, protein function domain
detection, function motif detection, protein function inference, disease diagnosis, disease
prognosis, disease treatment optimization, protein and gene interaction network reconstruction,
data cleansing, and protein sub-cellular location prediction.
For example, microarray technologies are used to predict a patient’s outcome. On the basis of
patients’ genotypic microarray data, their survival time and risk of tumor metastasis or
recurrence can be estimated. Machine learning can be used for peptide identification through
mass spectroscopy. Correlation among fragment ions in a tandem mass spectrum is crucial in
reducing stochastic mismatches for peptide identification by database searching. An efficient
scoring algorithm that considers the correlative information in a tunable and comprehensive
manner is highly desirable.

Challenges
Bioinformatics and data mining are developing as interdisciplinary science. Data mining
approaches seem ideally suited for bioinformatics, since bioinformatics is data-rich but lacks a
comprehensive theory of life’s organization at the molecular level.
However, data mining in bioinformatics is hampered by many facets of biological databases,
including their size, number, diversity and the lack of a standard ontology to aid the querying of
them as well as the heterogeneous data of the quality and provenance information they contain.
Another problem is the range of levels the domains of expertise present amongst potential users,
so it can be difficult for the database curators to provide access mechanism appropriate to all.
The integration of biological databases is also a problem. Data mining and bioinformatics are
fast growing research area today. It is important to examine what are the important research
issues in bioinformatics and develop new data mining methods for scalable and effective analysis.

You might also like