Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
AbstractAdvancement in the sequencing technology has Twitter and Walmart. Big Data has characteristics of 5
resulted in the production of large amount of omics data in Vs such as Veracity (amount of consuming data),
short time. Traditional bioinformatics tools cannot cope with Velocity (processing of data), Variety (types of data),
the rate of production of such huge amount of data. So, new Volume (amount of data) and Potential Value (giving
tools need to be developed and existing tools need to be value to data). Potential Value is very important for future
improved. New researchers, developers and Bioinformaticists thoughts and for planning of data. Big Data flow includes
face difficulty in selecting the appropriate tool for Analysis of two types of processing such as real time (new SQL) and
data or for making improvements in the tools. This paper batch (based on analytics). Performance is the big
presents a comprehensive survey on the availability of
challenge for Big Data. Many tools and systems are
bioinformatics tools, purpose of different tools, programming
available which manage the Big Data for example
languages used for development of different tools and data
formats used for different tools. It also presents either a tool Hadoop (distributed management System).
has been enhanced to be used on Big Data platform or not. At the same time, Bioinformatics data has been also
produced in large amount such as Genomics, Proteomics,
I. INTRODUCTION RNA, DNA, and Motif Finding. A lot of data will be
With the passage of time, new approaches and produced for Sequence Alignment (multiple and
technologies has been developed because massive Pairwise) from RNA, DNA and Proteins. Some data will
amount of data is available. A large amount of data is be produced for their relationships such as Protein to
available in many fields such as Electrical, Mechanical, Protein, Gene to Disease, Disease to Gene, and Gene to
Electronics, Mathematics, Management Sciences, Protein. Some important data will be available for
Computer Science and Bioinformatics. A tool of stack is database search. NGS (Next Generation Sequencing) has
provided for every field of data that will help us to produced a lot of sequencing data.
analyze and store data. In upcoming Era, this data will increase in large
Recently, the term Big Data has been introduced amount day by day. All of this Bioinformatics data is
which denotes the huge amount of data in Computer required to be analyzed and stored in a well-organized
Science field. This large data needs to be analyzed, stored way. For this purpose, an open source Apache Hadoop
and managed for example, data of Facebook, Yahoo, system had been designed for large distributed storage
and exploration of large data. This will give advantages
of fault tolerance, security and efficiency. Hadoop
consists of HDFS (Hadoop Distributed File System),
MapReduce (a programming paradigm) and many tools
This paper was submitted for review on 14 December 2016.
built on it like HBase, Hive, Pig, Zookeeper etc.
Muhammad Atif Sarwar is with department of computer Science,
COMSATS Institute of Information Techology, Sahiwal, 57000 HDFS (Hadoop Distributed File System) is
Pakistan (e-mail:[email protected]).
Abbas Rehman is with department of computer Science, COMSATS
distributed system for data storage and data processing
Institute of Information Techology, Sahiwal, 57000 Pakistan with many clusters by programming help. It contains
(e-mail:[email protected]). name node and data node (like master slave relationship).
Javed Ferzund is with department of computer Science, COMSATS HBase contains read write data access with ACID
Institute of Information Techology, Sahiwal, 57000 Pakistan properties. Hive is data warehouse system with HQL
(e-mail:[email protected]). (Hive Query Language) interface. It also provides the
317 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
facility of Hive data units such as Partition, Bucket and implementation language, data format support and
Table. implementation for Big Data. Section 4 presents a brief
discussion and finally conclusion is presented in Section
Hadoop MapReduce is system for parallel processing 5.
of large data using simple programing. It contains Map
and Reduce tasks with job and task tracker (like master II. RELATED WORK
slave relationship). Apache also provides Spark
framework for Analysis of large data with the help of
Transformations, RDD (Resilient Distributed Datasets), The larger sizes of data has [1] been Analyzed and
Actions and Caching. It includes many Built-in libraries executed with distributed computing on computer
for multiple purposes. Apache Platform has advantage of clusters. The rapidly growing data quantities, for
best Performance than Hadoop MapReduce. example, DNA sequencing, and new methods to data
analysis are required. Hadoop is used for data processing
Many Bioinformatics tools are designed for that is a collection of software designed especially for Big
processing and analysis of Bioinformatics data such as Data, with the basic working of the Hadoop Distributed
RNA, DNA, Genomics and Proteomics. These tools are File System and Hadoop MapReduce scalable distributed
not Scale when this huge amount of data is concerned. To computing platform and Apache Spark. This effort tells
remove this bottleneck, these tools are implemented in about the working of Hadoop MapReduce and Apache
Hadoop Platform for processing of large Bioinformatics Spark for the bioinformatics data. The growth of
data. Some tools are implemented in Hadoop for bioinformatics data is so fast that it can only be stored and
Alignment viewers, some for database searching, some manipulated with the technology of Hadoop MapReduce
for Genomic Analysis, Mostly Bioinformatics tools are and Apache Spark and HDFS environment. Thus new file
implemented in MapReduce or in Spark framework. formats are being developed to better cope with the needs
Hadoop modules support many languages such as of modern and future Big Data sets. This work analyses
Java, Python, and Scala etc. Bioinformatics tools are the current state of the art tools in the world of
implemented in specific language in MapReduce or bioinformatics and their implementation for Big Data
Apache Spark framework. Some tools are implemented Platforms.
in Java, some in Python, some in Scala and some in Schatz et al. have [2] developed the CloudBurst that
C++/C# language. When these tools are implemented in is used for the genome mapping process. CloudBurst
MapReduce then mostly Java language is used for provides parallel short-read mapping method to boost the
processing. When these tools are implemented in Spark measurability of reading largest sequencing data. Many
then Scala, Java and Python are mostly used languages. new tools have been developed by CloudBrust team to
It is most important opinion what Data Format will be support the field of biomedical, for example Crossbow
selected for the storage of large Bioinformatics data. used for the recognizing single nucleotide
Focus of Data Format is compulsory when polymorphisms (SNPs) from sequencing data and
Bioinformatics tools are implemented in MapReduce or Contrail use for the aggregation giant genomes.
Spark framework. Some Data Formats are performing Pandey et al. have designed the DistMap toolkit [3]
well with small datasets but when Bioinformatics data is on a Hadoop cluster for distributed short-read mapping.
large then these Formats are not scale. There are different DistMap aims to extend the support of various styles
Data Formats for large data storage of Database of mappers to cover a wider variety of sequencing
Searching, Alignment viewer editor. applications. The nine supported mapper types include
The objectives of this survey are: SOAP, STAR, GSNOP, BWA, Bowtie, Bowtie2,
Bismark, BSMAP and TopHat. A DistMap is integrated
To explore all tools in the Bioinformatics with mapping workflow, which could be run with simple
domain command.
To explain specific implementation platform OConnor et al. have built the SeqWare that could [4]
for these Bioinformatics tools be a query based engine designed on the Apache HBase
database to assist bioinformatics researchers access
To recognize specific implementation large-scale whole-genome datasets. The interactive
language for specific tool interface to integrate the genome tools and browser was
To understand Data Format for storageand created by SeqWare team. During a prototyping analysis,
analysis of large data in MapReduce or the 1102GBM and U87MG tumor database were laden,
Spark framework and the team match the HBase back end and the Berkeley
DB and for exporting variant and loading data capabilities
Rest of this survey is organized as follows: Section 2
describes the related work in this field. Section 3 presents Lewis et al. have made the Hydra that is a search [5]
the available Bioinformatics Tools and their engine for scalable proteomic search which is built on the
characteristics in terms of category of tools, Hadoop-distributed computing framework. Hydra gives
318 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
the software bundle for processing spectra and largest Hadoop on cluster has shown that Hadoop performs 80
peptide databases, with the implementation of distributed times speedily than SAMQA.
computing framework that supports massive quantities of
spectrometry data for scalable search. The Hydra divided Angiuoli et al. designed the CloVR, a sequence [10]
the proteomic search into two steps: (1) scoring the analysis tool distributed over a virtual machine. This tool
spectra and retrieving the data and (2) generating a is an equal support for both cloud systems as well as
peptide database. personal desktop systems to process the sequencing data
thus removing all the hazards to analyze large datasets.
Van der Auwera et al. have launched the Genome There are many other bioinformatics workflows that
Analysis Toolkit (GATK) designed [6] to support large- incorporate virtual machines including metagenome,
scale DNA sequence Analysis based upon the whole-genome, and 16S rRNA-sequencing Analysis.
MapReduce programing framework. GATK backings
several data formats, including binary alignment/map To check the portability of CIoVR for local host and
(BAM), SAM files, dBs and Hap Map. The GATK toolkit cloud platform, this tool is tested against a local machine
prepares two modules traversal and walker. The traversal (4 CPU, 8 GB RAM) and on the Amazon EC2 cloud
modules read sequencing data into the system and platform (80 CPU). And the results concluded that
CIoVR is potable on both platforms.
provide related references to the data. The walker
modules provide analytics outcome from the data. GATK Oehmen et al. proposed the bioinformatics tool that is
has been used for the 1000 Genomes Projects and Cancer used for parallel five BLAST calculations (blastn, blastp,
Genome Atlas. tblastn, tblastx and blastx) [11] on large and small scale
Van der Aurar et al. have introduced Myrna that is a systems in genome and protein sequence on
[7] cloud-based computing pipeline used to calculate multiprocessor environment. ScalaBLAST 2.0 provides
difference in gene expression for very large transcript dynamic data partitioning (fault resilience properties) and
RNA sequencing datasets. RNA sequence data are m- does not require pre-formatting (repeated same datasets)
sequencing reads derived from mRNA molecules. Myrna than ScalaBLAST 1.0 and mpiBLAST that contains
supports several functions for RNA-seq Analysis, single processor. It is implemented using NCBI BLAST
including normalization, reads alignment, and statistical C toolkit and depends on MPI library. The input file in
modeling in an integrated pipeline. Myrna returns the FASTA format and output formats (standard
differential expression of genes into the form of P-value pairwise, tabular and tabular with headers).
and q-value and analytical plot of those genes. This Michael C. Schatz introduced the BlastReduce, a [12]
system was tested on the Amazon Elastic Compute Cloud parallel seed-and-extend alignment algorithm (includes
(Amazon EC2) using 1.1 billion RNA-seq reads, and the three MapReduce cycles) that takes DNA NGS data
results show that Myrna can process data in less than two (short reads) and then align or map these read to reference
hours and the cost of the test task was around $66. genome (database of specific specie) on Hadoop
Chung et al. proposed the CloudDOE, a software [8] MapReduce paradigm that support parallel execution.
tool set that offers a user friendly interface to implement Landau-Vishkin alignment algorithm reduces this
the Hadoop cloud because the Hadoop platform in itself infeasibility. BlastReduce is much faster than BLAST
is too complex to be handled by a user that has no and reduce the execution time. It is implemented in Java
expertise in computer science or some other technical with Hadoop and compatible with cloud computing. Like
skills. This tool is very simple to be handled by a layman BLAST, BlastReduce uses seed-and-extend technique
and it uses MapReduce to analyze very complex and Unlike BLAST, BlastReduce uses Landau-Vishkin
procedures such as the throughput of sequences in algorithm. Their performance shows that BlastReduce is
bioinformatics data. This tool is a support for scholars and scalable to large sets of read and highly speed up.
researchers to easily configure Hadoop cloud to study III. TOOLS FOR BIOINFORMATICS
different aspects of Bioinformatics data.
There are many Bioinformatics tools that are used for
Robinson et al. presented the SAMQA, a special [9] analysis of small and large datasets. Every tool performs
toolkit designed to find errors in genomic data and it specific function. Different tools are used for alignment
guarantees that the sequencing data must fulfill the viewer editor, database search and genome analysis.
significant quality measures. The tool was intentionally These tools require the data to be stored in a specific
developed for the National Institutes of Health Cancer format for any kind of analysis. These tools are built using
Genome Atlas for an automatic detection of errors and different programming languages.so, it is important to
thus generating a log file of these errors. The tool uses know the specific language in order to customize the
some technical tests to identify abnormalities in genomic tools. The skills in a programming language are more
data such as invalid CIGAR value or sequence helpful when extending these tools for Hadoop
alignment/map (SAM) format error. For the same MapReduce or Apache Spark framework.
biological data set of approximately 23GBs in size, a
comparison of SAMQA on a single-core server and of
319 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
Table 1: Overview of Alignment Viewer/Editor that use MapReduce and Spark Framework
320 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
Python, C++, C
Genedoc [40] MSF NO NO
[41]
Perl/Python/Java/C++/C
Geneious Fasta NO NO
[42]
Integrated Genome
Perl, C/C++ and Java
Browser (IGB) BAM, Fasta PSL NO NO
[44]
[43]
C++ and Java MSF,ClustalW, ,Nexus, PIR, Phylip,
IVistMSA [45] NO NO
[46] GDE,
Java BLC, PIR, Stockholm, MSF, Clustal,
Jalview 2 [47] NO NO
[48] Fasta PFAM
Java FASTA MSF, Clustal, Phylip, Newick,
JEvTrace [49] NO NO
[50] PDB
Javascript
JSAV [51] An array of JavaScript objects NO NO
[52]
Java YES
Maestro [53] Clustal, Fasta PDB NO
[54] [55]
C++
MEGA [56] FASTA Clustal, Nexus, Mega, etc NO NO
[57]
MSAReveal.org
Java [59] Fasta NO NO
[58]
Multiseq (vmd C++ [61]
Fasta PDB, ALN, Phylip, Nexus NO NO
plugin) [60]
MView [62] Perl Clustal, HSSP, Fasta PIR, MSF, Fasta
NO NO
[63] search, Blast search
Java
PFAAT [64] Nexus, MSF, Clustal, Fasta PFAAT NO NO
[65]
Ralee (emacs
Perl
plugin for RNA Stockholm NO NO
[67]
[66] )
Java
S2S RNA editor Fasta RNAML NO NO
[68]
321 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
B. Database Search
Many new tools are available for searching in protein The Database Search tools support different formats
and nucleotide sequence databases. With the for querying the stored data. The supported formats
advancement in technology, many tools have been include FASTA bare sequences, identifiers, Genbank,
developed for database search. These are implemented in FASTQ, EMBL, Genbank, CLUSTAL, Stockholm,
different languages like C, C++, C#, Java, Perl, Python, A2M, A3M, EMBL, MEGA, GCG/MSF, PIR/NBRF and
CUDA C++ and PTX assembly language, Java, Android, TREECON. Some of these tools have been implemented
Objective-C, and Ruby. These tools include BLAST, CS- on Hadoop MapReduce and Apache Spark. Tools like
BLAST, CUDASW++, DIAMOND, BLAST, DIAMOND, HMMER, HHpred/ HHsearch,
GGSearch/GLSearch, Genoogle, and HMMER. A list of ScalaBLAST are implemented on Hadoop MapReduce
the available tools in this category is presented in Table and Apache Spark Framework.
2.
Table 2: Overview of Database Search tools that use MapReduce and Spark Framework
Nucleotide or Fasta
Genoogle Java NO NO
protein sequence
HMMER Nucleotide or Fasta EMBL, Genbank YES
C NO
[87] protein sequence [88]
Fasta , A2M, GCG/MSF,
HHpred/ A3M,PIR/NBRF,EMBL,
Protein C++ YES [88] NO
HHsearch MEGA, Clustal
Sequence
C++, Java Fasta
KLAST similarity search NO NO
tool
Nucleotide or C/C++ Fasta
USEARCH NO NO
protein
Fasta
Nucleotide or
Parasail Python or C Fastq NO NO
protein
322 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
PHYLIP, GCG,
PSI-Search Java
Protein UniProtKB, PIR, NO NO
[89]
C. Genomics Analysis
Tools in this category are used for analysis of data storage. EMBL format is used in ACT and Sim4
nucleotide or peptide sequences. Different programming tools for data storage. GENBANK format is used in ACT,
languages are used for implementation of genome Mauve, MGA, Mulan and Sim4 tools for data storage.
analysis tools. Java language is used by ACT, FLAK, GFF format is used in ACT tool for data storage. FASTA
Mauve, Mulan, Sequero and Shuffle-LAGAN. Perl is format is used in ACT, BLAT, DECIPHE R, GMAP,
used by BLAT, Sim4/ SIBsim4 and SLAM tools. R is Splign, Mauve, Mulan, Multiz, PLAST-ncRNA,
used by DECIPHE R tool for implementation. Sequilab, Shuffle-LAGAN, Sim4 and SLAM tools for
FORTRAN is used by GMAP tool for implementation. data storage. Multi-FASTA format is used in Mauve tool
C++ is used by GMAP, Splign, Mauve, Sequilab, for data storage. Bare format is used in Mulan and Multiz
Shuffle-LAGAN and sim4 tools for implementation. tools for data storage. Fastq format is used in PLAST-
FORTRAN is used by GMAP tool for implementation. C ncRNA tool for data storage.
is used by GMAP, PLAST-ncRNA, Sequilab, Shuffle-
LAGAN, Sim4 and SLAM tools for implementation. Some of these tools like BLAT, GMAP and MGA are
Python is used by MGA, Multiz, PLAST-ncRNA and implemented in Hadoop MapReduce framework for
Sequilab tools for implementation. Ruby language is used genome Analysis. Specific format is used in the
by the SLAM tool. Table 3 presents the tools available implementation of BLAT, GMAP and MGA tools for
for Genomics Analysis. input data in Hadoop MapReduce or Apache Spark
framework.
Every tool supports specific format for efficient data
storage. EMBL format is used in ACT and Sim4 tools for
Table 3: Overview of Genomics Analysis tools that use MapReduce and Spark Framework
323 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
R NO
DECIPHER Nucleotide Fasta NO
Java
FLAK [94] Nucleotide Fasta NO NO
Fasta
Fortran, C and C++ YES [96]
GMAP [95] Nucleotide NO
C++ Fasta
Splign [97] Nucleotide NO NO
Genbank
Python
MGA [99] Nucleotide NO NO
IV. FUTURE OF BIOINFORMATICS TOOLS single points of failure. Also the traditional bioinformatics
data analysis tools based on R, Perl, or python do not meet
Bioinformatics research data is very large in size, the requirements to handle such huge datasets. So there is
complex in nature and critical for results. Conventional a need to use different tools according to the nature of
research methods exhibit very greater time complexity datasets involved and the type of queries or results to get
while analyzing results and very high space complexity insights of that data structures. It is a shift from data
while storing such massive datasets thus requiring generation to data analysis now.
systems with tremendously high processing capabilities.
For example, the NCIs The Cancer Genomics Atlas
(TCGA) dataset alone is 2.5 Petabytes and it would take The main goal to use statistics, machine learning
23 days to even download this dataset even with industry algorithms and data mining techniques to identify,
standard internet speeds. So instead of introducing such compile, analyze and visualize biological data structures
highly capable machines the concept of cloud computing is to make new models that may help in epidemic disease
has taken the place. To handle large, complex and analysis, understanding evolution, matching genomics,
distributed data of bioinformatics it is economical to suggesting medicine, providing health care information
process datasets across the cloud. Cloud computing is and predicting metabolomics processes. The main
beneficial for big data analytics because it would purpose is to introduce and implement standard protocols
distribute data load across physically distant machines while analyzing huge bioinformatics data using modern
thus improving efficiency, saving money and avoiding computer science techniques that will tackle the
324 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
complexity in nature of data and combine data across the quality. This thing will ultimately help biological
cloud from distributed resources. Now the need is to fuse scientists, pharmacists, patients, customers and
robust, efficient, quantitative, accurate and precise data companies.
visualization algorithms in previously implemented tools
of biological and biomedical fields to target all four Vs
of big data: volume of data, velocity of processing the
data, variability of data sources, and veracity of the data
V. CONCLUSION
NGS (Next Generation Sequencing) plays an Like Multiple Sequence Alignment, Pairwise
imperative role in the Bioinformatics field. There are Sequence Alignment tools implemented in Hadoop
many tools exists which are implemented in Hadoop MapReduce and Apache Spark for Bioinformatics
MapReduce and Apache Spark framework. These tools datasets. We can implement these tools such as ACANA,
are implemented using specific language like C, C++, AlignMe, CUDAlign, DNADot, DOTLET, FEAST, G-
Python, Scala and Java. Bioinformatics domain consists PAS, LALIGN, mAlign, MUMer, Needle, NW, Path,
of DNA, Protein, Genomics and RNA data. When this PyMOL, SEQALN, SIM, GAP, NAP, LAP, SSEARCH,
data is stored in BI (Bioinformatics) tools then specific Water and YASS in Hadoop MapReduce Platform. These
data format will be used like GenBank, EMBL, Fasta tools can also implement in Apache Spark framework.
Fastq and Phylip 4 etc. When Bioinformatics data are
store in Hadoop MapReduce or Spark framework, REFERENCES
specific data format will be used such as Avro, Sparse
vector, (key, value) pair and Sequence file etc.
Alignment viewers/editors Tools implemented in [1] M. Niemenmaa, "Analysing sequencing data in Hadoop:The
road to interactivity via SQL".
Hadoop MapReduce and Spark framework. We can
implement Alignment viewers/editor tools such as Ale, [2] M. C. Schatz, "CloudBurst: highly sensitive read mapping
AliView, Base by Base, BioEdit, BioNemerics, with MapReduce," Oxford Journal, 2009.
BoxShade, CLC Viewer, DnaSP, FLAK, Genodoc, [3] C. S. Ram Vinay Pandey, "DistMap: A Toolkit for Distributed
Jalview 2, JSAV, JEvTrace, Mega, PFAAT, Seaview, Short Read Mapping on a Hadoop Cluster," PLOS ONE,
Sequilab, Snipviz, Strap, Tablet, DNApy and UGENE in August 2013.
Hadoop MapReduce Platform. These tools can also [4] B. M. S. F. N. Brian D OConnor, "SeqWare Query Engine:
implement in Apache Spark framework. Alignment storing and searching sequence data in the cloud," in
viewer/editor tools such as DANASTAR and Maestro Open Source Conference, 2010.
have implemented in Hadoop MapReduce platform. [5] A. C. S. K. H. H. M. R. H. R. L. M. Steven Lewis, "Hydra: a
scalable proteomic search engine which utilizes the
Database search also play a vital role in Hadoop distributed computing framework," in BioMed
Bioinformatics. Like Alignment viewer/editor tools, Central, 5 December 2012.
database tools implemented in Hadoop MapReduce and [6] M. O. C. Geraldine A. Van der Auwera, "From FastQ Data to
Apache Spark. We can implement Alignment database High-Confidence Variant Calls: The Genome Analysis
tools such as CS-BLAST, FASTA GLSearch, Genoogle, Toolkit Best Practices Pipeline," in Current Protocols in
KLAST, Parasail, PSI-Search and Sequilab in Hadoop Bioinformatics, October 2013.
MapReduce Platform. These tools can also implement in [7] K. D. H. Ben Langmead, "Langmead B, Hansen KD, Leek JT.
Apache Spark framework. Database tools such as Cloud-scale RNA-sequencing differential expression
BLAST, DIAMOND, HMMER, HHSearch and analysis with Myrna. Genome Biol. 2010;11(8):R83.," in
BioMed Central , 11 August 2010.
ScalaBLAST have implemented in Hadoop MapReduce
platform. ScalaBLAST have implemented in Apache [8] C.-C. C. Wei-Chun Chung, "CloudDOE: A User-Friendly Tool
Spark framework. for Deploying Hadoop Clouds and Analyzing High-
Throughput Sequencing Data with MapReduce," PLOS
Genome Analysis consist of genes and microarray ONE.
data. Gene-to-Disease relationship show a significant role [9] S. K. R. B. a. J. B. Thomas Robinson, "SAMQA: error
in Genome Analysis. . Like Alignment database tools, classification and validation of high-throughput
Genome Analysis tools implemented in Hadoop sequenced read data," BioMed Central.
MapReduce and Apache Spark for Bioinformatics [10] M. M. A. G. K. G. M. V. D. R. R. Samuel V Angiuoli, "CloVR:
datasets. We can implement Genome Analysis tools such A virtual machine for automated and portable sequence
as ACT, DECIPHE R, FLAK, Splign, Mauve, Mulan, analysis from the desktop using cloud computing,"
BioMed Central.
Multiz, Sequilab and SLAM in Hadoop MapReduce
Platform. These tools can also implement in Apache [11] C. S. O. a. D. J. Baxter, "ScalaBLAST 2.0: rapid and robust
Spark framework. Genome Analysis tools such as BLAT, BLAST calculations on multiprocessor systems,"
Oxford, January 29, 2013.
GMAP and MGA have implemented in Hadoop
MapReduce platform.
325 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
[12] M. C. Schatz, "BlastReduce: High Performance Short Read [36] "paradoxus," [Online]. Available:
Mapping with MapReduce," in Computional Method for https://paradoxus.wordpress.com/category/life-of-
Next Generation Sequencing Data Analysis. student/.
[13] [Online]. Available: http://www.red-bean.com/ale/. [37] "LIST OF SEQUENCE ALIGNMENT SOFTWARE,"
[Online]. Available:
[14] U. University, "http://www.ormbunkar.se/aliview/," [Online]. http://theinfolist.com/php/SummaryGet.php?FindGo=Li
[15] [Online]. Available: http://www.ormbunkar.se/aliview/. st%20Of%20Sequence%20Alignment%20Software.
326 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
[60] J. E. D. W. a. Z. L.-S. Elijah Roberts, "The Luthey-Schulten [83] B. S. a. D. L. M. Yongchao LiuEmail author,
group," [Online]. Available: "CUDASW++2.0: enhanced Smith-Waterman protein
http://www.scs.illinois.edu/schulten/multiseq/. database search on CUDA-enabled GPUs based on SIMT
and virtualized SIMD abstractions," BioMedCentral, 6
[61] E. Roberts, "MultiSeq: unifying sequence and structure data April 2010.
for evolutionary analysis," BioMed Central , 16 August
2006. [84] "UWSysLab/diamond," [Online]. Available:
https://github.com/UWSysLab/diamond.
[62] "Mview," [Online]. Available:
https://desmid.github.io/mview/. [85] "FASTA Sequence Comparison at the U. of Virginia,"
[Online]. Available:
[63] [Online]. Available: https://sourceforge.net/projects/bio- http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.sh
mview/postdownload?source=dlp. tml.
[64] "PAFFT," [Online]. Available: http://pfaat.sourceforge.net/. [86] [Online]. Available:
[65] [Online]. Available: https://en.wikipedia.org/wiki/FASTA_format.
https://sourceforge.net/directory/os:windows/?q=annotat 87] "HMMER: biosequence analysis using profile hidden Markov
ion%20tool. models," [Online]. Available: http://hmmer.org/.
[66] S. Griffiths-Jones, "RALEERNA ALignment Editor in [88] A. Ragothaman, "Developing eThread Pipeline Using SAGA-
Emacs," Oxford Journal, Accepted August 17, 2004. Pilot Abstraction for Large-Scale Structural
[67] [Online]. Available: Bioinformatics," BioMed Research International.
https://raw.githubusercontent.com/samgriffithsjones/rale [89] M. B. a. L. Aravind, "PSI-BLAST Tutorial," in Comparative
e/release-0.8/00README. Genomics: Volumes 1 and 2..
[68] [Online]. Available: -[90] C. S. Oehmen, "ScalaBLAST 2.0: rapid and robust BLAST
http://www.bioinformatics.org/groups/categories.php?ca calculations on multiprocessor systems," Oxford.
t_id=2.
[91] A. T. Seung-Jin Sul, "Parallelizing BLAST and SOM
[69] [Online]. Available: Algorithms with MapReduce-MPI Library," in Parallel
http://packages.ubuntu.com/precise/seaview. and Distributed Processing Workshops and Phd Forum
[70] [Online]. Available: (IPDPSW), 2011 IEEE International Symposium on, 01
https://github.com/4ment?tab=repositories. September 2011.
[71] "seqpup," [Online]. Available: [92] "Research on High-performance and Scalable Data Access in
http://iubio.bio.indiana.edu/soft/molbio/seqpup/java/. Parallel Big Data Computing," [Online]. Available:
http://stars.library.ucf.edu/etd/1417/.
[72] "sequlator," [Online]. Available: http://sequlator.com/.
[93] K. M. R. M. B. M.-A. R. B. G. B. a. J. P. Tim J. Carver, "ACT:
[73] [Online]. Available: the Artemis comparison too," Oxford, 2005.
https://www.google.com.pk/url?sa=t&rct=j&q=&esrc=.
[94] "FLAK (Fuzzy Logic Analysis of k-mers)," [Online].
[74] U. o. W. Department of Biochemistry, "SnipViz: a compact Available: http://www.flakbio.com/.
and lightweight web site widget for display and
dissemination of multiple versions of gene and protein [95] T. D. W. a. C. K. Watanabe, "GMAP: a genomic mapping and
sequences.," PubMed. alignment program for mRNA and EST sequences,"
Oxford.
[75] [Online]. Available:
https://github.com/njahn82/dvcs_epmc/blob/master/data [96] N. R. S. J. A. G. Karthik Kambatla, "Relaxed Synchronization
/github_parsed.csv. and Eager Scheduling in MapReduce," [Online].
Available:
[76] [Online]. Available: https://researchontherocks.wordpress.com/2011/11/03/r
http://www.bioinformatics.org/strap/Scripting.html. elaxed-synchronization-and-eager-scheduling-in-
mapreduce/.
[77] "tablet," [Online]. Available: https://ics.hutton.ac.uk/tablet/.
[97] [Online]. Available:
[78] [Online]. Available:
https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi.
http://www.revolvy.com/main/index.php?s=List%20of
%20alignment%20visualization%20software&iem_type [98] M. B. B. F. P. N. Darling AC, "Mauve: multiple alignment of
=topic. conserved genomic sequence with rearrangements.,"
[79] [Online]. Available: https://github.com/mengqvist/DNApy. PubMed.
327 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016
Muhammad Atif Sarwar is a fields. Currently, he is leading the Big Data Analytics
Lab Engineer at Department of Research Group at COMSATS Institute Sahiwal.
Computer Science, COMSATS
Institute of Information
Technology Sahiwal, Pakistan.
He received BS (CS) degree
from COMSATS Institute of
Information Technology Sahiwal, Pakistan in 2015.
Currently, he is a scholar of MS (CS) session 2015-
2017 in COMSATS Institute of Information
Technology Sahiwal, Pakistan. His main research
interests include Big Data Analytics and Machine
Learning. Particularly, he is interested in applications
of Big Data in the Bioinformatics field. Currently, he
is working with the Big Data Analytics Research
Group at COMSATS Institute Sahiwal.
328 https://sites.google.com/site/ijcsis/
ISSN 1947-5500