International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 3 Issue: 3 1704 - 1708
_______________________________________________________________________________________________
Are NoSQL Data Stores Useful for Bioinformatics Researchers?
A comparative study of storing and querying strategies for proteomics mass-spectrometry data
Borong Shao, Tim OF Conrad
Freie Universität at Berlin,
Berlin, Germany
Research Campus MODAL, Zuse Institute Berlin,
Berlin, Germany
email:
[email protected],
[email protected]Abstract—The big data challenge in bioinformatics is approaching. Data storage and processing, instead of experimental technologies, are
becoming the slower and more costly part of research. Biological data typically have large size and a variety of structures. The ability to
efficiently store and retrieve the data is important in bioinformatics research. Traditionally, large datasets are either stored as disk-based flat-files
or in relational databases. These systems become more complicated to plan, maintain and adjust to big data applications as they follow rigid
table schema and often lack scalability, e.g. for data aggregation. Meanwhile, non-relational databases (NoSQL) emerge to provide alternative,
flexible and more scalable data stores.
In this study, we aim to quantitatively compare the latencies of different data stores on storing and querying proteomics datasets. We show
benchmarks for typical relational and non-relational systems for both, in-memory and disk-based configurations and compare them to a simple
flat-file based approach. We will focus on the latencies of storing and querying proteomics mass spectrometry datasets and the actual space
consumption inside the data stores. Experiments are carried out on a local desktop with medium-sized data, which is the typical experimental
settings of individual bioinformatics researchers. Results show that there are significant latency differences among the considered data stores
(up to 30 folds). In certain use cases, flat file system can achieve comparable performance with the data stores.
Keywords-relational vs. non-relational databases, proteomics data, storing and querying latencies
__________________________________________________*****_________________________________________________
I. INTRODUCTION generally relax the ACID constraints and provide BASE
Nowadays, the advances in high-throughput technologies properties (basically available, soft state, and eventual
lead to the exponential growth of molecular biological data. consistency) instead [10]. The lower level of ACID compliance
Discovering useful information from these data is one of the is traded off for higher availability [6, 11], flexibility [12] and
main endeavors in bioinformatics. In order to achieve it, large scalability.
amounts and varieties of biological data such as DNA, protein Biological data are commonly stored as flat files or in
sequences, microarrays and proteomics data need to be stored, relational databases [13]. Once the data are stored, most of the
retrieved and analyzed. Although new algorithms and pipelines operations on the data are queries, which serve as the first step
are developed constantly, the gap between the amount of data of data mining or knowledge discovery [14]. In bioinformatics
produced and the amount of data analyzed is still growing [1, data analysis, ACID compliance is usually not the critical issue
2]. Biological data is eligible for the name “big data" which is but efficient data mining is [15]. For example, the mass
often characterized by three “V" properties: volume, velocity, spectrometry data of patients are generated and stored. These
and variety. An efficient data store is required to address the data, usually from Megabytes to Gigabytes, need to be queried
big data challenge in bioinformatics [3-5]. over and over again to be used in computations such as
There are mainly two types of database systems: traditional biomarker identification and protein identification. Therefore,
relational databases and non-relational (NoSQL) databases. In an ideal data store should have low latencies in storing and
relational databases, data are stored in a number of cross- querying data, while maintaining the consistency. NoSQL data
referenced tables and queried through relational algebra stores are useful to deal with the storage and processing of
operations. Relational databases provide ACID properties large volume of data when the structure of the data does not
(atomicity, consistency, isolation, and durability), which require a relational model [7]. Meanwhile, updates of the data
guarantee reliable database transactions. At the same time, this are guaranteed to be propagated to all nodes eventually. It is
limits the scalability of the databases [6, 7]. As stated in Eric therefore interesting to investigate whether NoSQL techniques
Brewer's CAP theorem [8], a system can have only two can provide benefits in bioinformatics applications.
properties out of these three properties: consistency, There have been a number of qualitative or conceptual
availability, and partition-tolerance. For systems that require studies comparing relational and non-relational databases [9,
ACID transactional properties, relational database is a good 10, 16]. They compare databases in terms of data models, query
option. However, for systems that can relax ACID constraints models, consistency models, scalability, maturity, etc. But in
but address availability and scalability, NoSQL databases may practice, it is helpful to have results from quantitative
provide alternative options. NoSQL databases have several experiments to draw useful conclusions. There have also been a
categories for different types of applications. There are key- few quantitative studies to compare different databases in
value databases such as DynamoDB, column-oriented biological applications, where some of the data stores are
databases such as HBase and Cassandra, document-based employed in certain use cases. For example, experiments are
databases such as CouchDB and MongoDB, and graph performed on storing and querying clinical data with a XML-
databases such as Allegro Graph and Neo4j [9]). They based data store [17]. They conclude that XML database can
1704
IJRITCC | March 2015, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 3 Issue: 3 1704 - 1708
_______________________________________________________________________________________________
store clinical data flexibly but it has higher query latencies than 1000.02 as 1000_02 instead. The collection structure is
MS-SQL database. In [18], Neo4j and PostgreSQL are used to illustrated in Table II.
store and query the STRING human protein interactions 3) HBase: HBase is a column-oriented data store. It stores
network. The queries aim at solving graph processing problems data with HTables. A HTable has rows and column families.
in bioinformatics such as finding the best scoring path between Data within a row are grouped by column families and data
two proteins. The results show that Neo4j can offer great within a column family are identified by column qualifiers. A
speedups over relational databases. But depending on the types row key, a column family, a column qualifier and a version
of queries, graph database may not be necessary for graph data. number (if present) can exactly specify a cell in a HTable. As
In this study, we aim to compare different data stores for suggested in the HBase manual, to achieve better performance,
proteomics mass-spectrometry data. This type of data is the number of column families should be kept low - usually not
important because it fosters a better understanding of diseases, more than two or three. Thus we define the m/z values as the
biomarker identification and drug development [19, 20]. Since row keys and intensity values as the only column family, which
proteomics data do not have graph data structures, graph data has one column for each sample. The HBase table structure is
stores are not included in the study. We compare the latencies illustrated in Table III.
of one relational database, three NoSQL data stores, and the 4) Redis: Redis is an in-memory key-value data store. It is
flat file system on storing and querying mass spectrometry different from a traditional key-value data store in which string
(MS) data, as well as their data sizes. Our choice of the data keys are associated with string values. In Redis, keys are binary
stores is based on their popularity, availability, and safe so any binary sequence can be used as a key, from a string
representativity. The four data stores are the representatives of to an image file. The values can hold complex data structures
four main database categories. They also cover both in memory such as list, set, sorted set, hash, etc. We use both Redis hash
Relational database (MySQL, standard disk-based and
and disk-based configurations, as listed below: and string data models to store MS data and compare their
performance. The hash uses field-value pairs to store the m/z
Document-oriented database (MongoDB, disk-based)
in memory configuration) value-intensity value pairs. The string simply stores all lines of
Column-oriented database (HBase, disk-based)
the sample file as a string. The key of a hash or a string is the
sample number. The Redis hash and string data models are
Key-value database (Redis, in memory) illustrated in Table IV.
II. METHODS TABLE I. DATA SCHEMA IN MYSQL TABLE FOR STORING MS DATA
We perform benchmark studies on storing and querying MS Intensity value
Sample number (smallint(6)) M/z value (float)
(smallint(6))
data using different data stores. This section introduces the
1 1000.02 29
employed data stores, the respective data models and the
experimental settings. 1 1000.12 21
A. Data Stores and Data Models ... ... ...
Each data store has alternative data models to store MS N 9999.68 5
data. We decide the data model for each data store based on its
distinguishing features. For example, MongoDB is document- TABLE II. DATA SCHEMA IN MONGODB COLLECTION FOR STORING MS
oriented so we store each sample file in one document; HBase DATA
is column-oriented thus we store each sample file in one HBase
Sample:1 1000_02:29 1000_12:21 ... 9999_68:1
table column. Below we introduce the individual databases and
adopted data models. Sample:2 1000_02:96 1000_12:91 ... 9999_68:10
1) MySQL: MySQL is the most widely used open-source
... ... ... ... ...
RDBMS (relational database management system). To store
MS samples, we create a table with three columns: sample Sample:N 1000_02:35 1000_12:34 ... 9999_68:1
number, m/z value and intensity value. An index is built on the
sample number column to accelerate the search for multiple TABLE III. DATA SCHEMA IN HBASE TABLE FOR STORING MS DATA
samples. The MySQL table structure is illustrated in Table I.
We use both InnoDB engine and MEMORY engine for storing Row key ColumnFamily
and querying data. Note that for inserting data, we use the bulk sample:1 29
sample:2 96
load operation to insert one sample file with one statement. 1000.02
…
2) MongoDB: MongoDB is a document-oriented data sample:N 35
store. It stores data in collections. A MongoDB collection sample:1 21
contains documents. A document is composed of field-value 1000.12 ...
sample:N 11
pairs. MongoDB can store data flexibly in documents with
embedded data models, instead of breaking it into relational ... ...
table structures. It also supports aggregation operations for ...
9999.68
sample:N 1
complex queries. We store the MS sample files in one
collection with one document for one sample. The key of the
document is the sample number. Within each document, the
field-value pairs store the m/z value-intensity value pairs in the
corresponding sample file. As MongoDB does not support
float values as field names, we store the m/z value, e.g.,
TABLE IV. HASH OR STRING DATA MODEL IN REDIS FOR STORING MS
DATA
1705
IJRITCC | March 2015, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 3 Issue: 3 1704 - 1708
_______________________________________________________________________________________________
Sample:1 1000.02 29 1000.12 21 … 9999.68 1 The results show significant performance differences
among the data stores with respect to the latencies and data
Sample:2 1000.02 96 1000.12 91 … 9999.68 10 sizes, as far as our experiments with real proteomics (MS) data
... ... are concerned. In-memory data stores (Redis, MySQL with
MEMORY engine) are generally faster than disk-based data
Sample:N 1000.02 35 1000.12 34 … 9999.68 1 stores (MongoDB, HBase, and MySQL with InnoDB engine).
All storage systems show a linear dependency of latencies and
data size with respect to the number of files. Thus for each data
B. Experimental Settings
store, we calculate the average latency of storing or querying
We use experimental approach to compare the latencies of one MS sample, and the average size of one MS sample (as
MySQL, MongoDB, HBase, Redis, and flat file system on shown in Table V). Each table entry therefore does not reflect
storing and querying MS data, as well as the data sizes in them. the average measurement per MS sample given only one
Java programs and JDBC (Java database connectivity observation, e.g., n=7000 samples, but the average
technology) are used to access the databases and flat file measurement across all experiments (n = 50, 250, 500, 1000,
system. Below we introduce the data, three use cases and the 2000... 7000). The experimental results provide useful
measurement. information and also raise new questions, which require a
1) Data: We use raw 1D mass spectrometry data in the deeper understanding of the technical details of individual data
form of m/z (mass-to-charge ratio) and intensity value pairs, stores. Below we discuss a few basic observations across all
stored in the widely used mzML format1. Among other meta data stores.
information, each MS data sample consists of 42,381 value
pairs (about 440KB) in the m/z range from 1000 to 10000 Da,
which are encoded as binary strings. To measure the influence
of parsing the mzML-XML structure and converting the
binary encoding, we also perform experiments just using the
m/z value and intensity value pairs, stored as numbers in text
files. We will refer to this format as .dta format.
2) Use Cases: The goal of this study is to evaluate the
suitabilities of the data stores for bioinformatics researchers.
We therefore choose a few common use cases which occur
during every day routine when working with MS data. We
decide to use the following three examples to serve as proxy
applications:
a) Storing new data: insert n (n = 50, 250, 500, 1000,
2000, . . . , 7000) number of MS data samples to each data
store.
b) Range query (query1): select all m/z value-intensity
value pairs from the available datasets where m/z values are Figure 1 Latencies of storing data to the data stores
between 1000.02 and 1500.02 Da. We will refer to this query
type as query1 in the following sections.
c) Retrieve entire samples (query2): retrieve entire data
samples for 10% of all available samples. We will call this
query as query2 in the rest of this paper.
3) Measurement: All data stores are configured in
standalone mode on a Debian Linux 2 desktop PC equipped
with a 4-core Intel Xeon (R) CPU running at 3.3GHz, 7.78GB
RAM and a 232.9GB SATA hard-disk drive. For each
experiment, the computer is rebooted and only one database
system is running. Latencies are measured within the respective
experiment implementations. Space consumption of the data
inside the database-systems is measured by querying the
database management system directly3. The MS data files are
available on the local hard-disk. All results are averaged over
multiple runs.
III. RESULTS AND DISCUSSIONS Figure 2 Latencies of querying data by m/z range (query1)
We measure the latencies of the use cases and the data size
for each data store and plot them against the number of sample
files (as shown from Fig. 1 to Fig. 4).
1 The mzML format was introduced by the HUPO-Proteomics Standards
Initiative, see [6] for more details.
2 Linux kernel version: SMP Linux 3.2.0-4-amd64
3 MySQL: size of the database, MongoDB: size of the collection, HBase:
size of the HTable, Redis: full memory footprint.
1706
IJRITCC | March 2015, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 3 Issue: 3 1704 - 1708
_______________________________________________________________________________________________
hardly be predicted these days due to the complexity of the
used components.
B. Flat file storage can achieve comparable query latencies
Although querying from flat files involves disk reading, flat
file storage achieves comparable query latencies compared to
the databases. Flat file storage also has the lowest latency on
querying MS data by samples (query2). As mentioned above,
techniques such as prefetching and hardware-based caching
accelerate the reading from disk, if the reads are sequential.
Besides, performing queries in each single sample file avoids
the overhead of loading large volume of data to the memory,
which can cause page faults and disk swaps if the data does not
fit in the memory. Recall that we use two MS file formats:
.mzML and .dta in the experiments. The results show that
querying .dta files has about half the latencies as querying
.mzml files.
Figure 3 Latencies of querying entire MS samples (query2) C. Range queries are more expensive
Our experiments show that querying data ranges is much
more expensive than reading entire samples. This behavior is
well known and occurs because (1) sequential access (reading a
full dataset at once) is faster than random access (“reading a bit
and then seeking to the entry point”) and (2) databases often
implement range queries as first returning all data fulfilling the
lower bound and then filtering on the upper bound4. This can
seriously lengthen the overall query times, and seems to be the
case in all tested database systems.
D. The trade-off between ACID compliance and other
properties
MySQL and HBase generally have higher write and query
latencies than other data stores. At the same time, they
guarantee higher level of data consistency, which inevitably
requires more disk writing. MySQL provides ACID properties.
HBase can provide ACID properties within the same row. In
Figure 4 Space of the data used inside the respective database systems comparison, MongoDB does not guarantee ACID properties. It
A. Better memory utilization results in lower latencies trades off ACID compliance for higher availability which
contributes to better speed. As stated in the literatures [1, 10]
As expected, the two in memory data stores (Redis and
and confirmed by our experiments, NoSQL data stores relax
MySQL with MEMORY engine) have lower write and query
the ACID compliance for other properties, such as availability
latencies compared with disk-based data stores (MongoDB,
and horizontal scalability. Based on the experimental results,
HBase, and MySQL with InnoDB engine). It is interesting to
we conclude that this trade-off has a good potential for
observe that the actually disk-based MongoDB also has very
bioinformatics applications.
low latencies. This is because MongoDB uses memory-mapped
files (“RAM disk”) which first utilizes (all) available memory E. The suitability of a data store depends on the use case
before using the hard-disk. If the available memory is exceeded In our experiments, the combinations of data stores and data
the average latencies increase from 6.79ms to 10.9ms per MS models show different performance for every use case. We
sample for query1 and from 2.71ms to 8.2ms per MS sample summarize our experience in Table VI.
for query2. In summary, if the data are available in memory
(instead of disk), the access latencies are much lower. TABLE VI. USE CASES IN WHICH THE DATA STORES MAY BE
Accessing data from disk and from memory are intrinsically CONSIDERED
different. Accessing data from a hard-disk is typically done
through the SATA (serial ATA) interface. This has a If you have … Recommended
data store
theoretical maximum bandwidth of 750 MB per second. This is
Unstructured or flexible data that require complex MongoDB
about 20 times slower than accessing the main memory (at a queries
maximal bandwidth of about 14.9GB per second). Additional Large data volume applications HBase
to this, seeking to the correct position of a file on a hard-disk Data that require a relational model and ACID MySQL
takes about four milliseconds (using a standard 7200 RPM transactional properties
Data that do not require complex queries and can fit in Redis
disk). Taken together, accessing data from the memory is on the memory
average 40.000 times faster than accessing data from the disk. Data that only require limited operations Flat file
Meanwhile, effects like operating system dependent page
caching and hardware-based caching mechanisms for disk-
reads can reduce the latency of disk read dramatically. The 4
There are numerous database-systems dependent approaches to optimize this
actual effect of the combination of different strategies can behavior, which we did not consider in this work.
1707
IJRITCC | March 2015, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 3 Issue: 3 1704 - 1708
_______________________________________________________________________________________________
TABLE V. THE AVERAGE STORING AND QUERYING LATENCIES OF THE DATA STORES
Data Stores
Measurement per
MS sample MySQL MySQL Redis Redis Flat file Flat file
MongoDB HBase
(InnoDB) (MEMORY) (hash) (string) (dta) (mzM,L)
Write latencies (millisecond) 44.45 322.71 490.55 30.98 38.71 1.69 - -
Range query latencies 6.79
5.60 23.97 10.09 64.80 10.20 15.47 28.38
(millisecond) (10.9)a
Sample query latencies
2.71 (8.2)a 33.43 9.59 1.93 6.19 1.01 0.89 2.03
(millisecond)
Data sizes (KB) 1024.0 1285.8 1851.0 1304.0 3325.7 434.66 440.32 669.1
a. The query latencies when the data size exceeds the available memory
IV. CONCLUSION [7] A. B. M. Moniruzzaman and Syed Akhter Hossain. Nosql
database: New era of databases for big data analytics -
In bioinformatics, system-level investigations of cellular classification, characteristics and comparison. CoRR,
and molecular interactions involve large amounts of data. An abs/1307.0191, 2013.
efficient data store is necessary in this process. We use an [8] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the
experimental approach to compare the performance of a feasibility of consistent, available, partition-tolerant web
relational database (MySQL) and three non-relational data services. SIGACT News, 33(2):51–59, June 2002.
stores (MongoDB, HBase, Redis) on their latencies of storing [9] Clarence JM Tauro, Baswanth Rao Patil, and KR Prashanth. A
comparative analysis of different NoSQL databases on data
and querying mass spectrometry data and the data sizes. We model, query model and replication model. In Proceedings of
also perform the same queries on a flat file system (both dta International Conference on ”Emerging Research in Computing,
and mzML formats) for comparison. To the best of our Information, Communication and Applications” ERCICA.
knowledge, this study is the first quantitative comparison Elsevier, 2013.
among a relational database, NoSQL data stores, and flat file [10] Ameya Nayak, Anil Poriya, and Dikshay Poojary. Type of
system in the context of bioinformatics applications, which can NoSQL databases and its comparison with relational databases.
International Journal of Applied Information Systems, 5(4):16–
provide practical guide to bioinformatics researchers. 19, March 2013.
The results show that NoSQL data stores with proper data [11] Rick Cattell. Scalable SQL and NoSQL data stores. SIGMOD
models can achieve lower write and query latencies and smaller Rec., 39(4):12–27, May 2011.
database size than relational databases. Depending on the use [12] Paolo Atzeni, Francesca Bugiotti, and Luca Rossi. Uniform
case, flat file system can achieve comparable query access to NoSQL systems. Information Systems, 43(0):117 –
performance as the databases. Above all, our study suggests 133, 2014.
that the suitabilities of databases need to be considered based [13] Jeremy O. Baum Marketa J. Zvelebil. Understanding
on the application context. In the future, we will extend our bioinformatics. Garland Science, 2008.
study by comparing the performance of the databases in [14] François Bry and Peer Kröger. A computational biology
distributed mode, for example: HBase with Hadoop and HDFS, database digest: Data, data analysis, and data management.
Distrib. Parallel Databases, 13(1):7–42, January 2003.
MongoDB with sharding technique, and MySQL cluster, which
[15] Yixue Li, Luonan Chen, Big Biological Data: Challenges and
are applicable to a medium-sized bioinformatics data center. Opportunities, Genomics, Proteomics & Bioinformatics, Volume
12, Issue 5, October 2014.
ACKNOWLEDGMENT
[16] Clarence J M Tauro, Aravindh S, and Shreeharsha A.b.
This work was funded by the German Ministry of Research Comparative study of the new generation, agile, scalable, high
and Education (BMBF) project Grant 3FO18501 performance NoSQL databases. International Journal of
Computer Applications, 48(20):1–4, June 2012.
(Forschungscampus MODAL).
[17] Ken Ka-Yin Lee, Wai-Choi Tang, and Kup-Sze Choi.
REFERENCES Alternatives to relational database: comparison of NoSQL and
XML approaches for clinical data storage. Computer Methods
[1] Schatz Michael C, Langmead Ben, and Salzberg Steven L. and Programs in Biomedicine, 110(1):99–109, April 2013.
Cloud computing and the DNA data race. Nat Biotech, [18] Christian Theil Have and Lars Juhl Jensen. Are graph databases
28(7):691–693, jul 2010. ready for bioinformatics? Bioinformatics, 2013.
[2] Lin Dai, Xin Gao, Yan Guo, Jingfa Xiao, and Zhang Zhang. [19] Sam Hanash. Disease proteomics. Nature 422: 226–232, 2003.
Bioinformatics clouds for big data manipulation. Biology Direct,
7(1):43, 2012. [20] Zhen Xiao, DaRue Prieto, Thomas P. Conrads, Timothy D.
Veenstra, and Haleem J.Issaq. Proteomic patterns: their potential
[3] Marx Vivien. Biology: The big challenges of big data. Nature, for disease diagnosis. Molecular and Cellular Endocrinology,
498(7453):255–260, jun 2013. 230(12):95 – 106, 2005.
[4] Casey S. Greene, Jie Tan, Matthew Ung, Jason H. Moore, and [21] Eric Deutsch. mzml: A single, unifying data format for mass
Chao Cheng. Big data bioinformatics. Journal of Cellular spectrometer output. PROTEOMICS, 8(14):2776–2777, 2008.
Physiology, 229(12):1896–1900, 2014.
[22] VijaySrinivas Agneeswaran. Big-data theoretical, engineering
[5] Savage Neil. Bioinformatics: Big data versus the big C. Nature, and analytics perspective. In Big Data Analytics, volume 7678
509(7502):S66–S67, may 2014. of Lecture Notes in Computer Science, pages 8–15. Springer
[6] B.G. Tudorica and C. Bucur. A comparison between several Berlin Heidelberg, 2012.
nosql databases with comments and notes. In Roedunet [23] N. Leavitt. Will nosql databases live up to their promise?
International Conference (RoEduNet), 2011 10th, pages 1–5, Computer, 43(2):12–14, Feb 2010.
June 2011.
1708
IJRITCC | March 2015, Available @ http://www.ijritcc.org
_______________________________________________________________________________________