NoSQL Databases MongoDB
NoSQL Databases MongoDB
1 Introduction
Recent developments in size of data have heightened the need for storing world
digital data exceeds the limit of a zettabyte (i.e., 1021 bytes); it is a challenge as well
as necessity to develop a powerful and efficient system that has the capacity to
accommodate data. For example, a system using a very dense storage medium like
deoxyribonucleic acid (DNA). DNA can encode two bits per nucleotide (NT) or
455 exabytes per gram of single-stranded DNA [1]. Taken into account the fact that
Sudhakar (✉)
Indian Computer Emergency Response Team, Ministry of Electronics
& Information Technology, New Delhi, India
e-mail: [email protected]
Sudhakar ⋅ S. K. Pandey
School of Computer & Systems Sciences, Jawaharlal Nehru University,
New Delhi, India
e-mail: [email protected]
2 Related Work
uneven chunk count event (i.e., chunk difference between minimal loaded and
maximal loaded shards is greater than or equal to 8), it redistributes the chunks
among shards until the load difference between any two shards is less than or equal
to two [15].
3.1 Replication
node has only three replicas in which one must be a primary replica and others are
secondary replicas.
Every node has replication group G of size n, so the quorum is a subset of nodes
in a replication group G. Let view v is a tuple over G that defines important
information for G and view id is denoted as iv ∈ ℕ [16].
In some cases, replication can be used to service more read operations on the
database. To increase the availability of data for distributed applications, we can
also use different data centers to store the database geographically. Replica sets
have the same data in the replica group. In this group, one replica is primary, and
rest are secondary [14]. The primary accepts all the read/write operations from the
clients. In the case, if primary is unavailable, one of the secondary replicas is
elected as primary. For the election purpose, Paxos algorithm is used [17].
3.2 Sharding
MongoDB scales the system, when it needs to store data more than the capacity of a
single server (or shard) with the help of horizontal scaling. The principle of hori-
zontal scaling is to partition data by rows rather than splitting data into columns
(e.g., normalization and vertical partitioning do in the relational database) [14]. In
MongoDB, horizontal scaling is done by automatic sharding architecture to dis-
tribute data across thousands of nodes. Moreover, sharding occurs on a per col-
lection basis; it did not take into consideration of the whole database. MongoDB is
configured in such a way that it automatically detects which collection is growing
monotonically than the other. That collection has become the subject of sharding
while others may still reside on the single server. Some components that need an
explanation to understand the architecture of the MongoDB sharding is given in
Fig. 1.
• Shards are the servers that store data (each run mongod process) and ensure
availability and automatic failover, each shard comprising a replica set.
• Config Servers “store the cluster’s metadata,” which include the basic infor-
mation of chunks and shards. These chunks are contiguous ranges of data from
collections that are ordered by the shard key.
• Routing Services are run mongos processes performing read and write request
on behalf of client applications.
Auto-sharding in MongoDB provides some necessary functionality without
requiring large or powerful machines [15].
1. Automatic balancing if changes occur in load and data distribution.
2. Ease of adding new machines without downtime.
3. No solitary point of failure.
4. Ability to recover from failure automatically.
534 Sudhakar and S. K. Pandey
In this section, we are going to introduce all the terms that we will use in this
document. For every data item d ∈ D, where D is the set of all data items, we define
all types of load here, that represented by Γ. The load lt : D → ℝ, t ∈ Γ is a
function that use for assigning the associated load value of load type t from set D.
There exists an associated load value for every unit of replication U ∈ D, i.e.,
t = ∑vϵU lt ðvÞ. And any node H in the distributed system, at a particular node UH
lU
t = ∑U ∈ UH lt . Every node
contain all units of replication, and an associated value lH U
has capacity ct ∈ ℝ for each load type t. Thus, the inequality lt < cH
H H
t must be
maintained as invariant, as the violation would result in failure of H. We also
calculate the utilization [18] of a node uH t = lt ̸ ct at t ∈ Γ. If a system has
H H
An Approach to Improve Load Balancing in Distributed Storage … 535
utilization μSt where S is a set of all nodes in the system and average utilization μ̂St of
t ∈ Γ [18], given as
∑H ∈ S lH 1
μSt = t
, μ̂St = ∑ uH ðiÞ
∑ H ∈ S cH
t ∣S∣ HϵS t
In addition to the above, we need to consider the third parameter that represents the
cost of moving replication unit or data item U from one host to another host in S, i.e.,
ρU: S × S → ℝ, this parameter is linearly depends on lU size . If the system is considered
uniform, then H, H 0 , H 00 ∈ S and U ∈ H such that ρU ðH, H 0 Þ = ρU ðH, H 00 Þ is seems
to be a constant where ρU ∈ ℝ.
Let
S = flt jt ∈ Γ, U ∈ Hg
LH ðiiÞ
H
CH = fcH
t jt ∈ Γg ðiiiÞ
LH = flH
t jt ∈ Γg ðivÞ
S jH ∈ Sg
LS = f CH , LH , LH ðvÞ
where LS is referred as the system statistics for S. The migration is done by the
balancer, i.e., MIGRATORðU, H Þ to MIGRATORðU, H 0 Þ where U for a unit of
replication and H, H 0 are the hosts.
In the algorithm, there are two methods one is IsBalanceðÞ that will return a
Boolean value depending on whether the particular shard or host H is balanced or
not. However, the condition of balance is as follows: if a shard has more data than
its threshold value, then it will become imbalanced. The threshold value is
Const * cH H
t (e.g., Const = 0.7 or 0.8 or 0.9) where ct the capacity of the shard H. If
a shard shows that it is imbalanced, then it needs to migrate. The second method is
MIGRATORðÞ which migrates data from imbalanced shard to a balanced shard until
l ̂ data remains in the shard.
S
t
We are calculating the total data occupied in all shards that are lSt and then
calculate average data occupied by all shards that is lt̂ .
S
Furthermore, we check if the shard is balanced or not, and for each value of
imbalance shard, we migrate chunks from imbalance shard to balance shard until
536 Sudhakar and S. K. Pandey
Hj
lt t ≥ lt̂ OR lt t ≥ lH
Hi S t
the condition tmax met and this process repeated until all
shards become balance.
Chunks Migrations vs. Max. number of Space Utilization vs. Max number of
Chunks Chunks
20 1
15 0.8
0.6
10
0.4
5
0.2
0 0
100 1000 10000 100000 100 1000 10000 100000
Modified MongoDB MongoDB Modified MongoDB MongoDB
(a) Migration rate (b) Space utilization
Fig. 2 Experimental evaluation of modified MongoDB and MongoDB based on migration rate
and space utilization
An Approach to Improve Load Balancing in Distributed Storage … 537
The large-scale applications and data processing require handling of issues like
scalability, reliability, and performance of distributed storage systems. One of the
prominent issues among them is to handle the skewness in the data distribution and
accessing of data items. We have presented an improved algorithm of load bal-
ancing for NoSQL database MongoDB, which handles aforementioned issues by
providing automatic load balancing. We have analyzed our algorithm and shown
that our approach is better than many similar systems employed specifically in
MongoDB database [14], in terms of chunk migration and memory utilization for
the individual shard.
Our proposed method is for MongoDB based NoSQL database systems. We will
try to incorporate this approach into existing different flavors of NoSQL.
References
1. Church, George M., Yuan Gao, and Sriram Kosuri, 2012, “Next-generation digital
information storage in DNA.” Science 337.6102: 1628–1628.
2. Dean, Jeffrey, and Sanjay Ghemawat, 2008, “MapReduce: simplified data processing on large
clusters.” Communications of the ACM 51.1: 107–113.
3. Lakshman, Avinash, and Prashant Malik, 2010, “Cassandra: a decentralized structured storage
system.” ACM SIGOPS Operating Systems Review 44.2: 35–40.
4. M. Ali-ud-din, et al., 2014, “Seven V’s of Big Data understanding Big Data to extract value,”
American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the,
Bridgeport, CT, USA.
5. E. Dumbill, 2012, “What is big data?,” O’Reilly Media, Inc., Available: https://beta.oreilly.
com/ideas/what-is-big-data.
6. DeCandia, Giuseppe, et al., 2007, “Dynamo: amazon’s highly available key-value
store.” ACM SIGOPS operating systems review 41.6: 205–220.
7. Cooper, Brian F., et al., 2008, “PNUTS: Yahoo!’s hosted data serving platform.” Proc. of the
VLDB Endowment 1: 1277–1288.
8. Chang, Fay, et al., 2008, “Bigtable: A distributed storage system for structured data.” ACM
Trans. on Computer Systems (TOCS) 26.2: 4.
9. “MongoDB,” MongoDB Inc., 2015, Available: https://en.wikipedia.org/wiki/MongoDB.
10. E. A. Brewer, Towards robust distributed systems. (Invited Talk), Oregon, 2000.
11. Featherston, Dietrich, 2010, “cassandra: Principles and Application.” Department of Com-
puter Science University of Illinois at Urbana-Champaign.
12. Thusoo, Ashish, et al., 2010, “Data warehousing and analytics infrastructure at face-
book.” Proc. of the 2010 ACM SIGMOD Inter. Conf. on Management of data.
13. Glendenning, Lisa, et al. “Scalable consistency in Scatter, 2011,” Proc. of the Twenty-Third
ACM Symposium on Operating Systems Principles.
14. MongoDB Documentation,” 25 June 2015. [Online].
15. Liu, Yimeng, Yizhi Wang, and Yi Jin., 2012, “Research on the improvement of MongoDB
Auto-Sharding in cloud environment.” Computer Science & Education (ICCSE), 2012 7th
Inter. Conf. on. IEEE.
16. Gifford, David K, 1979, “Weighted voting for replicated data.” Proc. of the seventh ACM
symposium on Operating systems principles.
538 Sudhakar and S. K. Pandey
17. Lamport, Leslie, 1998, “The part-time parliament.” ACM Transactions on Computer Systems
(TOCS) 16.2: 133–169.
18. Godfrey, Brighten, et al., 2004, “Load balancing in dynamic structured P2P sys-
tems.” INFOCOM 2004. Twenty-third Annual Joint Conf. of the IEEE Computer and
Communications Societies. Vol. 4.