0% found this document useful (0 votes)
1K views

Mongodb Architecture Guide

Uploaded by

api-267649761
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Mongodb Architecture Guide

Uploaded by

api-267649761
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

March 2014

A MongoDB White Paper


MongoDB Architecture Guide
MongoDB 2.6
Table of Contents
INTRODUCTION 1
How We Build Applications 1
How We Run Applications 2
MongoDB Embraces These New Realities Through Key Innovations 2
MONGODB DATA MODEL 2
Data as Documents 2
Dynamic Schema 3
Schema Design 3
MONGODB QUERY MODEL 3
Idiomatic Drivers 3
Mongo Shell 3
Query Types 4
Indexing 4
Query Optimization 5
Covered Queries 5
MONGODB DATA MANAGEMENT 5
In-Place Updates 5
Auto-Sharding 5
Query Router 6
CONSISTENCY & DURABILITY 6
Transaction Model 6
Journaling 6
Replica Sets 6
AVAILABILITY 8
Replication 8
Elections and Failover 8
Election Priority 8
Congurable Write Availability 8
PERFORMANCE 8
In-Memory Performance with On-disk Capacity 8
RESOURCES 9
1
MongoDB wasnt designed in a lab. We built MongoDB
from our own experiences building large-scale, high-
availability, robust systems. We didnt start from scratch,
we really tried to gure out what was broken, and
tackle that. So the way I think about MongoDB is that
if you take MySQL, and change the data model from
relational to document-based, you get a lot of great
features: embedded docs for speed, manageability, agile
development with dynamic schemas, easier horizontal
scalability because joins arent as important. There are
a lot of things that work great in relational databases:
indexes, dynamic queries and updates to name a few, and
we havent changed much there. For example, the way
you design your indexes in MongoDB should be exactly
the way you do it in MySQL or Oracle, you just have the
option of indexing an embedded eld.
Eliot Horowitz, CTO and Co-founder
MongoDB is designed for how we build and run
applications with modern development techniques,
programming models, and computing resources.
HOW WE BUILD APPLICATIONS:
New and Complex Data Types. Rich data
structures with dynamic attributes, mixed
structure, text, media, arrays and other complex
types are common in todays applications.
Flexibility. Applications have evolving data
models, because certain attributes are initially
unknown, and because applications evolve
over time to accommodate new features and
requirements.
Modern Programming Languages. Object-
oriented programming languages interact with
data in structures that are dramatically diferent
from the way data is stored in a relational
database.
Faster Development. Software engineering
teams now embrace short, iterative development
cycles. In these projects dening the data model
and application functionality is a continuous
process rather than a single event that happens
at the beginning of the project.
Introduction
2
HOW WE RUN APPLICATIONS:
New Scalability for Big Data. Operational and
analytical workloads challenge traditional
capabilities on one or more dimensions of scale,
availability, performance and cost efectiveness.
Fast, Real-time Performance. Users expect
consistent, interactive experiences from
applications across many types of interfaces.
New Hardware. The relationship between cost
and performance for compute, storage, network
and main memory resources has changed
dramatically. Application designs can make
diferent optimizations and trade ofs to take
advantage of these resources.
New Computing Environments. The infrastructure
requirements for applications can easily exceed
the resources of a single computer, and cloud
infrastructure now provides massive, elastic,
cost-efective computing capacity on a metered
cost model.
MONGODB EMBRACES THESE NEW REALITIES
THROUGH KEY INNOVATIONS:
Document Data Model. Data is stored in a
structure that maps to objects in modern
programming languages and is easy for
developers to understand.
Rich Query Model. MongoDB is t for a wide
variety of applications. It provides rich index and
query support, including secondary, geospatial
and text search indexes, the Aggregation
Framework and native MapReduce.
Idiomatic Drivers. Developers interact with
the database through native libraries that are
integrated with their respective environments
and code repositories, making MongoDB simple
and natural to use.
Horizontal Scalability. As the data volume and
throughput grow, developers can take advantage
of commodity hardware and cloud infrastructure
to increase the capacity of the MongoDB system.
High Availability. Multiple copies of data are
maintained with native replication. Automatic
failover to secondary nodes, racks and data
centers makes it possible to achieve enterprise-
grade uptime without custom code and
complicated tuning.
In-Memory Performance. Data is read and written
to RAM while also persisted to disk for durability,
providing fast performance and eliminating the
need for a separate caching layer.
Flexibility. From the document data model,
to multi-datacenter deployments, to tunable
consistency, to operation-level availability
options, MongoDB provides tremendous
exibility to the development and operations
teams, and for these reasons it is well suited
to a wide variety of applications across many
industries.
MongoDB Data Model
DATA AS DOCUMENTS
MongoDB stores data as documents in a binary repre-
sentation called BSON (Binary JSON). The BSON
encoding extends the popular JSON (JavaScript Object
Notation) representation to include additional types
such as int, long, and oating point. BSON documents
contain one or more elds, and each eld contains a
value of a specic data type, including arrays, binary
data and sub-documents.
Documents that tend to share a similar structure are
organized as collections. It may be helpful to think
of collections as being analogous to a table in a
relational database, documents as similar to rows, and
elds as similar to columns.
Category
Name
URL
Article
Name
Slug
Publish date
Text
User
Name
Email address
Tag
Name
URL
Comment
Comment
Date
Author
FIGURE 1 // Example relational data model for a
blogging application.
3
For example, consider the data model for a blogging
application. In a relational database the data model
would comprise multiple tables. To simplify the
example, assume there are tables for Categories, Tags,
Users, Comments and Articles.
In MongoDB the data could be modeled as two collec-
tions, one for users, and the other for articles. In each
blog document there might be multiple comments,
multiple tags, and multiple categories, each expressed
as an embedded array.
MongoDB documents tend to have all data for a
given record in a single document, whereas in a
relational database information for a given record is
usually spread across many tables. In other words,
data in MongoDB tends to be more localized. In most
MongoDB systems, BSON documents also tend to
be closely aligned to the structure of objects in the
programming language of the application, which
makes it easy for developers to understand the
realtionship of how the data used in the application
maps to the data that is stored in the database.
DYNAMIC SCHEMA
MongoDB documents can vary in structure. For
example, all documents that describe users might
contain the user id and the last date they logged into
the system, but only some of these documents might
contain the users identity for one or more third-
party applications. Fields can vary from document to
document; there is no need to declare the structure
of documents to the system documents are self-
describing. If a new eld needs to be added to a
document then the eld can be created without
afecting all other documents in the system, without
updating a central system catalog, and without taking
the system ofine.
SCHEMA DESIGN
Although MongoDB provides robust schema exibility,
schema design is still important. Schema designers
should consider a number of topics, including the
types of queries the application will need to perform,
how objects are managed in the application code,
and how documents will change and potentially grow
over time. Schema design is an extensive topic that is
beyond the scope of this document. For more infor-
mation, please see Data Modeling Considerations.
MongoDB Query Model
IDIOMATIC DRIVERS
MongoDB provides native drivers for all popular
programming languages and frameworks to make
development natural. Supported drivers include Java,
.NET, Ruby, PHP, JavaScript, node.js, Python, Perl, PHP,
Scala and others. MongoDB drivers are designed to be
idiomatic for the given language. One fundamental
diference as compared to relational databases is that
the MongoDB query model is implemented as methods
or functions within the API of a specic programming
language, as opposed to a completely separate
language like SQL. This, coupled with the afnity
between MongoDBs JSON document model and the
data structures used in object-oriented programming,
makes integration with applications simple. For a
complete list of drivers see the MongoDB Drivers page.
MONGO SHELL
The mongo shell is a rich, interactive JavaScript shell
that is included with all MongoDB distributions. Nearly
all commands supported by MongoDB can be issued
through the shell, including administrative opera-
tions. The mongo shell is a popular way to interact
with MongoDB for ad hoc operations. All examples in
the MongoDB Manual are based on the shell. For more
on the mongo shell, see the corresponding page in the
MongoDB Manual.
Article
Name
Slug
Publish date
Text
Author
User
Name
Email address
Comment[]
Comment
Date
Author
Tag[]
Value
Category[]
Value
FIGURE 2 // Example document data model for a
blogging application.
4
QUERY TYPES
MongoDB supports many types of queries. A query may
return a document or a subset of specic elds within
the document:
Key-value queries return results based on any
eld in the document, often the primary key.
Range queries return results based on values
dened as inequalities (e.g, greater than, less
than or equal to, between).
Geospatial queries return results based on
proximity criteria, intersection and inclusion as
specied by a point, line, circle or polygon.
Text Search queries return results in relevance
order based on text arguments using Boolean
operators (e.g., AND, OR, NOT).
Aggregation Framework queries return
aggregations of values returned by the query
(e.g., count, min, max, average, similar to a SQL
GROUP BY statement).
MapReduce queries execute complex data
processing that is expressed in JavaScript and
executed across data in the database.
INDEXING
Like most database management systems, indexes are
a crucial mechanism for optimizing system performance
in MongoDB. And while indexes will improve the perfor-
mance of some operations by orders of magnitude, they
have associated costs in the form of slower writes, disk
usage, and memory usage. MongoDB includes support
for many types of indexes on any eld in the document:
Unique Indexes. By specifying an index as unique,
MongoDB will reject inserts of new documents
or the update of a document with an existing
value for the eld for which the unique index has
been created. By default all indexes are not set as
unique. If a compound index is specied as unique,
the combination of values must be unique.
Compound Indexes. It can be useful to create
compound indexes for queries that specify
multiple predicates. For example, consider an
application that stores data about customers. The
application may need to nd customers based on
last name, rst name, and state of residence. With
a compound index on last name, rst name, and
state of residence, queries could efciently locate
people with all three of these values specied.
An additional benet of a compound index is that
any leading eld within the index can be used, so
fewer indexes on single elds may be necessary:
this compound index would also optimize queries
looking for customers by last name.
Array Indexes. For elds that contain an array, each
array value is stored as a separate index entry.
For example, documents that describe recipes
might include a eld for ingredients. If there is an
index on the ingredient eld, each ingredient is
indexed and queries on the ingredient eld can
be optimized by this index. There is no special
syntax required for creating array indexes if the
eld contains an array, it will be indexed as an array
index.
TTL Indexes. In some cases data should expire out
of the system automatically. Time to Live (TTL)
indexes allow the user to specify a period of time
after which the data will automatically be deleted
from the database. A common use of TTL indexes
is applications that maintain a rolling window of
history (e.g., most recent 100 days) for user actions
such as clickstreams.
Geospatial Indexes. MongoDB provides geospatial
indexes to optimize queries related to location
within a two dimensional space, such as projection
systems for the earth. These indexes allow
MongoDB to optimize queries for documents
that contain points or a polygon that are closest
to a given point or line; that are within a circle,
rectangle, or polygon; or that intersect with a
circle, rectangle, or polygon
Sparse Indexes. Sparse indexes only contain
entries for documents that contain the specied
eld. Because the document data model of
MongoDB allows for exibility in the data model
from document to document, it is common for
some elds to be present only in a subset of all
documents. Sparse indexes allow for smaller, more
MongoDB supports many types
of queries. A query may return a
document or a subset of specic
elds within the document.
5
efcient indexes when elds are not present in all
documents.
Text Search Indexes. MongoDB provides a
specialized index for text search that uses
advanced, language-specic linguistic rules
for stemming, tokenization and stop words.
Queries that use the text search index will return
documents in relevance order. One or more elds
can be included in the text index.
QUERY OPTIMIZATION
MongoDB automatically optimizes queries to make
evaluation as efcient as possible. Evaluation normally
includes selecting data based on predicates, and sorting
data based on the sort criteria provided. The query
optimizer selects the best index to use by periodi-
cally running alternate query plans and selecting the
index with the best response time for each query type.
The results of this empirical test are stored as a cached
query plan and are updated periodically.
COVERED QUERIES
Queries that return results containing only indexed
elds are called covered queries. These results can be
returned without reading from the source documents.
With the appropriate indexes, workloads can be
optimized to use predominantly covered queries.
MongoDB Data
Management
IN-PLACE UPDATES
MongoDB stores each document on disk as a
contiguous block. It allocates space for a document
at insert time, and performs updates to documents
in-place. By managing data in-place, MongoDB can
perform discrete, eld-level updates, thereby reducing
disk I/O and updating only those index entries that
need to be updated. Furthermore, MongoDB can
manage disk space more efciently than designs
that manage updates using database compaction at
runtime, which requires additional space and imposes
a processing overhead, often yielding unpredictable
performance.
AUTO-SHARDING
MongoDB provides horizontal scale-out for databases
using a technique called sharding, which is trans-
parent to applications. Sharding distributes data across
multiple physical partitions called shards. Sharding
allows MongoDB deployments to address the hardware
limitations of a single server, such as bottlenecks in
RAM or disk I/O, without adding complexity to the
application.
MongoDB supports three types of sharding:
Range-based Sharding. Documents are
partitioned across shards according to the shard
key value. Documents with shard key values
close to one another are likely to be co-located
on the same shard. This approach is well suited
for applications that need to optimize range-
based queries.
Hash-based Sharding. Documents are uniformly
distributed according to an MD5 hash of the
shard key value. Documents with shard key
values close to one another are unlikely to be
co-located on the same shard. This approach
guarantees a uniform distribution of writes
across shards, but is less optimal for range-based
queries.
Tag-aware Sharding. Documents are partitioned
according to a user-specied conguration that
associates shard key ranges with shards. Users
can optimize the physical location of documents
for application requirements such as locating
data in specic data centers.
MongoDB automatically balances the data in the
cluster as the data grows or the size of the cluster
increases or decreases. For more on sharding see the
Sharding Introduction.
Shard N Shard 3 Shard 2 Shard 1
Horizontally Scalable
FIGURE 3 // Automatic sharding provides horizontal
scalability in MongoDB.
6
QUERY ROUTER
Sharding is transparent to applications; whether there
is one or one hundred shards, the application code
for querying MongoDB is the same. Applications issue
queries to a query router that dispatches the query to
the appropriate shards.
For key-value queries that are based on the shard
key, the query router will dispatch the query to the
shard that manages the document with the requested
key. When using range-based sharding, queries that
specify ranges on the shard key are only dispatched
to shards that contain documents with values within
the range. For queries that dont use the shard key, the
query router will dispatch the query to all shards and
aggregate and sort the results as appropriate. Multiple
query routers can be used with a MongoDB system,
and the appropriate number is determined based on
performance and availability requirements of the
application.
Consistency & Durability
TRANSACTION MODEL
MongoDB is ACID compliant at the document
level. One or more elds may be written in a
single operation, including updates to multiple
sub-documents and elements of an array. The ACID
guarantees provided by MongoDB ensures complete
isolation as a document is updated; any errors cause
the operation to roll back and clients receive a
consistent view of the document.
Developers can use MongoDBs Write Concerns to
congure operations to commit to the application only
after they have been ushed to the journal le on
disk. This is the same model used by many traditional
relational databases to provide durability guarantees.
As a distributed system, MongoDB presents additional
exibility in enabling users to achieve their desired
durability goals. Each query can specify the appro-
priate write concern, ranging from unacknowledged to
acknowledged only after specic policies have been
fullled, such as writing to at least two replicas in one
data center and one replica in a second data center.
See the Congurable Write Availability section of the
guide to learn more.
Locks and Lock Yielding
MongoDB uses locks to enforce concurrency control
for multiple clients accessing the database. In a
relational database, locks are typically placed on
entire tables or rows. In MongoDB, concurrency control
is much closer to the concept of a latch, which coordi-
nates multi-threaded access to shared data structures
and objects. The lock is lightweight, and in a properly
designed schema a write will hold it for approximately
10 microseconds, and is able to yield in the event of
long running operations, providing high concurrency.
Primary
Secondary
Secondary
Asynchronous
Replication
FIGURE 5 // A replica set is a fully self-healing shard.
Data across replicas can be eventually consistent or
strictly consistent.
Shard N Shard 3 Shard 2 Shard 1
Horizontally Scalable
FIGURE 4 // Sharding is transparent to applications.
7
MongoDB implements a reader/writer lock for each
database which supports multiple readers and a single
writer. The lock is write-greedy, meaning:
There can be an unlimited number of
simultaneous readers on a database
There can only be one writer at a time on any
collection in any one database
Writers block readers for only as long as it takes to
update a single document in RAM. If a slow-running
operation is predicted (i.e., a document or an index
entry will need to be paged in from disk), then that
operation will yield the write lock. When the operation
yields the lock, then the next queued operation can
proceed.
With a properly designed schema, MongoDB will
saturate disk I/O capacity before locking contention
presents a performance bottleneck. In cases where
locking contention is inhibiting application perfor-
mance, sharding the database enables concurrency to
be scaled across many instances.
For more on locks and lock yielding, see the entry on
Concurrency.
JOURNALING
MongoDB implements write-ahead journaling of
operations to enable fast crash recovery and durability
in the storage engine. Journaling helps prevent
corruption and increases operational resilience. Journal
commits are issued at least as often as every 100ms
by default. In the case of a server crash, journal entries
are recovered automatically.
REPLICA SETS
MongoDB maintains multiple copies of data called
replica sets using native replication. A replica set is a
fully self-healing shard that helps prevent database
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Shard 1 Shard N Shard 3 Shard 2
FIGURE 6 // Sharding and replica sets: automatic sharding provides horizontal scalability; replica sets help prevent
database downtime.
Replica sets also provide opera-
tional exibility by providing a
way to upgrade hardware and
software without requiring the
database to go ofine.
8
downtime. Replica failover is fully automated, elimi-
nating the need for administrators to intervene
manually.
A replica set consists of multiple replicas. At any given
time one member acts as the primary replica and the
other members act as secondary replicas. MongoDB is
consistent by default: reads and writes are issued to a
primary copy of the data. If the primary member fails
for any reason (e.g., overheating, hardware failure or
network partition), one of the secondary members is
automatically elected to primary and begins to process
all writes.
The number of replicas in a MongoDB replica set is
congurable, and a larger number of replicas provides
increased data durability and protection against
database downtime (e.g., in case of multiple machine
failures, rack failures, data center failures, or network
partitions). Optionally, operations can be congured to
write to multiple replicas before returning to the appli-
cation, thereby providing functionality that is similar
to synchronous replication. For more on this topic, see
the section on Congurable Availability of Writes later
in this guide.
Applications can optionally read from secondary
replicas, where data is eventually consistent by
default. Reads from secondaries can be useful in
scenarios where it is acceptable for data to be slightly
out of date, such as some reporting applications. Appli-
cations can also read from the closest copy of the data
as measured by ping distance when latency is more
important than consistency. For more on reading from
secondaries see the entry on Read Preference.
Replica sets also provide operational exibility by
providing a way to upgrade hardware and software
without requiring the database to go ofine. This is
an important feature as these types of operations
can account for as much as one third of all downtime
in traditional systems. For instance, if one needs
to perform a hardware upgrade on all members of
the replica set, one can sequentially upgrade each
secondary without impacting the replica set. When all
secondaries have been upgraded, one can temporarily
demote the primary replica to secondary to upgrade
that server. Similarly, other operational tasks, like
adding indexes, can be carried out on replicas without
interfering with uptime. For more on replica sets, see
the entry on Replication.
Availability
REPLICATION
Operations that modify a database on the primary are
replicated to the secondaries with a log called the
oplog. The oplog contains an ordered set of idempotent
operations that are replayed on the secondaries. The
size of the oplog is congurable and by default 5% of
the available free disk space. For most applications,
this size represents many hours of operations and
denes the recovery window for a secondary should
this replica go ofine for some period of time and need
to catch up to the primary.
If a secondary is down for a period longer than is
maintained by the oplog, it must be recovered from the
primary using a process called initial synchronization.
During this process all databases and their collections
are copied from the primary or another replica to
the secondary as well as the oplog, then the indexes
are built. Initial synchronization is also performed
when adding a new member to a replica set. For
more information see the page on Replica Set Data
Synchronization.
ELECTIONS AND FAILOVER
Replica sets reduce operational overhead and improve
system availability. If the primary replica for a shard
fails, secondary replicas together determine which
replica should become the new primary in a process
called an election. Once the new primary has been
determined, remaining secondaries are congured to
receive updates from the new primary. If the original
primary comes back online, it will recognize that it
is no longer the primary and will congure itself to
become a secondary. For more on failover elections
see the entry on Replica Set Elections.
ELECTION PRIORITY
MongoDB considers a number of criteria when electing
a new primary, and a conguration called the election
priority allows users to inuence their deployments
to achieve certain operational goals. Every replica set
member has a priority that determines its eligibility to
become primary. In an election, the replica set elects
an eligible member with the highest priority value
9
as primary. By default, all members have a priority
of 1 and have an equal chance of becoming primary;
however, it is possible to set priority values that afect
the likelihood of a replica becoming primary.
In some deployments, there may be operational
requirements that can be addressed with election
priorities. For instance, all replicas located in a
secondary data center could be congured with a
priority so that they would only become primary if the
primary data center fails. Similarly, a replica can be
congured to act as a backup by setting the priority so
that it never becomes primary.
CONFIGURABLE WRITE AVAILABILITY
MongoDB allows users to specify write availability
in the system, which is called the write concern. The
default write concern acknowledges writes from the
application, allowing the client to catch network
exceptions and duplicate key errors. Each query
can specify the appropriate write concern, ranging
from unacknowledged to acknowledgement that
writes have been committed to multiple replicas, a
majority of replicas, or all replicas. It is also possible
to congure the write concern so that writes are
only acknowledged once specic policies have been
fullled, such as writing to at least two replicas in one
data center and at least one replica in a second data
center. For more on congurable availability see the
entry on Write Concern.
Performance
IN-MEMORY PERFORMANCE WITH ON-DISK
CAPACITY
MongoDB makes extensive use of RAM to speed up
database operations. Reading data from memory is
measured in nanoseconds, whereas reading data from
spinning disk is measured in milliseconds; reading from
memory is approximately 100,000 times faster than
reading data from disk. In MongoDB, all data is read
and manipulated through memory-mapped les. Data
that is not accessed is not loaded into RAM. While it
is not required that all data t in RAM, it should be
the goal of the deployment team that indexes and all
data that is frequently accessed should t in RAM. For
example it may be the case that a fraction of the entire
database is most frequently accessed by the appli-
cation, such as data related to recent events or popular
products. If the volume of data that is frequently
accessed exceeds the capacity of a single machine,
MongoDB can scale horizontally across multiple
servers using automatic sharding.
Because MongoDB provides in-memory performance,
for most applications there is no need for a separate
caching layer.
Conclusion
MongoDB provides a powerful, innovative database
platform, architected for how we build and run ap-
plications today. In this guide we have explored the
fundamental concepts and assumptions that underly
the architecture of MongoDB. Other guides on topics
such as Operations Best Practices can be found at
mongodb.com.
About MongoDB
MongoDB (from humongous) is reinventing data
management and powering big data as the leading
NoSQL database. Designed for how we build and run
applications today, it empowers organizations to be
more agile and scalable. MongoDB enables new types
of applications, better customer experience, faster
time to market and lower costs. It has a thriving global
community with over 7 million downloads, 150,000
online education registrations, 25,000 user group
members and 20,000 MongoDB Days attendees. The
company has more than 1,000 customers, including
many of the worlds largest organizations.
10
Resources
For more information, please visit mongodb.com or
mongodb.org, or contact us at sales@mongodb.com.
Resource Website URL
MongoDB Enterprise
Download
mongodb.com/download
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
White Papers mongodb.com/white-papers
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Documentation docs.mongodb.org
US 866.237.8815 INTL +1 650 440 4474 info@mongodb.com
Copyright 2014 MongoDB, Inc. All Rights Reserved.

You might also like