Mongodb
Mongodb
Document based databases MongoDB- Documents- JSON & BSON format, representing
relationships, CRUD operations, Indexing, Aggregation, Sharding architecture and Replication
strategies, consistency and locking.
MongoDB
• MongoDB is a powerful, flexible, and scalable general-purpose database.
• It combines the ability to scale out with features such as secondary indexes, range queries,
sorting, aggregations, and geospatial indexes.
• Ease of Use
• Easy Scaling
Ease of Use
• A document-oriented database replaces the concept of a “row” with a more flexible model, the
“document.”
• By allowing embedded documents and arrays, the document oriented approach makes it possible
to represent complex hierarchical relationships with a single record.
• There are also no predefined schemas: a document’s keys and values are not of fixed types or
sizes.
• Without a fixed schema, adding or removing fields as needed becomes easier. Generally, this
makes development faster as developers can quickly iterate. It is also easier to experiment.
Easy Scaling
• Data set sizes for applications are growing at an incredible pace. Increases in available bandwidth
and cheap storage have created an environment where even small-scale applications need to
store more data than many databases were meant to handle.
• A terabyte of data, once an unheard-of amount of information, is now commonplace
• Scaling a database comes down to the choice between scaling up (getting a bigger machine) or
scaling out (partitioning data across more machines).
• Scaling up is often the path of least resistance, but it has draw- backs: large machines are often
very expensive, and eventually a physical limit is reached where a more powerful machine cannot
be purchased at any cost.
• The alternative is to scale out: to add storage space or increase performance, buy another
commodity server and add it to your cluster. This is both cheaper and more scalable; however, it
is more difficult to administer a thousand machines than it is to care for one.
• MongoDB was designed to scale out.
• Its document-oriented data model makes it easier for it to split up data across multiple servers.
• MongoDB automatically takes care of balancing data and load across a cluster, redistributing
documents automatically and routing user requests to the correct machines.
• This allows developers to focus on programming the application, not scaling it
MongoDB – Features
• Indexing
• MongoDB supports generic secondary indexes, allowing a variety of fast queries, and
provides unique, compound, geospatial, and full-text indexing capabilities as well.
• Aggregation
• MongoDB supports an “aggregation pipeline” that allows you to build complex aggregations
from simple pieces and allow the database to optimize it.
• Special collection types
• MongoDB supports time-to-live collections for data that should expire at a certain time,
such as sessions. It also supports fixed-size collections, which are useful for holding recent
data, such as logs.
• File storage
• MongoDB supports an easy-to-use protocol for storing large files and file metadata.
Architecture : -
Batch Insert
• If you have a situation where you are inserting multiple documents into a collection, you can make
the insert faster by using batch inserts.
• Batch inserts allow you to pass an array of documents to the database
• > db.foo.batchInsert([{"_id" : 0}, {"_id" : 1}, {"_id" : 2}])
• > db.foo.find()
• { "_id" : 0 } { "_id" : 1 } { "_id" : 2 }
• Batch inserts are only useful if you are inserting multiple documents into a single col- lection:
• you cannot use batch inserts to insert into multiple collections with a single request.
• If you are importing a batch and a document halfway through the batch fails to be inserted, the
documents up to that document will be inserted and everything after that document will not:
• > db.foo.batchInsert([{"_id" : 0}, {"_id" : 1}, {"_id" : 1}, {"_id" : 2}])
• Only the first two documents will be inserted, as the third will produce an error: you cannot insert
two documents with the same "_id".
Removing Documents
• Now that there’s data in our database, let’s delete it:
• > db.foo.remove()
• This will remove all of the documents in the foo collection.
• This doesn’t actually remove the collection, and any meta information about it will still exist.
• that we want to remove everyone from the mailing.list collection where the value for "optout" is
true:
• > db.mailing.list.remove({"opt-out" : true})
• Once data has been removed, it is gone forever. There is no way to undo the remove or recover
deleted documents
• Remove Speed Removing documents is usually a fairly quick operation, but if you want to clear
an entire collection, it is faster to drop it
• db.tester.drop()
Updating Documents
• Once a document is stored in the database, it can be changed using the update method.
• update takes two parameters: a query document, which locates documents to update, and a
modifier document, which describes the changes to make to the documents found.
Document Replacement
• The simplest type of update fully replaces a matching document with a new one.
• This can be useful to do a dramatic schema migration
• { "_id" : ObjectId("4b2b9f67a1f631733d917a7a"), ‘
"name" : "joe",
"friends" : 32,
"enemies" : 2 }
• We want to move the "friends" and "enemies" fields to a "relationships" subdocument.
var joe = db.users.findOne({"name" : "joe"});
> joe.relationships = {"friends" : joe.friends, "enemies" : joe.enemies};
{ "friends" : 32, "enemies" : 2 }
> joe.username = joe.name;
"joe" > delete joe.friends; true
> delete joe.enemies; true
> delete joe.name; true
> db.users.update({"name" : "joe"}, joe);
• { "_id" : ObjectId("4b2b9f67a1f631733d917a7a"),
"username" : "joe",
"relationships" : {
"friends" : 32,
"enemies" : 2 }
}
Using Modifiers
• Usually only certain portions of a document need to be updated.
• You can update specific fields in a document using atomic update modifiers.
• Update modifiers are special keys that can be used to specify complex update operations, such as
altering, adding, or removing keys, and even manipulating arrays and embedded documents.
• Each URL and its number of page views is stored in a document that looks like this:
• { "_id" : ObjectId("4b253b067525f35f94b60a31"),
"url" : "www.example.com",
"pageviews" : 52 }
• db.analytics.update({"url" : "www.example.com"}, ...
{"$inc" : {"pageviews" : 1}})
“$set” modifier
• "$set" sets the value of a field.
• If the field does not yet exist, it will be created.
• This can be handy for updating schema or adding user-defined keys.
• db.users.findOne()
{ "_id" : ObjectId("4b253b067525f35f94b60a31"),
"name" : "joe",
"age" : 30,
"sex" : "male",
"location" : "Wisconsin" }
• db.users.update({"_id" : ObjectId("4b253b067525f35f94b60a31")}, ...
{"$set" : {"favorite book" : "War and Peace"}})
Upserts
• An upsert is a special type of update.
• If no document is found that matches the update criteria, a new document will be created by
combining the criteria and updated documents.
• If a matching document is found, it will be updated normally.
• Upserts can be handy because they can eliminate the need to “seed” your collection: you can
often have the same code create and update documents.
• an upsert (the third parameter to update specifies that this should be an upsert):
• db.analytics.update({"url" : "/blog"}, {"$inc" : {"pageviews" : 1}}, true)
MongoDB – Indexing
• A database index is similar to a book’s index.
• Instead of looking through the whole book, the database takes a shortcut and just looks at an
ordered list that points to the content, which allows it to query orders of magnitude faster.
• A query that does not use an index is called a table scan (a term inherited from relational
databases), which means that the server has to “look through the whole book” to find a query’s
results.
• This process is basically what you’d do if you were looking for information in a book without an
index: you start at page 1 and read through the whole thing.
• In general, you want to avoid making the server do table scans because it is very slow for large
collections.
• indexes have their price: every write (insert, update, or delete) will take longer for every index
you add.
• This is because MongoDB has to update all your indexes whenever your data changes, as well as
the document itself.
• Thus, MongoDB limits you to 64 indexes per collection.
• Generally you should not have more than a couple of indexes on any given collection.
• The tricky part becomes figuring out which fields to index.
• To choose which fields to create indexes for, look through your common queries and queries that
need to be fast and try to find a common set of keys from those.
• This is good in that it does not require any giant in-memory sorts.
• However, it does have to scan the entire index to find all matches.
• Thus, putting the sort key first is generally a good strategy when you’re using a limit so MongoDB
can stop scanning the index after a couple of matches.
• if your query is only looking for the fields that are included in the index, it does not need to fetch
the document.
• When an index contains all the values requested by the user, it is considered to be covering a
query.
• Whenever practical, use covered indexes in preference to going back to documents.
Sparse Indexes
• Unique indexes count null as a value, so you cannot have a unique index with more than one
document missing the key.
• However, there are lots of cases where you may want the unique index to be enforced only if the
key exists.
• If you have a field that may or may not exist but must be unique when it does, you can combine
the unique option with the sparse option.
• To create a sparse index, include the sparse option.
• For example, if providing an email address was optional but, if provided, should be unique, we
could do:
• > db.ensureIndex({"email" : 1}, {"unique" : true, "sparse" : true})
MongoDB – Sharding
• Sharding refers to the process of splitting data up across machines;
• the term partitioning is also sometimes used to describe this concept.
• By putting a subset of data on each machine, it becomes possible to store more data and handle
more load without requiring larger or more powerful machines, just a larger quantity of less-
powerful machines.
• Manual sharding can be done with almost any database software.
• Manual sharding is when an application maintains connections to several different database
servers, each of which are completely independent.
• The application manages storing different data on different servers and querying against the
appropriate server to get data back.
• This approach can work well but becomes difficult to maintain when adding or removing nodes
from the cluster or in the face of changing data distributions or load patterns.
• MongoDB supports autosharding, which tries to both abstract the architecture away from the
application and simplify the administration of such a system.
• MongoDB allows your application to ignore the fact that it isn’t talking to a standalone MongoDB
server, to some extent.
• On the operations side, MongoDB automates balancing data across shards and makes it easier to
add and remove capacity
• MongoDB’s sharding allows you to create a cluster of many machines (shards) and break up your
collection across them, putting a subset of data on each shard.
• This allows your application to grow beyond the resource limits of a standalone server or replica
set
• One of the goals of sharding is to make a cluster of 5, 10, or 1,000 machines look like a single
machine to your application.
• To hide these details from the application, we run a routing process called mongos in front of the
shards.
• This router keeps a “table of contents” that tells it which shard contains which data.
• Applications can connect to this router and issue requests normally
• The router, knowing what data is on which shard, is able to forward the requests to the
appropriate shard(s).
• If there are responses to the request, the router collects them, merges them, and sends them
back to the application.
• As far as the application knows, it’s connected to a standalone mongod,
• Deciding when to shard is a balancing act.
• You generally do not want to shard too early because it adds operational complexity to your
deployment and forces you to make design decisions that are difficult to change later.
• On the other hand, you do not want to wait too long to shard because it is difficult to shard an
overloaded system without downtime.
• In general, sharding is used to:
o Increase available RAM
o Increase available disk space
o Reduce load on a server
o Read or write data with greater throughput than a single mongod can handle
• Thus, good monitoring is important to decide when sharding will be necessary.
• Carefully measure each of these metrics.
• Generally people speed toward one of these bottlenecks much faster than the others, so figure
out which one your deployment will need to provision for first and make plans well in advance
about when and how you plan to convert your replica set.
• As you add shards, performance should increase roughly linearly per shard up to hundreds of
shards.
• However, you will usually experience a performance drop if you move from a non-sharded system
to just a few shards.
• Due to the overhead of moving data, maintaining metadata, and routing, small numbers of shards
will generally have higher latency and may even have lower throughput than a non-sharded
system.
• Thus, you may want to jump directly to three or more shards.