MEAN 3 L4 Advanced MongoDB With Aggregation
MEAN 3 L4 Advanced MongoDB With Aggregation
Indexes are data structures that store collection’s data set in a form that is easy to
traverse. Indexes help perform the following functions:
● Execute queries and find documents that match the query criteria without a
collection scan
● Limit the number of documents a query examines
● Store field value in the order of the value
● Support equality matches that are range-based queries
Types of Index
● TTL Indexes: The TTL index is used for TTL collections, which removes data after a period of
time.
● Unique Indexes: A unique index causes MongoDB to reject all documents that contain a
duplicate value for the indexed field.
● Partial Indexes: A partial index indexes only documents that meet specified filter criteria.
● Case Insensitive Indexes: A case insensitive index disregards the case of the index key values.
● Sparse Indexes: A sparse index does not index documents that do not have the indexed field.
Compound Index
A compound index in MongoDB contains multiple single field indexes separated by a comma.
MongoDB limits the fields of a compound index to a maximum of 31.
Sparse indexes manage documents with indexed fields and ignore documents which do not contain any
index field.
To create a sparse index, use the db.collection.createIndex() method and set the sparse option to true.
When a sparse index returns an incomplete index, then MongoDB does not use that index unless it is
specified in the hint method.
{ x: { $exists: false } }
Unique Index
If a unique index has no value, the index stores a null value for the document.
Because of this unique constraint, MongoDB permits only one document without
the indexed field. For more than one document with a valueless or missing
indexed field, the index build process fails.
Create Compound, Sparse, and Unique Indexes
Duration: 45 min.
Problem Statement:
You are given a project to create compound, sparse, and, unique indexes.
Assisted Practice: Guidelines to demonstrate Indexing
During index creation, operations on a database are blocked and the database becomes unavailable
for any read or write operation. The read or write operations on the database queue allow the index
building process to complete.
Use the following command to make MongoDB available even during an index build process:
If you perform any administrative operations when MongoDB is creating indexes in the
background for a collection, you will receive an error.
The index build process at the background:
● Uses an incremental approach and is slower than the normal foreground process
● Depends on the size of the index for its speed
● Impacts database performance
To avoid any performance issues, use:
● getIndexes() to ensure that your application checks for the indexes at the start-up
● Equivalent method for your driver and ensure it terminates an operation if the proper
indexes do not exist
● Separate application codes and designated maintenance windows
Index Creation on Replica Set
Background index operations on a secondary replica set begin after the index build completes in
the primary. To build large indexes on secondaries perform the following steps:
1 2 3 4
Restart one Catch up with
Restart as a
secondary at a The index build the other
member of the
time in a completes members of the
replica set
standalone mode set
5 6 When all 7 8
Build the index secondaries Build the index
on the next have new index, Restart as a on the former
secondary step down the standalone primary
primary
dropIndex() Method
db.accounts.dropIndex( { "tax-id": 1 } )
db.collection.dropIndexe() Method
To remove all indexes barring the _id index from a collection, use the command:
db.collection.dropIndexes()
Index Modification
1
Drop Index: Execute the query given below to return a document showing the
operation status.
2
Recreate the Index: Execute the query given below to return a document showing the
status of the results.
Duration: 40 min.
Problem Statement:
Use the db.collection.reIndex method to rebuild all indexes of a collection. This will
drop all indexes including _id and rebuild all indexes in a single operation.
All indexes of a collection and a database can be listed. To get a list of all indexes of a
collection, use the db.collection.getIndexes() or a similar method.
To list all indexes of collections, use the operation given below in the mongo shell.
db.getCollectionNames().forEach(function(collection) {
indexes = db[collection].getIndexes();
print("Indexes for " + collection + ":");
printjson(indexes);
});
23
Retrieval of Index
Duration: 40 min.
Problem Statement:
Aggregations process data sets and return calculated results. They are run on the
mongod instance to simplify application codes and limit resource requirements.
● Uses collections of documents as an input and returns results in the form of one
or more documents
● Is based on data processing pipelines. Documents pass through multi-stage
pipelines and get transformed into an aggregated result
● The most basic pipeline stage in the aggregation framework provides filters that
function like queries
● The pipeline operations group and sort documents by defined field or fields
● The pipeline uses native operations within MongoDB to allow efficient data
aggregation and is the favored method for data aggregation
Aggregation
Aggregation
The aggregation operation given below returns all states with total population greater than 10 million.
The aggregation operation given below returns user names sorted by the month of their joining.
Aggregation operations manipulate data and return a computed result based on the input document
and a specific procedure. Aggregation provides the following semantics for data processing:
Count
This command along with the two methods, count() and cursor.count() provides access to total
counts in the mongo shell. The command given below counts all documents in the customer_info
collection.
db.customer_info.count()
Distinct
This operation searches for documents matching a query and returns all unique values for a field in
the matched document. The syntax given below is an example of a distinct operation.
db.customer_info.distinct( “customer_name" )
Aggregation Operations
Duration: 40 min.
Problem Statement:
Group operations accept sets of documents as input. They match the given query, apply the
operation, and then return an array of documents with the computed results.
A group does not support sharded collection data. In addition, the results of the group operation
must not exceed 16 megabytes.
The group operation shown below groups documents by the field ‘a’, where ‘a’ is less than three. It
also sums the field count for each group.
Duration: 30 min.
Problem Statement:
● Stores multiple copies of data across different databases in multiple locations, and thus
protects data when the database suffers any loss
● Helps manage data in the event of hardware failure and any kind of service interruptions
● Stores copies of these operations in different data centers to increase the locality and
availability of data for distributed applications
Master-Slave Replication
The Master-Slave Replication is the oldest mode of replication that MongoDB supports.
In the earlier versions of MongoDB, the master-slave replication was used for failover,
backup, and read scaling. However, in the newer versions, it is replaced by replica sets
for most use cases.
Master
A replica set consists of a group of mongod instances that host the same
data set. The replica set functions as follows: Client Application
Driver
● The primary mongod receives all write operations and the secondary
mongod replicates the operations from the primary. Writes Reads
● The primary node receives the write operations from clients.
● The primary logs any changes or updates to its data sets in its oplog.
Primary
● The secondaries replicate the oplog of the primary and apply all the
Re
io
operations to their data sets.
cat
pli
ca
p li
● When the primary becomes unavailable, the replica set nominates a
tio
Re
n
secondary as the primary.
Secondary Secondary
Replica Set in MongoDB
An extra mongod instance can be added to a replica set to act as an arbiter. Following are some
characteristics of an arbiter:
● Arbiter is the node which just participated in an election to select the primary node.
Secondary members in a replica set asynchronously apply operations from the primary. These
replica sets can function without some secondary members. As a result, all secondary members
may not return the updated data to clients.
Replica Set Members
A replica set can also have an arbiter. Arbiters do not replicate or store data but play a
crucial role in selecting a secondary to take the place of the primary, when the primary
becomes unavailable. A typical replica set contains a primary, secondary, and an arbiter.
Priority 0 Replica Set Members
A priority 0 member is a secondary member that cannot become the primary. The characteristics of a
priority 0 are:
● Cannot trigger any election
● Can maintain data set copies, accept and perform read operations, and elect a primary
By configuring a priority zero member, you can prevent secondaries from becoming the primary. In a
three-member replica set, one data center hosts both, the primary and a secondary, and a second
data center hosts one priority zero member.
Hidden Replica Set Members
Hidden members of a replica set are invisible to the client applications. The characteristics
of hidden members are:
● They store a copy of the primary’s data.
● They are priority 0 members, can elect a primary but cannot replace a primary.
● They are not given appropriate read and write rights.
● They can be used for dedicated functions like reporting and backup.
Start a Replica Set
Duration: 30 min.
Problem Statement:
Tag sets allow tagging target read operations to select replica set members. Customized read
preferences and write concerns assess tag sets as follows:
● Read preferences stress on the tag value when selecting a replica set member to read from
● Write concerns ignore the tag value when selecting a member
You can specify tag sets with the following read preference modes:
● primaryPreferred
● secondary
● secondaryPreferred
● nearest
Tags are not compatible with the primary mode but are compatible with the nearest mode. When
combined together, the nearest mode selects the matching member, primary or secondary, with the
lowest network latency.
Tag Set for Replica Set
Tag sets allow customizing write concerns and read preferences in a replica set. MongoDB stores tag
sets in the replica set configuration object.
conf = rs.conf()
conf.members[0].tags = { "dc": "NYK", "rackNYK": "A"}
conf.members[1].tags = { "dc": "NYK", "rackNYK": "A"}
conf.members[2].tags = { "dc": "NYK", "rackNYK": "B" }
conf.members[3].tags = {"dc": "LON", "rackLON": "A"}
conf.members[4].tags = { "dc": "LON", "rackLON": "B"}
conf.settings = { getLastErrorModes: { MultipleDC : { "dc": 2}, multiRack: { rackNYK: 2 } }
rs.reconfig(conf)
Replica Set and Patterns
Replica Set Deployment Strategies
You can use the following deployment strategies for a replica set:
Add Capacity
Add a new member to an existing replica set before new demands arise.
Ahead of Demand
Distribute
Keep at least one member in an alternate data center as a backup in case of any data loss
Members
incident. Set the priorities of these members to zero to prevent them from becoming primary.
Geographically
When electing the primary, all members must be able to see each other to create a majority.
Keep Majority in
To enable the members elect the primary, ensure that most of the members are in one
One Location
location.
Use Replica Set This ensures that all operations are replicated at specific data centers. Tag sets help route read
Tag Sets operations to specific computers.
Use journaling to safely write data on a disk in case of shutdowns, power failure, and other
Use Journaling
unexpected failures.
Replica Set Deployment Patterns
The record of operations maintained by the master server is called the operation log or
oplog. Each oplog document denotes a single operation performed on the master server
and contains the following keys:
An internal function Type of operation The collection name Key for specifying
to track operations. performed as a 1- where the the operation to
Contains a 4-byte byte code, for operation is perform. For an
timestamp and a 4- example, i for an performed insert, this would
byte incrementing insert be the document
counter to insert
Replication State and Local Database
MongoDB maintains a local database called local to keep the information about the replication
state and the list of master and slaves. The content of this database remains local to the master
and slaves.
Slaves store the replication information in the local database. The unique slave identifier gets saved
in the me collection and the list of masters gets saved in sources collection.
To check the replication status, use the function given below when connected to the master:
The oplog size and the date ranges of operations are contained in the oplog. In the given
example, the oplog size is 10 megabytes and can accommodate about 30 seconds of operations.
The log length serves as a metric for servers that have been operational long enough for the
oplog to roll over.
The functions given below will populate a list of sources for a slave, each displaying information
such as how far behind it is from the master.
db.printSlaveReplicationInfo()
Check a Replica Set Status
Duration: 30 min.
Problem Statement:
Sharding is the process of distributing data across multiple servers for storage. The
characteristics of sharding are as follows:
● Sharding adds more servers to a database and automatically balances data and load across
various servers.
● Sharding provides additional write capacity by distributing the write load over a number of
mongod instances.
● Sharding splits the data set and distributes them across multiple databases, or shards. Each
shard serves as an independent database, and together, shards make a single logical
database.
Sharded clusters require a proper infrastructure setup, which increases the overall
complexity of the deployment. Therefore, consider deploying sharded clusters only when
your system shows the following characteristics:
● The data set outgrows the storage capacity of a single MongoDB instance.
● The size of the active working set exceeds the capacity of the maximum available RAM.
A shard is a replica set or a single mongod instance that holds the data
subset used in a sharded cluster. Each shard is a replica set that
provides redundancy and high availability for the data it holds.
The characteristics of a shard are as follows:
Choose the appropriate shard key based on the following two factors:
65
Range-Based Shard Key
In range-based sharding, MongoDB divides data sets into different ranges based on the values of shard
keys. In range-based sharding, documents having close shard key values reside in the same chunk and
shard.
Range-based partitioning supports range queries because for a given range query of a shard key, the
query router can easily find which shards contain those chunks.
Data distribution in range-based partitioning can be uneven, which may negate some benefits of
sharding. For example, if a shard key field size increases linearly, such as time, then all requests for a given
time range will map to the same chunk and shard. In such cases, a small set of shards may receive most
of the requests and the system would fail to scale.
66
Hash-Based Sharding
For hash-based partitioning, MongoDB first calculates the hash of a field’s value, and then
creates chunks using those hashes. In hash-based partitioning, collections in a cluster are
randomly distributed.
In hash-based partitioning:
• Data is evenly distributed.
• Hashed key values randomly distribute data across chunks and shards.
• Range query on the shard key is ineffective.
67
Impact of Shard Keys on Cluster Operation
Some shard keys can scale write operations. A computed shard key with randomness,
allows a cluster to scale write operations. To improve write scaling, MongoDB enables
sharding a collection on a hashed index.
MongoDB improves write scaling using the following two methods:
● Querying
A mongos instance enables applications to interact with sharded clusters. When mongos
receives queries from client applications, it uses metadata from the config server and
routes queries to the mongod instances. mongos makes querying operational in sharded
environments.
● Query Isolation
Query execution will be fast and efficient if mongos can route to a single shard using a
shard key and metadata is stored from the config server. If your query contains the first
component of a compound shard key then mongos can route a query to a single shard
and thus provides good performance.
68
Production Cluster Deployment
● Config Servers
Each of the three config servers must be hosted on separate machines. Each single sharded
cluster must have an exclusive use of its config servers. Each cluster in multiple sharded
clusters must have a group of config servers.
● Shards
A production cluster must have two or more replica sets or shards.
69
Deploy a Sharded Cluster
● Step 2: Start each config server by issuing a command using the syntax given below:
● Step 3: To start a mongos instance, issue a command using the syntax given below:
To start a mongos that connects to a config server instance running on the following hosts and on the
default ports, issue the command given below:
Hosts
cfg0.example.net sh.addShard( “mongodb0.example.net:27017")
cfg1.example.net
cfg2.example.net
70
Create a Shard Cluster and Deploy the Sharded Cluster
Problem Statement:
You need to enable sharding for the database of the collection before you start sharding.
To enable sharding, perform the following steps:
● Step 1: From a mongo shell, connect to the mongos instance and issue a command using the
syntax given below.
mongo --host <hostname of machine running mongos> --port <port mongos listens on>
● Step 2: Issue the sh.enableSharding() method and specify the name of the database for which
you want to enable sharding. Use the syntax given below:
sh.enableSharding("<database>")
Optionally, enable sharding for a database using the enableSharding command. For this, use the
syntax given below.
db.runCommand( { enableSharding: <database> } )
74
Enable Sharding for Collection
● Determine the shard key value. The selected shard key value impacts the efficiency of
sharding.
● If the collection contains data, create an index on the shard key using the createIndex()
method. If the collection is empty then MongoDB creates the index as a part of the
sh.shardCollection().
● To enable sharding for a collection, open the mongo shell and issue the
sh.shardCollection() method.
75
Enable Sharding for Collection
To enable sharding for a collection, replace the string <database>.<collection> with the full
namespace of your database. This string consists of the name of your database, a dot, and the
full name of the collection.
The example given below shows sharding collections based on the partition key.
76
Shard Balancing
● When the shard distribution in a cluster is uneven, the balancer migrates chunks from one
shard to another to achieve a balance in chunk numbers per shard.
● Chunk migration is a background operation that occurs between two shards, an origin and a
destination.
● The origin shard sends the destination shard to all the current documents.
● During the migration, if an error occurs, the balancer aborts the process and leaves the
chunk unchanged in the origin shard.
● Adding a new shard to a cluster may create an imbalance because the new shard has no
chunks.
● Similarly, when a shard is being removed, the balancer migrates all the chunks from the
shard to other shards.
● After all the data is migrated and the metadata is updated, the shard can be safely removed.
77
Shard Balancing
Chunk migrations carry bandwidth and workload overheads, which may impact the database
performance.
78
Tag Aware Sharding
● In mongoDB, you can create tags for a range of shard keys to associate those ranges to a
group of shards.
● The balancer that moves chunks from one shard to another obeys these tagged ranges.
● The balancer moves or keeps a specific subset of the data on a specific set of shards and
ensures that the most relevant data resides on the shard which is geographically closer to
the client/application server.
79
Add Shard Tags
When connected to a mongos instance, use the sh.addShardTag() method to associate tags with
a particular shard. The example given on the screen adds the tag NYC to two shards, and adds
the tags SFO and NRT to a third shard.
sh.addShardTag("shard0000", "NYC")
sh.addShardTag("shard0001", "NYC")
sh.addShardTag("shard0002", "SFO")
sh.addShardTag("shard0002", "NRT")
To assign a tag to a range of shard keys, connect to the mongos instance and use the
sh.addTagRange() method. The following operations assign:
● Two ranges of zip codes in Manhattan and Brooklyn, the NYC tag
● One range of zip codes in San Francisco, the SFO tag
80
Remove Shard Tags
Shard tags exist in the shard’s document in a collection of the config database. To return all shards
with a specific tag, use the operations as given below:
use config
db.shards.find({ tags: "NYC" })
To return all shard key ranges tagged with NYC, use the following sequence of operations given below:
use config
db.tags.find({ tags: "NYC" })
The example given below removes the NYC tag assignment for the range of zip codes within
Manhattan:
use config
db.tags.remove({ _id: { ns: "records.users", min: { zipcode: "10001" }}, tag: "NYC" })
81
Key Takeaways
a. db.collection.createUniqueIndex({name:1})
b. db.collection.createUniqueIndex({name:1})
c. db.collection.createIndex({unique:true},{name:1})
d. db.collection.createIndex({name:1},{unique:true})
Knowledge
Check Which among the following is the correct syntax to create a unique index on field
1 name?
a. db.collection.createUniqueIndex({name:1})
b. db.collection.createUniqueIndex({name:1})
c. db.collection.createIndex({unique:true},{name:1})
d. db.collection.createIndex({name:1},{unique:true})
The correct answer is c. Indexes support the efficient execution of queries in MongoDB
Replication
a.
Sharding
b.
Indexing
c.
Splitting
d.
Knowledge
Check
Which of the following techniques is used for scaling write operation in MongoDB?
3
Replication
a.
Sharding
b.
Indexing
c.
Splitting
d.
a. getIndex()
b. listIndex()
c. getIndexes()
d. listIndexes()
Knowledge
Check
Which of the following methods is used for listing all indexes of a collection?
5
a. getIndex()
b. listIndex()
c. getIndexes()
d. listIndexes()
PQR Corp is a leading corporate training provider. PQR Corp has decided
to share analysis report with their clients. This report will help their clients
know the employees who have completed training and evaluation exam,
what are their strengths, and what are the areas where employees need
improvement. This is going to be a unique selling feature for the PQR Corp.
They have huge amount of data to deal with. They have hired you as an
expert and want your help to solve this problem.