Unit 2
Unit 2
com
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
NGT
same) sets of fields. This affords you more flexibility when dealing with data.
A MongoDB deployment can have many databases. Each database is a set of
collections. Collections are similar to the concept of tables in SQL; however, they are
schemaless. Each collection can have multiple documents. Think of a document as a
SEM : V row in SQL.
SEM V: UNIT 2
In an RDBMS system, since the table structures and the data types for each column
are fixed, you can only add data of a particular data type in a column. In MongoDB, a
collection is a collection of documents where data is stored as key-value pairs.
Let’s understand with an example how data is stored in a document. The following
document holds the name and phone numbers of the users:
Dynamic schema means that documents within the same collection can have the same
or different sets of fields or structure, and even common fields can store different types
of values across documents. There’s no rigidness in the way data is stored in the
documents of a collection.
Let’s see an example of a Region collection:
{ "R_ID" : "REG001", "Name" : "United States" }
{ "R_ID" :1234, "Name" : "New York" , "Country" : "United States" }
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Q.2: write note on JSON and BSON Q.3 write note on The Identifier(_id)and capped collection
Ans: Ans:
JSON
MongoDB is a document-based database. It uses Binary JSON for storing its data. The Identifier (_id)
. JSON stands for JavaScript Object Notation. It’s a standard used for data interchange MongoDB stores data in documents. Documents are made up of key-value pairs.
in today’s modern Web (along with XML). The format is human and machine readable. Although a document can be compared to a row in RDBMS, unlike a row, documents
It is not only a great way to exchange data but also a nice way to store data. have flexible schema.
All the basic data types (such as strings, numbers, Boolean values, and arrays) are A key, which is nothing but a label, can be roughly compared to the column name in
supported by JSON. RDBMS.A key is used for querying data from the documents. Hence, like a RDBMS
At high level,JSON have two things- An Object and an array .An object is a collection of primary key (used to uniquely identify each row), you need to have a key that uniquely
name/value pairs and an array is ordered list of values. With the combination of identifies each document within a collection. This is referred to as _id in MongoDB.
two,you can have complete JSON structure If you have not explicitly specified any value for a key, a unique value will be
The maximum number of documents that can embed in a document is 100. This is very automatically generated and assigned to it by MongoDB. This key value is immutable
important factor while working with MongoDB. and can be of any data type except arrays.
The following code shows what a JSON document looks like:
{ Capped Collection
"_id" : 1, If you want to log the activities ,cache data or high volume data within an application and you
"name" : { "first" : "John", "last" : "Doe" }, want to store data in the same order it is inserted
"publications" : [ MongoDB offers Capped collections for doing so. Capped collections are fixed size
{ circular collections which can store data in the same order it is inserted in order to
"title" : "First Book", support high performance for create read and delete operation.
"year" : 1989, It is very fixed size,high-performance and “auto FIFO age-out” .That is when allotted
"publisher" : "publisher1" space is fully utilized ,newly added object will replace the older ones in the same order
}, it is inserted.
{ "title" : "Second Book", New objects can be inserted into capped collection.
"year" : 1999, Existing objects can be replaced.
"publisher" : "publisher2" But you can’t remove an individual object from capped collection.
} To create capped collection ,we use the following command
] >db.createCollection(“CappedLogCollection”,{capped:True,size:10000,max:1000})
} Where the size is the maximum size of the capped collection in bytes,and max specifies
JSON lets you keep all the related pieces of information together in one place, which the number of documents in the capped collections.
provides excellent performance. It also enables the updating of a document to be To check whether the collection is capped or not
independent. It is schemaless. >db.cappedLogCollection.isCapped()
If you want to cap the existing collection
Binary JSON (BSON) >db.runCommand({“convertToCapped”:”posts”,size:10000})
Where the posts is the name of the collection which will be capped.
MongoDB stores the JSON document in a binary-encoded format. This is termed as MongoDB uses capped collection for maintaining replication logs. It assures protection
BSON. The BSON data model is an extended form of the JSON data model. of data in the insertion order thus leading high performance without use of indexes.
MongoDB’s implementation of a BSON document is fast, highly traversable, and
lightweight. It supports embedding of arrays and objects within other arrays, and also
enables MongoDB to reach inside the objects to build indexes and match objects Q.4: explain Object-Oriented Programming
against queried expressions, both on top-level and nested BSON keys.This means that
MongoDb gives users the ease of use and flexibility of JSON documents together with Ans:
the speed and richness of a lightweight binary format
Object-oriented programming enables you to have classes share data and behaviors
using inheritance. It also lets you define functions in the parent class that can be
overridden in the child class and thus will function differently in a different context.
In other words, you can use the same function name to manipulate the child as well as
the parent class, although under the hood the implementations might be different. This
feature is referred to as polymorphism.
Page 2 of 63 Page 3 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Relational databases ,with their focus on tables with a fixed schema and does not allow One of the ways of doing this is to include the new structure to the new documents
us to define a related set of schemas for table so that we could store any object in our being added to the collection and then gradually migrating the collection in the
hierarchy in the same table background while the application is still running. This is one of the many use cases
The flexibility of MongoDB offers by not enforcing a particular shema for all documents where having a polymorphic schema will be advantageous.
in a collection provides several benefits to the application programmer over an RDBMS For example, say you are working with a Tickets collection where you have documents
solution: with ticket details, like so:
o Better Mapping of object-oriented inheritance and polymorphism // "Ticket1" document (stored in "Tickets" collection")
o S+implermigrstions between schemas with less application downtime {
o Better support for semi-structured domain data. id: 1,
For Example Priority: "High",
{ type: "Incident",
_id:1, text: "Printer not working"
Password:”7f1afdbe” }...........
Firstname:”Derick” At some point, the application team decides to introduce a “short description” field in
Lastname:”Rethans” the ticketdocument structure, so the best alternative is to introduce this new field in
Contacts:[ the new ticket documents.
{ Within the application, you embed a piece of code that will handle retrieving both “old
method:”phone” style” documents(without a short description field) and “new style” documents (with a
value:”+447551569555” short description field).
} Gradually the old style documents can be migrated to the new style documents. Once
] the migration is completed, if required the code can be updated to remove the piece of
}, code that was embedded to handle the missing field.
{
_id:2,
Password:”ae9c300e” Q.6: Explain the basic Query of MongoDB
Firstname:”Rasmus”
Lastname:”Lerdorf” Ans:
} CRUD operations (Create, Read, Update, and Delete) are used to query database.The
In the above example we can see that the fields of two documents are not common mongo shell is standard distribution of MongoDB which provides a full database
and the structure is also different .we can also have fields with same name but interface,enabling you to work on different data stored in mongoDB.Once the database
different datatypes .This flexible schema not only enables you to store related data services have started,you can fire up the mongo shell and start using to query the
with different structures together in a same collection but it also simplifies the querying database.
MongoDB by default listens for any incoming connections on port 27017 of the
localhost interface. Now that the database server is started ,you can start issuing
Q.5: Explain Schema Evolution commands to the server using mongo shell or any new command prompt.
Let’s understand how to use the import/export tool to import and export data in and
Ans: out of the MongoDB database.
When you are working with databases, one of the most important considerations that First, create a CSV file to hold the records of students with the following structure:
you need to account for is the schema evolution (i.e. the change in the schema’s Name, Gender, Class, Score, Age.
impact on the running application). The design should be done in a way as to have Save the file in C:/ as “student.json”
minimal or no impact on the application, meaning no or minimal downtime, no or very Next, import the data from the MongoDB database to a new collection in order to look
minimal code changes, etc. at how the import tool works.
Typically, schema evolution happens by executing a migration script that upgrades the Open the command prompt (by running it as administrator) and import the .json file
databaseschema from the old version to the new one. If the database is not in using the following command:
production, the script can be simple drop and recreation of the database.
Although MongoDB offers an Update option that can be used to update all the C:\>Mongoimport --db<database_name>--collection <collection_name><file_path
documents’ structure within a collection if there’s a new addition of a field, imagine the Example
impact of doing this if you have thousands of documents in the collection. It would be C:\>Mongoimport --db details --collection student <student.json
very slow and would have a negative impact on the underlying application’s Issue the following command to import the data from file student.json to a new
performance collection called student in the database name details.
Page 4 of 63 Page 5 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
In order to validate whether the collection is created and data is imported ,you connect
to the database using mongo shell To start the mongo shell, run command prompt as Q.7: Explain Create and Insert command
administrator and issue the command
>mongo Ans:
And press Enter Create and Insert
Now ,on MongoDB shell check whether database,collections and documents exist with The use command
following commands MongoDB use DATABASE_NAME is used to create database. The command will create a
show dbs : print list of all databases on the server new database if it doesn’t exit ,otherwise it wil return the existing database.
use<db>: switch current database to <db> variable db is set to the current database >use <database_name>
show collections:print list of all collections for current database. Example:
db.<collection_name>.find():To display all the documents present in collection. >use details
Switched to db details
Example: If you want to check your database list,use the command show dbs
>use details MongoDB doesn’t create a database until data is inserted into the database. In order to
Switched to db details view database we need to insert atleast one data.
>show collections
student Create collection implicitly
>db.student.find() To create collection implicitly, we use the following command:
{ "_id" : ObjectId("5450af58c770b7161eefd31d"), "Name" : "S1", "Gender" : "M", >db.<collection_name>.insert({})
"Class" : "C1", "Age" : 19 } Example:
....... >db.student.insert({“name”:”ankush”,”age”:19})
{ "_id" : ObjectId("5450af59c770b7161eefd31e"), "Name" : "S2", "Gender" : "M", To view the created collection,we use the following command
"Class" : "C2", "Age" : 18 } >show collections
At any point, help can be accesses using the help() command . Querying a document from collection(Read)
> help >db.<collection_name>.find() OR >db.<collection_name>.find().pretty()
db.help(): help on db methods Example:
db.mycoll.help() :help on collection methods >db.student.find()
sh.help():sharding helpers The find() command helps to query data from MongoDB collection and to display the
rs.help(): replica set helpers result in a formatted way,you can use pretty() method.
help admin: administrative help
help connect :connecting to a db help Create collection explicitly
help keys: key shortcuts To create collection explicitly,we use following command:
help misc:misc things to know >db.createCollection(“<collection_name>”,options)
help mr:mapreduce In the command name is name of collection to be created. Options is a document and
show dbs:show database names is used to specify configuration of collection
show collections :show collections in current database Options can be defined as follows:
show users :show users in current database
............. Field Type Description
exit :quit the mongo shell Capped Boolean If true,enables a capped
collection. Capped collection
As shown above, if you need help on any of the methods of dbor collection , you can is a fixed size collection that
use db.help() or db.<CollectionName>.help() . automatically overwrites its
oldest entries when reaches
For example, if you need help on the dbcommand, execute db.help() . to maximum size. If you
>db.help() specify true,you need to
DB methods: specify size parameter also
db.addUser(userDocument) autoIndexId Boolean (optional)If
... true,automatically create
db.shutdownServer() index on _id field.Default
db.stats() value is false
db.version() current version of the server
Page 6 of 63 Page 7 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
If there are multiple records and you want to delete only the first record ,then set To display the first 3 documents who stay in Mumbai
justOne() parameter in remove() method. In huge collection,if you want to return only few matching documents,then the limit()
>db.emp.remove({“age”:20},1) command is used
The following command will delete all documents: >db.emp.find({“addrs”:”mumbai”}).limit(3)
>db.users.remove({}) To display the next 2 document after skipping first two records whose age is 37
Finally, if you want to drop the collection, the following command will drop the >db.emp.find({“age”:37}).skip(2).limit(2)
collection:
>db.users.drop() Q.10: Explain Conditional Operators
Ans:
Q.9: write note on Query Document Conditional operators enable you to have more control over the data you are trying to
extract from the database.The operator compares two expressions and fetches
Ans: documents from mongodb collection
A rich query system is provided by MongoDB. Query documents can be passed as a NAME DESCRIPTION
parameter to the find() method to filter documents within a collection. $eq Matches values that are equal to a specified
A query document is specified within open “{” and closed “}” curly braces. A query value
document ismatched against all of the documents in the collection before returning the {Key:{$eq:value}}
result set.Using the find() command without any query document or an empty query $gt Matches values that are greater than a
document such asfind({}) returns all the documents within the collection. specified value
A query document can contain selectors and projectors. {Key:{$gt:value}}
o A selector is like a where condition in SQL or a filter that is used to filter out the $gte Matches values that are greater than or
results. equal to a specified value
o A projector is like the select condition or the selection list that is used to display the {Key:{$gte:value}}
data fields. $lt Matches values that are less than a specified
value
Selector {Key:{$lt:value}}
$lte Matches values that are less than or equal
In MongoDB ,when you execute find() method,then it displays all fields of a document toa specified value
The following command will return all the female users: {Key:{$lte:value}}
>db.users.find({"Gender":"F"}) $in Matches any of the values specified in an
MongoDB also supports operators that merge different conditions together in array
order to refine your search on the basis of your requirements. {Key:{$in:value}}
Let’s refine the above query to now look for female users from India. The following $ne Matches all values that are not equal toa
command willreturn the same: specified value
>db.users.find({"Gender":"F", $or: [{"Country":"India"}]}) {Key:{$ne:value}}
Next, if you want to find all female users who belong to either India or US, execute the
following command: To find students whose Age >25 .
>db.users.find({"Gender":"F",$or:[{"Country":"India"},{"Country":"US >db.students.find({"Age":{"$gt":25}})
"}]})
If you change the above example to return students with Age >= 25 , then the command is
Projector >db.students.find({"Age":{"$gte":25}})
In above example find() command returns all fields of the documents matching the Let’s find all students who belong to either class C1 or C2 . The command for the same is
selector. >db.students.find({"Class":{"$in":["C1","C2"]}})
Let’s add a projector to the query document where, in addition to the selector, you will
also mention specific details or fields that need to be displayed. Let’s next find students who don’t belong to class C1 or C2 . The command is
Suppose you want to display the first name and age of all female employees. In this >db.students.find({"Class":{"$nin":["C1","C2"]}})
case, along with theselector, a projector is also used.
Execute the following command to return the desired result set:
>db.users.find({"Gender":"F"}, {"Name":1,"Age":1}) If you want to find all students who are younger than 25 (Age < 25), you can execute the
Page 10 of 63 Page 11 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
If you want to find out all students who are older than 25 (Age <= 25), execute the Similarly, in order to find out the class-wise average score , the following command can be
following: executed:
>db.students.find({"Age":{"$lte":25}}) >db.students.aggregate({$group:{_id:"$Class", AvgScore: {$avg: "$Score"}}})
{ "_id" : "Biology", "AvgScore" : 90 }
Q.11: Explain MapReduce { "_id" : "C3", "AvgScore" : 90 }
{ "_id" : "Chemistry", "AvgScore" : 90 }
Ans: { "_id" : "C2", "AvgScore" : 93 }
Map-reduce is the data processing paradigm for condensing large volumes of data into { "_id" : "C1", "AvgScore" : 85 }
useful aggregate results. >
For map-reduce operation mongodb provides the mapreduce database command
In this map-reduce operation ,mongoDB applies the map phase each input.The map Q13: Explain Regular Expressions
function emits key-value pairs.
For those keys that have multiple values, MongoDB applies reduce phase ,which collect Ans:
and condenses the aggregate data.
Map reduce function can be used on both structured data and unstructured data Regular expressions are useful in scenarios where you want to find a stringas some
o Map is a javascript function that maps a value with akey and emits a key-value particular pattern. As in SQL we had LIKE clause, InmongoDB it is regular expressions
pair.It divides the big problem into multiple small problems,which can be further for the same.
subdivided into sub-problems
o Reduce is a javascript function that reduces or groups all the documents having In order to understand this, let's take the example of students with different names.
the same key and produces the final output which was the answer to big
problem that you were trying to solve >db.students.insert({Name:"Student1", Age:30, Gender:"M", Class: "Biology", Score:90})
In order to understand how it works,let’s consider the following example where you >db.students.insert({Name:"Student2", Age:30, Gender:"M", Class: "Chemistry", Score:90})
find out the number of male,female and others in emp collection >db.students.insert({Name:"Test1", Age:30, Gender:"M", Class: "Chemistry", Score:90})
The first step is to create map and reduce functions and then you call the mapReduce >db.students.insert({Name:"Test2", Age:30, Gender:"M", Class: "Chemistry", Score:90})
function and pass the necessary arguments. >db.students.insert({Name:"Test3", Age:30, Gender:"M", Class: "Chemistry", Score:90})
>var map=function(){emit(this.Gender,1)}
>var reduce=function(key,value){return Array.sum(value);} Say you want to find all students with names starting with “St” or “Te” and whose class
This will group document emitted by the map function on the key field begins with “Che”.
Put them together using the mapReduce function
>db.emp.mapReduce(map,reduce,{out:”mapreducecount”}) The same can be filtered using regular expressions, like so:
Q.12 Explain aggregate functions >db.students.find({"Name":/(St|Te)*/i, "Class":/(Che)/i})
{ "_id" : ObjectId("52f89ecae451bb7a56e59086"), "Name" : "Student2", "Age" : 30,
Ans: "Gender" : "M", "Class" : "Chemistry", "Score" : 90 }
.........................
Aggregate operations process data records and return computed results. Aggregation { "_id" : ObjectId("52f89f06e451bb7a56e59089"), "Name" : "Test3", "Age" : 30,
operations group values from multiple document together and can perform a variety of "Gender" : "M", "Class" : "Chemistry", "Score" : 90 }
operations on the grouped data to return a single result, >
Aggregate function groups the records in collection, and can be used to provide total
number (sum), average, minimum, maximum, etc out of the group selected. In order to understand how the regular expression works, let’s take the query
The aggregation framework enables you find out the aggregate value without using the "Name":/(St|Te)*/i.
MapReduce function. Performance-wise, the aggregation framework is faster than the
MapReduce function. //I indicates that the regex is case insensitive.
To perform aggregate function aggregate() is function to be used. Following is the syntax for (St|Te)*means the Name string must start with either “St” or “Te”.
aggregation: The * at the end means it will match anything after that.
When you put everything together, you are doing a case insensitive match of names
>db.students.aggregate({$group:{_id:"$Gender", totalStudent: {$sum: 1}}}) that have either “St” or “Te” at the beginning of them. In the regex for the Class also
{ "_id" : "F", "totalStudent" : 6 } the same Regex is issued.
Page 12 of 63 Page 13 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
It takes an optional parameter called verbose , which determines what the explain
Q.14: Explain use of cursors output should look like.
The following are the verbosity modes:
Ans: allPlansExecution,executionStats, and queryPlanner.
The default verbosity mode is queryPlanner,which means if nothing is specified, it
The find() method is used, MongoDB returns the results of the query as a cursor defaultsto queryPlanner.
object. In order to display the result, the mongo shell iterates over the returned cursor The following code covers the steps executed when filtering on the username field:
To return all the users in the US. In order to do so, you created a variable, assigned >db.users.find({"Name":"Test User"}).explain("allPlansExecution")
the output of find() to the variable, which is a cursor, and then using the while loop "queryPlanner" : {
you iterate and print the output. "plannerVersion" : 1,
The code snippet is as follows: "namespace" : "mydbproc.users",
> var c = db.users.find({"Country":"US"}) "indexFilterSet" : false,
> while(c.hasNext()) printjson(c.next()) "parsedQuery" : {
{ "$and" : [ ]
"_id" :ObjectId("52f4a823958073ea07e15070"), },
"FName" : "Test",
"LName" : "User", "winningPlan" : {
"Age" : 30, "stage" : "COLLSCAN",
"Gender" : "M", "filter" : {
"Country" : "US" "$and" : [ ]
} },
{ "direction" : "forward"
"_id" :ObjectId("52f4a826958073ea07e15071"), },
"Name" : "Test User", "rejectedPlans" : [ ]
"Age" : 45, },
"Gender" : "F", "executionStats" : {
"Country" : "US" "executionSuccess" : true,
} "nReturned" : 20,
> "executionTimeMillis" : 0,
The next() function returns the next document. The h asNext() function returns true if "totalKeysExamined" : 0,
a document exists, and printjson() renders the output in JSON format. "totalDocsExamined" : 20,
The variable to which the cursor object is assigned can also be manipulated as an "executionStages" : {
array. If, instead of looping through the variable, you want to display the document at "stage" : "COLLSCAN",
array index 1, you can run the following command: "filter" : {
"$and" : [ ]
> var c = db.users.find({"Country":"US"}) },
>printjson(c[1])
{ "nReturned" : 20,
"_id" :ObjectId("52f4a826958073ea07e15071"), "executionTimeMillisEstimate" : 0,
"Name" : "Test User", "works" : 22,
.... "Gender" : "F", "advanced" : 20,
"Country" : "US"} "needTime" : 1,
"needFetch" : 0,
Q.15: Explain explain() function "saveState" : 0,
"restoreState" : 0,
Ans: "isEOF" : 1,
"invalidates" : 0,
The explain() function can be used to see what steps the MongoDB database is running "direction" : "forward",
while executing a query. the output format of the function and the parameter that is "docsExamined" : 20
passed to the function have changed. },
"allPlansExecution" : [ ]
Page 14 of 63 Page 15 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
}, "Name" : 1
"serverInfo" : { },
"host" : " ANOC9", "indexName" : "Name_1",
"port" : 27017, "isMultiKey" : false,
"version" : "3.0.4", "direction" : "forward",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952" "indexBounds" : {
}, "Name" : [
"ok" : 1 "[\"user101\", \"user101\"]"
]
As you can see, the explain() output returns information regarding queryPlanner, }
executionStats, and serverInfo. As highlighted above, the information the output }
returns depends on the verbosity modeselected. },
"rejectedPlans" : [ ]
Q16: Explain Concept of Indexes },
"executionStats" : {
Ans: "executionSuccess" : true,
"nReturned" : 1,
Indexes are used to provide high performance read operations for queries that are "executionTimeMillis" : 0,
used frequently. "totalKeysExamined" : 1,
By default, whenever a collection is created and documents are added to it, an index "totalDocsExamined" : 1,
is created on the id field of the document. "executionStages" : {
"stage" : "FETCH",
Single Key Index "nReturned" : 1,
"executionTimeMillisEstimate" : 0,
Let’s create an index on the Name field of the document. Use ensureIndex() to create "works" : 2,
the index. "advanced" : 1,
"needTime" : 0,
>db.testindx.ensureIndex({"Name":1}) "needFetch" : 0,
"saveState" : 0,
The index creation will take few minutes depending on the server and the collection "restoreState" : 0,
size. "isEOF" : 1,
Let’s run the same query that you run earlier with explain() to check what the steps "invalidates" : 0,
the database is executing post index creation. "docsExamined" : 1,
Check the n ,nscanned, and millisfields in the output. "alreadyHasObj" : 0,
"inputStage" : {
>db.testindx.find({"Name":"user101"}).explain("allPathsExecution") "stage" : "IXSCAN",
{ "nReturned" : 1,
"queryPlanner" : { "executionTimeMillisEstimate" : 0,
"plannerVersion" : 1, "works" : 2,
"namespace" : "mydbproc.testindx", "advanced" : 1,
"indexFilterSet" : false, "needTime" : 0,
"parsedQuery" : { "needFetch" : 0,
"Name" : { "saveState" : 0,
"$eq" : "user101" "restoreState" : 0,
} "isEOF" : 1,
}, "invalidates" : 0,
"winningPlan" : { "keyPattern" : {
"stage" : "FETCH", "Name" : 1
"inputStage" : { },
"stage" : "IXSCAN", "indexName" : "Name_1",
"keyPattern" : { "isMultiKey" : false,
Page 16 of 63 Page 17 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Page 18 of 63 Page 19 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
"$gt" : 25 >db.testindx.insert({"Name":"usercit"})
} >db.testindx.insert({"Name":"usercit", "Age":30})
},
"nReturned" : 1, However, if you execute
"executionTimeMillisEstimate" : 0,
"works" : 2, >db.testindx.insert({"Name":"usercit", "Age":30})
"advanced" : 1,
"allPlansExecution" : [ it’ll throw an error like E11000 duplicate key error index:
{ mydbpoc.testindx.$Name_1_Age_1
"nReturned" : 1, dup key: { : "usercit", : 30.0 }
"executionTimeMillisEstimate" : 0, You may create the collection and insert the documents first and then create an index
CHAPTER 6 ■ USING MONGODB SHELL on the collection.
75 If you create a unique index on the collection that might have duplicate values in the
"totalKeysExamined" : 1, fields on
"totalDocsExamined" : 1, which the index is being created, the index creation will fail.
"executionStages" : { To cater to this scenario, MongoDB provides a dropDupsoption. The dropDupsoption
............................................................. saves the first document found and remove any subsequent documents with duplicate
"serverInfo" : { values.
"host" : " ANOC9",
"port" : 27017, The following command will create a unique index on the name field and will delete any
"version" : "3.0.4", duplicate
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952" documents:
},
"ok" : 1 >db.testindx.ensureIndex({"Name":1},{"unique":true, "dropDups":true})
} >
>
system.indexes
Unique Index
Whenever you create a database, by default a system.indexescollection is created. All
Creating index on a field doesn’t ensure uniqueness, so if an index is created on the of the information about a database’s indexes is stored in the system.indexescollection.
Name field, then two or more documents can have the same names. However, if This is a reserved collection, so you cannot modify its documents or remove documents
uniqueness is one of the constraints that needs to be enabled, the unique property from it. You can manipulate it only through ensureIndexand the dropIndexesdatabase
needs to be set to true when creating the index. commands.
Whenever an index is created, its meta information can be seen in system.indexes.
First, let’s drop the existing indexes.
The followingcommand can be used to fetch all the index information about the mentioned
>db.testindx.dropIndexes() collection:
The following command will create a unique index on the Name field of the testindxcollection: db.collectionName.getIndexes()
>db.testindx.ensureIndex({"Name":1},{"unique":true}) For example, the following command will return all indexes created on the testindxcollection:
Now if you try to insert duplicate names in the collection as shown below, MongoDB returns >db.testindx.getIndexes()
an errorand does not allow insertion of duplicate records:
dropIndex
For example, if you have a unique index on {"name":1, "age":1} ,
>db.testindx.ensureIndex({"Name":1, "Age":1},{"unique":true}) The dropIndexcommand is used to remove the index.
>
The following command will remove the Name field index from the testindxcollection:
then the following inserts will be permissible:
>db.testindx.dropIndex({"Name":1})
Page 20 of 63 Page 21 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
{ "nIndexesWas" : 3, "ok" : 1 }
> Q18: MongoDB Document Data Model Approach
reIndex Ans:
When you have performed a number of insertions and deletions on the collection, you As you know, in MongoDB, data is stored in documents . This opens up some new
may have to rebuild the indexes so that the index can be used optimally. The possibilities in schema design. It also complicates our schema design process. In
reIndexcommand is used to rebuild the indexes. MongoDB, the schema design depends on the problem you are trying to solve.
The following command rebuilds all the indexes of a collection. It will first drop the
indexes, including the default index on the id field, and then it will rebuild the indexes. Embedding
db.collectionname.reIndex() you will see if embedding will have a positive impact on the performance. Embedding
can be useful when you want to fetch some set of data and display it on the screen,
The following command rebuilds the indexes of the testindxcollection: such as a page that displays comments associated with the blog; in this case the
comments can be embedded in the Blogs document.
>db.testindx.reIndex() The benefit of this approach is that since MongoDB stores the documents contiguously
{ on disk, all therelated data can be fetched in a single seek.
"nIndexesWas" : 2, Apart from this, since JOINs are not supported and you used referencing in this case,
"msg" : "indexes dropped for collection", the application might do something like the following to fetch the comments data
"nIndexes" : 2, associated with the blog.
.............. Fetch the associated comments _id from the blogs document.
"ok" : 1 Fetch the comments document based on the comments_idfound in the first step.
} If you take this approach, which is referencing, not only does the database have to do
> multiple seeks to find your data, but additional latency is introduced into the lookup
since it now takes two round trips to the database to retrieve your data.
Q17: Designing an Application’s Data Model If the application frequently accesses the comments data along with the blogs, then
almost certainly embedding the comments within the blog documents will have a
Ans: positive impact on the performance.
Another concern that weighs in favor of embedding is the desire for atomicity and
let's understand how to design the data model for an application in mongoDB he MongoDB isolation in writing data. MongoDB is designed without multi-documents transactions.
database provides two options for designing a data model: In MongoDB, the atomicity of the operation is provided only at a single document level
so data that needs to be updated together atomically needs to be placed together in a
the user can either embed related objects within one another single document.
When you update data in your database, you must ensure that your update either
or it can reference each other using ID succeeds or fails entirely, never having a “partial success,” and that no other database
reader ever sees an incomplete write operation.
The Problem with Normal Forms
Referencing
As mentioned, the nice thing about normalization is that it allows for easy updating
without any redundancy You understood embedding is the approach that will provide the best performance in
However, a problem arises when you try to get the data back out . For instance, to find many cases; it also provides data consistency guarantees. However, in some cases, a
all tags and comments associated with posts by a specific user, the relational database more normalized model works better in MongoDB.
programmer uses a JOIN. One reason for having multiple collections and adding references is the increased
By using a JOIN, the database returns all data as per the application screen design, but flexibility it gives when querying the data. Let’s understand this with the blogging
the real problem is what operation the database performs to get that result set. example mentioned above.
Generally, any RDBMS reads from a disk and does a seek, which takes well over 99% You saw how to use embedded schema, which will work very well when displaying all
of the timespent reading a row. When it comes to disk access, random seeks are the the data together on a single page (i.e. the page that displays the blog post followed
enemy. The reason why this is so important in this context is because JOINs typically by all of the associated comments).
require random seeks. The JOIN operation is one of the most expensive operations Now suppose you have a requirement to search for the comments posted by a particular
within a relational database. user. The query(using this embedded schema) would be as follows:
Page 22 of 63 Page 23 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Page 24 of 63 Page 25 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
At least 8KB of data space is required by each index. It handles data requests, manages data access, and performs background
For write operations, an index addition has some negative performance impact. Hence management operations.
for collections with heavy writes, indexes might be expensive because for each insert, When a mongod is run without any arguments, it connects to the default data
the keys must be added to all the indexes. directory.
Indexes are beneficial for collections with heavy read operations such as where the By default, MongoDB listens for connections from clients on port 27017, and stores
proportion of read-to-write operations is high. The un-indexed read operations are not data in the /data/db directory in C:\ drive.
affected by an index. mongod also has a HTTP server which listens on a port 1000 higher than the default
port. This basic HTTP server provides administrative information about the database.
Sharding
mongo
One of the important factors when designing the application model is whether to
partition the data or not. mongo provides an interactive JavaScript interface for the developer to test queries
This is implemented using sharding in MongoDB. and operations directly on the database and for the system administrators to manage
Sharding is also referred as partitioning of data. In MongoDB, a collection is partitioned the database.
with its This is all done via the command line. When the mongo shell is started, it will connect
documents distributed across cluster of machines, which are referred as shards. This to the default database called test .
can have a significant impact on the performance. We will discuss sharding more in This database connection value is assigned to global variable db.
Chapter tk.
A Large Number of Collections The design considerations for having multiple collections mongos
vs. storing data in a single collection are the following:
There is no performance penalty in choosing multiple collections for storing data. mongos is used in “MongoDB Shard,”. It is a routing service for MongoDB
Having distinct collections for different types of data can have shard configurations that processes queries from the application layer, and
performanceimprovements in high-throughput batch processing applications. determines the location of this data in the sharded cluster, in order to
When you are designing models that have a large number of collections, you need to complete these operations.
take intoconsideration the following behaviors:
A certain minimum overhead of few kilobytes is associated with each collection. Q21: what are the various tools available in MongoDB?
At least 8KB of data space is required by each index, including the id index.
You know by now that the metadata for each database is stored in the Ans:
<database>.nsfile. Each
collection and index has its own entry in the namespace file, so you need to consider mongodump
the limits_on_the_size_of_namespace files when deciding to implement a large number
of collections. This utility is used as part of an effective backup strategy.
Growth of the Document Few updates, such as pushing an element to an array, adding It creates a binary export of the database contents.
new fields, etc., can lead to an increase in the document size, which can lead to the mongodump can read data from either mongod or mongos instances
movement of the document from one slot to another in order to fit in the document. mongorestore:
This process of document relocation is both resource and time consuming.
Although MongoDB provides padding to minimize the relocation occurrences, you may The mongorestore program writes data from a binary database dump created
need to handle the document growth manually. by mongodump to a MongoDB instance.
mongorestore can create a new database or add data to an existing database.
Q20: Explain concept of Core Processes of MongoDB mongorestore can write data to either mongod or mongos instances
Page 26 of 63 Page 27 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Page 28 of 63 Page 29 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
The master node maintains a capped collection (oplog) that stores an ordered history
of logical writes to the database.
The slaves replicate the data using this oplog collection.
Since the oplog is a capped collection, if the slave’s state is far behind the master’s
state, the slave may become out of sync.
In that scenario, the replication will stop and manual intervention will be needed to re- The primary logs any changes or updates to its data sets in its oplog(read as op log).
establish the replication. The secondaries also replicate the oplog of the primary and apply all the operations to
There are two main reasons behind a slave becoming out of sync: their data sets.
The slave shuts down or stops and restarts later. During this time, the oplog may have When the primary becomes unavailable, the replica set nominates a secondary as the
deleted the log of operations required to be applied on the slave. primary.
The slave is slow in executing the updates that are available from the master. The primary node is selected through an election mechanism. If the primary goes
down, the selected node will be chosen as the primary node.
Disadvantage: Figure below shows how a two-member replica set failover happens.
Replica set
Replica sets are basically a type of master-slave replication but they provide automatic
failover.
A replica set has one master, which is termed as primary, and multiple slaves, which
are termed as secondary in the replica set context.
In a replica set, the primary mongod receives all write operations from clients and the
secondary mongod replicates the operations from the primary and thus both have the
same data set.
The primary node goes down, and the secondary is promoted as primary.
The original primary comes up, it acts as slave, and becomes the secondary node.
A replica set is a mongod’s cluster, which replicates among one another and ensures
automatic failover.
Page 30 of 63 Page 31 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
In the replica set, one mongod will be the primary member and the others will be
secondary members.
The primary member is elected by the members of the replica set. All writes are
directed to the primary member whereas the secondary members replicate from the
primary asynchronously using oplog.
The secondary’s data sets reflect the primary data sets, enabling them to be promoted
to primary in case of unavailability of the current primary
There are two types of members: primary members and secondary members.
Primary member :
Hidden members
A replica set can have only one primary, which is elected by the voting nodes in the
replica set. Any node with associated priority as 1 can be elected as a primary. The
client redirects all the write operations to the primary member, which is then later
replicated to the secondary members.
Secondary member
Members in SECONDARY state replicate the primary’s data set and can be configured
to accept read operations. Secondary’s are eligible to vote in elections, and may be
elected to the PRIMARY state if the primary becomes unavailable. In addition to this, a
replica set can have other types of secondary members.
Priority 0 Replica Set Members in MongoDB A hidden member maintains a copy of the primary’s data set but is invisible to client
Hidden members applications.
Delayed members Hidden members must always be priority 0 members and so cannot become primary.
Arbiters In a replica set, these members can be dedicated for reporting needs or backups.
Non-voting members Hidden members can vote in the elections
A priority zero member in a replica set is a secondary member that cannot become the Delayed members contain copies of a replica set’s data set.
primary. These members can act as normal secondary’s, but cannot trigger any However, a delayed member’s data set reflects an earlier, or delayed, state of the set.
election. Must be priority 0 members. Set the priority to 0 to prevent a delayed member from
The main functions of a priority are as follows: becoming primary as they do not consist of updated data.
Maintains data set copies Should be hidden members. Always prevent applications from seeing and querying
Accepts and performs read operations delayed members.
Elects the primary node Do vote in elections for primary, if members[n].votes is set to 1.
Page 32 of 63 Page 33 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Arbiters
They are secondary members that do not hold a copy of the primary’s data and hence
they cannot become primary.
Replica sets may have arbiters to add a vote in elections for primary.
Arbiters always have exactly 1 election vote, and thus allow replica sets to have an The replica set cannot process write operations until the election completes
uneven number of voting members without the overhead of an additional member that
successfully.
replicates data.
The replica set can continue to serve read queries if such queries are configured
to run on secondaries.
Process of Election
In order to get elected, a server need to not just have the majority but needs to
have majority of the total votes.
If there are X servers with each server having 1 vote, then a server can become
primary only when it has at least [(X/2) + 1] votes.
If a server gets the required number of votes or more, then it will become primary.
The primary that went down still remains part of the set; when it is up, it will act as
a secondary server until the time it gets a majority of votes again.
Non-voting members If just two nodes are there, acting as master and slave then slave will never be
promoted as master if the server goes down.
These members hold the primary’s data copy, they can accept client read operations,
In case of network partitioning, the master will lose the majority of votes since it
and they can also become the primary, but they cannot vote in an election.
The voting ability of a member can be disabled by setting its votes to 0. By default will have only its own one vote and it’ll be demoted to slave.
every member has one vote. A replica set uses an arbiter to help resolve such conflicts.
Page 34 of 63 Page 35 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
The oplog is a capped collection, with every new addition of an operation, the oldest
operations are automatically moved out. This is done to ensure that it does not
grow beyond a pre-set bound, which is the oplog size.
By default in MongoDB, available free space or 5% is used for the oplog on
Windows
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
This is the time of the latest op that was applied on the secondary. Failover
The following shell helper can be used to find the latest op in the shell: All members of a replica set are connected to each other as shown below.
rs.debug.getLastOpWritten() They exchange a heartbeat message amongst each other.
The output returns a field named ts , which depicts the last op time. A node with missing heartbeat is considered as crashed.
If a member starts up and finds the ts entry, it starts by choosing a target to sync
from and it will start syncing as in a normal operation.
However, if no entry is found, the node will begin the initial sync process.
db.adminCommand({replSetGetStatus:1})
The output field of syncingTo is present only on secondary nodes and provides
information on the node from which it is syncing.
Page 38 of 63 Page 39 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Consistency The members should be distributed geographically in order to cater to main data
center failure. As shown below, the members that are kept at a geographically different
In MongoDB, the reads can be routed to the secondaries, the writes are always routed location other than the main data center can have priority set as 0, so that they cannot
to the primary. be elected as primary and can act as a standby only.
If the read requests are routed to the primary node, it will always see the up-to-date
changes, which means the read operations are always consistent with the last write
operations.
However, if the application has changed the read preference to read from secondaries,
there might be a probability of user not seeing the latest changes or seeing previous
states. This is because the writes are replicated asynchronously on the secondaries.
This behavior is characterized as eventual consistency, which means that although the
secondary’s state is not consistent with the primary node state, it will eventually
become consistent over time.
There is no way that reads from the secondary can be guaranteed to be consistent,
except by issuing write concerns to ensure that writes succeed on all members before
the operation is actually marked successful.
Possible Replication Deployment When replica set members are distributed across data centers, network partitioning
can prevent data centers from communicating with each other. In order to ensure a
The architecture you chose to deploy a replica set affects its capability and capacity. majority in the case of network partitioning, it keeps a majority of the members in one
location.
Odd number of members
Scaling Reads
This should be done in order to ensure that there is no tie when electing a primary. If
you have an even number of voting members, deploy an arbiter so that the set has an The primary purpose of the secondaries is to ensure data availability in case of
odd number of voting members. downtime of the primary node
They can be used to perform some backup operations or data processing jobs or to
scale out reads.
One of the ways to scale reads is to issue the read queries against the secondary
nodes; by doing so the workload on the master is reduced.
One important point that you need to consider when using secondaries for scaling read
operations is that in MongoDB the replication is asynchronous, which means if any
write or update operation is performed on the master’s data, the secondary data will be
momentarily out-of-date.
If the application in question is read-heavy and is accessed over a network and does
not need up-to-date data, the secondaries can be used to scale out the read in order to
Replica set fault tolerance is the count of members, which can go down but still the provide a good read throughput.
replica set has enough members to elect a primary in case of any failure. Although by default the read requests are routed to the primary node, the requests can
be distributed over secondary nodes by specifying the read preferences.
Page 40 of 63 Page 41 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Applications that are geographically distributed: In such cases, you can have a
replica set that is distributed across geographies. The read preferences should be set to In most situations, operations read from the primary but if it is
primaryPreferred
read from the nearest secondary node. This helps in reducing the latency that is unavailable, operations read from secondary members.
caused when reading over network and this improves the read performance.
secondary All operations read from the secondary members of the replica set.
Operations read from member of the replica set with the least
nearest
network latency, irrespective of the member’s type.
Write Concerns
If the application always requires up-to-date data, it uses the option
primaryPreferred, which in normal circumstances will always read from the primary When the client application interacts with MongoDB, it is generally not aware whether
node. However, if the primary is unavailable, as is the case during failover situations, the database is on standalone deployment or is deployed as a replica set.
operations read from secondary members. However, when dealing with replica sets, the client should be aware of write concern
and read concern.
Since a replica set duplicates the data and stores it across multiple nodes, these two
concerns give a client application the flexibility to enforce data consistency across
nodes while performing read or write operations.
Using a write concern enables the application to get a success or failure response from
MongoDB.
When used in a replica set deployment of MongoDB, the write concern sends a
confirmation from the server to the application that the write has succeeded on the
primary node. However, this can be configured so that the write concern returns
success only when the write is replicated to all the nodes maintaining the data.
Note: If while specifying number the number is greater than the nodes that actually
hold the data, the command will keep on waiting until the members are available. In
order to avoid this indefinite wait time, wtimeout should also be used along with w,
which will ensure that it will wait for the specified time period, and if the write has not
If you have an application that supports two types of operations, the first operation is succeeded by that time, it will time out.
the main workload that involves reading and doing some processing on the data,
whereas the second operation generates reports using the data. In such a scenario,
you can have the reporting reads directed to the secondaries.
Page 42 of 63 Page 43 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
How Writes Happen with Write Concern One example is a query scanning through all documents of a database where the size
exceeds the server memory. This leads to loading of the documents in memory and
In order to ensure that the written data is present on say at least two members, issue the moving the working set out to disk.
following command : In MongoDB, the scaling is handled by scaling out the data horizontally (i.e.
>db.testprod.insert({i:”test”, q: 50, t: “B”}, {writeConcern: {w:2}}) partitioning the data across multiple commodity servers), which is also called sharding
(horizontal scaling).
Steps Sharding addresses the challenges of scaling to support large data sets and high
throughput by horizontally dividing the datasets across servers where each server is
1. The write operation is directed to the primary. responsible for handling its part of data and no one server is burdened.
2. The operation is written to the oplog of primary with ts depicting the time of These servers are also called shards.
operation. Every shard is an independent database. All the shards collectively make up a single
3. A w: 2 is issued, so the write operation needs to be written to one more server before it’s logical database .
marked successful. Sharding reduces the operations count handled by each shard. For example, when data
4. The secondary queries the primary’s oplog for the op, and it applies the op. is inserted, only the shards responsible for storing those records need to be accessed.
5. Next, the secondary sends a request to the primary requesting for ops with ts The processes that need to be handled by each shard reduce as the cluster grows
greater than t. because the subset of data that the shard holds reduces. This leads to an increase in
6. At this point, the primary sends an update that the operation until t has been the throughput and capacity horizontally.
applied by the secondary as it’s requesting for ops with {ts: {$gt: t}} .
7. The writeConcern finds that a write has occurred on both the primary and
secondary, satisfying the w: 2 criteria, and the command returns success.
Ans:
Ans:
MongoDB uses memory extensively for low latency database operations (Low
latency describes a computer network that is optimized to process a very high volume
of data messages with minimal delay (latency)).
When you compare the speed of reading data from memory to reading data from disk,
reading from memory is approximately 100,000 times faster than reading from the
disk.
A page fault happens when data which is not there in memory is accessed by
MongoDB.
If there’s free memory available, the OS will directly load the requested page into
memory; however, in the absence of free memory, the page in memory is written to When to use Sharding
the disk and then the requested page is loaded in the memory, slowing down the
process. Although sharding is a compelling and powerful feature, it has significant infrastructure
Few operations accidentally purge large portion of the working set from the memory, requirements and it increases the complexity of the overall deployment.
leading to an adverse effect on the performance. Use sharding in the following instances:
Page 44 of 63 Page 45 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
The size of the dataset is huge and it has started challenging the capacity of a single Shards
system.
Since memory is used by MongoDB for quickly fetching data, it becomes important to Shards (upper left) store the application data. In a sharded cluster, only the
scale out when the active work set limits are set to reach. mongos routers or system administrators should be connecting directly to the
If the application is write-intensive, sharding can be used to spread the writes across shards.
multiple servers. Like an unsharded deployment, each shard can be a single node for development
and testing, but should be a replica set in production.
Q25: Explain Sharding Components mongos routers (center) cache the cluster metadata and use it to route operations
to the correct shard or shards.
Ans: Config servers (upper right) persistently store metadata about the cluster, including
Sharding is enabled in MongoDB via sharded clusters. which shard has what subset of the data.
The following are the components of a sharded cluster:
Shards
mongos
Config servers
Config Server
Config servers are special mongods that hold the sharded cluster’s metadata. This
metadata depicts the sharded system state and organization.
The config server stores data for a single sharded cluster. The config servers should
be available for the proper functioning of the cluster.
One config server can lead to a cluster’s single point of failure. For production
deployment it’s recommended to have at least three config servers, so that the
cluster keeps functioning even if one config server is not accessible.
A config server stores the data in the config database, which enables routing of the
The shard is the component where the actual data is stored. For the sharded cluster, it client requests to the respective data. This database should not be updated.
holds a subset of data and can either be a mongod or a replica set. MongoDB writes data to the config server only when the data distribution has
All shard’s data combined together forms the complete dataset for the sharded cluster. changed for balancing the cluster.
Sharding is enabled per collection basis, so there might be collections that are not
sharded. mongos
In every sharded cluster there’s a primary shard where all the unsharded collections
are placed in addition to the sharded collection data. The mongos act as the routers. They are responsible for routing the read and write
When deploying a sharded cluster, by default the first shard becomes the primary request from the application to the shards.
shard although it’s configurable. An application interacting with a mongo database need not worry about how the data is
stored internally on the shards. For them, it’s transparent because it’s only the mongos
they interact with.
The mongos, in turn, route the reads and writes to the shards.
Page 46 of 63 Page 47 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
The mongos cache the metadata from config server so that for every read and write Shard Key
request they don’t overburden the config server.
However, in the following cases, the data is read from the config server : Any indexed single/compound field that exists within all documents of the collection
Either an existing mongos has restarted or a new mongos has started for the first time. can be a shard key. You specify that this is the field basis which the documents of the
Migration of chunks. collection need to be distributed.
Internally, MongoDB divides the documents based on the value of the field into chunks
Q26: Explain Data Distribution Process and distributes them across the shards.
There are two ways MongoDB enables distribution of the data:
Ans: range-based partitioning and
In MongoDB, the data is sharded or distributed at the collection level. Hashbased partitioning.
The collection is partitioned by the shard key.
In range-based partitioning , the shard key values are divided into ranges.
Say you consider a timestamp field as the shard key.
In this way of partitioning, the values are considered as a straight line starting from a
Min value to Max value where Min is the starting period (say, 01/01/1970) and Max is
the end period (say, 12/31/9999).
Every document in the collection will have timestamp value within this range only, and
Levels of granularity available in a sharded MongoDB deployment
it will represent some point on the line.
Based on the number of shards available, the line will be divided into ranges, and
documents will be distributed based on them.
In this scheme of partitioning, the documents where the values of the shard key are
nearby are likely to fall on the same shard.
This can significantly improve the performance of the range queries.
Page 48 of 63 Page 49 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Hash based partitioning For a sharded cluster, 64MB is the default chunk size. In most situations, this is an apt
size for chunk slitting and migration.
In hash-based partitioning , the data is distributed on the basis of the hash value of the The mongos routes writes to the appropriate chunk based on the shard key value.
shard field. MongoDB splits chunks when they grow beyond the configured chunk size. Both inserts
If selected, this will lead to a more random distribution compared to range-based and updates can trigger a chunk split.
partitioning.
It’s unlikely that the documents with close shard key will be part of the same chunk.
For example, for ranges based on the hash of the id field, there will be a straight line of
hash values, which will again be partitioned on basis of the number of shards.
On the basis of the hash values, the documents will lie in either of the shards.
For Example
Say you have a blog posts collection which is sharded o
Shard #1: Beginning of time up to July 2009
Shard #2: August 2009 to December 2009
Shard #3: January 2010 to through the end of
time the field date .
This ensures that the data is evenly distributed, but it happens at the cost of efficient
range queries.
Role of ConfigServers in the Above Scenario
Chunks
Consider a scenario where you start getting insert requests for millions of
documents with the date of September 2009.
The data is moved between the shards in form of chunks.
In this case, Shard #2 begins to get overloaded.
The shard key range is further partitioned into subranges, which are also termed as
The config server steps in once it realizes that Shard #2 is becoming too big. It will
chunks.
split the data on the shard and start migrating it to other shards. After the
A chunk consists of a subset of sharded data.
migration is completed, it sends the updated status to the mongos.
Each chunk has a inclusive lower and exclusive upper range based on the shard key.
So now Shard #2 has data from August 2009 until September 18, 2009 and Shard
#3 contains data from September 19, 2009 until the end of time.
When a new shard is added to the cluster, it’s the config server’s responsibility to
figure out what to do with it. The data may need to be immediately migrated to the
new shard, or the new shard may need to be in reserve for some time.
In summary, the config servers are the brains. Whenever any data is moved
around, the config servers let the mongos know about the final configuration so that
the mongos can continue doing proper routing.
We need to understand how MongoDB ensures that all the shards are equally loaded.
The addition of new data or modification of existing data, or the addition or removal of
servers, can lead to imbalance in the data distribution, which means either one shard is
overloaded with more chunks and the other shards have less number of chunks, or it
Page 50 of 63 Page 51 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
can lead to an increase in the chunk size, which is significantly greater than the other
chunks.
MongoDB ensures balance with the following background processes:
• Chunk splitting
• Balancer
Chunk Splitting
Chunk splitting is one of the processes that ensures the chunks are of the specified
size.
As you have seen, a shard key is chosen and it is used to identify how the documents
will be distributed across the shards.
The documents are further grouped into chunks of 64MB (default and is configurable)
and are stored in the shards based on the range it is hosting.
If the size of the chunk changes due to an insert or update operation, and exceeds the Any of the mongos within the cluster can initiate the balancer process.
default chunk size, then the chunk is split into two smaller chunks by the mongos. They do so by acquiring a lock on the config database of the config server, as
balancer involves migration of chunks from one shard to another, which can lead to
a change in the metadata, which will lead to change in the config server database.
The balancer migrates one chunk at a time
1. Be configured to start the migration only when the migration threshold has reached.
The migration threshold is the difference in the number of maximum and minimum
chunks on the shards.
2. Or it can be scheduled to run in a time period that will not impact the production
traffic.
Migration process
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
4. Next, start a new terminal window, connect to the mongos, and enable sharding on the { "shard" : "shard0000" }
collections
{ "shard" : "shard0000" }
5. 5. View the running databases connected to the mongos instance running at port.
{ "shard" : "shard0001" }
6. 6. Get reference to the database named as movies .
mongos>db.chunks.find({ns:"movies.action"}, {shard:1, _id:0}).sort({shard:1})
7. 7. Enable sharding of the database movies .
{ "shard" : "shard0000" }
8. 8. Shard the collection movies.drama by shard key originality .
{ "shard" : "shard0000" }
9. 9. Shard the collection movies.action by shard key distribution .
{ "shard" : "shard0000" }
10.10. Shard the collection movies.comedy by shard key collections .
{ "shard" : "shard0000" }
Now, to check how data is distributed across the shards.
{ "shard" : "shard0000" }
Switch to configdb:
mongos> use config { "shard" : "shard0001" }
switched to db config
mongos> { "shard" : "shard0001" }
You can use chunks.find to look at how the chunks are distributed:
mongos>db.chunks.find({ns:"movies.drama"}, {shard:1, { "shard" : "shard0001" }
_id:0}).sort({shard:1})
{ "shard" : "shard0001" }
Similarly on the other documents of a collection if you fire the chunks.find command
we can see where the data of the data is distributed in the shards. { "shard" : "shard0002" }
mongos>db.chunks.find({ns:"movies.drama"}, {shard:1, _id:0}).sort({shard:1}) { "shard" : "shard0002" }
{ "shard" : "shard0000" } { "shard" : "shard0002" }
{ "shard" : "shard0000" } { "shard" : "shard0002" }
{ "shard" : "shard0000" } { "shard" : "shard0001" }
{ "shard" : "shard0000" } { "shard" : "shard0001" }
{ "shard" : "shard0001" }
Page 54 of 63 Page 55 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
{ "shard" : "shard0002" }
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Monitoring for Sharding 1. At least two mongos instance, but you can have more as per need.
In addition to the normal monitoring and analysis that is done for other MongoDB 2. Three config servers, each on a separate system.
instances, the sharding 3. Two or more replica sets serving as shards . The replica sets are distributed across
geographies with read concern set to nearest.
cluster requires an additional monitoring to ensure that all its operations are
functioning appropriately and
Page 58 of 63 Page 59 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Scenario 3
If one of the shard becomes unavailable
In this scenario, the data on the shard will be unavailable, but the other shards will
be available, so it won’t stop the application.
The application can continue with its read/ write operations; however, the partial
results must be dealt with within the application.
In parallel, the shard should attempt to recover as soon as possible
Scenario 2
One of the mongod of the replica set becomes unavailable in a shard
Since you used replica sets to provide high availability, there is no data loss.
If a primary node is down, a new primary is chosen, whereas if it’s a secondary
node, then it is disconnected and the functioning continues normally.
The only difference is that the duplication of the data is reduced, making the system
little weak, so you should in parallel check if the mongod is recoverable.
Page 60 of 63 Page 61 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2 TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT2
Scenario 4
Only one config server is available out of three
In this scenario, although the cluster will become readonly, it will not serve any
operations that might lead to changes in the cluster structure, thereby leading to a
change of metadata such as chunk migration or chunk splitting.
The config servers should be replaced ASAP because if all config servers become
unavailable, this will lead to an inoperable cluster
Page 62 of 63 Page 63 of 63
YouTube - Abhay More | Telegram - abhay_more YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622 607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622