0% found this document useful (0 votes)

25 views51 pages

Unit - Iii Bda

This document provides an introduction to MongoDB and MapReduce programming, detailing its structure, terminology, and advantages over traditional RDBMS. It explains key concepts such as databases, collections, and documents, along with data types and querying methods like find() and findOne(). Additionally, it covers the use of logical operators for querying and highlights MongoDB's suitability for various applications.

Uploaded by

subramanipadmavathy16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views51 pages

Unit - Iii Bda

Uploaded by

subramanipadmavathy16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 51

UNIT III

INTRODUCTION TO MONGODB AND MAPREDUCE PROGRAMMING

Mongo DB - Terms used in RDBMS and Mongo DB - Data Types - MongoDB Query
LanguageMapReduce: Mapper – Reducer – Combiner – Partitioner – Searching – Sorting –
Compression

INTRODUCTION TO MONGODB AND MAPREDUCE PROGRAMMING

MongoDB is a cross-platform, document-oriented database that provides, high performance,

high availability, and easy scalability. MongoDB works on concept of collection and document.

Database

Database is a physical container for collections. Each database gets its own set of files on the file
system. A single MongoDB server typically has multiple databases.

Collection

Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A

collection exists within a single database. Collections do not enforce a schema. Documents
within a collection can have different fields. Typically, all documents in a collection are of
similar or related purpose.

Document

A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema
means that documents in the same collection do not need to have the same set of fields or
structure, and common fields in a collection's documents may hold different types of data.

The following table shows the relationship of RDBMS terminology with MongoDB.

RDBMS MongoDB

Database Database

Table Collection

Tuple/Row Document
Column Field

Table Join Embedded Documents

Primary Key Primary Key (Default key _id

provided by MongoDB
itself)
Database Server and Client
mysqld/Oracle Mongod
mysql/sqlplus Mongo
Sample Document

Following example shows the document structure of a blog site, which is simply a comma
separated key value pair.

_id: ObjectId(7df78ad8902c)

title: 'MongoDB Overview',

description: 'MongoDB is no sql database',

by: 'tutorials point',

url: 'http://www.tutorialspoint.com',

tags: ['mongodb', 'database', 'NoSQL'],

likes: 100,

comments: [

user:'user1',

message: 'My first comment',

dateCreated: new Date(2011,1,20,2,15),

like: 0
},

user:'user2',

message: 'My second comments',

dateCreated: new Date(2011,1,25,7,45),

like: 5

_id is a 12 bytes hexadecimal number which assures the uniqueness of every document. You can
provide _id while inserting the document. If you don’t provide then MongoDB provides a unique
id for every document. These 12 bytes first 4 bytes for the current timestamp, next 3 bytes for
machine id, next 2 bytes for process id of MongoDB server and remaining 3 bytes are simple
incremental VALUE.

Any relational database has a typical schema design that shows number of tables and the
relationship between these tables. While in MongoDB, there is no concept of relationship.

Advantages of MongoDB over RDBMS

 Schema less − MongoDB is a document database in which one collection holds different
documents. Number of fields, content and size of the document can differ from one
document to another.
 Structure of a single object is clear.
 No complex joins.
 Deep query-ability. MongoDB supports dynamic queries on documents using a
document- based query language that's nearly as powerful as SQL.
 Tuning.
 Ease of scale-out − MongoDB is easy to scale.
 Conversion/mapping of application objects to database objects not needed.
 Uses internal memory for storing the (windowed) working set, enabling faster access of
data.
Why Use MongoDB?
 Document Oriented Storage − Data is stored in the form of JSON style documents.

 Index on any attribute

 Replication and high availability
 Auto-Sharding
 Rich queries
 Fast in-place updates
 Professional support by

MongoDB Where to Use MongoDB?

 Big Data
 Content Management and Delivery
 Mobile and Social Infrastructure
 User Data Management
 Data Hub
MongoDB supports many datatypes. Some of them are −
 String − This is the most commonly used datatype to store the data. String in MongoDB
must be UTF-8 valid.
 Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit
depending upon your server.
 Boolean − This type is used to store a boolean (true/ false) value.
 Double − This type is used to store floating point values.
 Min/ Max keys − This type is used to compare a value against the lowest and highest
BSON elements.
 Arrays − This type is used to store arrays or list or multiple values into one key.
 Timestamp − ctimestamp. This can be handy for recording when a document has been
modified or added.
 Object − This datatype is used for embedded documents.
 Null − This type is used to store a Null value.
 Symbol − This datatype is used identically to a string; however, it's generally reserved
for languages that use a specific symbol type.
 Date − This datatype is used to store the current date or time in UNIX time format. You
can specify your own date time by creating object of Date and passing day, month, year
into it.
 Object ID − This datatype is used to store the document’s ID.
 Binary data − This datatype is used to store binary data.
 Code − This datatype is used to store JavaScript code into the document.
 Regular expression − This datatype is used to store regular expression.

The find() Method

To query data from MongoDB collection, you need to use MongoDB's find() method.

Syntax

The basic syntax of find() method is as follows −

>db.COLLECTION_NAME.find()
find() method will display all the documents in a non-structured way.

Example

Assume we have created a collection named mycol as −

> use sampleDB
switched to db sampleDB
> db.createCollection("mycol")
{ "ok" : 1 }
>

And inserted 3 documents in it using the insert() method as shown below −

> db.mycol.insert([
{
title: "MongoDB Overview",
description: "MongoDB is no SQL database",
by: "tutorials point",
url: "http://www.tutorialspoint.com",
tags: ["mongodb", "database", "NoSQL"],
likes: 100
},
{
title: "NoSQL Database",
description: "NoSQL database doesn't have tables",
by: "tutorials point",
url: "http://www.tutorialspoint.com",
tags: ["mongodb", "database", "NoSQL"],
likes: 20,
{

user:"user1",
message: "My first comment",
dateCreated: new Date(2013,11,10,2,35),
like: 0
}
]

Following method retrieves all the documents in the collection −

> db.mycol.find()
{ "_id" : ObjectId("5dd4e2cc0821d3b44607534c"), "title" : "MongoDB Overview", "description"
: "MongoDB is no SQL database", "by" : "tutorials point", "url" :
"http://www.tutorialspoint.com", "tags" : [ "mongodb", "database", "NoSQL" ], "likes" : 100 }
{ "_id" : ObjectId("5dd4e2cc0821d3b44607534d"), "title" : "NoSQL Database", "description" :
"NoSQL database doesn't have tables", "by" : "tutorials point", "url" :
"http://www.tutorialspoint.com", "tags" : [ "mongodb", "database", "NoSQL" ], "likes" : 20,
"comments" : [ { "user" : "user1", "message" : "My first comment", "dateCreated" :
ISODate("2013-12-09T21:05:00Z"), "like" : 0 } ] }
>

The pretty() Method

To display the results in a formatted way, you can use pretty() method.

Syntax

>db.COLLECTION_NAME.find().pretty()

Example

Following example retrieves all the documents from the collection named mycol and arranges
them in an easy-to-read format.
> db.mycol.find().pretty()
{

"_id" : ObjectId("5dd4e2cc0821d3b44607534c"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
}

"_id" : ObjectId("5dd4e2cc0821d3b44607534d"),
"title" : "NoSQL Database",
"description" : "NoSQL database doesn't have tables",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 20,
"comments" : [
{
"user" : "user1",
"message" : "My first comment",
"dateCreated" : ISODate("2013-12-09T21:05:00Z"),
"like" : 0

The findOne() method

Apart from the find() method, there is findOne() method, that returns only one document.

Syntax

>db.COLLECTIONNAME.findOne()

Example

Following example retrieves the document with title MongoDB Overview.

> db.mycol.findOne({title: "MongoDB Overview"})
{

"_id" : ObjectId("5dd6542170fb13eec3963bf0"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
}

RDBMS Where Clause Equivalents in MongoDB

To query the document on the basis of some condition, you can use following operations.

Operation Syntax Example RDBMS

Equivalent

Equality {<key>:{$eg;<value>}} db.mycol.find({"by":"tutorials where by =

point"}).pretty() 'tutorials
point'

Less Than {<key>:{$lt:<value>}} db.mycol.find({"likes":{$lt:50}}).pretty() where likes

< 50

Less Than {<key>:{$lte:<value>}} db.mycol.find({"likes":{$lte:50}}).pretty() where likes

Equals <= 50

Greater {<key>:{$gt:<value>}} db.mycol.find({"likes":{$gt:50}}).pretty() where likes

Than > 50

Greater {<key>:{$gte:<value>}} db.mycol.find({"likes":{$gte:50}}).pretty() where likes

Than >= 50
Equals

Not {<key>:{$ne:<value>}} db.mycol.find({"likes":{$ne:50}}).pretty() where likes

Equals != 50

Values in {<key>:{$in:[<value1>, db.mycol.find({"name":{$in:["Raj", Where

an array <value2>,……<valueN>]}} "Ram", "Raghu"]}}).pretty() name
matches
any of the
value in
:["Raj",
"Ram",
"Raghu"]
values not {<key>:{$nin:<value>}} db.mycol.find({"name":{$nin:["Ramu", Where
in an array "Raghav"]}}).pretty() name
values is
not in the
array
:["Ramu",
"Raghav"]
or, doesn’t
exist at all

AND in MongoDB

Syntax

To query documents based on the AND condition, you need to use $and keyword. Following is
the basic syntax of AND −
>db.mycol.find({ $and: [ {<key1>:<value1>}, { <key2>:<value2>} ] })

Example

Following example will show all the tutorials written by 'tutorials point' and whose title is
'MongoDB Overview'.
> db.mycol.find({$and:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()
{

For the above given example, equivalent where clause will be ' where by = 'tutorials point'
AND title = 'MongoDB Overview' '. You can pass any number of key, value pairs in find
clause.

OR in MongoDB

Syntax
To query documents based on the OR condition, you need to use $or keyword. Following is the basic
syntax of OR −
>db.mycol.find(
{

$or: [
{key1: value1}, {key2:value2}
]
}

).pretty()

Example

Following example will show all the tutorials written by 'tutorials point' or whose title is
'MongoDB Overview'.
>db.mycol.find({$or:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}

Using AND and OR

Together Example

The following example will show the documents that have likes greater than 10 and whose title
is either 'MongoDB Overview' or by is 'tutorials point'. Equivalent SQL where clause is 'where
likes>10 AND (by = 'tutorials point' OR title = 'MongoDB Overview')'
>db.mycol.find({"likes": {$gt:10}, $or: [{"by": "tutorials point"},
{"title": "MongoDB Overview"}]}).pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}
NOR in MongoDB Syntax

To query documents based on the NOT condition, you need to use $not keyword. Following is
the basic syntax of NOT −
>db.COLLECTION_NAME.find(
{

$not: [

] {key1: value1}, {key2:value2}

}

Example

Assume we have inserted 3 documents in the collection empDetails as shown below −

db.empDetails.insertMany(
[

{
First_Name: "Radhika",
Last_Name: "Sharma",
Age: "26",
e_mail: "[email protected]",
phone: "9000012345"
},
{
First_Name: "Rachel",
Last_Name: "Christopher",
Age: "27",
e_mail: "[email protected]",
phone: "9000054321"
},
{
First_Name: "Fathima",
Last_Name: "Sheik",
Age: "24",
e_mail: "[email protected]",

Following example will retrieve the document(s) whose first name is not "Radhika" and last
name is not "Christopher"
> db.empDetails.find(
{
$nor:[
40
{"First_Name": "Radhika"},
{"Last_Name": "Christopher"}
]
}
).pretty()
{
"_id" : ObjectId("5dd631f270fb13eec3963bef"),
"First_Name" : "Fathima",
"Last_Name" : "Sheik",
"Age" : "24",
"e_mail" : "[email protected]",
"phone" : "9000054321"
NOT in MongoDB

Syntax

To query documents based on the NOT condition, you need to use $not keyword following is the
basic syntax of NOT −
>db.COLLECTION_NAME.find(
{

$NOT: [

] {key1: value1}, {key2:value2}

}

).pretty()

Example
Following example will retrieve the document(s) whose age is not greater than 25
> db.empDetails.find( { "Age": { $not: { $gt: "25" } } } )
{
"_id" : ObjectId("5dd6636870fb13eec3963bf7"),
"First_Name" : "Fathima",
"Last_Name" : "Sheik",
"Age" : "24",
"e_mail" : "[email protected]",
"phone" : "9000054321"
}
MapReduce:

MapReduce addresses the challenges of distributed programming by providing an abstraction

that isolates the developer from system-level details (e.g., locking of data structures, data
starvation issues in the processing pipeline, etc.). The programming model specifies simple and
well-defined interfaces between a small number of components, and therefore is easy for the
programmer to reason about. MapReduce maintains a separation of what computations are to be
performed and how those computations are actually carried out on a cluster of machines. The
first is under the control of the programmer, while the second is exclusively the responsibility of
the execution framework or “runtime”. The advantage is that the execution framework only
needs to be designed once and verified for correctness—thereafter, as long as the developer
expresses computations in the programming model, code is guaranteed to behave as expected. The
upshot is that the developer is freed from having to worry about system-level details (e.g., no
more debugging race conditions and addressing lock contention) and can instead focus on
algorithm or application design.

ich often has multiple cores). Why is MapReduce important? In practical terms, it provides a
very effective tool for tackling large-data problems. But beyond that, MapReduce is important in
how it has changed the way we organize computations at a massive scale. MapReduce represents
the first widely-adopted step away from the von Neumann model that has served as the
foundation of computer science over the last half plus century. Valiant called this a bridging
model [148], a conceptual bridge between the physical implementation of a machine and the
software that is to be executed on that machine. Until recently, the von Neumann model has
served us well: Hardware designers focused on efficient implementations of the von Neumann
model and didn’t have to think much about the actual software that would run on the machines.
Similarly, the software industry developed software targeted at the model without worrying
about the hardware details. The result was extraordinary growth: chip designers churned out
successive generations of increasingly powerful processors, and software engineers were able to
develop applications in high-level languages that exploited those processors.

MapReduce can be viewed as the first breakthrough in the quest for new abstractions that allow
us to organize computations, not over individual machines, but over entire clusters. As Barroso
puts it, the datacenter is the computer. MapReduce is certainly not the first model of parallel
computation that has been proposed. The most prevalent model in theoretical computer science,
which dates back several decades, is the PRAM. MAPPERS AND REDUCERS Key-value pairs
form the basic data structure in MapReduce. Keys and values may be primitives such as integers,
floating point values, strings, and raw bytes, or they may be arbitrarily complex structures (lists,
tuples, associative arrays, etc.). Programmers typically need to define their own custom data
types, although a number of libraries such as Protocol Buffers,5 Thrift,6 and Avro7 simplify the
task. Part of the design of MapReduce algorithms involves imposing the key-value structure on
arbitrary datasets. For a collection of web pages, keys may be URLs and values may be the
actual HTML content. For a graph, keys may represent node ids and values may contain the
adjacency lists of those nodes (see Chapter 5 for more details). In some algorithms, input keys
are not particularly
meaningful and are simply ignored during processing, while in other cases input keys are used to
uniquely identify a datum (such as a record id). In Chapter 3, we discuss the role of complex
keys and values in the design of various algorithms. In MapReduce, the programmer defines a
mapper and a reducer with the following signatures: map: (k1, v1) → [(k2, v2)] reduce: (k2,
[v2]) → [(k3, v3)] The convention [. . .] is used throughout this book to denote a list. The input
to a MapReduce job starts as data stored on the underlying distributed file system (see Section
2.5). The mapper is applied to every input key-value pair (split across an arbitrary number of
files) to generate an arbitrary number of intermediate key-value pairs. The reducer is applied to
all values associated with the same intermediate key to generate output key-value pairs.8 Implicit
between the map and reduce phases is a distributed “group by” operation on intermediate keys.
Intermediate data arrive at each reducer in order, sorted by the key. However, no ordering
relationship is guaranteed for keys across different reducers. Output key-value pairs from each
reducer are written persistently back onto the distributed file system (whereas intermediate key-
value pairs are transient and not preserved). The output ends up in r files on the distributed file
system, where r is the number of reducers. For the most part, there is no need to consolidate
reducer output, since the r files often serve as input to yet another MapReduce job. Figure 2.2
illustrates this two-stage processing structure. A simple word count algorithm in MapReduce is
shown in Figure 2.3. This algorithm counts the number of occurrences of every word in a text
collection, which may be the first step in, for example, building a unigram language model (i.e.,
probability
MAPREDUCE BASICS

The distribution over words in a collection). Input key-values pairs take the form of (docid, doc)
pairs stored on the distributed file system, where the former is a unique identifier for the
document, and the latter is the text of the document itself. The mapper takes an input key-value
pair, tokenizes the document, and emits an intermediate key-value pair for every word: the word
itself serves as the key, and the integer one serves as the value (denoting that we’ve seen the
word once). The MapReduce execution framework guarantees that all values associated with the
same key are brought together in the reducer. Therefore, in our word count algorithm, we simply
need to sum up all counts (ones) associated with each word. The reducer does exactly this, and
emits final keyvalue pairs with the word as the key, and the count as the value. Final output is
written to the distributed file system, one file per reducer. Words within each file will be sorted
by alphabetical order, and each file will contain roughly the same number of words. The
partitioner, which we discuss later in Section 2.4, controls the assignment of words to reducers.
The output can be examined by the programmer or used as input to another MapReduce
program.

There are some differences between the Hadoop implementation of MapReduce and Google’s
implementation.9 In Hadoop, the reducer is presented with a key and an iterator over all values
associated with the particular key. The values are arbitrarily ordered. Google’s implementation
allows the programmer to specify a secondary sort key for ordering the values (if desired)—in
which case values associated with each key would be presented to the developer’s reduce code in
sorted order. Later in Section 3.4 we discuss how to overcome this limitation in Hadoop to
perform secondary sorting. Another difference: in Google’s implementation the programmer is
not allowed to change the key in the reducer. That is, the reducer output key must be exactly the
same as the reducer input key. In Hadoop, there is no such restriction, and the reducer can emit
an arbitrary number of output key-value pairs (with different keys).

To provide a bit more implementation detail: pseudo-code provided in this book roughly mirrors
how MapReduce programs are written in Hadoop. Mappers and reducers are objects that
implement the Map and Reduce methods, respectively. In Hadoop, a mapper object is initialized
for each map task (associated with a particular sequence of key-value pairs called an input split)
and the Map method is called on each key-value pair by the execution framework. In configuring
a MapReduce job, the programmer provides a hint on the number of map tasks to run, but the
execution framework (see next section) makes the final determination based on the physical
layout of the data (more details in Section 2.5 and Section 2.6). The situation is similar for the
reduce phase: a reducer object is initialized for each reduce task, and the Reduce method is called
once per intermediate key. In contrast with the number of map tasks, the programmer can
precisely specify the number of reduce tasks. We will return to discuss the details of Hadoop job
execution in Section 2.6, which is dependent on an understanding of the distributed file system
(covered in Section 2.5). To reiterate: although the presentation of algorithms in this book
closely mirrors the way they would be implemented in Hadoop, our focus is on algorithm
design and conceptual
understanding—not actual Hadoop programming. For that, we would recommend Tom White’s book
[154]. What are the restrictions on mappers and reducers? Mappers and reducers can express arbitrary
computations over their inputs. However, one must generally be careful about use of external resources
since multiple mappers or reducers may be contending for those resources. For example, it may be
unwise for a mapper to query an external SQL database, since that would introduce a scalability
bottleneck on the number of map tasks that could be run in parallel (since they might all be
simultaneously querying the database).10 In general, mappers can emit an arbitrary number of
intermediate key-value pairs, and they need not be of the same type as the input key-value pairs.
Similarly, reducers can emit an arbitrary number of final key-value pairs, and they can differ in type
from the intermediate key-value pairs. Although not permitted in functional programming, mappers and
reducers can have side effects. This is a powerful and useful feature: for example, preserving state across
multiple inputs is central to the design of many MapReduce algorithms (see Chapter 3). Such algorithms
can be understood as having side effects that only change state that is internal to the mapper or reducer.
While the correctness of such algorithms may be more difficult to guarantee (since the function’s
behavior depends not only on the current input but on previous inputs), most potential synchronization
problems are avoided since internal state is private only to individual mappers and reducers. In other
cases (see Section 4.4 and Section 6.5), it may be useful for mappers or reducers to have external side
effects, such as writing files to the distributed file system. Since many mappers and reducers are run in
parallel, and the distributed file system is a shared global resource, special care must be taken to ensure
that such operations avoid synchronization conflicts. One strategy is to write a temporary file that is
renamed upon successful completion of the mapper or reducer .

In addition to the “canonical” MapReduce processing flow, other variations are also possible.
MapReduce programs can contain no reducers, in which case mapper output is directly written to
disk (one file per mapper). For embarrassingly parallel problems, e.g., parse a large text
collection or independently analyze a large number of images, this would be a common pattern.
The converse—a MapReduce program with no mappers—is not possible, although in some cases
it is useful for the mapper to implement the identity function and simply pass input key-value
pairs to the reducers. This has the effect of sorting and regrouping the input for reduce-side
processing. Similarly, in some cases it is useful for the reducer to implement the identity
function, in which case the program simply sorts and groups mapper output. Finally, running
identity mappers and reducers has the effect of regrouping and resorting the input data (which is
sometimes useful).

Although in the most common case, input to a MapReduce job comes from data stored on the
distributed file system and output is written back to the distributed file system, any other system
that satisfies the proper abstractions can serve as a data source or sink. With Google’s
MapReduce implementation, BigTable [34], a sparse, distributed, persistent multidimensional
sorted map, is frequently used as a source of input and as a store of MapReduce output. HBase is
an open-source BigTable clone and has similar capabilities. Also, Hadoop has been integrated
with existing MPP (massively parallel processing) relational databases, which allows a
programmer to write MapReduce jobs over database rows and dump output into a new database
table. Finally, in some
cases MapReduce jobs may not consume any input at all (e.g., computing π) or may only
consume a small amount of data (e.g., input parameters to many instances of processorintensive
simulations running in parallel).

PARTITIONERS AND COMBINERS

We have thus far presented a simplified view of MapReduce. There are two additional elements
that complete the programming model: partitioners and combiners. Partitioners are responsible
for dividing up the intermediate key space and assigning intermediate key-value pairs to
reducers. In other words, the partitioner specifies the task to which an intermediate key-value
pair must be copied. Within each reducer, keys are processed in sorted order (which is how the
“group by” is implemented). The simplest partitioner involves computing the hash value of the
key and then taking the mod of that value with the number of reducers. This assigns
approximately the same number of keys to each reducer (dependent on the quality of the hash
function). Note, however, that the partitioner only considers the key and ignores the value—
therefore, a roughly-even partitioning of the key space may nevertheless yield large differences
in the number of key-values pairs sent to each reducer (since different keys may have different
numbers of associated values). This imbalance in the amount of data associated with each key is
relatively common in many text processing applications due to the Zipfian distribution of word
occurrences.

Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle
and sort phase. We can motivate the need for combiners by considering the word count algorithm
in Figure 2.3, which emits a key-value pair for each word in the collection. Furthermore, all these
key-value pairs need to be copied across the network, and so the amount of intermediate data will
be larger than the input collection itself. This is clearly inefficient. One solution is to perform
local aggregation on the output of each mapper, i.e., to compute a local count for a word over all
the documents processed by the mapper. With this modification (assuming the maximum amount
of local aggregation possible), the number of intermediate key-value pairs will be at most the
number of unique words in the collection times the number of mappers (and typically far smaller
because each mapper may not encounter every word).

smaller because each mapper may not encounter every word). The combiner in MapReduce
supports such an optimization. One can think of combiners as “mini-reducers” that take place on
the output of the mappers, prior to the shuffle and sort phase. Each combiner operates in isolation
and therefore does not have access to intermediate output from other mappers. The combiner is
provided keys and values associated with each key (the same types as the mapper output keys
and values). Critically, one cannot assume that a combiner will have the opportunity to process
all values associated with the same key. The combiner can emit any number of key-value pairs,
but the keys and values must be of the same type as the mapper output (same as the reducer
input).12 In cases where an operation is both associative and commutative (e.g., addition or
multiplication), reducers can directly serve as combiners. In general, however, reducers and
combiners are not interchangeable.
In many cases, proper use of combiners can spell the difference between an impractical
algorithm and an efficient algorithm. This topic will be discussed in Section 3.1, which focuses
on various techniques for local aggregation. It suffices to say for now that a combiner can
significantly reduce the amount of data that needs to be copied over the network, resulting in
much faster algorithms. The complete MapReduce model is shown in Figure 2.4. Output of the
mappers are processed by the combiners, which perform local aggregation to cut down on the
number of intermediate key- value pairs. The partitioner determines which reducer will be
responsible for processing a particular key, and the execution framework uses this information to
copy the data to the right location during the shuffle and sort phase.13 Therefore, a complete
MapReduce job consists of code for the mapper, reducer, combiner, and partitioner, along with
job configuration parameters. The execution framework handles everything else.

SECONDARY SORTING

MapReduce sorts intermediate key-value pairs by the keys during the shuffle and sort phase,
which is very convenient if computations inside the reducer rely on sort order (e.g., the order
inversion design pattern described in the previous section). However, what if in addition to
sorting by key, we also need to sort by value? Google’s MapReduce implementation
provides built-in
functionality for (optional) secondary sorting, which guarantees that values arrive in sorted
order. Hadoop, unfortunately, does not have this capability built in.

Consider the example of sensor data from a scientific experiment: there are m sensors each
taking readings on continuous basis, where m is potentially a large number. A dump of the sensor
data might look something like the following, where rx after each timestamp represents the
actual sensor readings (unimportant for this discussion, but may be a series of values, one or
more complex records, or even raw bytes of images).

(t1, m1, r80521)

(t1, m2, r14209)

(t1, m3, r76042) ...

(t2, m1, r21823)

(t2, m2, r66508)

(t2, m3, r98347)

Suppose we wish to reconstruct the activity at each individual sensor over time. A MapReduce
program to accomplish this might map over the raw data and emit the sensor id as the
intermediate key, with the rest of each record as the value:

m1 → (t1, r80521)

This would bring all readings from the same sensor together in the reducer. However, since
MapReduce makes no guarantees about the ordering of values associated with the same key, the
sensor readings will not likely be in temporal order. The most obvious solution is to buffer all the
readings in memory and then sort by timestamp before additional processing. However, it should
be apparent by now that any in-memory buffering of data introduces a potential scalability
bottleneck. What if we are working with a high frequency sensor or sensor readings over a long
period of time? What if the sensor readings themselves are large complex objects? This approach
may not scale in these cases—the reducer would run out of memory trying to buffer all values
associated with the same key.

This is a common problem, since in many applications we wish to first group together data one
way (e.g., by sensor id), and then sort within the groupings another way (e.g., by time).
Fortunately, there is a general purpose solution, which we call the “value-to-key conversion”
design pattern. The basic idea is to move part of the value into the intermediate key to form a
composite key, and let the MapReduce execution framework handle the sorting. In the above
example, instead of emitting the sensor id as the key, we would emit the sensor id and the
timestamp as a composite key: (m1, t1) → (r80521)
The sensor reading itself now occupies the value. We must define the intermediate key sort order
to first sort by the sensor id (the left element in the pair) and then by the timestamp (the right
element in the pair). We must also implement a custom partitioner so that all pairs associated
with the same sensor are shuffled to the same reducer. Properly orchestrated, the key-value pairs
will be presented to the reducer in the correct sorted order: (m1, t1) → [(r80521)] (m1, t2) →
[(r21823)] (m1, t3) → [(r146925)] . . .

However, note that sensor readings are now split across multiple keys. The reducer will need to
preserve state and keep track of when readings associated with the current sensor end and the
next sensor begin.9 The basic tradeoff between the two approaches discussed above (buffer and
inmemory sort vs. value-to-key conversion) is where sorting is performed. One can explicitly
implement secondary sorting in the reducer, which is likely to be faster but suffers from a
scalability bottleneck.10 With value-to-key conversion, sorting is offloaded to the MapReduce
execution framework. Note that this approach can be arbitrarily extended to tertiary, quaternary,
etc. sorting. This pattern results in many more keys for the framework to sort, but distributed
sorting is a task that the MapReduce runtime excels at since it lies at the heart of the
programming model.

INDEX COMPRESSION

We return to the question of how postings are actually compressed and stored on disk. This
chapter devotes a substantial amount of space to this topic because index compression is one of
the main differences between a “toy” indexer and one that works on real-world collections.
Otherwise, MapReduce inverted indexing algorithms are pretty straightforward.

Let us consider the canonical case where each posting consists of a document id and the term
frequency. A na¨ıve implementation might represent the first as a 32-bit integer9 and the second
as a 16-bit integer. Thus, a postings list might be encoded as follows: [(5, 2),(7, 3),(12, 1),(49, 1),
(51, 2), . . .]

where each posting is represented by a pair in parentheses. Note that all brackets, parentheses,
and commas are only included to enhance readability; in reality the postings would be
represented as a long stream of integers. This na¨ıve implementation would require six bytes per
posting. Using this scheme, the entire inverted index would be about as large as the collection
itself. Fortunately, we can do significantly better. The first trick is to encode differences between
document ids as opposed to the document ids themselves. Since the postings are sorted by
document ids, the differences (called d-gaps) must be positive integers greater than zero. The
above postings list, represented with d-gaps, would be: [(5, 2),(2, 3),(5, 1),(37, 1),(2, 2)

Of course, we must actually encode the first document id. We haven’t lost any information, since
the original document ids can be easily reconstructed from the d-gaps. However, it’s not obvious
that we’ve reduced the space requirements either, since the largest possible d-gap is one less than
the number of documents in the collection. This is where the second trick comes in, which is to
represent the d-gaps in a way such that it takes less space for smaller numbers. Similarly, we
want to apply the same techniques to compress the term frequencies, since for the most part they
are also small values. But to understand how this is done, we need to take a slight detour into
compression techniques, particularly for coding integers.

Compression, in general, can be characterized as either lossless or lossy: it’s fairly obvious that
loseless compression is required in this context. To start, it is important to understand that all
compression techniques represent a time–space tradeoff. That is, we reduce the amount of space
on disk necessary to store data, but at the cost of extra processor cycles that must be spent coding
and decoding data. Therefore, it is possible that compression reduces size but also slows
processing. However, if the two factors are properly balanced (i.e., decoding speed can keep up
with disk bandwidth), we can achieve the best of both worlds: smaller and faster.

POSTINGS COMPRESSION

Having completed our slight detour into integer compression techniques, we can now return to
the scalable inverted indexing algorithm shown in Figure 4.4 and discuss how postings lists can
be properly compressed. As we can see from the previous section, there is a wide range of
choices that represent different tradeoffs between compression ratio and decoding speed. Actual
performance also depends on characteristics of the collection, which, among other factors,
determine the distribution of d-gaps. B¨uttcher et al. [30] recently compared the performance of
various compression techniques on coding document ids. In terms of the amount of compression
that can be obtained (measured in bits per docid), Golomb and Rice codes performed the best,
followed by γ codes, Simple-9, varInt, and group varInt (the least space efficient). In terms of
raw decoding speed, the order was almost the reverse: group varInt was the fastest, followed by
varInt.14 Simple-9 was substantially slower, and the bit-aligned codes were even slower than
that. Within the bit-aligned codes, Rice codes were the fastest, followed by γ, with Golomb codes
being the slowest (about ten times slower than group varInt).

Let us discuss what modifications are necessary to our inverted indexing algorithm if we were to
adopt Golomb compression for d-gaps and represent term frequencies with γ codes. Note that
this represents a space-efficient encoding, at the cost of slower decoding compared to
alternatives. Whether or not this is actually a worthwhile tradeoff in practice is not important
here: use of Golomb codes serves a pedagogical purpose, to illustrate how one might set
compression parameters.

Coding term frequencies with γ codes is easy since they are parameterless. Compressing d-gaps
with Golomb codes, however, is a bit tricky, since two parameters are required: the size of the
document collection and the number of postings for a particular postings list (i.e., the document
frequency, or df). The first is easy to obtain and can be passed into the reducer as a constant. The
df of a term, however, is not known until all the postings have been processed—and
unfortunately,
the parameter must be known before any posting is coded. At first glance, this seems like a
chicken-and-egg problem. A two-pass solution that involves first buffering the postings (in
memory) would suffer from the memory bottleneck we’ve been trying to avoid in the first place.

To get around this problem, we need to somehow inform the reducer of a term’s df before any
of its postings arrive. This can be solved with the order inversion design pattern introduced in
Section

form ht, ∗i to communicate partial document frequencies. That is, inside the mapper, in addition
3.3 to compute relative frequencies. The solution is to have the mapper emit special keys of the

to emitting intermediate key-value pairs of the following form:

(tuple ht, docidi,tf f)

we also emit special intermediate key-value pairs like this:

(tuple ht, ∗i, df e)

to keep track of document frequencies associated with each term. In practice, we can accomplish
this by applying the in-mapper combining design pattern (see Section 3.1). The mapper holds an
in-memory associative array that keeps track of how many documents a term has been observed

mapper). Once the mapper has processed all input records, special keys of the form ht, ∗i are
in (i.e., the local document frequency of the term for the subset of documents processed by the

emitted with the partial df as the value.

special symbol ∗ precedes all documents (part of the order inversion design pattern). Thus, for
To ensure that these special keys arrive first, we define the sort order of the tuple so that the

each term, the reducer will first encounter the ht, ∗i key, associated with a list of values
representing partial df values originating from each mapper. Summing all these partial
contributions will yield the term’s df, which can then be used to set the Golomb compression
parameter b. This allows the postings to be incrementally compressed as they are encountered in
the reducer—memory bottlenecks are eliminated since we do not need to buffer postings in
memory.

Once again, the order inversion design pattern comes to the rescue. Recall that the pattern is
useful when a reducer needs to access the result of a computation (e.g., an aggregate statistic)
before it encounters the data necessary to produce that computation. For computing relative
frequencies, that bit of information was the marginal. In this case, it’s the document frequency.

PARALLEL BREADTH-FIRST SEARCH

One of the most common and well-studied problems in graph theory is the single-source shortest
path problem, where the task is to find shortest paths from a source node to all other nodes in the
graph (or alternatively, edges can be associated with costs or weights, in which case the task is to
compute lowest-cost or lowest-weight paths). Such problems are a staple in undergraduate
algorithm courses, where students are taught the solution using Dijkstra’s algorithm. However,
this famous algorithm assumes sequential processing—how would we solve this problem in
parallel, and more specifically, with MapReduce?

Dijkstra(G, w, s)

3: for all vertex v ∈ V do

2: d[s] ← 0

4: d[v] ← ∞

6: while Q 6= ∅ do
5: Q ← {V }

8: for all vertex v ∈ u.AdjacencyList do

7: u ← ExtractMin(Q)

9: if d[v] > d[u] + w(u, v) then

10: d[v] ← d[u] + w(u, v)

Figure 5.2: Pseudo-code for Dijkstra’s algorithm, which is based on maintaining a global priority
queue of nodes with priorities equal to their distances from the source node. At each iteration, the
algorithm expands the node with the shortest distance and updates distances to all reachable
nodes. As a refresher and also to serve as a point of comparison, Dijkstra’s algorithm is shown in
Figure 5.2, adapted from Cormen, Leiserson, and Rivest’s classic algorithms textbook [41] (often
simply known as CLR). The input to the algorithm is a directed, connected graph G = (V, E)

source node s. The algorithm begins by first setting distances to all vertices d[v], v ∈ V to ∞,
represented with adjacency lists, w containing edge distances such that w(u, v) ≥ 0, and the

except for the source node, whose distance to itself is zero. The algorithm maintains Q, a global
priority queue of vertices with priorities equal to their distance values d
Dijkstra’s algorithm operates by iteratively selecting the node with the lowest current distance
from the priority queue (initially, this is the source node). At each iteration, the algorithm
“expands” that node by traversing the adjacency list of the selected node to see if any of those
nodes can be reached with a path of a shorter distance. The algorithm terminates when the
priority queue Q is empty, or equivalently, when all nodes have been considered. Note that the
algorithm as presented in Figure 5.2 only computes the shortest distances. The actual paths can
be recovered by storing “backpointers” for every node indicating a fragment of the shortest path.
A sample trace of the algorithm running on a simple graph is shown in Figure 5.3 (example also
adapted from CLR). We start out in (a) with n1 having a distance of zero (since it’s the source)
and all other nodes having a distance of ∞. In the first iteration (a), n1 is selected as the node to
expand (indicated by the thicker border). After the expansion, we see in (b) that n2 and n3 can be
reached at a distance of 10 and 5, respectively. Also, we see in (b) that n3 is the next node
selected for expansion. Nodes we have already considered for expansion are shown in black.
Expanding n3, we see in (c) that the distance to n2 has decreased because we’ve found a shorter
path. The nodes that will be expanded next, in order, are n5, n2, and n4. The algorithm
terminates with the end state shown in (f), where we’ve discovered the shortest distance to all
nodes.
The key to Dijkstra’s algorithm is the priority queue that maintains a globallysorted list of nodes
by current distance. This is not possible in MapReduce, as the programming model does not
provide a mechanism for exchanging global data. Instead, we adopt a brute force approach
known as parallel breadth-first search. First, as a simplification let us assume that all edges have
unit distance (modeling, for example, hyperlinks on the web). This makes the algorithm easier to
understand, but we’ll relax this restriction later.
The intuition behind the algorithm is this: the distance of all nodes connected directly to the
source node is one; the distance of all nodes directly connected to those is two; and so on.
Imagine water rippling away from a rock dropped into a pond— that’s a good image of how
parallel breadth-first search works. However, what if there are multiple paths to the same node?

one of the nodes in M that contains an outgoing edge to n: we need to examine all m ∈ M to find
Suppose we wish to compute the shortest distance to node n. The shortest path must go through

ms, the node with the shortest distance. The shortest distance to n is the distance to ms plus one.

Pseudo-code for the implementation of the parallel breadth-first search algorithm is provided in
Figure 5.4. As with Dijkstra’s algorithm, we assume a connected, directed graph represented as
adjacency lists. Distance to each node is directly stored alongside the adjacency list of that node,
and initialized to ∞ for all nodes except for the source node. In the pseudo-code, we use n to
denote the node id (an integer) and N to denote the node’s corresponding data structure
(adjacency list and current distance). The algorithm works by mapping over all nodes and
emitting a key-value pair for each neighbor on the node’s adjacency list. The key contains the
node id of the neighbor, and the value is the current distance to the node plus one. This says: if
we can reach node n with a distance d, then we must be able to reach all the nodes that are
connected to n with distance d + 1.
After shuffle and sort, reducers will receive keys corresponding to the destination node ids and
distances corresponding to all paths leading to that node. The reducer will select the shortest of
these distances and then update the distance in the node data structure.

h iteration corresponds to a MapReduce job. The first time we run the algorithm, we “discover”
all nodes that are connected to the source. The second iteration, we discover all nodes connected
to those, and so on. Each iteration of the algorithm expands the “search frontier” by one hop,
and, eventually, all nodes will be discovered with their shortest distances (assuming a fully-
connected graph). Before we discuss termination of the algorithm, there is one more detail
required to make the parallel breadth-first search algorithm work. We need to “pass along” the
graph structure from one iteration to the next. This is accomplished by emitting the node data
structure itself, with the node id as a key (Figure 5.4, line 4 in the mapper). In the reducer, we
must distinguish the node data structure from distance values (Figure 5.4, lines 5–6 in the
reducer), and update the minimum distance in the node data structure before emitting it as the
final value. The final output is now ready to serve as input to the next iteration.

So how many iterations are necessary to compute the shortest distance to all nodes? The answer
is the diameter of the graph, or the greatest distance between any pair of nodes. This number is
surprisingly small for many real-world problems: the saying “six degrees of separation” suggests
that everyone on the planet is connected to everyone else by at most six steps (the people a
person knows are one step away, people that they know are two steps away, etc.). If this is
indeed true, then parallel breadthfirst search on the global social network would take at most six
MapReduce iterations.
class Mapper
2: method Map(nid n, node N)
3: d ← N.Distance

5: for all nodeid m ∈ N.AdjacencyList do

4: Emit(nid n, N) . Pass along graph structure

6: Emit(nid m, d + 1) . Emit distances to reachable nodes

1: class Reducer
2: method Reduce(nid m, [d1, d2, . . .])

4: M ← ∅
3: dmin ← ∞

5: for all d ∈ counts [d1, d2, . . .] do

6: if IsNode(d) then
7: M ← d . Recover graph structure
8: else if d < dmin then . Look for shorter distance
9: dmin ← d
10: M.Distance ← dmin . Update shortest distance
11: Emit(nid m, node M)
Figure 5.4: Pseudo-code for parallel breath-first search in MapReduce: the mappers emit
distances to reachable nodes, while the reducers select the minimum of those distances for each
destination node. Each iteration (one MapReduce job) of the algorithm expands the “search
frontier” by one hop.
For more serious academic studies of “small world” phenomena in networks, we refer the reader
to a number of publications [61, 62, 152, 2]. In practical terms, we iterate the algorithm until
there are no more node distances that are ∞. Since the graph is connected, all nodes are
reachable, and since all edge distances are one, all discovered nodes are guaranteed to have the
shortest distances (i.e., there is not a shorter path that goes through a node that hasn’t been
discovered).
The actual checking of the termination condition must occur outside of MapReduce. Typically,
execution of an iterative MapReduce algorithm requires a nonMapReduce “driver” program,
which submits a MapReduce job to iterate the algorithm, checks to see if a termination condition
has been met, and if not, repeats. Hadoop provides a lightweight API for constructs called
“counters”, which, as the name suggests, can be used for counting events that occur during
execution, e.g., number of corrupt records, number of times a certain condition is met, or
anything that the programmer desires. Counters can be defined to count the number of nodes that
have distances of ∞: at the end of the job, the driver program can access the final counter value
and check to see if another iteration is necessary.

Finally, as with Dijkstra’s algorithm in the form presented earlier, the parallel breadth-first
search algorithm only finds the shortest distances, not the actual shortest paths. However, the
path can be straightforwardly recovered. Storing “backpointers” at each node, as with Dijkstra’s
algorithm, will work, but may not be efficient since the graph needs to be traversed again to
reconstruct the path segments. A simpler approach is to emit paths along with distances in the
mapper, so that each node will have its shortest path easily accessible at all times. The additional
space requirements for shuffling these data from mappers to reducers are relatively modest, since
for the most part paths (i.e., sequence of node ids) are relatively short.
Up until now, we have been assuming that all edges are unit distance. Let us relax that restriction
and see what changes are required in the parallel breadth-first search algorithm. The adjacency
lists, which were previously lists of node ids, must now encode the edge distances as well. In line
6 of the mapper code in Figure 5.4, instead of emitting d + 1 as the value, we must now emit d +
w where w is the edge distance. No other changes to the algorithm are required, but the
termination behavior is very different. To illustrate, consider the graph fragment in Figure 5.5,
where s is the source node, and in this iteration, we just “discovered” node r for the very first
time. Assume for the sake of argument that we’ve already discovered the shortest distance to
node p, and that the shortest distance to r so far goes through p. This, however, does not
guarantee that we’ve discovered the shortest distance to r, since there may exist a path going
through q that we haven’t encountered yet (because it lies outside the search frontier).6 However,
as the search frontier expands, we’ll eventually cover q and all other nodes along the path from p
to q to r—which means that with sufficient iterations, we will discover the shortest distance to r.
But how do we know that we’ve found the shortest distance to p? Well, if the shortest path to p
lies within the search frontier, we would have already discovered it. And if it doesn’t, the above
argument applies. Similarly, we can repeat the same argument for all nodes on the path from s to
p. The conclusion is that, with sufficient iterations, we’ll eventually discover all the shortest
distances.
So exactly how many iterations does “eventually” mean? In the worst case, we might need as
many iterations as there are nodes in the graph minus one. In fact, it is not difficult to construct
graphs that will elicit this worse-case behavior: Figure 5.6 provides an example, with n1 as the
source. The parallel breadth-first search algorithm would not discover that the shortest path from
n1 to n6 goes through n3, n4, and n5 until the fifth iteration. Three more iterations are necessary
to cover the rest of the graph. Fortunately, for most real-world graphs, such extreme cases are
rare, and the number of iterations necessary to discover all shortest distances is quite close to the
diameter of the graph, as in the unit edge distance case.
In practical terms, how do we know when to stop iterating in the case of arbitrary edge
distances? The algorithm can terminate when shortest distances at every node no longer change.
Once again, we can use counters to keep track of such events. Every time we encounter a shorter
distance in the reducer, we increment a counter. At the end of each MapReduce iteration, the
driver program reads the counter value and determines if another iteration is necessary.
Compared to Dijkstra’s algorithm on a single processor, parallel breadth-first search in
MapReduce can be characterized as a brute force approach that “wastes” a lot of time performing
computations whose results are discarded. At each iteration, the algorithm attempts to recompute
distances to all nodes, but in reality only useful work is done along the search frontier: inside the
search frontier, the algorithm is simply repeating previous computations.7 Outside the search
frontier, the algorithm hasn’t discovered any paths to nodes there yet, so no meaningful work is
done. Dijkstra’s algorithm, on the other hand, is far more efficient. Every time a node is
explored, we’re guaranteed to have already found the shortest path to it. However, this is made
possible by maintaining a global data structure (a priority queue) that holds nodes sorted by
distance—this is not possible in MapReduce because the programming model does not provide
support for global data that is mutable and accessible by the mappers and reducers. These
inefficiencies represent the cost of parallelization.
The parallel breadth-first search algorithm is instructive in that it represents the prototypical
structure of a large class of graph algorithms in MapReduce. They share in the following
characteristics:
The graph structure is represented with adjacency lists, which is part
of some larger node data structure that may contain additional information
(variables to store intermediate output, features of the nodes). In many cases,
features are attached to edges as well (e.g., edge weights).
The graph structure is represented with adjacency lists, which is part of
some larger node data structure that may contain additional information
(variables to store intermediate output, features of the nodes). In many cases,
features are attached to edges as well (e.g., edge weights).
In addition to computations, the graph itself is also passed from the
mapper to the reducer. In the reducer, the data structure corresponding to each
node is updated and written back to disk.
Graph algorithms in MapReduce are generally iterative, where the
output of the previous iteration serves as input to the next iteration. The
process is controlled by a non-MapReduce driver program that checks for
termination.
For parallel breadth-first search, the mapper computation is the current
distance plus edge distance (emitting distances to neighbors), while the
reducer computation is the Min function (selecting the shortest path). As we
will see in the next section, the MapReduce algorithm for PageRank works in
much the same way
MongoDB Query Language:
MongoDB is a document-based NoSQL database having a flexible schema and does not require
a fixed format to store data. This helps store unstructured data like media files, sensor data and
real-time transactional data. MongoDB stores data as documents into collections. There can be
many collections, and as many documents into each collection, each document having its own set
of fields.

Here is a quick mapping of how data stored in MongoDB vs an SQL database:

SQL database MongoDB

Table Collection
Row Document
Column Field

the detailed mapping chart in the MongoDB manual.

Example:

{
_id: ObjectId("509a8fb2f3f5434bd2f983b2"),

login_id: "samy_123",

user_first_name: "Sam",

user_last_name: "Adams",

age: 23,

location: "Germany"

Note that each document starts and ends with a curly brace ({}). Also, only the _id field is
mandatory by default. The _id field is an unique internal identifier for each document and is
generated by the system while inserting the document, if you do not specify a particular value.
The above is a simple example of details of a user. MongoDB can easily handle more complex
nested documents as well.

Here is another example to store objects:

_id: ObjectId("509a8fb2f3f5434bd2f983b2"),

login_id: "samy_123",

user_first_name: "Sam",

user_last_name: "Adams",

age: 23,

location: {

"lng" : 48.137154,

"lat" : 11.576124

address: {

"city" : "Munich",
"state" : "Bavaria",

"zipcode" : “80335”,

"countrycode" : “GER”,

"country" : "Germany"

Note that ‘address’ and ‘location’ are objects that further have fields to store the required
information.

What is the MongoDB Query Language (MQL)?

To perform CRUD operations on a database, we need to query the database in a way that
the database can understand. MongoDB has its own language called the MQL, to perform basic
and advanced operations on the data. Apart from the basic CRUD operations, MongoDB
provides powerful aggregation functions, and different operators, to filter, sort, group, and
arrange data as per requirements, thereby improving the overall application performance. You
can write the queries from shell or in the UI, via MongoDB Compass or MongoDB Atlas.

Basic CRUD operations

While going through the article, you may choose to parallelly work the queries. For this,
you would either need to install mongo shell, MongoDB Compass, or connect to MongoDB
Atlas UI. We are using mongo shell for all the commands below.

To connect to a MongoDB cluster on Atlas, use the following command from your shell:

mongosh "mongodb+srv://sandbox.qurli.mongodb.net/" --apiVersion 1 --username <username>

o switch to or use a particular database, use the command:

use sample_training
To view the collections present in the database, use:

show collections
Read operation in MongoDB
To view the data present in the collection ‘grades’, use:

db.grades.find()
The find() method without any condition lists all the documents present in the collection.
However, to find a specific record, we can place a condition, like:

db.grades.find({"student_id" : 1})
This lists only those records that have ‘student_id’ field as 1.

To find record using multiple conditions, you can place multiple conditions using comma:

db.grades.find({"student_id" : 1, "class_id" : 265})

Insert operation in MongoDB

To insert a record, you can use the insertOne or insertMany command along with the data
you want to insert. Note that MongoDB does not place any restrictions on the data you can add,
so you have to make sure to add the right data in the right collections. Also, MongoDB also has
an ‘insert’ method, which is now deprecated.

db.grades.insertOne({

_id: 1,

student_id: 17,

scores: [

{ type: 'exam', score: 20 },

{ type: 'quiz', score: 58 },

{ type: 'homework', score: 84 },

{ type: 'homework', score: 94 }

class_id: 205

})
If the insert is successful, you will get a message as:

{ acknowledged: true, insertedId: 1 }

Update operation in MongoDB

Updating documents can be done in three ways in MongoDB:

 update() : updates one or many documents matching the criteria, based on the value set for the
boolean parameter ‘multi’. If multi is false, the query updates only one record.
 updateOne(): updates the first document found with the matching criteria
 updateMany(): updates all the documents that match the specified criteria

Suppose we want to update the exam score to 35, in the document that we inserted above. The
query is as follows:

db.grades.updateOne({_id: 1, "scores.type": 'exam'}, {$set:{'scores.$.score': 35}})

Note that the scores field is an array, so we use the dot operator to access the fields. To update
the value, we need to use the positional operator ($). If the document is updated successfully,
you will get an acknowledgement:

acknowledged: true,

insertedId: null,
matchedCount: 1,

modifiedCount: 1,

upsertedCount: 0

Note the parameter upsertedCount. MongoDB uses an optional parameter known as ‘upsert’, the
value of which can be a boolean (true/false). While writing the update query, if upsert is set to
true, the document is inserted if it doesn’t exist.

For example, let’s try to update a document with _id as 10000 (which doesn’t exist), with upsert
as true:

db.grades.update(

{ student_id: 10000 },

{ $set:

class_id: 180, "scores.type": "exam", "scores.score" : 50

{ upsert: true }

)
Here is the output:

acknowledged: true,

insertedId: ObjectId("64d5233b55eec96b68be4027"),

matchedCount: 0,

modifiedCount: 0,

upsertedCount: 1

The upsertedCount is 1, which means that MongoDB inserted a new record. We can check the
same using our find() command, with the condition below:

db.grades.find({"student_id" : 10000})

You can view the newly created document:

[

_id: ObjectId("64d5233b55eec96b68be4027"),

student_id: 10000,

class_id: 180,

scores: { score: 50, type: 'exam' }

Note that you cannot update the _id field, once created, as the _id field is immutable.

Delete operation in MongoDB

Similar to other operations, MongoDB provides deleteOne() and deleteMany() options.

The simple query to delete the document with _id: 1 is:

db.grades.deleteOne({'_id' : 1})

After successful delete, you will get an acknowledgement:

{ acknowledged: true, deletedCount: 1 }

MongoDB Map-Reduce:
MongoDB Map-Reduce is a data processing model that facilitates operations on large data
sets to produce aggregated results. It uses the mapReduce() function comprising map and
reduce functions to handle complex data transformations.
In this article, We will learn about MongoDB Map-Reduce by understanding various
examples, When to use and How to use Map Reduce in MongoDB in detail.
MongoDB Map-Reduce
 MongoDB Map-Reduce is a data processing programming model that helps to perform
operations on large data sets and produce aggregated results.
 MongoDB provides the mapReduce() function to perform the map-reduce operations.
This function has two main functions, i.e., map function and reduce function.
 The map function is used to group all the data based on the key-value and the reduce
function is used to perform operations on the mapped data.
 So, the data is independently mapped and reduced in different spaces and then combined
in the function and the result will be saved to the specified new collection.
 This mapReduce() function generally operates on large data sets. Using Map Reduce we
can perform aggregation operations such as max and avg on the data using some key and it
is similar to GroupBy in SQL.
Syntax
db.collectionName.mapReduce(... map(),...reduce(),...query{},...output{});
Parameters:
 map() function: It uses the emit() function in which it takes two parameters key and value
key. Here the key is on which we make groups like Group By in MySQL.
 the reduce() function: It is the step in which we perform our aggregate functions like
avg(), and sum().
 query: Here we will pass the query to filter the resultset.
 output: In this, we will specify the collection name where the result will be stored.
Map-Reduce is a powerful feature in MongoDB for aggregating data.
Steps to use Map Reduce in MongoDB
Look at this step-by-step guide to learn how to use MongoDB Map-Reduce. Let’s try to
understand the mapReduce() using the following example:
In this example, we have five records from which we need to take out the maximum marks of
each section and the keys are id, sec, marks.
{"id":1, "sec":A, "marks":80}
{"id":2, "sec":A, "marks":90}
{"id":1, "sec":B, "marks":99}
{"id":1, "sec":B, "marks":95}
{"id":1, "sec":C, "marks":90}
Here we need to find the maximum marks in each section. So, our key by which we will group
documents is the sec key and the value will be marks. Inside the map function, we use
emit(this.sec, this.marks) function, and we will return the sec and marks of each
record(document) from the emit function. This is similar to group By MySQL.
var map = function(){emit(this.sec, this.marks)};
After iterating over each document Emit function will give back the data like this:
{"A":[80, 90]}, {"B":[99, 90]}, {"C":[90] }
and upto this point it is what map() function does. The data given by emit function is grouped
by sec key, Now this data will be input to our reduce function. Reduce function is where actual
aggregation of data takes place. In our example we will pick the Max of each section like for
sec A:[80, 90] = 90 (Max) B:[99, 90] = 99 (max) , C:[90] = 90(max).
var reduce = function(sec,marks){return Array.max(marks);};
Here in reduce() function, we have reduced the records now we will output them into a new
collection.{out :”collectionName”}
db.collectionName.mapReduce(map,reduce,{out :"collectionName"});
In the above query we have already defined the map, reduce. Then for checking we need to
look into the newly created collection we can use the query db.collectionName.find() we get:
{"id":"A", value:90}
{"id":"B", value:99}
{"id":"C", value:90}
Examples of MongoDB Map Reduce
Let’s look at some examples of MongoDB map reduce function.
In this example, we are working with:
Database: geeksforgeeks2
Collection: employee
Documents: Six documents that contains the details of the employees
Output:

Example 1: Find the sum of ranks grouped by ages

Here, we will calculate the sum of rank present inside the particular age group. Now age is our
key on which we will perform group by (like in MySQL) and rank will be the key on which
we will perform sum aggregation.
Query:
var map=function(){ emit(this.age,this.rank)};
var reduce=function(age,rank){ return Array.sum(rank);};
db.employee.mapReduce(map,reduce,{out :"resultCollection1"});
Output:

Explanation:
 Inside map() function, i.e., map() : function map(){ emit(this.age,this.rank);}; we will write
the emit(this.age,this.rank) function. Here this represents the current collection being
iterated and the first key is age using age we will group the result like having age 24 give
the sum of all rank or having age 25 give the sum of all rank and the second argument is
rank on which aggregation will be performed.
 Inside the reduce function, i.e., reduce(): function reduce(key,rank){ return
Array.sum(rank); }; we will perform the aggregation function.
 Now the third parameter will be output where we will define the collection where the result
will be saved, i.e., {out :”resultCollection1″}. Here, out represents the key whose value is
the collection name where the result will be saved.
When to use Map-Reduce in MongoDB?
 In MongoDB, we can use Map-reduce when our aggregation query is slow because data is
present in a large amount and the aggregation query is taking more time to process.
 So using map-reduce we can perform action faster on large datasets than aggregation
query.
 Map-Reduce is useful for performing complex aggregation operations that are difficult or
inefficient to express using the aggregation pipeline.
 It provides more flexibility than the aggregation pipeline which allowing us to use
custom JavaScript functions to map, reduce, and finalize the data processing.

MangoDB Datatype
No ratings yet
MangoDB Datatype
12 pages
Lac Matrix ESP
100% (1)
Lac Matrix ESP
4 pages
DBMS
No ratings yet
DBMS
24 pages
SCREENING - DEVELOPMENTAL DELAY & AUTISM - CCBS DP Infant-Toddler Checklist
100% (2)
SCREENING - DEVELOPMENTAL DELAY & AUTISM - CCBS DP Infant-Toddler Checklist
9 pages
Business Cases and Benefits Management
100% (2)
Business Cases and Benefits Management
66 pages
Yellow Jambhala Cultivation Booklet PDF
No ratings yet
Yellow Jambhala Cultivation Booklet PDF
15 pages
European Pharmacopoeia 6 0 Vol 1 Evropeyskaya Farmakopeya 6 PDF
No ratings yet
European Pharmacopoeia 6 0 Vol 1 Evropeyskaya Farmakopeya 6 PDF
1,129 pages
U5 01 MongoDB
No ratings yet
U5 01 MongoDB
99 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
Advanced Web Designing
No ratings yet
Advanced Web Designing
96 pages
Database-Technology Removed
No ratings yet
Database-Technology Removed
50 pages
BIS601 Module 5 Textbook
No ratings yet
BIS601 Module 5 Textbook
57 pages
Android Practical File (IT-602)
50% (2)
Android Practical File (IT-602)
59 pages
21 Mongo DB
No ratings yet
21 Mongo DB
104 pages
FSD Unit III
No ratings yet
FSD Unit III
22 pages
Mathematics: Sindh Textbook Board, Jamshoro
No ratings yet
Mathematics: Sindh Textbook Board, Jamshoro
176 pages
Mongodb Tutorial: Database Collection
No ratings yet
Mongodb Tutorial: Database Collection
36 pages
1664473609-Unit 5 - Database Management - MongoDB
No ratings yet
1664473609-Unit 5 - Database Management - MongoDB
23 pages
Questions
No ratings yet
Questions
70 pages
MongoDB Lecture 1
No ratings yet
MongoDB Lecture 1
37 pages
Mongodb
No ratings yet
Mongodb
49 pages
Big Data (Unit 3)
No ratings yet
Big Data (Unit 3)
46 pages
Module 5
No ratings yet
Module 5
32 pages
Lecture - 1 MongoDB
No ratings yet
Lecture - 1 MongoDB
41 pages
Unit 2 - Bda Notes
No ratings yet
Unit 2 - Bda Notes
37 pages
CHAPTER 6 MongoDB
No ratings yet
CHAPTER 6 MongoDB
53 pages
L48 - MongoDB
No ratings yet
L48 - MongoDB
31 pages
Tendernotice 5 PDF
No ratings yet
Tendernotice 5 PDF
148 pages
Mongo DBPlus
No ratings yet
Mongo DBPlus
40 pages
Corporate Acc Unit 4 MCQ
100% (1)
Corporate Acc Unit 4 MCQ
7 pages
Mongodb Tutorial
No ratings yet
Mongodb Tutorial
54 pages
NGT Unit 2 - 230630 - 094118
No ratings yet
NGT Unit 2 - 230630 - 094118
62 pages
Mongo DB
No ratings yet
Mongo DB
77 pages
UNIT-IV MongoDB
No ratings yet
UNIT-IV MongoDB
54 pages
Module - 6
No ratings yet
Module - 6
24 pages
02 - Document-Based and MongoDB
No ratings yet
02 - Document-Based and MongoDB
133 pages
NoSQL 24 Mongo P1
No ratings yet
NoSQL 24 Mongo P1
43 pages
Manual Basico Mongo DB
No ratings yet
Manual Basico Mongo DB
48 pages
Open Source For You - September 2014 in
No ratings yet
Open Source For You - September 2014 in
108 pages
NOSQL Lab Book
No ratings yet
NOSQL Lab Book
33 pages
Bda Unit 4
No ratings yet
Bda Unit 4
13 pages
Litere. Teme Licenta Engleza
No ratings yet
Litere. Teme Licenta Engleza
4 pages
Unit 1
No ratings yet
Unit 1
16 pages
Mongo DB
No ratings yet
Mongo DB
36 pages
Basics of Mongodb-Connectivity
No ratings yet
Basics of Mongodb-Connectivity
26 pages
MongoDB Notes - Madhu
No ratings yet
MongoDB Notes - Madhu
15 pages
Unit 4 (MongoDB)
No ratings yet
Unit 4 (MongoDB)
46 pages
Unit 3 Two Marks
No ratings yet
Unit 3 Two Marks
5 pages
Lecture 9 - MongoDB
No ratings yet
Lecture 9 - MongoDB
8 pages
Tvl11-He-Cookery Q1 M4 W4
No ratings yet
Tvl11-He-Cookery Q1 M4 W4
15 pages
Module 3
No ratings yet
Module 3
15 pages
What Is MongoDB - Introduction, Architecture, Features & Example
No ratings yet
What Is MongoDB - Introduction, Architecture, Features & Example
8 pages
Inventory Recording
No ratings yet
Inventory Recording
12 pages
Mongodb
No ratings yet
Mongodb
19 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Unit 5 - Nosql DB
No ratings yet
Unit 5 - Nosql DB
13 pages
Module 3 Mongodb
No ratings yet
Module 3 Mongodb
10 pages
Unit 5 - Chapter 2 - Introduction To MongoDB
No ratings yet
Unit 5 - Chapter 2 - Introduction To MongoDB
53 pages
Manual of Hyd Control Unit
No ratings yet
Manual of Hyd Control Unit
13 pages
Note #1 - Substantive Test of Cash
No ratings yet
Note #1 - Substantive Test of Cash
11 pages
Network Linkages
100% (2)
Network Linkages
3 pages
AIESL CAPABILITY (Group A) 1
No ratings yet
AIESL CAPABILITY (Group A) 1
314 pages
Mongo DB
No ratings yet
Mongo DB
20 pages
Mongo Assignment
No ratings yet
Mongo Assignment
4 pages
MongoDB CURD Operation
No ratings yet
MongoDB CURD Operation
16 pages
Nvs Teaching and Non Teaching Jobs 2019
No ratings yet
Nvs Teaching and Non Teaching Jobs 2019
4 pages
Greetings
No ratings yet
Greetings
16 pages
Intro To MongoDB
100% (1)
Intro To MongoDB
13 pages
Mongodb
No ratings yet
Mongodb
9 pages
MongoDB Quick Guide
No ratings yet
MongoDB Quick Guide
61 pages
Mean Stack Technologies Unit-5
No ratings yet
Mean Stack Technologies Unit-5
9 pages
History of Art by HW Janson, Vol 1 4th Ed (Art Ebook)
86% (66)
History of Art by HW Janson, Vol 1 4th Ed (Art Ebook)
444 pages
Mongo DB
No ratings yet
Mongo DB
9 pages
Commonly Chanted Ashram Mantras
No ratings yet
Commonly Chanted Ashram Mantras
3 pages
Mongodb Tutorial
No ratings yet
Mongodb Tutorial
15 pages
Marc h5, 2015: Proprietary and Confidential
No ratings yet
Marc h5, 2015: Proprietary and Confidential
26 pages
Mongodb - Quick Guide Mongodb Overview
No ratings yet
Mongodb - Quick Guide Mongodb Overview
18 pages
Kundalika and Small Rivers Full Report
No ratings yet
Kundalika and Small Rivers Full Report
32 pages
Mongo
No ratings yet
Mongo
7 pages
Mongo DB
No ratings yet
Mongo DB
31 pages
Current State of Fabrication Technologies and Materials Fo 2018 Acta Biomate
No ratings yet
Current State of Fabrication Technologies and Materials Fo 2018 Acta Biomate
30 pages
Database: Rdbms Mongodb
No ratings yet
Database: Rdbms Mongodb
7 pages
Cs Tcom Sincgars RT 1523 VHF Radio Datasheet
No ratings yet
Cs Tcom Sincgars RT 1523 VHF Radio Datasheet
2 pages
Chapter 2 3 Tomp
No ratings yet
Chapter 2 3 Tomp
3 pages
Xi SPL Computer SC Sample Paper
No ratings yet
Xi SPL Computer SC Sample Paper
12 pages
Hope 1 (Reveiwer)
No ratings yet
Hope 1 (Reveiwer)
2 pages
Social Re Serch Foundation Paper
No ratings yet
Social Re Serch Foundation Paper
6 pages
Homework 3 - Fallacies (RevS23)
No ratings yet
Homework 3 - Fallacies (RevS23)
2 pages
Suprativ Datta Atlassian Resume
No ratings yet
Suprativ Datta Atlassian Resume
1 page
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)