Module-3
Module-3
Syllabus :
I. MongoDB :
• Querying and Indexing: MongoDB supports rich query capabilities, including ad-hoc
queries, indexing, and aggregation pipelines. It has a flexible query language that allows
for complex searches and supports various types of indexes for optimizing query
performance.
• High Availability and Fault Tolerance: MongoDB provides built-in replication,
which allows data to be automatically synchronized across multiple servers, ensuring
data redundancy and high availability. In case of server failures, MongoDB can
seamlessly failover to a replica set member.
• Flexible Data Model: MongoDB's document-oriented data model allows for flexible
and dynamic schemas, accommodating evolving data structures without requiring
downtime or complex migrations. It is well-suited for scenarios where data has varied
attributes or evolving requirements.
• Integration and Ecosystem: MongoDB has a rich ecosystem with support for various
programming languages and frameworks. It provides drivers and connectors for
popular programming languages, making it easy to integrate MongoDB with
applications.
RDBMS MongoDB
Row vs. Document: Each row in an RDBMS table In MongoDB, each document in a
represents a single record with collection is a JSON-like object
a fixed set of columns and data with a dynamic schema, allowing
types for varying fields within the same
collection.
Column vs. Field Columns represent the Fields are the equivalent data
individual data attributes or attributes of a document in a
fields of a table collection
JOIN A JOIN operation combines MongoDB does not support JOINs
rows from two or more tables directly since it follows a
based on a related column denormalized data model. Instead,
between them data is often embedded within
documents to avoid JOIN
operations
• MongoDB Query is a way to get the data from the MongoDB database. MongoDB
queries provide the simplicity in process of fetching data from the database, it’s similar
to SQL queries in SQL Database language.
• While performing a query operation, one can also use criteria or conditions which can
be used to retrieve specific data from the database.
• MongoDB stores the data in the form of the structure (field:value pair) rather than
tabular form.
• It stores data in BSON (Binary JSON) format just like JSON format.
“_id” : ObjectId(“6009585d35cce6b7b8f087f1”),
“title” : “Math”,
“author” : “Aditya”,
“level” : “basic”,
“length” : 230,
“example” : 11
}
example
<switched to db sampleDB
< db.createCollection("mycol")
{ "ok" : 1 }
>
1. Find:
The find() method is used to retrieve data from a collection based on specified criteria.
You can specify the filter conditions as a JSON object in the find() method to retrieve
documents that match the criteria.
Syntax
> db.COLLECTION_NAME.find()
• > db.mycol.find()
pretty()
To display the results in a formatted way, you can use pretty() method.
Syntax
>db.COLLECTION_NAME.find().pretty()
• > db.mycol.find().pretty()
findOne()
o Apart from the find() method, there is findOne() method, that returns only one
document.
Syntax
o >db.COLLECTIONNAME.findOne()
Equal filter query
o The equality operator($eq) is used to match the documents where the value of the field
is equal to the specified value. In other words, the $eq operator is used to specify the
equality condition.
Syntax:
Example :
o >db.article.find({author:{$eq:"devil"}}).pretty()
Comparison Operators:
MongoDB supports various comparison operators such as $eq, $ne, $gt, $lt, $gte, and $lte to
compare values in queries.
Logical Operators:
MongoDB provides logical operators like $and, $or, and $not to combine multiple conditions
in a query.
AND in MongoDB
In MongoDB, the AND logical operator is used to combine multiple conditions in a query to
retrieve documents that satisfy all of the specified conditions simultaneously
Syntax
Following example will show all the tutorials written by 'tutorials point' and whose title is
'MongoDB Overview
OR in MongoDB
Syntax
Following example will show all the tutorials written by 'tutorials point' or whose title is
'MongoDB Overview'.
NOT in MongoDB
syntax
Following example will retrieve the document(s) whose age is not greater than 25
NOR operator
In MongoDB, the NOR logical operator is used to combine multiple conditions in a query to
retrieve documents that do not satisfy any of the specified conditions.
Syntax
• Now, let's say we want to find products that do not have the name "Product A" and
whose price is not less than 20. We can use the $nor javascroperator as follows:
Projection:
The project() method is used to specify which fields to include or exclude from the query result.
Sorting:
The sort() method is used to sort the query result based on one or more fields.
Limiting:
The limit() method is used to restrict the number of documents returned by a query.
Aggregation:
MongoDB supports the Aggregation Framework, which allows you to perform advanced data
processing and transformation tasks, including grouping, filtering, and computing aggregate
functions.
Insert:
To insert data into a collection, you use the insertOne() or insertMany() method.
Update:
The updateOne() and updateMany() methods are used to update existing documents in a
collection based on specified criteria.
Delete:
To delete documents from a collection, you can use the deleteOne() or deleteMany() method.
The MongoDB query language is designed to be intuitive and easy to use. It provides a wide
range of capabilities for retrieving and manipulating data, making it suitable for various use
cases, including real-time applications, analytics, and big data processing. MongoDB's query
language allows developers to efficiently work with both simple and complex data structures,
providing the flexibility needed for modern data-driven applications.
RDBMS Vs MongoDB
RDBMS MongoDB
Retrieving Data SELECT column1, column2 db.collection_name.find({ field1:
FROM table_name WHERE value1, field2: value2 }, { field1: 1,
condition; field2: 1 });
Inserting Data INSERT INTO table_name db.collection_name.insertOne({ field1:
(column1, column2) value1, field2: value2 });
VALUES (value1, value2);
Updating Data: UPDATE table_name SET db.collection_name.updateOne({ field:
column1 = value1, column2 = value }, { $set: { field1: value1, field2:
value2 WHERE condition; value2 } });
Deleting Data: DELETE FROM table_name db.collection_name.deleteOne({ field:
WHERE condition; value });
Aggregation SELECT column1, db.collection_name.aggregate([
(Grouping and COUNT(column2) FROM { $group: { _id: "$field1", count: {
Aggregating table_name GROUP BY $sum: 1 } } },
Data) column1; { $project: { _id: 0, field1: "$_id",
count: 1 } }
]);
Sorting: SELECT column1, column2 db.collection_name.find({}).sort({
FROM table_name ORDER field1: -1, field2: 1 });
BY column1 DESC, column2
ASC;
Limiting: SELECT column1, column2 db.collection_name.find({}).limit(10);
FROM table_name LIMIT
10;
MapReduce: Mapper – Reducer – Combiner – Partitioner – Searching – Sorting –
Compression
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used
as input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending
the Map & Reduce tasks to appropriate servers in a cluster.
Mapper:
Input: The input data is divided into smaller chunks called input splits.
Mapper Function: The Mapper is responsible for processing these input splits independently.
It applies a user-defined function to each input split to produce a set of intermediate key-
value pairs. The key is typically used for grouping and sorting in the next phase (Shuffle and
sort).
Combiner:
Optional: The Combiner is an optional intermediate step that can be used to reduce the volume
of data transferred between the Mapper and Reducer phases. It performs a local aggregation of
the output from Mappers on each node before sending it to the Reducer. This is particularly
useful when the same key appears multiple times in the Mapper output.
The primary purpose of the combiner is to reduce the volume of the data that needs to be
transferred between the mapper and Reducer and improve the overall efficiency of the
MapReduce job.
Partitioner:
Partitioning Logic: The Partitioner decides which Reducer will receive each intermediate key-
value pair from the Mapper. It is responsible for ensuring that all key-value pairs for a given
key are sent to the same Reducer. This is crucial for correct aggregation and computation during
the Reducer phase.
Reducer:
Grouping and Aggregation: The Reducer takes the intermediate key-value pairs produced by
the Mappers, groups them by key (based on the output of the Partitioner), and then applies a
user-defined Reduce function to each group. The result is typically written to an output file.
Shuffle and Sort: After the Mapper phase, there's a shuffle and sort step where the
MapReduce framework sorts and groups the intermediate key-value pairs before sending
them to the Reducers. This ensures that a Reducer processes together all values
associated with the same key.
Sorting
Searching
Indexing
TF-IDF
Sorting:
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the
mapper by their keys.
Searching:
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the
help of an example.
Compression :
Data Compression: To optimize data transfer and storage, MapReduce frameworks often
support data compression. Intermediate data and output data can be compressed to reduce
disk I/O and network bandwidth usage.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
Let us assume we have employee data in four different files − A, B, C, and D. Let us
also assume there are duplicate employee records in all four files because of
importing the employee data from all database tables repeatedly. See the following
illustration.
The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name, salary>). See the following illustration.
The combiner phase (searching technique) will accept the input from the Map phase
as a key-value pair with employee name and salary. Using searching technique, the
combiner will check all the employee salary to find the highest salaried employee in
each file. See the following snippet.
Reducer phase − Form each file, you will find the highest salaried employee. To
avoid redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any.
The same algorithm is used in between the four <k, v> pairs, which are coming from
four input files. The final output should be as follows –
<gopal, 50000>
Indexing
Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.The indexing technique that is normally
used in MapReduce is known as inverted index. Search engines like Google and Bing use
inverted indexing technique. Let us try to understand how Indexing works with the help of a
simple example.