Big Data
Big Data
REPORT ON
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
This assignment aims to provide a comprehensive understanding of Big Data Analytics through the
installation and demonstration of MongoDB, a widely-used big data tool. As part of the curriculum for
the 8th semester course Big Data Analytics (21CS71), the focus is on exploring MongoDB’s core
functionalities, architecture, and its role in handling and analyzing large volumes of unstructured and
semi-structured data.
The assignment begins with the successful installation and configuration of MongoDB on a local system,
followed by the creation and manipulation of databases and collections using real-time data samples. Core
data operations such as insertion, querying, updating, and deletion (CRUD) are performed using the
MongoDB shell and GUI tools like MongoDB Compass. This hands-on experience introduces students to
MongoDB’s data model, which uses flexible, JSON-like documents, facilitating scalable and efficient
data management suitable for big data environments.
In addition, the assignment explores MongoDB’s capabilities in distributed data processing and
storage, which are essential components of big data systems. Topics such as replication and sharding
are discussed to illustrate how MongoDB ensures high availability, fault tolerance, and horizontal
scalability. The concept of replica sets and the use of quorums in maintaining consistency across
distributed systems directly supports Course Outcomes CO1 and CO2, which relate to understanding data
distribution and consistency in big data architectures.
The assignment also includes an overview of aggregation pipelines, indexing strategies, and
MongoDB’s support for parallel data processing, which collectively enhance performance in big data
analytics tasks. Though MapReduce is not the default processing model in MongoDB, its legacy support
is touched upon to align with CO3, introducing students to the fundamentals of parallel computation
within big data systems.
TABLE OF CONTENTS
1 INTRODUCTION 1-2
1.1. What is NoSQL? 1
1.2. What is MongoDB? 1
1.3. Problem Statement 2
5 CONCLUSION 28
6 REFERENCES 29
Big Data Analytics 21CS71
CHAPTER 1
INTRODUCTION
NoSQL databases are designed to efficiently manage large volumes of data that may not
conform to the rigid structures of traditional relational databases. Unlike relational models that
store data in predefined tables with strict schemas, NoSQL systems provide more flexible data
storage options, allowing for rapid development and scalability.
The term “NoSQL” stands for “Not Only SQL,” indicating that these systems support a variety
of data models, including key-value, document, column-family, and graph formats. These
databases are optimized for high performance, horizontal scaling, and distributed data storage.
They are commonly used in applications where data structures frequently change, such as web
applications, social networks, real-time analytics, and Internet of Things (IoT) systems.
NoSQL databases also emphasize eventual consistency and fault tolerance, making them
suitable for distributed environments and big data use cases.
MongoDB is a document-oriented NoSQL database that stores data in BSON (Binary JSON)
format, allowing for dynamic, schema-less structures. It enables developers to work with
complex, nested data without needing to define rigid schemas in advance. This flexibility
makes MongoDB well-suited for applications with rapidly changing data models.
Key features of MongoDB include support for high availability through replica sets, horizontal
scalability via sharding, and robust query capabilities using both the MongoDB shell and GUI
tools like MongoDB Compass. It also offers powerful aggregation pipelines for data
processing and transformation.
In this assignment, MongoDB was installed and configured on a local system. The
demonstration includes creating databases and collections, performing CRUD (Create, Read,
Update, Delete) operations, and exploring the replication model. Particular attention was given
to MongoDB’s consistency model and the role of quorums in maintaining eventual
consistency across distributed systems.
By working with MongoDB, this assignment provides hands-on experience with a leading
NoSQL technology and reinforces theoretical concepts related to unstructured data
management, distributed databases, and modern application development.
CHAPTER 2
This guide outlines the installation of MongoDB 8.0 Community Edition on supported 64-bit
Windows platforms using the MSI installation wizard. It includes basic configuration steps and
considerations for both service-based and manual operation modes.
• If Installed as a Service
MongoDB starts automatically after installation. Modify configuration via
<install dir>\bin\mongod.cfg if needed, then restart the service through
the Windows Services console.
• If Not Installed as a Service
Create the data directory (C:\data\db) and start MongoDB manually via the
command prompt:
"C:\Program Files\MongoDB\Server\8.0\bin\mongod.exe" --
dbpath="C:\data\db"
Install mongosh, add it to your system PATH, and run mongosh.exe to connect to the
MongoDB instance.
Refer to the MongoDB documentation for information on CRUD operations and deployment
connections.
Additional Configuration
• bindIp Setting
By default, MongoDB binds to 127.0.0.1. Modify bindIp in the config file or
use --bind_ip to allow external connections. Ensure proper security measures are
in place before exposing MongoDB to public networks.
• PATH Environment Variable
Add C:\Program Files\MongoDB\Server\8.0\bin and the path to
mongosh to the system PATH to simplify command-line access.
• Upgrades
The .msi installer supports automatic upgrades within the same release series (e.g.,
8.0.1 to 8.0.2). For major version upgrades, reinstall MongoDB.
CHAPTER 3
MongoDB uses documents stored inside collections. Each document is a JSON-like structure
(internally BSON – Binary JSON), allowing flexible, hierarchical data. Below are examples of
how to perform each CRUD operation in MongoDB.
In MongoDB, Create operations refer to inserting new documents into a collection. MongoDB
provides two primary methods:
Each document must follow the BSON (Binary JSON) structure, which allows for flexibility
in types (arrays, objects, booleans, etc.).
Basic Example:
db.students.insertOne({
name: "Abhishek",
age: 22,
branch: "CSE"
})
db.students.insertOne({
name: "Sneha",
age: 21,
branch: "AI & DS",
marks: [88, 92, 95],
address: {
city: "Hyderabad",
pincode: 500081
},
isActive: true,
created_at: new Date()
})
Basic Example:
db.students.insertMany([
{ name: "Raj", age: 20, branch: "MECH" },
{ name: "Pooja", age: 22, branch: "ECE" },
{ name: "Vikram", age: 23, branch: "CIVIL" }
])
db.students.insertMany([
{ name: "Ananya", gender: "Female" },
{ name: "Amit", age: 25, department: "IT", isGraduated: false
}
])
MongoDB collections are schema-less, which means each document can have different fields
or structures.
MongoDB automatically generates a unique _id for each document. However, you can assign
a custom _id if needed:
db.students.insertOne({
_id: 101,
name: "Ravi",
age: 21
})
Note: Attempting to insert another document with the same _id will result in an error.
By default, if one document insertion fails (e.g., due to duplicate _id), the whole operation
stops. You can override this behavior using the ordered: false option:
db.students.insertMany([
{ _id: 1, name: "A" },
{ _id: 1, name: "B" }, // Duplicate _id, causes error
{ _id: 2, name: "C" }
], { ordered: false }) // Inserts other valid documents
• Always include a created_at timestamp for tracking when the data was inserted.
• Use consistent field naming conventions.
• Validate data at the application level or using MongoDB’s schema validation.
• Avoid inserting large arrays or deeply nested structures unless necessary.
Read operations in MongoDB are used to retrieve documents from a collection. MongoDB
provides several methods for querying data, primarily using the find() and findOne()
methods. These operations support filtering, projections, sorting, pagination, and the use of
various query operators.
The find() method returns a cursor to the documents that match the query. If no query is
specified, all documents are returned.
Example:
db.students.find()
javascript
CopyEdit
db.students.find({ age: { $gt: 21 } })
Example:
D. Query Operators
• $eq – Equal to
• $ne – Not equal to
Examples:
E. Sorting Results
Sorting can be done using the sort() method. Use 1 for ascending and -1 for descending.
db.students.find().sort({ age: -1 })
Use limit() and skip() for pagination or limiting the number of results.
Limit to 5 documents:
db.students.find().limit(5)
db.students.find().skip(5)
db.students.find().skip(5).limit(5)
G. Counting Documents
You can count how many documents match a query using countDocuments().
Update operations in MongoDB are used to modify existing documents in a collection. You
can update specific fields, add new fields, or even replace entire documents. MongoDB
provides methods such as updateOne(), updateMany(), and replaceOne().
Example:
db.students.updateOne(
{ name: "Abhishek" },
{ $set: { age: 23 } }
)
You can also use operators like $inc, $unset, $rename, etc.
Increment age by 1:
db.students.updateOne(
{ name: "Abhishek" },
{ $inc: { age: 1 } }
)
Example:
db.students.updateMany(
{ branch: "CSE" },
{ $set: { isEligible: true } }
)
This method replaces the entire document with the new one provided. Useful when you want
to overwrite all fields.
Example:
db.students.replaceOne(
{ name: "Sneha" },
{
name: "Sneha",
age: 22,
branch: "AI & DS",
isActive: true
}
)
Unset a field:
db.students.updateOne(
{ name: "Sneha" },
{ $unset: { isActive: "" } }
)
Rename a field:
db.students.updateOne(
{ name: "Ravi" },
{ $rename: { "branch": "department" } }
)
E. Upsert Option
Example:
db.students.updateOne(
{ name: "Karthik" },
{ $set: { age: 24, branch: "EEE" } },
{ upsert: true }
)
Delete operations in MongoDB are used to remove documents from a collection. MongoDB
provides two primary methods for deletion:
Example:
This will delete the first document where the name is "Abhishek."
Example:
If you want to remove all documents from a collection but not drop the collection itself, use
deleteMany() with an empty query.
Example:
db.students.deleteMany({})
The Aggregation Framework in MongoDB is a powerful tool that allows you to process data
records and return computed results. It is similar to SQL's GROUP BY and JOIN operations.
The aggregation pipeline is a series of stages that process data, transforming it into an
aggregated result.
A. Aggregation Pipeline
The aggregation pipeline is a sequence of stages, each performing a specific operation on the
input data, such as filtering, grouping, or sorting.
Basic Structure:
db.collection.aggregate([
{ stage1 },
{ stage2 },
{ stage3 }
])
Each stage transforms the data and passes it to the next stage.
db.students.aggregate([
{ $match: { branch: "CSE" } }
])
db.students.aggregate([
{ $group: { _id: "$branch", total_students: { $sum: 1 } } }
])
Groups students by branch and calculates the total number of students in each branch.
db.students.aggregate([
{ $sort: { age: -1 } }
])
db.students.aggregate([
{ $project: { name: 1, age: 1, _id: 0 } }
])
Projects only the name and age fields, excluding the _id field.
db.students.aggregate([
{ $skip: 5 },
{ $limit: 5 }
])
MongoDB’s $lookup operator allows you to perform left outer joins between collections.
db.orders.aggregate([
{
$lookup: {
from: "products",
localField: "product_id",
foreignField: "_id",
as: "product_details"
}
}
])
This query joins the orders collection with the products collection where product_id in the
orders collection matches the _id in the products collection.
db.students.aggregate([
{ $match: { branch: "CSE" } },
{ $count: "total_students_in_cse" }
])
3.3 MapReduce
MapReduce in MongoDB is a powerful tool for performing complex data processing tasks that
require transforming and aggregating large datasets. It is based on the Map and Reduce
operations, which are typically used for parallel processing and can handle large-scale
operations.
A. Overview of MapReduce
• Map Function: The map function processes each document and outputs key-value
pairs.
• Reduce Function: The reduce function groups the results by keys and combines them
into a single output.
Basic Syntax:
db.collection.mapReduce(
mapFunction,
reduceFunction,
{ options }
)
Let's assume we have a sales collection, and we want to calculate the total sales per product.
The map function emits a key-value pair for each document, where the key is the product name,
and the value is the sale amount.
The reduce function sums the sale amounts for each product.
3. Running MapReduce
db.sales.mapReduce(
mapFunction,
reduceFunction,
{ out: "total_sales_per_product" }
)
This will output the result into a new collection called total_sales_per_product, where each
document will contain the product name and the total sales amount.
C. Output Options
• out: Specifies where to store the output of the MapReduce operation. Options include:
o A collection (out: "collection_name")
o A temporary collection (out: { inline: 1 } for inline results)
o A merge operation (out: { merge: "existing_collection" })
db.sales.mapReduce(
mapFunction,
reduceFunction,
{ out: { inline: 1 } }
)
D. Limitations of MapReduce
• Performance: MapReduce can be slower than aggregation operations for many use
cases. It requires writing results to disk, which can be inefficient for large datasets.
• Custom Reductions: When you need to define custom logic for combining values,
MapReduce provides flexibility that aggregation may not offer.
• Parallel Processing: For large-scale data processing that can be split into smaller tasks,
MapReduce is helpful.
1. Single Field Index: The simplest type of index, created on a single field.
2. Compound Index: An index on multiple fields, which is useful when queries involve
multiple fields.
3. Text Index: Used for text search on string fields.
4. Hashed Index: Primarily used for sharded collections to ensure the distribution of data
across shards.
5. Geospatial Index: Used for queries on location-based data.
6. Wildcard Index: Allows indexing on all fields in a document.
B. Creating an Index
Syntax:
db.students.createIndex({ name: 1 })
This index will allow for faster queries based on the name field in the students collection.
A compound index involves multiple fields. This is useful when you frequently query on
multiple fields together.
This index will improve the performance of queries that filter by branch and sort by age in
descending order.
C. Dropping an Index
If an index is no longer needed, it can be dropped to save space and improve performance on
write operations.
db.students.dropIndex("index_name")
db.students.dropIndexes()
D. Indexing Strategies
• Use Indexes for Frequently Queried Fields: Index fields that are frequently used in
queries, particularly in find(), sort(), update(), or delete() operations.
• Limit the Number of Indexes: Each index consumes memory, and excessive indexes
can slow down write operations. Keep the number of indexes manageable.
• Use Compound Indexes When Applicable: Compound indexes are particularly useful
when multiple fields are used together in queries. They avoid the need for multiple
single-field indexes.
• Consider Indexing for Sorting: If queries often involve sorting on specific fields,
creating an index on those fields can speed up the sorting process.
• Analyze Query Performance: Use MongoDB's explain() method to understand how
indexes are being used in queries and whether performance improvements are needed.
Without Index:
Without an index on branch and age, MongoDB will need to scan the entire collection.
This index allows MongoDB to quickly locate the documents where branch is "CSE" and age
is 20, improving performance.
3.5 Indexing
1. Single Field Index: The simplest type of index, created on a single field.
2. Compound Index: An index on multiple fields, which is useful when queries involve
multiple fields.
3. Text Index: Used for text search on string fields.
4. Hashed Index: Primarily used for sharded collections to ensure the distribution of data
across shards.
5. Geospatial Index: Used for queries on location-based data.
6. Wildcard Index: Allows indexing on all fields in a document.
B. Creating an Index
Syntax:
CHAPTER 4
While NoSQL databases like MongoDB offer several advantages over traditional relational
databases, such as flexibility in handling unstructured data and scalability, they come with their
own set of challenges:
Despite these challenges, the future of NoSQL databases like MongoDB looks promising due
to several emerging trends and advancements:
CHAPTER 5
CONCLUSION
In conclusion, NoSQL databases, especially MongoDB, have revolutionized the way data is
stored and processed, offering solutions for modern applications that demand scalability,
flexibility, and high performance. Unlike traditional relational databases, MongoDB’s
document-oriented model allows developers to store data in a more natural, hierarchical format,
making it an ideal choice for handling unstructured and semi-structured data.
MongoDB's strength lies in its ability to scale horizontally, distribute data across multiple
nodes, and provide high availability and fault tolerance. Its flexible schema design allows rapid
iteration and agile development, while its built-in features like the aggregation framework,
MapReduce, and indexing significantly improve query performance and data processing
capabilities.
Despite these advantages, MongoDB is not without its challenges. Issues such as consistency
in distributed systems, complex querying, and the learning curve for developers transitioning
from SQL databases are areas that continue to pose difficulties. However, MongoDB's
continued advancements—particularly in transaction support, improved querying capabilities,
and integration with cloud-native and serverless architectures—are addressing these limitations
and expanding its use cases.
Looking ahead, MongoDB’s integration with emerging technologies like artificial intelligence,
machine learning, and big data analytics will further enhance its utility, enabling businesses to
harness the power of large datasets. The support for graph databases and time-series data,
combined with hybrid models that blend the best of both SQL and NoSQL, ensures that
MongoDB will remain a critical tool in the developer's toolkit.
As the demand for real-time data processing, high availability, and cloud-based applications
continues to rise, MongoDB's role as a leading NoSQL database will only grow. Its growing
ecosystem, improved features, and vast community support make it a powerful choice for
modern applications across industries such as e-commerce, social media, IoT, finance, and
healthcare.
In summary, MongoDB, with its unique features and growing capabilities, provides a robust
solution to the data management challenges of modern applications. Its continuous evolution
ensures that it will remain at the forefront of the NoSQL revolution, empowering developers
and businesses to build scalable, high-performance systems.
CHAPTER 6
REFERENCES
• MongoDB Documentation. (n.d.). MongoDB Manual. MongoDB, Inc. Retrieved from
https://docs.mongodb.com/
• Chodorow, K. (2013). MongoDB: The Definitive Guide. O'Reilly Media.
• Giamas, A. (2017). Mastering MongoDB: The Complete Guide to MongoDB
Development and Administration. Packt Publishing.
• MongoDB Atlas Documentation. (n.d.). MongoDB Atlas: Managed MongoDB in the
Cloud. MongoDB, Inc. Retrieved from https://www.mongodb.com/cloud/atlas
• Rhys, C. (2020). MongoDB in Action. Manning Publications.
• Finkel, H. (2015). Learning MongoDB: A Hands-on Guide to Building Applications
with MongoDB. Packt Publishing.
• O'Reilly Media. (2015). Learning MongoDB. Retrieved from
https://www.oreilly.com/library/view/learning-mongodb/9781785884334/
• Grolinger, K., Hughes, K., & Buckley, K. (2013). Data Management in the Cloud:
Challenges and Opportunities. International Journal of Cloud Computing and
Services Science (IJCCSS), 2(3), 1-18.
• Nunn, M., & Denny, M. (2017). Practical MongoDB: Architecting, Developing, and
Administering MongoDB. Apress.