0% found this document useful (0 votes)
5 views11 pages

Unit 3 Chap2

Uploaded by

thirdp753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

Unit 3 Chap2

Uploaded by

thirdp753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MongoDB Use Cases

USE CASE-1
In this use case, the focus is on storing and retrieving performance monitoring data in MongoDB.
This involves designing a schema that efficiently stores the data collected from a monitoring tool and
performing operations like data insertion with different levels of data safety and performance. Let's
break down the explanation in more detail:
Schema Design and Optimization
The data being collected from the monitoring tool is in CSV format, and it contains parameters like
Host, Timestamp, ParameterName, and Value. These parameters need to be stored in MongoDB in a
structured and optimized format.

Here’s an example of how a line from the log file might look:
Node UUID | IP Address | Node Name | MIB | Time Stamp (ms) | Metric Value
3beb1a8b-040d-4b46-932a | 10.161.1.73| corp_xyz_sardar| IFU | 1369221223384 | 0.2
Instead of storing this line as a text string, which would require regular expression searches (slow and
inefficient), you can break it down into individual fields and store it in a more structured and
optimized way.
Document Structure Example:
{
_id: ObjectID(...),
Host: "corp_xyz_sardar",
Time: ISODate("2015-07-15T13:55:36Z"),
ParameterName: "CPU",
Value: 0.2
}
In this structure:
Host: Represents the host name or IP of the server being monitored.
Time: The timestamp of the log data stored in the ISODate format for efficient querying.
ParameterName: The metric being monitored (e.g., CPU usage, Memory, etc.).
Value: The value of the monitored metric.
Why Use Correct Data Types?
By using the correct data types, you:
Optimize Storage: For instance, using ISODate (8 bytes) instead of storing the date as a string (28
bytes).
Facilitate Efficient Queries: Using ISODate allows efficient querying for date ranges rather than
performing string-based comparisons.
Operations
Once the schema is designed, the next step is to insert and retrieve the data. You can perform insert
operations with different levels of write concerns depending on your application's needs for
performance and data safety.

Insert Operations with Write Concerns


Write concern determines the level of acknowledgment you want from MongoDB after an insert
operation. The higher the write concern, the more guarantees you get about the success of the write,
but this comes at the cost of speed.

Fastest Insertion - No Acknowledgment (w=0)


This command inserts data without waiting for any acknowledgment from the server. It is the fastest
but risks data loss, as you have no confirmation of success.
db.perfpoc.insert({Host: "Host1", GeneratedOn: new ISODate("2015-07-15T12:02Z"),
ParameterName: "CPU", Value: 13.13}, {w: 0})
Use Case: Suitable when you prioritize speed and are okay with potential data loss, like during high-
frequency performance data collection where every data point may not be critical.
Basic Acknowledgment - Write Concern (w=1)
Here, MongoDB will acknowledge that it has received the data and saved it. However, it doesn’t
ensure the data is journaled or replicated yet, meaning some data loss is still possible.
db.perfpoc.insert({Host: "Host1", GeneratedOn: new ISODate("2015-07-15T12:07Z"),
ParameterName: "CPU", Value: 13.23}, {w: 1})
Use Case: Appropriate when you want a confirmation that data is being stored but don't need strong
guarantees of durability.
Increased Safety - Journaling and Replication (j=true, w=2)
This option ensures not only that the data is saved, but also that it is journaled and replicated to at
least one other node in the replica set. This provides higher durability at the expense of some speed.
db.perfpoc.insert({Host: "Host1", GeneratedOn: new ISODate("2015-07-15T12:09Z"),
ParameterName: "CPU", Value: 30.01}, {j: true, w: 2})
Use Case: Use this when you require stronger data safety guarantees, such as in a production
environment where losing data is not an option.
Bulk Insert:
Real-Life Example:
Imagine you are monitoring server performance across a company’s infrastructure with hundreds of
servers. Every second, data is being generated about CPU usage, memory, disk I/O, etc., across all
servers. If you inserted every single data point one by one into your MongoDB database, it would
take much longer, especially when using strict write concerns that ensure data is safely written.
Bulk Insert:
Instead of inserting each performance event individually, you group the data points together and
perform a bulk insert. This method reduces the number of database write operations, improving
performance since the database can handle groups of data in one go.
For example, suppose you collect data from 100 servers every second and group them together in
bulk inserts:
db.perfpoc.insertMany([
{ Host: "Server1", GeneratedOn: ISODate("2024-08-01T12:00:00Z"), ParameterName: "CPU",
Value: 55 },
{ Host: "Server2", GeneratedOn: ISODate("2024-08-01T12:00:00Z"), ParameterName: "CPU",
Value: 45 },
{ Host: "Server3", GeneratedOn: ISODate("2024-08-01T12:00:00Z"), ParameterName: "Memory",
Value: 70 },
// and so on
]);
This method helps MongoDB distribute the performance impact across multiple records, improving
the insertion efficiency for large amounts of data.

2. Querying Performance Data


After storing the performance data, the key to making use of this data lies in efficient querying.
Query 1: Fetching Data for a Particular Host
You might want to analyze how a specific server, say "Server1", has been performing over time.
Example Query:
db.perfpoc.find({ Host: "Server1" });
This query retrieves all performance data for the host "Server1". However, as the database grows with
thousands of records, querying by just the host might slow down without optimization. Hence,
creating an index on the Host field will speed things up:
db.perfpoc.createIndex({ Host: 1 });
By indexing Host, MongoDB can quickly find and return the relevant records instead of scanning the
entire collection.

Query 2: Fetching Data for a Date Range


If you want to analyze performance data within a specific time frame, such as between July 10, 2024,
and July 20, 2024, the query would look like this:
db.perfpoc.find({
GeneratedOn: { "$gte": ISODate("2024-07-10"), "$lte": ISODate("2024-07-20") }
});
Again, to optimize this query, you should create an index on the GeneratedOn field:
db.perfpoc.createIndex({ GeneratedOn: 1 });
This ensures that the query can quickly locate the documents within the specified date range without
scanning unnecessary records.

Query 3: Combining Multiple Conditions


Now, suppose you want to analyze how "Server1" performed during a specific period. You can
combine both the host and the date range in a single query:
db.perfpoc.find({
GeneratedOn: { "$gte": ISODate("2024-07-10"), "$lte": ISODate("2024-07-20") },
Host: "Server1"
});
Here, using a compound index (an index on multiple fields) can greatly enhance performance.
However, the order of the fields in the index matters.
For instance, if you are more frequently querying by date and then by host, you would create a
compound index like this:
db.perfpoc.createIndex({ GeneratedOn: 1, Host: 1 });
However, if your queries are more likely to focus on the host first, you would reverse the order:
db.perfpoc.createIndex({ Host: 1, GeneratedOn: 1 });
By running explain("allPlansExecution") on your queries, you can observe how MongoDB is using
these indexes, and decide which indexing strategy offers the best performance based on your query
patterns.

3. Aggregation and Counting: Practical Use Case


Real-Life Scenario:
Suppose you want to analyze how many performance events were logged for each server in a
particular month. This could be useful to assess if certain servers are generating more issues than
others.
You would use MongoDB's aggregation framework to group data by month and count the records:
db.perfpoc.aggregate([
{ $project: { month_joined: { $month: "$GeneratedOn" } } },
{ $group: { _id: { month_joined: "$month_joined" }, count: { $sum: 1 } } },
{ $sort: { "_id.month_joined": 1 } }
]);
This query projects a new field month_joined that extracts the month from the GeneratedOn date
field, groups the documents by this month, and counts how many records belong to each month.
Finally, it sorts the results by month.
Optimization:
To make this aggregation efficient, you already have an index on GeneratedOn, which helps the
database quickly process and group the data.

Sharding in MongoDB
Sharding is a critical strategy in MongoDB to manage large datasets and high-throughput operations.
It involves distributing data across multiple servers, or shards, to balance the load and improve
performance. Here's a detailed breakdown of how sharding works and the considerations for choosing
a suitable shard key:

Shard Key Characteristics


A shard key determines how data is distributed across the shards. The ideal shard key should:
Distribute Insertions Evenly: Prevent any single shard from becoming a bottleneck due to high write
operations.
Optimize Query Routing: Ensure that most queries can be directed to a subset of shards, minimizing
the need for broadcasting queries to all shards.
Choosing a Shard Key
Time Field:

Pros: Easy to implement.


Cons: Time-based data can lead to skewed distribution. Writes might be concentrated on a single
shard, especially if data is appended chronologically. Reads are also likely to be concentrated on
recent data, which may end up on one shard.
Example: If you choose a timestamp field as the shard key, all new records with a recent timestamp
will end up on the same shard, creating a potential bottleneck.

Hash-Based Shard Key:

Pros: Balances writes across shards since hash values are distributed randomly.
Cons: Queries must be broadcasted to all shards, reducing query efficiency.
Example: Using a hash of the _id field as the shard key. While this approach distributes writes evenly,
querying specific data becomes inefficient as every shard must be queried.

Field with Even Distribution (e.g., Host):

Pros: Balances writes and queries efficiently if the field is evenly distributed across all documents.
Queries that filter by this field can be routed to specific shards.
Cons: Can lead to imbalanced chunks if one shard accumulates a disproportionate amount of data for
a specific value.
Example: If you use the Host field as the shard key and there is a data concentration for one host, that
shard might become overloaded.

Compound Shard Key:

Pros: Combines benefits of evenly distributed fields with balanced writes. Queries that use the fields
in the compound key can be routed efficiently.
Cons: Requires careful design to ensure optimal performance.
Example: Using {Host: 1, _id: 1} as a compound shard key. This approach distributes data based on
the Host field and uses the hash of _id to ensure even distribution. Queries that filter by Host will be
directed to relevant shards, and writes will be balanced.

Managing Data Growth


Given the continuous growth of performance data, implementing a data retention policy is essential.
Here are some patterns for managing data:

Capped Collections:
Description: A fixed-size collection that overwrites old data when the limit is reached.
Pros: Efficient for high-throughput scenarios.
Cons: Cannot be sharded and does not support TTL for automatic data removal.
Example: Using a capped collection for real-time monitoring data where old data is overwritten.

TTL (Time-to-Live) Collections:


Description: Collections where documents are automatically removed after a specified period using
a TTL index.
Pros: Can be sharded and supports automatic data removal.
Cons: May lead to data fragmentation and is not as efficient as capped collections.
Example: Creating a TTL index on the GeneratedOn field to automatically remove data older than
six months.

Multiple Collections (e.g., Day-wise Collections):


Description: Creating separate collections for each day or time period.
Pros: Simplifies data management and allows for efficient removal of old data.
Cons: Querying requires reading from multiple collections, which can complicate data retrieval.
Example: Creating daily collections like performance_data_2024_01_01,
performance_data_2024_01_02, etc. This allows for efficient deletion of old collections and avoids
fragmentation.
Example Scenario
Consider a performance monitoring system that generates data every second. Using the following
approach:
Shard Key: {Host: 1, _id: 1}.
Data Management: Daily collections to manage data retention efficiently.
Example Queries:
Insert Data: New performance data for different hosts is distributed across shards based on the Host
and _id hash.
Query Data: Fetching performance data for Host1 within a date range is routed to the relevant shards
based on the compound shard key.
Data Removal: At the end of each day, old collections (e.g., performance_data_2024_08_24) are
dropped to manage disk space efficiently.

Use Case 2 – Social Networking

In this section, you will explore how to use MongoDB to store and retrieve data of a social networking
site.
To ensure optimal performance in a social media application where users can view posts quickly and
accurately, the schema and operations must be designed with efficiency in mind. Here’s how you can
implement the various functions for viewing posts, including visibility checks, and indexing to
enhance performance.

Operations for Viewing Posts


1. Fetching Post Details
The function Fetch_Post_Details retrieves posts for a specific user from either the social.posts or
user.wall collection. It allows for filtering posts by month and ensures that results are sorted in reverse
chronological order.

Pseudo Code:
Function Fetch_Post_Details(CollectionName, View_User_ID, Month)
SET QueryDocument to {"User_id": View_User_ID}
IF Month IS NOT NULL
APPEND Month Filter ["Month": {"$lte": Month}] to QueryDocument
SET O_Cursor = (resultset of the collection after applying the QueryDocument filter)
SET Cur = (sort O_Cursor by "month" in reverse order)
WHILE records are present in Cur
PRINT record
END WHILE
END Function
Example Usage:

Suppose you want to fetch posts for user user123 from the user.wall collection for August 2024. The
function would generate a query like this:
{
"User_id": "user123",
"PostMonth": "202408"
}
The query is then sorted in reverse chronological order to show the most recent posts first.

2. Checking Visibility on Own Wall


When rendering posts on a user’s wall, you need to verify if the post should be visible based on the
user’s circles.
Pseudo Code:
Function Check_VisibleOnOwnWall(user, post)
WHILE Loop_User IN user.CirclesList
IF post by = Loop_User
RETURN true
ELSE
RETURN false
END WHILE
END Function
Example Usage:
Assume user user123 is checking a post by user456. If user456 is in one of the circles that user123 is
following, the function returns true.

3. Checking Blocked Users


To ensure that posts by blocked users are not visible, use this function.
Pseudo Code:
Function ReturnBlockedOrNot(user, post)
IF post by user_id NOT in user.ListBlockedUserIDs
RETURN true
ELSE
RETURN false
END Function
Example Usage:

If user123 has blocked user789, and a post by user789 is being viewed, the function will return false.

4. Permission Checks for Viewing Posts


When a user views another user’s wall, you need to check if the post should be visible based on circles
and privacy settings.
Pseudo Code:
Function visibleposts(user, post)
IF post circles is public
RETURN true
IF post circles is public to all followed users
RETURN true
SET listofcircles = followers circles whose user_id is post's by id
IF listofcircles IN post's circles
RETURN true
RETURN false
END Function
Example Usage:

For a post from user456, if the circles are set to public or user123 is following user456, the function
returns true. Otherwise, it checks if the post is within any circles of users whom user123 follows.

Indexing for Performance


To enhance performance, especially for large datasets, create indexes on the user_id and month fields
in both social.posts and user.wall collections. This improves query efficiency by allowing MongoDB
to quickly locate relevant documents.

Creating Indexes:

In MongoDB, you can create indexes as follows:


db.social.posts.createIndex({ "user_id": 1, "postmonth": 1 })
db.user.wall.createIndex({ "User_id": 1, "PostMonth": 1 })
Benefits of Indexing:

Faster Queries: Indexes help in quickly locating documents without scanning the entire collection.
Efficient Sorting: Sorting queries can be optimized by using indexes, reducing the time required to
order results.
Improved Performance: Overall performance improves as the database engine can use indexes to
fetch and sort data more efficiently.
To efficiently handle a social media application that includes operations for creating and managing
posts and comments, it's crucial to understand how to implement these functionalities and optimize
them for performance. Here's a detailed explanation of the operations for creating comments and
posts, including sharding strategies for scaling.

Creating Comments
When a user adds a comment to a post, you need to update multiple collections to ensure that the
comment appears in all relevant places. Here’s a step-by-step breakdown:

Function: postcomment
Purpose: To create a comment on a post and update the relevant collections to reflect this comment.

Pseudo Code:
Function postcomment(commentedby, commentedonpostid, commenttext)
SET commentedon to current datetime
SET month to month of commentedon
SET comment_document = {
"by": {id: commentedby["id"], "Name": commentedby["name"]},
"ts": commentedon,
"text": commenttext
}

// Update user.posts collection


UPDATE user.posts
WHERE _id = commentedonpostid
PUSH comment_document INTO Comments_Doc

// Update user.wall collection


UPDATE user.wall
WHERE User_id = commentedby["id"] AND PostMonth = month
PUSH comment_document INTO PostDetails.comments

// Increment the comments_shown in user.wall collection


UPDATE user.wall
WHERE User_id = commentedby["id"] AND PostMonth = month
SET comments_shown = comments_shown + 1

// Update social.posts collection


UPDATE social.posts
WHERE post_id = commentedonpostid
PUSH comment_document INTO postlists.comments

// Increment the comments_shown in social.posts collection


UPDATE social.posts
WHERE post_id = commentedonpostid
SET comments_shown = comments_shown + 1
END Function
Example Usage:

A user with ID user123 comments "Great post!" on a post with ID post456. The function will:

Add the comment to user.posts where post_id = post456.


Add the comment to user.wall for user123.
Add the comment to social.posts where post_id = post456.
Increment the comments_shown counter in both user.wall and social.posts.
Maintaining Comments
To ensure that no more than a specified number of comments are shown, you need a periodic cleanup
function.

Function: MaintainComments

Pseudo Code:
Function MaintainComments()
SET MaximumComments = 3

// Loop through social.posts


FOR EACH post IN social.posts
IF post.comments_shown > MaximumComments
// Remove the oldest comment
POP the oldest comment from postlists.comments
// Decrement the comments_shown counter
UPDATE social.posts
WHERE _id = post._id
SET comments_shown = comments_shown - 1
END IF
END FOR
// Loop through user.wall
FOR EACH post IN user.wall
IF post.comments_shown > MaximumComments
// Remove the oldest comment
POP the oldest comment from PostDetails.comments
// Decrement the comments_shown counter
UPDATE user.wall
WHERE _id = post._id
SET comments_shown = comments_shown - 1
END IF
END FOR
END Function
Example Usage:

If a post has more than 3 comments, the oldest comments will be removed to ensure only the latest 3
are shown.

Creating New Posts


When a user creates a new post, the operation needs to update multiple collections to reflect this post.

Function: createnewpost

Purpose: To create a new post and update all relevant collections to include this post.

Pseudo Code:
Function createnewpost(createdby, posttype, postdetails, circles)
SET ts = current timestamp
SET month = month of ts
SET post_document = {
"ts": ts,
"by": {id: createdby["id"], name: createdby["name"]},
"circles": circles,
"type": posttype,
"details": postdetails
}

// Insert post_document into user.posts collection


INSERT post_document INTO user.posts

// Append post_document into user.wall collection


UPDATE user.wall
WHERE User_id = createdby["id"] AND PostMonth = month
PUSH post_document INTO PostDetails

// Get the list of users based on circles and post creator


SET userlist = getUsersInCircles(circles, createdby["id"])
// Append post_document to each user's social.posts collection
FOR EACH user IN userlist
UPDATE social.posts
WHERE user_id = user._id
PUSH post_document INTO postlists
END FOR
END Function
Example Usage:

If user123 creates a post with posttype = "status", postdetails = "Hello World!", and circles including
circleA, the function will:

Add the post to user.posts.


Update user.wall for user123.
Add the post to the social.posts collections of all users in circleA.
Sharding Strategy
To handle large volumes of data and ensure scalability, you can shard the collections based on suitable
shard keys:

user.profile: Use user_id as the shard key. Since the data is specific to each user, this key will evenly
distribute the data across shards.

user.wall: Use user_id as the shard key for similar reasons. This ensures that the wall data for each
user is distributed efficiently.

social.posts: Use user_id as the shard key to evenly distribute the social posts related to each user.

user.posts: Use _id as the shard key since each post is uniquely identified by this field. This helps in
distributing posts evenly across shards.

Creating Shards:
// Sharding user.profile collection
sh.shardCollection("yourDatabase.user.profile", { "user_id": 1 })

// Sharding user.wall collection


sh.shardCollection("yourDatabase.user.wall", { "User_id": 1 })

// Sharding social.posts collection


sh.shardCollection("yourDatabase.social.posts", { "user_id": 1 })

// Sharding user.posts collection


sh.shardCollection("yourDatabase.user.posts", { "_id": 1 })

You might also like