Unit 3 Chap2
Unit 3 Chap2
USE CASE-1
In this use case, the focus is on storing and retrieving performance monitoring data in MongoDB.
This involves designing a schema that efficiently stores the data collected from a monitoring tool and
performing operations like data insertion with different levels of data safety and performance. Let's
break down the explanation in more detail:
Schema Design and Optimization
The data being collected from the monitoring tool is in CSV format, and it contains parameters like
Host, Timestamp, ParameterName, and Value. These parameters need to be stored in MongoDB in a
structured and optimized format.
Here’s an example of how a line from the log file might look:
Node UUID | IP Address | Node Name | MIB | Time Stamp (ms) | Metric Value
3beb1a8b-040d-4b46-932a | 10.161.1.73| corp_xyz_sardar| IFU | 1369221223384 | 0.2
Instead of storing this line as a text string, which would require regular expression searches (slow and
inefficient), you can break it down into individual fields and store it in a more structured and
optimized way.
Document Structure Example:
{
_id: ObjectID(...),
Host: "corp_xyz_sardar",
Time: ISODate("2015-07-15T13:55:36Z"),
ParameterName: "CPU",
Value: 0.2
}
In this structure:
Host: Represents the host name or IP of the server being monitored.
Time: The timestamp of the log data stored in the ISODate format for efficient querying.
ParameterName: The metric being monitored (e.g., CPU usage, Memory, etc.).
Value: The value of the monitored metric.
Why Use Correct Data Types?
By using the correct data types, you:
Optimize Storage: For instance, using ISODate (8 bytes) instead of storing the date as a string (28
bytes).
Facilitate Efficient Queries: Using ISODate allows efficient querying for date ranges rather than
performing string-based comparisons.
Operations
Once the schema is designed, the next step is to insert and retrieve the data. You can perform insert
operations with different levels of write concerns depending on your application's needs for
performance and data safety.
Sharding in MongoDB
Sharding is a critical strategy in MongoDB to manage large datasets and high-throughput operations.
It involves distributing data across multiple servers, or shards, to balance the load and improve
performance. Here's a detailed breakdown of how sharding works and the considerations for choosing
a suitable shard key:
Pros: Balances writes across shards since hash values are distributed randomly.
Cons: Queries must be broadcasted to all shards, reducing query efficiency.
Example: Using a hash of the _id field as the shard key. While this approach distributes writes evenly,
querying specific data becomes inefficient as every shard must be queried.
Pros: Balances writes and queries efficiently if the field is evenly distributed across all documents.
Queries that filter by this field can be routed to specific shards.
Cons: Can lead to imbalanced chunks if one shard accumulates a disproportionate amount of data for
a specific value.
Example: If you use the Host field as the shard key and there is a data concentration for one host, that
shard might become overloaded.
Pros: Combines benefits of evenly distributed fields with balanced writes. Queries that use the fields
in the compound key can be routed efficiently.
Cons: Requires careful design to ensure optimal performance.
Example: Using {Host: 1, _id: 1} as a compound shard key. This approach distributes data based on
the Host field and uses the hash of _id to ensure even distribution. Queries that filter by Host will be
directed to relevant shards, and writes will be balanced.
Capped Collections:
Description: A fixed-size collection that overwrites old data when the limit is reached.
Pros: Efficient for high-throughput scenarios.
Cons: Cannot be sharded and does not support TTL for automatic data removal.
Example: Using a capped collection for real-time monitoring data where old data is overwritten.
In this section, you will explore how to use MongoDB to store and retrieve data of a social networking
site.
To ensure optimal performance in a social media application where users can view posts quickly and
accurately, the schema and operations must be designed with efficiency in mind. Here’s how you can
implement the various functions for viewing posts, including visibility checks, and indexing to
enhance performance.
Pseudo Code:
Function Fetch_Post_Details(CollectionName, View_User_ID, Month)
SET QueryDocument to {"User_id": View_User_ID}
IF Month IS NOT NULL
APPEND Month Filter ["Month": {"$lte": Month}] to QueryDocument
SET O_Cursor = (resultset of the collection after applying the QueryDocument filter)
SET Cur = (sort O_Cursor by "month" in reverse order)
WHILE records are present in Cur
PRINT record
END WHILE
END Function
Example Usage:
Suppose you want to fetch posts for user user123 from the user.wall collection for August 2024. The
function would generate a query like this:
{
"User_id": "user123",
"PostMonth": "202408"
}
The query is then sorted in reverse chronological order to show the most recent posts first.
If user123 has blocked user789, and a post by user789 is being viewed, the function will return false.
For a post from user456, if the circles are set to public or user123 is following user456, the function
returns true. Otherwise, it checks if the post is within any circles of users whom user123 follows.
Creating Indexes:
Faster Queries: Indexes help in quickly locating documents without scanning the entire collection.
Efficient Sorting: Sorting queries can be optimized by using indexes, reducing the time required to
order results.
Improved Performance: Overall performance improves as the database engine can use indexes to
fetch and sort data more efficiently.
To efficiently handle a social media application that includes operations for creating and managing
posts and comments, it's crucial to understand how to implement these functionalities and optimize
them for performance. Here's a detailed explanation of the operations for creating comments and
posts, including sharding strategies for scaling.
Creating Comments
When a user adds a comment to a post, you need to update multiple collections to ensure that the
comment appears in all relevant places. Here’s a step-by-step breakdown:
Function: postcomment
Purpose: To create a comment on a post and update the relevant collections to reflect this comment.
Pseudo Code:
Function postcomment(commentedby, commentedonpostid, commenttext)
SET commentedon to current datetime
SET month to month of commentedon
SET comment_document = {
"by": {id: commentedby["id"], "Name": commentedby["name"]},
"ts": commentedon,
"text": commenttext
}
A user with ID user123 comments "Great post!" on a post with ID post456. The function will:
Function: MaintainComments
Pseudo Code:
Function MaintainComments()
SET MaximumComments = 3
If a post has more than 3 comments, the oldest comments will be removed to ensure only the latest 3
are shown.
Function: createnewpost
Purpose: To create a new post and update all relevant collections to include this post.
Pseudo Code:
Function createnewpost(createdby, posttype, postdetails, circles)
SET ts = current timestamp
SET month = month of ts
SET post_document = {
"ts": ts,
"by": {id: createdby["id"], name: createdby["name"]},
"circles": circles,
"type": posttype,
"details": postdetails
}
If user123 creates a post with posttype = "status", postdetails = "Hello World!", and circles including
circleA, the function will:
user.profile: Use user_id as the shard key. Since the data is specific to each user, this key will evenly
distribute the data across shards.
user.wall: Use user_id as the shard key for similar reasons. This ensures that the wall data for each
user is distributed efficiently.
social.posts: Use user_id as the shard key to evenly distribute the social posts related to each user.
user.posts: Use _id as the shard key since each post is uniquely identified by this field. This helps in
distributing posts evenly across shards.
Creating Shards:
// Sharding user.profile collection
sh.shardCollection("yourDatabase.user.profile", { "user_id": 1 })