0% found this document useful (0 votes)
29 views13 pages

281511lecture Notes 2 - MongoDB Data Modeling-1718181255820

Short notes on mongo db. Lecture 2nd

Uploaded by

praksh00740
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views13 pages

281511lecture Notes 2 - MongoDB Data Modeling-1718181255820

Short notes on mongo db. Lecture 2nd

Uploaded by

praksh00740
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

NoSQL Database: MongoDB

LECTURE 2 NOTES

MongoDB Data Modeling


● What is Data Modeling?

■ In MongoDB, data modeling refers to the process of designing the structure of your
data to best suit the needs of your application. Unlike traditional relational
databases where data is organized in tables, rows, and columns, MongoDB is a
NoSQL database that stores data in flexible, JSON-like documents.
■ The key challenge in data modeling is balancing the needs of the application, the
performance characteristics of the database engine, and the data retrieval
patterns. When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well as the inherent
structure of the data itself.

■ Flexible Schema

○ Unlike SQL databases, where you must determine and declare a table's
schema before inserting data, MongoDB's collections, by default, do not
require their documents to have the same schema. That is:

○ The documents in a single collection do not need to have the same set of
fields and the data type for a field can differ across documents within a
collection.

○ To change the structure of the documents in a collection, such as adding


new fields, removing existing fields, or changing the field values to a new
type, update the documents to the new structure.

○ This flexibility facilitates the mapping of documents to an entity or an


object. Each document can match the data fields of the represented entity,
even if the document has substantial variation from other documents in the
collection.

○ In practice, however, the documents in a collection share a similar structure,


and you can enforce document validation rules for a collection during
update and insert operations.

■ Document Structure

○ The key decision in designing data models for MongoDB applications


revolves around the structure of documents and how the application
represents relationships between data. MongoDB allows related data to be
embedded within a single document.
■ Embedded Data
○ Embedded documents capture relationships between data by storing
related data in a single document structure. MongoDB documents make it
possible to embed document structures in a field or array within a
document. These denormalized data models allow applications to retrieve
and manipulate related data in a single database operation.

■ Embedded Data Models

○ With MongoDB, you may embed related data in a single structure or document.
These schemas are generally known as ‘denormalized’ models, and take
advantage of MongoDB's rich documents.

○ Embedded data models allow applications to store related pieces of


information in the same database record. As a result, applications may need to
issue fewer queries and updates to complete common operations.

○ In general, use embedded data models when: you have ‘contains’ relationships
between entities. You have one-to-many relationships between entities. In these
relationships the ‘many’ or child documents always appear with or are viewed in
the context of the ‘one’ or parent documents.

○ In general, embedding provides better performance for read operations, as well


as the ability to request and retrieve related data in a single database
operation. Embedded data models make it possible to update related data in a
single atomic write operation.

○ To access data within embedded documents, use dot notation to ‘reach into’ the
embedded documents. See query for data in arrays and query data in
embedded documents for more examples of accessing data in arrays and
embedded documents.

● Embedded Data Model and Document Size Limit

■ Documents in MongoDB must be smaller than the maximum BSON document size.

■ For bulk binary data, consider GridFS.

■ For many use cases in MongoDB, the denormalized data model is optimal.
● Atomicity of Write Operations

■ Single Document Atomicity

○ In MongoDB, a write operation is atomic on the level of a single document, even


if the operation modifies multiple embedded documents within a single
document.

○ A denormalized data model with embedded data combines all related data in a
single document instead of normalizing across multiple documents and
collections. This data model facilitates atomic operations.

■ Multi-Document Transactions

○ When a single write operation (e.g. db.collection.updateMany()) modifies


multiple documents, the modification of each document is atomic, but the
operation as a whole is not atomic.

○ When performing multi-document write operations, whether through a single


write operation or multiple write operations, other operations may interleave.

○ For situations that require atomicity of reads and writes to multiple documents
(in a single or multiple collections), MongoDB supports multi-document
transactions:

➢ In version 4.0, MongoDB supports multi-document transactions on replica


sets.

➢ In version 4.2, MongoDB introduces distributed transactions, which adds


support for multi-document transactions on sharded clusters and
incorporates the existing support for multi-document transactions on
replica sets.

○ Note:

1. In most cases, multi-document transaction incurs a greater performance


cost over single document writes, and the availability of multi-document
transactions should not be a replacement for effective schema design.

2. For many scenarios, the denormalized data model (embedded documents


and arrays) will continue to be optimal for your data and use cases.

3. That is, for many scenarios, modeling your data appropriately will minimize
the need for multi-document transactions.
● Sharding

■ MongoDB uses sharding to provide horizontal scaling. These clusters support


deployments with large data sets and high-throughput operations. Sharding allows
users to partition a collection within a database to distribute the collection's
documents across several mongod instances or shards.

■ To distribute data and application traffic in a sharded collection, MongoDB uses the
shard key. Selecting the proper shard key has significant implications for
performance, and can enable or prevent query isolation and increased write
capacity. While you can change your shard key later, it is important to carefully
consider your shard key choice.

● Indexes

■ Use indexes to improve performance for common queries. Build indexes on fields
that appear often in queries and for all operations that return sorted results.
MongoDB automatically creates a unique index on the _id field.

■ As you create indexes, consider the following behaviors of indexes:

○ Each index requires at least 8 kB of data space.

○ Adding an index has some negative performance impact on write


operations. For collections with a high write-to-read ratio, indexes are
expensive since each insert must also update any indexes.

○ Collections with high read-to-write ratios often benefit from additional


indexes. Indexes do not affect un-indexed read operations.

○ When active, each index consumes disk space and memory. This usage can
be significant and should be tracked for capacity planning, especially for
concerns over working set size.

● Large Number of Collections

■ In certain situations, you might choose to store related information in several


collections rather than in a single collection.

■ Consider a sample collection log that stores log documents for various
environments and applications. The log collection contains documents in the
following form:
{ log: "dev", ts: ..., info: ... }

{ log: "debug", ts: ..., info: ...}

■ If the total number of documents is low, you may group documents into collections
by type. For logs, consider maintaining distinct log collections, such as logs_dev
and logs_debug. The logs_dev collection would contain only the documents related
to the dev environment.
■ Generally, having a large number of collections has no significant performance
penalty and results in a very good performance. Distinct collections are very
important for high-throughput batch processing.
■ When using models that have a large number of collections, consider the following
behaviors:

○ Each collection has a certain minimum overhead of a few kilobytes.

○ Each index, including the index on _id, requires at least 8 kB of data space.

○ For each database, a single namespace file (i.e. <database>.ns) stores all
meta-data for that database, and each index and collection has its own
entry in the namespace file. See places namespace length limits for specific
limitations.

■ Collection Contains a Large Number of Small Documents

○ You should consider embedding for performance reasons if you have a


collection with a large number of small documents. If you can group these
small documents by some logical relationship and you frequently retrieve
the documents by this grouping, you might consider ‘rolling-up’ the small
documents into larger documents that contain an array of embedded
documents.

○ ‘Rolling up’ these small documents into logical groupings means that
queries to retrieve a group of documents involve sequential reads and fewer
random disk accesses. Additionally, ‘rolling up’ documents and moving
common fields to the larger document benefit the index on these fields.
There would be fewer copies of the common fields and there would be
fewer associated key entries in the corresponding index. See Indexes for
more information on indexes.

○ However, if you often only need to retrieve a subset of the documents within
the group, then ‘rolling-up’ the documents may not provide better
performance. Furthermore, if small, separate documents represent the
natural model for the data, you should maintain that model.

● Storage Optimization for Small Documents

■ Each MongoDB document contains a certain amount of overhead. This overhead is


normally insignificant but becomes significant if all documents are just a few
bytes, as might be the case if the documents in your collection only have one or
two fields.

■ Consider the following suggestions and strategies for optimizing storage utilization
for these collections:

○ Use the _id field explicitly:

➢ MongoDB clients automatically add an _id field to each document and


generate a unique 12-byte ObjectId for the _id field. Furthermore, MongoDB
always indexes the _id field. For smaller documents this may account for a
significant amount of space.

➢ To optimize storage use, users can specify a value for the _id field explicitly
when inserting documents into the collection. This strategy allows
applications to store a value in the _id field that would have occupied space
in another portion of the document.

➢ You can store any value in the _id field, but because this value serves as a
primary key for documents in the collection, it must uniquely identify them.
If the field's value is not unique, then it cannot serve as a primary key as
there would be collisions in the collection.

○ Use shorter field names:

➢ Shortening field names reduces expressiveness and does not provide


considerable benefit for larger documents and where document overhead is
not of significant concern. Shorter field names do not reduce the size of
indexes, because indexes have a predefined structure.

➢ In general, it is not necessary to use short field names.

➢ MongoDB stores all field names in every document. For most documents,
this represents a small fraction of the space used by a document; however,
for small documents the field names may represent a proportionally large
amount of space. Consider a collection of small documents that resemble
the following:

{ last_name : "Smith",
best_score: 3.9 }

➢ If you shorten the field named last_name to lname and the field named
best_score to score, as follows, you could save 9 bytes per document.
{ lname :
"Smith", score :
3.9 }

○ Embed documents:

➢ In some cases you may want to embed documents in other documents and
save on the per-document overhead. See Collection Contains Large Number
of Small Documents.

➢ Here are some key aspects of data modeling in MongoDB:

■ Document Structure:

➢ MongoDB stores data in JSON-such as documents called BSON (Binary


JSON). These documents can contain nested structures and arrays, allowing
for more complex data models compared to rows and columns in relational
databases.

■ Schema Design:

➢ MongoDB is schema-less, allowing for dynamic and flexible schemas.


However, having a well-thought-out schema design is crucial for
performance and scalability. Even though MongoDB allows flexibility,
defining a schema that fits your application's needs is important.

■ Embedding vs Referencing:

➢ MongoDB allows you to embed related data within a single document or


reference it across multiple documents. Deciding whether to embed or
reference depends on factors such as data access patterns, data size, and
relationships between entities.

➢ Indexing: Creating appropriate indexes is vital for efficient querying in


MongoDB. Indexes help in speeding up data retrieval operations.
Understanding query patterns and frequently accessed fields is essential
for creating the right indexes.
■ Normalization vs Denormalization:

➢ Unlike traditional relational databases where normalization is a common


practice to minimize redundancy, MongoDB often uses denormalization to
improve query performance. This involves duplicating data across
documents to avoid costly joins.

■ Scalability:

➢ Designing a data model that supports scalability is crucial in MongoDB.


Considering sharding (horizontal scaling) and replication early in the data
modeling process ensures the model can handle increasing data volumes
and concurrent users.

➢ When modeling data in MongoDB, it's important to consider the specific


requirements of your application, the expected workload, and the types of
queries you'll perform most frequently. This way, you can create a data
model that optimizes performance, scalability, and flexibility.

● Some MongoDB data modeling best practices

1. Understand Your Data and Use Cases:

○ Know Your Application: Understand how your application will use the data,
the read/write patterns, and the types of queries it will execute most
frequently.

○ Analyze Access Patterns: Design the data model based on how data will be
accessed and queried.

2. Design a Schema that Fits Your Use Case:

○ Balance Flexibility and Structure: Leverage MongoDB's flexibility but


design a schema that suits your application's requirements.

○ Use Schema Validation: Employ schema validation to enforce data


consistency and integrity where necessary.

3. Optimize Document Structure:

○ Embedding vs Referencing: Choose between embedding related data


within a single document or referencing it across multiple documents based
on query patterns and data relationships.
○ Avoid Deeply Nested Structures: Deeply nested arrays or objects can
affect query performance. Use nesting judiciously.

4. Index Appropriately:

○ Create Indexes: Identify fields frequently used in queries and create


indexes on those fields to speed up query performance.

○ Compound Indexes: For queries that involve multiple fields, consider using
compound indexes.

5. Normalize or Denormalize Data Appropriately:

○ Balance Normalization and Denormalization: Normalize data to maintain


consistency and reduce redundancy where it makes sense, but consider
denormalization for performance optimization.

○ Use References Judiciously: If referencing data, ensure it's done for logical
relationships and doesn't lead to excessive query loads.

6. Consider Sharding and Replication for Scalability:

○ Sharding: Plan for sharding early if your data is expected to grow


significantly. Distribute data across shards to achieve horizontal scalability.

○ Replication: Implement replication to ensure data redundancy, fault


tolerance, and high availability.

7. Regularly Monitor and Adjust:

○ Monitor Performance: Regularly analyze query performance, index usage,


and overall database performance metrics.

○ Refine Data Model: Modify the data model as needed based on observed
performance and changing application requirements.

8. Utilize Aggregation Framework:

○ Leverage Aggregation: Use MongoDB's powerful Aggregation Framework


for complex data manipulation, analytics, and reporting tasks.

9. Test and Iterate:

○ Prototype and Test: Experiment with different data models and query
patterns. Test the performance of approaches to find the most efficient one.

○ Iterate Based on Feedback: Use insights gained from testing to refine and
improve the data model iteratively.

■ Elaborating on use cases involves understanding how different scenarios or


applications might benefit from MongoDB's data modeling best practices:

1. Content Management Systems (CMS):

○ Use Case: Storing articles, blogs, or multimedia content.

○ Modeling Approach: Use embedding to store comments, likes, or metadata


within the same document as the content to retrieve them together
efficiently.

2. E-Commerce Platforms:

○ Use Case: Managing products, orders, and customer data.

○ Modeling Approach: Use referencing to connect orders to products and


customers, allowing for efficient retrieval of specific order details or
customer information.

3. Internet of Things (IoT) Applications:

○ Use Case: Storing sensor data, device information, and telemetry data.

○ Modeling Approach: Depending on the volume of data, use sharding to


handle the growing number of devices and sensor readings efficiently.

4. Real-Time Analytics and Logging:

○ Use Case: Capturing and analyzing log data, user interactions, or system
events.

○ Modeling Approach: Use denormalization to optimize read performance and


generate real-time analytics without the need for complex joins.

5. Social Media Platforms:

○ Use Case: Managing user profiles, friendships, posts, comments, and likes.
○ Modeling Approach: Use a combination of embedding and referencing to
balance data retrieval efficiency with consistency.

6. Geographic Information Systems (GIS):

○ Use Case: Storing geographical data, maps, and spatial information.

○ Modeling Approach: Utilize MongoDB's geospatial indexes and data types


for efficient querying and analysis of spatial data.

7. Messaging and Chat Applications:

○ Use Case: Managing conversations, messages, and user interactions.

○ Modeling Approach: Embed messages within conversation documents for


fast retrieval of entire conversation histories.

8. Event Logging and Monitoring:

○ Use Case: Storing and analyzing system events, errors, and performance
metrics.

○ Modeling Approach: Design a schema that enables efficient querying and


aggregation of event data for monitoring and analysis.

9. Healthcare and Electronic Medical Records (EMR):

○ Use Case: Managing patient records, medical history, and appointments.

○ Modeling Approach: Employ a schema that balances data normalization for


consistency with denormalization for efficient retrieval of patient
information.

10. Gaming Applications:

○ Use Case: Storing user profiles, game data, scores, and achievements.

○ Modeling Approach: Design a schema that allows for quick retrieval of


game-related data, considering performance optimizations for leaderboards
or achievements.

You might also like