281511lecture Notes 2 - MongoDB Data Modeling-1718181255820
281511lecture Notes 2 - MongoDB Data Modeling-1718181255820
LECTURE 2 NOTES
●
● What is Data Modeling?
■ In MongoDB, data modeling refers to the process of designing the structure of your
data to best suit the needs of your application. Unlike traditional relational
databases where data is organized in tables, rows, and columns, MongoDB is a
NoSQL database that stores data in flexible, JSON-like documents.
■ The key challenge in data modeling is balancing the needs of the application, the
performance characteristics of the database engine, and the data retrieval
patterns. When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well as the inherent
structure of the data itself.
■ Flexible Schema
○ Unlike SQL databases, where you must determine and declare a table's
schema before inserting data, MongoDB's collections, by default, do not
require their documents to have the same schema. That is:
○ The documents in a single collection do not need to have the same set of
fields and the data type for a field can differ across documents within a
collection.
■ Document Structure
○ With MongoDB, you may embed related data in a single structure or document.
These schemas are generally known as ‘denormalized’ models, and take
advantage of MongoDB's rich documents.
○ In general, use embedded data models when: you have ‘contains’ relationships
between entities. You have one-to-many relationships between entities. In these
relationships the ‘many’ or child documents always appear with or are viewed in
the context of the ‘one’ or parent documents.
○ To access data within embedded documents, use dot notation to ‘reach into’ the
embedded documents. See query for data in arrays and query data in
embedded documents for more examples of accessing data in arrays and
embedded documents.
■ Documents in MongoDB must be smaller than the maximum BSON document size.
■ For many use cases in MongoDB, the denormalized data model is optimal.
● Atomicity of Write Operations
○ A denormalized data model with embedded data combines all related data in a
single document instead of normalizing across multiple documents and
collections. This data model facilitates atomic operations.
■ Multi-Document Transactions
○ For situations that require atomicity of reads and writes to multiple documents
(in a single or multiple collections), MongoDB supports multi-document
transactions:
○ Note:
3. That is, for many scenarios, modeling your data appropriately will minimize
the need for multi-document transactions.
● Sharding
■ To distribute data and application traffic in a sharded collection, MongoDB uses the
shard key. Selecting the proper shard key has significant implications for
performance, and can enable or prevent query isolation and increased write
capacity. While you can change your shard key later, it is important to carefully
consider your shard key choice.
● Indexes
■ Use indexes to improve performance for common queries. Build indexes on fields
that appear often in queries and for all operations that return sorted results.
MongoDB automatically creates a unique index on the _id field.
○ When active, each index consumes disk space and memory. This usage can
be significant and should be tracked for capacity planning, especially for
concerns over working set size.
■ Consider a sample collection log that stores log documents for various
environments and applications. The log collection contains documents in the
following form:
{ log: "dev", ts: ..., info: ... }
■ If the total number of documents is low, you may group documents into collections
by type. For logs, consider maintaining distinct log collections, such as logs_dev
and logs_debug. The logs_dev collection would contain only the documents related
to the dev environment.
■ Generally, having a large number of collections has no significant performance
penalty and results in a very good performance. Distinct collections are very
important for high-throughput batch processing.
■ When using models that have a large number of collections, consider the following
behaviors:
○ Each index, including the index on _id, requires at least 8 kB of data space.
○ For each database, a single namespace file (i.e. <database>.ns) stores all
meta-data for that database, and each index and collection has its own
entry in the namespace file. See places namespace length limits for specific
limitations.
○ ‘Rolling up’ these small documents into logical groupings means that
queries to retrieve a group of documents involve sequential reads and fewer
random disk accesses. Additionally, ‘rolling up’ documents and moving
common fields to the larger document benefit the index on these fields.
There would be fewer copies of the common fields and there would be
fewer associated key entries in the corresponding index. See Indexes for
more information on indexes.
○ However, if you often only need to retrieve a subset of the documents within
the group, then ‘rolling-up’ the documents may not provide better
performance. Furthermore, if small, separate documents represent the
natural model for the data, you should maintain that model.
■ Consider the following suggestions and strategies for optimizing storage utilization
for these collections:
➢ To optimize storage use, users can specify a value for the _id field explicitly
when inserting documents into the collection. This strategy allows
applications to store a value in the _id field that would have occupied space
in another portion of the document.
➢ You can store any value in the _id field, but because this value serves as a
primary key for documents in the collection, it must uniquely identify them.
If the field's value is not unique, then it cannot serve as a primary key as
there would be collisions in the collection.
➢ MongoDB stores all field names in every document. For most documents,
this represents a small fraction of the space used by a document; however,
for small documents the field names may represent a proportionally large
amount of space. Consider a collection of small documents that resemble
the following:
{ last_name : "Smith",
best_score: 3.9 }
➢ If you shorten the field named last_name to lname and the field named
best_score to score, as follows, you could save 9 bytes per document.
{ lname :
"Smith", score :
3.9 }
○ Embed documents:
➢ In some cases you may want to embed documents in other documents and
save on the per-document overhead. See Collection Contains Large Number
of Small Documents.
■ Document Structure:
■ Schema Design:
■ Embedding vs Referencing:
■ Scalability:
○ Know Your Application: Understand how your application will use the data,
the read/write patterns, and the types of queries it will execute most
frequently.
○ Analyze Access Patterns: Design the data model based on how data will be
accessed and queried.
4. Index Appropriately:
○ Compound Indexes: For queries that involve multiple fields, consider using
compound indexes.
○ Use References Judiciously: If referencing data, ensure it's done for logical
relationships and doesn't lead to excessive query loads.
○ Refine Data Model: Modify the data model as needed based on observed
performance and changing application requirements.
○ Prototype and Test: Experiment with different data models and query
patterns. Test the performance of approaches to find the most efficient one.
○ Iterate Based on Feedback: Use insights gained from testing to refine and
improve the data model iteratively.
2. E-Commerce Platforms:
○ Use Case: Storing sensor data, device information, and telemetry data.
○ Use Case: Capturing and analyzing log data, user interactions, or system
events.
○ Use Case: Managing user profiles, friendships, posts, comments, and likes.
○ Modeling Approach: Use a combination of embedding and referencing to
balance data retrieval efficiency with consistency.
○ Use Case: Storing and analyzing system events, errors, and performance
metrics.
○ Use Case: Storing user profiles, game data, scores, and achievements.