0% found this document useful (0 votes)
23 views

Data Modeling With Amazon DocumentDB

Data modeling with Amazon DocumentDB Data modeling with Amazon DocumentDB Data modeling with Amazon DocumentDB

Uploaded by

Roger Sena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Modeling With Amazon DocumentDB

Data modeling with Amazon DocumentDB Data modeling with Amazon DocumentDB Data modeling with Amazon DocumentDB

Uploaded by

Roger Sena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data modeling with

Amazon DocumentDB
Table of Contents
Introduction 03

The relational model 04


Adapting to the document model 05

The Document API 07


Documents and the _id field 08

Inserting documents 09

Reading documents 10

Sorting, projecting, and other options 14

Updating documents 15

Deleting documents 17

Aggregation framework 18

Transactions 20

Operations conclusion 21

Data modeling Patterns 22


Schema management in DocumentDB 23

Managing your schema in your application code 23

Managing your schema with DocumentDB’s JSON Schema validation 24

Managing relationships in DocumentDB 26

Managing relationships with embedding 27

Handling relationships with duplication 28

Managing relationships with referencing 31

Indexes in DocumentDB 33

Compound indexes 34

Multi-key indexes 36

Sparse indexes 37

Advanced tips 39
Use the aggregation framework wisely 40

Scaling with DocumentDB 41

Reduce I/O 42

Reduce document size 44

Conclusion 45

2
Introduction
For decades, the relational database was the primary choice for storing
data. New developers learned the SQL language and the virtues of
database normalization. To reduce the learning curve of SQL or the tedium
of mapping between paradigms, developers use object-relational mappers
(ORMs) to translate between their live application and their durable
storage. The relational database provided enough flexibility and ubiquity
to be the default choice.

In the early 2000s, the ground shifted. New databases arose, often
grouped under the term ‘NoSQL’. While there were many flavors of NoSQL
databases, the document database was the most popular. Document
databases provided a flexible data model that more closely matched the
objects used in application code. This allowed developers to avoid the
famed impedance mismatch between the relational model and application
objects. Further, document databases provided a flexible schema that
allowed developers to evolve their data model over time without the pain
of database migrations.

In this book, we’ll look at how to model your data in a document database.
We’ll focus on Amazon DocumentDB, a fully managed document database
service that is compatible with MongoDB. We’ll start by looking at the
differences between the relational model and the document model as
well as some key principles for adopting the document model. Then, we’ll
look at the core API operations in DocumentDB. Finally, we’ll look at some
common data modeling patterns for DocumentDB.
The relational model
Before we dive into the details of data modeling in DocumentDB, let’s
take a step back and look at the relational and document models. How are
these two models similar and where do they differ?

In a traditional relational model, each entity in your application is contained


in a separate table. If your application has teams, users, and support tickets,
you would have a teams table, a users table, and a tickets table. Each
table would contain only the data for that entity.

Within each table, you would define a schema for the records contained
in that table. This schema is made up of columns, and each column has
a name and data type. In our users table, we might have a username
column of type string, a name column of type string, an email column
of type string, and a created_at column of type datetime. Each row
in the table would contain the data for a single user, and each column
would contain a single piece of data for that user.

Traditionally, column types would be simple scalar values like strings and
numbers. If you had more complex data, like an array of values, you would
split that data into a separate table and use a reference to that table in
your original table. For example, a team will have multiple users. Rather
than storing the users within the team record, each user record will point to
its corresponding team record via a foreign key.

Thus, the three key characteristics of the relational model are (1) a fixed
schema for each table, (2) a flat data model using scalar values, and (3)
references across entities via foreign keys.
Introduction Adapting to the document model
The relational As we’ll see, the document model does not have these three main characteristics of a
model relational database. But while there are significant differences between the document and
Adapting to
relational models, you don’t need to throw away all of your existing database experience.
the document model
You’re still working with individual records made of attributes, and these records are grouped
The DocumentDB API
together within the database. The specifics of the terminology differ. An individual record is
Documents and called a document in DocumentDB as compared to a row in a relational database. Within a
the _id field document, individual attributes are called fields in DocumentDB instead of columns like in a
Inserting documents relational database. Documents are kept together in collections rather than in tables.

Reading documents The biggest difference between the relational and document models is in the flexibility of
Sorting, projecting, designing your data objects. This flexibility can be a benefit, as we’ll see throughout this
and other options book, but you should be careful in how you use it. In general, there are two areas where the
document model is more flexible than the relational model.
Updating
documents
First, the document model provides more schema flexibility than the relational model. In
Deleting
a relational database, you must define a schema for your table before you can insert any
documents
data. This schema defines the columns that are available for each row, and the data type
Aggregation of each column. Once you’ve defined this schema, you can only insert rows that match the
framework
schema. If you try to insert a row with a column that doesn’t exist in the schema, or a value
Transactions that doesn’t match the data type of the column, the database will reject the insert.
Operations
conclusion
In contrast, you don’t need to define a schema for a document database like DocumentDB.
Documents in DocumentDB are self-describing, like JSON objects. This allows you to
Data modeling customize the shape of each document to match the needs of your application. This
patterns
can prove particularly useful in situations with highly flexible data, such as a content
Schema management management system that includes a variety of different types of content, each with their
in DocumentDB own set of fields.
– Managing your schema
in your application code Additionally, DocumentDB’s schema flexibility reduces the operational pain of evolving
your data model over time. In a relational database, you would need to perform table-
– Managing your schema
with DocumentDB’s JSON level DDL operations to add or remove a column. Traditionally, this was a painful process
Schema validation that could require downtime for your application. Even with modern relational databases
with online DDL, performing these operations can require significant resources on your
Managing relationships
in DocumentDB database instance.

– Managing relationships While this schema flexibility is useful, it does come with a cost. A relational database
with embedding
would do the work to ensure your data was in the format you expected. With a schemaless
– Handling relationships database like DocumentDB, these guardrails are gone. You’ll need to do the work to
with duplication enforce the schema of your application data. In the data modeling section below,
– Managing relationships we’ll look at different approaches to managing your schema in a schemaless database
with referencing like DocumentDB.
Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

5
Introduction A second way that DocumentDB is more flexible than a relational database is in the way it
The relational
selectively encourages denormalization of your application data.. In a relational database,
model you would typically normalize your data model following the principles of Edgar Codd.
This often requires splitting your data into multiple tables and tracking references across
Adapting to
the document model
tables using foreign keys. It also avoids using nested data structures, such as objects or
arrays, in your data model. Rather, you would split these nested data structures into their
The DocumentDB API own tables. To combine data from different tables, you would use joins at query time.
Documents and
the _id field Database normalization is a useful technique that helps to avoid data anomalies and
provide query flexibility. However, normalization has its downsides as well. It can reduce
Inserting documents
query-time performance as your join is hopping around multiple tables to assemble your
Reading documents required data. It can also increase the complexity of your application code if you need to
perform multiple queries to assemble your data. Finally, it can be difficult to translate the
Sorting, projecting,
and other options normalized data model into the objects used by your application, which often use nested
data structures.
Updating
documents
In a document model, you use a more denormalized model when you don’t need
Deleting the benefits of normalization. You may duplicate data across multiple documents,
documents
particularly when that data is not changing frequently. This can change an expensive join
Aggregation operation to a simple lookup. Further, you may choose to use nested data structures. This
framework selective denormalization can improve the performance of your queries and simplify your
Transactions application code.

Operations
conclusion
These two changes - schema flexibility and careful
Data modeling
patterns
denormalization - make up the fundamental differences between
Schema management
the relational model and the document model.
in DocumentDB

– Managing your schema


in your application code
In the rest of this ebook, we’ll look how to adopt our data modeling techniques to take
– Managing your schema advantage of these differences. First, we’ll start with an overview of the DocumentDB API.
with DocumentDB’s JSON
Schema validation
Then, we’ll cover the three core aspects of proper data modeling in DocumentDB. Finally,
we’ll close with some advanced tips for using DocumentDB well.
Managing relationships
in DocumentDB

– Managing relationships
with embedding

– Handling relationships
with duplication

– Managing relationships
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

6
The DocumentDB API
Just as there are differences in terminology and style between
DocumentDB and a traditional relational database, there are also
differences in the API. In this section, we’ll look at core API operations in
DocumentDB to get a feel for how to interact with the database. These
operations will form the building blocks for the data modeling patterns
we’ll look at later.

DocumentDB provides compatibility with the MongoDB API. Thus, you can
use the MongoDB client libraries as well as third-party ORMs and other
tools. In the examples below, we’ll use the Node.js SDK for MongoDB.
However, the concepts are the same regardless of the SDK you use.
Introduction Documents and the _id field
The relational As noted, a single record in DocumentDB is called a document. This is comparable to
model a row in a relational database. As we’ve seen, a document is more similar to a JSON
Adapting to
object than a record you’re used to in a relational database. A document has no set
the document model schema by default and can contain nested data structures.
The DocumentDB API
There is one caveat to DocumentDB’s schemaless model: every document must
Documents and have an ‘_id’ field. This field is required for every document, and it must uniquely
the _id field identify a single document in a collection. This _id field is similar to a primary key in
Inserting documents a relational database.

Reading documents You may provide a value for the _id field when inserting a document. This is useful
Sorting, projecting, when you have a natural key for your document, such as a username or email
and other options address. The _id field is indexed by default, so using a natural key for the _id field
can take full advantage of this index.
Updating
documents
If you don’t provide an _id field for a document, DocumentDB will provide one for
Deleting
you. This will be in the form of an ObjectId, which is a twelve-byte unique identifier.
documents
The first four bytes of an ObjectId are a timestamp indicating when the object
Aggregation was created. In addition to providing uniqueness, this can be useful for sorting
framework
documents by creation time.
Transactions
Finally, remember that each document must belong to a collection in DocumentDB.
Operations
conclusion
When interacting with documents, you must specify the collection that contains the
document. Thus, you often have some initialization code to get a reference to the
Data modeling collection you want to interact with.
patterns

Schema management For example, the following code gets a reference to the users collection in the main
in DocumentDB database.
– Managing your schema
in your application code

– Managing your schema


with DocumentDB’s JSON
const db = client.db(‘main’);
Schema validation const collection = db.collection(‘users’);
Managing relationships
in DocumentDB

– Managing relationships
Then, you can use the collection variable to interact with documents in the
with embedding
users collection.
– Handling relationships
with duplication
With this core understanding of documents, let’s look at how you can interact with
– Managing relationships them in DocumentDB.
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

8
Introduction Inserting documents
The relational The first operation we’ll look at is inserting documents. This is the equivalent of
model and INSERT operation in a relational database. You can insert a single document or
multiple documents at a time.
Adapting to
the document model
To insert a document into your database, you will use the insertOne method.
The DocumentDB API

Documents and
the _id field

Inserting documents const result = await collection.insertOne({


Reading documents username: ‘johndoe’,
name: ‘John Doe’,
Sorting, projecting,
and other options email: ‘[email protected]’,
Updating created_at: new Date(),
documents
});
Deleting
documents
console.log(result.insertedId); // 65836d8ad4211546b47af455
Aggregation
framework

Transactions

Operations You can provide a single object to the insertOne method. This will be serialized
conclusion into a document and inserted into the collection. The insertOne method returns
Data modeling a result object that contains the _id of the inserted document as an insertedId
patterns property.
Schema management
in DocumentDB As noted above, you can specify the _id property yourself if you want. However,
the write will fail if you specify an _id that already exists in the collection.
– Managing your schema
in your application code

– Managing your schema


with DocumentDB’s JSON
const result1 = await collection.insertOne({
Schema validation
_id: ‘johndoe’,
Managing relationships
in DocumentDB name: ‘John Doe’,
– Managing relationships email: ‘[email protected]’,
with embedding created_at: new Date(),
– Handling relationships });
with duplication

– Managing relationships
console.log(result.insertedId); // johndoe
with referencing

Indexes in DocumentDB
const result2 = await collection.insertOne({
– Compound indexes
_id: ‘johndoe’,
– Multi-key indexes
name: ‘Jonathan Doe’,
– Sparse indexes email: ‘[email protected]’,
Advanced tips created_at: new Date(),
Use the aggregation }); // Throws error because _id already exists
framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

9
Introduction You can also insert multiple documents at once using the insertMany method.
The relational
model

Adapting to
the document model const result = await collection.insertMany([
{
The DocumentDB API
username: ‘johndoe’,
Documents and
the _id field name: ‘John Doe’,
Inserting documents email: ‘[email protected]’,
created_at: new Date(),
Reading documents
},
Sorting, projecting,
and other options {

Updating
username: ‘janedoe’,
documents name: ‘Jane Doe’,
Deleting email: ‘[email protected]’,
documents
created_at: new Date(),
Aggregation }
framework
]);
Transactions

Operations
conclusion
console.log(result.insertedIds); // [ 65836d8ad4211546b47af455,
65836d8ad4211546b47af456 ]
Data modeling
patterns

Schema management
in DocumentDB
When inserting multiple documents at once, the insertedIds property of the
– Managing your schema result object will be an array of the _id values for each inserted document, in the
in your application code
order they were inserted.
– Managing your schema
with DocumentDB’s JSON
Schema validation Reading documents
Inserting documents is all well and good, but our data is only really useful if
Managing relationships we can read it back out. In this section, we’ll look at how to read documents
in DocumentDB
from DocumentDB.
– Managing relationships
with embedding To read documents from DocumentDB, you will use the find method.
– Handling relationships
with duplication

– Managing relationships const result = collection.find();


with referencing

Indexes in DocumentDB
for await (const doc of result) {
– Compound indexes console.log(doc);
– Multi-key indexes }
– Sparse indexes

Advanced tips
The result of the find method is a cursor object, similar to a cursor in a relational
Use the aggregation database. This cursor object is an async iterator, which means you can use it in a
framework wisely for await ... of loop to iterate over the documents in the result set. Under
Scaling with the hood, the find method is retrieving batches of records. As you iterate over the
DocumentDB cursor, it will fetch the next batch of records as needed.
Reduce I/O

Reduce document size

Conclusion

10
Introduction The example above returns all documents within a collection. This would be like a
The relational
SELECT * FROM users query in a relational database. However, you often want
model to filter the documents during a particular query to only return the documents you
need. You can do this by passing a filter object to the find method.
Adapting to
the document model

The DocumentDB API


const result = collection.find({
Documents and
the _id field
username: ‘johndoe’
});
Inserting documents

Reading documents
for await (const doc of result) {
Sorting, projecting,
console.log(doc);
and other options
}
Updating
documents

Deleting
documents With this filter object, DocumentDB will only return documents where the
username field is equal to johndoe. This would be like a SELECT * FROM users
Aggregation
WHERE username = ‘johndoe’ query in a relational database.
framework

Transactions When doing an exact match filter on the username property, it’s likely you will
Operations
receive only a single document back. In situations where you expect a single
conclusion document, you can use the findOne method instead of the find method.
Data modeling
patterns

Schema management const result = await collection.findOne({


in DocumentDB
username: ‘johndoe’
– Managing your schema
in your application code });

– Managing your schema


with DocumentDB’s JSON console.log(result); // { _id: 65836d8ad4211546b47af455,
Schema validation
username: ‘johndoe’, ... }
Managing relationships
in DocumentDB

– Managing relationships
with embedding Note that the findOne method returns a single document, not a cursor. This means
– Handling relationships you wouldn’t use a for await ... of loop to iterate over the results. Rather, you
with duplication can simply await the result of the findOne method.
– Managing relationships
with referencing In the examples above, we’ve used a simple exact match on the username field to
locate matching records. However, you can use a wide range of operators to filter
Indexes in DocumentDB
your results. Let’s quickly look at a few of these operators.
– Compound indexes
A common requirement is to find matching records with a value greater than or less
– Multi-key indexes
than a particular value. You can use the $gt and $lt operators to do this.
– Sparse indexes

Advanced tips
For example, you can use the $gt operator to find documents where the
created_at field is greater than a particular date.
Use the aggregation
framework wisely

Scaling with
DocumentDB const result = collection.find({

Reduce I/O
created_at: { $gt: new Date(‘2024-01-01’) }
});
Reduce document size

Conclusion

11
Introduction Notice that the value for the created_at field is an object with a $gt property.
The relational
This is how you specify operators in DocumentDB. The value of the $gt property is
model the value you want to compare against. The use of a $ is an indicator that this is an
operator, not a field name.
Adapting to
the document model
If you want values that are greater than or equal to (or less than or equal to), you
The DocumentDB API can use the $gte and $lte operators.
Documents and
the _id field Another common requirement is to find matching records where a field is in a list of
values. For example, we may want to find all support tickets that were created by a
Inserting documents
particular set of users.
Reading documents
You can use the $in operator to do this:
Sorting, projecting,
and other options

Updating
documents const result = collection.find({
Deleting username: { $in: [‘johndoe’, ‘janedoe’] }
documents
});
Aggregation
framework

Transactions
This will return all documents where the username field is equal to either johndoe
Operations or janedoe.
conclusion

Data modeling As mentioned before, DocumentDB is a schemaless database that allows for nested
patterns data structures. You can query those data structures directly using dot notation --
Schema management referring to the nested field by the path from the root of the document.
in DocumentDB
For example, let’s extend our users collection to include an address field with
– Managing your schema
in your application code nested values:

– Managing your schema


with DocumentDB’s JSON
Schema validation
{
Managing relationships _id: 65836d8ad4211546b47af455,
in DocumentDB
username: ‘johndoe’,
– Managing relationships
with embedding name: ‘John Doe’,
email: ‘[email protected]’,
– Handling relationships
with duplication address: {
– Managing relationships street: ‘123 Broadway’,
with referencing
city: ‘New York’,
Indexes in DocumentDB state: ‘NY’,
– Compound indexes zip: ‘11111’
– Multi-key indexes }

– Sparse indexes
}

Advanced tips

Use the aggregation If you want to find all the users that live in a specific zip code, you can use dot
framework wisely
notation to query the nested zip field.
Scaling with
DocumentDB

Reduce I/O
const result = collection.find({
Reduce document size ‘address.zip’: ‘11111’
Conclusion });

12
Introduction This will return all documents where the zip field in the address object is equal
The relational
to 11111.
model
Finally, while dot notation works well for nested objects, you may want to query for
Adapting to
the document model
documents that contain a particular value in an array. For example, let’s extend our
users collection to include a roles field with an array of roles.
The DocumentDB API

Documents and
the _id field
{
Inserting documents
_id: 65836d8ad4211546b47af455,
Reading documents username: ‘johndoe’,
Sorting, projecting, name: ‘John Doe’,
and other options
email: ‘[email protected]’,
Updating
documents
address: {
street: ‘123 Broadway’,
Deleting
documents city: ‘New York’,
Aggregation state: ‘NY’,
framework zip: ‘11111’
Transactions },
Operations roles: [‘admin’, ‘user’]
conclusion
}
Data modeling
patterns

Schema management If you want to find all the users that have a particular role, you can use the
in DocumentDB
$elemMatch operator. With the query below, we can find all users with the
– Managing your schema admin role.
in your application code

– Managing your schema


with DocumentDB’s JSON
Schema validation const result = collection.find({

Managing relationships
roles: { $elemMatch: { $eq: ‘admin’ } }
in DocumentDB });
– Managing relationships
with embedding

– Handling relationships The $elemMatch operator is composable, so you construct a subquery that is
with duplication applied to all elements in an array. DocumentDB will return any document where
– Managing relationships the subquery matches at least one element in the array.
with referencing
There are a number of other operators available in DocumentDB, including
Indexes in DocumentDB
regex matching, JSON schema matching, geospatial queries, and bitwise
– Compound indexes operations. For a full listing of compatibility with MongoDB APIs, see the
DocumentDB documentation. For details on the operators available in
– Multi-key indexes
DocumentDB, see the MongoDB documentation on query operators.
– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

13
Introduction Sorting, projecting, and other options
The relational When reading documents from DocumentDB, you may want to alter the returned
model results in some way. This could mean changing the order of a set of documents
by including a preferred sort order. Additionally, it could mean limiting the fields
Adapting to
the document model returned in the result set to only those you need.

The DocumentDB API


DocumentDB has a chainable API that allows you to specify these options. Let’s
Documents and look at a few of the most common options.
the _id field

Inserting documents First, it’s common to want to sort the results of a query. You can do this using the
sort method.
Reading documents

Sorting, projecting, If we wanted to return users in order of the most recently created, we could do this
and other options as follows:
Updating
documents
const result = collection.find().sort({ created_at: -1 });
Deleting
documents

Aggregation
framework The sort method takes an object with the fields you want to sort on. The value of
each field is either 1 for ascending order or -1 for descending order. In this case, we
Transactions
want to sort by the created_at field in descending order, so we use -1.
Operations
conclusion
You can sort by multiple fields by including multiple fields in the sort object. The
Data modeling documents will be sorted by the first field, then the second field, and so on.
patterns

Schema management In the example below, we can find users ordered by state, then by created date to
in DocumentDB find the most recent users in each state.
– Managing your schema
in your application code

– Managing your schema const result = collection.find().sort({ state: 1, created_at: -1 });


with DocumentDB’s JSON
Schema validation

Managing relationships One of the easiest ways to improve the performance of your queries is to reduce
in DocumentDB
the amount of data you’re reading and sending back. If you’re retrieving a set of
– Managing relationships records but have a maximum number of records you want to return, you can use
with embedding the limit method.
– Handling relationships
with duplication

– Managing relationships const result = collection.find().limit(10);


with referencing

Indexes in DocumentDB

– Compound indexes This is powerful when combined with the sort method. For example, if you want
to find the ten most recently created users, you can use the sort method to sort
– Multi-key indexes by created_at in descending order and then use the limit method to limit
– Sparse indexes the results to ten. For top performance, index on your sort field to avoid a full
collection scan. More on indexing can be found in the data modeling section below.
Advanced tips

Use the aggregation Finally, in addition to reducing the number of records sent back, you can also
framework wisely reduce the size of each record. If you don’t need all the fields in a document, you
Scaling with
can use the project method to specify which fields to include or exclude.
DocumentDB

Reduce I/O

Reduce document size const result = collection.find().project({ username: 1, name: 1 });


Conclusion

14
Introduction The project method uses a similar syntax as the sort method, but it includes
The relational
some explicit and implicit behavior. Essentially, you can explicitly specify the fields
model you want by listing them with a value of 1. This implicitly excludes all other fields.
Adapting to
the document model
Alternatively, you can explicitly state the fields you don’t want by listing them with
a value of 0. This implicitly includes all other fields.
The DocumentDB API

Documents and If you have large, sprawling documents but only need a subset of them, using a
the _id field project operation can significantly reduce the amount of data you’re sending back
and forth. For large response bodies, this can significantly improve the performance
Inserting documents
of your application. Check out the indexing section below for more information on
Reading documents how to optimize this further by using covered queries.
Sorting, projecting,
and other options Updating documents
Updating In addition to writing new documents to DocumentDB, you’ll often need to update
documents existing documents. Perhaps a user wants to change their email address, upgrade
their plan, or change their address. In this section, we’ll look at how to update
Deleting
documents documents in DocumentDB.

Aggregation Like the read operations, DocumentDB contains two core update methods:
framework
updateOne and updateMany. Like the read operations, the operation you choose
Transactions depends on whether you want to update a single document or multiple documents.
Operations
conclusion Let’s start with the updateOne method. This method takes two arguments: a filter
object and an update object. The filter object is used to find the document you want
Data modeling
to update. The update object is used to specify the changes you want to make to
patterns
the document.
Schema management
in DocumentDB
In the example below, we will update the johndoe user to have a new email address.
– Managing your schema
in your application code

– Managing your schema


with DocumentDB’s JSON const result = await collection.updateOne(
Schema validation { username: ‘johndoe’ },
Managing relationships { $set: { email: ‘[email protected]’ } }
in DocumentDB
);
– Managing relationships
with embedding

– Handling relationships
with duplication Notice our filter object -- { username: ‘johndoe’ } -- is the same as the filter
object we used to find the document in the section on reading documents. All the
– Managing relationships
with referencing
query syntax works the same for update operations as it does for read operations.

Indexes in DocumentDB In the update object, we use the $set operator to specify the changes we want to make
– Compound indexes to the document. The $set operator takes an object with the fields you want to update.
In this case, we want to update the email field to [email protected].
– Multi-key indexes

– Sparse indexes Sometimes you want to remove a field from a document altogether. You can do this
using the $unset operator.
Advanced tips

Use the aggregation


framework wisely
const result = await collection.updateOne(
Scaling with
DocumentDB { username: ‘johndoe’ },

Reduce I/O
{ $unset: { email: ‘’ } }
);
Reduce document size

Conclusion

15
Introduction The $set and $unset operators will do most of the update work you need.
The relational However, there are other operators available as well, including $inc for
model incrementing a counter or $rename for renaming an existing field. You can find the
full documentation on update operators here.
Adapting to
the document model
Note that the updateOne method will always modify a single document, regardless
The DocumentDB API of how many documents are matched by your filter object. If multiple documents
Documents and are matched, only the first one will be modified by the updateOne operation.
the _id field

Inserting documents
Sometimes you want to update multiple documents at once. This may be an
application-level update like updating all users that belong to a specific team
Reading documents or organization, or it may be a schema migration where you want to update all
Sorting, projecting,
documents in a collection.
and other options
In these cases, you can use the updateMany method. This method takes the same
Updating
documents
arguments as the updateOne method, but it will update all documents that match
the filter object.
Deleting
documents
In the example below, we will rename the email field to email_address for all
Aggregation documents in the users collection.
framework

Transactions

Operations const results = await collection.updateMany(


conclusion
{},
Data modeling { $rename: { email: ‘email_address’ } }
patterns
)
Schema management
in DocumentDB

– Managing your schema


in your application code
Notice that we’ve used an empty filter object. This will match all documents in the
collection. This is similar to a SELECT * FROM users query in a relational database.
– Managing your schema
with DocumentDB’s JSON
Schema validation
The updateMany operation can be helpful for changing multiple documents at
once, but use caution when using it. Performing a write operation on large numbers
Managing relationships of documents can be a resource-intensive operation. If you’re updating a large
in DocumentDB
number of documents, you may want to perform the update in batches to avoid
– Managing relationships overloading your database.
with embedding

– Handling relationships A final update operation is the replaceOne method. Remember that a document
with duplication in DocumentDB is uniquely identified by its _id property. In the default update
operations, the document will only alter the fields specified by the update object.
– Managing relationships
with referencing All other properties on the document will remain unchanged.

Indexes in DocumentDB Sometimes you want to replace the entire document with a new document. This is
– Compound indexes where the replaceOne method is useful. You can use it to completely overwrite an
existing document with a new document.
– Multi-key indexes

– Sparse indexes Like the updateOne method, the replaceOne method takes two arguments. The
first is a filter object that is used to identify the object to replace, and the second
Advanced tips
object is the new document to replace it with.
Use the aggregation
framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

16
Introduction In the example below, we will replace the johndoe user with a new document.
The relational
model

Adapting to
const result = await collection.replaceOne(
the document model
{ username: ‘johndoe’ },
The DocumentDB API
{
Documents and
the _id field username: ‘johndoe’,
name: ‘Johnny Doe’,
Inserting documents
email: ‘[email protected]’,
Reading documents
}
Sorting, projecting,
and other options
);

Updating
documents

Deleting There are two important things to note with the replaceOne method. First, note
documents that the replacement will occur for only the first matching document for the filter
object. If you include a wide filter that matches multiple documents, only one will
Aggregation
framework be updated. Thus, ensure you’re specific with your filter object when using the
replaceOne method. Ideally you are using the _id field or another unique field to
Transactions identify the document to replace.
Operations
conclusion Second, notice that the replaceOne document takes in a full document rather than
Data modeling
an update object with the update operators. This is more similar to the insertOne
patterns method than the updateOne method.
Schema management
in DocumentDB
In each situation where using the update methods, consider what type of updates
you want to make.
– Managing your schema
in your application code
• Are you updating a single document or multiple documents?
– Managing your schema • Do you want to update the document in place or replace it with a new
with DocumentDB’s JSON document?
Schema validation

Managing relationships The answers to these questions will help you determine which update method
in DocumentDB to use.
– Managing relationships
with embedding Deleting documents
– Handling relationships The final CRUD operation we’ll look at is deleting documents. This is the equivalent
with duplication of a DELETE operation in a relational database. You can delete a single document
or multiple documents at a time.
– Managing relationships
with referencing
Like the update operations, there are separate operations for deleting a single
Indexes in DocumentDB
document and for deleting multiple documents. Let’s start with the deleteOne
– Compound indexes method.
– Multi-key indexes
The deleteOne method takes a filter object as its only argument. This filter object
– Sparse indexes is used to identify the document to delete.
Advanced tips

Use the aggregation


framework wisely const result = await collection.deleteOne({ username: ‘johndoe’ });

Scaling with
DocumentDB
This will delete a single document where the username field is equal to johndoe.
Reduce I/O
Like the updateOne method, it will only delete the first matching document if the
Reduce document size filter matches multiple documents.
Conclusion
If you want to bulk delete items, you can use the deleteMany method. This can be
good for cascading deletes or bulk cleanup operations.

17
Introduction In the operation below, we will delete all users that have not logged in for the past
The relational
90 days.
model

Adapting to
the document model
const result = await collection.deleteMany({
The DocumentDB API last_login: { $lt: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) }
Documents and });
the _id field

Inserting documents

Reading documents Just like the updateMany, think carefully about how many items will be affected
before running a deleteMany operation. You don’t want to tie up your database
Sorting, projecting,
and other options with a large delete operation.

Updating
documents Aggregation framework
Deleting
The core CRUD operations we’ve looked at so far are the basic building blocks for
documents working with DocumentDB. However, DocumentDB also provides some advanced
tools for data querying. One of these tools is the aggregation framework.
Aggregation
framework
The aggregation framework is a pipeline-based operation that allows you
Transactions to perform a series of operations on a set of documents. Each operation in
Operations the pipeline takes the results of the previous operation and performs some
conclusion transformation on it. The result of the final operation is returned to the caller.
Data modeling
patterns One common example is to emulate JOIN-like functionality in a relational database
by combining two documents via a reference.
Schema management
in DocumentDB
Imagine you have one collection where every document has a userId field that
– Managing your schema points to the _id field of a document in the users collection. You might use the
in your application code aggregate operation as follows:
– Managing your schema
with DocumentDB’s JSON
Schema validation
const result = await collection.aggregate([
Managing relationships
in DocumentDB {
– Managing relationships $match: {
with embedding “user_id”: ‘65836d8ad4211546b47af456’
– Handling relationships }
with duplication
},
– Managing relationships
with referencing {
$lookup: {
Indexes in DocumentDB
from: “users”,
– Compound indexes
localField: “user_id”,
– Multi-key indexes
foreignField: “_id”,
– Sparse indexes as: “userInfo”
Advanced tips }
Use the aggregation }
framework wisely
]);
Scaling with
DocumentDB

Reduce I/O In the aggregate operation, you provide an array of steps. These steps are
Reduce document size performed in sequential order by the database engine. In this case, we have two
steps. First, we use the $match operation to filter the documents in the collection
Conclusion
to only those that match the provided filter. Then, we use the $lookup operation to
join the users collection to our original collection.

18
Introduction Enriching data with a $lookup operation is a common use case for the aggregation
The relational
framework. However, DocumentDB supports a number of other stage operations as
model well. Many of these operations compare to different operators in SQL.
Adapting to
the document model
We won’t run through examples for all of the stages, but some of the more useful
ones include:
The DocumentDB API

Documents and • $match: Filters the documents in the collection to only those that match
the _id field the provided filter (compare to a WHERE clause in SQL);
• $project: Allows you to specify which fields to include or exclude from
Inserting documents
the result set (compare to a SELECT clause in SQL);
Reading documents • $group: Allows you to group documents together based on a field or
fields (compare to a GROUP BY clause in SQL);
Sorting, projecting,
and other options • $unwind: Allows you to explode arrays into individual documents per
element in the array to work on those elements individually;
Updating • $bucket: Allows you to group documents together based on a range of
documents
values (compare to a GROUP BY clause with a range in SQL);
Deleting • $sort: Allows you to sort the documents in the result set (compare to an
documents ORDER BY clause in SQL);
Aggregation • $limit: Allows you to limit the number of documents returned (compare
framework to a LIMIT clause in SQL);
Transactions
• $skip: Allows you to skip a number of documents in the result set
(compare to an OFFSET clause in SQL).
Operations
conclusion
See the Advanced Tips section below for more on using the aggregation
Data modeling framework well.
patterns

Schema management
in DocumentDB

– Managing your schema


in your application code

– Managing your schema


with DocumentDB’s JSON
Schema validation

Managing relationships
in DocumentDB

– Managing relationships
with embedding

– Handling relationships
with duplication

– Managing relationships
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

19
Introduction Transactions
The relational Another advanced query feature of DocumentDB is support for transactions.
model Transactions allow you to perform multiple operations in a single, atomic
Adapting to
operation. If one of your steps fails, the entire transaction is rolled back and
the document model no changes are made to the database. This is a common feature of relational
databases but was missing in early versions of document databases.
The DocumentDB API

Documents and There are a few common reasons you may need to use transactions. The simplest is
the _id field to ensure that multiple operations are performed atomically -- that is, either all of
Inserting documents the operations succeed or none of them succeed.

Reading documents A simple example is in creating a user that will belong to a team. Perhaps we want
Sorting, projecting, to keep track of the total members of the team on the team object. To do so, we
and other options increment the memberCount field on the team object when we create a new user.
Updating
documents If we need this operation to be atomic, we can use a transaction to ensure that the
user is created and the memberCount field is incremented in a single operation. If
Deleting
either operation fails, the entire transaction is rolled back and no changes are made
documents
to the database.
Aggregation
framework
You could handle this using the following code:
Transactions

Operations
conclusion
const session = client.startSession();
Data modeling
patterns
try {
Schema management
in DocumentDB session.startTransaction();
– Managing your schema
in your application code
// Create the user
– Managing your schema const usersCollection = client.db(‘main’).
with DocumentDB’s JSON
Schema validation collection(‘users’);
Managing relationships await usersCollection.insertOne({ ... }, { session });
in DocumentDB

– Managing relationships // Increment the memberCount field on the team


with embedding
const teamsCollection = client.db(‘main’).
– Handling relationships
with duplication
collection(‘teams’);
await teamsCollection.updateOne(
– Managing relationships
with referencing { _id: ‘65836d8ad4211546b47af456’ },
Indexes in DocumentDB { $inc: { memberCount: 1 } },

– Compound indexes
{ session }
);
– Multi-key indexes

– Sparse indexes
// Commit the transaction
Advanced tips
await session.commitTransaction();
Use the aggregation } catch (error) {
framework wisely
await session.abortTransaction();
Scaling with
DocumentDB throw error;
} finally {
Reduce I/O
await session.endSession();
Reduce document size
}
Conclusion

20
Introduction This is complex, so let’s walk through it step-by-step.
The relational
model First, you create a session using the startSession() method on the client. This
session will be used to perform the transaction.
Adapting to
the document model
You then call the startTransaction() method on the session to start the
The DocumentDB API transaction. Remember that errors could occur during your transaction, so you’ll
Documents and want to wrap it in a try/catch block to handle the error and rollback the transaction.
the _id field
Then, you would perform the actions you want to take. In this case, we’re creating a
Inserting documents
new user and incrementing the memberCount field on the team.
Reading documents

Sorting, projecting,
To complete the transaction, you call the commitTransaction() method on the
and other options session to commit the transaction to the database.
Updating
documents
Notice in our catch block that we call the abortTransaction() method on the
session. This will roll back the transaction and undo any changes made during the
Deleting transaction. Any changes made during the transaction will be discarded.
documents

Aggregation Finally, we use a finally block to close the session using the closeSession()
framework method.
Transactions
In every transaction you do, you must ensure you handle the preparation and
Operations clean up tasks. Start your session at the beginning, and close it when your work
conclusion
is through. Start a new transaction from your session, and commit or abort it
Data modeling when you’re done. If you don’t do this, you’ll end up with orphaned sessions and
patterns transactions that will eventually time out, all while holding locks on your database.
Schema management
in DocumentDB Further, you should try to design your data model to avoid transactions where
possible. Transactions require coordination, which can be expensive. Try to lean into
– Managing your schema
in your application code a document model and avoid the need for transactions where possible. You can do
this by using the patterns discussed below, including the embedded pattern to keep
– Managing your schema
related data together.
with DocumentDB’s JSON
Schema validation
Both the aggregation framework and transaction capabilities are powerful
Managing relationships
features of DocumentDB that give you some of the same power and flexibility of
in DocumentDB
a traditional relational database. However, be sure to use these features sparingly.
– Managing relationships See where you can find a simpler, more scalable solution using the patterns
with embedding
discussed above.
– Handling relationships
with duplication

– Managing relationships Operations conclusion


with referencing In this section, we learned the basic CRUD operations in DocumentDB. These will be
Indexes in DocumentDB the bulk of your operations in DocumentDB, so you should understand them well.
Specifically, think about how DocumentDB will match the documents you want to
– Compound indexes read, update, or delete. Work to ensure this is a fast, efficient operation. You can do
– Multi-key indexes this using some of the indexing patterns discussed below.

– Sparse indexes
We also reviewed some of the advanced features of the DocumentDB API, such as
Advanced tips the aggregation framework and transactions. These are powerful features that can
help you solve complex problems. However, they can also be expensive operations
Use the aggregation
framework wisely that can slow down your database. Use them sparingly and consider whether there
are simpler solutions to your problem.
Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

21
Data modeling patterns
Now that we know the basics of the DocumentDB API, let’s look at some
common data modeling patterns. The sections below provide tactical
advice for real-life data modeling problems. These patterns are not
exhaustive, but they should provide a good starting point for your data
modeling efforts.

In general, three areas will make the biggest difference for success in
DocumentDB:

• Schema management;
• Modeling relationships;
• Proper indexes.

Focus on getting these three areas right first, then focus on the more
advanced topics at the end of this ebook.
Introduction Schema management in DocumentDB
The relational One of the best things you can do for a scalable and efficient data model is to
model maintain sanity for the shape of your existing documents. We saw above that
Adapting to
DocumentDB is a schemaless database by default -- we won’t be defining column
the document model names and data types for our documents like we would with a relational database.
While that flexibility is useful, particularly as your schema evolves over time, you do
The DocumentDB API
want to manage some semblance of a schema for your documents.
Documents and
the _id field There are two common patterns for managing your schema in DocumentDB. The
Inserting documents first, more traditional, way is to verify your schema yourself in your application
code. The second, more modern, way is to use JSON Schema to define a schema for
Reading documents your documents and use DocumentDB’s validation feature to enforce that schema.
Sorting, projecting,
and other options Let’s review each of these patterns in turn.
Updating
documents
Managing your schema in your application code
Deleting
documents
In the early days of NoSQL databases, there was no server-side schema
management available. This meant the only pattern available to enforce your
Aggregation schema was via your application code. You would need to verify that your
framework
documents matched your expected schema before writing them to the database. To
Transactions be safe, you’d likely want to verify the schema again when reading the document
back from the database.
Operations
conclusion
To manage this schema, you can use a generic schema validation tool like JSON
Data modeling Schema. Alternatively, you can use a language-specific tool such as Zod to define
patterns
your schema and validate your documents. Some client libraries for DocumentDB,
Schema management such as Mongoose for Node.js, include schema validation tools as well.
in DocumentDB

– Managing your schema Let’s look at an example of using Zod to validate a document before writing it to
in your application code DocumentDB.
– Managing your schema
with DocumentDB’s JSON
Schema validation

Managing relationships const userSchema = z.object({


in DocumentDB
username: z.string(),
– Managing relationships
name: z.string(),
with embedding
email: z.string().email(),
– Handling relationships
with duplication created_at: z.date(),

– Managing relationships });


with referencing

Indexes in DocumentDB async function createUser(userData) {


– Compound indexes const user = userSchema.parse(userData);

– Multi-key indexes
return collection.insertOne(user);
}
– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

23
Introduction In this example, we’ve defined a schema for our user documents using Zod. Then,
The relational
in our createUser function, we parse the user data against the schema. If the user
model data does not match the schema, Zod will throw an error. Otherwise, we can safely
insert the user into DocumentDB.
Adapting to
the document model
This pattern works well for managing your schema, but it does have some
The DocumentDB API downsides. First, you must manage the schema validation yourself. This means
Documents and carefully ensuring that, in every place you write to the database, you are validating
the _id field the schema. This can be difficult to manage as your application grows.
Inserting documents
Additionally, you may be doing update operations on your documents that alter
Reading documents only a portion of your document. For example, you may be updating a user’s
email address. In this case, you would want to validate that the email address is in
Sorting, projecting,
and other options the correct format, but you wouldn’t want to validate the entire document. This
requires careful management of your schema validation and a deep understanding
Updating
of your schema requirements.
documents

Deleting For this reason, you may prefer to use server-side validation of your schema
documents
instead. Let’s explore that next.
Aggregation
framework

Transactions Managing your schema with DocumentDB’s JSON Schema validation


A second approach to schema validation is to have DocumentDB validate your
Operations documents via its JSON schema validation feature. This allows you to define a
conclusion
schema for your documents and have DocumentDB validate that schema before
Data modeling writing the document to the database.
patterns

Schema management A schema is defined at the collection level and can be added when creating the
in DocumentDB collection or added to an existing collection. Let’s look at an example of creating a
– Managing your schema
collection with a schema.
in your application code

– Managing your schema


with DocumentDB’s JSON const result = await db.createCollection(‘users’, {
Schema validation
validator: {
Managing relationships
in DocumentDB $jsonSchema: {
bsonType: ‘object’,
– Managing relationships
with embedding required: [‘username’, ‘name’, ‘email’, ‘created_at’],
– Handling relationships properties: {
with duplication
username: {
– Managing relationships bsonType: ‘string’,
with referencing
},
Indexes in DocumentDB
name: {
– Compound indexes bsonType: ‘string’,
– Multi-key indexes },
– Sparse indexes email: {
Advanced tips bsonType: ‘string’,
pattern: ‘^.+@.+$’,
Use the aggregation
framework wisely },
Scaling with created_at: {
DocumentDB
bsonType: ‘date’,
Reduce I/O },
Reduce document size },
Conclusion },
},
“validationLevel”: “strict”, “validationAction”: “error”
});

24
Introduction Now, when you insert a document, DocumentDB will throw an error if the
The relational
document does not match the schema.
model
This schema validation will also apply during updates. In general, this greatly
Adapting to
the document model
simplifies the management of your schema. As long as you can describe it correctly
via JSON schema, you’ll be able to validate it with DocumentDB.
The DocumentDB API

Documents and DocumentDB also allows for updates to your schema as your application evolves.
the _id field You can update your schema at any time by using the collMod command. Note
that this will not retroactively change or validate any existing documents -- you will
Inserting documents
be responsible for making corresponding changes yourself.
Reading documents
If you have existing documents that will not pass your new schema validation rules
Sorting, projecting,
and other options and you don’t want to update them, you can reduce the validation level of your
schema. By setting validationLevel to moderate, DocumentDB will only apply
Updating
documents
schema validation to new documents and to existing documents that are valid
before the update. If you have prior versions of documents that are invalid, they
Deleting will not be validated during updates.
documents

Aggregation The moderate validation level can ease the pain of updating your schema, but it
framework does come with a cost. If you have invalid documents in your collection, you may
Transactions not be able to trust the data in your collection. You’ll need to perform additional,
application-side validation and transformations to handle invalid documents. For
Operations this reason, it’s best to keep your schema validation level at strict if possible.
conclusion

Data modeling Using server-side validation may seem like we’ve come full circle. Part of the reason
patterns document databases gained in popularity was their flexible schema, but now we’re
Schema management back to enforcing it in the database again! Note, however, that DocumentDB still
in DocumentDB has a more flexible schema system than a traditional relational database. It natively
allows for nested data types like objects and arrays. Further, updating your schema
– Managing your schema
in your application code is a much easier operation in DocumentDB. It doesn’t require locking your table or
costly background resources.
– Managing your schema
with DocumentDB’s JSON
Schema validation In general, the server-side, JSON-schema-based approach to schema validation will
be easier to manage than the application-side approach. Whichever pattern you
Managing relationships
choose, enforcing some schema in your application will make it easier to trust the
in DocumentDB
data in your DocumentDB database.
– Managing relationships
with embedding

– Handling relationships
with duplication

– Managing relationships
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

25
Introduction Managing relationships in DocumentDB
The relational One of the biggest changes in moving from a relational database to a document
model database is how you manage relationships between data. Many people incorrectly
Adapting to
think that ‘relational’ in a relational database refers to relationships between
the document model records. This is not true. Rather, the relational model refers to the fact that data is
stored in tables with rows and columns and has more in common with set theory.
The DocumentDB API

Documents and Taking this misconception further, some state that a “non-relational” database
the _id field like DocumentDB cannot handle relationships between data. This is also not
Inserting documents true. DocumentDB can handle relationships between data. In fact, it’s almost
meaningless to consider data without relationships. However, you do need to
Reading documents handle relationships differently in DocumentDB than you would in a relational
Sorting, projecting, database.
and other options
Let’s start by reviewing what relationships (not relations!) are and how they’re
Updating
documents handled in a relational database. Then, we’ll look at how to handle relationships
in DocumentDB.
Deleting
documents
A relationship describes some connection between two pieces of data. For example,
Aggregation a user may belong to a team or organization, a support ticket may be assigned to
framework
a user, or a blog post may have comments. These are all examples of relationships
Transactions between data.
Operations
conclusion In a relational database, you commonly model different data entities as separate
tables. In our example relationships above, we would have a users table, a teams
Data modeling table, a support_tickets table, a blog_posts table, and a comments table. Each
patterns
table would have its own set of columns, and each row in the table would represent
Schema management a single record.
in DocumentDB

– Managing your schema To indicate relationships between different entities, you could use foreign keys. A
in your application code foreign key uses a reference in one table to identify a record in another table. For
example, the support_tickets table may have a user_id column that references
– Managing your schema
with DocumentDB’s JSON the id column in the users table. This would indicate that the support ticket
Schema validation belongs to a particular user.
Managing relationships
in DocumentDB
When reading these items back, you would use the JOIN operation to combine data
– Managing relationships
with embedding
from multiple tables. For example, imagine that you want to find all the support
tickets for a particular user. You don’t know the ID for each support ticket for that
– Handling relationships user, but you do have the user’s username. You could use a JOIN operation to find
with duplication
all the support tickets that have a user_id that matches the user’s ID.
– Managing relationships
with referencing

Indexes in DocumentDB
SELECT * FROM support_tickets
– Compound indexes
JOIN users ON support_tickets.user_id = users.id
– Multi-key indexes WHERE users.username = ‘johndoe’;
– Sparse indexes

Advanced tips
In this case, we’re linking the two tables together using the foreign key relationship.
Use the aggregation
This allows me to filter the support_tickets table on a field that’s only present in
framework wisely
the users table.
Scaling with
DocumentDB
DocumentDB does include a JOIN-like operation -- $lookup, via the aggregation
Reduce I/O framework discussed above -- but you should avoid it where possible.
Rather, lean in to a document model by using the relationship patterns common in
Reduce document size
document databases.
Conclusion
There are three common patterns for modeling relationships in DocumentDB:
embedding, duplicating, and referencing. Let’s look at each of these in turn.

26
Introduction Managing relationships with embedding
The relational The first pattern for managing relationships in DocumentDB is to embed the
model related data directly in the document. As we’ve seen throughout this book,
DocumentDB allows you to store nested data structures in your documents. Using
Adapting to
the document model the embedding pattern takes advantage of this feature.

The DocumentDB API One example where you could use embedding is in selling a book in an online store.
Documents and A single book may have a few different formats -- hardcover, softcover, ebook,
the _id field and audiobook. Rather than storing these as four separate documents, you could
Inserting documents embed the different formats directly in the book document.

Reading documents

Sorting, projecting,
and other options {

Updating
_id: 65836d8ad4211546b47af455,
documents title: ‘The Hobbit (75th Anniversary Edition)’,
Deleting formats: [
documents
{
Aggregation ISBN: ‘978-0547928227’,
framework
type: ‘hardcover’,
Transactions
price: 19.99,
Operations
conclusion
pages: 304
},
Data modeling
patterns {
Schema management ISBN: ‘978-0547928210’,
in DocumentDB type: ‘softcover’,
– Managing your schema price: 12.99,
in your application code
pages: 410
– Managing your schema
with DocumentDB’s JSON },
Schema validation {
Managing relationships ISBN: ‘978-0547928241’,
in DocumentDB
type: ‘ebook’,
– Managing relationships price: 9.99,
with embedding
},
– Handling relationships
with duplication {

– Managing relationships ISBN: ‘978-0547928234’,


with referencing type: ‘audiobook’,
Indexes in DocumentDB price: 19.99,
– Compound indexes lengthInMinutes: 912

– Multi-key indexes
}
]
– Sparse indexes
}
Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

27
Introduction Now, in fetching a book, you can retrieve all the formats for the book in a single
The relational query by using an indexed field like the book’s title:
model

Adapting to
the document model
const result = await collection.findOne({ title: ‘The Hobbit
The DocumentDB API
(75th Anniversary Edition)’ });
Documents and
the _id field

Inserting documents
The embedded pattern works best in the following situations:
Reading documents

Sorting, projecting, • You often require the related data when retrieving the parent record. In our
and other options example above, users will be searching for a particular book and reviewing its
Updating details. In doing so, we want to display information on all available formats so
documents they can choose the one that best fits their needs. By embedding the formats
directly in the book document, we can retrieve all the data we need with a
Deleting
documents single lookup on the book’s title.

Aggregation • The related data is not retrieved directly. This factor pairs with the previous
framework
one, but if you are only retrieving the embedded data through the parent item,
Transactions the embedded pattern makes sense. You won’t need to index the embedded
data separately, and you can get the benefits of indexing the parent item to
Operations
conclusion retrieve the embedded data.

Data modeling
• The number of related items is limited. With databases in general, we want
patterns
to prevent reading data that we don’t actually need. If you embed data that is
Schema management unbounded, your documents will grow in size and result in slower performance.
in DocumentDB
In our example above, we have a limited number of formats for a book.
– Managing your schema However, if we were to embed all the reviews for a book in the book document,
in your application code we could end up with a large number of reviews. This would result in a large
– Managing your schema
document that would be slow to read from disk. In this case, we would want to
with DocumentDB’s JSON use one of the other patterns.
Schema validation

Managing relationships
These factors will be true in a number of situations with related data. In those
in DocumentDB situations, using the embedded pattern is a great way to simplify your application
code by saving your application objects directly to the database. Further, you can
– Managing relationships
with embedding
enhance performance by avoiding the need for database joins or multiple queries
to the database.
– Handling relationships
with duplication

– Managing relationships Handling relationships with duplication


with referencing
While the embedded model is popular, there are a number of situations where
Indexes in DocumentDB it doesn’t work well. Most commonly, it doesn’t work well when the number of
related items is unbounded, as your document size will grow as the number of
– Compound indexes
related items grows. This larger document size will result in slower performance.
– Multi-key indexes
Additionally, the embedded model doesn’t work as well with many-to-many
– Sparse indexes
relationships. In that situation, a piece of data will be related to multiple other
Advanced tips pieces of data. For example, a user may belong to multiple teams, or a support
Use the aggregation ticket may be assigned to multiple users. In these situations, you can’t embed the
framework wisely related data in a single document.
Scaling with
DocumentDB In situations like these, you can use the duplication pattern to duplicate data
across multiple documents. Like the embedded pattern, this is a violation of the
Reduce I/O normalization principles of the relational model. To achieve third normal form, you
Reduce document size should not duplicate data across multiple records. However, when used correctly,
duplication can be a powerful tool for improving performance in DocumentDB.
Conclusion

28
Introduction Think back to the initial example in the introduction to this managing relationships
The relational section. We had a SQL query that retrieved all the support tickets for a particular
model user as identified by the username. We used a JOIN operation to combine the
support_tickets and users tables to find the matching support tickets.
Adapting to
the document model

The DocumentDB API


SELECT * FROM support_tickets
Documents and
the _id field JOIN users ON support_tickets.user_id = users.id
Inserting documents WHERE users.username = ‘johndoe’;

Reading documents

Sorting, projecting,
and other options We had to perform this JOIN operation because the information we had, the
Updating
username, was only available in the users table.
documents
With the duplication pattern in DocumentDB, we could avoid this JOIN operation
Deleting
documents
by duplicating the username in our support ticket documents. This would allow us
to query the support_tickets collection directly to find all the support tickets for
Aggregation a particular user.
framework

Transactions A support ticket document could look as follows:


Operations
conclusion

Data modeling {
patterns
_id: 65836d8ad4211546b47af455,
Schema management
in DocumentDB title: ‘Support ticket title’,

– Managing your schema


description: ‘Support ticket description’,
in your application code text: ‘....’,
– Managing your schema user: {
with DocumentDB’s JSON
Schema validation
_id: 65836d8ad4211546b47af456,
username: ‘johndoe’,
Managing relationships
in DocumentDB name: ‘John Doe’,
– Managing relationships email: ‘[email protected]
with embedding }
– Handling relationships }
with duplication

– Managing relationships
with referencing
This would allow us to make the following query to find all the support tickets for a
Indexes in DocumentDB
particular user:
– Compound indexes

– Multi-key indexes

– Sparse indexes const results = collection.find({ ‘user.username’: ‘johndoe’ });


Advanced tips

Use the aggregation


framework wisely We’ve avoided the JOIN operation and gone directly to the support documents to
Scaling with
find this information. This can be a powerful performance optimization, especially
DocumentDB because DocumentDB allows you to index nested documents.
Reduce I/O
In addition to using duplicated data to help locate related records, you can also
Reduce document size duplicate data that needs to be displayed for a given record. For example, imagine
our online bookstore wants to show the purchase method used when a customer
Conclusion
reviews an order. Rather than including a pointer to a specific payment method, we
can simply duplicate that data onto the order record itself.

29
Introduction

The relational // Example order document


model {
Adapting to _id: ‘658c32831bd77b6c38e72615’,
the document model
purchaseDate: ‘2023-12-27T14:19:47.000Z’,
The DocumentDB API
total: 19.99,
Documents and items: [ ... ],
the _id field
purchaseMethod: {
Inserting documents
type: ‘credit_card’,
Reading documents
cardType: ‘Visa’,
Sorting, projecting, lastFour: ‘1234’
and other options
}
Updating
documents }

Deleting
documents

Aggregation In our order document, we have a purchaseMethod field that contains the details
framework of the payment method used for the order. This allows us to display the payment
Transactions
method on the order without needing to perform a JOIN operation to retrieve the
payment method. We aren’t retrieving the order by the payment method directly,
Operations but we are duplicating it to avoid an additional read.
conclusion

Data modeling Like all data modeling patterns, there are tradeoffs to the duplication pattern.
patterns The benefits of normalization’s preference for a single, canonical source of truth
Schema management is in the consistency of the data. If a data record is updated, all other records
in DocumentDB that refer to that record will see the benefits of the update when they fetch the
canonical record.
– Managing your schema
in your application code
On the other hand, if you duplicate data, you’ll have to manage and handle
– Managing your schema corresponding updates to all duplicates. Failure to do so can lead to data
with DocumentDB’s JSON
Schema validation inconsistencies and a confusing user experience.

Managing relationships The key factors to look at here are (1) whether the duplicated data is immutable,
in DocumentDB
and (2) how widely the data is duplicated. Let’s explore these factors in the context
– Managing relationships of our examples.
with embedding

– Handling relationships For our first example, we duplicate the user record into our support ticket to handle
with duplication searching by username. In many applications, the username is immutable. This
– Managing relationships
makes it much safer to duplicate this data as we won’t need to update it in the
with referencing future.
Indexes in DocumentDB
If you look closely at our example, we also duplicate the user’s name and email
– Compound indexes address. These are attributes that are more likely to be mutable and thus require
corresponding updates to the duplicated data. Because support tickets are
– Multi-key indexes
essentially unbounded, this could be an expensive operation for our application.
– Sparse indexes For this reason, we may decide not to duplicate these mutable attributes onto our
Advanced tips
support ticket documents.

Use the aggregation For our second example, we duplicate the payment method onto the order
framework wisely
document. This seems like data that may be mutable, as a user can change
Scaling with their payment methods over time. However, think about it from the perspective
DocumentDB of the order -- once an order is placed, the payment method for that order
Reduce I/O does not change. Thus, we can safely duplicate the payment method onto the
order document.
Reduce document size

Conclusion The duplication pattern is a powerful pattern and another example of how selective
denormalization can improve performance in DocumentDB. However, be sure to
consider the tradeoffs of this pattern before using it in your application.

30
Introduction Managing relationships with referencing
The relational The final pattern for managing relationships in DocumentDB is to use references.
model This is the most similar to the relational model, as you are using a reference to
identify a related record.
Adapting to
the document model
In general, the reference pattern works well when the embedded or duplication
The DocumentDB API patterns do not fit. If you have a large or unbounded set of data that is mutable, it
Documents and may not be a good fit for the embedded or duplication patterns. Further, if you’re
the _id field likely to fetch a related record directly, outside of the context of its parent record,
Inserting documents the reference pattern may be a good fit. Finally, if you have a many-to-many
relationship, the reference pattern is a good fit.
Reading documents

Sorting, projecting, If you use a reference pattern, there are generally two ways to fetch your data. You
and other options can use DocumentDB to perform the join for you using the $lookup operation, or
you can perform multiple queries to fetch the related data.
Updating
documents
Let’s look at an example of using the $lookup operation using the aggregation
Deleting framework discussed in the previous chapter.
documents

Aggregation In the support ticket example from the duplication section, we duplicated the
framework user record onto the support ticket to allow us to search for support tickets by
Transactions username. However, we noted that the user’s name and email address are mutable
and thus may not be a good fit for duplication.
Operations
conclusion
We can update our support ticket document to look as follows:
Data modeling
patterns

Schema management
in DocumentDB {
_id: 65836d8ad4211546b47af455,
– Managing your schema
in your application code title: ‘Support ticket title’,
– Managing your schema description: ‘Support ticket description’,
with DocumentDB’s JSON
Schema validation
text: ‘....’,
user: {
Managing relationships
in DocumentDB _id: 65836d8ad4211546b47af456,
– Managing relationships username: ‘johndoe’,
with embedding }
– Handling relationships }
with duplication

– Managing relationships
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

31
Introduction We still copy the username and _id reference in the user field, but we don’t copy
The relational
the name and email address, but we no longer include the more mutable name and
model email address fields.
Adapting to
the document model
If we need the user’s name and email address when fetching a support ticket,
we can use the $lookup operation to join the users collection to the support_
The DocumentDB API tickets collection.
Documents and
the _id field

Inserting documents
const result = await collection.aggregate([
Reading documents {
Sorting, projecting, $match: {
and other options
“user._id”: ‘65836d8ad4211546b47af456’
Updating
documents }
},
Deleting
documents {
Aggregation $lookup: {
framework from: “users”,
Transactions localField: “user._id”,
Operations foreignField: “_id”,
conclusion
as: “userInfo”
Data modeling }
patterns
}
Schema management
in DocumentDB ]);

– Managing your schema


in your application code

– Managing your schema Note that this operation does two steps which are performed sequentially. First,
with DocumentDB’s JSON it matches all the support tickets for a particular user. Then, it uses the $lookup
Schema validation operation to join the users collection to the support tickets collection. This will
Managing relationships return all the support tickets for a particular user, along with the user’s record.
in DocumentDB

– Managing relationships
This simplifies our application logic but pushes the compute to our database.
with embedding
To reduce the compute load on our DocumentDB cluster, we could perform two
– Handling relationships
with duplication
queries instead. First, we could query the support_tickets collection to find all
the support tickets for a particular user. Then, we could query the users collection
– Managing relationships to find the user’s record.
with referencing

Indexes in DocumentDB

– Compound indexes
const tickets = collection.find({ “user._id”:
– Multi-key indexes
‘65836d8ad4211546b47af456’ });
– Sparse indexes const user = users.findOne({ _id: ‘65836d8ad4211546b47af456’ });
Advanced tips

Use the aggregation


framework wisely
In this case, we’re able to perform both queries in parallel, which will reduce the
Scaling with overall time to fetch the data. In some situations, you may need to do the queries
DocumentDB sequentially. For example, imagine we fetch a support ticket by its _id and then
Reduce I/O want to fetch the user’s record. In this case, we would need to wait for the first
query to complete. Then, once we have the user._id value, we can perform the
Reduce document size second query.
Conclusion
While the selective denormalization patterns of embedding and duplication are
often preferred in DocumentDB, the reference pattern can be a good fit in certain
situations. Be sure to consider the tradeoffs of each pattern before deciding which
one to use.

32
Introduction Indexes in DocumentDB
The relational Good performance in DocumentDB relies on proper data modeling, and data
model modeling is heavily influenced by the indexes you create. In this section, we’ll see
Adapting to
why indexes are important and how to create them in DocumentDB.
the document model Then, we’ll learn about the different types of indexes and how you can use them
to solve your needs.
The DocumentDB API

Documents and In previous sections, we retrieved a user by their username. By default, there is not
the _id field an index on the username field. This means that DocumentDB must scan the entire
Inserting documents collection to find the matching document. This results in a ton of data being read
from disk and eventually discarded.
Reading documents

Sorting, projecting, Let’s do some quick math to see this in action. Let’s assume that each document
and other options in our users collection is 1KB in size. If we have 1 million users in our collection,
that means we have 1GB of data in our collection. If we want to find a user by their
Updating
documents username, DocumentDB must read all 1GB of data from disk to find the matching
document. Once the matching document is found, we’ll throw away the other
Deleting
999,999 documents that we read from disk. That’s a lot of wasted work!
documents

Aggregation Further, 1GB is a pretty small collection. For a large application, you might have
framework
collections that surpass hundreds of GBs or even terabytes of data. In those
Transactions cases, you can see how inefficient it would be to scan the entire collection to find
a single document.
Operations
conclusion
To avoid this, we can create an index on the username field. This will allow
Data modeling DocumentDB to find the matching document without scanning the entire
patterns
collection. The index will not only be much smaller than the full dataset, as only
Schema management the username field is indexed, but it will also be sorted by the username field. This
in DocumentDB allows DocumentDB to use a binary search to find the matching document, which is
– Managing your schema much faster than scanning the entire collection.
in your application code
Let’s create an index on the username field.
– Managing your schema
with DocumentDB’s JSON
Schema validation

Managing relationships
in DocumentDB await collection.createIndex({ username: 1 });

– Managing relationships
with embedding

– Handling relationships In specifying the index, you must provide an object with the field you want to index
with duplication as the key. The value of the key is the sort order of the index. For a single field
index like this one, the sort order doesn’t matter too much as there is only one field
– Managing relationships
with referencing to sort, and the index can be used for queries in either direction. However,
for compound indexes discussed below, the sort order can be important.
Indexes in DocumentDB

– Compound indexes You can use single-field indexes on any field in your document, including nested
fields. Like when querying nested fields, you use the dot notation syntax to specify
– Multi-key indexes
the nested field.
– Sparse indexes
For example, if we wanted to index an embedded zip code field, you could do it
Advanced tips
as following:
Use the aggregation
framework wisely

Scaling with
DocumentDB await collection.createIndex({ ‘address.zip’: 1 });
Reduce I/O

Reduce document size

Conclusion

33
Introduction We’ll look at additional index types below, but first let’s think about the process
The relational
when creating an index. When creating an index in DocumentDB, the default
model setting is to create the index in the foreground. This means DocumentDB will
obtain a lock on the collection and prevent all write operations to the collection
Adapting to
the document model
while the index is being built.

The DocumentDB API If you have an existing collection with a large amount of data that’s serving live
Documents and traffic, this will result in downtime for your users. Instead of doing a foreground
the _id field index build, you can do a background build by passing the background: true
option to the createIndex method.
Inserting documents

Reading documents

Sorting, projecting,
and other options await collection.createIndex({ username: 1 }, { background: true });

Updating
documents

Deleting A background index build will take longer but will avoid downtime for your users.
documents This is the recommended approach for building indexes on existing collections.
Aggregation
framework In addition to the simple single-field indexes, DocumentDB also supports more
advanced indexing methods. Let’s look at each of those and see when you might
Transactions want to use them.
Operations
conclusion

Data modeling
Compound indexes
patterns The second most common type of index is a compound index. With a compound
index, you’re indexing multiple fields in a single index. This is good for situations
Schema management
in DocumentDB
where you’re querying on multiple fields at the same time.

– Managing your schema In the users example we’ve been using, imagine you want to find all adult users in a
in your application code
particular zip code. To do this, you would do an equality query on the address.zip
– Managing your schema field and a range query on the birthdate field, as follows:
with DocumentDB’s JSON
Schema validation

Managing relationships
in DocumentDB
const results = collection.find({
– Managing relationships ‘address.zip’: ‘12345’,
with embedding
‘birthdate’: { $lte: new Date() - 18 * 365 * 24 * 60 * 60 *
– Handling relationships
with duplication 1000 }});

– Managing relationships
with referencing

Indexes in DocumentDB
If you had a single field index on the address.zip field, it would be able to
find all the users in a particular zip code. However, it would then need to scan
– Compound indexes all the users in that zip code to find the ones that are adults. This could be
– Multi-key indexes a slow operation.

– Sparse indexes Instead, you can create a compound index on the address.zip and birthdate
Advanced tips fields. This will allow DocumentDB to find all the users in a particular zip code and
then find the ones that are adults. This will be much faster than scanning all the
Use the aggregation
framework wisely
users in the zip code.

Scaling with To create a compound index, you pass an object with the fields you want to index
DocumentDB
as the key. Like with a single-field index, the value of the key is the sort order of
Reduce I/O the index.
Reduce document size

Conclusion

34
Introduction

The relational await collection.createIndex({ ‘address.zip’: 1, ‘birthdate’: 1 });


model

Adapting to
the document model
In this case, we’re creating a compound index on the address.zip and birthdate
The DocumentDB API fields. The index will be sorted first by the address.zip field and then by the
birthdate field.
Documents and
the _id field
Proper configuration of your compound indexes can greatly reduce your query
Inserting documents
times, but subtle mistakes can avoid the power of the index. There are two tricks to
Reading documents make your compound indexes go from good to great.

Sorting, projecting,
First, understand that a properly configured compound index can support multiple
and other options
query patterns. For example, you could query on the address.zip field alone, or
Updating you could query on the address.zip and birthdate fields together. However, you
documents
cannot query on the birthdate field alone. Thus, think carefully about your actual
Deleting query patterns when creating your compound index. This is covered further in the
documents Advanced Tips section.
Aggregation
framework A second tip for your compound indexes is to try to provide covered queries if
possible. Like most databases, DocumentDB stores full records separately from
Transactions
the index. This means that, when you query on a field that’s not in the index,
Operations DocumentDB must fetch the full record from disk. When applied to a large number
conclusion of records, this can greatly increase the latency of your operation.
Data modeling
patterns To avoid this, you can try to provide a covered query. A covered query is one where
all the fields you need are in the index. This allows DocumentDB to retrieve the
Schema management
in DocumentDB data directly from the index without needing to fetch the full record from disk.

– Managing your schema To achieve a covered query, you’ll need to provide a projection that includes
in your application code
only the fields you need. In our example above, if we only need the zip code and
– Managing your schema birthdate of users that match our query, we could do the following:
with DocumentDB’s JSON
Schema validation

Managing relationships
in DocumentDB const results = collection.find({
– Managing relationships ‘address.zip’: ‘12345’,
with embedding
‘birthdate’: { $lte: new Date() - 18 * 365 * 24 * 60 * 60 * 1000 }
– Handling relationships
with duplication })
.project({ ‘address.zip’: 1, ‘birthdate’: 1 });
– Managing relationships
with referencing

Indexes in DocumentDB
Because we’re only asking for fields that are in the compound index, DocumentDB
– Compound indexes
can handle this query without needing to fetch the full record from disk.
– Multi-key indexes
The example here is a simple one, but the harder problem is when you need
– Sparse indexes
additional fields from the document that aren’t required in your query filter. For
Advanced tips example, perhaps we also need the username field in this query. We could add this
Use the aggregation
field at the end of our compound index, even though we don’t use it for querying,
framework wisely in order to achieve a covered query. However, this results in a larger index and more
work for DocumentDB to maintain the index.
Scaling with
DocumentDB
Additionally, due to DocumentDB internals, the efficacy of a covered index may
Reduce I/O drop with a write-heavy workload.
Reduce document size
In aiming for a covered query, think carefully about these tradeoffs. An additional
Conclusion field or two in the index may not be a big deal, but if you’re adding a large number
of fields to the index, you may want to reconsider your approach.

35
Introduction Multi-key indexes
The relational In multiple places above, we’ve seen that DocumentDB allows you to store arrays
model in your documents. This is a powerful feature that allows you to store nested data
structures in your documents. However, it does present a challenge when indexing
Adapting to
the document model your data -- how can you query on an array field?

The DocumentDB API Fortunately, DocumentDB allows you to query array fields by creating a multi-key
Documents and index. A multi-key index is an index on an array field that creates an index entry
the _id field for each element in the array. This allows you to query on the array field and have
Inserting documents DocumentDB return all the documents that match the query.

Reading documents Think back to our bookstore example. We have a book document that uses the
Sorting, projecting, embedding pattern to store the related formats for a book.
and other options

Updating
documents
{
Deleting
documents
_id: 65836d8ad4211546b47af455,
title: ‘The Hobbit (75th Anniversary Edition)’,
Aggregation
framework formats: [
Transactions {
ISBN: ‘978-0547928227’,
Operations
conclusion type: ‘hardcover’,
Data modeling price: 19.99,
patterns
pages: 304
Schema management },
in DocumentDB
{
– Managing your schema
in your application code ISBN: ‘978-0547928210’,

– Managing your schema


type: ‘softcover’,
with DocumentDB’s JSON price: 12.99,
Schema validation
pages: 410
Managing relationships
in DocumentDB
},
{
– Managing relationships
with embedding ISBN: ‘978-0547928241’,
– Handling relationships type: ‘ebook’,
with duplication price: 9.99,
– Managing relationships },
with referencing
{
Indexes in DocumentDB
ISBN: ‘978-0547928234’,
– Compound indexes type: ‘audiobook’,
– Multi-key indexes price: 19.99,
– Sparse indexes lengthInMinutes: 912
Advanced tips }
]
Use the aggregation
framework wisely }
Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

36
Introduction Note that each format has an ISBN field that is a unique identifier for the book. It’s
The relational
common that we’ll want to find a book by the ISBN of one of its formats.
model
To do this, we can create a multi-key index on the formats.ISBN field.
Adapting to
the document model

The DocumentDB API

Documents and await collection.createIndex({ ‘formats.ISBN’: 1 });


the _id field

Inserting documents

Reading documents Now, we can efficiently query for a book by the ISBN of one of its formats.

Sorting, projecting,
and other options

Updating const results = collection.find({ ‘formats.ISBN’: ‘978-0547928227’ });


documents

Deleting
documents
These multi-key indexes are very powerful in DocumentDB as they allow us to use
Aggregation the embedded pattern to keep related data together while still allowing us to query
framework
on the embedded data directly.
Transactions

Operations
conclusion
Sparse indexes
When creating your index, you can pass additional properties to configure the
Data modeling index. One of these properties is the sparse property. This property allows you to
patterns
create a sparse index, which is an index that only includes documents that have the
Schema management indexed field.
in DocumentDB

– Managing your schema Using a sparse index can greatly improve your performance and reduce the size of
in your application code your index in the right circumstances. You may even want to alter the structure of
your documents to take advantage of a sparse index.
– Managing your schema
with DocumentDB’s JSON
Schema validation Let’s think of our support tickets example from above. Imagine that we often want
to find all the support tickets for a user, sorted by the date they were created. In
Managing relationships
in DocumentDB fetching these, we only want to show open support tickets.

– Managing relationships Over many years, our database will gather lots of support tickets. However, most
with embedding
of these support tickets will be irrelevant to our users as they’re for old issues that
– Handling relationships have been resolved.
with duplication

– Managing relationships You may think to create an index like the following:
with referencing

Indexes in DocumentDB

– Compound indexes await collection.createIndex({ ‘status’: 1, ‘user._id’: 1, ‘created_

– Multi-key indexes at’: -1 });

– Sparse indexes

Advanced tips
In this, we’re creating a compound index on the status, user._id, and created_
Use the aggregation at fields. While this would work, it’s also bloating the size of our index to include
framework wisely all the closed tickets that we don’t care about. This will result in a larger index and
Scaling with slower performance.
DocumentDB
A second solution is to delete the closed support tickets, but often this won’t fit
Reduce I/O
with your business requirements. You may need to keep historical data for archival
Reduce document size or reporting purposes.
Conclusion
Instead, we can alter our document slightly. For open tickets only, we can add a
open_created_at field that is a duplicate of our created_at field. Then, we can
create a sparse index on the user._id and open_created_at fields.

37
Introduction
await collection.createIndex({ ‘user._id’: 1, ‘open_created_at’: -1
The relational
model }, { sparse: true });
Adapting to
the document model

The DocumentDB API Notice that we passed {sparse: true} as the second argument to the
createIndex() method. This tells DocumentDB to create a sparse index, which
Documents and
the _id field will only include documents that have values for all fields in our index. Because
only open tickets will have the open_created_at field, only open tickets will be
Inserting documents included in the index.
Reading documents
To utilize a sparse index, you must pass the {$exists: true} operator on the
Sorting, projecting, indexed field to tell the DocumentDB query engine that you only want documents
and other options
where the field exists.
Updating
documents

Deleting
documents const results = collection.find({ ‘open_created_at’: { $exists:

Aggregation true } });


framework

Transactions
Ideally your sparse index will use actual fields on your documents rather than
Operations
conclusion fields that are constructed solely for the index. However, don’t be afraid to use this
pattern if it fits your use case. It can greatly improve your performance and reduce
Data modeling the size of your index.
patterns

Schema management
in DocumentDB

– Managing your schema


in your application code

– Managing your schema


with DocumentDB’s JSON
Schema validation

Managing relationships
in DocumentDB

– Managing relationships
with embedding

– Handling relationships
with duplication

– Managing relationships
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

38
Advanced tips
Once you have figured out the core aspects of schema management,
relationship modeling, and index optimization in DocumentDB, you can get
pretty far. But as you start to push DocumentDB further, you might need
more advanced optimizations.

This section includes advanced tips for improving your DocumentDB


performance as you scale. We’ll look at how to use the aggregation
framework well and how to think about scaling your DocumentDB cluster.
Additionally, we’ll look at tips to reduce your I/O consumption and your
overall document size. These tips will take you to the next level with
DocumentDB.
Introduction Use the aggregation framework wisely
The relational In the section on the DocumentDB API, we saw that the aggregation framework is
model a tool for complex operations in DocumentDB. This can be both a blessing and a
Adapting to
curse. On one hand, you can use the aggregation framework to perform advanced,
the document model multi-step operations that would be difficult to manage with a bunch of ad-hoc
queries. On the other hand, you can burn through a lot of CPU and I/O with an
The DocumentDB API
unoptimized query.
Documents and
the _id field Remember that the aggregation framework allows you to specify multiple steps to
Inserting documents be performed sequentially by DocumentDB. Sometimes, DocumentDB may be able
to condense multiple steps together in order to reduce processing by the engine.
Reading documents However, you shouldn’t count on that and should work to optimize the overall
Sorting, projecting, query as much as possible.
and other options
In general, you should try to reduce the amount of data being passed between
Updating
documents steps. Efficient database usage is all about filtering down to the relevant data as
much as possible, and this is particularly critical in the aggregation framework.
Deleting
Think about ways to cull your dataset earlier in the process.
documents

Aggregation One way to do this is to use the $match operator early in your query. The $match
framework
operator is similar to a SQL WHERE clause, and it will filter out records that are
Transactions unnecessary. The earlier you can do that -- ideally by using an index -- the better.
Operations
conclusion
In addition to $match, the $project operator is a good way to reduce your data
size. If you have documents that are a few KB in size but you only need a few small
Data modeling properties for your aggregation query, use the $project operator to select just
patterns
those properties early on. This will reduce CPU usage for later stages. Likewise,
Schema management the $group and $bucket operators help to reduce batches of records into a smaller
in DocumentDB summary.
– Managing your schema
in your application code Finally, once you have reduced the data set, then apply the $sort and $limit
operators on the remaining records. This will greatly reduce the time to sort and,
– Managing your schema
with DocumentDB’s JSON ideally, avoid spilling to disk to perform the sort operation.
Schema validation
Focus on this core principle -- reduce your dataset as early as possible -- in order to
Managing relationships
in DocumentDB keep your aggregation queries performant and efficient.

– Managing relationships
with embedding

– Handling relationships
with duplication

– Managing relationships
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

40
Introduction Scaling with DocumentDB
The relational As your application and DocumentDB usage grows, you may find yourself needing
model to scale your DocumentDB cluster. At first, this can be as simple as increasing your
Adapting to
DocumentDB instance size. This is an operation that the DocumentDB service will
the document model manage for you with minimal downtime.
The DocumentDB API
As your needs continue to grow, you may consider horizontal scaling by increasing
Documents and the number of instances in your DocumentDB cluster. Here, it is beneficial to know
the _id field more about DocumentDB’s underlying architecture.
Inserting documents
In many modern databases, a common remedy to scaling is to horizontally scale by
Reading documents sharding your data. In sharding, you split your entire dataset into shards, each of
Sorting, projecting, which will hold a subset of the entire dataset. This sharding is typically based on a
and other options highly used attribute in your application, such as a userID or tenantID, that allows
requests to be routed to a single shard for processing.
Updating
documents
With these horizontally scalable databases, you might shard your database for
Deleting
three reasons:
documents

Aggregation • Increasing write throughput, as you can spread writes across more instances in
framework
the cluster;
Transactions • Increasing read throughput, as you can spread reads across more instances; or
• Increasing storage capacity, as each instance won’t need to hold the entire
Operations
conclusion
dataset.

Data modeling DocumentDB does provide sharding via Elastic Clusters, but you won’t need to
patterns
jump to sharding as quickly as you will with other database systems. In fact, many
Schema management databases that are sharded on other databases will find they can remove sharding
in DocumentDB altogether by migrating to DocumentDB as it has a more scalable architecture that
– Managing your schema separates compute and storage allowing for storage to elastically grow.
in your application code
Let’s run through each of the reasons for sharding above and see how you can
– Managing your schema
with DocumentDB’s JSON handle these with DocumentDB.
Schema validation
First, you will rarely need to shard your DocumentDB cluster to account for
Managing relationships
in DocumentDB additional storage capacity. DocumentDB’s storage capacity grows automatically
as you use it, up to a maximum capacity of 128 TiB. This is enough for the vast
– Managing relationships majority of use cases. That said, if you do require additional storage, an elastic
with embedding
cluster can scale its storage to a maximum of 4 PiB.
– Handling relationships
with duplication Second, sharding is usually not the correct approach to increase your read
– Managing relationships throughput with DocumentDB. DocumentDB allows you to create up to 15 read
with referencing replica instances in your DocumentDB cluster. These instances only handle reads
Indexes in DocumentDB
and are an easy way to provide additional read capacity to your application.
Because DocumentDB uses a shared underlying storage volume, read replicas can
– Compound indexes be created quickly, regardless of the size of your dataset. Further, there’s no impact
– Multi-key indexes on the primary instance in your cluster.

– Sparse indexes You should note that replication to read replicas is asynchronous and thus may be
Advanced tips slightly out of date with your primary instance. For most applications, this eventual
consistency is fine. If you require stronger consistency guarantees on reads, you can
Use the aggregation
framework wisely
direct certain reads to the primary instance. Further, DocumentDB allows you to
monitor replication lag via the DBInstanceReplicaLag metric in CloudWatch.
Scaling with
DocumentDB
Finally, if you need to increase the write throughput of your DocumentDB cluster
Reduce I/O and you’ve decided against increasing the instance size of your primary instance,
you can use DocumentDB elastic clusters to shard your data.
Reduce document size

Conclusion

41
Introduction In sharding your database by using elastic clusters, you are adding an additional tier
The relational
to your database infrastructure. DocumentDB will add a request router layer that
model will be the primary point to handle your request. After the request router parses
the request, it will forward the request to the relevant shard(s) to process the
Adapting to
the document model
request and return the result to your client. Note that, because of this additional
request router tier, there may be a slight increase in overall latency when moving to
The DocumentDB API an elastic cluster.
Documents and
the _id field In creating your elastic cluster, you’ll need to choose which collections are sharded.
For those that are unsharded, the entire collection will be located on a single shard
Inserting documents
in your elastic cluster. For collections that are sharded, they will be split across the
Reading documents instances according to a shard key that you specify.
Sorting, projecting,
and other options Choosing a shard key for your collection is very important to see the benefits of
elastic clusters. You’ll want to choose a shard key that is evenly distributed across
Updating
your dataset so that the data ends up being well-distributed across your shards.
documents
If you use an unbalanced shard key, you won’t get the full benefit of splitting up
Deleting your data.
documents

Aggregation Additionally, you want to use a shard key that correlates with your access patterns.
framework Ideally you are using a shard key that contains an exact match in all or most of your
Transactions database operations. This way, your operations can be directed to a single shard to
handle the request rather than doing a scatter-gather operation across all of the
Operations shards. Again, this will allow you to take full advantage of the decision to shard.
conclusion

Data modeling Finally, elastic clusters can also be helpful in the rare case where you need to
patterns increase the number of connections to your DocumentDB cluster. While a single
Schema management DocumentDB instance maxes out at 30,000 open connections, an elastic cluster
in DocumentDB supports up to 300,000 open connections.
– Managing your schema
in your application code DocumentDB provides a number of mechanisms to scale your database to
meet your usage. In general, try to avoid sharding your database with elastic
– Managing your schema
clusters where you can, due to the extra latency and planning work that requires.
with DocumentDB’s JSON
Schema validation In the event that you do need to scale via sharding, DocumentDB provides a
straightforward mechanism via elastic clusters.
Managing relationships
in DocumentDB

– Managing relationships
with embedding
Reduce I/O
A second advanced tip is to reduce your I/O consumption in DocumentDB. In
– Handling relationships DocumentDB, I/O is cost. It is cost not only literally, in the sense that you are
with duplication
charged directly for I/O consumption, but also in the sense that I/O reduces your
– Managing relationships performance by consuming scarce resources.
with referencing

Indexes in DocumentDB To understand how to reduce I/O consumption, let’s first review some details about
DocumentDB’s underlying architecture. Then, we’ll look at some tips for optimizing
– Compound indexes your I/O consumption.
– Multi-key indexes
Under the hood, DocumentDB is using a multi-version concurrency control (MVCC)
– Sparse indexes
architecture. This is common to many database systems and can assist with
Advanced tips handling concurrent operations without the use of locks. A MVCC architecture may
have multiple versions of a particular document that existed at different times in
Use the aggregation
framework wisely the database lifecycle.

Scaling with To understand DocumentDB’s effects on I/O, you need to know both what happens
DocumentDB
during individual write operations (inserts, updates, and deletes) as well as the
Reduce I/O garbage collection process.
Reduce document size
A caveat up front -- this is an advanced topic that goes deep into DocumentDB
Conclusion architecture. As you learn about these MVCC internals, do not confuse this with
whether it affects the correctness of your results. You’ll read about multiple versions
of your document and the impact on indexes, but the query engine understands
how to handle these versions to return the proper result.

42
Introduction Every insert of a document will result in not only writing the full document to the
The relational
heap but also updating every index where applicable for that document. On the
model other hand, deleting a document will only mark the time when the document was
deleted. It won’t delete the document from storage (yet!), and it won’t update the
Adapting to
the document model
indexes to remove the entry for the deleted document.

The DocumentDB API An update operation is a combination of an insert and a delete operation. It
Documents and will create a new version of the document on the heap and update all indexes
the _id field accordingly. Additionally, it will mark the time when the old version of the
document was deleted.
Inserting documents

Reading documents You might be thinking that this leaves a lot of old data lying around, and you’d be
correct. This is where garbage collection comes in. DocumentDB has automated
Sorting, projecting,
and other options thresholds where it will run a garbage collection process to remove any document
versions that are not visible to any current queries and to clean out expired entries
Updating
from indexes. This garbage collection process consumes I/O in your DocumentDB
documents
cluster.
Deleting
documents
An easy way to synthesize this knowledge is to remember that an insert consumes
Aggregation more I/O during the actual operation, as it has to update the indexes as well. A
framework delete consumes less I/O during the operation but adds more I/O later during the
Transactions garbage collection process. An update combines the two since it is essentially an
insert plus a delete operation.
Operations
conclusion
With that background in place, let’s consider how to reduce our I/O consumption.
Data modeling The first way to do this is to reduce the number of indexes you have. An insert
patterns updates each index for that document (and a document could even have multiple
Schema management entries in an index, such as with a multi-key index). Reducing the number of indexes
in DocumentDB will reduce your I/O consumption accordingly.
– Managing your schema
in your application code The best way to do this is to discover and remove both redundant and unused
indexes. Many unoptimized databases have redundant indexes for handling
– Managing your schema
different queries.
with DocumentDB’s JSON
Schema validation
Going back to our compound index example, imagine you create an index that
Managing relationships
looks as follows:
in DocumentDB

– Managing relationships
with embedding

– Handling relationships await collection.createIndex({ ‘address.zip’: 1, ‘birthdate’: 1 });


with duplication

– Managing relationships
with referencing

Indexes in DocumentDB This would allow you to handle multiple types of queries:
– Compound indexes
1. An exact match query on just the address.zip field;
– Multi-key indexes 2. A range query on the address.zip field;
3. An exact match query on the address.zip field plus an exact match on
– Sparse indexes
the birthdate field;
Advanced tips 4. An exact match query on the address.zip field plus a range query on
Use the aggregation the birthdate field.
framework wisely
With compound indexes, recall that you don’t need to filter on all values to use the
Scaling with
DocumentDB index. Compound indexes are evaluated from left-to-right up to and including the
first range query.
Reduce I/O

Reduce document size Thus, an index like the following on just address.zip would be redundant:

Conclusion

await collection.createIndex({ ‘address.zip’: 1 });

43
Introduction This query could be served by the compound index we created previously.
The relational
model The DocumentDB documentation provides some advice on locating unused indexes.
Some quick analysis here can save you significant I/O in your DocumentDB cluster.
Adapting to
the document model
A second, more difficult, tip for reducing I/O is to consider splitting up a single
The DocumentDB API document into multiple documents in certain situations. This is particularly useful
Documents and when you can break a document into a mutable portion and an immutable portion.
the _id field When the mutable portion changes, you will have smaller write operations and
correspondingly less I/O usage. Further, any immutable portions that are indexed
Inserting documents
will not be changed when updating the mutable portion.
Reading documents

Sorting, projecting,
and other options Reduce document size
Another data modeling optimization that relies on DocumentDB internals is to
Updating reduce your document size. In most circumstances, smaller documents will result
documents
in better performance. In certain circumstances, this performance impact can be
Deleting much larger.
documents

Aggregation Consider why smaller documents can be better. Smaller documents mean less I/O
framework consumption and more documents in memory, skipping the new for I/O altogether.
Transactions
Further, it’s less CPU consumption as you’re reading over these documents.

Operations There are a couple patterns for reducing your document size. The first, and easiest,
conclusion
is to compress your documents, particularly when they contain highly compressible
Data modeling field names or field values that are not queried or updated directly. DocumentDB
patterns does offer compression at the document level.
Schema management
in DocumentDB A second option is to reduce the size of your keys in DocumentDB. Remember
that DocumentDB objects are self-describing, so they contain their entire
– Managing your schema
in your application code schema within it. While descriptive key names like “username”, “updatedAt”, and
“isAPayingMember” may be helpful for reading, they do expand the size of your
– Managing your schema document. Consider abbreviating these values to shorter values.
with DocumentDB’s JSON
Schema validation
As with all of these advanced tips, consider the tips on reducing document size
Managing relationships carefully. Make sure that the added complexity is worth the benefits on document
in DocumentDB
size that will result.
– Managing relationships
with embedding

– Handling relationships
with duplication

– Managing relationships
with referencing

Indexes in DocumentDB

– Compound indexes

– Multi-key indexes

– Sparse indexes

Advanced tips

Use the aggregation


framework wisely

Scaling with
DocumentDB

Reduce I/O

Reduce document size

Conclusion

44
Conclusion
In this ebook, we learned how to model data in DocumentDB. We started
off by learning the basics of DocumentDB, including how the document
model differs from a relational model. Then, we learned about the
DocumentDB API and how to use it to interact with our database. Next,
we saw some of the key data modeling patterns including how to manage
your schema, how to handle relationships, and how to use indexes well.
Finally, we looked at some advanced tips that covered proper use of the
aggregation framework, scaling your database, and low-level optimizations
of indexes and documents.

DocumentDB is a powerful database that can be used for a wide variety


of applications. By understanding the basics of DocumentDB and how to
model your data, you can build powerful applications that scale to meet
your needs.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like