Module 2 Architecture
Module 2 Architecture
#Documents
MongoDB stores data records as BSON documents. BSON is a binary representation of JSON documents,
though it contains more data types than JSON. For the BSON spec, see bsonspec.org. See also BSON Types.
A MongoDB document.
#Databases
In MongoDB, databases hold one or more collections of documents. To select a database to use, in
mongosh, issue the use <db> statement, as in the following example:
use myDB
Create a Database
If a database does not exist, MongoDB creates the database when you first store data for that database. As
such, you can switch to a non-existent database and perform the following operation in mongosh:
use myNewDB
db.myNewCollection1.insertOne( { x: 1 } )
The insertOne() operation creates both the database myNewDB and the collection myNewCollection1 if
they do not already exist. Be sure that both the database and collection names follow MongoDB Naming
Restrictions.
#Collections
MongoDB stores documents in collections. Collections are analogous to tables in relational databases.
Create a Collection
If a collection does not exist, MongoDB creates the collection when you first store data for that collection.
db. myNewCollection2.insertOne({x: 1})
db. myNewCollection3.createIndex({y: 1})
Both the insertOne() and the createIndex() operations create their respective collection if they do not already
exist. Be sure that the collection name follows MongoDB Naming Restrictions.
#What is JSON?
JSON, or JavaScript Object Notation, is a human-readable data interchange format, specified in the early
2000s. Even though JSON is based on a subset of the JavaScript programming language standard, it’s
completely language-independent.
JSON objects are associative containers, wherein a string key is mapped to a value (which can be a number,
string, boolean, array, an empty value — null, or even another object). Almost any programming language
has an implementation for this abstract data structure — objects in JavaScript, dictionaries in Python, hash
tables in Java and C#, associative arrays in C++, and so on. JSON objects are easy for humans to understand
and for machines to parse and generate:
#What is BSON?
BSON stands for “Binary JSON,” and that’s exactly what it was invented to be. BSON’s binary structure
encodes type and length information, which allows it to be traversed much more quickly compared to JSON.
BSON adds some non-JSON-native data types, like dates and binary data, without which MongoDB would
have been missing some valuable support.
The following are some example JSON objects and their corresponding BSON representations.
#Storage Engines
In MongoDB, storage engines are responsible for managing how data is stored, accessed, and managed on
disk. MongoDB has evolved over the years with different storage engines, but currently, the primary and
default storage engine is WiredTiger. Here’s an overview:
WiredTiger
Concurrency Control: Supports fine-grained concurrency control at the document-level. This
means multiple operations can access and modify different documents concurrently.
Compression: Includes built-in support for data compression, which reduces storage requirements
and can improve read and write performance.
Logging: Uses a modern logging mechanism for crash recovery and to ensure data durability in case
of system failures.
Suitability: WiredTiger is the default storage engine in MongoDB since version 3.2. It is designed to
handle a wide range of workloads efficiently, including large-scale deployments and applications
with high concurrency requirements.
MMAPv1 (Memory-Mapped Architecture Version 1)
Concurrency Control: Uses a global reader-writer lock at the database level. This means that while
multiple read operations can occur simultaneously, write operations may block other operations.
Architecture: Relies on memory-mapped files where data files are directly mapped into virtual
memory. This architecture can be less efficient for large datasets or write-heavy workloads.
Suitability: MMAPv1 was the default storage engine in MongoDB prior to version 3.0. It was
deprecated in MongoDB 4.0 and is no longer recommended for new deployments. It was suitable for
simpler applications with less demanding concurrency requirements.
#Journaling
Journaling in MongoDB refers to the process by which changes to the database are recorded in a journal or
log before they are applied to the actual data files. This mechanism ensures durability and consistency in
case of unexpected shutdowns or failures.
Here’s how journaling works in MongoDB:
1. Purpose of Journaling: MongoDB uses journaling to ensure that write operations are durable even in the
event of a crash. When a write operation is performed, it is first written to the journal files on disk. Once the
write operation is confirmed in the journal, it is then applied to the database data files.
2. Journal Files: MongoDB maintains a set of journal files (named `j._journal` by default) in its data
directory. These files store write operations in a format that allows MongoDB to quickly recover any
uncommitted changes in case of a crash.
3. Journaling Process:
- Write Operation: When a write operation (insert, update, delete) is performed, MongoDB writes the
operation to the journal file on disk.
- Confirmation: Once the write is confirmed in the journal file, MongoDB applies the operation to the data
files.
- Commit: Once the operation is applied to the data files, MongoDB marks the write operation as
committed in the journal.
4. Recovery: In case of a crash or unexpected shutdown, MongoDB can replay the journal files to recover
any write operations that were not yet applied to the data files. This ensures that the database can recover to
a consistent state without losing data.
5. Configuration: Journaling is enabled by default in MongoDB, starting from version 2.0. You can
configure the journaling options in the MongoDB configuration file (`mongod.conf`). The main
configuration options related to journaling include `journal`, which enables or disables journaling, and
`journalCommitInterval`, which specifies how often MongoDB syncs journal files to disk.
6. Performance Considerations: While journaling ensures data durability, it does have a slight
performance overhead due to the additional disk writes. MongoDB provides options to tune journaling
performance based on your application's requirements.
#Write Path
In MongoDB, the write path refers to the sequence of steps and processes involved when a write operation
(insert, update, delete) is performed on the database. Understanding the write path is crucial for optimizing
MongoDB performance and ensuring data consistency and durability. Here's a detailed overview of the write
path in MongoDB:
Write Path Steps:
1. Client Application:
- The write operation originates from a client application that interacts with the MongoDB server using
one of the MongoDB drivers or through the MongoDB shell.
2. Query Parsing and Optimization:
- When MongoDB receives a write operation, it parses the query to understand the intent of the operation
and optimizes it if possible. This involves checking indexes and determining the most efficient way to
perform the operation.
3. Document Validation:
- If document validation rules are defined (e.g., via schema validation or validation rules in MongoDB
3.2+), MongoDB validates the incoming document against these rules to ensure data integrity and
consistency.
4. Write Operation Execution:
- Once validated, MongoDB executes the write operation. Depending on the type of operation (insert,
update, delete), different actions are taken:
- Insert: If inserting a new document, MongoDB checks if the document already exists and then writes it
to the collection.
- Update: For updates, MongoDB may update existing documents or upsert (update or insert if not exists)
based on the query criteria.
- Delete: Deletes documents matching the specified query criteria.
5. Journaling (if enabled):
- If journaling is enabled (which is the default in MongoDB), the write operation is recorded in the journal
files before it is applied to the data files on disk. This ensures durability and recoverability in case of a crash.
6. Data File Write (on Disk):
- After journaling (if enabled), MongoDB writes the updated data to the data files (e.g., WiredTiger
storage engine files) on disk. This step ensures that the data is persistent and can be retrieved even after
server restarts.
7. Memory (RAM) Updates:
- MongoDB also updates its in-memory data structures (e.g., indexes, working set) to reflect the changes
made to the data files. This improves read performance by maintaining frequently accessed data in memory.
8. Response to Client:
- Once the write operation is successfully completed and committed to disk (and optionally to journal),
MongoDB sends an acknowledgment (ACK) back to the client application, confirming the operation's
success.
Optimization and Considerations:
- Write Concern: MongoDB provides options to specify the level of acknowledgment required from the
server for write operations (e.g., acknowledgments from a single server, a majority of servers, or all servers).
- Indexes: Properly designed indexes can significantly improve the performance of write operations by
reducing the time it takes to locate and modify documents.
When you want to keep your files and metadata automatically synced and deployed across a number of
systems and facilities, you can use GridFS. When using geographically distributed replica sets, MongoDB
can distribute files and their metadata automatically to a number of mongod instances and facilities.
Do not use GridFS if you need to update the content of the entire file atomically. As an alternative you can
store multiple versions of each file and specify the current version of the file in the metadata. You can
update the metadata field that indicates "latest" status in an atomic update after uploading the new version of
the file, and later remove previous versions if needed.
Furthermore, if your files are all smaller than the 16 MB BSON Document Size limit, consider storing each
file in a single document instead of using GridFS. You may use the BinData data type to store the binary
data. See your drivers documentation for details on using BinData.
Use GridFS
To store and retrieve files using GridFS, use either of the following:
A MongoDB driver. See the drivers documentation for information on using GridFS with your driver.
The mongofiles command-line tool. See the mongofiles reference for documentation.
GridFS Collections
GridFS stores files in two collections:
chunks stores the binary chunks. For details, see The chunks Collection.
files stores the file's metadata. For details, see The files Collection.
GridFS places the collections in a common bucket by prefixing each with the bucket name. By default,
GridFS uses two collections with a bucket named fs:
fs.files
fs.chunks
You can choose a different bucket name, as well as create multiple buckets in a single database. The full
collection name, which includes the bucket name, is subject to the namespace length limit.
The chunks Collection
Each document in the chunks [1] collection represents a distinct chunk of a file as represented in GridFS.
Documents in this collection have the following form:
{
"_id" : <ObjectId>,
"files_id" : <ObjectId>,
"n" : <num>,
"data" : <binary>
}
If this index does not exist, you can issue the following operation to create it using mongosh:
db.fs.chunks.createIndex( { files_id: 1, n: 1 }, { unique: true } );
The files Index
GridFS uses an index on the files collection using the filename and uploadDate fields. This index allows for
efficient retrieval of files, as shown in this example:
db.fs.files.find( { filename: myFileName } ).sort( { uploadDate: 1 } )
Drivers that conform to the GridFS specification will automatically ensure that this index exists before read
and write operations. See the relevant driver documentation for the specific behavior of your GridFS
application.
If this index does not exist, you can issue the following operation to create it using mongosh:
db.fs.files.createIndex( { filename: 1, uploadDate: 1 } );
[1] (1, 2) The use of the term chunks in the context of GridFS is not related to the use of the term chunks
in the context of sharding.
Sharding GridFS
There are two collections to consider with GridFS - files and chunks.
chunks Collection
To shard the chunks collection, use either { files_id : 1, n : 1 } or { files_id : 1 } as the shard key index.
files_id is an ObjectId and changes monotonically.
For MongoDB drivers that do not run filemd5 to verify successful upload (for example, MongoDB drivers
that support MongoDB 4.0 or greater), you can use Hashed Sharding for the chunks collection.
If the MongoDB driver runs filemd5, you cannot use Hashed Sharding. For details, see SERVER-9888.
files Collection
The files collection is small and only contains metadata. None of the required keys for GridFS lend
themselves to an even distribution in a sharded environment. Leaving files unsharded allows all the file
metadata documents to live on the primary shard.
If you must shard the files collection, use the _id field, possibly in combination with an application field.