Simplr Solutions - Field
Simplr Solutions - Field
Participants:
● Ofer Beit Halachmi, Simplr Solutions
● Srinivas Jaligama, Simplr Solutions
● Santhosh S Kashyap, Consulting Engineer, MongoDB Inc.
This document summarizes the discussions and recommendations from a 3-day remote
consultation with the “Simplr Solutions” team on 17th, 18th and 19th April 2023.
Each recommendation is assigned a level from 1 to 3. The levels correspond to the following
priorities:
Not implementing this recommendation incurs the risk of data loss, system unavailability, or
other significant problems that may cause outages in the future.
These recommendations are less severe than level 1, but typically represent significant
problems that could affect a production system. Consider these issues promptly.
While this suggestion may improve some aspects of the application, it is not critical and
may reflect a much larger change in the application code or MongoDB deployment.
Consider these modifications as part of the next revision of the application.
This document contains comments and recommendations for Simplr’s simplr-prod Atlas
cluster. Primarily the efforts concentrated on this cluster to remove unused and redundant
indexes, backups, online archives and integration strategies to snowflake.
2 Background
2.1 Application
The company offers a fully managed service that connects a chatbot,human agents, and an
AI-powered platform to deliver better and more cost efficient CX than legacy BPOs. With
Simplr, clients are engaging with customers in ways that drive more revenue. In doing so,
they’re fundamentally transforming CX programs into strategic imperatives for their
companies.
2.2 Environment
Application is developed using Node.js using typegoose driver. Following are the details of
the MongoDB deployment focussed as part of this engagement:
Cluster Name Deployment Region Cluster Tier MongoDB
Type Version
3 Recommendations
3.1 Query Optimization [Priority 1]
During the consultation, we analyzed logs for 1 week using the keyhole analyser to find slow
queries. We saw a few queries which can be optimized and he discussed the query
optimization.
{
"t":{
"$date":"2023-04-11T01:40:26.850+00:00"
},
"s":"I",
"c":"COMMAND",
"id":51803,
"ctx":"conn30900",
"msg":"Slow query",
"attr":{
"type":"command",
"ns":"simplr.agendaJobs-prod",
"command":{
"findAndModify":"agendaJobs-prod",
"query":{
"$or":[
{
"name":"conversationCloseTimer",
"lockedAt":null,
"nextRunAt":{
"$lte":{
"$date":"2023-04-11T01:40:26.698Z"
}
},
"disabled":{
"$ne":true
}
},
{
"name":"conversationCloseTimer",
"lockedAt":{
"$exists":false
},
"nextRunAt":{
"$lte":{
"$date":"2023-04-11T01:40:26.698Z"
"protocol":"op_query",
"durationMillis":5140
}
}
We saw that this query was executed whenever the agenda starts the job. We also dived
deep into the agenda code and saw we can optimize it by removing one of the conditions as
in the current db lockedAt always exists. This needs a change in agenda code v2.0.0 which
the team is using. Due to time constraints we were not able to implement and test the
possible issues the team must test thoroughly before implementing. We also saw that this
logic is greatly improved in the newer versions of agenda and team can also check the
possibility of upgrading the package.
{
"t":{
"$date":"2023-04-14T19:13:36.140+00:00"
},
"s":"I",
"c":"COMMAND",
"id":51803,
"ctx":"conn557",
"msg":"Slow query",
"attr":{
"type":"command",
"ns":"simplr.Message",
"command":{
"aggregate":"Message",
"pipeline":[
{
"$match":{
"_p_conversationPtr":{
"$exists":true
},
"direction":"INBOUND",
"_updated_at":{
"$gte":{
"$date":"2023-04-13T00:00:00.000Z"
},
"$lte":{
"$date":"2023-04-14T00:00:00.000Z"
}
}
}
},
{
"$project":{
"messageBody":1,
"_p_conversationPtr":1,
"conversationId":{
"$substr":[
"$_p_conversationPtr",
13,
-1
]
},
"_id":1
}
},
{
"$lookup":{
"from":"Conversation",
"let":{
"conversationId":"$conversationId"
},
"lsid":{
"id":{
"$uuid":"6285c6f4-6552-410c-b173-bbfd738489cd"
}
},
"$clusterTime":{
"clusterTime":{
"$timestamp":{
"t":1681499575,
"i":63
}
},
"signature":{
"hash":{
"$binary":{
"base64":"EmroYSf1BTt8RGzHXJQQV+PcAKs=",
"subType":"0"
}
},
"keyId":7179330532990255267
In the above query we can improve it by adding an optimal index for the match condition
using ESR Rule.
Possible Index
Direction:1, _updated_at:1, _p_conversationPtr:1
Here we also see that team is $skip and $limit is after the lookup which leads to lot of
documents being looked up but since the team is not using the looked up items for sorting,
matching or unwinding we can move the skip and limit before the lookup which drastically
reduced the number of documents going for $lookup.
Query 3
{
"t":{
"$date":"2023-04-14T20:08:42.333+00:00"
},
"s":"I",
"c":"COMMAND",
"id":51803,
"ctx":"conn361",
"msg":"Slow query",
"attr":{
"type":"command",
"ns":"simplr.EmailQueue",
"command":{
"find":"EmailQueue",
"filter":{
"status":{
"$in":[
"SENDING",
"FAILED"
]
},
},
"projection":{
},
"limit":100,
"returnKey":false,
"showRecordId":false,
"planSummary":"IXSCAN { status: 1 }",
"keysExamined":246070,
"docsExamined":246069,
"durationMillis":58457
}
}
In the above query we can improve it by adding an optimal index for the match condition
using ESR Rule.
Possible Index
status:1, lastUpdate:1, processed:1
Here we can also avoid a negator query by changing the processed {$ne: true} to
processed: {eq: false} for more details please refer Prefer Equality or $in vs Negators
During the consultation, we observed a few queries which had a $ne check. Hence we
discussed the possibility of using an equality or $in in these conditions.
The inequality operator $ne is an expensive operation as it is not very selective since it often
matches a large portion of the index. As a result, performance of $ne query with index is not
better than the $ne query without index in many cases.
It is recommended to modify this query to use $in operation. It is to be noted that Mongodb
runs $in as equality until the combinations of all the items in $in reach 200 after which it
treats all the $in as range. When $in combination increases over 200 it results in a memory
sort stage. This needs to be kept in mind when designing the index of the query using ESR
rule.
In the aggregation above we observe a few conditions where $or is used hence we
discussed $or optimization and how indexes work in the case of $or. For MongoDB to use
indexes to evaluate an $or expression, all the clauses in the $or expression must be
supported by distinct indexes. Otherwise, MongoDB will perform a collection scan.
When using indexes with $or queries, each clause of an $or can use its own index. Consider
the following query:
db.inventory.createIndex( { quantity: 1 } )
db.inventory.createIndex( { price: 1 } )
In MongoDB, indexes aid in the efficient execution of queries. However, each index you
create has a negative impact on write performance and requires some disc space. Adding
unnecessary indexes to a collection results in a bloated collection and slow writes. Consider
whether each query performed by your application justifies the creation of an index. Remove
unused indexes, either because the field is not used to query the database or the index is
redundant. Below is the list of indexes with zero usage.
Please note details in the list is as per april 17 and will be dynamic and the team should run
the script again to determine before considering removal of indexes.
During the discussion, the team inquired about the safe deletion of an index from the
database. We talked about the following steps -
During the consultation, we saw that the team was using the type goose's populate method
extensively.
It is to be noted that populate is not a mongodb method but a typgoose driver method
derived mongoose driver. Populate caches the requested data and tries to replace the data
in place of reference. It is to be noted that when data is not available this may lead to a large
number of calls to mongodb database reducing the performance of the query drastically. We
can reduce and optimize this by adding a lookup with lookup optimization.
We can optimize this better by precomputing the reference fields using the precompute
pattern. Here we can create a new collection (to preserve the previous schema) or one of
the current collections and pre-compute the referenced fields with the required data for the
api and frontend. Here we also discussed the use of change streams and atlas triggers to
keep the data in sync by computing when a data point is added or modified from which the
data was computed.
During the consultation, we saw few collection which were were very large like
conversations, messages, history collection, events collections etc in these the team
mentioned that some of the collections like directly moved to snowflake and are not
accessed using mongodb but team also mentioned that these data need to available in
mongodb hence these collections can be moved to online archives. We also saw that there
is a large number of old and unused conversations and messages which can also be
archived. Hence we discussed ways to archive the data.
During the consultation, the team mentioned that they wanted alternatives for using fivetran
for integration into snowflake. Here we discussed 2 possible approaches for integration into
snowflakes.
Here we discussed the possibility of using kafka source connector to fetch data from
mongodb and using kafka’s snowflake sink connector to push data from mongodb into
snowflake.
Here we discussed the possibility of pushing data to snowflake using snowflake apis to push
data to snowflake after the initial sync is complete the team can make of change streams to
track the changes to data points that needs to pushed to snowflake and use snowflake apis
to push data into snowflake
● A compound index can be utilized to satisfy multiple queries. For example, if there is
a compound index like { a:1, b:1, c:1 }, this can be utilized to satisfy all the
following queries -
○ db.coll.find({ a:3 })
○ db.coll.find({ a:3, b:5 })
○ db.coll.find({ a:3, b:5, c:8 })
● The order of the index keys is important for a compound index. In general, the
following rule of thumb can be used for the key order in compound indexes: First use
keys on fields on which there is an equality match in the query (these are usually the
most “selective” part of the query), then the sort keys, and then range query keys.
We call this the ESR (Equality-Sort-Range) rule. Consider the following example:
● Remove indexes that are not used because every index creates overhead for write
operations. The $indexStats command can be used for getting the statistics about
the usage of the indexes or index usage can be checked using MongoDB Compass.
Atlas UI can also be used to check the index usage.
● For the fastest processing, ensure that your indexes fit entirely in RAM so that the
system can avoid reading the index from the disk. For more information, refer to
Ensure Indexes Fit in RAM.
● When possible use covered queries. A covered query is a query that can be satisfied
entirely using an index and does not have to examine any documents (it will not show
a FETCH stage in the winning plan of explain-results).
● Use indexes to sort query results. For more information please refer to the link -Use
Indexes to Sort Query Results.
● You can make MongoDB use an index with hint() if for some reason the optimal index
is not used. Use caution while using .hint() because changes to the database’s query
optimizer may be negated by forcing an index selection with hints.
It’s common to find indexes that are optimized for querying and others for sorting.
Compound indexes should be structured to optimize for both whenever possible. A
compound index can benefit both a query and a sort when created properly.
Order matters in a compound index. Indexes are, by definition, a sorted structure. All levels
of a tree are sorted in ascending or descending order. Some queries and sorts can use an
index while others cannot, all depending on the query structure and sort options. It may help
to visualize an index as a generic tree structure to help see why.
All of the examples queries above leverage the fact that every “level” of the tree is sorted. It’s
important to note that name and color are both sorted under their respective tree. For
example, apple and orange are sorted correctly even though cucumber comes before
orange in a basic sort because they do not share a common parent.
The same index can be used for the following queries, but less efficiently:
● db.food.find({ type: "fruit" }).sort({ color: 1 })
● db.food.find({ name: "apple" }).sort({ type: 1 })
Both examples above can use the index to satisfy either the equality or the sort but not both.
Looking at the first example, equality on fruit will eliminate half of the tree structure (i.e.
vegetable), but an in-memory sort of color is still required. Orange comes before yellow
when sorting only by color, but those colors don’t contain a common parent so they are not
sorted for this example. An in-memory sort is now required to sort by color.
The second example has an equality on name and sort on type. The index is already sorted
on type so it can be used for that portion, but the entire tree must be traversed to eliminate
all entries with the name apple.
For more information on indexing performance, please refer to the following blog
Performance Best Practices: Indexing
While creating indexes, ESR is the thumb rule that should be kept in mind which is also
recommended for performance best practices.
Equality fields should always form the prefix for the index to ensure selectivity.
We can get insights about query/aggregation performance using explain plans. We can run
an explain plan on a query using .explain() on a query/aggregation. By looking at the
execution plan, the following can be determined:
● Access paths being used for fetching the data
● Various stages that go into execution
● How data is being sorted
● What indexes are being used and which fields the indexes are being used for
Using the explain method, an explainable object can be constructed to get the
allPlansExecution as below.
db.collection.explain(“allPlansExecution”).<query/aggregation>
"executionStats" : {
"executionSuccess" : <boolean>,
"nReturned" : <int>,
"executionTimeMillis" : <int>,
Atlas Data Federation is a multi-tenant on-demand query processing engine that allows
users to quickly query their archived data stored in Amazon S3 buckets using the standard
MongoDB Query Language (MQL). Atlas data Federation supports multiple formats for data
stored in AWS S3 buckets viz. JSON, BSON, CSV, TSV, Avro, ORC, and Parquet.
Atlas Data Federation also supports “Federated Queries” with Atlas clusters as data sources
in addition to data stored in S3 buckets. This means that you can combine both your live
cluster data and historical data in your S3 buckets in virtual databases and collections on
Atlas Data Federation and can query on these virtual databases/collections using MQL
seamlessly.
For additional information on Atlas Data Federation, please refer to the links below-
● Creating Indexes
● Monitoring Data with Atlas monitoring tools
● Creating a Data Federation with S3 buckets from more than one AWS account
● Assigning Atlas temporary users permission to query a Data Federation
● Querying documents larger than 16MB
● Adding IP address associated with your Data Federation to your Atlas project
whitelist
● Assigning Atlas read only access to your AWS account with AWS security groups
● Returning more than 100 collections for wildcard collections
$out takes documents returned by the aggregation pipeline and writes them to a specified
collection. The $out operator must be the last stage in the aggregation pipeline. In Atlas Data
Federation, $out can be used to write to S3 buckets with read and write permissions or to an
Atlas cluster namespace.
{
"$out": {
"s3": {
"bucket": "<bucket-name>",
"region": "<aws-region>",
"filename": "<file-name>",
"format": {
"name": "json|json.gz|bson|bson.gz",
"maxFileSize": "<file-size>"
}
}
}
}
The US Federal Government is dedicated to delivering its services to the American people in
the most innovative, secure, and cost-efficient fashion. Cloud computing plays a key part in
how the federal government can achieve operational efficiencies and innovate on demand to
advance their mission across the nation. That is why many federal agencies today are using
AWS cloud services to process, store, and transmit federal government data.
Atlas Online Archive is a new feature that provides the capability to move historical
documents to MongoDB-managed S3 buckets automatically.
With Atlas Online Archive, the team can define a simple rule based on a date field and the
number of days, for archiving data off of a cluster, pick specific fields you query most
frequently, and then sit back. Atlas will automatically move data off of your cluster and into a
more cost-effective storage layer (MongoDB managed AWS S3 buckets) that can still be
queried with a connection string that combines cluster and archive data, powered by Atlas
Data Lake.
Online Archive is a good fit for many different use cases, including
The process of configuring “Atlas Online Archive” with screenshots of every step and how to
connect to it, is very well documented in the following MongoDB’s blog - Online Archive: A
New Paradigm for Data Tiering on MongoDB Atlas, also refer to Archive Cluster Data.
Although You can create up to 50 online archives per cluster, up to 20 can be active per
cluster.
The below image states that after configuring the online archive, you are provided with three
possible ways:
Also, the standard connection string for your cluster remains the same; it is just that you are
provided with two more additional connection strings for connecting to your archive as well
as archive and cluster together.
How is the performance when we connect to Cluster and Online Archive both?
The queries will be propagated in parallel to the underlying S3 data of the OA as well as the
cluster. The cluster is likely to respond quicker ( depending on the characteristics of the data,
presence of indexes, etc), since it's managed under MongoDB and S3 is generally slower.
The bottleneck can be S3, but the performance of OA can be improved by creating efficient
partitioning while creating Online Archives.
4.3 Mongolyser
Using the mongolyser tool you can detect, diagnose and anticipate any bottlenecks, issues
and red flags inside your MongoDB implementation. The information that the tool examines
includes in depth log analysis, query health and efficiency analysis, index analysis and much
more. This tool only runs administrative commands and does not contain any DDL or DML
commands. It is also worth mentioning that no analytics data leaves the system running the
tool.
We helped the team in downloading and installing the latest release from here. This tool is
still evolving and will entail new and deeper insights in near future.
Note 2: Certain admin commands being used inside mongolyser’s analysis engine, under
specific conditions, can cause performance impact. Hence it is recommended to run these
analyses in Least User Activity (LUA) hours.
The team wanted to monitor the application query patterns for the index usage and the
MongoDB cluster for query targeting to devise a strategy for index and query optimizations.
Hence, the team was advised to use Keyhole to analyze the system logs while leveraging
Atlas metrics/Atlas profiler to gauge the health of the cluster periodically. Using Keyhole with
Maobi helps in gaining actionable insights from the log files generated by MongoDB and can
help you scan your MongoDB cluster effectively. The information that the keyhole examines
includes MongoDB configurations, cluster statistics, cache usage visibility, database
schema, indexes, and index usages. It also identifies if your performance issues are related
to short hardware resources (such as physical RAM, CPU, and disk IOPS), and/or slow
queries without proper indexes.
Please refer to the following blog posts to understand how to best use keyhole:
● Survey Your Mongo Land
● Peek at your MongoDB Clusters like a Pro with Keyhole: Part 1
● Peek at your MongoDB Clusters like a Pro with Keyhole: Part 2
● Peek at your MongoDB Clusters like a Pro with Keyhole: Part 3
Please use the installation steps mentioned here to download and install Keyhole, Maobi for
monitoring the cluster using Keyhole.
Following command can be used to run the analysis on the log file downloaded from Atlas:
keyhole -loginfo mongodb.log.gz
Following command can be used to generate cluster information for the Atlas cluster by
providing the Atlas connection string:
keyhole -allinfo "mongodb+srv://<username>:<password>@cluster.mongodb.net"
Please note: The keyhole tool is not officially supported by MongoDB. Keyhole reports for
log analysis can be run offline and the visualization of the report in Maobi requires network
calls for loading CSS or HTML related files.
MongoDB Atlas provides a fully managed backup service for use with MongoDB
deployments. There are different backup strategies which you can use to backup your
database:
● Cloud Backups: These use the cloud provider’s native snapshot capabilities(
incremental snapshots) to take a volume snapshot of your database. Snapshots can
be configured as dictated by your backup policy.
● Continuous Cloud Backups: These backups also record the oplog for a configured
window, which can be replayed after a snapshot has been restored to a cluster upto
a minute granularity, thus allowing Point-in-Time restores. Please note that this
feature increases the monthly cost of your cluster.
During the engagement, it was observed that the continuous backup was not enabled on the
production cluster. In addition, only hourly backup was enabled with a frequency of 6 hours.
The team can consider following points to setup a backup strategy based on their
requirements:
Continuous Backup can be enabled from Cluster’s configuration screen on MongoDB Atlas.
During the consultation, the team mentioned that one of main hurdles for converting their
cluster to multi region cluster is that restore into a similar multi region cluster goes through a
full download restore process which can take a long time. Hence we discussed restoring the
backup to a single region cluster which can be done using a direct attach restore and
converting the cluster to multi region cluster which greatly reduces the downtime when a
cluster needs to be restored.
Please note: Team must test this strategy to find the exact time taken for the entire process
and any potential downtime
Change streams were introduced in MongoDB version 3.6 to give applications an ability to
listen for changes happening in the database in real time, using a simple API and since then
it has come a long way forward providing more robustness. Change streams are very robust
because they provide “resumability” capabilities, provide retry logic to handle loss of
1. Targeted changes
Changes can be filtered to provide relevant and targeted changes to listening
applications. As an example, filters can be on operation type or fields within the
document.
2. Resumablility
Resumability was top of mind when building change streams to ensure that
applications can see every change in a collection. Each change stream response
includes a resume token. In cases where the connection between the application and
the database is temporarily lost, the application can send the last resume token it
received and change streams will pick up right where the application left off. In cases
of transient network errors or elections, the driver will automatically make an attempt
to reestablish a connection using its cached copy of the most recent resume token.
However, to resume after application failure, the application needs to persist the
resume token, as drivers do not maintain state over application restarts.
3. Total ordering
MongoDB 3.6 and above has a global logical clock that enables the server to order
all changes across a sharded cluster. Applications will always receive changes in the
order they were applied to the database.
4. Durability
Change streams only include majority-committed changes. This means that every
change seen by listening applications is durable in failure scenarios such as a new
primary being elected.
5. Security
Change streams are secure – users are only able to create change streams on
collections to which they have been granted read access.
6. Ease of use
Change streams are familiar – the API syntax takes advantage of the established
MongoDB drivers and query language, and are independent of the underlying oplog
format.
For better understanding of a use case and relevant code examples, please refer to this and
this.
Database triggers allow you to execute server-side logic whenever a document is added,
updated, or removed in a linked cluster. Use database triggers to implement complex data
interactions, including updating information in one document when a related document
changes or interacting with an external service when a DML event occurs.
Scheduled Triggers
Scheduled triggers allow you to execute server-side logic on a regular schedule that you
define using CRON expressions. Use scheduled triggers to do work that happens on a
periodic basis, such as updating a document every minute, generating a nightly report, or
sending an automated weekly email newsletter.
● Realm receives an invalidate event from the change stream, for example
dropDatabase or renameCollection. Invalidate events close the change stream
cursor and prevent them from resuming.
● The resume point/token which the trigger needs to use is no longer in the oplog.
● A network error resulted in a communication failure and invalidation of the underlying
change stream.
● An authentication error where the Atlas database user used by the Realm trigger is
no longer valid, for example, if the Realm App is imported with --strategy=replace
instead of --strategy=merge.
Typically, restarting the trigger establishes a new change stream against the watched
collection. If you restart the trigger with a resume token, Realm attempts to resume the
trigger’s underlying change stream at the event immediately following the last change event
it processed. If successful, the trigger processes any events that occurred while it was
suspended.
However, it is possible that the suspended trigger cannot be restarted with a resume token if
the resume token is no longer in the oplog by the time the trigger attempts to resume (for
example, due to a small oplog window). The solution is to restart the trigger without the
resume token. If you do not use a resume token, the trigger listens for new events, but will
not fire for any events that occurred while it was suspended. Ensure that your oplog size is
sufficient (typically a few times more than the peak value from the Oplog GB / Hour graph in
LIMITATIONS
Like all services, Realm triggers also have certain limits which need to be applied for optimal
performance. Below are limits applied on Realm Triggers/Functions:
5.2 Training
MongoDB offers a comprehensive set of instructor-led training courses covering all aspects
of building and running applications with MongoDB. Instructor-led training is the fastest and
best way to learn MongoDB in depth. Both public and private training classes are available -
for more information or to enroll in classes, please see Instructor-Led Training.