-
Notifications
You must be signed in to change notification settings - Fork 33
[Incremental Reprocessing] [WIP] Proof-of-concept incremental reprocessing for MongoDB #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mongo-concurrent-streaming
Are you sure you want to change the base?
Conversation
Allows multiplexing multiple sync rule versions onto one change stream.
|
Set doesn't make sense since we cannot specify a custom comparison function.
Builds on #450, supersedes #454 and #461. The latter two PRs have been merged into this one - too much of the logic there has changed, making it not very useful to review those separately. They can still be viewed to see a history of the changes.
Incremental Reprocessing
This implements incremental reprocessing of sync rules and sync streams. For the overall design, see the proposal in #349.
Right now, this implements:
The same applies for end-users re-syncing: Only buckets for new definitions and streams will be re-synced.
This significantly reduces the overhead of many sync rule and sync stream updates.
Matching is currently by definition name. This means updating existing definitions and streams requires changing the name. Support for in-place updates will be added soon.
Right now, this only support MongoDB for source and storage databases. Support for others will be added before merging. (The builds pass, but the implementation is not complete).
Internal Restructuring
This restructures storage (MongoDB storage only) to decouple storage of different parts from sync rules:
Bucket names are now also decoupled from the definition name. And wherever we read or write bucket data, it is now required to pass in the relevant BucketDataSource. This unfortunately complicates tests a bit, since it can't rely on static bucket names anymore.
The bucket ids are generated when a sync rules version is first loaded. It is effectively just an incrementing sequence, starting from
sync_rules_version << 17. This will change in the future.Note that apart from some replication state tracking, SourceTable is immutable. When a new sync rules version adds definitions using the same source collections, it still creates a new SourceTable. This does add some overhead - we're now keeping multiple copies of each source row in CurrentData. However, this simplifies the implementation, and ensures that we don't affect existing BucketData when re-replicating that table for the new sync rules version.
In the current form, the PR implements changes without much regard for the old storage structure. We'll have to implement proper backwards-compatibility and/or a migration strategy to handle existing instances.
ChangeStream Multiplexing
For the replication job, instead of creating a new "replication job" for each sync rules version, instead create one job for all sync rules versions. Specifically, this is the "active" version and the "reprocessing" version. This allows us to open a single change stream that is shared for the different versions, instead of a separate change stream for each.
Whenever a sync rules version is added, or a previous one is deactivated, we re-create the job.
This uses the data restructuring described above - the persisted data is not scoped to a single sync rules version anymore.
One caveat with the approach here is that we need a lock for both individual sync streams versions at the same time. If another process locked one version, the current process cannot lock the other one only. This should generally not be an issue if multiple processes run the same version of the code, but a different lock management approach may be better in the future.
Remaining work