Skip to content

Conversation

@rkistner
Copy link
Contributor

@rkistner rkistner commented Jan 15, 2026

Builds on #450, supersedes #454 and #461. The latter two PRs have been merged into this one - too much of the logic there has changed, making it not very useful to review those separately. They can still be viewed to see a history of the changes.

Incremental Reprocessing

This implements incremental reprocessing of sync rules and sync streams. For the overall design, see the proposal in #349.

Right now, this implements:

  1. Only tables used in new bucket definitions and sync streams are re-replicated.
  2. Removing bucket definitions and sync streams require no re-replication.

The same applies for end-users re-syncing: Only buckets for new definitions and streams will be re-synced.

This significantly reduces the overhead of many sync rule and sync stream updates.

Matching is currently by definition name. This means updating existing definitions and streams requires changing the name. Support for in-place updates will be added soon.

Right now, this only support MongoDB for source and storage databases. Support for others will be added before merging. (The builds pass, but the implementation is not complete).

Internal Restructuring

This restructures storage (MongoDB storage only) to decouple storage of different parts from sync rules:

  1. BucketState and BucketData is now scoped to a specific sync rule definition (BucketDataSource), rather than the sync rules versions. This may be shared between different sync rules versions. The mapping is stored on the sync rules version.
  2. BucketParameterDocument (parameter lookups) are not scoped to a specific sync rules versions anymore - it is instead scoped to the parameter lookup index creator id [PENDING] and the SourceTable.
  3. SourceTable is not scoped to a specific sync rules versions anymore. Instead, we track the relevant BucketDataSource and ParameterIndexLookupCreator ids.
  4. CurrentData is not scoped to a specific sync rules versions anymore - it is now only scoped to a SourceTable.

Bucket names are now also decoupled from the definition name. And wherever we read or write bucket data, it is now required to pass in the relevant BucketDataSource. This unfortunately complicates tests a bit, since it can't rely on static bucket names anymore.

The bucket ids are generated when a sync rules version is first loaded. It is effectively just an incrementing sequence, starting from sync_rules_version << 17. This will change in the future.

Note that apart from some replication state tracking, SourceTable is immutable. When a new sync rules version adds definitions using the same source collections, it still creates a new SourceTable. This does add some overhead - we're now keeping multiple copies of each source row in CurrentData. However, this simplifies the implementation, and ensures that we don't affect existing BucketData when re-replicating that table for the new sync rules version.

In the current form, the PR implements changes without much regard for the old storage structure. We'll have to implement proper backwards-compatibility and/or a migration strategy to handle existing instances.

ChangeStream Multiplexing

For the replication job, instead of creating a new "replication job" for each sync rules version, instead create one job for all sync rules versions. Specifically, this is the "active" version and the "reprocessing" version. This allows us to open a single change stream that is shared for the different versions, instead of a separate change stream for each.

Whenever a sync rules version is added, or a previous one is deactivated, we re-create the job.

This uses the data restructuring described above - the persisted data is not scoped to a single sync rules version anymore.

One caveat with the approach here is that we need a lock for both individual sync streams versions at the same time. If another process locked one version, the current process cannot lock the other one only. This should generally not be an issue if multiple processes run the same version of the code, but a different lock management approach may be better in the future.

Remaining work

  • Do fine-grained comparison of bucket definitions and sync streams, rather than relying on unique names
  • Implement the same changes on Postgres storage
  • Figure out migration strategy
  • Get other source databases functional again at the very least (actual incremental reprocessing for them may come later)
  • Fix the compact task (may need significant restructuring)
  • Fix all tests.
  • Make sure the logic for which resumeToken to use is sound
  • Re-implement check for out-of-order messages in the change stream
  • Re-implement logic to restart replication when the change stream is invalid
  • Re-implement custom write checkpoints
  • Avoid releasing locks twice in some cases - can we get a better lock code architecture that has built-in guarantees?
  • Clean up the replication job code - the logic for MongoDB is now significantly diverging from the other source databases (until we can implement the same for them)
  • Add tests for incremental reprocessing
  • Implement some performance improvements in processing rows (use Maps to avoid scanning through all definitions for every row)
  • Make sure data is properly cleared when deactivating sync rule versions

@changeset-bot
Copy link

changeset-bot bot commented Jan 15, 2026

⚠️ No Changeset found

Latest commit: 13d2a31

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants