[Incremental Reprocessing] [WIP] Proof-of-concept incremental reprocessing for MongoDB #468

rkistner · 2026-01-15T09:17:11Z

Builds on #450, supersedes #454 and #461. The latter two PRs have been merged into this one - too much of the logic there has changed, making it not very useful to review those separately. They can still be viewed to see a history of the changes.

Incremental Reprocessing

This implements incremental reprocessing of sync rules and sync streams. For the overall design, see the proposal in #349.

Right now, this implements:

Only tables used in new bucket definitions and sync streams are re-replicated.
Removing bucket definitions and sync streams require no re-replication.

The same applies for end-users re-syncing: Only buckets for new definitions and streams will be re-synced.

This significantly reduces the overhead of many sync rule and sync stream updates.

Matching is currently by definition name. This means updating existing definitions and streams requires changing the name. Support for in-place updates will be added soon.

Right now, this only support MongoDB for source and storage databases. Support for others will be added before merging. (The builds pass, but the implementation is not complete).

Internal Restructuring

This restructures storage (MongoDB storage only) to decouple storage of different parts from sync rules:

BucketState and BucketData is now scoped to a specific sync rule definition (BucketDataSource), rather than the sync rules versions. This may be shared between different sync rules versions. The mapping is stored on the sync rules version.
BucketParameterDocument (parameter lookups) are not scoped to a specific sync rules versions anymore - it is instead scoped to the parameter lookup index creator id [PENDING] and the SourceTable.
SourceTable is not scoped to a specific sync rules versions anymore. Instead, we track the relevant BucketDataSource and ParameterIndexLookupCreator ids.
CurrentData is not scoped to a specific sync rules versions anymore - it is now only scoped to a SourceTable.

Bucket names are now also decoupled from the definition name. And wherever we read or write bucket data, it is now required to pass in the relevant BucketDataSource. This unfortunately complicates tests a bit, since it can't rely on static bucket names anymore.

The bucket ids are generated when a sync rules version is first loaded. It is effectively just an incrementing sequence, starting from sync_rules_version << 17. This will change in the future.

Note that apart from some replication state tracking, SourceTable is immutable. When a new sync rules version adds definitions using the same source collections, it still creates a new SourceTable. This does add some overhead - we're now keeping multiple copies of each source row in CurrentData. However, this simplifies the implementation, and ensures that we don't affect existing BucketData when re-replicating that table for the new sync rules version.

In the current form, the PR implements changes without much regard for the old storage structure. We'll have to implement proper backwards-compatibility and/or a migration strategy to handle existing instances.

ChangeStream Multiplexing

For the replication job, instead of creating a new "replication job" for each sync rules version, instead create one job for all sync rules versions. Specifically, this is the "active" version and the "reprocessing" version. This allows us to open a single change stream that is shared for the different versions, instead of a separate change stream for each.

Whenever a sync rules version is added, or a previous one is deactivated, we re-create the job.

This uses the data restructuring described above - the persisted data is not scoped to a single sync rules version anymore.

One caveat with the approach here is that we need a lock for both individual sync streams versions at the same time. If another process locked one version, the current process cannot lock the other one only. This should generally not be an issue if multiple processes run the same version of the code, but a different lock management approach may be better in the future.

Remaining work

Allows multiplexing multiple sync rule versions onto one change stream.

needed.

required.

changeset-bot · 2026-01-15T09:17:16Z

⚠️ No Changeset found

Latest commit: 13d2a31

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Set doesn't make sense since we cannot specify a custom comparison function.

rkistner added 29 commits January 8, 2026 14:05

Refactor ChangeStream implementation.

a845758

Allows multiplexing multiple sync rule versions onto one change stream.

Fixes.

8e28411

Fix potential race condition leading to locks held for longer than

f2b0b6c

needed.

Implement actual multiplexing for mongodb change streams.

ffe752f

Fix tests.

60e5897

Consistent sort order for sync rules.

22f5904

Persist mappings.

55adf3a

Refactor deletes.

ed9e90c

Scoped bucket_data.

46ca53e

Fix sync rules tests.

24c5484

Tests build again.

3e45cc7

Minor restructure and test fixes.

9f50346

Fixes for checksums and some tests.

25344b8

Refactor current_data.

99bca49

Fix more tests.

017e6dd

More fixes for clearing data.

03b07dd

Fix test build issues.

df561c0

Initial fixes for parameter lookups.

5714d9f

Cleanup.

d26a01f

Hack: re-use existing mappings.

306beb6

Initial writer restructuring.

2d14290

WIP: merged processing.

d73fad6

Initial working through errors.

872d3a9

Restructure snapshotter.

d816b6b

Fix job wiring up.

f9dfbfc

Fixes.

44ac2b7

Fixes to source table filtering.

8b923a5

Fix source table filtering.

436e6f4

Fix initialization of sync rule rules where no new snapshots are

bb63762

required.

rkistner added 5 commits January 15, 2026 11:20

Work around build issues.

596343d

Merge branch 'mongo-concurrent-streaming' into incremental-reprocessing

cbd97b0

Fix test build errors.

e71bc49

Support parameter lookups again; simplify a bit.

9cffa33

Fix parameter row filtering.

7acc0f7

This was referenced Jan 15, 2026

[Incremental Reprocessing] [WIP] [MongoDB] Change stream multiplexing #454

Closed

[Incremental Reprocessing] [WIP] [MongoDB Storage] Replication storage restructure #461

Closed

rkistner added 4 commits January 15, 2026 13:39

Minor code simplification.

24ba4d0

Use TablePattern[] instead of Set<TablePattern>.

54e5f59

Set doesn't make sense since we cannot specify a custom comparison function.

Refactor to pull through TablePattern.

bb9fb76

Improve lookup performance in some cases.

13d2a31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Incremental Reprocessing] [WIP] Proof-of-concept incremental reprocessing for MongoDB #468

[Incremental Reprocessing] [WIP] Proof-of-concept incremental reprocessing for MongoDB #468

Uh oh!

rkistner commented Jan 15, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Jan 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Incremental Reprocessing] [WIP] Proof-of-concept incremental reprocessing for MongoDB #468

Are you sure you want to change the base?

[Incremental Reprocessing] [WIP] Proof-of-concept incremental reprocessing for MongoDB #468

Uh oh!

Conversation

rkistner commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!