Spark: Fix row lineage inheritance for distributed planning #13061

amogh-jahagirdar · 2025-05-14T23:10:47Z

For Spark Distributed planning we use a ManifestFileBean implementation of ManifestFile which is serializable and encodes the minimal amount of manifest fields required during distributed planning. This was missing firstRowId and as a result null values would be propogated for the inherited firstRowId. This fixes the issue by simply adding the firstRowId field to the bean which will be set correctly and as a result be inherited correctly during Spark distributed planning.

I discovered this when going through the DML row lineage tests and noticed we weren't exercising a distributed planning case and after enabling, debugged. I added another test parameter set for distributed planning.

amogh-jahagirdar · 2025-05-14T23:12:10Z

...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java

+      {
+        "testhadoop",
+        SparkCatalog.class.getName(),
+        ImmutableMap.of("type", "hadoop"),
+        FileFormat.PARQUET,
+        false,
+        WRITE_DISTRIBUTION_MODE_HASH,
+        true,
+        null,
+        DISTRIBUTED,
+        3
+      },


I'll see how much more time this adds, but at least in the interim I feel like it's worth having this as it's what caught the issue.

What we can probably do once the vectorized reader change is in, is remove the parquet + local test above since the vectorized reader is already testing local. Then we'll still have coverage of both local + distributed without multiple parquet local cases like we have right now.

technically you could also override parameters() in TestRowLevelOperationsWithLineage so that the test matrix is only increased for those tests and not across all tests

Good point. I like that better because now we can explicitly see what we're testing against in this class and what we're not testing due to lack of support. Went ahead and did this override.

amogh-jahagirdar · 2025-05-15T00:47:14Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java

  private Long addedSnapshotId = null;
  private Integer content = null;
  private Long sequenceNumber = null;
+  private Long firstRowId = null;


Ok this may not be quite right, forgot about delete manifests...

Why would this not work for delete manifests?

It should (it'd just be null for delete manifests). I confused myself when debugging an issue with failing tests but those tests are unrelated to if it's a delete manifest or not. Checkout my comment below on why I removed the getter on my latest update

aokolnychyi · 2025-05-15T01:19:41Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java

    bean.setAddedSnapshotId(manifest.snapshotId());
    bean.setContent(manifest.content().id());
    bean.setSequenceNumber(manifest.sequenceNumber());
+    bean.setFirstRowId(manifest.firstRowId());


Do we need a proper getter for it?

On my latest push I removed the getter because the spark actions that read the paths in the manifest file as a dataframe (e.g. orphan files/expire snapshots) also use ManifestFileBean and when reading the manifest DF, it was failing to find firstRowId (the existence of the getter makes it so that every record read by these actions needs to have this field, and if it doesn't it fails during analysis).

The getter isn't needed for the distributed planning case since ManifestFileBean implements manifestFile and firstRowId API gets used. It's also not required for the Spark actions which just need the minimal file info.

But it's a bit odd that this one particular field won't have a getter, let me think if there's a cleaner way. Having two different manifest file bean structures where one is even more minimal seems a bit messy, at least just for this case.
At the very least, if we go with this approach I should inline comment why there's no getter for the field.

I'm confused, isn't there a getter

https://github.com/apache/iceberg/pull/13061/files#diff-2dea172e2b23aef5bedc568cf8df914d10e56c88a884884c53d3b846e83bf881R184-R188

@RussellSpitzer Yeah so that firstRowID() implementation required for satisfying the ManifestFile interface but if you checkout some of the other fields for instance partition spec ID https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ManifestFileBean.java#L69
we have an additional getPartitionSpecId.

This is a bean class that gets used when reading records in spark actions https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java#L167 so we need to indicate to the Encoder up above which fields we expect there to be. The way this is indicated is by having the explicit "getFoo" style API since I think under the hood the encoder is using some sort of reflection + name search based on get* to find these

And we can't have the getter if it's optional?

Just feels odd that it can't be null :)

Yeah I agree, even if the field doesn't exist things should just work by the field being null....

Ok @RussellSpitzer , I looked into it a bit more here. The issue I pointed out earlier regarding https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java#L156 , is simply that when reading the manifests table into a dataframe, we cannot deserialize the records from the manifests metadata table into ManifestFileBean structure since the expectation is that it does exist in the source records (even if it's null, but the column must exist).

As a result, I think the main part to decide on is do we want to continue relying on this ManifestFileBean in the base spark procedure I linked or do we want to compose a slightly slimmed down structure, which can have a slightly more minimal field set with all the getters/setters.

My take is that we should just continue re-using it, and it's fine if some fields don't have explicit getters since this means that it's not always required from the Encoder perspective.

Additionally, we'll need to add first_row_id to the manifests metadata table anyways. At that point, we could add back the getter for this because then we can project that field to read into this structure. But that still doesn't seem quite right because many of the procedures don't even need to project that field anyways.

TLDR, I think there are 3 options (I'm preferring 1):

Leave things as is, specific fields like first_row_id won't have explicit getters because when deserializing into the bean via the encoder this creates the expectation that the record must actually have that field, which isn't always true.

Add a custom slimmed down version of manifestfilebean for the spark case or the inverse, introduce a bean which composes the existing bean plus a first_row_id for the distributed planning case.

Add back a getter when working on the manifest metadata table support for first_row_id and change the projection in the spark procedure. This seems like a backwards way of addressing the issue imo

I understand using bean for serialization in distributed planning. what's the reason for using ManifestFileBean in the BaseSparkAction? that seems to be the only place using ManifestFileBean.ENCODER.

BTW, this PR seems good as it is. The above discussion on bean usage in BaseSparkAction can be tackled separately.

I'm not sure I follow this entire thread, but it seems reasonable to me to add first_row_id to the manifests table so that the bean works?

amogh-jahagirdar · 2025-05-22T16:43:36Z

...ons/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java

-    // ToDo: Remove these as row lineage inheritance gets implemented in the other readers
-    assumeThat(fileFormat).isEqualTo(FileFormat.PARQUET);


Don't need these assumptions since the parameters are being explicitly overriden now

RussellSpitzer · 2025-05-22T16:46:10Z

...ons/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java

+        WRITE_DISTRIBUTION_MODE_HASH,
+        true,
+        null,
+        LOCAL,


Do we need this many parameters if we are only alternating planningMode?

Fair point, I mostly just went and used these since it inherits from the existing DML test class but I think we can maybe slim down some of the parameters in this class by making some values constant?

There are ones which we'll need to test against which include file format and vectorized because those validate that the readers are plumbing inheritance correctly.

Probably doesn't add value to test row lineage against a single branch but we should have a test which does writes to main and additional writes to another branch to make sure ID assignment/seq number is still correct.

Let me see if I can slim these down as part of this change

@RussellSpitzer I looked into it a bit more and I think it'll be quite a bit of work that I'm not sure is actually beneficial to slim down here. I think for this class we know we want to vary the following parameters for testing:

1.) Planning mode
2.) File format
3.) Vectorized
4.) Distribution mode (this is mostly to make sure that regardless of plan, we don't end up in situations where we drop the lineage fields, to make sure the rewritten logical rules just work)

The part that we should be able to leave constant is all the catalog stuff and branching, but that's deep down in the inheritance path of this test. And I think we do want to keep the inheritance from the existing row level operations tests not only for the parameters but there are some shared helper methods for table setup, view creation for the merge etc.

All in all, I'm also not super concerned about test times blowing up here (this test suite takes around 30 seconds at the moment).

rdblue

Looks good to me. Thanks @amogh-jahagirdar!

… is being tested can be seen more explicitly

amogh-jahagirdar · 2025-06-27T15:09:36Z

Thanks for reviewing @RussellSpitzer @stevenzwu @rdblue @aokolnychyi , I'll go ahead and merge. We will need to address adding first_row_id to the manifests metadata table at some point and then we can probably just add the getter to the bean. At the end of the day projecting an extra long for the manifests metadata table is probably inconsequential (even with an extremely, unrealistically large number of manifests 100000 that's 0.064mb).

I'll go ahead and merge so that we can ensure that first the row lineage metadata is correct and take the manifest metadata table fix in a follow on.

…istributed planning

…uted planning (#13436)

github-actions bot added the spark label May 14, 2025

amogh-jahagirdar commented May 14, 2025

View reviewed changes

amogh-jahagirdar requested review from RussellSpitzer, aokolnychyi, nastra and rdblue May 14, 2025 23:13

amogh-jahagirdar commented May 15, 2025

View reviewed changes

amogh-jahagirdar force-pushed the row-lineage-distributed-planning-fix branch from 561a43e to ebbb9a1 Compare May 15, 2025 01:15

aokolnychyi reviewed May 15, 2025

View reviewed changes

amogh-jahagirdar mentioned this pull request May 15, 2025

Spark, Avro: Add support for row lineage in Avro reader #13070

Merged

amogh-jahagirdar force-pushed the row-lineage-distributed-planning-fix branch 4 times, most recently from 2f146a6 to 57ac888 Compare May 22, 2025 16:41

amogh-jahagirdar commented May 22, 2025

View reviewed changes

RussellSpitzer reviewed May 22, 2025

View reviewed changes

amogh-jahagirdar force-pushed the row-lineage-distributed-planning-fix branch from 57ac888 to 5f562d5 Compare May 29, 2025 17:51

github-actions bot added the core label May 29, 2025

amogh-jahagirdar force-pushed the row-lineage-distributed-planning-fix branch 3 times, most recently from 3c5312b to 2a50f72 Compare May 29, 2025 18:59

amogh-jahagirdar requested a review from RussellSpitzer May 29, 2025 20:39

amogh-jahagirdar added this to the Iceberg 1.10.0 milestone Jun 23, 2025

rdblue approved these changes Jun 26, 2025

View reviewed changes

amogh-jahagirdar added 2 commits June 27, 2025 09:51

Spark: Fix row lineage inheritance for distributed planning

4f12a87

override parameters in TestRowLevelOperationsWithLineage so that what…

ba3adf3

… is being tested can be seen more explicitly

amogh-jahagirdar force-pushed the row-lineage-distributed-planning-fix branch from 2a50f72 to ba3adf3 Compare June 27, 2025 13:55

stevenzwu approved these changes Jun 27, 2025

View reviewed changes

amogh-jahagirdar merged commit fce069f into apache:main Jun 27, 2025
27 checks passed

amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jul 1, 2025

Spark 3.4: Backport apache#13061 fix for row lineage inheritance in d…

a1e290e

…istributed planning

amogh-jahagirdar mentioned this pull request Jul 1, 2025

Spark 3.4: Backport #13061 fix for row lineage inheritance in distributed planning #13436

Merged

stevenzwu pushed a commit that referenced this pull request Jul 1, 2025

Spark 3.4: Backport #13061 fix for row lineage inheritance in distrib…

28b90ea

…uted planning (#13436)

		// ToDo: Remove these as row lineage inheritance gets implemented in the other readers
		assumeThat(fileFormat).isEqualTo(FileFormat.PARQUET);

Spark: Fix row lineage inheritance for distributed planning #13061

Spark: Fix row lineage inheritance for distributed planning #13061

Uh oh!

Conversation

amogh-jahagirdar commented May 14, 2025

Uh oh!

amogh-jahagirdar May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Jun 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amogh-jahagirdar May 14, 2025 •

edited

Loading

nastra May 15, 2025 •

edited

Loading

amogh-jahagirdar May 15, 2025 •

edited

Loading

amogh-jahagirdar May 22, 2025 •

edited

Loading

amogh-jahagirdar May 29, 2025 •

edited

Loading

stevenzwu Jun 24, 2025 •

edited

Loading

amogh-jahagirdar May 29, 2025 •

edited

Loading