-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 4.0: Row Lineage support #13310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 4.0: Row Lineage support #13310
Conversation
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Show resolved
Hide resolved
7dcb5a2 to
be78ef5
Compare
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Outdated
Show resolved
Hide resolved
54145af to
3a6d7fa
Compare
...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Outdated
Show resolved
Hide resolved
...ons/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java
Outdated
Show resolved
Hide resolved
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some early comments
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Show resolved
Hide resolved
.../v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineageFromMetadata.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkMetadataColumn.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java
Outdated
Show resolved
Hide resolved
| public void beforeEach() { | ||
| assumeThat(formatVersion).isGreaterThanOrEqualTo(3); | ||
| // ToDo: Remove these as row lineage inheritance gets implemented in the other readers | ||
| assumeThat(fileFormat).isEqualTo(FileFormat.PARQUET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth overriding parameters() in TestRowLevelOperationsWithLineage and defining a smaller test matrix, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup agreed! I need to rebase and incorporate hte latest test changes I made which define a smaller test matrix (and will also remove the changes I made to SparkRowLevelOperationsTestBase)
...ons/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkMetadataColumn.java
Show resolved
Hide resolved
| public NamedReference[] requiredMetadataAttributes() { | ||
| NamedReference specId = Expressions.column(MetadataColumns.SPEC_ID.name()); | ||
| NamedReference partition = Expressions.column(MetadataColumns.PARTITION_COLUMN_NAME); | ||
| if (TableUtil.supportsRowLineage(table)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm fine either way but I think it would be could to align how this is done here (stores named references in an array) vs in SparkCopyOnWriteOperation (which stores named references in a list)
| .writeProperties(writeProperties) | ||
| .build(); | ||
|
|
||
| Function<InternalRow, InternalRow> extractRowLineage = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe rowLineageExtractor or something along those lines? I only mention this because extractRowLineage sounds like a boolean flag
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Outdated
Show resolved
Hide resolved
b5c5dd6 to
0d51fb5
Compare
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java
Show resolved
Hide resolved
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkAvroReader.java
Show resolved
Hide resolved
dad59a1 to
31cfce2
Compare
…stently for surfacing metadata columns, and include test refactorings that were done in 3.4/3.5
…isting metadata row
31cfce2 to
2eb84ca
Compare
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Outdated
Show resolved
Hide resolved
…mMetadata to RowLineageExtractor
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Show resolved
Hide resolved
…ow lineage decoration
This change implements spark 4.0 with Iceberg v3's row lineage feature; this approach uses the new conditional nullification mechanism introduced in 4.0 instead of custom rules that we implemented for 3.5