-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Calling rewrite_position_delete_files fails on tables with more than 1k columns #10020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Calling rewrite_position_delete_files fails on tables with more than 1k columns #10020
Conversation
|
|
||
| class AssignFreshIds extends TypeUtil.CustomOrderSchemaVisitor<Type> { | ||
| private final Schema visitingSchema; | ||
| class AssignFreshIds extends BaseAssignIds { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here is just to refactor to a base class, for logic re-use.
While we can use this class as-is, it seems only able to completely re-assign ids, and in our case it seems cleaner to only re-assign ids when they collide.
|
Actually testing further, this does not work correctly if position delete file actually has 'row' value populated. The problem being, the position delete file is a parquet file with 'row' id in the metadata, so the ParquetReader then populates those with null (if those ids have been reassigned). |
|
Redid the approach, now I re-assign partition field ids. I fixed all the problems of that approach, it requires a bit of finesse. The broad picture:
|
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some explanations.
| * @param partitionType original table's partition type | ||
| * @return partition type with reassigned field ids | ||
| */ | ||
| public static Types.StructType partitionType(Schema tableSchema, Types.StructType partitionType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These and following are just helper methods to reasisgn partition field ids to prevent collision with schema's field ids.
|
|
||
| // prepare transformed partition specs and caches | ||
| Map<Integer, PartitionSpec> transformedSpecs = transformSpecs(tableSchema(), table().specs()); | ||
| Map<Integer, PartitionSpec> transformedSpecs = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the transformed spec to evaluate the filters, so we need to bind against the new (reassigned) field ids, hence passing the field id map here so the transformed specs have them.
| // Read manifests (use original table's partition ids to de-serialize partition values) | ||
| CloseableIterable<ManifestEntry<DeleteFile>> deleteFileEntries = | ||
| ManifestFiles.readDeleteManifest(manifest, table().io(), transformedSpecs) | ||
| ManifestFiles.readDeleteManifest(manifest, table().io(), table().specs()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a bit tricky. We need to read manifests using the original partition field ids (as that's how they are stored in the manifest file), so go back to use the original table's partition spec which preserves the original field ids.
But now we can no longer use the schema of that spec to bind the user-provided partition filter, because the filter needs to bind on the reasisgned ones of the metadata table schema. So here we split out the row-filter from ManifestReader and evaluate it separately outside. We do that below using transformedSpecs (which have the reassigned partition field ids).
|
|
||
| private StructLike coercePartition(PositionDeletesScanTask task, StructType partitionType) { | ||
| return PartitionUtil.coercePartition(partitionType, task.spec(), task.partition()); | ||
| Types.StructType dedupType = PositionDeletesTable.partitionType(table.schema(), partitionType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was using the original partition type to coerce partition (for grouping during the rewrite), but we actually need the final partition type (with reassigned field ids). This is because PositionDeletesScanTask now has partition data that matches the reassigned field ids.
d2db7f9 to
79cdd7d
Compare
c19734b to
7e7a6d5
Compare
|
Because this problem may affect more than just rewrite_position_deletes (like for example any table that selects _partition metadata column), I rewrote the patch to make the logic more generic and add support at the the Iceberg API level, so it can be used later more than just PositionDeletesTable. Key changes:
This changes also makes fixes in tests and RewritePositionDeleteFiles to get the schema/spec always from the position_delete_files metadata table (which has the re-assigned field ids) instead of the original table which has the potentially colliding one. |
|
@rdblue @RussellSpitzer may be interested, can you take a look? |
|
@szehon-ho but this implementation is rather a hack, workaround around original design. Why folks you don't think just advancing version of the table and switching to bigger constant for partition spec offset (or going to negative ids just to remove set clash) won't be more clean for the implementation? This way we'd have a clear path forward, marking old spec for deprecation in future releases and having a stable mainline with actual code. |
|
Hi @bk-mz we discussed this a bit in the last Iceberg community sync. The motivation in this pr is to fix the position_deletes metadata table and have tools to fix any other place it comes up (should be limited to metadata columns or metadata tables). imo the other fix as you mentioned is either
|
|
Thanks @RussellSpitzer , addressed initial comments |
44159a4 to
552ae60
Compare
| old: "method void org.apache.iceberg.PositionDeletesTable.PositionDeletesBatchScan::<init>(org.apache.iceberg.Table,\ | ||
| \ org.apache.iceberg.Schema, org.apache.iceberg.TableScanContext)" | ||
| justification: "Removing deprecated code" | ||
| "1.5.0": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like adding a new ctor method as suggested changes the serializationVersionUID, is it ok? @RussellSpitzer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Yep, this would only be a concern if we were worried about folks using different Iceberg versions on client and server, this shouldn't be the case.
|
So I've reviewed this PR, built it, tested it in our qa table that has close to 30k columns and it is working perfectly :) would really be interested in seeing this merged so I don't have to maintain custom version of iceberg lib. One small note, would it be possible to remove old schema definitions from metadata file? With frequent schema evolution and a large amount of columns the metadata file grows very quickly, I'm wondering if during rewrite_manifest it would make sense to iterate over used schemas and keep only those? Maybe in another PR? |
| AtomicInteger nextId = new AtomicInteger(); | ||
|
|
||
| Type res = | ||
| TypeUtil.assignIds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may have some ordering issues ... I'm not sure if this is possible but say I see
transformId = 1000
Then after that I see
columnId = 1000
Won't I still have a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not entirely sure I get your use case. But , we have two lists conceptually: idsToReassign, usedIds.
We go through all the fields, and if find an idToReassign, we just find the next value that is not in usedIds.
Are you asking, what if we both have idToReassign = usedIds = (1000). I think its ok, it will just pick the next one 1001, (even though it didnt have to..)
0ebca1e to
02d46c5
Compare
| // Calculate used ids (for de-conflict) | ||
| Set<Integer> currentlyUsedIds = | ||
| Collections.unmodifiableSet(TypeUtil.indexById(Types.StructType.of(columns)).keySet()); | ||
| Set<Integer> usedIds = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer i found another issue. In some case where I remove a column, i did not have those deleted column ids in 'usedIds'. This led to some case where I re-assigned an id to a deleted column id, and for old files with that column, I saw bad behavior when it came to pruning. This is the fix.
For this to work, I massaged the Schema API to take in directly a callback to reassign id. I think it actually simplifies the API a bit as it removes the concept of 'metadataIds', which was probably confusing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this also fixes the situation I was worried about.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are set here. Remembering
- We still need to resolve this for SELECT *, _partition Queries (ok for followup)
- I think we should have a straight up test of the position delete metdata table so we have a test that is invoked without a spark change. One that specifically would use the re-id'ing . I know the current case does, but mostly on accident since 0 is the default right?
|
Thanks, added test |
|
Thanks @RussellSpitzer for helping get this through the finish line! PR to fix _partition metadata column collisions to come subsequently |
#10547 is attempting to fix the read of |
…th more than 1k columns (apache#10020)
…th more than 1k columns (apache#10020) (apache#1283)
…th more than 1k columns (apache#10020)
…th more than 1k columns (apache#10020)
Fixes: #9923
The position_deletes metadata table (used to rewrite_position_deletes) has both 'partition' field and row field ( for the optional 'row' column of position deletes, https://iceberg.apache.org/spec/#position-delete-files, which is the table schema as struct). If the table has over 1000 columns, than the field ids will collide.