Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Mar 22, 2024

Fixes: #9923

The position_deletes metadata table (used to rewrite_position_deletes) has both 'partition' field and row field ( for the optional 'row' column of position deletes, https://iceberg.apache.org/spec/#position-delete-files, which is the table schema as struct). If the table has over 1000 columns, than the field ids will collide.


class AssignFreshIds extends TypeUtil.CustomOrderSchemaVisitor<Type> {
private final Schema visitingSchema;
class AssignFreshIds extends BaseAssignIds {
Copy link
Member Author

@szehon-ho szehon-ho Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here is just to refactor to a base class, for logic re-use.

While we can use this class as-is, it seems only able to completely re-assign ids, and in our case it seems cleaner to only re-assign ids when they collide.

@szehon-ho szehon-ho closed this Mar 22, 2024
@szehon-ho szehon-ho reopened this Mar 22, 2024
@szehon-ho
Copy link
Member Author

Actually testing further, this does not work correctly if position delete file actually has 'row' value populated. The problem being, the position delete file is a parquet file with 'row' id in the metadata, so the ParquetReader then populates those with null (if those ids have been reassigned).

@github-actions github-actions bot removed the API label Mar 26, 2024
@szehon-ho
Copy link
Member Author

szehon-ho commented Mar 26, 2024

Redid the approach, now I re-assign partition field ids.

I fixed all the problems of that approach, it requires a bit of finesse. The broad picture:

  • When we need to read manifests, we need to use the original specs. This is because it needs the original field ids to deserialize the manifest entry, which has those partition field ids.
  • When we need to bind filters (like for predicate eval), we need the new spec with reassigned ids, because it needs to bind against the updated schema.

Copy link
Member Author

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some explanations.

* @param partitionType original table's partition type
* @return partition type with reassigned field ids
*/
public static Types.StructType partitionType(Schema tableSchema, Types.StructType partitionType) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These and following are just helper methods to reasisgn partition field ids to prevent collision with schema's field ids.


// prepare transformed partition specs and caches
Map<Integer, PartitionSpec> transformedSpecs = transformSpecs(tableSchema(), table().specs());
Map<Integer, PartitionSpec> transformedSpecs =
Copy link
Member Author

@szehon-ho szehon-ho Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the transformed spec to evaluate the filters, so we need to bind against the new (reassigned) field ids, hence passing the field id map here so the transformed specs have them.

// Read manifests (use original table's partition ids to de-serialize partition values)
CloseableIterable<ManifestEntry<DeleteFile>> deleteFileEntries =
ManifestFiles.readDeleteManifest(manifest, table().io(), transformedSpecs)
ManifestFiles.readDeleteManifest(manifest, table().io(), table().specs())
Copy link
Member Author

@szehon-ho szehon-ho Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a bit tricky. We need to read manifests using the original partition field ids (as that's how they are stored in the manifest file), so go back to use the original table's partition spec which preserves the original field ids.

But now we can no longer use the schema of that spec to bind the user-provided partition filter, because the filter needs to bind on the reasisgned ones of the metadata table schema. So here we split out the row-filter from ManifestReader and evaluate it separately outside. We do that below using transformedSpecs (which have the reassigned partition field ids).


private StructLike coercePartition(PositionDeletesScanTask task, StructType partitionType) {
return PartitionUtil.coercePartition(partitionType, task.spec(), task.partition());
Types.StructType dedupType = PositionDeletesTable.partitionType(table.schema(), partitionType);
Copy link
Member Author

@szehon-ho szehon-ho Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was using the original partition type to coerce partition (for grouping during the rewrite), but we actually need the final partition type (with reassigned field ids). This is because PositionDeletesScanTask now has partition data that matches the reassigned field ids.

@szehon-ho szehon-ho force-pushed the postition_deletes_id_collision branch from d2db7f9 to 79cdd7d Compare March 26, 2024 23:47
@szehon-ho szehon-ho force-pushed the postition_deletes_id_collision branch from c19734b to 7e7a6d5 Compare April 11, 2024 21:58
@szehon-ho
Copy link
Member Author

szehon-ho commented Apr 11, 2024

Because this problem may affect more than just rewrite_position_deletes (like for example any table that selects _partition metadata column), I rewrote the patch to make the logic more generic and add support at the the Iceberg API level, so it can be used later more than just PositionDeletesTable.

Key changes:

  • Schema adds a new concept 'metadataFields' (name suggestion welcome). This allows tables to construct schemas with field ids that 'conflict' only if the conflicts are marked as metadata fields. Those field ids are automatically re-assigned to prevent conflict. There are also helper methods to get the map of reassigned field ids for those cases.
  • PartitionSpec has a method 'originalPartitionType' that returns the partition type with the original field ids.
  • ManifestReader uses the partition spec's originalPartitionType (because they are serialized with those in the manifest file). It is the only place in the code so far to need this, the rest of the code uses the partition type with re-assigned field ids.

This changes also makes fixes in tests and RewritePositionDeleteFiles to get the schema/spec always from the position_delete_files metadata table (which has the re-assigned field ids) instead of the original table which has the potentially colliding one.

@szehon-ho
Copy link
Member Author

@rdblue @RussellSpitzer may be interested, can you take a look?

@bk-mz
Copy link

bk-mz commented Apr 12, 2024

@szehon-ho but this implementation is rather a hack, workaround around original design.

Why folks you don't think just advancing version of the table and switching to bigger constant for partition spec offset (or going to negative ids just to remove set clash) won't be more clean for the implementation?

This way we'd have a clear path forward, marking old spec for deprecation in future releases and having a stable mainline with actual code.

@szehon-ho
Copy link
Member Author

szehon-ho commented Apr 17, 2024

Hi @bk-mz we discussed this a bit in the last Iceberg community sync. The motivation in this pr is to fix the position_deletes metadata table and have tools to fix any other place it comes up (should be limited to metadata columns or metadata tables). imo the other fix as you mentioned is either

  1. intrusive as preivously we did not force user to reserve id ranges for partition field vs regular field (its possible today in spec for field id to be negative)
  2. temporary fix (what happens if user makes 10k columns)

@szehon-ho
Copy link
Member Author

Thanks @RussellSpitzer , addressed initial comments

@szehon-ho szehon-ho force-pushed the postition_deletes_id_collision branch from 44159a4 to 552ae60 Compare May 8, 2024 01:36
old: "method void org.apache.iceberg.PositionDeletesTable.PositionDeletesBatchScan::<init>(org.apache.iceberg.Table,\
\ org.apache.iceberg.Schema, org.apache.iceberg.TableScanContext)"
justification: "Removing deprecated code"
"1.5.0":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like adding a new ctor method as suggested changes the serializationVersionUID, is it ok? @RussellSpitzer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Yep, this would only be a concern if we were worried about folks using different Iceberg versions on client and server, this shouldn't be the case.

@jgprogramming
Copy link

So I've reviewed this PR, built it, tested it in our qa table that has close to 30k columns and it is working perfectly :) would really be interested in seeing this merged so I don't have to maintain custom version of iceberg lib.

One small note, would it be possible to remove old schema definitions from metadata file? With frequent schema evolution and a large amount of columns the metadata file grows very quickly, I'm wondering if during rewrite_manifest it would make sense to iterate over used schemas and keep only those? Maybe in another PR?

AtomicInteger nextId = new AtomicInteger();

Type res =
TypeUtil.assignIds(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have some ordering issues ... I'm not sure if this is possible but say I see

transformId = 1000
Then after that I see
columnId = 1000

Won't I still have a problem?

Copy link
Member Author

@szehon-ho szehon-ho Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not entirely sure I get your use case. But , we have two lists conceptually: idsToReassign, usedIds.

We go through all the fields, and if find an idToReassign, we just find the next value that is not in usedIds.

Are you asking, what if we both have idToReassign = usedIds = (1000). I think its ok, it will just pick the next one 1001, (even though it didnt have to..)

@szehon-ho szehon-ho force-pushed the postition_deletes_id_collision branch from 0ebca1e to 02d46c5 Compare June 11, 2024 23:46
// Calculate used ids (for de-conflict)
Set<Integer> currentlyUsedIds =
Collections.unmodifiableSet(TypeUtil.indexById(Types.StructType.of(columns)).keySet());
Set<Integer> usedIds =
Copy link
Member Author

@szehon-ho szehon-ho Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer i found another issue. In some case where I remove a column, i did not have those deleted column ids in 'usedIds'. This led to some case where I re-assigned an id to a deleted column id, and for old files with that column, I saw bad behavior when it came to pruning. This is the fix.

For this to work, I massaged the Schema API to take in directly a callback to reassign id. I think it actually simplifies the API a bit as it removes the concept of 'metadataIds', which was probably confusing anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this also fixes the situation I was worried about.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are set here. Remembering

  • We still need to resolve this for SELECT *, _partition Queries (ok for followup)
  • I think we should have a straight up test of the position delete metdata table so we have a test that is invoked without a spark change. One that specifically would use the re-id'ing . I know the current case does, but mostly on accident since 0 is the default right?

@szehon-ho
Copy link
Member Author

Thanks, added test

@szehon-ho szehon-ho merged commit b6c949c into apache:main Jun 12, 2024
@szehon-ho
Copy link
Member Author

Thanks @RussellSpitzer for helping get this through the finish line!

PR to fix _partition metadata column collisions to come subsequently

@szehon-ho szehon-ho deleted the postition_deletes_id_collision branch June 12, 2024 23:52
@dramaticlly
Copy link
Contributor

Thanks @RussellSpitzer for helping get this through the finish line!

PR to fix _partition metadata column collisions to come subsequently

#10547 is attempting to fix the read of _partition for table over 1k columns

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
szehon-ho pushed a commit to szehon-ho/iceberg that referenced this pull request Sep 16, 2024
sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calling rewrite_position_delete_files fails on tables with more than 1k columns

5 participants