Skip to content

Conversation

@szehon-ho
Copy link
Member

Spark queries of Metadata Tables will often try to check row count estimates repeatedly during planning phase, like picking a join strategy.

By default, this involves an extra call to read the manifest file header, as the default logic on the BaseContentScanTask is as follows:

  static long estimateRowsCount(long length, ContentFile<?> file) {
    long[] splitOffsets = splitOffsets(file);
    long splitOffset = splitOffsets != null ? splitOffsets[0] : 0L;
    double scannedFileFraction = ((double) length) / (file.fileSizeInBytes() - splitOffset);
    return (long) (scannedFileFraction * file.recordCount());
  }

However, we actually have this information in the metadata and can just return it directly without reading the file header.

This can save many calls on the Spark driver for planning (one call per manifest file, so potentially many thousands, in serial manner).

This is useful as many Spark maintenance procedures involve querying these tables ( expireSnapshots, removeOrphanFiles)

@github-actions github-actions bot added the core label Jul 23, 2024
@szehon-ho szehon-ho force-pushed the files_table_estimate branch from 2306f7c to 71d23c1 Compare July 23, 2024 23:31
@szehon-ho
Copy link
Member Author

cc @amogh-jahagirdar @RussellSpitzer does it make sense? Thanks

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho This was a great find, this improvement makes sense to me! I'll wait before merging in case @RussellSpitzer has any comments

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Great find

@amogh-jahagirdar amogh-jahagirdar merged commit 24fb734 into apache:main Jul 24, 2024
@lurnagao-dahua
Copy link
Contributor

lurnagao-dahua commented Jul 26, 2024

Hi,May I ask why we are not using manifest.addedRowsCount() instead of manifest.addedFilesCount()?
deleted and existing is the same?

@szehon-ho
Copy link
Member Author

This is for a metadata table and not data table. Hence for manifest-read task, each manifest's row count is the number of files it references (added + deleted + existing)

@lurnagao-dahua
Copy link
Contributor

This is for a metadata table and not data table. Hence for manifest-read task, each manifest's row count is the number of files it references (added + deleted + existing)

Got it, thank you for your answer

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants