Core: Add estimateRowCount for Files and Entries Metadata Tables #10759

szehon-ho · 2024-07-23T21:16:40Z

Spark queries of Metadata Tables will often try to check row count estimates repeatedly during planning phase, like picking a join strategy.

By default, this involves an extra call to read the manifest file header, as the default logic on the BaseContentScanTask is as follows:

  static long estimateRowsCount(long length, ContentFile<?> file) {
    long[] splitOffsets = splitOffsets(file);
    long splitOffset = splitOffsets != null ? splitOffsets[0] : 0L;
    double scannedFileFraction = ((double) length) / (file.fileSizeInBytes() - splitOffset);
    return (long) (scannedFileFraction * file.recordCount());
  }

However, we actually have this information in the metadata and can just return it directly without reading the file header.

This can save many calls on the Spark driver for planning (one call per manifest file, so potentially many thousands, in serial manner).

This is useful as many Spark maintenance procedures involve querying these tables ( expireSnapshots, removeOrphanFiles)

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java

szehon-ho · 2024-07-24T04:43:53Z

cc @amogh-jahagirdar @RussellSpitzer does it make sense? Thanks

amogh-jahagirdar

@szehon-ho This was a great find, this improvement makes sense to me! I'll wait before merging in case @RussellSpitzer has any comments

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java

RussellSpitzer

Looks good to me! Great find

lurnagao-dahua · 2024-07-26T09:13:15Z

Hi,May I ask why we are not using manifest.addedRowsCount() instead of manifest.addedFilesCount()?
deleted and existing is the same?

szehon-ho · 2024-07-26T15:57:27Z

This is for a metadata table and not data table. Hence for manifest-read task, each manifest's row count is the number of files it references (added + deleted + existing)

lurnagao-dahua · 2024-07-27T01:59:07Z

This is for a metadata table and not data table. Hence for manifest-read task, each manifest's row count is the number of files it references (added + deleted + existing)

Got it, thank you for your answer

apache#10759)

Core: Add fileSizeEstimates for Files and Entries Metadata Tables

9ecd6dd

github-actions bot added the core label Jul 23, 2024

amogh-jahagirdar reviewed Jul 23, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java Outdated Show resolved Hide resolved

Fix wrong version in test

71d23c1

szehon-ho force-pushed the files_table_estimate branch from 2306f7c to 71d23c1 Compare July 23, 2024 23:31

amogh-jahagirdar approved these changes Jul 24, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java Outdated Show resolved Hide resolved

RussellSpitzer approved these changes Jul 24, 2024

View reviewed changes

Update test

4b0673e

amogh-jahagirdar merged commit 24fb734 into apache:main Jul 24, 2024

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024

Core: Implement estimateRowCount for Files and Entries Metadata Tables (

2da108f

apache#10759)

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Core: Implement estimateRowCount for Files and Entries Metadata Tables (

969fea2

apache#10759)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Add estimateRowCount for Files and Entries Metadata Tables #10759

Core: Add estimateRowCount for Files and Entries Metadata Tables #10759

Uh oh!

szehon-ho commented Jul 23, 2024

Uh oh!

Uh oh!

szehon-ho commented Jul 24, 2024

Uh oh!

amogh-jahagirdar left a comment •

edited

Loading

Uh oh!

Uh oh!

RussellSpitzer left a comment

Uh oh!

lurnagao-dahua commented Jul 26, 2024 •

edited

Loading

Uh oh!

szehon-ho commented Jul 26, 2024

Uh oh!

lurnagao-dahua commented Jul 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Core: Add estimateRowCount for Files and Entries Metadata Tables #10759

Core: Add estimateRowCount for Files and Entries Metadata Tables #10759

Uh oh!

Conversation

szehon-ho commented Jul 23, 2024

Uh oh!

Uh oh!

szehon-ho commented Jul 24, 2024

Uh oh!

amogh-jahagirdar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

lurnagao-dahua commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Jul 26, 2024

Uh oh!

lurnagao-dahua commented Jul 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amogh-jahagirdar left a comment •

edited

Loading

lurnagao-dahua commented Jul 26, 2024 •

edited

Loading