Skip to content

[Improvement] ParallelIterable#hasNext submit reading ManifestFile task slowly with DataTableScan#planTasks #3741

@yohengyang

Description

@yohengyang

This Issuse manily to speed up reading Iterator<FileScanTask> with planTasks.

Background

We use trino to query icebergTable, but same query are sometimes very slow in scheduling stage.
Using Arthas, finally positioned to fileScanTasks.hasNext() in io.trino.plugin.iceberg.IcebergSplitSource#getNextBatch
This method traverse Iterator<FileScanTask> to get IcebergSplit .
image

fileScanTasks generated by DataTableScan.planTasks, when called ParallelIterable.hasNext(), it should parallel submiting Runnable task to workPool , but when called tasks.next(), it will reading manifest firstly.
When Hdfs has bad performance, this step will take too many time to stuck submitTask.
So, it's better to put reading manifest step into Runnable rather than Iterables.transform() in ManifestGroup#entries.

image

Through profile, we can sure that ManifestFiles.read() and ManifestReader.entries() reading avro file from the hdfs.
image

Reslove

I’m not familiar with iceberg. I provide a rewrite idea which may not right. Hoping everyone can help to improve it. Thanks all !
#3742

Test Performence

Query the same sql 100 times per test, the sql read 47 arvo file,
Count trino scheduling time

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions