`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

samuelcolvin · 2024-09-25T09:18:27Z

Describe the bug

I noticed this while investigating apache/datafusion#7845 (comment).

The suggestion from @jayzhan211 and @alamb was that datafusion.execution.parquet.pushdown_filters true should improve performance of queries like this, but it seems to make them slower.

I think the reason is that data is being decompressed twice (or data is being decompressed that shouldn't be), here's a screenshot from samply running on this code:

(You can view this flamegraph properly here)

You can see that there are two blocks of decompression work, the second one is associated with parquet::column::reader::GenericColumnReader::skip_records and happens after the first decompression chunk and running the query has completed.

In particular you can se that there's a read_new_page() cal in parquet::column::reader::GenericColumnReader::skip_records (line 335) that's taking a lot of time:

My question is - could this second run of compression be avoided?

To Reproduce

Clone https://github.com/samuelcolvin/batson-perf, comment out one of the modes, compile with profiling enabled cargo build --profile profiling, run with samply samply record ./target/profiling/batson-perf

Expected behavior

I would expect that datafusion.execution.parquet.pushdown_filters true was faster, I think the reason it's not is decompressing the data twice.

Additional context

apache/datafusion#7845 (comment)

The text was updated successfully, but these errors were encountered:

tustvold · 2024-09-25T11:29:15Z

Have you enabled the page index?

etseidl · 2024-09-25T17:13:18Z

Have you enabled the page index?

Indeed. Or enabled v2 page headers? The issue seems to be that when skipping rows (skip_records defines a record as rep_level == 0, so a row), the number of rows per page is not known in advance, so to figure out the number of levels to skip, the repetition levels need to be decoded for every page. For V1 pages, unfortunately, the level information is compressed along with the page data, so the entire page needs decompressing to calculate the number of rows. If either of V2 page headers or the page index were enabled, then the number of rows per page is known without having to do decompression, so entire pages can be skipped with very little effort (the continue at L330 above).

I don't think pages are uncompressed twice...it's just a result of the two paths through ParquetRecordBatchReader::next (call call skip_records until enough have been skipped, then switch over to read_records`).

alamb · 2024-09-25T17:28:02Z

I think the documentation on https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html is also instructive. Even if all the decode is working properly, I think the arrow reader may well decode certain pages twice. It is one of my theories about why pushing filters down doesn't make things always faster, but I have not had time to look into it in more detail

tustvold · 2024-09-25T17:55:12Z

See also #5523

Although I suspect in this case the issue is a lack of page index information for whatever reason

tustvold · 2024-10-09T11:57:12Z

I've taken a look at the reproducer linked from apache/datafusion#7845 (comment) and I'm not sure that predicate pushdown is going to be helpful here.

The query is SELECT count(*) where json_contains(...), which implies:

Page index pushdown won't be effective as json_contains cannot be evaluated against the page index statistics
Late materialization, i.e. RowFilter, won't be advantageous as there is nothing to late materialize

What follows are some avenues to explore

Potentially Incorrect DF ProjectionMask

However, there is something fishy here. Given the query doesn't actually request any columns, the final ProjectionMask should be empty. As we have a single predicate, we would therefore expect it to never perform record skipping at all - it would evaluate the predicate to compute the RowSelection, and then just use EmptyArrayReader to materialize this.

The trace, however, would suggest DF is requesting columns in the final projection, I wonder if DF requests filter columns in the projection mask even when the filter that needs them has been pushed down? This is probably something that could/should be fixed.

Adaptive Predicate Pushdown

Currently all RowFilter and RowSelection provided are used to perform late materialization. However, the only scenario where it actually makes sense to do this is:

The predicate is highly selective and the results are clustered
The other columns in the projection are comparatively more expensive to materialize than those required to evaluate the predicate

The question then becomes what makes this judgement, currently I believe DF pushes everything down that it can. #5523 proposes adding some logic to the parquet reader to accept the pushed down predicate but choose to not use it for late materialization. This would have the effect of making it so that pushing down a predicate is no worse than not pushing it down, but it would not improve the best-case performance.

Cache Decompressed Pages

Currently when evaluating a predicate, even if those columns are to be used again, either in another predicate or the final projection mask, the decompressed pages are not kept around. Keeping them around would have the advantage of saving CPU cycles at the cost of potentially significant additional memory usage, especially if the predicate is very selective.

Cache Filtered Columns

A potentially better option would be to retain the filtered columns for later usage, however, aside from being quite complex to implement this still runs the risk of blowing the memory budget

Lazy Predicate Evaluation

The problem with both of the above caching strategies is that predicates are completely evaluated up-front. Whilst this makes the code much simpler, especially in async contexts, it has the major drawback that any caching strategy has to potentially retain a huge amount of decoded data. If instead we incrementally evaluated the filters as we went, we would be able to yield batches as we went.

The one subtlety concerns object stores, where interleaving IO in the same way is likely to seriously hurt. We may need to no longer perform I/O pushdown based on RowFilter, and only use RowSelection to drive this.

This would also improve the performance of queries with limits that can't be fully pushed down. I will try to find some time to work on this over the next few days.

XiangpengHao · 2024-10-15T22:58:24Z

Cache Decompressed Pages
Currently when evaluating a predicate, even if those columns are to be used again, either in another predicate or the final projection mask, the decompressed pages are not kept around. Keeping them around would have the advantage of saving CPU cycles at the cost of potentially significant additional memory usage, especially if the predicate is very selective.

FWIW, I quantified the memory usage vs query time with ClickBench query 21:

The query:

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

Cache Arrow is essentially to cache the decompressed (and decoded) data, thus avoiding decoding Parquet twice.

As @tustvold predicted, we roughly gets 4x better performance while 4x more memory usage.

XiangpengHao · 2024-10-15T23:16:17Z

Adaptive Predicate Pushdown

👍 on this.

I happened to noticed that each intersect/union of RowSelection takes more than 1ms with Q21 (100k selectors). I'm trying to see if I can make the row selection faster and also investigate just using boolean array to represent selection.

alamb · 2024-10-22T13:59:56Z

The trace, however, would suggest DF is requesting columns in the final projection, I wonder if DF requests filter columns in the projection mask even when the filter that needs them has been pushed down? This is probably something that could/should be fixed.

I agree this should be fixed. I will try and file a ticket to investigate

samuelcolvin added the bug label Sep 25, 2024

tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Sep 25, 2024

alamb mentioned this issue Sep 26, 2024

Any plan to support JSON or JSONB? apache/datafusion#7845

Open

alamb mentioned this issue Oct 22, 2024

Adaptive Parquet Predicate Pushdown #5523

Open

This was referenced Oct 22, 2024

Oct 21, 2024: This week in DataFusion apache/datafusion#13035

Closed

Enable parquet filter pushdown by default apache/datafusion#3463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

samuelcolvin commented Sep 25, 2024

tustvold commented Sep 25, 2024

etseidl commented Sep 25, 2024

alamb commented Sep 25, 2024

tustvold commented Sep 25, 2024

tustvold commented Oct 9, 2024 •

edited

Loading

XiangpengHao commented Oct 15, 2024

XiangpengHao commented Oct 15, 2024

alamb commented Oct 22, 2024

parquet::column::reader::GenericColumnReader::skip_records still decompresses most data #6454

parquet::column::reader::GenericColumnReader::skip_records still decompresses most data #6454

Comments

samuelcolvin commented Sep 25, 2024

tustvold commented Sep 25, 2024

etseidl commented Sep 25, 2024

alamb commented Sep 25, 2024

tustvold commented Sep 25, 2024

tustvold commented Oct 9, 2024 • edited Loading

XiangpengHao commented Oct 15, 2024

XiangpengHao commented Oct 15, 2024

alamb commented Oct 22, 2024

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

`parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data #6454

tustvold commented Oct 9, 2024 •

edited

Loading