Use bounded queue to avoid consuming too much memory #4596

uncleGen · 2022-04-20T12:35:33Z

In my scene, there are hundreds of thousands of datafiles. If iceberg.scan.plan-in-worker-pool is enabled, OOM exception happend continually. The root cause is an unbounded queue ConcurrentLinkedQueue is used in ParallelIterator. The queue will consume much memory.

core/src/main/java/org/apache/iceberg/SystemProperties.java

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java

kbendick · 2022-04-21T23:44:08Z

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java

-            } catch (IOException e) {
-              throw new RuntimeIOException(e, "Failed to close iterable");
+            } catch (IOException | InterruptedException e) {
+              throw new RuntimeException("Failed to close iterable", e);


Can we add a test case that goes through this path? I'm not sure if throwing when an InterrruptedException occurs is necessarily the correct behavior.

InterruptedException is thrown by BlockingQueue.put(). Using put() but not add() is because put() will waiting if necessary for space to become available.

BTW, I think we shouldn't call Thread.sleep in hasNext if we use blocking queue. In our test it's bad for performance when the queue size is small.

+1 to not using Thread.sleep() anywhere if at all possible. It’s bad for system performance and generally causes instability at any scale. @lirui-apache’s result confirm what I’ve expected / known.

uncleGen · 2022-04-22T08:21:13Z

Let me be clear. In my scene, my machine has 96 cores, when iceberg.scan.plan-in-worker-pool is enabled, the iceberg.worker.num-threads will be number of available processors in default, i.e. 96. So the input speed of queue is far greater than outpu speed of queue. Then OOM occurs.

rdblue · 2022-04-24T22:17:08Z

core/src/main/java/org/apache/iceberg/SystemProperties.java

+   * Sets the size of the queue, which is used to avoid consuming too much memory.
+   */
+  public static final String SCAN_SHARED_QUEUE_SIZE = "iceberg.scan.shared-queue-size";
+  public static final int SCAN_SHARED_QUEUE_SIZE_DEFAULT = 1000;


This seems really small if we're trying to avoid running out of memory. I think it is likely that this will cause a performance bottleneck.

When use parallel iterator, BaseFileScanTask will be put into queue in parallel, while we consume these BaseFileScanTask serially. So it should not cause a performance bottleneck. We can make this size larger, like 10000, with more memory consuming.

The worker pool is already used for a number of things though and is presently uncapped.

Going from unbounded to bounded is already a potential drain / possible source of bottleneck. I don't know what the right value is for this, but if we're considering this feature, I would say this is the best value to use as a configuration to disable it.

There could be generally other scans taking place and many file scan tasks are unfortunately somewhat variable in size and can be combined etc. So I don't disagree with using the larger value (especially if it's configurable).

That said, if we add this feature, we should add a property to enable or disable it.

We can use 0, or any non-positive value, which is what sets whether or not the caching catalog has a TTL.

Though catalogs general have a boolean property to enable / disable whether or not caching is used entirely. So either iceberg.scan.enable-blocking-queue or iceberg.scan.use-shared-queue, or just setting the value to 0 or -1 to disable it entirely

rdblue · 2022-04-24T22:19:03Z

@uncleGen, what is the average file size in the table, the average number of files per manifest, and how many files is your scan producing? Is there anything consuming from the concurrent queue? This seems odd to me since the parallel iterator is self-limiting. When nothing is consuming from the queue, new tasks won't be submitted.

uncleGen · 2022-04-25T06:22:19Z

@rdblue ParallelIterator use ConcurrentLinkedQueue, and it is an un-bouned queue. Sorry, could you please make me clear why the parallel iterator is self-limiting? FYI, the average file size in the table is ~ 1MB, and there are hundreds of thousands of datafiles in my scan. Following is a JVM heap snapshot:

uncleGen · 2022-05-13T06:37:14Z

ping @rdblue

kbendick · 2022-05-13T06:49:00Z

Is there a reason that the files have to be 1MB? That’s very very small. You should also consider using Avro for storage at that size. But I think if you compacted files you probably wouldn’t have such issues during scan planning.

For ingest, sometimes we can’t avoid that small of files, I know. But scanning that many small files is counterintuitive for most cases. Ideally files that small (possibly coming from a 3rd party or a very very sharded system) would then be ingested into a table that is more ideally tailored to you query needs.

Do you run table maintenance actions on your table @uncleGen?

uncleGen · 2022-05-16T03:50:43Z

@kbendick

Is there a reason that the files have to be 1MB?

There is a streaming ingest job, it will create much small files.

But I think if you compacted files you probably wouldn’t have such issues during scan planning.
Do you run table maintenance actions on your table?

I have run rewrite-files action once a day. Ideally, reducing the interval may bypass this issue. But bypassing this issue requires carefully considering jvm heap configuration with existing small files.

szehon-ho · 2022-05-24T23:29:33Z

Took a little a look through the code,

ParallelIterable.hasNext() calls checkTasks() , which then calls submitNextTask() but only up to taskFuture size (configured thread pool size).

submitNextTask is the one that calls tasks.next(), which adds things to the queue. I guess that's why @rdblue is mentioning, it is self-limiting.

The problem may be that each tasks.next() does dump the entire nested Iterable into the queue.

So in real life example of the planning case, it seems the level parallelism is manifest. New manifests may be blocked to be processed if nobody is consuming, but a single or few manifests having a lot of entries may cause memory pressure as they dump all of the entries to the queue. @lirui-apache @uncleGen is that the case? Probably the fix is to rewriteManifests to have each manifest have more even number of entries?

I guess if we use blocking queue, it may be useful and we have another knob to limit the memory, but then it's not truly parallel by manifest, ie some manifests may be blocked a long time before they can submit their entries to the queue. @rdblue let me know if you have any thoughts or if I may have misread the code.

rdblue · 2022-05-25T01:59:11Z

@szehon-ho, that sounds correct to me. I'm reluctant to block the queue because that introduces another bottleneck in cases where you probably don't want one.

lirui-apache · 2022-05-25T09:01:06Z

Hey @szehon-ho , your understanding about the issue is correct. We did some test of iterating all manifest entries and compute aggregated stats for each partition. We tried various queue sizes ranging from 5 to 10000. In our test the consumer is pretty fast, and even the smallest queue doesn't affect the e2e latency of the job. The result might be different in other use cases where the consumer is not fast enough, but my hunch is that such job latency is bounded by consumer anyway.
One problem I can think of is when we plan files for multiple tables concurrently. And if one of the consumers is slow, it might block all the threads in the thread pool and prevent other jobs from making progress. We're investigating how to limit the resource used by each job.

In production env, we do have a background service to rewrite manifest periodically. But such optimization is asynchronous, which means if users query the table before the rewrite is done, it can still cause OOM.

rice668 · 2022-05-25T12:34:37Z

Is there anything worse than OOM in production env if we do not use a blocking queue ? We may be able to use another solution to limit the flow rather than blocking queue that may cause another bottleneck.

szehon-ho · 2022-05-25T17:48:48Z

@rdblue @lirui-apache thanks for confirming, so I think using 'rewriteManifest' and specifying the commit.manifest.target-size-byte to a reasonable size (so you have an even number of entries per manifest file), along with a system property: 'iceberg.worker.num-threads' to control how many manifests are read at once. I wonder if that will help that memory problem?

So overall, I would double-check how many entries you have per manifest, because it looks like it's parallelizing on the level on manifests.

rdblue · 2022-05-25T18:39:37Z

Good idea setting the number of threads. It could easily be that you're processing too many manifests in parallel because of the number of threads.

lirui-apache · 2022-05-26T03:00:11Z

Let me clarify our use case. We have an iceberg table partitioned by date, and we run an ETL job every day to sync data from an upstream hive table into this iceberg table. The ETL job basically just runs an INSERT INTO with SparkSQL which adds a new partition to the iceberg table. So we end up having a manifest for each partition, and each partition has lots of data files, i.e. ranging roughly from 50k to 130k.
Then we have a trino cluster where users submit ad-hoc queries. This cluster is multi-tenant and not just meant for the iceberg table mentioned above.
The problem we faced was that querying this huge iceberg table can easily make the trino coordinator unstable or even crash with OOM.

I'll check whether commit.manifest.target-size-byte can mitigate our case. But IIUC, the iceberg worker pool is static and shared among all jobs. So we probably won't want to change the pool size.

lirui-apache · 2022-05-26T04:24:07Z

@szehon-ho Do you have an example in mind where blocking queue can hurt performance? IMHO, if a blocking queue blocks the producers, it usually means the consumer is not fast enough, in which case the bottleneck is the consumer, rather than the queue itself.

rdblue · 2022-05-26T14:28:35Z

@lirui-apache, another option here is to add the ability to set parallelism on a per-table basis. Basically to prevent the parallel iterator from submitting so many tasks at once. That may help your situation.

That said, I still think the main problem is that your partitions have 50-130k files. Have you tried compacting those?

lirui-apache · 2022-05-26T15:18:36Z

@rdblue Thanks for the advice. Actually it's not a small-file problem. Each partition has over 200 billion records. We do have optimizations to make sure each query only scans a small portion. But it cannot help if we hit OOM at the planning phase.

I also noticed latest code supports planning files with separate pools. So with separate pools + limited manifest entries + limited pool size, I think we can bring the memory usage under control. Although personally I still prefer the blocking queue solution, which seems easier to achieve and more reliable.

rdblue · 2022-05-26T15:31:29Z

@lirui-apache, the files should be added to the queue after being filtered. Won't you need to hold all these files in memory at some point anyway? Or are you avoiding that somehow?

szehon-ho · 2022-05-26T16:36:26Z

Yea , good idea to adding plan parallelism per table if not there already (it might be in Flink but not Spark)

Yea, also curious, in Spark it seems the FileScanTask iterator in the end gets into a concrete array, though I guess its possible in custom code to consume the FileScanTask directly in streaming fashion

kbendick · 2022-05-26T18:26:50Z

@kbendick

Is there a reason that the files have to be 1MB?

There is a streaming ingest job, it will create much small files.

But I think if you compacted files you probably wouldn’t have such issues during scan planning.
Do you run table maintenance actions on your table?

I have run rewrite-files action once a day. Ideally, reducing the interval may bypass this issue. But bypassing this issue requires carefully considering jvm heap configuration with existing small files.

Even for streaming ingest, there are ways to make the small files problem better. For example, if you're ingesting with Flink, you can increase the time between commits. Or you can also play with write.distribution.mode to try to send data to the correct writer before generating the files. Which might reduce the overall number of files you have.

But if files are only 1MB, even if there's a large amount of data per partition, that is imho by definition going to encounter many of the common problems that encompass the "small files problem".

kbendick · 2022-05-26T18:32:09Z

Also, for compacting small files (as well as avoiding small files in ingest), I just wrote a somewhat in depth summary in this issue that might be relevant to your streaming ingest (and data file rewrite operations): #4875

lirui-apache · 2022-05-27T02:55:32Z

@rdblue We implemented file-level index for iceberg and use the index to determine whether a file satisfies the query predicate. Although the index file is much smaller than the data file, we wouldn't want to load them in trino coordinator. So the filtering happens on worker nodes. Trino coordinator schedules the file scan tasks in a streaming fashion, so it doesn't have to hold all of them in memory (it does such thing during query optimization though, which we are trying to avoid).

openinx · 2022-06-02T10:18:18Z

Actually, I'd prefer to give my +1 to the bound queue solution. Because:

If there is an existing table which just has included too many manifests (and some of them just have many manifest entries), then the approaches will just don't work ( such as merging metadata so that those manifest size is an idea size, tuning the thread size etc). We can do nothing in the real production environment unless we increment the heap size of spark driver or trino coordinator. But what if we are not allowed to restart the spark driver & trino coordinator because of the other serving querying jobs ?
Does the blocking queue approach introduce any substantial performance bottleneck ? If we think the default blocking queue size is a bit small, then we can increase this default blocking queue size to 10000. I think most of the cases won't be effected by the default blocking queue size, unless we have an extremely large table with so many manifest entries. But in the case it seems to be easily OOM if we don't have any limited queue size.

hililiwei · 2022-06-02T13:11:49Z

Agree with @openinx. also +1 to the bound queue solution. I was in the same predicament.

uncleGen · 2022-06-07T03:38:40Z

IMHO, if a blocking queue blocks the producers, it usually means the consumer is not fast enough, in which case the bottleneck is the consumer, rather than the queue itself.

Does the blocking queue approach introduce any substantial performance bottleneck ? If we think the default blocking queue size is a bit small, then we can increase this default blocking queue size to 10000.

IIUC, blocking queue should not introduce any bottleneck. Besides, if blocking queue can resolve OOM issue, it will be better and easier than doing additional tuning works for dev.

kbendick · 2022-06-07T04:08:39Z

I do see the points you raise and have admittedly used a BlockingQueue in similar situations in large scale streaming ETL in the past (thinking in terms of manifests in the sort of “envelope” sense that is manifest lists and even just overall snapshot change set). About 10k as the size per worker served me well (though would defer to others as in this sense I was streaming kinesis shards, which aren’t entirely unlike a manifest list given potentially variable size and need to rate limit).

Can we introduce a configuration parameter, with a blocking queue size? If we give the parameter a negative value, ideally the user would not use a queue but would keep current behavior (similar to CachingCatalogs cache timeout milliseconds disables the cache when negative). This way the user can avoid the queue or keep old behavior but also users interested in trying the BlockingQueue approach can do so (as it has served me well, particularly in streaming scenario where the pipeline cannot stop).

This is also not unlike streamResults parameter for some spark driver side operations or akin to whether or not the worker thread pool is used in my opinion. So if we configure it in terms of size of queue, that’s the trade off users can more easily tune.

Would this be achievable?

lirui-apache · 2022-06-09T12:35:00Z

Can we introduce a configuration parameter, with a blocking queue size? If we give the parameter a negative value, ideally the user would not use a queue but would keep current behavior

+1, this is exactly how we implemented internally.

kbendick · 2022-06-23T04:30:57Z

Let me be clear. In my scene, my machine has 96 cores, when iceberg.scan.plan-in-worker-pool is enabled, the iceberg.worker.num-threads will be number of available processors in default, i.e. 96. So the input speed of queue is far greater than outpu speed of queue. Then OOM occurs.

Forgive me if I missed this as I’m just coming back to it, but what happens if you lower the core count in iceberg.sc.plan-in-worker-pool via iceberg.worker.num-threads?

I’ve heard of other issues with machines with very large number of cores.

Maybe, possibly in addition to this, we might consider a configurable max limit and default it to something sane? The other report was from a Spark driver iirc.

kbendick · 2022-06-23T04:47:24Z

core/src/main/java/org/apache/iceberg/SystemProperties.java

  }
+
+  public static int getInt(String systemProperty, int defaultValue) {
+    Preconditions.checkNotNull(systemProperty, "System property name should not be null");


General but / FYI - we almost always tend to prefer plain English language with a structure similar to “general type of problem / problem phrase [possible solution only if not clear from stack trace]: offending value”.

we put the bad value at the end after a colon to give us a more continuous search string for logs (instead of mixing them into the sentence).

So for this situation, the suggested sentence would be format(“Invalid value for system property %s: null”, systemProperty);

However for this we can be more specific as null here means missing / not set almost always (and anybody who set it to null will see that). So I’d suggest format(“Invalid value for system property %s: null”, systemProperty); The final : null is debatable but I’d personally put it just to cover any cases where it might be explicitly null and just to match many others preconditions.

Yours is pretty good, but this tends to be our standard. I think it’s most helpful thinking of it in the context of the NPE or IllegalArumentException and full stack trace. What’s invariably helpful is having a common-ish way of writing these with a long enough, specific enough stack trace to be searchable, with less and less specific search query as the phrase is shortened.

Hope that helps for the long term! It really is just a nit but if you’re working on Iceberg offen enough it might help you as a contributor and a user.

Suggested change

Preconditions.checkNotNull(systemProperty, "System property name should not be null");

Preconditions.checkNotNull(systemProperty, String.format("Invalid value for system property %s: null", systemProperty));

kbendick

I think the better solution is to possibly cap the worker pool size on these 96 core (or very massive number of core) machines. I've heard of other issues, and maybe at a certain point it's best to use 3/4th of the number of threads for this as a ceiling which might resolve general issues with input and output without having to add the overhead of bounded queues in multiple places for a relatively less common configuration - most often I'm using more, smaller machines. But I have worked with very, very large machines before and I understand it's a special kind of situation as performance variances can cause problems.

But I do think we should consider something, though switching the queue out only when requested seems reasonable for now as these very many cored machines (128, 96, etc) are generally not the norm. But they do benefit from restricting work sometimes as differences in performance become more apparent at scale.

kbendick · 2022-06-23T05:19:08Z

Also @uncleGen,

You might be interested in these two PRs which might help your use case (increase output speed of queue): - #4911 (merged in master and will be released in 0.14.0)

Flink: add an option to set monitoring snapshot number #4943 (yet to be merged but will also help this situation generally speaking)

lirui-apache · 2022-06-23T09:43:26Z

I also want to mention that if close and hasNext in ParallelIterator are called concurrently, there might be thread safety issues, e.g. new tasks submitted during close can be leaked in the pool. I think it's safer to make the producer thread periodically check the closed flag, rather than block on the queue forever.

rdblue · 2022-06-24T16:04:14Z

@lirui-apache, do you want to open a PR for that fix?

lirui-apache · 2022-06-28T08:52:51Z

@rdblue The issue I mentioned is only critical with blocking queue. Because the producer thread can block forever and if the pool is full of such threads, no new tasks can be run. If we have reached consensus to use blocking queue, I can submit a PR for it.

use bounded queue

ee7f1d5

github-actions bot added the core label Apr 20, 2022

rajarshisarkar reviewed Apr 21, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/SystemProperties.java Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/SystemProperties.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/util/ParallelIterable.java Outdated Show resolved Hide resolved

kbendick reviewed Apr 21, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/DeleteFileIndex.java Outdated Show resolved Hide resolved

kbendick reviewed Apr 21, 2022

View reviewed changes

dylonyu added 2 commits April 22, 2022 15:59

revert

e0079c0

resolve comment

91876e3

rdblue reviewed Apr 24, 2022

View reviewed changes

szehon-ho mentioned this pull request May 24, 2022

ParallelIterator is using too much memory #4822

Closed

kbendick reviewed Jun 23, 2022

View reviewed changes

uncleGen closed this Aug 2, 2022

Heltman mentioned this pull request Sep 29, 2022

ParallelIterable.close() will not clear queue #5886

Closed

	Preconditions.checkNotNull(systemProperty, "System property name should not be null");
	Preconditions.checkNotNull(systemProperty, String.format("Invalid value for system property %s: null", systemProperty));

Use bounded queue to avoid consuming too much memory #4596

Use bounded queue to avoid consuming too much memory #4596

Uh oh!

Conversation

uncleGen commented Apr 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kbendick Apr 21, 2022

Choose a reason for hiding this comment

Uh oh!

uncleGen Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lirui-apache May 26, 2022

Choose a reason for hiding this comment

Uh oh!

kbendick Jun 23, 2022

Choose a reason for hiding this comment

Uh oh!

uncleGen commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue Apr 24, 2022

Choose a reason for hiding this comment

Uh oh!

uncleGen Apr 25, 2022

Choose a reason for hiding this comment

Uh oh!

kbendick Jun 23, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue commented Apr 24, 2022

Uh oh!

uncleGen commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uncleGen commented May 13, 2022

Uh oh!

kbendick commented May 13, 2022

Uh oh!

uncleGen commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented May 25, 2022

Uh oh!

lirui-apache commented May 25, 2022

Uh oh!

rice668 commented May 25, 2022

Uh oh!

szehon-ho commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented May 25, 2022

Uh oh!

lirui-apache commented May 26, 2022

Uh oh!

lirui-apache commented May 26, 2022

Uh oh!

rdblue commented May 26, 2022

Uh oh!

lirui-apache commented May 26, 2022

Uh oh!

rdblue commented May 26, 2022

Uh oh!

szehon-ho commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented May 26, 2022

Uh oh!

lirui-apache commented May 27, 2022

uncleGen Apr 22, 2022 •

edited

Loading

uncleGen commented Apr 22, 2022 •

edited

Loading

uncleGen commented Apr 25, 2022 •

edited

Loading

uncleGen commented May 16, 2022 •

edited

Loading

szehon-ho commented May 24, 2022 •

edited

Loading

szehon-ho commented May 25, 2022 •

edited

Loading

szehon-ho commented May 26, 2022 •

edited

Loading

kbendick commented May 26, 2022 •

edited

Loading

uncleGen commented Jun 7, 2022 •

edited

Loading

kbendick commented Jun 7, 2022 •

edited

Loading

kbendick Jun 23, 2022 •

edited

Loading