Skip to content

Conversation

@frankliee
Copy link
Contributor

Fixed a bug that could leak hive client connections.

Caffeine cache will not always call removelistener by default, so an extra scheduler is required to invoke removelistener.

See: google/guava#3295.

return poolSize;
}

public boolean isClosed() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package-private is enough? For UT, we could mark it as VisibleForTesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientPoolImpl and UT are in different packages, so package-private is not enough.
But I will add VisibleForTesting.

cc @chenjunjiedada

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VisibleForTesting + public cannot pass Iceberg's style check, so I use public.

Since boolean isClosed() is read-only function, public will not introduce risks.

Besides, int poolSize() is this class is also public.

Caffeine.newBuilder()
.expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
.removalListener((ignored, value, cause) -> ((HiveClientPool) value).close())
.scheduler(Scheduler.forScheduledExecutorService(Executors.newScheduledThreadPool(2)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Iceberg has a thread pool util ThreadPools.newScheduledPool that you can use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

return poolSize;
}

public boolean isClosed() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to be public?

Copy link
Contributor Author

@frankliee frankliee Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the above comment.

Copy link
Collaborator

@chenjunjiedada chenjunjiedada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenjunjiedada chenjunjiedada requested review from nastra and pvary April 13, 2023 03:40
@nastra
Copy link
Contributor

nastra commented Apr 13, 2023

Caffeine cache will not always call removelistener by default, so an extra scheduler is required to invoke removelistener.

From how I understand how Caffeine does this is that it performs the cleanup whenever the cache is being accessed or modified. Below is an excerpt from the Javadoc:

If expireAfter, expireAfterWrite, or expireAfterAccess is requested then entries may be evicted on each cache modification, on occasional cache accesses, or on calls to Cache.cleanUp. A scheduler(Scheduler) may be specified to provide prompt removal of expired entries rather than waiting until activity triggers the periodic maintenance. Expired entries may be counted by Cache.estimatedSize(), but will never be visible to read or write operations.

That means the removal listener will be called. And this can be seen by modifying the test to

    Awaitility.await()
        .atMost(10, TimeUnit.SECONDS)
        .untilAsserted(
            () -> {
              Assert.assertTrue(clientPool1.isClosed());
              Assert.assertTrue(clientPool2.isClosed());
            });

The question for this PR should rather be how critical it is that the removal is performed immediately vs waiting until activity triggers the periodic maintenance.

@frankliee
Copy link
Contributor Author

Caffeine cache will not always call removelistener by default, so an extra scheduler is required to invoke removelistener.

From how I understand how Caffeine does this is that it performs the cleanup whenever the cache is being accessed or modified. Below is an excerpt from the Javadoc:

If expireAfter, expireAfterWrite, or expireAfterAccess is requested then entries may be evicted on each cache modification, on occasional cache accesses, or on calls to Cache.cleanUp. A scheduler(Scheduler) may be specified to provide prompt removal of expired entries rather than waiting until activity triggers the periodic maintenance. Expired entries may be counted by Cache.estimatedSize(), but will never be visible to read or write operations.

That means the removal listener will be called. And this can be seen by modifying the test to

    Awaitility.await()
        .atMost(10, TimeUnit.SECONDS)
        .untilAsserted(
            () -> {
              Assert.assertTrue(clientPool1.isClosed());
              Assert.assertTrue(clientPool2.isClosed());
            });

The question for this PR should rather be how critical it is that the removal is performed immediately vs waiting until activity triggers the periodic maintenance.

I understand your concerns, and this PR chooses to call removal immediately for the following two reasons.

  1. Caffeine's default stratey will postpone removal as much as possible to reduce overhead. In this Iceberg, removal will be call for each 5 min by default, so the overhead is acceptable.

  2. Iceberg has a conf called "client.pool.cache.eviction-interval-ms" to ensure the time of eviction, but the uncertainty of Caffeine default stratey will make it strange to predict. This PR makes it meaningful.

@nastra
Copy link
Contributor

nastra commented Apr 13, 2023

I think it would be good here to get @pvary's opinion as I'm not very familiar with Hive and how critical it is in this particular case that we perform removal immediately

@pvary
Copy link
Contributor

pvary commented Apr 17, 2023

I think it would be good here to get @pvary's opinion as I'm not very familiar with Hive and how critical it is in this particular case that we perform removal immediately

I have seen cases with Flink when the open HMS client numbers are kept increasing. I was suspicious about the Caffeine cleanup, but was not able to pin down the the issue. That said, I am not sure the issue was the same.

@frankliee: did you have a concrete issue which needs fixing? What was it?

Thanks, Peter

@frankliee
Copy link
Contributor Author

I think it would be good here to get @pvary's opinion as I'm not very familiar with Hive and how critical it is in this particular case that we perform removal immediately

I have seen cases with Flink when the open HMS client numbers are kept increasing. I was suspicious about the Caffeine cleanup, but was not able to pin down the the issue. That said, I am not sure the issue was the same.

@frankliee: did you have a concrete issue which needs fixing? What was it?

Thanks, Peter

In another project, I found guava cache can leak connections, when the cache will not be "set" for a long time.
So Iceberg has the similar risk.

private synchronized void init() {
if (clientPoolCache == null) {
// Since Caffeine does not ensure that removalListener will be involved after expiration
// We use a scheduler with 2 threads to clean up expired clients.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is size 2? Why not 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should be rather 1 instead of 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the comment?

Copy link
Contributor Author

@frankliee frankliee Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your correction. I have updated the comment. @pvary

.getIfPresent(CachedClientPool.extractKey(null, hiveConf)));

// The client has been really closed.
Assert.assertTrue(clientPool1.isClosed());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record: Did we test that this test fails without the fix?

Copy link
Contributor

@nastra nastra Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked this when reviewing and the test fails on master with these 2 checks added. However, the test passes with some wait time

Awaitility.await()
        .atMost(10, TimeUnit.SECONDS)
        .untilAsserted(
            () -> {
              Assert.assertTrue(clientPool1.isClosed());
              Assert.assertTrue(clientPool2.isClosed());
            });

because cache removal will eventually be executed (just not immediately).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caffeine's doc describes its complicated cleanup behavior, and introduces the scheduler to prompt cleanup.

By default, Caffeine does not perform cleanup and evict values "automatically" or instantly after a value expires. Instead, it performs small amounts of maintenance work after write operations or occasionally after read operations if writes are rare.
( see: https://github.com/ben-manes/caffeine/wiki/Cleanup )

The unpredictability of removal could increase the pressure on HMS, when there are many applications (including Flink and Spark).

@chenjunjiedada
Copy link
Collaborator

@nastra @pvary Any more options on this? Is it ready to go?

@pvary pvary merged commit bbacaf4 into apache:master Apr 26, 2023
@pvary
Copy link
Contributor

pvary commented Apr 26, 2023

Thanks for the review @chenjunjiedada, @nastra, @ConeyLiu for the review, and @frankliee for the PR

manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants