-
Notifications
You must be signed in to change notification settings - Fork 3k
Hive: Clean up expired metastore clients #7310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return poolSize; | ||
| } | ||
|
|
||
| public boolean isClosed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package-private is enough? For UT, we could mark it as VisibleForTesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientPoolImpl and UT are in different packages, so package-private is not enough.
But I will add VisibleForTesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VisibleForTesting + public cannot pass Iceberg's style check, so I use public.
Since boolean isClosed() is read-only function, public will not introduce risks.
Besides, int poolSize() is this class is also public.
| Caffeine.newBuilder() | ||
| .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS) | ||
| .removalListener((ignored, value, cause) -> ((HiveClientPool) value).close()) | ||
| .scheduler(Scheduler.forScheduledExecutorService(Executors.newScheduledThreadPool(2))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Iceberg has a thread pool util ThreadPools.newScheduledPool that you can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| return poolSize; | ||
| } | ||
|
|
||
| public boolean isClosed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the above comment.
chenjunjiedada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
From how I understand how Caffeine does this is that it performs the cleanup whenever the cache is being accessed or modified. Below is an excerpt from the Javadoc:
That means the removal listener will be called. And this can be seen by modifying the test to The question for this PR should rather be how critical it is that the removal is performed immediately vs waiting until activity triggers the periodic maintenance. |
I understand your concerns, and this PR chooses to call removal immediately for the following two reasons.
|
|
I think it would be good here to get @pvary's opinion as I'm not very familiar with Hive and how critical it is in this particular case that we perform removal immediately |
I have seen cases with Flink when the open HMS client numbers are kept increasing. I was suspicious about the Caffeine cleanup, but was not able to pin down the the issue. That said, I am not sure the issue was the same. @frankliee: did you have a concrete issue which needs fixing? What was it? Thanks, Peter |
In another project, I found guava cache can leak connections, when the cache will not be "set" for a long time. |
| private synchronized void init() { | ||
| if (clientPoolCache == null) { | ||
| // Since Caffeine does not ensure that removalListener will be involved after expiration | ||
| // We use a scheduler with 2 threads to clean up expired clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is size 2? Why not 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this should be rather 1 instead of 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your correction. I have updated the comment. @pvary
| .getIfPresent(CachedClientPool.extractKey(null, hiveConf))); | ||
|
|
||
| // The client has been really closed. | ||
| Assert.assertTrue(clientPool1.isClosed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record: Did we test that this test fails without the fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked this when reviewing and the test fails on master with these 2 checks added. However, the test passes with some wait time
Awaitility.await()
.atMost(10, TimeUnit.SECONDS)
.untilAsserted(
() -> {
Assert.assertTrue(clientPool1.isClosed());
Assert.assertTrue(clientPool2.isClosed());
});
because cache removal will eventually be executed (just not immediately).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caffeine's doc describes its complicated cleanup behavior, and introduces the scheduler to prompt cleanup.
By default, Caffeine does not perform cleanup and evict values "automatically" or instantly after a value expires. Instead, it performs small amounts of maintenance work after write operations or occasionally after read operations if writes are rare.
( see: https://github.com/ben-manes/caffeine/wiki/Cleanup )
The unpredictability of removal could increase the pressure on HMS, when there are many applications (including Flink and Spark).
|
Thanks for the review @chenjunjiedada, @nastra, @ConeyLiu for the review, and @frankliee for the PR |
Fixed a bug that could leak hive client connections.
Caffeine cache will not always call removelistener by default, so an extra scheduler is required to invoke removelistener.
See: google/guava#3295.