PYTHON-5212 - Do not hold Topology lock while resetting pool #2301

NoahStapp · 2025-04-21T20:46:15Z

No description provided.

ShaneHarvey

Is there any regression test we can add for this? Perhaps one that mocks a slow close and ensures other operations proceed unblocked?

ShaneHarvey · 2025-04-21T21:13:53Z

pymongo/asynchronous/pool.py

-            for conn in sockets:
-                await conn.close_conn(ConnectionClosedReason.POOL_CLOSED)
+            if not _IS_SYNC:
+                await asyncio.gather(


We probably want to use return_exceptions=True here to ensure all tasks complete.

ShaneHarvey · 2025-04-21T21:14:37Z

pymongo/asynchronous/pool.py

-            for conn in close_conns:
-                await conn.close_conn(ConnectionClosedReason.IDLE)
+            if not _IS_SYNC:
+                await asyncio.gather(


ShaneHarvey · 2025-04-21T21:18:39Z

pymongo/asynchronous/topology.py

@@ -557,6 +551,11 @@ async def on_change(
            # that didn't include this server.
            if self._opened and self._description.has_server(server_description.address):
                await self._process_change(server_description, reset_pool, interrupt_connections)
+        # Clear the pool from a failed heartbeat, done outside the lock to avoid blocking on connection close.
+        if self._opened and self._description.has_server(server_description.address) and reset_pool:
+            server = self._servers.get(server_description.address)


The has_server -> _servers.get pattern is not safe to do here (https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use)

We also don't need to check for closed/opened because pool.reset is safe to call even after close().

Instead we can do:

if reset_pool: server = self._servers.get(server_description.address) if server: await server.pool.reset(interrupt_connections=interrupt_connections)

NoahStapp · 2025-04-22T13:00:30Z

Is there any regression test we can add for this? Perhaps one that mocks a slow close and ensures other operations proceed unblocked?

That was my thought as well, working on adding such a test today.

ShaneHarvey · 2025-04-23T17:37:28Z

test/asynchronous/test_discovery_and_monitoring.py

+                elapsed = time.monotonic() - start_time
+                latencies.append(elapsed)
+                if elapsed >= close_delay:
+                    break


If elapsed is always < close_delay, this loop will never exit?

Could we replace this with an exit condition like this?:

latencies = [] should_exit = [] async def run_task(): ... if should_exit: break ... # Wait until all idle connections are closed to simulate real-world conditions await listener.async_wait_for_event(monitoring.ConnectionClosedEvent, 10) should_exit.append(True) await task.join()

The classic list-as-mutable-state technique!

ShaneHarvey · 2025-04-23T17:40:44Z

test/asynchronous/test_discovery_and_monitoring.py

+            # Wait until all idle connections are closed to simulate real-world conditions
+            await listener.async_wait_for_event(monitoring.ConnectionClosedEvent, 10)
+            # No operation latency should not significantly exceed close_delay
+            self.assertLessEqual(max(latencies), close_delay * 2.0)


I worry this test will be flaky. A single op can easily take >100 ms in our CI depending on the host (mac/windows). Could you increase the close_delay to 0.1 and increase this line to close_delay * 5? Since there are 10 connections to close, * 5 should still catch the regression right?

I would expect close_delay * 5 to catch regressions consistently, yeah.

ShaneHarvey · 2025-04-23T17:42:16Z

test/asynchronous/test_discovery_and_monitoring.py

+            minPoolSize=10,
+        )
+        server = await (await client._get_topology()).select_server(
+            readable_server_selector, _Op.TEST


readable_server_selector -> writeable_server_selector. The test is using primary read preference so we should wait for 10 connections to the primary node.

Because only the primary is writeable? Makes sense.

ShaneHarvey · 2025-04-23T18:36:36Z

test/test_client.py

@@ -1864,6 +1864,7 @@ def test_direct_connection(self):
            MongoClient(["host1", "host2"], directConnection=True)

    @unittest.skipIf("PyPy" in sys.version, "PYTHON-2927 fails often on PyPy")
+    @skipIf(os.environ.get("DEBUG_LOG"), "Enabling debug logs breaks this test")


Was this intentionally added?

Whoops this was for testing purposes. Intended to be done in a separate ticket.

ShaneHarvey · 2025-04-23T18:37:23Z

test/asynchronous/test_discovery_and_monitoring.py

+                await listener.async_wait_for_event(monitoring.ServerHeartbeatFailedEvent, 1)
+            # Wait until all idle connections are closed to simulate real-world conditions
+            await listener.async_wait_for_event(monitoring.ConnectionClosedEvent, 10)
+            should_exit.append(True)


The test should join the task here after should_exit.append(True) otherwise we may miss a latency. Also one more thing:

# Wait until all idle connections are closed to simulate real-world conditions await listener.async_wait_for_event(monitoring.ConnectionClosedEvent, 10) # Wait for one more find to complete, then shutdown the task. n = len(latencies) await async_wait_until(lambda: len(latencies) >= n + 1, "run one more find") should_exit.append(True) await task.join()

I see, that way we ensure that the operations are still working after the pool reset completes.

…#2301) (cherry picked from commit 09897b6)

PYTHON-5212 - Do not hold Topology lock while resetting pool

216407e

NoahStapp requested review from ShaneHarvey and sleepyStick April 21, 2025 20:46

Use correct close_conn

4f0c8e4

ShaneHarvey requested changes Apr 21, 2025

View reviewed changes

NoahStapp added 2 commits April 22, 2025 10:09

Address review

b7232db

Add test

f5166be

NoahStapp requested a review from ShaneHarvey April 23, 2025 13:41

NoahStapp added 4 commits April 23, 2025 10:43

Removed unused import

1b34973

Increase test threshold to account for variance

c612c58

Merge branch 'master' into PYTHON-5212

6a0457f

Disable test_continuous_network_errors with debug logs

c5bf1ab

ShaneHarvey requested changes Apr 23, 2025

View reviewed changes

Address review

10406de

NoahStapp requested a review from ShaneHarvey April 23, 2025 18:11

ShaneHarvey reviewed Apr 23, 2025

View reviewed changes

Address review

37be166

NoahStapp requested a review from ShaneHarvey April 23, 2025 18:45

ShaneHarvey approved these changes Apr 23, 2025

View reviewed changes

NoahStapp merged commit 09897b6 into mongodb:master Apr 23, 2025
30 of 32 checks passed

blink1073 pushed a commit to blink1073/mongo-python-driver that referenced this pull request Apr 23, 2025

PYTHON-5212 - Do not hold Topology lock while resetting pool (mongodb…

03a0d32

…#2301) (cherry picked from commit 09897b6)

PYTHON-5212 - Do not hold Topology lock while resetting pool #2301

PYTHON-5212 - Do not hold Topology lock while resetting pool #2301

Uh oh!

Conversation

NoahStapp commented Apr 21, 2025

Uh oh!

ShaneHarvey left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NoahStapp commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ShaneHarvey left a comment •

edited

Loading