fix(gateway): drain in-flight runs before closing checkpointer on shutdown (#3381)

mirror of https://github.com/bytedance/deer-flow.git synced 2026-06-10 09:25:57 +00:00

* fix(gateway): drain in-flight runs before closing checkpointer on shutdown

Chat runs execute in fire-and-forget background asyncio tasks that write
checkpoints through a shared checkpointer. On shutdown, langgraph_runtime's
AsyncExitStack tore down the checkpointer's postgres connection pool while
those run tasks were still mid-graph. langgraph's
AsyncPregelLoop._checkpointer_put_after_previous then ran its
`finally: await checkpointer.aput(...)` against the closed pool, raising
psycopg_pool.PoolClosed. Because that put runs in a langgraph-internal task
(not on run_agent's call stack), run_agent's try/except cannot catch it and it
surfaces as "unhandled exception during asyncio.run() shutdown".

Add RunManager.shutdown() to cancel and bounded-await all in-flight runs, and
call it from langgraph_runtime BEFORE the AsyncExitStack closes the
checkpointer, so the final checkpoint write lands while the pool is still open.
The drain is bounded by a timeout so a stuck run cannot hang worker shutdown,
and is shielded so a second shutdown signal cannot abandon it mid-drain and
reopen the race.

Closes #3373

* fix(gateway): address review — preserve completed-run status, bound drain persistence

Addresses Copilot review on #3381:

- RunManager.shutdown(): decide run status AFTER the drain. Under the lock it
  now only requests cancellation; after asyncio.wait it marks/persists
  `interrupted` only for runs still pending or ended cancelled. A run that
  completes (e.g. `success`) during the drain window keeps its real terminal
  status instead of being unconditionally overwritten.
- Bound the trailing status persistence within the timeout budget
  (deadline = loop.time()+timeout; gather wrapped in asyncio.wait_for) so a slow
  store backing off under DB pressure cannot push shutdown past the deadline.
- deps: use asyncio.create_task instead of asyncio.ensure_future.
- tests: wait deterministically for the run to be in-flight (poll the first
  checkpoint) instead of a fixed sleep; init shutdown_calls explicitly in the
  recovery test double; add regression test asserting a run completing during
  the drain keeps its status (in memory and in the store).

* fix(gateway): address maintainer review — surface failed drain persists, clarify timeout constant

Addresses @WillemJiang review on #3381:

- shutdown(): inspect the gather result of the trailing interrupted-status
  persistence. _persist_status is best-effort (it catches + logs its own
  failure with exc_info and returns False, so it never raises out of the
  gather), but the aggregate result was never checked — a partial failure had
  no shutdown-level visibility. Now any escaped Exception is logged, and any
  False (a persist that did not confirm) is logged with the run_id. Added
  regression test test_shutdown_surfaces_failed_interrupted_persist.
- deps: clarify the _RUN_DRAIN_TIMEOUT_SECONDS comment — state the actual value
  of _SHUTDOWN_HOOK_TIMEOUT_SECONDS (5.0s) and that both count toward the
  lifespan shutdown window. Kept as two separate constants (independent teardown
  steps that may diverge) rather than one shared "must match" value.
- Verified no other test fake needs the shutdown stub: _FakeRunManager in
  test_worker_langfuse_metadata.py is a run_agent() argument (worker path),
  never injected into langgraph_runtime, so it never receives shutdown().

This commit is contained in:

Xinmin Zeng

2026-06-07 11:24:30 +08:00

committed by

GitHub

parent 9a5de8d6a5

commit 268fdd6968

4 changed files with 497 additions and 0 deletions

									
										backend/tests/test_gateway_run_recovery.py
									
		+6
		
												View File
												
				@@ -32,6 +32,7 @@ class _FakeRunManager:

				        self.store = store

				        self.reconcile_calls: list[dict] = []

				        self.list_by_thread_calls: list[dict] = []

				        self.shutdown_calls: int = 0

				        _FakeRunManager.instances.append(self)

				    async def reconcile_orphaned_inflight_runs(self, *, error: str, before: str | None = None):

				@@ -42,6 +43,11 @@ class _FakeRunManager:

				        self.list_by_thread_calls.append({"thread_id": thread_id, "user_id": user_id, "limit": limit})

				        return self.latest_by_thread.get(thread_id, self.recovered_runs[:limit])

				    async def shutdown(self, *, timeout: float = 5.0) -> None:

				        # No in-flight tasks in these startup-recovery tests; langgraph_runtime

				        # drains the manager on teardown, so the double must accept the call.

				        self.shutdown_calls += 1

				class _FakeThreadStore:

				    def __init__(self) -> None: