fix(gateway): drain in-flight runs before closing checkpointer on shutdown (#3381)

* fix(gateway): drain in-flight runs before closing checkpointer on shutdown Chat runs execute in fire-and-forget background asyncio tasks that write checkpoints through a shared checkpointer. On shutdown, langgraph_runtime's AsyncExitStack tore down the checkpointer's postgres connection pool while those run tasks were still mid-graph. langgraph's AsyncPregelLoop._checkpointer_put_after_previous then ran its `finally: await checkpointer.aput(...)` against the closed pool, raising psycopg_pool.PoolClosed. Because that put runs in a langgraph-internal task (not on run_agent's call stack), run_agent's try/except cannot catch it and it surfaces as "unhandled exception during asyncio.run() shutdown". Add RunManager.shutdown() to cancel and bounded-await all in-flight runs, and call it from langgraph_runtime BEFORE the AsyncExitStack closes the checkpointer, so the final checkpoint write lands while the pool is still open. The drain is bounded by a timeout so a stuck run cannot hang worker shutdown, and is shielded so a second shutdown signal cannot abandon it mid-drain and reopen the race. Closes #3373 * fix(gateway): address review — preserve completed-run status, bound drain persistence Addresses Copilot review on #3381: - RunManager.shutdown(): decide run status AFTER the drain. Under the lock it now only requests cancellation; after asyncio.wait it marks/persists `interrupted` only for runs still pending or ended cancelled. A run that completes (e.g. `success`) during the drain window keeps its real terminal status instead of being unconditionally overwritten. - Bound the trailing status persistence within the timeout budget (deadline = loop.time()+timeout; gather wrapped in asyncio.wait_for) so a slow store backing off under DB pressure cannot push shutdown past the deadline. - deps: use asyncio.create_task instead of asyncio.ensure_future. - tests: wait deterministically for the run to be in-flight (poll the first checkpoint) instead of a fixed sleep; init shutdown_calls explicitly in the recovery test double; add regression test asserting a run completing during the drain keeps its status (in memory and in the store). * fix(gateway): address maintainer review — surface failed drain persists, clarify timeout constant Addresses @WillemJiang review on #3381: - shutdown(): inspect the gather result of the trailing interrupted-status persistence. _persist_status is best-effort (it catches + logs its own failure with exc_info and returns False, so it never raises out of the gather), but the aggregate result was never checked — a partial failure had no shutdown-level visibility. Now any escaped Exception is logged, and any False (a persist that did not confirm) is logged with the run_id. Added regression test test_shutdown_surfaces_failed_interrupted_persist. - deps: clarify the _RUN_DRAIN_TIMEOUT_SECONDS comment — state the actual value of _SHUTDOWN_HOOK_TIMEOUT_SECONDS (5.0s) and that both count toward the lifespan shutdown window. Kept as two separate constants (independent teardown steps that may diverge) rather than one shared "must match" value. - Verified no other test fake needs the shutdown stub: _FakeRunManager in test_worker_langfuse_metadata.py is a run_agent() argument (worker path), never injected into langgraph_runtime, so it never receives shutdown().
2026-06-10 09:25:57 +00:00 · 2026-06-07 11:24:30 +08:00
parent 9a5de8d6a5
commit 268fdd6968
4 changed files with 497 additions and 0 deletions
@@ -645,6 +645,98 @@ class RunManager:
            self._runs.pop(run_id, None)
        logger.debug("Run record %s cleaned up", run_id)

+    async def shutdown(self, *, timeout: float = 5.0) -> None:
+        """Cancel and bounded-await all in-flight runs on process shutdown.
+
+        Chat runs execute in fire-and-forget background ``asyncio`` tasks that
+        write checkpoints through a shared checkpointer. On shutdown the
+        checkpointer's resources (e.g. the postgres connection pool owned by the
+        gateway's ``AsyncExitStack``) are torn down; if a run task is still
+        mid-graph at that point, langgraph's
+        ``AsyncPregelLoop._checkpointer_put_after_previous`` runs its
+        ``finally: await checkpointer.aput(...)`` against the closed pool. Because
+        that put runs in a langgraph-internal task (not on ``run_agent``'s call
+        stack), the resulting ``psycopg_pool.PoolClosed`` is not catchable by the
+        worker and surfaces as an unhandled exception during ``asyncio.run()``
+        shutdown (bytedance/deer-flow issue #3373).
+
+        Draining in-flight runs *before* the checkpointer is closed lets each
+        run that settles within ``timeout`` flush its final checkpoint while
+        resources are still open. Only runs that do **not** settle on their own
+        are marked ``interrupted`` — a run that completes (e.g. ``success``)
+        during the drain keeps its real terminal status instead of being
+        blanket-overwritten. The whole drain, including the trailing status
+        persistence, is bounded by ``timeout`` so a run stuck in cleanup (or a
+        slow store under DB pressure) cannot hang worker shutdown — the
+        precondition for the signal-reentrancy deadlock guarded by
+        ``app.gateway.app._SHUTDOWN_HOOK_TIMEOUT_SECONDS``. Runs still active
+        after ``timeout`` are logged and may still race teardown.
+        """
+        loop = asyncio.get_running_loop()
+        deadline = loop.time() + timeout
+
+        async with self._lock:
+            inflight = [record for record in self._runs.values() if record.status in (RunStatus.pending, RunStatus.running) and record.task is not None and not record.task.done()]
+            for record in inflight:
+                record.abort_action = "interrupt"
+                record.abort_event.set()
+                record.task.cancel()  # type: ignore[union-attr]  # filtered above
+                # Status is decided AFTER the drain (below), not here: a run that
+                # completes on its own during the drain must keep its real status.
+
+        if not inflight:
+            return
+
+        tasks = [record.task for record in inflight]
+        _, pending = await asyncio.wait(tasks, timeout=timeout)
+
+        # Only mark/persist ``interrupted`` for runs that did not settle on their
+        # own (still pending after the timeout, or ended cancelled). A run that
+        # finished normally during the drain keeps the status it set for itself.
+        to_persist: list[RunRecord] = []
+        async with self._lock:
+            for record in inflight:
+                task = record.task
+                if task not in pending and not task.cancelled():
+                    # Completed on its own — retrieve any surfaced exception so it
+                    # is not reported as "never retrieved", and keep its status.
+                    task.exception()  # type: ignore[union-attr]  # done & not cancelled
+                    continue
+                if record.status in (RunStatus.pending, RunStatus.running):
+                    record.status = RunStatus.interrupted
+                    record.updated_at = _now_iso()
+                to_persist.append(record)
+
+        # Bound the trailing status persistence within the remaining budget so a
+        # slow store (``_call_store_with_retry`` can back off under DB pressure)
+        # cannot push shutdown past ``timeout``.
+        if to_persist:
+            remaining = deadline - loop.time()
+            if remaining <= 0:
+                logger.warning("Run drain budget exhausted before persisting %d interrupted run(s) on shutdown", len(to_persist))
+            else:
+                try:
+                    results = await asyncio.wait_for(
+                        asyncio.gather(*(self._persist_status(record, RunStatus.interrupted) for record in to_persist), return_exceptions=True),
+                        timeout=remaining,
+                    )
+                except TimeoutError:
+                    logger.warning("Run drain status persistence exceeded the %.1fs budget; %d record(s) may not be persisted", timeout, len(to_persist))
+                else:
+                    # ``_persist_status`` is best-effort: it catches and logs its
+                    # own failures, returning ``False``. Inspect the aggregate so a
+                    # partial failure is surfaced at shutdown level (with the
+                    # run_id) instead of being silently swallowed by the gather.
+                    for record, result in zip(to_persist, results):
+                        if isinstance(result, Exception):
+                            logger.warning("Unexpected error persisting interrupted status for run %s during shutdown: %r", record.run_id, result)
+                        elif result is False:
+                            logger.warning("Could not persist interrupted status for run %s during shutdown", record.run_id)
+
+        if pending:
+            logger.warning("Run drain exceeded %.1fs on shutdown; %d run task(s) still active and may race checkpointer teardown", timeout, len(pending))
+        logger.info("Drained %d in-flight run(s) on shutdown (%d settled within %.1fs)", len(inflight), len(inflight) - len(pending), timeout)
+

 class ConflictError(Exception):
    """Raised when multitask_strategy=reject and thread has inflight runs."""