mirror of
https://github.com/bytedance/deer-flow.git
synced 2026-06-10 09:25:57 +00:00
fix(gateway): drain in-flight runs before closing checkpointer on shutdown (#3381)
* fix(gateway): drain in-flight runs before closing checkpointer on shutdown Chat runs execute in fire-and-forget background asyncio tasks that write checkpoints through a shared checkpointer. On shutdown, langgraph_runtime's AsyncExitStack tore down the checkpointer's postgres connection pool while those run tasks were still mid-graph. langgraph's AsyncPregelLoop._checkpointer_put_after_previous then ran its `finally: await checkpointer.aput(...)` against the closed pool, raising psycopg_pool.PoolClosed. Because that put runs in a langgraph-internal task (not on run_agent's call stack), run_agent's try/except cannot catch it and it surfaces as "unhandled exception during asyncio.run() shutdown". Add RunManager.shutdown() to cancel and bounded-await all in-flight runs, and call it from langgraph_runtime BEFORE the AsyncExitStack closes the checkpointer, so the final checkpoint write lands while the pool is still open. The drain is bounded by a timeout so a stuck run cannot hang worker shutdown, and is shielded so a second shutdown signal cannot abandon it mid-drain and reopen the race. Closes #3373 * fix(gateway): address review — preserve completed-run status, bound drain persistence Addresses Copilot review on #3381: - RunManager.shutdown(): decide run status AFTER the drain. Under the lock it now only requests cancellation; after asyncio.wait it marks/persists `interrupted` only for runs still pending or ended cancelled. A run that completes (e.g. `success`) during the drain window keeps its real terminal status instead of being unconditionally overwritten. - Bound the trailing status persistence within the timeout budget (deadline = loop.time()+timeout; gather wrapped in asyncio.wait_for) so a slow store backing off under DB pressure cannot push shutdown past the deadline. - deps: use asyncio.create_task instead of asyncio.ensure_future. - tests: wait deterministically for the run to be in-flight (poll the first checkpoint) instead of a fixed sleep; init shutdown_calls explicitly in the recovery test double; add regression test asserting a run completing during the drain keeps its status (in memory and in the store). * fix(gateway): address maintainer review — surface failed drain persists, clarify timeout constant Addresses @WillemJiang review on #3381: - shutdown(): inspect the gather result of the trailing interrupted-status persistence. _persist_status is best-effort (it catches + logs its own failure with exc_info and returns False, so it never raises out of the gather), but the aggregate result was never checked — a partial failure had no shutdown-level visibility. Now any escaped Exception is logged, and any False (a persist that did not confirm) is logged with the run_id. Added regression test test_shutdown_surfaces_failed_interrupted_persist. - deps: clarify the _RUN_DRAIN_TIMEOUT_SECONDS comment — state the actual value of _SHUTDOWN_HOOK_TIMEOUT_SECONDS (5.0s) and that both count toward the lifespan shutdown window. Kept as two separate constants (independent teardown steps that may diverge) rather than one shared "must match" value. - Verified no other test fake needs the shutdown stub: _FakeRunManager in test_worker_langfuse_metadata.py is a run_agent() argument (worker path), never injected into langgraph_runtime, so it never receives shutdown().
This commit is contained in:
@@ -645,6 +645,98 @@ class RunManager:
|
||||
self._runs.pop(run_id, None)
|
||||
logger.debug("Run record %s cleaned up", run_id)
|
||||
|
||||
async def shutdown(self, *, timeout: float = 5.0) -> None:
|
||||
"""Cancel and bounded-await all in-flight runs on process shutdown.
|
||||
|
||||
Chat runs execute in fire-and-forget background ``asyncio`` tasks that
|
||||
write checkpoints through a shared checkpointer. On shutdown the
|
||||
checkpointer's resources (e.g. the postgres connection pool owned by the
|
||||
gateway's ``AsyncExitStack``) are torn down; if a run task is still
|
||||
mid-graph at that point, langgraph's
|
||||
``AsyncPregelLoop._checkpointer_put_after_previous`` runs its
|
||||
``finally: await checkpointer.aput(...)`` against the closed pool. Because
|
||||
that put runs in a langgraph-internal task (not on ``run_agent``'s call
|
||||
stack), the resulting ``psycopg_pool.PoolClosed`` is not catchable by the
|
||||
worker and surfaces as an unhandled exception during ``asyncio.run()``
|
||||
shutdown (bytedance/deer-flow issue #3373).
|
||||
|
||||
Draining in-flight runs *before* the checkpointer is closed lets each
|
||||
run that settles within ``timeout`` flush its final checkpoint while
|
||||
resources are still open. Only runs that do **not** settle on their own
|
||||
are marked ``interrupted`` — a run that completes (e.g. ``success``)
|
||||
during the drain keeps its real terminal status instead of being
|
||||
blanket-overwritten. The whole drain, including the trailing status
|
||||
persistence, is bounded by ``timeout`` so a run stuck in cleanup (or a
|
||||
slow store under DB pressure) cannot hang worker shutdown — the
|
||||
precondition for the signal-reentrancy deadlock guarded by
|
||||
``app.gateway.app._SHUTDOWN_HOOK_TIMEOUT_SECONDS``. Runs still active
|
||||
after ``timeout`` are logged and may still race teardown.
|
||||
"""
|
||||
loop = asyncio.get_running_loop()
|
||||
deadline = loop.time() + timeout
|
||||
|
||||
async with self._lock:
|
||||
inflight = [record for record in self._runs.values() if record.status in (RunStatus.pending, RunStatus.running) and record.task is not None and not record.task.done()]
|
||||
for record in inflight:
|
||||
record.abort_action = "interrupt"
|
||||
record.abort_event.set()
|
||||
record.task.cancel() # type: ignore[union-attr] # filtered above
|
||||
# Status is decided AFTER the drain (below), not here: a run that
|
||||
# completes on its own during the drain must keep its real status.
|
||||
|
||||
if not inflight:
|
||||
return
|
||||
|
||||
tasks = [record.task for record in inflight]
|
||||
_, pending = await asyncio.wait(tasks, timeout=timeout)
|
||||
|
||||
# Only mark/persist ``interrupted`` for runs that did not settle on their
|
||||
# own (still pending after the timeout, or ended cancelled). A run that
|
||||
# finished normally during the drain keeps the status it set for itself.
|
||||
to_persist: list[RunRecord] = []
|
||||
async with self._lock:
|
||||
for record in inflight:
|
||||
task = record.task
|
||||
if task not in pending and not task.cancelled():
|
||||
# Completed on its own — retrieve any surfaced exception so it
|
||||
# is not reported as "never retrieved", and keep its status.
|
||||
task.exception() # type: ignore[union-attr] # done & not cancelled
|
||||
continue
|
||||
if record.status in (RunStatus.pending, RunStatus.running):
|
||||
record.status = RunStatus.interrupted
|
||||
record.updated_at = _now_iso()
|
||||
to_persist.append(record)
|
||||
|
||||
# Bound the trailing status persistence within the remaining budget so a
|
||||
# slow store (``_call_store_with_retry`` can back off under DB pressure)
|
||||
# cannot push shutdown past ``timeout``.
|
||||
if to_persist:
|
||||
remaining = deadline - loop.time()
|
||||
if remaining <= 0:
|
||||
logger.warning("Run drain budget exhausted before persisting %d interrupted run(s) on shutdown", len(to_persist))
|
||||
else:
|
||||
try:
|
||||
results = await asyncio.wait_for(
|
||||
asyncio.gather(*(self._persist_status(record, RunStatus.interrupted) for record in to_persist), return_exceptions=True),
|
||||
timeout=remaining,
|
||||
)
|
||||
except TimeoutError:
|
||||
logger.warning("Run drain status persistence exceeded the %.1fs budget; %d record(s) may not be persisted", timeout, len(to_persist))
|
||||
else:
|
||||
# ``_persist_status`` is best-effort: it catches and logs its
|
||||
# own failures, returning ``False``. Inspect the aggregate so a
|
||||
# partial failure is surfaced at shutdown level (with the
|
||||
# run_id) instead of being silently swallowed by the gather.
|
||||
for record, result in zip(to_persist, results):
|
||||
if isinstance(result, Exception):
|
||||
logger.warning("Unexpected error persisting interrupted status for run %s during shutdown: %r", record.run_id, result)
|
||||
elif result is False:
|
||||
logger.warning("Could not persist interrupted status for run %s during shutdown", record.run_id)
|
||||
|
||||
if pending:
|
||||
logger.warning("Run drain exceeded %.1fs on shutdown; %d run task(s) still active and may race checkpointer teardown", timeout, len(pending))
|
||||
logger.info("Drained %d in-flight run(s) on shutdown (%d settled within %.1fs)", len(inflight), len(inflight) - len(pending), timeout)
|
||||
|
||||
|
||||
class ConflictError(Exception):
|
||||
"""Raised when multitask_strategy=reject and thread has inflight runs."""
|
||||
|
||||
Reference in New Issue
Block a user