fix(subagents): add cooperative cancellation for subagent threads (#1873)

* fix(subagents): add cooperative cancellation for subagent threads

Subagent tasks run inside ThreadPoolExecutor threads with their own
event loop (asyncio.run). When a user clicks stop, RunManager cancels
the parent asyncio.Task, but Future.cancel() cannot terminate a running
thread and asyncio.Event does not propagate across event loops. This
causes subagent threads to keep executing (writing files, calling LLMs)
even after the user explicitly stops the run.

Fix: add a threading.Event (cancel_event) to SubagentResult and check
it cooperatively in _aexecute()'s astream iteration loop. On cancel,
request_cancel_background_task() sets the event, and the thread exits
at the next iteration boundary.

Changes:
- executor.py: Add cancel_event field to SubagentResult, check it in
  _aexecute loop, set it on timeout, add request_cancel_background_task
- task_tool.py: Call request_cancel_background_task on CancelledError

* fix(subagents): guard cancel status and add pre-check before astream

- Only overwrite status to FAILED when still RUNNING, preserving
  TIMED_OUT set by the scheduler thread.
- Add cancel_event pre-check before entering the astream loop so
  cancellation is detected immediately when already signalled.

* fix(subagents): guard status updates with lock to prevent race condition

Wrap the check-and-set on result.status in _aexecute with
_background_tasks_lock so the timeout handler in execute_async
cannot interleave between the read and write.

* fix(subagents): add dedicated CANCELLED status for user cancellation

Introduce SubagentStatus.CANCELLED to distinguish user-initiated
cancellation from actual execution failures.  Update _aexecute,
task_tool polling, cleanup terminal-status sets, and test fixtures.

* test(subagents): add cancellation tests and fix timeout regression test

- Add dedicated TestCooperativeCancellation test class with 6 tests:
  - Pre-set cancel_event prevents astream from starting
  - Mid-stream cancel_event returns CANCELLED immediately
  - request_cancel_background_task() sets cancel_event correctly
  - request_cancel on nonexistent task is a no-op
  - Real execute_async timeout does not overwrite CANCELLED (deterministic
    threading.Event sync, no wall-clock sleeps)
  - cleanup_background_task removes CANCELLED tasks

- Add task_tool cancellation coverage:
  - test_cancellation_calls_request_cancel: assert CancelledError path
    calls request_cancel_background_task(task_id)
  - test_task_tool_returns_cancelled_message: assert CANCELLED polling
    branch emits task_cancelled event and returns expected message

- Fix pre-existing test infrastructure issue: add deerflow.sandbox.security
  to _MOCKED_MODULE_NAMES (fixes ModuleNotFoundError for all executor tests)

- Add RUNNING guard to timeout handler in executor.py to prevent
  TIMED_OUT from overwriting CANCELLED status

- Add cooperative cancellation granularity comment documenting that
  cancellation is only detected at astream iteration boundaries

---------

Co-authored-by: lulusiyuyu <lulusiyuyu@users.noreply.github.com>
This commit is contained in:
lulusiyuyu
2026-04-07 11:12:25 +08:00
committed by GitHub
parent 7643a46fca
commit f0dd8cb0d2
4 changed files with 397 additions and 7 deletions
@@ -6,7 +6,7 @@ import threading
import uuid
from concurrent.futures import Future, ThreadPoolExecutor
from concurrent.futures import TimeoutError as FuturesTimeoutError
from dataclasses import dataclass
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Any
@@ -30,6 +30,7 @@ class SubagentStatus(Enum):
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
CANCELLED = "cancelled"
TIMED_OUT = "timed_out"
@@ -56,6 +57,7 @@ class SubagentResult:
started_at: datetime | None = None
completed_at: datetime | None = None
ai_messages: list[dict[str, Any]] | None = None
cancel_event: threading.Event = field(default_factory=threading.Event, repr=False)
def __post_init__(self):
"""Initialize mutable defaults."""
@@ -241,7 +243,31 @@ class SubagentExecutor:
# Use stream instead of invoke to get real-time updates
# This allows us to collect AI messages as they are generated
final_state = None
# Pre-check: bail out immediately if already cancelled before streaming starts
if result.cancel_event.is_set():
logger.info(f"[trace={self.trace_id}] Subagent {self.config.name} cancelled before streaming")
with _background_tasks_lock:
if result.status == SubagentStatus.RUNNING:
result.status = SubagentStatus.CANCELLED
result.error = "Cancelled by user"
result.completed_at = datetime.now()
return result
async for chunk in agent.astream(state, config=run_config, context=context, stream_mode="values"): # type: ignore[arg-type]
# Cooperative cancellation: check if parent requested stop.
# Note: cancellation is only detected at astream iteration boundaries,
# so long-running tool calls within a single iteration will not be
# interrupted until the next chunk is yielded.
if result.cancel_event.is_set():
logger.info(f"[trace={self.trace_id}] Subagent {self.config.name} cancelled by parent")
with _background_tasks_lock:
if result.status == SubagentStatus.RUNNING:
result.status = SubagentStatus.CANCELLED
result.error = "Cancelled by user"
result.completed_at = datetime.now()
return result
final_state = chunk
# Extract AI messages from the current state
@@ -437,10 +463,12 @@ class SubagentExecutor:
except FuturesTimeoutError:
logger.error(f"[trace={self.trace_id}] Subagent {self.config.name} execution timed out after {self.config.timeout_seconds}s")
with _background_tasks_lock:
_background_tasks[task_id].status = SubagentStatus.TIMED_OUT
_background_tasks[task_id].error = f"Execution timed out after {self.config.timeout_seconds} seconds"
_background_tasks[task_id].completed_at = datetime.now()
# Cancel the future (best effort - may not stop the actual execution)
if _background_tasks[task_id].status == SubagentStatus.RUNNING:
_background_tasks[task_id].status = SubagentStatus.TIMED_OUT
_background_tasks[task_id].error = f"Execution timed out after {self.config.timeout_seconds} seconds"
_background_tasks[task_id].completed_at = datetime.now()
# Signal cooperative cancellation and cancel the future
result_holder.cancel_event.set()
execution_future.cancel()
except Exception as e:
logger.exception(f"[trace={self.trace_id}] Subagent {self.config.name} async execution failed")
@@ -456,6 +484,24 @@ class SubagentExecutor:
MAX_CONCURRENT_SUBAGENTS = 3
def request_cancel_background_task(task_id: str) -> None:
"""Signal a running background task to stop.
Sets the cancel_event on the task, which is checked cooperatively
by ``_aexecute`` during ``agent.astream()`` iteration. This allows
subagent threads — which cannot be force-killed via ``Future.cancel()``
— to stop at the next iteration boundary.
Args:
task_id: The task ID to cancel.
"""
with _background_tasks_lock:
result = _background_tasks.get(task_id)
if result is not None:
result.cancel_event.set()
logger.info("Requested cancellation for background task %s", task_id)
def get_background_task_result(task_id: str) -> SubagentResult | None:
"""Get the result of a background task.
@@ -503,6 +549,7 @@ def cleanup_background_task(task_id: str) -> None:
is_terminal_status = result.status in {
SubagentStatus.COMPLETED,
SubagentStatus.FAILED,
SubagentStatus.CANCELLED,
SubagentStatus.TIMED_OUT,
}
if is_terminal_status or result.completed_at is not None:
@@ -14,7 +14,7 @@ from deerflow.agents.lead_agent.prompt import get_skills_prompt_section
from deerflow.agents.thread_state import ThreadState
from deerflow.sandbox.security import LOCAL_BASH_SUBAGENT_DISABLED_MESSAGE, is_host_bash_allowed
from deerflow.subagents import SubagentExecutor, get_available_subagent_names, get_subagent_config
from deerflow.subagents.executor import SubagentStatus, cleanup_background_task, get_background_task_result
from deerflow.subagents.executor import SubagentStatus, cleanup_background_task, get_background_task_result, request_cancel_background_task
logger = logging.getLogger(__name__)
@@ -182,6 +182,11 @@ async def task_tool(
logger.error(f"[trace={trace_id}] Task {task_id} failed: {result.error}")
cleanup_background_task(task_id)
return f"Task failed. Error: {result.error}"
elif result.status == SubagentStatus.CANCELLED:
writer({"type": "task_cancelled", "task_id": task_id, "error": result.error})
logger.info(f"[trace={trace_id}] Task {task_id} cancelled: {result.error}")
cleanup_background_task(task_id)
return "Task cancelled by user."
elif result.status == SubagentStatus.TIMED_OUT:
writer({"type": "task_timed_out", "task_id": task_id, "error": result.error})
logger.warning(f"[trace={trace_id}] Task {task_id} timed out: {result.error}")
@@ -204,6 +209,11 @@ async def task_tool(
writer({"type": "task_timed_out", "task_id": task_id})
return f"Task polling timed out after {timeout_minutes} minutes. This may indicate the background task is stuck. Status: {result.status.value}"
except asyncio.CancelledError:
# Signal the background subagent thread to stop cooperatively.
# Without this, the thread (running in ThreadPoolExecutor with its
# own event loop via asyncio.run) would continue executing even
# after the parent task is cancelled.
request_cancel_background_task(task_id)
async def cleanup_when_done() -> None:
max_cleanup_polls = max_poll_count
@@ -214,7 +224,7 @@ async def task_tool(
if result is None:
return
if result.status in {SubagentStatus.COMPLETED, SubagentStatus.FAILED, SubagentStatus.TIMED_OUT} or getattr(result, "completed_at", None) is not None:
if result.status in {SubagentStatus.COMPLETED, SubagentStatus.FAILED, SubagentStatus.CANCELLED, SubagentStatus.TIMED_OUT} or getattr(result, "completed_at", None) is not None:
cleanup_background_task(task_id)
return