mirror of
https://github.com/bytedance/deer-flow.git
synced 2026-05-22 07:56:48 +00:00
fix(subagents): add cooperative cancellation for subagent threads (#1873)
* fix(subagents): add cooperative cancellation for subagent threads
Subagent tasks run inside ThreadPoolExecutor threads with their own
event loop (asyncio.run). When a user clicks stop, RunManager cancels
the parent asyncio.Task, but Future.cancel() cannot terminate a running
thread and asyncio.Event does not propagate across event loops. This
causes subagent threads to keep executing (writing files, calling LLMs)
even after the user explicitly stops the run.
Fix: add a threading.Event (cancel_event) to SubagentResult and check
it cooperatively in _aexecute()'s astream iteration loop. On cancel,
request_cancel_background_task() sets the event, and the thread exits
at the next iteration boundary.
Changes:
- executor.py: Add cancel_event field to SubagentResult, check it in
_aexecute loop, set it on timeout, add request_cancel_background_task
- task_tool.py: Call request_cancel_background_task on CancelledError
* fix(subagents): guard cancel status and add pre-check before astream
- Only overwrite status to FAILED when still RUNNING, preserving
TIMED_OUT set by the scheduler thread.
- Add cancel_event pre-check before entering the astream loop so
cancellation is detected immediately when already signalled.
* fix(subagents): guard status updates with lock to prevent race condition
Wrap the check-and-set on result.status in _aexecute with
_background_tasks_lock so the timeout handler in execute_async
cannot interleave between the read and write.
* fix(subagents): add dedicated CANCELLED status for user cancellation
Introduce SubagentStatus.CANCELLED to distinguish user-initiated
cancellation from actual execution failures. Update _aexecute,
task_tool polling, cleanup terminal-status sets, and test fixtures.
* test(subagents): add cancellation tests and fix timeout regression test
- Add dedicated TestCooperativeCancellation test class with 6 tests:
- Pre-set cancel_event prevents astream from starting
- Mid-stream cancel_event returns CANCELLED immediately
- request_cancel_background_task() sets cancel_event correctly
- request_cancel on nonexistent task is a no-op
- Real execute_async timeout does not overwrite CANCELLED (deterministic
threading.Event sync, no wall-clock sleeps)
- cleanup_background_task removes CANCELLED tasks
- Add task_tool cancellation coverage:
- test_cancellation_calls_request_cancel: assert CancelledError path
calls request_cancel_background_task(task_id)
- test_task_tool_returns_cancelled_message: assert CANCELLED polling
branch emits task_cancelled event and returns expected message
- Fix pre-existing test infrastructure issue: add deerflow.sandbox.security
to _MOCKED_MODULE_NAMES (fixes ModuleNotFoundError for all executor tests)
- Add RUNNING guard to timeout handler in executor.py to prevent
TIMED_OUT from overwriting CANCELLED status
- Add cooperative cancellation granularity comment documenting that
cancellation is only detected at astream iteration boundaries
---------
Co-authored-by: lulusiyuyu <lulusiyuyu@users.noreply.github.com>
This commit is contained in:
@@ -14,7 +14,7 @@ from deerflow.agents.lead_agent.prompt import get_skills_prompt_section
|
||||
from deerflow.agents.thread_state import ThreadState
|
||||
from deerflow.sandbox.security import LOCAL_BASH_SUBAGENT_DISABLED_MESSAGE, is_host_bash_allowed
|
||||
from deerflow.subagents import SubagentExecutor, get_available_subagent_names, get_subagent_config
|
||||
from deerflow.subagents.executor import SubagentStatus, cleanup_background_task, get_background_task_result
|
||||
from deerflow.subagents.executor import SubagentStatus, cleanup_background_task, get_background_task_result, request_cancel_background_task
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -182,6 +182,11 @@ async def task_tool(
|
||||
logger.error(f"[trace={trace_id}] Task {task_id} failed: {result.error}")
|
||||
cleanup_background_task(task_id)
|
||||
return f"Task failed. Error: {result.error}"
|
||||
elif result.status == SubagentStatus.CANCELLED:
|
||||
writer({"type": "task_cancelled", "task_id": task_id, "error": result.error})
|
||||
logger.info(f"[trace={trace_id}] Task {task_id} cancelled: {result.error}")
|
||||
cleanup_background_task(task_id)
|
||||
return "Task cancelled by user."
|
||||
elif result.status == SubagentStatus.TIMED_OUT:
|
||||
writer({"type": "task_timed_out", "task_id": task_id, "error": result.error})
|
||||
logger.warning(f"[trace={trace_id}] Task {task_id} timed out: {result.error}")
|
||||
@@ -204,6 +209,11 @@ async def task_tool(
|
||||
writer({"type": "task_timed_out", "task_id": task_id})
|
||||
return f"Task polling timed out after {timeout_minutes} minutes. This may indicate the background task is stuck. Status: {result.status.value}"
|
||||
except asyncio.CancelledError:
|
||||
# Signal the background subagent thread to stop cooperatively.
|
||||
# Without this, the thread (running in ThreadPoolExecutor with its
|
||||
# own event loop via asyncio.run) would continue executing even
|
||||
# after the parent task is cancelled.
|
||||
request_cancel_background_task(task_id)
|
||||
|
||||
async def cleanup_when_done() -> None:
|
||||
max_cleanup_polls = max_poll_count
|
||||
@@ -214,7 +224,7 @@ async def task_tool(
|
||||
if result is None:
|
||||
return
|
||||
|
||||
if result.status in {SubagentStatus.COMPLETED, SubagentStatus.FAILED, SubagentStatus.TIMED_OUT} or getattr(result, "completed_at", None) is not None:
|
||||
if result.status in {SubagentStatus.COMPLETED, SubagentStatus.FAILED, SubagentStatus.CANCELLED, SubagentStatus.TIMED_OUT} or getattr(result, "completed_at", None) is not None:
|
||||
cleanup_background_task(task_id)
|
||||
return
|
||||
|
||||
|
||||
Reference in New Issue
Block a user