fix(subagent): structured subagent_status field over text parsing (#3146) (#3154)

* fix(subagent): structured subagent_status field over text parsing

Closes #3146.

## Why

The frontend used to derive subtask card state by string-matching the
leading text of the `task` tool's result. That contract surface was
fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases
where new backend wording (`Task cancelled by user.`,
`Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware`
exception wrappers) silently broke the card lifecycle. The frontend
fallback kept growing more prefixes; any future rewording would break
it again.

## Design

1. **Backend → frontend contract**: `ToolMessage.additional_kwargs`
   carries `subagent_status` (one of `completed | failed | cancelled |
   timed_out | polling_timed_out`) and an optional `subagent_error`
   blob. The frontend prefers it over parsing `content`.

2. **Centralised stamping, not 8 sprinkled stamps**: rather than have
   each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:`
   paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware`
   stamps the field after every task-tool call. Adding a new return
   path in `task_tool.py` cannot now skip the stamp.

3. **Cross-language contract fixture**: the prefix→status mapping is
   the one piece both sides must agree on. The shared fixture at
   `contracts/subagent_status_contract.json` lists every backend return
   string, the expected status, and what the error substring should
   contain. Backend test (`backend/tests/test_subagent_status_contract.py`)
   and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`)
   both load that fixture and assert the same cases. A wording drift on
   either side fails the matching language's test.

4. **Round-trip serialisation pinned**: the round-trip test asserts
   `ToolMessage.model_dump_json()` → `model_validate_json()` preserves
   `additional_kwargs.subagent_status`. Catches the case where a future
   LangChain or Pydantic upgrade silently strips unknown kwargs.

5. **Frontend status collapse documented**: the backend has five status
   values, the frontend card has three (`completed | failed |
   in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all
   collapse to `failed` with the original status preserved in `error`.
   `parseSubtaskResult` returns `in_progress` for unknown values so a
   backend that ships a new enum variant before the frontend upgrades
   degrades to the legacy prefix fallback instead of getting pinned.

## Changes

Backend:
- `deerflow.subagents.status_contract` — new module exporting
  `SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`,
  `SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and
  `make_subagent_additional_kwargs(status, error)`.
- `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status`
  helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call`
  stamp on the success path; `_build_error_message` stamps on the
  wrapper path (carrying `ExcClass: detail` into `subagent_error`).
  Non-task tools are untouched.
- New tests: `test_subagent_status_contract.py` (19 cases from the
  shared fixture + status-enum / blank-error / unknown-status
  rejection) and `test_tool_error_handling_subagent_stamp.py`
  (middleware integration: terminal-content stamps, non-terminal
  doesn't, non-task tools untouched, async path mirrors sync,
  existing additional_kwargs survive, JSON round-trip preserved).

Frontend:
- `parseSubtaskResult(text, additionalKwargs?)` — prefers the
  structured stamp; falls back to the legacy prefix matcher for
  historical threads / unknown future status values.
- `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse.
- `message-list.tsx` passes `message.additional_kwargs` through.
- `subtask-result.test.ts` adds a structured-status block + a
  fixture-driven contract block; legacy prefix tests stay green for
  the fallback path.

Contract:
- `contracts/subagent_status_contract.json` — single source of truth
  both languages load. Whitespace variants, varied N for polling
  timeouts, the 3 pre-execution `Error:` returns task_tool produces,
  and the middleware wrapper shape are all in there.

## Test plan
- `make lint` clean (backend + frontend).
- `pytest tests/test_subagent_status_contract.py
   tests/test_tool_error_handling_subagent_stamp.py` → 37 passed.
- `pnpm test --run` → 103 passed (was 76, +27 new).

## Migration / fallback retirement

The text-prefix fallback stays in place until backend telemetry shows
the frontend never hits it for newly produced messages. At that point
a follow-up PR can drop the prefix branches and keep only the
structured-status branch.

Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131
(prior prefix-only fix), #3146 (this issue).

* fix(subtask): back-fill result/error from text when structured status present

Three follow-ups on the PR #3154 review:

1. `readStructuredStatus` no longer short-circuits the prefix parse.
   The backend currently stamps only the `subagent_status` enum value;
   the human-facing `result` body and wrapped-error message still live
   in `ToolMessage.content`. Dropping the text parse meant successful
   tasks rendered empty completed pills and wrapped failures lost their
   diagnostic. Now both shapes get composed: structured status wins,
   `result`/`error` come from text when both sides agree, and a lying
   success body under a `failed` stamp is dropped instead of leaking.

2. Replace the ESM-incompatible `__dirname` fixture lookup in
   subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`.
   The frontend package is `"type": "module"`, so the previous path
   would have thrown at runtime if anything ever changed under the
   contract directory.

3. Drop the `$schema` reference from contracts/subagent_status_contract.json
   pointing at a file that doesn't exist in the tree.

Three new tests cover the structured + text composition: completed
back-fills the success body, failed back-fills the wrapper text, and
unrecognised content under a `failed` stamp stays empty rather than
echoing noise.
This commit is contained in:
Xinmin Zeng
2026-06-07 22:49:55 +08:00
committed by GitHub
parent d8b728f7cb
commit 8d2e55a05f
8 changed files with 780 additions and 17 deletions
@@ -12,10 +12,45 @@ from langgraph.prebuilt.tool_node import ToolCallRequest
from langgraph.types import Command
from deerflow.config.app_config import AppConfig
from deerflow.subagents.status_contract import (
extract_subagent_status,
make_subagent_additional_kwargs,
)
logger = logging.getLogger(__name__)
_MISSING_TOOL_CALL_ID = "missing_tool_call_id"
_TASK_TOOL_NAME = "task"
def _stamp_task_subagent_status(message: ToolMessage, *, tool_name: str, error: str | None = None) -> ToolMessage:
"""Centralised stamping of ``additional_kwargs.subagent_status``.
Bytedance/deer-flow issue #3146: the frontend now reads the subagent
status from a structured field instead of parsing the leading text of
the task tool's return string. That contract is enforced here, in the
one place every task tool result flows through, rather than at the 5
normal-return + 3 ``Error:`` pre-execution branches inside
``task_tool.py``. Centralisation prevents the "added a new return
path, forgot the stamp" drift mode.
For non-``task`` tools this is a no-op so other tools' additional_kwargs
conventions are untouched.
"""
if tool_name != _TASK_TOOL_NAME:
return message
content = message.content if isinstance(message.content, str) else ""
status = extract_subagent_status(content)
if status is None:
# Non-terminal streaming chunks or unrecognised shapes leave the
# field unset so the frontend can keep the card on its in-progress
# placeholder until a real terminal frame arrives.
return message
stamp = make_subagent_additional_kwargs(status, error=error)
existing = dict(message.additional_kwargs or {})
existing.update(stamp)
message.additional_kwargs = existing
return message
class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
@@ -29,12 +64,31 @@ class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
detail = detail[:497] + "..."
content = f"Error: Tool '{tool_name}' failed with {exc.__class__.__name__}: {detail}. Continue with available context, or choose an alternative tool."
return ToolMessage(
message = ToolMessage(
content=content,
tool_call_id=tool_call_id,
name=tool_name,
status="error",
)
# Stamp the structured subagent status on the wrapper too: the
# frontend would otherwise have to fall back to prefix-matching
# ``Error: Tool 'task' failed ...`` on the wire. The ``subagent_error``
# carries the same ``ExcClass: detail`` shape the wrapper string
# uses so debugging artifacts stay aligned.
structured_error = f"{exc.__class__.__name__}: {detail}"
return _stamp_task_subagent_status(message, tool_name=tool_name, error=structured_error)
@staticmethod
def _maybe_stamp(result: ToolMessage | Command, request: ToolCallRequest) -> ToolMessage | Command:
"""Apply the subagent stamp to successful task tool returns.
``Command`` results bypass the stamp — they encode LangGraph
control flow rather than user-facing tool output.
"""
if not isinstance(result, ToolMessage):
return result
tool_name = str(request.tool_call.get("name") or "")
return _stamp_task_subagent_status(result, tool_name=tool_name)
@override
def wrap_tool_call(
@@ -43,13 +97,14 @@ class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
handler: Callable[[ToolCallRequest], ToolMessage | Command],
) -> ToolMessage | Command:
try:
return handler(request)
result = handler(request)
except GraphBubbleUp:
# Preserve LangGraph control-flow signals (interrupt/pause/resume).
raise
except Exception as exc:
logger.exception("Tool execution failed (sync): name=%s id=%s", request.tool_call.get("name"), request.tool_call.get("id"))
return self._build_error_message(request, exc)
return self._maybe_stamp(result, request)
@override
async def awrap_tool_call(
@@ -58,13 +113,14 @@ class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
handler: Callable[[ToolCallRequest], Awaitable[ToolMessage | Command]],
) -> ToolMessage | Command:
try:
return await handler(request)
result = await handler(request)
except GraphBubbleUp:
# Preserve LangGraph control-flow signals (interrupt/pause/resume).
raise
except Exception as exc:
logger.exception("Tool execution failed (async): name=%s id=%s", request.tool_call.get("name"), request.tool_call.get("id"))
return self._build_error_message(request, exc)
return self._maybe_stamp(result, request)
def _build_runtime_middlewares(
@@ -0,0 +1,102 @@
"""Backend↔frontend contract for the structured subagent status.
Bytedance/deer-flow issue #3146: the frontend used to derive the
subtask card state by string-matching the leading text of the
``task`` tool's result. That contract was fragile — any rewording on
the backend silently broke the card lifecycle, and the issue history
of #3107 BUG-007 / #3131 review showed it repeatedly.
This module replaces the text-shaped contract with a small structured
one carried inside ``ToolMessage.additional_kwargs``:
- ``subagent_status``: one of ``SUBAGENT_STATUS_VALUES``.
- ``subagent_error`` (optional): the human-readable error blob the
backend recorded.
The mapping from "task tool result text" to status is the one piece
the backend stamper (``ToolErrorHandlingMiddleware``) and the
frontend fallback parser must agree on. The shared fixture at
``contracts/subagent_status_contract.json`` is the single source of
truth — both sides' tests load it and assert behaviour.
"""
from __future__ import annotations
from typing import Literal
SUBAGENT_STATUS_KEY = "subagent_status"
SUBAGENT_ERROR_KEY = "subagent_error"
SubagentStatusValue = Literal[
"completed",
"failed",
"cancelled",
"timed_out",
"polling_timed_out",
]
#: Enumeration of every value ``subagent_status`` may take. Mirrors the
#: ``valid_status_values`` array in the shared fixture; the contract test
#: pins them against each other.
SUBAGENT_STATUS_VALUES: tuple[SubagentStatusValue, ...] = (
"completed",
"failed",
"cancelled",
"timed_out",
"polling_timed_out",
)
# Prefix table — ordered most-specific-first because some prefixes are
# substrings of others ("Task timed out" vs "Task polling timed out", "Task
# failed" vs "Task failed. Error: ..."). The "Task " prefixes come from
# ``task_tool.py``'s 5 normal-return strings; the bare ``Error:`` prefix
# catches both the 3 ``Error:`` pre-execution returns and the wrapper
# produced by ``ToolErrorHandlingMiddleware`` for any task tool exception.
_PREFIX_TO_STATUS: tuple[tuple[str, SubagentStatusValue], ...] = (
("Task Succeeded. Result:", "completed"),
("Task polling timed out", "polling_timed_out"),
("Task timed out", "timed_out"),
("Task cancelled by user", "cancelled"),
("Task failed.", "failed"),
("Error", "failed"),
)
def extract_subagent_status(content: str) -> SubagentStatusValue | None:
"""Infer the structured status for a ``task`` tool result string.
Returns ``None`` when the content does not match any known terminal
prefix. Non-terminal streaming chunks fall into this branch by
design — the middleware then leaves ``subagent_status`` unset so
the frontend keeps the card on its in-progress placeholder until
the real terminal frame arrives.
"""
trimmed = content.strip()
for prefix, status in _PREFIX_TO_STATUS:
if trimmed.startswith(prefix):
return status
return None
def make_subagent_additional_kwargs(
status: SubagentStatusValue,
*,
error: str | None = None,
) -> dict[str, str]:
"""Build the ``additional_kwargs`` payload the middleware stamps.
Drops the error field when blank so the JSON wire format never carries
a misleading empty ``subagent_error: ""``.
Raises:
ValueError: when ``status`` is not in :data:`SUBAGENT_STATUS_VALUES`.
We do not accept arbitrary strings: a typo would silently leak
through to the frontend and degrade to the legacy prefix
fallback rather than failing loudly.
"""
if status not in SUBAGENT_STATUS_VALUES:
raise ValueError(f"invalid subagent status {status!r}; expected one of {SUBAGENT_STATUS_VALUES}")
payload: dict[str, str] = {SUBAGENT_STATUS_KEY: status}
if error and error.strip():
payload[SUBAGENT_ERROR_KEY] = error.strip()
return payload
@@ -0,0 +1,78 @@
"""Contract tests for ``deerflow.subagents.status_contract``.
Bytedance/deer-flow issue #3146: the backend stamps
``ToolMessage.additional_kwargs.subagent_status`` so the frontend can read
the subagent state from a structured field instead of parsing the result
text. The mapping from "task tool result text" to status is shared with the
frontend through the cross-language fixture file
``contracts/subagent_status_contract.json``.
These tests pin the backend implementation against that fixture so any
edit on either side surfaces immediately as a test failure.
"""
from __future__ import annotations
import json
from pathlib import Path
import pytest
from deerflow.subagents.status_contract import (
SUBAGENT_ERROR_KEY,
SUBAGENT_STATUS_KEY,
SUBAGENT_STATUS_VALUES,
extract_subagent_status,
make_subagent_additional_kwargs,
)
_REPO_ROOT = Path(__file__).resolve().parents[2]
_CONTRACT_PATH = _REPO_ROOT / "contracts" / "subagent_status_contract.json"
def _load_contract() -> dict:
return json.loads(_CONTRACT_PATH.read_text(encoding="utf-8"))
def test_contract_file_exists():
assert _CONTRACT_PATH.is_file(), f"missing shared fixture: {_CONTRACT_PATH}"
def test_status_values_match_contract():
"""Backend status enum stays aligned with the contract document."""
contract = _load_contract()
assert set(SUBAGENT_STATUS_VALUES) == set(contract["valid_status_values"])
@pytest.mark.parametrize("case", _load_contract()["cases"], ids=lambda c: c["name"])
def test_extract_subagent_status_matches_contract(case):
"""Every fixture case maps through ``extract_subagent_status`` to the
expected status — covers task_tool's 5 normal returns, the 3
pre-execution ``Error:`` returns, the middleware-wrapped exception
case, whitespace handling, and the streaming chunk that must stay
unrecognised.
"""
status = extract_subagent_status(case["content"])
assert status == case["expected_status"], f"case {case['name']!r}: expected {case['expected_status']!r}, got {status!r}"
def test_make_subagent_additional_kwargs_includes_status():
kwargs = make_subagent_additional_kwargs("completed")
assert kwargs == {SUBAGENT_STATUS_KEY: "completed"}
def test_make_subagent_additional_kwargs_includes_error_when_present():
kwargs = make_subagent_additional_kwargs("failed", error="boom")
assert kwargs == {SUBAGENT_STATUS_KEY: "failed", SUBAGENT_ERROR_KEY: "boom"}
def test_make_subagent_additional_kwargs_omits_blank_error():
"""Empty / whitespace error must not leak as ``subagent_error: ""``."""
assert make_subagent_additional_kwargs("failed", error="") == {SUBAGENT_STATUS_KEY: "failed"}
assert make_subagent_additional_kwargs("failed", error=" ") == {SUBAGENT_STATUS_KEY: "failed"}
assert make_subagent_additional_kwargs("failed", error=None) == {SUBAGENT_STATUS_KEY: "failed"}
def test_make_subagent_additional_kwargs_rejects_unknown_status():
with pytest.raises(ValueError, match="invalid subagent status"):
make_subagent_additional_kwargs("garbage") # type: ignore[arg-type]
@@ -0,0 +1,151 @@
"""Regression tests for ToolErrorHandlingMiddleware's subagent status stamp.
Bytedance/deer-flow issue #3146: rather than stamp
``ToolMessage.additional_kwargs.subagent_status`` from each of
task_tool.py's 5 normal returns + 3 pre-execution Error: returns (which
would be 8 separate places to drift over time), the middleware that
already wraps every tool call does the stamping in one place. These
tests pin that centralisation.
For non-``task`` tools the middleware must not touch additional_kwargs
— other tools have their own conventions and we do not want to leak a
``subagent_status`` field onto them.
"""
from __future__ import annotations
import asyncio
import json
from pathlib import Path
import pytest
from langchain_core.messages import ToolMessage
from deerflow.agents.middlewares.tool_error_handling_middleware import (
ToolErrorHandlingMiddleware,
)
from deerflow.subagents.status_contract import (
SUBAGENT_ERROR_KEY,
SUBAGENT_STATUS_KEY,
)
_CONTRACT_PATH = Path(__file__).resolve().parents[2] / "contracts" / "subagent_status_contract.json"
def _load_terminal_cases() -> list[dict]:
"""Load only the cases that should produce a terminal status stamp."""
data = json.loads(_CONTRACT_PATH.read_text(encoding="utf-8"))
return [c for c in data["cases"] if c["expected_status"] is not None]
class _FakeRequest:
"""Stand-in for ``ToolCallRequest`` used by the middleware."""
def __init__(self, tool_name: str, tool_call_id: str = "call-1") -> None:
self.tool_call = {"name": tool_name, "id": tool_call_id}
@pytest.mark.parametrize("case", _load_terminal_cases(), ids=lambda c: c["name"])
def test_stamps_subagent_status_on_successful_task_return(case):
"""Every terminal task tool result string stamps the matching status."""
middleware = ToolErrorHandlingMiddleware()
request = _FakeRequest("task")
def handler(_req):
return ToolMessage(content=case["content"], tool_call_id="call-1", name="task")
result = middleware.wrap_tool_call(request, handler)
assert isinstance(result, ToolMessage)
assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == case["expected_status"]
def test_does_not_stamp_unknown_streaming_chunk():
"""Non-terminal content leaves additional_kwargs alone."""
middleware = ToolErrorHandlingMiddleware()
request = _FakeRequest("task")
def handler(_req):
return ToolMessage(content="Investigating ...", tool_call_id="call-1", name="task")
result = middleware.wrap_tool_call(request, handler)
assert SUBAGENT_STATUS_KEY not in (result.additional_kwargs or {})
def test_does_not_stamp_non_task_tool():
"""A non-task tool returning a string that happens to start with
``Error:`` must not pick up a subagent stamp."""
middleware = ToolErrorHandlingMiddleware()
request = _FakeRequest("bash")
def handler(_req):
return ToolMessage(content="Error: command not found", tool_call_id="call-1", name="bash")
result = middleware.wrap_tool_call(request, handler)
assert SUBAGENT_STATUS_KEY not in (result.additional_kwargs or {})
def test_stamps_failed_when_task_tool_raises():
"""The exception path goes through ``_build_error_message`` which is
the only place ToolErrorHandlingMiddleware ever emits a brand-new
ToolMessage. It must stamp ``failed`` for task too, since the wrapper
text starts with ``Error:``.
"""
middleware = ToolErrorHandlingMiddleware()
request = _FakeRequest("task")
def handler(_req):
raise RuntimeError("blew up during execution")
result = middleware.wrap_tool_call(request, handler)
assert isinstance(result, ToolMessage)
assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "failed"
assert "RuntimeError" in result.additional_kwargs.get(SUBAGENT_ERROR_KEY, "")
def test_async_wrap_also_stamps():
"""The async wrap path must behave identically."""
middleware = ToolErrorHandlingMiddleware()
request = _FakeRequest("task")
async def handler(_req):
return ToolMessage(content="Task Succeeded. Result: ok", tool_call_id="call-1", name="task")
result = asyncio.run(middleware.awrap_tool_call(request, handler))
assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "completed"
def test_preserves_existing_additional_kwargs():
"""The stamper must not clobber unrelated fields the tool already set."""
middleware = ToolErrorHandlingMiddleware()
request = _FakeRequest("task")
def handler(_req):
return ToolMessage(
content="Task Succeeded. Result: ok",
tool_call_id="call-1",
name="task",
additional_kwargs={"existing_field": "must_survive"},
)
result = middleware.wrap_tool_call(request, handler)
assert result.additional_kwargs.get("existing_field") == "must_survive"
assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "completed"
def test_additional_kwargs_round_trip_via_json():
"""Pydantic dump → JSON → restore must keep the stamp intact.
``ToolMessage`` is what LangGraph serialises into the checkpoint and
what the frontend deserialises off the stream. If a future Pydantic /
LangChain upgrade silently strips unknown ``additional_kwargs`` we
want that to fail loudly here rather than in the wild.
"""
msg = ToolMessage(
content="Task Succeeded. Result: ok",
tool_call_id="call-1",
name="task",
additional_kwargs={SUBAGENT_STATUS_KEY: "completed", SUBAGENT_ERROR_KEY: ""},
)
serialised = msg.model_dump_json()
restored = ToolMessage.model_validate_json(serialised)
assert restored.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "completed"