fix(subagent): structured subagent_status field over text parsing (#3146) (#3154)

* fix(subagent): structured subagent_status field over text parsing Closes #3146. ## Why The frontend used to derive subtask card state by string-matching the leading text of the `task` tool's result. That contract surface was fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases where new backend wording (`Task cancelled by user.`, `Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware` exception wrappers) silently broke the card lifecycle. The frontend fallback kept growing more prefixes; any future rewording would break it again. ## Design 1. **Backend → frontend contract**: `ToolMessage.additional_kwargs` carries `subagent_status` (one of `completed | failed | cancelled | timed_out | polling_timed_out`) and an optional `subagent_error` blob. The frontend prefers it over parsing `content`. 2. **Centralised stamping, not 8 sprinkled stamps**: rather than have each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:` paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware` stamps the field after every task-tool call. Adding a new return path in `task_tool.py` cannot now skip the stamp. 3. **Cross-language contract fixture**: the prefix→status mapping is the one piece both sides must agree on. The shared fixture at `contracts/subagent_status_contract.json` lists every backend return string, the expected status, and what the error substring should contain. Backend test (`backend/tests/test_subagent_status_contract.py`) and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`) both load that fixture and assert the same cases. A wording drift on either side fails the matching language's test. 4. **Round-trip serialisation pinned**: the round-trip test asserts `ToolMessage.model_dump_json()` → `model_validate_json()` preserves `additional_kwargs.subagent_status`. Catches the case where a future LangChain or Pydantic upgrade silently strips unknown kwargs. 5. **Frontend status collapse documented**: the backend has five status values, the frontend card has three (`completed | failed | in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all collapse to `failed` with the original status preserved in `error`. `parseSubtaskResult` returns `in_progress` for unknown values so a backend that ships a new enum variant before the frontend upgrades degrades to the legacy prefix fallback instead of getting pinned. ## Changes Backend: - `deerflow.subagents.status_contract` — new module exporting `SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`, `SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and `make_subagent_additional_kwargs(status, error)`. - `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status` helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call` stamp on the success path; `_build_error_message` stamps on the wrapper path (carrying `ExcClass: detail` into `subagent_error`). Non-task tools are untouched. - New tests: `test_subagent_status_contract.py` (19 cases from the shared fixture + status-enum / blank-error / unknown-status rejection) and `test_tool_error_handling_subagent_stamp.py` (middleware integration: terminal-content stamps, non-terminal doesn't, non-task tools untouched, async path mirrors sync, existing additional_kwargs survive, JSON round-trip preserved). Frontend: - `parseSubtaskResult(text, additionalKwargs?)` — prefers the structured stamp; falls back to the legacy prefix matcher for historical threads / unknown future status values. - `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse. - `message-list.tsx` passes `message.additional_kwargs` through. - `subtask-result.test.ts` adds a structured-status block + a fixture-driven contract block; legacy prefix tests stay green for the fallback path. Contract: - `contracts/subagent_status_contract.json` — single source of truth both languages load. Whitespace variants, varied N for polling timeouts, the 3 pre-execution `Error:` returns task_tool produces, and the middleware wrapper shape are all in there. ## Test plan - `make lint` clean (backend + frontend). - `pytest tests/test_subagent_status_contract.py tests/test_tool_error_handling_subagent_stamp.py` → 37 passed. - `pnpm test --run` → 103 passed (was 76, +27 new). ## Migration / fallback retirement The text-prefix fallback stays in place until backend telemetry shows the frontend never hits it for newly produced messages. At that point a follow-up PR can drop the prefix branches and keep only the structured-status branch. Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131 (prior prefix-only fix), #3146 (this issue). * fix(subtask): back-fill result/error from text when structured status present Three follow-ups on the PR #3154 review: 1. `readStructuredStatus` no longer short-circuits the prefix parse. The backend currently stamps only the `subagent_status` enum value; the human-facing `result` body and wrapped-error message still live in `ToolMessage.content`. Dropping the text parse meant successful tasks rendered empty completed pills and wrapped failures lost their diagnostic. Now both shapes get composed: structured status wins, `result`/`error` come from text when both sides agree, and a lying success body under a `failed` stamp is dropped instead of leaking. 2. Replace the ESM-incompatible `__dirname` fixture lookup in subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`. The frontend package is `"type": "module"`, so the previous path would have thrown at runtime if anything ever changed under the contract directory. 3. Drop the `$schema` reference from contracts/subagent_status_contract.json pointing at a file that doesn't exist in the tree. Three new tests cover the structured + text composition: completed back-fills the success body, failed back-fills the wrapper text, and unrecognised content under a `failed` stamp stays empty rather than echoing noise.
2026-06-10 17:35:57 +00:00 · 2026-06-07 22:49:55 +08:00
parent d8b728f7cb
commit 8d2e55a05f
8 changed files with 780 additions and 17 deletions
@@ -12,10 +12,45 @@ from langgraph.prebuilt.tool_node import ToolCallRequest
 from langgraph.types import Command

 from deerflow.config.app_config import AppConfig
+from deerflow.subagents.status_contract import (
+    extract_subagent_status,
+    make_subagent_additional_kwargs,
+)

 logger = logging.getLogger(__name__)

 _MISSING_TOOL_CALL_ID = "missing_tool_call_id"
+_TASK_TOOL_NAME = "task"
+
+
+def _stamp_task_subagent_status(message: ToolMessage, *, tool_name: str, error: str | None = None) -> ToolMessage:
+    """Centralised stamping of ``additional_kwargs.subagent_status``.
+
+    Bytedance/deer-flow issue #3146: the frontend now reads the subagent
+    status from a structured field instead of parsing the leading text of
+    the task tool's return string. That contract is enforced here, in the
+    one place every task tool result flows through, rather than at the 5
+    normal-return + 3 ``Error:`` pre-execution branches inside
+    ``task_tool.py``. Centralisation prevents the "added a new return
+    path, forgot the stamp" drift mode.
+
+    For non-``task`` tools this is a no-op so other tools' additional_kwargs
+    conventions are untouched.
+    """
+    if tool_name != _TASK_TOOL_NAME:
+        return message
+    content = message.content if isinstance(message.content, str) else ""
+    status = extract_subagent_status(content)
+    if status is None:
+        # Non-terminal streaming chunks or unrecognised shapes leave the
+        # field unset so the frontend can keep the card on its in-progress
+        # placeholder until a real terminal frame arrives.
+        return message
+    stamp = make_subagent_additional_kwargs(status, error=error)
+    existing = dict(message.additional_kwargs or {})
+    existing.update(stamp)
+    message.additional_kwargs = existing
+    return message


 class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
@@ -29,12 +64,31 @@ class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
            detail = detail[:497] + "..."

        content = f"Error: Tool '{tool_name}' failed with {exc.__class__.__name__}: {detail}. Continue with available context, or choose an alternative tool."
-        return ToolMessage(
+        message = ToolMessage(
            content=content,
            tool_call_id=tool_call_id,
            name=tool_name,
            status="error",
        )
+        # Stamp the structured subagent status on the wrapper too: the
+        # frontend would otherwise have to fall back to prefix-matching
+        # ``Error: Tool 'task' failed ...`` on the wire. The ``subagent_error``
+        # carries the same ``ExcClass: detail`` shape the wrapper string
+        # uses so debugging artifacts stay aligned.
+        structured_error = f"{exc.__class__.__name__}: {detail}"
+        return _stamp_task_subagent_status(message, tool_name=tool_name, error=structured_error)
+
+    @staticmethod
+    def _maybe_stamp(result: ToolMessage | Command, request: ToolCallRequest) -> ToolMessage | Command:
+        """Apply the subagent stamp to successful task tool returns.
+
+        ``Command`` results bypass the stamp — they encode LangGraph
+        control flow rather than user-facing tool output.
+        """
+        if not isinstance(result, ToolMessage):
+            return result
+        tool_name = str(request.tool_call.get("name") or "")
+        return _stamp_task_subagent_status(result, tool_name=tool_name)

    @override
    def wrap_tool_call(
@@ -43,13 +97,14 @@ class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
        handler: Callable[[ToolCallRequest], ToolMessage | Command],
    ) -> ToolMessage | Command:
        try:
-            return handler(request)
+            result = handler(request)
        except GraphBubbleUp:
            # Preserve LangGraph control-flow signals (interrupt/pause/resume).
            raise
        except Exception as exc:
            logger.exception("Tool execution failed (sync): name=%s id=%s", request.tool_call.get("name"), request.tool_call.get("id"))
            return self._build_error_message(request, exc)
+        return self._maybe_stamp(result, request)

    @override
    async def awrap_tool_call(
@@ -58,13 +113,14 @@ class ToolErrorHandlingMiddleware(AgentMiddleware[AgentState]):
        handler: Callable[[ToolCallRequest], Awaitable[ToolMessage | Command]],
    ) -> ToolMessage | Command:
        try:
-            return await handler(request)
+            result = await handler(request)
        except GraphBubbleUp:
            # Preserve LangGraph control-flow signals (interrupt/pause/resume).
            raise
        except Exception as exc:
            logger.exception("Tool execution failed (async): name=%s id=%s", request.tool_call.get("name"), request.tool_call.get("id"))
            return self._build_error_message(request, exc)
+        return self._maybe_stamp(result, request)


 def _build_runtime_middlewares(
@@ -0,0 +1,102 @@
+"""Backend↔frontend contract for the structured subagent status.
+
+Bytedance/deer-flow issue #3146: the frontend used to derive the
+subtask card state by string-matching the leading text of the
+``task`` tool's result. That contract was fragile — any rewording on
+the backend silently broke the card lifecycle, and the issue history
+of #3107 BUG-007 / #3131 review showed it repeatedly.
+
+This module replaces the text-shaped contract with a small structured
+one carried inside ``ToolMessage.additional_kwargs``:
+
+- ``subagent_status``: one of ``SUBAGENT_STATUS_VALUES``.
+- ``subagent_error`` (optional): the human-readable error blob the
+  backend recorded.
+
+The mapping from "task tool result text" to status is the one piece
+the backend stamper (``ToolErrorHandlingMiddleware``) and the
+frontend fallback parser must agree on. The shared fixture at
+``contracts/subagent_status_contract.json`` is the single source of
+truth — both sides' tests load it and assert behaviour.
+"""
+
+from __future__ import annotations
+
+from typing import Literal
+
+SUBAGENT_STATUS_KEY = "subagent_status"
+SUBAGENT_ERROR_KEY = "subagent_error"
+
+SubagentStatusValue = Literal[
+    "completed",
+    "failed",
+    "cancelled",
+    "timed_out",
+    "polling_timed_out",
+]
+
+#: Enumeration of every value ``subagent_status`` may take. Mirrors the
+#: ``valid_status_values`` array in the shared fixture; the contract test
+#: pins them against each other.
+SUBAGENT_STATUS_VALUES: tuple[SubagentStatusValue, ...] = (
+    "completed",
+    "failed",
+    "cancelled",
+    "timed_out",
+    "polling_timed_out",
+)
+
+# Prefix table — ordered most-specific-first because some prefixes are
+# substrings of others ("Task timed out" vs "Task polling timed out", "Task
+# failed" vs "Task failed. Error: ..."). The "Task " prefixes come from
+# ``task_tool.py``'s 5 normal-return strings; the bare ``Error:`` prefix
+# catches both the 3 ``Error:`` pre-execution returns and the wrapper
+# produced by ``ToolErrorHandlingMiddleware`` for any task tool exception.
+_PREFIX_TO_STATUS: tuple[tuple[str, SubagentStatusValue], ...] = (
+    ("Task Succeeded. Result:", "completed"),
+    ("Task polling timed out", "polling_timed_out"),
+    ("Task timed out", "timed_out"),
+    ("Task cancelled by user", "cancelled"),
+    ("Task failed.", "failed"),
+    ("Error", "failed"),
+)
+
+
+def extract_subagent_status(content: str) -> SubagentStatusValue | None:
+    """Infer the structured status for a ``task`` tool result string.
+
+    Returns ``None`` when the content does not match any known terminal
+    prefix. Non-terminal streaming chunks fall into this branch by
+    design — the middleware then leaves ``subagent_status`` unset so
+    the frontend keeps the card on its in-progress placeholder until
+    the real terminal frame arrives.
+    """
+    trimmed = content.strip()
+    for prefix, status in _PREFIX_TO_STATUS:
+        if trimmed.startswith(prefix):
+            return status
+    return None
+
+
+def make_subagent_additional_kwargs(
+    status: SubagentStatusValue,
+    *,
+    error: str | None = None,
+) -> dict[str, str]:
+    """Build the ``additional_kwargs`` payload the middleware stamps.
+
+    Drops the error field when blank so the JSON wire format never carries
+    a misleading empty ``subagent_error: ""``.
+
+    Raises:
+        ValueError: when ``status`` is not in :data:`SUBAGENT_STATUS_VALUES`.
+            We do not accept arbitrary strings: a typo would silently leak
+            through to the frontend and degrade to the legacy prefix
+            fallback rather than failing loudly.
+    """
+    if status not in SUBAGENT_STATUS_VALUES:
+        raise ValueError(f"invalid subagent status {status!r}; expected one of {SUBAGENT_STATUS_VALUES}")
+    payload: dict[str, str] = {SUBAGENT_STATUS_KEY: status}
+    if error and error.strip():
+        payload[SUBAGENT_ERROR_KEY] = error.strip()
+    return payload
@@ -0,0 +1,78 @@
+"""Contract tests for ``deerflow.subagents.status_contract``.
+
+Bytedance/deer-flow issue #3146: the backend stamps
+``ToolMessage.additional_kwargs.subagent_status`` so the frontend can read
+the subagent state from a structured field instead of parsing the result
+text. The mapping from "task tool result text" to status is shared with the
+frontend through the cross-language fixture file
+``contracts/subagent_status_contract.json``.
+
+These tests pin the backend implementation against that fixture so any
+edit on either side surfaces immediately as a test failure.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from deerflow.subagents.status_contract import (
+    SUBAGENT_ERROR_KEY,
+    SUBAGENT_STATUS_KEY,
+    SUBAGENT_STATUS_VALUES,
+    extract_subagent_status,
+    make_subagent_additional_kwargs,
+)
+
+_REPO_ROOT = Path(__file__).resolve().parents[2]
+_CONTRACT_PATH = _REPO_ROOT / "contracts" / "subagent_status_contract.json"
+
+
+def _load_contract() -> dict:
+    return json.loads(_CONTRACT_PATH.read_text(encoding="utf-8"))
+
+
+def test_contract_file_exists():
+    assert _CONTRACT_PATH.is_file(), f"missing shared fixture: {_CONTRACT_PATH}"
+
+
+def test_status_values_match_contract():
+    """Backend status enum stays aligned with the contract document."""
+    contract = _load_contract()
+    assert set(SUBAGENT_STATUS_VALUES) == set(contract["valid_status_values"])
+
+
+@pytest.mark.parametrize("case", _load_contract()["cases"], ids=lambda c: c["name"])
+def test_extract_subagent_status_matches_contract(case):
+    """Every fixture case maps through ``extract_subagent_status`` to the
+    expected status — covers task_tool's 5 normal returns, the 3
+    pre-execution ``Error:`` returns, the middleware-wrapped exception
+    case, whitespace handling, and the streaming chunk that must stay
+    unrecognised.
+    """
+    status = extract_subagent_status(case["content"])
+    assert status == case["expected_status"], f"case {case['name']!r}: expected {case['expected_status']!r}, got {status!r}"
+
+
+def test_make_subagent_additional_kwargs_includes_status():
+    kwargs = make_subagent_additional_kwargs("completed")
+    assert kwargs == {SUBAGENT_STATUS_KEY: "completed"}
+
+
+def test_make_subagent_additional_kwargs_includes_error_when_present():
+    kwargs = make_subagent_additional_kwargs("failed", error="boom")
+    assert kwargs == {SUBAGENT_STATUS_KEY: "failed", SUBAGENT_ERROR_KEY: "boom"}
+
+
+def test_make_subagent_additional_kwargs_omits_blank_error():
+    """Empty / whitespace error must not leak as ``subagent_error: ""``."""
+    assert make_subagent_additional_kwargs("failed", error="") == {SUBAGENT_STATUS_KEY: "failed"}
+    assert make_subagent_additional_kwargs("failed", error="   ") == {SUBAGENT_STATUS_KEY: "failed"}
+    assert make_subagent_additional_kwargs("failed", error=None) == {SUBAGENT_STATUS_KEY: "failed"}
+
+
+def test_make_subagent_additional_kwargs_rejects_unknown_status():
+    with pytest.raises(ValueError, match="invalid subagent status"):
+        make_subagent_additional_kwargs("garbage")  # type: ignore[arg-type]
@@ -0,0 +1,151 @@
+"""Regression tests for ToolErrorHandlingMiddleware's subagent status stamp.
+
+Bytedance/deer-flow issue #3146: rather than stamp
+``ToolMessage.additional_kwargs.subagent_status`` from each of
+task_tool.py's 5 normal returns + 3 pre-execution Error: returns (which
+would be 8 separate places to drift over time), the middleware that
+already wraps every tool call does the stamping in one place. These
+tests pin that centralisation.
+
+For non-``task`` tools the middleware must not touch additional_kwargs
+— other tools have their own conventions and we do not want to leak a
+``subagent_status`` field onto them.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+from pathlib import Path
+
+import pytest
+from langchain_core.messages import ToolMessage
+
+from deerflow.agents.middlewares.tool_error_handling_middleware import (
+    ToolErrorHandlingMiddleware,
+)
+from deerflow.subagents.status_contract import (
+    SUBAGENT_ERROR_KEY,
+    SUBAGENT_STATUS_KEY,
+)
+
+_CONTRACT_PATH = Path(__file__).resolve().parents[2] / "contracts" / "subagent_status_contract.json"
+
+
+def _load_terminal_cases() -> list[dict]:
+    """Load only the cases that should produce a terminal status stamp."""
+    data = json.loads(_CONTRACT_PATH.read_text(encoding="utf-8"))
+    return [c for c in data["cases"] if c["expected_status"] is not None]
+
+
+class _FakeRequest:
+    """Stand-in for ``ToolCallRequest`` used by the middleware."""
+
+    def __init__(self, tool_name: str, tool_call_id: str = "call-1") -> None:
+        self.tool_call = {"name": tool_name, "id": tool_call_id}
+
+
+@pytest.mark.parametrize("case", _load_terminal_cases(), ids=lambda c: c["name"])
+def test_stamps_subagent_status_on_successful_task_return(case):
+    """Every terminal task tool result string stamps the matching status."""
+    middleware = ToolErrorHandlingMiddleware()
+    request = _FakeRequest("task")
+
+    def handler(_req):
+        return ToolMessage(content=case["content"], tool_call_id="call-1", name="task")
+
+    result = middleware.wrap_tool_call(request, handler)
+    assert isinstance(result, ToolMessage)
+    assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == case["expected_status"]
+
+
+def test_does_not_stamp_unknown_streaming_chunk():
+    """Non-terminal content leaves additional_kwargs alone."""
+    middleware = ToolErrorHandlingMiddleware()
+    request = _FakeRequest("task")
+
+    def handler(_req):
+        return ToolMessage(content="Investigating ...", tool_call_id="call-1", name="task")
+
+    result = middleware.wrap_tool_call(request, handler)
+    assert SUBAGENT_STATUS_KEY not in (result.additional_kwargs or {})
+
+
+def test_does_not_stamp_non_task_tool():
+    """A non-task tool returning a string that happens to start with
+    ``Error:`` must not pick up a subagent stamp."""
+    middleware = ToolErrorHandlingMiddleware()
+    request = _FakeRequest("bash")
+
+    def handler(_req):
+        return ToolMessage(content="Error: command not found", tool_call_id="call-1", name="bash")
+
+    result = middleware.wrap_tool_call(request, handler)
+    assert SUBAGENT_STATUS_KEY not in (result.additional_kwargs or {})
+
+
+def test_stamps_failed_when_task_tool_raises():
+    """The exception path goes through ``_build_error_message`` which is
+    the only place ToolErrorHandlingMiddleware ever emits a brand-new
+    ToolMessage. It must stamp ``failed`` for task too, since the wrapper
+    text starts with ``Error:``.
+    """
+    middleware = ToolErrorHandlingMiddleware()
+    request = _FakeRequest("task")
+
+    def handler(_req):
+        raise RuntimeError("blew up during execution")
+
+    result = middleware.wrap_tool_call(request, handler)
+    assert isinstance(result, ToolMessage)
+    assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "failed"
+    assert "RuntimeError" in result.additional_kwargs.get(SUBAGENT_ERROR_KEY, "")
+
+
+def test_async_wrap_also_stamps():
+    """The async wrap path must behave identically."""
+    middleware = ToolErrorHandlingMiddleware()
+    request = _FakeRequest("task")
+
+    async def handler(_req):
+        return ToolMessage(content="Task Succeeded. Result: ok", tool_call_id="call-1", name="task")
+
+    result = asyncio.run(middleware.awrap_tool_call(request, handler))
+    assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "completed"
+
+
+def test_preserves_existing_additional_kwargs():
+    """The stamper must not clobber unrelated fields the tool already set."""
+    middleware = ToolErrorHandlingMiddleware()
+    request = _FakeRequest("task")
+
+    def handler(_req):
+        return ToolMessage(
+            content="Task Succeeded. Result: ok",
+            tool_call_id="call-1",
+            name="task",
+            additional_kwargs={"existing_field": "must_survive"},
+        )
+
+    result = middleware.wrap_tool_call(request, handler)
+    assert result.additional_kwargs.get("existing_field") == "must_survive"
+    assert result.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "completed"
+
+
+def test_additional_kwargs_round_trip_via_json():
+    """Pydantic dump → JSON → restore must keep the stamp intact.
+
+    ``ToolMessage`` is what LangGraph serialises into the checkpoint and
+    what the frontend deserialises off the stream. If a future Pydantic /
+    LangChain upgrade silently strips unknown ``additional_kwargs`` we
+    want that to fail loudly here rather than in the wild.
+    """
+    msg = ToolMessage(
+        content="Task Succeeded. Result: ok",
+        tool_call_id="call-1",
+        name="task",
+        additional_kwargs={SUBAGENT_STATUS_KEY: "completed", SUBAGENT_ERROR_KEY: ""},
+    )
+    serialised = msg.model_dump_json()
+    restored = ToolMessage.model_validate_json(serialised)
+    assert restored.additional_kwargs.get(SUBAGENT_STATUS_KEY) == "completed"