fix(subagent): structured subagent_status field over text parsing (#3146) (#3154)

* fix(subagent): structured subagent_status field over text parsing

Closes #3146.

## Why

The frontend used to derive subtask card state by string-matching the
leading text of the `task` tool's result. That contract surface was
fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases
where new backend wording (`Task cancelled by user.`,
`Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware`
exception wrappers) silently broke the card lifecycle. The frontend
fallback kept growing more prefixes; any future rewording would break
it again.

## Design

1. **Backend → frontend contract**: `ToolMessage.additional_kwargs`
   carries `subagent_status` (one of `completed | failed | cancelled |
   timed_out | polling_timed_out`) and an optional `subagent_error`
   blob. The frontend prefers it over parsing `content`.

2. **Centralised stamping, not 8 sprinkled stamps**: rather than have
   each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:`
   paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware`
   stamps the field after every task-tool call. Adding a new return
   path in `task_tool.py` cannot now skip the stamp.

3. **Cross-language contract fixture**: the prefix→status mapping is
   the one piece both sides must agree on. The shared fixture at
   `contracts/subagent_status_contract.json` lists every backend return
   string, the expected status, and what the error substring should
   contain. Backend test (`backend/tests/test_subagent_status_contract.py`)
   and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`)
   both load that fixture and assert the same cases. A wording drift on
   either side fails the matching language's test.

4. **Round-trip serialisation pinned**: the round-trip test asserts
   `ToolMessage.model_dump_json()` → `model_validate_json()` preserves
   `additional_kwargs.subagent_status`. Catches the case where a future
   LangChain or Pydantic upgrade silently strips unknown kwargs.

5. **Frontend status collapse documented**: the backend has five status
   values, the frontend card has three (`completed | failed |
   in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all
   collapse to `failed` with the original status preserved in `error`.
   `parseSubtaskResult` returns `in_progress` for unknown values so a
   backend that ships a new enum variant before the frontend upgrades
   degrades to the legacy prefix fallback instead of getting pinned.

## Changes

Backend:
- `deerflow.subagents.status_contract` — new module exporting
  `SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`,
  `SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and
  `make_subagent_additional_kwargs(status, error)`.
- `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status`
  helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call`
  stamp on the success path; `_build_error_message` stamps on the
  wrapper path (carrying `ExcClass: detail` into `subagent_error`).
  Non-task tools are untouched.
- New tests: `test_subagent_status_contract.py` (19 cases from the
  shared fixture + status-enum / blank-error / unknown-status
  rejection) and `test_tool_error_handling_subagent_stamp.py`
  (middleware integration: terminal-content stamps, non-terminal
  doesn't, non-task tools untouched, async path mirrors sync,
  existing additional_kwargs survive, JSON round-trip preserved).

Frontend:
- `parseSubtaskResult(text, additionalKwargs?)` — prefers the
  structured stamp; falls back to the legacy prefix matcher for
  historical threads / unknown future status values.
- `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse.
- `message-list.tsx` passes `message.additional_kwargs` through.
- `subtask-result.test.ts` adds a structured-status block + a
  fixture-driven contract block; legacy prefix tests stay green for
  the fallback path.

Contract:
- `contracts/subagent_status_contract.json` — single source of truth
  both languages load. Whitespace variants, varied N for polling
  timeouts, the 3 pre-execution `Error:` returns task_tool produces,
  and the middleware wrapper shape are all in there.

## Test plan
- `make lint` clean (backend + frontend).
- `pytest tests/test_subagent_status_contract.py
   tests/test_tool_error_handling_subagent_stamp.py` → 37 passed.
- `pnpm test --run` → 103 passed (was 76, +27 new).

## Migration / fallback retirement

The text-prefix fallback stays in place until backend telemetry shows
the frontend never hits it for newly produced messages. At that point
a follow-up PR can drop the prefix branches and keep only the
structured-status branch.

Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131
(prior prefix-only fix), #3146 (this issue).

* fix(subtask): back-fill result/error from text when structured status present

Three follow-ups on the PR #3154 review:

1. `readStructuredStatus` no longer short-circuits the prefix parse.
   The backend currently stamps only the `subagent_status` enum value;
   the human-facing `result` body and wrapped-error message still live
   in `ToolMessage.content`. Dropping the text parse meant successful
   tasks rendered empty completed pills and wrapped failures lost their
   diagnostic. Now both shapes get composed: structured status wins,
   `result`/`error` come from text when both sides agree, and a lying
   success body under a `failed` stamp is dropped instead of leaking.

2. Replace the ESM-incompatible `__dirname` fixture lookup in
   subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`.
   The frontend package is `"type": "module"`, so the previous path
   would have thrown at runtime if anything ever changed under the
   contract directory.

3. Drop the `$schema` reference from contracts/subagent_status_contract.json
   pointing at a file that doesn't exist in the tree.

Three new tests cover the structured + text composition: completed
back-fills the success body, failed back-fills the wrapper text, and
unrecognised content under a `failed` stamp stays empty rather than
echoing noise.
This commit is contained in:
Xinmin Zeng
2026-06-07 22:49:55 +08:00
committed by GitHub
parent d8b728f7cb
commit 8d2e55a05f
8 changed files with 780 additions and 17 deletions
@@ -1,6 +1,37 @@
import { readFileSync } from "node:fs";
import { fileURLToPath } from "node:url";
import { describe, expect, it } from "vitest";
import { parseSubtaskResult } from "@/core/tasks/subtask-result";
import {
SUBAGENT_ERROR_KEY,
SUBAGENT_STATUS_KEY,
parseSubtaskResult,
} from "@/core/tasks/subtask-result";
interface ContractCase {
name: string;
content: string;
expected_status: string | null;
expected_error_contains: string | null;
}
interface ContractFile {
valid_status_values: string[];
cases: ContractCase[];
}
// The frontend package is ESM (`"type": "module"`), so `__dirname` is not
// defined. Resolve the cross-language fixture relative to this module URL.
const CONTRACT_PATH = fileURLToPath(
new URL(
"../../../../../contracts/subagent_status_contract.json",
import.meta.url,
),
);
const CONTRACT: ContractFile = JSON.parse(
readFileSync(CONTRACT_PATH, "utf-8"),
) as ContractFile;
describe("parseSubtaskResult", () => {
it("recognises the standard success prefix", () => {
@@ -110,3 +141,149 @@ describe("parseSubtaskResult", () => {
expect(parsed.result).toBe("ok");
});
});
/**
* Structured-status path (bytedance/deer-flow#3146).
*
* The backend stamps `ToolMessage.additional_kwargs.subagent_status`
* directly. The frontend should prefer that over reverse-engineering it
* from the content string.
*/
describe("parseSubtaskResult — structured additional_kwargs (preferred path)", () => {
it("uses additional_kwargs.subagent_status when present", () => {
const parsed = parseSubtaskResult("Task Succeeded. Result: foo", {
[SUBAGENT_STATUS_KEY]: "completed",
});
expect(parsed.status).toBe("completed");
});
it("collapses cancelled / timed_out / polling_timed_out to failed for the card UI", () => {
for (const backendStatus of [
"cancelled",
"timed_out",
"polling_timed_out",
]) {
const parsed = parseSubtaskResult("anything at all", {
[SUBAGENT_STATUS_KEY]: backendStatus,
});
expect(parsed.status).toBe("failed");
}
});
it("uses subagent_error when supplied", () => {
const parsed = parseSubtaskResult("ignored content", {
[SUBAGENT_STATUS_KEY]: "failed",
[SUBAGENT_ERROR_KEY]: "boom from backend",
});
expect(parsed.status).toBe("failed");
expect(parsed.error).toBe("boom from backend");
});
it("ignores empty / non-string subagent_error", () => {
const parsed = parseSubtaskResult("ignored content", {
[SUBAGENT_STATUS_KEY]: "failed",
[SUBAGENT_ERROR_KEY]: "",
});
expect(parsed.status).toBe("failed");
expect(parsed.error).toBeUndefined();
});
it("falls back to prefix parsing when the structured status is missing", () => {
const parsed = parseSubtaskResult("Task Succeeded. Result: foo", {
// No subagent_status here — backend versions that pre-date the
// middleware stamping commit still need to render.
other_field: "irrelevant",
});
expect(parsed.status).toBe("completed");
expect(parsed.result).toBe("foo");
});
it("falls back to prefix parsing when the structured status is an unknown future value", () => {
const parsed = parseSubtaskResult("Task Succeeded. Result: foo", {
[SUBAGENT_STATUS_KEY]: "renamed_in_v3",
});
// Falls back to prefix and still finds the success path.
expect(parsed.status).toBe("completed");
});
it("structured status overrides legacy text — opposite content", () => {
// Defence: if backend sends `failed` structured but the content
// accidentally starts with "Task Succeeded.", we must trust the
// structured field. The structured field is the source of truth.
const parsed = parseSubtaskResult("Task Succeeded. Result: this is a lie", {
[SUBAGENT_STATUS_KEY]: "failed",
});
expect(parsed.status).toBe("failed");
// The misleading success body must be dropped — `result` is reserved
// for the completed pill, and the suspicious text isn't replayed as
// an error either.
expect(parsed.result).toBeUndefined();
expect(parsed.error).toBeUndefined();
});
it("back-fills `result` from the success-prefixed content when structured says completed", () => {
// The backend currently stamps `subagent_status: completed` but the
// success body still lives in `content`. Without back-fill the card
// would render an empty completed pill (regression flagged in PR #3154
// Copilot review).
const parsed = parseSubtaskResult(
"Task Succeeded. Result: investigated and produced a 3-page report",
{ [SUBAGENT_STATUS_KEY]: "completed" },
);
expect(parsed.status).toBe("completed");
expect(parsed.result).toBe("investigated and produced a 3-page report");
});
it("back-fills `error` from a wrapped-error body when structured says failed and no subagent_error", () => {
// Same regression on the failure side: the wrapper text is the only
// place the diagnostic message exists when the backend stamps the
// enum but not `subagent_error`.
const parsed = parseSubtaskResult(
"Error: Tool 'task' failed with TypeError: boom",
{ [SUBAGENT_STATUS_KEY]: "failed" },
);
expect(parsed.status).toBe("failed");
expect(parsed.error).toContain("TypeError: boom");
});
it("leaves `error` undefined when structured says failed with no error and unrecognised text", () => {
// Don't dump arbitrary content into the error field — better to render
// an empty `failed` pill than to surface noise.
const parsed = parseSubtaskResult("partial streaming chunk", {
[SUBAGENT_STATUS_KEY]: "failed",
});
expect(parsed.status).toBe("failed");
expect(parsed.error).toBeUndefined();
});
});
/**
* Cross-language contract test (bytedance/deer-flow#3146).
*
* Loads the shared fixture at ``contracts/subagent_status_contract.json``
* and runs every case through the legacy prefix parser. The matching
* backend test (`backend/tests/test_subagent_status_contract.py`) runs
* the same cases through ``extract_subagent_status``. Any drift between
* the two implementations surfaces here.
*
* Status-collapse expectations:
* - `completed` → `completed`
* - `failed` → `failed`
* - `cancelled` / `timed_out` / `polling_timed_out` → `failed`
* (the frontend card has three pill states, not five)
* - `null` → `in_progress`
*/
describe("parseSubtaskResult — shared contract fixture", () => {
const expectedCardStatus = (backendStatus: string | null): string => {
if (backendStatus === null) return "in_progress";
if (backendStatus === "completed") return "completed";
return "failed";
};
for (const c of CONTRACT.cases) {
it(`legacy prefix parser matches contract: ${c.name}`, () => {
const parsed = parseSubtaskResult(c.content);
expect(parsed.status).toBe(expectedCardStatus(c.expected_status));
});
}
});