fix(stability): resolve P0 blockers from v2.0-m1-rc1 stability audit (#3107) (#3131)

* fix(task-tool): unwrap callback manager when locating usage recorder `config["callbacks"]` may arrive as a `BaseCallbackManager` (e.g. the `AsyncCallbackManager` LangChain hands to async tool runs), not just a plain list. The previous `for cb in callbacks` loop raised `TypeError: 'AsyncCallbackManager' object is not iterable`, which `ToolErrorHandlingMiddleware` then converted into a failed `task` ToolMessage even though the subagent had completed internally — Ultra mode lost subagent results and the lead agent fell back to redoing the work. Unwrap `BaseCallbackManager.handlers` before searching for the recorder. Refs: bytedance/deer-flow#3107 (BUG-002) * fix(frontend): treat any task tool error as a terminal subtask failure The subtask card status machine matched only three English prefixes (`Task Succeeded. Result:`, `Task failed.`, `Task timed out`). Anything else fell through to `in_progress`, so a `task` tool error wrapped by `ToolErrorHandlingMiddleware` (`Error: Tool 'task' failed ...`) left the card spinning forever even after the run had ended. Extract the prefix logic into `parseSubtaskResult` and recognise any leading `Error:` token as a terminal failure. The extracted function is unit-tested against the legacy prefixes plus the `AsyncCallbackManager` regression captured in the upstream issue. Refs: bytedance/deer-flow#3107 (BUG-007) * fix(frontend): exclude hidden, reasoning, and tool payloads from chat export `formatThreadAsMarkdown` / `formatThreadAsJSON` iterated raw messages without running the UI-level `isHiddenFromUIMessage` filter. Exported transcripts therefore included `hide_from_ui` system reminders, memory injections, provider `reasoning_content`, tool calls, and tool result messages — content that is intentionally hidden in the chat view. Filter the export to the user-visible transcript by default and gate reasoning / tool calls / tool messages / hidden messages behind explicit `ExportOptions` flags so a future debug export can opt back in without forking the formatter. Refs: bytedance/deer-flow#3107 (BUG-006) * fix(gateway): route get_config through get_app_config for mtime hot reload `get_config(request)` returned the `app.state.config` snapshot captured at startup. The worker / lead-agent path then threaded that frozen `AppConfig` through `RunContext` and `agent_factory`, so per-run fields edited in `config.yaml` (notably `max_tokens`) were ignored until the gateway process was restarted — even though `get_app_config()` already does mtime-based reload at the bottom layer. Route the request dependency through `get_app_config()` directly. Runtime `ContextVar` overrides (`push_current_app_config`) and test-injected singletons (`set_app_config`) keep working; `app.state.config` is now only read at startup for one-shot bootstrap (logging level, IM channels, `langgraph_runtime` engines). `tests/test_gateway_deps_config.py` encoded the old snapshot contract and is removed; `tests/test_gateway_config_freshness.py` replaces it with mtime, ContextVar, and `set_app_config` coverage. `test_skills_custom_router.py` and `test_uploads_router.py` now inject test configs via FastAPI `dependency_overrides[get_config]` instead of mutating `app.state.config`. Document the hot-reload boundary in `backend/CLAUDE.md` so reviewers know which fields are picked up on the next request vs. which still require a restart (`database`, `checkpointer`, `run_events`, `stream_bridge`, `sandbox.use`, `log_level`, `channels.*`). Refs: bytedance/deer-flow#3107 (BUG-001) * fix(gateway): broaden get_config 503 to any config-load failure Address review feedback on the previous commit: 1. Narrow exception catch removed. The old contract returned 503 whenever `app.state.config is None`. The first cut only mapped `FileNotFoundError`, leaving `PermissionError`, YAML parse errors, and pydantic `ValidationError` to bubble up as 500. At the request boundary we treat any inability to materialise the config as "configuration not available" (503) and log the original exception so the operator still has the stack. 2. Removed the unused `request: Request` parameter and the matching `# noqa: ARG001`. FastAPI's `Depends()` does not require the dependency to accept `Request`; the only call site uses the no-arg form. 3. `backend/CLAUDE.md` boundary now lists the *reason* each field is restart-required (engine binding, singleton caching, one-shot `apply_logging_level`, etc.), not just the field name, so reviewers do not have to reverse-engineer the boundary themselves. Tests parametrise four exception classes (`FileNotFoundError`, `PermissionError`, `ValueError`, `RuntimeError`) and assert 503 for each. Refs: bytedance/deer-flow#3107 (BUG-001) * fix(task-tool): defend _find_usage_recorder against non-list callbacks Address review feedback. The previous commit handled the two common shapes LangChain hands to async tool runs — a plain `list[BaseCallbackHandler]` and a `BaseCallbackManager` subclass — but iterated any other shape directly, which would still raise `TypeError` if e.g. a single handler instance leaked through without a list wrapper. Treat any non-list, non-manager `config["callbacks"]` value as "no recorder" rather than crash. Docstring now lists all four shapes explicitly. New tests cover the single-handler-object case, `runtime is None`, `callbacks is None`, and `runtime.config` being a non-dict — all required to be silent no-ops. Refs: bytedance/deer-flow#3107 (BUG-002) * fix(frontend): drop dead identity ternary and add opt-in export tests Address review feedback on the previous export commit: 1. Removed the no-op `typeof msg.content === "string" ? msg.content : msg.content` expression in `formatThreadAsJSON`. Both branches returned the same value; the message content now flows through unchanged whether it is a string or the rich `MessageContent[]` shape (LangChain JSON-serialises the array structure correctly already). 2. Expanded the JSDoc on `ExportOptions` to make it clearer that the four flags are not currently wired to any UI control — callers wanting a debug export must build the options object explicitly. The default behaviour continues to match the explicit prescription in bytedance/deer-flow#3107 BUG-006. 3. Added opt-in coverage. The previous tests only exercised the `options = {}` default path; the new cases verify each flag flips the corresponding payload back into the export so a future debug-export surface does not silently break the contract. Refs: bytedance/deer-flow#3107 (BUG-006) * fix(frontend): export subtask prefix constants and document fallback intent Address review feedback on the previous BUG-007 commit: 1. `SUCCESS_PREFIX`, `FAILURE_PREFIX`, `TIMEOUT_PREFIX`, and the `ERROR_WRAPPER_PATTERN` regex are now exported. The JSDoc explicitly pins them as part of the backend↔frontend contract defined in `task_tool.py` and `tool_error_handling_middleware.py`, so any future structured-status migration (e.g. backend writing `additional_kwargs.subagent_status` instead of leading text) can reference these from one canonical place rather than redefine them. 2. The `in_progress` fallback now carries a docstring explaining the deliberate choice — LangChain only ever emits a `ToolMessage` once the tool itself has returned, so unrecognised content means the contract has drifted and "still running" is the right operator signal (eagerly marking it terminal-failed would mask the drift). No behaviour change; this is documentation and an API export. Refs: bytedance/deer-flow#3107 (BUG-007) * fix(gateway): drop app.state.config snapshot and freeze run_events_config Address @ShenAC-SAC's BUG-001 review on #3131. The previous cut still stored an ``AppConfig`` snapshot on ``app.state.config`` for startup bootstrap. Two follow-on hazards from that: 1. Future code touching the gateway lifespan could accidentally start reading ``app.state.config`` again, silently regressing the request hot path back to a stale snapshot. 2. ``get_run_context()`` paired a freshly-reloaded ``AppConfig`` with the startup-bound ``event_store`` and a *live* ``run_events_config`` field — so an operator who edited ``run_events.backend`` mid-flight would have produced a run context whose ``event_store`` and ``run_events_config`` referred to different backends. Clean approach (aligned with the direction in PR #3128): - ``lifespan()`` keeps a local ``startup_config`` variable and passes it explicitly into ``langgraph_runtime(app, startup_config)`` and into ``start_channel_service``. No ``app.state.config`` attribute is set at any point. - ``langgraph_runtime`` now accepts ``startup_config`` as a required parameter, removing the ``getattr(app.state, "config", None)`` lookup and the "config not initialised" runtime error. - The matching ``run_events_config`` is frozen onto ``app.state`` next to ``run_event_store`` so ``get_run_context`` reads the two from the same startup-time source. ``app_config`` continues to be resolved live via ``get_app_config()``. - ``backend/CLAUDE.md`` boundary explanation updated to spell out the ``startup_config`` / ``get_app_config()`` split. New regression test ``test_run_context_app_config_reflects_yaml_edit`` exercises the worker-feeding path: it asserts that ``ctx.app_config`` follows a mid-flight ``config.yaml`` edit while ``ctx.run_events_config`` stays frozen to the startup snapshot the event store was built from. Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review * fix(frontend): parse Task cancelled and polling timed out as terminal Address @ShenAC-SAC's BUG-007 review on #3131. `task_tool.py` actually emits five terminal strings: - `Task Succeeded. Result: …` - `Task failed. …` - `Task timed out. …` - `Task cancelled by user.` ← previously matched none - `Task polling timed out after N minutes …` ← previously matched none The previous cut handled three; the last two fell through to the "unknown content" branch and pushed the subtask card back to `in_progress` even though the backend had already reached a terminal state. Add explicit matches plus regression tests for both. The `in_progress` fallback is now reserved for genuinely unrecognised output (i.e. contract drift), as documented. Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review * fix(frontend): sanitize JSON export content via the Markdown content path Address @ShenAC-SAC's BUG-006 review and the Copilot inline comment on #3131. The previous cut filtered hidden/tool messages out of the JSON export but still serialised `msg.content` verbatim, so: - inline `<think>…</think>` wrappers stayed in the exported `content` even with `includeReasoning: false`, - content-array thinking blocks leaked the `thinking` field, - `<uploaded_files>…</uploaded_files>` markers leaked the workspace paths a user uploaded files to. JSON now goes through the same sanitiser the Markdown path uses (`extractContentFromMessage` + `stripUploadedFilesTag`). Reasoning and tool_calls remain gated behind their `ExportOptions` flags. AI / human rows that sanitise to empty content with no opted-in reasoning or tool calls are dropped so the JSON matches the Markdown path's `continue` on empty assistant fragments. New regression tests cover the three leak shapes the reviewer called out plus the empty-content-drop case. Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review * test(gateway): align lifespan stub with langgraph_runtime two-arg signature Codex round-3 review of c0bc7a06 flagged this: changing `langgraph_runtime` to require `startup_config` as a second positional argument broke the one-arg stub `_noop_langgraph_runtime(_app)` in `test_gateway_lifespan_shutdown.py`, which is patched into `app.gateway.app.langgraph_runtime` by the lifespan shutdown bounded-timeout regression. Lifespan would then call the stub with two args and raise `TypeError` before the bounded-shutdown assertion ran. Update the stub to match the new signature. The shutdown test itself is unaffected — it only cares about the channel `stop_channel_service` hang path. Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review * fix(frontend): strip every known backend marker in export, not just uploads Codex round-3 review of 258ca800 and the matching maintainer feedback on PR #3131 made the same point: the JSON export now ran the Markdown-side sanitiser, but that sanitiser only stripped `<uploaded_files>`. The full set of payloads middleware embeds inside message `content` is larger: - `<uploaded_files>` — `UploadsMiddleware` - `<system-reminder>` — `DynamicContextMiddleware` - `<memory>` — `DynamicContextMiddleware` (nested inside system-reminder) - `<current_date>` — `DynamicContextMiddleware` The primary protection is still `isHiddenFromUIMessage`: the `<system-reminder>` HumanMessage is marked `hide_from_ui: true` and never reaches the formatter. This commit adds the second line of defence so a regression that drops the `hide_from_ui` flag — or any future middleware that injects the same tag vocabulary into a visible HumanMessage — cannot leak the payload into the export file. Concrete changes: - New `INTERNAL_MARKER_TAGS` constant + `stripInternalMarkers(content)` helper in `core/messages/utils.ts`. The constant doubles as documentation for the backend↔frontend contract. - `formatMessageContent` in `export.ts` now calls `stripInternalMarkers` instead of `stripUploadedFilesTag`. UI render paths (`message-list-item.tsx`) keep using the narrower function so a user legitimately typing `<memory>` in a meta-discussion is preserved. - The "drop empty rows" guard in `buildJSONMessage` switched from `=== undefined` to truthy `!` checks. Codex spotted the asymmetry: when `extractReasoningContentFromMessage` returned the empty string (which it legitimately can), the JSON path emitted `{reasoning: ""}` while the Markdown path's `!reasoning` `continue` correctly dropped the row. New regression tests cover the defence-in-depth strip with a `<system-reminder><memory><current_date>` payload deliberately *not* marked `hide_from_ui`; tool-message sanitization under `includeToolMessages: true`; the mixed-content-array case (`thinking + text + image_url`); and the opted-in empty-reasoning drop. Live verification on a real Ultra-mode thread that uploaded a PDF (`曾鑫民-薪资交易流水.pdf`): backend state's first HumanMessage carries the `<uploaded_files>` block (with `/mnt/user-data/uploads/...` paths) as part of a content-array. The Markdown and JSON export blobs both come back free of `<uploaded_files>`, `<system-reminder>`, `<current_date>`, `tool_calls`, and reasoning — while preserving the user's `这是什么？` prompt and the assistant's visible answer. Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review * test(frontend): cover trim, varied N, and pre-execution Error: prefixes Codex round-3 review of 50e2c257 flagged three coverage gaps in the subtask-status parser: 1. `Task cancelled by user.` and `Task polling timed out` previously had no whitespace-trim coverage — the original trim test only exercised the success prefix. Streaming chunks can arrive with leading/trailing newlines; the regex needed an explicit assertion. 2. The polling-timeout case was tested only at one `N` (15 minutes). The backend interpolates the live `timeout_seconds // 60` value, so the matcher must hold for any positive integer. Now we run the case for 1, 5, and 60 minutes. 3. `task_tool.py` also emits three `Error:` strings for pre-execution failures — unknown subagent type, host-bash disabled, and "task disappeared from background tasks". They are intentionally handled by `ERROR_WRAPPER_PATTERN` rather than dedicated prefixes (the wrapper already produces the right terminal-failed shape) but had no test coverage proving that wiring. Codex was right that a refactor splitting one of them off into its own prefix would silently break things. The JSDoc on the constants block now spells the three pre-execution errors out so the relationship between `task_tool.py` returns and the prefix vocabulary is explicit. No production code change beyond the docstring — this commit is pure coverage hardening for the contract that already exists. Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review
2026-05-22 07:56:48 +00:00 · 2026-05-21 21:18:10 +08:00
parent 4cb2a22400
commit e93f658472
16 changed files with 1060 additions and 107 deletions
@@ -27,6 +27,7 @@ import {
 import { useRehypeSplitWordsIntoSpans } from "@/core/rehype";
 import type { Subtask } from "@/core/tasks";
 import { useUpdateSubtask } from "@/core/tasks/context";
+import { parseSubtaskResult } from "@/core/tasks/subtask-result";
 import type { AgentThreadState } from "@/core/threads";
 import { cn } from "@/lib/utils";

@@ -359,33 +360,10 @@ export function MessageList({
              } else if (message.type === "tool") {
                const taskId = message.tool_call_id;
                if (taskId) {
-                  const result = extractTextFromMessage(message);
-                  if (result.startsWith("Task Succeeded. Result:")) {
-                    updateSubtask({
-                      id: taskId,
-                      status: "completed",
-                      result: result
-                        .split("Task Succeeded. Result:")[1]
-                        ?.trim(),
-                    });
-                  } else if (result.startsWith("Task failed.")) {
-                    updateSubtask({
-                      id: taskId,
-                      status: "failed",
-                      error: result.split("Task failed.")[1]?.trim(),
-                    });
-                  } else if (result.startsWith("Task timed out")) {
-                    updateSubtask({
-                      id: taskId,
-                      status: "failed",
-                      error: result,
-                    });
-                  } else {
-                    updateSubtask({
-                      id: taskId,
-                      status: "in_progress",
-                    });
-                  }
+                  const parsed = parseSubtaskResult(
+                    extractTextFromMessage(message),
+                  );
+                  updateSubtask({ id: taskId, ...parsed });
                }
              }
            }
@@ -397,6 +397,50 @@ export function stripUploadedFilesTag(content: string): string {
    .trim();
 }

+/**
+ * Tag names that backend middlewares wrap around internal payloads before
+ * letting them ride along inside LangGraph message ``content``.
+ *
+ * These markers are *not* user copy — they come from:
+ *
+ * - ``UploadsMiddleware`` → ``<uploaded_files>``
+ * - ``DynamicContextMiddleware`` → ``<system-reminder>`` (carrying
+ *   ``<memory>`` / ``<current_date>`` inside)
+ * - ``TodoListMiddleware`` / ``LoopDetectionMiddleware`` style reminders
+ *   live in ``hide_from_ui`` HumanMessages, but their inner payload uses
+ *   the same tag vocabulary.
+ *
+ * The primary export filter is {@link isHiddenFromUIMessage}. This list is
+ * the defence-in-depth strip for any message that — by middleware bug,
+ * provider quirk, or merge-conflict regression — slips through without
+ * its ``hide_from_ui`` flag set.
+ */
+export const INTERNAL_MARKER_TAGS = [
+  "uploaded_files",
+  "system-reminder",
+  "memory",
+  "current_date",
+] as const;
+
+const INTERNAL_MARKER_RE = new RegExp(
+  `<(${INTERNAL_MARKER_TAGS.join("|")})>[\\s\\S]*?</\\1>`,
+  "g",
+);
+
+/**
+ * Strip every known backend-injected marker from message content.
+ *
+ * Intended for the chat export path where a marker leaking through is a
+ * privacy regression. UI render paths should keep using
+ * {@link stripUploadedFilesTag} — they receive ``hide_from_ui`` messages
+ * via a separate filter and the narrower function avoids stripping content
+ * a user might legitimately type into a meta-discussion (e.g. asking the
+ * model about its own ``<memory>`` system).
+ */
+export function stripInternalMarkers(content: string): string {
+  return content.replace(INTERNAL_MARKER_RE, "").trim();
+}
+
 export function parseUploadedFiles(content: string): FileInMessage[] {
  // Match <uploaded_files>...</uploaded_files> tag
  const uploadedFilesRegex = /<uploaded_files>([\s\S]*?)<\/uploaded_files>/;
@@ -0,0 +1,88 @@
+import type { Subtask } from "./types";
+
+export type SubtaskStatus = Subtask["status"];
+
+export interface SubtaskResultUpdate {
+  status: SubtaskStatus;
+  result?: string;
+  error?: string;
+}
+
+/**
+ * Prefix strings the backend `task` tool writes into its result `content`.
+ *
+ * These values are not user-facing copy — they are part of the
+ * backend↔frontend contract defined in
+ * `backend/packages/harness/deerflow/tools/builtins/task_tool.py` (returned
+ * from the tool body) and in
+ * `backend/packages/harness/deerflow/agents/middlewares/tool_error_handling_middleware.py`
+ * (wrapper for tool exceptions). Any change here must be paired with the
+ * matching backend change. Exported so a future structured-status migration
+ * can reference the same values from one place.
+ *
+ * `task_tool.py` also emits three `Error:` strings for pre-execution failures
+ * — unknown subagent type, host-bash disabled, and "task disappeared from
+ * background tasks". They are handled by {@link ERROR_WRAPPER_PATTERN}
+ * rather than dedicated prefixes because the wrapper already produces
+ * exactly the right `terminal failed` shape.
+ */
+export const SUCCESS_PREFIX = "Task Succeeded. Result:";
+export const FAILURE_PREFIX = "Task failed.";
+export const TIMEOUT_PREFIX = "Task timed out";
+export const CANCELLED_PREFIX = "Task cancelled by user.";
+export const POLLING_TIMEOUT_PREFIX = "Task polling timed out";
+export const ERROR_WRAPPER_PATTERN = /^Error\b/i;
+
+/**
+ * Map a `task` tool result string to a {@link SubtaskStatus}.
+ *
+ * Bytedance/deer-flow issue #3107 BUG-007: parent-visible task tool errors do
+ * not always start with one of the three legacy prefixes (e.g. when
+ * `ToolErrorHandlingMiddleware` wraps an exception as
+ * `Error: Tool 'task' failed ...`). Treat any leading `Error:` token as a
+ * terminal failure so subtask cards stop being stuck on "in_progress".
+ *
+ * Returning `in_progress` is the **deliberate** fallback for content that
+ * matches none of the known prefixes. LangChain only ever emits a
+ * `ToolMessage` once the tool itself has returned (success or wrapped
+ * exception), so an unknown shape means "the contract changed underneath us"
+ * — surfacing it as still-running prompts the operator to investigate, where
+ * eagerly marking it terminal-failed would mask the drift.
+ */
+export function parseSubtaskResult(text: string): SubtaskResultUpdate {
+  const trimmed = text.trim();
+
+  if (trimmed.startsWith(SUCCESS_PREFIX)) {
+    return {
+      status: "completed",
+      result: trimmed.slice(SUCCESS_PREFIX.length).trim(),
+    };
+  }
+
+  if (trimmed.startsWith(FAILURE_PREFIX)) {
+    return {
+      status: "failed",
+      error: trimmed.slice(FAILURE_PREFIX.length).trim(),
+    };
+  }
+
+  if (trimmed.startsWith(TIMEOUT_PREFIX)) {
+    return { status: "failed", error: trimmed };
+  }
+
+  if (trimmed.startsWith(CANCELLED_PREFIX)) {
+    return { status: "failed", error: trimmed };
+  }
+
+  if (trimmed.startsWith(POLLING_TIMEOUT_PREFIX)) {
+    return { status: "failed", error: trimmed };
+  }
+
+  // ToolErrorHandlingMiddleware-style wrapper, or any other terminal error
+  // signal the backend forwards to the lead agent.
+  if (ERROR_WRAPPER_PATTERN.test(trimmed)) {
+    return { status: "failed", error: trimmed };
+  }
+
+  return { status: "in_progress" };
+}
@@ -5,16 +5,53 @@ import {
  extractReasoningContentFromMessage,
  hasContent,
  hasToolCalls,
-  stripUploadedFilesTag,
+  isHiddenFromUIMessage,
+  stripInternalMarkers,
 } from "../messages/utils";

 import type { AgentThread } from "./types";
 import { titleOfThread } from "./utils";

+/**
+ * Optional debug switches for advanced exports.
+ *
+ * Bytedance/deer-flow issue #3107 BUG-006 explicitly prescribes that the
+ * default export includes only the user-visible transcript and excludes
+ * thinking/reasoning content, tool calls, tool results, hidden messages,
+ * memory injection, and `<system-reminder>` payloads. These options let a
+ * future "debug export" surface re-include any of those categories without
+ * forking the formatter. They are not currently wired to any UI control —
+ * callers that want them must construct the options object explicitly.
+ */
+export interface ExportOptions {
+  includeReasoning?: boolean;
+  includeToolCalls?: boolean;
+  includeToolMessages?: boolean;
+  includeHidden?: boolean;
+}
+
+function visibleMessages(
+  messages: Message[],
+  options: ExportOptions,
+): Message[] {
+  return messages.filter((message) => {
+    if (!options.includeHidden && isHiddenFromUIMessage(message)) {
+      return false;
+    }
+    if (!options.includeToolMessages && message.type === "tool") {
+      return false;
+    }
+    return true;
+  });
+}
+
 function formatMessageContent(message: Message): string {
  const text = extractContentFromMessage(message);
  if (!text) return "";
-  return stripUploadedFilesTag(text);
+  // Defence-in-depth: even if a middleware-injected marker slipped through
+  // the `hide_from_ui` filter, scrub every known internal tag before the
+  // content lands in a user-visible export file.
+  return stripInternalMarkers(text);
 }

 function formatToolCalls(message: Message): string {
@@ -26,6 +63,7 @@ function formatToolCalls(message: Message): string {
 export function formatThreadAsMarkdown(
  thread: AgentThread,
  messages: Message[],
+  options: ExportOptions = {},
 ): string {
  const title = titleOfThread(thread);
  const createdAt = thread.created_at
@@ -41,16 +79,20 @@ export function formatThreadAsMarkdown(
    "",
  ];

-  for (const message of messages) {
+  for (const message of visibleMessages(messages, options)) {
    if (message.type === "human") {
      const content = formatMessageContent(message);
      if (content) {
        lines.push(`## 🧑 User`, "", content, "", "---", "");
      }
    } else if (message.type === "ai") {
-      const reasoning = extractReasoningContentFromMessage(message);
+      const reasoning = options.includeReasoning
+        ? extractReasoningContentFromMessage(message)
+        : undefined;
      const content = formatMessageContent(message);
-      const toolCalls = formatToolCalls(message);
+      const toolCalls = options.includeToolCalls
+        ? formatToolCalls(message)
+        : "";

      if (!content && !toolCalls && !reasoning) continue;

@@ -83,23 +125,65 @@ export function formatThreadAsMarkdown(
  return lines.join("\n").trimEnd() + "\n";
 }

+interface JSONExportMessage {
+  type: Message["type"];
+  id: string | undefined;
+  content: string;
+  reasoning?: string;
+  tool_calls?: unknown;
+}
+
+function buildJSONMessage(
+  msg: Message,
+  options: ExportOptions,
+): JSONExportMessage | null {
+  // Run the same sanitiser the Markdown path uses so the JSON `content`
+  // field never carries inline `<think>...</think>` wrappers, content-array
+  // thinking blocks, `<uploaded_files>` markers, or other internal payloads.
+  const content = formatMessageContent(msg);
+  const reasoning =
+    options.includeReasoning && msg.type === "ai"
+      ? (extractReasoningContentFromMessage(msg) ?? undefined)
+      : undefined;
+  const toolCalls =
+    options.includeToolCalls &&
+    msg.type === "ai" &&
+    "tool_calls" in msg &&
+    msg.tool_calls?.length
+      ? msg.tool_calls
+      : undefined;
+
+  // Drop rows with no exportable payload (empty content + no opted-in
+  // reasoning / tool_calls). Uses falsy semantics so `reasoning: ""` (the
+  // empty string ``extractReasoningContentFromMessage`` can hand back) is
+  // treated the same way Markdown's `!reasoning` continue does — otherwise
+  // an opted-in but empty reasoning field would leak as `{reasoning: ""}`.
+  if (!content && !reasoning && !toolCalls) {
+    return null;
+  }
+
+  return {
+    type: msg.type,
+    id: msg.id,
+    content,
+    ...(reasoning !== undefined ? { reasoning } : {}),
+    ...(toolCalls !== undefined ? { tool_calls: toolCalls } : {}),
+  };
+}
+
 export function formatThreadAsJSON(
  thread: AgentThread,
  messages: Message[],
+  options: ExportOptions = {},
 ): string {
  const exportData = {
    title: titleOfThread(thread),
    thread_id: thread.thread_id,
    created_at: thread.created_at,
    exported_at: new Date().toISOString(),
-    messages: messages.map((msg) => ({
-      type: msg.type,
-      id: msg.id,
-      content: typeof msg.content === "string" ? msg.content : msg.content,
-      ...(msg.type === "ai" && msg.tool_calls?.length
-        ? { tool_calls: msg.tool_calls }
-        : {}),
-    })),
+    messages: visibleMessages(messages, options)
+      .map((msg) => buildJSONMessage(msg, options))
+      .filter((m): m is JSONExportMessage => m !== null),
  };
  return JSON.stringify(exportData, null, 2);
 }
@@ -0,0 +1,112 @@
+import { describe, expect, it } from "vitest";
+
+import { parseSubtaskResult } from "@/core/tasks/subtask-result";
+
+describe("parseSubtaskResult", () => {
+  it("recognises the standard success prefix", () => {
+    const parsed = parseSubtaskResult(
+      "Task Succeeded. Result: investigated and produced a 3-page report",
+    );
+    expect(parsed.status).toBe("completed");
+    expect(parsed.result).toBe("investigated and produced a 3-page report");
+  });
+
+  it("recognises the standard failure prefix", () => {
+    const parsed = parseSubtaskResult(
+      "Task failed. underlying tool raised RuntimeError",
+    );
+    expect(parsed.status).toBe("failed");
+    expect(parsed.error).toBe("underlying tool raised RuntimeError");
+  });
+
+  it("recognises the standard timeout prefix", () => {
+    const parsed = parseSubtaskResult("Task timed out after 900s");
+    expect(parsed.status).toBe("failed");
+    expect(parsed.error).toBe("Task timed out after 900s");
+  });
+
+  it("recognises the cancelled-by-user prefix", () => {
+    // bytedance/deer-flow#3131 review: this is one of the five terminal
+    // strings task_tool.py actually emits — the previous cut treated it as
+    // unrecognised content and pushed the card back to in_progress.
+    const parsed = parseSubtaskResult("Task cancelled by user.");
+    expect(parsed.status).toBe("failed");
+    expect(parsed.error).toBe("Task cancelled by user.");
+  });
+
+  it("recognises the polling-timed-out prefix", () => {
+    // Emitted by task_tool when the background polling loop runs out of
+    // budget waiting for the subagent to reach a terminal state.
+    const parsed = parseSubtaskResult(
+      "Task polling timed out after 15 minutes. This may indicate the background task is stuck. Status: RUNNING",
+    );
+    expect(parsed.status).toBe("failed");
+    expect(parsed.error).toContain("polling timed out");
+  });
+
+  it("recognises polling-timed-out with different durations", () => {
+    // `task_tool` emits `Task polling timed out after {N} minutes` where N
+    // varies with the configured subagent timeout. Guard against the regex
+    // accidentally being pinned to a specific number.
+    for (const n of [1, 5, 60]) {
+      const parsed = parseSubtaskResult(
+        `Task polling timed out after ${n} minutes. Status: RUNNING`,
+      );
+      expect(parsed.status).toBe("failed");
+    }
+  });
+
+  it("trims whitespace around cancelled and polling-timed-out prefixes", () => {
+    // Streaming chunks sometimes arrive with leading/trailing newlines.
+    expect(parseSubtaskResult("  Task cancelled by user.  \n").status).toBe(
+      "failed",
+    );
+    expect(
+      parseSubtaskResult("\n\nTask polling timed out after 3 minutes").status,
+    ).toBe("failed");
+  });
+
+  it("recognises task_tool pre-execution Error: returns via the wrapper", () => {
+    // `task_tool.py` returns three `Error:` strings for unknown subagent
+    // type, host-bash disabled, and "task disappeared". They share the
+    // ERROR_WRAPPER_PATTERN, not a dedicated prefix, so this guards
+    // against a refactor splitting them off.
+    for (const text of [
+      "Error: Unknown subagent type 'foo'. Available: bash, general-purpose",
+      "Error: Host bash subagent is disabled by configuration",
+      "Error: Task 1234 disappeared from background tasks",
+    ]) {
+      expect(parseSubtaskResult(text).status).toBe("failed");
+    }
+  });
+
+  it("treats middleware-wrapped tool errors as terminal failures", () => {
+    // bytedance/deer-flow issue #3107 BUG-007: the parent-visible ToolMessage
+    // produced by ToolErrorHandlingMiddleware never matches the three legacy
+    // prefixes, so subtask cards stay stuck on "in_progress".
+    const parsed = parseSubtaskResult(
+      "Error: Tool 'task' failed with TypeError: 'AsyncCallbackManager' object is not iterable. Continue with available context, or choose an alternative tool.",
+    );
+    expect(parsed.status).toBe("failed");
+    expect(parsed.error).toContain("AsyncCallbackManager");
+  });
+
+  it("treats any other Error: prefix as a terminal failure", () => {
+    const parsed = parseSubtaskResult("Error: subagent worker pool exhausted");
+    expect(parsed.status).toBe("failed");
+  });
+
+  it("keeps unrecognised non-error output as in_progress", () => {
+    // Streaming partial chunks should not flip the card to terminal early.
+    const parsed = parseSubtaskResult("Investigating ...");
+    expect(parsed.status).toBe("in_progress");
+    expect(parsed.error).toBeUndefined();
+    expect(parsed.result).toBeUndefined();
+  });
+
+  it("trims surrounding whitespace before matching prefixes", () => {
+    const parsed = parseSubtaskResult("   Task Succeeded. Result: ok   ");
+    expect(parsed.status).toBe("completed");
+    expect(parsed.result).toBe("ok");
+  });
+});
@@ -0,0 +1,317 @@
+import type { Message } from "@langchain/langgraph-sdk";
+import { describe, expect, it } from "vitest";
+
+import {
+  formatThreadAsJSON,
+  formatThreadAsMarkdown,
+} from "@/core/threads/export";
+import type { AgentThread } from "@/core/threads/types";
+
+// Bytedance/deer-flow issue #3107 BUG-006: the chat export path bypasses the
+// UI-level hidden-message filter and emits reasoning content, tool calls, and
+// any other "internal" payload as if it were part of the user transcript.
+
+function makeThread(): AgentThread {
+  return {
+    thread_id: "thread-1",
+    created_at: "2026-05-21T00:00:00Z",
+    updated_at: "2026-05-21T00:00:00Z",
+    metadata: { title: "Demo thread" },
+    status: "idle",
+    values: { messages: [] },
+  } as unknown as AgentThread;
+}
+
+function human(content: string, extra: Partial<Message> = {}): Message {
+  return {
+    id: `h-${content}`,
+    type: "human",
+    content,
+    ...extra,
+  } as Message;
+}
+
+function ai(
+  content: string,
+  extra: Partial<Message> & { tool_calls?: unknown } = {},
+): Message {
+  return {
+    id: `a-${content}`,
+    type: "ai",
+    content,
+    ...extra,
+  } as Message;
+}
+
+function toolMsg(content: string): Message {
+  return {
+    id: `t-${content}`,
+    type: "tool",
+    content,
+    name: "task",
+    tool_call_id: "call-1",
+  } as unknown as Message;
+}
+
+describe("formatThreadAsMarkdown", () => {
+  it("includes plain user and assistant text", () => {
+    const md = formatThreadAsMarkdown(makeThread(), [
+      human("hello"),
+      ai("hi there"),
+    ]);
+    expect(md).toContain("hello");
+    expect(md).toContain("hi there");
+  });
+
+  it("drops messages marked hide_from_ui", () => {
+    const hidden = human("internal system reminder", {
+      additional_kwargs: { hide_from_ui: true },
+    } as Partial<Message>);
+    const md = formatThreadAsMarkdown(makeThread(), [
+      hidden,
+      ai("public answer"),
+    ]);
+    expect(md).not.toContain("internal system reminder");
+    expect(md).toContain("public answer");
+  });
+
+  it("does not emit reasoning_content by default", () => {
+    const message = ai("final answer", {
+      additional_kwargs: {
+        reasoning_content: "secret chain of thought",
+      },
+    } as Partial<Message>);
+    const md = formatThreadAsMarkdown(makeThread(), [message]);
+    expect(md).not.toContain("secret chain of thought");
+    expect(md).not.toContain("Thinking");
+  });
+
+  it("does not emit tool calls by default", () => {
+    const message = ai("ok", {
+      tool_calls: [{ id: "1", name: "task", args: { description: "do work" } }],
+    } as Partial<Message>);
+    const md = formatThreadAsMarkdown(makeThread(), [message]);
+    expect(md).not.toContain("**Tool:**");
+    expect(md).not.toContain("`task`");
+  });
+
+  it("drops tool result messages", () => {
+    const md = formatThreadAsMarkdown(makeThread(), [
+      ai("delegating"),
+      toolMsg("Task Succeeded. Result: confidential"),
+    ]);
+    expect(md).not.toContain("confidential");
+  });
+});
+
+describe("formatThreadAsMarkdown opt-in flags", () => {
+  it("emits reasoning when includeReasoning is true", () => {
+    const message = ai("final answer", {
+      additional_kwargs: {
+        reasoning_content: "step-by-step chain of thought",
+      },
+    } as Partial<Message>);
+    const md = formatThreadAsMarkdown(makeThread(), [message], {
+      includeReasoning: true,
+    });
+    expect(md).toContain("step-by-step chain of thought");
+    expect(md).toContain("Thinking");
+  });
+
+  it("emits tool call rows when includeToolCalls is true", () => {
+    const message = ai("ok", {
+      tool_calls: [{ id: "1", name: "task", args: { description: "do work" } }],
+    } as Partial<Message>);
+    const md = formatThreadAsMarkdown(makeThread(), [message], {
+      includeToolCalls: true,
+    });
+    expect(md).toContain("**Tool:**");
+    expect(md).toContain("`task`");
+  });
+
+  it("keeps hidden messages when includeHidden is true", () => {
+    const hidden = human("internal reminder", {
+      additional_kwargs: { hide_from_ui: true },
+    } as Partial<Message>);
+    const md = formatThreadAsMarkdown(makeThread(), [hidden], {
+      includeHidden: true,
+    });
+    expect(md).toContain("internal reminder");
+  });
+});
+
+describe("formatThreadAsJSON opt-in flags", () => {
+  it("emits tool_calls field when includeToolCalls is true", () => {
+    const message = ai("ok", {
+      tool_calls: [{ id: "1", name: "task", args: { description: "x" } }],
+    } as Partial<Message>);
+    const raw = formatThreadAsJSON(makeThread(), [message], {
+      includeToolCalls: true,
+    });
+    expect(raw).toContain("tool_calls");
+    expect(raw).toContain('"task"');
+  });
+
+  it("keeps tool messages when includeToolMessages is true", () => {
+    const raw = formatThreadAsJSON(
+      makeThread(),
+      [toolMsg("Task Succeeded. Result: keep me")],
+      { includeToolMessages: true },
+    );
+    const parsed = JSON.parse(raw) as { messages: { type: string }[] };
+    expect(parsed.messages.some((m) => m.type === "tool")).toBe(true);
+    expect(raw).toContain("keep me");
+  });
+});
+
+describe("formatThreadAsJSON", () => {
+  it("strips hidden messages, tool messages, reasoning, and tool calls", () => {
+    const messages = [
+      human("hello"),
+      human("secret reminder", {
+        additional_kwargs: { hide_from_ui: true },
+      } as Partial<Message>),
+      ai("answer", {
+        additional_kwargs: {
+          reasoning_content: "secret reasoning",
+        },
+        tool_calls: [{ id: "1", name: "task", args: {} }],
+      } as Partial<Message>),
+      toolMsg("internal trace"),
+    ];
+    const raw = formatThreadAsJSON(makeThread(), messages);
+    const parsed = JSON.parse(raw) as {
+      messages: { type: string; tool_calls?: unknown[] }[];
+    };
+
+    expect(parsed.messages).toHaveLength(2);
+    expect(parsed.messages.every((m) => m.type !== "tool")).toBe(true);
+    expect(raw).not.toContain("secret reminder");
+    expect(raw).not.toContain("secret reasoning");
+    expect(raw).not.toContain("internal trace");
+    expect(raw).not.toContain("tool_calls");
+  });
+
+  it("strips inline <think>...</think> wrappers from content", () => {
+    // bytedance/deer-flow#3131 review: JSON export must run the same
+    // sanitiser the Markdown path uses so inline reasoning never leaks
+    // even when `includeReasoning` is left at its default false.
+    const message = ai("<think>internal monologue</think>visible answer", {
+      id: "ai-1",
+    } as Partial<Message>);
+    const raw = formatThreadAsJSON(makeThread(), [message]);
+    expect(raw).not.toContain("internal monologue");
+    expect(raw).not.toContain("<think>");
+    expect(raw).toContain("visible answer");
+  });
+
+  it("strips content-array thinking blocks from content", () => {
+    const message = ai("placeholder", {
+      id: "ai-2",
+      content: [
+        { type: "thinking", thinking: "hidden reasoning step" },
+        { type: "text", text: "final visible text" },
+      ],
+    } as unknown as Partial<Message>);
+    const raw = formatThreadAsJSON(makeThread(), [message]);
+    expect(raw).not.toContain("hidden reasoning step");
+    expect(raw).toContain("final visible text");
+  });
+
+  it("strips <uploaded_files> markers from content", () => {
+    const message = human(
+      "real prompt\n<uploaded_files>\n/mnt/user-data/uploads/secret.pdf\n</uploaded_files>",
+      { id: "h-clean" } as Partial<Message>,
+    );
+    const raw = formatThreadAsJSON(makeThread(), [message]);
+    expect(raw).not.toContain("<uploaded_files>");
+    expect(raw).not.toContain("secret.pdf");
+    expect(raw).toContain("real prompt");
+  });
+
+  it("drops AI messages that sanitise to empty content", () => {
+    // Pure-reasoning AI fragments (no visible text, no tool calls) should
+    // not survive as `{content: ""}` rows in the export.
+    const message = ai("<think>only thinking, no answer</think>", {
+      id: "ai-3",
+    } as Partial<Message>);
+    const raw = formatThreadAsJSON(makeThread(), [message]);
+    const parsed = JSON.parse(raw) as { messages: unknown[] };
+    expect(parsed.messages).toHaveLength(0);
+  });
+
+  it("strips <system-reminder>/<memory>/<current_date> as defence in depth", () => {
+    // Primary protection is `isHiddenFromUIMessage` filtering the whole
+    // hidden HumanMessage. If a regression strips the `hide_from_ui` flag
+    // (or the marker leaks into an otherwise-visible message), the
+    // sanitiser must still scrub the payload before export.
+    const leaky = human("real user text", {
+      id: "leak-1",
+      content:
+        "<system-reminder>\n<memory>secret fact A</memory>\n<current_date>2026-01-01, Tuesday</current_date>\n</system-reminder>\nreal user text",
+      // Deliberately *not* setting hide_from_ui to model the regression
+      // case the defence-in-depth strip is guarding against.
+    } as unknown as Partial<Message>);
+    const raw = formatThreadAsJSON(makeThread(), [leaky]);
+    expect(raw).not.toContain("<system-reminder>");
+    expect(raw).not.toContain("<memory>");
+    expect(raw).not.toContain("<current_date>");
+    expect(raw).not.toContain("secret fact A");
+    expect(raw).toContain("real user text");
+  });
+
+  it("sanitises tool message content when includeToolMessages is true", () => {
+    const message = {
+      id: "t-leak",
+      type: "tool",
+      content:
+        "Task Succeeded. Result: payload\n<uploaded_files>\n/mnt/user-data/uploads/secret.pdf\n</uploaded_files>",
+      name: "task",
+      tool_call_id: "call-leak",
+    } as unknown as Message;
+
+    const raw = formatThreadAsJSON(makeThread(), [message], {
+      includeToolMessages: true,
+    });
+    expect(raw).toContain("Task Succeeded");
+    expect(raw).not.toContain("<uploaded_files>");
+    expect(raw).not.toContain("secret.pdf");
+  });
+
+  it("preserves text and image_url parts in mixed content arrays", () => {
+    // `extractContentFromMessage` keeps `text` and `image_url` parts and
+    // drops `thinking` parts. The JSON export must agree with that
+    // contract.
+    const message = ai("placeholder", {
+      id: "ai-mixed",
+      content: [
+        { type: "thinking", thinking: "internal reasoning" },
+        { type: "text", text: "user-visible answer" },
+        {
+          type: "image_url",
+          image_url: { url: "https://example.invalid/cat.png" },
+        },
+      ],
+    } as unknown as Partial<Message>);
+    const raw = formatThreadAsJSON(makeThread(), [message]);
+    expect(raw).toContain("user-visible answer");
+    expect(raw).toContain("https://example.invalid/cat.png");
+    expect(raw).not.toContain("internal reasoning");
+  });
+
+  it("drops opted-in empty reasoning rather than emit reasoning: ''", () => {
+    // `extractReasoningContentFromMessage` can legitimately hand back ""
+    // for an AI message that has no reasoning content. The export must
+    // mirror the Markdown path's `!reasoning` `continue` and drop the row
+    // instead of leaking `{reasoning: ""}`.
+    const message = ai("", {
+      id: "ai-empty-reasoning",
+      additional_kwargs: { reasoning_content: "" },
+    } as Partial<Message>);
+    const raw = formatThreadAsJSON(makeThread(), [message], {
+      includeReasoning: true,
+    });
+    const parsed = JSON.parse(raw) as { messages: unknown[] };
+    expect(parsed.messages).toHaveLength(0);
+  });
+});