fix(backend): stream DeerFlowClient AI text as token deltas (#1969) (#1974)

* fix(backend): stream DeerFlowClient AI text as token deltas (#1969)

DeerFlowClient.stream() subscribed to LangGraph stream_mode=["values",
"custom"] which only delivers full-state snapshots at graph-node
boundaries, so AI replies were dumped as a single messages-tuple event
per node instead of streaming token-by-token. `client.stream("hello")`
looked identical to `client.chat("hello")` — the bug reported in #1969.

Subscribe to "messages" mode as well, forward AIMessageChunk deltas as
messages-tuple events with delta semantics (consumers accumulate by id),
and dedup the values-snapshot path so it does not re-synthesize AI
text that was already streamed. Introduce a per-id usage_metadata
counter so the final AIMessage in the values snapshot and the final
"messages" chunk — which carry the same cumulative usage — are not
double-counted.

chat() now accumulates per-id deltas and returns the last message's
full accumulated text. Non-streaming mock sources (single event per id)
are a degenerate case of the same logic, keeping existing callers and
tests backward compatible.

Verified end-to-end against a real LLM: a 15-number count emits 35
messages-tuple events with BPE subword boundaries clearly visible
("eleven" -> "ele" / "ven", "twelve" -> "tw" / "elve"), 476ms across
the window, end-event usage matches the values-snapshot usage exactly
(not doubled). tests/test_client_live.py::TestLiveStreaming passes.

New unit tests:
- test_messages_mode_emits_token_deltas: 3 AIMessageChunks produce 3
  delta events with correct content/id/usage, values-snapshot does not
  duplicate, usage counted once.
- test_chat_accumulates_streamed_deltas: chat() rebuilds full text
  from deltas.
- test_messages_mode_tool_message: ToolMessage delivered via messages
  mode is not duplicated by the values-snapshot synthesis path.

The stream() docstring now documents why this client does not reuse
Gateway's run_agent() / StreamBridge pipeline (sync vs async, raw
LangChain objects vs serialized dicts, single caller vs HTTP fan-out).

Fixes #1969

* refactor(backend): simplify DeerFlowClient streaming helpers (#1969)

Post-review cleanup for the token-level streaming fix. No behavior
change for correct inputs; one efficiency regression fixed.

Fix: chat() O(n²) accumulator
-----------------------------
`chat()` accumulated per-id text via `buffers[id] = buffers.get(id,"") + delta`,
which is O(n) per concat → O(n²) total over a streamed response. At
~2 KB cumulative text this becomes user-visible; at 50 KB / 5000 chunks
it costs roughly 100-300 ms of pure copying. Switched to
`dict[str, list[str]]` + `"".join()` once at return.

Cleanup
-------
- Extract `_serialize_tool_calls`, `_ai_text_event`, `_ai_tool_calls_event`,
  and `_tool_message_event` static helpers. The messages-mode and
  values-mode branches previously repeated four inline dict literals each;
  they now call the same builders.
- `StreamEvent.type` is now typed as `Literal["values", "messages-tuple",
  "custom", "end"]` via a `StreamEventType` alias. Makes the closed set
  explicit and catches typos at type-check time.
- Direct attribute access on `AIMessage`/`AIMessageChunk`: `.usage_metadata`,
  `.tool_calls`, `.id` all have default values on the base class, so the
  `getattr(..., None)` fallbacks were dead code. Removed from the hot
  path.
- `_account_usage` parameter type loosened to `Any` so that LangChain's
  `UsageMetadata` TypedDict is accepted under strict type checking.
- Trimmed narrating comments on `seen_ids` / `streamed_ids` / the
  values-synthesis skip block; kept the non-obvious ones that document
  the cross-mode dedup invariant.

Net diff: -15 lines. All 132 unit tests + harness boundary test still
pass; ruff check and ruff format pass.

* docs(backend): add STREAMING.md design note (#1969)

Dedicated design document for the token-level streaming architecture,
prompted by the bug investigation in #1969.

Contents:
- Why two parallel streaming paths exist (Gateway HTTP/async vs
  DeerFlowClient sync/in-process) and why they cannot be merged.
- LangGraph's three-layer mode naming (Graph "messages" vs Platform
  SDK "messages-tuple" vs HTTP SSE) and why a shared string constant
  would be harmful.
- Gateway path: run_agent + StreamBridge + sse_consumer with a
  sequence diagram.
- DeerFlowClient path: sync generator + direct yield, delta semantics,
  chat() accumulator.
- Why the three id sets (seen_ids / streamed_ids / counted_usage_ids)
  each carry an independent invariant and cannot be collapsed.
- End-to-end sequence for a real conversation turn.
- Lessons from #1969: why mock-based tests missed the bug, why
  BPE subword boundaries in live output are the strongest
  correctness signal, and the regression test that locks it in.
- Source code location index.

Also:
- Link from backend/CLAUDE.md Embedded Client section.
- Link from backend/docs/README.md under Feature Documentation.

* test(backend): add refactor regression guards for stream() (#1969)

Three new tests in TestStream that lock the contract introduced by
PR #1974 so any future refactor (sync->async migration, sharing a
core with Gateway's run_agent, dedup strategy change) cannot
silently change behavior.

- test_dedup_requires_messages_before_values_invariant: canary that
  documents the order-dependence of cross-mode dedup. streamed_ids
  is populated only by the messages branch, so values-before-messages
  for the same id produces duplicate AI text events. Real LangGraph
  never inverts this order, but a refactor that does (or that makes
  dedup idempotent) must update this test deliberately.

- test_messages_mode_golden_event_sequence: locks the *exact* event
  sequence (4 events: 2 messages-tuple deltas, 1 values snapshot, 1
  end) for a canonical streaming turn. List equality gives a clear
  diff on any drift in order, type, or payload shape.

- test_chat_accumulates_in_linear_time: perf canary for the O(n^2)
  fix in commit 1f11ba10. 10,000 single-char chunks must accumulate
  in under 1s; the threshold is wide enough to pass on slow CI but
  tight enough to fail if buffer = buffer + delta is restored.

All three tests pass alongside the existing 12 TestStream tests
(15/15). ruff check + ruff format clean.

* docs(backend): clarify stream() docstring on JSON serialization (#1969)

Replace the misleading "raw LangChain objects (AIMessage,
usage_metadata as dataclasses), not dicts" claim in the
"Why not reuse Gateway's run_agent?" section. The implementation
already yields plain Python dicts (StreamEvent.data is dict, and
usage_metadata is a TypedDict), so the original wording suggested
a richer return type than the API actually delivers.

The corrected wording focuses on what is actually true and
relevant: this client skips the JSON/SSE serialization layer that
Gateway adds for HTTP wire transmission, and yields stream event
payloads directly as Python data structures.

Addresses Copilot review feedback on PR #1974.

* test(backend): document none-id messages dedup limitation (#1969)

Add test_none_id_chunks_produce_duplicates_known_limitation to
TestStream that explicitly documents and asserts the current
behavior when an LLM provider emits AIMessageChunk with id=None
(vLLM, certain custom backends).

The cross-mode dedup machinery cannot record a None id in
streamed_ids (guarded by ``if msg_id:``), so the values snapshot's
reassembled AIMessage with a real id falls through and synthesizes
a duplicate AI text event. The test asserts len == 2 and locks
this as a known limitation rather than silently letting future
contributors hit it without context.

Why this is documented rather than fixed:
* Falling back to ``metadata.get("id")`` does not help — LangGraph's
  messages-mode metadata never carries the message id.
* Synthesizing ``f"_synth_{id(msg_chunk)}"`` only helps if the
  values snapshot uses the same fallback, which it does not.
* A real fix requires provider cooperation (always emit chunk ids)
  or content-based dedup (false-positive risk), neither of which
  belongs in this PR.

If a real fix lands, replace this test with a positive assertion
that dedup works for None-id chunks.

Addresses Copilot review feedback on PR #1974 (client.py:515).

* fix(frontend): UI polish - fix CSS typo, dark mode border, and hardcoded colors (#1942)

- Fix `font-norma` typo to `font-normal` in message-list subtask count
- Fix dark mode `--border` using reddish hue (22.216) instead of neutral
- Replace hardcoded `rgb(184,184,192)` in hero with `text-muted-foreground`
- Replace hardcoded `bg-[#a3a1a1]` in streaming indicator with `bg-muted-foreground`
- Add missing `font-sans` to welcome description `<pre>` for consistency
- Make case-study-section padding responsive (`px-4 md:px-20`)

Closes #1940

* docs: clarify deployment sizing guidance (#1963)

* fix(frontend): prevent stale 'new' thread ID from triggering 422 history requests (#1960)

After history.replaceState updates the URL from /chats/new to
/chats/{UUID}, Next.js useParams does not update because replaceState
bypasses the router. The useEffect in useThreadChat would then set
threadIdFromPath ('new') as the threadId, causing the LangGraph SDK
to call POST /threads/new/history which returns HTTP 422 (Invalid
thread ID: must be a UUID).

This fix adds a guard to skip the threadId update when
threadIdFromPath is the literal string 'new', preserving the
already-correct UUID that was set when the thread was created.

* fix(frontend): avoid using route new as thread id (#1967)

Co-authored-by: luoxiao6645 <luoxiao6645@gmail.com>

* Fix(subagent): Event loop conflict in SubagentExecutor.execute() (#1965)

* Fix event loop conflict in SubagentExecutor.execute()

When SubagentExecutor.execute() is called from within an already-running
event loop (e.g., when the parent agent uses async/await), calling
asyncio.run() creates a new event loop that conflicts with asyncio
primitives (like httpx.AsyncClient) that were created in and bound to
the parent loop.

This fix detects if we're already in a running event loop, and if so,
runs the subagent in a separate thread with its own isolated event loop
to avoid conflicts.

Fixes: sub-task cards not appearing in Ultra mode when using async parent agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(subagent): harden isolated event loop execution

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(backend): remove dead getattr in _tool_message_event

---------

Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
Co-authored-by: Xinmin Zeng <135568692+fancyboi999@users.noreply.github.com>
Co-authored-by: 13ernkastel <LennonCMJ@live.com>
Co-authored-by: siwuai <458372151@qq.com>
Co-authored-by: 肖 <168966994+luoxiao6645@users.noreply.github.com>
Co-authored-by: luoxiao6645 <luoxiao6645@gmail.com>
Co-authored-by: Saber <11769524+hawkli-1994@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
This commit is contained in:
greatmengqi
2026-04-10 18:16:38 +08:00
committed by GitHub
parent 654354c624
commit b1aabe88b8
5 changed files with 917 additions and 56 deletions
+189 -49
View File
@@ -25,7 +25,7 @@ import uuid
from collections.abc import Generator, Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from typing import Any, Literal
from langchain.agents import create_agent
from langchain.agents.middleware import AgentMiddleware
@@ -55,6 +55,9 @@ from deerflow.uploads.manager import (
logger = logging.getLogger(__name__)
StreamEventType = Literal["values", "messages-tuple", "custom", "end"]
@dataclass
class StreamEvent:
"""A single event from the streaming agent response.
@@ -69,7 +72,7 @@ class StreamEvent:
data: Event payload. Contents vary by type.
"""
type: str
type: StreamEventType
data: dict[str, Any] = field(default_factory=dict)
@@ -254,13 +257,53 @@ class DeerFlowClient:
return get_available_tools(model_name=model_name, subagent_enabled=subagent_enabled)
@staticmethod
def _serialize_tool_calls(tool_calls) -> list[dict]:
"""Reshape LangChain tool_calls into the wire format used in events."""
return [{"name": tc["name"], "args": tc["args"], "id": tc.get("id")} for tc in tool_calls]
@staticmethod
def _ai_text_event(msg_id: str | None, text: str, usage: dict | None) -> "StreamEvent":
"""Build a ``messages-tuple`` AI text event, attaching usage when present."""
data: dict[str, Any] = {"type": "ai", "content": text, "id": msg_id}
if usage:
data["usage_metadata"] = usage
return StreamEvent(type="messages-tuple", data=data)
@staticmethod
def _ai_tool_calls_event(msg_id: str | None, tool_calls) -> "StreamEvent":
"""Build a ``messages-tuple`` AI tool-calls event."""
return StreamEvent(
type="messages-tuple",
data={
"type": "ai",
"content": "",
"id": msg_id,
"tool_calls": DeerFlowClient._serialize_tool_calls(tool_calls),
},
)
@staticmethod
def _tool_message_event(msg: ToolMessage) -> "StreamEvent":
"""Build a ``messages-tuple`` tool-result event from a ToolMessage."""
return StreamEvent(
type="messages-tuple",
data={
"type": "tool",
"content": DeerFlowClient._extract_text(msg.content),
"name": msg.name,
"tool_call_id": msg.tool_call_id,
"id": msg.id,
},
)
@staticmethod
def _serialize_message(msg) -> dict:
"""Serialize a LangChain message to a plain dict for values events."""
if isinstance(msg, AIMessage):
d: dict[str, Any] = {"type": "ai", "content": msg.content, "id": getattr(msg, "id", None)}
if msg.tool_calls:
d["tool_calls"] = [{"name": tc["name"], "args": tc["args"], "id": tc.get("id")} for tc in msg.tool_calls]
d["tool_calls"] = DeerFlowClient._serialize_tool_calls(msg.tool_calls)
if getattr(msg, "usage_metadata", None):
d["usage_metadata"] = msg.usage_metadata
return d
@@ -438,6 +481,53 @@ class DeerFlowClient:
consumers can switch between HTTP streaming and embedded mode
without changing their event-handling logic.
Token-level streaming
~~~~~~~~~~~~~~~~~~~~~
This method subscribes to LangGraph's ``messages`` stream mode, so
``messages-tuple`` events for AI text are emitted as **deltas** as
the model generates tokens, not as one cumulative dump at node
completion. Each delta carries a stable ``id`` — consumers that
want the full text must accumulate ``content`` per ``id``.
``chat()`` already does this for you.
Tool calls and tool results are still emitted once per logical
message. ``values`` events continue to carry full state snapshots
after each graph node finishes; AI text already delivered via the
``messages`` stream is **not** re-synthesized from the snapshot to
avoid duplicate deliveries.
Why not reuse Gateway's ``run_agent``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gateway (``runtime/runs/worker.py``) has a complete streaming
pipeline: ``run_agent`` → ``StreamBridge`` → ``sse_consumer``. It
looks like this client duplicates that work, but the two paths
serve different audiences and **cannot** share execution:
* ``run_agent`` is ``async def`` and uses ``agent.astream()``;
this method is a sync generator using ``agent.stream()`` so
callers can write ``for event in client.stream(...)`` without
touching asyncio. Bridging the two would require spinning up
an event loop + thread per call.
* Gateway events are JSON-serialized by ``serialize()`` for SSE
wire transmission. This client yields in-process stream event
payloads directly as Python data structures (``StreamEvent``
with ``data`` as a plain ``dict``), without the extra
JSON/SSE serialization layer used for HTTP delivery.
* ``StreamBridge`` is an asyncio-queue decoupling producers from
consumers across an HTTP boundary (``Last-Event-ID`` replay,
heartbeats, multi-subscriber fan-out). A single in-process
caller with a direct iterator needs none of that.
So ``DeerFlowClient.stream()`` is a parallel, sync, in-process
consumer of the same ``create_agent()`` factory — not a wrapper
around Gateway. The two paths **should** stay in sync on which
LangGraph stream modes they subscribe to; that invariant is
enforced by ``tests/test_client.py::test_messages_mode_emits_token_deltas``
rather than by a shared constant, because the three layers
(Graph, Platform SDK, HTTP) each use their own naming
(``messages`` vs ``messages-tuple``) and cannot literally share
a string.
Args:
message: User message text.
thread_id: Thread ID for conversation context. Auto-generated if None.
@@ -448,8 +538,8 @@ class DeerFlowClient:
StreamEvent with one of:
- type="values" data={"title": str|None, "messages": [...], "artifacts": [...]}
- type="custom" data={...}
- type="messages-tuple" data={"type": "ai", "content": str, "id": str}
- type="messages-tuple" data={"type": "ai", "content": str, "id": str, "usage_metadata": {...}}
- type="messages-tuple" data={"type": "ai", "content": <delta>, "id": str}
- type="messages-tuple" data={"type": "ai", "content": <delta>, "id": str, "usage_metadata": {...}}
- type="messages-tuple" data={"type": "ai", "content": "", "id": str, "tool_calls": [...]}
- type="messages-tuple" data={"type": "tool", "content": str, "name": str, "tool_call_id": str, "id": str}
- type="end" data={"usage": {"input_tokens": int, "output_tokens": int, "total_tokens": int}}
@@ -466,13 +556,47 @@ class DeerFlowClient:
context["agent_name"] = self._agent_name
seen_ids: set[str] = set()
# Cross-mode handoff: ids already streamed via LangGraph ``messages``
# mode so the ``values`` path skips re-synthesis of the same message.
streamed_ids: set[str] = set()
# The same message id carries identical cumulative ``usage_metadata``
# in both the final ``messages`` chunk and the values snapshot —
# count it only on whichever arrives first.
counted_usage_ids: set[str] = set()
cumulative_usage: dict[str, int] = {"input_tokens": 0, "output_tokens": 0, "total_tokens": 0}
def _account_usage(msg_id: str | None, usage: Any) -> dict | None:
"""Add *usage* to cumulative totals if this id has not been counted.
``usage`` is a ``langchain_core.messages.UsageMetadata`` TypedDict
or ``None``; typed as ``Any`` because TypedDicts are not
structurally assignable to plain ``dict`` under strict type
checking. Returns the normalized usage dict (for attaching
to an event) when we accepted it, otherwise ``None``.
"""
if not usage:
return None
if msg_id and msg_id in counted_usage_ids:
return None
if msg_id:
counted_usage_ids.add(msg_id)
input_tokens = usage.get("input_tokens", 0) or 0
output_tokens = usage.get("output_tokens", 0) or 0
total_tokens = usage.get("total_tokens", 0) or 0
cumulative_usage["input_tokens"] += input_tokens
cumulative_usage["output_tokens"] += output_tokens
cumulative_usage["total_tokens"] += total_tokens
return {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
}
for item in self._agent.stream(
state,
config=config,
context=context,
stream_mode=["values", "custom"],
stream_mode=["values", "messages", "custom"],
):
if isinstance(item, tuple) and len(item) == 2:
mode, chunk = item
@@ -484,6 +608,36 @@ class DeerFlowClient:
yield StreamEvent(type="custom", data=chunk)
continue
if mode == "messages":
# LangGraph ``messages`` mode emits ``(message_chunk, metadata)``.
if isinstance(chunk, tuple) and len(chunk) == 2:
msg_chunk, _metadata = chunk
else:
msg_chunk = chunk
msg_id = getattr(msg_chunk, "id", None)
if isinstance(msg_chunk, AIMessage):
text = self._extract_text(msg_chunk.content)
counted_usage = _account_usage(msg_id, msg_chunk.usage_metadata)
if text:
if msg_id:
streamed_ids.add(msg_id)
yield self._ai_text_event(msg_id, text, counted_usage)
if msg_chunk.tool_calls:
if msg_id:
streamed_ids.add(msg_id)
yield self._ai_tool_calls_event(msg_id, msg_chunk.tool_calls)
elif isinstance(msg_chunk, ToolMessage):
if msg_id:
streamed_ids.add(msg_id)
yield self._tool_message_event(msg_chunk)
continue
# mode == "values"
messages = chunk.get("messages", [])
for msg in messages:
@@ -493,47 +647,25 @@ class DeerFlowClient:
if msg_id:
seen_ids.add(msg_id)
# Already streamed via ``messages`` mode; only (defensively)
# capture usage here and skip re-synthesizing the event.
if msg_id and msg_id in streamed_ids:
if isinstance(msg, AIMessage):
_account_usage(msg_id, getattr(msg, "usage_metadata", None))
continue
if isinstance(msg, AIMessage):
# Track token usage from AI messages
usage = getattr(msg, "usage_metadata", None)
if usage:
cumulative_usage["input_tokens"] += usage.get("input_tokens", 0) or 0
cumulative_usage["output_tokens"] += usage.get("output_tokens", 0) or 0
cumulative_usage["total_tokens"] += usage.get("total_tokens", 0) or 0
counted_usage = _account_usage(msg_id, msg.usage_metadata)
if msg.tool_calls:
yield StreamEvent(
type="messages-tuple",
data={
"type": "ai",
"content": "",
"id": msg_id,
"tool_calls": [{"name": tc["name"], "args": tc["args"], "id": tc.get("id")} for tc in msg.tool_calls],
},
)
yield self._ai_tool_calls_event(msg_id, msg.tool_calls)
text = self._extract_text(msg.content)
if text:
event_data: dict[str, Any] = {"type": "ai", "content": text, "id": msg_id}
if usage:
event_data["usage_metadata"] = {
"input_tokens": usage.get("input_tokens", 0) or 0,
"output_tokens": usage.get("output_tokens", 0) or 0,
"total_tokens": usage.get("total_tokens", 0) or 0,
}
yield StreamEvent(type="messages-tuple", data=event_data)
yield self._ai_text_event(msg_id, text, counted_usage)
elif isinstance(msg, ToolMessage):
yield StreamEvent(
type="messages-tuple",
data={
"type": "tool",
"content": self._extract_text(msg.content),
"name": getattr(msg, "name", None),
"tool_call_id": getattr(msg, "tool_call_id", None),
"id": msg_id,
},
)
yield self._tool_message_event(msg)
# Emit a values event for each state snapshot
yield StreamEvent(
@@ -550,10 +682,12 @@ class DeerFlowClient:
def chat(self, message: str, *, thread_id: str | None = None, **kwargs) -> str:
"""Send a message and return the final text response.
Convenience wrapper around :meth:`stream` that returns only the
**last** AI text from ``messages-tuple`` events. If the agent emits
multiple text segments in one turn, intermediate segments are
discarded. Use :meth:`stream` directly to capture all events.
Convenience wrapper around :meth:`stream` that accumulates delta
``messages-tuple`` events per ``id`` and returns the text of the
**last** AI message to complete. Intermediate AI messages (e.g.
planner drafts) are discarded — only the final id's accumulated
text is returned. Use :meth:`stream` directly if you need every
delta as it arrives.
Args:
message: User message text.
@@ -561,15 +695,21 @@ class DeerFlowClient:
**kwargs: Override client defaults (same as stream()).
Returns:
The last AI message text, or empty string if no response.
The accumulated text of the last AI message, or empty string
if no AI text was produced.
"""
last_text = ""
# Per-id delta lists joined once at the end — avoids the O(n²) cost
# of repeated ``str + str`` on a growing buffer for long responses.
chunks: dict[str, list[str]] = {}
last_id: str = ""
for event in self.stream(message, thread_id=thread_id, **kwargs):
if event.type == "messages-tuple" and event.data.get("type") == "ai":
content = event.data.get("content", "")
if content:
last_text = content
return last_text
msg_id = event.data.get("id") or ""
delta = event.data.get("content", "")
if delta:
chunks.setdefault(msg_id, []).append(delta)
last_id = msg_id
return "".join(chunks.get(last_id, ()))
# ------------------------------------------------------------------
# Public API — configuration queries