fix(#3189): prevent write_file streaming timeout on long reports (#3195)

* fix(#3189): prevent write_file streaming timeout on long reports Adds a layered defense against StreamChunkTimeoutError caused by oversized single-shot write_file tool calls: - factory: default stream_chunk_timeout to 240s for OpenAI-compatible clients (overridable via ModelConfig.stream_chunk_timeout in config.yaml) - sandbox/tools: server-side 80 KB length guard on non-append write_file calls (configurable via DEERFLOW_WRITE_FILE_MAX_BYTES env var, 0 disables); rejects oversized payloads with a structured error pointing the model at str_replace or append=True - middleware: classify StreamChunkTimeoutError as transient but cap retries at 1 via per-exception _RETRY_BUDGET_OVERRIDES (same-payload retry on a chunk-gap timeout buffers the same way upstream; full 3-attempt loop would stack 6-12 min of dead air) - middleware: surface an actionable user-facing message for stream-drop exceptions instead of leaking the raw langchain stack - prompts: add a routing-style File Editing Workflow hint to both lead_agent and general_purpose subagent prompts, pointing the model at str_replace for incremental edits (mirrors Claude Code's Edit / Codex's apply_patch) - tests: behavioural coverage for size guard, retry budget override, stream-drop user message, factory default injection Refs #3189 * fix(#3189): drop stream_chunk_timeout for non-OpenAI providers Address CR feedback on PR #3195: - factory: pop `stream_chunk_timeout` from kwargs for any model_use_path other than `langchain_openai:ChatOpenAI` instead of returning early. `ModelConfig.stream_chunk_timeout` is part of the shared schema, so a user-supplied value on a non-OpenAI provider would otherwise be forwarded to its constructor and raise `TypeError: unexpected keyword argument`. - factory: rewrite docstring to describe the actual `exclude_none=True` behaviour (explicit null is excluded and falls back to the default) instead of the misleading "None falling out via exclude_none=True keeps its value". - tests: add regression coverage asserting the kwarg is stripped before reaching a non-OpenAI provider's constructor. Refs: bytedance#3189 * fix(#3189): restrict stream-drop user copy to StreamChunkTimeoutError only Per CR on #3195: narrow _STREAM_DROP_EXCEPTIONS to StreamChunkTimeoutError. Generic httpx RemoteProtocolError / ReadError fall back to the standard 'temporarily unavailable' copy, since they routinely fire on transient network blips where the 'split the output' guidance is misleading. Retry/backoff classification is unchanged — both remain transient/retriable. Tests updated to reflect new copy, plus a symmetric regression test for ReadError. --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-06-10 17:35:57 +00:00 · 2026-06-07 17:47:11 +08:00
parent 268fdd6968
commit 88e36d9686
10 changed files with 677 additions and 4 deletions
@@ -542,6 +542,14 @@ combined with a FastAPI gateway for REST API access [citation:FastAPI](https://f
 {subagent_reminder}- Skill First: Always load the relevant skill before starting **complex** tasks.
 - Progressive Loading: Load resources incrementally as referenced in skills
 - Output Files: Final deliverables must be in `/mnt/user-data/outputs`
+- File Editing Workflow: When revising an existing file, prefer
+  `str_replace` over `write_file` — it sends only the diff and avoids
+  re-emitting the whole file (mirrors Claude Code's Edit and Codex's
+  apply_patch). When writing long new content from scratch, split it
+  into sections: the first `write_file` call creates the file, then use
+  `write_file` with append=True to extend it section by section. This
+  keeps each tool call small and avoids mid-stream chunk-gap timeouts
+  on oversized single-shot writes. (See issue #3189.)  
 - Clarity: Be direct and helpful, avoid unnecessary meta-commentary
 - Including Images and Mermaid: Images and Mermaid diagrams are always welcomed in the Markdown format, and you're encouraged to use `![Image Description](image_path)\n\n` or "```mermaid" to display images in response or Markdown files
 - Multi-task: Better utilize parallel tool calling to call multiple tools at one time for better performance
@@ -62,6 +62,41 @@ _AUTH_PATTERNS = (
    "未授权",
 )

+# Per-exception retry budget overrides.
+#
+# Some transient errors are retriable in principle but expensive to retry at
+# the default budget. StreamChunkTimeoutError in particular fires after the
+# upstream provider has already stalled for `stream_chunk_timeout` seconds
+# (typically 120-240s); a full 3-attempt loop can therefore stack 6-12 minutes
+# of dead air before surfacing the failure to the user. We keep exactly one
+# retry (cheap reconnect that catches genuine transient TCP blips) and then
+# fail fast — the same buffered payload is overwhelmingly likely to fail
+# again at the upstream provider for the same reason.
+#
+# Keys are exception class *names* (not classes) so we don't introduce
+# import-time coupling on optional dependencies like langchain-openai. The
+# value is the absolute max attempt count, NOT additional retries — so a
+# value of 2 means "1 first attempt + 1 retry" (the CR-requested
+# "keep one retry" behavior).
+_RETRY_BUDGET_OVERRIDES: dict[str, int] = {
+    "StreamChunkTimeoutError": 2,
+}
+
+# Exception class names that indicate the upstream stream-chunk watchdog
+# fired because the model stalled mid-flight. These deserve a more specific
+# user-facing message than the generic "temporarily unavailable" copy,
+# because the typical root cause is a long tool-call serialization stalling
+# the upstream stream — and the most actionable advice we can give the user
+# is "ask for a shorter / split output" rather than "wait and retry".
+# Generic connection drops (httpx RemoteProtocolError / ReadError) are
+# intentionally excluded: they routinely fire on transient network blips
+# with normal payloads, where the "split the work" guidance is misleading.
+_STREAM_DROP_EXCEPTIONS: frozenset[str] = frozenset(
+    {
+        "StreamChunkTimeoutError",
+    }
+)
+

 class LLMErrorHandlingMiddleware(AgentMiddleware[AgentState]):
    """Retry transient LLM errors and surface graceful assistant messages."""
@@ -83,6 +118,18 @@ class LLMErrorHandlingMiddleware(AgentMiddleware[AgentState]):
        self._circuit_state = "closed"
        self._circuit_probe_in_flight = False

+    def _max_attempts_for(self, exc: BaseException) -> int:
+        """Return the effective max attempt count for this exception.
+
+        Falls back to `self.retry_max_attempts` unless the exception class name
+        appears in the per-exception override table.
+        """
+        override = _RETRY_BUDGET_OVERRIDES.get(type(exc).__name__)
+        if override is None:
+            return self.retry_max_attempts
+
+        return min(override, self.retry_max_attempts)
+
    def _check_circuit(self) -> bool:
        """Returns True if circuit is OPEN (fast fail), False otherwise."""
        with self._circuit_lock:
@@ -153,6 +200,7 @@ class LLMErrorHandlingMiddleware(AgentMiddleware[AgentState]):
            "InternalServerError",
            "ReadError",  # httpx.ReadError: connection dropped mid-stream
            "RemoteProtocolError",  # httpx: server closed connection unexpectedly
+            "StreamChunkTimeoutError",  # langchain-openai: chunk gap exceeded stream_chunk_timeout
        }:
            return True, "transient"
        if status_code in _RETRIABLE_STATUS_CODES:
@@ -202,6 +250,20 @@ class LLMErrorHandlingMiddleware(AgentMiddleware[AgentState]):
        if reason == "auth":
            return "The configured LLM provider rejected the request because authentication or access is invalid. Please check the provider credentials and try again."
        if reason in {"busy", "transient"}:
+            # Stream-drop failures (chunk-gap timeout, peer-closed connection,
+            # raw read error) almost always point at a single oversized
+            # tool-call payload — the model spent so long serializing JSON
+            # arguments that the upstream provider buffered and the stream
+            # gap exceeded `stream_chunk_timeout`. Surfacing this distinct
+            # cause lets the user split or shorten their next request
+            # instead of helplessly retrying the same prompt.
+            if type(exc).__name__ in _STREAM_DROP_EXCEPTIONS:
+                return (
+                    "The model's streaming response was interrupted before it could "
+                    "finish. This usually happens when a single response or tool call "
+                    "is very large — please ask the assistant to split the work into "
+                    "smaller steps, or shorten the requested output, and try again."
+                )
            return "The configured LLM provider is temporarily unavailable after multiple retries. Please wait a moment and continue the conversation."
        return f"LLM request failed: {detail}"

@@ -259,7 +321,8 @@ class LLMErrorHandlingMiddleware(AgentMiddleware[AgentState]):
                raise
            except Exception as exc:
                retriable, reason = self._classify_error(exc)
-                if retriable and attempt < self.retry_max_attempts:
+                max_attempts = self._max_attempts_for(exc)
+                if retriable and attempt < max_attempts:
                    wait_ms = self._build_retry_delay_ms(attempt, exc)
                    logger.warning(
                        "Transient LLM error on attempt %d/%d; retrying in %dms: %s",
@@ -310,7 +373,8 @@ class LLMErrorHandlingMiddleware(AgentMiddleware[AgentState]):
                raise
            except Exception as exc:
                retriable, reason = self._classify_error(exc)
-                if retriable and attempt < self.retry_max_attempts:
+                max_attempts = self._max_attempts_for(exc)
+                if retriable and attempt < max_attempts:
                    wait_ms = self._build_retry_delay_ms(attempt, exc)
                    logger.warning(
                        "Transient LLM error on attempt %d/%d; retrying in %dms: %s",
@@ -32,6 +32,16 @@ class ModelConfig(BaseModel):
        description="Extra settings to be passed to the model when thinking is disabled",
    )
    supports_vision: bool = Field(default_factory=lambda: False, description="Whether the model supports vision/image inputs")
+    stream_chunk_timeout: float | None = Field(
+        default=None,
+        description=(
+            "Maximum seconds to wait between successive streaming chunks before "
+            "langchain-openai raises StreamChunkTimeoutError. None means use the "
+            "factory default (240s for OpenAI-compatible clients). Tune higher for "
+            "reasoning models with long thinking pauses; lower for latency-sensitive "
+            "interactive endpoints. Has no effect on non-OpenAI-compatible providers."
+        ),
+    )
    thinking: dict | None = Field(
        default_factory=lambda: None,
        description=(
@@ -47,6 +47,38 @@ def _enable_stream_usage_by_default(model_use_path: str, model_settings_from_con
        model_settings_from_config["stream_usage"] = True


+# Default chunk-gap budget for OpenAI-compatible streaming responses.
+#
+# langchain-openai raises ``StreamChunkTimeoutError`` after this many seconds
+# without receiving a chunk. Its own default is 60s, which is too aggressive for
+# reasoning models (DeepSeek-R1, Doubao-thinking, GPT-5) whose first chunk can
+# legitimately take 90~150s. We default to 240s so the streaming layer rarely
+# trips on long thinking pauses; the LLMErrorHandlingMiddleware still retries
+# (budget=2) if a real stall happens. Users can override per-model in config.yaml.
+_DEFAULT_STREAM_CHUNK_TIMEOUT_SECONDS: float = 240.0
+
+
+def _apply_stream_chunk_timeout_default(model_use_path: str, model_settings_from_config: dict) -> None:
+    """Inject a generous ``stream_chunk_timeout`` for OpenAI-compatible clients.
+
+    The ``stream_chunk_timeout`` kwarg is specific to ``langchain_openai:ChatOpenAI``
+    and is rejected by other providers' constructors as an unexpected keyword
+    argument. Behaviour:
+
+    * OpenAI-compatible path: an explicit value in ``config.yaml`` is preserved.
+      An explicit ``null`` is dropped upstream by ``model_dump(exclude_none=True)``
+      and therefore treated as "unset", so the default is injected.
+    * Non-OpenAI path: drop the key so it is never forwarded to an incompatible
+      constructor (which would raise ``TypeError: unexpected keyword argument``).
+    """
+    if model_use_path != "langchain_openai:ChatOpenAI":
+        model_settings_from_config.pop("stream_chunk_timeout", None)
+        return
+    if "stream_chunk_timeout" in model_settings_from_config:
+        return
+    model_settings_from_config["stream_chunk_timeout"] = _DEFAULT_STREAM_CHUNK_TIMEOUT_SECONDS
+
+
 def create_chat_model(name: str | None = None, thinking_enabled: bool = False, *, app_config: AppConfig | None = None, attach_tracing: bool = True, **kwargs) -> BaseChatModel:
    """Create a chat model instance from the config.

@@ -128,6 +160,7 @@ def create_chat_model(name: str | None = None, thinking_enabled: bool = False, *
        model_settings_from_config.pop("reasoning_effort", None)

    _enable_stream_usage_by_default(model_config.use, model_settings_from_config)
+    _apply_stream_chunk_timeout_default(model_config.use, model_settings_from_config)

    # For Codex Responses API models: map thinking mode to reasoning_effort
    from deerflow.models.openai_codex_provider import CodexChatModel
@@ -1,4 +1,5 @@
 import asyncio
+import os
 import posixpath
 import re
 import shlex
@@ -43,6 +44,16 @@ _MAX_GLOB_MAX_RESULTS = 1000
 _DEFAULT_GREP_MAX_RESULTS = 100
 _MAX_GREP_MAX_RESULTS = 500
 _DEFAULT_WRITE_FILE_ERROR_MAX_CHARS = 2000
+
+# Maximum bytes accepted in a single non-append write_file call (issue #3189).
+# Oversized single-shot writes correlate with LLM streaming chunk-gap timeouts
+# because the tool-call JSON payload (which the model must emit as one
+# continuous stream) grows past the safe window. 80 KB ≈ 20K tokens, a
+# comfortable headroom under the factory-default 240s stream_chunk_timeout.
+# Deployments can override via env var DEERFLOW_WRITE_FILE_MAX_BYTES; set to
+# 0 (or negative) to disable the guard entirely.
+_WRITE_FILE_CONTENT_MAX_BYTES = 80 * 1024
+_WRITE_FILE_MAX_BYTES_ENV = "DEERFLOW_WRITE_FILE_MAX_BYTES"
 _LOCAL_BASH_CWD_COMMANDS = {"cd", "pushd"}
 _LOCAL_BASH_COMMAND_WRAPPERS = {"command", "builtin"}
 _LOCAL_BASH_COMMAND_PREFIX_KEYWORDS = {"!", "{", "case", "do", "elif", "else", "for", "if", "select", "then", "time", "until", "while"}
@@ -1671,6 +1682,23 @@ async def _read_file_tool_async(
 read_file_tool.coroutine = _read_file_tool_async


+def _effective_write_file_max_bytes() -> int:
+    """Return the active size cap for non-append write_file calls.
+
+    Reads ``DEERFLOW_WRITE_FILE_MAX_BYTES`` at call time (not import time)
+    so tests and runtime tweaks take effect without restart. Falls back to
+    the default on missing/malformed values. A non-positive value disables
+    the guard.
+    """
+    raw = os.environ.get(_WRITE_FILE_MAX_BYTES_ENV)
+    if raw is None:
+        return _WRITE_FILE_CONTENT_MAX_BYTES
+    try:
+        return int(raw)
+    except ValueError:
+        return _WRITE_FILE_CONTENT_MAX_BYTES
+
+
@tool("write_file", parse_docstring=True)
 def write_file_tool(
    runtime: Runtime,
@@ -1679,14 +1707,47 @@ def write_file_tool(
    content: str,
    append: bool = False,
 ) -> str:
-    """Write text content to a file. By default this overwrites the target file; set append to true to add content to the end without replacing existing content.
+    """Write text content to a file. By default this overwrites the target file; set append=True to add content to the end without replacing existing content.
+
+    SIZE POLICY (issue #3189):
+    A single non-append write_file call must not exceed 80 KB of UTF-8 content.
+    Oversized single-shot writes correlate with LLM streaming chunk-gap
+    timeouts because the tool-call JSON payload — which the model must emit as
+    one continuous stream — grows past the safe window. For larger documents,
+    use ONE of these strategies (write_file rejects oversized payloads with an
+    actionable error):
+
+      1. INCREMENTAL EDIT (preferred for revisions): after the initial write,
+         use `str_replace` to surgically update sections. This is the same
+         pattern Claude Code's Write+Edit and OpenAI Codex's apply_patch use,
+         and keeps each tool call's payload small.
+      2. APPEND-IN-CHUNKS (for new long-form content): split the document into
+         sections, each well under 80 KB. First call uses append=False to
+         create the file; subsequent calls use append=True. The 80 KB cap does
+         NOT apply to append=True calls.
+
+    Operators can override the cap via env var `DEERFLOW_WRITE_FILE_MAX_BYTES`
+    (0 disables the guard entirely). Raising it risks streaming timeouts.

    Args:
        description: Explain why you are writing to this file in short words. ALWAYS PROVIDE THIS PARAMETER FIRST.
        path: The **absolute** path to the file to write to. ALWAYS PROVIDE THIS PARAMETER SECOND.
        content: The content to write to the file. ALWAYS PROVIDE THIS PARAMETER THIRD.
-        append: Whether to append content to the end of the file instead of overwriting it. Defaults to false.
+        append: Whether to append content to the end of the file instead of overwriting it. Defaults to False.
    """
+    if not append:
+        max_bytes = _effective_write_file_max_bytes()
+        if max_bytes > 0:
+            content_bytes = len(content.encode("utf-8"))
+            if content_bytes > max_bytes:
+                return (
+                    f"Error: write_file content ({content_bytes} bytes) exceeds the "
+                    f"{max_bytes}-byte single-call limit. Split the content into smaller "
+                    "pieces: either (a) write the first section now, then use `str_replace` "
+                    "for further edits, or (b) call write_file again with append=True "
+                    "carrying the next section. See SIZE POLICY in the tool docstring "
+                    "or issue #3189 for the rationale."
+                )
    try:
        requested_path = path
        sandbox = ensure_sandbox_initialized(runtime)
@@ -24,6 +24,17 @@ Do NOT use for simple, single-step operations.""",
 - Do NOT ask for clarification - work with the information provided
 </guidelines>

+<file_editing_workflow>
+When revising an existing file, prefer `str_replace` over `write_file` —
+it sends only the diff and avoids re-emitting the whole file (mirrors
+Claude Code's Edit and Codex's apply_patch). When writing long new
+content from scratch, split it into sections: the first `write_file`
+call creates the file, then use `write_file` with append=True to extend
+it section by section. This keeps each tool call small and avoids
+mid-stream chunk-gap timeouts on oversized single-shot writes.
+(See issue #3189.)
+</file_editing_workflow>
+
 <output_format>
 When you complete the task, provide:
 1. A brief summary of what was accomplished