The default `make up` started the Gateway with `--workers 4`, but run state
(RunManager and the stream bridge) is held in-process and nginx uses no sticky
sessions. With the default config, same-run requests scatter across workers that
each keep their own run state, breaking run cancellation (409), SSE reconnect
(hangs on heartbeats), multitask de-duplication, and IM channels (duplicate
replies). The shared cross-worker stream bridge does not exist yet.
Default GATEWAY_WORKERS to 1 so the out-of-the-box deployment is correct,
document the single-worker boundary in the README, and add a regression test
pinning the default while keeping it overridable. This is a stop-gap, not a
multi-worker implementation; the full fix (shared run state + stream bridge) is
tracked in #3191.
Refs #3239, #3260
In Docker production deployments, LocalSandboxProvider runs inside the
deer-flow-gateway container, so any `sandbox.mounts[].host_path` from
config.yaml is resolved against the gateway container's filesystem — not
the host machine. When the path isn't also bind-mounted into the gateway
service, the mount was silently dropped with only a WARNING log, leaving
agents reading an empty directory in production while the same config
worked under `make dev`.
Escalate the missing-host_path branch to logger.error with explicit
guidance about Docker bind mounts and docker-compose, so the failure is
hard to miss in default log configurations. Skip behaviour is preserved
to avoid breaking existing deployments.
Also clarify the misleading `VolumeMountConfig.host_path` field
description so it documents reality for both providers:
- LocalSandboxProvider checks host_path from inside the gateway process
(host in `make dev`, container in `make up`).
- AioSandboxProvider (DooD) passes host_path straight to `docker -v`
for the sandbox container, where the host Docker daemon resolves it
from the host machine's perspective.
config.example.yaml's `sandbox.mounts` comment gets a Note: block
pointing operators at the docker-compose bind-mount requirement so the
Docker-mode gotcha is discoverable from the canonical template.
Adds a regression test that:
- confirms missing host_path is still skipped (no behaviour break);
- asserts an ERROR record is emitted referencing the offending paths;
- asserts the message contains actionable Docker/gateway/docker-compose
keywords so future refactors can't quietly downgrade it.
Refs: https://github.com/bytedance/deer-flow/issues/3244
* fix: Solving the problem of "make dev" failing to start in Windows environment
* fix: revert the change to the startup_config and fix the lint errors
* fix: Address Copilot review feedback
- Validate wait-for-port input and avoid PowerShell port interpolation
- Require Python 3 in serve.sh launcher detection
- Keep Windows event loop policy setup in sitecustomize only
- Clarify sitecustomize process-wide backend behavior
* fix(agents): offload blocking filesystem IO in delete_agent off the event loop
delete_agent is an async route handler but resolved the agent directory (Paths.base_dir -> Path.resolve), probed it (Path.exists), and removed it (shutil.rmtree) directly on the event loop, blocking it for the duration of every delete. Surfaced by 'make detect-blocking-io'.
Move the resolve/exists/rmtree sequence into a sync helper run via asyncio.to_thread, mapping its outcome back to the existing 404/409/500 responses (behavior unchanged). Adds a tests/blocking_io/ regression anchor under the strict Blockbuster gate, mirroring test_skills_load.py (#1917).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(agents): offload blocking filesystem IO in create_agent_endpoint too
Like delete_agent, the async create_agent_endpoint resolved and created the agent directory and wrote config.yaml + SOUL.md (with rmtree cleanup on failure) directly on the event loop. Move the whole create-or-409 sequence into a sync helper run via asyncio.to_thread; behavior is unchanged (201 / 409 / 500). Extends the blocking_io regression anchor to cover create as well as delete and renames it to test_agents_router.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* add caller identity in replay e2e
* make format
* fix(replay-e2e): stabilize title caller replay
* fix(replay-e2e): use captured caller without run manager
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Add PatchedChatStepFun adapter for StepFun reasoning models (step-3.7-flash,
step-3.5-flash). Captures reasoning from both streaming and non-streaming
responses and replays it on historical assistant messages for multi-turn
tool-call conversations.
- New: PatchedChatStepFun adapter with streaming/non-streaming reasoning capture
- Support both reasoning and reasoning_content field names
- 17 unit tests covering all response paths
- Updated: config.example.yaml with StepFun configuration example
Copying config.example.yaml to config.yaml and starting DeerFlow crashed with `pydantic ValidationError: models — Input should be a valid list [input_value=None]`, because the example ships every entry under `models:` commented out, so PyYAML parses the key as null. Reported in #1444.
Add a field_validator(mode="before") on AppConfig that coerces null models/tools/tool_groups to [] (matching their default_factory=list), and emit an actionable warning from from_file when no models are configured (pointing to config.example.yaml / make setup). Adds regression tests.
Closes#1444
Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(dev): create backend/sandbox before uvicorn reload-exclude (#3459)
#3426 switched the dev gateway's --reload-exclude patterns to absolute
paths. uvicorn only excludes an absolute path directly when it already
exists as a directory; otherwise it globs the pattern, and Python 3.12's
pathlib raises NotImplementedError("Non-relative patterns are unsupported")
for an absolute glob pattern. serve.sh mkdir'd the .deer-flow excludes but
not backend/sandbox, so `make dev` crashed on startup on a fresh checkout
under Python 3.12 (#3454). docker/dev-entrypoint.sh had the same latent gap.
Create backend/sandbox in both launchers so every absolute exclude stays on
uvicorn's is_dir() short-circuit. Add a regression test that pins the uvicorn
mechanism (crash on missing dir, safe once created) and enforces that every
absolute --reload-exclude is mkdir'd before launch.
Closes#3459
* test(dev): harden reload-exclude invariant parser against false pass/negatives
The launcher invariant test parsed shell with a "mkdir -p" line filter and a
substring membership check. Two latent gaps (sub-threshold for this fix, but
this code guards a user-facing startup path, so close them):
- A `\`-continued multi-line `mkdir` would drop arguments on continuation
lines, silently weakening coverage.
- Substring membership could false-pass when an exclude is a path-prefix of a
different created dir (e.g. `/app/backend/sandbox` "found" inside
`/app/backend/sandbox-other`).
Fold line-continuations, drop comments, and shlex-tokenize each `mkdir`
argument list into an exact set (quotes stripped, `$VAR` literal); assert exact
set membership. Same shlex handling for `--reload-exclude` values. Verified the
parser still flags the pre-fix missing `backend/sandbox` (RED preserved) and no
longer false-passes on a path-prefix.
* fix(dev): gitignore backend/sandbox runtime dir + pin mkdir-before-launch
Address two review findings on the #3459 fix:
- backend/sandbox was described as "gitignored runtime state" but no ignore
rule actually matched it. Add an anchored `/sandbox/` to backend/.gitignore
(anchored so it does NOT shadow the source package
backend/packages/harness/deerflow/sandbox/) so sandbox artifacts created at
runtime can't pollute the working tree or be committed by accident. New test
asserts content under backend/sandbox is ignored, making the claim verifiable.
- The launcher invariant test only proved the sandbox mkdir exists somewhere,
not that it runs before uvicorn starts. Add an order test (sandbox mkdir line
must precede the `uv run uvicorn` launch) so a future edit can't move the
mkdir below the launch and silently reintroduce the crash.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* test(dev): fix reload-exclude parser to handle serve.sh's quoted flag bundle
The previous autofix tokenized each whole line with shlex, but serve.sh packs
every flag into a single double-quoted `GATEWAY_EXTRA_FLAGS="..."` assignment.
shlex collapses that into one token, so no `--reload-exclude` flag is found and
`test_launcher_precreates_every_absolute_reload_exclude[scripts/serve.sh]`
failed CI with "expected at least one absolute reload-exclude".
Parse `--reload-exclude` with a regex that matches a balanced single/double
quoted group or a bare token, so the assignment's surrounding `"` is never
swallowed into the value. This recovers all three serve.sh excludes (the prior
regex also silently dropped the last `$BACKEND_RUNTIME_HOME` because the
adjacent closing quote broke shlex) while still covering dev-entrypoint.sh and
the space-separated `--reload-exclude <value>` form.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
`client.py` imported the private `_build_middlewares` from `agent.py` across a
module boundary and called it as public API. Because the `_` name signals
"module-private, no external callers", any future rename or signature change
silently breaks the embedded `DeerFlowClient` path — and the test suite even
monkeypatched `deerflow.client._build_middlewares`, baking the leak in.
`DeerFlowClient` is a lead-agent variant that genuinely needs the lead agent's
full middleware composition, so make the dependency honest: promote the helper
to a documented public entry point `build_middlewares` and update every in-repo
caller. Found during #3341 review; #3341 already removed one such leak
(`_assemble_deferred` -> public `assemble_deferred_tools`) and left this one out
of scope on purpose.
- agent.py: rename def + both internal call sites; expand the docstring into a
public-entry-point contract and document the previously-undocumented
model_name / app_config / deferred_setup params
- client.py: import + call site now use the public name (removes the last
cross-module private import)
- scripts/tool-error-degradation-detection.sh: update its import + call site
- tests (5 files): update monkeypatch/patch targets and direct calls
- docs (backend/CLAUDE.md, plan_mode_usage.md, middlewares.mdx): sync the live
references that describe the symbol as current API
Pure mechanical rename, no behavior change. Historical design docs (rfc,
superpowers spec) intentionally keep the old name as point-in-time records.
Closes#3431
* fix(ci): consolidate PR/issue labeling into one triage.yml; fix reviewing crash & label thrash
- Replace pr-labeler + pr-triage + issue-triage with a single triage.yml; drop actions/labeler.
Its sync-labels removed labels outside its config (clobbered size/risk/needs-validation and
could clobber maintainer labels). Area is now computed in-script and reconciled only within
owned namespaces (area:/size//risk:/needs-validation); first-time/reviewing are add-only.
- reviewing: gate on author_association in {OWNER,MEMBER,COLLABORATOR} + user.type==='User'
instead of getCollaboratorPermissionLevel, which 404'd on bot reviewers ('Copilot is not a
user') and crashed the job. Excludes all review bots with no denylist and no API call.
- Read live state (listFiles + listLabelsOnIssue) not the stale event payload, so rapid
synchronize events converge instead of thrashing. Size churn excludes lockfiles/snapshots.
* fix(ci): read labels live via paginate in reviewing & issue-triage jobs
Address review feedback on #3455:
- reviewing: listLabelsOnIssue now paginates (per_page:100) instead of the
default 30, matching pr-labels, so a 'reviewing' label is never missed on
PRs with many labels.
- issue-triage: read live labels via the API instead of the event payload,
consistent with the live-state reads documented in the header.
* fix(web_fetch): support proxy for Jina reader in restricted networks
The web_fetch tool built a bare httpx.AsyncClient() with no proxy
awareness, so users behind a corporate proxy / in Docker / WSL could
not reach https://r.jina.ai and web_fetch timed out.
- Add optional `proxy` / `trust_env` params to JinaClient.crawl and
wire them from the `web_fetch` tool config (with type coercion for
YAML string values).
- Pass internal service hostnames through NO_PROXY in both compose
files so proxy env inherited via env_file does not break in-cluster
calls (gateway/provisioner/etc).
- Load proxy vars from .env into the shell in scripts/docker.sh so the
NO_PROXY interpolation can merge user-provided values on `make` path.
- Document proxy/trust_env options in config.example.yaml.
Closes#3418
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat(subagents): extend deferred MCP tool loading to subagents (#3341)
Subagents now reuse the lead agent's deferred-tool path: when
tool_search.enabled, MCP tool schemas are withheld from the model and
surfaced by name in <available-deferred-tools>, fetched on demand via the
generated tool_search helper. DeferredToolFilterMiddleware deterministically
rewrites request.tools to hide the deferred schemas (the prompt section is
discovery only, not enforcement).
Consolidates the assembly into deerflow.tools.builtins.tool_search, now the
single home for both assemble_deferred_tools (centralized fail-closed guard,
replacing the lead-only private _assemble_deferred) and the relocated
get_deferred_tools_prompt_section. Shared by every build path: lead agent,
embedded client, and subagent executor.
tool_search is appended after the subagent's name-level tool policy and is
treated as infrastructure: its catalog is built from the already
policy-filtered list, so it can never surface a tool the policy denied.
Follow-up to #3370. Fixes#3341.
* test(subagents): assert the real middleware builder emits a working deferred filter (#3341)
The existing recipe test hand-constructs DeferredToolFilterMiddleware, so it
cannot catch a regression in how build_subagent_runtime_middlewares (the call
executor._create_agent actually makes) wires the deferred setup into the
filter. Add a test that sources the filter from the real builder given a real
setup and runs it through a graph: a wrong catalog hash would silently stop
promotion, a dropped filter would stop hiding — both now caught.
Running the full real middleware stack is intentionally avoided (the other
runtime middlewares need sandbox/thread infra to execute, which would make the
test flaky); their attachment + ordering before Safety stays locked in
test_tool_error_handling_middleware.py.
* test(subagents): keep executor tests config-free in CI
* chore: trigger ci
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* async
* add test
* test(threads): assert aput preserves endpoint-assigned checkpoint id
Confirm the update_thread_state fix is real, not a no-op: all supported
savers (InMemorySaver, AsyncSqliteSaver, AsyncPostgresSaver) persist and
echo checkpoint["id"] verbatim rather than minting their own. Add
assertions that each POST /state response's checkpoint_id round-tripped
into persisted history and kept its uuid6 time-ordering through aput,
and document the verified contract in the router.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(spec): MiniMax integration for generation skills + new music skill
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(plan): MiniMax generation providers implementation plan
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(skills): add importlib loader + FakeResp for skill tests
* test(skills): register loaded module in sys.modules; raise requests.HTTPError in FakeResp
* feat(image-generation): add MiniMax provider with env auto-detect
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(image-generation): guard unknown provider, derive ref MIME, strengthen tests
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(video-generation): add MiniMax provider with async poll/download
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(video-generation): surface base_resp errors while polling; add timeout test
* feat(podcast-generation): add MiniMax t2a_v2 provider with env auto-detect
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(podcast-generation): restore TTS credential guard; add volcengine + voice tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(music-generation): new MiniMax music skill via skill-creator
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(music-generation): treat empty lyrics as absent; test no-audio-data path
* refactor(skills): add request timeouts to MiniMax network calls
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Potential fix for pull request finding 'Explicit returns mixed with implicit (fall through) returns'
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* fix(models): strip inconsistent user-message names for MiniMax chat
DeerFlow middlewares tag user messages with provenance names (user-input, summary, loop_warning); langchain serializes them into the OpenAI-compatible payload and MiniMax rejects mismatched user-message names with "user name must be consistent (2013)". PatchedChatMiniMax now drops the per-message name from user-role messages. Point the config.example MiniMax models at PatchedChatMiniMax so they also get reasoning_content mapping.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(image-generation): MiniMax sends JSON prompt field, guard 1500-char limit
MiniMax image-01 takes one text string capped at 1500 chars, but the skill was sending the whole structured JSON. The MiniMax provider now extracts the JSON `prompt` field (relying on prompt_optimizer to expand it) and fails fast with a clear error before calling the API when that field exceeds 1500 chars. Authoring stays provider-agnostic; Gemini still receives the full JSON.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(podcast-generation): per-provider TTS concurrency and retry/backoff
Each TTS provider owns its concurrency internally — MiniMax runs single-threaded to reduce rate-limit failures, Volcengine keeps 4 workers — with automatic retry and backoff on transient HTTP and base_resp errors. No caller-facing concurrency knob.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(skills): address Copilot review comments on generation skills
- video: add raise_for_status + timeout to the Gemini download/POST/poll calls so non-2xx responses surface as clear HTTP errors instead of JSON/KeyError or hangs
- video: check the task Fail status before the generic base_resp check so the failure keeps its task_id context
- video/image: create the output file parent directory before writing (matching music-generation) so nested output paths do not raise FileNotFoundError
- music: require a non-empty prompt and fail fast with ValueError instead of sending an empty prompt to the API
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(scripts): reclaim dev ports across worktrees in make stop/dev
All deer-flow worktrees (main checkout + linked worktrees) hardcode the same dev ports (8001/3000/2026), so a service started from any worktree must be reclaimable from another. stop_all now resolves the set of worktree roots (DEERFLOW_ROOTS) and treats a process as deer-flow-owned when its open files live under any of them. It also force-kills survivors on 2026 alongside 8001/3000, fixing `make dev` aborting on the nginx port preflight when a prior nginx lingered on 2026.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(view-image): hide the injected image-context message from the UI
ViewImageMiddleware injects a HumanMessage (text + base64 images) so the vision model can see viewed images, but it was the only internal injector that set neither hide_from_ui nor a hidden name, so it leaked into the chat UI (and IM channels) as a user bubble reading "Here are the images you've viewed:". Mark it with additional_kwargs={"hide_from_ui": True}, matching todo/dynamic_context injections, which the frontend isHiddenFromUIMessage and the channel sender already honor. The model still receives the full content.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(minimax): mark M2.7 models as text-only (no vision)
MiniMax M2.7 / M2.7-highspeed do not support vision; only M3 does. The
provider config asserted vision support for M2.7 in four places.
- config.example.yaml: 4 M2.7 entries -> supports_vision: false
- backend/docs/CONFIGURATION.md: M2.7 + highspeed -> supports_vision: false
- wizard: add LLMProvider.model_vision_overrides + extra_config_for() so
selecting an M2.7 model writes supports_vision: false while M3 (default)
keeps vision; wire it through setup_wizard.py
- tests: M2.7-highspeed fixture -> supports_vision=False; add
test_minimax_vision_is_per_model
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
The search input, filter tabs, and four action buttons were crammed into
a single horizontal row, which squeezed the search box into an unusable
sliver and truncated the "Summaries" filter tab to "Summarie".
Split the toolbar into two rows: search + filter tabs on the first,
actions on the second. The search input now keeps a usable min width,
filter tabs use whitespace-nowrap so they never truncate, and the
destructive "Clear all memory" button is pushed to the far right
(ml-auto) to separate it from the constructive actions.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(replay-e2e): match by conversation, not the living system prompt
The model-replay match key hashed the full input including the lead-agent
system prompt. That prompt is edited frequently (e.g. #3195 added a "File
Editing Workflow" section), so the committed fixture went stale the moment
the prompt changed on main — turning the Layer-2 render gate RED on every
unrelated PR (#3430, #3432, ...). This was a self-inflicted false positive.
Root-cause fix:
- replay_provider._canonical_messages now EXCLUDES the system message from
the hash. The conversation (human/ai/tool) is the stable contract that
identifies a recorded turn; the system prompt is an internal detail not
part of the front-back contract under test. (Mirrors how open-design keys
its mock picker on the user prompt, not the system internals.) Proven
robust: injecting a prompt edit no longer causes a replay miss.
- Layer-1 golden was BLIND to replay misses: the gateway swallows a miss
into an assistant error message, so the shape-only golden stayed green on
a stale fixture. It now inspects replay_provider.replay_misses() and fails
loud. (Layer-2 already fails on a miss.)
- Re-recorded write_read_file.ultra fixture + regenerated golden under the
new conversation-only hash.
- Layer-2 render spec: assert the in-graph auto-title (deterministic); the
follow-up suggestion is fired async and depends on a clean JSON model
output, so assert it only when the fixture captured one — never gate on
its absence (recording flakiness must not block CI).
- docs: REPLAY_E2E.md updated.
Verified: Layer-1 golden green (no miss), Layer-2 both specs green,
CI=true make test 4033 passed / 0 failed, frontend pnpm check clean.
* test(replay-e2e): restore suggestions coverage with a reliable capture
Addresses review feedback (the suggestion path was dropped from Layer-2):
- record spec now waits for the `/suggestions` response before checking
capture stability, so the recorded fixture reliably includes the
frontend-fired suggestions turn (previously the stability window could
return before suggestions fired, yielding a fixture without it).
- Re-recorded write_read_file.ultra: 5 turns (write_file, auto-title,
read_file, answer, suggestions). Golden unchanged — suggestions is a
separate /suggestions call, not part of the /runs/stream SSE sequence.
- Layer-2 spec: restore the hard `EXPECTED_SUGGESTION` assertion. With the
record spec now waiting for /suggestions, a fixture missing the suggestion
turn means a broken recording and must fail loud, not pass silently.
Verified: Layer-1 golden green (no miss), Layer-2 both specs green
(auto-title + suggestion render), frontend pnpm check clean.
* ci: re-trigger (flaky Docker Hub image pull in sandbox e2e, unrelated)
backend-unit-tests failed only in test_sandbox_orphan_reconciliation_e2e.py
with 'docker pull busybox:latest ... context deadline exceeded' — a CI-runner
network flake reaching Docker Hub, not related to this docs/tests-only change.
Empty commit to re-run CI.
---------
Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
Reasoning models such as MiniMax-M3 inline their chain-of-thought into the
message content as <think>...</think> (reasoning_split defaults to false)
instead of a separate reasoning_content field. The follow-up-suggestions
endpoint extracted the JSON array via find('[') / rfind(']'), which silently
broke whenever the reasoning text contained '[' or ']' — or when long thinking
hit max_tokens and truncated before the array was emitted — returning empty
suggestions.
- Add _strip_think_blocks() and apply it before JSON extraction; it removes
complete <think>...</think> blocks (case-insensitive) and drops an unclosed
<think> left by max_tokens truncation.
- Document the MiniMax thinking toggle in config.example.yaml
(when_thinking_enabled: adaptive / when_thinking_disabled: disabled) so
thinking_enabled=False actually disables reasoning on M3; note that M2.x
models always think and rely on the defensive strip above.
- Tests cover complete/unclosed think blocks, brackets-inside-think, think +
code-fence, and an end-to-end suggestions case reproducing the empty-result
bug.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(e2e): record/replay front-back contract verification
Guards the front-back contract with a deterministic, key-free record/replay
harness (mirrors open-design's golden-trace approach):
- ReplayChatModel (tests/replay_provider.py): replays recorded LLM turns by a
normalized hash of the model input. Strips <system-reminder>/date/uuid/tmp-path
so one fixture replays across days and from both the browser and direct-POST
paths; a miss raises loudly (no silent divergence).
- Recording is record-through-browser (scripts/record_gateway.py +
build_fixture_from_jsonl.py + frontend/tests/e2e-record): a real run is driven
through the real frontend so captured inputs match exactly what the browser
sends; fixtures contain no API key.
- Layer 1 — backend golden (tests/test_replay_golden.py): replay through the real
gateway, assert the SSE event sequence == committed golden.
- Layer 2 — full-stack render (frontend/tests/e2e-real-backend): real Next.js +
real gateway (replay model) + Chromium; assert the replayed auto-title and
follow-up suggestions render. DOM assertions are the gate; visual regression is
a local dev gate (CI uploads the render as an artifact).
- CI (.github/workflows/replay-e2e.yml): both layers, triggered on EITHER side of
the contract (frontend/** or backend gateway/harness/fixtures).
* test(e2e): multi-run render-order cross-stack scenario (#3352)
Guards the dangerous front-back class where a backend ordering change
silently breaks a frontend assumption while both sides' unit tests stay
green. Reproduces issue #3352: backend list_by_thread returns runs
newest-first (#2932) and the frontend prepended per-run pages, inverting
chronological order once the checkpoint no longer held the older messages.
- tests/seed_runs_router.py: test-only seeder, mounted on the replay
gateway only when DEERFLOW_ENABLE_TEST_SEED=1 (never in the production
app). Seeds a thread with >=2 runs + per-run message events and no
checkpoint -- the #3352 precondition -- so the frontend per-run reload
path is the sole source of truth and the prepend inversion is observable.
- frontend/tests/e2e-real-backend/multi-run-order.spec.ts: drives the real
frontend against the real gateway, asserts the first run renders above
the second. Reverting the #3354 fix turns it red.
- replay-e2e.yml: trigger on the new replay test-infra paths.
- docs: REPLAY_E2E.md cross-stack scenario section.
* test(e2e): address Copilot review on the replay harness
- Fix stale recorder references (scripts/record_traces.py ->
scripts/record_gateway.py + scripts/build_fixture_from_jsonl.py) in
replay_provider.py, test_replay_golden.py, _replay_fixture.py.
- MODE_CONTEXT['ultra']: thinking_enabled False -> True, mirroring the
frontend's `context.mode !== 'flash'` (hooks.ts). It did not affect the
hashed input (Layer 1 golden still green), but the table now matches the
real frontend context it claims to mirror.
- replay_provider.py docstring: stop claiming memory is recorded-enabled;
the replay config disables memory/summarization for determinism (title
stays, as an in-graph deterministic call).
- record_gateway.py / run_replay_gateway.py: override DEER_FLOW_HOME instead
of setdefault, so an outer value can't leak into the hermetic harness.
- record_gateway.py: clear error when DEERFLOW_RECORD_OUT is unset (was a
bare KeyError).
- playwright.record.config.ts: forward OPENAI_*/DEERFLOW_RECORD_OUT only when
set, so the gateway raises a clear 'missing env' error instead of getting ''.
* test(e2e): address Copilot review round 2
- seed_runs_router.py: constrain SeedMessage.role to Literal['human','ai']
so a bad value is a clean 422 at the boundary instead of a 500
(KeyError on _EVENT_TYPE).
- record-write-read-file.spec.ts: waitForCaptureStable now throws on
timeout instead of returning the last count, so a truncated/partial
recording can't pass silently.
- real-backend-render.spec.ts: guard the suggestions JSON.parse; a
bracket-prefixed non-JSON turn falls back to '' so the existing
not.toBe('') assertion fails clearly instead of a generic parse throw.
* fix(middleware): externalize oversized tool output into sandbox for non-mounted sandboxes
ToolOutputBudgetMiddleware persisted oversized tool results to the host
filesystem and returned a /mnt/user-data/outputs virtual path. For sandboxes
that do not use thread-data mounts (e.g. remote AIO sandbox), that virtual
path does not exist inside the sandbox, so the model's read_file tool could
not read it back and reported 'file not found'.
Branch on SandboxProvider.uses_thread_data_mounts:
- Mounted sandboxes (local Docker, AIO + LocalContainerBackend) keep the
original host-disk path; the host outputs dir is bind-mounted to the same
virtual path inside the sandbox, so behavior is unchanged.
- Non-mounted (remote) sandboxes externalize into the sandbox itself via
execute_command('mkdir -p ...') + write_file + 'test -s' validation. The
validation step is required because AIO sandbox execute_command returns
'Error: ...' as a string on failure instead of raising, so a silent mkdir
failure would otherwise leak through.
Any failure (rejected subdir, mkdir/write/validate error) falls back to the
existing inline head+tail truncation, so an unreadable path is never returned
to the model.
The sandbox resolver reads the sandbox_id that SandboxMiddleware already
writes into runtime.state['sandbox']; it never calls provider.acquire(),
keeping the tool-call hot path free of blocking I/O. Tools that do not use a
sandbox (web_search, MCP, ...) resolve to None and fall through to inline
truncation, which is the safe behavior for them.
Fixes#3416
* fix(middleware): address Copilot review feedback on sandbox externalization
- Make get_sandbox_provider() lookup best-effort in _budget_content: only
query when outputs_path or sandbox is available, and fall back to inline
truncation if provider initialization raises rather than propagating
the error. A resolved sandbox instance is sufficient on its own to take
the non-mounted externalization branch.
- Strict-match the sandbox post-write validation echo
(check.strip() == 'OK') to avoid false positives if execute_command
ever surfaces unrelated stdout/stderr containing 'OK' as a substring.
Refs: #3417
* test: fix flaky tests relying on /nonexistent/... path under container root
Two tests in this module (test_returns_none_on_invalid_path and
test_fallback_when_disk_write_fails) used paths like
'/nonexistent/impossible/path' to trigger _externalize's OSError
fallback. These paths are creatable when the test process runs as root
inside the CI container: os.makedirs(..., exist_ok=True) successfully
creates the entire chain under /, so the OSError branch is never hit
and the tests fail. Reproducible on main independently of this PR.
Switch to '/dev/null/cannot-mkdir-here'. /dev/null is a character
device on both Linux and macOS, so os.makedirs always fails with
NotADirectoryError regardless of privileges, reliably exercising the
OSError fallback.
* fix(tool-output-budget): only consult sandbox provider when a sandbox is resolved
The previous revision called get_sandbox_provider() whenever externalization
was triggered, including on the legacy host-disk path. Environments without
a configured sandbox -- in particular CI runners without a config.yaml --
would raise FileNotFoundError there, get caught, and silently fall back to
inline truncation. That defeated the host-disk externalization path that
predates this PR and was the root cause of the regressing legacy tests.
Restructure the branching so the provider is only consulted when a sandbox
has actually been resolved for the current tool call:
- sandbox resolved + provider.uses_thread_data_mounts: host-disk write
(bind-mounted into the sandbox, equivalent to a sandbox-side write).
- sandbox resolved + non-mounted provider: sandbox write (#3416).
- no sandbox + outputs_path: host-disk write
(legacy / non-sandbox tools, no provider call at all).
- otherwise: inline fallback.
No test changes; the legacy externalization tests are provider-agnostic by
construction and now pass without monkeypatching.
Refs: #3416
* test(tool-output-budget): assert legacy path does not call sandbox provider
Lock in the contract introduced by d6e2d25b: when no sandbox is resolved
for a tool call, _budget_content must externalize to the host outputs
directory without consulting get_sandbox_provider(). Regressing this would
re-break legacy / non-sandbox tools in environments without a configured
sandbox (e.g. CI without config.yaml), which is the failure mode #3416's
fix avoids.
The test injects a get_sandbox_provider that raises on call, so any
future refactor that moves the provider lookup out of the sandbox-only
branch will fail loudly.
Refs: #3416
* fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402)
DynamicContextMiddleware.abefore_agent() called _inject() synchronously
on the asyncio event loop. The first time memory is injected (second
request), _inject() → format_memory_for_injection() → _count_tokens()
→ tiktoken.get_encoding("cl100k_base") needs to download the BPE data
from openaipublic.blob.core.windows.net. In network-restricted
environments this download blocks until the OS TCP timeout (~26 min),
starving ALL concurrent handlers including /api/v1/auth/me.
Fix:
- abefore_agent now uses asyncio.to_thread(self._inject, state) so
file I/O and tiktoken never block the event loop.
- Extract _get_tiktoken_encoding() with a module-level cache so
tiktoken.get_encoding() is called at most once per encoding name.
- Add warm_tiktoken_cache() startup helper; gateway lifespan pre-warms
the cache via asyncio.to_thread so the first request never triggers a
cold download.
- _count_tokens falls back to len(text) // 4 on any encoding failure.
Tests:
- tests/test_tiktoken_cache_and_count_tokens.py (12 tests): cache
hit/miss, fallback paths, warm-up helper.
- tests/blocking_io/test_dynamic_context_middleware.py (2 tests):
Blockbuster gate verifies abefore_agent does not block the event
loop; async/sync parity check.
Fixes#3402
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix the lint error
* fix(memory): use future annotations to avoid NameError when tiktoken is absent
Add `from __future__ import annotations` to prompt.py so that
tiktoken.Encoding type hints are never evaluated at runtime. Without
this, environments where tiktoken is not installed could raise NameError
on the module-level cache and function return annotations.
Addresses Copilot review comment on PR #3411.
* fix(middleware): bound abefore_agent injection with timeout to prevent hung requests
Wrap the asyncio.to_thread(self._inject) offload in asyncio.wait_for()
with a 5-second cap. If the startup warm-up failed silently (e.g.
network blip during deploy), a cold tiktoken BPE download on the first
request can block until the OS TCP timeout (~26 min). The bounded
timeout ensures the request degrades gracefully (no memory/date context
for that turn) rather than hanging.
Adds test_abefore_agent_returns_none_on_timeout to the blocking-IO
regression anchors.
Addresses review feedback from xg-gh-25 on PR #3411.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(frontend): truncate overflowing text in agent cards
Long custom agent names, descriptions, skills and tool-group labels
overflowed the agent card and broke its layout (issue #3389). The title
already had `truncate`, but it never took effect: an ancestor flex
container lacked `min-w-0`, so the flex item refused to shrink below its
content width.
- Restore the truncation chain by adding `min-w-0` to the title's flex
ancestors so `truncate` can finally take effect.
- Cap and ellipsize model / skill / tool-group badges via a small
`TruncatedBadge` (`block max-w-full truncate`).
- Reveal the full value on hover, but only when the text is actually
clipped (`TruncatedTooltip`, width + height detection), so names,
descriptions and labels stay readable without popping redundant
tooltips on short cards.
* fix(frontend): wrap unbreakable strings in agent card tooltips
A long token with no break opportunity (no spaces or hyphens) could still
overflow the tooltip horizontally. Add `break-words` next to the existing
`text-wrap` so such strings wrap instead of overflowing.
Addresses Copilot review feedback on tooltip wrapping robustness.
* fix(frontend): show agent card tooltips instantly
Drop the explicit `delayDuration` so card tooltips fall back to the
provider's default 0ms delay. Instant feedback is better UX for revealing
text that is already clipped, per maintainer review.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>