mirror of
https://github.com/bytedance/deer-flow.git
synced 2026-06-10 17:35:57 +00:00
6a94b58ad196d60bc12502e1045b3a6b350c9edd
2253 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6a94b58ad1 | Fix safe user id digest algorithm | ||
|
|
d06643d8a2 | Align IM connections with local channels | ||
|
|
92c185b90d | Support local IM channel connections | ||
|
|
9effa7be6d | Merge remote-tracking branch 'origin/main' into codex/im-channel-connections | ||
|
|
582bfda6f8 | Harden dev service daemon startup | ||
|
|
05ae4467ae |
fix(docker): default Gateway to a single worker to prevent multi-worker breakage (#3475)
The default `make up` started the Gateway with `--workers 4`, but run state (RunManager and the stream bridge) is held in-process and nginx uses no sticky sessions. With the default config, same-run requests scatter across workers that each keep their own run state, breaking run cancellation (409), SSE reconnect (hangs on heartbeats), multitask de-duplication, and IM channels (duplicate replies). The shared cross-worker stream bridge does not exist yet. Default GATEWAY_WORKERS to 1 so the out-of-the-box deployment is correct, document the single-worker boundary in the README, and add a regression test pinning the default while keeping it overridable. This is a stop-gap, not a multi-worker implementation; the full fix (shared run state + stream bridge) is tracked in #3191. Refs #3239, #3260 |
||
|
|
b66152c514 | Use async channel connect flow | ||
|
|
78fbc0abdb | Fix dev startup and channel connect popup | ||
|
|
ec5ed185cd |
Merge remote-tracking branch 'origin/main' into codex/im-channel-connections
# Conflicts: # backend/app/channels/discord.py # backend/app/channels/manager.py # backend/app/channels/slack.py # backend/app/channels/telegram.py |
||
|
|
dbe3a3bb0d | Add user-owned IM channel connections | ||
|
|
2b795265e7 |
fix: align auth-disabled mode and mock history loading (#3471)
* fix: align auth-disabled mode and mock history loading * fix: address auth-disabled review feedback * test: cover auth-disabled backend contract * style: format frontend tests * fix: address follow-up review comments |
||
|
|
a57d05fe0a | fix runtime journal run lifecycle events (#3470) | ||
|
|
ae9e8bc0bf |
fix(sandbox): make missing sandbox.mounts host_path a loud ERROR (#3244) (#3250)
In Docker production deployments, LocalSandboxProvider runs inside the
deer-flow-gateway container, so any `sandbox.mounts[].host_path` from
config.yaml is resolved against the gateway container's filesystem — not
the host machine. When the path isn't also bind-mounted into the gateway
service, the mount was silently dropped with only a WARNING log, leaving
agents reading an empty directory in production while the same config
worked under `make dev`.
Escalate the missing-host_path branch to logger.error with explicit
guidance about Docker bind mounts and docker-compose, so the failure is
hard to miss in default log configurations. Skip behaviour is preserved
to avoid breaking existing deployments.
Also clarify the misleading `VolumeMountConfig.host_path` field
description so it documents reality for both providers:
- LocalSandboxProvider checks host_path from inside the gateway process
(host in `make dev`, container in `make up`).
- AioSandboxProvider (DooD) passes host_path straight to `docker -v`
for the sandbox container, where the host Docker daemon resolves it
from the host machine's perspective.
config.example.yaml's `sandbox.mounts` comment gets a Note: block
pointing operators at the docker-compose bind-mount requirement so the
Docker-mode gotcha is discoverable from the canonical template.
Adds a regression test that:
- confirms missing host_path is still skipped (no behaviour break);
- asserts an ERROR record is emitted referencing the offending paths;
- asserts the message contains actionable Docker/gateway/docker-compose
keywords so future refactors can't quietly downgrade it.
Refs: https://github.com/bytedance/deer-flow/issues/3244
|
||
|
|
16391e35ab |
fix(skills): harden slash skill activation across chat channels (#3466)
* support slash skill activation * format slash skill activation * Preserve slash skill activation with uploads * Address slash skill review feedback * Address slash skill follow-up review * Fix lazy slash skill storage resolution * Keep slash skill activation out of system prompt * Address slash skill review issues * fix: harden slash skill command handling * feat(frontend): add slash skill autocomplete * fix: address slash skill review feedback * fix: preserve slash skill text for IM uploadsv2.0-m1-rc3 |
||
|
|
18bbb82f07 |
Fix 'make dev' failure in Windows environment (#3236)
* fix: Solving the problem of "make dev" failing to start in Windows environment * fix: revert the change to the startup_config and fix the lint errors * fix: Address Copilot review feedback - Validate wait-for-port input and avoid PowerShell port interpolation - Require Python 3 in serve.sh launcher detection - Keep Windows event loop policy setup in sitecustomize only - Clarify sitecustomize process-wide backend behavior |
||
|
|
b62c5a7b5b |
fix(agents): offload blocking filesystem IO in the custom-agent router off the event loop (#3457)
* fix(agents): offload blocking filesystem IO in delete_agent off the event loop delete_agent is an async route handler but resolved the agent directory (Paths.base_dir -> Path.resolve), probed it (Path.exists), and removed it (shutil.rmtree) directly on the event loop, blocking it for the duration of every delete. Surfaced by 'make detect-blocking-io'. Move the resolve/exists/rmtree sequence into a sync helper run via asyncio.to_thread, mapping its outcome back to the existing 404/409/500 responses (behavior unchanged). Adds a tests/blocking_io/ regression anchor under the strict Blockbuster gate, mirroring test_skills_load.py (#1917). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agents): offload blocking filesystem IO in create_agent_endpoint too Like delete_agent, the async create_agent_endpoint resolved and created the agent directory and wrote config.yaml + SOUL.md (with rmtree cleanup on failure) directly on the event loop. Move the whole create-or-409 sequence into a sync helper run via asyncio.to_thread; behavior is unchanged (201 / 409 / 500). Extends the blocking_io regression anchor to cover create as well as delete and renames it to test_agents_router.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
5b81588b87 |
fix(frontend): fallback Streamdown clipboard copy (#3397)
* fix(frontend): fallback streamdown clipboard copy * fix(frontend): address clipboard fallback review * fix(frontend): normalize clipboard fallback rejection * fix(frontend): harden clipboard fallback install * fix(frontend): clarify clipboard fallback errors * fix(frontend): cover clipboard fallback edge cases * fix(frontend): tighten clipboard fallback cleanup * fix(frontend): reduce clipboard fallback copy window * fix(frontend): guard clipboard item fallback install * fix(frontend): clean up clipboard fallback on selection errors * Address clipboard fallback review feedback * fix(frontend): guard clipboard fallback install during SSR |
||
|
|
63ce88f874 |
fix(replay-e2e): key fixtures by caller and conversation (#3453)
* add caller identity in replay e2e * make format * fix(replay-e2e): stabilize title caller replay * fix(replay-e2e): use captured caller without run manager --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |
||
|
|
37337b77f9 |
feat(models): add StepFun reasoning model adapter (#3461)
Add PatchedChatStepFun adapter for StepFun reasoning models (step-3.7-flash, step-3.5-flash). Captures reasoning from both streaming and non-streaming responses and replays it on historical assistant messages for multi-turn tool-call conversations. - New: PatchedChatStepFun adapter with streaming/non-streaming reasoning capture - Support both reasoning and reasoning_content field names - 17 unit tests covering all response paths - Updated: config.example.yaml with StepFun configuration example |
||
|
|
8db16bb3d8 |
fix(config): coerce null config.yaml list sections to empty list (#3434)
Copying config.example.yaml to config.yaml and starting DeerFlow crashed with `pydantic ValidationError: models — Input should be a valid list [input_value=None]`, because the example ships every entry under `models:` commented out, so PyYAML parses the key as null. Reported in #1444. Add a field_validator(mode="before") on AppConfig that coerces null models/tools/tool_groups to [] (matching their default_factory=list), and emit an actionable warning from from_file when no models are configured (pointing to config.example.yaml / make setup). Adds regression tests. Closes #1444 Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |
||
|
|
93e3281cbf |
fix(dev): create backend/sandbox before uvicorn reload-exclude (#3459) (#3460)
* fix(dev): create backend/sandbox before uvicorn reload-exclude (#3459) #3426 switched the dev gateway's --reload-exclude patterns to absolute paths. uvicorn only excludes an absolute path directly when it already exists as a directory; otherwise it globs the pattern, and Python 3.12's pathlib raises NotImplementedError("Non-relative patterns are unsupported") for an absolute glob pattern. serve.sh mkdir'd the .deer-flow excludes but not backend/sandbox, so `make dev` crashed on startup on a fresh checkout under Python 3.12 (#3454). docker/dev-entrypoint.sh had the same latent gap. Create backend/sandbox in both launchers so every absolute exclude stays on uvicorn's is_dir() short-circuit. Add a regression test that pins the uvicorn mechanism (crash on missing dir, safe once created) and enforces that every absolute --reload-exclude is mkdir'd before launch. Closes #3459 * test(dev): harden reload-exclude invariant parser against false pass/negatives The launcher invariant test parsed shell with a "mkdir -p" line filter and a substring membership check. Two latent gaps (sub-threshold for this fix, but this code guards a user-facing startup path, so close them): - A `\`-continued multi-line `mkdir` would drop arguments on continuation lines, silently weakening coverage. - Substring membership could false-pass when an exclude is a path-prefix of a different created dir (e.g. `/app/backend/sandbox` "found" inside `/app/backend/sandbox-other`). Fold line-continuations, drop comments, and shlex-tokenize each `mkdir` argument list into an exact set (quotes stripped, `$VAR` literal); assert exact set membership. Same shlex handling for `--reload-exclude` values. Verified the parser still flags the pre-fix missing `backend/sandbox` (RED preserved) and no longer false-passes on a path-prefix. * fix(dev): gitignore backend/sandbox runtime dir + pin mkdir-before-launch Address two review findings on the #3459 fix: - backend/sandbox was described as "gitignored runtime state" but no ignore rule actually matched it. Add an anchored `/sandbox/` to backend/.gitignore (anchored so it does NOT shadow the source package backend/packages/harness/deerflow/sandbox/) so sandbox artifacts created at runtime can't pollute the working tree or be committed by accident. New test asserts content under backend/sandbox is ignored, making the claim verifiable. - The launcher invariant test only proved the sandbox mkdir exists somewhere, not that it runs before uvicorn starts. Add an order test (sandbox mkdir line must precede the `uv run uvicorn` launch) so a future edit can't move the mkdir below the launch and silently reintroduce the crash. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * test(dev): fix reload-exclude parser to handle serve.sh's quoted flag bundle The previous autofix tokenized each whole line with shlex, but serve.sh packs every flag into a single double-quoted `GATEWAY_EXTRA_FLAGS="..."` assignment. shlex collapses that into one token, so no `--reload-exclude` flag is found and `test_launcher_precreates_every_absolute_reload_exclude[scripts/serve.sh]` failed CI with "expected at least one absolute reload-exclude". Parse `--reload-exclude` with a regex that matches a balanced single/double quoted group or a bare token, so the assignment's surrounding `"` is never swallowed into the value. This recovers all three serve.sh excludes (the prior regex also silently dropped the last `$BACKEND_RUNTIME_HOME` because the adjacent closing quote broke shlex) while still covering dev-entrypoint.sh and the space-separated `--reload-exclude <value>` form. --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
0fb18e368c |
refactor(lead-agent): make build_middlewares public to drop the last cross-module private import (#3458)
`client.py` imported the private `_build_middlewares` from `agent.py` across a module boundary and called it as public API. Because the `_` name signals "module-private, no external callers", any future rename or signature change silently breaks the embedded `DeerFlowClient` path — and the test suite even monkeypatched `deerflow.client._build_middlewares`, baking the leak in. `DeerFlowClient` is a lead-agent variant that genuinely needs the lead agent's full middleware composition, so make the dependency honest: promote the helper to a documented public entry point `build_middlewares` and update every in-repo caller. Found during #3341 review; #3341 already removed one such leak (`_assemble_deferred` -> public `assemble_deferred_tools`) and left this one out of scope on purpose. - agent.py: rename def + both internal call sites; expand the docstring into a public-entry-point contract and document the previously-undocumented model_name / app_config / deferred_setup params - client.py: import + call site now use the public name (removes the last cross-module private import) - scripts/tool-error-degradation-detection.sh: update its import + call site - tests (5 files): update monkeypatch/patch targets and direct calls - docs (backend/CLAUDE.md, plan_mode_usage.md, middlewares.mdx): sync the live references that describe the symbol as current API Pure mechanical rename, no behavior change. Historical design docs (rfc, superpowers spec) intentionally keep the old name as point-in-time records. Closes #3431 |
||
|
|
90e23bfd09 |
fix(ci): consolidate PR/issue labeling and fix reviewing-job crash + label thrash (#3455)
* fix(ci): consolidate PR/issue labeling into one triage.yml; fix reviewing crash & label thrash
- Replace pr-labeler + pr-triage + issue-triage with a single triage.yml; drop actions/labeler.
Its sync-labels removed labels outside its config (clobbered size/risk/needs-validation and
could clobber maintainer labels). Area is now computed in-script and reconciled only within
owned namespaces (area:/size//risk:/needs-validation); first-time/reviewing are add-only.
- reviewing: gate on author_association in {OWNER,MEMBER,COLLABORATOR} + user.type==='User'
instead of getCollaboratorPermissionLevel, which 404'd on bot reviewers ('Copilot is not a
user') and crashed the job. Excludes all review bots with no denylist and no API call.
- Read live state (listFiles + listLabelsOnIssue) not the stale event payload, so rapid
synchronize events converge instead of thrashing. Size churn excludes lockfiles/snapshots.
* fix(ci): read labels live via paginate in reviewing & issue-triage jobs
Address review feedback on #3455:
- reviewing: listLabelsOnIssue now paginates (per_page:100) instead of the
default 30, matching pr-labels, so a 'reviewing' label is never missed on
PRs with many labels.
- issue-triage: read live labels via the API instead of the event payload,
consistent with the live-state reads documented in the header.
|
||
|
|
f92a26d56f |
fix(web_fetch): support proxy for Jina reader in restricted networks (#3418) (#3430)
* fix(web_fetch): support proxy for Jina reader in restricted networks The web_fetch tool built a bare httpx.AsyncClient() with no proxy awareness, so users behind a corporate proxy / in Docker / WSL could not reach https://r.jina.ai and web_fetch timed out. - Add optional `proxy` / `trust_env` params to JinaClient.crawl and wire them from the `web_fetch` tool config (with type coercion for YAML string values). - Pass internal service hostnames through NO_PROXY in both compose files so proxy env inherited via env_file does not break in-cluster calls (gateway/provisioner/etc). - Load proxy vars from .env into the shell in scripts/docker.sh so the NO_PROXY interpolation can merge user-provided values on `make` path. - Document proxy/trust_env options in config.example.yaml. Closes #3418 * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
3b6dd0a4e3 |
feat(subagents): extend deferred MCP tool loading to subagents (#3432)
* feat(subagents): extend deferred MCP tool loading to subagents (#3341) Subagents now reuse the lead agent's deferred-tool path: when tool_search.enabled, MCP tool schemas are withheld from the model and surfaced by name in <available-deferred-tools>, fetched on demand via the generated tool_search helper. DeferredToolFilterMiddleware deterministically rewrites request.tools to hide the deferred schemas (the prompt section is discovery only, not enforcement). Consolidates the assembly into deerflow.tools.builtins.tool_search, now the single home for both assemble_deferred_tools (centralized fail-closed guard, replacing the lead-only private _assemble_deferred) and the relocated get_deferred_tools_prompt_section. Shared by every build path: lead agent, embedded client, and subagent executor. tool_search is appended after the subagent's name-level tool policy and is treated as infrastructure: its catalog is built from the already policy-filtered list, so it can never surface a tool the policy denied. Follow-up to #3370. Fixes #3341. * test(subagents): assert the real middleware builder emits a working deferred filter (#3341) The existing recipe test hand-constructs DeferredToolFilterMiddleware, so it cannot catch a regression in how build_subagent_runtime_middlewares (the call executor._create_agent actually makes) wires the deferred setup into the filter. Add a test that sources the filter from the real builder given a real setup and runs it through a graph: a wrong catalog hash would silently stop promotion, a dropped filter would stop hiding — both now caught. Running the full real middleware stack is intentionally avoided (the other runtime middlewares need sandbox/thread infra to execute, which would make the test flaky); their attachment + ordering before Safety stays locked in test_tool_error_handling_middleware.py. * test(subagents): keep executor tests config-free in CI * chore: trigger ci * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>v2.0-m1-rc2 |
||
|
|
3c2b60aaae |
fix(threads): assign new checkpoint ID in update_thread_state (#2391)
* async * add test * test(threads): assert aput preserves endpoint-assigned checkpoint id Confirm the update_thread_state fix is real, not a no-op: all supported savers (InMemorySaver, AsyncSqliteSaver, AsyncPostgresSaver) persist and echo checkpoint["id"] verbatim rather than minting their own. Add assertions that each POST /state response's checkpoint_id round-tripped into persisted history and kept its uuid6 time-ordering through aput, and document the verified contract in the router. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
67ad6e232f | fix(dev): exclude runtime state from gateway reload (#3426) | ||
|
|
cd5bedaa74 |
feat: MiniMax provider for image/video/podcast skills + new music-generation skill (#3437)
* docs(spec): MiniMax integration for generation skills + new music skill Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): MiniMax generation providers implementation plan Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(skills): add importlib loader + FakeResp for skill tests * test(skills): register loaded module in sys.modules; raise requests.HTTPError in FakeResp * feat(image-generation): add MiniMax provider with env auto-detect Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(image-generation): guard unknown provider, derive ref MIME, strengthen tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(video-generation): add MiniMax provider with async poll/download Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(video-generation): surface base_resp errors while polling; add timeout test * feat(podcast-generation): add MiniMax t2a_v2 provider with env auto-detect Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(podcast-generation): restore TTS credential guard; add volcengine + voice tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(music-generation): new MiniMax music skill via skill-creator Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(music-generation): treat empty lyrics as absent; test no-audio-data path * refactor(skills): add request timeouts to MiniMax network calls Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Potential fix for pull request finding 'Explicit returns mixed with implicit (fall through) returns' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * fix(models): strip inconsistent user-message names for MiniMax chat DeerFlow middlewares tag user messages with provenance names (user-input, summary, loop_warning); langchain serializes them into the OpenAI-compatible payload and MiniMax rejects mismatched user-message names with "user name must be consistent (2013)". PatchedChatMiniMax now drops the per-message name from user-role messages. Point the config.example MiniMax models at PatchedChatMiniMax so they also get reasoning_content mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(image-generation): MiniMax sends JSON prompt field, guard 1500-char limit MiniMax image-01 takes one text string capped at 1500 chars, but the skill was sending the whole structured JSON. The MiniMax provider now extracts the JSON `prompt` field (relying on prompt_optimizer to expand it) and fails fast with a clear error before calling the API when that field exceeds 1500 chars. Authoring stays provider-agnostic; Gemini still receives the full JSON. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(podcast-generation): per-provider TTS concurrency and retry/backoff Each TTS provider owns its concurrency internally — MiniMax runs single-threaded to reduce rate-limit failures, Volcengine keeps 4 workers — with automatic retry and backoff on transient HTTP and base_resp errors. No caller-facing concurrency knob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(skills): address Copilot review comments on generation skills - video: add raise_for_status + timeout to the Gemini download/POST/poll calls so non-2xx responses surface as clear HTTP errors instead of JSON/KeyError or hangs - video: check the task Fail status before the generic base_resp check so the failure keeps its task_id context - video/image: create the output file parent directory before writing (matching music-generation) so nested output paths do not raise FileNotFoundError - music: require a non-empty prompt and fail fast with ValueError instead of sending an empty prompt to the API Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(scripts): reclaim dev ports across worktrees in make stop/dev All deer-flow worktrees (main checkout + linked worktrees) hardcode the same dev ports (8001/3000/2026), so a service started from any worktree must be reclaimable from another. stop_all now resolves the set of worktree roots (DEERFLOW_ROOTS) and treats a process as deer-flow-owned when its open files live under any of them. It also force-kills survivors on 2026 alongside 8001/3000, fixing `make dev` aborting on the nginx port preflight when a prior nginx lingered on 2026. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(view-image): hide the injected image-context message from the UI ViewImageMiddleware injects a HumanMessage (text + base64 images) so the vision model can see viewed images, but it was the only internal injector that set neither hide_from_ui nor a hidden name, so it leaked into the chat UI (and IM channels) as a user bubble reading "Here are the images you've viewed:". Mark it with additional_kwargs={"hide_from_ui": True}, matching todo/dynamic_context injections, which the frontend isHiddenFromUIMessage and the channel sender already honor. The model still receives the full content. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(minimax): mark M2.7 models as text-only (no vision) MiniMax M2.7 / M2.7-highspeed do not support vision; only M3 does. The provider config asserted vision support for M2.7 in four places. - config.example.yaml: 4 M2.7 entries -> supports_vision: false - backend/docs/CONFIGURATION.md: M2.7 + highspeed -> supports_vision: false - wizard: add LLMProvider.model_vision_overrides + extra_config_for() so selecting an M2.7 model writes supports_vision: false while M3 (default) keeps vision; wire it through setup_wizard.py - tests: M2.7-highspeed fixture -> supports_vision=False; add test_minimax_vision_is_per_model Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> |
||
|
|
1651d1f1f5 |
fix(frontend): restructure Memory settings toolbar into two rows (#3433)
The search input, filter tabs, and four action buttons were crammed into a single horizontal row, which squeezed the search box into an unusable sliver and truncated the "Summaries" filter tab to "Summarie". Split the toolbar into two rows: search + filter tabs on the first, actions on the second. The search input now keeps a usable min width, filter tabs use whitespace-nowrap so they never truncate, and the destructive "Clear all memory" button is pushed to the far right (ml-auto) to separate it from the constructive actions. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
799bef6d9d |
fix(replay-e2e): match by conversation, not the living system prompt (#3436)
* fix(replay-e2e): match by conversation, not the living system prompt The model-replay match key hashed the full input including the lead-agent system prompt. That prompt is edited frequently (e.g. #3195 added a "File Editing Workflow" section), so the committed fixture went stale the moment the prompt changed on main — turning the Layer-2 render gate RED on every unrelated PR (#3430, #3432, ...). This was a self-inflicted false positive. Root-cause fix: - replay_provider._canonical_messages now EXCLUDES the system message from the hash. The conversation (human/ai/tool) is the stable contract that identifies a recorded turn; the system prompt is an internal detail not part of the front-back contract under test. (Mirrors how open-design keys its mock picker on the user prompt, not the system internals.) Proven robust: injecting a prompt edit no longer causes a replay miss. - Layer-1 golden was BLIND to replay misses: the gateway swallows a miss into an assistant error message, so the shape-only golden stayed green on a stale fixture. It now inspects replay_provider.replay_misses() and fails loud. (Layer-2 already fails on a miss.) - Re-recorded write_read_file.ultra fixture + regenerated golden under the new conversation-only hash. - Layer-2 render spec: assert the in-graph auto-title (deterministic); the follow-up suggestion is fired async and depends on a clean JSON model output, so assert it only when the fixture captured one — never gate on its absence (recording flakiness must not block CI). - docs: REPLAY_E2E.md updated. Verified: Layer-1 golden green (no miss), Layer-2 both specs green, CI=true make test 4033 passed / 0 failed, frontend pnpm check clean. * test(replay-e2e): restore suggestions coverage with a reliable capture Addresses review feedback (the suggestion path was dropped from Layer-2): - record spec now waits for the `/suggestions` response before checking capture stability, so the recorded fixture reliably includes the frontend-fired suggestions turn (previously the stability window could return before suggestions fired, yielding a fixture without it). - Re-recorded write_read_file.ultra: 5 turns (write_file, auto-title, read_file, answer, suggestions). Golden unchanged — suggestions is a separate /suggestions call, not part of the /runs/stream SSE sequence. - Layer-2 spec: restore the hard `EXPECTED_SUGGESTION` assertion. With the record spec now waiting for /suggestions, a fixture missing the suggestion turn means a broken recording and must fail loud, not pass silently. Verified: Layer-1 golden green (no miss), Layer-2 both specs green (auto-title + suggestion render), frontend pnpm check clean. * ci: re-trigger (flaky Docker Hub image pull in sandbox e2e, unrelated) backend-unit-tests failed only in test_sandbox_orphan_reconciliation_e2e.py with 'docker pull busybox:latest ... context deadline exceeded' — a CI-runner network flake reaching Docker Hub, not related to this docs/tests-only change. Empty commit to re-run CI. --------- Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com> |
||
|
|
3b105d1e5f |
fix(suggestions): strip inline <think> reasoning before parsing follow-up questions (#3435)
Reasoning models such as MiniMax-M3 inline their chain-of-thought into the
message content as <think>...</think> (reasoning_split defaults to false)
instead of a separate reasoning_content field. The follow-up-suggestions
endpoint extracted the JSON array via find('[') / rfind(']'), which silently
broke whenever the reasoning text contained '[' or ']' — or when long thinking
hit max_tokens and truncated before the array was emitted — returning empty
suggestions.
- Add _strip_think_blocks() and apply it before JSON extraction; it removes
complete <think>...</think> blocks (case-insensitive) and drops an unclosed
<think> left by max_tokens truncation.
- Document the MiniMax thinking toggle in config.example.yaml
(when_thinking_enabled: adaptive / when_thinking_disabled: disabled) so
thinking_enabled=False actually disables reasoning on M3; note that M2.x
models always think and rely on the defensive strip above.
- Tests cover complete/unclosed think blocks, brackets-inside-think, think +
code-fence, and an end-to-end suggestions case reproducing the empty-result
bug.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
88759015e4 |
test(e2e): deterministic record/replay front-back contract verification (#3365)
* test(e2e): record/replay front-back contract verification Guards the front-back contract with a deterministic, key-free record/replay harness (mirrors open-design's golden-trace approach): - ReplayChatModel (tests/replay_provider.py): replays recorded LLM turns by a normalized hash of the model input. Strips <system-reminder>/date/uuid/tmp-path so one fixture replays across days and from both the browser and direct-POST paths; a miss raises loudly (no silent divergence). - Recording is record-through-browser (scripts/record_gateway.py + build_fixture_from_jsonl.py + frontend/tests/e2e-record): a real run is driven through the real frontend so captured inputs match exactly what the browser sends; fixtures contain no API key. - Layer 1 — backend golden (tests/test_replay_golden.py): replay through the real gateway, assert the SSE event sequence == committed golden. - Layer 2 — full-stack render (frontend/tests/e2e-real-backend): real Next.js + real gateway (replay model) + Chromium; assert the replayed auto-title and follow-up suggestions render. DOM assertions are the gate; visual regression is a local dev gate (CI uploads the render as an artifact). - CI (.github/workflows/replay-e2e.yml): both layers, triggered on EITHER side of the contract (frontend/** or backend gateway/harness/fixtures). * test(e2e): multi-run render-order cross-stack scenario (#3352) Guards the dangerous front-back class where a backend ordering change silently breaks a frontend assumption while both sides' unit tests stay green. Reproduces issue #3352: backend list_by_thread returns runs newest-first (#2932) and the frontend prepended per-run pages, inverting chronological order once the checkpoint no longer held the older messages. - tests/seed_runs_router.py: test-only seeder, mounted on the replay gateway only when DEERFLOW_ENABLE_TEST_SEED=1 (never in the production app). Seeds a thread with >=2 runs + per-run message events and no checkpoint -- the #3352 precondition -- so the frontend per-run reload path is the sole source of truth and the prepend inversion is observable. - frontend/tests/e2e-real-backend/multi-run-order.spec.ts: drives the real frontend against the real gateway, asserts the first run renders above the second. Reverting the #3354 fix turns it red. - replay-e2e.yml: trigger on the new replay test-infra paths. - docs: REPLAY_E2E.md cross-stack scenario section. * test(e2e): address Copilot review on the replay harness - Fix stale recorder references (scripts/record_traces.py -> scripts/record_gateway.py + scripts/build_fixture_from_jsonl.py) in replay_provider.py, test_replay_golden.py, _replay_fixture.py. - MODE_CONTEXT['ultra']: thinking_enabled False -> True, mirroring the frontend's `context.mode !== 'flash'` (hooks.ts). It did not affect the hashed input (Layer 1 golden still green), but the table now matches the real frontend context it claims to mirror. - replay_provider.py docstring: stop claiming memory is recorded-enabled; the replay config disables memory/summarization for determinism (title stays, as an in-graph deterministic call). - record_gateway.py / run_replay_gateway.py: override DEER_FLOW_HOME instead of setdefault, so an outer value can't leak into the hermetic harness. - record_gateway.py: clear error when DEERFLOW_RECORD_OUT is unset (was a bare KeyError). - playwright.record.config.ts: forward OPENAI_*/DEERFLOW_RECORD_OUT only when set, so the gateway raises a clear 'missing env' error instead of getting ''. * test(e2e): address Copilot review round 2 - seed_runs_router.py: constrain SeedMessage.role to Literal['human','ai'] so a bad value is a clean 422 at the boundary instead of a 500 (KeyError on _EVENT_TYPE). - record-write-read-file.spec.ts: waitForCaptureStable now throws on timeout instead of returning the last count, so a truncated/partial recording can't pass silently. - real-backend-render.spec.ts: guard the suggestions JSON.parse; a bracket-prefixed non-JSON turn falls back to '' so the existing not.toBe('') assertion fails clearly instead of a generic parse throw. |
||
|
|
64d923b0fd |
fix(middleware): externalize oversized tool output into sandbox for non-mounted sandboxes (#3417)
* fix(middleware): externalize oversized tool output into sandbox for non-mounted sandboxes
ToolOutputBudgetMiddleware persisted oversized tool results to the host
filesystem and returned a /mnt/user-data/outputs virtual path. For sandboxes
that do not use thread-data mounts (e.g. remote AIO sandbox), that virtual
path does not exist inside the sandbox, so the model's read_file tool could
not read it back and reported 'file not found'.
Branch on SandboxProvider.uses_thread_data_mounts:
- Mounted sandboxes (local Docker, AIO + LocalContainerBackend) keep the
original host-disk path; the host outputs dir is bind-mounted to the same
virtual path inside the sandbox, so behavior is unchanged.
- Non-mounted (remote) sandboxes externalize into the sandbox itself via
execute_command('mkdir -p ...') + write_file + 'test -s' validation. The
validation step is required because AIO sandbox execute_command returns
'Error: ...' as a string on failure instead of raising, so a silent mkdir
failure would otherwise leak through.
Any failure (rejected subdir, mkdir/write/validate error) falls back to the
existing inline head+tail truncation, so an unreadable path is never returned
to the model.
The sandbox resolver reads the sandbox_id that SandboxMiddleware already
writes into runtime.state['sandbox']; it never calls provider.acquire(),
keeping the tool-call hot path free of blocking I/O. Tools that do not use a
sandbox (web_search, MCP, ...) resolve to None and fall through to inline
truncation, which is the safe behavior for them.
Fixes #3416
* fix(middleware): address Copilot review feedback on sandbox externalization
- Make get_sandbox_provider() lookup best-effort in _budget_content: only
query when outputs_path or sandbox is available, and fall back to inline
truncation if provider initialization raises rather than propagating
the error. A resolved sandbox instance is sufficient on its own to take
the non-mounted externalization branch.
- Strict-match the sandbox post-write validation echo
(check.strip() == 'OK') to avoid false positives if execute_command
ever surfaces unrelated stdout/stderr containing 'OK' as a substring.
Refs: #3417
* test: fix flaky tests relying on /nonexistent/... path under container root
Two tests in this module (test_returns_none_on_invalid_path and
test_fallback_when_disk_write_fails) used paths like
'/nonexistent/impossible/path' to trigger _externalize's OSError
fallback. These paths are creatable when the test process runs as root
inside the CI container: os.makedirs(..., exist_ok=True) successfully
creates the entire chain under /, so the OSError branch is never hit
and the tests fail. Reproducible on main independently of this PR.
Switch to '/dev/null/cannot-mkdir-here'. /dev/null is a character
device on both Linux and macOS, so os.makedirs always fails with
NotADirectoryError regardless of privileges, reliably exercising the
OSError fallback.
* fix(tool-output-budget): only consult sandbox provider when a sandbox is resolved
The previous revision called get_sandbox_provider() whenever externalization
was triggered, including on the legacy host-disk path. Environments without
a configured sandbox -- in particular CI runners without a config.yaml --
would raise FileNotFoundError there, get caught, and silently fall back to
inline truncation. That defeated the host-disk externalization path that
predates this PR and was the root cause of the regressing legacy tests.
Restructure the branching so the provider is only consulted when a sandbox
has actually been resolved for the current tool call:
- sandbox resolved + provider.uses_thread_data_mounts: host-disk write
(bind-mounted into the sandbox, equivalent to a sandbox-side write).
- sandbox resolved + non-mounted provider: sandbox write (#3416).
- no sandbox + outputs_path: host-disk write
(legacy / non-sandbox tools, no provider call at all).
- otherwise: inline fallback.
No test changes; the legacy externalization tests are provider-agnostic by
construction and now pass without monkeypatching.
Refs: #3416
* test(tool-output-budget): assert legacy path does not call sandbox provider
Lock in the contract introduced by
|
||
|
|
519200728a |
fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402) (#3411)
* fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402) DynamicContextMiddleware.abefore_agent() called _inject() synchronously on the asyncio event loop. The first time memory is injected (second request), _inject() → format_memory_for_injection() → _count_tokens() → tiktoken.get_encoding("cl100k_base") needs to download the BPE data from openaipublic.blob.core.windows.net. In network-restricted environments this download blocks until the OS TCP timeout (~26 min), starving ALL concurrent handlers including /api/v1/auth/me. Fix: - abefore_agent now uses asyncio.to_thread(self._inject, state) so file I/O and tiktoken never block the event loop. - Extract _get_tiktoken_encoding() with a module-level cache so tiktoken.get_encoding() is called at most once per encoding name. - Add warm_tiktoken_cache() startup helper; gateway lifespan pre-warms the cache via asyncio.to_thread so the first request never triggers a cold download. - _count_tokens falls back to len(text) // 4 on any encoding failure. Tests: - tests/test_tiktoken_cache_and_count_tokens.py (12 tests): cache hit/miss, fallback paths, warm-up helper. - tests/blocking_io/test_dynamic_context_middleware.py (2 tests): Blockbuster gate verifies abefore_agent does not block the event loop; async/sync parity check. Fixes #3402 * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix the lint error * fix(memory): use future annotations to avoid NameError when tiktoken is absent Add `from __future__ import annotations` to prompt.py so that tiktoken.Encoding type hints are never evaluated at runtime. Without this, environments where tiktoken is not installed could raise NameError on the module-level cache and function return annotations. Addresses Copilot review comment on PR #3411. * fix(middleware): bound abefore_agent injection with timeout to prevent hung requests Wrap the asyncio.to_thread(self._inject) offload in asyncio.wait_for() with a 5-second cap. If the startup warm-up failed silently (e.g. network blip during deploy), a cold tiktoken BPE download on the first request can block until the OS TCP timeout (~26 min). The bounded timeout ensures the request degrades gracefully (no memory/date context for that turn) rather than hanging. Adds test_abefore_agent_returns_none_on_timeout to the blocking-IO regression anchors. Addresses review feedback from xg-gh-25 on PR #3411. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
40a371b88c |
fix(security): harden MCP config endpoint (#3425)
* fix(security): harden MCP config endpoint * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
f725a963d5 |
fix(runtime): protect sync singleton init and reset (#3413)
* fix(runtime): protect sync singleton init/reset with threading.Lock * fix(runtime): serialize sync singleton init and reset * make format * test(runtime): assert store reset creates new singleton * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix(runtime): load config outside singleton locks * fix(runtime): share checkpointer config loading helper --------- Co-authored-by: GODDiao <diaoshengjia@gmail.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
3b4c9ff733 | fix(setup): refresh LLM provider wizard defaults (#3421) | ||
|
|
10c1d9f417 | fix(search): fix DDGS Wikipedia region handling (#3423) | ||
|
|
7679f21edf |
fix(frontend): truncate overflowing text in agent cards (#3391)
* fix(frontend): truncate overflowing text in agent cards Long custom agent names, descriptions, skills and tool-group labels overflowed the agent card and broke its layout (issue #3389). The title already had `truncate`, but it never took effect: an ancestor flex container lacked `min-w-0`, so the flex item refused to shrink below its content width. - Restore the truncation chain by adding `min-w-0` to the title's flex ancestors so `truncate` can finally take effect. - Cap and ellipsize model / skill / tool-group badges via a small `TruncatedBadge` (`block max-w-full truncate`). - Reveal the full value on hover, but only when the text is actually clipped (`TruncatedTooltip`, width + height detection), so names, descriptions and labels stay readable without popping redundant tooltips on short cards. * fix(frontend): wrap unbreakable strings in agent card tooltips A long token with no break opportunity (no spaces or hyphens) could still overflow the tooltip horizontally. Add `break-words` next to the existing `text-wrap` so such strings wrap instead of overflowing. Addresses Copilot review feedback on tooltip wrapping robustness. * fix(frontend): show agent card tooltips instantly Drop the explicit `delayDuration` so card tooltips fall back to the provider's default 0ms delay. Instant feedback is better UX for revealing text that is already clipped, per maintainer review. --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |
||
|
|
8d2e55a05f |
fix(subagent): structured subagent_status field over text parsing (#3146) (#3154)
* fix(subagent): structured subagent_status field over text parsing Closes #3146. ## Why The frontend used to derive subtask card state by string-matching the leading text of the `task` tool's result. That contract surface was fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases where new backend wording (`Task cancelled by user.`, `Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware` exception wrappers) silently broke the card lifecycle. The frontend fallback kept growing more prefixes; any future rewording would break it again. ## Design 1. **Backend → frontend contract**: `ToolMessage.additional_kwargs` carries `subagent_status` (one of `completed | failed | cancelled | timed_out | polling_timed_out`) and an optional `subagent_error` blob. The frontend prefers it over parsing `content`. 2. **Centralised stamping, not 8 sprinkled stamps**: rather than have each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:` paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware` stamps the field after every task-tool call. Adding a new return path in `task_tool.py` cannot now skip the stamp. 3. **Cross-language contract fixture**: the prefix→status mapping is the one piece both sides must agree on. The shared fixture at `contracts/subagent_status_contract.json` lists every backend return string, the expected status, and what the error substring should contain. Backend test (`backend/tests/test_subagent_status_contract.py`) and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`) both load that fixture and assert the same cases. A wording drift on either side fails the matching language's test. 4. **Round-trip serialisation pinned**: the round-trip test asserts `ToolMessage.model_dump_json()` → `model_validate_json()` preserves `additional_kwargs.subagent_status`. Catches the case where a future LangChain or Pydantic upgrade silently strips unknown kwargs. 5. **Frontend status collapse documented**: the backend has five status values, the frontend card has three (`completed | failed | in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all collapse to `failed` with the original status preserved in `error`. `parseSubtaskResult` returns `in_progress` for unknown values so a backend that ships a new enum variant before the frontend upgrades degrades to the legacy prefix fallback instead of getting pinned. ## Changes Backend: - `deerflow.subagents.status_contract` — new module exporting `SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`, `SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and `make_subagent_additional_kwargs(status, error)`. - `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status` helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call` stamp on the success path; `_build_error_message` stamps on the wrapper path (carrying `ExcClass: detail` into `subagent_error`). Non-task tools are untouched. - New tests: `test_subagent_status_contract.py` (19 cases from the shared fixture + status-enum / blank-error / unknown-status rejection) and `test_tool_error_handling_subagent_stamp.py` (middleware integration: terminal-content stamps, non-terminal doesn't, non-task tools untouched, async path mirrors sync, existing additional_kwargs survive, JSON round-trip preserved). Frontend: - `parseSubtaskResult(text, additionalKwargs?)` — prefers the structured stamp; falls back to the legacy prefix matcher for historical threads / unknown future status values. - `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse. - `message-list.tsx` passes `message.additional_kwargs` through. - `subtask-result.test.ts` adds a structured-status block + a fixture-driven contract block; legacy prefix tests stay green for the fallback path. Contract: - `contracts/subagent_status_contract.json` — single source of truth both languages load. Whitespace variants, varied N for polling timeouts, the 3 pre-execution `Error:` returns task_tool produces, and the middleware wrapper shape are all in there. ## Test plan - `make lint` clean (backend + frontend). - `pytest tests/test_subagent_status_contract.py tests/test_tool_error_handling_subagent_stamp.py` → 37 passed. - `pnpm test --run` → 103 passed (was 76, +27 new). ## Migration / fallback retirement The text-prefix fallback stays in place until backend telemetry shows the frontend never hits it for newly produced messages. At that point a follow-up PR can drop the prefix branches and keep only the structured-status branch. Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131 (prior prefix-only fix), #3146 (this issue). * fix(subtask): back-fill result/error from text when structured status present Three follow-ups on the PR #3154 review: 1. `readStructuredStatus` no longer short-circuits the prefix parse. The backend currently stamps only the `subagent_status` enum value; the human-facing `result` body and wrapped-error message still live in `ToolMessage.content`. Dropping the text parse meant successful tasks rendered empty completed pills and wrapped failures lost their diagnostic. Now both shapes get composed: structured status wins, `result`/`error` come from text when both sides agree, and a lying success body under a `failed` stamp is dropped instead of leaking. 2. Replace the ESM-incompatible `__dirname` fixture lookup in subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`. The frontend package is `"type": "module"`, so the previous path would have thrown at runtime if anything ever changed under the contract directory. 3. Drop the `$schema` reference from contracts/subagent_status_contract.json pointing at a file that doesn't exist in the tree. Three new tests cover the structured + text composition: completed back-fills the success body, failed back-fills the wrapper text, and unrecognised content under a `failed` stamp stays empty rather than echoing noise. |
||
|
|
d8b728f7cb |
fix(mcp): close stdio sessions on their owning loop to avoid cross-task cancel-scope error (#3379) (#3392)
* fix(mcp): close stdio sessions on their owning loop to avoid cross-task cancel-scope error (#3379) Adopt an owner-task lifecycle for pooled MCP ClientSessions so each session is entered, initialized, and exited within a single asyncio task on its owning event loop. This eliminates the anyio "Attempted to exit cancel scope in a different task than it was entered in" RuntimeError that surfaced when stdio MCP tools were used via the sync tool wrapper (which spins up and tears down event loops across tasks). Also harden the pool lifecycle: - track in-flight session creation per (server, scope) to dedupe concurrent get_session() calls for the same key - make close_scope/close_server/close_all/close_all_sync cover both established entries and in-flight creations so sessions cannot be resurrected or leaked after close - handle cross-loop preemption of an in-flight creation by cancelling the stale owner task instead of only signalling it - define close_all_sync() semantics for a running loop on the current thread (signal-only, async completion) and route reset_mcp_tools_cache through a deterministic async close in that case * fix(mcp): avoid reset deadlock on running loop cache reset * fix(mcp): address session pool review feedback |
||
|
|
befe334f10 |
fix(config): make the reload boundary discoverable from code (#3144) (#3153)
* fix(config): make the reload boundary discoverable from code, not just docs Closes #3144. The hot-reload contract — per-run fields are resolved through `get_app_config()` on every request, infrastructure fields snapshot at gateway startup — landed in `backend/CLAUDE.md` as part of #3131. A maintainer reading `get_config()` or an `AppConfig` field still had to context-switch to that document to know which fields require a process restart, and there was no enforcement that the prose list stayed in sync with the code. This commit moves the boundary to a machine-readable single source of truth and surfaces it where the code lives: - New `deerflow.config.reload_boundary` module owns the registry of restart-required fields (`STARTUP_ONLY_FIELDS`) and a tiny helper API (`is_startup_only_field`, `iter_startup_only_field_paths`, `format_field_description`). The standardised `"startup-only:"` prefix is exported as `STARTUP_ONLY_PREFIX` so future scanners / lint hooks / doc generators can pivot off it without re-parsing prose. - `AppConfig`'s `database`, `checkpointer`, `run_events`, `stream_bridge`, `sandbox`, and `log_level` fields now build their `Field(description=...)` from `format_field_description(...)`. The same text shows up in IDE hover (Pydantic v2 exposes `description` via `model_fields[...]`). - `channels` is restart-required too but lives outside the AppConfig Pydantic schema (the config section is consumed directly by `start_channel_service`). The registry owns it so the boundary is not split between two places. - `get_config()` docstring points to the registry instead of leaving the reader to find `CLAUDE.md`. The `CLAUDE.md` table collapses to a one-liner pointing back at `reload_boundary.py` so the boundary has one canonical location, not two. Drift coverage in `tests/test_reload_boundary.py`: - Every registered field has a non-trivial reason. - Iterator / membership helpers stay in sync with the dict. - Every registry entry that maps to an `AppConfig` field also carries the `"startup-only:"` prefix in the schema (catches "forgot to update the schema"). - Reverse drift: any AppConfig field whose description starts with the prefix must be registered (catches "marked restart-required in the schema but forgot the registry"). - The runtime introspection that IDE hover depends on (`AppConfig.model_fields["database"].description`) is pinned, so a future Pydantic upgrade or schema swap that breaks the hover surface shows up as a test failure rather than a silent regression. Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131 (prior boundary fix in prose form). * fix(config): preserve field doc and correct log_level reload reason Two follow-ups on the PR #3153 review: 1. The `log_level` STARTUP_ONLY_FIELDS reason previously claimed `apply_logging_level()` mutates the root logger level. It does not: only the `deerflow` / `app` logger levels are set, and root handler thresholds are conditionally lowered so messages from those loggers can propagate. Reword to match the actual behavior so operators reading IDE hover get accurate restart guidance. 2. `format_field_description(field_path)` was the sole `Field(description=)` for every restart-required field, which silently overwrote the original human-facing documentation — most visibly the `log_level` field that used to list debug/info/warning/error and clarify that third-party libraries are not affected. Extend the helper with a keyword-only `field_doc` parameter that composes the startup-only marker with the original prose so IDE hover documents both *why* the field is restart-required and *what* it actually accepts. Updated all six restart-required AppConfig fields (`log_level`, `database`, `sandbox`, `run_events`, `checkpointer`, `stream_bridge`) to pass their original descriptions through the helper. Tests: two new cases in `test_reload_boundary.py` pin (a) the helper composition and (b) every AppConfig restart-required field still surfaces a recognisable substring of its original documentation. --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |
||
|
|
d133b1119a |
fix(summarization): tag summary LLM calls nostream to stop phantom stream messages (#2503) (#3378)
* fix(summarization): tag summary LLM calls nostream to stop phantom stream messages (#2503) The SummarizationMiddleware runs its summary LLM call inside a before_model hook. Without a nostream tag the summary tokens were captured by LangGraph's messages-tuple stream callback and broadcast to the frontend as a phantom AI message. Generate a dedicated summary model copy tagged with "nostream" (merged on top of any existing tags such as "middleware:summarize" so RunJournal attribution is preserved) and override _create_summary / _acreate_summary to invoke it directly. This avoids temporarily swapping the shared self.model, which would otherwise leak the RunnableBinding across concurrent runs and break parent logic that inspects the raw model (profile / _get_ls_params). Add regression tests covering nostream tagging, concurrent-run isolation, raw model preservation, and existing-tag merge. * fix(summarization): address nostream review feedback |
||
|
|
88e36d9686 |
fix(#3189): prevent write_file streaming timeout on long reports (#3195)
* fix(#3189): prevent write_file streaming timeout on long reports Adds a layered defense against StreamChunkTimeoutError caused by oversized single-shot write_file tool calls: - factory: default stream_chunk_timeout to 240s for OpenAI-compatible clients (overridable via ModelConfig.stream_chunk_timeout in config.yaml) - sandbox/tools: server-side 80 KB length guard on non-append write_file calls (configurable via DEERFLOW_WRITE_FILE_MAX_BYTES env var, 0 disables); rejects oversized payloads with a structured error pointing the model at str_replace or append=True - middleware: classify StreamChunkTimeoutError as transient but cap retries at 1 via per-exception _RETRY_BUDGET_OVERRIDES (same-payload retry on a chunk-gap timeout buffers the same way upstream; full 3-attempt loop would stack 6-12 min of dead air) - middleware: surface an actionable user-facing message for stream-drop exceptions instead of leaking the raw langchain stack - prompts: add a routing-style File Editing Workflow hint to both lead_agent and general_purpose subagent prompts, pointing the model at str_replace for incremental edits (mirrors Claude Code's Edit / Codex's apply_patch) - tests: behavioural coverage for size guard, retry budget override, stream-drop user message, factory default injection Refs #3189 * fix(#3189): drop stream_chunk_timeout for non-OpenAI providers Address CR feedback on PR #3195: - factory: pop `stream_chunk_timeout` from kwargs for any model_use_path other than `langchain_openai:ChatOpenAI` instead of returning early. `ModelConfig.stream_chunk_timeout` is part of the shared schema, so a user-supplied value on a non-OpenAI provider would otherwise be forwarded to its constructor and raise `TypeError: unexpected keyword argument`. - factory: rewrite docstring to describe the actual `exclude_none=True` behaviour (explicit null is excluded and falls back to the default) instead of the misleading "None falling out via exclude_none=True keeps its value". - tests: add regression coverage asserting the kwarg is stripped before reaching a non-OpenAI provider's constructor. Refs: bytedance#3189 * fix(#3189): restrict stream-drop user copy to StreamChunkTimeoutError only Per CR on #3195: narrow _STREAM_DROP_EXCEPTIONS to StreamChunkTimeoutError. Generic httpx RemoteProtocolError / ReadError fall back to the standard 'temporarily unavailable' copy, since they routinely fire on transient network blips where the 'split the output' guidance is misleading. Retry/backoff classification is unchanged — both remain transient/retriable. Tests updated to reflect new copy, plus a symmetric regression test for ReadError. --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |
||
|
|
268fdd6968 |
fix(gateway): drain in-flight runs before closing checkpointer on shutdown (#3381)
* fix(gateway): drain in-flight runs before closing checkpointer on shutdown Chat runs execute in fire-and-forget background asyncio tasks that write checkpoints through a shared checkpointer. On shutdown, langgraph_runtime's AsyncExitStack tore down the checkpointer's postgres connection pool while those run tasks were still mid-graph. langgraph's AsyncPregelLoop._checkpointer_put_after_previous then ran its `finally: await checkpointer.aput(...)` against the closed pool, raising psycopg_pool.PoolClosed. Because that put runs in a langgraph-internal task (not on run_agent's call stack), run_agent's try/except cannot catch it and it surfaces as "unhandled exception during asyncio.run() shutdown". Add RunManager.shutdown() to cancel and bounded-await all in-flight runs, and call it from langgraph_runtime BEFORE the AsyncExitStack closes the checkpointer, so the final checkpoint write lands while the pool is still open. The drain is bounded by a timeout so a stuck run cannot hang worker shutdown, and is shielded so a second shutdown signal cannot abandon it mid-drain and reopen the race. Closes #3373 * fix(gateway): address review — preserve completed-run status, bound drain persistence Addresses Copilot review on #3381: - RunManager.shutdown(): decide run status AFTER the drain. Under the lock it now only requests cancellation; after asyncio.wait it marks/persists `interrupted` only for runs still pending or ended cancelled. A run that completes (e.g. `success`) during the drain window keeps its real terminal status instead of being unconditionally overwritten. - Bound the trailing status persistence within the timeout budget (deadline = loop.time()+timeout; gather wrapped in asyncio.wait_for) so a slow store backing off under DB pressure cannot push shutdown past the deadline. - deps: use asyncio.create_task instead of asyncio.ensure_future. - tests: wait deterministically for the run to be in-flight (poll the first checkpoint) instead of a fixed sleep; init shutdown_calls explicitly in the recovery test double; add regression test asserting a run completing during the drain keeps its status (in memory and in the store). * fix(gateway): address maintainer review — surface failed drain persists, clarify timeout constant Addresses @WillemJiang review on #3381: - shutdown(): inspect the gather result of the trailing interrupted-status persistence. _persist_status is best-effort (it catches + logs its own failure with exc_info and returns False, so it never raises out of the gather), but the aggregate result was never checked — a partial failure had no shutdown-level visibility. Now any escaped Exception is logged, and any False (a persist that did not confirm) is logged with the run_id. Added regression test test_shutdown_surfaces_failed_interrupted_persist. - deps: clarify the _RUN_DRAIN_TIMEOUT_SECONDS comment — state the actual value of _SHUTDOWN_HOOK_TIMEOUT_SECONDS (5.0s) and that both count toward the lifespan shutdown window. Kept as two separate constants (independent teardown steps that may diverge) rather than one shared "must match" value. - Verified no other test fake needs the shutdown stub: _FakeRunManager in test_worker_langfuse_metadata.py is a run_agent() argument (worker path), never injected into langgraph_runtime, so it never receives shutdown(). |
||
|
|
9a5de8d6a5 |
fix(ux): remove Backspace shortcut for deleting prompt attachments (#3410)
* Remove backspace attachment deletion * Fix the lint error --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |
||
|
|
1aac408dd0 | fix upload file size contract (#3408) | ||
|
|
dd8f9bf5f0 | chore: add AI assistance disclosure to PR template and CONTRIBUTING (#3398) | ||
|
|
2bbc7879fa |
refactor(tool-search): consolidate MCP metadata tag and harden deferred-tool setup (#3370)
Follow-up to #3342 (deferred MCP tool loading). Maintainability cleanup plus hardening of malformed/empty tool_search queries; no change to the deferral mechanism or search ranking. - Add deerflow/tools/mcp_metadata.py as the single source of truth for the "deerflow_mcp" tag (MCP_TOOL_METADATA_KEY + tag_mcp_tool + public is_mcp_tool). Removes the duplicated magic string and the private, cross-module _is_mcp_tool import. - tool_search.search: never raise on model-generated input. Extract _compile_catalog_regex (shared compile-with-literal-fallback); return empty for empty/whitespace queries and a bare "+" instead of matching everything or raising IndexError. - DeferredToolSetup: document the empty-vs-populated invariant. - build_deferred_tool_setup: comment the two distinct empty-return branches. - _assemble_deferred: add return type, rename local to deferred_setup, build the final list with an explicit append. - Tests: use tag_mcp_tool instead of per-file tag helpers; cover empty and bare-"+" queries. |
||
|
|
28b1da2172 |
fix(agents): harden update_agent null-like args (#3237)
* fix(agents): harden update_agent null-like args * docs: mention undefined null-like update args --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |