The moderation model's response was silently falling through to a
conservative block when LLMs wrapped structured output in markdown
code fences, added prose around the JSON, returned case-variant
decisions (e.g. "Allow"), or included nested braces in the reason
field. The greedy `\{.*\}` regex also over-matched on nested braces.
- Rewrite _extract_json_object() with markdown fence stripping and
brace-balanced string-aware extraction
- Normalize decision field to lowercase for case-insensitive matching
- Distinguish "model unavailable" from "unparseable output" in fallback
- Strengthen system prompt to explicitly forbid code fences and prose
- Add 15 tests covering all reported scenarios
Fixes#2985
* fix(sandbox): uphold /mnt/user-data contract at Sandbox API boundary (#2873)
LocalSandboxProvider used a process-wide singleton with no /mnt/user-data
mapping, forcing every caller to translate virtual paths via tools.py
before invoking the public Sandbox API. AIO already exposes /mnt/user-data
natively (per-thread bind mounts), so the same code path behaved
differently across implementations — and direct callers like
uploads.py:282 / feishu.py:389 only worked thanks to the
`uses_thread_data_mounts` workaround flag.
Switch the provider to a dual-track cache: keep the `"local"` singleton
for legacy acquire(None) callers (backward-compat for existing tests and
scripts), and create a per-thread LocalSandbox with id `"local:{tid}"`
for acquire(thread_id). Each per-thread instance carries PathMapping
entries for /mnt/user-data, its three subdirs, and /mnt/acp-workspace,
mirroring how AioSandboxProvider mounts those paths into its container.
is_local_sandbox() now recognises both id formats. `_agent_written_paths`
becomes per-thread (it was a process-wide set that leaked across
threads — a latent isolation bug also fixed by this change).
Verified via TDD: a new contract test suite hits the public Sandbox API
directly (write/read/list/exec/glob/grep/update + per-thread isolation +
lifecycle). 3212 backend tests still pass, ruff is clean.
* fix(sandbox): address Copilot review on #2881
Three follow-ups from Copilot's review of the LocalSandboxProvider refactor:
1. Synchronisation: ``acquire`` / ``get`` / ``reset`` mutated the cache without
any lock, so concurrent acquire of the same ``thread_id`` could create two
``LocalSandbox`` instances and lose one's ``_agent_written_paths`` state.
Add a provider-wide ``threading.Lock`` (matching ``AioSandboxProvider``) and
build per-thread mappings outside the lock to avoid holding it during the
``ensure_thread_dirs`` filesystem touch.
2. Memory bound: ``_thread_sandboxes`` grew monotonically. Replace the plain
dict with an ``OrderedDict`` LRU capped at
``DEFAULT_MAX_CACHED_THREAD_SANDBOXES`` (256, configurable per provider
instance). ``get`` promotes touched threads to the MRU end so an active
thread isn't evicted under load. Eviction is graceful: the next ``acquire``
rebuilds a fresh sandbox; only ``_agent_written_paths`` (reverse-resolve
hint) is lost.
3. Docs: update ``CLAUDE.md`` to reflect the new per-thread architecture, the
LRU cap, and that ``is_local_sandbox`` recognises both id formats.
New regression tests:
- Concurrent ``acquire("alpha")`` from 8 threads yields a single instance
(slow-init injection forces the race window wide open).
- Concurrent ``acquire`` of distinct thread_ids yields distinct instances.
- The cache evicts the least-recently-used thread once the cap is exceeded.
- ``get`` promotes recency so a polled thread survives a later acquire-storm.
* fix(auth): persist auto-generated JWT secret to survive restarts
When AUTH_JWT_SECRET is not set, the auto-generated secret is now
written to .deer-flow/.jwt_secret (mode 0600) and reused on subsequent
starts. This prevents session invalidation on every restart while still
allowing explicit AUTH_JWT_SECRET in .env to take precedence.
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix the lint errors of backend
---------
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat(channels): enhance Discord with mention-only mode, thread routing, and typing indicators
Add mention_only config to only respond when bot is mentioned, with
allowed_channels override. Add thread_mode for Hermes-style auto-thread
creation. Add periodic typing indicators while bot is processing.
* fix(discord): include allowed_channels in mention_only skip condition (line 274)
* docs: fix Discord config example to match boolean thread_mode implementation
* style: format with ruff
* fix(discord): apply Copilot review fixes and resolve lint errors
- Remove unused Optional import
- Fix thread_ts type hints to str | None
- Fix has_mention logic for None values
- Implement thread_mode fallback to channel replies on thread creation failure
- Fix thread_mode docstring alignment
- Fix allowed_channels comment formatting in config.example.yaml
* fix(discord): reset context for orphaned threads in mention_only mode
When a message arrives in a thread not tracked by _active_threads,
clear thread_id and typing_target so the message falls through to
the standard channel handling pipeline, which creates a fresh thread
instead of incorrectly routing to the stale thread.
* fix(discord): create new thread on @ when channel has existing tracked thread
When mention_only is enabled and a user @-s the bot in a channel
that already has a tracked thread, create a new thread instead of
incorrectly routing to the old one.
* fix(discord): allow no-@ thread replies while skipping no-@ channel messages
The skip block for no-@ messages was too aggressive — it blocked
continuation replies within tracked threads AND incorrectly routed
no-@ channel messages to the existing thread.
Now:
- Thread message, no @ → routed to existing tracked thread
- Channel message, no @ → skipped
- Channel message, with @ → creates new thread
* feat(discord): add checkmark reaction to acknowledge received messages
* Move discord.py to optional dependency and auto-detect from config.yaml
- Add discord extra to [project.optional-dependencies] in pyproject.toml
- Update detect_uv_extras.py to map channels.discord.enabled: true -> --extra discord
- Set UV_EXTRAS=discord in docker-compose-dev.yaml gateway env
* fix(discord): persist thread-channel mappings to store for recovery after restart
Discord's _active_threads dict was purely in-memory, so all channel-to-thread
mappings were lost on server restart. This fix bridges ChannelStore into
DiscordChannel:
- Save thread mappings to store.json after every thread creation
- Restore active threads from store on DiscordChannel startup
- Pass channel_store to all channels via service.py config injection
Store keys follow the pattern: discord:<channel_id>:<thread_id>
* fix(discord): address Copilot review — fix types, typing targets, cross-thread safety, and config comments
* fix(tests): add multitask_strategy param to mock for clarification follow-up test
* fix(tests): explicitly set model_name=None for title middleware test isolation
* fix(discord): use trigger_typing() instead of typing() for typing indicators
discord.py 2.x TextChannel.typing() and Thread.typing() are async context
managers, not one-shot coroutines. Use trigger_typing() for periodic
typing indicator pings.
* fix(discord): cancel typing tasks on channel shutdown
Prevents 'Task was destroyed but it is pending' warnings when the
Discord client stops while typing indicator loops are still running.
* fix(scripts): detect nested YAML config for discord extra
section_value() only matched top-level YAML sections. Added
nested_section_value() that handles two-level nesting (e.g.,
channels.discord.enabled), so auto-detection of the discord
extra works when config uses the standard nested format.
* fix(docker): remove hard-coded UV_EXTRAS=discord from dev compose
Relies on auto-detection via detect_uv_extras.py instead of forcing
discord.py install even when channels.discord.enabled is false.
Matches production docker-compose.yaml behavior (UV_EXTRAS:-).
* refactor(nginx): move proxy_buffering/proxy_cache to server level
DRY cleanup — these directives were repeated in 14 location blocks.
Set at server level once, reducing duplication and risk of drift.
* fix(discord): use dedicated JSON file for thread persistence
Replace ChannelStore usage for Discord thread-ID persistence with a
dedicated discord_threads.json file. ChannelStore is designed to map
IM conversations to DeerFlow thread IDs — using it to persist Discord
thread IDs was semantically wrong and confusing.
Changes:
- _save_thread() now reads/writes a simple {channel_id: thread_id} JSON dict
- _load_active_threads() reads directly from the JSON file
- File path derived from ChannelStore directory (when available) or
defaults to ~/.deer-flow/channels/discord_threads.json
- Removed unused ChannelStore import
* fix(discord): address WillemJiang's code review comments on PR #2842
1. Remove semantically incorrect message_in_thread variable. At this code
point (after the Thread case is handled above), we're guaranteed to be in
a channel, not a thread. Always apply mention_only check here.
2. Add _active_thread_ids reverse-lookup set for O(1) thread ID membership
checks instead of O(n) scan of _active_threads.values(). Keep the set
in sync with _active_threads in _load_active_threads() and _save_thread().
3. Add _thread_store_lock (threading.Lock) to protect _active_threads and
the JSON file from concurrent access between the Discord loop thread
(_run_client) and the main thread (_load_active_threads, _save_thread).
* fix(middleware): Prevent todo completion reminder IMMessage leak (#2892)
* make format
* fix(middleware): Clear stale todo reminder counts (#2892)
* add size guard for _completion_reminder_counts and add a integration test
* feat: real-time subagent token usage display in header and per-turn
Backend:
- Persist subagent token usage to AIMessage.usage_metadata via
TokenUsageMiddleware, so accumulateUsage() naturally includes
subagent tokens without frontend state management
- Cache subagent usage by tool_call_id in task_tool, write back
to the dispatching AIMessage on next model response
- Emit subagent token usage on all terminal task events
(task_completed, task_failed, task_cancelled, task_timed_out)
- Report subagent usage to parent RunJournal for API totals
- Search backward from ToolMessage to find dispatching AIMessage
for correct multi-tool-call attribution
Frontend:
- Remove subagentUsage state, custom event handling, and prop
threading — subagent tokens are now embedded in message metadata
- Simplify selectHeaderTokenUsage (no subagentUsage parameter)
- Per-turn inline badges show turn-specific usage via message
accumulation
- Remove isLoading guard from MessageTokenUsageList for dynamic
updates during streaming
* fix: prevent header token double counting from baseline reset race
onFinish, onError, and thread-switch useEffect all reset
pendingUsageBaselineMessageIdsRef to an empty Set. If
thread.isLoading is still true on the next render, all messages
pass the getMessagesAfterBaseline filter and their tokens are
added to backendUsage (which already includes them), causing
the header to display up to 2× the actual token count.
Capture current message IDs instead of using an empty Set so
that getMessagesAfterBaseline correctly returns no pending
messages even if thread.isLoading lags behind the stream end.
* fix: write back subagent tokens for all concurrent task tool calls
TokenUsageMiddleware only processed messages[-2], so when a
single model response dispatched multiple task tool calls only
the last ToolMessage had its cached subagent usage written back
to the dispatch AIMessage.usage_metadata. Earlier tasks' usage
stayed in _subagent_usage_cache indefinitely (leak) and never
appeared in the per-turn inline token display.
Walk backward through all consecutive ToolMessages before the
new AIMessage, and accumulate updates targeting the same
dispatch message into one state update so overlapping writes
don't clobber each other.
* fix: clean up subagent usage cache entry on task cancellation
When a task_tool invocation is cancelled via CancelledError, any
cached subagent usage entry leaked because the TokenUsageMiddleware
writeback path never fires after cancellation. Pop the cache entry
before re-raising to prevent unbounded growth of the module-level
_subagent_usage_cache dict.
* fix: address token usage review feedback
* fix: handle missing config for subagent usage cache
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(tools): preserve tool_search promotions across re-entrant get_available_tools
Closes#2884.
``get_available_tools`` used to unconditionally call
``reset_deferred_registry()`` and rebuild a fresh ``DeferredToolRegistry``
on every invocation. That works for the first call of a request (the
ContextVar starts at its default of ``None``), but any RE-ENTRANT call
during the same async context — e.g. ``task_tool`` building a subagent's
toolset, or a custom middleware that rebuilds tools mid-run — wiped any
``tool_search`` promotions the parent agent had already made. The
``DeferredToolFilterMiddleware`` would then re-hide those tools from the
next model call, leaving the agent able to see a tool's name (via the
prior ``tool_search`` result that's still in conversation history) but
unable to invoke it.
Fix: when the ContextVar already holds a registry, reuse it instead of
rebuilding. Fresh requests still get a fresh registry because each new
graph run starts in a new asyncio task with the ContextVar at ``None``.
## Verification
- Unit-level reproduction (``test_get_available_tools_resets_registry_wiping_promotion``):
promote a tool in the registry, call ``get_available_tools`` again, assert
the promotion is preserved. Fails on main, passes on this branch.
- Graph-execution reproduction (two tests): drive a real
``langchain.agents.create_agent`` graph with the real
``DeferredToolFilterMiddleware`` through two model turns, including one
that issues a re-entrant ``get_available_tools`` call to simulate the
task_tool subagent path.
- Real-LLM end-to-end (``test_deferred_tool_promotion_real_llm.py``,
opt-in via ``ONEAPI_E2E=1``): drives the same flow against a real
OpenAI-compatible model (verified on GPT-5.4-mini through the one-api
gateway), watches the model call the promoted ``fake_calculator``
through the deferred-filter middleware, and asserts the right arithmetic
result. Passes against the fixed branch.
- Companion update to ``test_tool_deduplication.py``: dropped the
``@patch("deerflow.tools.tools.reset_deferred_registry")`` decorators
because the symbol is no longer imported there.
- Test fixtures in the new files patch ``deerflow.tools.tools.get_app_config``
with a minimal ``model_construct``-ed ``AppConfig`` instead of calling
the real loader, so they never trigger ``_apply_singleton_configs`` and
never leak ``_memory_config``/``_title_config``/… mutations into the
rest of the suite.
Full backend suite: 3208 passed / 14 skipped / 0 failed. ruff check + format clean.
* fix(tools): address Copilot review on #2885
- tools.py: rewrite the reuse-path comment to spell out (a) why we don't
reconcile the registry against the current ``mcp_tools`` snapshot — the
MCP cache doesn't refresh mid-graph-run, the lead agent's ``ToolNode``
is already bound to the previous tool set anyway, and ``promote()``
drops the entry so a naive re-sync misclassifies promotions as new
tools — and (b) why the log uses ``max(0, …)`` to avoid negative
counts when the cache shrinks between snapshots.
- Replace direct ``ts_mod._registry_var.set(None)`` in test fixtures with
the public ``reset_deferred_registry()`` helper so tests don't couple
to module internals.
- Correct the docstring path in ``test_deferred_tool_registry_promotion.py``
to match the actual monkeypatch target (``deerflow.mcp.cache.get_cached_mcp_tools``).
- Rename
``test_get_available_tools_resets_registry_wiping_promotion`` to
``test_get_available_tools_preserves_promotions_across_reentrant_calls``
so the test name describes the contract being asserted, not the bug it
originally reproduced.
Full backend suite: 3208 passed / 14 skipped. Real-LLM e2e: 1 passed.
* perf(harness): push thread metadata filters into SQL
Replace Python-side metadata filtering (5x overfetch + in-memory match)
with database-side json_extract predicates so LIMIT/OFFSET pagination
is exact regardless of match density.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): add dialect-aware JsonMatch compiler for type-safe metadata SQL filters
Replace SQLAlchemy JSON index/comparator APIs with a custom JsonMatch
ColumnElement that compiles to json_type/json_extract on SQLite and
jsonb_typeof/->>/-> on PostgreSQL. Tighten key validation regex to
single-segment identifiers, handle None/bool/numeric value types with
json_type-based discrimination, and strengthen test coverage for edge
cases and discriminability.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): address Copilot review comments on JSON metadata filters
- Use json_typeof instead of jsonb_typeof in PostgreSQL compiler; the
metadata_json column is JSON not JSONB so jsonb_typeof would error at
runtime on any PostgreSQL backend
- Align _is_safe_json_key with json_match's _KEY_CHARSET_RE so keys
containing hyphens or leading digits are not silently skipped
- Add thread_id as secondary ORDER BY in search() to make pagination
deterministic when updated_at values collide; remove asyncio.sleep
from the pagination regression test
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix(harness): address remaining review comments on metadata SQL filters
- Remove _is_safe_json_key() and reuse json_match ValueError to avoid
validator drift (Copilot #3217603895, #3217411616)
- Raise ValueError when all metadata keys are rejected so callers never
get silent unfiltered results (WillemJiang)
- Fix integer precision: split int/float branches, bind int as Integer()
with INTEGER/BIGINT CAST instead of float() coercion (Copilot #3217603972)
- Fix jsonb_typeof -> json_typeof on JSON column (Copilot #3217411579)
- Replace manual _cleanup() calls with async yield fixture so teardown
always runs (Copilot #3217604019)
- Remove asyncio.sleep(0.01) pagination ordering; use thread_id secondary
sort instead (Copilot #3217411636)
- Add type annotations to _bind/_build_clause/_compile_* and remove EOL
comments from _Dialect fields (coding.mdc)
- Expand test coverage: boolean/null/mixed-type/large-int precision,
partial unsafe-key skip with caplog assertion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): address third-round Copilot review comments on JsonMatch
- Reject unsupported value types (list, dict, ...) in JsonMatch.__init__
with TypeError so inherit_cache=True never receives an unhashable value
and callers get an explicit error instead of silent str() coercion
(Copilot #3217933201)
- Upgrade int bindparam from Integer() to BigInteger() to align with
BIGINT CAST and avoid overflow on large integers (Copilot #3217933252)
- Catch TypeError alongside ValueError in search() so non-string metadata
keys are warned and skipped rather than raising unexpectedly
(Copilot #3217933300)
- Add three tests: json_match rejects unsupported value types, search()
warns and raises on non-string key, search() warns and raises on
unsupported value type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): address fourth-round Copilot review comments on JsonMatch
- Add CASE WHEN guard for PostgreSQL integer matching: json_typeof returns
'number' for both ints and floats; wrap CAST in CASE with regex guard
'^-?[0-9]+$' so float rows never trigger CAST error (Copilot #3218413860)
- Validate isinstance(key, str) before regex match in JsonMatch.__init__
so non-string keys raise ValueError consistently instead of TypeError
from re.match (Copilot #3218413900)
- Include exception message in metadata filter skip warning so callers
can distinguish invalid key from unsupported value type (Copilot #3218413924)
- Update tests: assert CASE WHEN guard in PG int compilation, cover
non-string key ValueError in test_json_match_rejects_unsafe_key
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): align ThreadMetaStore.search() signature with sql.py implementation
Use `dict[str, Any]` for `metadata` and `list[dict[str, Any]]` as return
type in base class and MemoryThreadMetaStore to resolve an LSP signature
mismatch; also correct a test docstring that cited the wrong exception type.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): surface InvalidMetadataFilterError as HTTP 400 in search endpoint
Replace bare ValueError with a domain-specific InvalidMetadataFilterError
(subclass of ValueError) so the Gateway handler can catch it and return
HTTP 400 instead of letting it bubble up as a 500.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): sanitize metadata keys in log output to prevent log injection
Use ascii() instead of %r to escape control characters in client-supplied
metadata keys before logging, preventing multiline/forged log entries.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(harness): validate metadata filters at API boundary and dedupe key/value rules
- Add Pydantic ``field_validator`` on ``ThreadSearchRequest.metadata`` so
unsafe keys / unsupported value types are rejected with HTTP 422 from
both SQL and memory backends (closes Copilot review 3218830849).
- Export ``validate_metadata_filter_key`` / ``validate_metadata_filter_value``
(and ``ALLOWED_FILTER_VALUE_TYPES``) from ``json_compat`` and have
``JsonMatch.__init__`` reuse them — the Gateway-side validator and the
SQL-side ``JsonMatch`` constructor now share one admission rule and
cannot drift.
- Format ``InvalidMetadataFilterError`` rejected-keys list as a
comma-separated plain string instead of a Python list repr so the
surfaced HTTP 400 detail is readable (closes Copilot review 3218830899).
- Update router tests to cover both 422 boundary paths plus the 400
defense-in-depth path when a backend still raises the error.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(harness): harden JsonMatch compile-time key validation against __init__ bypass
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix: address review feedback on metadata filter SQL push-down
- Add signed 64-bit range check to validate_metadata_filter_value; give
out-of-range ints a distinct TypeError message.
- Replace assert guards in _compile_sqlite/_compile_pg with explicit
if/raise so they survive python -O optimisation.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(agents): make update_agent honor runtime.context user_id like setup_agent
PR #2784 hardened setup_agent to prefer runtime.context["user_id"] (set by
inject_authenticated_user_context from the auth-validated request) over the
contextvar, so an agent created during the bootstrap flow always lands under
users/<auth_uid>/agents/<name>. update_agent was left calling
get_effective_user_id() unconditionally — the same class of bug that produced
issues #2782 / #2862 still applies whenever the contextvar is not available
on the executing task (background work, future cross-process drivers,
checkpoint resume on a different task). In that regime update_agent silently
routes writes to users/default/agents/<name>, corrupting the shared default
bucket and losing the user's edit.
Extract the resolution policy into a shared resolve_runtime_user_id helper
on deerflow.runtime.user_context and route both setup_agent and update_agent
through it so the two halves of the lifecycle stay in lockstep.
Add load-bearing end-to-end tests that drive a real langchain.agents
create_agent graph with a fake LLM, exercising the full pipeline:
HTTP wire format
-> app.gateway.services.start_run config-assembly
-> deerflow.runtime.runs.worker._build_runtime_context
-> langchain.agents create_agent graph
-> ToolNode dispatch (sync + async + sub-graph + ContextThreadPoolExecutor)
-> setup_agent / update_agent
The negative-control tests intentionally land in users/default/ to prove the
positive tests are actually load-bearing rather than vacuously passing.
The new test_update_agent_e2e_user_isolation suite included a test that
failed against main and now passes after this fix.
* style: ruff format on new e2e tests
* test(e2e): real-server HTTP test driving setup_agent through the full ASGI stack
Adds tests/test_setup_agent_http_e2e_real_server.py — a single load-bearing
test that drives the entire FastAPI gateway through starlette.testclient.
TestClient with no mocks above the LLM:
- lifespan boots (config, sqlite engine, LangGraph runtime, channels)
- POST /api/v1/auth/register (real password hash, real sqlite write,
issues access_token + csrf_token cookies)
- POST /api/threads (real thread_meta + checkpoint creation)
- POST /api/threads/{id}/runs/stream with the exact wire shape the React
frontend sends (assistant_id + input + config + context with
agent_name/is_bootstrap)
- AuthMiddleware -> CSRFMiddleware -> require_permission ->
start_run -> inject_authenticated_user_context ->
asyncio.create_task(run_agent) -> worker._build_runtime_context ->
Runtime injection -> ToolNode dispatch -> real setup_agent
- Asserts SOUL.md is under users/<authenticated_uid>/agents/<name>/
and NOT under users/default/agents/<name>/.
DEER_FLOW_HOME and the sqlite path are redirected into tmp_path so the test
never touches the real .deer-flow directory or developer database. The only
patch above the LLM boundary is replacing create_chat_model with a fake that
emits a single setup_agent tool_call.
This is the "真实验证" answer: it reproduces what curl-against-uvicorn would
do, minus the network socket layer.
* test: address Copilot review on user-isolation e2e tests
- Drop "currently expected to FAIL" wording from update_agent e2e docstring
and header (Copilot review): the fix is in this PR, the test pins the
corrected behaviour rather than driving a future change.
- Rephrase the assertion failure messages from "BUG:" to "REGRESSION:" to
match the test's role on the fixed branch.
- Bound _drain_stream with a wall-clock timeout, a max-bytes cap, and an
early break on the "event: end" SSE frame (Copilot review). Stops the
test from hanging on a stuck run or runaway heartbeat loop.
- Replace the misleading "patch both module aliases" comment with an
explanation of why patching lead_agent.agent.create_chat_model is the
only correct target (Copilot review): lead_agent rebinds the symbol
into its own namespace at import time, so patching deerflow.models is
too late.
* test(refactor): address WillemJiang review on user-isolation e2e tests
- Extract the duplicated FakeToolCallingModel (and a
build_single_tool_call_model helper) into tests/_agent_e2e_helpers.py.
All three e2e files now import from the shared module instead of
redefining the shim locally.
- Convert the manual p.start() / p.stop() try/finally blocks in
test_update_agent_e2e_user_isolation.py to contextlib.ExitStack so
patch lifecycle is Pythonic and exception-safe.
- Lift the isolated_app fixture's private-attribute resets into a
named _reset_process_singletons helper with a comment block
explaining why each singleton has to be invalidated for true e2e
isolation, and why raising=False is intentional. Makes the
fragility visible and the intent self-documenting rather than
leaving the resets inline as opaque monkeypatch calls.
Net change: -59 lines (143 -> 84) across the three test files, with
every assertion intact. Full suite remains 69 passed / lint clean.
* test(e2e): make real-server test self-supply its config
CI's actions/checkout only ships config.example.yaml (the real config.yaml
is gitignored), so the production config-discovery search
(./config.yaml -> ../config.yaml -> $DEER_FLOW_CONFIG_PATH) finds nothing
and the test fails at lifespan boot with FileNotFoundError. The dev-machine
run passed only because a local config.yaml happened to exist.
Write a minimal AppConfig-valid yaml into tmp_path and pin
DEER_FLOW_CONFIG_PATH to it. The yaml carries just what the schema requires
(a single fake-test-model entry, LocalSandboxProvider, sqlite database).
The LLM never gets instantiated because the test patches create_chat_model
on the lead agent module, so the api_key/base_url stay placeholders.
Verified by hiding the local config.yaml to mirror the CI checkout — the
test now passes in both environments.
* docs: document auth design and user isolation
* docs: align auth docs with current storage and reset behavior
---------
Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
* feat(run): propagate model_name from gateway request context to persistence layer
Pass model_name through the full run creation pipeline — from
RunCreateRequest.context in the gateway, through RunManager, to the
RunStore interface and SQL persistence. This enables client-specified
model selection to be recorded per-run in the database.
* feat(run): add model allowlist validation and effective model name capture
- Validate model_name against allowlist in gateway services.py using
get_app_config().get_model_config()
- Truncate model_name to 128 chars to match DB column constraint
- In worker.py, capture effective model name from agent.metadata after
agent creation and persist if resolved differently than requested
* feat(run): add defense-in-depth model_name normalization and round-trip persistence tests
- Add _normalize_model_name() to RunRepository for whitespace stripping
and 128-char truncation before DB writes.
- Add round-trip unit tests for model_name creation and default None
in test_run_manager.py.
* fix(run): coerce non-string model_name values before strip/truncate in _normalize_model_name
* fix(gateway): add runtime type guard for model_name coercion in gateway services
Add isinstance check and str() coercion before calling .strip() to prevent
AttributeError when non-string types (int, None, etc.) flow through the
gateway. Paired with SQL integration test for end-to-end model_name
persistence across gateway → langgraph → persistence layer.
* fix(run): drop Alembic migration for model_name (no-op) and expose public update method on RunManager
- Drop a1b2c3d4e5f6 migration: model_name already exists in RunRow schema
and is auto-created via Base.metadata.create_all() at startup
- Add update_model_name() public method to RunManager to replace the private
_persist_to_store call in worker.py, preserving internal locking/persistence
* fix(nginx): defer cors to gateway allowlist
Remove proxy-level wildcard CORS handling so browser origins are controlled by the Gateway allowlist and stay aligned with CSRF origin checks.
* docs: document gateway cors allowlist
Clarify that same-origin nginx access needs no CORS headers while split-origin or port-forwarded browser clients must opt in with GATEWAY_CORS_ORIGINS.
* docs(gateway): record cors source of truth
Document that Gateway CORSMiddleware and CSRFMiddleware share GATEWAY_CORS_ORIGINS as the split-origin source of truth.
* fix(gateway): align cors origin normalization
* docs: clarify gateway langgraph routing
* docs(gateway): update runtime routing note
* fix(subagents): consolidate system_prompt and skills into single SystemMessage
Some LLM APIs (vLLM, Xinference, Chinese LLM providers) reject multiple
system messages with \”System message must be at the beginning.\” The
subagent executor was sending separate SystemMessages for the configured
system_prompt and each loaded skill, which caused failures when calling
task tool with sub-agents.
Merge system_prompt and all skill content into one SystemMessage in the
initial state, and pass system_prompt=None to create_agent() so the
factory doesn't prepend a second one.
Fixes#2693
* fix(subagents): update SubagentConfig.system_prompt to str | None and add astream regression test
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/2ee03a26-e19b-4106-abc5-c76a2906383b
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* fixed the lint error
* fix the lint error in the backend
* fix the unit test error of test_subagent_executor
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* Fix local sandbox singleton reset on provider lifecycle
* Fix local sandbox singleton reset on provider reset
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix: make tool argument behavior discoverable
The write_file tool already supported append=false by default with append=true for end-of-file writes, but the parsed docstring did not describe append in the model-facing schema. This records the overwrite default and append path in the tool description, adds resilient schema regression coverage, and keeps backend sandbox docs aligned.
The regression now also checks that every public parameter in the existing tool schema test matrix has a description. Enabling docstring parsing on setup_agent and update_agent fills the two existing gaps with their existing Args docs instead of duplicating descriptions elsewhere.
Constraint: Issue #2831 asks for a small docstring/schema discoverability fix without changing runtime file-writing behavior
Rejected: Changing write_file defaults | would alter existing overwrite semantics and broaden the fix beyond schema discoverability
Rejected: Exact phrase assertions | too brittle for future docstring rewording while testing the same behavior
Confidence: high
Scope-risk: narrow
Directive: Keep model-facing tool parameters documented through parsed docstrings or equivalent schema descriptions
Tested: cd backend && uv run pytest tests/test_setup_agent_tool.py tests/test_update_agent_tool.py tests/test_tool_args_schema_no_pydantic_warning.py tests/test_sandbox_tools_security.py::test_str_replace_and_append_on_same_path_should_preserve_both_updates -q
Tested: cd backend && uv run ruff check packages/harness/deerflow/sandbox/tools.py packages/harness/deerflow/tools/builtins/setup_agent_tool.py packages/harness/deerflow/tools/builtins/update_agent_tool.py tests/test_tool_args_schema_no_pydantic_warning.py
Not-tested: Full backend test suite
Co-authored-by: OmX <omx@oh-my-codex.dev>
* Fix the lint error
---------
Co-authored-by: OmX <omx@oh-my-codex.dev>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix: bucket subagent token usage into RunRow.subagent_tokens
Add caller-bucketed token tracking to RunJournal so subagent and
middleware LLM calls are written to the correct RunRow columns instead
of all falling into lead_agent_tokens (default 0).
- RunJournal: accumulate _lead_agent_tokens / _subagent_tokens /
_middleware_tokens in on_llm_end, deduped by langchain run_id.
Add record_external_llm_usage_records() for external sources
(respects track_token_usage flag). Return caller buckets from
get_completion_data().
- SubagentTokenCollector: new lightweight callback handler that
collects LLM usage within subagent execution.
- SubagentExecutor: wire collector into subagent run_config and sync
records to SubagentResult on every chunk (timeout/cancel safe).
- SubagentResult: add token_usage_records and usage_reported fields.
- task_tool: report subagent usage to parent RunJournal on every
terminal status (COMPLETED/FAILED/CANCELLED/TIMED_OUT), including
the CancelledError path, guarded against double-reporting.
No DB migration needed — RunRow columns already exist.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix: address token usage review feedback
* Address review follow-ups
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
`make dev` ran `uv sync` unconditionally on every restart, wiping any
optional extras the user had installed manually with
`uv sync --all-packages --extra postgres`. The Docker image-build path
already solved this via the `UV_EXTRAS` build-arg in backend/Dockerfile;
the local serve.sh path and the docker-compose-dev startup command
were the remaining outliers.
`scripts/serve.sh` now resolves extras before `uv sync`:
1. honors `UV_EXTRAS` (parity with backend/Dockerfile and
docker/docker-compose.yaml — no new convention introduced);
2. falls back to parsing config.yaml — `database.backend: postgres`
or legacy `checkpointer.type: postgres` auto-pins
`--extra postgres`, so the common case needs zero extra config.
3. detector stderr is no longer suppressed, so whitelist warnings or
crashes surface to the dev terminal (review feedback).
Detection lives in `scripts/detect_uv_extras.py` (stdlib-only — has to
run before the venv exists). Extra names are validated against
`^[A-Za-z][A-Za-z0-9_-]*$` so a stray shell metacharacter in `.env`
cannot reach `uv sync` downstream (defense in depth).
`docker/docker-compose-dev.yaml`'s startup command is now extracted to
`docker/dev-entrypoint.sh` (review feedback — the inline command had
grown to a ~350-char one-liner). The script:
- parses comma/whitespace-separated UV_EXTRAS, applying the same
`^[A-Za-z][A-Za-z0-9_-]*$` whitelist as the local detector;
- emits one `--extra X` flag per token, so `UV_EXTRAS=postgres,ollama`
works in Docker dev too (harmonized with local — review feedback);
- calls `uv sync --all-packages` (PR #2584) so workspace member
extras (deerflow-harness's postgres extra) are installed;
- keeps the existing self-heal `(uv sync || (recreate venv && retry))`
branch;
- exposes `--print-extras` for dry-run testing.
The compose file mounts the script read-only at runtime, so script
edits take effect on `make docker-restart` without an image rebuild.
The `--no-sync` alternative (a separate suggestion in the issue thread)
was considered but rejected for dev paths because it would drop the
self-heal branch and the auto-pickup of new pyproject deps. `--no-sync`
is already in use for the production CMD (`backend/Dockerfile:101`)
where it's appropriate.
Updates the asyncpg-missing error message to include the
`--all-packages` flag (matching #2584) plus the persistent install flow,
and expands `config.example.yaml` so all three install paths
(local / docker dev / docker image build) are documented with their
multi-extra capabilities.
Tests:
- `tests/test_detect_uv_extras.py` (21 tests) — local-path env parsing,
YAML edge cases, env-vs-config precedence, whitelist rejection of
shell metacharacters.
- `tests/test_dev_entrypoint.py` (15 tests) — docker-path validation
via `--print-extras`, multi-extra parsing, metacharacter abort.
- `tests/test_persistence_scaffold.py` (22 tests, unchanged) — passes
with the merged `--all-packages --extra postgres` error message.
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Surface artifacts produced via the present_files tool in the CLI debug
REPL so headless clients without a frontend (VS Code launch configs,
etc.) can locate output files. Each turn prints newly added artifacts
plus their resolved host path. Works for any source that goes through
present_files — ACP agents, subagents, or sandbox writes.
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>