Commit Graph

2308 Commits

Author SHA1 Message Date
Nan Gao 8c0830aea1 fix(channels): add operational guardrails (#3584)
* fix(channels): add operational guardrails

* make format

* fix(channels): converge with #3582 to avoid merge-order conflicts

Drop this PR's DingTalk INFO-log redaction and hand it to #3582, which
already restructures that handler and will redact the same log there. This
PR no longer touches dingtalk.py, so the two PRs can merge to main in any
order without a conflict.

For WeChat, drop the contested thread_ts priority reorder (review #3) and
keep only what inbound dedupe needs: a server-stable message_id in the
inbound metadata (message_id/msg_id, no client_id per review #6). This is a
single added line inside the metadata dict, a region #3582 never touches, so
it auto-merges regardless of order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): address three correctness review findings

1. Connect-code cap was racy (willem #1): _create_state ran delete-expired,
   count, and insert as three separate transactions, so concurrent connect
   POSTs from one owner could each see count < cap and all insert past it. Add
   ChannelConnectionRepository.create_oauth_state_within_cap which does
   delete+count+insert in a single transaction serialized per (owner,
   provider) — Postgres via pg_advisory_xact_lock, SQLite via the write lock
   the leading DELETE takes — and have the router use it.

2. Inbound dedupe key fell back to "" workspace (willem #3): two workspaces
   delivering without team/guild/aibotid would collapse to the same key and
   dedupe each other's messages. _inbound_dedupe_key now fails closed
   (returns None) when no workspace identifier is present.

3. Dedupe key was recorded on receipt and never released on failure
   (ShenAC #1): a transient error (DB blip, Gateway 503) left the key in place
   for the full TTL, so a provider redelivery of the same message_id — exactly
   the retry dedupe should absorb — was silently dropped. _handle_message now
   releases the key in the unexpected-exception branch so redelivery can
   recover, while keeping record-on-receipt so retries during handling are
   still deduped.

Tests: repo cap enforcement incl. concurrent-issuance non-leak; dedupe
fail-closed; dedupe key release-on-failure redelivery recovery.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): address cleanup/efficiency and test review findings

Efficiency / cleanup:
- Dedupe key set drops client-generated ids (client_msg_id, client_id);
  keep only server-stable event_id/message_id/msg_id, which a provider's own
  redelivery preserves (ShenAC #6). Every provider already emits message_id.
- TTL/overflow pruning of _recent_inbound_events is now O(k): switch to an
  OrderedDict and popitem(last=False) from the front instead of scanning all
  4096 entries on every inbound (willem #4).
- Log "received inbound" only after the dedupe check so a provider retrying N
  times no longer logs N accepts; document that manager dedupe covers the
  agent run/final answer, not provider ack side-effects (willem #5, ShenAC #2).
- Slack drops the redundant `team_id or event.get("team")` fallback the caller
  already resolved (willem #6).
- create_oauth_state_within_cap prunes only this owner/provider's expired codes
  instead of a global DELETE on every connect POST; global cleanup still runs
  on consume_oauth_state (willem #7).

Tests:
- Dedupe test uses tmp_path instead of a leaked mkdtemp, uses distinct objects
  per publish, and adds a negative control: a different message_id is still
  processed, catching over-dedupe regressions (willem #8, ShenAC #4).
- Slack HTTP-mode rejection test supplies app_token so the missing-token early
  return can't mask the guard, giving the state assertions teeth (ShenAC #3).
- count_oauth_states test pins that the active row survives, not just the count
  (ShenAC #5).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* make format

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 10:09:46 +08:00
Xinmin Zeng 97dd9ecf73 fix(sandbox): stop flagging string-literal path fragments as unsafe absolute paths (#3623)
* fix(sandbox): stop flagging string-literal path fragments as unsafe paths

The host-bash absolute-path guard scans the raw command string, so /segment
sequences inside string literals, f-strings, and templates were treated as
absolute path arguments and rejected — e.g. python -c "print(f'/端口{port}')"
or a REST template /devices/{id}/port. Whether a fragment tripped the guard
depended on the character right before the slash (a word char suppressed the
match), so the same literal could pass or fail unpredictably, pushing the model
into retry loops that bloat context and wall-clock time.

Exempt matches carrying non-ASCII characters or format braces: real host paths
a command would open contain neither, so these are text, not paths. The guard
is best-effort (not a security boundary), and plain ASCII host paths like
/etc/passwd — including ones written inside a code string such as
open('/etc/passwd') — stay rejected.

* fix(sandbox): only exempt identifier-template braces, not bash brace expansion

The literal-fragment exemption exempted any path fragment containing { or },
which let bash brace expansion (cat /etc/{passwd,shadow}) and ${VAR} expansion
reconstitute real host paths past validate_local_bash_command_paths. Tighten
the brace branch to only exempt fragments where every {...} block is a single
identifier-like placeholder (/devices/{id}/port, f-string /{port}); reject
${VAR} shell-variable expansion. Add parametrized regression tests for the
brace-expansion and shell-var bypasses.
2026-06-18 09:27:05 +08:00
Ryker_Feng 0bbbbc06f4 feat(community): add Serper Google Images provider for image_search (#3575)
* feat(community): add Serper Google Images provider for image_search

Add a Serper-backed `image_search` tool alongside the existing Serper
`web_search` provider, so users with a SERPER_API_KEY can pull Google
Images results as reference images for downstream image generation.

- Share request/response handling between web_search and image_search
  via `_serper_post` / `_response_items`, with bounded `max_results`
  (capped at 10) and query normalization.
- Add a best-effort SSRF guard (`_safe_public_url`) that rejects
  non-http(s), localhost and private/non-global IP image URLs; filtered
  entries are dropped and never consume the result limit.
- doctor: flag literal `api_key` values in config as a warning and steer
  users toward `.env` + `$SERPER_API_KEY`.
- Docs/config: document the Serper image_search provider and SERPER_API_KEY,
  and discourage committing literal keys to config.yaml.
- Tests: cover the provider end-to-end (100% line coverage on tools.py)
  and the doctor literal-key warning path.

* fix(community): block obfuscated IPv4 literals in Serper image SSRF guard

The image_search SSRF guard only rejected dotted-decimal IP literals; encoded
forms such as decimal (http://2130706433/), hex (0x7f000001) and octal
(0177.0.0.1) raised ValueError in ip_address() and were allowed through, even
though many HTTP clients resolve them to private addresses like 127.0.0.1.

Add _decode_ipv4() to permissively decode these inet_aton-style encodings and
apply the same is_global check; hostnames that do not decode to an IP (e.g.
cafe.com) are still treated as hosts and left to fetch-time re-validation.

Addresses PR review feedback. Tests cover decimal/hex/octal loopback and
private encodings plus non-IP edge cases; tools.py stays at 100% line coverage.

* test(community): cover IPv4-mapped IPv6 URL filtering

* fix(community): address Serper image search review feedback

- Block trailing-dot hostname SSRF bypass (localhost./127.0.0.1.) in
  _safe_public_url by stripping the FQDN root label before checks.
- Keep a filtered image/thumbnail URL empty instead of collapsing onto
  its counterpart, preserving the high-res/preview contract.
- Evaluate the SSRF guard once per field rather than twice.
- Treat a null-typed organic/images field as "no results" rather than a
  malformed payload.
- doctor.py: when a config $VAR is unset, fall through to the default env
  var before reporting it as not set.
2026-06-18 07:36:35 +08:00
Huixin615 ec16b6650d fix(channel): force reload config on channel restart (#3619)
* fix: force reload config on channel restart

* fix: detect config content changes for reload
2026-06-17 22:57:46 +08:00
Xinmin Zeng 6a4a30fa2b fix(sandbox): return actionable hint when read_file hits a binary file (#3624)
read_file decodes with UTF-8. Binary uploads (.xlsx, images, ...) raise
UnicodeDecodeError in the sandbox layer. UnicodeDecodeError is a ValueError
subclass, not an OSError, so it bypassed the typed handlers and fell through
to the generic except, surfacing a vague "Unexpected error reading file"
message. The model could not tell the file was binary, so it retried
read_file instead of switching to bash + pandas/openpyxl, burning LLM
round-trips and bloating context with repeated failures.

Add a dedicated UnicodeDecodeError handler that tells the model the file is
binary and to use bash with a suitable library (or view_image for images).
2026-06-17 21:11:44 +08:00
Nan Gao e732a741bf fix(channels): centralize shared channel retry helpers (#3583) 2026-06-17 15:44:40 +08:00
Zhipeng Zheng c81ab268fb fix(serialization): stop stripping __interrupt__ from channel values (#3595) (#3605) 2026-06-17 15:29:22 +08:00
heart-scalpel a72af8ea37 feat(subagents): attribute subagent spans to parent thread's Langfuse session (#3611)
The subagent execution path did not call inject_langfuse_metadata(...)
and built its model with attach_tracing=True, so subagent LLM/tool
spans landed in Langfuse as isolated top-level traces carrying fresh
session ids and the default user. They were findable in the unfiltered
trace list but did not group under the parent thread's session card,
and Langfuse cost attribution for subagent traffic did not line up
with the parent conversation — even though DeerFlow's internal token
accounting (SubagentTokenCollector) was already correct.

Extend the lead-agent tracing wiring to the subagent path so a single
subagent run produces one trace that shares the parent thread's
session_id and user_id, with a subagent:<name> trace name:

- subagents/executor.py: append build_tracing_callbacks() output to
  run_config["callbacks"] (preserving SubagentTokenCollector) and
  call inject_langfuse_metadata(...) with thread_id, user_id, and
  the normalized subagent:<name> trace name. Build the model with
  attach_tracing=False so model-level tracing does not double-count
  with the graph-root callbacks — the same pairing the lead agent
  uses.
- tools/builtins/task_tool.py: resolve user_id via
  resolve_runtime_user_id(runtime) at the parent tool layer (before
  the background thread starts) and thread it through
  SubagentExecutor.__init__, because the _current_user contextvar
  is not guaranteed to survive the _execution_pool boundary.

Trace topology is unchanged: subagent traces remain separate top-level
traces in the same session, not nested as child spans under the lead
trace (Plan B follow-up).

Tests: tests/test_subagent_executor.py::TestSubagentTracingWiring
covers the callback append, the session/user/trace-name injection,
the disabled-langfuse no-op, the DEFAULT_USER_ID fallback, the
empty-name trace-name fallback, and the env-tag emission. Existing
test_create_agent_threads_explicit_app_config_to_model_and_middlewares
now also asserts attach_tracing=False.

Docs: CLAUDE.md Tracing System section documents subagents/executor.py
as a third injection point alongside worker.py and client.py.
2026-06-17 14:36:09 +08:00
dependabot[bot] 6b2716e75b chore(deps): bump starlette from 1.0.1 to 1.3.1 in /backend (#3622)
Bumps [starlette](https://github.com/Kludex/starlette) from 1.0.1 to 1.3.1.
- [Release notes](https://github.com/Kludex/starlette/releases)
- [Changelog](https://github.com/Kludex/starlette/blob/main/docs/release-notes.md)
- [Commits](https://github.com/Kludex/starlette/compare/1.0.1...1.3.1)

---
updated-dependencies:
- dependency-name: starlette
  dependency-version: 1.3.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-17 13:15:05 +08:00
dependabot[bot] 55577eb36f chore(deps): bump aiohttp from 3.14.0 to 3.14.1 in /backend (#3621)
---
updated-dependencies:
- dependency-name: aiohttp
  dependency-version: 3.14.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-17 13:14:06 +08:00
dependabot[bot] fda8209ee3 chore(deps): bump cryptography from 46.0.7 to 48.0.1 in /backend (#3620)
Bumps [cryptography](https://github.com/pyca/cryptography) from 46.0.7 to 48.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/46.0.7...48.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 48.0.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-17 13:13:26 +08:00
Zheng Feng 5851f8250e fix(sandbox): make setup-sandbox.sh script executable (#3618) 2026-06-17 12:32:52 +08:00
stphtt f212da9f89 fix(sandbox): create shell session before retrying on a fresh id (#3577)
* fix(sandbox): create shell session before retrying on a fresh id

The AIO sandbox recovery path generated a UUID and passed it straight to
exec_command(id=...). The sandbox image only auto-creates a session when
exec_command is called with *no* id; an exec carrying an unknown id returns
HTTP 404 "Session not found". So every ErrorObservation recovery itself
404'd, turning a transient session lapse into an unrecoverable tool error
that looped the run up to the LangGraph recursion limit.

Explicitly create_session(id=fresh_id) before targeting that id on retry.
create_session is idempotent (returns the existing session if the id already
exists), so this is safe under the serializing lock.

Updated the regression test to assert the retry targets exactly the
created session id rather than a fabricated, uncreated one.

* fix(sandbox): release the one-shot recovery session after retry

The fresh session created on the ErrorObservation recovery path is used for
exactly one command -- the next execute_command runs with no id and returns
to the default session. Under persistent session corruption every command
would create another session that is never reused or released, accumulating
sessions on the container.

Release it best-effort with cleanup_session() in a finally, swallowing any
cleanup error so it never masks a successful retry.

Addresses review feedback on #3577.

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-06-17 10:21:27 +08:00
dependabot[bot] c361fa3e49 chore(deps): bump python-multipart from 0.0.27 to 0.0.31 in /backend (#3613)
Bumps [python-multipart](https://github.com/Kludex/python-multipart) from 0.0.27 to 0.0.31.
- [Release notes](https://github.com/Kludex/python-multipart/releases)
- [Changelog](https://github.com/Kludex/python-multipart/blob/main/CHANGELOG.md)
- [Commits](https://github.com/Kludex/python-multipart/compare/0.0.27...0.0.31)

---
updated-dependencies:
- dependency-name: python-multipart
  dependency-version: 0.0.31
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-06-17 08:43:19 +08:00
dependabot[bot] 5a9db48ae1 chore(deps): bump pyjwt from 2.12.1 to 2.13.0 in /backend (#3612)
Bumps [pyjwt](https://github.com/jpadilla/pyjwt) from 2.12.1 to 2.13.0.
- [Release notes](https://github.com/jpadilla/pyjwt/releases)
- [Changelog](https://github.com/jpadilla/pyjwt/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/jpadilla/pyjwt/compare/2.12.1...2.13.0)

---
updated-dependencies:
- dependency-name: pyjwt
  dependency-version: 2.13.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-17 08:20:41 +08:00
Nan Gao 926406e0d6 fix(channels): make runtime provider state authoritative (#3580)
* fix(channels): make runtime provider state authoritative

* make format

* fix(channels): close runtime provider config races and status gaps

Address review findings on the runtime-provider-state change:

- configure/disconnect now re-read the live app.state.channels_config
  after the worker await and mutate only the affected provider key in
  place, so a concurrent mutation for a different provider is no longer
  clobbered by a stale pre-await snapshot.
- disconnect revokes DB connection rows before committing the store and
  cache, so a repo failure cannot leave the store/cache "disconnected"
  while the DB keeps "connected" rows a later re-configure would
  silently reactivate.
- _provider_response preserves non-connected statuses (e.g. revoked)
  when the provider is unavailable, only masking a stale "connected"
  row as not_connected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:45:46 +08:00
Huixin615 43dba448ad fix(channel): unsubscribe channel listeners by equality (#3608) 2026-06-17 00:12:10 +08:00
AochenShen99 65fab1d4d4 feat(skill): strengthen maintainer orchestrator review workflow (#3606)
* feat(skill): strengthen maintainer orchestrator review workflow

Add seven enhancements to the deerflow-maintainer-orchestrator skill and
mirror them in docs/agents/maintainer-sop.md:

- Posting gate as confidence x severity, with a maintainer-only notes
  channel for sub-threshold observations. Clarifies that "no high-confidence
  findings" spans P0/P1/P2, not just P0.
- Batch handling: cluster by relatedness, then synthesize cross-PR overlap,
  merge-order/conflict surface, and composition risk.
- Competing PR comparison anchored to the issue's acceptance criteria, with a
  maintainer-only ranking and a constructive per-PR public surface.
- Existing comments suppress duplicate posting, not analysis: review fully and
  post only the net-new delta, with an idempotency guard for re-runs.
- Green-CI discipline: checks are signal not verdict; read the changed code
  path regardless of a green rollup.
- Fork PR head resolution via pull/<n>/head and a pre-post head-SHA recheck.
- Competing-PR detection in artifact resolution; output gains Already
  covered / Maintainer notes / Batch synthesis fields.

* docs(skill): rewrite maintainer SOP as design rationale, not a skill restatement

* docs(skill): rename maintainer SOP doc to maintainer-orchestrator-design

* feat(skill): allow controlled fan-out when a related cluster exceeds one context

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-06-16 23:33:09 +08:00
Huixin615 1896722e66 fix: add MCP tools cache reset endpoint (#3602)
* fix: add MCP tools cache reset endpoint

* docs: clarify MCP cache reset scope
2026-06-16 23:20:20 +08:00
Nan Gao 0966131b31 fix(channels): require bound identity for user-owned IM messages (#3578)
* fix(channels): require bound identity for user-owned IM messages

* make format

* docs: document bound identity channel config

* refactor: reuse channel connection config

* refactor _requires_bound_identity()

* refactor from_app_config()

* make format

* fix: reject unbound channel chats before semaphore

* security enhancement

* make format

* fix: enforce bound-identity admission at command entry point

The bound-identity gate only ran for non-command messages in
_handle_message() and as a fallback inside _handle_chat(). Commands had
no equivalent boundary, so an unbound platform user could send /new and
reach _create_thread() directly, creating an unowned Gateway thread and
empty checkpoint. Info commands (/status, /models, /memory) likewise
leaked Gateway state to unbound users.

Add the same _requires_bound_identity() check at the top of
_handle_command(), rejecting via _reject_unbound_channel_message() before
any thread creation or Gateway query. The gate is a no-op in legacy
open-bot mode (require_bound_identity=False) and auth-disabled mode.
Provider-level binding flows (/connect, /start) are consumed by the
provider adapter before reaching the manager, so they are unaffected.

Tests:
- unbound auth-enabled /new is rejected before threads.create
- bound auth-enabled /new still creates the thread

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): carry workspace fallback decision on inbound messages

* fix(channels): recheck bound identity by normalized workspace

* fix(channels): avoid duplicate bound identity checks

* fix(channels): preserve verified routing for bound identity rejects

* fix(channels): clarify bound identity upgrade failures

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-06-16 23:04:39 +08:00
DanielWalnut 05be7ea688 fix(subagents): raise general-purpose max_turns to 150 and default timeout to 30min (#3610)
* fix(subagents): raise general-purpose max_turns to 150 and default timeout to 30min

Deep-research subtasks failed out of the box with GraphRecursionError (Recursion limit of 100 reached): the built-in general-purpose subagent caps at max_turns=100. Raise it to 150 and bump the default subagent timeout from 900s (15min) to 1800s (30min) so the extra turns have time to run instead of shifting the failure to a timeout. The lead agent recursion_limit (100) is unchanged; the failures are subagent-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(subagents): clarify lead recursion_limit is independent of subagent max_turns

Add comments at both lead recursion_limit=100 sites (gateway services + channel manager) explaining the lead's LangGraph super-step budget is separate from subagent depth, so the two 100s are not conflated. Comment-only, no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(subagents): clarify built-in vs custom timeout scope; pin bash max_turns in test

Review follow-ups: (1) clarify SubagentConfig docstring + global timeout field/comment that the 1800 default applies to built-in subagents (custom agents keep their own timeout_seconds); (2) pin bash.max_turns==60 in the defaults regression test so the config.example.yaml doc cannot drift; (3) rename test_default_timeout_preserved_when_no_config -> test_explicit_global_timeout_propagates_to_general_purpose since it intentionally exercises an explicit non-default 900. No runtime behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:55:04 +08:00
Chetan Sharma d2cc991d55 make ai follow-up suggestions optional (#3591) 2026-06-15 17:59:25 +08:00
Eilen Shin 25fbd25b05 fix(frontend): cap deeply nested list indentation to prevent render crash (#3393) (#3570)
* fix(frontend): cap deeply nested list indentation to prevent render crash

Deeply nested lists make marked's recursive list tokenizer overflow the
call stack during Streamdown's lexing useMemo, throwing an uncaught
"RangeError: Maximum call stack size exceeded" that replaces the chat
route with an error page (issue #3393); on larger stacks the same input
exhausts the heap, which the render error boundary cannot catch.

Mirror the existing capBlockquoteNesting guard with capListNesting, which
clamps leading whitespace to 200 columns (~100 nesting levels) only when
pathologically deep indentation is present, leaving normal content and
fenced code untouched. Wire both through capMarkdownNesting.

* fix(frontend): satisfy prettier format check in preprocess test

* fix(frontend): exempt indented code from list-indent cap (PR #3570 review)

* fix(frontend): keep capping all deep indentation outside fenced code

Revert the indented-code exemption from the PR #3570 review nit. Taken
literally the suggested guard (insideFence || INDENTED_CODE_RE.test(line))
no-ops capListNesting, because INDENTED_CODE_RE matches every line with
4+ leading spaces — i.e. exactly the deep-indent lines the cap targets.
A context-aware exemption (only treat 4+-space lines as code after a
blank line) instead reopens the crash: blank-separated deeply nested list
items get exempted and still blow up marked (verified: OOM at depth ~1.5k).

Unlike blockquotes (markers take <=3 leading spaces, so deep-quote lines
never look like indented code), list vs. indented-code indentation is
ambiguous line-by-line, so any exemption is exploitable. Keep capping all
deep indentation outside fenced code; the only cost is mild corruption of
a >200-column indented-code line, which never occurs in real content and
is strictly preferable to a render crash. Add a regression test locking
the blank-line case.
v2.0.0-rc0
2026-06-14 22:19:54 +08:00
zgenu 34e126ee4b fix(frontend): reset active chat after deletion (#3519) 2026-06-14 22:06:19 +08:00
Willem Jiang ec520e6427 fix(makefile):fix the per-commit hooks installation (#3569)
Install pre-commit as a stable uv tool. Avoid `uv run --with pre-commit`: that runs
  from a throwaway temp env whose Python gets baked into .git/hooks/pre-commit and is
  gone by the next commit, leaving the hook broken. A tool install bakes a permanent path.
2026-06-14 11:56:38 +08:00
Willem Jiang 0fb2a75bfb fix(doc):update the document for the docker configuration 2026-06-14 11:35:01 +08:00
Xinmin Zeng 5d61718c80 fix(security): mount host Docker socket only in aio (DooD) sandbox mode (#3517)
* fix(security): mount host Docker socket only in aio (DooD) sandbox mode

The default Compose stack mounted /var/run/docker.sock read-write into the
root gateway container in every sandbox mode, including the default `local`
mode that never uses it -- an unnecessary host-escape surface (DooD =
root-equivalent host control). deploy.sh already gated the socket *check* on
sandbox_mode != local, but the Compose files mounted it unconditionally.

Move the socket mount to an opt-in docker/docker-compose.dood.yaml overlay
that deploy.sh / docker.sh append only when detect_sandbox_mode() returns
`aio`. Default (local) and provisioner/Kubernetes modes no longer expose the
host daemon. Tighten the socket existence check from != local to == aio.
Document the DooD threat model in SECURITY.md.

Reported by @greatmengqi.

* refactor(docker): address review on socket-hardening PR

- docker.sh: use absolute path for the dood overlay (match deploy.sh, drop cwd dependency)
- deploy.sh: drop now-dead DEER_FLOW_DOCKER_SOCKET exports in down/build paths
- docker-compose.yaml: fix stale header comment to point at the overlay

Addresses codex + reviewer feedback on #3517.

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-06-14 11:03:50 +08:00
Xinmin Zeng 474c89bac2 fix(security): do not bind-mount host CLI auth dirs by default (#3521)
* fix(security): do not bind-mount host CLI auth dirs by default

The Compose stack bind-mounted the entire ~/.claude and ~/.codex dirs
(read-only) into the root gateway container in every configuration -- exposing
not just credentials but full conversation history, per-project session data,
and global CLI config. The default OpenAI-compatible model providers and the
local sandbox never use them.

Move the mounts to an opt-in docker/docker-compose.cli-auth.yaml overlay.
Document env-token paths (CLAUDE_CODE_OAUTH_TOKEN, CODEX_AUTH_PATH) in
.env.example -- the Gateway credential loader reads env first, so most setups
need no mount at all. Document the exposure and per-mode options in SECURITY.md.

Reported by @greatmengqi.

* docs: clarify ACP adapter auth and add Claude single-file credential option

- ACP adapters authenticate independently (many take an env API key like
  ANTHROPIC_API_KEY and need no mount); the cli-auth overlay is only for
  adapters that read the full CLI config dir. Avoids steering users toward
  mounting the whole dir for ACP when env auth usually suffices.
- Add CLAUDE_CODE_CREDENTIALS_PATH (single .credentials.json) as a Claude
  one-file option, matching codex CODEX_AUTH_PATH and the README.

* docs: cite claude-code-acp env auth and CLAUDE_CONFIG_DIR in ACP guidance

Replace the generic 'some adapters' wording with the verified behavior of
the common claude-code-acp adapter (env ANTHROPIC_API_KEY startup + CLAUDE_CONFIG_DIR),
so the 'no ~/.claude mount needed for ACP' guidance is backed by a concrete adapter.
2026-06-14 10:50:05 +08:00
Huixin615 f43aa78107 fix(agents): sync agent_name across context/configurable and reject empty soul (#3549) (#3553)
* fix(agents): sync agent_name across context/configurable and reject empty soul (#3549)

Two independent issues caused custom agent creation to silently fail:

1. build_run_config only wrote agent_name into one container (configurable
   or context), so setup_agent — which reads ToolRuntime.context exclusively
   since LangGraph >=1.1.9 — saw agent_name=None and wrote SOUL.md to the
   global base_dir instead of users/{user_id}/agents/{name}/. Mirror the
   dual-write pattern already used by merge_run_context_overrides and
   naming.py so both containers always carry the same value.

2. setup_agent persisted whatever soul string it received, including empty
   or whitespace-only content, and still reported success. The frontend
   then surfaced an unusable agent and the global default SOUL.md could be
   silently overwritten with empty content. Reject empty soul before any
   filesystem operation so the model can retry.

Tests:
- test_gateway_services.py: dual-write regressions for both configurable
  and context entry paths, explicit-agent-name precedence on both sides,
  and a shape-parity test against merge_run_context_overrides.
- test_setup_agent_tool.py: empty/whitespace soul rejection, plus
  no-overwrite guarantees for existing global and per-agent SOUL.md.

* Update services.py
2026-06-14 10:40:16 +08:00
heart-scalpel 47e9570d86 fix(subagent): isolate subagent from parent run checkpointer (#3559)
Subagent _create_agent() now passes checkpointer=False to prevent
inheriting the parent run's synchronous checkpointer via copy_context(),
which would cause NotImplementedError when aget_tuple() is called on
the async path. Subagents are one-shot delegations that never resume,
so persistence is unnecessary.
2026-06-14 10:30:45 +08:00
hataa 1783da42f4 fix(channels): close Discord file handle after upload (#3561)
send_file opened the attachment with a bare open() and never closed it,
leaking a file descriptor on every Discord file delivery. The handle was
also leaked on the failure path: when target.send raised, the except
branch logged and returned without closing fp. The "# noqa: SIM115"
suppressed the lint warning instead of fixing it.

Wrap open() in a with statement that stays open for the full upload —
the discord client reads fp while target.send runs on _discord_loop, and
once that future resolves the bytes are consumed, so closing here is
safe. This closes the handle on both the success and exception paths and
matches how telegram and feishu already handle their file uploads.

Adds regression tests asserting the handle is closed after send_file on
both the success and failure paths.

Refs #3544
2026-06-13 23:27:17 +08:00
AochenShen99 d23eac227f feat(skill): add maintainer issue and PR workflow (#3554)
* feat(skill): add maintainer orchestrator workflow

* feat(skill): refine maintainer comment behavior

* fix(skill): match PR review opener count

* fix(skill): align maintainer skill path convention
2026-06-13 22:56:33 +08:00
idefav 554017a89f docs: document custom AIO sandbox images (#3548)
* docs: document custom AIO sandbox images

* docs: clarify sandbox image dependency example
2026-06-13 22:50:51 +08:00
Ryker_Feng 6e839342a7 feat(community): add Brave Search web search tool (#3528)
* feat(community): add Brave Search web search tool

Add a community web_search provider backed by the official Brave Search
API (https://api.search.brave.com/res/v1/web/search). API key is read
from the tool config (inline api_key) or the BRAVE_SEARCH_API_KEY env
var. Output schema (title/url/content) matches existing search tools.
No new dependencies (uses the existing httpx). Also wires up the setup
wizard, doctor health check, config example, and EN/ZH docs.

* refactor(community): drop redundant [:count] slice in Brave search

The Brave API already caps results via the `count` request param, so
client-side slicing was redundant. Tests now simulate the API honoring
`count` instead of relying on the slice. Addresses PR review nit.

* style(tests): apply ruff format to test_doctor.py

Collapse multiline write_text calls onto single lines to satisfy the
CI ruff formatter (lint-backend was failing on format --check).
2026-06-13 22:47:35 +08:00
liuchuan01 8955b3222a fix(sandbox): merge idempotent sandbox state updates (#3518)
* fix(sandbox): merge idempotent sandbox state updates

* fix(sandbox): merge idempotent sandbox state updates
2026-06-13 22:40:48 +08:00
Ryker_Feng c91dacc8e2 fix(channels): surface WeCom WebSocket connection failures (#2000) (#3526)
* fix(channels): surface WeCom WebSocket connection failures (#2000)

The WeCom channel started the SDK connection with a fire-and-forget
asyncio task and never inspected its result, so connection failures
(e.g. the gateway WebSocket handshake to wss://openws.work.weixin.qq.com
failing) were silently swallowed: the channel still logged "started",
SDK error/disconnected events went unobserved, and the connect task
produced "Task exception was never retrieved" noise.

Monitor the connect task with a done-callback that logs a clear,
actionable error (and stays silent on cancellation), and subscribe to
the SDK's error/disconnected events so failures become visible in
DeerFlow's own logs.

* style(channels): apply ruff format to wecom.py

Collapse the multiline log message onto a single line to satisfy the
CI ruff formatter (lint-backend was failing on format --check).

* fix(channels): log WeCom disconnect reason when SDK provides one

Address review feedback: _on_ws_disconnected now includes the first
event arg (e.g. reason/context) in the warning instead of discarding
it, so disconnect causes are visible in logs.
2026-06-13 22:34:00 +08:00
Airene Fang cad6e89a19 fix(scripts):start with make start-daemon,can not stop next-server with make stop (#3498)
* fix(scripts):start with make start-daemon,can not stop next-server with make stop

* fix(scripts):start with make start-daemon,can not stop next-server with make stop
2026-06-13 09:16:08 +08:00
hataa 094296440f fix(history): strip base64 image data from REST endpoint responses (#3535)
ViewImageMiddleware persists full base64 image payloads in hide_from_ui
human messages inside checkpoints. All REST endpoints that returned
serialize_channel_values(channel_values) sent these multi-megabyte
payloads to the frontend, freezing the UI on threads with images.

Add strip_data_url_image_blocks() to remove data:-scheme image_url
content blocks from hide_from_ui messages, and
serialize_channel_values_for_api() as a convenience wrapper used by all
six affected call sites across threads, runs, and thread_runs routers.
SSE streaming is unaffected (still uses serialize_channel_values).

Fixes #3496
2026-06-13 08:58:19 +08:00
DanielWalnut 839fa99237 feat(telegram): stream agent replies by editing the placeholder message in place (#3534)
* docs(spec): telegram streaming output design

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs(plan): telegram streaming implementation plan

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(telegram): report streaming support for telegram channel

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(channels): use slack as the non-streaming sample channel in manager tests

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(telegram): register running-reply placeholder as stream target

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(telegram): pin last_edit_at sentinel in placeholder registration test

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(telegram): extract _send_new_message from send()

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(telegram): edit streamed message in place for non-final updates

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(telegram): finalize streamed message with overflow splitting

When is_final=True arrives and stream state exists, pop the state, edit
the streamed placeholder with the final text, split overflow into follow-up
send_message calls, update _last_bot_message, and clear stream state.
Falls back to _send_new_message when no stream state is registered.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(telegram): exercise the not-modified handler in final edit path

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: telegram channel now streams replies via message editing

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(telegram): harden final-delivery path with guarded retry and chunk retries

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(channels): accept runtime 'messages' SSE event for streaming text accumulation

The embedded runtime (matching LangGraph Platform semantics) emits SSE
event name 'messages' for the requested 'messages-tuple' stream mode,
so the manager never accumulated token deltas and streaming channels
only updated from end-of-step 'values' snapshots — on Telegram this
looked like 'Working on it...' followed by the full answer in one block.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(telegram): widen stream-edit throttle to 3s in group chats

Telegram caps bots at 20 messages/minute per group, stricter than the
1 msg/s per-chat guideline. Groups have negative chat ids, so pick the
interval by sign.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(telegram): address review findings — thread fallback messages, bound stream registry, share stream-event constants

- Fallback/new stream messages now carry reply_to_message_id parsed from
  thread_ts so they stay nested under the user's message (finding 1)
- STREAM_MODES / MESSAGE_STREAM_EVENTS constants link the requested
  stream modes to the SSE event names they arrive under (finding 2)
- _register_stream_message bounds the in-flight registry at 256 entries,
  evicting oldest, guarding against leaks when a final never arrives (finding 4)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 08:38:28 +08:00
dependabot[bot] 3475f7cdad chore(deps): bump starlette from 1.0.0 to 1.0.1 in /backend (#3546)
Bumps [starlette](https://github.com/Kludex/starlette) from 1.0.0 to 1.0.1.
- [Release notes](https://github.com/Kludex/starlette/releases)
- [Changelog](https://github.com/Kludex/starlette/blob/main/docs/release-notes.md)
- [Commits](https://github.com/Kludex/starlette/compare/1.0.0...1.0.1)

---
updated-dependencies:
- dependency-name: starlette
  dependency-version: 1.0.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-13 08:14:16 +08:00
dependabot[bot] 83bc2fb1ae chore(deps): bump aiohttp from 3.13.5 to 3.14.0 in /backend (#3545)
---
updated-dependencies:
- dependency-name: aiohttp
  dependency-version: 3.14.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-13 08:13:49 +08:00
Huixin615 a17d2ff8f8 fix(mcp): surface admin-required state on settings tools page (#3527) (#3533)
GET /api/mcp/config returns 403 for non-admin users, but the previous
client returned the error body as MCPConfig, causing MCPServerList to
crash with 'Cannot convert undefined or null to object' on
Object.entries(config.mcp_servers).

- api.ts: introduce MCPConfigRequestError; loadMCPConfig and
  updateMCPConfig now throw it (carrying status + isAdminRequired)
  instead of letting non-2xx bodies leak through as parsed config
- tool-settings-page.tsx: render a friendly 'admin privileges required'
  empty state when the React Query error is an admin-required
  MCPConfigRequestError; keep MCPServerList resilient with
  Object.entries(servers ?? {}) and an empty-state for no servers
- i18n: add settings.tools.adminRequired and settings.tools.empty in
  en-US, zh-CN and the Translations type
- tests: cover 403 / 5xx / instanceof / detail-fallback for both
  loadMCPConfig and updateMCPConfig in tests/unit/core/mcp/api.test.ts

Refs: #3527
2026-06-13 07:36:57 +08:00
ly-wang19 420a886e1d fix(channels): offload blocking filesystem IO in inbound file ingestion (#3529)
_ingest_inbound_files ensured the thread uploads dir (mkdir), enumerated it
(iterdir/is_file) to de-duplicate names, and wrote each downloaded attachment
(write_upload_file_no_symlink) directly on the event loop. Offload the directory
prep and every per-file write via asyncio.to_thread; the genuinely async network
read (file_reader) stays on the loop. Externally observable behavior is unchanged.

Found via `make detect-blocking-io` (HIGH: iterdir on an async path).

Add tests/blocking_io/test_channels_ingest.py anchor, verified red->green under
the strict Blockbuster gate.

Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 06:38:54 +08:00
ly-wang19 579e416459 perf(runtime): index messages in MemoryRunEventStore to avoid O(n) scans (#3531)
list_messages re-scanned every event in the thread on each call (category
filter + seq filter) — O(total events) per paginated request on the default
run-events backend. Maintain a messages-only, seq-sorted projection of _events
(shared dict refs, no copies) and locate the seq window with bisect:
list_messages drops to O(log m + page) and count_messages to O(1). The index is
kept in lockstep at every mutation site (put / put_batch via _put_one,
delete_by_run, delete_by_thread).

Externally observable behavior is unchanged — the full RunEventStore contract
suite passes across memory/db/jsonl.

Add a test covering pagination over non-contiguous message seqs (messages
interleaved with trace events), including in-gap and exact-boundary cursors.

Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 22:58:30 +08:00
AochenShen99 c002596ab4 chore(todo): remove unused completion reminder counter (#3530) 2026-06-12 22:48:47 +08:00
AochenShen99 a838546a2b chore(blocking-io): fail-loud repo-root resolution and shared detector CLI shim (#3512)
* chore(blocking-io): fail-loud repo-root resolution and shared detector CLI shim

The three detectors resolved REPO_ROOT with depth-indexed
Path(__file__).resolve().parents[4]. If a detector file ever moves to a
different directory depth, scan roots resolve under the wrong directory
and the detector reports zero findings with no error — a silent-zero
failure shape for a detection tool.

- Add support/detectors/repo_root.py: resolve the repo root by walking
  upward to the .git marker (checked with exists() so git worktrees,
  where .git is a file, also resolve), raising RuntimeError when no
  marker is found. All three detectors use it at import time, so a
  relocated detector fails loudly instead of scanning an empty tree.
- Extract scripts/_detector_cli.py from the three character-identical
  CLI shims; the sys.path computation lives in one place and raises
  when backend/tests cannot be found.
- tests/test_detector_repo_root.py pins: resolution from an unmarked
  location raises instead of returning an empty scan; all three
  detectors share the resolved root; each CLI shim delegates to its
  detector.

Testing: backend `make test` (4278 passed); smoke-ran
`make detect-blocking-io`, `make detect-thread-boundaries`, and
`scripts/scan_changed_blocking_io.py --base upstream/main`.

Closes #3510 (review follow-up to #3503).

* chore(blocking-io): declare detector modules import-only, drop script-mode residue

Adversarial review caught that blocking_io_static.py and
thread_boundaries.py kept shebangs and __main__ blocks but can no longer
run as plain scripts: the new `from support.detectors.repo_root import`
executes before anything puts backend/tests on sys.path, so direct
invocation dies with ModuleNotFoundError before argparse.

Direct execution was never a documented entry point (Makefile targets,
the scripts/ shims, the blocking-io-guard skill, and tests all go
through the support.detectors package), so converge on import-only
instead of re-adding per-module bootstrap: drop the shebangs and the now
unreachable __main__ blocks (plus the `import sys` they kept alive) and
state the supported entry points in each module docstring. The shim
delegation tests in test_detector_repo_root.py pin the supported CLI
paths.

Testing: backend `make test` (4278 passed); `make detect-blocking-io`
and `make detect-thread-boundaries` smoke-ran.
2026-06-12 17:16:01 +08:00
zengxi bbce6c0ac0 docs(config): add SearXNG and Browserless configuration examples (#3513)
* docs(config): add SearXNG and Browserless configuration examples

Add commented-out configuration examples for the SearXNG web search
and Browserless web fetch tools introduced in PR #3451.

- SearXNG: self-hosted metasearch engine (base_url, max_results)
- Browserless: headless Chrome renderer (base_url, token, timeout_s,
  wait_for_event, wait_for_selector, reject_resource_types, etc.)

Also bump config_version to 13 since the tool schema has new options.

* fix(config): align defaults with code and remove unconfigured keys

- SearXNG default port: 8088 (matches searxng/tools.py fallback)
- Browserless default port: 3032 (matches browserless/tools.py fallback)
- Remove wait_for_selector_timeout_ms, reject_resource_types,
  reject_request_pattern from example (not yet read from config)
- Note Docker service ports differ from code defaults
2026-06-12 16:50:32 +08:00
ly-wang19 0d3bfe0a76 perf(runtime): index runs by thread_id to avoid O(n) scans in RunManager (#3499)
* perf(runtime): index runs by thread_id to avoid O(n) scans in RunManager

RunManager.list_by_thread, create_or_reject (inflight check), and has_inflight each filtered every in-memory run by thread_id — an O(total in-memory runs) scan that grows with overall gateway traffic rather than the queried thread's depth.

Add a thread_id -> run_ids secondary index (an insertion-ordered dict used as an ordered set) maintained in lockstep with _runs under the existing lock at every add/remove site (create, create_or_reject, both rollbacks, cleanup). The three per-thread queries now run in O(runs-in-thread); insertion order is preserved so list_by_thread keeps stable tie-breaking. Behavior unchanged. Adds 6 regression tests; full RunManager suite 146 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(runtime): cover create_or_reject rollback + clarify thread-index guard docstrings

Address review on #3499 (fancyboi999):
- Reword _thread_records_locked docstring: lockstep under self._lock is the
  correctness guarantee; self._runs.get is one-directional defense-in-depth
  (drops stale ids, cannot recover index-missing ids), not reconciliation.
- Add test_failed_create_or_reject_unindexes_run covering the create_or_reject
  rollback/unindex mutation site (the last untested mutation path).
- Fix _FailingPutRunStore docstring ("initial put" -> "every put").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 16:48:47 +08:00
Xinmin Zeng 503eeac788 fix(frontend): render user messages as plain text and cap blockquote nesting (#3502)
* fix(frontend): render user messages as plain text and cap blockquote nesting

User messages are typed or pasted plain text, not authored Markdown, but
they were rendered through the full Streamdown pipeline. Pasted source
files got fragmented (indented chunks become code blocks, paragraphs
collapse and lose indentation), "$...$" spans were KaTeX-ified, and a
message with thousands of nested ">" markers overflowed the call stack
in marked's recursive blockquote lexer, permanently crashing the thread
on every load.

Render human message content verbatim with pre-wrap instead, and cap
blockquote nesting at 100 levels at the Streamdown chokepoint so model
output cannot trigger the same recursion either.

Closes #3500

* fix(frontend): absorb marked lexer crashes with a render fallback boundary

Review found two gaps in the nesting cap: marked's list and blockquote
tokenizers are mutually recursive, so a list marker in front of the
quote chain ("- > > > ...") bypassed the blockquote-only regex and
still overflowed the stack; and the line-based rewrite was fence-blind,
silently truncating literal ">" runs inside code blocks.

Add an error boundary around Streamdown that renders the raw content as
plain pre-wrap text when rendering throws (retrying on the next content
change), keep the cap as a fast path for the dominant pure-">" case,
and make it skip fenced and indented code lines.
2026-06-12 16:15:40 +08:00
DanielWalnut aa015462a7 feat(im): Add user-owned IM channel connections (#3487)
* Add user-owned IM channel connections

* Fix dev startup and channel connect popup

* Use async channel connect flow

* Harden dev service daemon startup

* Support local IM channel connections

* Align IM connections with local channels

* Fix safe user id digest algorithm

* Address Copilot IM channel feedback

* Address IM channel review comments

* Support all integrated IM channel connections

* Format additional channel connection tests

* Keep unavailable channel connect buttons clickable

* Fix IM channel provider icons

* Add runtime setup for enabled IM channels

* Guard global shortcut key handling

* Keep configured IM channels editable

* Avoid password autofill for channel secrets

* Make channel threads visible to connection owners

* Persist IM runtime config locally

* Allow disconnecting runtime IM channels

* Route no-auth channel sessions to local user

* Use default user for auth-disabled local mode

* Show IM channel source on threads

* Prefill IM channel runtime config

* Reflect IM channel runtime health

* Ignore Feishu message read events

* Ignore Feishu non-content message events

* Let setup wizard enable IM channels

* Fix frontend formatting after merge

* Stabilize backend tests without local config

* Isolate channel runtime config tests

* Address channel connection review comments

* Use sha256 user buckets with legacy migration

* Ensure runtime IM channels are ready after restart

* Persist disconnected IM channel state

* Address channel connection review comments

* Address channel connection review findings

Frontend connect flow:
- Open the runtime-config dialog only when a provider still needs
  credentials; configured providers go straight to the connect flow, so
  the binding-code/deep-link path is reachable from the UI again.
- After saving credentials, continue into the connect flow when a user
  binding is still required (multi-user mode) instead of stopping at a
  "Connected" toast.
- Extract shared provider-state helpers to core/channels/provider-state
  and add unit + e2e coverage for the direct-connect and
  configure-then-connect paths.

Provider status semantics:
- Report connection_status from the user's newest connection row;
  with no binding it is not_connected, except in auth-disabled local
  mode where a configured running channel is effectively connected.

Concurrency and event-loop correctness:
- Offload ChannelRuntimeConfigStore construction and writes, channel
  service construction, and Slack connection replies to threads; add a
  tests/blocking_io/ anchor for the runtime-config handlers.
- Consume binding codes with a conditional UPDATE so a code can only be
  used once under concurrent workers; retry upsert_connection as an
  update when a concurrent insert wins the unique constraint.
- Serialize ensure_channel_ready per channel so concurrent provider
  polls cannot double-start a channel worker.

Config and migration hardening:
- Stop mutating the get_app_config()-cached Telegram provider config;
  the runtime store now owns the UI-entered bot username.
- Register channel_connections in STARTUP_ONLY_FIELDS with the
  standardized startup-only Field description.
- Match the legacy unsafe-id bucket by recomputing its exact SHA-1 name
  so another user's same-prefix bucket can never be migrated.
- Remove the unused Telegram process_webhook_update path and document
  src/core/channels in the frontend docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Address PR review comments on authz scoping and channel runtime

Security (review feedback from ShenAC-SAC):
- Scope internal-token callers to the connection owner carried in
  X-DeerFlow-Owner-User-Id instead of bypassing owner checks outright,
  in both require_permission(owner_check=True) and the stateless run
  endpoints. Internal callers keep access to their own and
  shared/legacy threads, and may claim a default-owned channel thread
  for its real owner, but a leaked internal token no longer grants
  cross-user thread access.
- Require admin privileges for POST/DELETE /api/channels/{provider}/
  runtime-config: runtime credentials and channel workers are
  instance-wide shared state (same model as the MCP config API).
  Read-only provider listing stays available to all users.

Performance (review feedback from willem-bd):
- Skip the redundant thread channel-metadata PATCH after the first
  successful backfill per thread.
- Reuse the per-connection Slack WebClient until its token changes
  instead of constructing one per outbound message.
- Reconcile channel readiness for all providers concurrently in
  GET /api/channels/providers.

Also resolve the code-quality unused-import flag in the blocking-io
anchor by pre-importing the channel service via importlib.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Fix prettier formatting in provider-state test

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Reconcile UI runtime channel config with config reload on restart

Main now reloads a channel's config.yaml entry on restart_channel()
(#3514, issue #3497). Adapt the user-owned connection flow to coexist:

- configure_channel() restarts with reload_config=False — the caller
  just supplied the authoritative config (browser-entered credentials
  that are never written to config.yaml), so a file reload must not
  clobber it with the stale on-disk entry.
- _load_channel_config() re-applies the UI runtime-store overlay used
  at startup, so an operator-triggered restart keeps browser-entered
  credentials for channels without a config.yaml entry and does not
  resurrect a channel disconnected from the UI.
- Offload the reload's disk IO (config.yaml + runtime store) with
  asyncio.to_thread, matching the blocking-IO policy on this branch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 15:24:58 +08:00