* fix(frontend): keep workspace interactive when SSR auth probe cannot reach gateway (#3493)
When the SSR auth probe at /api/v1/auth/me times out or fails, the
workspace layout used to render a static fallback page without
AuthProvider or QueryClientProvider, making logout and every other
interaction non-functional until the gateway recovered.
Render the normal WorkspaceContent in 'gateway_unavailable' mode
instead, surfacing a polite offline banner that re-probes the gateway
in the background and hides itself the moment refreshUser() returns
an authenticated user. The probe is reentrancy-guarded so a slow
gateway cannot pile up parallel /auth/me requests.
Closes#3493
* fix(workspace): silent probe in offline banner to avoid /login redirect during gateway recovery (#3493)
The banner previously delegated retry probes to AuthProvider.refreshUser(),
which treats any 401 from /api/v1/auth/me as 'session expired' and
force-redirects to /login. During gateway recovery, the first few requests
may transiently return 401 before the gateway is fully ready, which would
incorrectly kick the user out — defeating the purpose of the offline banner.
Now the banner silently fetches /api/v1/auth/me itself and only delegates
to refreshUser() on 200 OK. Non-200 responses (401 / 5xx / network) are
swallowed and retried on the next interval tick, ensuring the user stays
logged in across short gateway outages.
Verified in Docker:
- docker pause deer-flow-gateway → banner appears, page interactive
- docker unpause deer-flow-gateway → banner auto-disappears within 10s,
user remains on /workspace/chats/new with full session restored
- All 117 unit tests pass
* fix(workspace): fix banner polling leak and persistent 401 handling (#3493)
- Stop polling immediately after user recovery: add user to effect dependencies, cleanup interval when user !== null
- Handle persistent 401: trigger login redirect after 3 consecutive unauthorized responses
- Extract decision logic to pure helper, add 8 unit tests covering all critical paths
* fix(workspace): address CR feedback on gateway offline recovery (#3493)
- gateway-offline-banner-helpers: decrement (not reset) auth-failure
streak on transient outcomes so a flapping gateway (401 alternating
with 5xx) still converges on session-expired
- gateway-offline-banner: reuse probe response body to apply user
directly via new AuthProvider.applyUser, halving the recovery burst
against an already-struggling gateway
- gateway-offline-banner: extract classifyProbe into helpers for unit
testability; log probe failures via console.warn instead of swallowing
- gateway-offline-fallback: new shared component used by both workspace
and (auth) layouts so auth pages recover the same way the workspace
does, fixing the lockup where bare static HTML had no AuthProvider
- AuthProvider.logout: fall back to hard navigation when the gateway
logout fetch fails, matching legacy form-POST behaviour and avoiding
stale client state during outage
- tests: extend gateway-offline-banner-helpers.test with flapping
convergence and classifyProbe branch coverage (19 cases total)
* Fix stale AIO sandbox cache reuse
* Address AIO sandbox review feedback
* Distinguish sandbox health check failures
* Keep local discovery recoverable when the runtime check fails
LocalContainerBackend.discover() shares _is_container_running, which now
raises on transient daemon errors instead of returning False. Discovery has
no exception handling in _discover_or_create_with_lock(_async), so a brief
Docker hiccup turned a recoverable "could not verify, create instead" into a
hard acquire failure. Catch the check failure inside discover() and return
None so an unverifiable container is simply not adopted, restoring the
pre-change fall-through while keeping raise-on-unknown semantics protecting
the destroy path.
Reported by fancy-agent on PR #3494.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* Narrow the not-found match in container inspect error handling
A bare "not found" substring also matches transient failures like "command
not found" or "context not found", which would misclassify a check error as
"container definitely gone" and bypass the raise-on-unknown contract. Keep
Docker's specific "No such object"/"No such container" phrases, and only
trust a generic "not found" (Apple Container) when the message names the
inspected container or refers to a container/object.
Reported by WillemJiang on PR #3494.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* fix(sandbox): persist lazily-acquired sandbox state via Command
ensure_sandbox_initialized mutates runtime.state in place, which is local
to the current tool invocation and is not picked up by LangGraph's channel
reducer. Subsequent graph steps and downstream consumers (such as
ToolOutputBudgetMiddleware and the sub-agent task_tool) therefore cannot
observe the sandbox id from state.
Wrap tool calls in SandboxMiddleware (wrap_tool_call / awrap_tool_call) to
detect fresh lazy initialization by diffing runtime.state before and after
the handler, and emit a proper state update via Command(update=...):
- ToolMessage results are wrapped into Command(update={sandbox, messages})
- Command results with a dict update are merged on the sandbox key while
preserving messages / goto / graph / resume
- Command results with non-dict updates are left untouched to avoid silent
data loss on unknown update shapes
Tests:
- 7 new unit tests cover lazy-init emit, passthrough, dict-update merge,
non-dict-update passthrough (sync and async)
- Refresh replay golden write_read_file.ultra.events.json: SSE 'values'
events now correctly carry the 'sandbox' key in their keys list, which
is the direct evidence that the fix is effective
Closes#3463
* refactor(sandbox): use dataclasses.replace to preserve Command fields
Address Copilot review on #3464: replace manual field-copy with
dataclasses.replace so any current or future Command fields are
preserved automatically when merging sandbox_update.
Also add a regression test that constructs a Command with non-None
graph/goto/resume to lock this behavior in.
* fix(frontend): paginate workspace chat list beyond 50 threads (#3482)
The sidebar 'Recent chats' and /workspace/chats list were hard-capped
at the first 50 threads returned by threads.search. Replace the
single-shot useThreads() consumers with useInfiniteThreads() and add
an IntersectionObserver sentinel to each list so further pages are
fetched on demand.
In search mode on the chats page, the sentinel is replaced by an
explicit 'Load more' button to prevent the observer from draining the
entire backend list while the filtered view stays empty.
- Add useInfiniteThreads + page-size constant and pure cache helpers
(map/filterInfiniteThreadsCache, getInfiniteThreadsNextPageParam)
- Mirror rename / delete / stream-finish updates into the new
infinite cache so optimistic UI stays consistent
- Extend the e2e mock to honour limit/offset slicing
- Unit tests for the cache helpers and pagination boundary
- Playwright e2e covering chats page + sidebar load-more, and the
search-mode guard against runaway auto-pagination
- Add en/zh i18n entries for the search-mode load-more button
Fixes#3482
* docs(frontend): clarify infinite-threads offset semantics and test post-delete invariant
- Add docstring to getInfiniteThreadsNextPageParam explaining that TanStack
Query freezes the returned offset into pageParams once, so optimistic cache
mutations that shrink page lengths (filterInfiniteThreadsCache on delete)
cannot retroactively move the offset backwards. Delete/rename paths
reconcile against the backend via invalidateQueries in onSettled.
- Add unit test covering the post-delete invariant.
- Fix misleading comment in thread-list-infinite-scroll.spec.ts: the
thread-search mock does not sort by updated_at; it returns the array in
the order provided.
Addresses Copilot CR comments on #3485.
* fix(frontend): mirror onCreated upsert into infinite cache; add sidebar Load-older button
Address review feedback on #3485:
- New upsertThreadInInfiniteCache helper; useThreadStream onCreated now
upserts into both the legacy ['threads','search'] cache and the new
infinite cache, so a freshly created thread appears in the sidebar
immediately during streaming instead of only after the run finishes
and onSettled invalidates the query. Restores parity with main.
- Sidebar Recent Chats now exposes a visible 'Load older chats' button
alongside the IntersectionObserver sentinel, so keyboard-only users
and environments where IO is unavailable can still reach older
conversations.
- Add zh-CN / en-US / types entry for chats.loadOlderChats.
- Cover the new helper with 3 unit tests (no-op on uninitialised cache,
prepend new thread to first page, merge with existing entry without
duplication).
When memory is enabled, the first conversation with a legacy shared agent
creates a per-user agent directory containing only memory.json (no
config.yaml). On the second turn, resolve_agent_dir() returned this
incomplete directory, causing load_agent_config() to fail with
"Agent config not found".
Require config.yaml to exist alongside the directory for both the
per-user and legacy paths, so that memory-only directories fall
through correctly. This aligns resolve_agent_dir with the existing
config.yaml check in list_custom_agents.
Refs: https://github.com/bytedance/deer-flow/issues/3390
* feat(memory): add memory.token_counting config to avoid tiktoken network dependency (#3429)
Add a `memory.token_counting` option (`tiktoken` | `char`) so deployments in
network-restricted environments can opt out of tiktoken entirely. In `char`
mode the memory-injection budget uses a network-free character-based estimate
and never triggers the BPE download from openaipublic.blob.core.windows.net,
which could otherwise block for tens of minutes (see #3402).
Also harden the default `tiktoken` path:
- cache an in-flight LOADING sentinel so concurrent callers fall back
immediately instead of spawning more blocking get_encoding threads when the
first load is still running (e.g. under the 5s startup warm-up timeout);
- cache failures with a timestamp and retry after a cooldown so a transient
network outage self-heals back to accurate counting without a restart;
- skip startup warm-up entirely in char mode.
The new config is surfaced via the memory config API and config.example.yaml
(config_version bumped). Default remains `tiktoken`, so existing deployments
are unaffected.
* fix(memory): use CJK-aware char token estimate and address review feedback
- Replace the flat len(text)//4 fallback with a CJK-aware estimate so
Chinese/Japanese/Korean memory content does not over-fill the injection budget
- Document the internal tiktoken retry cooldown and char-mode escape hatch
- Sync CLAUDE.md / config.example.yaml / MEMORY_IMPROVEMENTS.md wording
- Fix MemoryConfigResponse mocks/assertions and add CJK estimate tests
POST /api/runs/stream and /api/runs/wait accept thread_id in the request
body but performed no owner authorization, letting any authenticated user
start runs on -- and read /wait checkpoint channel_values from -- another
user's thread (cross-user IDOR, #3472).
The @require_permission(owner_check=True) decorator resolves ownership from
the thread_id *path* param, so it cannot cover these body-param endpoints.
Enforce ownership inside start_run() before create_or_reject via
ThreadMetaStore.check_access: missing rows (auto-created temp threads) and
NULL-owner rows stay accessible, while a thread owned by another user
returns 404 (matching thread_runs.py). The internal system role (IM
channels acting for platform users) is exempt.
Closes#3472
The default `make up` started the Gateway with `--workers 4`, but run state
(RunManager and the stream bridge) is held in-process and nginx uses no sticky
sessions. With the default config, same-run requests scatter across workers that
each keep their own run state, breaking run cancellation (409), SSE reconnect
(hangs on heartbeats), multitask de-duplication, and IM channels (duplicate
replies). The shared cross-worker stream bridge does not exist yet.
Default GATEWAY_WORKERS to 1 so the out-of-the-box deployment is correct,
document the single-worker boundary in the README, and add a regression test
pinning the default while keeping it overridable. This is a stop-gap, not a
multi-worker implementation; the full fix (shared run state + stream bridge) is
tracked in #3191.
Refs #3239, #3260