fix(agents): make update_agent honor runtime.context user_id like setup_agent (#2867)

* fix(agents): make update_agent honor runtime.context user_id like setup_agent PR #2784 hardened setup_agent to prefer runtime.context["user_id"] (set by inject_authenticated_user_context from the auth-validated request) over the contextvar, so an agent created during the bootstrap flow always lands under users/<auth_uid>/agents/<name>. update_agent was left calling get_effective_user_id() unconditionally — the same class of bug that produced issues #2782 / #2862 still applies whenever the contextvar is not available on the executing task (background work, future cross-process drivers, checkpoint resume on a different task). In that regime update_agent silently routes writes to users/default/agents/<name>, corrupting the shared default bucket and losing the user's edit. Extract the resolution policy into a shared resolve_runtime_user_id helper on deerflow.runtime.user_context and route both setup_agent and update_agent through it so the two halves of the lifecycle stay in lockstep. Add load-bearing end-to-end tests that drive a real langchain.agents create_agent graph with a fake LLM, exercising the full pipeline: HTTP wire format -> app.gateway.services.start_run config-assembly -> deerflow.runtime.runs.worker._build_runtime_context -> langchain.agents create_agent graph -> ToolNode dispatch (sync + async + sub-graph + ContextThreadPoolExecutor) -> setup_agent / update_agent The negative-control tests intentionally land in users/default/ to prove the positive tests are actually load-bearing rather than vacuously passing. The new test_update_agent_e2e_user_isolation suite included a test that failed against main and now passes after this fix. * style: ruff format on new e2e tests * test(e2e): real-server HTTP test driving setup_agent through the full ASGI stack Adds tests/test_setup_agent_http_e2e_real_server.py — a single load-bearing test that drives the entire FastAPI gateway through starlette.testclient. TestClient with no mocks above the LLM: - lifespan boots (config, sqlite engine, LangGraph runtime, channels) - POST /api/v1/auth/register (real password hash, real sqlite write, issues access_token + csrf_token cookies) - POST /api/threads (real thread_meta + checkpoint creation) - POST /api/threads/{id}/runs/stream with the exact wire shape the React frontend sends (assistant_id + input + config + context with agent_name/is_bootstrap) - AuthMiddleware -> CSRFMiddleware -> require_permission -> start_run -> inject_authenticated_user_context -> asyncio.create_task(run_agent) -> worker._build_runtime_context -> Runtime injection -> ToolNode dispatch -> real setup_agent - Asserts SOUL.md is under users/<authenticated_uid>/agents/<name>/ and NOT under users/default/agents/<name>/. DEER_FLOW_HOME and the sqlite path are redirected into tmp_path so the test never touches the real .deer-flow directory or developer database. The only patch above the LLM boundary is replacing create_chat_model with a fake that emits a single setup_agent tool_call. This is the "真实验证" answer: it reproduces what curl-against-uvicorn would do, minus the network socket layer. * test: address Copilot review on user-isolation e2e tests - Drop "currently expected to FAIL" wording from update_agent e2e docstring and header (Copilot review): the fix is in this PR, the test pins the corrected behaviour rather than driving a future change. - Rephrase the assertion failure messages from "BUG:" to "REGRESSION:" to match the test's role on the fixed branch. - Bound _drain_stream with a wall-clock timeout, a max-bytes cap, and an early break on the "event: end" SSE frame (Copilot review). Stops the test from hanging on a stuck run or runaway heartbeat loop. - Replace the misleading "patch both module aliases" comment with an explanation of why patching lead_agent.agent.create_chat_model is the only correct target (Copilot review): lead_agent rebinds the symbol into its own namespace at import time, so patching deerflow.models is too late. * test(refactor): address WillemJiang review on user-isolation e2e tests - Extract the duplicated FakeToolCallingModel (and a build_single_tool_call_model helper) into tests/_agent_e2e_helpers.py. All three e2e files now import from the shared module instead of redefining the shim locally. - Convert the manual p.start() / p.stop() try/finally blocks in test_update_agent_e2e_user_isolation.py to contextlib.ExitStack so patch lifecycle is Pythonic and exception-safe. - Lift the isolated_app fixture's private-attribute resets into a named _reset_process_singletons helper with a comment block explaining why each singleton has to be invalidated for true e2e isolation, and why raising=False is intentional. Makes the fragility visible and the intent self-documenting rather than leaving the resets inline as opaque monkeypatch calls. Net change: -59 lines (143 -> 84) across the three test files, with every assertion intact. Full suite remains 69 passed / lint clean. * test(e2e): make real-server test self-supply its config CI's actions/checkout only ships config.example.yaml (the real config.yaml is gitignored), so the production config-discovery search (./config.yaml -> ../config.yaml -> $DEER_FLOW_CONFIG_PATH) finds nothing and the test fails at lifespan boot with FileNotFoundError. The dev-machine run passed only because a local config.yaml happened to exist. Write a minimal AppConfig-valid yaml into tmp_path and pin DEER_FLOW_CONFIG_PATH to it. The yaml carries just what the schema requires (a single fake-test-model entry, LocalSandboxProvider, sqlite database). The LLM never gets instantiated because the test patches create_chat_model on the lead agent module, so the api_key/base_url stay placeholders. Verified by hiding the local config.yaml to mirror the CI checkout — the test now passes in both environments.
2026-05-21 23:46:50 +00:00 · 2026-05-12 23:18:54 +08:00
parent 506be8bffd
commit 68d8caec1f
7 changed files with 1114 additions and 13 deletions
@@ -109,6 +109,34 @@ def get_effective_user_id() -> str:
    return str(user.id)


+def resolve_runtime_user_id(runtime: object | None) -> str:
+    """Single source of truth for a tool/middleware's effective user_id.
+
+    Resolution order (most authoritative first):
+      1. ``runtime.context["user_id"]`` — set by ``inject_authenticated_user_context``
+         in the gateway from the auth-validated ``request.state.user``. This is
+         the only source that survives boundaries where the contextvar may have
+         been lost (background tasks scheduled outside the request task,
+         worker pools that don't copy_context, future cross-process drivers).
+      2. The ``_current_user`` ContextVar — set by the auth middleware at
+         request entry. Reliable for in-task work; copied by ``asyncio``
+         child tasks and by ``ContextThreadPoolExecutor``.
+      3. ``DEFAULT_USER_ID`` — last-resort fallback so unauthenticated
+         CLI / migration / test paths keep working without raising.
+
+    Tools that persist user-scoped state (custom agents, memory, uploads)
+    MUST call this instead of ``get_effective_user_id()`` directly so they
+    benefit from the runtime.context channel that ``setup_agent`` already
+    relies on.
+    """
+    context = getattr(runtime, "context", None)
+    if isinstance(context, dict):
+        ctx_user_id = context.get("user_id")
+        if ctx_user_id:
+            return str(ctx_user_id)
+    return get_effective_user_id()
+
+
 # ---------------------------------------------------------------------------
 # Sentinel-based user_id resolution
 # ---------------------------------------------------------------------------
@@ -7,19 +7,12 @@ from langgraph.types import Command

 from deerflow.config.agents_config import validate_agent_name
 from deerflow.config.paths import get_paths
-from deerflow.runtime.user_context import get_effective_user_id
+from deerflow.runtime.user_context import resolve_runtime_user_id
 from deerflow.tools.types import Runtime

 logger = logging.getLogger(__name__)


-def _get_runtime_user_id(runtime: Runtime) -> str:
-    context_user_id = runtime.context.get("user_id") if runtime.context else None
-    if context_user_id:
-        return str(context_user_id)
-    return get_effective_user_id()
-
-
@tool(parse_docstring=True)
 def setup_agent(
    soul: str,
@@ -45,7 +38,7 @@ def setup_agent(
        if agent_name:
            # Custom agents are persisted under the current user's bucket so
            # different users do not see each other's agents.
-            user_id = _get_runtime_user_id(runtime)
+            user_id = resolve_runtime_user_id(runtime)
            agent_dir = paths.user_agent_dir(user_id, agent_name)
        else:
            # Default agent (no agent_name): SOUL.md lives at the global base dir.
@@ -27,7 +27,7 @@ from langgraph.types import Command
 from deerflow.config.agents_config import load_agent_config, validate_agent_name
 from deerflow.config.app_config import get_app_config
 from deerflow.config.paths import get_paths
-from deerflow.runtime.user_context import get_effective_user_id
+from deerflow.runtime.user_context import resolve_runtime_user_id
 from deerflow.tools.types import Runtime

 logger = logging.getLogger(__name__)
@@ -118,9 +118,13 @@ def update_agent(
        return _err("update_agent is only available inside a custom agent's chat. There is no agent_name in the current runtime context, so there is nothing to update. If you are inside the bootstrap flow, use setup_agent instead.")

    # Resolve the active user so that updates only affect this user's agent.
-    # ``get_effective_user_id`` returns DEFAULT_USER_ID when no auth context
-    # is set (matching how memory and thread storage behave).
-    user_id = get_effective_user_id()
+    # ``resolve_runtime_user_id`` prefers ``runtime.context["user_id"]`` (set by
+    # the gateway from the auth-validated request) and falls back to the
+    # contextvar, then DEFAULT_USER_ID. This matches setup_agent so a user
+    # creating an agent and later refining it always touches the same files,
+    # even if the contextvar gets lost across an async/thread boundary
+    # (issue #2782 / #2862 class of bugs).
+    user_id = resolve_runtime_user_id(runtime)

    # Reject an unknown ``model`` *before* touching the filesystem. Otherwise
    # ``_resolve_model_name`` silently falls back to the default at runtime