# Design: Eliminate Global Mutable State in Configuration System

> Implements [#1811](https://github.com/bytedance/deer-flow/issues/1811) · Tracked in [#2151](https://github.com/bytedance/deer-flow/issues/2151)
>
> **Phase 1 (shipped):** [PR #2271](https://github.com/bytedance/deer-flow/pull/2271) — frozen config tree, purify `from_file()`, 3-tier `AppConfig.current()` lifecycle, `DeerFlowContext` for agent execution path.
>
> **Phase 2 (proposed):** eliminate the remaining implicit-state surface (`_global` / `_override` / `current()`) via pure explicit parameter passing. See §8.

## Problem

`deerflow/config/` had three structural issues:

1. **Dual source of truth** — each sub-config existed both as an `AppConfig` field and a module-level global (e.g. `_memory_config`). Consumers didn't know which to trust.
2. **Side-effect coupling** — `AppConfig.from_file()` silently mutated 8 sub-module globals via `load_*_from_dict()` calls.
3. **Incomplete isolation** — `ContextVar` only scoped `AppConfig`, not the 8 sub-config globals.

## Design Principle

**Config is a value object, not live shared state.** Constructed once, immutable, no reload. New config = new object + rebuild agent.

## Solution

### 1. Frozen AppConfig (full tree)

All config models set `frozen=True`, including `DatabaseConfig` and `RunEventsConfig` (added late in review). No mutation after construction.

```python
class MemoryConfig(BaseModel):
    model_config = ConfigDict(frozen=True)

class AppConfig(BaseModel):
    model_config = ConfigDict(extra="allow", frozen=True)
    memory: MemoryConfig
    title: TitleConfig
    ...
```

Changes use copy-on-write: `config.model_copy(update={...})`.

### 2. Pure `from_file()`

`AppConfig.from_file()` is a pure function — returns a frozen object, no side effects. All 8 `load_*_from_dict()` calls and their imports were removed.

### 3. Deleted sub-module globals

Every sub-config module's global state was deleted:

| Deleted | Files |
|---------|-------|
| `_memory_config`, `get_memory_config()`, `set_memory_config()`, `load_memory_config_from_dict()` | `memory_config.py` |
| `_title_config`, `get_title_config()`, `set_title_config()`, `load_title_config_from_dict()` | `title_config.py` |
| Same pattern | `summarization_config.py`, `subagents_config.py`, `guardrails_config.py`, `tool_search_config.py`, `checkpointer_config.py`, `stream_bridge_config.py`, `acp_config.py` |
| `_extensions_config`, `reload_extensions_config()`, `reset_extensions_config()`, `set_extensions_config()` | `extensions_config.py` |
| `reload_app_config()`, `reset_app_config()`, `set_app_config()`, mtime detection, `push/pop_current_app_config()` | `app_config.py` |

Consumers migrated from `get_memory_config()` → `AppConfig.current().memory` (~100 call-sites).

### 4. Lifecycle: 3-tier `AppConfig.current()`

The original plan called for a single `ContextVar` with hard-fail on uninitialized access. The shipped lifecycle is a **3-tier fallback** attached to `AppConfig` itself (no separate `context.py` module). The divergence is explained in §7.

```python
# app_config.py
class AppConfig(BaseModel):
    ...

    # Process-global singleton. Atomic pointer swap under the GIL,
    # so no lock is needed for current read/write patterns.
    _global: ClassVar[AppConfig | None] = None

    # Per-context override (tests, multi-client scenarios).
    _override: ClassVar[ContextVar[AppConfig]] = ContextVar("deerflow_app_config_override")

    @classmethod
    def init(cls, config: AppConfig) -> None:
        """Set the process-global. Visible to all subsequent async tasks."""
        cls._global = config

    @classmethod
    def set_override(cls, config: AppConfig) -> Token[AppConfig]:
        """Per-context override. Returns Token for reset_override()."""
        return cls._override.set(config)

    @classmethod
    def reset_override(cls, token: Token[AppConfig]) -> None:
        cls._override.reset(token)

    @classmethod
    def current(cls) -> AppConfig:
        """Priority: per-context override > process-global > auto-load from file."""
        try:
            return cls._override.get()
        except LookupError:
            pass
        if cls._global is not None:
            return cls._global
        logger.warning(
            "AppConfig.current() called before init(); auto-loading from file. "
            "Call AppConfig.init() at process startup to surface config errors early."
        )
        config = cls.from_file()
        cls._global = config
        return config
```

**Why three tiers and not one:**

- **Process-global** is required because `ContextVar` doesn't propagate config updates across async request boundaries. Gateway receives a `PUT /mcp/config` on one request, reloads config, and the next request — in a fresh async context — must see the new value. A plain class variable (`_global`) does this; a `ContextVar` does not.
- **Per-context override** is retained for test isolation and multi-client scenarios. A test can scope its config without mutating the process singleton. `reset_override()` restores the previous state deterministically via `Token`.
- **Auto-load fallback** is a backward-compatibility escape hatch with a warning. Call sites that skipped explicit `init()` (legacy or test) still work, but the warning surfaces the miss.

### 5. Per-invocation context: `DeerFlowContext`

Lives in `deerflow/config/deer_flow_context.py` (not `context.py` as originally planned — the name was reserved to avoid implying a lifecycle module).

```python
@dataclass(frozen=True)
class DeerFlowContext:
    """Typed, immutable, per-invocation context injected via LangGraph Runtime."""
    app_config: AppConfig
    thread_id: str
    agent_name: str | None = None
```

**Fields:**

| Field | Type | Source | Mutability |
|-------|------|--------|-----------|
| `app_config` | `AppConfig` | `AppConfig.current()` at run start | Immutable per-run |
| `thread_id` | `str` | Caller-provided | Immutable per-run |
| `agent_name` | `str \| None` | Caller-provided (bootstrap only) | Immutable per-run |

**Not in context:** `sandbox_id` is mutable runtime state (lazy-acquired mid-execution). It flows through `ThreadState.sandbox` (state channel), not context. All 3 `runtime.context["sandbox_id"] = ...` writes in `sandbox/tools.py` were removed; `SandboxMiddleware.after_agent` reads from `state["sandbox"]` only.

**Construction per entry point:**

```python
# Gateway runtime (worker.py) — primary path
deer_flow_context = DeerFlowContext(
    app_config=AppConfig.current(),
    thread_id=thread_id,
)
agent.astream(input, config=config, context=deer_flow_context)

# DeerFlowClient (client.py)
AppConfig.init(AppConfig.from_file(config_path))
context = DeerFlowContext(app_config=AppConfig.current(), thread_id=thread_id)
agent.stream(input, config=config, context=context)

# LangGraph Server — legacy path, context=None or dict, fallback via resolve_context()
```

### 6. Access pattern by caller type

The shipped code stratifies callers by what `runtime.context` type they see, and tightened middleware access over time:

| Caller type | Access pattern | Examples |
|-------------|---------------|----------|
| Typed middleware (declares `Runtime[DeerFlowContext]`) | `runtime.context.app_config.xxx` — direct field access, no wrapper | `memory_middleware`, `title_middleware`, `thread_data_middleware`, `uploads_middleware`, `loop_detection_middleware` |
| Tools that may see legacy dict context | `resolve_context(runtime).xxx` | `sandbox/tools.py` (bash-guard gate, sandbox config), `task_tool.py` (bash subagent gate) |
| Tools with typed runtime | `runtime.context.xxx` directly | `present_file_tool.py`, `setup_agent_tool.py`, `skill_manage_tool.py` |
| Non-agent paths (Gateway routers, CLI, factories) | `AppConfig.current().xxx` | `app/gateway/routers/*`, `reset_admin.py`, `models/factory.py` |

**Middleware hardening** (late commit `a934a822`): the original plan had middlewares call `resolve_context(runtime)` everywhere. In practice, once the middleware signature was typed as `Runtime[DeerFlowContext]`, the wrapper became defensive noise. The commit removed:
- `try/except` wrappers around `resolve_context(...)` in middlewares and sandbox tools
- Optional `title_config=None` fallback on every `_build_title_prompt` / `_format_for_title_model` helper; they now take `TitleConfig` as a **required parameter**
- Ad-hoc `get_config()` fallback chains in `memory_middleware`

Dropping the swallowed-exception layer means config-resolution bugs surface as errors instead of silently degrading — aligning with let-it-crash.

`resolve_context()` itself still exists and handles three cases:

```python
def resolve_context(runtime: Any) -> DeerFlowContext:
    ctx = getattr(runtime, "context", None)
    if isinstance(ctx, DeerFlowContext):
        return ctx                        # typed path (Gateway, Client)
    if isinstance(ctx, dict):
        return DeerFlowContext(           # legacy dict path (with warning if empty thread_id)
            app_config=AppConfig.current(),
            thread_id=ctx.get("thread_id", ""),
            agent_name=ctx.get("agent_name"),
        )
    # Final fallback: LangGraph configurable (e.g. LangGraph Server)
    cfg = get_config().get("configurable", {})
    return DeerFlowContext(
        app_config=AppConfig.current(),
        thread_id=cfg.get("thread_id", ""),
        agent_name=cfg.get("agent_name"),
    )
```

### 7. Divergence from original plan

Two material divergences from the original design, both driven by implementation feedback:

**7.1 Lifecycle: `ContextVar` → process-global + `ContextVar` override**

*Original:* single `ContextVar` in a new `context.py` module. `get_app_config()` raises `ConfigNotInitializedError` if unset.

*Shipped:* process-global `AppConfig._global` (primary) + `ContextVar` override (scoped) + auto-load with warning (fallback).

*Why:* a `ContextVar` set by Gateway startup is not visible to subsequent requests that spawn fresh async contexts. `PUT /mcp/config` must update config such that the next incoming request sees the new value in *its* async task — this requires process-wide state. ContextVar is retained for test isolation (`reset_override()` works cleanly per test via `Token`) and for per-client scoping if ever needed.

The `ConfigNotInitializedError` was replaced by a warning + auto-load. The hard error caught more legitimate bugs but also broke call sites that historically worked without explicit init (internal scripts, test fixtures during import-time). The warning preserves the signal without breaking backward compatibility; `backend/tests/conftest.py` now has an autouse fixture that sets `_global` to a minimal `AppConfig` so tests never hit auto-load.

**7.2 Module name: `context.py` → lifecycle on `AppConfig`, `deer_flow_context.py` for the invocation context**

*Original:* lifecycle and `DeerFlowContext` both in `deerflow/config/context.py`.

*Shipped:* lifecycle is classmethods on `AppConfig` itself (`init`, `current`, `set_override`, `reset_override`). `DeerFlowContext` and `resolve_context()` live in `deerflow/config/deer_flow_context.py`.

*Why:* the lifecycle operates on `AppConfig` directly — putting it on the class removes one level of module coupling. The per-invocation context is conceptually separate (it's agent-execution plumbing, not config lifecycle) so it got its own file with a distinguishing name.

**7.3 Client lifecycle: `init() + set_override()` → `init()` only**

*Original (never finalized):* `DeerFlowClient.__init__` called both `init()` (process-global) and `set_override()` so two clients with different configs wouldn't clobber each other.

*Shipped:* `init()` only.

*Why (commit `a934a822`):* `set_override()` leaked overrides across test boundaries because the `ContextVar` wasn't reset between client instances. Single-client is the common case, and tests use the autouse fixture for isolation. Multi-client scoping can be added back with explicit `set_override()` if the need arises.

## What doesn't change

- `config.yaml` schema
- `extensions_config.json` loading
- External API behavior (Gateway, DeerFlowClient)

## Migration scope (Phase 1, actual)

- ~100 call-sites: `get_*_config()` → `AppConfig.current().xxx`
- 6 runtime-path migrations: middlewares + sandbox tools read from `runtime.context` or `resolve_context()`
- 3 deleted sandbox_id writes in `sandbox/tools.py`
- ~100 test locations updated; `conftest.py` autouse fixture added
- New tests: `test_config_frozen.py`, `test_deer_flow_context.py`, `test_app_config_reload.py`
- Gateway update flow: `reload_*` → `AppConfig.init(AppConfig.from_file())`
- Dependency: langgraph `Runtime` / `ToolRuntime` (already available at target version)

## 8. Phase 2: pure explicit parameter passing

Phase 1 shipped a working 3-tier `AppConfig.current()` lifecycle. The remaining implicit-state surface is:

- `AppConfig._global: ClassVar` — process-level singleton
- `AppConfig._override: ClassVar[ContextVar]` — per-context override
- `AppConfig.current()` — fallback-chain reader with auto-load warning

Phase 2 proposes removing all three. `AppConfig` reduces to a pure Pydantic value object with `from_file()` as its only factory. All consumers receive `AppConfig` as an explicit parameter, either through a typed constructor, a function signature, or LangGraph `Runtime[DeerFlowContext]`.

### 8.1 Motivation

Phase 1 addressed the **data side** of the problem: config is now a frozen ADT, sub-module globals deleted, `from_file()` pure. The **access side** still relies on implicit ambient lookup:

```python
# Today (Phase 1 shipped):
def _get_memory_prompt() -> str:
    config = AppConfig.current().memory  # implicit global lookup
    ...

# Target (Phase 2):
def _get_memory_prompt(config: MemoryConfig) -> str:  # explicit dependency
    ...
```

Three concrete benefits:

| Benefit | What it buys |
|---------|-------------|
| Referential transparency | A function's result depends only on its inputs. Testing becomes parameter substitution, no `patch.object(AppConfig, "current")` chains |
| Dependency visibility | A function signature declares what config it needs. No "this deep helper secretly reads `.memory`" surprises |
| True multi-config isolation | Two `DeerFlowClient` instances with different configs can run in the same process without any ambient shared state to contend over |

The cost (Phase 1 wouldn't have made this smaller): ~97 production call sites + ~91 test mock sites need touching, plus signature changes for helpers that now accept `config` as a parameter.

### 8.2 Non-agent call paths and their target APIs

Phase 1 got the agent-execution path right (`runtime.context.app_config.xxx`). The unsolved paths split into four categories:

**FastAPI Gateway** → `Depends(get_config)`

```python
# app/gateway/app.py — at startup
app.state.config = AppConfig.from_file()

# app/gateway/deps.py
def get_config(request: Request) -> AppConfig:
    return request.app.state.config

# app/gateway/routers/models.py
@router.get("/models")
def list_models(config: AppConfig = Depends(get_config)):
    ...

# app/gateway/routers/mcp.py — config reload replaces AppConfig.init()
@router.put("/config")
def update_mcp(..., request: Request):
    ...
    request.app.state.config = AppConfig.from_file()
```

`app.state.config` is a FastAPI-owned attribute on the app object, not a module-level global. Scoped to the app's lifetime, only written at startup and config-reload.

**`DeerFlowClient`** → constructor-captured config

```python
class DeerFlowClient:
    def __init__(self, config_path: str | None = None, config: AppConfig | None = None):
        self._config = config or AppConfig.from_file(config_path)

    def chat(self, message: str, thread_id: str) -> str:
        context = DeerFlowContext(app_config=self._config, thread_id=thread_id)
        ...
```

Multiple `DeerFlowClient` instances are now first-class — each owns its config, nothing shared.

**Agent construction (`make_lead_agent`, `_build_middlewares`, prompt helpers)** → threaded through

```python
def make_lead_agent(config: RunnableConfig, app_config: AppConfig):
    middlewares = _build_middlewares(app_config, runtime_config=config)
    ...

def _build_middlewares(app_config: AppConfig, runtime_config: RunnableConfig):
    if app_config.token_usage.enabled:
        middlewares.append(TokenUsageMiddleware())
    ...
```

Every helper that reads config is now on a function-signature chain from `make_lead_agent`.

**Background threads (memory debounce Timer, queue consumers)** → closure-captured

```python
def MemoryQueue.add(self, conversation, user_id, config: MemoryConfig):
    # capture config at enqueue time
    def _flush():
        self._updater.update(conversation, user_id, config)
    self._timer = Timer(config.debounce_seconds, _flush)
    self._timer.start()
```

The captured config lives in the closure, not in a contextvar the thread can't see.

### 8.3 Target `AppConfig` shape

```python
class AppConfig(BaseModel):
    model_config = ConfigDict(extra="allow", frozen=True)

    log_level: str = "info"
    memory: MemoryConfig = Field(default_factory=MemoryConfig)
    ...  # same fields as Phase 1

    @classmethod
    def from_file(cls, config_path: str | None = None) -> Self:
        """Pure factory. Reads file, returns frozen object. No side effects."""
        ...

    @classmethod
    def resolve_config_path(cls, config_path: str | None = None) -> Path:
        """Unchanged from Phase 1."""
        ...

    def get_model_config(self, name: str) -> ModelConfig | None:
        """Unchanged."""
        ...

    # Removed:
    # - _global: ClassVar
    # - _override: ClassVar[ContextVar]
    # - init(), set_override(), reset_override(), current()
```

### 8.4 `DeerFlowContext` and `resolve_context()` after Phase 2

`DeerFlowContext` is unchanged — it's already Phase 2-compliant.

`resolve_context()` simplifies: the "fall back to `AppConfig.current()`" branch goes away. The dict-context legacy path either constructs `DeerFlowContext` with an explicitly-passed `AppConfig` (fed by caller) or is deleted if no dict-context callers remain.

```python
def resolve_context(runtime: Any) -> DeerFlowContext:
    ctx = getattr(runtime, "context", None)
    if isinstance(ctx, DeerFlowContext):
        return ctx
    raise RuntimeError(
        "runtime.context is not a DeerFlowContext. All callers must construct "
        "and inject one explicitly; there is no global fallback."
    )
```

Let-it-crash: if Phase 2 is done correctly, every caller constructs a typed context. If one doesn't, fail loudly.

### 8.5 Trade-off acknowledgment

The three cases where ambient lookup is genuinely tempting (and why we reject them):

| Tempting case | Why ambient looks easier | Why we still reject it |
|---------------|-------------------------|------------------------|
| Deep helper in `memory/storage.py` needs `memory.storage_path` | Just threaded through 4 call layers | That's exactly the dependency chain you want visible. It's either there or it's hiding |
| Community tool factory reading API keys from config | "Each tool factory doesn't want to take config" | Each tool factory literally needs the config. Passing it is the honest signature |
| Test that wants to "override just one field globally" | `patch.object(AppConfig, "current")` is one line | Tests constructing their own `AppConfig` is one fixture — and that fixture becomes infrastructure for all future tests |

The rejection is consistent: **an explicit parameter is strictly more honest than an implicit global lookup**, in every case.

### 8.6 Scope

- ~97 production call sites: `AppConfig.current()` → parameter
- ~91 test mock sites: `patch.object(AppConfig, "current")` / `AppConfig._global = ...` → fixture injection
- ~30 FastAPI endpoints gain `config: AppConfig = Depends(get_config)`
- ~15 factory / helper functions gain `config: AppConfig` parameter
- Delete from `app_config.py`: `_global`, `_override`, `init`, `current`, `set_override`, `reset_override`
- Simplify `resolve_context()`: remove `AppConfig.current()` fallback

Implementation plan: see [2026-04-12-config-refactor-plan.md §Phase 2](./2026-04-12-config-refactor-plan.md#phase-2-pure-explicit-parameter-passing).