Files
deer-flow/backend/packages/harness/deerflow/runtime/README.md
T
rayhpeng 9d0a42c1fb refactor(runtime): restructure runs module with new execution architecture
Major refactoring of deerflow/runtime/:
- runs/callbacks/ - new callback system (builder, events, title, tokens)
- runs/internal/ - execution internals (executor, supervisor, stream_logic, registry)
- runs/internal/execution/ - execution artifacts and events handling
- runs/facade.py - high-level run facade
- runs/observer.py - run observation protocol
- runs/types.py - type definitions
- runs/store/ - simplified store interfaces (create, delete, query, event)

Refactor stream_bridge/:
- Replace old providers with contract.py and exceptions.py
- Remove async_provider.py, base.py, memory.py

Add documentation:
- README.md and README_zh.md for runtime module

Remove deprecated:
- manager.py moved to internal/
- worker.py, schemas.py
- user_context.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-22 11:28:01 +08:00

595 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# deerflow.runtime Design Overview
This document describes the current implementation of `backend/packages/harness/deerflow/runtime`, including its overall design, boundary model, the collaboration between `runs` and `stream_bridge`, how it interacts with external infrastructure and the `app` layer, and how `actor_context` is dynamically injected to provide user isolation.
## 1. Overall Role
`deerflow.runtime` is the runtime kernel layer of DeerFlow.
It sits below agents / tools / middlewares and above app / gateway / infra. Its purpose is to define runtime semantics and boundary contracts, without directly owning web endpoints, ORM models, or concrete infrastructure implementations.
Its public surface is re-exported from [`__init__.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/__init__.py) and currently exposes four main capability areas:
1. `runs`
- Run domain types, execution facade, lifecycle observers, and store protocols
2. `stream_bridge`
- Stream event bridge contract and public stream types
3. `actor_context`
- Request/task-scoped actor context and user-isolation bridge
4. `serialization`
- Runtime serialization helpers for LangChain / LangGraph data and outward-facing events
Structurally, the current package looks like:
```text
runtime
├─ runs
│ ├─ facade / types / observer / store
│ ├─ internal/*
│ └─ callbacks/*
├─ stream_bridge
│ ├─ contract
│ └─ exceptions
├─ actor_context
└─ serialization / converters
```
## 2. Overall Design and Constraint Model
### 2.1 Design Goal
The core goal of `runtime` is to decouple runtime control-plane semantics from infrastructure implementations.
It only cares about:
1. What a run is and how run state changes over time
2. What lifecycle events and stream events are produced during execution
3. Which capabilities must be injected from the outside, such as checkpointer, event store, stream bridge, and durable stores
4. Who the current actor is, and how lower layers can use that for isolation
It deliberately does not care about:
1. Whether events are stored in memory, Redis, or another transport
2. How run / thread / feedback data is persisted
3. HTTP / SSE / FastAPI details
4. How the auth plugin resolves the request user
### 2.2 Boundary Rules
The current package has a fairly clear boundary model:
1. `runs` owns execution orchestration, not ORM or SQL writes
2. `stream_bridge` defines stream semantics, not app-level bridge construction
3. `actor_context` defines runtime context, not auth-plugin behavior
4. Durable data enters only through boundary protocols:
- `RunCreateStore`
- `RunQueryStore`
- `RunDeleteStore`
- `RunEventStore`
5. Lifecycle side effects enter only through `RunObserver`
6. User isolation is not implemented ad hoc in each module; it is propagated through actor context
In one sentence:
`runtime` defines semantics and contracts; `app.infra` provides implementations.
## 3. runs Subsystem Design
### 3.1 Purpose
`runtime/runs` is the run orchestration domain. It is responsible for:
1. Defining run domain objects and status transitions
2. Organizing create / stream / wait / join / cancel / delete behavior
3. Maintaining the in-process runtime control plane
4. Emitting stream events and lifecycle events during execution
5. Collecting trace, token, title, and message data through callbacks
### 3.2 Core Objects
See [`runs/types.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/runs/types.py).
The most important types are:
1. `RunSpec`
- Built by the app-side input layer
- The real execution input
2. `RunRecord`
- The runtime record managed by `RunRegistry`
3. `RunStatus`
- `pending`, `starting`, `running`, `success`, `error`, `interrupted`, `timeout`
4. `RunScope`
- Distinguishes stateful vs stateless execution and temporary thread behavior
### 3.3 Current Constraints
The current implementation explicitly limits some parts of the problem space:
1. `multitask_strategy` currently supports only `reject` and `interrupt` on the main path
2. `enqueue`, `after_seconds`, and batch execution are not on the current primary path
3. `RunRegistry` is an in-process state source, not a durable source of truth
4. External queries may use durable stores, but the live control plane still centers on the in-memory registry
### 3.4 Facade and Internal Components
`RunsFacade` in [`runs/facade.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/runs/facade.py) provides the unified API:
1. `create_background`
2. `create_and_stream`
3. `create_and_wait`
4. `join_stream`
5. `join_wait`
6. `cancel`
7. `get_run`
8. `list_runs`
9. `delete_run`
Internally it composes:
1. `RunRegistry`
2. `ExecutionPlanner`
3. `RunSupervisor`
4. `RunStreamService`
5. `RunWaitService`
6. `RunCreateStore` / `RunQueryStore` / `RunDeleteStore`
7. `RunObserver`
So `RunsFacade` is the public entry point, while execution and state transitions are distributed across smaller components.
## 4. stream_bridge Design and Implementation
### 4.1 Why stream_bridge Is a Separate Abstraction
`StreamBridge` is defined in [`stream_bridge/contract.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/stream_bridge/contract.py).
It exists because run execution needs an event channel that is:
1. Subscribable
2. Replayable
3. Terminal-state aware
4. Resume-capable
That behavior must not be hard-coupled to HTTP SSE, in-memory queues, or Redis-specific details.
So:
1. harness defines stream semantics
2. the app layer owns backend selection and implementation
### 4.2 Contract Contents
The abstract `StreamBridge` currently exposes:
1. `publish(run_id, event, data)`
2. `publish_end(run_id)`
3. `publish_terminal(run_id, kind, data)`
4. `subscribe(run_id, last_event_id, heartbeat_interval)`
5. `cleanup(run_id, delay=0)`
6. `cancel(run_id)`
7. `mark_awaiting_input(run_id)`
8. `start()`
9. `close()`
Public types include:
1. `StreamEvent`
2. `StreamStatus`
3. `ResumeResult`
4. `HEARTBEAT_SENTINEL`
5. `END_SENTINEL`
6. `CANCELLED_SENTINEL`
### 4.3 Semantic Boundary
The contract explicitly distinguishes:
1. `end` / `cancel` / `error`
- Real business-level terminal events for a run
2. `close()`
- Bridge-level shutdown
- Not equivalent to run cancellation
### 4.4 Current Implementation Style
The concrete implementation currently used is the app-layer [`MemoryStreamBridge`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/infra/stream_bridge/adapters/memory.py).
Its design is effectively “one in-memory event log per run”:
1. `_RunStream` stores the event list, offset mapping, status, subscriber count, and awaiting-input state
2. `publish()` generates increasing event IDs and appends to the per-run log
3. `subscribe()` supports replay, heartbeat, resume, and terminal exit
4. `cleanup_loop()` handles:
- old streams
- active streams with no publish activity
- orphan terminal streams
- TTL expiration
5. `mark_awaiting_input()` extends timeout behavior for HITL flows
The Redis implementation is still only a placeholder in [`RedisStreamBridge`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/infra/stream_bridge/adapters/redis.py).
### 4.5 Call Chain
The stream bridge participates in the execution chain like this:
```text
RunsFacade
-> RunStreamService
-> StreamBridge
-> app route converts events to SSE
```
More concretely:
1. `_RunExecution._start()` publishes `metadata`
2. `_RunExecution._stream()` converts agent `astream()` output into bridge events
3. `_RunExecution._finish_success()` / `_finish_failed()` / `_finish_aborted()` publish terminal events
4. `RunWaitService` waits by subscribing for `values`, `error`, or terminal events
5. The app route layer converts those events into outward-facing SSE
### 4.6 Future Extensions
Likely future directions include:
1. A real Redis bridge for cross-process / multi-instance streaming
2. Stronger Last-Event-ID gap recovery behavior
3. Richer HITL state handling
4. Cross-node run coordination and more explicit dead-letter strategies
## 5. External Communication and Store Read/Write Boundaries
### 5.1 Two Main Outward Boundaries
`runtime` does not send HTTP requests directly and does not write ORM models directly, but it communicates outward through two main boundaries:
1. `StreamBridge`
- For outward-facing stream events
2. `store` / `observer`
- For durable data and lifecycle side effects
### 5.2 Store Boundary Protocols
Under [`runs/store`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/runs/store), the harness layer defines:
1. `RunCreateStore`
2. `RunQueryStore`
3. `RunDeleteStore`
4. `RunEventStore`
These are not harness-internal persistence implementations. They are app-facing contracts declared by the runtime.
### 5.3 How the app Layer Supplies Store Implementations
The app layer currently provides:
1. [`AppRunCreateStore`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/gateway/services/runs/store/create_store.py)
2. [`AppRunQueryStore`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/gateway/services/runs/store/query_store.py)
3. [`AppRunDeleteStore`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/gateway/services/runs/store/delete_store.py)
4. [`AppRunEventStore`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/infra/storage/run_events.py)
5. [`JsonlRunEventStore`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/infra/run_events/jsonl_store.py)
The shared pattern is:
1. harness depends only on protocols
2. the app layer owns session lifecycle, commit behavior, access control, and backend choice
3. durable data eventually lands in `store.repositories.*` or JSONL files
### 5.4 How Run Lifecycle Data Leaves the Runtime
The single-run executor [`_RunExecution`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/runs/internal/execution/executor.py) does not write to the database directly.
It exports data through three paths:
1. bridge events
- Streamed outward to subscribers
2. callback -> `RunEventStore`
- Execution trace / message / tool / custom events are persisted in batches
3. lifecycle event -> `RunObserver`
- Run started, completed, failed, cancelled, and thread-status updates are emitted for app observers
### 5.5 `RunEventStore` Backends
The app-side factory [`app/infra/run_events/factory.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/infra/run_events/factory.py) currently selects:
1. `run_events.backend == "db"`
- `AppRunEventStore`
2. `run_events.backend == "jsonl"`
- `JsonlRunEventStore`
So the runtime does not care whether events end up in a database or in files. It only requires the event-store protocol.
## 6. Run Lifecycle Data, Callbacks, Write-Back, and Query Flow
### 6.1 Main Single-Run Flow
The main `_RunExecution.run()` flow is:
1. `_start()`
2. `_prepare()`
3. `_stream()`
4. `_finish_after_stream()`
5. `finally`
- `_emit_final_thread_status()`
- `callbacks.flush()`
- `bridge.cleanup(run_id)`
### 6.2 What the Start Phase Records
`_start()`:
1. sets run status to `running`
2. emits `RUN_STARTED`
3. extracts the first human message and emits `HUMAN_MESSAGE`
4. captures the pre-run checkpoint ID
5. publishes a `metadata` stream event
### 6.3 What the Callbacks Collect
Callbacks live under [`runs/callbacks`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/runs/callbacks).
The main ones are:
1. `RunEventCallback`
- Records `run_start`, `run_end`, `llm_request`, `llm_response`, `tool_start`, `tool_end`, `tool_result`, `custom_event`, and more
- Flushes batches into `RunEventStore`
2. `RunTokenCallback`
- Aggregates token usage, LLM call counts, lead/subagent/middleware token split, message counts, first human message, and last AI message
3. `RunTitleCallback`
- Extracts thread title from title middleware output or custom events
### 6.4 How completion_data Is Produced
`RunTokenCallback.completion_data()` yields `RunCompletionData`, including:
1. `total_input_tokens`
2. `total_output_tokens`
3. `total_tokens`
4. `llm_call_count`
5. `lead_agent_tokens`
6. `subagent_tokens`
7. `middleware_tokens`
8. `message_count`
9. `last_ai_message`
10. `first_human_message`
The executor includes this data in lifecycle payloads on success, failure, and cancellation.
### 6.5 How the app Layer Writes Lifecycle Results Back
The executor emits `RunLifecycleEvent` objects through [`RunEventEmitter`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/runs/internal/execution/events.py).
The app-layer [`StorageRunObserver`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/infra/storage/runs.py) then persists durable state:
1. `RUN_STARTED`
- Marks the run as `running`
2. `RUN_COMPLETED`
- Writes completion data
- Syncs thread title if present
3. `RUN_FAILED`
- Writes error and completion data
4. `RUN_CANCELLED`
- Writes `interrupted` state and completion data
5. `THREAD_STATUS_UPDATED`
- Syncs thread status
### 6.6 Query Paths
`RunsFacade.get_run()` and `list_runs()` have two paths:
1. If a `RunQueryStore` is injected, durable state is used first
2. Otherwise, the facade falls back to `RunRegistry`
So:
1. the in-memory registry is the control plane
2. the durable store is the preferred query surface
## 7. How actor_context Is Dynamically Injected for User Isolation
### 7.1 Design Goal
`actor_context` is defined in [`actor_context.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/packages/harness/deerflow/runtime/actor_context.py).
Its purpose is to let the runtime and lower-level infrastructure modules depend on a stable notion of “who the current actor is” without importing the auth plugin, FastAPI request objects, or a specific user model.
### 7.2 Current Implementation
The current implementation is a request/task-scoped context built on top of `ContextVar`:
1. `ActorContext`
- Currently carries only `user_id`
2. `_current_actor`
- A `ContextVar[ActorContext | None]`
3. `bind_actor_context(actor)`
- Binds the current actor
4. `reset_actor_context(token)`
- Restores the previous context
5. `get_actor_context()`
- Returns the current actor
6. `get_effective_user_id()`
- Returns the current user ID or `DEFAULT_USER_ID`
7. `resolve_user_id(value=AUTO | explicit | None)`
- Resolves repository/storage-facing user IDs consistently
### 7.3 How the app Layer Injects It Dynamically
Dynamic injection currently happens at the app/auth boundary.
For HTTP request flows:
1. [`app.plugins.auth.security.middleware`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/plugins/auth/security/middleware.py)
- Builds `ActorContext(user_id=...)` from the authenticated request user
- Binds and resets runtime actor context around request handling
2. [`app.plugins.auth.security.actor_context`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/plugins/auth/security/actor_context.py)
- Provides `bind_request_actor_context(request)` and `bind_user_actor_context(user_id)`
- Allows routes and non-HTTP entry points to bind runtime actor context explicitly
For non-HTTP / external channel flows:
1. [`app/channels/manager.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/channels/manager.py)
2. [`app/channels/feishu.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/channels/feishu.py)
Those entry points also wrap execution with `bind_user_actor_context(user_id)` before they enter runtime-facing code. This matters because:
1. the runtime does not need to distinguish HTTP from Feishu or other channels
2. any entry point that can resolve a user ID can inject the same isolation semantics
3. the same runtime/store/path/memory code can stay protocol-agnostic
So the runtime itself does not know what a request is, and it does not know the auth plugins user model. It only knows whether an `ActorContext` is currently bound in the `ContextVar`.
### 7.4 Propagation Semantics After Injection
In practice, “dynamic injection” here does not mean manually threading `user_id` through every function signature. The app boundary binds the actor into a `ContextVar`, and runtime-facing code reads it only where isolation is actually needed.
The current semantics are:
1. an entry boundary calls `bind_actor_context(...)`
2. the async call chain created inside that context sees the same actor view
3. the boundary restores the previous value with `reset_actor_context(token)` when the request/task exits
That gives two practical outcomes:
1. most runtime interfaces do not need to carry `user_id` as an explicit parameter through every layer
2. boundaries that do need durable isolation or path isolation can still read explicitly via `resolve_user_id()` or `get_effective_user_id()`
### 7.5 How User Isolation Actually Works
User isolation is implemented through “dynamic injection + boundary-specific reads”.
The main paths are:
1. path / uploads / sandbox / memory
- Use `get_effective_user_id()` to derive per-user directories and resource scopes
2. app storage adapters
- Use `resolve_user_id(AUTO)` in `RunStoreAdapter`, `ThreadMetaStorage`, and related boundaries
3. run event store
- `AppRunEventStore` reads `get_actor_context()` and decides whether the current actor may see a thread
So user isolation is not centralized in a single middleware and then forgotten. Instead:
1. the app boundary dynamically binds the actor into runtime context
2. runtime and lower layers read that context when they need isolation input
3. each boundary applies the user ID according to its own responsibility
### 7.6 Why This Approach Works Well
The current design has several practical strengths:
1. The runtime does not depend on a specific auth implementation
2. HTTP and non-HTTP entry points can reuse the same isolation mechanism
3. The same user ID propagates naturally into paths, memory, store access, and event visibility
4. Where stronger enforcement is needed, `AUTO` + `resolve_user_id()` can require a bound actor context
### 7.7 Future Extensions
`ActorContext` already contains explicit future-extension hints. The current pattern can be extended without changing the architecture:
1. `tenant_id`
- For multi-tenant isolation
2. `subject_id`
- For a more stable identity key
3. `scopes`
- For finer-grained authorization
4. `auth_source`
- To track the source channel or auth mechanism
The recommended extension model is to preserve the current shape:
1. The app/auth boundary binds a richer `ActorContext`
2. The runtime depends only on abstract context fields, never on request/user objects
3. Lower layers read only the fields they actually need
4. Store / path / sandbox / stream / memory boundaries can gradually become tenant-aware or scope-aware
More concretely, stronger isolation can be added incrementally at the boundaries:
1. store boundaries
- add `tenant_id` filtering in `RunStoreAdapter`, `ThreadMetaStorage`, and feedback/event stores
2. path and sandbox boundaries
- shard directories by `tenant_id/user_id` instead of `user_id` alone
3. event-visibility boundaries
- layer `scopes` or `subject_id` checks into run-event and thread queries
4. external-channel boundaries
- populate `auth_source` so API, channel, and internal-job traffic can be distinguished
That keeps the runtime dependent on the abstract “current actor context” concept, not on FastAPI request objects or a specific auth implementation.
## 8. Interaction with the app Layer
### 8.1 How the app Layer Wires the Runtime
The app composition root for runs is [`app/gateway/services/runs/facade_factory.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/gateway/services/runs/facade_factory.py).
It assembles:
1. `RunRegistry`
2. `ExecutionPlanner`
3. `RunSupervisor`
4. `RunStreamService`
5. `RunWaitService`
6. `RunsRuntime`
- `bridge`
- `checkpointer`
- `store`
- `event_store`
- `agent_factory_resolver`
7. `StorageRunObserver`
8. `AppRunCreateStore`
9. `AppRunQueryStore`
10. `AppRunDeleteStore`
### 8.2 How app.state Provides Infrastructure
In [`app/gateway/registrar.py`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/gateway/registrar.py):
1. `init_persistence()` creates:
- `persistence`
- `checkpointer`
- `run_store`
- `thread_meta_storage`
- `run_event_store`
2. `init_runtime()` creates:
- `stream_bridge`
Those objects are then attached to `app.state` for dependency injection and facade construction.
### 8.3 The app Boundary for `stream_bridge`
Concrete stream bridge construction now belongs entirely to the app layer:
1. harness exports only the `StreamBridge` contract
2. [`app.infra.stream_bridge.build_stream_bridge`](/Users/rayhpeng/workspace/open-source/deer-flow/backend/app/infra/stream_bridge/factory.py) constructs the actual implementation
That is a very explicit boundary:
1. harness defines runtime semantics and interfaces
2. app selects and constructs infrastructure
## 9. Summary
The most accurate one-line summary of `deerflow.runtime` today is:
It is a runtime kernel built around run orchestration, a stream bridge as the streaming boundary, actor context as the dynamic isolation bridge, and store / observer protocols as the durable and side-effect boundaries.
More concretely:
1. `runs` owns orchestration and lifecycle progression
2. `stream_bridge` owns stream semantics
3. `actor_context` owns runtime-scoped user context and isolation bridging
4. `serialization` / `converters` own outward event and message formatting
5. the app layer owns real persistence, stream infrastructure, and auth-driven context injection
The main strengths of this structure are:
1. Runtime semantics are decoupled from infrastructure implementations
2. Request identity is decoupled from runtime logic
3. HTTP, CLI, and channel-worker entry points can reuse the same runtime boundaries
4. The system can grow toward multi-tenancy, cross-process stream bridges, and richer durable backends without changing the core model
The current limitations are also clear:
1. `RunRegistry` is still an in-process control plane
2. The Redis bridge is not implemented yet
3. Some multitask strategies and batch capabilities are still outside the main path
4. `ActorContext` currently carries only `user_id`, not richer fields such as tenant, scopes, or auth source
So the best way to understand the current code is not as a final platform, but as a runtime kernel with clear semantics and extension boundaries.