Files
deer-flow/backend
Nan Gao 68ba4198b8 fix(channels): make channel connect flow deterministic (#3582)
* fix(channels): make channel connect flow deterministic

* make format

* fix(channels): apply connect-code before allowed_users on telegram and wechat

The bind-bootstrap reorder shipped for slack/dingtalk only. Telegram and
WeChat still gate _check_user/allowed_users before connect-code dispatch, so
a newly allowlisted-but-unbound user is silently rejected when binding via the
browser deep-link / connect-code flow — the same deadlock the PR fixes.

- telegram: consume the /start deep-link token before the allowed_users gate.
- wechat: handle the /connect code before the allowed_users gate, and defer
  inbound file extraction + context-token tracking past the gate so blocked
  senders no longer trigger CDN downloads or token bookkeeping.

Adds regression tests for both adapters mirroring the slack/dingtalk coverage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): enforce single-active-owner invariant at the DB layer

_revoke_other_active_owners did a SELECT-then-UPDATE in app code with no row
lock or constraint covering active rows. Under READ COMMITTED, two concurrent
connect-code consumes for the same (provider, external_account_id, workspace_id)
from different owners could each observe "no other active owner" and both commit
a connected row, leaving find_connection_by_external_identity nondeterministic.

- Add a partial unique index on (provider, external_account_id, workspace_id)
  WHERE status != 'revoked' (portable to SQLite >= 3.8.0 and PostgreSQL) so the
  database guarantees at most one non-revoked row per external identity.
- Reorder upsert_connection to revoke other owners' active rows before the new
  connected row is flushed (so the index is satisfied at commit), wrapped in a
  bounded rollback-and-retry loop. A losing concurrent writer now retries
  against the now-visible state instead of committing a duplicate.

Adds DB-constraint, revoked-slot-reuse, and concurrent-upsert regression tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): harden connect-status polling primitive

pollChannelConnectionUntilResolved was a free-floating recursive setTimeout
started from onSuccess with no cancellation, no per-provider dedup, a redundant
second endpoint per tick, and an unbounded loop on a non-finite expires_in.

- Extract a framework-agnostic, cancellable poller (connect-poll.ts) that polls
  only listChannelConnections() and invalidates the providers query once when the
  bind resolves, instead of fetching both endpoints every tick.
- Guard expires_in with a finite check + default window so undefined/NaN can no
  longer produce a poll loop that runs until the page closes.
- Track one active poll handle per provider in useConnectChannelProvider via a
  ref Map: a new connect cancels the prior poll for that provider, and a useEffect
  cleanup cancels all polls on unmount.

Adds unit tests for resolve-and-stop, cancellation, and non-finite-expiry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): stop leaking blocked-sender content in DingTalk INFO log; document bind semantics

Moving the allowed_users gate past _extract_text meant the parsed-message INFO
log (text=%r, first 100 chars) fired for senders that allowed_users would have
rejected, defeating the filter's noise/privacy role. Move that log to after the
allowed_users gate so blocked senders' message text never reaches INFO logs.

Also document the two operator-relevant semantic changes in backend/CLAUDE.md:
connect-code dispatch runs before allowed_users (so allowed_users is no longer a
bind-time defense; the model relies on code confidentiality + 600s TTL + one-time
consumption), and the single-active-owner-per-external-identity transfer semantics
now backed by the partial unique index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(channels): note connect-code-vs-allowlist and ownership transfer in operator guide

Mirror the backend/CLAUDE.md notes in the operator-facing IM_CHANNEL_CONNECTIONS.md:
connect codes are consumed before allowed_users (so a not-yet-allowlisted user can
still complete a first bind, and allowed_users is not a bind-time defense), and an
external identity has at most one active owner with last-bind-wins transfer enforced
at the DB layer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* refactor(channels): lift connect-code dispatch into Channel base class

Each adapter duplicated the ordering-sensitive boilerplate of extracting a
/connect code and guarding on the connection repo before its allowed_users gate.
The duplication is what let telegram/wechat drift and keep the gate ahead of the
bind. Centralize it:

- Move `_connection_repo` onto Channel.__init__ (removing 7 duplicate assignments).
- Add Channel._pending_connect_code(text), which guards on the repo and extracts
  the code, documenting that adapters MUST consult it before authorization so a
  browser-initiated bind can bootstrap a not-yet-authorized identity.
- Route slack, discord, feishu, dingtalk, wechat, and wecom through the helper.
  This also fixes a latent inconsistency where slack dispatched a bind even when
  no connection repo was configured.

Pure refactor — the full channel suite stays green; adds a direct unit test for
the base helper's contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* make format

* fix(channels): redact DingTalk parsed-message INFO log content

Log text_len instead of the first 100 chars of message text, so message
content never reaches INFO logs (the after-gate move already keeps blocked
senders out entirely). This takes over the redaction from #3584 so only this
PR touches dingtalk.py, letting the two PRs merge in any order conflict-free.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 10:15:31 +08:00
..
2026-01-14 09:57:52 +08:00

DeerFlow Backend

DeerFlow is a LangGraph-based AI super agent with sandbox execution, persistent memory, and extensible tool integration. The backend enables AI agents to execute code, browse the web, manage files, delegate tasks to subagents, and retain context across conversations - all in isolated, per-thread environments.


Architecture

                        ┌──────────────────────────────────────┐
                        │          Nginx (Port 2026)           │
                        │      Unified reverse proxy           │
                        └───────┬──────────────────┬───────────┘
                                │
            /api/langgraph/*    │    /api/* (other)
            rewritten to /api/* │
                                ▼
               ┌────────────────────────────────────────┐
               │        Gateway API (8001)              │
               │        FastAPI REST + agent runtime    │
               │                                        │
               │ Models, MCP, Skills, Memory, Uploads,  │
               │ Artifacts, Threads, Runs, Streaming    │
               │                                        │
               │ ┌────────────────────────────────────┐ │
               │ │ Lead Agent                         │ │
               │ │ Middleware Chain, Tools, Subagents │ │
               │ └────────────────────────────────────┘ │
               └────────────────────────────────────────┘

Request Routing (via Nginx):

  • /api/langgraph/* → Gateway LangGraph-compatible API - agent interactions, threads, streaming
  • /api/* (other) → Gateway API - models, MCP, skills, memory, artifacts, uploads, thread-local cleanup
  • / (non-API) → Frontend - Next.js web interface

Core Components

Lead Agent

The single LangGraph agent (lead_agent) is the runtime entry point, created via make_lead_agent(config). It combines:

  • Dynamic model selection with thinking and vision support
  • Middleware chain for cross-cutting concerns (9 middlewares)
  • Tool system with sandbox, MCP, community, and built-in tools
  • Subagent delegation for parallel task execution
  • System prompt with skills injection, memory context, and working directory guidance

Middleware Chain

Middlewares execute in strict order, each handling a specific concern:

# Middleware Purpose
1 ThreadDataMiddleware Creates per-thread isolated directories (workspace, uploads, outputs)
2 UploadsMiddleware Injects newly uploaded files into conversation context
3 SandboxMiddleware Acquires sandbox environment for code execution
4 SummarizationMiddleware Reduces context when approaching token limits (optional)
5 TodoListMiddleware Tracks multi-step tasks in plan mode (optional)
6 TitleMiddleware Auto-generates conversation titles after first exchange
7 MemoryMiddleware Queues conversations for async memory extraction
8 ViewImageMiddleware Injects image data for vision-capable models (conditional)
9 ClarificationMiddleware Intercepts clarification requests and interrupts execution (must be last)

Sandbox System

Per-thread isolated execution with virtual path translation:

  • Abstract interface: execute_command, read_file, write_file, list_dir
  • Providers: LocalSandboxProvider (filesystem) and AioSandboxProvider (Docker, in community/). Async runtime paths use async sandbox lifecycle hooks so startup, readiness polling, and release do not block the event loop. AioSandboxProvider validates active-cache and warm-pool containers during acquire/reuse, dropping definitively dead entries so a thread can provision a fresh sandbox after an unexpected container exit while keeping get() as an in-memory lookup. Backend health-check failures are treated as unknown, not dead, and a container that cannot be verified during discovery is simply not adopted (acquire falls through to create instead of failing).
  • Virtual paths: /mnt/user-data/{workspace,uploads,outputs} → thread-specific physical directories
  • Skills path: /mnt/skillsdeer-flow/skills/ directory
  • Skills loading: Recursively discovers nested SKILL.md files under skills/{public,custom} and preserves nested container paths
  • File-write safety: str_replace serializes read-modify-write per (sandbox.id, path) so isolated sandboxes keep concurrency even when virtual paths match
  • Tools: bash, ls, read_file, write_file, str_replace (write_file overwrites by default and exposes append for end-of-file writes; bash is disabled by default when using LocalSandboxProvider; use AioSandboxProvider for isolated shell access)

Subagent System

Async task delegation with concurrent execution:

  • Built-in agents: general-purpose (full toolset) and bash (command specialist, exposed only when shell access is available)
  • Concurrency: Max 3 subagents per turn, 15-minute timeout
  • Execution: Background thread pools with status tracking and SSE events
  • Flow: Agent calls task() tool → executor runs subagent in background → polls for completion → returns result

Memory System

LLM-powered persistent context retention across conversations:

  • Automatic extraction: Analyzes conversations for user context, facts, and preferences
  • Structured storage: User context (work, personal, top-of-mind), history, and confidence-scored facts
  • Debounced updates: Batches updates to minimize LLM calls (configurable wait time)
  • System prompt injection: Top facts + context injected into agent prompts
  • Storage: JSON file with mtime-based cache invalidation

Tool Ecosystem

Category Tools
Sandbox bash, ls, read_file, write_file, str_replace
Built-in present_files, ask_clarification, view_image, task (subagent)
Community Tavily (web search), Jina AI (web fetch), Firecrawl (scraping), DuckDuckGo (image search)
MCP Any Model Context Protocol server (stdio, SSE, HTTP transports)
Skills Domain-specific workflows injected via system prompt

Gateway API

FastAPI application providing REST endpoints for frontend integration:

Route Purpose
GET /api/models List available LLM models
GET/PUT /api/mcp/config Manage MCP server configurations
POST /api/mcp/cache/reset Reset cached MCP tools so they reload on next use
GET/PUT /api/skills List and manage skills
POST /api/skills/install Install skill from .skill archive
GET /api/memory Retrieve memory data
POST /api/memory/reload Force memory reload
GET /api/memory/config Memory configuration
GET /api/memory/status Combined config + data
POST /api/threads/{id}/uploads Upload files (auto-converts PDF/PPT/Excel/Word to Markdown, rejects directory paths, auto-renames duplicate filenames in one request)
GET /api/threads/{id}/uploads/list List uploaded files
DELETE /api/threads/{id} Delete DeerFlow-managed local thread data after LangGraph thread deletion; unexpected failures are logged server-side and return a generic 500 detail
GET /api/threads/{id}/artifacts/{path} Serve generated artifacts

IM Channels

The IM bridge supports Feishu, Slack, and Telegram. Slack and Telegram still use the final runs.wait() response path, while Feishu now streams through runs.stream(["messages-tuple", "values"]) and updates a single in-thread card in place.

For Feishu card updates, DeerFlow stores the running card's message_id per inbound message and patches that same card until the run finishes, preserving the existing OK / DONE reaction flow.


Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager
  • API keys for your chosen LLM provider

Installation

cd deer-flow

# Copy configuration files
cp config.example.yaml config.yaml

# Install backend dependencies
cd backend
make install

Configuration

Edit config.yaml in the project root:

models:
  - name: gpt-4o
    display_name: GPT-4o
    use: langchain_openai:ChatOpenAI
    model: gpt-4o
    api_key: $OPENAI_API_KEY
    supports_thinking: false
    supports_vision: true

  - name: gpt-5-responses
    display_name: GPT-5 (Responses API)
    use: langchain_openai:ChatOpenAI
    model: gpt-5
    api_key: $OPENAI_API_KEY
    use_responses_api: true
    output_version: responses/v1
    supports_vision: true

Set your API keys:

export OPENAI_API_KEY="your-api-key-here"

Running

Full Application (from project root):

make dev  # Starts Gateway + Frontend + Nginx

Access at: http://localhost:2026

Backend Only (from backend directory):

# Gateway API + embedded agent runtime
make dev

Direct access: Gateway at http://localhost:8001


Project Structure

backend/
├── src/
│   ├── agents/                  # Agent system
│   │   ├── lead_agent/         # Main agent (factory, prompts)
│   │   ├── middlewares/        # 9 middleware components
│   │   ├── memory/             # Memory extraction & storage
│   │   └── thread_state.py    # ThreadState schema
│   ├── gateway/                # FastAPI Gateway API
│   │   ├── app.py             # Application setup
│   │   └── routers/           # 6 route modules
│   ├── sandbox/                # Sandbox execution
│   │   ├── local/             # Local filesystem provider
│   │   ├── sandbox.py         # Abstract interface
│   │   ├── tools.py           # bash, ls, read/write/str_replace
│   │   └── middleware.py      # Sandbox lifecycle
│   ├── subagents/              # Subagent delegation
│   │   ├── builtins/          # general-purpose, bash agents
│   │   ├── executor.py        # Background execution engine
│   │   └── registry.py        # Agent registry
│   ├── tools/builtins/         # Built-in tools
│   ├── mcp/                    # MCP protocol integration
│   ├── models/                 # Model factory
│   ├── skills/                 # Skill discovery & loading
│   ├── config/                 # Configuration system
│   ├── community/              # Community tools & providers
│   ├── reflection/             # Dynamic module loading
│   └── utils/                  # Utilities
├── docs/                       # Documentation
├── tests/                      # Test suite
├── langgraph.json              # LangGraph graph registry for tooling/Studio compatibility
├── pyproject.toml              # Python dependencies
├── Makefile                    # Development commands
└── Dockerfile                  # Container build

langgraph.json is not the default service entrypoint. The scripts and Docker deployments run the Gateway embedded runtime; the file is kept for LangGraph tooling, Studio, or direct LangGraph Server compatibility.


Configuration

Main Configuration (config.yaml)

Place in project root. Config values starting with $ resolve as environment variables.

Key sections:

  • models - LLM configurations with class paths, API keys, thinking/vision flags
  • tools - Tool definitions with module paths and groups
  • tool_groups - Logical tool groupings
  • sandbox - Execution environment provider
  • skills - Skills directory paths
  • title - Auto-title generation settings
  • summarization - Context summarization settings
  • subagents - Subagent system (enabled/disabled)
  • memory - Memory system settings (enabled, storage, debounce, facts limits)

Provider note:

  • models[*].use references provider classes by module path (for example langchain_openai:ChatOpenAI).
  • If a provider module is missing, DeerFlow now returns an actionable error with install guidance (for example uv add langchain-google-genai).

Extensions Configuration (extensions_config.json)

MCP servers and skill states in a single file:

{
  "mcpServers": {
    "github": {
      "enabled": true,
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {"GITHUB_TOKEN": "$GITHUB_TOKEN"}
    },
    "secure-http": {
      "enabled": true,
      "type": "http",
      "url": "https://api.example.com/mcp",
      "oauth": {
        "enabled": true,
        "token_url": "https://auth.example.com/oauth/token",
        "grant_type": "client_credentials",
        "client_id": "$MCP_OAUTH_CLIENT_ID",
        "client_secret": "$MCP_OAUTH_CLIENT_SECRET"
      }
    }
  },
  "skills": {
    "pdf-processing": {"enabled": true}
  }
}

Environment Variables

  • DEER_FLOW_CONFIG_PATH - Override config.yaml location
  • DEER_FLOW_EXTENSIONS_CONFIG_PATH - Override extensions_config.json location
  • Model API keys: OPENAI_API_KEY, ANTHROPIC_API_KEY, DEEPSEEK_API_KEY, etc.
  • Tool API keys: TAVILY_API_KEY, GITHUB_TOKEN, etc.

LangSmith Tracing

DeerFlow has built-in LangSmith integration for observability. When enabled, all LLM calls, agent runs, tool executions, and middleware processing are traced and visible in the LangSmith dashboard.

Setup:

  1. Sign up at smith.langchain.com and create a project.
  2. Add the following to your .env file in the project root:
LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGSMITH_API_KEY=lsv2_pt_xxxxxxxxxxxxxxxx
LANGSMITH_PROJECT=xxx

Legacy variables: The LANGCHAIN_TRACING_V2, LANGCHAIN_API_KEY, LANGCHAIN_PROJECT, and LANGCHAIN_ENDPOINT variables are also supported for backward compatibility. LANGSMITH_* variables take precedence when both are set.

Langfuse Tracing

DeerFlow also supports Langfuse observability for LangChain-compatible runs.

Add the following to your .env file:

LANGFUSE_TRACING=true
LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxxxxxxxxxx
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxxxxxxxxxx
LANGFUSE_BASE_URL=https://cloud.langfuse.com

If you are using a self-hosted Langfuse deployment, set LANGFUSE_BASE_URL to your Langfuse host.

Dual Provider Behavior

If both LangSmith and Langfuse are enabled, DeerFlow initializes and attaches both callbacks so the same run data is reported to both systems.

If a provider is explicitly enabled but required credentials are missing, or the provider callback cannot be initialized, DeerFlow raises an error when tracing is initialized during model creation instead of silently disabling tracing.

Docker: In docker-compose.yaml, tracing is disabled by default (LANGSMITH_TRACING=false). Set LANGSMITH_TRACING=true and/or LANGFUSE_TRACING=true in your .env, together with the required credentials, to enable tracing in containerized deployments.


Development

Commands

make install    # Install dependencies
make dev        # Run Gateway API + embedded agent runtime (port 8001)
make gateway    # Run Gateway API without reload (port 8001)
make lint       # Run linter (ruff)
make format     # Format code (ruff)
make detect-blocking-io  # Inventory blocking IO that may block the backend event loop

Code Style

  • Linter/Formatter: ruff
  • Line length: 240 characters
  • Python: 3.12+ with type hints
  • Quotes: Double quotes
  • Indentation: 4 spaces

Testing

uv run pytest

make detect-blocking-io statically scans backend business code for blocking IO that may run on the backend event loop and is not test-coverage-bound. It prints a concise summary for human review and writes complete JSON findings to .deer-flow/blocking-io-findings.json at the repository root (regardless of whether the target is invoked from the repo root or from backend/). JSON findings include both broad IO category and review-oriented fields such as priority, location, blocking_call, event_loop_exposure, reason, and code. priority is a deterministic review ordering from the operation type, not proof of a bug. Bare-name same-file calls are resolved by function name, so duplicate helper names in one file can conservatively over-report async reachability.


Technology Stack

  • LangGraph (1.0.6+) - Agent framework and multi-agent orchestration
  • LangChain (1.2.3+) - LLM abstractions and tool system
  • FastAPI (0.115.0+) - Gateway REST API
  • langchain-mcp-adapters - Model Context Protocol support
  • agent-sandbox - Sandboxed code execution
  • markitdown - Multi-format document conversion
  • tavily-python / firecrawl-py - Web search and scraping

Documentation


License

See the LICENSE file in the project root.

Contributing

See CONTRIBUTING.md for contribution guidelines.