Files
deer-flow/backend/docs/BLOCKING_IO_DETECTION.md
T
AochenShen99 dc2ababf00 feat(skill): add blocking-io-guard — SOP skill for blocking-IO triage and runtime anchors (#3503)
* feat(blocking-io): add changed-lines blocking-IO scanner (L1)

* feat(blocking-io): add scan-changed CLI wrapper

* feat(skill): add blocking-io-guard developer SOP skill

* docs(blocking-io): point contributors at the blocking-io-guard skill

* style(blocking-io): apply ruff format to scanner and tests

* docs(backend): document changed-lines blocking-IO scanner in CLAUDE.md

* feat(skill): add post-fix re-scan check and PR batching policy

* refactor(skill): fix SOP step ordering, align template with repo conventions

- Move re-scan into an explicit 'apply the fix' step (was wedged after
  anchor generation while telling you to go back before the anchor)
- Renumber steps 0-6; drop undefined 'L1' jargon
- Mode A: document that the diff is <base>...HEAD (commit first)
- Mode B: prefer make detect-blocking-io + findings JSON file
- anchor template: module-level pytestmark per tests/blocking_io convention
- CLAUDE.md: fix 'git diff --base' phrasing

* fix(skill): catch findings introduced without touching the blocking line

Review follow-up: changed-line intersection alone misses the case where a
new async caller exposes an old sync helper — the static finding sits on
the untouched blocking line, so Mode A returned empty and the SOP stopped
on a false 'no blocking-IO surface'.

Selection is now a union over the changed files:
- findings on added lines of git diff <base>...HEAD (kept: a second
  identical symbol in an already-flagged function collides on the stable
  key and only this selection sees it);
- findings new versus the merge base, matched by (path, function,
  symbol) — never line numbers.

Base sources are materialized via git show <merge-base>:<path>; files
absent at base count every head finding as new. SKILL.md now states the
residual same-file-only blind spot (cross-file async callers) instead of
treating an empty list as proof of zero exposure, and only requires
reading sop-skeleton.md when generalizing to another detector domain.

* docs(skill): examples teach test-writing, the teeth check defines the rule

All examples in the references/template are filesystem-flavored; make
explicit that they are instances, not the SOP's boundary — the same rules
apply to every detector category (FILE_IO, HTTP, SUBPROCESS, SLEEP) and
acceptance is always red/green teeth, never similarity to an example.
Neutralize the template's arrange comment accordingly.

* fix(blocking-io): harden changed-lines scanner per review

- Dedup the union selection by the stable key (path, function, symbol)
  instead of dict identity, so a future selector returning copied dicts
  cannot silently empty the result.
- parse_changed_lines now handles any unified diff: context lines advance
  the new-file counter, \-markers and deletions do not, and the counter
  resets at each +++ header. Previously correct only for --unified=0.
- Add blocking_io_static.scan_source (in-memory scan); base-version
  comparison no longer round-trips through temp files.
- Empty Mode A report now prints the same-file-only reachability caveat
  at the point of use instead of relying on the SOP text alone.

* docs(skill): bound best-effort cleanup when the offload sits in finally

Lesson from the #3505 review: the SOP routinely drives 'offload the
cleanup branch' transformations, and an awaited cleanup in finally can
mask or stall the primary exception. One sentence in Step 2 closes that
gap at the point where the fix is written.
2026-06-12 10:20:38 +08:00

160 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Blocking IO detection usage and maintenance
This document describes how to use and maintain DeerFlow backend blocking-IO
detection for async event-loop safety.
The goal is narrow: find and prevent synchronous IO from blocking backend
async event-loop paths. Static and runtime detection are complementary, but
they have different jobs.
## Static detector
The static detector is the discovery tool. It scans backend source code and
reports candidate blocking-IO call sites that may need human review.
Run it from the repository root:
```bash
make detect-blocking-io
```
Or from `backend/`:
```bash
make detect-blocking-io
```
The report is written to:
```text
.deer-flow/blocking-io-findings.json
```
Use this output for review and triage. A static finding is a candidate, not
proof that production blocks the event loop at runtime. The current static
rules are intentionally broad; prefer triaging existing output before adding
new static rules.
Add a static rule only when review finds a recurring high-risk blocking
pattern that is invisible to the current detector.
## Runtime detector
The runtime detector is the CI regression guard. It uses Blockbuster to fail a
focused test when code under `app.*` or `deerflow.*` performs blocking IO on
the asyncio event-loop thread.
Run it from `backend/`:
```bash
make test-blocking-io
```
The runtime gate starts from confirmed production bugs and protects those
paths from regressing. It does not prove that the entire backend is free of
blocking IO; it only covers the production paths exercised by
`backend/tests/blocking_io/`.
## Maintenance workflow
Use the static detector to find candidates, then use review to decide which
async production paths are worth protecting in CI.
The normal workflow is:
1. Run the static detector to find backend blocking-IO candidates.
2. Use human review to pick high-risk production async paths.
3. Add or update a focused runtime anchor in `backend/tests/blocking_io/`.
4. Let CI prevent that path from regressing.
Contributors changing backend async code can run the `blocking-io-guard` skill
(`.agent/skills/blocking-io-guard/`) to execute steps 13 for their own diff: it
scans the change for blocking-IO candidates, drafts or extends a runtime anchor,
and verifies the anchor fails when the blocking IO regresses.
Runtime detection has two maintenance paths.
### Add a runtime rule
Add a runtime rule when Blockbuster's default rules do not cover a generic
blocking primitive used by production code.
Rules belong in:
```text
backend/tests/support/detectors/blocking_io_runtime.py
```
Add them to `_PROJECT_BLOCKING_RULES`, not directly inside individual tests.
Keeping rules centralized makes it clear which extra primitives DeerFlow
expects Blockbuster to catch.
Example shape:
```python
import subprocess
from blockbuster import BlockBusterFunction
_PROJECT_BLOCKING_RULES = (
(
"subprocess.Popen.__init__",
BlockBusterFunction(
subprocess.Popen,
"__init__",
scanned_modules=["app", "deerflow"],
),
),
)
```
Do not add a runtime rule just because a business path is not tested. A rule
only expands what Blockbuster can intercept after code runs.
### Add a runtime anchor
Add a runtime anchor when a high-risk async production path should be protected
by CI but no existing `backend/tests/blocking_io/` test executes it.
Anchors belong in:
```text
backend/tests/blocking_io/
```
A good anchor should:
- Call the real production async entry point.
- Avoid bypassing the blocking surface with test-only `asyncio.to_thread`
wrappers.
- Use real local filesystem inputs when the bug shape is filesystem IO.
- Mock only the external dependency boundary, such as a network service or
third-party saver class.
- Fail if a future change moves the blocking operation back onto the event
loop.
Avoid testing only the low-level helper unless that helper is the production
async entry point. The runtime gate is most useful when it protects the caller
that production actually executes.
## Current runtime coverage
The runtime anchors protect confirmed blocking-IO bug shapes:
- SQLite checkpointer setup, including path resolution and parent-directory
creation.
- Subagent skill metadata loading through `SubagentExecutor._load_skills()`.
- `JsonlRunEventStore` async API (`put` / `list_*` / `delete_*`): the JSONL
run-event backend offloads its synchronous file IO via `asyncio.to_thread`
(fix #3084); this anchor drives the real async API under the gate so any
blocking IO reintroduced on the loop fails, not only removal of one
`to_thread` call.
- `UploadsMiddleware.before_agent` uploads-directory scan: a sync-only middleware
hook runs on the event loop under async graph execution, so the scan is
offloaded via `abefore_agent` + `run_in_executor`.
- Gate health checks: Blockbuster catches unoffloaded calls, opt-out works, and
patches are restored after exceptions.
As static detection and review identify more high-risk async paths, add new
runtime anchors incrementally.