feat(skill): add blocking-io-guard — SOP skill for blocking-IO triage and runtime anchors (#3503)

* feat(blocking-io): add changed-lines blocking-IO scanner (L1)

* feat(blocking-io): add scan-changed CLI wrapper

* feat(skill): add blocking-io-guard developer SOP skill

* docs(blocking-io): point contributors at the blocking-io-guard skill

* style(blocking-io): apply ruff format to scanner and tests

* docs(backend): document changed-lines blocking-IO scanner in CLAUDE.md

* feat(skill): add post-fix re-scan check and PR batching policy

* refactor(skill): fix SOP step ordering, align template with repo conventions

- Move re-scan into an explicit 'apply the fix' step (was wedged after
  anchor generation while telling you to go back before the anchor)
- Renumber steps 0-6; drop undefined 'L1' jargon
- Mode A: document that the diff is <base>...HEAD (commit first)
- Mode B: prefer make detect-blocking-io + findings JSON file
- anchor template: module-level pytestmark per tests/blocking_io convention
- CLAUDE.md: fix 'git diff --base' phrasing

* fix(skill): catch findings introduced without touching the blocking line

Review follow-up: changed-line intersection alone misses the case where a
new async caller exposes an old sync helper — the static finding sits on
the untouched blocking line, so Mode A returned empty and the SOP stopped
on a false 'no blocking-IO surface'.

Selection is now a union over the changed files:
- findings on added lines of git diff <base>...HEAD (kept: a second
  identical symbol in an already-flagged function collides on the stable
  key and only this selection sees it);
- findings new versus the merge base, matched by (path, function,
  symbol) — never line numbers.

Base sources are materialized via git show <merge-base>:<path>; files
absent at base count every head finding as new. SKILL.md now states the
residual same-file-only blind spot (cross-file async callers) instead of
treating an empty list as proof of zero exposure, and only requires
reading sop-skeleton.md when generalizing to another detector domain.

* docs(skill): examples teach test-writing, the teeth check defines the rule

All examples in the references/template are filesystem-flavored; make
explicit that they are instances, not the SOP's boundary — the same rules
apply to every detector category (FILE_IO, HTTP, SUBPROCESS, SLEEP) and
acceptance is always red/green teeth, never similarity to an example.
Neutralize the template's arrange comment accordingly.

* fix(blocking-io): harden changed-lines scanner per review

- Dedup the union selection by the stable key (path, function, symbol)
  instead of dict identity, so a future selector returning copied dicts
  cannot silently empty the result.
- parse_changed_lines now handles any unified diff: context lines advance
  the new-file counter, \-markers and deletions do not, and the counter
  resets at each +++ header. Previously correct only for --unified=0.
- Add blocking_io_static.scan_source (in-memory scan); base-version
  comparison no longer round-trips through temp files.
- Empty Mode A report now prints the same-file-only reachability caveat
  at the point of use instead of relying on the SOP text alone.

* docs(skill): bound best-effort cleanup when the offload sits in finally

Lesson from the #3505 review: the SOP routinely drives 'offload the
cleanup branch' transformations, and an awaited cleanup in finally can
mask or stall the primary exception. One sentence in Step 2 closes that
gap at the point where the fix is written.
This commit is contained in:
AochenShen99
2026-06-12 10:20:38 +08:00
committed by GitHub
parent 330a2ff8c5
commit dc2ababf00
10 changed files with 703 additions and 4 deletions
@@ -0,0 +1,213 @@
"""Intersect a git diff with static blocking-IO findings.
Wraps the static detector (`blocking_io_static`) to answer a narrower question:
which blocking-IO candidates does THIS change introduce? A candidate qualifies
when its blocking line is on an added line of the diff, or when the finding is
new versus the merge base — the latter catches exposure created without
touching the blocking line itself (a new async caller making an old sync
helper async-reachable). Used by the `blocking-io-guard` skill as the
deterministic scope step.
"""
from __future__ import annotations
import argparse
import json
import re
import subprocess
import sys
from collections import defaultdict
from collections.abc import Sequence
from pathlib import Path
from support.detectors import blocking_io_static as static
REPO_ROOT = Path(__file__).resolve().parents[4]
SCAN_ROOTS = (
"backend/app",
"backend/packages/harness/deerflow",
"backend/scripts",
)
_HUNK_RE = re.compile(r"^@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@")
def parse_changed_lines(diff_text: str) -> dict[str, set[int]]:
"""Map repo-relative path -> set of added line numbers in the new file.
Accepts any unified diff (with or without `--unified=0`): context lines
advance the new-file counter, deletions (`-`) and `\\ No newline` markers
do not. Records only added lines (`+`, not the `+++` header), numbered
from each hunk's new-file start line; deleted files (`+++ /dev/null`) are
skipped.
"""
changed: dict[str, set[int]] = defaultdict(set)
current_path: str | None = None
next_line = 0
for raw in diff_text.splitlines():
if raw.startswith("+++ "):
target = raw[4:].strip()
if target == "/dev/null":
current_path = None
else:
current_path = target[2:] if target.startswith("b/") else target
next_line = 0
continue
match = _HUNK_RE.match(raw)
if match:
next_line = int(match.group(1))
continue
if not current_path:
continue
if raw.startswith("+"):
changed[current_path].add(next_line)
next_line += 1
elif raw.startswith(" ") or raw == "":
next_line += 1
return dict(changed)
def changed_python_lines(base: str, repo_root: Path = REPO_ROOT) -> dict[str, set[int]]:
"""Diff `base...HEAD` over scan roots and return added .py lines."""
cmd = [
"git",
"-C",
str(repo_root),
"diff",
"--unified=0",
"--no-color",
f"{base}...HEAD",
"--",
*SCAN_ROOTS,
]
diff_text = subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
return {path: lines for path, lines in parse_changed_lines(diff_text).items() if path.endswith(".py")}
def select_findings_on_changed_lines(
findings: Sequence[dict[str, object]],
changed_lines: dict[str, set[int]],
) -> list[dict[str, object]]:
"""Keep findings whose (path, line) falls on a changed line."""
selected: list[dict[str, object]] = []
for finding in findings:
location = finding["location"] # type: ignore[index]
path = location["path"] # type: ignore[index]
line = location["line"] # type: ignore[index]
if line in changed_lines.get(path, set()):
selected.append(finding)
return selected
def base_python_contents(base: str, paths: Sequence[str], repo_root: Path = REPO_ROOT) -> dict[str, str]:
"""Return each path's content at the merge base of `base` and HEAD.
Files absent at the merge base (newly added) are omitted, so every head
finding in them counts as new.
"""
merge_base = subprocess.run(
["git", "-C", str(repo_root), "merge-base", base, "HEAD"],
capture_output=True,
text=True,
check=True,
).stdout.strip()
contents: dict[str, str] = {}
for path in paths:
shown = subprocess.run(
["git", "-C", str(repo_root), "show", f"{merge_base}:{path}"],
capture_output=True,
text=True,
)
if shown.returncode == 0:
contents[path] = shown.stdout
return contents
def scan_python_contents(contents: dict[str, str]) -> list[dict[str, object]]:
"""Run the static detector over in-memory sources (repo-relative path -> code)."""
findings: list[dict[str, object]] = []
for rel_path in sorted(contents):
findings.extend(finding.to_dict() for finding in static.scan_source(contents[rel_path], rel_path))
return findings
def _stable_key(finding: dict[str, object]) -> tuple[str, str, str]:
location = finding["location"] # type: ignore[index]
call = finding["blocking_call"] # type: ignore[index]
return (location["path"], location["function"], call["symbol"]) # type: ignore[index]
def select_findings_new_vs_base(
head_findings: Sequence[dict[str, object]],
base_findings: Sequence[dict[str, object]],
) -> list[dict[str, object]]:
"""Keep head findings whose stable key (path, function, symbol) is absent at base.
Line numbers shift between revisions, so matching is by stable key only.
A second identical symbol added inside a function that already had a
finding collides on the key and is NOT reported here — that case is
covered by the changed-line selection instead.
"""
base_keys = {_stable_key(finding) for finding in base_findings}
return [finding for finding in head_findings if _stable_key(finding) not in base_keys]
def find_changed_blocking_io(base: str, repo_root: Path = REPO_ROOT) -> list[dict[str, object]]:
"""Return static findings this change introduces or touches.
Union over the changed files of:
- findings whose blocking line is on an added line of the diff;
- findings new versus the merge base (a new async caller can expose an
untouched sync helper — the blocking line itself is not in the diff).
"""
changed_lines = changed_python_lines(base, repo_root)
if not changed_lines:
return []
files = [repo_root / path for path in changed_lines]
head_findings = [finding.to_dict() for finding in static.scan_paths(files, repo_root=repo_root)]
on_changed_lines = select_findings_on_changed_lines(head_findings, changed_lines)
base_findings = scan_python_contents(base_python_contents(base, sorted(changed_lines), repo_root))
new_vs_base = select_findings_new_vs_base(head_findings, base_findings)
selected_keys = {_stable_key(finding) for finding in (*on_changed_lines, *new_vs_base)}
return [finding for finding in head_findings if _stable_key(finding) in selected_keys]
def format_report(findings: Sequence[dict[str, object]], base: str) -> str:
if not findings:
return (
f"No blocking-IO candidates introduced by this change (base: {base}).\n"
"Note: async reachability is resolved within each file only. If this change\n"
"adds an async call into a sync helper defined in another file, check that\n"
"helper manually (codegraph or git grep) before relying on this empty result."
)
lines = [
f"Blocking-IO candidates introduced/touched by this change (base: {base}): {len(findings)}",
"",
]
order = {"HIGH": 0, "MEDIUM": 1, "LOW": 2}
for finding in sorted(findings, key=lambda f: order.get(str(f["priority"]), 9)):
location = finding["location"] # type: ignore[index]
call = finding["blocking_call"] # type: ignore[index]
lines.append(f"{finding['priority']} {call['category']}/{call['operation']} {location['path']}:{location['line']} in {location['function']} exposure={finding['event_loop_exposure']}")
lines.append(f" symbol: {call['symbol']}")
if finding.get("code"):
lines.append(f" code: {finding['code']}")
return "\n".join(lines)
def main(argv: Sequence[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="List blocking-IO candidates this change introduces: findings on added lines plus findings new versus the merge base (diff against --base).")
parser.add_argument("--base", default="origin/main", help="Base ref to diff against (default: origin/main).")
parser.add_argument("--format", choices=("text", "json"), default="text", help="Output format.")
args = parser.parse_args(argv)
findings = find_changed_blocking_io(args.base)
if args.format == "json":
print(json.dumps(findings, indent=2))
else:
print(format_report(findings, args.base))
return 0
if __name__ == "__main__":
sys.exit(main())
@@ -717,12 +717,11 @@ def _finalize_findings(visitor: BlockingIOStaticVisitor) -> list[BlockingIOStati
return findings
def scan_file(path: Path, *, repo_root: Path = REPO_ROOT) -> list[BlockingIOStaticFinding]:
source = path.read_text(encoding="utf-8")
def scan_source(source: str, relative_path: str) -> list[BlockingIOStaticFinding]:
"""Scan one in-memory Python source; `relative_path` is reported verbatim in findings."""
source_lines = source.splitlines()
relative_path = relative_to_repo(path, repo_root)
try:
tree = ast.parse(source, filename=str(path))
tree = ast.parse(source, filename=relative_path)
except SyntaxError as exc:
line = exc.lineno or 0
code = _source_snippet(source_lines, line)
@@ -746,6 +745,10 @@ def scan_file(path: Path, *, repo_root: Path = REPO_ROOT) -> list[BlockingIOStat
return sorted(_finalize_findings(visitor), key=lambda finding: (finding.path, finding.line, finding.column, finding.category))
def scan_file(path: Path, *, repo_root: Path = REPO_ROOT) -> list[BlockingIOStaticFinding]:
return scan_source(path.read_text(encoding="utf-8"), relative_to_repo(path, repo_root))
def is_ignored_path(path: Path) -> bool:
return any(part in IGNORED_DIR_NAMES for part in path.parts)