Merge branch 'main' into 2.0.0-release

2026-06-17 04:56:04 +00:00 · 2026-06-17 00:12:48 +08:00
parent c7c5600aba 43dba448ad
commit e2edd262d8
11 changed files with 297 additions and 194 deletions
@@ -1,6 +1,6 @@
 ---
 name: deerflow-maintainer-orchestrator
-description: "Use when a DeerFlow maintainer needs comment-only GitHub issue or PR handling: resolve issue/PR scopes with gh, analyze issues, post or draft issue comments, perform PR review comments, give fix strategy, risk classification, and validation guidance. Intended for maintainers and trusted local agents, not general contributors."
+description: "Use when a DeerFlow maintainer needs comment-only GitHub issue or PR handling: resolve issue/PR scopes with gh, analyze issues, post or draft issue comments, perform PR review comments, review PR or issue batches, compare competing PRs that target the same issue, give fix strategy, risk classification, and validation guidance. Intended for maintainers and trusted local agents, not general contributors."
 ---
 # DeerFlow Maintainer Orchestrator
@@ -33,21 +33,35 @@ Use GitHub tooling to resolve artifact type and scope. Do not ask the maintainer
 8. For "recent/latest" wording without a count, use a small default recent slice. For "recent hours" wording without a number, use six hours. Do not ask.
 9. Use `gh api` when `gh issue/pr view/list` lacks required fields such as timeline events, review threads, or precise search filters.
 10. Use GitHub search only as a fallback for natural-language filters that cannot be represented by view/list/API calls. Do not use web search for artifact routing unless GitHub tooling is unavailable.
-11. If no artifact type, number, URL, count, time window, or searchable GitHub scope can be resolved, stop with a compact "scope unresolved" report. Do not ask a follow-up question.
+11. When an issue has more than one candidate resolving PR, gather them all before reviewing: the issue's linked/Development PRs, closing keywords (`Closes/Fixes #<issue>`) found via `gh api` timeline cross-reference events, and PRs that mention the issue. Route them into Competing PR Comparison.
 12. If no artifact type, number, URL, count, time window, or searchable GitHub scope can be resolved, stop with a compact "scope unresolved" report. Do not ask a follow-up question.
 Use concise repo-local references such as `#123` and `PR #123` in maintainer reports and comments. Include full GitHub URLs only for posted comment/review links returned by GitHub or when the maintainer supplied an explicit URL.
 ## Existing Coverage and Re-Runs
 Existing comments suppress duplicate **posting**, not **analysis**. Always analyze the artifact in full, then post only the net-new delta over what is already covered.
 1. Read existing maintainer/trusted-agent comments and reviews as prior coverage.
 2. Analyze the artifact fully regardless of what already exists. A prior comment may be partial — catching A while missing B.
 3. Keep only net-new, high-confidence items not already materially covered.
 4. Non-empty delta: post one comment that explicitly builds on the prior coverage (for example `Adding to @reviewer's review:`) and states only the new items. Do not restate covered points.
 5. Empty delta: post nothing public; report `Already covered` to the maintainer with the existing comment/review URL.
 6. Idempotency: treat your own earlier skill-authored comments as already-covered. On a re-run, never stack a second comment that repeats an earlier one — post only genuinely new delta, or nothing.
 RFC issues are the one hard skip: no analysis and no post unless the maintainer overrides.
 ## Issue Flow
 Use Issue Flow for GitHub issues, bug reports, feature requests, support questions, and issue batches.
-Start every issue with a cheap duplicate-opinion precheck:
+Start every issue with a cheap precheck:
 1. Fetch issue metadata, labels, author, body, and existing comments.
 2. If labels, title, or body mark the issue as RFC (`rfc`, `[RFC]`, `RFC:`, or `Request for Comments`), classify it as `rfc-no-comment`, skip deep analysis, and do not post anything public unless the maintainer explicitly overrides the RFC skip for that item.
-3. If an existing maintainer or trusted-agent issue comment already gives a materially equivalent diagnosis, modification suggestion, information request, or blocking decision, skip deep analysis and do not post anything public for that issue.
+3. Existing maintainer or trusted-agent comments are prior coverage, not an automatic skip. Analyze fully and post only the net-new delta (see Existing Coverage and Re-Runs).
 4. Treat ordinary reporter replies, thanks, unrelated discussion, or incomplete guesses as non-blocking.
-5. Report skipped issues to the maintainer only as compact identifiers plus the skipped reason or existing comment URL when available.
+5. Report already-covered or skipped issues to the maintainer only as compact identifiers plus the reason or existing comment URL when available.
 For non-skipped issues:
@@ -87,7 +101,7 @@ Validation:
   - Add `Missing info:` only when the issue cannot be diagnosed without more evidence; ask for the smallest useful data.
   - Put relevant files/components inside `Evidence:` or `Recommended solution:` bullets instead of separate metadata fields.
   - Every posted issue comment should contain concrete modification guidance and validation guidance unless the only useful response is `Missing info:`.
-5. Immediately before posting, refresh comments and skip if an equivalent maintainer or trusted-agent comment appeared during analysis.
+5. Immediately before posting, refresh comments; fold any equivalent comment that appeared during analysis into prior coverage and post only the remaining delta.
 6. Post one issue comment when posting is authorized; otherwise return the same text as `Reply draft`.
 Do not expose private reasoning, credentials, internal-only context, or unsupported promises. Do not say a fix was made unless a separate coding workflow actually changed code.
@@ -96,12 +110,13 @@ Do not expose private reasoning, credentials, internal-only context, or unsuppor
 Use PR Review Flow for GitHub pull requests and PR batches.
-Start every PR with a cheap duplicate-review precheck:
+Start every PR with a cheap precheck:
 1. Fetch PR metadata, changed file list, checks summary, existing PR reviews, existing PR comments, and review threads when available.
-2. If an existing maintainer or trusted-agent review already gives materially equivalent findings or a blocking decision, skip deep review and do not post anything public for that PR.
+2. Existing maintainer or trusted-agent reviews are prior coverage, not an automatic skip. Review fully and post only the net-new delta (see Existing Coverage and Re-Runs).
-3. Treat author replies, thanks, unrelated discussion, or incomplete guesses as non-blocking.
+3. Read `statusCheckRollup` as signal, not verdict. Failing required checks are themselves a reportable finding (build failure = P0; failing tests or lint = P1/P2 by impact). Green checks lower risk but never excuse reading the actual changed code path — confirm suspect logic by reading the source, not by trusting green CI. Tests passing does not prove the changed branch is exercised.
-4. Report skipped PRs to the maintainer only as compact identifiers plus the existing review/comment URL when available.
+4. Treat author replies, thanks, unrelated discussion, or incomplete guesses as non-blocking.
 5. Report already-covered or clean PRs to the maintainer only, with the existing review/comment URL when available.
 ### Diff Base Rule
@@ -112,6 +127,8 @@ Before reviewing a local PR branch or local diff, fetch the base repository's ta
 - Prefer GitHub PR base metadata for the target branch. For non-PR local diffs, use the base repository default branch. If metadata is unavailable, default to `main` only after fetching the base remote.
 - Refresh the comparison ref explicitly, for example `git fetch <base-remote> +refs/heads/<base-branch>:refs/remotes/<base-remote>/<base-branch>`, then inspect `BASE=$(git merge-base HEAD <base-remote>/<base-branch>)` and `git diff "$BASE"...HEAD`.
 - If using `FETCH_HEAD` from a single-branch fetch instead, diff against that verified `FETCH_HEAD` immediately and do not later substitute a possibly stale remote-tracking ref.
 - Resolve the PR head explicitly. For fork PRs whose head branch is not on the base repo, fetch the PR ref: `git fetch <base-remote> pull/<n>/head:pr-<n>`. The fork's own branch ref and `gh api .../contents?ref=<fork-branch>` will 404 against the base repo. Record the head SHA you reviewed.
 - Re-check the head SHA immediately before posting. If the PR head moved during analysis, re-review the new diff or abort — never post a review against a diff the PR no longer has.
 - For uncommitted local changes, review committed branch changes against the fresh base first, then include working-tree changes separately.
 - If the base remote or base branch cannot be established, use the GitHub PR files/diff as the source of truth. If neither local nor GitHub diff can be read, return a compact failure report and do not post a review.
@@ -122,8 +139,8 @@ Before posting a PR review comment:
 3. Prioritize correctness, safety, maintainability, production risk, compatibility, and missing critical tests over style.
 4. Report concrete architecture, security, public API, default-behavior, and compatibility problems as findings when the diff causes or exposes them.
 5. Check changed behavior, edge cases, error paths, state mutation, transactions, locks, cache invalidation, cleanup, security boundaries, missing tests, performance/reliability, and API compatibility.
-6. Immediately before posting, refresh reviews/comments and skip if an equivalent maintainer or trusted-agent review appeared during analysis.
+6. Immediately before posting, refresh reviews/comments and fold any equivalent review that appeared during analysis into prior coverage; post only the remaining delta.
-7. If there are high-confidence findings, post a PR review comment using the PR language. If there are no high-confidence findings, do not post a public PR review/comment; report `No high-confidence review findings.` to the maintainer in the run result.
+7. Apply the Posting Gate. If the gate yields public findings, post one PR review comment in the PR language. Otherwise post nothing public and report the result (`No high-confidence review findings.` or `Already covered`) plus any sub-threshold items as `Maintainer notes`.
 For public PR reviews with findings, start with one short opener that fits the review context and matches the finding count. Use singular wording only for exactly one finding, for example `Thanks @author. I found one issue that should be addressed before this is ready.` Use plural wording for multiple findings, for example `Thanks @author. I found a few issues that should be addressed before this is ready.` Omit the mention for bots or when it adds noise.
@@ -145,8 +162,41 @@ Severity guide:
 - `P1`: likely production bug, serious regression, broken compatibility, or high-risk security/architecture issue.
 - `P2`: correctness, maintainability, or test concern with lower risk.
 ### Posting Gate
 Posting depends on BOTH confidence (is the problem real?) and severity (how bad if real). They are independent axes — "no high-confidence findings" means none across P0/P1/P2, not merely "no P0".
 - Post publicly only items that are high-confidence AND at least P2.
 - For a public P2, additionally require that the diff itself introduces or worsens the issue. Do not raise a public P2 for pre-existing behavior the diff only touches, or for a change that is a net improvement over the prior state.
 - A high-confidence P0/P1 is always worth posting. A low-confidence P1 is not — omit it, or route it to `Maintainer notes` framed as a hypothesis to verify.
 - Sub-threshold but real observations (net-improvement nits, bounded or low-risk concerns, pre-existing issues, low-confidence hypotheses) go to the `Maintainer notes` channel in the run result, never to a public comment.
 Do not produce compliments, summaries, or general advice. For sensitive security issues, describe impact and remediation without exploit instructions.
 ## Batch Handling
 When the scope has multiple artifacts, cluster before reviewing and synthesize after.
 Cluster by relatedness, not by type. Group artifacts that share files, interfaces, or the same issue/feature into one cluster; same-type artifacts that touch disjoint files are independent.
 - Related cluster: review in ONE shared context so cross-artifact reasoning is possible — parallel agents cannot see each other's findings. If it cannot fit one context, fan out per sub-group and reconcile in the synthesis pass; never split it blind, without that re-aggregation.
 - Independent clusters: may run in parallel. Offloading a large or independent batch to one subagent per cluster keeps the main context clean — consider it for big batches, prefer offering it to the maintainer over silently spawning, and do not spawn for two or three related items or when the cold-start cost is not earned.
 After per-artifact review, run one synthesis pass over the whole batch and report it to the maintainer (decision-support, not a public comment):
 - Overlapping files and merge-order/conflict surface — which PRs touch the same files and will conflict pairwise.
 - Duplicate or competing solutions to the same problem.
 - Composition risk — changes each safe alone but interacting (for example, two PRs editing the same module or table).
 ## Competing PR Comparison
 When several PRs target the same issue, compare them instead of reviewing each in isolation.
 1. Pull the issue's acceptance criteria (reported problem and expected behavior); that is the rubric anchor.
 2. Score each PR on: does it actually resolve the issue's ask; correctness and edge/error-path coverage; test quality; blast radius and compatibility; maintainability. Use the same DeerFlow Review Heuristics and Posting Gate as a single review.
 3. Report a maintainer-facing comparison — strongest PR and why, what each is missing — in the run result.
 4. Keep the public surface constructive and per-PR: post each PR's own gate-passing findings normally. Do not publicly rank PRs against each other or tell an author their PR is worse than a competitor's; winner selection stays in the maintainer report.
 ## No-Question Policy
 Do not ask the maintainer routine clarification questions. The skill should save maintainer time by turning scope into comments through a fixed workflow.
@@ -195,7 +245,9 @@ For Issue Flow:
 Run result:
 Posted:
 Skipped:
 Already covered:
 Failed:
 Maintainer notes:
 Per issue:
  Issue:
  Surface:
@@ -213,7 +265,9 @@ Run result:
 Reviewed:
 Skipped:
 Clean:
 Already covered:
 Failed:
 Maintainer notes:
 Per PR:
  PR:
  Public review:
@@ -231,7 +285,9 @@ For batches, prefer a compact maintainer-facing table after the headline counts:
 | #123 | posted | comment URL | short reason |
 | PR #456 | reviewed | review URL | P1: finding title |
 | PR #789 | clean | none | No high-confidence review findings. |
-| #321 | skipped | none | existing maintainer comment |
+| #321 | already covered | none | existing maintainer comment |
 ```
 For multi-artifact batches, follow the table with a `Batch synthesis` block (overlapping files, merge-order/conflict surface, duplicate or competing solutions, composition risk) and, when issues had competing PRs, a `Competing PR comparison` block. Both are maintainer-only.
 Omit empty categories, no-op fields, routine command output, and raw logs. Report meaningful changes, evidence, and options.
@@ -113,6 +113,7 @@ FastAPI application providing REST endpoints for frontend integration:
 |-------|---------|
 | `GET /api/models` | List available LLM models |
 | `GET/PUT /api/mcp/config` | Manage MCP server configurations |
 | `POST /api/mcp/cache/reset` | Reset cached MCP tools so they reload on next use |
 | `GET/PUT /api/skills` | List and manage skills |
 | `POST /api/skills/install` | Install skill from `.skill` archive |
 | `GET /api/memory` | Retrieve memory data |
@@ -172,7 +172,7 @@ class MessageBus:
    def unsubscribe_outbound(self, callback: OutboundCallback) -> None:
        """Remove a previously registered outbound callback."""
-        self._outbound_listeners = [cb for cb in self._outbound_listeners if cb is not callback]
+        self._outbound_listeners = [cb for cb in self._outbound_listeners if cb != callback]
    async def publish_outbound(self, msg: OutboundMessage) -> None:
        """Dispatch an outbound message to all registered listeners."""
@@ -8,6 +8,7 @@ from fastapi import APIRouter, HTTPException, Request, status
 from pydantic import BaseModel, Field
 from deerflow.config.extensions_config import ExtensionsConfig, get_extensions_config, reload_extensions_config
 from deerflow.mcp.cache import reset_mcp_tools_cache
 logger = logging.getLogger(__name__)
 router = APIRouter(prefix="/api", tags=["mcp"])
@@ -69,6 +70,13 @@ class McpConfigUpdateRequest(BaseModel):
    )
 class McpCacheResetResponse(BaseModel):
    """Response model for resetting the MCP tools cache."""
    success: bool = Field(description="Whether the MCP tools cache was reset")
    message: str = Field(description="Human-readable reset status")
 _MASKED_VALUE = "***"
@@ -269,6 +277,27 @@ async def get_mcp_configuration(request: Request) -> McpConfigResponse:
    return McpConfigResponse(mcp_servers=servers)
@router.post(
    "/mcp/cache/reset",
    response_model=McpCacheResetResponse,
    summary="Reset MCP Tools Cache",
    description=("Reset cached MCP tools and pooled sessions process-wide so tools are reloaded on next use. This affects all threads and users in the current Gateway process."),
 )
 async def reset_mcp_tools_cache_endpoint(request: Request) -> McpCacheResetResponse:
    """Reset cached MCP tools and persistent sessions process-wide.
    The next agent run or tool lookup will reload tools from the configured MCP
    servers. This affects all threads and users in the current Gateway process,
    and avoids relying on extensions_config.json mtime changes.
    """
    await _require_admin_user(request)
    reset_mcp_tools_cache()
    return McpCacheResetResponse(
        success=True,
        message="MCP tools cache reset. Tools will reload on next use.",
    )
@router.put(
    "/mcp/config",
    response_model=McpConfigResponse,
@@ -363,6 +392,7 @@ async def update_mcp_configuration(request: Request, body: McpConfigUpdateReques
        # agent runtime lives in Gateway, so this keeps API reads and tool
        # execution aligned after extensions_config.json changes.
        reloaded_config = reload_extensions_config()
        reset_mcp_tools_cache()
        servers = {name: _mask_server_config(McpServerConfigResponse(**server.model_dump())) for name, server in reloaded_config.mcp_servers.items()}
        return McpConfigResponse(mcp_servers=servers)
@@ -299,6 +299,26 @@ deployment needs additional trusted launchers.
 }
 ```
 #### Reset MCP Tools Cache
 Clear cached MCP tools and persistent MCP sessions process-wide. This affects
 all threads and users in the current Gateway process. Tools are loaded again
 from configured MCP servers on the next agent run or tool lookup.
 ```http
 POST /api/mcp/cache/reset
 ```
 Requires an authenticated admin session.
 **Response:**
 ```json
 {
  "success": true,
  "message": "MCP tools cache reset. Tools will reload on next use."
 }
 ```
 ### Skills
 #### List Skills
@@ -427,17 +427,17 @@ SKILL.md Format:
 ### Configuration Reload
 ```
-1. Client updates MCP config
+1. Client updates MCP config or requests a cache reset
   PUT /api/mcp/config
   POST /api/mcp/cache/reset
-2. Gateway writes extensions_config.json
+2. Gateway updates runtime state
-   - Updates mcpServers section
+   - PUT writes extensions_config.json and reloads configuration
-   - File mtime changes
+   - Both endpoints reset the MCP tools cache and persistent sessions
-3. MCP Manager detects change
+3. MCP Manager reloads on next use
-   - get_cached_mcp_tools() checks mtime
+   - get_cached_mcp_tools() lazily reinitializes MCP tools
-   - If changed: reinitializes MCP client
+   - Loads current server configurations and tool lists
   - Loads updated server configurations
 4. Next agent run uses new tools
 ```
@@ -33,6 +33,7 @@ def test_public_paths(path: str):
    [
        "/api/models",
        "/api/mcp/config",
        "/api/mcp/cache/reset",
        "/api/memory",
        "/api/skills",
        "/api/threads/123",
@@ -149,6 +150,10 @@ def _make_app():
    async def mcp_put():
        return {"ok": True}
    @app.post("/api/mcp/cache/reset")
    async def mcp_cache_reset():
        return {"ok": True}
    @app.delete("/api/threads/abc")
    async def thread_delete():
        return {"ok": True}
@@ -360,6 +365,11 @@ def test_protected_post_no_cookie_returns_401(client):
    assert res.status_code == 401
 def test_mcp_cache_reset_post_no_cookie_returns_401(client):
    res = client.post("/api/mcp/cache/reset")
    assert res.status_code == 401
 def test_protected_post_with_internal_auth_header_passes():
    from app.gateway.internal_auth import create_internal_auth_headers
@@ -148,6 +148,27 @@ class TestMessageBus:
        _run(go())
    def test_unsubscribe_outbound_removes_fresh_bound_method_reference(self):
        bus = MessageBus()
        received = []
        class Handler:
            async def callback(self, msg):
                received.append((self, msg))
        handler = Handler()
        other_handler = Handler()
        async def go():
            bus.subscribe_outbound(handler.callback)
            bus.subscribe_outbound(other_handler.callback)
            bus.unsubscribe_outbound(handler.callback)
            out = OutboundMessage(channel_name="test", chat_id="c1", thread_id="t1", text="reply")
            await bus.publish_outbound(out)
            assert received == [(other_handler, out)]
        _run(go())
    def test_outbound_error_does_not_crash(self):
        bus = MessageBus()
@@ -12,6 +12,7 @@ from types import SimpleNamespace
 import pytest
 from fastapi import HTTPException
 from app.gateway.routers import mcp as mcp_router
 from app.gateway.routers.mcp import (
    _MCP_STDIO_COMMAND_ALLOWLIST_ENV,
    McpConfigUpdateRequest,
@@ -21,6 +22,8 @@ from app.gateway.routers.mcp import (
    _merge_preserving_secrets,
    _require_admin_user,
    _validate_mcp_update_request,
    reset_mcp_tools_cache_endpoint,
    update_mcp_configuration,
 )
 # ---------------------------------------------------------------------------
@@ -339,6 +342,71 @@ async def test_mcp_config_requires_admin_user():
    assert exc_info.value.status_code == 403
@pytest.mark.asyncio
 async def test_reset_mcp_tools_cache_endpoint_requires_admin_user(monkeypatch):
    called = False
    def fake_reset_mcp_tools_cache():
        nonlocal called
        called = True
    monkeypatch.setattr(mcp_router, "reset_mcp_tools_cache", fake_reset_mcp_tools_cache)
    response = await reset_mcp_tools_cache_endpoint(_request_with_role("admin"))
    assert called is True
    assert response.success is True
    assert "next use" in response.message
    with pytest.raises(HTTPException) as exc_info:
        await reset_mcp_tools_cache_endpoint(_request_with_role("user"))
    assert exc_info.value.status_code == 403
@pytest.mark.asyncio
 async def test_update_mcp_configuration_resets_tools_cache(monkeypatch, tmp_path):
    reset_calls = 0
    config_path = tmp_path / "extensions_config.json"
    config_path.write_text('{"mcpServers": {}, "skills": {}}', encoding="utf-8")
    current_config = SimpleNamespace(skills={}, mcp_servers={})
    reloaded_config = SimpleNamespace(
        mcp_servers={
            "github": McpServerConfigResponse(
                type="stdio",
                command="npx",
                args=["-y", "@modelcontextprotocol/server-github"],
            )
        }
    )
    def fake_reset_mcp_tools_cache():
        nonlocal reset_calls
        reset_calls += 1
    monkeypatch.setattr(mcp_router.ExtensionsConfig, "resolve_config_path", lambda: config_path)
    monkeypatch.setattr(mcp_router, "get_extensions_config", lambda: current_config)
    monkeypatch.setattr(mcp_router, "reload_extensions_config", lambda: reloaded_config)
    monkeypatch.setattr(mcp_router, "reset_mcp_tools_cache", fake_reset_mcp_tools_cache)
    response = await update_mcp_configuration(
        _request_with_role("admin"),
        McpConfigUpdateRequest(
            mcp_servers={
                "github": McpServerConfigResponse(
                    type="stdio",
                    command="npx",
                    args=["-y", "@modelcontextprotocol/server-github"],
                )
            }
        ),
    )
    assert reset_calls == 1
    assert list(response.mcp_servers) == ["github"]
 def test_validate_mcp_update_allows_default_npx_stdio_command(monkeypatch):
    monkeypatch.delenv(_MCP_STDIO_COMMAND_ALLOWLIST_ENV, raising=False)
    request = McpConfigUpdateRequest(
@@ -0,0 +1,69 @@
 # DeerFlow Maintainer Orchestrator — design notes
 This document explains the *thinking* behind the `deerflow-maintainer-orchestrator` skill: what it is for, the boundaries that make it safe to run, and the principles that shape how it reviews. It is written for DeerFlow maintainers who run the skill, and for anyone in the community who wants to understand — or adapt — the pattern of delegating issue and PR triage to an agent.
 It is **not** a rule reference. The exact resolution commands, comment templates, severity definitions, and validation matrix live in the skill itself, which is the canonical executable contract:
 > `.agent/skills/deerflow-maintainer-orchestrator/SKILL.md`
 When the two disagree, the skill wins and this document should be updated to match.
 ## What problem it solves
 Triage is repetitive and easy to defer. A maintainer has to open each issue or PR, reconstruct context, judge severity, and write a comment that actually helps the author move forward. The skill turns a *bounded scope* — some issue or PR numbers, a count, or a time window — into evidence-backed comments through a fixed workflow, without turning routine judgment back into questions for the maintainer and without handing the analysis back for them to finish.
 The goal is leverage, not autonomy. The maintainer still owns every decision that matters; the skill does the legwork and makes a concrete, defensible recommendation inside each comment.
 ## The safety model: comment-only
 The most important property of this skill is the surface it is *not* allowed to touch. It operates entirely on the **comment plane**: resolve scope, read evidence, post or draft issue comments and PR review comments. It does not write code, manage branches, close or label artifacts, or cut releases.
 This is a deliberate trust boundary. A comment is the lowest-risk, most reversible action an agent can take on a repository — a wrong comment costs a correction, while a wrong merge, force-push, or release costs far more. Keeping the agent on the comment plane is what makes it safe to run over a batch of real PRs without pre-auditing every step for irreversible damage.
 ## When it posts: a deliberately high bar
 Public review noise erodes trust faster than the occasional missed nit, so posting is conservative and gated on two **independent** axes:
 - **Confidence** — is the problem real?
 - **Severity** — P0/P1/P2: how bad if it is real?
 A finding reaches the **public** surface only when it is high-confidence *and* at least P2. "No high-confidence findings" means none across P0/P1/P2 — not merely "no P0." A public P2 carries one extra guard: the diff under review must itself introduce or worsen the issue, so the skill does not lecture an author about pre-existing behavior their change merely touches, or about a change that is already a net improvement.
 Everything real but below that bar — net-improvement nits, bounded risks, low-confidence hypotheses, pre-existing issues — goes to a **maintainer-only notes channel** in the run result, never to a public comment. The maintainer still sees the signal; the author's thread stays clean.
 ## How it treats existing coverage
 Existing comments suppress duplicate *posting*, not *analysis*. The skill always analyzes the artifact in full, because a prior review may have caught one problem and missed another. It then posts only the net-new delta, explicitly building on what is already there, and stays silent when there is nothing to add. Re-running is safe by design: the skill treats its own earlier comments as covered and never stacks a second comment that repeats the first.
 ## Principles that shape the review
 A handful of ideas do most of the work:
 - **Evidence over a green check.** CI status is a signal, not a verdict. A green rollup never excuses reading the changed code path, and a failing required check is itself a finding. Tests passing does not prove the changed branch is exercised.
 - **Review the right diff.** A finding is only as trustworthy as the diff it is computed against. The skill compares against a freshly fetched base rather than a stale local branch, resolves fork PR heads explicitly, records the reviewed head SHA, and re-checks it before posting — because a review against a diff the PR no longer has is worse than no review.
 - **Reason about batches, not just artifacts.** Related PRs are clustered and reviewed in one context, then a synthesis pass reports cross-PR interactions: overlapping files, merge-order and conflict surface, and composition risk where each change is safe alone but unsafe together. Reviewing related PRs in isolation is how you patch one and break another.
 - **Compare competing PRs fairly.** When several PRs target the same issue, they are scored against the *issue's acceptance criteria* as the rubric. The ranking is for the maintainer; the public surface stays per-PR and constructive, and never tells one author their PR is worse than a competitor's.
 ## What it deliberately does not do
 Scope discipline is part of the design, not an omission:
 - It stays on the comment plane (above) — no code, branch, or release actions.
 - It keeps its review heuristics focused and delegates detection that other tools already own. Blocking-IO on the event loop, for example, is covered by the CI blocking-IO gate and a dedicated `blocking-io-guard` skill, so it is intentionally left out of this skill's heuristics rather than duplicated. Separation of concerns keeps each tool sharp.
 - It keeps private reasoning, credentials, and security-exploit detail out of public comments; sensitive issues are described by impact and remediation only.
 ## How a maintainer runs it
 Give it a scope: issue or PR numbers, a URL, a count, or a time window. It resolves the artifacts with GitHub tooling and returns posted comment and review URLs, clean results, already-covered notes, maintainer-only notes, a batch synthesis, or — when you ask for analysis only — drafts to review before anything is posted. It does not ask routine clarifying questions; it stops and reports only when scope genuinely cannot be resolved, access fails, the request leaves comment-only scope, or posting would require non-public context.
 Output language follows the artifact: Chinese issues and PRs get Chinese comments, English gets English.
 ## Adapting this pattern
 If you are building something similar for your own project, three choices carry most of the value and transfer cleanly:
 1. **Keep the agent on a reversible surface** (comments) until you trust it; reversibility is what lets you run it unattended.
 2. **Gate public output on confidence and severity together**, with a private channel for everything below the bar — a reviewer that posts everything it notices is quickly muted.
 3. **Make the agent prove it reviewed the current diff** before it speaks.
 The rest — surfaces, severity labels, validation commands, output formats — is project-specific and belongs in the skill, not in a document like this one.
@@ -1,172 +0,0 @@
 # DeerFlow Maintainer Orchestrator SOP
 This SOP defines how DeerFlow maintainers should use the repository-local `deerflow-maintainer-orchestrator` skill for comment-only GitHub issue handling and PR review.
 The goal is practical automation: the maintainer provides an issue or PR scope, and the agent resolves the artifacts with GitHub tools, analyzes DeerFlow context, and posts or drafts useful comments. The skill should not turn routine judgment into maintainer questions or offload technical analysis back to the maintainer.
 The local skill lives at `.agent/skills/deerflow-maintainer-orchestrator/SKILL.md`.
 ## Scope
 - **Issue Flow** analyzes GitHub issues and posts or drafts issue comments.
 - **PR Review Flow** reviews GitHub pull request diffs and posts or drafts PR review comments.
 - The skill is a comment-plane workflow. It does not implement code changes, manage branches, close artifacts, publish releases, or perform non-comment maintainer actions.
 ## Comment Authorization
 When the maintainer asks to process, handle, comment on, or review a bounded set of issues or PRs, the skill may post one public issue comment per selected non-skipped issue and one PR review comment per selected PR with high-confidence findings.
 If a PR has no high-confidence findings, the skill should not post a public review/comment. It should report that clean result to the maintainer only.
 When the maintainer explicitly asks for analysis only, the skill should return comment-ready drafts without posting.
 The maintainer's normal interaction should be: provide scope; receive posted comment URLs, PR review URLs, clean results, skipped items, failures, or drafts.
 The skill should not announce its own name, mode, or "no code edited" status in normal output. Those are process details, not maintainer signal.
 ## Language
 The output language should match the issue or PR language unless the maintainer asks otherwise. Chinese issues/PRs get Chinese analysis and comments; English issues/PRs get English analysis and comments. Logs, stack traces, and code snippets do not determine the response language.
 ## Artifact Resolution
 The skill should resolve issue/PR scope through GitHub tools before considering any clarification.
 1. Default repository: `bytedance/deer-flow`, unless a URL or explicit repo says otherwise.
 2. URLs route directly: `/issues/<number>` uses Issue Flow; `/pull/<number>` uses PR Review Flow.
 3. Typed numbers use typed commands:
   - Issue: `gh issue view <number> --repo <repo> --json number,title,url,state,body,labels,author,comments`
   - PR: `gh pr view <number> --repo <repo> --json number,title,url,state,body,author,files,comments,reviews,statusCheckRollup,baseRefName,headRefName`
 4. Normalize multiple explicit references such as `#123`, `# 123`, and bare `123` into a number list, preserving order and de-duplicating exact repeats.
 5. Untyped numbers are resolved by trying `gh pr view <number>` first, then `gh issue view <number>`.
 6. Issue batches use `gh issue list`; PR batches use `gh pr list`. Do not use a mixed issue endpoint as the source for both queues.
 7. Respect the maintainer's requested count or time window. There is no hard five-item cap.
 8. If the scope is broad and underspecified, choose a practical recent slice, state the slice used, prioritize newest and highest-risk items, and report unprocessed remainder.
 9. Use `gh api` when view/list commands lack fields such as review threads or precise filters.
 10. Use GitHub search only as a fallback for natural-language filters that cannot be represented by view/list/API calls.
 11. If no artifact scope can be resolved through URLs, numbers, `gh`, API, or search fallback, return a compact failure report instead of asking a question.
 Maintainer reports and comments can use concise repo-local references such as `#123` and `PR #123`. Include full GitHub URLs only for posted comment/review links returned by GitHub or when the maintainer supplied an explicit URL.
 ## Issue Flow
 For each issue, first perform a cheap precheck: read issue metadata, labels, author, body, and existing comments. If labels, title, or body mark the issue as RFC (`rfc`, `[RFC]`, `RFC:`, or `Request for Comments`), classify it as `rfc-no-comment`, skip deep analysis, and do not post anything public unless the maintainer explicitly overrides the RFC skip for that item. If a maintainer or trusted agent already posted an equivalent diagnosis, modification suggestion, information request, or blocking decision, skip deep analysis and do not post anything public for that issue.
 If the precheck does not skip the issue, gather the issue body, comments, screenshots, logs, reproduction details, linked artifacts, and relevant DeerFlow code/docs.
 The public issue comment should start naturally, then move quickly into execution guidance. Prefer a short opener like `Thanks @author. <specific context sentence>.` when the issue is reporter-authored and the mention reads naturally. Omit the mention for bots, maintainer-authored tracking issues, or cases where it would add noise.
 Do not include internal analysis labels or generic assessment openers such as "This is actionable", "I would treat this as", `ready-to-fix`, surface labels, or risk labels. Use the smallest stable template that fits:
 ```text
 Thanks @author. <one specific sentence that frames the fix, investigation, or missing evidence.>
 Recommended solution:
 - ...
 Validation:
 - ...
 ```
 Add optional sections only when they add signal:
 - `Evidence:` for concrete code, logs, reproduction details, or proof.
 - `Risk:` for specific architecture, security, public API, default behavior, or compatibility impact.
 - `Missing info:` when the issue cannot be diagnosed without more evidence.
 Put relevant files/components inside `Evidence:` or `Recommended solution:` bullets. Every posted issue comment should contain concrete modification guidance and validation guidance unless the only useful response is `Missing info:`.
 Architecture and security concerns should be explained in the comment when they are relevant. They are not reasons to ask the maintainer what to do. Avoid private reasoning, credentials, internal-only context, exploit instructions, and unsupported promises.
 Immediately before posting, refresh comments and skip if an equivalent maintainer or trusted-agent comment appeared during analysis.
 ## PR Review Flow
 For each PR, first perform a cheap duplicate-review precheck: read PR metadata, changed file list, checks summary, existing PR reviews, existing comments, and review threads when available. If a maintainer or trusted agent already posted equivalent findings or a blocking decision, skip deep review and do not post another review comment.
 Before local diff review, establish the base from the base repository, not from local `main`. Prefer GitHub PR base metadata for PR target branches; for non-PR local diffs, use the base repository default branch. Fetch that branch with a command that updates the remote-tracking ref, such as `git fetch <base-remote> +refs/heads/<base-branch>:refs/remotes/<base-remote>/<base-branch>`, or use the verified `FETCH_HEAD` immediately. In fork checkouts this is usually `upstream/main`; in direct upstream checkouts this is usually `origin/main`. Use a merge-base or three-dot diff from the fetched base. If local base resolution fails, use the GitHub PR files/diff as source of truth.
 Review only the current diff and changed files. Do not comment on unrelated pre-existing code unless the diff makes it newly risky. Do not report low-confidence guesses.
 Prioritize correctness, safety, maintainability, production risk, compatibility, and missing critical tests. Architecture, security, public API, default-behavior, and compatibility problems should be reported as findings when the diff causes or exposes them.
 For public PR reviews with findings, start with one short opener that fits the review context and matches the finding count. Use singular wording only for exactly one finding, for example `Thanks @author. I found one issue that should be addressed before this is ready.` Use plural wording for multiple findings, for example `Thanks @author. I found a few issues that should be addressed before this is ready.` Omit the mention for bots or when it adds noise.
 Use this finding format:
 ```text
 [P0/P1/P2] Title
 - Location: file and line/range
 - Problem: what can go wrong
 - Evidence: why the diff causes it
 - Suggested fix: concrete minimal fix
 - Test: what test should cover it
 ```
 Severity:
 - `P0`: causes outage, data loss, security breach, or build failure.
 - `P1`: likely production bug, serious regression, broken compatibility, or high-risk security/architecture issue.
 - `P2`: correctness, maintainability, or test concern with lower risk.
 If there are no high-confidence findings, do not post a public PR review/comment. Report `No high-confidence review findings.` to the maintainer in the run result.
 Immediately before posting, refresh reviews/comments and skip if an equivalent maintainer or trusted-agent review appeared during analysis.
 ## No-Question Policy
 The skill should not ask routine clarification questions. It should use the workflow to resolve scope and produce comments.
 Stop without asking only when:
 - no issue/PR scope can be resolved through URLs, numbers, `gh` view/list, `gh api`, or GitHub search fallback;
 - GitHub authentication, repository access, or comment posting fails;
 - the requested action is outside comment-only scope;
 - posting would require private credentials, private security details, or non-public context.
 In these cases, return a compact failure report with attempted command path and smallest next action. Do not phrase it as a question unless the maintainer explicitly asks to be prompted.
 ## DeerFlow Heuristics
 Treat these as high-signal areas for issue comments and PR findings:
 - `backend/packages/harness/deerflow/` must not import `app.*`.
 - App may depend on harness; harness must stay publishable and app-agnostic.
 - Frontend thread/message behavior and Gateway/LangGraph-compatible SSE are contract surfaces.
 - Sandbox permissions, bash/file-write tools, skill installation, and remote execution are security-sensitive.
 - Default model/provider behavior, config migration, persistence schema, public API/SSE, and LangGraph thread/run lifecycle are compatibility-sensitive.
 - Runtime docs should track user-facing or developer-facing behavior changes.
 - Security-sensitive comments should provide proof and remediation, not vague assertions.
 ## Validation Guidance
 | Surface | Suggested evidence |
 | --- | --- |
 | Backend API / harness / agents / MCP / runtime skills | `cd backend && make lint && make test` |
 | Blocking IO or async IO risk | `cd backend && make test-blocking-io` or focused regression |
 | Harness/app boundary | `cd backend && uv run pytest tests/test_harness_boundary.py` |
 | Frontend UI/core | `cd frontend && pnpm format && pnpm lint && pnpm typecheck && BETTER_AUTH_SECRET=local-dev-secret pnpm build && make test` |
 | Front/back thread or SSE contract | backend replay golden and full-stack replay render where feasible |
 | Frontend user workflow | Playwright E2E or browser proof with screenshot/DOM assertion |
 | Docker/sandbox/provisioner | focused backend tests plus Docker/provisioner smoke when feasible |
 | Docs-only | targeted markdown review |
 ## Output
 For Issue Flow, report posted, skipped, failed, and per-issue comment status. For analysis-only requests, report drafted comments instead of posted comments.
 For PR Review Flow, report reviewed, skipped, clean, failed, and per-PR review status. `Clean` means no high-confidence findings and no public comment posted.
 For batches, prefer a compact maintainer-facing table:
 ```text
 | Artifact | Status | Public action | Notes |
 | --- | --- | --- | --- |
 | #123 | posted | comment URL | short reason |
 | PR #456 | reviewed | review URL | P1: finding title |
 | PR #789 | clean | none | No high-confidence review findings. |
 | #321 | skipped | none | existing maintainer comment |
 ```
 Omit empty categories, no-op fields, routine command output, and raw logs. Report meaningful changes, evidence, and options.