From 799bef6d9dbc3a2cb37ce8177eeeabe2a33d8971 Mon Sep 17 00:00:00 2001 From: Xinmin Zeng <135568692+fancyboi999@users.noreply.github.com> Date: Mon, 8 Jun 2026 17:32:41 +0800 Subject: [PATCH] fix(replay-e2e): match by conversation, not the living system prompt (#3436) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(replay-e2e): match by conversation, not the living system prompt The model-replay match key hashed the full input including the lead-agent system prompt. That prompt is edited frequently (e.g. #3195 added a "File Editing Workflow" section), so the committed fixture went stale the moment the prompt changed on main — turning the Layer-2 render gate RED on every unrelated PR (#3430, #3432, ...). This was a self-inflicted false positive. Root-cause fix: - replay_provider._canonical_messages now EXCLUDES the system message from the hash. The conversation (human/ai/tool) is the stable contract that identifies a recorded turn; the system prompt is an internal detail not part of the front-back contract under test. (Mirrors how open-design keys its mock picker on the user prompt, not the system internals.) Proven robust: injecting a prompt edit no longer causes a replay miss. - Layer-1 golden was BLIND to replay misses: the gateway swallows a miss into an assistant error message, so the shape-only golden stayed green on a stale fixture. It now inspects replay_provider.replay_misses() and fails loud. (Layer-2 already fails on a miss.) - Re-recorded write_read_file.ultra fixture + regenerated golden under the new conversation-only hash. - Layer-2 render spec: assert the in-graph auto-title (deterministic); the follow-up suggestion is fired async and depends on a clean JSON model output, so assert it only when the fixture captured one — never gate on its absence (recording flakiness must not block CI). - docs: REPLAY_E2E.md updated. Verified: Layer-1 golden green (no miss), Layer-2 both specs green, CI=true make test 4033 passed / 0 failed, frontend pnpm check clean. * test(replay-e2e): restore suggestions coverage with a reliable capture Addresses review feedback (the suggestion path was dropped from Layer-2): - record spec now waits for the `/suggestions` response before checking capture stability, so the recorded fixture reliably includes the frontend-fired suggestions turn (previously the stability window could return before suggestions fired, yielding a fixture without it). - Re-recorded write_read_file.ultra: 5 turns (write_file, auto-title, read_file, answer, suggestions). Golden unchanged — suggestions is a separate /suggestions call, not part of the /runs/stream SSE sequence. - Layer-2 spec: restore the hard `EXPECTED_SUGGESTION` assertion. With the record spec now waiting for /suggestions, a fixture missing the suggestion turn means a broken recording and must fail loud, not pass silently. Verified: Layer-1 golden green (no miss), Layer-2 both specs green (auto-title + suggestion render), frontend pnpm check clean. * ci: re-trigger (flaky Docker Hub image pull in sandbox e2e, unrelated) backend-unit-tests failed only in test_sandbox_orphan_reconciliation_e2e.py with 'docker pull busybox:latest ... context deadline exceeded' — a CI-runner network flake reaching Docker Hub, not related to this docs/tests-only change. Empty commit to re-run CI. --------- Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com> --- backend/docs/REPLAY_E2E.md | 25 +++- .../replay/write_read_file.ultra.events.json | 60 ++++++++++ .../replay/write_read_file.ultra.json | 107 ++++++++++-------- backend/tests/replay_provider.py | 40 ++++++- backend/tests/test_replay_golden.py | 16 ++- .../real-backend-render.spec.ts | 10 +- .../e2e-record/record-write-read-file.spec.ts | 10 ++ 7 files changed, 202 insertions(+), 66 deletions(-) diff --git a/backend/docs/REPLAY_E2E.md b/backend/docs/REPLAY_E2E.md index 546e160c2..cd9920b4c 100644 --- a/backend/docs/REPLAY_E2E.md +++ b/backend/docs/REPLAY_E2E.md @@ -50,12 +50,25 @@ gateway's own run/event stores using the request's auth context, so the real ## How replay works `tests/replay_provider.py::ReplayChatModel` returns recorded assistant turns keyed -by a **normalized hash** of the model input (strips ``, dates, -UUIDs, tmp paths). A miss raises loudly rather than passing silently. The system -prompt is made environment-independent by pinning skills + extensions empty and -disabling memory/summarization (`tests/_replay_fixture.py::build_config_yaml`), so -a fixture replays the same across machines, days, and CI. Replaying needs **no -API key**. +by a **normalized hash of the conversation** (human / ai / tool messages — role, +text, tool-call name+args; with ``, dates, UUIDs, tmp paths +stripped). A miss raises loudly rather than passing silently. + +**The system prompt is excluded from the match key.** The lead-agent system +prompt is a living, frequently-edited implementation detail — its wording changes +across PRs (e.g. #3195 added a "File Editing Workflow" section). Hashing it would +make every fixture go stale and red-fail unrelated PRs the moment anyone edits the +prompt. The conversation flow (user input → tool calls → results → answer) is the +stable contract that identifies a recorded turn. (This mirrors how open-design's +mock picker keys on the user prompt, not the system internals.) Combined with +pinning skills + extensions empty and disabling memory/summarization +(`tests/_replay_fixture.py::build_config_yaml`), a fixture replays the same across +machines, days, prompt edits, and CI. Replaying needs **no API key**. + +A swallowed hash-miss keeps the SSE *event shapes* identical (the gateway wraps it +into a normal assistant error message), so the Layer-1 golden can't catch a miss +by shape alone — it inspects `replay_provider.replay_misses()` and fails loud +instead. Layer-2 already fails on a miss (the recorded turns never render). ## Record a new scenario (needs a real key — dev machine only) diff --git a/backend/tests/fixtures/replay/write_read_file.ultra.events.json b/backend/tests/fixtures/replay/write_read_file.ultra.events.json index 3a4f8c041..babf90d9d 100644 --- a/backend/tests/fixtures/replay/write_read_file.ultra.events.json +++ b/backend/tests/fixtures/replay/write_read_file.ultra.events.json @@ -64,6 +64,66 @@ "viewed_images" ] }, + { + "event": "values", + "keys": [ + "artifacts", + "messages", + "thread_data", + "title", + "viewed_images" + ] + }, + { + "event": "values", + "keys": [ + "artifacts", + "messages", + "thread_data", + "title", + "viewed_images" + ] + }, + { + "event": "values", + "keys": [ + "artifacts", + "messages", + "thread_data", + "title", + "viewed_images" + ] + }, + { + "event": "values", + "keys": [ + "artifacts", + "messages", + "thread_data", + "title", + "viewed_images" + ] + }, + { + "event": "values", + "keys": [ + "artifacts", + "messages", + "thread_data", + "title", + "viewed_images" + ] + }, + { + "event": "values", + "keys": [ + "artifacts", + "messages", + "thread_data", + "title", + "viewed_images" + ] + }, { "event": "end", "keys": null diff --git a/backend/tests/fixtures/replay/write_read_file.ultra.json b/backend/tests/fixtures/replay/write_read_file.ultra.json index a534eb2eb..95cce6ce8 100644 --- a/backend/tests/fixtures/replay/write_read_file.ultra.json +++ b/backend/tests/fixtures/replay/write_read_file.ultra.json @@ -1,7 +1,7 @@ { "scenario": "write_read_file", "mode": "ultra", - "model": "gpt-5.5", + "model": "sre/gpt-5", "prompt": "Using your own file tools directly, create the file /mnt/user-data/outputs/note.txt with exactly this content: hi from replay. Then read that same file back and reply with its exact contents. Do NOT delegate to a subagent and do NOT use the task tool — do it yourself. Do not ask any clarifying questions.", "context": { "is_bootstrap": false, @@ -12,7 +12,7 @@ }, "turns": [ { - "input_hash": "686cd44a9f17fadc0398768731324f3980480a027593a475fad4583581df677f", + "input_hash": "9c50eda6ab7e8593dabccbdeadc70a4a7bf778b2c0c3f275f1f96cf2c8ab58db", "output": { "type": "ai", "data": { @@ -20,36 +20,36 @@ "additional_kwargs": {}, "response_metadata": { "finish_reason": "tool_calls", - "model_name": "gpt-5.5", + "model_name": "sre/gpt-5", "model_provider": "openai" }, "type": "ai", "name": null, - "id": "lc_run--019e8c60-8d4b-79a1-8d77-0a67fc360ce4", + "id": "lc_run--019ea641-acda-7423-9a9f-79725057bc20", "tool_calls": [ { "name": "write_file", "args": { - "description": "Create requested note file", + "description": "Create the requested output file with exact content", "path": "/mnt/user-data/outputs/note.txt", - "content": "hi from replay" + "content": "hi from replay." }, - "id": "call_UdIzq5Vyx7pu1Usnj4wPCC6G", + "id": "call_FV7zhKonjx5CAa1RwIcKihpi", "type": "tool_call" } ], "invalid_tool_calls": [], "usage_metadata": { - "input_tokens": 3285, - "output_tokens": 66, - "total_tokens": 3351, + "input_tokens": 3664, + "output_tokens": 434, + "total_tokens": 4098, "input_token_details": { "audio": 0, - "cache_read": 0 + "cache_read": 3584 }, "output_token_details": { "audio": 0, - "reasoning": 21 + "reasoning": 384 } } } @@ -60,36 +60,36 @@ "output": { "type": "ai", "data": { - "content": "File Creation and Verification", + "content": "Direct File Creation and Readback", "additional_kwargs": {}, "response_metadata": { "finish_reason": "stop", - "model_name": "gpt-5.5", + "model_name": "sre/gpt-5", "model_provider": "openai" }, "type": "ai", "name": null, - "id": "lc_run--019e8c60-9c18-72c1-95e8-f6a240747395", + "id": "lc_run--019ea641-cf52-7793-900e-15ad4f032c0e", "tool_calls": [], "invalid_tool_calls": [], "usage_metadata": { "input_tokens": 104, - "output_tokens": 53, - "total_tokens": 157, + "output_tokens": 656, + "total_tokens": 760, "input_token_details": { "audio": 0, "cache_read": 0 }, "output_token_details": { "audio": 0, - "reasoning": 39 + "reasoning": 640 } } } } }, { - "input_hash": "92430ba866abe577c86d2e67eb5158b10f3f19ec306aa9de235bb06736320d70", + "input_hash": "6af134379b2a9efa01b4f63032f88211d5f38f459f8bed621eb6c65e8e05c1f9", "output": { "type": "ai", "data": { @@ -97,31 +97,31 @@ "additional_kwargs": {}, "response_metadata": { "finish_reason": "tool_calls", - "model_name": "gpt-5.5", + "model_name": "sre/gpt-5", "model_provider": "openai" }, "type": "ai", "name": null, - "id": "lc_run--019e8c60-b036-7710-8db9-717ab54e5805", + "id": "lc_run--019ea641-f523-7d60-a416-b051fba469a2", "tool_calls": [ { "name": "read_file", "args": { - "description": "Read requested note file", + "description": "Verify contents to echo back exactly", "path": "/mnt/user-data/outputs/note.txt" }, - "id": "call_0BFNns0FkRb3n2LR0PRrfbIJ", + "id": "call_YevFCnLcjWfWHaZm8wwMpEk8", "type": "tool_call" } ], "invalid_tool_calls": [], "usage_metadata": { - "input_tokens": 3334, - "output_tokens": 33, - "total_tokens": 3367, + "input_tokens": 3719, + "output_tokens": 35, + "total_tokens": 3754, "input_token_details": { "audio": 0, - "cache_read": 0 + "cache_read": 3584 }, "output_token_details": { "audio": 0, @@ -132,29 +132,29 @@ } }, { - "input_hash": "8ab757aa51f9d556adcea07c0221445a2b791cc882ef11922babf7f2865d1913", + "input_hash": "04751c4f7b0107b78b5c97d417063883fd586f5ebcbc4acf79be6cb3c0cdaec1", "output": { "type": "ai", "data": { - "content": "hi from replay", + "content": "hi from replay.", "additional_kwargs": {}, "response_metadata": { "finish_reason": "stop", - "model_name": "gpt-5.5", + "model_name": "sre/gpt-5", "model_provider": "openai" }, "type": "ai", "name": null, - "id": "lc_run--019e8c60-bef3-7201-a30a-cbc5f45920ba", + "id": "lc_run--019ea641-ff38-7751-9c2b-cc648811883b", "tool_calls": [], "invalid_tool_calls": [], "usage_metadata": { - "input_tokens": 3380, - "output_tokens": 7, - "total_tokens": 3387, + "input_tokens": 3768, + "output_tokens": 8, + "total_tokens": 3776, "input_token_details": { "audio": 0, - "cache_read": 0 + "cache_read": 3584 }, "output_token_details": { "audio": 0, @@ -165,56 +165,65 @@ } }, { - "input_hash": "fd67723cc8810ce79b4785fec4c251a272a91d677a216c735b23b5f6d3dec0c3", + "input_hash": "8b98ebdbb53e88f000556c4753adede8eaa076ff6fd7b8a1285bfd18aee8144d", "output": { "type": "ai", "data": { - "content": "[\"Can you append another line to the file?\",\"Can you show the file path again?\",\"Can you delete the file now?\"]", + "content": "[\n \"Can you show the file size and last modified time of /mnt/user-data/outputs/note.txt?\",\n \"List the contents of /mnt/user-data/outputs/ to confirm the file exists.\",\n \"Append 'second line' to /mnt/user-data/outputs/note.txt and print its new contents.\"\n]", "additional_kwargs": { "refusal": null }, "response_metadata": { "token_usage": { - "completion_tokens": 71, + "completion_tokens": 909, "prompt_tokens": 224, - "total_tokens": 295, + "total_tokens": 1133, "completion_tokens_details": { "accepted_prediction_tokens": 0, "audio_tokens": 0, - "reasoning_tokens": 33, + "reasoning_tokens": 832, "rejected_prediction_tokens": 0 }, "prompt_tokens_details": { "audio_tokens": 0, "cached_tokens": 0 }, - "input_tokens": 0, - "output_tokens": 0, - "input_tokens_details": null + "latency_checkpoint": { + "engine_tbt_ms": 12, + "engine_ttft_ms": 324, + "engine_ttlt_ms": 10965, + "pre_inference_ms": 153, + "service_tbt_ms": 12, + "service_ttft_ms": 849, + "service_ttlt_ms": 11491, + "total_duration_ms": 11351, + "user_visible_ttft_ms": 696 + } }, "model_provider": "openai", - "model_name": "gpt-5.5", + "model_name": "sre/gpt-5", "system_fingerprint": null, - "id": "chatcmpl-DmaI5yVqQ39LRWyugoCEPalKw0gBR", + "id": "chatcmpl-DoPFALdwiyEDYOIN7wFYhqBrr6eTA", + "service_tier": "default", "finish_reason": "stop", "logprobs": null }, "type": "ai", "name": null, - "id": "lc_run--019e8c60-d025-7fd2-9cc9-8b4fb8fe1a82-0", + "id": "lc_run--019ea642-0eac-78f1-a506-931e343184f1-0", "tool_calls": [], "invalid_tool_calls": [], "usage_metadata": { "input_tokens": 224, - "output_tokens": 71, - "total_tokens": 295, + "output_tokens": 909, + "total_tokens": 1133, "input_token_details": { "audio": 0, "cache_read": 0 }, "output_token_details": { "audio": 0, - "reasoning": 33 + "reasoning": 832 } } } diff --git a/backend/tests/replay_provider.py b/backend/tests/replay_provider.py index c16c46448..ab2ef3791 100644 --- a/backend/tests/replay_provider.py +++ b/backend/tests/replay_provider.py @@ -76,6 +76,24 @@ from pydantic import PrivateAttr _FIXTURE_ENV = "DEERFLOW_REPLAY_FIXTURE" +# Process-wide record of replay misses. A miss raises inside the model, but the +# gateway's LLMErrorHandlingMiddleware swallows it into a normal assistant error +# message — so the SSE *event shapes* are unchanged and a shape-only golden stays +# green on a stale fixture. The in-process Layer-1 test inspects this list to fail +# loud on a miss instead. (Layer-2 already fails on a miss: the recorded turns +# never render.) +_replay_misses: list[str] = [] + + +def replay_misses() -> list[str]: + """Hashes that missed the fixture since the last reset (see ``_replay_misses``).""" + return list(_replay_misses) + + +def reset_replay_misses() -> None: + _replay_misses.clear() + + # Volatile substrings that differ between a recording run and a replay run but # carry no semantic weight for matching. Normalized to stable placeholders # before hashing so the same logical input hashes identically across processes. @@ -117,13 +135,24 @@ def _content_to_text(content: Any) -> str: def _canonical_messages(messages: list[BaseMessage]) -> str: """Project messages to a stable shape that excludes volatile metadata/ids. - Keeps only what determines the model's next output: role, text content, and - tool-call name+args. Drops ``id``, ``response_metadata``, ``usage_metadata``, - and ``tool_call_id`` (all volatile), then normalizes embedded volatile - substrings. + Keeps only what determines which recorded turn to replay: the conversation + (human / ai / tool messages — role, text content, tool-call name+args). Drops + ``id``, ``response_metadata``, ``usage_metadata``, ``tool_call_id`` (all + volatile), then normalizes embedded volatile substrings. + + **The system message is excluded entirely.** The lead-agent system prompt is + a living, frequently-edited implementation detail (its wording changes across + PRs), not part of the front-back contract this harness verifies. Hashing it + would make every fixture go stale — and red-fail on unrelated PRs — the moment + anyone edits the prompt. The conversation flow (user input -> tool calls -> + results -> answer) is the stable key that identifies a recorded turn. """ projected: list[dict[str, Any]] = [] for message in messages: + # Exclude the system prompt from the match key — see docstring. It is the + # most-edited part of the prompt and not part of the contract under test. + if message.type == "system": + continue content = _normalize_text(_content_to_text(message.content)) tool_calls = getattr(message, "tool_calls", None) # Drop messages that are empty after normalization — e.g. a turn that was @@ -189,6 +218,7 @@ class ReplayChatModel(BaseChatModel): key = hash_messages(messages) bucket = self._table.get(key) if not bucket: + _replay_misses.append(key) preview = _canonical_messages(messages) raise KeyError( f"replay miss: no recorded output for input hash {key} in {self._fixture_path!r}. " @@ -227,4 +257,4 @@ class ReplayChatModel(BaseChatModel): # Re-export so the recorder shares the exact hashing logic. -__all__ = ["ReplayChatModel", "hash_messages"] +__all__ = ["ReplayChatModel", "hash_messages", "replay_misses", "reset_replay_misses"] diff --git a/backend/tests/test_replay_golden.py b/backend/tests/test_replay_golden.py index f90bbd88e..18714ff8e 100644 --- a/backend/tests/test_replay_golden.py +++ b/backend/tests/test_replay_golden.py @@ -66,14 +66,24 @@ def test_replay_write_read_file_ultra_matches_golden(tmp_path: Path, monkeypatch cfg = app_config_module.get_app_config() cfg.database.sqlite_dir = str(home / "db") + # Fail loud on a replay miss. The gateway swallows a hash-miss into a normal + # assistant error message, so the SSE *shapes* below stay green on a stale + # fixture — the miss list is the only reliable signal at this layer. + import replay_provider + from app.gateway.app import create_app + replay_provider.reset_replay_misses() + events = drive_gateway(create_app(), prompt=fixture["prompt"], context=fixture["context"]) assert events, "replay produced no SSE events" assert events[0]["event"] == "metadata", f"first event should be metadata, got {events[0]!r}" assert events[-1]["event"] == "end", f"last event should be end (run completed), got {events[-1]!r}" + misses = replay_provider.replay_misses() + assert not misses, f"replay miss ({len(misses)}): the fixture is stale vs the current system prompt or agent graph. Re-record it (see backend/docs/REPLAY_E2E.md). Missed hashes: {misses}" + # Regenerate the committed golden after re-recording the fixture: # DEERFLOW_WRITE_GOLDEN=1 uv run pytest tests/test_replay_golden.py if os.environ.get("DEERFLOW_WRITE_GOLDEN"): @@ -81,7 +91,7 @@ def test_replay_write_read_file_ultra_matches_golden(tmp_path: Path, monkeypatch return golden = json.loads(events_path.read_text(encoding="utf-8"))["events"] - # A replay hash-miss surfaces as the run erroring mid-stream -> the event - # shape sequence diverges from the golden, so this assertion is the catch-all - # for both backend SSE drift and replay divergence. + # Guards backend SSE protocol drift: the event name + payload-key sequence + # must match the committed golden. (Replay divergence is caught by the miss + # assertion above, not here — a swallowed miss keeps the shapes identical.) assert events == golden, f"SSE event-shape sequence drifted from the golden.\ngot ({len(events)}): {[e['event'] for e in events]}\nwant ({len(golden)}): {[e['event'] for e in golden]}" diff --git a/frontend/tests/e2e-real-backend/real-backend-render.spec.ts b/frontend/tests/e2e-real-backend/real-backend-render.spec.ts index fe4446e67..97c367d41 100644 --- a/frontend/tests/e2e-real-backend/real-backend-render.spec.ts +++ b/frontend/tests/e2e-real-backend/real-backend-render.spec.ts @@ -85,17 +85,21 @@ test.describe("real backend render (replay, no API key)", () => { await textarea.fill(PROMPT); await textarea.press("Enter"); - // Replay-only DOM assertions (derived from the fixture): they render only if + // Replay-only DOM assertions (derived from the fixture): both are + // model-generated strings absent from the user prompt, so they render only if // the recorded turns replayed AND the real frontend rendered them — the // in-graph auto-title and the post-answer follow-up suggestion. Together they - // prove the whole pipeline (replay backend -> real frontend render). + // prove the whole pipeline (replay backend -> real frontend render). The + // record spec waits for the /suggestions response, so a re-recorded fixture + // always captures the suggestion turn — a missing one is a broken recording + // and must fail loud here, not pass silently. expect( EXPECTED_TITLE, "fixture should contain an auto-title turn", ).not.toBe(""); expect( EXPECTED_SUGGESTION, - "fixture should contain a suggestions turn", + "fixture should contain a suggestions turn (re-record; the record spec waits for /suggestions)", ).not.toBe(""); await expect(page.getByText(EXPECTED_TITLE)).toBeVisible({ timeout: 60_000, diff --git a/frontend/tests/e2e-record/record-write-read-file.spec.ts b/frontend/tests/e2e-record/record-write-read-file.spec.ts index 77f02ec85..0e530a5e9 100644 --- a/frontend/tests/e2e-record/record-write-read-file.spec.ts +++ b/frontend/tests/e2e-record/record-write-read-file.spec.ts @@ -104,6 +104,16 @@ test("record write/read-file run through the real frontend", async ({ await textarea.fill(PROMPT); await textarea.press("Enter"); + // Suggestions fire only AFTER the run completes (input-box.tsx POSTs + // /suggestions). Wait for that response so its model call lands in the capture + // before we check for stability — otherwise the stability window can return + // first and the recorded fixture would be missing the suggestions turn. + await page + .waitForResponse((r) => r.url().includes("/suggestions"), { + timeout: 90_000, + }) + .catch(() => undefined); + const captured = await waitForCaptureStable(out!); console.log( `[record] captures stabilized at ${captured} model call(s) -> ${out}`,