mirror of
https://github.com/bytedance/deer-flow.git
synced 2026-06-10 09:25:57 +00:00
fix(replay-e2e): match by conversation, not the living system prompt (#3436)
* fix(replay-e2e): match by conversation, not the living system prompt The model-replay match key hashed the full input including the lead-agent system prompt. That prompt is edited frequently (e.g. #3195 added a "File Editing Workflow" section), so the committed fixture went stale the moment the prompt changed on main — turning the Layer-2 render gate RED on every unrelated PR (#3430, #3432, ...). This was a self-inflicted false positive. Root-cause fix: - replay_provider._canonical_messages now EXCLUDES the system message from the hash. The conversation (human/ai/tool) is the stable contract that identifies a recorded turn; the system prompt is an internal detail not part of the front-back contract under test. (Mirrors how open-design keys its mock picker on the user prompt, not the system internals.) Proven robust: injecting a prompt edit no longer causes a replay miss. - Layer-1 golden was BLIND to replay misses: the gateway swallows a miss into an assistant error message, so the shape-only golden stayed green on a stale fixture. It now inspects replay_provider.replay_misses() and fails loud. (Layer-2 already fails on a miss.) - Re-recorded write_read_file.ultra fixture + regenerated golden under the new conversation-only hash. - Layer-2 render spec: assert the in-graph auto-title (deterministic); the follow-up suggestion is fired async and depends on a clean JSON model output, so assert it only when the fixture captured one — never gate on its absence (recording flakiness must not block CI). - docs: REPLAY_E2E.md updated. Verified: Layer-1 golden green (no miss), Layer-2 both specs green, CI=true make test 4033 passed / 0 failed, frontend pnpm check clean. * test(replay-e2e): restore suggestions coverage with a reliable capture Addresses review feedback (the suggestion path was dropped from Layer-2): - record spec now waits for the `/suggestions` response before checking capture stability, so the recorded fixture reliably includes the frontend-fired suggestions turn (previously the stability window could return before suggestions fired, yielding a fixture without it). - Re-recorded write_read_file.ultra: 5 turns (write_file, auto-title, read_file, answer, suggestions). Golden unchanged — suggestions is a separate /suggestions call, not part of the /runs/stream SSE sequence. - Layer-2 spec: restore the hard `EXPECTED_SUGGESTION` assertion. With the record spec now waiting for /suggestions, a fixture missing the suggestion turn means a broken recording and must fail loud, not pass silently. Verified: Layer-1 golden green (no miss), Layer-2 both specs green (auto-title + suggestion render), frontend pnpm check clean. * ci: re-trigger (flaky Docker Hub image pull in sandbox e2e, unrelated) backend-unit-tests failed only in test_sandbox_orphan_reconciliation_e2e.py with 'docker pull busybox:latest ... context deadline exceeded' — a CI-runner network flake reaching Docker Hub, not related to this docs/tests-only change. Empty commit to re-run CI. --------- Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
This commit is contained in:
@@ -50,12 +50,25 @@ gateway's own run/event stores using the request's auth context, so the real
|
|||||||
## How replay works
|
## How replay works
|
||||||
|
|
||||||
`tests/replay_provider.py::ReplayChatModel` returns recorded assistant turns keyed
|
`tests/replay_provider.py::ReplayChatModel` returns recorded assistant turns keyed
|
||||||
by a **normalized hash** of the model input (strips `<system-reminder>`, dates,
|
by a **normalized hash of the conversation** (human / ai / tool messages — role,
|
||||||
UUIDs, tmp paths). A miss raises loudly rather than passing silently. The system
|
text, tool-call name+args; with `<system-reminder>`, dates, UUIDs, tmp paths
|
||||||
prompt is made environment-independent by pinning skills + extensions empty and
|
stripped). A miss raises loudly rather than passing silently.
|
||||||
disabling memory/summarization (`tests/_replay_fixture.py::build_config_yaml`), so
|
|
||||||
a fixture replays the same across machines, days, and CI. Replaying needs **no
|
**The system prompt is excluded from the match key.** The lead-agent system
|
||||||
API key**.
|
prompt is a living, frequently-edited implementation detail — its wording changes
|
||||||
|
across PRs (e.g. #3195 added a "File Editing Workflow" section). Hashing it would
|
||||||
|
make every fixture go stale and red-fail unrelated PRs the moment anyone edits the
|
||||||
|
prompt. The conversation flow (user input → tool calls → results → answer) is the
|
||||||
|
stable contract that identifies a recorded turn. (This mirrors how open-design's
|
||||||
|
mock picker keys on the user prompt, not the system internals.) Combined with
|
||||||
|
pinning skills + extensions empty and disabling memory/summarization
|
||||||
|
(`tests/_replay_fixture.py::build_config_yaml`), a fixture replays the same across
|
||||||
|
machines, days, prompt edits, and CI. Replaying needs **no API key**.
|
||||||
|
|
||||||
|
A swallowed hash-miss keeps the SSE *event shapes* identical (the gateway wraps it
|
||||||
|
into a normal assistant error message), so the Layer-1 golden can't catch a miss
|
||||||
|
by shape alone — it inspects `replay_provider.replay_misses()` and fails loud
|
||||||
|
instead. Layer-2 already fails on a miss (the recorded turns never render).
|
||||||
|
|
||||||
## Record a new scenario (needs a real key — dev machine only)
|
## Record a new scenario (needs a real key — dev machine only)
|
||||||
|
|
||||||
|
|||||||
@@ -64,6 +64,66 @@
|
|||||||
"viewed_images"
|
"viewed_images"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"event": "values",
|
||||||
|
"keys": [
|
||||||
|
"artifacts",
|
||||||
|
"messages",
|
||||||
|
"thread_data",
|
||||||
|
"title",
|
||||||
|
"viewed_images"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event": "values",
|
||||||
|
"keys": [
|
||||||
|
"artifacts",
|
||||||
|
"messages",
|
||||||
|
"thread_data",
|
||||||
|
"title",
|
||||||
|
"viewed_images"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event": "values",
|
||||||
|
"keys": [
|
||||||
|
"artifacts",
|
||||||
|
"messages",
|
||||||
|
"thread_data",
|
||||||
|
"title",
|
||||||
|
"viewed_images"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event": "values",
|
||||||
|
"keys": [
|
||||||
|
"artifacts",
|
||||||
|
"messages",
|
||||||
|
"thread_data",
|
||||||
|
"title",
|
||||||
|
"viewed_images"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event": "values",
|
||||||
|
"keys": [
|
||||||
|
"artifacts",
|
||||||
|
"messages",
|
||||||
|
"thread_data",
|
||||||
|
"title",
|
||||||
|
"viewed_images"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event": "values",
|
||||||
|
"keys": [
|
||||||
|
"artifacts",
|
||||||
|
"messages",
|
||||||
|
"thread_data",
|
||||||
|
"title",
|
||||||
|
"viewed_images"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"event": "end",
|
"event": "end",
|
||||||
"keys": null
|
"keys": null
|
||||||
|
|||||||
+58
-49
@@ -1,7 +1,7 @@
|
|||||||
{
|
{
|
||||||
"scenario": "write_read_file",
|
"scenario": "write_read_file",
|
||||||
"mode": "ultra",
|
"mode": "ultra",
|
||||||
"model": "gpt-5.5",
|
"model": "sre/gpt-5",
|
||||||
"prompt": "Using your own file tools directly, create the file /mnt/user-data/outputs/note.txt with exactly this content: hi from replay. Then read that same file back and reply with its exact contents. Do NOT delegate to a subagent and do NOT use the task tool — do it yourself. Do not ask any clarifying questions.",
|
"prompt": "Using your own file tools directly, create the file /mnt/user-data/outputs/note.txt with exactly this content: hi from replay. Then read that same file back and reply with its exact contents. Do NOT delegate to a subagent and do NOT use the task tool — do it yourself. Do not ask any clarifying questions.",
|
||||||
"context": {
|
"context": {
|
||||||
"is_bootstrap": false,
|
"is_bootstrap": false,
|
||||||
@@ -12,7 +12,7 @@
|
|||||||
},
|
},
|
||||||
"turns": [
|
"turns": [
|
||||||
{
|
{
|
||||||
"input_hash": "686cd44a9f17fadc0398768731324f3980480a027593a475fad4583581df677f",
|
"input_hash": "9c50eda6ab7e8593dabccbdeadc70a4a7bf778b2c0c3f275f1f96cf2c8ab58db",
|
||||||
"output": {
|
"output": {
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"data": {
|
"data": {
|
||||||
@@ -20,36 +20,36 @@
|
|||||||
"additional_kwargs": {},
|
"additional_kwargs": {},
|
||||||
"response_metadata": {
|
"response_metadata": {
|
||||||
"finish_reason": "tool_calls",
|
"finish_reason": "tool_calls",
|
||||||
"model_name": "gpt-5.5",
|
"model_name": "sre/gpt-5",
|
||||||
"model_provider": "openai"
|
"model_provider": "openai"
|
||||||
},
|
},
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"name": null,
|
"name": null,
|
||||||
"id": "lc_run--019e8c60-8d4b-79a1-8d77-0a67fc360ce4",
|
"id": "lc_run--019ea641-acda-7423-9a9f-79725057bc20",
|
||||||
"tool_calls": [
|
"tool_calls": [
|
||||||
{
|
{
|
||||||
"name": "write_file",
|
"name": "write_file",
|
||||||
"args": {
|
"args": {
|
||||||
"description": "Create requested note file",
|
"description": "Create the requested output file with exact content",
|
||||||
"path": "/mnt/user-data/outputs/note.txt",
|
"path": "/mnt/user-data/outputs/note.txt",
|
||||||
"content": "hi from replay"
|
"content": "hi from replay."
|
||||||
},
|
},
|
||||||
"id": "call_UdIzq5Vyx7pu1Usnj4wPCC6G",
|
"id": "call_FV7zhKonjx5CAa1RwIcKihpi",
|
||||||
"type": "tool_call"
|
"type": "tool_call"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"invalid_tool_calls": [],
|
"invalid_tool_calls": [],
|
||||||
"usage_metadata": {
|
"usage_metadata": {
|
||||||
"input_tokens": 3285,
|
"input_tokens": 3664,
|
||||||
"output_tokens": 66,
|
"output_tokens": 434,
|
||||||
"total_tokens": 3351,
|
"total_tokens": 4098,
|
||||||
"input_token_details": {
|
"input_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"cache_read": 0
|
"cache_read": 3584
|
||||||
},
|
},
|
||||||
"output_token_details": {
|
"output_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"reasoning": 21
|
"reasoning": 384
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -60,36 +60,36 @@
|
|||||||
"output": {
|
"output": {
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"data": {
|
"data": {
|
||||||
"content": "File Creation and Verification",
|
"content": "Direct File Creation and Readback",
|
||||||
"additional_kwargs": {},
|
"additional_kwargs": {},
|
||||||
"response_metadata": {
|
"response_metadata": {
|
||||||
"finish_reason": "stop",
|
"finish_reason": "stop",
|
||||||
"model_name": "gpt-5.5",
|
"model_name": "sre/gpt-5",
|
||||||
"model_provider": "openai"
|
"model_provider": "openai"
|
||||||
},
|
},
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"name": null,
|
"name": null,
|
||||||
"id": "lc_run--019e8c60-9c18-72c1-95e8-f6a240747395",
|
"id": "lc_run--019ea641-cf52-7793-900e-15ad4f032c0e",
|
||||||
"tool_calls": [],
|
"tool_calls": [],
|
||||||
"invalid_tool_calls": [],
|
"invalid_tool_calls": [],
|
||||||
"usage_metadata": {
|
"usage_metadata": {
|
||||||
"input_tokens": 104,
|
"input_tokens": 104,
|
||||||
"output_tokens": 53,
|
"output_tokens": 656,
|
||||||
"total_tokens": 157,
|
"total_tokens": 760,
|
||||||
"input_token_details": {
|
"input_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"cache_read": 0
|
"cache_read": 0
|
||||||
},
|
},
|
||||||
"output_token_details": {
|
"output_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"reasoning": 39
|
"reasoning": 640
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"input_hash": "92430ba866abe577c86d2e67eb5158b10f3f19ec306aa9de235bb06736320d70",
|
"input_hash": "6af134379b2a9efa01b4f63032f88211d5f38f459f8bed621eb6c65e8e05c1f9",
|
||||||
"output": {
|
"output": {
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"data": {
|
"data": {
|
||||||
@@ -97,31 +97,31 @@
|
|||||||
"additional_kwargs": {},
|
"additional_kwargs": {},
|
||||||
"response_metadata": {
|
"response_metadata": {
|
||||||
"finish_reason": "tool_calls",
|
"finish_reason": "tool_calls",
|
||||||
"model_name": "gpt-5.5",
|
"model_name": "sre/gpt-5",
|
||||||
"model_provider": "openai"
|
"model_provider": "openai"
|
||||||
},
|
},
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"name": null,
|
"name": null,
|
||||||
"id": "lc_run--019e8c60-b036-7710-8db9-717ab54e5805",
|
"id": "lc_run--019ea641-f523-7d60-a416-b051fba469a2",
|
||||||
"tool_calls": [
|
"tool_calls": [
|
||||||
{
|
{
|
||||||
"name": "read_file",
|
"name": "read_file",
|
||||||
"args": {
|
"args": {
|
||||||
"description": "Read requested note file",
|
"description": "Verify contents to echo back exactly",
|
||||||
"path": "/mnt/user-data/outputs/note.txt"
|
"path": "/mnt/user-data/outputs/note.txt"
|
||||||
},
|
},
|
||||||
"id": "call_0BFNns0FkRb3n2LR0PRrfbIJ",
|
"id": "call_YevFCnLcjWfWHaZm8wwMpEk8",
|
||||||
"type": "tool_call"
|
"type": "tool_call"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"invalid_tool_calls": [],
|
"invalid_tool_calls": [],
|
||||||
"usage_metadata": {
|
"usage_metadata": {
|
||||||
"input_tokens": 3334,
|
"input_tokens": 3719,
|
||||||
"output_tokens": 33,
|
"output_tokens": 35,
|
||||||
"total_tokens": 3367,
|
"total_tokens": 3754,
|
||||||
"input_token_details": {
|
"input_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"cache_read": 0
|
"cache_read": 3584
|
||||||
},
|
},
|
||||||
"output_token_details": {
|
"output_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
@@ -132,29 +132,29 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"input_hash": "8ab757aa51f9d556adcea07c0221445a2b791cc882ef11922babf7f2865d1913",
|
"input_hash": "04751c4f7b0107b78b5c97d417063883fd586f5ebcbc4acf79be6cb3c0cdaec1",
|
||||||
"output": {
|
"output": {
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"data": {
|
"data": {
|
||||||
"content": "hi from replay",
|
"content": "hi from replay.",
|
||||||
"additional_kwargs": {},
|
"additional_kwargs": {},
|
||||||
"response_metadata": {
|
"response_metadata": {
|
||||||
"finish_reason": "stop",
|
"finish_reason": "stop",
|
||||||
"model_name": "gpt-5.5",
|
"model_name": "sre/gpt-5",
|
||||||
"model_provider": "openai"
|
"model_provider": "openai"
|
||||||
},
|
},
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"name": null,
|
"name": null,
|
||||||
"id": "lc_run--019e8c60-bef3-7201-a30a-cbc5f45920ba",
|
"id": "lc_run--019ea641-ff38-7751-9c2b-cc648811883b",
|
||||||
"tool_calls": [],
|
"tool_calls": [],
|
||||||
"invalid_tool_calls": [],
|
"invalid_tool_calls": [],
|
||||||
"usage_metadata": {
|
"usage_metadata": {
|
||||||
"input_tokens": 3380,
|
"input_tokens": 3768,
|
||||||
"output_tokens": 7,
|
"output_tokens": 8,
|
||||||
"total_tokens": 3387,
|
"total_tokens": 3776,
|
||||||
"input_token_details": {
|
"input_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"cache_read": 0
|
"cache_read": 3584
|
||||||
},
|
},
|
||||||
"output_token_details": {
|
"output_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
@@ -165,56 +165,65 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"input_hash": "fd67723cc8810ce79b4785fec4c251a272a91d677a216c735b23b5f6d3dec0c3",
|
"input_hash": "8b98ebdbb53e88f000556c4753adede8eaa076ff6fd7b8a1285bfd18aee8144d",
|
||||||
"output": {
|
"output": {
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"data": {
|
"data": {
|
||||||
"content": "[\"Can you append another line to the file?\",\"Can you show the file path again?\",\"Can you delete the file now?\"]",
|
"content": "[\n \"Can you show the file size and last modified time of /mnt/user-data/outputs/note.txt?\",\n \"List the contents of /mnt/user-data/outputs/ to confirm the file exists.\",\n \"Append 'second line' to /mnt/user-data/outputs/note.txt and print its new contents.\"\n]",
|
||||||
"additional_kwargs": {
|
"additional_kwargs": {
|
||||||
"refusal": null
|
"refusal": null
|
||||||
},
|
},
|
||||||
"response_metadata": {
|
"response_metadata": {
|
||||||
"token_usage": {
|
"token_usage": {
|
||||||
"completion_tokens": 71,
|
"completion_tokens": 909,
|
||||||
"prompt_tokens": 224,
|
"prompt_tokens": 224,
|
||||||
"total_tokens": 295,
|
"total_tokens": 1133,
|
||||||
"completion_tokens_details": {
|
"completion_tokens_details": {
|
||||||
"accepted_prediction_tokens": 0,
|
"accepted_prediction_tokens": 0,
|
||||||
"audio_tokens": 0,
|
"audio_tokens": 0,
|
||||||
"reasoning_tokens": 33,
|
"reasoning_tokens": 832,
|
||||||
"rejected_prediction_tokens": 0
|
"rejected_prediction_tokens": 0
|
||||||
},
|
},
|
||||||
"prompt_tokens_details": {
|
"prompt_tokens_details": {
|
||||||
"audio_tokens": 0,
|
"audio_tokens": 0,
|
||||||
"cached_tokens": 0
|
"cached_tokens": 0
|
||||||
},
|
},
|
||||||
"input_tokens": 0,
|
"latency_checkpoint": {
|
||||||
"output_tokens": 0,
|
"engine_tbt_ms": 12,
|
||||||
"input_tokens_details": null
|
"engine_ttft_ms": 324,
|
||||||
|
"engine_ttlt_ms": 10965,
|
||||||
|
"pre_inference_ms": 153,
|
||||||
|
"service_tbt_ms": 12,
|
||||||
|
"service_ttft_ms": 849,
|
||||||
|
"service_ttlt_ms": 11491,
|
||||||
|
"total_duration_ms": 11351,
|
||||||
|
"user_visible_ttft_ms": 696
|
||||||
|
}
|
||||||
},
|
},
|
||||||
"model_provider": "openai",
|
"model_provider": "openai",
|
||||||
"model_name": "gpt-5.5",
|
"model_name": "sre/gpt-5",
|
||||||
"system_fingerprint": null,
|
"system_fingerprint": null,
|
||||||
"id": "chatcmpl-DmaI5yVqQ39LRWyugoCEPalKw0gBR",
|
"id": "chatcmpl-DoPFALdwiyEDYOIN7wFYhqBrr6eTA",
|
||||||
|
"service_tier": "default",
|
||||||
"finish_reason": "stop",
|
"finish_reason": "stop",
|
||||||
"logprobs": null
|
"logprobs": null
|
||||||
},
|
},
|
||||||
"type": "ai",
|
"type": "ai",
|
||||||
"name": null,
|
"name": null,
|
||||||
"id": "lc_run--019e8c60-d025-7fd2-9cc9-8b4fb8fe1a82-0",
|
"id": "lc_run--019ea642-0eac-78f1-a506-931e343184f1-0",
|
||||||
"tool_calls": [],
|
"tool_calls": [],
|
||||||
"invalid_tool_calls": [],
|
"invalid_tool_calls": [],
|
||||||
"usage_metadata": {
|
"usage_metadata": {
|
||||||
"input_tokens": 224,
|
"input_tokens": 224,
|
||||||
"output_tokens": 71,
|
"output_tokens": 909,
|
||||||
"total_tokens": 295,
|
"total_tokens": 1133,
|
||||||
"input_token_details": {
|
"input_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"cache_read": 0
|
"cache_read": 0
|
||||||
},
|
},
|
||||||
"output_token_details": {
|
"output_token_details": {
|
||||||
"audio": 0,
|
"audio": 0,
|
||||||
"reasoning": 33
|
"reasoning": 832
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -76,6 +76,24 @@ from pydantic import PrivateAttr
|
|||||||
|
|
||||||
_FIXTURE_ENV = "DEERFLOW_REPLAY_FIXTURE"
|
_FIXTURE_ENV = "DEERFLOW_REPLAY_FIXTURE"
|
||||||
|
|
||||||
|
# Process-wide record of replay misses. A miss raises inside the model, but the
|
||||||
|
# gateway's LLMErrorHandlingMiddleware swallows it into a normal assistant error
|
||||||
|
# message — so the SSE *event shapes* are unchanged and a shape-only golden stays
|
||||||
|
# green on a stale fixture. The in-process Layer-1 test inspects this list to fail
|
||||||
|
# loud on a miss instead. (Layer-2 already fails on a miss: the recorded turns
|
||||||
|
# never render.)
|
||||||
|
_replay_misses: list[str] = []
|
||||||
|
|
||||||
|
|
||||||
|
def replay_misses() -> list[str]:
|
||||||
|
"""Hashes that missed the fixture since the last reset (see ``_replay_misses``)."""
|
||||||
|
return list(_replay_misses)
|
||||||
|
|
||||||
|
|
||||||
|
def reset_replay_misses() -> None:
|
||||||
|
_replay_misses.clear()
|
||||||
|
|
||||||
|
|
||||||
# Volatile substrings that differ between a recording run and a replay run but
|
# Volatile substrings that differ between a recording run and a replay run but
|
||||||
# carry no semantic weight for matching. Normalized to stable placeholders
|
# carry no semantic weight for matching. Normalized to stable placeholders
|
||||||
# before hashing so the same logical input hashes identically across processes.
|
# before hashing so the same logical input hashes identically across processes.
|
||||||
@@ -117,13 +135,24 @@ def _content_to_text(content: Any) -> str:
|
|||||||
def _canonical_messages(messages: list[BaseMessage]) -> str:
|
def _canonical_messages(messages: list[BaseMessage]) -> str:
|
||||||
"""Project messages to a stable shape that excludes volatile metadata/ids.
|
"""Project messages to a stable shape that excludes volatile metadata/ids.
|
||||||
|
|
||||||
Keeps only what determines the model's next output: role, text content, and
|
Keeps only what determines which recorded turn to replay: the conversation
|
||||||
tool-call name+args. Drops ``id``, ``response_metadata``, ``usage_metadata``,
|
(human / ai / tool messages — role, text content, tool-call name+args). Drops
|
||||||
and ``tool_call_id`` (all volatile), then normalizes embedded volatile
|
``id``, ``response_metadata``, ``usage_metadata``, ``tool_call_id`` (all
|
||||||
substrings.
|
volatile), then normalizes embedded volatile substrings.
|
||||||
|
|
||||||
|
**The system message is excluded entirely.** The lead-agent system prompt is
|
||||||
|
a living, frequently-edited implementation detail (its wording changes across
|
||||||
|
PRs), not part of the front-back contract this harness verifies. Hashing it
|
||||||
|
would make every fixture go stale — and red-fail on unrelated PRs — the moment
|
||||||
|
anyone edits the prompt. The conversation flow (user input -> tool calls ->
|
||||||
|
results -> answer) is the stable key that identifies a recorded turn.
|
||||||
"""
|
"""
|
||||||
projected: list[dict[str, Any]] = []
|
projected: list[dict[str, Any]] = []
|
||||||
for message in messages:
|
for message in messages:
|
||||||
|
# Exclude the system prompt from the match key — see docstring. It is the
|
||||||
|
# most-edited part of the prompt and not part of the contract under test.
|
||||||
|
if message.type == "system":
|
||||||
|
continue
|
||||||
content = _normalize_text(_content_to_text(message.content))
|
content = _normalize_text(_content_to_text(message.content))
|
||||||
tool_calls = getattr(message, "tool_calls", None)
|
tool_calls = getattr(message, "tool_calls", None)
|
||||||
# Drop messages that are empty after normalization — e.g. a turn that was
|
# Drop messages that are empty after normalization — e.g. a turn that was
|
||||||
@@ -189,6 +218,7 @@ class ReplayChatModel(BaseChatModel):
|
|||||||
key = hash_messages(messages)
|
key = hash_messages(messages)
|
||||||
bucket = self._table.get(key)
|
bucket = self._table.get(key)
|
||||||
if not bucket:
|
if not bucket:
|
||||||
|
_replay_misses.append(key)
|
||||||
preview = _canonical_messages(messages)
|
preview = _canonical_messages(messages)
|
||||||
raise KeyError(
|
raise KeyError(
|
||||||
f"replay miss: no recorded output for input hash {key} in {self._fixture_path!r}. "
|
f"replay miss: no recorded output for input hash {key} in {self._fixture_path!r}. "
|
||||||
@@ -227,4 +257,4 @@ class ReplayChatModel(BaseChatModel):
|
|||||||
|
|
||||||
|
|
||||||
# Re-export so the recorder shares the exact hashing logic.
|
# Re-export so the recorder shares the exact hashing logic.
|
||||||
__all__ = ["ReplayChatModel", "hash_messages"]
|
__all__ = ["ReplayChatModel", "hash_messages", "replay_misses", "reset_replay_misses"]
|
||||||
|
|||||||
@@ -66,14 +66,24 @@ def test_replay_write_read_file_ultra_matches_golden(tmp_path: Path, monkeypatch
|
|||||||
cfg = app_config_module.get_app_config()
|
cfg = app_config_module.get_app_config()
|
||||||
cfg.database.sqlite_dir = str(home / "db")
|
cfg.database.sqlite_dir = str(home / "db")
|
||||||
|
|
||||||
|
# Fail loud on a replay miss. The gateway swallows a hash-miss into a normal
|
||||||
|
# assistant error message, so the SSE *shapes* below stay green on a stale
|
||||||
|
# fixture — the miss list is the only reliable signal at this layer.
|
||||||
|
import replay_provider
|
||||||
|
|
||||||
from app.gateway.app import create_app
|
from app.gateway.app import create_app
|
||||||
|
|
||||||
|
replay_provider.reset_replay_misses()
|
||||||
|
|
||||||
events = drive_gateway(create_app(), prompt=fixture["prompt"], context=fixture["context"])
|
events = drive_gateway(create_app(), prompt=fixture["prompt"], context=fixture["context"])
|
||||||
|
|
||||||
assert events, "replay produced no SSE events"
|
assert events, "replay produced no SSE events"
|
||||||
assert events[0]["event"] == "metadata", f"first event should be metadata, got {events[0]!r}"
|
assert events[0]["event"] == "metadata", f"first event should be metadata, got {events[0]!r}"
|
||||||
assert events[-1]["event"] == "end", f"last event should be end (run completed), got {events[-1]!r}"
|
assert events[-1]["event"] == "end", f"last event should be end (run completed), got {events[-1]!r}"
|
||||||
|
|
||||||
|
misses = replay_provider.replay_misses()
|
||||||
|
assert not misses, f"replay miss ({len(misses)}): the fixture is stale vs the current system prompt or agent graph. Re-record it (see backend/docs/REPLAY_E2E.md). Missed hashes: {misses}"
|
||||||
|
|
||||||
# Regenerate the committed golden after re-recording the fixture:
|
# Regenerate the committed golden after re-recording the fixture:
|
||||||
# DEERFLOW_WRITE_GOLDEN=1 uv run pytest tests/test_replay_golden.py
|
# DEERFLOW_WRITE_GOLDEN=1 uv run pytest tests/test_replay_golden.py
|
||||||
if os.environ.get("DEERFLOW_WRITE_GOLDEN"):
|
if os.environ.get("DEERFLOW_WRITE_GOLDEN"):
|
||||||
@@ -81,7 +91,7 @@ def test_replay_write_read_file_ultra_matches_golden(tmp_path: Path, monkeypatch
|
|||||||
return
|
return
|
||||||
|
|
||||||
golden = json.loads(events_path.read_text(encoding="utf-8"))["events"]
|
golden = json.loads(events_path.read_text(encoding="utf-8"))["events"]
|
||||||
# A replay hash-miss surfaces as the run erroring mid-stream -> the event
|
# Guards backend SSE protocol drift: the event name + payload-key sequence
|
||||||
# shape sequence diverges from the golden, so this assertion is the catch-all
|
# must match the committed golden. (Replay divergence is caught by the miss
|
||||||
# for both backend SSE drift and replay divergence.
|
# assertion above, not here — a swallowed miss keeps the shapes identical.)
|
||||||
assert events == golden, f"SSE event-shape sequence drifted from the golden.\ngot ({len(events)}): {[e['event'] for e in events]}\nwant ({len(golden)}): {[e['event'] for e in golden]}"
|
assert events == golden, f"SSE event-shape sequence drifted from the golden.\ngot ({len(events)}): {[e['event'] for e in events]}\nwant ({len(golden)}): {[e['event'] for e in golden]}"
|
||||||
|
|||||||
@@ -85,17 +85,21 @@ test.describe("real backend render (replay, no API key)", () => {
|
|||||||
await textarea.fill(PROMPT);
|
await textarea.fill(PROMPT);
|
||||||
await textarea.press("Enter");
|
await textarea.press("Enter");
|
||||||
|
|
||||||
// Replay-only DOM assertions (derived from the fixture): they render only if
|
// Replay-only DOM assertions (derived from the fixture): both are
|
||||||
|
// model-generated strings absent from the user prompt, so they render only if
|
||||||
// the recorded turns replayed AND the real frontend rendered them — the
|
// the recorded turns replayed AND the real frontend rendered them — the
|
||||||
// in-graph auto-title and the post-answer follow-up suggestion. Together they
|
// in-graph auto-title and the post-answer follow-up suggestion. Together they
|
||||||
// prove the whole pipeline (replay backend -> real frontend render).
|
// prove the whole pipeline (replay backend -> real frontend render). The
|
||||||
|
// record spec waits for the /suggestions response, so a re-recorded fixture
|
||||||
|
// always captures the suggestion turn — a missing one is a broken recording
|
||||||
|
// and must fail loud here, not pass silently.
|
||||||
expect(
|
expect(
|
||||||
EXPECTED_TITLE,
|
EXPECTED_TITLE,
|
||||||
"fixture should contain an auto-title turn",
|
"fixture should contain an auto-title turn",
|
||||||
).not.toBe("");
|
).not.toBe("");
|
||||||
expect(
|
expect(
|
||||||
EXPECTED_SUGGESTION,
|
EXPECTED_SUGGESTION,
|
||||||
"fixture should contain a suggestions turn",
|
"fixture should contain a suggestions turn (re-record; the record spec waits for /suggestions)",
|
||||||
).not.toBe("");
|
).not.toBe("");
|
||||||
await expect(page.getByText(EXPECTED_TITLE)).toBeVisible({
|
await expect(page.getByText(EXPECTED_TITLE)).toBeVisible({
|
||||||
timeout: 60_000,
|
timeout: 60_000,
|
||||||
|
|||||||
@@ -104,6 +104,16 @@ test("record write/read-file run through the real frontend", async ({
|
|||||||
await textarea.fill(PROMPT);
|
await textarea.fill(PROMPT);
|
||||||
await textarea.press("Enter");
|
await textarea.press("Enter");
|
||||||
|
|
||||||
|
// Suggestions fire only AFTER the run completes (input-box.tsx POSTs
|
||||||
|
// /suggestions). Wait for that response so its model call lands in the capture
|
||||||
|
// before we check for stability — otherwise the stability window can return
|
||||||
|
// first and the recorded fixture would be missing the suggestions turn.
|
||||||
|
await page
|
||||||
|
.waitForResponse((r) => r.url().includes("/suggestions"), {
|
||||||
|
timeout: 90_000,
|
||||||
|
})
|
||||||
|
.catch(() => undefined);
|
||||||
|
|
||||||
const captured = await waitForCaptureStable(out!);
|
const captured = await waitForCaptureStable(out!);
|
||||||
console.log(
|
console.log(
|
||||||
`[record] captures stabilized at ${captured} model call(s) -> ${out}`,
|
`[record] captures stabilized at ${captured} model call(s) -> ${out}`,
|
||||||
|
|||||||
Reference in New Issue
Block a user