fix(replay-e2e): key fixtures by caller and conversation (#3453)

* add caller identity in replay e2e

* make format

* fix(replay-e2e): stabilize title caller replay

* fix(replay-e2e): use captured caller without run manager

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
This commit is contained in:
Nan Gao
2026-06-09 15:58:31 +02:00
committed by GitHub
parent 37337b77f9
commit 63ce88f874
7 changed files with 308 additions and 34 deletions
+10 -6
View File
@@ -50,18 +50,22 @@ gateway's own run/event stores using the request's auth context, so the real
## How replay works
`tests/replay_provider.py::ReplayChatModel` returns recorded assistant turns keyed
by a **normalized hash of the conversation** (human / ai / tool messages — role,
text, tool-call name+args; with `<system-reminder>`, dates, UUIDs, tmp paths
stripped). A miss raises loudly rather than passing silently.
by a **normalized hash of the model caller + conversation**. The conversation is
human / ai / tool messages — role, text, tool-call name+args; with
`<system-reminder>`, dates, UUIDs, tmp paths stripped. The caller is the stable
source of the model call (`lead_agent`, `middleware:title`, `suggest_agent`,
`subagent:*`, etc.). A miss raises loudly rather than passing silently.
**The system prompt is excluded from the match key.** The lead-agent system
prompt is a living, frequently-edited implementation detail — its wording changes
across PRs (e.g. #3195 added a "File Editing Workflow" section). Hashing it would
make every fixture go stale and red-fail unrelated PRs the moment anyone edits the
prompt. The conversation flow (user input → tool calls → results → answer) is the
stable contract that identifies a recorded turn. (This mirrors how open-design's
mock picker keys on the user prompt, not the system internals.) Combined with
pinning skills + extensions empty and disabling memory/summarization
stable contract that identifies a recorded turn. The caller still stays in the
key so two different model users with identical conversation text do not compete
for the same replay bucket. (This mirrors how open-design's mock picker keys on
the user prompt, not the system internals.) Combined with pinning skills +
extensions empty and disabling memory/summarization
(`tests/_replay_fixture.py::build_config_yaml`), a fixture replays the same across
machines, days, prompt edits, and CI. Replaying needs **no API key**.