Files
deer-flow/frontend/tests/e2e-real-backend/multi-run-order.spec.ts
T
Xinmin Zeng 88759015e4 test(e2e): deterministic record/replay front-back contract verification (#3365)
* test(e2e): record/replay front-back contract verification

Guards the front-back contract with a deterministic, key-free record/replay
harness (mirrors open-design's golden-trace approach):

- ReplayChatModel (tests/replay_provider.py): replays recorded LLM turns by a
  normalized hash of the model input. Strips <system-reminder>/date/uuid/tmp-path
  so one fixture replays across days and from both the browser and direct-POST
  paths; a miss raises loudly (no silent divergence).
- Recording is record-through-browser (scripts/record_gateway.py +
  build_fixture_from_jsonl.py + frontend/tests/e2e-record): a real run is driven
  through the real frontend so captured inputs match exactly what the browser
  sends; fixtures contain no API key.
- Layer 1 — backend golden (tests/test_replay_golden.py): replay through the real
  gateway, assert the SSE event sequence == committed golden.
- Layer 2 — full-stack render (frontend/tests/e2e-real-backend): real Next.js +
  real gateway (replay model) + Chromium; assert the replayed auto-title and
  follow-up suggestions render. DOM assertions are the gate; visual regression is
  a local dev gate (CI uploads the render as an artifact).
- CI (.github/workflows/replay-e2e.yml): both layers, triggered on EITHER side of
  the contract (frontend/** or backend gateway/harness/fixtures).

* test(e2e): multi-run render-order cross-stack scenario (#3352)

Guards the dangerous front-back class where a backend ordering change
silently breaks a frontend assumption while both sides' unit tests stay
green. Reproduces issue #3352: backend list_by_thread returns runs
newest-first (#2932) and the frontend prepended per-run pages, inverting
chronological order once the checkpoint no longer held the older messages.

- tests/seed_runs_router.py: test-only seeder, mounted on the replay
  gateway only when DEERFLOW_ENABLE_TEST_SEED=1 (never in the production
  app). Seeds a thread with >=2 runs + per-run message events and no
  checkpoint -- the #3352 precondition -- so the frontend per-run reload
  path is the sole source of truth and the prepend inversion is observable.
- frontend/tests/e2e-real-backend/multi-run-order.spec.ts: drives the real
  frontend against the real gateway, asserts the first run renders above
  the second. Reverting the #3354 fix turns it red.
- replay-e2e.yml: trigger on the new replay test-infra paths.
- docs: REPLAY_E2E.md cross-stack scenario section.

* test(e2e): address Copilot review on the replay harness

- Fix stale recorder references (scripts/record_traces.py ->
  scripts/record_gateway.py + scripts/build_fixture_from_jsonl.py) in
  replay_provider.py, test_replay_golden.py, _replay_fixture.py.
- MODE_CONTEXT['ultra']: thinking_enabled False -> True, mirroring the
  frontend's `context.mode !== 'flash'` (hooks.ts). It did not affect the
  hashed input (Layer 1 golden still green), but the table now matches the
  real frontend context it claims to mirror.
- replay_provider.py docstring: stop claiming memory is recorded-enabled;
  the replay config disables memory/summarization for determinism (title
  stays, as an in-graph deterministic call).
- record_gateway.py / run_replay_gateway.py: override DEER_FLOW_HOME instead
  of setdefault, so an outer value can't leak into the hermetic harness.
- record_gateway.py: clear error when DEERFLOW_RECORD_OUT is unset (was a
  bare KeyError).
- playwright.record.config.ts: forward OPENAI_*/DEERFLOW_RECORD_OUT only when
  set, so the gateway raises a clear 'missing env' error instead of getting ''.

* test(e2e): address Copilot review round 2

- seed_runs_router.py: constrain SeedMessage.role to Literal['human','ai']
  so a bad value is a clean 422 at the boundary instead of a 500
  (KeyError on _EVENT_TYPE).
- record-write-read-file.spec.ts: waitForCaptureStable now throws on
  timeout instead of returning the last count, so a truncated/partial
  recording can't pass silently.
- real-backend-render.spec.ts: guard the suggestions JSON.parse; a
  bracket-prefixed non-JSON turn falls back to '' so the existing
  not.toBe('') assertion fails clearly instead of a generic parse throw.
2026-06-08 12:35:03 +08:00

102 lines
4.7 KiB
TypeScript

import { expect, test } from "@playwright/test";
/**
* Layer 2 (cross-stack contract): reproduces upstream issue #3352 — after the
* checkpoint no longer holds the older messages (post context-compression), the
* frontend rebuilds thread history from the per-run endpoints, and the order it
* rebuilds them in must stay chronological.
*
* The dangerous class this guards: a BACKEND change to run ordering silently
* breaks a FRONTEND assumption. Backend `list_by_thread` returns runs
* NEWEST-FIRST (PR #2932); the pre-#3354 frontend iterated runs from the end and
* PREPENDED each loaded page (`core/threads/hooks.ts`), which inverts order. A
* backend-only ordering test was green the whole time #3352 was live, and the
* frontend regression unit test hardcodes "backend returns newest-first" in a
* mock — so only a real frontend against a real backend catches the desync.
*
* This drives the REAL frontend against a REAL gateway with two seeded runs and
* NO checkpoint (the seeder forces the per-run reload path to be the sole source
* of truth), then asserts the first run's message renders ABOVE the second's.
* No model, no recording, no API key — the runs are seeded via a test-only
* endpoint mounted only on the replay gateway.
*/
const APP = "http://localhost:3000";
// Distinctive markers so getByText can't collide with UI chrome.
const ALPHA = "ALPHA-FIRST-QUESTION-7f3a2c";
const OMEGA = "OMEGA-SECOND-QUESTION-9b21d4";
test.describe("multi-run thread renders chronologically (replay, no API key)", () => {
test("first run renders above second run after history rebuild (#3352)", async ({
page,
context,
}) => {
const uniq = `${Date.now()}-${Math.floor(Math.random() * 1e6)}`;
const threadId = `e2e-multi-run-${uniq}`;
const email = `e2e-${uniq}@example.com`;
// Register through the frontend origin (same-origin proxy) so the auth
// cookies are stored for localhost and forwarded to the gateway via the
// next.config rewrite — never cross-origin from the browser.
const reg = await context.request.post(`${APP}/api/v1/auth/register`, {
data: { email, password: "very-strong-password-123" },
});
expect(reg.status(), await reg.text()).toBe(201);
const cookies = await context.cookies();
const csrf = cookies.find((c) => c.name === "csrf_token")?.value;
expect(csrf, "register must set csrf_token cookie").toBeTruthy();
// Seed two runs in one thread: run-1 (ALPHA) older, run-2 (OMEGA) newer, so
// the real backend's list_by_thread returns them newest-first. No checkpoint
// is seeded — that is the #3352 precondition.
const seed = await context.request.post(`${APP}/api/test-only/seed-runs`, {
headers: { "X-CSRF-Token": csrf! },
data: {
thread_id: threadId,
runs: [
{
run_id: `${threadId}-r1`,
created_at: "2026-01-01T00:00:00+00:00",
messages: [
{ role: "human", content: ALPHA, id: `${threadId}-a-h` },
{ role: "ai", content: "ALPHA reply", id: `${threadId}-a-a` },
],
},
{
run_id: `${threadId}-r2`,
created_at: "2026-01-01T00:01:00+00:00",
messages: [
{ role: "human", content: OMEGA, id: `${threadId}-o-h` },
{ role: "ai", content: "OMEGA reply", id: `${threadId}-o-a` },
],
},
],
},
});
expect(seed.status(), await seed.text()).toBe(200);
// Load the thread fresh — triggers useThreadHistory's per-run reload path.
await page.goto(`/workspace/chats/${threadId}`);
const alpha = page.getByText(ALPHA, { exact: false });
const omega = page.getByText(OMEGA, { exact: false });
await expect(alpha).toBeVisible({ timeout: 60_000 });
await expect(omega).toBeVisible({ timeout: 30_000 });
// Each marker renders exactly once (guards against accidental duplicate matches).
expect(await alpha.count(), "ALPHA should render exactly once").toBe(1);
expect(await omega.count(), "OMEGA should render exactly once").toBe(1);
// The contract: ALPHA (first run) must render ABOVE OMEGA (second run). With
// the #3352 bug the per-run rebuild inverts this and OMEGA renders first.
const alphaBox = await alpha.first().boundingBox();
const omegaBox = await omega.first().boundingBox();
expect(alphaBox, "ALPHA must have a layout box").toBeTruthy();
expect(omegaBox, "OMEGA must have a layout box").toBeTruthy();
expect(
alphaBox!.y,
`chronological order broken: ALPHA(first run) rendered at y=${alphaBox!.y}, OMEGA(second run) at y=${omegaBox!.y} — backend list_by_thread ordering and frontend history rebuild are out of sync (#3352)`,
).toBeLessThan(omegaBox!.y);
});
});