mirror of
https://github.com/bytedance/deer-flow.git
synced 2026-06-10 09:25:57 +00:00
88759015e4
* test(e2e): record/replay front-back contract verification Guards the front-back contract with a deterministic, key-free record/replay harness (mirrors open-design's golden-trace approach): - ReplayChatModel (tests/replay_provider.py): replays recorded LLM turns by a normalized hash of the model input. Strips <system-reminder>/date/uuid/tmp-path so one fixture replays across days and from both the browser and direct-POST paths; a miss raises loudly (no silent divergence). - Recording is record-through-browser (scripts/record_gateway.py + build_fixture_from_jsonl.py + frontend/tests/e2e-record): a real run is driven through the real frontend so captured inputs match exactly what the browser sends; fixtures contain no API key. - Layer 1 — backend golden (tests/test_replay_golden.py): replay through the real gateway, assert the SSE event sequence == committed golden. - Layer 2 — full-stack render (frontend/tests/e2e-real-backend): real Next.js + real gateway (replay model) + Chromium; assert the replayed auto-title and follow-up suggestions render. DOM assertions are the gate; visual regression is a local dev gate (CI uploads the render as an artifact). - CI (.github/workflows/replay-e2e.yml): both layers, triggered on EITHER side of the contract (frontend/** or backend gateway/harness/fixtures). * test(e2e): multi-run render-order cross-stack scenario (#3352) Guards the dangerous front-back class where a backend ordering change silently breaks a frontend assumption while both sides' unit tests stay green. Reproduces issue #3352: backend list_by_thread returns runs newest-first (#2932) and the frontend prepended per-run pages, inverting chronological order once the checkpoint no longer held the older messages. - tests/seed_runs_router.py: test-only seeder, mounted on the replay gateway only when DEERFLOW_ENABLE_TEST_SEED=1 (never in the production app). Seeds a thread with >=2 runs + per-run message events and no checkpoint -- the #3352 precondition -- so the frontend per-run reload path is the sole source of truth and the prepend inversion is observable. - frontend/tests/e2e-real-backend/multi-run-order.spec.ts: drives the real frontend against the real gateway, asserts the first run renders above the second. Reverting the #3354 fix turns it red. - replay-e2e.yml: trigger on the new replay test-infra paths. - docs: REPLAY_E2E.md cross-stack scenario section. * test(e2e): address Copilot review on the replay harness - Fix stale recorder references (scripts/record_traces.py -> scripts/record_gateway.py + scripts/build_fixture_from_jsonl.py) in replay_provider.py, test_replay_golden.py, _replay_fixture.py. - MODE_CONTEXT['ultra']: thinking_enabled False -> True, mirroring the frontend's `context.mode !== 'flash'` (hooks.ts). It did not affect the hashed input (Layer 1 golden still green), but the table now matches the real frontend context it claims to mirror. - replay_provider.py docstring: stop claiming memory is recorded-enabled; the replay config disables memory/summarization for determinism (title stays, as an in-graph deterministic call). - record_gateway.py / run_replay_gateway.py: override DEER_FLOW_HOME instead of setdefault, so an outer value can't leak into the hermetic harness. - record_gateway.py: clear error when DEERFLOW_RECORD_OUT is unset (was a bare KeyError). - playwright.record.config.ts: forward OPENAI_*/DEERFLOW_RECORD_OUT only when set, so the gateway raises a clear 'missing env' error instead of getting ''. * test(e2e): address Copilot review round 2 - seed_runs_router.py: constrain SeedMessage.role to Literal['human','ai'] so a bad value is a clean 422 at the boundary instead of a 500 (KeyError on _EVENT_TYPE). - record-write-read-file.spec.ts: waitForCaptureStable now throws on timeout instead of returning the last count, so a truncated/partial recording can't pass silently. - real-backend-render.spec.ts: guard the suggestions JSON.parse; a bracket-prefixed non-JSON turn falls back to '' so the existing not.toBe('') assertion fails clearly instead of a generic parse throw.
102 lines
4.7 KiB
TypeScript
102 lines
4.7 KiB
TypeScript
import { expect, test } from "@playwright/test";
|
|
|
|
/**
|
|
* Layer 2 (cross-stack contract): reproduces upstream issue #3352 — after the
|
|
* checkpoint no longer holds the older messages (post context-compression), the
|
|
* frontend rebuilds thread history from the per-run endpoints, and the order it
|
|
* rebuilds them in must stay chronological.
|
|
*
|
|
* The dangerous class this guards: a BACKEND change to run ordering silently
|
|
* breaks a FRONTEND assumption. Backend `list_by_thread` returns runs
|
|
* NEWEST-FIRST (PR #2932); the pre-#3354 frontend iterated runs from the end and
|
|
* PREPENDED each loaded page (`core/threads/hooks.ts`), which inverts order. A
|
|
* backend-only ordering test was green the whole time #3352 was live, and the
|
|
* frontend regression unit test hardcodes "backend returns newest-first" in a
|
|
* mock — so only a real frontend against a real backend catches the desync.
|
|
*
|
|
* This drives the REAL frontend against a REAL gateway with two seeded runs and
|
|
* NO checkpoint (the seeder forces the per-run reload path to be the sole source
|
|
* of truth), then asserts the first run's message renders ABOVE the second's.
|
|
* No model, no recording, no API key — the runs are seeded via a test-only
|
|
* endpoint mounted only on the replay gateway.
|
|
*/
|
|
const APP = "http://localhost:3000";
|
|
|
|
// Distinctive markers so getByText can't collide with UI chrome.
|
|
const ALPHA = "ALPHA-FIRST-QUESTION-7f3a2c";
|
|
const OMEGA = "OMEGA-SECOND-QUESTION-9b21d4";
|
|
|
|
test.describe("multi-run thread renders chronologically (replay, no API key)", () => {
|
|
test("first run renders above second run after history rebuild (#3352)", async ({
|
|
page,
|
|
context,
|
|
}) => {
|
|
const uniq = `${Date.now()}-${Math.floor(Math.random() * 1e6)}`;
|
|
const threadId = `e2e-multi-run-${uniq}`;
|
|
const email = `e2e-${uniq}@example.com`;
|
|
|
|
// Register through the frontend origin (same-origin proxy) so the auth
|
|
// cookies are stored for localhost and forwarded to the gateway via the
|
|
// next.config rewrite — never cross-origin from the browser.
|
|
const reg = await context.request.post(`${APP}/api/v1/auth/register`, {
|
|
data: { email, password: "very-strong-password-123" },
|
|
});
|
|
expect(reg.status(), await reg.text()).toBe(201);
|
|
|
|
const cookies = await context.cookies();
|
|
const csrf = cookies.find((c) => c.name === "csrf_token")?.value;
|
|
expect(csrf, "register must set csrf_token cookie").toBeTruthy();
|
|
|
|
// Seed two runs in one thread: run-1 (ALPHA) older, run-2 (OMEGA) newer, so
|
|
// the real backend's list_by_thread returns them newest-first. No checkpoint
|
|
// is seeded — that is the #3352 precondition.
|
|
const seed = await context.request.post(`${APP}/api/test-only/seed-runs`, {
|
|
headers: { "X-CSRF-Token": csrf! },
|
|
data: {
|
|
thread_id: threadId,
|
|
runs: [
|
|
{
|
|
run_id: `${threadId}-r1`,
|
|
created_at: "2026-01-01T00:00:00+00:00",
|
|
messages: [
|
|
{ role: "human", content: ALPHA, id: `${threadId}-a-h` },
|
|
{ role: "ai", content: "ALPHA reply", id: `${threadId}-a-a` },
|
|
],
|
|
},
|
|
{
|
|
run_id: `${threadId}-r2`,
|
|
created_at: "2026-01-01T00:01:00+00:00",
|
|
messages: [
|
|
{ role: "human", content: OMEGA, id: `${threadId}-o-h` },
|
|
{ role: "ai", content: "OMEGA reply", id: `${threadId}-o-a` },
|
|
],
|
|
},
|
|
],
|
|
},
|
|
});
|
|
expect(seed.status(), await seed.text()).toBe(200);
|
|
|
|
// Load the thread fresh — triggers useThreadHistory's per-run reload path.
|
|
await page.goto(`/workspace/chats/${threadId}`);
|
|
|
|
const alpha = page.getByText(ALPHA, { exact: false });
|
|
const omega = page.getByText(OMEGA, { exact: false });
|
|
await expect(alpha).toBeVisible({ timeout: 60_000 });
|
|
await expect(omega).toBeVisible({ timeout: 30_000 });
|
|
// Each marker renders exactly once (guards against accidental duplicate matches).
|
|
expect(await alpha.count(), "ALPHA should render exactly once").toBe(1);
|
|
expect(await omega.count(), "OMEGA should render exactly once").toBe(1);
|
|
|
|
// The contract: ALPHA (first run) must render ABOVE OMEGA (second run). With
|
|
// the #3352 bug the per-run rebuild inverts this and OMEGA renders first.
|
|
const alphaBox = await alpha.first().boundingBox();
|
|
const omegaBox = await omega.first().boundingBox();
|
|
expect(alphaBox, "ALPHA must have a layout box").toBeTruthy();
|
|
expect(omegaBox, "OMEGA must have a layout box").toBeTruthy();
|
|
expect(
|
|
alphaBox!.y,
|
|
`chronological order broken: ALPHA(first run) rendered at y=${alphaBox!.y}, OMEGA(second run) at y=${omegaBox!.y} — backend list_by_thread ordering and frontend history rebuild are out of sync (#3352)`,
|
|
).toBeLessThan(omegaBox!.y);
|
|
});
|
|
});
|