fix(persistence): stream hang when run_events.backend=db

DbRunEventStore._user_id_from_context() returned user.id without coercing it to str. User.id is a Pydantic UUID, and aiosqlite cannot bind a raw UUID object to a VARCHAR column, so the INSERT for the initial human_message event silently rolled back and raised out of the worker task. Because that put() sat outside the worker's try block, the finally-clause that publishes end-of-stream never ran and the SSE stream hung forever. jsonl mode was unaffected because json.dumps(default=str) coerces UUID objects transparently. Fixes: - db.py: coerce user.id to str at the context-read boundary (matches what resolve_user_id already does for the other repositories) - worker.py: move RunJournal init + human_message put inside the try block so any failure flows through the finally/publish_end path instead of hanging the subscriber Defense-in-depth: - engine.py: add PRAGMA busy_timeout=5000 so checkpointer and event store wait for each other on the shared deerflow.db file instead of failing immediately under write-lock contention - journal.py: skip fire-and-forget _flush_sync when a previous flush task is still in flight, to avoid piling up concurrent put_batch writes on the same SQLAlchemy engine during streaming; flush() now waits for pending tasks before draining the buffer - database_config.py: doc-only update clarifying WAL + busy_timeout keep the unified deerflow.db safe for both workloads Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 17:36:00 +00:00 · 2026-04-11 11:16:22 +08:00
parent 20f64bbf4f
commit 10cc651578
5 changed files with 76 additions and 44 deletions
@@ -50,6 +50,7 @@ class RunJournal(BaseCallbackHandler):

        # Write buffer
        self._buffer: list[dict] = []
+        self._pending_flush_tasks: set[asyncio.Task[None]] = set()

        # Token accumulators
        self._total_input_tokens = 0
@@ -381,6 +382,10 @@ class RunJournal(BaseCallbackHandler):
        """
        if not self._buffer:
            return
+        # Skip if a flush is already in flight — avoids concurrent writes
+        # to the same SQLite file from multiple fire-and-forget tasks.
+        if self._pending_flush_tasks:
+            return
        try:
            loop = asyncio.get_running_loop()
        except RuntimeError:
@@ -389,6 +394,7 @@ class RunJournal(BaseCallbackHandler):
        batch = self._buffer.copy()
        self._buffer.clear()
        task = loop.create_task(self._flush_async(batch))
+        self._pending_flush_tasks.add(task)
        task.add_done_callback(self._on_flush_done)

    async def _flush_async(self, batch: list[dict]) -> None:
@@ -404,8 +410,8 @@ class RunJournal(BaseCallbackHandler):
            # Return failed events to buffer for retry on next flush
            self._buffer = batch + self._buffer

-    @staticmethod
-    def _on_flush_done(task: asyncio.Task) -> None:
+    def _on_flush_done(self, task: asyncio.Task) -> None:
+        self._pending_flush_tasks.discard(task)
        if task.cancelled():
            return
        exc = task.exception()
@@ -450,10 +456,17 @@ class RunJournal(BaseCallbackHandler):

    async def flush(self) -> None:
        """Force flush remaining buffer. Called in worker's finally block."""
-        if self._buffer:
-            batch = self._buffer.copy()
-            self._buffer.clear()
-            await self._store.put_batch(batch)
+        if self._pending_flush_tasks:
+            await asyncio.gather(*tuple(self._pending_flush_tasks), return_exceptions=True)
+
+        while self._buffer:
+            batch = self._buffer[: self._flush_threshold]
+            del self._buffer[: self._flush_threshold]
+            try:
+                await self._store.put_batch(batch)
+            except Exception:
+                self._buffer = batch + self._buffer
+                raise

    def get_completion_data(self) -> dict:
        """Return accumulated token and message data for run completion."""