fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402) (#3411)

* fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402)

  DynamicContextMiddleware.abefore_agent() called _inject() synchronously
  on the asyncio event loop.  The first time memory is injected (second
  request), _inject() → format_memory_for_injection() → _count_tokens()
  → tiktoken.get_encoding("cl100k_base") needs to download the BPE data
  from openaipublic.blob.core.windows.net.  In network-restricted
  environments this download blocks until the OS TCP timeout (~26 min),
  starving ALL concurrent handlers including /api/v1/auth/me.

  Fix:
  - abefore_agent now uses asyncio.to_thread(self._inject, state) so
    file I/O and tiktoken never block the event loop.
  - Extract _get_tiktoken_encoding() with a module-level cache so
    tiktoken.get_encoding() is called at most once per encoding name.
  - Add warm_tiktoken_cache() startup helper; gateway lifespan pre-warms
    the cache via asyncio.to_thread so the first request never triggers a
    cold download.
  - _count_tokens falls back to len(text) // 4 on any encoding failure.

  Tests:
  - tests/test_tiktoken_cache_and_count_tokens.py (12 tests): cache
    hit/miss, fallback paths, warm-up helper.
  - tests/blocking_io/test_dynamic_context_middleware.py (2 tests):
    Blockbuster gate verifies abefore_agent does not block the event
    loop; async/sync parity check.

  Fixes #3402

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix the lint error

* fix(memory): use future annotations to avoid NameError when tiktoken is absent

Add `from __future__ import annotations` to prompt.py so that
tiktoken.Encoding type hints are never evaluated at runtime.  Without
this, environments where tiktoken is not installed could raise NameError
on the module-level cache and function return annotations.

Addresses Copilot review comment on PR #3411.

* fix(middleware): bound abefore_agent injection with timeout to prevent hung requests

Wrap the asyncio.to_thread(self._inject) offload in asyncio.wait_for()
with a 5-second cap.  If the startup warm-up failed silently (e.g.
network blip during deploy), a cold tiktoken BPE download on the first
request can block until the OS TCP timeout (~26 min).  The bounded
timeout ensures the request degrades gracefully (no memory/date context
for that turn) rather than hanging.

Adds test_abefore_agent_returns_none_on_timeout to the blocking-IO
regression anchors.

Addresses review feedback from xg-gh-25 on PR #3411.

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
Willem Jiang
2026-06-08 12:21:55 +08:00
committed by GitHub
parent 40a371b88c
commit 519200728a
5 changed files with 372 additions and 3 deletions
@@ -28,6 +28,7 @@ Date-update format:
from __future__ import annotations
import asyncio
import logging
import re
import uuid
@@ -43,6 +44,12 @@ if TYPE_CHECKING:
logger = logging.getLogger(__name__)
# Upper bound (seconds) for a single _inject() offload. If the warm-up at
# gateway startup failed silently, the first request may still hit a cold
# tiktoken BPE download that blocks until the OS TCP timeout (~26 min).
# This cap ensures the request degrades gracefully instead of hanging.
_INJECT_TIMEOUT_SECONDS = 5.0
_DATE_RE = re.compile(r"<current_date>([^<]+)</current_date>")
_DYNAMIC_CONTEXT_REMINDER_KEY = "dynamic_context_reminder"
_SUMMARY_MESSAGE_NAME = "summary"
@@ -201,4 +208,25 @@ class DynamicContextMiddleware(AgentMiddleware):
@override
async def abefore_agent(self, state, runtime: Runtime) -> dict | None:
return self._inject(state)
# _inject() performs synchronous file I/O (memory JSON loading) and
# potentially blocking network calls (tiktoken encoding download on
# first use). Offload to a thread so the event loop is never blocked
# — a blocking call here starves all concurrent HTTP handlers (auth,
# SSE heartbeats, etc.). See issue #3402.
#
# Bounded timeout: if startup warm-up failed silently (e.g. network
# blip during deploy), the first request's cold tiktoken download can
# block for tens of minutes (OS TCP timeout). Time-box injection so
# the request degrades gracefully (no memory context) rather than
# hanging.
try:
return await asyncio.wait_for(
asyncio.to_thread(self._inject, state),
timeout=_INJECT_TIMEOUT_SECONDS,
)
except TimeoutError:
logger.warning(
"DynamicContextMiddleware: injection timed out (%.1fs); skipping memory/date injection for this turn",
_INJECT_TIMEOUT_SECONDS,
)
return None