* feat(memory): add memory.token_counting config to avoid tiktoken network dependency (#3429)
Add a `memory.token_counting` option (`tiktoken` | `char`) so deployments in
network-restricted environments can opt out of tiktoken entirely. In `char`
mode the memory-injection budget uses a network-free character-based estimate
and never triggers the BPE download from openaipublic.blob.core.windows.net,
which could otherwise block for tens of minutes (see #3402).
Also harden the default `tiktoken` path:
- cache an in-flight LOADING sentinel so concurrent callers fall back
immediately instead of spawning more blocking get_encoding threads when the
first load is still running (e.g. under the 5s startup warm-up timeout);
- cache failures with a timestamp and retry after a cooldown so a transient
network outage self-heals back to accurate counting without a restart;
- skip startup warm-up entirely in char mode.
The new config is surfaced via the memory config API and config.example.yaml
(config_version bumped). Default remains `tiktoken`, so existing deployments
are unaffected.
* fix(memory): use CJK-aware char token estimate and address review feedback
- Replace the flat len(text)//4 fallback with a CJK-aware estimate so
Chinese/Japanese/Korean memory content does not over-fill the injection budget
- Document the internal tiktoken retry cooldown and char-mode escape hatch
- Sync CLAUDE.md / config.example.yaml / MEMORY_IMPROVEMENTS.md wording
- Fix MemoryConfigResponse mocks/assertions and add CJK estimate tests
* fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402)
DynamicContextMiddleware.abefore_agent() called _inject() synchronously
on the asyncio event loop. The first time memory is injected (second
request), _inject() → format_memory_for_injection() → _count_tokens()
→ tiktoken.get_encoding("cl100k_base") needs to download the BPE data
from openaipublic.blob.core.windows.net. In network-restricted
environments this download blocks until the OS TCP timeout (~26 min),
starving ALL concurrent handlers including /api/v1/auth/me.
Fix:
- abefore_agent now uses asyncio.to_thread(self._inject, state) so
file I/O and tiktoken never block the event loop.
- Extract _get_tiktoken_encoding() with a module-level cache so
tiktoken.get_encoding() is called at most once per encoding name.
- Add warm_tiktoken_cache() startup helper; gateway lifespan pre-warms
the cache via asyncio.to_thread so the first request never triggers a
cold download.
- _count_tokens falls back to len(text) // 4 on any encoding failure.
Tests:
- tests/test_tiktoken_cache_and_count_tokens.py (12 tests): cache
hit/miss, fallback paths, warm-up helper.
- tests/blocking_io/test_dynamic_context_middleware.py (2 tests):
Blockbuster gate verifies abefore_agent does not block the event
loop; async/sync parity check.
Fixes#3402
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix the lint error
* fix(memory): use future annotations to avoid NameError when tiktoken is absent
Add `from __future__ import annotations` to prompt.py so that
tiktoken.Encoding type hints are never evaluated at runtime. Without
this, environments where tiktoken is not installed could raise NameError
on the module-level cache and function return annotations.
Addresses Copilot review comment on PR #3411.
* fix(middleware): bound abefore_agent injection with timeout to prevent hung requests
Wrap the asyncio.to_thread(self._inject) offload in asyncio.wait_for()
with a 5-second cap. If the startup warm-up failed silently (e.g.
network blip during deploy), a cold tiktoken BPE download on the first
request can block until the OS TCP timeout (~26 min). The bounded
timeout ensures the request degrades gracefully (no memory/date context
for that turn) rather than hanging.
Adds test_abefore_agent_returns_none_on_timeout to the blocking-IO
regression anchors.
Addresses review feedback from xg-gh-25 on PR #3411.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>