Two tests in this module (test_returns_none_on_invalid_path and
test_fallback_when_disk_write_fails) used paths like
'/nonexistent/impossible/path' to trigger _externalize's OSError
fallback. These paths are creatable when the test process runs as root
inside the CI container: os.makedirs(..., exist_ok=True) successfully
creates the entire chain under /, so the OSError branch is never hit
and the tests fail. Reproducible on main independently of this PR.
Switch to '/dev/null/cannot-mkdir-here'. /dev/null is a character
device on both Linux and macOS, so os.makedirs always fails with
NotADirectoryError regardless of privileges, reliably exercising the
OSError fallback.
ToolOutputBudgetMiddleware persisted oversized tool results to the host
filesystem and returned a /mnt/user-data/outputs virtual path. For sandboxes
that do not use thread-data mounts (e.g. remote AIO sandbox), that virtual
path does not exist inside the sandbox, so the model's read_file tool could
not read it back and reported 'file not found'.
Branch on SandboxProvider.uses_thread_data_mounts:
- Mounted sandboxes (local Docker, AIO + LocalContainerBackend) keep the
original host-disk path; the host outputs dir is bind-mounted to the same
virtual path inside the sandbox, so behavior is unchanged.
- Non-mounted (remote) sandboxes externalize into the sandbox itself via
execute_command('mkdir -p ...') + write_file + 'test -s' validation. The
validation step is required because AIO sandbox execute_command returns
'Error: ...' as a string on failure instead of raising, so a silent mkdir
failure would otherwise leak through.
Any failure (rejected subdir, mkdir/write/validate error) falls back to the
existing inline head+tail truncation, so an unreadable path is never returned
to the model.
The sandbox resolver reads the sandbox_id that SandboxMiddleware already
writes into runtime.state['sandbox']; it never calls provider.acquire(),
keeping the tool-call hot path free of blocking I/O. Tools that do not use a
sandbox (web_search, MCP, ...) resolve to None and fall through to inline
truncation, which is the safe behavior for them.
Fixes#3416
* feat(agent): add ToolOutputBudgetMiddleware for oversized tool output protection
Closes#3289. Adds a unified middleware that enforces per-result budgets
on ALL tool outputs (MCP, sandbox, community, custom), preventing
oversized external tool results from blowing the model context window.
Design informed by claude-code (persistToolResult), hermes-agent
(tool_result_storage), and pi (OutputAccumulator) — the three most
mature implementations in production coding-agent frameworks.
Key features:
- Disk externalization: oversized outputs written to thread-local
.tool-results/ directory, replaced with compact preview + file
reference. Model can read full output via read_file with offset/limit.
- Fallback truncation: head+tail truncation when disk is unavailable
(no thread_data, write failure), ensuring the context is always
protected.
- read_file exemption: prevents persist-read-persist infinite loops
(independently discovered by claude-code, hermes-agent, and pi).
- Per-tool threshold overrides via config.
- Line-boundary-aware truncation (no partial lines in previews).
- Multimodal content passthrough (images/structured blocks skip budget).
- Historical ToolMessage patching in wrap_model_call for checkpoint
recovery scenarios.
Related: #3222 (design RFC), #1844 (comprehensive context management),
#3137 (write_file args compaction), #1677 (sandbox tool truncation).
* test: add MCP content_and_artifact format coverage
Add 5 tests for MCP tool output format (list of content blocks):
- text content blocks are extracted and budgeted
- multiple text blocks are joined and budgeted
- image content blocks are skipped (multimodal passthrough)
- mixed text+image blocks are skipped
- small text blocks pass through unchanged
Total test count: 59 (was 54).
* fix(agent): address Codex review findings for ToolOutputBudgetMiddleware
Three issues identified by Codex code review, all fixed:
1. `enabled` config field was unused — middleware now checks
`config.enabled` and skips all processing when disabled.
2. `_build_fallback` could exceed `fallback_max_chars` — the marker
text itself (~139 chars) was not deducted from the budget. Now
pre-computes marker overhead and falls back to hard slice when
max_chars is smaller than the marker.
3. Sync file I/O in async path — `awrap_tool_call` now delegates
`_patch_result` to `asyncio.to_thread` to avoid blocking the
event loop during disk writes.
Tests updated to use realistic fallback_max_chars values (500+)
that can accommodate the marker overhead, plus two new tests:
- `test_result_never_exceeds_max_chars` (parametric across sizes)
- `test_very_small_max_chars_does_not_crash`
* fix(agent): address Copilot review — path traversal, async perf, shared config
1. Path traversal defense: sanitize tool_name via _sanitize_tool_name()
(strips separators, .., absolute paths), validate storage_subdir is
relative, and verify resolved filepath stays inside storage_dir.
2. Async hot-path optimization: add _needs_budget() cheap check before
asyncio.to_thread offload — small outputs (99% of calls) skip the
thread overhead entirely.
3. Replace shared module-level _DEFAULT_CONFIG with _default_config()
factory to prevent cross-instance mutation of mutable fields.
12 new tests: TestSanitizeToolName (5), TestExternalizePathTraversal (3),
TestNeedsBudget (4).
* fix(agent): correct preview hint to match read_file actual API
read_file uses start_line/end_line (1-indexed line numbers), not
offset/limit. The previous wording was copied from hermes-agent
which has a different read_file interface.
* perf(agent): hoist hot-path imports, add model-call pre-scan (review #3303)
Address maintainer review feedback:
1. Hoist inline imports to module level — `import asyncio` (was in
awrap_tool_call hot path) and `from dataclasses import replace`
(was in _patch_result) now live at module top.
2. Add a cheap pre-scan to _patch_model_messages so the historical
message list is not rebuilt on every model call when nothing is
oversized (the common case once results are budgeted at tool-call
time). Also adds the same _needs_budget gate to the sync
wrap_tool_call for symmetry with awrap_tool_call.
The pre-scan is refactored into per-tool-aware helpers
(_effective_trigger / _tool_message_over_budget) that mirror the exact
trigger conditions in _budget_content — including tool_overrides — so
the fast-path can never produce a false negative (silently skipping
budgeting for a tool with a low per-tool threshold).
7 new regression tests lock the per-tool-override-through-pre-scan path
and the model-call early return.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>