fix: cap prompt caching breakpoints at 4 to prevent API 400 errors (#2449)

* fix: cap prompt caching breakpoints at 4 to prevent API 400 errors (fixes #2448)

The previous _apply_prompt_caching() attached cache_control to every text
block in the system prompt, every content block in the last N messages, and
the last tool definition. In multi-turn conversations with structured content
blocks this easily exceeded the 4-breakpoint hard limit enforced by both the
Anthropic API and AWS Bedrock, producing a 400 Bad Request (or a silent
"No generations found in stream" when streaming).

Fix: collect all candidate blocks in document order, then apply cache_control
only to the last MAX_CACHE_BREAKPOINTS (4) of them. Later breakpoints cover a
larger prefix and therefore yield better cache hit rates, making this the
optimal placement strategy as well as the safe one.

Adds 13 unit tests covering the budget cap, edge cases, and correct
last-candidate placement.

* docs: clarify _apply_prompt_caching docstring includes tool definitions

Per Copilot review: the implementation also caches the last tool definition
(see the candidates list at lines 202-205), so the docstring summary should
explicitly mention tools alongside system and recent messages.

* Fix the lint error

* style: fix ruff format check for test_claude_provider_prompt_caching.py

Add the missing blank line before the 'Edge cases' section comment so
that ruff format --check passes in CI.

---------

Co-authored-by: octo-patch <octo-patch@github.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
This commit is contained in:
Octopus
2026-04-25 19:40:06 +08:00
committed by GitHub
parent f394c0d8c8
commit 1f59e945af
2 changed files with 281 additions and 21 deletions
@@ -190,23 +190,33 @@ class ClaudeChatModel(ChatAnthropic):
)
def _apply_prompt_caching(self, payload: dict) -> None:
"""Apply ephemeral cache_control to system and recent messages."""
# Cache system messages
"""Apply ephemeral cache_control to system, recent messages, and last tool definition.
Uses a budget of MAX_CACHE_BREAKPOINTS (4) breakpoints — the hard limit
enforced by both the Anthropic API and AWS Bedrock. Breakpoints are
placed on the *last* eligible blocks because later breakpoints cover a
larger prefix and yield better cache hit rates.
"""
MAX_CACHE_BREAKPOINTS = 4
# Collect candidate blocks in document order:
# 1. system text blocks
# 2. content blocks of the last prompt_cache_size messages
# 3. the last tool definition
candidates: list[dict] = []
# 1. System blocks
system = payload.get("system")
if system and isinstance(system, list):
for block in system:
if isinstance(block, dict) and block.get("type") == "text":
block["cache_control"] = {"type": "ephemeral"}
candidates.append(block)
elif system and isinstance(system, str):
payload["system"] = [
{
"type": "text",
"text": system,
"cache_control": {"type": "ephemeral"},
}
]
new_block: dict = {"type": "text", "text": system}
payload["system"] = [new_block]
candidates.append(new_block)
# Cache recent messages
# 2. Recent message blocks
messages = payload.get("messages", [])
cache_start = max(0, len(messages) - self.prompt_cache_size)
for i in range(cache_start, len(messages)):
@@ -217,20 +227,21 @@ class ClaudeChatModel(ChatAnthropic):
if isinstance(content, list):
for block in content:
if isinstance(block, dict):
block["cache_control"] = {"type": "ephemeral"}
candidates.append(block)
elif isinstance(content, str) and content:
msg["content"] = [
{
"type": "text",
"text": content,
"cache_control": {"type": "ephemeral"},
}
]
new_block = {"type": "text", "text": content}
msg["content"] = [new_block]
candidates.append(new_block)
# Cache the last tool definition
# 3. Last tool definition
tools = payload.get("tools", [])
if tools and isinstance(tools[-1], dict):
tools[-1]["cache_control"] = {"type": "ephemeral"}
candidates.append(tools[-1])
# Apply cache_control only to the last MAX_CACHE_BREAKPOINTS candidates
# to stay within the API limit.
for block in candidates[-MAX_CACHE_BREAKPOINTS:]:
block["cache_control"] = {"type": "ephemeral"}
def _apply_thinking_budget(self, payload: dict) -> None:
"""Auto-allocate thinking budget (80% of max_tokens)."""