fix: cap prompt caching breakpoints at 4 to prevent API 400 errors (#2449)

* fix: cap prompt caching breakpoints at 4 to prevent API 400 errors (fixes #2448) The previous _apply_prompt_caching() attached cache_control to every text block in the system prompt, every content block in the last N messages, and the last tool definition. In multi-turn conversations with structured content blocks this easily exceeded the 4-breakpoint hard limit enforced by both the Anthropic API and AWS Bedrock, producing a 400 Bad Request (or a silent "No generations found in stream" when streaming). Fix: collect all candidate blocks in document order, then apply cache_control only to the last MAX_CACHE_BREAKPOINTS (4) of them. Later breakpoints cover a larger prefix and therefore yield better cache hit rates, making this the optimal placement strategy as well as the safe one. Adds 13 unit tests covering the budget cap, edge cases, and correct last-candidate placement. * docs: clarify _apply_prompt_caching docstring includes tool definitions Per Copilot review: the implementation also caches the last tool definition (see the candidates list at lines 202-205), so the docstring summary should explicitly mention tools alongside system and recent messages. * Fix the lint error * style: fix ruff format check for test_claude_provider_prompt_caching.py Add the missing blank line before the 'Edge cases' section comment so that ruff format --check passes in CI. --------- Co-authored-by: octo-patch <octo-patch@github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-05-22 16:06:50 +00:00 · 2026-04-25 19:40:06 +08:00
parent f394c0d8c8
commit 1f59e945af
2 changed files with 281 additions and 21 deletions
@@ -190,23 +190,33 @@ class ClaudeChatModel(ChatAnthropic):
            )

    def _apply_prompt_caching(self, payload: dict) -> None:
-        """Apply ephemeral cache_control to system and recent messages."""
-        # Cache system messages
+        """Apply ephemeral cache_control to system, recent messages, and last tool definition.
+
+        Uses a budget of MAX_CACHE_BREAKPOINTS (4) breakpoints — the hard limit
+        enforced by both the Anthropic API and AWS Bedrock.  Breakpoints are
+        placed on the *last* eligible blocks because later breakpoints cover a
+        larger prefix and yield better cache hit rates.
+        """
+        MAX_CACHE_BREAKPOINTS = 4
+
+        # Collect candidate blocks in document order:
+        #   1. system text blocks
+        #   2. content blocks of the last prompt_cache_size messages
+        #   3. the last tool definition
+        candidates: list[dict] = []
+
+        # 1. System blocks
        system = payload.get("system")
        if system and isinstance(system, list):
            for block in system:
                if isinstance(block, dict) and block.get("type") == "text":
-                    block["cache_control"] = {"type": "ephemeral"}
+                    candidates.append(block)
        elif system and isinstance(system, str):
-            payload["system"] = [
-                {
-                    "type": "text",
-                    "text": system,
-                    "cache_control": {"type": "ephemeral"},
-                }
-            ]
+            new_block: dict = {"type": "text", "text": system}
+            payload["system"] = [new_block]
+            candidates.append(new_block)

-        # Cache recent messages
+        # 2. Recent message blocks
        messages = payload.get("messages", [])
        cache_start = max(0, len(messages) - self.prompt_cache_size)
        for i in range(cache_start, len(messages)):
@@ -217,20 +227,21 @@ class ClaudeChatModel(ChatAnthropic):
            if isinstance(content, list):
                for block in content:
                    if isinstance(block, dict):
-                        block["cache_control"] = {"type": "ephemeral"}
+                        candidates.append(block)
            elif isinstance(content, str) and content:
-                msg["content"] = [
-                    {
-                        "type": "text",
-                        "text": content,
-                        "cache_control": {"type": "ephemeral"},
-                    }
-                ]
+                new_block = {"type": "text", "text": content}
+                msg["content"] = [new_block]
+                candidates.append(new_block)

-        # Cache the last tool definition
+        # 3. Last tool definition
        tools = payload.get("tools", [])
        if tools and isinstance(tools[-1], dict):
-            tools[-1]["cache_control"] = {"type": "ephemeral"}
+            candidates.append(tools[-1])
+
+        # Apply cache_control only to the last MAX_CACHE_BREAKPOINTS candidates
+        # to stay within the API limit.
+        for block in candidates[-MAX_CACHE_BREAKPOINTS:]:
+            block["cache_control"] = {"type": "ephemeral"}

    def _apply_thinking_budget(self, payload: dict) -> None:
        """Auto-allocate thinking budget (80% of max_tokens)."""