Files
deer-flow/backend/packages/harness/deerflow/persistence/json_compat.py
T
He Wang e9deb6c2f2 perf(harness): push thread metadata filters into SQL (#2865)
* perf(harness): push thread metadata filters into SQL

Replace Python-side metadata filtering (5x overfetch + in-memory match)
with database-side json_extract predicates so LIMIT/OFFSET pagination
is exact regardless of match density.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(harness): add dialect-aware JsonMatch compiler for type-safe metadata SQL filters

Replace SQLAlchemy JSON index/comparator APIs with a custom JsonMatch
ColumnElement that compiles to json_type/json_extract on SQLite and
jsonb_typeof/->>/-> on PostgreSQL. Tighten key validation regex to
single-segment identifiers, handle None/bool/numeric value types with
json_type-based discrimination, and strengthen test coverage for edge
cases and discriminability.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(harness): address Copilot review comments on JSON metadata filters

- Use json_typeof instead of jsonb_typeof in PostgreSQL compiler; the
  metadata_json column is JSON not JSONB so jsonb_typeof would error at
  runtime on any PostgreSQL backend
- Align _is_safe_json_key with json_match's _KEY_CHARSET_RE so keys
  containing hyphens or leading digits are not silently skipped
- Add thread_id as secondary ORDER BY in search() to make pagination
  deterministic when updated_at values collide; remove asyncio.sleep
  from the pagination regression test

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix(harness): address remaining review comments on metadata SQL filters

- Remove _is_safe_json_key() and reuse json_match ValueError to avoid
  validator drift (Copilot #3217603895, #3217411616)
- Raise ValueError when all metadata keys are rejected so callers never
  get silent unfiltered results (WillemJiang)
- Fix integer precision: split int/float branches, bind int as Integer()
  with INTEGER/BIGINT CAST instead of float() coercion (Copilot #3217603972)
- Fix jsonb_typeof -> json_typeof on JSON column (Copilot #3217411579)
- Replace manual _cleanup() calls with async yield fixture so teardown
  always runs (Copilot #3217604019)
- Remove asyncio.sleep(0.01) pagination ordering; use thread_id secondary
  sort instead (Copilot #3217411636)
- Add type annotations to _bind/_build_clause/_compile_* and remove EOL
  comments from _Dialect fields (coding.mdc)
- Expand test coverage: boolean/null/mixed-type/large-int precision,
  partial unsafe-key skip with caplog assertion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): address third-round Copilot review comments on JsonMatch

- Reject unsupported value types (list, dict, ...) in JsonMatch.__init__
  with TypeError so inherit_cache=True never receives an unhashable value
  and callers get an explicit error instead of silent str() coercion
  (Copilot #3217933201)
- Upgrade int bindparam from Integer() to BigInteger() to align with
  BIGINT CAST and avoid overflow on large integers (Copilot #3217933252)
- Catch TypeError alongside ValueError in search() so non-string metadata
  keys are warned and skipped rather than raising unexpectedly
  (Copilot #3217933300)
- Add three tests: json_match rejects unsupported value types, search()
  warns and raises on non-string key, search() warns and raises on
  unsupported value type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): address fourth-round Copilot review comments on JsonMatch

- Add CASE WHEN guard for PostgreSQL integer matching: json_typeof returns
  'number' for both ints and floats; wrap CAST in CASE with regex guard
  '^-?[0-9]+$' so float rows never trigger CAST error (Copilot #3218413860)
- Validate isinstance(key, str) before regex match in JsonMatch.__init__
  so non-string keys raise ValueError consistently instead of TypeError
  from re.match (Copilot #3218413900)
- Include exception message in metadata filter skip warning so callers
  can distinguish invalid key from unsupported value type (Copilot #3218413924)
- Update tests: assert CASE WHEN guard in PG int compilation, cover
  non-string key ValueError in test_json_match_rejects_unsafe_key

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): align ThreadMetaStore.search() signature with sql.py implementation

Use `dict[str, Any]` for `metadata` and `list[dict[str, Any]]` as return
type in base class and MemoryThreadMetaStore to resolve an LSP signature
mismatch; also correct a test docstring that cited the wrong exception type.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): surface InvalidMetadataFilterError as HTTP 400 in search endpoint

Replace bare ValueError with a domain-specific InvalidMetadataFilterError
(subclass of ValueError) so the Gateway handler can catch it and return
HTTP 400 instead of letting it bubble up as a 500.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(harness): sanitize metadata keys in log output to prevent log injection

Use ascii() instead of %r to escape control characters in client-supplied
metadata keys before logging, preventing multiline/forged log entries.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix(harness): validate metadata filters at API boundary and dedupe key/value rules

- Add Pydantic ``field_validator`` on ``ThreadSearchRequest.metadata`` so
  unsafe keys / unsupported value types are rejected with HTTP 422 from
  both SQL and memory backends (closes Copilot review 3218830849).
- Export ``validate_metadata_filter_key`` / ``validate_metadata_filter_value``
  (and ``ALLOWED_FILTER_VALUE_TYPES``) from ``json_compat`` and have
  ``JsonMatch.__init__`` reuse them — the Gateway-side validator and the
  SQL-side ``JsonMatch`` constructor now share one admission rule and
  cannot drift.
- Format ``InvalidMetadataFilterError`` rejected-keys list as a
  comma-separated plain string instead of a Python list repr so the
  surfaced HTTP 400 detail is readable (closes Copilot review 3218830899).
- Update router tests to cover both 422 boundary paths plus the 400
  defense-in-depth path when a backend still raises the error.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(harness): harden JsonMatch compile-time key validation against __init__ bypass

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: address review feedback on metadata filter SQL push-down

- Add signed 64-bit range check to validate_metadata_filter_value; give
  out-of-range ints a distinct TypeError message.

- Replace assert guards in _compile_sqlite/_compile_pg with explicit
  if/raise so they survive python -O optimisation.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 23:21:22 +08:00

196 lines
7.7 KiB
Python

"""Dialect-aware JSON value matching for SQLAlchemy (SQLite + PostgreSQL)."""
from __future__ import annotations
import re
from dataclasses import dataclass
from typing import Any
from sqlalchemy import BigInteger, Float, String, bindparam
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.compiler import SQLCompiler
from sqlalchemy.sql.expression import ColumnElement
from sqlalchemy.sql.visitors import InternalTraversal
from sqlalchemy.types import Boolean, TypeEngine
# Key is interpolated into compiled SQL; restrict charset to prevent injection.
_KEY_CHARSET_RE = re.compile(r"^[A-Za-z0-9_\-]+$")
# Allowed value types for metadata filter values (same set accepted by JsonMatch).
ALLOWED_FILTER_VALUE_TYPES: tuple[type, ...] = (type(None), bool, int, float, str)
# SQLite raises an overflow when binding values outside signed 64-bit range;
# PostgreSQL overflows during BIGINT cast. Reject at validation time instead.
_INT64_MIN = -(2**63)
_INT64_MAX = 2**63 - 1
def validate_metadata_filter_key(key: object) -> bool:
"""Return True if *key* is safe for use as a JSON metadata filter key.
A key is "safe" when it is a string matching ``[A-Za-z0-9_-]+``. The
charset is restricted because the key is interpolated into the
compiled SQL path expression (``$."<key>"`` / ``->`` literal), so any
laxer pattern would open a SQL/JSONPath injection surface.
"""
return isinstance(key, str) and bool(_KEY_CHARSET_RE.match(key))
def validate_metadata_filter_value(value: object) -> bool:
"""Return True if *value* is an allowed type for a JSON metadata filter.
Matches the set of types ``_build_clause`` knows how to compile into
a dialect-portable predicate. Anything else (list/dict/bytes/...) is
intentionally rejected rather than silently coerced via ``str()`` —
silent coercion would (a) produce wrong matches and (b) break
SQLAlchemy's ``inherit_cache`` invariant when ``value`` is unhashable.
Integer values are additionally restricted to the signed 64-bit range
``[-2**63, 2**63 - 1]``: SQLite overflows when binding larger values
and PostgreSQL overflows during the ``BIGINT`` cast.
"""
if not isinstance(value, ALLOWED_FILTER_VALUE_TYPES):
return False
if isinstance(value, int) and not isinstance(value, bool):
if not (_INT64_MIN <= value <= _INT64_MAX):
return False
return True
class JsonMatch(ColumnElement):
"""Dialect-portable ``column[key] == value`` for JSON columns.
Compiles to ``json_type``/``json_extract`` on SQLite and
``json_typeof``/``->>`` on PostgreSQL, with type-safe comparison
that distinguishes bool vs int and NULL vs missing key.
*key* must be a single literal key matching ``[A-Za-z0-9_-]+``.
*value* must be one of: ``None``, ``bool``, ``int`` (signed 64-bit), ``float``, ``str``.
"""
inherit_cache = True
type = Boolean()
_is_implicitly_boolean = True
_traverse_internals = [
("column", InternalTraversal.dp_clauseelement),
("key", InternalTraversal.dp_string),
("value", InternalTraversal.dp_plain_obj),
]
def __init__(self, column: ColumnElement, key: str, value: object) -> None:
if not validate_metadata_filter_key(key):
raise ValueError(f"JsonMatch key must match {_KEY_CHARSET_RE.pattern!r}; got: {key!r}")
if not validate_metadata_filter_value(value):
if isinstance(value, int) and not isinstance(value, bool):
raise TypeError(f"JsonMatch int value out of signed 64-bit range [-2**63, 2**63-1]: {value!r}")
raise TypeError(f"JsonMatch value must be None, bool, int, float, or str; got: {type(value).__name__!r}")
self.column = column
self.key = key
self.value = value
super().__init__()
@dataclass(frozen=True)
class _Dialect:
"""Per-dialect names used when emitting JSON type/value comparisons."""
null_type: str
num_types: tuple[str, ...]
num_cast: str
int_types: tuple[str, ...]
int_cast: str
# None for SQLite where json_type already returns 'integer'/'real';
# regex literal for PostgreSQL where json_typeof returns 'number' for
# both ints and floats, so an extra guard prevents CAST errors on floats.
int_guard: str | None
string_type: str
bool_type: str | None
_SQLITE = _Dialect(
null_type="null",
num_types=("integer", "real"),
num_cast="REAL",
int_types=("integer",),
int_cast="INTEGER",
int_guard=None,
string_type="text",
bool_type=None,
)
_PG = _Dialect(
null_type="null",
num_types=("number",),
num_cast="DOUBLE PRECISION",
int_types=("number",),
int_cast="BIGINT",
int_guard="'^-?[0-9]+$'",
string_type="string",
bool_type="boolean",
)
def _bind(compiler: SQLCompiler, value: object, sa_type: TypeEngine[Any], **kw: Any) -> str:
param = bindparam(None, value, type_=sa_type)
return compiler.process(param, **kw)
def _type_check(typeof: str, types: tuple[str, ...]) -> str:
if len(types) == 1:
return f"{typeof} = '{types[0]}'"
quoted = ", ".join(f"'{t}'" for t in types)
return f"{typeof} IN ({quoted})"
def _build_clause(compiler: SQLCompiler, typeof: str, extract: str, value: object, dialect: _Dialect, **kw: Any) -> str:
if value is None:
return f"{typeof} = '{dialect.null_type}'"
if isinstance(value, bool):
# bool check must precede int check — bool is a subclass of int in Python
bool_str = "true" if value else "false"
if dialect.bool_type is None:
return f"{typeof} = '{bool_str}'"
return f"({typeof} = '{dialect.bool_type}' AND {extract} = '{bool_str}')"
if isinstance(value, int):
bp = _bind(compiler, value, BigInteger(), **kw)
if dialect.int_guard:
# CASE prevents CAST error when json_typeof = 'number' also matches floats
return f"(CASE WHEN {_type_check(typeof, dialect.int_types)} AND {extract} ~ {dialect.int_guard} THEN CAST({extract} AS {dialect.int_cast}) END = {bp})"
return f"({_type_check(typeof, dialect.int_types)} AND CAST({extract} AS {dialect.int_cast}) = {bp})"
if isinstance(value, float):
bp = _bind(compiler, value, Float(), **kw)
return f"({_type_check(typeof, dialect.num_types)} AND CAST({extract} AS {dialect.num_cast}) = {bp})"
bp = _bind(compiler, str(value), String(), **kw)
return f"({typeof} = '{dialect.string_type}' AND {extract} = {bp})"
@compiles(JsonMatch, "sqlite")
def _compile_sqlite(element: JsonMatch, compiler: SQLCompiler, **kw: Any) -> str:
if not validate_metadata_filter_key(element.key):
raise ValueError(f"Key escaped validation: {element.key!r}")
col = compiler.process(element.column, **kw)
path = f'$."{element.key}"'
typeof = f"json_type({col}, '{path}')"
extract = f"json_extract({col}, '{path}')"
return _build_clause(compiler, typeof, extract, element.value, _SQLITE, **kw)
@compiles(JsonMatch, "postgresql")
def _compile_pg(element: JsonMatch, compiler: SQLCompiler, **kw: Any) -> str:
if not validate_metadata_filter_key(element.key):
raise ValueError(f"Key escaped validation: {element.key!r}")
col = compiler.process(element.column, **kw)
typeof = f"json_typeof({col} -> '{element.key}')"
extract = f"({col} ->> '{element.key}')"
return _build_clause(compiler, typeof, extract, element.value, _PG, **kw)
@compiles(JsonMatch)
def _compile_default(element: JsonMatch, compiler: SQLCompiler, **kw: Any) -> str:
raise NotImplementedError(f"JsonMatch supports only sqlite and postgresql; got dialect: {compiler.dialect.name}")
def json_match(column: ColumnElement, key: str, value: object) -> JsonMatch:
return JsonMatch(column, key, value)