fix(channels): add operational guardrails (#3584)

* fix(channels): add operational guardrails * make format * fix(channels): converge with #3582 to avoid merge-order conflicts Drop this PR's DingTalk INFO-log redaction and hand it to #3582, which already restructures that handler and will redact the same log there. This PR no longer touches dingtalk.py, so the two PRs can merge to main in any order without a conflict. For WeChat, drop the contested thread_ts priority reorder (review #3) and keep only what inbound dedupe needs: a server-stable message_id in the inbound metadata (message_id/msg_id, no client_id per review #6). This is a single added line inside the metadata dict, a region #3582 never touches, so it auto-merges regardless of order. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(channels): address three correctness review findings 1. Connect-code cap was racy (willem #1): _create_state ran delete-expired, count, and insert as three separate transactions, so concurrent connect POSTs from one owner could each see count < cap and all insert past it. Add ChannelConnectionRepository.create_oauth_state_within_cap which does delete+count+insert in a single transaction serialized per (owner, provider) — Postgres via pg_advisory_xact_lock, SQLite via the write lock the leading DELETE takes — and have the router use it. 2. Inbound dedupe key fell back to "" workspace (willem #3): two workspaces delivering without team/guild/aibotid would collapse to the same key and dedupe each other's messages. _inbound_dedupe_key now fails closed (returns None) when no workspace identifier is present. 3. Dedupe key was recorded on receipt and never released on failure (ShenAC #1): a transient error (DB blip, Gateway 503) left the key in place for the full TTL, so a provider redelivery of the same message_id — exactly the retry dedupe should absorb — was silently dropped. _handle_message now releases the key in the unexpected-exception branch so redelivery can recover, while keeping record-on-receipt so retries during handling are still deduped. Tests: repo cap enforcement incl. concurrent-issuance non-leak; dedupe fail-closed; dedupe key release-on-failure redelivery recovery. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(channels): address cleanup/efficiency and test review findings Efficiency / cleanup: - Dedupe key set drops client-generated ids (client_msg_id, client_id); keep only server-stable event_id/message_id/msg_id, which a provider's own redelivery preserves (ShenAC #6). Every provider already emits message_id. - TTL/overflow pruning of _recent_inbound_events is now O(k): switch to an OrderedDict and popitem(last=False) from the front instead of scanning all 4096 entries on every inbound (willem #4). - Log "received inbound" only after the dedupe check so a provider retrying N times no longer logs N accepts; document that manager dedupe covers the agent run/final answer, not provider ack side-effects (willem #5, ShenAC #2). - Slack drops the redundant `team_id or event.get("team")` fallback the caller already resolved (willem #6). - create_oauth_state_within_cap prunes only this owner/provider's expired codes instead of a global DELETE on every connect POST; global cleanup still runs on consume_oauth_state (willem #7). Tests: - Dedupe test uses tmp_path instead of a leaked mkdtemp, uses distinct objects per publish, and adds a negative control: a different message_id is still processed, catching over-dedupe regressions (willem #8, ShenAC #4). - Slack HTTP-mode rejection test supplies app_token so the missing-token early return can't mask the guard, giving the state assertions teeth (ShenAC #3). - count_oauth_states test pins that the active row survives, not just the count (ShenAC #5). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * make format --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:55:59 +00:00 · 2026-06-18 04:09:46 +02:00
parent 97dd9ecf73
commit 8c0830aea1
12 changed files with 468 additions and 51 deletions
@@ -11,7 +11,7 @@ from datetime import UTC, datetime
 from typing import Any

 from cryptography.fernet import Fernet, InvalidToken
-from sqlalchemy import delete, func, select, update
+from sqlalchemy import delete, func, select, text, update
 from sqlalchemy.exc import IntegrityError
 from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker

@@ -279,16 +279,128 @@ class ChannelConnectionRepository:
            session.add(row)
            await session.commit()

-    async def count_oauth_states(self, *, owner_user_id: str, provider: str) -> int:
+    async def create_oauth_state_within_cap(
+        self,
+        *,
+        owner_user_id: str,
+        provider: str,
+        state: str,
+        expires_at: datetime,
+        max_pending: int,
+        now: datetime | None = None,
+        code_verifier: str | None = None,
+        nonce_hash: str | None = None,
+        redirect_after: str | None = None,
+        requested_scopes: list[str] | None = None,
+        metadata: dict[str, Any] | None = None,
+    ) -> bool:
+        """Atomically enforce the per-(owner, provider) pending cap, then insert.
+
+        delete-expired + count + insert run in a single transaction serialized
+        per (owner, provider), so concurrent connect requests cannot each
+        observe ``count < max_pending`` and all insert (which would leak past
+        the cap). PostgreSQL takes a transaction-scoped advisory lock; SQLite
+        serializes writers through the write lock the leading DELETE acquires.
+
+        Returns ``True`` when the row was inserted, ``False`` when the cap is
+        already reached.
+        """
+        current_time = now or datetime.now(UTC)
        async with self.session_factory() as session:
-            result = await session.execute(
+            await self._serialize_oauth_owner_scope(session, owner_user_id, provider)
+            # Prune only this owner/provider's expired codes (the ones that affect
+            # this cap), not every user's — avoids a global DELETE on each connect
+            # POST. Issuing this write first also takes the SQLite database write
+            # lock so the count below cannot race a concurrent inserter between
+            # count and commit. Stale codes for other owners are pruned globally
+            # by consume_oauth_state / delete_expired_oauth_states.
+            await session.execute(
+                delete(ChannelOAuthStateRow).where(
+                    ChannelOAuthStateRow.owner_user_id == owner_user_id,
+                    ChannelOAuthStateRow.provider == provider,
+                    ChannelOAuthStateRow.expires_at < current_time,
+                )
+            )
+            pending = await session.execute(
                select(func.count())
                .select_from(ChannelOAuthStateRow)
                .where(
                    ChannelOAuthStateRow.owner_user_id == owner_user_id,
                    ChannelOAuthStateRow.provider == provider,
+                    ChannelOAuthStateRow.consumed_at.is_(None),
+                    ChannelOAuthStateRow.expires_at >= current_time,
                )
            )
+            if int(pending.scalar_one()) >= max_pending:
+                await session.rollback()
+                return False
+            session.add(
+                ChannelOAuthStateRow(
+                    state_hash=self.hash_state(state),
+                    owner_user_id=owner_user_id,
+                    provider=provider,
+                    code_verifier_encrypted=self._encrypt_optional_secret(code_verifier),
+                    nonce_hash=nonce_hash,
+                    redirect_after=redirect_after,
+                    requested_scopes_json=list(requested_scopes or []),
+                    metadata_json=dict(metadata or {}),
+                    expires_at=expires_at,
+                )
+            )
+            await session.commit()
+            return True
+
+    async def _serialize_oauth_owner_scope(self, session: AsyncSession, owner_user_id: str, provider: str) -> None:
+        """Serialize concurrent pending-cap transactions for one (owner, provider).
+
+        On PostgreSQL this takes a transaction-scoped advisory lock so concurrent
+        issuers run their count+insert one at a time. On SQLite the leading
+        DELETE in the caller's transaction already acquires the database write
+        lock, which serializes writers, so no extra lock is required.
+        """
+        try:
+            dialect = session.bind.dialect.name if session.bind is not None else ""
+        except Exception:
+            dialect = ""
+        if dialect == "postgresql":
+            await session.execute(text("SELECT pg_advisory_xact_lock(:lock_key)"), {"lock_key": self._oauth_scope_lock_key(owner_user_id, provider)})
+
+    @staticmethod
+    def _oauth_scope_lock_key(owner_user_id: str, provider: str) -> int:
+        digest = hashlib.sha256(f"{owner_user_id}\x00{provider}".encode()).digest()
+        # 63-bit non-negative key for pg_advisory_xact_lock(bigint).
+        return int.from_bytes(digest[:8], "big") & 0x7FFFFFFFFFFFFFFF
+
+    async def delete_expired_oauth_states(self, *, now: datetime | None = None) -> int:
+        current_time = now or datetime.now(UTC)
+        async with self.session_factory() as session:
+            result = await session.execute(delete(ChannelOAuthStateRow).where(ChannelOAuthStateRow.expires_at < current_time))
+            await session.commit()
+            return int(result.rowcount or 0)
+
+    async def count_oauth_states(
+        self,
+        *,
+        owner_user_id: str,
+        provider: str,
+        active_only: bool = False,
+        now: datetime | None = None,
+    ) -> int:
+        current_time = now or datetime.now(UTC)
+        conditions = [
+            ChannelOAuthStateRow.owner_user_id == owner_user_id,
+            ChannelOAuthStateRow.provider == provider,
+        ]
+        if active_only:
+            conditions.extend(
+                [
+                    ChannelOAuthStateRow.consumed_at.is_(None),
+                    ChannelOAuthStateRow.expires_at >= current_time,
+                ]
+            )
+
+        async with self.session_factory() as session:
+            result = await session.execute(select(func.count()).select_from(ChannelOAuthStateRow).where(*conditions))
            return int(result.scalar_one())

    async def consume_oauth_state(