fix(channels): add operational guardrails (#3584)

* fix(channels): add operational guardrails * make format * fix(channels): converge with #3582 to avoid merge-order conflicts Drop this PR's DingTalk INFO-log redaction and hand it to #3582, which already restructures that handler and will redact the same log there. This PR no longer touches dingtalk.py, so the two PRs can merge to main in any order without a conflict. For WeChat, drop the contested thread_ts priority reorder (review #3) and keep only what inbound dedupe needs: a server-stable message_id in the inbound metadata (message_id/msg_id, no client_id per review #6). This is a single added line inside the metadata dict, a region #3582 never touches, so it auto-merges regardless of order. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(channels): address three correctness review findings 1. Connect-code cap was racy (willem #1): _create_state ran delete-expired, count, and insert as three separate transactions, so concurrent connect POSTs from one owner could each see count < cap and all insert past it. Add ChannelConnectionRepository.create_oauth_state_within_cap which does delete+count+insert in a single transaction serialized per (owner, provider) — Postgres via pg_advisory_xact_lock, SQLite via the write lock the leading DELETE takes — and have the router use it. 2. Inbound dedupe key fell back to "" workspace (willem #3): two workspaces delivering without team/guild/aibotid would collapse to the same key and dedupe each other's messages. _inbound_dedupe_key now fails closed (returns None) when no workspace identifier is present. 3. Dedupe key was recorded on receipt and never released on failure (ShenAC #1): a transient error (DB blip, Gateway 503) left the key in place for the full TTL, so a provider redelivery of the same message_id — exactly the retry dedupe should absorb — was silently dropped. _handle_message now releases the key in the unexpected-exception branch so redelivery can recover, while keeping record-on-receipt so retries during handling are still deduped. Tests: repo cap enforcement incl. concurrent-issuance non-leak; dedupe fail-closed; dedupe key release-on-failure redelivery recovery. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(channels): address cleanup/efficiency and test review findings Efficiency / cleanup: - Dedupe key set drops client-generated ids (client_msg_id, client_id); keep only server-stable event_id/message_id/msg_id, which a provider's own redelivery preserves (ShenAC #6). Every provider already emits message_id. - TTL/overflow pruning of _recent_inbound_events is now O(k): switch to an OrderedDict and popitem(last=False) from the front instead of scanning all 4096 entries on every inbound (willem #4). - Log "received inbound" only after the dedupe check so a provider retrying N times no longer logs N accepts; document that manager dedupe covers the agent run/final answer, not provider ack side-effects (willem #5, ShenAC #2). - Slack drops the redundant `team_id or event.get("team")` fallback the caller already resolved (willem #6). - create_oauth_state_within_cap prunes only this owner/provider's expired codes instead of a global DELETE on every connect POST; global cleanup still runs on consume_oauth_state (willem #7). Tests: - Dedupe test uses tmp_path instead of a leaked mkdtemp, uses distinct objects per publish, and adds a negative control: a different message_id is still processed, catching over-dedupe regressions (willem #8, ShenAC #4). - Slack HTTP-mode rejection test supplies app_token so the missing-token early return can't mask the guard, giving the state assertions teeth (ShenAC #3). - count_oauth_states test pins that the active row survives, not just the count (ShenAC #5). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * make format --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:55:59 +00:00 · 2026-06-18 04:09:46 +02:00
parent 97dd9ecf73
commit 8c0830aea1
12 changed files with 468 additions and 51 deletions
@@ -246,6 +246,77 @@ class TestChannelConnectionRepository:
            states = (await session.execute(select(ChannelOAuthStateRow))).scalars().all()
        assert [state.state_hash for state in states] == [repo.hash_state("active-state")]

+    @pytest.mark.anyio
+    async def test_count_oauth_states_active_only_and_delete_expired(self, repo):
+        now = datetime.now(UTC)
+        await repo.create_oauth_state(
+            owner_user_id="alice",
+            provider="slack",
+            state="expired-state",
+            expires_at=now - timedelta(minutes=1),
+        )
+        await repo.create_oauth_state(
+            owner_user_id="alice",
+            provider="slack",
+            state="active-state",
+            expires_at=now + timedelta(minutes=5),
+        )
+
+        assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 1
+        assert await repo.delete_expired_oauth_states(now=now) == 1
+        assert await repo.count_oauth_states(owner_user_id="alice", provider="slack") == 1
+        # Pin that the surviving row is the active one (an inverted expiry
+        # predicate would delete the active row, still return 1, and pass above).
+        async with repo.session_factory() as session:
+            survivors = (await session.execute(select(ChannelOAuthStateRow))).scalars().all()
+        assert [row.state_hash for row in survivors] == [repo.hash_state("active-state")]
+
+    @pytest.mark.anyio
+    async def test_create_oauth_state_within_cap_enforces_pending_cap(self, repo):
+        now = datetime.now(UTC)
+        expires = now + timedelta(minutes=5)
+
+        for i in range(3):
+            inserted = await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state=f"code-{i}", expires_at=expires, max_pending=3, now=now)
+            assert inserted is True
+
+        # Cap reached: the next issuance is rejected and nothing is inserted.
+        assert await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state="code-over", expires_at=expires, max_pending=3, now=now) is False
+        assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 3
+
+        # Expired rows are pruned and free up capacity; a different owner is unaffected.
+        assert await repo.create_oauth_state_within_cap(owner_user_id="bob", provider="slack", state="bob-1", expires_at=expires, max_pending=3, now=now) is True
+
+    @pytest.mark.anyio
+    async def test_create_oauth_state_within_cap_ignores_expired_rows(self, repo):
+        now = datetime.now(UTC)
+        # Three already-expired rows must not count against the cap.
+        for i in range(3):
+            await repo.create_oauth_state(owner_user_id="alice", provider="slack", state=f"old-{i}", expires_at=now - timedelta(minutes=1))
+
+        inserted = await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state="fresh", expires_at=now + timedelta(minutes=5), max_pending=3, now=now)
+        assert inserted is True
+        assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 1
+
+    @pytest.mark.anyio
+    async def test_create_oauth_state_within_cap_does_not_leak_under_concurrency(self, repo):
+        """Concurrent issuance for one owner cannot push past the cap (willem #1)."""
+        import anyio
+
+        now = datetime.now(UTC)
+        expires = now + timedelta(minutes=5)
+        results: list[bool] = []
+
+        async def issue(state: str) -> None:
+            results.append(await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state=state, expires_at=expires, max_pending=3, now=now))
+
+        async with anyio.create_task_group() as tg:
+            for i in range(8):
+                tg.start_soon(issue, f"code-{i}")
+
+        assert sum(1 for ok in results if ok) == 3
+        assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 3
+
    @pytest.mark.anyio
    async def test_consume_oauth_state_is_one_time_even_under_concurrent_consumers(self, repo):
        import anyio