fix(channels): add operational guardrails (#3584)

* fix(channels): add operational guardrails

* make format

* fix(channels): converge with #3582 to avoid merge-order conflicts

Drop this PR's DingTalk INFO-log redaction and hand it to #3582, which
already restructures that handler and will redact the same log there. This
PR no longer touches dingtalk.py, so the two PRs can merge to main in any
order without a conflict.

For WeChat, drop the contested thread_ts priority reorder (review #3) and
keep only what inbound dedupe needs: a server-stable message_id in the
inbound metadata (message_id/msg_id, no client_id per review #6). This is a
single added line inside the metadata dict, a region #3582 never touches, so
it auto-merges regardless of order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): address three correctness review findings

1. Connect-code cap was racy (willem #1): _create_state ran delete-expired,
   count, and insert as three separate transactions, so concurrent connect
   POSTs from one owner could each see count < cap and all insert past it. Add
   ChannelConnectionRepository.create_oauth_state_within_cap which does
   delete+count+insert in a single transaction serialized per (owner,
   provider) — Postgres via pg_advisory_xact_lock, SQLite via the write lock
   the leading DELETE takes — and have the router use it.

2. Inbound dedupe key fell back to "" workspace (willem #3): two workspaces
   delivering without team/guild/aibotid would collapse to the same key and
   dedupe each other's messages. _inbound_dedupe_key now fails closed
   (returns None) when no workspace identifier is present.

3. Dedupe key was recorded on receipt and never released on failure
   (ShenAC #1): a transient error (DB blip, Gateway 503) left the key in place
   for the full TTL, so a provider redelivery of the same message_id — exactly
   the retry dedupe should absorb — was silently dropped. _handle_message now
   releases the key in the unexpected-exception branch so redelivery can
   recover, while keeping record-on-receipt so retries during handling are
   still deduped.

Tests: repo cap enforcement incl. concurrent-issuance non-leak; dedupe
fail-closed; dedupe key release-on-failure redelivery recovery.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): address cleanup/efficiency and test review findings

Efficiency / cleanup:
- Dedupe key set drops client-generated ids (client_msg_id, client_id);
  keep only server-stable event_id/message_id/msg_id, which a provider's own
  redelivery preserves (ShenAC #6). Every provider already emits message_id.
- TTL/overflow pruning of _recent_inbound_events is now O(k): switch to an
  OrderedDict and popitem(last=False) from the front instead of scanning all
  4096 entries on every inbound (willem #4).
- Log "received inbound" only after the dedupe check so a provider retrying N
  times no longer logs N accepts; document that manager dedupe covers the
  agent run/final answer, not provider ack side-effects (willem #5, ShenAC #2).
- Slack drops the redundant `team_id or event.get("team")` fallback the caller
  already resolved (willem #6).
- create_oauth_state_within_cap prunes only this owner/provider's expired codes
  instead of a global DELETE on every connect POST; global cleanup still runs
  on consume_oauth_state (willem #7).

Tests:
- Dedupe test uses tmp_path instead of a leaked mkdtemp, uses distinct objects
  per publish, and adds a negative control: a different message_id is still
  processed, catching over-dedupe regressions (willem #8, ShenAC #4).
- Slack HTTP-mode rejection test supplies app_token so the missing-token early
  return can't mask the guard, giving the state assertions teeth (ShenAC #3).
- count_oauth_states test pins that the active row survives, not just the count
  (ShenAC #5).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* make format

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Nan Gao
2026-06-18 04:09:46 +02:00
committed by GitHub
parent 97dd9ecf73
commit 8c0830aea1
12 changed files with 468 additions and 51 deletions
@@ -246,6 +246,77 @@ class TestChannelConnectionRepository:
states = (await session.execute(select(ChannelOAuthStateRow))).scalars().all()
assert [state.state_hash for state in states] == [repo.hash_state("active-state")]
@pytest.mark.anyio
async def test_count_oauth_states_active_only_and_delete_expired(self, repo):
now = datetime.now(UTC)
await repo.create_oauth_state(
owner_user_id="alice",
provider="slack",
state="expired-state",
expires_at=now - timedelta(minutes=1),
)
await repo.create_oauth_state(
owner_user_id="alice",
provider="slack",
state="active-state",
expires_at=now + timedelta(minutes=5),
)
assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 1
assert await repo.delete_expired_oauth_states(now=now) == 1
assert await repo.count_oauth_states(owner_user_id="alice", provider="slack") == 1
# Pin that the surviving row is the active one (an inverted expiry
# predicate would delete the active row, still return 1, and pass above).
async with repo.session_factory() as session:
survivors = (await session.execute(select(ChannelOAuthStateRow))).scalars().all()
assert [row.state_hash for row in survivors] == [repo.hash_state("active-state")]
@pytest.mark.anyio
async def test_create_oauth_state_within_cap_enforces_pending_cap(self, repo):
now = datetime.now(UTC)
expires = now + timedelta(minutes=5)
for i in range(3):
inserted = await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state=f"code-{i}", expires_at=expires, max_pending=3, now=now)
assert inserted is True
# Cap reached: the next issuance is rejected and nothing is inserted.
assert await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state="code-over", expires_at=expires, max_pending=3, now=now) is False
assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 3
# Expired rows are pruned and free up capacity; a different owner is unaffected.
assert await repo.create_oauth_state_within_cap(owner_user_id="bob", provider="slack", state="bob-1", expires_at=expires, max_pending=3, now=now) is True
@pytest.mark.anyio
async def test_create_oauth_state_within_cap_ignores_expired_rows(self, repo):
now = datetime.now(UTC)
# Three already-expired rows must not count against the cap.
for i in range(3):
await repo.create_oauth_state(owner_user_id="alice", provider="slack", state=f"old-{i}", expires_at=now - timedelta(minutes=1))
inserted = await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state="fresh", expires_at=now + timedelta(minutes=5), max_pending=3, now=now)
assert inserted is True
assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 1
@pytest.mark.anyio
async def test_create_oauth_state_within_cap_does_not_leak_under_concurrency(self, repo):
"""Concurrent issuance for one owner cannot push past the cap (willem #1)."""
import anyio
now = datetime.now(UTC)
expires = now + timedelta(minutes=5)
results: list[bool] = []
async def issue(state: str) -> None:
results.append(await repo.create_oauth_state_within_cap(owner_user_id="alice", provider="slack", state=state, expires_at=expires, max_pending=3, now=now))
async with anyio.create_task_group() as tg:
for i in range(8):
tg.start_soon(issue, f"code-{i}")
assert sum(1 for ok in results if ok) == 3
assert await repo.count_oauth_states(owner_user_id="alice", provider="slack", active_only=True, now=now) == 3
@pytest.mark.anyio
async def test_consume_oauth_state_is_one_time_even_under_concurrent_consumers(self, repo):
import anyio
@@ -504,6 +504,27 @@ def test_connect_slack_returns_binding_command_and_persists_state(tmp_path):
anyio.run(repo.close)
def test_connect_binding_code_caps_pending_states_per_provider(tmp_path):
import anyio
repo = anyio.run(_make_repo, tmp_path)
app = _make_app(_enabled_connections_config(), repo, _channels_config())
with TestClient(app) as client:
responses = [client.post("/api/channels/slack/connect") for _ in range(6)]
assert [response.status_code for response in responses[:5]] == [200, 200, 200, 200, 200]
assert responses[5].status_code == 429
assert "Too many pending channel connection codes" in responses[5].json()["detail"]
async def count_states():
return await repo.count_oauth_states(owner_user_id=str(_user().id), provider="slack")
assert anyio.run(count_states) == 5
anyio.run(repo.close)
def test_connect_discord_returns_binding_command_and_persists_state(tmp_path):
import anyio
+121 -1
View File
@@ -800,6 +800,126 @@ class TestChannelManager:
_run(go())
def test_dispatch_loop_dedupes_stable_provider_message_id(self, tmp_path):
from app.channels.manager import ChannelManager
async def go():
bus = MessageBus()
store = ChannelStore(path=tmp_path / "store.json")
manager = ChannelManager(bus=bus, store=store)
manager._client = _make_mock_langgraph_client()
outbound_received: list[OutboundMessage] = []
async def capture_outbound(msg: OutboundMessage) -> None:
outbound_received.append(msg)
bus.subscribe_outbound(capture_outbound)
await manager.start()
def _slack_inbound(message_id: str) -> InboundMessage:
# Distinct objects per publish, like a real provider redelivery.
return InboundMessage(
channel_name="slack",
chat_id="C123",
user_id="U123",
text="sensitive prompt",
topic_id="1710000000.000100",
metadata={"team_id": "T123", "message_id": message_id},
)
# Same stable message_id delivered twice -> processed once.
await bus.publish_inbound(_slack_inbound("1710000000.000200"))
await bus.publish_inbound(_slack_inbound("1710000000.000200"))
await _wait_for(lambda: manager._client.runs.wait.call_count == 1 and len(outbound_received) == 1)
await asyncio.sleep(0.05)
assert manager._client.threads.create.call_count == 1
assert manager._client.runs.wait.call_count == 1
assert len(outbound_received) == 1
# Negative control: a *different* message_id must still be processed,
# so an over-dedupe regression (dropping distinct messages) is caught.
await bus.publish_inbound(_slack_inbound("1710000000.000999"))
await _wait_for(lambda: manager._client.runs.wait.call_count == 2 and len(outbound_received) == 2)
await asyncio.sleep(0.05)
await manager.stop()
assert manager._client.runs.wait.call_count == 2
assert len(outbound_received) == 2
_run(go())
def test_inbound_dedupe_key_fails_closed_without_workspace(self):
"""Without a workspace identifier, skip dedupe instead of collapsing workspaces (willem #3)."""
from app.channels.manager import ChannelManager
with_workspace = InboundMessage(
channel_name="slack",
chat_id="C1",
user_id="U1",
text="x",
metadata={"team_id": "T1", "message_id": "m1"},
)
assert ChannelManager._inbound_dedupe_key(with_workspace) == ("slack", "T1", "C1", "m1")
without_workspace = InboundMessage(
channel_name="slack",
chat_id="C1",
user_id="U1",
text="x",
metadata={"message_id": "m1"},
)
assert ChannelManager._inbound_dedupe_key(without_workspace) is None
def test_dispatch_loop_releases_dedupe_key_when_handling_fails(self, tmp_path):
"""A transient handling failure must not black-hole a provider redelivery (ShenAC #1)."""
from app.channels.manager import ChannelManager
async def go():
bus = MessageBus()
store = ChannelStore(path=tmp_path / "store.json")
manager = ChannelManager(bus=bus, store=store)
client = _make_mock_langgraph_client()
attempts = {"n": 0}
async def flaky_wait(*args, **kwargs):
attempts["n"] += 1
if attempts["n"] == 1:
raise RuntimeError("transient gateway 503")
return {"messages": [{"type": "human", "content": "hi"}, {"type": "ai", "content": "recovered"}]}
client.runs.wait = AsyncMock(side_effect=flaky_wait)
manager._client = client
outbound_received: list[OutboundMessage] = []
async def capture_outbound(msg: OutboundMessage) -> None:
outbound_received.append(msg)
bus.subscribe_outbound(capture_outbound)
await manager.start()
inbound = InboundMessage(
channel_name="slack",
chat_id="C123",
user_id="U123",
text="hello",
metadata={"team_id": "T123", "message_id": "m-1"},
)
# First delivery fails transiently; the dedupe key must be released.
await bus.publish_inbound(inbound)
await _wait_for(lambda: attempts["n"] == 1 and len(outbound_received) >= 1)
# Provider redelivers the same message_id: it must be reprocessed, not dropped.
await bus.publish_inbound(inbound)
await _wait_for(lambda: attempts["n"] == 2)
await asyncio.sleep(0.05)
await manager.stop()
assert attempts["n"] == 2
_run(go())
def test_handle_chat_outbound_preserves_inbound_metadata(self):
"""DingTalk (and similar) need inbound metadata on outbound sends (e.g. sender_staff_id)."""
from app.channels.manager import ChannelManager
@@ -3752,7 +3872,7 @@ class TestWeComChannel:
assert inbound.thread_ts == "msg-1"
assert inbound.topic_id == "user-1"
assert inbound.files == files
assert inbound.metadata == {"aibotid": "bot-1", "chattype": "single"}
assert inbound.metadata == {"aibotid": "bot-1", "chattype": "single", "message_id": "msg-1"}
assert channel._ws_frames["msg-1"] is frame
assert channel._ws_stream_ids["msg-1"] == "stream-1"
+11 -28
View File
@@ -98,24 +98,13 @@ def test_slack_send_uses_connection_bot_token_when_connection_id_is_present():
anyio.run(go)
def test_slack_http_events_mode_initializes_operator_web_client(monkeypatch):
def test_slack_http_events_mode_is_rejected(monkeypatch, caplog):
import anyio
from app.channels.slack import SlackChannel
class FakeWebClient:
def __init__(self, token: str) -> None:
self.token = token
self.messages: list[dict] = []
def auth_test(self):
return {"user_id": "B-http"}
def chat_postMessage(self, **kwargs):
self.messages.append(kwargs)
slack_sdk = ModuleType("slack_sdk")
slack_sdk.WebClient = FakeWebClient
slack_sdk.WebClient = object
socket_mode = ModuleType("slack_sdk.socket_mode")
socket_mode.SocketModeClient = object
response = ModuleType("slack_sdk.socket_mode.response")
@@ -129,26 +118,20 @@ def test_slack_http_events_mode_initializes_operator_web_client(monkeypatch):
bus=MessageBus(),
config={
"bot_token": "xoxb-operator",
# Provide app_token too so the missing-token early return cannot
# fire before the HTTP-mode guard — otherwise the state assertions
# below would hold even if the guard were deleted.
"app_token": "xapp-token",
"event_delivery": "http",
"connection_repo": MagicMock(),
},
)
await channel.start()
assert channel._running is True
assert channel._web_client is not None
assert channel._web_client.token == "xoxb-operator"
assert channel._bot_user_id == "B-http"
with caplog.at_level("ERROR", logger="app.channels.slack"):
await channel.start()
await channel._post_connection_reply("C123", "Slack connected to DeerFlow.", "1710000000.000100")
assert channel._web_client.messages == [
{
"channel": "C123",
"text": "Slack connected to DeerFlow.",
"thread_ts": "1710000000.000100",
}
]
await channel.stop()
assert channel._running is False
assert channel._web_client is None
assert "Slack HTTP Events mode is not supported" in caplog.text
anyio.run(go)