fix(channels): make channel connect flow deterministic (#3582)

* fix(channels): make channel connect flow deterministic * make format * fix(channels): apply connect-code before allowed_users on telegram and wechat The bind-bootstrap reorder shipped for slack/dingtalk only. Telegram and WeChat still gate _check_user/allowed_users before connect-code dispatch, so a newly allowlisted-but-unbound user is silently rejected when binding via the browser deep-link / connect-code flow — the same deadlock the PR fixes. - telegram: consume the /start deep-link token before the allowed_users gate. - wechat: handle the /connect code before the allowed_users gate, and defer inbound file extraction + context-token tracking past the gate so blocked senders no longer trigger CDN downloads or token bookkeeping. Adds regression tests for both adapters mirroring the slack/dingtalk coverage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(channels): enforce single-active-owner invariant at the DB layer _revoke_other_active_owners did a SELECT-then-UPDATE in app code with no row lock or constraint covering active rows. Under READ COMMITTED, two concurrent connect-code consumes for the same (provider, external_account_id, workspace_id) from different owners could each observe "no other active owner" and both commit a connected row, leaving find_connection_by_external_identity nondeterministic. - Add a partial unique index on (provider, external_account_id, workspace_id) WHERE status != 'revoked' (portable to SQLite >= 3.8.0 and PostgreSQL) so the database guarantees at most one non-revoked row per external identity. - Reorder upsert_connection to revoke other owners' active rows before the new connected row is flushed (so the index is satisfied at commit), wrapped in a bounded rollback-and-retry loop. A losing concurrent writer now retries against the now-visible state instead of committing a duplicate. Adds DB-constraint, revoked-slot-reuse, and concurrent-upsert regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(channels): harden connect-status polling primitive pollChannelConnectionUntilResolved was a free-floating recursive setTimeout started from onSuccess with no cancellation, no per-provider dedup, a redundant second endpoint per tick, and an unbounded loop on a non-finite expires_in. - Extract a framework-agnostic, cancellable poller (connect-poll.ts) that polls only listChannelConnections() and invalidates the providers query once when the bind resolves, instead of fetching both endpoints every tick. - Guard expires_in with a finite check + default window so undefined/NaN can no longer produce a poll loop that runs until the page closes. - Track one active poll handle per provider in useConnectChannelProvider via a ref Map: a new connect cancels the prior poll for that provider, and a useEffect cleanup cancels all polls on unmount. Adds unit tests for resolve-and-stop, cancellation, and non-finite-expiry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(channels): stop leaking blocked-sender content in DingTalk INFO log; document bind semantics Moving the allowed_users gate past _extract_text meant the parsed-message INFO log (text=%r, first 100 chars) fired for senders that allowed_users would have rejected, defeating the filter's noise/privacy role. Move that log to after the allowed_users gate so blocked senders' message text never reaches INFO logs. Also document the two operator-relevant semantic changes in backend/CLAUDE.md: connect-code dispatch runs before allowed_users (so allowed_users is no longer a bind-time defense; the model relies on code confidentiality + 600s TTL + one-time consumption), and the single-active-owner-per-external-identity transfer semantics now backed by the partial unique index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(channels): note connect-code-vs-allowlist and ownership transfer in operator guide Mirror the backend/CLAUDE.md notes in the operator-facing IM_CHANNEL_CONNECTIONS.md: connect codes are consumed before allowed_users (so a not-yet-allowlisted user can still complete a first bind, and allowed_users is not a bind-time defense), and an external identity has at most one active owner with last-bind-wins transfer enforced at the DB layer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(channels): lift connect-code dispatch into Channel base class Each adapter duplicated the ordering-sensitive boilerplate of extracting a /connect code and guarding on the connection repo before its allowed_users gate. The duplication is what let telegram/wechat drift and keep the gate ahead of the bind. Centralize it: - Move `_connection_repo` onto Channel.__init__ (removing 7 duplicate assignments). - Add Channel._pending_connect_code(text), which guards on the repo and extracts the code, documenting that adapters MUST consult it before authorization so a browser-initiated bind can bootstrap a not-yet-authorized identity. - Route slack, discord, feishu, dingtalk, wechat, and wecom through the helper. This also fixes a latent inconsistency where slack dispatched a bind even when no connection repo was configured. Pure refactor — the full channel suite stays green; adds a direct unit test for the base helper's contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * make format * fix(channels): redact DingTalk parsed-message INFO log content Log text_len instead of the first 100 chars of message text, so message content never reaches INFO logs (the after-gate move already keeps blocked senders out entirely). This takes over the redaction from #3584 so only this PR touches dingtalk.py, letting the two PRs merge in any order conflict-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 13:46:02 +00:00 · 2026-06-18 04:15:31 +02:00
parent 8c0830aea1
commit 68ba4198b8
21 changed files with 695 additions and 80 deletions
@@ -4,7 +4,7 @@ from __future__ import annotations

 from datetime import UTC, datetime

-from sqlalchemy import JSON, DateTime, ForeignKey, Index, Integer, String, Text, UniqueConstraint
+from sqlalchemy import JSON, DateTime, ForeignKey, Index, Integer, String, Text, UniqueConstraint, text
 from sqlalchemy.orm import Mapped, mapped_column

 from deerflow.persistence.base import Base
@@ -46,6 +46,20 @@ class ChannelConnectionRow(Base):
            name="uq_channel_connection_owner_provider_identity",
        ),
        Index("idx_channel_connections_event_lookup", "provider", "workspace_id", "bot_user_id"),
+        # Enforce the single-active-owner invariant at the database layer: at most
+        # one non-revoked row may exist per external identity. This makes ownership
+        # transfer race-safe (concurrent connects from different owners can no
+        # longer both commit a connected row). Partial unique indexes are
+        # supported by both SQLite (>= 3.8.0) and PostgreSQL.
+        Index(
+            "uq_channel_connection_active_identity",
+            "provider",
+            "external_account_id",
+            "workspace_id",
+            unique=True,
+            sqlite_where=text("status != 'revoked'"),
+            postgresql_where=text("status != 'revoked'"),
+        ),
    )


@@ -25,6 +25,12 @@ from deerflow.utils.time import coerce_iso

 logger = logging.getLogger(__name__)

+# Bounded retries for upsert_connection when a concurrent writer commits a
+# conflicting row first (same owner identity, or the same active external
+# identity guarded by the partial unique index). Each retry re-reads the
+# now-visible state, so a small bound converges under realistic contention.
+_UPSERT_MAX_ATTEMPTS = 3
+

 class ChannelCredentialCipher:
    """Encrypts provider credentials before they are persisted."""
@@ -128,36 +134,62 @@ class ChannelConnectionRepository:
            row.capabilities_json = dict(capabilities or {})
            row.metadata_json = dict(metadata or {})

+        async def _revoke_other_active_owners(session: AsyncSession) -> None:
+            if status != "connected":
+                return
+            with session.no_autoflush:
+                result = await session.execute(
+                    select(ChannelConnectionRow.id).where(
+                        ChannelConnectionRow.provider == provider,
+                        ChannelConnectionRow.external_account_id == external_account_id_value,
+                        ChannelConnectionRow.workspace_id == workspace_id_value,
+                        ChannelConnectionRow.owner_user_id != owner_user_id,
+                        ChannelConnectionRow.status != "revoked",
+                    )
+                )
+            transferred_ids = [row_id for row_id in result.scalars()]
+            if not transferred_ids:
+                return
+            await session.execute(update(ChannelConnectionRow).where(ChannelConnectionRow.id.in_(transferred_ids)).values(status="revoked"))
+            await session.execute(delete(ChannelCredentialRow).where(ChannelCredentialRow.connection_id.in_(transferred_ids)))
+
        stmt = select(ChannelConnectionRow).where(
            ChannelConnectionRow.owner_user_id == owner_user_id,
            ChannelConnectionRow.provider == provider,
            ChannelConnectionRow.external_account_id == external_account_id_value,
            ChannelConnectionRow.workspace_id == workspace_id_value,
        )
-        async with self.session_factory() as session:
-            row = (await session.execute(stmt)).scalar_one_or_none()
-            if row is None:
-                row = ChannelConnectionRow(
-                    id=self._new_id(),
-                    owner_user_id=owner_user_id,
-                    provider=provider,
-                    external_account_id=external_account_id_value,
-                    workspace_id=workspace_id_value,
-                )
-                session.add(row)

-            _apply(row)
-            try:
-                await session.commit()
-            except IntegrityError:
-                # A concurrent writer inserted the same identity first; retry as
-                # an update of that row.
-                await session.rollback()
-                row = (await session.execute(stmt)).scalar_one()
-                _apply(row)
-                await session.commit()
-            await session.refresh(row)
-            return self._connection_to_dict(row)
+        async with self.session_factory() as session:
+            last_error: IntegrityError | None = None
+            for _ in range(_UPSERT_MAX_ATTEMPTS):
+                try:
+                    row = (await session.execute(stmt)).scalar_one_or_none()
+                    # Revoke any other owner's active row for this external identity
+                    # *before* our connected row is flushed, so the partial unique
+                    # index on active identities is satisfied at commit time.
+                    await _revoke_other_active_owners(session)
+                    if row is None:
+                        row = ChannelConnectionRow(
+                            id=self._new_id(),
+                            owner_user_id=owner_user_id,
+                            provider=provider,
+                            external_account_id=external_account_id_value,
+                            workspace_id=workspace_id_value,
+                        )
+                        session.add(row)
+                    _apply(row)
+                    await session.commit()
+                    await session.refresh(row)
+                    return self._connection_to_dict(row)
+                except IntegrityError as exc:
+                    # A concurrent writer committed a conflicting row first (this
+                    # owner's identity, or the same active external identity). Roll
+                    # back and retry: the next pass re-reads the now-visible state,
+                    # revokes the newly-committed owner, and writes our row.
+                    last_error = exc
+                    await session.rollback()
+            raise last_error  # type: ignore[misc]  # loop runs at least once

    async def list_connections(self, owner_user_id: str) -> list[dict[str, Any]]:
        async with self.session_factory() as session: