fix(channels): make channel connect flow deterministic (#3582)

* fix(channels): make channel connect flow deterministic

* make format

* fix(channels): apply connect-code before allowed_users on telegram and wechat

The bind-bootstrap reorder shipped for slack/dingtalk only. Telegram and
WeChat still gate _check_user/allowed_users before connect-code dispatch, so
a newly allowlisted-but-unbound user is silently rejected when binding via the
browser deep-link / connect-code flow — the same deadlock the PR fixes.

- telegram: consume the /start deep-link token before the allowed_users gate.
- wechat: handle the /connect code before the allowed_users gate, and defer
  inbound file extraction + context-token tracking past the gate so blocked
  senders no longer trigger CDN downloads or token bookkeeping.

Adds regression tests for both adapters mirroring the slack/dingtalk coverage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): enforce single-active-owner invariant at the DB layer

_revoke_other_active_owners did a SELECT-then-UPDATE in app code with no row
lock or constraint covering active rows. Under READ COMMITTED, two concurrent
connect-code consumes for the same (provider, external_account_id, workspace_id)
from different owners could each observe "no other active owner" and both commit
a connected row, leaving find_connection_by_external_identity nondeterministic.

- Add a partial unique index on (provider, external_account_id, workspace_id)
  WHERE status != 'revoked' (portable to SQLite >= 3.8.0 and PostgreSQL) so the
  database guarantees at most one non-revoked row per external identity.
- Reorder upsert_connection to revoke other owners' active rows before the new
  connected row is flushed (so the index is satisfied at commit), wrapped in a
  bounded rollback-and-retry loop. A losing concurrent writer now retries
  against the now-visible state instead of committing a duplicate.

Adds DB-constraint, revoked-slot-reuse, and concurrent-upsert regression tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): harden connect-status polling primitive

pollChannelConnectionUntilResolved was a free-floating recursive setTimeout
started from onSuccess with no cancellation, no per-provider dedup, a redundant
second endpoint per tick, and an unbounded loop on a non-finite expires_in.

- Extract a framework-agnostic, cancellable poller (connect-poll.ts) that polls
  only listChannelConnections() and invalidates the providers query once when the
  bind resolves, instead of fetching both endpoints every tick.
- Guard expires_in with a finite check + default window so undefined/NaN can no
  longer produce a poll loop that runs until the page closes.
- Track one active poll handle per provider in useConnectChannelProvider via a
  ref Map: a new connect cancels the prior poll for that provider, and a useEffect
  cleanup cancels all polls on unmount.

Adds unit tests for resolve-and-stop, cancellation, and non-finite-expiry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(channels): stop leaking blocked-sender content in DingTalk INFO log; document bind semantics

Moving the allowed_users gate past _extract_text meant the parsed-message INFO
log (text=%r, first 100 chars) fired for senders that allowed_users would have
rejected, defeating the filter's noise/privacy role. Move that log to after the
allowed_users gate so blocked senders' message text never reaches INFO logs.

Also document the two operator-relevant semantic changes in backend/CLAUDE.md:
connect-code dispatch runs before allowed_users (so allowed_users is no longer a
bind-time defense; the model relies on code confidentiality + 600s TTL + one-time
consumption), and the single-active-owner-per-external-identity transfer semantics
now backed by the partial unique index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(channels): note connect-code-vs-allowlist and ownership transfer in operator guide

Mirror the backend/CLAUDE.md notes in the operator-facing IM_CHANNEL_CONNECTIONS.md:
connect codes are consumed before allowed_users (so a not-yet-allowlisted user can
still complete a first bind, and allowed_users is not a bind-time defense), and an
external identity has at most one active owner with last-bind-wins transfer enforced
at the DB layer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* refactor(channels): lift connect-code dispatch into Channel base class

Each adapter duplicated the ordering-sensitive boilerplate of extracting a
/connect code and guarding on the connection repo before its allowed_users gate.
The duplication is what let telegram/wechat drift and keep the gate ahead of the
bind. Centralize it:

- Move `_connection_repo` onto Channel.__init__ (removing 7 duplicate assignments).
- Add Channel._pending_connect_code(text), which guards on the repo and extracts
  the code, documenting that adapters MUST consult it before authorization so a
  browser-initiated bind can bootstrap a not-yet-authorized identity.
- Route slack, discord, feishu, dingtalk, wechat, and wecom through the helper.
  This also fixes a latent inconsistency where slack dispatched a bind even when
  no connection repo was configured.

Pure refactor — the full channel suite stays green; adds a direct unit test for
the base helper's contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* make format

* fix(channels): redact DingTalk parsed-message INFO log content

Log text_len instead of the first 100 chars of message text, so message
content never reaches INFO logs (the after-gate move already keeps blocked
senders out entirely). This takes over the redaction from #3584 so only this
PR touches dingtalk.py, letting the two PRs merge in any order conflict-free.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Nan Gao
2026-06-18 04:15:31 +02:00
committed by GitHub
parent 8c0830aea1
commit 68ba4198b8
21 changed files with 695 additions and 80 deletions
@@ -0,0 +1,101 @@
import { afterEach, beforeEach, describe, expect, test, vi } from "vitest";
import { startConnectionPoll } from "@/core/channels/connect-poll";
import type { ChannelConnection } from "@/core/channels/types";
function connection(provider: string, status: string): ChannelConnection {
return {
id: `${provider}-1`,
provider,
status,
scopes: [],
metadata: {},
};
}
beforeEach(() => {
vi.useFakeTimers();
});
afterEach(() => {
vi.useRealTimers();
});
describe("startConnectionPoll", () => {
test("polls connections until the provider is connected, then resolves once", async () => {
const responses: ChannelConnection[][] = [
[connection("telegram", "pending")],
[connection("telegram", "connected")],
];
const fetchConnections = vi.fn(async () => responses.shift() ?? []);
const onConnected = vi.fn();
startConnectionPoll({
provider: "telegram",
expiresInSeconds: 600,
fetchConnections,
onConnected,
intervalMs: 1000,
});
await vi.advanceTimersByTimeAsync(1000);
expect(fetchConnections).toHaveBeenCalledTimes(1);
expect(onConnected).not.toHaveBeenCalled();
await vi.advanceTimersByTimeAsync(1000);
expect(fetchConnections).toHaveBeenCalledTimes(2);
expect(onConnected).toHaveBeenCalledTimes(1);
// No further polling after the connection resolves.
await vi.advanceTimersByTimeAsync(5000);
expect(fetchConnections).toHaveBeenCalledTimes(2);
});
test("cancel() stops scheduled polling and fires no further fetches", async () => {
const fetchConnections = vi.fn(async () => [
connection("telegram", "pending"),
]);
const handle = startConnectionPoll({
provider: "telegram",
expiresInSeconds: 600,
fetchConnections,
onConnected: vi.fn(),
intervalMs: 1000,
});
await vi.advanceTimersByTimeAsync(1000);
expect(fetchConnections).toHaveBeenCalledTimes(1);
handle.cancel();
await vi.advanceTimersByTimeAsync(10000);
expect(fetchConnections).toHaveBeenCalledTimes(1);
});
test("a non-finite expires_in falls back to a finite deadline and terminates", async () => {
const fetchConnections = vi.fn(async () => [
connection("telegram", "pending"),
]);
let nowValue = 0;
startConnectionPoll({
provider: "telegram",
expiresInSeconds: Number.NaN,
fetchConnections,
onConnected: vi.fn(),
intervalMs: 1000,
now: () => nowValue,
});
nowValue = 1;
await vi.advanceTimersByTimeAsync(1000);
expect(fetchConnections).toHaveBeenCalledTimes(1);
// Jump past the fallback expiry window: the loop must stop instead of
// running forever (Date.now() >= NaN would otherwise never be true).
nowValue = 10_000_000;
await vi.advanceTimersByTimeAsync(1000);
expect(fetchConnections).toHaveBeenCalledTimes(2);
await vi.advanceTimersByTimeAsync(10000);
expect(fetchConnections).toHaveBeenCalledTimes(2);
});
});