Skip to Content

Phase 4 — Multi-Tenant Auth & Audit

Effort: L

Goal

Every tool call, webhook event, and admin action is attributable to a client_id, gated by scope + grant checks, rate-limited, and audited. This is the security-critical phase — defence in depth from key parse through tool dispatch through DB write.

Deliverables

Migrations

  • drizzle/0003_multitenant.sql:
    • Backfills messages.client_id = local-owner for existing rows; sets NOT NULL.
    • Adds client_id to contacts and media_objects (Phase 6 will use the latter).
  • drizzle/0004_audit_ratelimit.sql:
    • Creates audit_log (bigserial, append-only enforced at role level — see DB role section).
    • Creates rate_limit_buckets.
    • Creates two Postgres roles: wa_app (INSERT/SELECT on audit_log, full DML on others) and wa_audit_archiver (DELETE on audit_log only). App connects as wa_app; the archive cron uses wa_audit_archiver.

Admin CLI

Run via docker compose exec app pnpm admin <subcommand> in prod; or pnpm admin ... in dev.

  • scripts/admin/create-client.tsadmin clients create --name <slug> --display-name "..." [--owner]. Refuses to create a second is_owner.
  • scripts/admin/list-clients.tsadmin clients list.
  • scripts/admin/disable-client.ts / admin clients enable <id>.
  • scripts/admin/mint-key.tsadmin keys mint --client <id> --label "..." --scopes "tools:send_message,numbers:<phone_id>" [--expires 90d] [--rpm 60] [--daily 250]:
    • Generates a wamcp_live_... key (140 bits entropy).
    • Computes HMAC-SHA256(pepper, token), stores the hash + 12-char prefix.
    • Validates scopes against the clients.is_owner flag (rejects wildcard scopes for non-owners).
    • Prints the full token to stderr with a one-line warning that this is the only time it will be shown. Stdout gets the key id + prefix for piping.
  • scripts/admin/list-keys.ts, scripts/admin/revoke-key.ts, scripts/admin/rotate-key.ts (default --grace 7d).
  • scripts/admin/add-grant.tsadmin grants add --client <id> --phone <phone_id> --tools "send_message,get_messages" [--daily-cap 500].
  • scripts/admin/list-grants.ts, scripts/admin/revoke-grant.ts.
  • scripts/admin/show-audit.tsadmin audit --client <id> --since "24h" for ad-hoc forensics.

Auth modules

  • src/auth/api-key.ts:
    • parseBearer(header){ env, prefix, secret } or throws.
    • lookupKey(prefix) → DB row or null (single indexed query).
    • verifyKeyHash(secret, hash, pepper) → constant-time HMAC compare.
    • loadClientContext(keyRow){ clientId, apiKeyId, scopes, rpmLimit, dailyMsgLimit, allowedPhoneNumberIds } (allowed phones come from joining client_phone_grants).
    • Express middleware composing the above. Fails closed on any error → 401 + auth_failed audit row.
  • src/auth/context.tsAsyncLocalStorage<AuthContext>; helpers getAuth(), runWithAuth().
  • src/auth/scopes.ts:
    • requireScope(toolName) — throws ScopeDeniedError if tools:<name> and tools:* both absent. Owner check: wildcards rejected at request time if clients.is_owner = false.
    • requireGrant(clientId, phoneNumberId, toolName) — looks up client_phone_grants, throws GrantDeniedError on miss / revoked / tool not in allowed_tools.
  • src/auth/rate-limit.ts:
    • enforceRpm(apiKeyId, limit) — sliding-window weighted across current + previous minute via atomic upserts in rate_limit_buckets. Returns { remaining, resetAtEpoch } on success, throws RateLimitedError with retryAfterSeconds on miss.
    • enforceDailyCap(clientId, phoneNumberId, cap) — sums last 24 hourly buckets; throws on miss. Called inside send-message Inngest function (post-dequeue).
  • All four errors carry httpStatus (401 / 403 / 403 / 429) and a JSON-RPC error code so the MCP layer can translate uniformly.

Audit logger

  • src/audit/logger.ts:
    • audit(action, { toolName?, phoneNumberId?, wamid?, payloadHash?, errorCode?, latencyMs?, metadata? }).
    • Pulls clientId, apiKeyId, requestId, ip, userAgent from AsyncLocalStorage.
    • Single INSERT into audit_log. Never UPDATE/DELETE (DB role enforces).
    • Writes asynchronously via a small per-process queue with periodic flush (every 500ms or 100 rows). On process shutdown, drains the queue.
    • Failure to write audit (DB down) is logged at error level and the request continues — the audit gap is itself an alert signal.
  • payloadHash = SHA-256 of the canonicalised JSON input. Canonicalisation uses JCS (RFC 8785) via the canonicalize npm package — deterministic key ordering and number formatting, no ambiguity. Never store plaintext bodies in audit_log — bodies live only in messages.body.

Webhook → tenant resolution

  • src/webhook/meta.ts:
    • Look up phone_numbers by wa_phone_number_id.
    • For each client_phone_grants row joining that number, derive a derivedEventId namespaced on client_id so a number serving multiple clients produces one Inngest event per client. (v1 still has one client per number, but the code path is fan-out-ready.)
  • process-message (Phase 3) gets clientId in its event data and uses it everywhere downstream.

MCP transport plumbing for auth

  • Even though Streamable HTTP lands in Phase 5, the dispatcher (the bit between transport and tool registry) is refactored here:
    • Every tool invocation runs through runWithAuth(ctx, async () => ...) with ctx either from the auth middleware (HTTP) or the synthetic owner context (stdio).
    • Every tool handler is wrapped at registry-load time with wrapToolHandler(handler, scope) which: parses input → requireScoperequireGrant (if input includes phoneNumberId) → calls handler → emits audit row → returns result.
    • Rate-limit / scope / grant errors are translated to MCP JSON-RPC errors with code -32001 (forbidden) or -32004 (rate limited).

Retention (built now, used later)

  • src/inngest/functions/archive-audit.ts — daily cron. Selects audit_log rows > 365d, writes them to /var/lib/whatsapp-mcp/audit-archive/<yyyy-mm>.jsonl.zst, then DELETE via wa_audit_archiver role. (The archive directory check is skipped in test runs; configured per env.)
  • src/inngest/functions/prune-messages.ts — daily cron. Nullifies body and raw on messages rows > 90d (configurable per client later in a clients.retention_days column — for now uses a global config).

Docs (extended)

  • docs/architecture/auth.md — full auth pipeline, key format, scope model, two-layer authz, rotation flow, stdio short-circuit.
  • docs/architecture/audit.md — what we log, what we never log, retention policy, archival flow, investigation queries.
  • docs/architecture/rate-limiting.md — RPM sliding window math, daily cap math, where each check runs, 429 contract.
  • docs/architecture/database.md — extended with audit_log, rate_limit_buckets, DB role split.
  • docs/components/admin-cli.md — every admin subcommand with usage examples.
  • docs/operations/client-onboarding.md — promoted from docs/plan/ops/ stub; create client → mint key → grant numbers → hand over → revoke flow.
  • docs/api/errors.md — extended with auth / scope / grant / rate-limit error codes.

Critical files

Tests

Unit

  • tests/unit/auth/api-key-parse.test.tswamcp_live_... accepted; Basic ... rejected; malformed format rejected; missing scheme rejected.
  • tests/unit/auth/key-hash.test.ts — HMAC matches; pepper mismatch fails; timing-safe compare (constant-time over equal-length inputs).
  • tests/unit/auth/scopes.test.ts — exact match; wildcard for owner; wildcard rejected for non-owner; missing scope throws.
  • tests/unit/auth/rate-limit-math.test.ts — sliding window calculation across minute boundaries.

Integration (testcontainers Postgres)

  • tests/integration/admin/keys.test.ts — mint → list → use → revoke → use-after-revoke → 401; rotate → both work in grace window → old revoked at expiry.
  • tests/integration/admin/grants.test.ts — add → use → revoke → use → grant_denied.
  • tests/integration/auth/middleware.test.ts:
    • Valid bearer → tool succeeds, tool_called + result-action audit rows.
    • Wrong key → 401, auth_failed audit row.
    • Revoked key → 401.
    • Expired key → 401.
    • Disabled client → 401.
  • tests/integration/auth/scope.test.ts:
    • Key with tools:send_message but not tools:get_messagesget_messages call returns scope_denied error, scope_denied audit row.
  • tests/integration/auth/grant.test.ts:
    • Two phones, grant for only one → call with the wrong phoneNumberId → grant_denied.
  • tests/integration/auth/rate-limit.test.ts:
    • Key with rpm=5 → 6th call in a minute returns RateLimitedError (HTTP 429 / JSON-RPC -32004).
    • Daily cap 3 → 4th send fails with daily-cap error inside the Inngest function; the message row goes status='failed'.
    • Sliding window: 5 calls at 0s and 5 calls at 30s → the 11th at 31s succeeds because the first window has rolled off; verify weighted math.
  • tests/integration/auth/stdio-owner.test.ts — stdio mode produces audit rows with api_key_id = null and metadata.transport = 'stdio'.
  • tests/integration/cross-tenant/isolation.test.ts:
    • Create clients A and B; grant A and B different numbers.
    • With A’s key: get_messages on B’s number → grant_denied.
    • With A’s key: send_message to a number not granted → grant_denied.
    • Direct DB query via the scoped helper with A’s clientId cannot read B’s rows.
    • This test file is the canonical isolation regression — runs on every PR.
  • tests/integration/audit/log.test.ts:
    • Every tool call produces exactly one row with the right action, payload_hash, latency_ms.
    • audit_log rows for the same request share a request_id.
    • Attempting UPDATE audit_log via the wa_app role fails (DB-level check).

Coverage

  • src/auth/ ≥ 95 %.
  • src/audit/ ≥ 95 %.
  • src/db/scoped.ts ≥ 95 %.
  • Phase total ≥ 80 %.

Code documentation

  • TSDoc on every exported symbol in src/auth/ and src/audit/. @remarks mandatory on every auth check covering: failure mode (fail-closed), what gets audited, what gets returned to the client, and the security invariant being enforced.
  • File-level headers on every new file.
  • docs/architecture/{auth,audit,rate-limiting,database}.md written/extended.
  • docs/components/admin-cli.md complete.
  • docs/operations/client-onboarding.md complete.
  • docs/api/errors.md extended.
  • docs/reference/ regenerated.

Acceptance

  1. Multi-client isolation demo — create test-client, mint a key with only tools:send_message,numbers:<phone_id>, grant just send_message on that phone:
    • With the key: send_message works.
    • With the key: get_messages returns scope_denied.
    • With the key: send_message to a different phoneNumberId returns grant_denied.
  2. Rate limit demo — key with rpm=5: 6 calls in 30 s → 6th returns 429 with Retry-After.
  3. Daily cap demo — grant with daily_message_cap=3: 4th send in 24h → message row status='failed' with daily-cap error; first 3 succeed.
  4. Audit completenessSELECT action, count(*) FROM audit_log WHERE client_id = '<test>' AND ts > now() - '1 hour'::interval GROUP BY 1 shows every action including scope_denied, grant_denied, rate_limited, tool_called, send_attempt, send_success/send_failed, key_used.
  5. Append-only enforcementpsql as wa_app running UPDATE audit_log SET ts = now() fails with a permission error.
  6. Cross-tenant isolation suitetests/integration/cross-tenant/isolation.test.ts green.
  7. pnpm test:ci green; coverage gates met.

Notes

  • The pepper at /run/secrets/api_key_pepper is 32 random bytes, base64-encoded. Pepper rotation procedure documented in ops/incident-runbook.md: re-hash all live keys with the new pepper inside a single transaction; the previous pepper is accepted for one rotation window (kept at /run/secrets/api_key_pepper.previous).
  • The admin CLI is the only management interface. There is no admin HTTP endpoint. Phase 8 introduces an owner-only portal if needed.

Definition of Done

Migrations

  • drizzle/0003_multitenant.sqlclient_id backfilled on messages, contacts; not-null where required.
  • drizzle/0004_audit_ratelimit.sqlaudit_log, rate_limit_buckets, plus DB roles wa_app + wa_audit_archiver with correct grants.

Admin CLI

  • clients create / list / disable / enable working.
  • keys mint rejects wildcard scopes for non-owners; prints full token to stderr only.
  • keys list / revoke / rotate working.
  • grants add / list / revoke working.
  • audit --client … --since … working.

Auth modules

  • src/auth/api-key.ts — parse, lookup, HMAC verify, attach context.
  • src/auth/context.ts — AsyncLocalStorage helpers.
  • src/auth/scopes.tsrequireScope, requireGrant, owner-wildcard check at request time.
  • src/auth/rate-limit.ts — RPM sliding window + daily cap; throws structured errors.
  • All four errors carry httpStatus + JSON-RPC error code.

Audit logger

  • src/audit/logger.ts — write-coalesced flush; payload_hash via canonicalize (RFC 8785) + SHA-256.
  • wa_app role has no UPDATE/DELETE on audit_log (verified via integration test).

Webhook + dispatcher

  • Webhook resolves phone_number_id and fans out one event per granted client.
  • MCP dispatcher wraps every tool with wrapToolHandler(scope, handler).
  • Errors translated to JSON-RPC -32001 (forbidden) or -32004 (rate limited).

Retention

  • archive-audit cron writes JSONL.zst then deletes via wa_audit_archiver.
  • prune-messages cron nullifies body/raw on rows > 90d.

Tests

  • tests/unit/auth/api-key-parse.test.ts passes.
  • tests/unit/auth/key-hash.test.ts passes (timing-safe verified).
  • tests/unit/auth/scopes.test.ts passes.
  • tests/unit/auth/rate-limit-math.test.ts passes (cross-boundary sliding window).
  • tests/integration/admin/keys.test.ts (mint → use → revoke → rotate-grace) passes.
  • tests/integration/admin/grants.test.ts passes.
  • tests/integration/auth/middleware.test.ts (valid/wrong/revoked/expired/disabled) passes.
  • tests/integration/auth/scope.test.ts passes.
  • tests/integration/auth/grant.test.ts passes.
  • tests/integration/auth/rate-limit.test.ts (rpm + daily + sliding window) passes.
  • tests/integration/auth/stdio-owner.test.ts passes.
  • tests/integration/cross-tenant/isolation.test.ts green (the canonical regression).
  • tests/integration/audit/log.test.ts (every action + append-only role check) passes.
  • Coverage: src/auth/ ≥ 95%; src/audit/ ≥ 95%; src/db/scoped.ts ≥ 95%; phase total ≥ 80%.

Documentation

  • docs/architecture/auth.md written.
  • docs/architecture/audit.md written.
  • docs/architecture/rate-limiting.md written.
  • docs/architecture/database.md extended (audit_log, rate_limit_buckets, role split).
  • docs/components/admin-cli.md complete.
  • docs/operations/client-onboarding.md promoted from stub.
  • docs/api/errors.md extended.
  • TSDoc @remarks on every src/auth/ and src/audit/ export.
  • docs/reference/ regenerated cleanly.

Acceptance verified

  • Multi-client isolation demo: test-client scoped key can send_message but get_messages returns scope_denied; wrong number returns grant_denied.
  • RPM demo: 6th call in a minute with rpm=5 → 429 + Retry-After.
  • Daily cap demo: 4th send with daily_cap=3 → row status='failed' with daily-cap error.
  • audit_log query shows every action variety for the test client.
  • psql as wa_app running UPDATE audit_log fails with permission error.

Phase signoff

  • Phase 4 complete. README.md status table updated to ✅.