Phase 4 — Multi-Tenant Auth & Audit

Effort: L

Goal

Every tool call, webhook event, and admin action is attributable to a client_id, gated by scope + grant checks, rate-limited, and audited. This is the security-critical phase — defence in depth from key parse through tool dispatch through DB write.

Deliverables

Migrations

drizzle/0003_multitenant.sql:
- Backfills messages.client_id = local-owner for existing rows; sets NOT NULL.
- Adds client_id to contacts and media_objects (Phase 6 will use the latter).
drizzle/0004_audit_ratelimit.sql:
- Creates audit_log (bigserial, append-only enforced at role level — see DB role section).
- Creates rate_limit_buckets.
- Creates two Postgres roles: wa_app (INSERT/SELECT on audit_log, full DML on others) and wa_audit_archiver (DELETE on audit_log only). App connects as wa_app; the archive cron uses wa_audit_archiver.

Admin CLI

Run via docker compose exec app pnpm admin <subcommand> in prod; or pnpm admin ... in dev.

scripts/admin/create-client.ts — admin clients create --name <slug> --display-name "..." [--owner]. Refuses to create a second is_owner.
scripts/admin/list-clients.ts — admin clients list.
scripts/admin/disable-client.ts / admin clients enable <id>.
scripts/admin/mint-key.ts — admin keys mint --client <id> --label "..." --scopes "tools:send_message,numbers:<phone_id>" [--expires 90d] [--rpm 60] [--daily 250]:
- Generates a wamcp_live_... key (140 bits entropy).
- Computes HMAC-SHA256(pepper, token), stores the hash + 12-char prefix.
- Validates scopes against the clients.is_owner flag (rejects wildcard scopes for non-owners).
- Prints the full token to stderr with a one-line warning that this is the only time it will be shown. Stdout gets the key id + prefix for piping.
scripts/admin/list-keys.ts, scripts/admin/revoke-key.ts, scripts/admin/rotate-key.ts (default --grace 7d).
scripts/admin/add-grant.ts — admin grants add --client <id> --phone <phone_id> --tools "send_message,get_messages" [--daily-cap 500].
scripts/admin/list-grants.ts, scripts/admin/revoke-grant.ts.
scripts/admin/show-audit.ts — admin audit --client <id> --since "24h" for ad-hoc forensics.

Auth modules

src/auth/api-key.ts:
- parseBearer(header) → { env, prefix, secret } or throws.
- lookupKey(prefix) → DB row or null (single indexed query).
- verifyKeyHash(secret, hash, pepper) → constant-time HMAC compare.
- loadClientContext(keyRow) → { clientId, apiKeyId, scopes, rpmLimit, dailyMsgLimit, allowedPhoneNumberIds } (allowed phones come from joining client_phone_grants).
- Express middleware composing the above. Fails closed on any error → 401 + auth_failed audit row.
src/auth/context.ts — AsyncLocalStorage<AuthContext>; helpers getAuth(), runWithAuth().
src/auth/scopes.ts:
- requireScope(toolName) — throws ScopeDeniedError if tools:<name> and tools:* both absent. Owner check: wildcards rejected at request time if clients.is_owner = false.
- requireGrant(clientId, phoneNumberId, toolName) — looks up client_phone_grants, throws GrantDeniedError on miss / revoked / tool not in allowed_tools.
src/auth/rate-limit.ts:
- enforceRpm(apiKeyId, limit) — sliding-window weighted across current + previous minute via atomic upserts in rate_limit_buckets. Returns { remaining, resetAtEpoch } on success, throws RateLimitedError with retryAfterSeconds on miss.
- enforceDailyCap(clientId, phoneNumberId, cap) — sums last 24 hourly buckets; throws on miss. Called inside send-message Inngest function (post-dequeue).
All four errors carry httpStatus (401 / 403 / 403 / 429) and a JSON-RPC error code so the MCP layer can translate uniformly.

Audit logger

src/audit/logger.ts:
- audit(action, { toolName?, phoneNumberId?, wamid?, payloadHash?, errorCode?, latencyMs?, metadata? }).
- Pulls clientId, apiKeyId, requestId, ip, userAgent from AsyncLocalStorage.
- Single INSERT into audit_log. Never UPDATE/DELETE (DB role enforces).
- Writes asynchronously via a small per-process queue with periodic flush (every 500ms or 100 rows). On process shutdown, drains the queue.
- Failure to write audit (DB down) is logged at error level and the request continues — the audit gap is itself an alert signal.
payloadHash = SHA-256 of the canonicalised JSON input. Canonicalisation uses JCS (RFC 8785) via the canonicalize npm package — deterministic key ordering and number formatting, no ambiguity. Never store plaintext bodies in audit_log — bodies live only in messages.body.

Webhook → tenant resolution

src/webhook/meta.ts:
- Look up phone_numbers by wa_phone_number_id.
- For each client_phone_grants row joining that number, derive a derivedEventId namespaced on client_id so a number serving multiple clients produces one Inngest event per client. (v1 still has one client per number, but the code path is fan-out-ready.)
process-message (Phase 3) gets clientId in its event data and uses it everywhere downstream.

MCP transport plumbing for auth

Even though Streamable HTTP lands in Phase 5, the dispatcher (the bit between transport and tool registry) is refactored here:
- Every tool invocation runs through runWithAuth(ctx, async () => ...) with ctx either from the auth middleware (HTTP) or the synthetic owner context (stdio).
- Every tool handler is wrapped at registry-load time with wrapToolHandler(handler, scope) which: parses input → requireScope → requireGrant (if input includes phoneNumberId) → calls handler → emits audit row → returns result.
- Rate-limit / scope / grant errors are translated to MCP JSON-RPC errors with code -32001 (forbidden) or -32004 (rate limited).

Retention (built now, used later)

src/inngest/functions/archive-audit.ts — daily cron. Selects audit_log rows > 365d, writes them to /var/lib/whatsapp-mcp/audit-archive/<yyyy-mm>.jsonl.zst, then DELETE via wa_audit_archiver role. (The archive directory check is skipped in test runs; configured per env.)
src/inngest/functions/prune-messages.ts — daily cron. Nullifies body and raw on messages rows > 90d (configurable per client later in a clients.retention_days column — for now uses a global config).

Docs (extended)

docs/architecture/auth.md — full auth pipeline, key format, scope model, two-layer authz, rotation flow, stdio short-circuit.
docs/architecture/audit.md — what we log, what we never log, retention policy, archival flow, investigation queries.
docs/architecture/rate-limiting.md — RPM sliding window math, daily cap math, where each check runs, 429 contract.
docs/architecture/database.md — extended with audit_log, rate_limit_buckets, DB role split.
docs/components/admin-cli.md — every admin subcommand with usage examples.
docs/operations/client-onboarding.md — promoted from docs/plan/ops/ stub; create client → mint key → grant numbers → hand over → revoke flow.
docs/api/errors.md — extended with auth / scope / grant / rate-limit error codes.

Critical files

src/auth/{api-key,context,scopes,rate-limit}.ts
src/audit/logger.ts
src/db/scoped.ts — the only module allowed to query messages / contacts / media_objects; mandatory clientId first argument
src/server/mcp.ts — dispatcher wraps every tool with wrapToolHandler
scripts/admin/*.ts
drizzle/0003_multitenant.sql, drizzle/0004_audit_ratelimit.sql

Tests

Unit

tests/unit/auth/api-key-parse.test.ts — wamcp_live_... accepted; Basic ... rejected; malformed format rejected; missing scheme rejected.
tests/unit/auth/key-hash.test.ts — HMAC matches; pepper mismatch fails; timing-safe compare (constant-time over equal-length inputs).
tests/unit/auth/scopes.test.ts — exact match; wildcard for owner; wildcard rejected for non-owner; missing scope throws.
tests/unit/auth/rate-limit-math.test.ts — sliding window calculation across minute boundaries.

Integration (testcontainers Postgres)

tests/integration/admin/keys.test.ts — mint → list → use → revoke → use-after-revoke → 401; rotate → both work in grace window → old revoked at expiry.
tests/integration/admin/grants.test.ts — add → use → revoke → use → grant_denied.
tests/integration/auth/middleware.test.ts:
- Valid bearer → tool succeeds, tool_called + result-action audit rows.
- Wrong key → 401, auth_failed audit row.
- Revoked key → 401.
- Expired key → 401.
- Disabled client → 401.
tests/integration/auth/scope.test.ts:
- Key with tools:send_message but not tools:get_messages → get_messages call returns scope_denied error, scope_denied audit row.
tests/integration/auth/grant.test.ts:
- Two phones, grant for only one → call with the wrong phoneNumberId → grant_denied.
tests/integration/auth/rate-limit.test.ts:
- Key with rpm=5 → 6th call in a minute returns RateLimitedError (HTTP 429 / JSON-RPC -32004).
- Daily cap 3 → 4th send fails with daily-cap error inside the Inngest function; the message row goes status='failed'.
- Sliding window: 5 calls at 0s and 5 calls at 30s → the 11th at 31s succeeds because the first window has rolled off; verify weighted math.
tests/integration/auth/stdio-owner.test.ts — stdio mode produces audit rows with api_key_id = null and metadata.transport = 'stdio'.
tests/integration/cross-tenant/isolation.test.ts:
- Create clients A and B; grant A and B different numbers.
- With A’s key: get_messages on B’s number → grant_denied.
- With A’s key: send_message to a number not granted → grant_denied.
- Direct DB query via the scoped helper with A’s clientId cannot read B’s rows.
- This test file is the canonical isolation regression — runs on every PR.
tests/integration/audit/log.test.ts:
- Every tool call produces exactly one row with the right action, payload_hash, latency_ms.
- audit_log rows for the same request share a request_id.
- Attempting UPDATE audit_log via the wa_app role fails (DB-level check).

Coverage

src/auth/ ≥ 95 %.
src/audit/ ≥ 95 %.
src/db/scoped.ts ≥ 95 %.
Phase total ≥ 80 %.

Code documentation

TSDoc on every exported symbol in src/auth/ and src/audit/. @remarks mandatory on every auth check covering: failure mode (fail-closed), what gets audited, what gets returned to the client, and the security invariant being enforced.
File-level headers on every new file.
docs/architecture/{auth,audit,rate-limiting,database}.md written/extended.
docs/components/admin-cli.md complete.
docs/operations/client-onboarding.md complete.
docs/api/errors.md extended.
docs/reference/ regenerated.

Acceptance

Multi-client isolation demo — create test-client, mint a key with only tools:send_message,numbers:<phone_id>, grant just send_message on that phone:
- With the key: send_message works.
- With the key: get_messages returns scope_denied.
- With the key: send_message to a different phoneNumberId returns grant_denied.
Rate limit demo — key with rpm=5: 6 calls in 30 s → 6th returns 429 with Retry-After.
Daily cap demo — grant with daily_message_cap=3: 4th send in 24h → message row status='failed' with daily-cap error; first 3 succeed.
Audit completeness — SELECT action, count(*) FROM audit_log WHERE client_id = '<test>' AND ts > now() - '1 hour'::interval GROUP BY 1 shows every action including scope_denied, grant_denied, rate_limited, tool_called, send_attempt, send_success/send_failed, key_used.
Append-only enforcement — psql as wa_app running UPDATE audit_log SET ts = now() fails with a permission error.
Cross-tenant isolation suite — tests/integration/cross-tenant/isolation.test.ts green.
pnpm test:ci green; coverage gates met.

Notes

The pepper at /run/secrets/api_key_pepper is 32 random bytes, base64-encoded. Pepper rotation procedure documented in ops/incident-runbook.md: re-hash all live keys with the new pepper inside a single transaction; the previous pepper is accepted for one rotation window (kept at /run/secrets/api_key_pepper.previous).
The admin CLI is the only management interface. There is no admin HTTP endpoint. Phase 8 introduces an owner-only portal if needed.

Definition of Done

Migrations

drizzle/0003_multitenant.sql — client_id backfilled on messages, contacts; not-null where required.
drizzle/0004_audit_ratelimit.sql — audit_log, rate_limit_buckets, plus DB roles wa_app + wa_audit_archiver with correct grants.

Admin CLI

clients create / list / disable / enable working.
keys mint rejects wildcard scopes for non-owners; prints full token to stderr only.
keys list / revoke / rotate working.
grants add / list / revoke working.
audit --client … --since … working.

Auth modules

src/auth/api-key.ts — parse, lookup, HMAC verify, attach context.
src/auth/context.ts — AsyncLocalStorage helpers.
src/auth/scopes.ts — requireScope, requireGrant, owner-wildcard check at request time.
src/auth/rate-limit.ts — RPM sliding window + daily cap; throws structured errors.
All four errors carry httpStatus + JSON-RPC error code.

Audit logger

src/audit/logger.ts — write-coalesced flush; payload_hash via canonicalize (RFC 8785) + SHA-256.
wa_app role has no UPDATE/DELETE on audit_log (verified via integration test).

Webhook + dispatcher

Webhook resolves phone_number_id and fans out one event per granted client.
MCP dispatcher wraps every tool with wrapToolHandler(scope, handler).
Errors translated to JSON-RPC -32001 (forbidden) or -32004 (rate limited).

Retention

archive-audit cron writes JSONL.zst then deletes via wa_audit_archiver.
prune-messages cron nullifies body/raw on rows > 90d.

Tests

Documentation

docs/architecture/auth.md written.
docs/architecture/audit.md written.
docs/architecture/rate-limiting.md written.
docs/architecture/database.md extended (audit_log, rate_limit_buckets, role split).
docs/components/admin-cli.md complete.
docs/operations/client-onboarding.md promoted from stub.
docs/api/errors.md extended.
TSDoc @remarks on every src/auth/ and src/audit/ export.
docs/reference/ regenerated cleanly.

Acceptance verified

Multi-client isolation demo: test-client scoped key can send_message but get_messages returns scope_denied; wrong number returns grant_denied.
RPM demo: 6th call in a minute with rpm=5 → 429 + Retry-After.
Daily cap demo: 4th send with daily_cap=3 → row status='failed' with daily-cap error.
audit_log query shows every action variety for the test client.
psql as wa_app running UPDATE audit_log fails with permission error.

Phase signoff

Phase 4 complete. README.md status table updated to ✅.