Phase 4 — Multi-Tenant Auth & Audit
Effort: L
Goal
Every tool call, webhook event, and admin action is attributable to a client_id, gated by scope + grant checks, rate-limited, and audited. This is the security-critical phase — defence in depth from key parse through tool dispatch through DB write.
Deliverables
Migrations
drizzle/0003_multitenant.sql:- Backfills
messages.client_id = local-ownerfor existing rows; setsNOT NULL. - Adds
client_idtocontactsandmedia_objects(Phase 6 will use the latter).
- Backfills
drizzle/0004_audit_ratelimit.sql:- Creates
audit_log(bigserial, append-only enforced at role level — see DB role section). - Creates
rate_limit_buckets. - Creates two Postgres roles:
wa_app(INSERT/SELECT onaudit_log, full DML on others) andwa_audit_archiver(DELETE onaudit_logonly). App connects aswa_app; the archive cron useswa_audit_archiver.
- Creates
Admin CLI
Run via docker compose exec app pnpm admin <subcommand> in prod; or pnpm admin ... in dev.
scripts/admin/create-client.ts—admin clients create --name <slug> --display-name "..." [--owner]. Refuses to create a secondis_owner.scripts/admin/list-clients.ts—admin clients list.scripts/admin/disable-client.ts/admin clients enable <id>.scripts/admin/mint-key.ts—admin keys mint --client <id> --label "..." --scopes "tools:send_message,numbers:<phone_id>" [--expires 90d] [--rpm 60] [--daily 250]:- Generates a
wamcp_live_...key (140 bits entropy). - Computes HMAC-SHA256(pepper, token), stores the hash + 12-char prefix.
- Validates scopes against the
clients.is_ownerflag (rejects wildcard scopes for non-owners). - Prints the full token to stderr with a one-line warning that this is the only time it will be shown. Stdout gets the key id + prefix for piping.
- Generates a
scripts/admin/list-keys.ts,scripts/admin/revoke-key.ts,scripts/admin/rotate-key.ts(default--grace 7d).scripts/admin/add-grant.ts—admin grants add --client <id> --phone <phone_id> --tools "send_message,get_messages" [--daily-cap 500].scripts/admin/list-grants.ts,scripts/admin/revoke-grant.ts.scripts/admin/show-audit.ts—admin audit --client <id> --since "24h"for ad-hoc forensics.
Auth modules
src/auth/api-key.ts:parseBearer(header)→{ env, prefix, secret }or throws.lookupKey(prefix)→ DB row or null (single indexed query).verifyKeyHash(secret, hash, pepper)→ constant-time HMAC compare.loadClientContext(keyRow)→{ clientId, apiKeyId, scopes, rpmLimit, dailyMsgLimit, allowedPhoneNumberIds }(allowed phones come from joiningclient_phone_grants).- Express middleware composing the above. Fails closed on any error → 401 +
auth_failedaudit row.
src/auth/context.ts—AsyncLocalStorage<AuthContext>; helpersgetAuth(),runWithAuth().src/auth/scopes.ts:requireScope(toolName)— throwsScopeDeniedErroriftools:<name>andtools:*both absent. Owner check: wildcards rejected at request time ifclients.is_owner = false.requireGrant(clientId, phoneNumberId, toolName)— looks upclient_phone_grants, throwsGrantDeniedErroron miss / revoked / tool not inallowed_tools.
src/auth/rate-limit.ts:enforceRpm(apiKeyId, limit)— sliding-window weighted across current + previous minute via atomic upserts inrate_limit_buckets. Returns{ remaining, resetAtEpoch }on success, throwsRateLimitedErrorwithretryAfterSecondson miss.enforceDailyCap(clientId, phoneNumberId, cap)— sums last 24 hourly buckets; throws on miss. Called insidesend-messageInngest function (post-dequeue).
- All four errors carry
httpStatus(401 / 403 / 403 / 429) and a JSON-RPC error code so the MCP layer can translate uniformly.
Audit logger
src/audit/logger.ts:audit(action, { toolName?, phoneNumberId?, wamid?, payloadHash?, errorCode?, latencyMs?, metadata? }).- Pulls
clientId, apiKeyId, requestId, ip, userAgentfromAsyncLocalStorage. - Single INSERT into
audit_log. Never UPDATE/DELETE (DB role enforces). - Writes asynchronously via a small per-process queue with periodic flush (every 500ms or 100 rows). On process shutdown, drains the queue.
- Failure to write audit (DB down) is logged at error level and the request continues — the audit gap is itself an alert signal.
payloadHash= SHA-256 of the canonicalised JSON input. Canonicalisation uses JCS (RFC 8785) via thecanonicalizenpm package — deterministic key ordering and number formatting, no ambiguity. Never store plaintext bodies inaudit_log— bodies live only inmessages.body.
Webhook → tenant resolution
src/webhook/meta.ts:- Look up
phone_numbersbywa_phone_number_id. - For each
client_phone_grantsrow joining that number, derive aderivedEventIdnamespaced onclient_idso a number serving multiple clients produces one Inngest event per client. (v1 still has one client per number, but the code path is fan-out-ready.)
- Look up
process-message(Phase 3) getsclientIdin its event data and uses it everywhere downstream.
MCP transport plumbing for auth
- Even though Streamable HTTP lands in Phase 5, the dispatcher (the bit between transport and tool registry) is refactored here:
- Every tool invocation runs through
runWithAuth(ctx, async () => ...)withctxeither from the auth middleware (HTTP) or the synthetic owner context (stdio). - Every tool handler is wrapped at registry-load time with
wrapToolHandler(handler, scope)which: parses input →requireScope→requireGrant(if input includesphoneNumberId) → calls handler → emits audit row → returns result. - Rate-limit / scope / grant errors are translated to MCP JSON-RPC errors with code
-32001(forbidden) or-32004(rate limited).
- Every tool invocation runs through
Retention (built now, used later)
src/inngest/functions/archive-audit.ts— daily cron. Selectsaudit_logrows > 365d, writes them to/var/lib/whatsapp-mcp/audit-archive/<yyyy-mm>.jsonl.zst, thenDELETEviawa_audit_archiverrole. (The archive directory check is skipped in test runs; configured per env.)src/inngest/functions/prune-messages.ts— daily cron. Nullifiesbodyandrawonmessagesrows > 90d (configurable per client later in aclients.retention_dayscolumn — for now uses a global config).
Docs (extended)
docs/architecture/auth.md— full auth pipeline, key format, scope model, two-layer authz, rotation flow, stdio short-circuit.docs/architecture/audit.md— what we log, what we never log, retention policy, archival flow, investigation queries.docs/architecture/rate-limiting.md— RPM sliding window math, daily cap math, where each check runs, 429 contract.docs/architecture/database.md— extended withaudit_log,rate_limit_buckets, DB role split.docs/components/admin-cli.md— every admin subcommand with usage examples.docs/operations/client-onboarding.md— promoted fromdocs/plan/ops/stub; create client → mint key → grant numbers → hand over → revoke flow.docs/api/errors.md— extended with auth / scope / grant / rate-limit error codes.
Critical files
- src/auth/{api-key,context,scopes,rate-limit}.ts
- src/audit/logger.ts
- src/db/scoped.ts — the only module allowed to query
messages/contacts/media_objects; mandatoryclientIdfirst argument - src/server/mcp.ts — dispatcher wraps every tool with
wrapToolHandler - scripts/admin/*.ts
- drizzle/0003_multitenant.sql, drizzle/0004_audit_ratelimit.sql
Tests
Unit
tests/unit/auth/api-key-parse.test.ts—wamcp_live_...accepted;Basic ...rejected; malformed format rejected; missing scheme rejected.tests/unit/auth/key-hash.test.ts— HMAC matches; pepper mismatch fails; timing-safe compare (constant-time over equal-length inputs).tests/unit/auth/scopes.test.ts— exact match; wildcard for owner; wildcard rejected for non-owner; missing scope throws.tests/unit/auth/rate-limit-math.test.ts— sliding window calculation across minute boundaries.
Integration (testcontainers Postgres)
tests/integration/admin/keys.test.ts— mint → list → use → revoke → use-after-revoke → 401; rotate → both work in grace window → old revoked at expiry.tests/integration/admin/grants.test.ts— add → use → revoke → use → grant_denied.tests/integration/auth/middleware.test.ts:- Valid bearer → tool succeeds,
tool_called+ result-action audit rows. - Wrong key → 401,
auth_failedaudit row. - Revoked key → 401.
- Expired key → 401.
- Disabled client → 401.
- Valid bearer → tool succeeds,
tests/integration/auth/scope.test.ts:- Key with
tools:send_messagebut nottools:get_messages→get_messagescall returns scope_denied error,scope_deniedaudit row.
- Key with
tests/integration/auth/grant.test.ts:- Two phones, grant for only one → call with the wrong
phoneNumberId→ grant_denied.
- Two phones, grant for only one → call with the wrong
tests/integration/auth/rate-limit.test.ts:- Key with
rpm=5→ 6th call in a minute returnsRateLimitedError(HTTP 429 / JSON-RPC -32004). - Daily cap 3 → 4th send fails with daily-cap error inside the Inngest function; the message row goes
status='failed'. - Sliding window: 5 calls at 0s and 5 calls at 30s → the 11th at 31s succeeds because the first window has rolled off; verify weighted math.
- Key with
tests/integration/auth/stdio-owner.test.ts— stdio mode produces audit rows withapi_key_id = nullandmetadata.transport = 'stdio'.tests/integration/cross-tenant/isolation.test.ts:- Create clients A and B; grant A and B different numbers.
- With A’s key:
get_messageson B’s number → grant_denied. - With A’s key:
send_messageto a number not granted → grant_denied. - Direct DB query via the scoped helper with A’s
clientIdcannot read B’s rows. - This test file is the canonical isolation regression — runs on every PR.
tests/integration/audit/log.test.ts:- Every tool call produces exactly one row with the right
action,payload_hash,latency_ms. audit_logrows for the same request share arequest_id.- Attempting
UPDATE audit_logvia thewa_approle fails (DB-level check).
- Every tool call produces exactly one row with the right
Coverage
src/auth/≥ 95 %.src/audit/≥ 95 %.src/db/scoped.ts≥ 95 %.- Phase total ≥ 80 %.
Code documentation
- TSDoc on every exported symbol in
src/auth/andsrc/audit/.@remarksmandatory on every auth check covering: failure mode (fail-closed), what gets audited, what gets returned to the client, and the security invariant being enforced. - File-level headers on every new file.
docs/architecture/{auth,audit,rate-limiting,database}.mdwritten/extended.docs/components/admin-cli.mdcomplete.docs/operations/client-onboarding.mdcomplete.docs/api/errors.mdextended.docs/reference/regenerated.
Acceptance
- Multi-client isolation demo — create
test-client, mint a key with onlytools:send_message,numbers:<phone_id>, grant justsend_messageon that phone:- With the key:
send_messageworks. - With the key:
get_messagesreturns scope_denied. - With the key:
send_messageto a differentphoneNumberIdreturns grant_denied.
- With the key:
- Rate limit demo — key with
rpm=5: 6 calls in 30 s → 6th returns 429 withRetry-After. - Daily cap demo — grant with
daily_message_cap=3: 4th send in 24h → message rowstatus='failed'with daily-cap error; first 3 succeed. - Audit completeness —
SELECT action, count(*) FROM audit_log WHERE client_id = '<test>' AND ts > now() - '1 hour'::interval GROUP BY 1shows every action includingscope_denied,grant_denied,rate_limited,tool_called,send_attempt,send_success/send_failed,key_used. - Append-only enforcement —
psqlaswa_apprunningUPDATE audit_log SET ts = now()fails with a permission error. - Cross-tenant isolation suite —
tests/integration/cross-tenant/isolation.test.tsgreen. pnpm test:cigreen; coverage gates met.
Notes
- The pepper at
/run/secrets/api_key_pepperis 32 random bytes, base64-encoded. Pepper rotation procedure documented in ops/incident-runbook.md: re-hash all live keys with the new pepper inside a single transaction; the previous pepper is accepted for one rotation window (kept at/run/secrets/api_key_pepper.previous). - The admin CLI is the only management interface. There is no admin HTTP endpoint. Phase 8 introduces an owner-only portal if needed.
Definition of Done
Migrations
-
drizzle/0003_multitenant.sql—client_idbackfilled onmessages,contacts; not-null where required. -
drizzle/0004_audit_ratelimit.sql—audit_log,rate_limit_buckets, plus DB roleswa_app+wa_audit_archiverwith correct grants.
Admin CLI
-
clients create / list / disable / enableworking. -
keys mintrejects wildcard scopes for non-owners; prints full token to stderr only. -
keys list / revoke / rotateworking. -
grants add / list / revokeworking. -
audit --client … --since …working.
Auth modules
-
src/auth/api-key.ts— parse, lookup, HMAC verify, attach context. -
src/auth/context.ts— AsyncLocalStorage helpers. -
src/auth/scopes.ts—requireScope,requireGrant, owner-wildcard check at request time. -
src/auth/rate-limit.ts— RPM sliding window + daily cap; throws structured errors. - All four errors carry
httpStatus+ JSON-RPC error code.
Audit logger
-
src/audit/logger.ts— write-coalesced flush; payload_hash viacanonicalize(RFC 8785) + SHA-256. -
wa_approle has noUPDATE/DELETEonaudit_log(verified via integration test).
Webhook + dispatcher
- Webhook resolves
phone_number_idand fans out one event per granted client. - MCP dispatcher wraps every tool with
wrapToolHandler(scope, handler). - Errors translated to JSON-RPC
-32001(forbidden) or-32004(rate limited).
Retention
-
archive-auditcron writes JSONL.zst then deletes viawa_audit_archiver. -
prune-messagescron nullifiesbody/rawon rows > 90d.
Tests
-
tests/unit/auth/api-key-parse.test.tspasses. -
tests/unit/auth/key-hash.test.tspasses (timing-safe verified). -
tests/unit/auth/scopes.test.tspasses. -
tests/unit/auth/rate-limit-math.test.tspasses (cross-boundary sliding window). -
tests/integration/admin/keys.test.ts(mint → use → revoke → rotate-grace) passes. -
tests/integration/admin/grants.test.tspasses. -
tests/integration/auth/middleware.test.ts(valid/wrong/revoked/expired/disabled) passes. -
tests/integration/auth/scope.test.tspasses. -
tests/integration/auth/grant.test.tspasses. -
tests/integration/auth/rate-limit.test.ts(rpm + daily + sliding window) passes. -
tests/integration/auth/stdio-owner.test.tspasses. -
tests/integration/cross-tenant/isolation.test.tsgreen (the canonical regression). -
tests/integration/audit/log.test.ts(every action + append-only role check) passes. - Coverage:
src/auth/≥ 95%;src/audit/≥ 95%;src/db/scoped.ts≥ 95%; phase total ≥ 80%.
Documentation
-
docs/architecture/auth.mdwritten. -
docs/architecture/audit.mdwritten. -
docs/architecture/rate-limiting.mdwritten. -
docs/architecture/database.mdextended (audit_log, rate_limit_buckets, role split). -
docs/components/admin-cli.mdcomplete. -
docs/operations/client-onboarding.mdpromoted from stub. -
docs/api/errors.mdextended. - TSDoc
@remarkson everysrc/auth/andsrc/audit/export. -
docs/reference/regenerated cleanly.
Acceptance verified
- Multi-client isolation demo: test-client scoped key can
send_messagebutget_messagesreturns scope_denied; wrong number returns grant_denied. - RPM demo: 6th call in a minute with
rpm=5→ 429 +Retry-After. - Daily cap demo: 4th send with
daily_cap=3→ rowstatus='failed'with daily-cap error. -
audit_logquery shows every action variety for the test client. -
psqlaswa_apprunningUPDATE audit_logfails with permission error.
Phase signoff
- Phase 4 complete. README.md status table updated to ✅.