Skip to Content

Architecture — Cross-Cutting Design

This document describes the parts of the system that span every phase. Read this before any phase file. Phase files describe what gets built; this file describes what gets honoured.

1. Database schema (the spine)

All tables use id uuid PRIMARY KEY DEFAULT gen_random_uuid() (Postgres pgcrypto) and timestamptz for time columns. FK actions: ON DELETE RESTRICT by default; explicit exceptions noted.

Core tenancy tables

clients id uuid PK name text unique -- kebab-case, e.g. "internal-projectX" display_name text is_owner bool -- exactly one row may have this true disabled_at timestamptz null -- gates all auth when set metadata jsonb created_at timestamptz default now() -- partial unique (is_owner) WHERE is_owner enforces single-owner api_keys id uuid PK client_id uuid FK clients(id) ON DELETE CASCADE label text -- "claude-desktop-laptop" prefix text unique -- 12 chars, indexed lookup column hash bytea -- 32B HMAC-SHA256(pepper, full token) scopes jsonb -- array of scope strings, see §3 rpm_limit int not null default 60 daily_msg_limit int not null default 250 expires_at timestamptz null -- hard cut-off revoked_at timestamptz null -- soft delete rotated_from_id uuid null -- self-FK; rotation lineage last_used_at timestamptz null -- updated async (batched) last_used_ip inet null created_at timestamptz default now() -- indexes: unique(prefix), partial (client_id) WHERE revoked_at IS NULL, -- partial (expires_at) WHERE expires_at IS NOT NULL phone_numbers id uuid PK wa_phone_number_id text unique not null -- Meta's numeric id wa_business_account_id text not null display_number text -- E.164 with +, humans only label text access_token_secret_ref text -- "secrets://wa_access_token_<id>" webhook_verify_token_ref text -- per-number to limit blast radius disabled_at timestamptz null created_at timestamptz default now() -- NOTE: no app_secret column. App Secret is per Meta App, not per number. -- All numbers in this WABA share the App Secret read from WA_APP_SECRET -- (secrets://wa_app_secret). If we ever split into multiple Meta Apps, -- add a nullable app_secret_ref column then. client_phone_grants -- M:N — what each client may do with which number id uuid PK client_id uuid FK clients(id) phone_number_id uuid FK phone_numbers(id) allowed_tools jsonb not null -- e.g. ["send_message","get_messages"] daily_message_cap int null -- override of api_keys.daily_msg_limit revoked_at timestamptz null created_at timestamptz default now() -- unique (client_id, phone_number_id) WHERE revoked_at IS NULL

Conversation tables

contacts id uuid PK phone_number_id uuid FK phone_numbers(id) wa_id text not null -- E.164 with no leading + profile_name text -- from Meta contacts[].profile.name display_name text -- local override metadata jsonb first_seen_at timestamptz default now() last_seen_at timestamptz default now() -- unique (phone_number_id, wa_id) messages id uuid PK phone_number_id uuid FK phone_numbers(id) client_id uuid FK clients(id) -- nullable for inbound (number-owned) direction text -- 'inbound' | 'outbound' wa_message_id text unique -- Meta's wamid.xxx contact_id uuid FK contacts(id) message_type text -- text|template|image|document|audio|video|sticker|reaction|interactive body text -- plaintext for text/caption template_name text null media_object_id uuid FK media_objects(id) null status text -- queued|sent|delivered|read|failed (out); received|read (in) error_code int null reply_to_wamid text null payload jsonb -- normalised body (e.g. interactive selectedId) raw jsonb -- original Meta payload (inbound), sanitised ts timestamptz -- Meta event time created_at timestamptz default now() -- indexes: unique(wa_message_id), -- (phone_number_id, ts DESC), -- (client_id, ts DESC) WHERE direction='outbound', -- (phone_number_id, contact_id, ts DESC) WHERE direction='inbound' media_objects id uuid PK phone_number_id uuid FK phone_numbers(id) direction text -- inbound|outbound wa_media_id text -- Meta media id (inbound) mime_type text sha256 bytea size_bytes bigint storage_path text -- relative under MEDIA_ROOT, never absolute downloaded_at timestamptz null -- null for outbound created_at timestamptz default now()

Operational tables

audit_log -- append-only; app role has INSERT,SELECT only id bigserial PK ts timestamptz default now() client_id uuid FK clients(id) ON DELETE SET NULL null api_key_id uuid FK api_keys(id) ON DELETE SET NULL null request_id text -- propagated through Inngest ip inet null user_agent text null action text -- see action vocabulary below tool_name text null phone_number_id uuid FK phone_numbers(id) ON DELETE SET NULL null wa_message_id text null payload_hash bytea null -- SHA-256 of canonicalised input error_code text null latency_ms int null metadata jsonb null -- indexes: (client_id, ts DESC), (api_key_id, ts DESC), (action, ts DESC), -- (wa_message_id) WHERE wa_message_id IS NOT NULL, BRIN on ts rate_limit_buckets -- token bucket id uuid PK key text -- 'rpm:<api_key_id>:<minute>' or 'daily:<client_id>:<phone>:<hour>' window_start timestamptz count int not null default 0 updated_at timestamptz default now() -- unique (key, window_start); partial index on stale rows for GC inngest_idempotency -- dedupe Meta webhook retries event_id text PK -- wamid for inbound, our request-id for outbound first_seen_at timestamptz default now() outcome text -- processed|duplicate|failed

action vocabulary for audit_log

key_minted, key_rotated, key_revoked, key_used, auth_failed, scope_denied, grant_denied, rate_limited, tool_called, send_attempt, send_success, send_failed, webhook_received, webhook_invalid_signature, webhook_duplicate, media_uploaded, media_downloaded, grant_added, grant_revoked, client_disabled.

Row-level isolation strategy

App-enforced WHERE clauses, not Postgres RLS. Justification: we issue one DB role for the app; RLS shines when each tenant gets its own role, which we don’t do. A single missed WHERE client_id = $1 would be a tenancy bug, so we centralise all queries on tenant tables in src/db/scoped.ts which takes clientId as a mandatory first argument. ESLint rule forbids raw pool.query against tenant tables outside that module. Cross-tenant isolation tests in CI assert client B cannot read/write client A’s data via any tool, scope, or grant path. The schema stays RLS-ready in case we change our mind.

2. Auth pipeline (request order)

Nginx (TLS, strips client X-Forwarded-* / X-Real-Ip, adds X-Request-Id) → Express trust proxy = 1 → request-id middleware (CLS / AsyncLocalStorage propagation into Inngest) → body parser (express.raw on /webhook/*, express.json on /mcp/*) → authMiddleware parse Bearer wamcp_live_... lookup api_keys by 12-char prefix HMAC-SHA256(pepper, token) constant-time compare against hash attach req.auth = { clientId, apiKeyId, scopes, rpmLimit, dailyMsgLimit } fail-closed: any error here → 401 → client-enabled check (clients.disabled_at IS NULL) → per-key RPM rate-limit check (Postgres sliding window) → MCP dispatcher (validates JSON-RPC envelope) → tool layer: parse input via tool's zod schema requireScope(toolName) -- key has tools:<name> or tools:* requireGrant(clientId, phoneNumberId, toolName) -- client_phone_grants row exists, -- not revoked, contains toolName (outbound only) per-client daily-cap check run tool body → response post-processor (audit log emit + rate-limit bucket increment)

Stdio mode skips Express entirely. Bootstrap synthesises req.auth = { clientId: ownerClientId, apiKeyId: null, scopes: ['*'], rpmLimit: RL_OWNER_RPM, dailyMsgLimit: RL_OWNER_DAILY }. Grants check still runs — the owner cannot accidentally use an un-granted number. Audit rows have api_key_id = NULL, metadata.transport = 'stdio'.

3. API keys, scopes, and grants

Key format

wamcp_<env>_<28 chars Crockford base32> where <env>{live, test}.

  • 140 bits of entropy.
  • The wamcp_ prefix is registered with GitHub Secret Scanning so leaks get caught.
  • The 12-char DB lookup prefix is wamcp_<env>_<first 4 of secret>, indexed.
  • Stored as hash = HMAC_SHA256(pepper, full_token). HMAC, not Argon2id, because the key already carries enough entropy that slow hashing buys nothing and would dominate request latency. The pepper at /run/secrets/api_key_pepper means a stolen DB dump alone can’t replay against the auth endpoint.
  • Shown once at mint, printed to stderr (not stdout, so it doesn’t end up piped into logs).

Scopes (stored as JSONB string array)

ScopeMeaning
tools:send_messagemay invoke send_message
tools:send_templatemay invoke send_template
tools:send_mediamay invoke send_media
tools:send_interactive_buttonsmay invoke send_interactive_buttons
tools:send_interactive_listmay invoke send_interactive_list
tools:get_messagesmay read inbound
tools:get_media_urlmay resolve a signed media URL
tools:list_chatsmay list conversations
tools:get_contactmay resolve contacts
tools:mark_readmay send read receipts
tools:*all tool scopes — owner-only
numbers:<phone_number_id>may operate against this WABA number
numbers:*any granted number — owner-only
media:read / media:writeinbound download / outbound upload
admin:*management plane — owner-only, never minted to third parties

Two-layer authorisation

A request must satisfy both:

  1. api_keys.scopes contains tools:<tool> AND (numbers:<phone_number_id> OR numbers:*).
  2. client_phone_grants row exists for (client_id, phone_number_id), not revoked, with <tool> in allowed_tools.

Why two layers: rotating a key shouldn’t re-grant numbers; revoking a number from a client should affect every key they hold instantly without touching key rows. Wildcards are checked at mint and at request time (defence in depth).

Rotation flow (dual-accept window)

  1. admin keys rotate <key-id> mints a new key (full token shown once), inserts a row with rotated_from_id = <old-id>, copies client_id and scopes.
  2. Old key remains valid; both work for a configurable window (default 7 days; expires_at set on the old row).
  3. Sweeper sets revoked_at = now() at expiry; admin can revoke early.
  4. Audit log records key_rotated, key_revoked, and key_used_after_rotation_warning (when the old key is used within the grace window).

4. Inngest

Inngest Cloud is the orchestrator. Functions execute inside the Node app at POST /api/inngest. Nginx proxies this path with rate limits but no app-layer auth — the Inngest SDK verifies INNGEST_SIGNING_KEY on every incoming request.

Events

EventProducerConsumerNotes
wa/webhook.receivedwebhook handler (after signature check)process-messageIdempotency key = derived event id
wa/message.send.requestedsend_message toolsend-messageConcurrency key = phone_number_id
wa/message.send.completedsend-message functionoptional step.waitForEvent for wait=true callers
wa/media.download.requestedprocess-message (when message has media)download-mediaStreams to disk
wa/media.upload.requestedsend_media toolupload-mediaFor localPath source
mcp/client.notifyprocess-message after persistin-process handler → SSE pushFan-out per granted client
wa/status.update.receivedwebhook handler (status updates)status-updaterMarks delivered/read/failed on messages
cron/audit.archiveInngest cron, dailyarchive-auditRolls audit_log rows >365d to JSONL.zst
cron/messages.retentionInngest cron, dailyprune-messagesNulls body/raw on rows >90d
cron/media.retentionInngest cron, dailyprune-mediaDeletes files on disk + nulls storage_path >90d
cron/rate-limit.gcInngest cron, every 10mgc-rate-limitsDrops rate_limit_buckets >25h

Concurrency + rate limit interaction

  • RPM is enforced before enqueue at the MCP entrypoint (Postgres sliding window). The caller sees an immediate 429 instead of accepting work we’ll just reject.
  • Daily cap is enforced inside the send-message Inngest function (after dequeue) so retries don’t double-count.
  • Concurrency key phone_number_id on send-message means we never hammer a single Meta number across clients.

5. Notification routing (multi-client)

Map<clientId, Set<McpSession>> in process memory. On connect (post-auth), register the session; on SSE close/error/idle (5 min), deregister.

When mcp/client.notify fires, the in-process handler looks up live sessions for clientId and sends MCP notifications/resources/updated with uri = wamcp://numbers/<phone_number_id>/messages. If no session is open, drop the notification — the message is durable in messages, so the client backfills via get_messages with a since cursor on next connect.

This is correct-by-construction across crashes, restarts, deploys, and offline clients. Notifications are best-effort hints; the DB is the truth.

For a future multi-instance world, the in-process Map is replaced with Postgres LISTEN/NOTIFY or Redis pub-sub. v1 is single-instance.

6. Webhook security

  • express.raw({ type: 'application/json', limit: '5mb' }) on /webhook/meta only (not global express.json) so the raw body Buffer is preserved.
  • HMAC-SHA256(WA_APP_SECRET, rawBody) compared to X-Hub-Signature-256 via crypto.timingSafeEqual. Mismatch → 404 (don’t leak the endpoint to scanners) and webhook_invalid_signature audit row.
  • GET verify-token handshake per phone number (column phone_numbers.webhook_verify_token_ref); leaking one token does not enable webhook hijack across the fleet.
  • IP allowlist is skipped — Meta does not publish stable source IPs. Signature is cryptographically sufficient.
  • Idempotency on inbound: Inngest idempotencyKey = wamid and DB INSERT INTO inngest_idempotency ... ON CONFLICT DO NOTHING. Meta retries aggressively on 5xx; this catches both Inngest retries and Meta retries.

7. Rate limiting details

RPM (per-API-key, sliding window)

Bucket key rpm:<api_key_id>:<minute_epoch>. Atomic upsert into rate_limit_buckets. True sliding window via weighted sum of current and previous minute buckets weighted by elapsed seconds. Default 60 req/min non-owner, RL_OWNER_RPM (default 600) for owner.

Daily message cap (per-client per-number, rolling 24h)

Bucket key daily:<client_id>:<phone_number_id>:<hour_epoch>. Sum the last 24 hourly buckets at send time. Default 250 messages/day for non-owner per number; overridable via client_phone_grants.daily_message_cap. Applies only to outbound message tools.

429 shape

HTTP layer (Streamable HTTP):

HTTP/1.1 429 Too Many Requests Retry-After: 17 X-RateLimit-Limit: 60 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1715000060

MCP protocol layer: JSON-RPC error code -32004 with data: { retryAfterSeconds, scope }. Surfaced as a protocol error, not a tool error, so the model backs off instead of retrying with different args.

8. Secret management

Phases 1–6: .env

  • File at /opt/whatsapp-mcp/.env, mode 0600, owned by service user, never committed.
  • Loaded only by src/config/env.ts (zod-validated). process.env access elsewhere is a lint error.
  • .env.example is committed.

Phase 7 hardening: SOPS + age

  • Encrypted secrets.enc.yaml in repo.
  • Age private key at /etc/whatsapp-mcp/age.key (mode 0400, root).
  • whatsapp-mcp-secrets.service (systemd) decrypts to tmpfs /run/whatsapp-mcp/secrets/ at boot (mode 0750, never disk).
  • Compose bind-mounts read-only as /run/secrets. Each secret is its own file.
  • Node reads files at startup (env-var leakage via /proc/<pid>/environ is avoided).
  • phone_numbers.*_ref columns hold secrets://name strings; a resolver maps to file paths. Migration to Vault/SSM is later a one-line change in the resolver.
  • Backup the age key offline (printed paper in a safe, hardware token); the encrypted file is safe to back up alongside the repo.

9. Canonical env vars

# Meta WA_GRAPH_API_VERSION=v23.0 WA_APP_SECRET= # X-Hub-Signature-256, per Meta App (shared by all numbers in the App) WA_WEBHOOK_VERIFY_TOKEN= # v1 single-number convenience; per-number tokens via phone_numbers.webhook_verify_token_ref later WA_DEFAULT_PHONE_NUMBER_ID= # v1 single number convenience WA_DEFAULT_WABA_ID= WA_DEFAULT_ACCESS_TOKEN= # Server APP_PUBLIC_URL=https://wa.<yourdomain> APP_HTTP_PORT=3000 APP_BIND=127.0.0.1 MCP_TRANSPORT=http # 'http' | 'stdio' NODE_ENV=production LOG_LEVEL=info # Postgres DATABASE_URL=postgres://wa:***@postgres:5432/wa_mcp # Inngest Cloud INNGEST_EVENT_KEY= INNGEST_SIGNING_KEY= INNGEST_SERVE_PATH=/api/inngest # Media MEDIA_ROOT=/var/lib/whatsapp-mcp/media MEDIA_SIGNING_SECRET= # HMAC for signed download URLs MEDIA_URL_TTL_SECONDS=300 # Auth API_KEY_PEPPER= # 32 random bytes, base64-encoded LOCAL_OWNER_CLIENT_ID= # uuid of the is_owner=true row # Rate limits (defaults; overridable per-key/grant in DB) RL_DEFAULT_RPM=60 RL_DEFAULT_DAILY_MSGS=250 RL_OWNER_RPM=600 RL_OWNER_DAILY=10000

10. Threat model (brief)

ThreatMitigation
Leaked API key (committed to git, pasted in chat)wamcp_ prefix in GitHub Secret Scanning; default 90d expires_at; last_used_ip shift triggers audit alert; revocation effective on next request (no caching)
Compromised Ubuntu hostPer-WABA tokens (one compromise doesn’t reach others); audit log rsynced hourly to a separate host so a wiped local DB still leaves a trail; Meta tokens scoped to minimum WABA permissions
Malicious client tries to send from a number they weren’t grantedTwo-layer key+grant check; grants are the authoritative truth, revoking a grant is instant; scope_denied / grant_denied audit rows trigger alerts on N denials/min
Replayed Meta webhookInngest idempotencyKey = wamid + DB inngest_idempotency ON CONFLICT DO NOTHING; processing is exactly-once even under aggressive Meta retries
Leaked Meta access tokenSystem User tokens scoped to one WABA, rotatable from Meta dashboard; phone_numbers.access_token_secret_ref indirection means rotation = update one file + restart container; all attempts audited
Accidental media URL exposureNginx never serves /var/lib/whatsapp-mcp/media; downloads go only through the media:read-scoped MCP tool which streams after auth + scope + grant check; signed URLs additionally bind tenancy; storage_path is relative so a path-traversal bug can’t escape the media root

Principles applied throughout

  • Fail closed. Auth, scope, grant, rate-limit checks all default-deny on error.
  • Single auth path. No back-door admin endpoints. Admin is CLI-only, run inside the container.
  • Tokens live in memory only as long as they must. Read at startup, not from env vars; rotations restart the container.
  • Audit log is append-only at the DB. Postgres role for the app has INSERT, SELECT on audit_log only. Retention archival uses a separate role.