Architecture — Cross-Cutting Design
This document describes the parts of the system that span every phase. Read this before any phase file. Phase files describe what gets built; this file describes what gets honoured.
1. Database schema (the spine)
All tables use id uuid PRIMARY KEY DEFAULT gen_random_uuid() (Postgres pgcrypto) and timestamptz for time columns. FK actions: ON DELETE RESTRICT by default; explicit exceptions noted.
Core tenancy tables
clients
id uuid PK
name text unique -- kebab-case, e.g. "internal-projectX"
display_name text
is_owner bool -- exactly one row may have this true
disabled_at timestamptz null -- gates all auth when set
metadata jsonb
created_at timestamptz default now()
-- partial unique (is_owner) WHERE is_owner enforces single-owner
api_keys
id uuid PK
client_id uuid FK clients(id) ON DELETE CASCADE
label text -- "claude-desktop-laptop"
prefix text unique -- 12 chars, indexed lookup column
hash bytea -- 32B HMAC-SHA256(pepper, full token)
scopes jsonb -- array of scope strings, see §3
rpm_limit int not null default 60
daily_msg_limit int not null default 250
expires_at timestamptz null -- hard cut-off
revoked_at timestamptz null -- soft delete
rotated_from_id uuid null -- self-FK; rotation lineage
last_used_at timestamptz null -- updated async (batched)
last_used_ip inet null
created_at timestamptz default now()
-- indexes: unique(prefix), partial (client_id) WHERE revoked_at IS NULL,
-- partial (expires_at) WHERE expires_at IS NOT NULL
phone_numbers
id uuid PK
wa_phone_number_id text unique not null -- Meta's numeric id
wa_business_account_id text not null
display_number text -- E.164 with +, humans only
label text
access_token_secret_ref text -- "secrets://wa_access_token_<id>"
webhook_verify_token_ref text -- per-number to limit blast radius
disabled_at timestamptz null
created_at timestamptz default now()
-- NOTE: no app_secret column. App Secret is per Meta App, not per number.
-- All numbers in this WABA share the App Secret read from WA_APP_SECRET
-- (secrets://wa_app_secret). If we ever split into multiple Meta Apps,
-- add a nullable app_secret_ref column then.
client_phone_grants -- M:N — what each client may do with which number
id uuid PK
client_id uuid FK clients(id)
phone_number_id uuid FK phone_numbers(id)
allowed_tools jsonb not null -- e.g. ["send_message","get_messages"]
daily_message_cap int null -- override of api_keys.daily_msg_limit
revoked_at timestamptz null
created_at timestamptz default now()
-- unique (client_id, phone_number_id) WHERE revoked_at IS NULLConversation tables
contacts
id uuid PK
phone_number_id uuid FK phone_numbers(id)
wa_id text not null -- E.164 with no leading +
profile_name text -- from Meta contacts[].profile.name
display_name text -- local override
metadata jsonb
first_seen_at timestamptz default now()
last_seen_at timestamptz default now()
-- unique (phone_number_id, wa_id)
messages
id uuid PK
phone_number_id uuid FK phone_numbers(id)
client_id uuid FK clients(id) -- nullable for inbound (number-owned)
direction text -- 'inbound' | 'outbound'
wa_message_id text unique -- Meta's wamid.xxx
contact_id uuid FK contacts(id)
message_type text -- text|template|image|document|audio|video|sticker|reaction|interactive
body text -- plaintext for text/caption
template_name text null
media_object_id uuid FK media_objects(id) null
status text -- queued|sent|delivered|read|failed (out); received|read (in)
error_code int null
reply_to_wamid text null
payload jsonb -- normalised body (e.g. interactive selectedId)
raw jsonb -- original Meta payload (inbound), sanitised
ts timestamptz -- Meta event time
created_at timestamptz default now()
-- indexes: unique(wa_message_id),
-- (phone_number_id, ts DESC),
-- (client_id, ts DESC) WHERE direction='outbound',
-- (phone_number_id, contact_id, ts DESC) WHERE direction='inbound'
media_objects
id uuid PK
phone_number_id uuid FK phone_numbers(id)
direction text -- inbound|outbound
wa_media_id text -- Meta media id (inbound)
mime_type text
sha256 bytea
size_bytes bigint
storage_path text -- relative under MEDIA_ROOT, never absolute
downloaded_at timestamptz null -- null for outbound
created_at timestamptz default now()Operational tables
audit_log -- append-only; app role has INSERT,SELECT only
id bigserial PK
ts timestamptz default now()
client_id uuid FK clients(id) ON DELETE SET NULL null
api_key_id uuid FK api_keys(id) ON DELETE SET NULL null
request_id text -- propagated through Inngest
ip inet null
user_agent text null
action text -- see action vocabulary below
tool_name text null
phone_number_id uuid FK phone_numbers(id) ON DELETE SET NULL null
wa_message_id text null
payload_hash bytea null -- SHA-256 of canonicalised input
error_code text null
latency_ms int null
metadata jsonb null
-- indexes: (client_id, ts DESC), (api_key_id, ts DESC), (action, ts DESC),
-- (wa_message_id) WHERE wa_message_id IS NOT NULL, BRIN on ts
rate_limit_buckets -- token bucket
id uuid PK
key text -- 'rpm:<api_key_id>:<minute>' or 'daily:<client_id>:<phone>:<hour>'
window_start timestamptz
count int not null default 0
updated_at timestamptz default now()
-- unique (key, window_start); partial index on stale rows for GC
inngest_idempotency -- dedupe Meta webhook retries
event_id text PK -- wamid for inbound, our request-id for outbound
first_seen_at timestamptz default now()
outcome text -- processed|duplicate|failedaction vocabulary for audit_log
key_minted, key_rotated, key_revoked, key_used, auth_failed, scope_denied, grant_denied, rate_limited, tool_called, send_attempt, send_success, send_failed, webhook_received, webhook_invalid_signature, webhook_duplicate, media_uploaded, media_downloaded, grant_added, grant_revoked, client_disabled.
Row-level isolation strategy
App-enforced WHERE clauses, not Postgres RLS. Justification: we issue one DB role for the app; RLS shines when each tenant gets its own role, which we don’t do. A single missed WHERE client_id = $1 would be a tenancy bug, so we centralise all queries on tenant tables in src/db/scoped.ts which takes clientId as a mandatory first argument. ESLint rule forbids raw pool.query against tenant tables outside that module. Cross-tenant isolation tests in CI assert client B cannot read/write client A’s data via any tool, scope, or grant path. The schema stays RLS-ready in case we change our mind.
2. Auth pipeline (request order)
Nginx (TLS, strips client X-Forwarded-* / X-Real-Ip, adds X-Request-Id)
→ Express trust proxy = 1
→ request-id middleware (CLS / AsyncLocalStorage propagation into Inngest)
→ body parser (express.raw on /webhook/*, express.json on /mcp/*)
→ authMiddleware
parse Bearer wamcp_live_...
lookup api_keys by 12-char prefix
HMAC-SHA256(pepper, token) constant-time compare against hash
attach req.auth = { clientId, apiKeyId, scopes, rpmLimit, dailyMsgLimit }
fail-closed: any error here → 401
→ client-enabled check (clients.disabled_at IS NULL)
→ per-key RPM rate-limit check (Postgres sliding window)
→ MCP dispatcher (validates JSON-RPC envelope)
→ tool layer:
parse input via tool's zod schema
requireScope(toolName) -- key has tools:<name> or tools:*
requireGrant(clientId, phoneNumberId, toolName)
-- client_phone_grants row exists,
-- not revoked, contains toolName
(outbound only) per-client daily-cap check
run tool body
→ response post-processor (audit log emit + rate-limit bucket increment)Stdio mode skips Express entirely. Bootstrap synthesises req.auth = { clientId: ownerClientId, apiKeyId: null, scopes: ['*'], rpmLimit: RL_OWNER_RPM, dailyMsgLimit: RL_OWNER_DAILY }. Grants check still runs — the owner cannot accidentally use an un-granted number. Audit rows have api_key_id = NULL, metadata.transport = 'stdio'.
3. API keys, scopes, and grants
Key format
wamcp_<env>_<28 chars Crockford base32> where <env> ∈ {live, test}.
- 140 bits of entropy.
- The
wamcp_prefix is registered with GitHub Secret Scanning so leaks get caught. - The 12-char DB lookup prefix is
wamcp_<env>_<first 4 of secret>, indexed. - Stored as
hash = HMAC_SHA256(pepper, full_token). HMAC, not Argon2id, because the key already carries enough entropy that slow hashing buys nothing and would dominate request latency. The pepper at/run/secrets/api_key_peppermeans a stolen DB dump alone can’t replay against the auth endpoint. - Shown once at mint, printed to stderr (not stdout, so it doesn’t end up piped into logs).
Scopes (stored as JSONB string array)
| Scope | Meaning |
|---|---|
tools:send_message | may invoke send_message |
tools:send_template | may invoke send_template |
tools:send_media | may invoke send_media |
tools:send_interactive_buttons | may invoke send_interactive_buttons |
tools:send_interactive_list | may invoke send_interactive_list |
tools:get_messages | may read inbound |
tools:get_media_url | may resolve a signed media URL |
tools:list_chats | may list conversations |
tools:get_contact | may resolve contacts |
tools:mark_read | may send read receipts |
tools:* | all tool scopes — owner-only |
numbers:<phone_number_id> | may operate against this WABA number |
numbers:* | any granted number — owner-only |
media:read / media:write | inbound download / outbound upload |
admin:* | management plane — owner-only, never minted to third parties |
Two-layer authorisation
A request must satisfy both:
api_keys.scopescontainstools:<tool>AND (numbers:<phone_number_id>ORnumbers:*).client_phone_grantsrow exists for(client_id, phone_number_id), not revoked, with<tool>inallowed_tools.
Why two layers: rotating a key shouldn’t re-grant numbers; revoking a number from a client should affect every key they hold instantly without touching key rows. Wildcards are checked at mint and at request time (defence in depth).
Rotation flow (dual-accept window)
admin keys rotate <key-id>mints a new key (full token shown once), inserts a row withrotated_from_id = <old-id>, copiesclient_idandscopes.- Old key remains valid; both work for a configurable window (default 7 days;
expires_atset on the old row). - Sweeper sets
revoked_at = now()at expiry; admin canrevokeearly. - Audit log records
key_rotated,key_revoked, andkey_used_after_rotation_warning(when the old key is used within the grace window).
4. Inngest
Inngest Cloud is the orchestrator. Functions execute inside the Node app at POST /api/inngest. Nginx proxies this path with rate limits but no app-layer auth — the Inngest SDK verifies INNGEST_SIGNING_KEY on every incoming request.
Events
| Event | Producer | Consumer | Notes |
|---|---|---|---|
wa/webhook.received | webhook handler (after signature check) | process-message | Idempotency key = derived event id |
wa/message.send.requested | send_message tool | send-message | Concurrency key = phone_number_id |
wa/message.send.completed | send-message function | optional step.waitForEvent for wait=true callers | |
wa/media.download.requested | process-message (when message has media) | download-media | Streams to disk |
wa/media.upload.requested | send_media tool | upload-media | For localPath source |
mcp/client.notify | process-message after persist | in-process handler → SSE push | Fan-out per granted client |
wa/status.update.received | webhook handler (status updates) | status-updater | Marks delivered/read/failed on messages |
cron/audit.archive | Inngest cron, daily | archive-audit | Rolls audit_log rows >365d to JSONL.zst |
cron/messages.retention | Inngest cron, daily | prune-messages | Nulls body/raw on rows >90d |
cron/media.retention | Inngest cron, daily | prune-media | Deletes files on disk + nulls storage_path >90d |
cron/rate-limit.gc | Inngest cron, every 10m | gc-rate-limits | Drops rate_limit_buckets >25h |
Concurrency + rate limit interaction
- RPM is enforced before enqueue at the MCP entrypoint (Postgres sliding window). The caller sees an immediate 429 instead of accepting work we’ll just reject.
- Daily cap is enforced inside the
send-messageInngest function (after dequeue) so retries don’t double-count. - Concurrency key
phone_number_idonsend-messagemeans we never hammer a single Meta number across clients.
5. Notification routing (multi-client)
Map<clientId, Set<McpSession>> in process memory. On connect (post-auth), register the session; on SSE close/error/idle (5 min), deregister.
When mcp/client.notify fires, the in-process handler looks up live sessions for clientId and sends MCP notifications/resources/updated with uri = wamcp://numbers/<phone_number_id>/messages. If no session is open, drop the notification — the message is durable in messages, so the client backfills via get_messages with a since cursor on next connect.
This is correct-by-construction across crashes, restarts, deploys, and offline clients. Notifications are best-effort hints; the DB is the truth.
For a future multi-instance world, the in-process Map is replaced with Postgres LISTEN/NOTIFY or Redis pub-sub. v1 is single-instance.
6. Webhook security
express.raw({ type: 'application/json', limit: '5mb' })on/webhook/metaonly (not globalexpress.json) so the raw body Buffer is preserved.- HMAC-SHA256(
WA_APP_SECRET, rawBody) compared toX-Hub-Signature-256viacrypto.timingSafeEqual. Mismatch → 404 (don’t leak the endpoint to scanners) andwebhook_invalid_signatureaudit row. - GET verify-token handshake per phone number (column
phone_numbers.webhook_verify_token_ref); leaking one token does not enable webhook hijack across the fleet. - IP allowlist is skipped — Meta does not publish stable source IPs. Signature is cryptographically sufficient.
- Idempotency on inbound: Inngest
idempotencyKey = wamidand DBINSERT INTO inngest_idempotency ... ON CONFLICT DO NOTHING. Meta retries aggressively on 5xx; this catches both Inngest retries and Meta retries.
7. Rate limiting details
RPM (per-API-key, sliding window)
Bucket key rpm:<api_key_id>:<minute_epoch>. Atomic upsert into rate_limit_buckets. True sliding window via weighted sum of current and previous minute buckets weighted by elapsed seconds. Default 60 req/min non-owner, RL_OWNER_RPM (default 600) for owner.
Daily message cap (per-client per-number, rolling 24h)
Bucket key daily:<client_id>:<phone_number_id>:<hour_epoch>. Sum the last 24 hourly buckets at send time. Default 250 messages/day for non-owner per number; overridable via client_phone_grants.daily_message_cap. Applies only to outbound message tools.
429 shape
HTTP layer (Streamable HTTP):
HTTP/1.1 429 Too Many Requests
Retry-After: 17
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1715000060MCP protocol layer: JSON-RPC error code -32004 with data: { retryAfterSeconds, scope }. Surfaced as a protocol error, not a tool error, so the model backs off instead of retrying with different args.
8. Secret management
Phases 1–6: .env
- File at
/opt/whatsapp-mcp/.env, mode0600, owned by service user, never committed. - Loaded only by
src/config/env.ts(zod-validated).process.envaccess elsewhere is a lint error. .env.exampleis committed.
Phase 7 hardening: SOPS + age
- Encrypted
secrets.enc.yamlin repo. - Age private key at
/etc/whatsapp-mcp/age.key(mode 0400, root). whatsapp-mcp-secrets.service(systemd) decrypts to tmpfs/run/whatsapp-mcp/secrets/at boot (mode 0750, never disk).- Compose bind-mounts read-only as
/run/secrets. Each secret is its own file. - Node reads files at startup (env-var leakage via
/proc/<pid>/environis avoided). phone_numbers.*_refcolumns holdsecrets://namestrings; a resolver maps to file paths. Migration to Vault/SSM is later a one-line change in the resolver.- Backup the age key offline (printed paper in a safe, hardware token); the encrypted file is safe to back up alongside the repo.
9. Canonical env vars
# Meta
WA_GRAPH_API_VERSION=v23.0
WA_APP_SECRET= # X-Hub-Signature-256, per Meta App (shared by all numbers in the App)
WA_WEBHOOK_VERIFY_TOKEN= # v1 single-number convenience; per-number tokens via phone_numbers.webhook_verify_token_ref later
WA_DEFAULT_PHONE_NUMBER_ID= # v1 single number convenience
WA_DEFAULT_WABA_ID=
WA_DEFAULT_ACCESS_TOKEN=
# Server
APP_PUBLIC_URL=https://wa.<yourdomain>
APP_HTTP_PORT=3000
APP_BIND=127.0.0.1
MCP_TRANSPORT=http # 'http' | 'stdio'
NODE_ENV=production
LOG_LEVEL=info
# Postgres
DATABASE_URL=postgres://wa:***@postgres:5432/wa_mcp
# Inngest Cloud
INNGEST_EVENT_KEY=
INNGEST_SIGNING_KEY=
INNGEST_SERVE_PATH=/api/inngest
# Media
MEDIA_ROOT=/var/lib/whatsapp-mcp/media
MEDIA_SIGNING_SECRET= # HMAC for signed download URLs
MEDIA_URL_TTL_SECONDS=300
# Auth
API_KEY_PEPPER= # 32 random bytes, base64-encoded
LOCAL_OWNER_CLIENT_ID= # uuid of the is_owner=true row
# Rate limits (defaults; overridable per-key/grant in DB)
RL_DEFAULT_RPM=60
RL_DEFAULT_DAILY_MSGS=250
RL_OWNER_RPM=600
RL_OWNER_DAILY=1000010. Threat model (brief)
| Threat | Mitigation |
|---|---|
| Leaked API key (committed to git, pasted in chat) | wamcp_ prefix in GitHub Secret Scanning; default 90d expires_at; last_used_ip shift triggers audit alert; revocation effective on next request (no caching) |
| Compromised Ubuntu host | Per-WABA tokens (one compromise doesn’t reach others); audit log rsynced hourly to a separate host so a wiped local DB still leaves a trail; Meta tokens scoped to minimum WABA permissions |
| Malicious client tries to send from a number they weren’t granted | Two-layer key+grant check; grants are the authoritative truth, revoking a grant is instant; scope_denied / grant_denied audit rows trigger alerts on N denials/min |
| Replayed Meta webhook | Inngest idempotencyKey = wamid + DB inngest_idempotency ON CONFLICT DO NOTHING; processing is exactly-once even under aggressive Meta retries |
| Leaked Meta access token | System User tokens scoped to one WABA, rotatable from Meta dashboard; phone_numbers.access_token_secret_ref indirection means rotation = update one file + restart container; all attempts audited |
| Accidental media URL exposure | Nginx never serves /var/lib/whatsapp-mcp/media; downloads go only through the media:read-scoped MCP tool which streams after auth + scope + grant check; signed URLs additionally bind tenancy; storage_path is relative so a path-traversal bug can’t escape the media root |
Principles applied throughout
- Fail closed. Auth, scope, grant, rate-limit checks all default-deny on error.
- Single auth path. No back-door admin endpoints. Admin is CLI-only, run inside the container.
- Tokens live in memory only as long as they must. Read at startup, not from env vars; rotations restart the container.
- Audit log is append-only at the DB. Postgres role for the app has
INSERT, SELECTonaudit_logonly. Retention archival uses a separate role.