Architecture — Cross-Cutting Design

This document describes the parts of the system that span every phase. Read this before any phase file. Phase files describe what gets built; this file describes what gets honoured.

1. Database schema (the spine)

All tables use id uuid PRIMARY KEY DEFAULT gen_random_uuid() (Postgres pgcrypto) and timestamptz for time columns. FK actions: ON DELETE RESTRICT by default; explicit exceptions noted.

Core tenancy tables


clients
  id              uuid PK
  name            text unique               -- kebab-case, e.g. "internal-projectX"
  display_name    text
  is_owner        bool                      -- exactly one row may have this true
  disabled_at     timestamptz null          -- gates all auth when set
  metadata        jsonb
  created_at      timestamptz default now()
  -- partial unique (is_owner) WHERE is_owner enforces single-owner

api_keys
  id                 uuid PK
  client_id          uuid FK clients(id) ON DELETE CASCADE
  label              text                    -- "claude-desktop-laptop"
  prefix             text unique             -- 12 chars, indexed lookup column
  hash               bytea                   -- 32B HMAC-SHA256(pepper, full token)
  scopes             jsonb                   -- array of scope strings, see §3
  rpm_limit          int not null default 60
  daily_msg_limit    int not null default 250
  expires_at         timestamptz null        -- hard cut-off
  revoked_at         timestamptz null        -- soft delete
  rotated_from_id    uuid null               -- self-FK; rotation lineage
  last_used_at       timestamptz null        -- updated async (batched)
  last_used_ip       inet null
  created_at         timestamptz default now()
  -- indexes: unique(prefix), partial (client_id) WHERE revoked_at IS NULL,
  --          partial (expires_at) WHERE expires_at IS NOT NULL

phone_numbers
  id                          uuid PK
  wa_phone_number_id          text unique not null      -- Meta's numeric id
  wa_business_account_id      text not null
  display_number              text                      -- E.164 with +, humans only
  label                       text
  access_token_secret_ref     text                      -- "secrets://wa_access_token_<id>"
  webhook_verify_token_ref    text                      -- per-number to limit blast radius
  disabled_at                 timestamptz null
  created_at                  timestamptz default now()
  -- NOTE: no app_secret column. App Secret is per Meta App, not per number.
  -- All numbers in this WABA share the App Secret read from WA_APP_SECRET
  -- (secrets://wa_app_secret). If we ever split into multiple Meta Apps,
  -- add a nullable app_secret_ref column then.

client_phone_grants     -- M:N — what each client may do with which number
  id                  uuid PK
  client_id           uuid FK clients(id)
  phone_number_id     uuid FK phone_numbers(id)
  allowed_tools       jsonb not null            -- e.g. ["send_message","get_messages"]
  daily_message_cap   int null                  -- override of api_keys.daily_msg_limit
  revoked_at          timestamptz null
  created_at          timestamptz default now()
  -- unique (client_id, phone_number_id) WHERE revoked_at IS NULL

Conversation tables


contacts
  id                 uuid PK
  phone_number_id    uuid FK phone_numbers(id)
  wa_id              text not null              -- E.164 with no leading +
  profile_name       text                       -- from Meta contacts[].profile.name
  display_name       text                       -- local override
  metadata           jsonb
  first_seen_at      timestamptz default now()
  last_seen_at       timestamptz default now()
  -- unique (phone_number_id, wa_id)

messages
  id                 uuid PK
  phone_number_id    uuid FK phone_numbers(id)
  client_id          uuid FK clients(id)        -- nullable for inbound (number-owned)
  direction          text                       -- 'inbound' | 'outbound'
  wa_message_id      text unique                -- Meta's wamid.xxx
  contact_id         uuid FK contacts(id)
  message_type       text                       -- text|template|image|document|audio|video|sticker|reaction|interactive
  body               text                       -- plaintext for text/caption
  template_name      text null
  media_object_id    uuid FK media_objects(id) null
  status             text                       -- queued|sent|delivered|read|failed (out); received|read (in)
  error_code         int null
  reply_to_wamid     text null
  payload            jsonb                      -- normalised body (e.g. interactive selectedId)
  raw                jsonb                      -- original Meta payload (inbound), sanitised
  ts                 timestamptz                -- Meta event time
  created_at         timestamptz default now()
  -- indexes: unique(wa_message_id),
  --          (phone_number_id, ts DESC),
  --          (client_id, ts DESC) WHERE direction='outbound',
  --          (phone_number_id, contact_id, ts DESC) WHERE direction='inbound'

media_objects
  id                 uuid PK
  phone_number_id    uuid FK phone_numbers(id)
  direction          text                       -- inbound|outbound
  wa_media_id        text                       -- Meta media id (inbound)
  mime_type          text
  sha256             bytea
  size_bytes         bigint
  storage_path       text                       -- relative under MEDIA_ROOT, never absolute
  downloaded_at      timestamptz null           -- null for outbound
  created_at         timestamptz default now()

Operational tables


audit_log                  -- append-only; app role has INSERT,SELECT only
  id                 bigserial PK
  ts                 timestamptz default now()
  client_id          uuid FK clients(id) ON DELETE SET NULL null
  api_key_id         uuid FK api_keys(id) ON DELETE SET NULL null
  request_id         text                       -- propagated through Inngest
  ip                 inet null
  user_agent         text null
  action             text                       -- see action vocabulary below
  tool_name          text null
  phone_number_id    uuid FK phone_numbers(id) ON DELETE SET NULL null
  wa_message_id      text null
  payload_hash       bytea null                 -- SHA-256 of canonicalised input
  error_code         text null
  latency_ms         int null
  metadata           jsonb null
  -- indexes: (client_id, ts DESC), (api_key_id, ts DESC), (action, ts DESC),
  --          (wa_message_id) WHERE wa_message_id IS NOT NULL, BRIN on ts

rate_limit_buckets         -- token bucket
  id              uuid PK
  key             text                          -- 'rpm:<api_key_id>:<minute>' or 'daily:<client_id>:<phone>:<hour>'
  window_start    timestamptz
  count           int not null default 0
  updated_at      timestamptz default now()
  -- unique (key, window_start); partial index on stale rows for GC

inngest_idempotency        -- dedupe Meta webhook retries
  event_id        text PK                       -- wamid for inbound, our request-id for outbound
  first_seen_at   timestamptz default now()
  outcome         text                          -- processed|duplicate|failed

`action` vocabulary for audit_log

key_minted, key_rotated, key_revoked, key_used, auth_failed, scope_denied, grant_denied, rate_limited, tool_called, send_attempt, send_success, send_failed, webhook_received, webhook_invalid_signature, webhook_duplicate, media_uploaded, media_downloaded, grant_added, grant_revoked, client_disabled.

Row-level isolation strategy

App-enforced WHERE clauses, not Postgres RLS. Justification: we issue one DB role for the app; RLS shines when each tenant gets its own role, which we don’t do. A single missed WHERE client_id = $1 would be a tenancy bug, so we centralise all queries on tenant tables in src/db/scoped.ts which takes clientId as a mandatory first argument. ESLint rule forbids raw pool.query against tenant tables outside that module. Cross-tenant isolation tests in CI assert client B cannot read/write client A’s data via any tool, scope, or grant path. The schema stays RLS-ready in case we change our mind.

2. Auth pipeline (request order)


Nginx (TLS, strips client X-Forwarded-* / X-Real-Ip, adds X-Request-Id)
  → Express trust proxy = 1
  → request-id middleware (CLS / AsyncLocalStorage propagation into Inngest)
  → body parser (express.raw on /webhook/*, express.json on /mcp/*)
  → authMiddleware
       parse Bearer wamcp_live_...
       lookup api_keys by 12-char prefix
       HMAC-SHA256(pepper, token) constant-time compare against hash
       attach req.auth = { clientId, apiKeyId, scopes, rpmLimit, dailyMsgLimit }
       fail-closed: any error here → 401
  → client-enabled check (clients.disabled_at IS NULL)
  → per-key RPM rate-limit check (Postgres sliding window)
  → MCP dispatcher (validates JSON-RPC envelope)
  → tool layer:
       parse input via tool's zod schema
       requireScope(toolName)              -- key has tools:<name> or tools:*
       requireGrant(clientId, phoneNumberId, toolName)
                                           -- client_phone_grants row exists,
                                           --   not revoked, contains toolName
       (outbound only) per-client daily-cap check
       run tool body
  → response post-processor (audit log emit + rate-limit bucket increment)

Stdio mode skips Express entirely. Bootstrap synthesises req.auth = { clientId: ownerClientId, apiKeyId: null, scopes: ['*'], rpmLimit: RL_OWNER_RPM, dailyMsgLimit: RL_OWNER_DAILY }. Grants check still runs — the owner cannot accidentally use an un-granted number. Audit rows have api_key_id = NULL, metadata.transport = 'stdio'.

3. API keys, scopes, and grants

Key format

wamcp_<env>_<28 chars Crockford base32> where <env> ∈ {live, test}.

140 bits of entropy.
The wamcp_ prefix is registered with GitHub Secret Scanning so leaks get caught.
The 12-char DB lookup prefix is wamcp_<env>_<first 4 of secret>, indexed.
Stored as hash = HMAC_SHA256(pepper, full_token). HMAC, not Argon2id, because the key already carries enough entropy that slow hashing buys nothing and would dominate request latency. The pepper at /run/secrets/api_key_pepper means a stolen DB dump alone can’t replay against the auth endpoint.
Shown once at mint, printed to stderr (not stdout, so it doesn’t end up piped into logs).

Scopes (stored as JSONB string array)

Scope	Meaning
`tools:send_message`	may invoke `send_message`
`tools:send_template`	may invoke `send_template`
`tools:send_media`	may invoke `send_media`
`tools:send_interactive_buttons`	may invoke `send_interactive_buttons`
`tools:send_interactive_list`	may invoke `send_interactive_list`
`tools:get_messages`	may read inbound
`tools:get_media_url`	may resolve a signed media URL
`tools:list_chats`	may list conversations
`tools:get_contact`	may resolve contacts
`tools:mark_read`	may send read receipts
`tools:*`	all tool scopes — owner-only
`numbers:<phone_number_id>`	may operate against this WABA number
`numbers:*`	any granted number — owner-only
`media:read` / `media:write`	inbound download / outbound upload
`admin:*`	management plane — owner-only, never minted to third parties

Two-layer authorisation

A request must satisfy both:

api_keys.scopes contains tools:<tool> AND (numbers:<phone_number_id> OR numbers:*).
client_phone_grants row exists for (client_id, phone_number_id), not revoked, with <tool> in allowed_tools.

Why two layers: rotating a key shouldn’t re-grant numbers; revoking a number from a client should affect every key they hold instantly without touching key rows. Wildcards are checked at mint and at request time (defence in depth).

Rotation flow (dual-accept window)

admin keys rotate <key-id> mints a new key (full token shown once), inserts a row with rotated_from_id = <old-id>, copies client_id and scopes.
Old key remains valid; both work for a configurable window (default 7 days; expires_at set on the old row).
Sweeper sets revoked_at = now() at expiry; admin can revoke early.
Audit log records key_rotated, key_revoked, and key_used_after_rotation_warning (when the old key is used within the grace window).

4. Inngest

Inngest Cloud is the orchestrator. Functions execute inside the Node app at POST /api/inngest. Nginx proxies this path with rate limits but no app-layer auth — the Inngest SDK verifies INNGEST_SIGNING_KEY on every incoming request.

Events

Event	Producer	Consumer	Notes
`wa/webhook.received`	webhook handler (after signature check)	`process-message`	Idempotency key = derived event id
`wa/message.send.requested`	`send_message` tool	`send-message`	Concurrency key = `phone_number_id`
`wa/message.send.completed`	`send-message` function	optional `step.waitForEvent` for `wait=true` callers
`wa/media.download.requested`	`process-message` (when message has media)	`download-media`	Streams to disk
`wa/media.upload.requested`	`send_media` tool	`upload-media`	For `localPath` source
`mcp/client.notify`	`process-message` after persist	in-process handler → SSE push	Fan-out per granted client
`wa/status.update.received`	webhook handler (status updates)	`status-updater`	Marks delivered/read/failed on messages
`cron/audit.archive`	Inngest cron, daily	`archive-audit`	Rolls `audit_log` rows >365d to JSONL.zst
`cron/messages.retention`	Inngest cron, daily	`prune-messages`	Nulls body/raw on rows >90d
`cron/media.retention`	Inngest cron, daily	`prune-media`	Deletes files on disk + nulls `storage_path` >90d
`cron/rate-limit.gc`	Inngest cron, every 10m	`gc-rate-limits`	Drops `rate_limit_buckets` >25h

Concurrency + rate limit interaction

RPM is enforced before enqueue at the MCP entrypoint (Postgres sliding window). The caller sees an immediate 429 instead of accepting work we’ll just reject.
Daily cap is enforced inside the send-message Inngest function (after dequeue) so retries don’t double-count.
Concurrency key phone_number_id on send-message means we never hammer a single Meta number across clients.

5. Notification routing (multi-client)

Map<clientId, Set<McpSession>> in process memory. On connect (post-auth), register the session; on SSE close/error/idle (5 min), deregister.

When mcp/client.notify fires, the in-process handler looks up live sessions for clientId and sends MCP notifications/resources/updated with uri = wamcp://numbers/<phone_number_id>/messages. If no session is open, drop the notification — the message is durable in messages, so the client backfills via get_messages with a since cursor on next connect.

This is correct-by-construction across crashes, restarts, deploys, and offline clients. Notifications are best-effort hints; the DB is the truth.

For a future multi-instance world, the in-process Map is replaced with Postgres LISTEN/NOTIFY or Redis pub-sub. v1 is single-instance.

6. Webhook security

express.raw({ type: 'application/json', limit: '5mb' }) on /webhook/meta only (not global express.json) so the raw body Buffer is preserved.
HMAC-SHA256(WA_APP_SECRET, rawBody) compared to X-Hub-Signature-256 via crypto.timingSafeEqual. Mismatch → 404 (don’t leak the endpoint to scanners) and webhook_invalid_signature audit row.
GET verify-token handshake per phone number (column phone_numbers.webhook_verify_token_ref); leaking one token does not enable webhook hijack across the fleet.
IP allowlist is skipped — Meta does not publish stable source IPs. Signature is cryptographically sufficient.
Idempotency on inbound: Inngest idempotencyKey = wamid and DB INSERT INTO inngest_idempotency ... ON CONFLICT DO NOTHING. Meta retries aggressively on 5xx; this catches both Inngest retries and Meta retries.

7. Rate limiting details

RPM (per-API-key, sliding window)

Bucket key rpm:<api_key_id>:<minute_epoch>. Atomic upsert into rate_limit_buckets. True sliding window via weighted sum of current and previous minute buckets weighted by elapsed seconds. Default 60 req/min non-owner, RL_OWNER_RPM (default 600) for owner.

Daily message cap (per-client per-number, rolling 24h)

Bucket key daily:<client_id>:<phone_number_id>:<hour_epoch>. Sum the last 24 hourly buckets at send time. Default 250 messages/day for non-owner per number; overridable via client_phone_grants.daily_message_cap. Applies only to outbound message tools.

429 shape

HTTP layer (Streamable HTTP):


HTTP/1.1 429 Too Many Requests
Retry-After: 17
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1715000060

MCP protocol layer: JSON-RPC error code -32004 with data: { retryAfterSeconds, scope }. Surfaced as a protocol error, not a tool error, so the model backs off instead of retrying with different args.

8. Secret management

Phases 1–6: `.env`

File at /opt/whatsapp-mcp/.env, mode 0600, owned by service user, never committed.
Loaded only by src/config/env.ts (zod-validated). process.env access elsewhere is a lint error.
.env.example is committed.

Phase 7 hardening: SOPS + age

Encrypted secrets.enc.yaml in repo.
Age private key at /etc/whatsapp-mcp/age.key (mode 0400, root).
whatsapp-mcp-secrets.service (systemd) decrypts to tmpfs /run/whatsapp-mcp/secrets/ at boot (mode 0750, never disk).
Compose bind-mounts read-only as /run/secrets. Each secret is its own file.
Node reads files at startup (env-var leakage via /proc/<pid>/environ is avoided).
phone_numbers.*_ref columns hold secrets://name strings; a resolver maps to file paths. Migration to Vault/SSM is later a one-line change in the resolver.
Backup the age key offline (printed paper in a safe, hardware token); the encrypted file is safe to back up alongside the repo.

9. Canonical env vars


# Meta
WA_GRAPH_API_VERSION=v23.0
WA_APP_SECRET=                 # X-Hub-Signature-256, per Meta App (shared by all numbers in the App)
WA_WEBHOOK_VERIFY_TOKEN=       # v1 single-number convenience; per-number tokens via phone_numbers.webhook_verify_token_ref later
WA_DEFAULT_PHONE_NUMBER_ID=    # v1 single number convenience
WA_DEFAULT_WABA_ID=
WA_DEFAULT_ACCESS_TOKEN=

# Server
APP_PUBLIC_URL=https://wa.<yourdomain>
APP_HTTP_PORT=3000
APP_BIND=127.0.0.1
MCP_TRANSPORT=http             # 'http' | 'stdio'
NODE_ENV=production
LOG_LEVEL=info

# Postgres
DATABASE_URL=postgres://wa:***@postgres:5432/wa_mcp

# Inngest Cloud
INNGEST_EVENT_KEY=
INNGEST_SIGNING_KEY=
INNGEST_SERVE_PATH=/api/inngest

# Media
MEDIA_ROOT=/var/lib/whatsapp-mcp/media
MEDIA_SIGNING_SECRET=          # HMAC for signed download URLs
MEDIA_URL_TTL_SECONDS=300

# Auth
API_KEY_PEPPER=                # 32 random bytes, base64-encoded
LOCAL_OWNER_CLIENT_ID=         # uuid of the is_owner=true row

# Rate limits (defaults; overridable per-key/grant in DB)
RL_DEFAULT_RPM=60
RL_DEFAULT_DAILY_MSGS=250
RL_OWNER_RPM=600
RL_OWNER_DAILY=10000

10. Threat model (brief)

Threat	Mitigation
Leaked API key (committed to git, pasted in chat)	`wamcp_` prefix in GitHub Secret Scanning; default 90d `expires_at`; `last_used_ip` shift triggers audit alert; revocation effective on next request (no caching)
Compromised Ubuntu host	Per-WABA tokens (one compromise doesn’t reach others); audit log rsynced hourly to a separate host so a wiped local DB still leaves a trail; Meta tokens scoped to minimum WABA permissions
Malicious client tries to send from a number they weren’t granted	Two-layer key+grant check; grants are the authoritative truth, revoking a grant is instant; `scope_denied` / `grant_denied` audit rows trigger alerts on N denials/min
Replayed Meta webhook	Inngest `idempotencyKey = wamid` + DB `inngest_idempotency` `ON CONFLICT DO NOTHING`; processing is exactly-once even under aggressive Meta retries
Leaked Meta access token	System User tokens scoped to one WABA, rotatable from Meta dashboard; `phone_numbers.access_token_secret_ref` indirection means rotation = update one file + restart container; all attempts audited
Accidental media URL exposure	Nginx never serves `/var/lib/whatsapp-mcp/media`; downloads go only through the `media:read`-scoped MCP tool which streams after auth + scope + grant check; signed URLs additionally bind tenancy; `storage_path` is relative so a path-traversal bug can’t escape the media root

Principles applied throughout

Fail closed. Auth, scope, grant, rate-limit checks all default-deny on error.
Single auth path. No back-door admin endpoints. Admin is CLI-only, run inside the container.
Tokens live in memory only as long as they must. Read at startup, not from env vars; rotations restart the container.
Audit log is append-only at the DB. Postgres role for the app has INSERT, SELECT on audit_log only. Retention archival uses a separate role.