Skip to Content
WhatsApp MCPPlan & Phases7 — Production Deploy

Phase 7 — Production Hardening & Deploy

Effort: L

Goal

Boots cleanly on the Ubuntu host under Docker Compose, fronted by Nginx + Let’s Encrypt, with secrets managed via SOPS + age, backups in place, and runbooks written. The whole stack survives reboot and certbot renewals are automated.

Deliverables

Docker

  • Dockerfile:
    • Multi-stage: builder (node:20-bookworm-slim) → runner (gcr.io/distroless/nodejs20-debian12:nonroot or node:20-bookworm-slim with a non-root user).
    • Pinned base image with @sha256:... digest.
    • WORKDIR /app; copy only dist/, node_modules/ (production-only), package.json, drizzle/.
    • USER nonroot (uid 1000).
    • EXPOSE 3000.
    • HEALTHCHECK CMD ["wget","--quiet","--tries=1","-O","/dev/null","http://127.0.0.1:3000/health"].
  • .dockerignore — excludes tests/, coverage/, docs/, .env, node_modules/.

Compose

  • docker-compose.prod.yml:
    services: app: image: ghcr.io/<org>/whatsapp-mcp:<sha> restart: unless-stopped read_only: true tmpfs: - /tmp ports: ["127.0.0.1:3000:3000"] env_file: /opt/whatsapp-mcp/.env volumes: - /run/whatsapp-mcp/secrets:/run/secrets:ro - /var/lib/whatsapp-mcp/media:/var/lib/whatsapp-mcp/media depends_on: postgres: condition: service_healthy cap_drop: [ALL] postgres: image: postgres:16-alpine restart: unless-stopped environment: POSTGRES_USER: wa POSTGRES_DB: wa_mcp POSTGRES_PASSWORD_FILE: /run/secrets/postgres_password volumes: - pg_data:/var/lib/postgresql/data - /run/whatsapp-mcp/secrets:/run/secrets:ro healthcheck: test: ["CMD-SHELL", "pg_isready -U wa -d wa_mcp"] # no public port nginx: image: nginx:1.27-alpine restart: unless-stopped ports: ["80:80", "443:443"] volumes: - ./ops/nginx:/etc/nginx/conf.d:ro - /etc/letsencrypt:/etc/letsencrypt:ro - /var/www/certbot:/var/www/certbot:ro depends_on: [app] volumes: pg_data:
  • /opt/whatsapp-mcp/.env carries only non-secret pointers + secret refs (secrets://...). Real secrets land in /run/whatsapp-mcp/secrets/.

Nginx

  • ops/nginx/sites/wa.conf:
    • 80 → 301 → 443.
    • 443 server wa.<yourdomain>:
      • TLS via /etc/letsencrypt/live/wa.<yourdomain>/{fullchain,privkey}.pem.
      • HSTS max-age=31536000; includeSubDomains.
      • Strips any client X-Forwarded-* and X-Real-Ip headers before forwarding.
      • Adds X-Request-Id (generates if absent), X-Real-Ip = $remote_addr, X-Forwarded-For = $remote_addr, X-Forwarded-Proto = https.
    • Locations:
      • /webhook/metaproxy_pass http://app:3000. Limit client_max_body_size 5m. Rate-limit zone meta_webhook 30 r/s burst 50.
      • /api/inngestproxy_pass. Rate-limit zone inngest 30 r/s burst 50.
      • /mcpSSE-safe settings: proxy_buffering off; proxy_cache off; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_read_timeout 1h; proxy_send_timeout 1h;. Rate-limit zone mcp_per_ip 60 r/m burst 20.
      • /media/proxy_pass http://app:3000/media/;. Rate-limit zone media 30 r/s burst 50.
      • /healthproxy_pass. Rate-limit zone health 6 r/m (so monitors don’t drown us).
    • ACME challenge: location ^~ /.well-known/acme-challenge/ { root /var/www/certbot; }.

certbot

  • Initial cert issuance: standalone or webroot mode, documented in docs/operations/host-bootstrap.md.
  • Renewal: host cron (or systemd timer) runs certbot renew --webroot -w /var/www/certbot --deploy-hook "docker compose -f /opt/whatsapp-mcp/docker-compose.prod.yml exec nginx nginx -s reload" twice daily.

Secrets — SOPS + age migration

  • Install sops and age on the host.
  • ops/secrets/secrets.enc.yaml committed (encrypted), containing every value previously in .env that is sensitive:
    • wa_app_secret — single value, per Meta App, used for all numbers in the App
    • wa_webhook_verify_token — v1 single value; per-number wa_webhook_verify_token_<phone_id> entries when Phase 8 brings multi-number
    • wa_default_access_token — v1; per-number wa_access_token_<phone_id> entries when Phase 8 brings multi-number
    • api_key_pepper
    • postgres_password
    • inngest_event_key
    • inngest_signing_key
    • media_signing_secret
  • ops/sops/.sops.yaml — recipients list (the age public keys allowed to decrypt).
  • /etc/whatsapp-mcp/age.key — age private key on the host, mode 0400, owned by root, readable by the whatsapp-mcp-secrets group.
  • ops/systemd/whatsapp-mcp-secrets.service:
    [Unit] Description=Decrypt secrets to tmpfs Before=docker-whatsapp-mcp.service [Service] Type=oneshot RemainAfterExit=yes ExecStartPre=/bin/mkdir -p /run/whatsapp-mcp/secrets ExecStartPre=/bin/mount -t tmpfs -o mode=0750,uid=root,gid=whatsapp-mcp-secrets tmpfs /run/whatsapp-mcp/secrets ExecStart=/usr/local/bin/sops -d /opt/whatsapp-mcp/ops/secrets/secrets.enc.yaml | /usr/local/bin/explode-secrets /run/whatsapp-mcp/secrets ExecStop=/bin/umount /run/whatsapp-mcp/secrets
  • A tiny explode-secrets helper script (POSIX shell or Node) writes each top-level key in the decrypted YAML to its own file mode 0640.
  • src/config/env.ts extended to resolve secrets://name references → reads /run/secrets/<name> synchronously at startup. The phone_numbers.access_token_secret_ref resolver uses the same mechanism.

systemd

  • ops/systemd/whatsapp-mcp.service:
    [Unit] Description=WhatsApp MCP Server After=docker.service network-online.target whatsapp-mcp-secrets.service Requires=docker.service whatsapp-mcp-secrets.service [Service] Type=simple WorkingDirectory=/opt/whatsapp-mcp ExecStart=/usr/bin/docker compose -f docker-compose.prod.yml up ExecStop=/usr/bin/docker compose -f docker-compose.prod.yml down ExecStartPost=/usr/bin/curl --silent --fail --max-time 30 --retry 10 --retry-delay 3 https://wa.<yourdomain>/health Restart=always RestartSec=10 [Install] WantedBy=multi-user.target

Backups

  • ops/backups/backup-db.sh — nightly via host cron:
    docker compose exec -T postgres pg_dump -U wa wa_mcp \ | gzip -9 > /var/lib/whatsapp-mcp/backups/db-$(date +%F).sql.gz
  • ops/backups/backup-media.sh — nightly rsync of /var/lib/whatsapp-mcp/media/ to the same backups dir as a hardlinked snapshot (--link-dest).
  • ops/backups/sync-offsite.shrclone push of /var/lib/whatsapp-mcp/backups/ to a remote (Backblaze B2 or S3). Encrypted at rest via rclone’s crypt backend or age-encrypted before upload.
  • ops/backups/sync-audit.sh — hourly cron: rsync of /var/lib/whatsapp-mcp/audit-archive/ to an off-host log server. Tamper-resistant: a compromised local host can’t erase the trail.
  • Retention policy: 30 daily, 12 monthly. Implemented via find cleanup in backup-db.sh.
  • ops/backups/restore-db.sh — restore from a chosen .sql.gz into a staging compose for verification.

Log rotation

  • Docker daemon.json:
    { "log-driver": "json-file", "log-opts": { "max-size": "50m", "max-file": "10" } }

Health & smoke

  • GET /health (Phase 5) is the systemd ExecStartPost smoke.
  • A pnpm smoke script: hits /health, runs a ping MCP call over HTTP with an admin key.

Docs (extended)

  • docs/operations/phone-number-onboarding.md — promoted from stub. Add phone_numbers row → register in Meta → link via client_phone_grants.
  • docs/operations/incident-runbook.md — promoted: Meta token rotation; DB restore from pg_dump; certbot failure; key pepper rotation; full host re-bootstrap.
  • docs/operations/upgrade.md — promoted: Graph API version bump procedure.
  • docs/operations/backups.md — promoted: backup + restore scripts, retention, off-host sync verification.
  • docs/architecture/auth.md — extended with the SOPS + age migration.
  • docs/components/inngest-runner.md — extended with the Nginx proxy contract.

Critical files

Tests

Unit

  • tests/unit/config/secrets-resolver.test.tssecrets://foo reads /run/secrets/foo; missing file fails fast; non-secrets:// strings pass through unchanged.
  • tests/unit/config/env-with-secrets.test.ts — full env parse with secret refs.

Integration / system

  • tests/integration/health/health.test.ts (Phase 5) — still applies.
  • tests/integration/smoke.test.ts — boots app against testcontainers Postgres, GET /health 200, ping over HTTP with a minted key works.

Manual / smoke (post-deploy)

A scripted but human-run checklist in docs/operations/post-deploy-smoke.md:

  1. systemctl status whatsapp-mcp is active.
  2. curl https://wa.<yourdomain>/health returns 200 from a third-party network.
  3. openssl s_client -connect wa.<yourdomain>:443 -servername wa.<yourdomain> shows the Let’s Encrypt cert with > 30 days remaining.
  4. POST a synthetic Meta webhook (with valid signature using WA_APP_SECRET) → 200 in < 200 ms.
  5. From a remote client with an admin key: connect to /mcp, list tools, call ping.
  6. From the host: trigger a manual backup-db.sh and verify the gzip file exists with non-zero size.
  7. restore-db.sh against the latest gzip in a staging compose → row counts match the prod DB at the time of dump (allow drift).
  8. SOPS smoke: sops -d ops/secrets/secrets.enc.yaml on the host using the on-host age key succeeds; same command on a different host fails.

Chaos checks

  • Stop the Postgres container while the app is running → /health returns 503 within 5 s; systemctl restart whatsapp-mcp recovers.
  • Kill the app container → Docker restarts it; recovery within 15 s.
  • Reboot the host → systemd brings the stack back automatically; /health reachable within 60 s of boot completing.

Coverage

  • Coverage gate continues from Phase 6; new code in src/config/ ≥ 90%.

Code documentation

  • TSDoc on the secrets resolver with @remarks on the synchronous-read contract and the migration story.
  • docs/operations/*.md for every runbook stub promoted.
  • docs/architecture/auth.md extended with SOPS + age.
  • docs/components/*.md updated to reflect the production wiring.
  • docs/reference/ regenerated.

Acceptance

  1. systemctl restart whatsapp-mcp brings the whole stack back, including a fresh certbot renew if one was due.
  2. curl https://wa.<yourdomain>/health returns 200 from the open internet.
  3. A simulated DB restore from the previous day’s pg_dump succeeds in a staging compose with row counts matching.
  4. The age key is backed up offline (printed paper in a safe, or in a hardware token). A bare new Ubuntu host can be bootstrapped to identical state in under 10 minutes given: the repo, the encrypted secrets, the age key, and the host bootstrap runbook.
  5. All Phase 2–6 acceptance checks pass on the production host with no regressions.
  6. Manual smoke checklist (above) green.
  7. The off-host audit log sync target shows fresh files in /audit-archive/ within the hour.

Notes

  • Nginx never has a location /media block. If you find one in PR review, reject the PR — media is auth-gated.
  • Don’t put the Meta webhook on a separate hostname. Use the same wa.<yourdomain> so a single cert serves everything and there’s one operational surface.
  • HSTS preload is optional — defer until you’re sure you’ll never need to serve non-TLS on this hostname.
  • Postgres has no public port. Backup access is via docker compose exec.

Definition of Done

Docker

  • Dockerfile multi-stage with pinned @sha256 base + non-root user + HEALTHCHECK.
  • .dockerignore excludes tests/coverage/docs/.env/node_modules.
  • Image built + pushed to ghcr.io from CI on main.

Compose

  • docker-compose.prod.yml with app (read-only + tmpfs, cap_drop ALL, bound 127.0.0.1), postgres (no public port), nginx.
  • /opt/whatsapp-mcp/.env mode 0600 with non-secret pointers + secrets:// refs.
  • Secret bind-mount /run/whatsapp-mcp/secrets:/run/secrets:ro on app + postgres.

Nginx

  • ops/nginx/sites/wa.conf written.
  • 80 → 301 → 443.
  • HSTS, header stripping (X-Forwarded-* / X-Real-Ip).
  • /mcp SSE-safe settings (buffering off, http1.1, long timeouts).
  • Per-zone rate limits (webhook / inngest / mcp / media / health).
  • ACME challenge location.

certbot

  • Initial cert issued for wa.<yourdomain>.
  • Renewal cron + deploy-hook reloading Nginx in container.
  • openssl s_client shows valid cert > 30 days.

Secrets — SOPS + age

  • age-keygen produced key; private key at /etc/whatsapp-mcp/age.key mode 0400.
  • Public key in ops/sops/.sops.yaml.
  • ops/secrets/secrets.enc.yaml encrypted with every required secret entry.
  • whatsapp-mcp-secrets.service decrypts to tmpfs at boot.
  • explode-secrets helper writes per-file secrets in /run/whatsapp-mcp/secrets/.
  • src/config/env.ts resolves secrets://name via the file resolver.
  • Age key backed up offline (paper/hardware token) — verified by re-bootstrap drill.

systemd

  • whatsapp-mcp.service enabled; ExecStartPost /health smoke succeeds.
  • Restart=always; survives intentional kill -9 of the app container.

Backups

  • backup-db.sh nightly cron writing /var/lib/whatsapp-mcp/backups/db-YYYY-MM-DD.sql.gz.
  • backup-media.sh nightly rsync with --link-dest.
  • sync-offsite.sh pushing to B2/S3 (encrypted at rest).
  • sync-audit.sh hourly cron pushing audit archive to off-host log target.
  • restore-db.sh tested against a staging compose.
  • Retention: 30 daily / 12 monthly cleanup verified.

Log rotation

  • Docker daemon.json max-size=50m, max-file=10.

Health + smoke

  • GET /health returns 200 from public internet.
  • pnpm smoke script green from host.

Tests

  • tests/unit/config/secrets-resolver.test.ts passes.
  • tests/unit/config/env-with-secrets.test.ts passes.
  • tests/integration/smoke.test.ts passes against testcontainers.
  • Chaos checks: Postgres stop → 503; app kill → auto-restart; host reboot → stack returns < 60s.

Documentation

  • docs/operations/phone-number-onboarding.md promoted from stub.
  • docs/operations/incident-runbook.md promoted.
  • docs/operations/upgrade.md promoted.
  • docs/operations/backups.md promoted.
  • docs/operations/post-deploy-smoke.md written.
  • docs/architecture/auth.md extended (SOPS + age).
  • docs/components/inngest-runner.md extended (Nginx proxy).
  • docs/reference/ regenerated cleanly.

Acceptance verified

  • systemctl restart whatsapp-mcp brings whole stack back including cert renew if due.
  • curl https://wa.<yourdomain>/health returns 200 from open internet.
  • DB restore drill in staging compose succeeds with row counts matching.
  • Bare-host re-bootstrap drill: repo + encrypted secrets + age key → identical state in < 10 min.
  • All Phase 2–6 acceptance checks still pass on production host.
  • Manual smoke checklist green.
  • Off-host audit sync target shows fresh files within the hour.

Phase signoff

  • Phase 7 complete — v1 SHIPPED. README.md status table updated to ✅.