Phase 7 — Production Hardening & Deploy
Effort: L
Goal
Boots cleanly on the Ubuntu host under Docker Compose, fronted by Nginx + Let’s Encrypt, with secrets managed via SOPS + age, backups in place, and runbooks written. The whole stack survives reboot and certbot renewals are automated.
Deliverables
Docker
Dockerfile:- Multi-stage: builder (
node:20-bookworm-slim) → runner (gcr.io/distroless/nodejs20-debian12:nonrootornode:20-bookworm-slimwith a non-root user). - Pinned base image with
@sha256:...digest. WORKDIR /app; copy onlydist/,node_modules/(production-only),package.json,drizzle/.USER nonroot(uid 1000).EXPOSE 3000.HEALTHCHECK CMD ["wget","--quiet","--tries=1","-O","/dev/null","http://127.0.0.1:3000/health"].
- Multi-stage: builder (
.dockerignore— excludestests/,coverage/,docs/,.env,node_modules/.
Compose
docker-compose.prod.yml:services: app: image: ghcr.io/<org>/whatsapp-mcp:<sha> restart: unless-stopped read_only: true tmpfs: - /tmp ports: ["127.0.0.1:3000:3000"] env_file: /opt/whatsapp-mcp/.env volumes: - /run/whatsapp-mcp/secrets:/run/secrets:ro - /var/lib/whatsapp-mcp/media:/var/lib/whatsapp-mcp/media depends_on: postgres: condition: service_healthy cap_drop: [ALL] postgres: image: postgres:16-alpine restart: unless-stopped environment: POSTGRES_USER: wa POSTGRES_DB: wa_mcp POSTGRES_PASSWORD_FILE: /run/secrets/postgres_password volumes: - pg_data:/var/lib/postgresql/data - /run/whatsapp-mcp/secrets:/run/secrets:ro healthcheck: test: ["CMD-SHELL", "pg_isready -U wa -d wa_mcp"] # no public port nginx: image: nginx:1.27-alpine restart: unless-stopped ports: ["80:80", "443:443"] volumes: - ./ops/nginx:/etc/nginx/conf.d:ro - /etc/letsencrypt:/etc/letsencrypt:ro - /var/www/certbot:/var/www/certbot:ro depends_on: [app] volumes: pg_data:/opt/whatsapp-mcp/.envcarries only non-secret pointers + secret refs (secrets://...). Real secrets land in/run/whatsapp-mcp/secrets/.
Nginx
ops/nginx/sites/wa.conf:- 80 → 301 → 443.
- 443 server
wa.<yourdomain>:- TLS via
/etc/letsencrypt/live/wa.<yourdomain>/{fullchain,privkey}.pem. - HSTS
max-age=31536000; includeSubDomains. - Strips any client
X-Forwarded-*andX-Real-Ipheaders before forwarding. - Adds
X-Request-Id(generates if absent),X-Real-Ip = $remote_addr,X-Forwarded-For = $remote_addr,X-Forwarded-Proto = https.
- TLS via
- Locations:
/webhook/meta—proxy_pass http://app:3000. Limitclient_max_body_size 5m. Rate-limit zonemeta_webhook30 r/s burst 50./api/inngest—proxy_pass. Rate-limit zoneinngest30 r/s burst 50./mcp— SSE-safe settings:proxy_buffering off; proxy_cache off; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_read_timeout 1h; proxy_send_timeout 1h;. Rate-limit zonemcp_per_ip60 r/m burst 20./media/—proxy_pass http://app:3000/media/;. Rate-limit zonemedia30 r/s burst 50./health—proxy_pass. Rate-limit zonehealth6 r/m (so monitors don’t drown us).
- ACME challenge:
location ^~ /.well-known/acme-challenge/ { root /var/www/certbot; }.
certbot
- Initial cert issuance: standalone or webroot mode, documented in
docs/operations/host-bootstrap.md. - Renewal: host cron (or systemd timer) runs
certbot renew --webroot -w /var/www/certbot --deploy-hook "docker compose -f /opt/whatsapp-mcp/docker-compose.prod.yml exec nginx nginx -s reload"twice daily.
Secrets — SOPS + age migration
- Install
sopsandageon the host. ops/secrets/secrets.enc.yamlcommitted (encrypted), containing every value previously in.envthat is sensitive:wa_app_secret— single value, per Meta App, used for all numbers in the Appwa_webhook_verify_token— v1 single value; per-numberwa_webhook_verify_token_<phone_id>entries when Phase 8 brings multi-numberwa_default_access_token— v1; per-numberwa_access_token_<phone_id>entries when Phase 8 brings multi-numberapi_key_pepperpostgres_passwordinngest_event_keyinngest_signing_keymedia_signing_secret
ops/sops/.sops.yaml— recipients list (the age public keys allowed to decrypt)./etc/whatsapp-mcp/age.key— age private key on the host, mode0400, owned byroot, readable by thewhatsapp-mcp-secretsgroup.ops/systemd/whatsapp-mcp-secrets.service:[Unit] Description=Decrypt secrets to tmpfs Before=docker-whatsapp-mcp.service [Service] Type=oneshot RemainAfterExit=yes ExecStartPre=/bin/mkdir -p /run/whatsapp-mcp/secrets ExecStartPre=/bin/mount -t tmpfs -o mode=0750,uid=root,gid=whatsapp-mcp-secrets tmpfs /run/whatsapp-mcp/secrets ExecStart=/usr/local/bin/sops -d /opt/whatsapp-mcp/ops/secrets/secrets.enc.yaml | /usr/local/bin/explode-secrets /run/whatsapp-mcp/secrets ExecStop=/bin/umount /run/whatsapp-mcp/secrets- A tiny
explode-secretshelper script (POSIX shell or Node) writes each top-level key in the decrypted YAML to its own filemode 0640. src/config/env.tsextended to resolvesecrets://namereferences → reads/run/secrets/<name>synchronously at startup. Thephone_numbers.access_token_secret_refresolver uses the same mechanism.
systemd
ops/systemd/whatsapp-mcp.service:[Unit] Description=WhatsApp MCP Server After=docker.service network-online.target whatsapp-mcp-secrets.service Requires=docker.service whatsapp-mcp-secrets.service [Service] Type=simple WorkingDirectory=/opt/whatsapp-mcp ExecStart=/usr/bin/docker compose -f docker-compose.prod.yml up ExecStop=/usr/bin/docker compose -f docker-compose.prod.yml down ExecStartPost=/usr/bin/curl --silent --fail --max-time 30 --retry 10 --retry-delay 3 https://wa.<yourdomain>/health Restart=always RestartSec=10 [Install] WantedBy=multi-user.target
Backups
ops/backups/backup-db.sh— nightly via host cron:docker compose exec -T postgres pg_dump -U wa wa_mcp \ | gzip -9 > /var/lib/whatsapp-mcp/backups/db-$(date +%F).sql.gzops/backups/backup-media.sh— nightlyrsyncof/var/lib/whatsapp-mcp/media/to the same backups dir as a hardlinked snapshot (--link-dest).ops/backups/sync-offsite.sh—rclonepush of/var/lib/whatsapp-mcp/backups/to a remote (Backblaze B2 or S3). Encrypted at rest via rclone’s crypt backend or age-encrypted before upload.ops/backups/sync-audit.sh— hourly cron:rsyncof/var/lib/whatsapp-mcp/audit-archive/to an off-host log server. Tamper-resistant: a compromised local host can’t erase the trail.- Retention policy: 30 daily, 12 monthly. Implemented via
findcleanup inbackup-db.sh. ops/backups/restore-db.sh— restore from a chosen.sql.gzinto a staging compose for verification.
Log rotation
- Docker
daemon.json:{ "log-driver": "json-file", "log-opts": { "max-size": "50m", "max-file": "10" } }
Health & smoke
GET /health(Phase 5) is the systemdExecStartPostsmoke.- A
pnpm smokescript: hits/health, runs apingMCP call over HTTP with an admin key.
Docs (extended)
docs/operations/phone-number-onboarding.md— promoted from stub. Addphone_numbersrow → register in Meta → link viaclient_phone_grants.docs/operations/incident-runbook.md— promoted: Meta token rotation; DB restore frompg_dump; certbot failure; key pepper rotation; full host re-bootstrap.docs/operations/upgrade.md— promoted: Graph API version bump procedure.docs/operations/backups.md— promoted: backup + restore scripts, retention, off-host sync verification.docs/architecture/auth.md— extended with the SOPS + age migration.docs/components/inngest-runner.md— extended with the Nginx proxy contract.
Critical files
- Dockerfile
- docker-compose.prod.yml
- ops/nginx/sites/wa.conf
- ops/systemd/whatsapp-mcp.service
- ops/systemd/whatsapp-mcp-secrets.service
- ops/sops/.sops.yaml
- ops/secrets/secrets.enc.yaml (encrypted)
- ops/backups/{backup-db,backup-media,sync-offsite,sync-audit,restore-db}.sh
- src/config/env.ts — extended with
secrets://resolver
Tests
Unit
tests/unit/config/secrets-resolver.test.ts—secrets://fooreads/run/secrets/foo; missing file fails fast; non-secrets://strings pass through unchanged.tests/unit/config/env-with-secrets.test.ts— full env parse with secret refs.
Integration / system
tests/integration/health/health.test.ts(Phase 5) — still applies.tests/integration/smoke.test.ts— boots app against testcontainers Postgres,GET /health200,pingover HTTP with a minted key works.
Manual / smoke (post-deploy)
A scripted but human-run checklist in docs/operations/post-deploy-smoke.md:
systemctl status whatsapp-mcpisactive.curl https://wa.<yourdomain>/healthreturns 200 from a third-party network.openssl s_client -connect wa.<yourdomain>:443 -servername wa.<yourdomain>shows the Let’s Encrypt cert with > 30 days remaining.- POST a synthetic Meta webhook (with valid signature using
WA_APP_SECRET) → 200 in < 200 ms. - From a remote client with an admin key: connect to
/mcp, list tools, callping. - From the host: trigger a manual
backup-db.shand verify the gzip file exists with non-zero size. restore-db.shagainst the latest gzip in a staging compose → row counts match the prod DB at the time of dump (allow drift).- SOPS smoke:
sops -d ops/secrets/secrets.enc.yamlon the host using the on-host age key succeeds; same command on a different host fails.
Chaos checks
- Stop the Postgres container while the app is running →
/healthreturns 503 within 5 s;systemctl restart whatsapp-mcprecovers. - Kill the app container → Docker restarts it; recovery within 15 s.
- Reboot the host → systemd brings the stack back automatically;
/healthreachable within 60 s of boot completing.
Coverage
- Coverage gate continues from Phase 6; new code in
src/config/≥ 90%.
Code documentation
- TSDoc on the secrets resolver with
@remarkson the synchronous-read contract and the migration story. docs/operations/*.mdfor every runbook stub promoted.docs/architecture/auth.mdextended with SOPS + age.docs/components/*.mdupdated to reflect the production wiring.docs/reference/regenerated.
Acceptance
systemctl restart whatsapp-mcpbrings the whole stack back, including a fresh certbot renew if one was due.curl https://wa.<yourdomain>/healthreturns 200 from the open internet.- A simulated DB restore from the previous day’s
pg_dumpsucceeds in a staging compose with row counts matching. - The age key is backed up offline (printed paper in a safe, or in a hardware token). A bare new Ubuntu host can be bootstrapped to identical state in under 10 minutes given: the repo, the encrypted secrets, the age key, and the host bootstrap runbook.
- All Phase 2–6 acceptance checks pass on the production host with no regressions.
- Manual smoke checklist (above) green.
- The off-host audit log sync target shows fresh files in
/audit-archive/within the hour.
Notes
- Nginx never has a
location /mediablock. If you find one in PR review, reject the PR — media is auth-gated. - Don’t put the Meta webhook on a separate hostname. Use the same
wa.<yourdomain>so a single cert serves everything and there’s one operational surface. - HSTS preload is optional — defer until you’re sure you’ll never need to serve non-TLS on this hostname.
- Postgres has no public port. Backup access is via
docker compose exec.
Definition of Done
Docker
-
Dockerfilemulti-stage with pinned@sha256base + non-root user + HEALTHCHECK. -
.dockerignoreexcludes tests/coverage/docs/.env/node_modules. - Image built + pushed to
ghcr.iofrom CI onmain.
Compose
-
docker-compose.prod.ymlwithapp(read-only + tmpfs,cap_drop ALL, bound127.0.0.1),postgres(no public port),nginx. -
/opt/whatsapp-mcp/.envmode 0600 with non-secret pointers +secrets://refs. - Secret bind-mount
/run/whatsapp-mcp/secrets:/run/secrets:roon app + postgres.
Nginx
-
ops/nginx/sites/wa.confwritten. - 80 → 301 → 443.
- HSTS, header stripping (X-Forwarded-* / X-Real-Ip).
-
/mcpSSE-safe settings (buffering off, http1.1, long timeouts). - Per-zone rate limits (webhook / inngest / mcp / media / health).
- ACME challenge location.
certbot
- Initial cert issued for
wa.<yourdomain>. - Renewal cron + deploy-hook reloading Nginx in container.
-
openssl s_clientshows valid cert > 30 days.
Secrets — SOPS + age
-
age-keygenproduced key; private key at/etc/whatsapp-mcp/age.keymode 0400. - Public key in
ops/sops/.sops.yaml. -
ops/secrets/secrets.enc.yamlencrypted with every required secret entry. -
whatsapp-mcp-secrets.servicedecrypts to tmpfs at boot. -
explode-secretshelper writes per-file secrets in/run/whatsapp-mcp/secrets/. -
src/config/env.tsresolvessecrets://namevia the file resolver. - Age key backed up offline (paper/hardware token) — verified by re-bootstrap drill.
systemd
-
whatsapp-mcp.serviceenabled;ExecStartPost/healthsmoke succeeds. -
Restart=always; survives intentionalkill -9of the app container.
Backups
-
backup-db.shnightly cron writing/var/lib/whatsapp-mcp/backups/db-YYYY-MM-DD.sql.gz. -
backup-media.shnightly rsync with--link-dest. -
sync-offsite.shpushing to B2/S3 (encrypted at rest). -
sync-audit.shhourly cron pushing audit archive to off-host log target. -
restore-db.shtested against a staging compose. - Retention: 30 daily / 12 monthly cleanup verified.
Log rotation
- Docker
daemon.jsonmax-size=50m, max-file=10.
Health + smoke
-
GET /healthreturns 200 from public internet. -
pnpm smokescript green from host.
Tests
-
tests/unit/config/secrets-resolver.test.tspasses. -
tests/unit/config/env-with-secrets.test.tspasses. -
tests/integration/smoke.test.tspasses against testcontainers. - Chaos checks: Postgres stop → 503; app kill → auto-restart; host reboot → stack returns < 60s.
Documentation
-
docs/operations/phone-number-onboarding.mdpromoted from stub. -
docs/operations/incident-runbook.mdpromoted. -
docs/operations/upgrade.mdpromoted. -
docs/operations/backups.mdpromoted. -
docs/operations/post-deploy-smoke.mdwritten. -
docs/architecture/auth.mdextended (SOPS + age). -
docs/components/inngest-runner.mdextended (Nginx proxy). -
docs/reference/regenerated cleanly.
Acceptance verified
-
systemctl restart whatsapp-mcpbrings whole stack back including cert renew if due. -
curl https://wa.<yourdomain>/healthreturns 200 from open internet. - DB restore drill in staging compose succeeds with row counts matching.
- Bare-host re-bootstrap drill: repo + encrypted secrets + age key → identical state in < 10 min.
- All Phase 2–6 acceptance checks still pass on production host.
- Manual smoke checklist green.
- Off-host audit sync target shows fresh files within the hour.
Phase signoff
- Phase 7 complete — v1 SHIPPED. README.md status table updated to ✅.