Phase 5 — Hardening & Compliance
Detailed execution doc for Phase 5 of the MCP Marketing Tool Architecture Plan. Builds on Phases 1, 2, 3, and 4. This is the final phase before declaring v1 production-ready.
Estimated effort: 1–2 weeks for one engineer (longer if you choose a third-party pen test). Follow-up: post-launch operational maintenance — not a “Phase 6.”
Goal
Take the application from “secure on the inside” to “production-ready on the outside.” Phase 1–4 built every component to be safe; Phase 5 makes the surrounding environment safe, observable, attacked, and documented. The output is a system you can hand to a SOC 2 auditor or a security-conscious customer with a Data Processing Agreement and pass review on.
If Phase 5 is done correctly: the public surface area passes a clean OWASP ZAP / nuclei scan, every metric is dashboard-visible, an Inngest outage pages someone within 10 minutes, every secret has a documented rotation procedure, and the GDPR / SOC 2 evidence binder is one folder you can email to a procurement team.
Definition of Done (high-level — full checklist in §11)
- Public surface (
/mcp,/auth/*,/api/inngest,/admin/*) reachable only via TLS 1.3 with full security headers. - UFW blocks every port except 22 (SSH, IP-restricted), 80 (Let’s Encrypt only), 443.
- systemd unit runs as a non-root user with the full hardening flag set.
-
/auth/*/callbackonly accepts requests where the source IP matches the platform vendor’s published ranges (Google / Meta / TikTok callbacks). -
audit_logrow deletion is only possible via aSECURITY DEFINERstored procedure callable by the archive job role; the broadGRANT DELETEfrom Phase 4 is revoked. -
scripts/rotate-kek.tsrewraps every tenant DEK under a new KEK with no service downtime. - Prometheus scrapes
/metrics; Grafana dashboards visualise auth failures, rate-limit trips, sync health, cache hit rate, p95 latency. - An external uptime monitor pages on
/admin/health/inngest503 or/metricsscrape failure. - An automated security-scan job runs in CI (nuclei + OWASP ZAP baseline) and fails the build on a high-severity finding.
- ROPA, key-management, incident-response, and access-control documents committed to
docs/compliance/. - Nightly Postgres backup runs off-host; weekly automated restore drill passes on staging; “we lost the database” runbook executed end-to-end.
Workstream order & dependency graph
A. nginx TLS hardening ─┬──▶ C. OAuth callback IP allow-list
│
B. UFW + systemd ───────┼──▶ H. Pen-test (depends on full public surface)
│
D. Audit log DELETE gate ─── (independent, safe to land anytime)
E. KEK rotation tooling ─── (independent — needs Phase 2 envelope encryption)
F. Prometheus + Grafana ─┬─▶ G. External alerting
H. Pen-test ─────────────┘
I. Compliance docs ─── (parallel to all of the above)
J. Tests run alongside everything
K. Postgres backup ─── (independent — needs off-host bucket provisioned first)
L. Log retention ───── (independent — pure systemd config)The critical path is A → C → H. D, E, F, I, K, L can land in any order.
Workstream A — nginx TLS hardening
A1. Full server config
File: infra/nginx/deneva-mcp.conf
# Rate limit zones — must be defined at http{} level. Move to /etc/nginx/conf.d/zones.conf.
# limit_req_zone $binary_remote_addr zone=mcp_global:10m rate=60r/m;
# limit_req_zone $binary_remote_addr zone=mcp_auth:10m rate=10r/m;
# limit_req_zone $binary_remote_addr zone=mcp_inngest:10m rate=600r/m;
server {
listen 443 ssl;
listen [::]:443 ssl;
http2 on;
server_name your-domain.com;
ssl_certificate /etc/letsencrypt/live/your-domain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/your-domain.com/privkey.pem;
ssl_protocols TLSv1.3;
ssl_ciphers TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256;
ssl_prefer_server_ciphers off;
ssl_session_timeout 1d;
ssl_session_cache shared:MozSSL:50m;
ssl_session_tickets off;
# OCSP stapling
ssl_stapling on;
ssl_stapling_verify on;
ssl_trusted_certificate /etc/letsencrypt/live/your-domain.com/chain.pem;
resolver 1.1.1.1 8.8.8.8 valid=300s;
# Security headers
add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
add_header X-Frame-Options DENY always;
add_header X-Content-Type-Options nosniff always;
add_header Referrer-Policy no-referrer always;
add_header Permissions-Policy "geolocation=(), microphone=(), camera=(), interest-cohort=()" always;
add_header Content-Security-Policy "default-src 'none'; frame-ancestors 'none'" always;
server_tokens off;
client_max_body_size 64k;
client_body_timeout 10s;
client_header_timeout 10s;
# /mcp — tenant-authenticated MCP traffic
location = /mcp {
limit_req zone=mcp_global burst=30 nodelay;
proxy_pass http://127.0.0.1:3001;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Request-Id $request_id;
proxy_hide_header X-Powered-By;
proxy_read_timeout 60s;
}
# /auth/*/start — tenant-authenticated, strict per-IP rate limit
location ~ ^/auth/[^/]+/(start|accounts(/select)?)$ {
limit_req zone=mcp_auth burst=5 nodelay;
proxy_pass http://127.0.0.1:3001;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
# /auth/*/callback — IP-allow-listed per platform (see §C)
location ~ ^/auth/google/callback$ { include snippets/allowlist-google.conf; proxy_pass http://127.0.0.1:3001; }
location ~ ^/auth/meta/callback$ { include snippets/allowlist-meta.conf; proxy_pass http://127.0.0.1:3001; }
location ~ ^/auth/tiktok/callback$ { include snippets/allowlist-tiktok.conf; proxy_pass http://127.0.0.1:3001; }
# /api/inngest — only Inngest Cloud egress IPs
location = /api/inngest {
include snippets/allowlist-inngest.conf;
limit_req zone=mcp_inngest burst=200 nodelay;
proxy_pass http://127.0.0.1:3001;
proxy_request_buffering off; # streaming payloads
}
# /admin/* — additional IP allow-list (operator office IPs)
location /admin/ {
include snippets/allowlist-admin.conf;
limit_req zone=mcp_auth burst=5 nodelay;
proxy_pass http://127.0.0.1:3001;
}
# /tenant/connections — same auth as /mcp, no special rate limit override
location = /tenant/connections {
limit_req zone=mcp_global burst=10 nodelay;
proxy_pass http://127.0.0.1:3001;
}
# Block everything else with a connection-close (444 returns no response)
location / { return 444; }
}
# HTTP → HTTPS redirect, but allow Let's Encrypt's HTTP-01 challenge.
server {
listen 80;
listen [::]:80;
server_name your-domain.com;
location /.well-known/acme-challenge/ { root /var/www/letsencrypt; }
location / { return 301 https://$host$request_uri; }
}A2. Cert renewal cron
# /etc/cron.d/letsencrypt
0 3 * * * root certbot renew --webroot -w /var/www/letsencrypt --quiet --post-hook "systemctl reload nginx"A3. Acceptance
testssl.sh https://your-domain.comreports A+ overall, no TLS 1.2 fallback.curl -I https://your-domain.com/mcpshows every requiredadd_headerand noServer: nginx/...version.curl https://your-domain.com/anything-elsereturns nothing (444).curl http://your-domain.com/anythingredirects 301 to HTTPS.
Workstream B — UFW + systemd hardening
B1. UFW
ufw default deny incoming
ufw default allow outgoing
# SSH from operator office IPs only (replace with real CIDRs)
ufw allow from 203.0.113.0/24 to any port 22 proto tcp
ufw allow from 198.51.100.10/32 to any port 22 proto tcp
# HTTP for ACME challenges + HTTPS public
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
ufw status verbose
# postgres + node port (5432, 3001) NOT in the rules — bound to 127.0.0.1 already, but
# UFW belt-and-braces ensures any future binding mistake stays internal.Architecture-doc divergence. The architecture doc’s
ecosystem.config.jsexample deploys via PM2 withinstances: 2, exec_mode: 'cluster'. Phase 5 deploys a single Node process via systemd directly. Reasons: (a) the systemd hardening flag set in §B2 is more comprehensive than PM2 can provide, (b) Phase 1’s IP-block map and Phase 4’s heartbeat counters are in-process — clustering would require moving them out (see Phase 1 §E4 note). If the load profile later requires horizontal scaling, the migration path is: move in-process state to Redis, then run multiple systemd instances behind nginx upstream.
B2. systemd unit (full hardening flag set)
File: infra/systemd/deneva-mcp.service
[Unit]
Description=MCP Marketing Server
After=network.target postgresql.service
Wants=postgresql.service
[Service]
Type=simple
User=deneva-mcp
Group=deneva-mcp
WorkingDirectory=/opt/deneva-mcp
ExecStart=/usr/bin/node dist/index.js
Restart=on-failure
RestartSec=5s
Environment=NODE_ENV=production
Environment=PORT=3001
Environment=SYSTEMD_UNIT=deneva-mcp.service
# Encrypted credentials — see §systemd-creds bootstrap below
LoadCredentialEncrypted=CREDENTIAL_KEK:/etc/deneva-mcp/creds/CREDENTIAL_KEK.cred
LoadCredentialEncrypted=API_KEY_HMAC_SECRET:/etc/deneva-mcp/creds/API_KEY_HMAC_SECRET.cred
LoadCredentialEncrypted=DB_PASSWORD:/etc/deneva-mcp/creds/DB_PASSWORD.cred
LoadCredentialEncrypted=GOOGLE_CLIENT_SECRET:/etc/deneva-mcp/creds/GOOGLE_CLIENT_SECRET.cred
LoadCredentialEncrypted=GOOGLE_DEVELOPER_TOKEN:/etc/deneva-mcp/creds/GOOGLE_DEVELOPER_TOKEN.cred
LoadCredentialEncrypted=META_APP_SECRET:/etc/deneva-mcp/creds/META_APP_SECRET.cred
LoadCredentialEncrypted=TIKTOK_APP_SECRET:/etc/deneva-mcp/creds/TIKTOK_APP_SECRET.cred
LoadCredentialEncrypted=INNGEST_SIGNING_KEY:/etc/deneva-mcp/creds/INNGEST_SIGNING_KEY.cred
LoadCredentialEncrypted=INNGEST_EVENT_KEY:/etc/deneva-mcp/creds/INNGEST_EVENT_KEY.cred
LoadCredentialEncrypted=ADMIN_TOKEN:/etc/deneva-mcp/creds/ADMIN_TOKEN.cred
# Sandboxing — full set
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
ProtectClock=true
ProtectHostname=true
ProtectProc=invisible
ProcSubset=pid
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
LockPersonality=true
MemoryDenyWriteExecute=true
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources @debug @mount @module @reboot @swap
CapabilityBoundingSet=
AmbientCapabilities=
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
ReadWritePaths=/var/log/deneva-mcp
UMask=0027
[Install]
WantedBy=multi-user.targetB3. systemd-creds bootstrap
# Run once per secret, as root, on the prod host
echo -n 'PASTE_SECRET_HERE' \
| sudo systemd-creds encrypt --name=CREDENTIAL_KEK - /etc/deneva-mcp/creds/CREDENTIAL_KEK.cred
sudo chmod 0600 /etc/deneva-mcp/creds/*.credDeviation from Phase 1: Phase 1 used
LoadCredential(plaintext on disk); Phase 5 upgrades toLoadCredentialEncrypted(encrypted under the host TPM / system key). This is the production-grade form.
B4. Replace Phase 5 placeholder admin token
Phase 1 §D3 used API_KEY_HMAC_SECRET as a stand-in admin token. Phase 5 introduces a dedicated secret and removes the alias.
// security/admin-auth.ts
import { timingSafeEqual } from 'node:crypto';
import { loadSecret } from './secrets.loader.js';
const adminToken = (await loadSecret('ADMIN_TOKEN' as never)).toString('utf8');
export function verifyAdminToken(presented: string | undefined): boolean {
if (!presented) return false;
const a = Buffer.from(presented), b = Buffer.from(adminToken);
if (a.length !== b.length) return false;
return timingSafeEqual(a, b);
}Replace every req.headers['x-admin-token'] !== adminToken (and the req.headers[ADMIN_HEADER_NAME] !== adminToken variant) with !verifyAdminToken(req.headers['x-admin-token'] as string) across admin-routes.ts. There are six sites: Phase 1 §D3 (rotation endpoint, uses the ADMIN_HEADER_NAME constant), Phase 2 §F2 / §G2, Phase 3 §G2, Phase 4 §F2 / §G3. Verify completeness with two greps (the second catches the §D3 site, which uses the constant rather than the literal):
grep -rn "x-admin-token.*!==" src/ # must return zero matches
grep -rn "ADMIN_HEADER_NAME.*!==" src/ # must return zero matchesB5. Acceptance
systemd-analyze security deneva-mcp.servicereturns an “exposure level” of0.x SAFEor “MEDIUM” (anything below 5).nmap -p 1-65535 your-domain.comshows only 22, 80, 443 open from the public internet.ps -o user= -p $(pidof node)returnsdeneva-mcp, neverroot.
Workstream C — OAuth callback IP allow-list
C1. Vendor source-IP lists
Each platform publishes (or you’ve measured) the source IPs from which their OAuth servers redirect users back. However: OAuth redirects originate from the user’s browser, not from the platform — IP allow-listing the callback would block legitimate users.
Correct interpretation: allow-list the /api/inngest endpoint to Inngest Cloud’s egress IPs (server-to-server). For OAuth callbacks, the right defence is state + PKCE + signed-cookie session, not IP. Phase 1 / 2 already enforce these.
So the actual Phase 5 work here is:
/api/inngestIP allow-list for Inngest Cloud egress./admin/*IP allow-list for operator IPs./auth/*/callbackkeeps no IP restriction; the security isstate+ PKCE.
C2. nginx allowlist snippets
# /etc/nginx/snippets/allowlist-inngest.conf
# Pulled from Inngest's published egress IP list — refresh quarterly.
allow 35.193.0.0/16; # placeholder — replace with actual list
allow 35.197.0.0/16;
deny all;
# /etc/nginx/snippets/allowlist-admin.conf
allow 203.0.113.0/24; # operator office
allow 198.51.100.10/32; # ops VPN
deny all;C3. Update procedure
A documented quarterly task in docs/compliance/key-management.md (§I): check Inngest’s published IP ranges, update allowlist-inngest.conf, reload nginx, verify a test event still arrives.
C4. Acceptance
curl -X POST https://your-domain.com/api/inngestfrom a non-allow-listed IP returns 403.- A correctly-signed POST from an allow-listed IP routes through.
curl https://your-domain.com/admin/metrics/syncfrom a non-allow-listed IP returns 403 before hitting Fastify (independent of admin-token).
Workstream D — Audit log DELETE gate
Phase 4 §E4 broadly granted DELETE ON audit_log TO mcp_app so the archive CTE could run. Phase 5 narrows this to a single SECURITY DEFINER stored procedure.
D1. Stored procedure
CREATE OR REPLACE FUNCTION archive_old_audit_rows(p_cutoff timestamptz)
RETURNS integer
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = public, pg_temp
AS $$
DECLARE
moved_count integer;
BEGIN
WITH moved AS (
DELETE FROM audit_log
WHERE created_at < p_cutoff
RETURNING *
)
INSERT INTO audit_log_archive
SELECT * FROM moved;
GET DIAGNOSTICS moved_count = ROW_COUNT;
RETURN moved_count;
END;
$$;
REVOKE EXECUTE ON FUNCTION archive_old_audit_rows(timestamptz) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION archive_old_audit_rows(timestamptz) TO mcp_app;
-- Take back the broad DELETE GRANT from Phase 4
REVOKE DELETE ON audit_log FROM mcp_app;The function is owned by mcp_admin (the migration role); SECURITY DEFINER means it runs with mcp_admin’s privileges when called by mcp_app. The function body is the only path through which audit rows can leave audit_log.
D2. Update the archive job
File: src/sync/functions.ts (replace the Phase 4 §E3 inline CTE)
export const archiveAuditLog = inngest.createFunction(
{ id: 'gdpr-archive-audit', retries: 1 },
{ cron: '30 2 * * *' },
async ({ step }) => {
const cutoff = new Date(Date.now() - 365 * 24 * 60 * 60 * 1000);
const moved = await step.run('archive-via-proc', async () => {
const r = await db.execute(sql`SELECT archive_old_audit_rows(${cutoff}) as moved`);
return Number((r.rows[0] as { moved: number }).moved);
});
await writeAuditEvent('gdpr.archive_audit', 'success', { moved });
},
);D3. Acceptance
mcp_apprunning rawDELETE FROM audit_log WHERE id = ...fails with permission denied.mcp_apprunningSELECT archive_old_audit_rows(now() - interval '13 months')succeeds and returns the moved count.- Total row count across
audit_log+audit_log_archiveis preserved.
Workstream E — KEK rotation
E1. Rotation script
File: scripts/rotate-kek.ts
/**
* KEK rotation: rewrap every tenant DEK from KEK v_old to KEK v_new.
*
* Usage:
* OLD_KEK_PATH=/run/credentials/.../CREDENTIAL_KEK_V1 \
* NEW_KEK_PATH=/run/credentials/.../CREDENTIAL_KEK_V2 \
* tsx scripts/rotate-kek.ts
*
* Pre-checks (manual): both KEK files exist, both are 32 bytes, DB up.
* Post-checks: every row's kek_version increments, decryptToken still works for a sampled tenant.
*/
import { readFile } from 'node:fs/promises';
import { createCipheriv, createDecipheriv, randomBytes } from 'node:crypto';
import { eq, lt } from 'drizzle-orm';
import { db } from '../src/db/index.js';
import { tenantDeks } from '../src/db/schema.js';
const oldKek = await readFile(process.env.OLD_KEK_PATH!);
const newKek = await readFile(process.env.NEW_KEK_PATH!);
if (oldKek.length !== 32 || newKek.length !== 32) throw new Error('Both KEKs must be 32 bytes');
const NEW_VERSION = Number(process.env.NEW_KEK_VERSION ?? '2');
let cursor: string | null = null;
const BATCH = 100;
let total = 0;
while (true) {
const rows = await db.select().from(tenantDeks)
.where(lt(tenantDeks.kekVersion, NEW_VERSION))
.limit(BATCH);
if (rows.length === 0) break;
for (const row of rows) {
// 1. Decrypt the DEK under the OLD KEK.
const oldDecipher = createDecipheriv('aes-256-gcm', oldKek, row.dekIv);
oldDecipher.setAuthTag(row.dekTag);
const dek = Buffer.concat([oldDecipher.update(row.dekEnc), oldDecipher.final()]);
// 2. Re-encrypt under the NEW KEK with a fresh IV.
const iv = randomBytes(12);
const newCipher = createCipheriv('aes-256-gcm', newKek, iv);
const dekEnc = Buffer.concat([newCipher.update(dek), newCipher.final()]);
const dekTag = newCipher.getAuthTag();
// 3. Update in a single statement.
await db.update(tenantDeks).set({
dekEnc, dekIv: iv, dekTag,
kekVersion: NEW_VERSION,
rotatedAt: new Date(),
}).where(eq(tenantDeks.tenantId, row.tenantId));
total += 1;
}
console.log(`Rotated ${total} so far...`);
}
console.log(`KEK rotation complete: ${total} tenants migrated to v${NEW_VERSION}`);Crucial property: the plaintext DEK does not change — only the KEK that wraps it. This means tenant data encrypted with the DEK (tokens) does NOT need to be re-encrypted, only the DEK row itself. Zero downtime on user data.
E2. Coordinated deployment
For zero downtime the running server must support reading both KEK versions during the rotation window. Update secrets.loader.ts:
async function loadKekForVersion(version: number): Promise<Buffer> {
return loadSecret(`CREDENTIAL_KEK_V${version}` as never);
}
// credentials.service.ts
async function getOrCreateDek(tenantId: string): Promise<Buffer> {
const [row] = await db.select().from(tenantDeks).where(eq(tenantDeks.tenantId, tenantId));
if (row) {
const kek = await loadKekForVersion(row.kekVersion);
const decipher = createDecipheriv('aes-256-gcm', kek, row.dekIv);
decipher.setAuthTag(row.dekTag);
return Buffer.concat([decipher.update(row.dekEnc), decipher.final()]);
}
// New tenant → use the highest version available.
const newest = Number(process.env.CREDENTIAL_KEK_NEWEST_VERSION ?? '1');
// ... encrypt under newest, persist with kekVersion=newest
}E3. Runbook
File: docs/compliance/runbooks/kek-rotation.md
Documented step-by-step: pre-checks, deploy multi-version-capable code, generate new KEK, encrypt-with-systemd-creds + deploy, run the script, verify a sampled tenant decrypt, retire old KEK after a 30-day grace period.
E4. Acceptance
- After rotation, every
tenant_deksrow haskekVersion = NEW_VERSIONand a recentrotatedAt. decryptTokenstill round-trips for a sampled tenant before, during, and after rotation.- Service stays up the entire time (use the Phase 4 heartbeat to verify).
- An attempted decrypt with the OLD KEK on a rewrapped row fails with auth-tag-invalid (sanity check that the rewrap actually happened).
Workstream F — Prometheus + Grafana
F1. /metrics endpoint
// src/observability/metrics.ts
import { register, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
collectDefaultMetrics();
export const httpRequests = new Counter({
name: 'mcp_http_requests_total',
help: 'HTTP requests by route, method, status',
labelNames: ['route', 'method', 'status'],
});
export const httpDuration = new Histogram({
name: 'mcp_http_duration_ms',
help: 'Request duration in ms',
labelNames: ['route', 'method'],
buckets: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
});
export const cacheHits = new Counter({ name: 'mcp_cache_hits_total', help: 'Cache hits', labelNames: ['platform', 'report_type'] });
export const cacheMisses = new Counter({ name: 'mcp_cache_misses_total', help: 'Cache misses', labelNames: ['platform', 'report_type'] });
export const authFailures = new Counter({ name: 'mcp_auth_failures_total', help: 'Auth failures', labelNames: ['reason'] });
export const rateLimitHits = new Counter({ name: 'mcp_rate_limit_total', help: 'Rate limits triggered', labelNames: ['scope'] });
export const syncStatus = new Counter({ name: 'mcp_sync_total', help: 'Sync results', labelNames: ['platform', 'report_type', 'status'] });
export const tokenRefreshes = new Counter({ name: 'mcp_token_refresh_total', help: 'Token refreshes', labelNames: ['platform', 'outcome'] });
export const heartbeatAgeSeconds = new Gauge({ name: 'mcp_inngest_heartbeat_age_seconds', help: 'Seconds since last Inngest heartbeat' });
export const metricsRegistry = register;F2. Mount + populate
// src/index.ts (after auth/rate-limit setup)
app.get('/metrics', async (req, reply) => {
// /metrics is allow-listed at nginx (§F3) — this is defence-in-depth:
// even if nginx is misconfigured or removed, only loopback callers (i.e. nginx → 127.0.0.1)
// can scrape. Public-IP requests get 404 so the endpoint is invisible to scanners.
if (req.ip !== '127.0.0.1' && req.ip !== '::1') {
return reply.code(404).send();
}
reply.header('content-type', metricsRegistry.contentType);
return metricsRegistry.metrics();
});
app.addHook('onResponse', async (req, reply) => {
const route = req.routeOptions?.url ?? req.url;
httpRequests.inc({ route, method: req.method, status: reply.statusCode });
httpDuration.observe({ route, method: req.method }, reply.elapsedTime);
});Fastify sees
req.ipas the immediate TCP peer, which is nginx on127.0.0.1since the proxy connects via loopback (proxy_pass http://127.0.0.1:3001). The real client IP lives inX-Forwarded-For— but for this guard we want the immediate peer, because the threat model is “what if someone bypasses nginx and hits Fastify directly.” Do NOT enable Fastify’strustProxyfor this route or the loopback check becomes spoofable.
Replace the in-memory cache counters from Phase 2 §H with cacheHits.inc({...}) / cacheMisses.inc({...}). Same swap for the Phase 4 heartbeat: the /admin/health/inngest endpoint can also update heartbeatAgeSeconds.
F3. nginx allow-list /metrics
location = /metrics {
include snippets/allowlist-prometheus.conf; # only the Prometheus host
proxy_pass http://127.0.0.1:3001;
}F4. Prometheus + Grafana
Provision via docker-compose on the same host or a sibling instance:
File: infra/observability/docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
# network_mode: host so Prometheus can scrape Fastify on 127.0.0.1:3001
# without crossing a Docker bridge — Fastify's loopback bind stays intact.
network_mode: host
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prom_data:/prometheus
# No `ports:` mapping under host networking. Prometheus's own listener is
# bound to loopback via the CLI args below so 9090 is not publicly exposed.
command:
- --config.file=/etc/prometheus/prometheus.yml
- --web.listen-address=127.0.0.1:9090
grafana:
image: grafana/grafana-oss:latest
volumes: [ grafana_data:/var/lib/grafana, ./dashboards:/var/lib/grafana/dashboards:ro ]
environment: { GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_admin }
secrets: [ grafana_admin ]
ports: ["127.0.0.1:3000:3000"]
volumes: { prom_data: {}, grafana_data: {} }
secrets:
grafana_admin: { file: ./grafana_admin_password }File: infra/observability/prometheus.yml
global: { scrape_interval: 30s }
scrape_configs:
- job_name: deneva-mcp
# Prometheus runs with network_mode: host (see docker-compose above), so
# loopback reaches Fastify directly. Fastify stays bound to 127.0.0.1 only.
static_configs: [ { targets: ['127.0.0.1:3001'] } ]
metrics_path: /metricsF5. Dashboards (committed JSON)
Files: infra/observability/dashboards/{auth.json,sync.json,cache.json,latency.json}
Each dashboard pre-built so Grafana auto-loads on first run. Required panels:
- auth.json: auth failures by reason, IP-block engagements over time, rate-limit triggers.
- sync.json: sync success rate per platform, p95 sync duration, current unhealthy tenants.
- cache.json: cache hit rate per (platform, report_type), miss rate trend.
- latency.json: p50/p95/p99 request duration per route.
F6. Acceptance
curl http://127.0.0.1:3001/metricsreturns valid Prometheus exposition format.- From the Prometheus container:
docker exec <prometheus-container> wget -qO- http://127.0.0.1:3001/metricsreturns the exposition body (proves host networking + Fastify’s loopback bind both work). - Prometheus scrapes successfully (check Targets page →
deneva-mcpUP). - Grafana auto-loads dashboards; data populates within one scrape interval.
- Hit
/mcp→ counter increments visible in Prometheus within 30s.
Workstream G — External alerting
G1. Webhook integration
The simplest cross-vendor option: use an external uptime monitor (UptimeRobot / Better Stack / Pingdom — all have free tiers). Configure two checks:
GET https://your-domain.com/admin/health/inngestwith the admin token in headers — page on non-200.- Prometheus self-check: a secondary monitor scrapes Prometheus’s own
/-/healthyand pages on 4xx/5xx.
Routing the page is the monitor’s responsibility (most integrate with PagerDuty / OpsGenie / Slack natively). No code change needed; this is a configuration deliverable.
G2. Optional: in-app webhook for critical events
For events that aren’t covered by polling — e.g., 10 auth failures from one IP in 5 minutes triggering an IP block — emit a webhook to a configured destination.
// src/observability/alerts.ts
const WEBHOOK_URL = process.env.ALERTS_WEBHOOK_URL;
export async function sendAlert(event: {
severity: 'info' | 'warn' | 'critical';
title: string;
detail: Record<string, unknown>;
}): Promise<void> {
if (!WEBHOOK_URL) return;
void fetch(WEBHOOK_URL, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ ...event, source: 'deneva-mcp', timestamp: new Date().toISOString() }),
}).catch(() => { /* never throw from alert path */ });
}Wire from:
- IP-block engagement (Phase 1 §E4) —
severity: 'warn'. - Sync exhaustion → unhealthy (Phase 4 §C2) —
severity: 'warn'. - Audit-log archive failure —
severity: 'critical'. - KEK rotation failure mid-run —
severity: 'critical'.
G3. Acceptance
- Trigger an IP block in staging — webhook delivered, payload contains the IP.
- Stop the Inngest dev server for 11 minutes — uptime monitor pages within its check interval.
- Document the on-call rotation in
docs/compliance/incident-response.md(§I).
Workstream H — Penetration test (DIY)
H1. Automated tooling in CI
File: .github/workflows/security-scan.yml
name: Security Scan
on:
schedule: [ { cron: '0 6 * * 1' } ] # weekly Mon 06:00 UTC
workflow_dispatch:
jobs:
zap-baseline:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: ZAP baseline scan
uses: zaproxy/action-baseline@v0.12.0
with:
target: 'https://staging.your-domain.com'
rules_file_name: '.zap/rules.tsv'
fail_action: true
nuclei:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: projectdiscovery/nuclei-action@main
with:
target: 'https://staging.your-domain.com'
templates: 'cves,exposed-panels,misconfiguration,vulnerabilities'
npm-audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22' }
- run: npm ci && npm run audit.zap/rules.tsv lets you suppress known-false-positives with a justification (committed as evidence).
H2. Manual checklist
File: docs/compliance/pentest-checklist.md
Categories with concrete tests:
Authentication & session
- API key brute force: 1000 random keys → all 401, IP blocked after threshold (engages
auth.blocked_ip). - API key timing attack: measure response time for valid-format-invalid-value vs valid key. Statistical t-test should be inconclusive.
- Admin endpoint without
x-admin-token→ 401 and nginx/adminIP-allowlist → 403 from a non-allowed IP.
OAuth
- State replay: complete OAuth, capture
state, replay → second use rejected. - State CSRF: hit
/auth/google/callbackwith astateminted for tenant B from a session of tenant A → rejected. - Open redirect:
/auth/google/start?redirect_uri=https://evil.com→ ignored; the callback URI is config-pinned. - PKCE downgrade: omit
code_verifieron Google callback → token exchange fails. - Scope tampering: complete OAuth with reduced scopes →
ScopeMissingError+ audit row.
Injection / input
- SQL injection in every query parameter we accept (especially
accountIdin/auth/:platform/accounts/select). - JSON-body fuzzing on
/mcptool calls — every closed enum rejects unknowns. - Header-injection:
X-Forwarded-For: \r\nSet-Cookie: ...→ nginx strips CR/LF.
Rate limiting / abuse
- 200 req/s from one IP → 429s engage; audit rows present.
- Distributed (10 IPs × 50 req/s) — global Fastify limit triggers; per-tenant limit triggers separately.
- Auth failure flood (11+ in 1h from one IP) → IP block engaged; webhook fired.
Transport / TLS
-
nmap --script ssl-enum-ciphers→ only TLS 1.3 advertised. - HSTS preload check via
https://hstspreload.org/. - Certificate transparency search for the domain — only the expected certs appear.
Secrets / config
- Process memory dump (in staging only): grep for KEK / DEK / API key plaintext — only in known short-lived buffers.
- Logs / Pino output: grep for
Bearer,ya29,EAA(Meta token prefix) → no hits. - Env-var dump:
cat /proc/$(pidof node)/environ | tr '\0' '\n'→ no secret values.
Inngest
- Unsigned POST to
/api/inngest→ 401. - Replayed signed request (same body, old timestamp) → rejected (Inngest SDK enforces timestamp window).
- Forged signature → 401.
MCP-specific
- Tool registry enumeration: confirm only the registered tools are callable; an attacker calling
__internal__fails. - Cross-tenant cache leak: tenant A’s request never returns tenant B’s cache row even with crafted JSON IDs (RLS catches it).
H3. Findings ledger
docs/compliance/pentest-findings.md — table of finding → severity → status (open / mitigated / accepted-with-rationale) → date. Fixed and accepted findings stay in the ledger as evidence; this is what a SOC 2 auditor wants.
H4. Acceptance
- ZAP baseline scan job is green;
.zap/rules.tsvfile documents every suppressed alert. - nuclei job reports zero high or critical findings; medium findings tracked in the ledger.
- Manual checklist run end-to-end with results captured in the ledger.
Workstream I — Compliance documentation
I1. Folder layout
docs/compliance/
├── ROPA.md # Records of Processing Activities (GDPR Art. 30)
├── access-control-policy.md # who has access to what; review cadence
├── incident-response.md # on-call, escalation, customer notification SLAs
├── key-management.md # KEK / DEK / API key lifecycle, rotation cadence, disposal
├── data-retention.md # what we keep, how long, where, how it's destroyed
├── pentest-checklist.md # H2
├── pentest-findings.md # H3
├── runbooks/
│ ├── kek-rotation.md # E3
│ ├── dek-rotation-per-tenant.md # uses Phase 2 §G
│ ├── tenant-erasure.md # uses Phase 3 §G
│ ├── database-restore.md # K6 — "we lost the database"
│ └── inngest-incident.md # what to do when /admin/health/inngest goes 503
└── soc2-evidence-binder.md # index pointing at every artifact aboveI2. ROPA template (excerpt)
File: docs/compliance/ROPA.md
# Record of Processing Activities
| Field | Value |
|---|---|
| Controller / Processor | Processor (acting on behalf of customer-controllers) |
| Purpose of processing | Aggregating and serving advertising performance metrics |
| Categories of data subjects | None directly. Aggregated ad-platform performance metrics processed at the campaign level. |
| Categories of personal data | None as defined by GDPR Art. 4(1). Tenant identifiers (UUIDs), API key hashes, OAuth tokens (encrypted at rest). |
| Recipients | Tenant-authenticated MCP clients only. No third-party data sharing. |
| Cross-border transfers | Tenants outside the EU are processed in the same EU region. (Update if multi-region added.) |
| Retention | metric_cache: 90 days. sync_log: 30 days. audit_log: 12 months active + archive. |
| Security measures | TLS 1.3, AES-256-GCM envelope encryption per tenant, RLS, HMAC-keyed API keys, full audit logging. |
| Lawful basis | Contract with customer-controller (DPA). |I3. Access-control policy (excerpt)
# Access Control Policy
## Roles
- **deneva-mcp (service)** — runs the application. Read/write `mcp_app` PG role. Cannot read `audit_log_archive` historical detail. No SSH access.
- **operator** — sysadmin. SSH from allow-listed IPs only; sudo audited. May rotate KEK, run DEK rotation per tenant, view all `/admin/*` endpoints.
- **auditor** — read-only. PG role with SELECT on audit_log + audit_log_archive only. No application secrets.
## Review cadence
- API key list reviewed quarterly: keys with `last_used_at` older than 90 days flagged for revocation.
- SSH allowlist reviewed monthly.
- Operator list reviewed on every team change.
- Failed-login bursts reviewed within 24h of detection (audit alerts §G2).I4. Incident-response (excerpt)
# Incident Response
## Severity definitions
- **Sev-1**: customer data exposed, service down >15min, key material compromised.
- **Sev-2**: degraded service (sync failures across all tenants, auth flaky).
- **Sev-3**: single-tenant degradation, non-customer-facing alerts.
## Response timeline
| Severity | Acknowledge | Customer notification | Post-mortem |
|---|---|---|---|
| Sev-1 | 15 min | 2 hours | 5 business days |
| Sev-2 | 1 hour | 24 hours (if customer-affecting) | 10 business days |
| Sev-3 | Next business day | Not required | At discretion |
## Runbooks
- KEK / DEK compromise → docs/compliance/runbooks/kek-rotation.md
- Inngest dead → docs/compliance/runbooks/inngest-incident.md
- Tenant data deletion request → docs/compliance/runbooks/tenant-erasure.md
- Database loss (full or partial) → docs/compliance/runbooks/database-restore.mdI5. Acceptance
- All files in §I1 exist and are non-empty.
- The SOC 2 evidence binder index links to every committed artifact.
- A peer review on the docs (in PR review form) signs off — the index is the deliverable, not just the individual files.
Workstream J — Tests
J1. New test files
| File | Covers |
|---|---|
tests/integration/security-headers.test.ts | A — verifies every required header on every public route |
tests/integration/admin-token.test.ts | B4 — replaces every Phase 2/3/4 admin-token check with the new helper |
tests/integration/audit-delete-gate.test.ts | D — DELETE blocked, stored proc allowed |
tests/integration/kek-rotation.test.ts | E — multi-version decrypt round-trips |
tests/integration/metrics-endpoint.test.ts | F — Prometheus exposition shape |
tests/integration/alerts-webhook.test.ts | G — webhook called on the configured triggers |
J2. Staging environment
The ZAP / nuclei jobs need a running target. Phase 5 introduces a staging environment matching production:
- Same systemd + nginx config.
- Same Postgres setup (dedicated DB).
- Real Google / Meta / TikTok test apps with sandboxed credentials.
- Restored from a snapshot before each weekly scan run (so attack residue doesn’t accumulate).
J3. CI orchestration
push to main → existing CI (Phases 1-4) → deploy to staging → security-scan workflow runs → if green, manual promote to prodWorkstream K — Postgres backup & restore (low-effort v1)
Scope deliberately small. This is the “we can survive losing the database” version. It uses nightly logical dumps only — no WAL archiving, no PITR, no replicas. RPO is one day. A v2 with continuous archiving and PITR is listed in §12.
K1. Targets
| Metric | v1 (this phase) | v2 (post-launch, §12) |
|---|---|---|
| RPO (data loss window) | ≤ 24h | ≤ 5 min |
| RTO (time to restored service) | ≤ 4h | ≤ 1h |
| Mechanism | nightly pg_dump -Fc, encrypted, off-host | WAL archiving + base backups, PITR via pgBackRest or similar |
| Restore-test cadence | weekly automated drill on staging | continuous (replica is the test) |
The v1 numbers are honest — a tenant whose campaign-sync ran an hour before the host died will lose that hour of metric_cache and audit_log rows. Document this in the DPA / customer comms.
K2. Backup script
File: infra/backup/pg-backup.sh
#!/usr/bin/env bash
# Nightly Postgres backup: pg_dump -> age encrypt -> rclone upload.
# Runs as the `postgres` system user via the systemd unit in K3.
# Exits non-zero on any step failure so systemd flags it and §G2 alerts fire.
set -euo pipefail
DB_NAME="${DB_NAME:-deneva_mcp}"
LOCAL_DIR="${LOCAL_DIR:-/var/backups/deneva-mcp}"
RECIPIENTS="${RECIPIENTS:-/etc/deneva-mcp/backup/age-recipients.txt}"
RCLONE_REMOTE="${RCLONE_REMOTE:-mcp-backups}" # rclone config name
RCLONE_BUCKET="${RCLONE_BUCKET:-deneva-mcp-backups}" # bucket / container
LOCAL_RETENTION_DAYS="${LOCAL_RETENTION_DAYS:-7}"
ts="$(date -u +%Y%m%dT%H%M%SZ)"
out="$LOCAL_DIR/${DB_NAME}-${ts}.dump.age"
mkdir -p "$LOCAL_DIR"
# pg_dump -> age. Stream end-to-end so the plaintext dump never lands on disk.
pg_dump --format=custom --compress=6 --no-owner --no-privileges "$DB_NAME" \
| age --encrypt --recipients-file "$RECIPIENTS" --output "$out"
# Off-host upload. rclone has built-in retry; we still fail loudly on a non-zero exit.
rclone copyto --s3-no-check-bucket "$out" "${RCLONE_REMOTE}:${RCLONE_BUCKET}/$(basename "$out")"
# Local retention: delete encrypted dumps older than N days.
find "$LOCAL_DIR" -maxdepth 1 -type f -name "${DB_NAME}-*.dump.age" -mtime +"$LOCAL_RETENTION_DAYS" -delete
# Tiny success marker — restore-drill (K5) reads this to confirm a fresh dump exists.
date -u +%s > "$LOCAL_DIR/last-success"Off-host retention is enforced by the bucket’s lifecycle policy (e.g. S3 / B2 / R2 lifecycle rule: keep 30 daily + 12 monthly + delete the rest), not by the script. One source of truth for retention.
K3. systemd timer
File: infra/systemd/mcp-backup.service
[Unit]
Description=Nightly Postgres backup for deneva_mcp
After=postgresql.service network-online.target
Wants=postgresql.service network-online.target
[Service]
Type=oneshot
User=postgres
Group=postgres
EnvironmentFile=/etc/deneva-mcp/backup/env
ExecStart=/opt/deneva-mcp/infra/backup/pg-backup.sh
# Hardening — the backup user only needs to read PG and write the local dir + reach the network.
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/backups/deneva-mcp
ProtectKernelTunables=true
ProtectKernelModules=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIXFile: infra/systemd/mcp-backup.timer
[Unit]
Description=Run mcp-backup nightly
[Timer]
OnCalendar=*-*-* 02:15:00 UTC
RandomizedDelaySec=10m
Persistent=true # if the host was down at 02:15, run on next boot
Unit=mcp-backup.service
[Install]
WantedBy=timers.target/etc/deneva-mcp/backup/env holds the RCLONE_* vars and is mode 0640 owned by postgres:postgres. The rclone remote credentials live in ~postgres/.config/rclone/rclone.conf (mode 0600), provisioned once during host setup.
K4. Encryption key handling
- Generate the keypair on the operator workstation:
age-keygen -o backup-key.txt. The public line (age1...) goes into/etc/deneva-mcp/backup/age-recipients.txton the prod host. The full file (with the secret line) never touches the prod host. - Store
backup-key.txtin two places: (a) the operator-team password manager (1Password / Bitwarden), (b) a printed copy in a physical safe. Loss of the secret = loss of every backup. - Rotation: generate a new keypair yearly. New backups encrypt to both recipients during the overlap window (
age-recipients.txtaccepts multiple lines); after 30 days, drop the old recipient. Old encrypted dumps remain restorable with the old key — keep the old keys filed by year.
K5. Restore drill (automated)
A second systemd timer, on the staging host (so it doesn’t impact prod or share a failure domain):
File: infra/backup/pg-restore-drill.sh
#!/usr/bin/env bash
# Weekly: pull latest backup -> decrypt -> restore into a scratch DB -> sanity-check -> drop.
# Exits non-zero on any failure so the §G2 webhook fires `severity: 'critical'`.
set -euo pipefail
RCLONE_REMOTE="${RCLONE_REMOTE:-mcp-backups}"
RCLONE_BUCKET="${RCLONE_BUCKET:-deneva-mcp-backups}"
SCRATCH_DB="${SCRATCH_DB:-mcp_restore_drill}"
KEY_FILE="${KEY_FILE:-/etc/deneva-mcp/backup/age-key.txt}" # staging-only: secret key present here
WORK_DIR="$(mktemp -d)"
trap 'rm -rf "$WORK_DIR"' EXIT
latest="$(rclone lsf "${RCLONE_REMOTE}:${RCLONE_BUCKET}" --files-only \
| grep -E '\.dump\.age$' | sort | tail -n1)"
[ -n "$latest" ] || { echo "No backups found"; exit 1; }
rclone copyto "${RCLONE_REMOTE}:${RCLONE_BUCKET}/$latest" "$WORK_DIR/$latest"
age --decrypt --identity "$KEY_FILE" --output "$WORK_DIR/dump" "$WORK_DIR/$latest"
dropdb --if-exists "$SCRATCH_DB"
createdb "$SCRATCH_DB"
pg_restore --dbname="$SCRATCH_DB" --no-owner --no-privileges --jobs=2 "$WORK_DIR/dump"
# Smoke checks. Add a row to each as new tables are introduced.
psql -d "$SCRATCH_DB" -v ON_ERROR_STOP=1 <<'SQL'
SELECT 1/CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM tenants;
SELECT 1/CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM tenant_deks;
SELECT MAX(created_at) > now() - interval '48 hours' AS recent FROM audit_log
\gset
\if :recent
\else
\echo 'audit_log latest row >48h old — backup may be stale'
\q 1
\endif
SQL
dropdb "$SCRATCH_DB"
echo "Restore drill OK: $latest"Wired to a mcp-restore-drill.timer running OnCalendar=Sun 04:00:00 UTC. Failure path: systemd OnFailure= → calls a tiny one-liner that POSTs to ALERTS_WEBHOOK_URL with severity=critical.
The staging host is the only place the secret age key lives outside the operator vault. This is a deliberate trade-off: the drill needs to actually decrypt, and human-driven restoration in an outage is too slow if the key has to be fetched from a vault first. Compensating control: staging is on the same UFW + systemd hardening footprint as prod (§B), and the key file is mode
0400owned bypostgres.
K6. Runbook — “we lost the database”
File: docs/compliance/runbooks/database-restore.md
# Database restore — full loss
## Trigger
Postgres data unrecoverable on prod host: disk failure, accidental DROP, ransomware,
or "the host is gone." Sev-1.
## Pre-flight
- [ ] Acknowledge the incident; start a comms doc; notify on-call channel.
- [ ] Stop traffic: `sudo systemctl stop deneva-mcp` (the app keeps `audit_log` writes
out of a half-restored DB).
- [ ] Confirm DB is actually unrecoverable (check `pg_isready`, disk, recent dumps on host).
## Restore
1. Provision a clean Postgres 16 instance (same host once disk is replaced, or a new host).
Run `infra/provision.sh` to recreate roles, then create the empty `deneva_mcp` DB.
2. Fetch the age secret key from the operator vault. Place it on the restoration host as
`/root/restore-key.txt`, mode 0400. Delete after restore completes.
3. List available dumps:
`rclone lsf mcp-backups:deneva-mcp-backups | grep dump.age | sort | tail -5`
4. Pull the freshest one:
`rclone copyto mcp-backups:deneva-mcp-backups/<file> /tmp/<file>`
5. Decrypt:
`age --decrypt --identity /root/restore-key.txt --output /tmp/dump /tmp/<file>`
6. Restore (run as `postgres`):
`pg_restore --dbname=deneva_mcp --no-owner --no-privileges --jobs=4 /tmp/dump`
7. Confirm migration version matches the deployed app:
`psql -d deneva_mcp -c "SELECT * FROM drizzle.__drizzle_migrations ORDER BY id DESC LIMIT 1;"`
If the app code is ahead, run `npm run db:migrate` before restarting.
8. Smoke checks (same queries as §K5):
- `SELECT count(*) FROM tenants;` > 0
- `SELECT count(*) FROM tenant_deks;` matches tenant count
- Sample one tenant: `decryptToken` round-trips (the KEK is unchanged — DEKs decrypt fine).
9. Restart the app: `sudo systemctl start deneva-mcp`. Confirm `/admin/health/inngest`
reports 200 and a fresh heartbeat.
10. Wipe restore artefacts: `shred -u /tmp/<file> /tmp/dump /root/restore-key.txt`.
## Post-restore
- [ ] Write a Sev-1 audit-log entry: `system.recovery {dump_used, restored_at, lag_seconds}`.
`lag_seconds` = `now() - dump_timestamp`; this is the actual data loss for this incident.
- [ ] Notify customers per §I4 timeline. The customer-facing message must state the data-loss
window honestly (rows created in the last `lag_seconds` are gone).
- [ ] Rotate the API_KEY_HMAC_SECRET if there's any chance the lost-host's disk can be read by
a third party (i.e., the loss was theft, not media failure). Forces all tenants to
re-issue keys, but the alternative is hoping the disk wasn't recoverable.
- [ ] Post-mortem within 5 business days. Include: actual RTO vs target, root cause, why the
monitor didn't catch it earlier, action items.
## What this runbook does NOT recover
- Data written between the last successful dump and the incident. Use the audit log on the
restored DB to reconstruct what's missing if a customer asks ("you ran sync at 18:00, our
dump was 02:15, those 16h of metric_cache are gone").
- v2 with WAL archiving (§12) reduces this gap to minutes.K7. Acceptance
mcp-backup.timeris enabled;systemctl list-timersshows next run.- After one nightly run: an
*.dump.agefile exists locally and in the off-host bucket;last-successis updated. mcp-restore-drill.timersucceeds end-to-end on staging once; failure path tested by feeding it a corrupt dump and confirming the §G2 webhook firesseverity: 'critical'.- The runbook in §K6 has been executed end-to-end on staging by an operator who did not write it. Time-to-green captured in the runbook header as the measured RTO.
- Bucket lifecycle policy (30 daily + 12 monthly) verified in the cloud console; old objects expire as expected.
Workstream L — Log retention (journald)
Why this is here. Phase 1 §G4 wires Pino to write structured JSON to stdout, and the systemd unit captures stdout into journald automatically. No app-level change is needed for “shipping” — journald is the log store. What’s missing for production is retention bounds so the disk doesn’t fill, and a documented way to query.
L1. journald config
File: /etc/systemd/journald.conf.d/deneva-mcp.conf (drop-in, not the main journald.conf)
[Journal]
Storage=persistent # write to /var/log/journal, survive reboots
SystemMaxUse=2G # cap journal disk usage
SystemMaxFileSize=200M # rotate when a single file hits this
MaxRetentionSec=30day # discard entries older than 30 days
ForwardToSyslog=no # we don't run syslog
Compress=yesApply:
sudo systemctl restart systemd-journald
journalctl --disk-usage # confirm cap is respectedL2. PII redaction confirmation
Pino’s Phase 1 §G4 setup has two layers: the redact paths strip x-api-key and x-admin-token from request headers, and the err serializer (scrubTokens) walks log payloads and redacts token-shaped strings (Bearer …, ya29.…, EAA[A-Z]…) plus token-named fields (access_token, refresh_token, client_secret, authorization, api_key, x-admin-token). Phase 5 adds an evidence check:
# Should produce zero hits — sanity check before declaring Phase 5 done.
journalctl -u deneva-mcp --since "7 days ago" | grep -E '("x-api-key"|Bearer |ya29\.|EAA[A-Z])'The Phase 1 §G4 serializer should make this grep produce zero matches; this check is a regression-detector, not the primary defence. If anything does match, fix it at the source — extend TOKEN_KEY_RE or the regex list in scrubTokens rather than papering over with the redact paths. The audit-log PII strip from Phase 1 §F1 covers the database side; this check covers the log-stream side.
L3. Querying
Documented one-liners in docs/compliance/runbooks/inngest-incident.md (and used by ops):
# Last 100 errors
journalctl -u deneva-mcp -p err -n 100 --no-pager
# Tail live, JSON-formatted
journalctl -u deneva-mcp -f -o json-pretty
# All requests for a given request-id (Pino's genReqId from Phase 1 §G4)
journalctl -u deneva-mcp --grep '"requestId":"$REQ_ID"'
# Compliance evidence: every audit-related event in a date window
journalctl -u deneva-mcp --since "2024-01-01" --until "2024-02-01" --grep 'audit'L4. When to upgrade
A single Ubuntu host is fine until any of these become true:
- Multiple hosts (need to query all of them at once).
- Auditor asks for a tamper-evident log store separate from the host.
- Disk pressure forces
MaxRetentionSecbelow the 30-day floor.
When that happens, the cheap next step is Grafana Cloud Loki free tier (50 GB ingest/month, 30-day retention as of writing) using promtail to read journald and ship to Loki. That migration is a config-only change — no app code touches it.
L5. Acceptance
/etc/systemd/journald.conf.d/deneva-mcp.confexists;journalctl --disk-usagereports a value ≤SystemMaxUse.- The grep check in §L2 returns zero matches over a 7-day window.
- The runbook in §L3 lists the four query patterns and is referenced from
incident-response.md.
§11 — Definition of Done (full checklist)
A. nginx TLS hardening
- testssl.sh A+ rating; only TLS 1.3.
- All Phase 1 security headers present on every public route.
- Server tokens disabled;
client_max_body_sizeenforced;server_tokens off. - Per-route rate-limit zones configured.
B. UFW + systemd
- UFW allows only 22 (IP-allowlist), 80, 443.
-
systemd-analyze security≤ 5.0 (“MEDIUM” or better). - All hardening flags from §B2 active.
- systemd-creds replaces plaintext
LoadCredentialfor every secret. - Dedicated
ADMIN_TOKENsecret; placeholder removed from Phase 1 §D3.
C. /api/inngest + /admin IP allow-lists
-
/api/inngestallows only Inngest egress IPs. -
/admin/*allows only operator IPs. - Refresh procedure documented in
docs/compliance/key-management.md.
D. Audit log DELETE gate
-
archive_old_audit_rowsSECURITY DEFINER function created. -
mcp_appcannot raw-DELETE fromaudit_log; can call the procedure. - Phase 4 archive job uses the procedure.
E. KEK rotation
-
scripts/rotate-kek.tsrewraps every DEK; idempotent on re-run. - Application supports reading any
kekVersionduring rotation window. - Runbook
docs/compliance/runbooks/kek-rotation.mdexists.
F. Prometheus + Grafana
-
/metricsreturns Prometheus exposition format, behind nginx allow-list. - Counters / histograms cover auth, rate limit, cache, sync, token refresh, latency, heartbeat age.
- Four committed dashboards auto-load in Grafana.
- In-memory counters from Phase 2/4 retired in favor of Prometheus.
G. External alerting
- Uptime monitor configured for
/admin/health/inngestand Prometheus self-check. -
sendAlertwebhook wired into the four trigger sites. - On-call rotation documented.
H. Pen-test
- Weekly ZAP + nuclei job in CI; green or with documented suppressions.
- Manual checklist completed once with all results in the findings ledger.
- No open high or critical findings.
I. Compliance documentation
- All files in §I1 exist and reviewed.
- ROPA accurate as of launch date.
- SOC 2 evidence binder index linked from
README.md.
J. Tests
- All new test files pass.
- Staging environment reproducibly snapshot-restored.
K. Postgres backup & restore
-
mcp-backup.timerenabled; fires nightly and produces an*.dump.agein the off-host bucket. - Local 7-day retention enforced; bucket lifecycle policy (30 daily + 12 monthly) verified.
-
mcp-restore-drill.timersucceeds on staging weekly; failure path tested. - Runbook
docs/compliance/runbooks/database-restore.mdexecuted end-to-end by an operator who did not write it; measured RTO recorded. - age keypair committed nowhere; private key in operator vault + physical safe.
L. Log retention
-
/etc/systemd/journald.conf.d/deneva-mcp.confdeployed;SystemMaxUseandMaxRetentionSecenforced. - PII grep check in §L2 returns zero matches across a 7-day window.
- Query one-liners committed to
docs/compliance/runbooks/inngest-incident.md.
§12 — Out of scope (post-launch)
| Item | Owner / cadence |
|---|---|
| Third-party penetration test | Schedule once revenue / customer signals justify; budget ~$15-30k |
| Multi-region failover (active-active or warm standby) | When first customer asks; significant infra rework |
| HashiCorp Vault migration (replacing systemd-creds) | When operator team grows beyond 1-2 people; trade-off: higher ops burden for centralized rotation |
| ISO 27001 certification | Typically follows successful SOC 2 Type II |
| Bug bounty program | After at least 6 months of production stability |
| Customer-managed encryption keys (BYOK) | Enterprise feature; revisit when first enterprise prospect asks |
| Centralised log aggregation (Grafana Cloud Loki + promtail, or similar) | When journald-on-one-host stops fitting; trigger conditions in §L4 |
§13 — Manual smoke test (production launch checklist)
# 0. All Phase 1-4 work merged. Staging running and green for >7 days.
# 1. Provision the Linux server (one-shot, idempotent ansible / shell)
sudo bash infra/provision.sh
# → installs node 22, postgres 16, nginx, ufw; creates deneva-mcp user; deploys systemd unit
# 2. Encrypt all secrets via systemd-creds
for s in CREDENTIAL_KEK API_KEY_HMAC_SECRET DB_PASSWORD GOOGLE_CLIENT_SECRET \
GOOGLE_DEVELOPER_TOKEN META_APP_SECRET TIKTOK_APP_SECRET \
INNGEST_SIGNING_KEY INNGEST_EVENT_KEY ADMIN_TOKEN; do
read -s -p "$s: " val; echo
echo -n "$val" | sudo systemd-creds encrypt --name=$s - /etc/deneva-mcp/creds/$s.cred
done
# 3. Start service, confirm
sudo systemctl enable --now deneva-mcp
systemctl status deneva-mcp
sudo systemd-analyze security deneva-mcp
# → exposure level ≤ 5.0
# 4. TLS smoke test
testssl.sh https://your-domain.com | grep -E '(Overall Grade|TLS 1.3|TLS 1.2)'
# → TLS 1.3 only, A+
# 5. Header smoke test
curl -sI https://your-domain.com/mcp | grep -iE '(strict-transport|content-security|x-frame|x-content-type|referrer-policy|permissions-policy)'
# → six headers present
# 6. UFW
sudo ufw status verbose | grep -E '(22|80|443|deny)'
# 7. /metrics reachable from Prometheus host, blocked elsewhere
curl -i http://prometheus-host:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="deneva-mcp") | .health'
# → "up"
# 8. Trigger every alerting path in staging:
# - lock out an IP via 11 auth failures → webhook + audit entry
# - kill inngest for 11 min → /admin/health/inngest 503 → uptime monitor pages
# - simulate KEK rotation script run on a sandbox tenant → success in <30s
# 9. Run the manual pen-test checklist top to bottom, capture results in ledger
# 10. Run the GDPR erasure end-to-end on a test tenant; verify all rows gone
# 11. SOC 2 evidence: confirm every link in soc2-evidence-binder.md resolves
# 12. Backup smoke test: confirm last night's dump landed in the off-host bucket
rclone lsf mcp-backups:deneva-mcp-backups | grep dump.age | sort | tail -3
# → most recent timestamp is from today
# Confirm weekly restore-drill timer is scheduled
systemctl list-timers mcp-restore-drill.timer
# → next trigger date shown; last successful run within 7 days
# 13. Announce launchIf every step produces the expected outcome, Phase 5 — and the v1 product — is shipped.