Phase 5 — Hardening & Compliance

Detailed execution doc for Phase 5 of the MCP Marketing Tool Architecture Plan. Builds on Phases 1, 2, 3, and 4. This is the final phase before declaring v1 production-ready.

Estimated effort: 1–2 weeks for one engineer (longer if you choose a third-party pen test). Follow-up: post-launch operational maintenance — not a “Phase 6.”

Goal

Take the application from “secure on the inside” to “production-ready on the outside.” Phase 1–4 built every component to be safe; Phase 5 makes the surrounding environment safe, observable, attacked, and documented. The output is a system you can hand to a SOC 2 auditor or a security-conscious customer with a Data Processing Agreement and pass review on.

If Phase 5 is done correctly: the public surface area passes a clean OWASP ZAP / nuclei scan, every metric is dashboard-visible, an Inngest outage pages someone within 10 minutes, every secret has a documented rotation procedure, and the GDPR / SOC 2 evidence binder is one folder you can email to a procurement team.

Definition of Done (high-level — full checklist in §11)

Workstream order & dependency graph


A. nginx TLS hardening ─┬──▶ C. OAuth callback IP allow-list
                        │
B. UFW + systemd ───────┼──▶ H. Pen-test (depends on full public surface)
                        │
D. Audit log DELETE gate ─── (independent, safe to land anytime)
E. KEK rotation tooling  ─── (independent — needs Phase 2 envelope encryption)
F. Prometheus + Grafana ─┬─▶ G. External alerting
H. Pen-test ─────────────┘
I. Compliance docs ─── (parallel to all of the above)
J. Tests run alongside everything
K. Postgres backup ─── (independent — needs off-host bucket provisioned first)
L. Log retention ───── (independent — pure systemd config)

The critical path is A → C → H. D, E, F, I, K, L can land in any order.

Workstream A — nginx TLS hardening

A1. Full server config

File: infra/nginx/deneva-mcp.conf


# Rate limit zones — must be defined at http{} level. Move to /etc/nginx/conf.d/zones.conf.
# limit_req_zone $binary_remote_addr zone=mcp_global:10m rate=60r/m;
# limit_req_zone $binary_remote_addr zone=mcp_auth:10m   rate=10r/m;
# limit_req_zone $binary_remote_addr zone=mcp_inngest:10m rate=600r/m;
 
server {
    listen 443 ssl;
    listen [::]:443 ssl;
    http2 on;
    server_name your-domain.com;
 
    ssl_certificate     /etc/letsencrypt/live/your-domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/your-domain.com/privkey.pem;
    ssl_protocols       TLSv1.3;
    ssl_ciphers         TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256;
    ssl_prefer_server_ciphers off;
    ssl_session_timeout 1d;
    ssl_session_cache   shared:MozSSL:50m;
    ssl_session_tickets off;
 
    # OCSP stapling
    ssl_stapling on;
    ssl_stapling_verify on;
    ssl_trusted_certificate /etc/letsencrypt/live/your-domain.com/chain.pem;
    resolver 1.1.1.1 8.8.8.8 valid=300s;
 
    # Security headers
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
    add_header X-Frame-Options DENY always;
    add_header X-Content-Type-Options nosniff always;
    add_header Referrer-Policy no-referrer always;
    add_header Permissions-Policy "geolocation=(), microphone=(), camera=(), interest-cohort=()" always;
    add_header Content-Security-Policy "default-src 'none'; frame-ancestors 'none'" always;
 
    server_tokens off;
    client_max_body_size 64k;
    client_body_timeout 10s;
    client_header_timeout 10s;
 
    # /mcp — tenant-authenticated MCP traffic
    location = /mcp {
        limit_req zone=mcp_global burst=30 nodelay;
        proxy_pass http://127.0.0.1:3001;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Request-Id $request_id;
        proxy_hide_header X-Powered-By;
        proxy_read_timeout 60s;
    }
 
    # /auth/*/start — tenant-authenticated, strict per-IP rate limit
    location ~ ^/auth/[^/]+/(start|accounts(/select)?)$ {
        limit_req zone=mcp_auth burst=5 nodelay;
        proxy_pass http://127.0.0.1:3001;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
 
    # /auth/*/callback — IP-allow-listed per platform (see §C)
    location ~ ^/auth/google/callback$  { include snippets/allowlist-google.conf;  proxy_pass http://127.0.0.1:3001; }
    location ~ ^/auth/meta/callback$    { include snippets/allowlist-meta.conf;    proxy_pass http://127.0.0.1:3001; }
    location ~ ^/auth/tiktok/callback$  { include snippets/allowlist-tiktok.conf;  proxy_pass http://127.0.0.1:3001; }
 
    # /api/inngest — only Inngest Cloud egress IPs
    location = /api/inngest {
        include snippets/allowlist-inngest.conf;
        limit_req zone=mcp_inngest burst=200 nodelay;
        proxy_pass http://127.0.0.1:3001;
        proxy_request_buffering off;        # streaming payloads
    }
 
    # /admin/* — additional IP allow-list (operator office IPs)
    location /admin/ {
        include snippets/allowlist-admin.conf;
        limit_req zone=mcp_auth burst=5 nodelay;
        proxy_pass http://127.0.0.1:3001;
    }
 
    # /tenant/connections — same auth as /mcp, no special rate limit override
    location = /tenant/connections {
        limit_req zone=mcp_global burst=10 nodelay;
        proxy_pass http://127.0.0.1:3001;
    }
 
    # Block everything else with a connection-close (444 returns no response)
    location / { return 444; }
}
 
# HTTP → HTTPS redirect, but allow Let's Encrypt's HTTP-01 challenge.
server {
    listen 80;
    listen [::]:80;
    server_name your-domain.com;
    location /.well-known/acme-challenge/ { root /var/www/letsencrypt; }
    location / { return 301 https://$host$request_uri; }
}

A2. Cert renewal cron


# /etc/cron.d/letsencrypt
0 3 * * * root certbot renew --webroot -w /var/www/letsencrypt --quiet --post-hook "systemctl reload nginx"

A3. Acceptance

testssl.sh https://your-domain.com reports A+ overall, no TLS 1.2 fallback.
curl -I https://your-domain.com/mcp shows every required add_header and no Server: nginx/... version.
curl https://your-domain.com/anything-else returns nothing (444).
curl http://your-domain.com/anything redirects 301 to HTTPS.

Workstream B — UFW + systemd hardening

B1. UFW


ufw default deny incoming
ufw default allow outgoing
 
# SSH from operator office IPs only (replace with real CIDRs)
ufw allow from 203.0.113.0/24 to any port 22 proto tcp
ufw allow from 198.51.100.10/32 to any port 22 proto tcp
 
# HTTP for ACME challenges + HTTPS public
ufw allow 80/tcp
ufw allow 443/tcp
 
ufw enable
ufw status verbose
# postgres + node port (5432, 3001) NOT in the rules — bound to 127.0.0.1 already, but
# UFW belt-and-braces ensures any future binding mistake stays internal.

Architecture-doc divergence. The architecture doc’s ecosystem.config.js example deploys via PM2 with instances: 2, exec_mode: 'cluster'. Phase 5 deploys a single Node process via systemd directly. Reasons: (a) the systemd hardening flag set in §B2 is more comprehensive than PM2 can provide, (b) Phase 1’s IP-block map and Phase 4’s heartbeat counters are in-process — clustering would require moving them out (see Phase 1 §E4 note). If the load profile later requires horizontal scaling, the migration path is: move in-process state to Redis, then run multiple systemd instances behind nginx upstream.

B2. systemd unit (full hardening flag set)

File: infra/systemd/deneva-mcp.service


[Unit]
Description=MCP Marketing Server
After=network.target postgresql.service
Wants=postgresql.service
 
[Service]
Type=simple
User=deneva-mcp
Group=deneva-mcp
WorkingDirectory=/opt/deneva-mcp
ExecStart=/usr/bin/node dist/index.js
Restart=on-failure
RestartSec=5s
Environment=NODE_ENV=production
Environment=PORT=3001
Environment=SYSTEMD_UNIT=deneva-mcp.service
 
# Encrypted credentials — see §systemd-creds bootstrap below
LoadCredentialEncrypted=CREDENTIAL_KEK:/etc/deneva-mcp/creds/CREDENTIAL_KEK.cred
LoadCredentialEncrypted=API_KEY_HMAC_SECRET:/etc/deneva-mcp/creds/API_KEY_HMAC_SECRET.cred
LoadCredentialEncrypted=DB_PASSWORD:/etc/deneva-mcp/creds/DB_PASSWORD.cred
LoadCredentialEncrypted=GOOGLE_CLIENT_SECRET:/etc/deneva-mcp/creds/GOOGLE_CLIENT_SECRET.cred
LoadCredentialEncrypted=GOOGLE_DEVELOPER_TOKEN:/etc/deneva-mcp/creds/GOOGLE_DEVELOPER_TOKEN.cred
LoadCredentialEncrypted=META_APP_SECRET:/etc/deneva-mcp/creds/META_APP_SECRET.cred
LoadCredentialEncrypted=TIKTOK_APP_SECRET:/etc/deneva-mcp/creds/TIKTOK_APP_SECRET.cred
LoadCredentialEncrypted=INNGEST_SIGNING_KEY:/etc/deneva-mcp/creds/INNGEST_SIGNING_KEY.cred
LoadCredentialEncrypted=INNGEST_EVENT_KEY:/etc/deneva-mcp/creds/INNGEST_EVENT_KEY.cred
LoadCredentialEncrypted=ADMIN_TOKEN:/etc/deneva-mcp/creds/ADMIN_TOKEN.cred
 
# Sandboxing — full set
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
ProtectClock=true
ProtectHostname=true
ProtectProc=invisible
ProcSubset=pid
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
LockPersonality=true
MemoryDenyWriteExecute=true
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources @debug @mount @module @reboot @swap
CapabilityBoundingSet=
AmbientCapabilities=
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
ReadWritePaths=/var/log/deneva-mcp
UMask=0027
 
[Install]
WantedBy=multi-user.target

B3. systemd-creds bootstrap


# Run once per secret, as root, on the prod host
echo -n 'PASTE_SECRET_HERE' \
  | sudo systemd-creds encrypt --name=CREDENTIAL_KEK - /etc/deneva-mcp/creds/CREDENTIAL_KEK.cred
sudo chmod 0600 /etc/deneva-mcp/creds/*.cred

Deviation from Phase 1: Phase 1 used LoadCredential (plaintext on disk); Phase 5 upgrades to LoadCredentialEncrypted (encrypted under the host TPM / system key). This is the production-grade form.

B4. Replace Phase 5 placeholder admin token

Phase 1 §D3 used API_KEY_HMAC_SECRET as a stand-in admin token. Phase 5 introduces a dedicated secret and removes the alias.


// security/admin-auth.ts
import { timingSafeEqual } from 'node:crypto';
import { loadSecret } from './secrets.loader.js';
 
const adminToken = (await loadSecret('ADMIN_TOKEN' as never)).toString('utf8');
 
export function verifyAdminToken(presented: string | undefined): boolean {
  if (!presented) return false;
  const a = Buffer.from(presented), b = Buffer.from(adminToken);
  if (a.length !== b.length) return false;
  return timingSafeEqual(a, b);
}

Replace every req.headers['x-admin-token'] !== adminToken (and the req.headers[ADMIN_HEADER_NAME] !== adminToken variant) with !verifyAdminToken(req.headers['x-admin-token'] as string) across admin-routes.ts. There are six sites: Phase 1 §D3 (rotation endpoint, uses the ADMIN_HEADER_NAME constant), Phase 2 §F2 / §G2, Phase 3 §G2, Phase 4 §F2 / §G3. Verify completeness with two greps (the second catches the §D3 site, which uses the constant rather than the literal):


grep -rn "x-admin-token.*!==" src/      # must return zero matches
grep -rn "ADMIN_HEADER_NAME.*!==" src/  # must return zero matches

B5. Acceptance

systemd-analyze security deneva-mcp.service returns an “exposure level” of 0.x SAFE or “MEDIUM” (anything below 5).
nmap -p 1-65535 your-domain.com shows only 22, 80, 443 open from the public internet.
ps -o user= -p $(pidof node) returns deneva-mcp, never root.

Workstream C — OAuth callback IP allow-list

C1. Vendor source-IP lists

Each platform publishes (or you’ve measured) the source IPs from which their OAuth servers redirect users back. However: OAuth redirects originate from the user’s browser, not from the platform — IP allow-listing the callback would block legitimate users.

Correct interpretation: allow-list the /api/inngest endpoint to Inngest Cloud’s egress IPs (server-to-server). For OAuth callbacks, the right defence is state + PKCE + signed-cookie session, not IP. Phase 1 / 2 already enforce these.

So the actual Phase 5 work here is:

/api/inngest IP allow-list for Inngest Cloud egress.
/admin/* IP allow-list for operator IPs.
/auth/*/callback keeps no IP restriction; the security is state + PKCE.

C2. nginx allowlist snippets


# /etc/nginx/snippets/allowlist-inngest.conf
# Pulled from Inngest's published egress IP list — refresh quarterly.
allow 35.193.0.0/16;        # placeholder — replace with actual list
allow 35.197.0.0/16;
deny all;
 
# /etc/nginx/snippets/allowlist-admin.conf
allow 203.0.113.0/24;       # operator office
allow 198.51.100.10/32;     # ops VPN
deny all;

C3. Update procedure

A documented quarterly task in docs/compliance/key-management.md (§I): check Inngest’s published IP ranges, update allowlist-inngest.conf, reload nginx, verify a test event still arrives.

C4. Acceptance

curl -X POST https://your-domain.com/api/inngest from a non-allow-listed IP returns 403.
A correctly-signed POST from an allow-listed IP routes through.
curl https://your-domain.com/admin/metrics/sync from a non-allow-listed IP returns 403 before hitting Fastify (independent of admin-token).

Workstream D — Audit log DELETE gate

Phase 4 §E4 broadly granted DELETE ON audit_log TO mcp_app so the archive CTE could run. Phase 5 narrows this to a single SECURITY DEFINER stored procedure.

D1. Stored procedure


CREATE OR REPLACE FUNCTION archive_old_audit_rows(p_cutoff timestamptz)
RETURNS integer
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = public, pg_temp
AS $$
DECLARE
  moved_count integer;
BEGIN
  WITH moved AS (
    DELETE FROM audit_log
    WHERE created_at < p_cutoff
    RETURNING *
  )
  INSERT INTO audit_log_archive
    SELECT * FROM moved;
  GET DIAGNOSTICS moved_count = ROW_COUNT;
  RETURN moved_count;
END;
$$;
 
REVOKE EXECUTE ON FUNCTION archive_old_audit_rows(timestamptz) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION archive_old_audit_rows(timestamptz) TO mcp_app;
 
-- Take back the broad DELETE GRANT from Phase 4
REVOKE DELETE ON audit_log FROM mcp_app;

The function is owned by mcp_admin (the migration role); SECURITY DEFINER means it runs with mcp_admin’s privileges when called by mcp_app. The function body is the only path through which audit rows can leave audit_log.

D2. Update the archive job

File: src/sync/functions.ts (replace the Phase 4 §E3 inline CTE)


export const archiveAuditLog = inngest.createFunction(
  { id: 'gdpr-archive-audit', retries: 1 },
  { cron: '30 2 * * *' },
  async ({ step }) => {
    const cutoff = new Date(Date.now() - 365 * 24 * 60 * 60 * 1000);
    const moved = await step.run('archive-via-proc', async () => {
      const r = await db.execute(sql`SELECT archive_old_audit_rows(${cutoff}) as moved`);
      return Number((r.rows[0] as { moved: number }).moved);
    });
    await writeAuditEvent('gdpr.archive_audit', 'success', { moved });
  },
);

D3. Acceptance

mcp_app running raw DELETE FROM audit_log WHERE id = ... fails with permission denied.
mcp_app running SELECT archive_old_audit_rows(now() - interval '13 months') succeeds and returns the moved count.
Total row count across audit_log + audit_log_archive is preserved.

Workstream E — KEK rotation

E1. Rotation script

File: scripts/rotate-kek.ts


/**
 * KEK rotation: rewrap every tenant DEK from KEK v_old to KEK v_new.
 *
 * Usage:
 *   OLD_KEK_PATH=/run/credentials/.../CREDENTIAL_KEK_V1 \
 *   NEW_KEK_PATH=/run/credentials/.../CREDENTIAL_KEK_V2 \
 *   tsx scripts/rotate-kek.ts
 *
 * Pre-checks (manual): both KEK files exist, both are 32 bytes, DB up.
 * Post-checks: every row's kek_version increments, decryptToken still works for a sampled tenant.
 */
import { readFile } from 'node:fs/promises';
import { createCipheriv, createDecipheriv, randomBytes } from 'node:crypto';
import { eq, lt } from 'drizzle-orm';
import { db } from '../src/db/index.js';
import { tenantDeks } from '../src/db/schema.js';
 
const oldKek = await readFile(process.env.OLD_KEK_PATH!);
const newKek = await readFile(process.env.NEW_KEK_PATH!);
if (oldKek.length !== 32 || newKek.length !== 32) throw new Error('Both KEKs must be 32 bytes');
 
const NEW_VERSION = Number(process.env.NEW_KEK_VERSION ?? '2');
 
let cursor: string | null = null;
const BATCH = 100;
let total = 0;
 
while (true) {
  const rows = await db.select().from(tenantDeks)
    .where(lt(tenantDeks.kekVersion, NEW_VERSION))
    .limit(BATCH);
  if (rows.length === 0) break;
 
  for (const row of rows) {
    // 1. Decrypt the DEK under the OLD KEK.
    const oldDecipher = createDecipheriv('aes-256-gcm', oldKek, row.dekIv);
    oldDecipher.setAuthTag(row.dekTag);
    const dek = Buffer.concat([oldDecipher.update(row.dekEnc), oldDecipher.final()]);
 
    // 2. Re-encrypt under the NEW KEK with a fresh IV.
    const iv = randomBytes(12);
    const newCipher = createCipheriv('aes-256-gcm', newKek, iv);
    const dekEnc = Buffer.concat([newCipher.update(dek), newCipher.final()]);
    const dekTag = newCipher.getAuthTag();
 
    // 3. Update in a single statement.
    await db.update(tenantDeks).set({
      dekEnc, dekIv: iv, dekTag,
      kekVersion: NEW_VERSION,
      rotatedAt: new Date(),
    }).where(eq(tenantDeks.tenantId, row.tenantId));
 
    total += 1;
  }
  console.log(`Rotated ${total} so far...`);
}
console.log(`KEK rotation complete: ${total} tenants migrated to v${NEW_VERSION}`);

Crucial property: the plaintext DEK does not change — only the KEK that wraps it. This means tenant data encrypted with the DEK (tokens) does NOT need to be re-encrypted, only the DEK row itself. Zero downtime on user data.

E2. Coordinated deployment

For zero downtime the running server must support reading both KEK versions during the rotation window. Update secrets.loader.ts:


async function loadKekForVersion(version: number): Promise<Buffer> {
  return loadSecret(`CREDENTIAL_KEK_V${version}` as never);
}
 
// credentials.service.ts
async function getOrCreateDek(tenantId: string): Promise<Buffer> {
  const [row] = await db.select().from(tenantDeks).where(eq(tenantDeks.tenantId, tenantId));
  if (row) {
    const kek = await loadKekForVersion(row.kekVersion);
    const decipher = createDecipheriv('aes-256-gcm', kek, row.dekIv);
    decipher.setAuthTag(row.dekTag);
    return Buffer.concat([decipher.update(row.dekEnc), decipher.final()]);
  }
  // New tenant → use the highest version available.
  const newest = Number(process.env.CREDENTIAL_KEK_NEWEST_VERSION ?? '1');
  // ... encrypt under newest, persist with kekVersion=newest
}

E3. Runbook

File: docs/compliance/runbooks/kek-rotation.md

Documented step-by-step: pre-checks, deploy multi-version-capable code, generate new KEK, encrypt-with-systemd-creds + deploy, run the script, verify a sampled tenant decrypt, retire old KEK after a 30-day grace period.

E4. Acceptance

After rotation, every tenant_deks row has kekVersion = NEW_VERSION and a recent rotatedAt.
decryptToken still round-trips for a sampled tenant before, during, and after rotation.
Service stays up the entire time (use the Phase 4 heartbeat to verify).
An attempted decrypt with the OLD KEK on a rewrapped row fails with auth-tag-invalid (sanity check that the rewrap actually happened).

Workstream F — Prometheus + Grafana

F1. /metrics endpoint


// src/observability/metrics.ts
import { register, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
 
collectDefaultMetrics();
 
export const httpRequests = new Counter({
  name: 'mcp_http_requests_total',
  help: 'HTTP requests by route, method, status',
  labelNames: ['route', 'method', 'status'],
});
export const httpDuration = new Histogram({
  name: 'mcp_http_duration_ms',
  help: 'Request duration in ms',
  labelNames: ['route', 'method'],
  buckets: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
});
export const cacheHits = new Counter({ name: 'mcp_cache_hits_total',  help: 'Cache hits',  labelNames: ['platform', 'report_type'] });
export const cacheMisses = new Counter({ name: 'mcp_cache_misses_total', help: 'Cache misses', labelNames: ['platform', 'report_type'] });
export const authFailures = new Counter({ name: 'mcp_auth_failures_total', help: 'Auth failures', labelNames: ['reason'] });
export const rateLimitHits = new Counter({ name: 'mcp_rate_limit_total', help: 'Rate limits triggered', labelNames: ['scope'] });
export const syncStatus = new Counter({ name: 'mcp_sync_total', help: 'Sync results', labelNames: ['platform', 'report_type', 'status'] });
export const tokenRefreshes = new Counter({ name: 'mcp_token_refresh_total', help: 'Token refreshes', labelNames: ['platform', 'outcome'] });
export const heartbeatAgeSeconds = new Gauge({ name: 'mcp_inngest_heartbeat_age_seconds', help: 'Seconds since last Inngest heartbeat' });
 
export const metricsRegistry = register;

F2. Mount + populate


// src/index.ts (after auth/rate-limit setup)
app.get('/metrics', async (req, reply) => {
  // /metrics is allow-listed at nginx (§F3) — this is defence-in-depth:
  // even if nginx is misconfigured or removed, only loopback callers (i.e. nginx → 127.0.0.1)
  // can scrape. Public-IP requests get 404 so the endpoint is invisible to scanners.
  if (req.ip !== '127.0.0.1' && req.ip !== '::1') {
    return reply.code(404).send();
  }
  reply.header('content-type', metricsRegistry.contentType);
  return metricsRegistry.metrics();
});
 
app.addHook('onResponse', async (req, reply) => {
  const route = req.routeOptions?.url ?? req.url;
  httpRequests.inc({ route, method: req.method, status: reply.statusCode });
  httpDuration.observe({ route, method: req.method }, reply.elapsedTime);
});

Fastify sees req.ip as the immediate TCP peer, which is nginx on 127.0.0.1 since the proxy connects via loopback (proxy_pass http://127.0.0.1:3001). The real client IP lives in X-Forwarded-For — but for this guard we want the immediate peer, because the threat model is “what if someone bypasses nginx and hits Fastify directly.” Do NOT enable Fastify’s trustProxy for this route or the loopback check becomes spoofable.

Replace the in-memory cache counters from Phase 2 §H with cacheHits.inc({...}) / cacheMisses.inc({...}). Same swap for the Phase 4 heartbeat: the /admin/health/inngest endpoint can also update heartbeatAgeSeconds.

F3. nginx allow-list /metrics


location = /metrics {
    include snippets/allowlist-prometheus.conf;   # only the Prometheus host
    proxy_pass http://127.0.0.1:3001;
}

F4. Prometheus + Grafana

Provision via docker-compose on the same host or a sibling instance:

File: infra/observability/docker-compose.yml


services:
  prometheus:
    image: prom/prometheus:latest
    # network_mode: host so Prometheus can scrape Fastify on 127.0.0.1:3001
    # without crossing a Docker bridge — Fastify's loopback bind stays intact.
    network_mode: host
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prom_data:/prometheus
    # No `ports:` mapping under host networking. Prometheus's own listener is
    # bound to loopback via the CLI args below so 9090 is not publicly exposed.
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --web.listen-address=127.0.0.1:9090
  grafana:
    image: grafana/grafana-oss:latest
    volumes: [ grafana_data:/var/lib/grafana, ./dashboards:/var/lib/grafana/dashboards:ro ]
    environment: { GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_admin }
    secrets: [ grafana_admin ]
    ports: ["127.0.0.1:3000:3000"]
volumes: { prom_data: {}, grafana_data: {} }
secrets:
  grafana_admin: { file: ./grafana_admin_password }

File: infra/observability/prometheus.yml


global: { scrape_interval: 30s }
scrape_configs:
  - job_name: deneva-mcp
    # Prometheus runs with network_mode: host (see docker-compose above), so
    # loopback reaches Fastify directly. Fastify stays bound to 127.0.0.1 only.
    static_configs: [ { targets: ['127.0.0.1:3001'] } ]
    metrics_path: /metrics

F5. Dashboards (committed JSON)

Files: infra/observability/dashboards/{auth.json,sync.json,cache.json,latency.json}

Each dashboard pre-built so Grafana auto-loads on first run. Required panels:

auth.json: auth failures by reason, IP-block engagements over time, rate-limit triggers.
sync.json: sync success rate per platform, p95 sync duration, current unhealthy tenants.
cache.json: cache hit rate per (platform, report_type), miss rate trend.
latency.json: p50/p95/p99 request duration per route.

F6. Acceptance

curl http://127.0.0.1:3001/metrics returns valid Prometheus exposition format.
From the Prometheus container: docker exec <prometheus-container> wget -qO- http://127.0.0.1:3001/metrics returns the exposition body (proves host networking + Fastify’s loopback bind both work).
Prometheus scrapes successfully (check Targets page → deneva-mcp UP).
Grafana auto-loads dashboards; data populates within one scrape interval.
Hit /mcp → counter increments visible in Prometheus within 30s.

Workstream G — External alerting

G1. Webhook integration

The simplest cross-vendor option: use an external uptime monitor (UptimeRobot / Better Stack / Pingdom — all have free tiers). Configure two checks:

GET https://your-domain.com/admin/health/inngest with the admin token in headers — page on non-200.
Prometheus self-check: a secondary monitor scrapes Prometheus’s own /-/healthy and pages on 4xx/5xx.

Routing the page is the monitor’s responsibility (most integrate with PagerDuty / OpsGenie / Slack natively). No code change needed; this is a configuration deliverable.

G2. Optional: in-app webhook for critical events

For events that aren’t covered by polling — e.g., 10 auth failures from one IP in 5 minutes triggering an IP block — emit a webhook to a configured destination.


// src/observability/alerts.ts
const WEBHOOK_URL = process.env.ALERTS_WEBHOOK_URL;
 
export async function sendAlert(event: {
  severity: 'info' | 'warn' | 'critical';
  title: string;
  detail: Record<string, unknown>;
}): Promise<void> {
  if (!WEBHOOK_URL) return;
  void fetch(WEBHOOK_URL, {
    method: 'POST',
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({ ...event, source: 'deneva-mcp', timestamp: new Date().toISOString() }),
  }).catch(() => { /* never throw from alert path */ });
}

Wire from:

IP-block engagement (Phase 1 §E4) — severity: 'warn'.
Sync exhaustion → unhealthy (Phase 4 §C2) — severity: 'warn'.
Audit-log archive failure — severity: 'critical'.
KEK rotation failure mid-run — severity: 'critical'.

G3. Acceptance

Trigger an IP block in staging — webhook delivered, payload contains the IP.
Stop the Inngest dev server for 11 minutes — uptime monitor pages within its check interval.
Document the on-call rotation in docs/compliance/incident-response.md (§I).

Workstream H — Penetration test (DIY)

H1. Automated tooling in CI

File: .github/workflows/security-scan.yml


name: Security Scan
on:
  schedule: [ { cron: '0 6 * * 1' } ]   # weekly Mon 06:00 UTC
  workflow_dispatch:
 
jobs:
  zap-baseline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: ZAP baseline scan
        uses: zaproxy/action-baseline@v0.12.0
        with:
          target: 'https://staging.your-domain.com'
          rules_file_name: '.zap/rules.tsv'
          fail_action: true
 
  nuclei:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: projectdiscovery/nuclei-action@main
        with:
          target: 'https://staging.your-domain.com'
          templates: 'cves,exposed-panels,misconfiguration,vulnerabilities'
 
  npm-audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - run: npm ci && npm run audit

.zap/rules.tsv lets you suppress known-false-positives with a justification (committed as evidence).

H2. Manual checklist

File: docs/compliance/pentest-checklist.md

Categories with concrete tests:

Authentication & session

API key brute force: 1000 random keys → all 401, IP blocked after threshold (engages auth.blocked_ip).
API key timing attack: measure response time for valid-format-invalid-value vs valid key. Statistical t-test should be inconclusive.
Admin endpoint without x-admin-token → 401 and nginx /admin IP-allowlist → 403 from a non-allowed IP.

OAuth

State replay: complete OAuth, capture state, replay → second use rejected.
State CSRF: hit /auth/google/callback with a state minted for tenant B from a session of tenant A → rejected.
Open redirect: /auth/google/start?redirect_uri=https://evil.com → ignored; the callback URI is config-pinned.
PKCE downgrade: omit code_verifier on Google callback → token exchange fails.
Scope tampering: complete OAuth with reduced scopes → ScopeMissingError + audit row.

Injection / input

SQL injection in every query parameter we accept (especially accountId in /auth/:platform/accounts/select).
JSON-body fuzzing on /mcp tool calls — every closed enum rejects unknowns.
Header-injection: X-Forwarded-For: \r\nSet-Cookie: ... → nginx strips CR/LF.

Rate limiting / abuse

200 req/s from one IP → 429s engage; audit rows present.
Distributed (10 IPs × 50 req/s) — global Fastify limit triggers; per-tenant limit triggers separately.
Auth failure flood (11+ in 1h from one IP) → IP block engaged; webhook fired.

Transport / TLS

nmap --script ssl-enum-ciphers → only TLS 1.3 advertised.
HSTS preload check via https://hstspreload.org/.
Certificate transparency search for the domain — only the expected certs appear.

Secrets / config

Process memory dump (in staging only): grep for KEK / DEK / API key plaintext — only in known short-lived buffers.
Logs / Pino output: grep for Bearer , ya29, EAA (Meta token prefix) → no hits.
Env-var dump: cat /proc/$(pidof node)/environ | tr '\0' '\n' → no secret values.

Inngest

Unsigned POST to /api/inngest → 401.
Replayed signed request (same body, old timestamp) → rejected (Inngest SDK enforces timestamp window).
Forged signature → 401.

MCP-specific

Tool registry enumeration: confirm only the registered tools are callable; an attacker calling __internal__ fails.
Cross-tenant cache leak: tenant A’s request never returns tenant B’s cache row even with crafted JSON IDs (RLS catches it).

H3. Findings ledger

docs/compliance/pentest-findings.md — table of finding → severity → status (open / mitigated / accepted-with-rationale) → date. Fixed and accepted findings stay in the ledger as evidence; this is what a SOC 2 auditor wants.

H4. Acceptance

ZAP baseline scan job is green; .zap/rules.tsv file documents every suppressed alert.
nuclei job reports zero high or critical findings; medium findings tracked in the ledger.
Manual checklist run end-to-end with results captured in the ledger.

Workstream I — Compliance documentation

I1. Folder layout


docs/compliance/
├── ROPA.md                          # Records of Processing Activities (GDPR Art. 30)
├── access-control-policy.md         # who has access to what; review cadence
├── incident-response.md             # on-call, escalation, customer notification SLAs
├── key-management.md                # KEK / DEK / API key lifecycle, rotation cadence, disposal
├── data-retention.md                # what we keep, how long, where, how it's destroyed
├── pentest-checklist.md             # H2
├── pentest-findings.md              # H3
├── runbooks/
│   ├── kek-rotation.md              # E3
│   ├── dek-rotation-per-tenant.md   # uses Phase 2 §G
│   ├── tenant-erasure.md            # uses Phase 3 §G
│   ├── database-restore.md          # K6 — "we lost the database"
│   └── inngest-incident.md          # what to do when /admin/health/inngest goes 503
└── soc2-evidence-binder.md          # index pointing at every artifact above

I2. ROPA template (excerpt)

File: docs/compliance/ROPA.md


# Record of Processing Activities
 
| Field | Value |
|---|---|
| Controller / Processor | Processor (acting on behalf of customer-controllers) |
| Purpose of processing | Aggregating and serving advertising performance metrics |
| Categories of data subjects | None directly. Aggregated ad-platform performance metrics processed at the campaign level. |
| Categories of personal data | None as defined by GDPR Art. 4(1). Tenant identifiers (UUIDs), API key hashes, OAuth tokens (encrypted at rest). |
| Recipients | Tenant-authenticated MCP clients only. No third-party data sharing. |
| Cross-border transfers | Tenants outside the EU are processed in the same EU region. (Update if multi-region added.) |
| Retention | metric_cache: 90 days. sync_log: 30 days. audit_log: 12 months active + archive. |
| Security measures | TLS 1.3, AES-256-GCM envelope encryption per tenant, RLS, HMAC-keyed API keys, full audit logging. |
| Lawful basis | Contract with customer-controller (DPA). |

I3. Access-control policy (excerpt)


# Access Control Policy
 
## Roles
 
- **deneva-mcp (service)** — runs the application. Read/write `mcp_app` PG role. Cannot read `audit_log_archive` historical detail. No SSH access.
- **operator** — sysadmin. SSH from allow-listed IPs only; sudo audited. May rotate KEK, run DEK rotation per tenant, view all `/admin/*` endpoints.
- **auditor** — read-only. PG role with SELECT on audit_log + audit_log_archive only. No application secrets.
 
## Review cadence
 
- API key list reviewed quarterly: keys with `last_used_at` older than 90 days flagged for revocation.
- SSH allowlist reviewed monthly.
- Operator list reviewed on every team change.
- Failed-login bursts reviewed within 24h of detection (audit alerts §G2).

I4. Incident-response (excerpt)


# Incident Response
 
## Severity definitions
 
- **Sev-1**: customer data exposed, service down >15min, key material compromised.
- **Sev-2**: degraded service (sync failures across all tenants, auth flaky).
- **Sev-3**: single-tenant degradation, non-customer-facing alerts.
 
## Response timeline
 
| Severity | Acknowledge | Customer notification | Post-mortem |
|---|---|---|---|
| Sev-1 | 15 min | 2 hours | 5 business days |
| Sev-2 | 1 hour  | 24 hours (if customer-affecting) | 10 business days |
| Sev-3 | Next business day | Not required | At discretion |
 
## Runbooks
 
- KEK / DEK compromise → docs/compliance/runbooks/kek-rotation.md
- Inngest dead → docs/compliance/runbooks/inngest-incident.md
- Tenant data deletion request → docs/compliance/runbooks/tenant-erasure.md
- Database loss (full or partial) → docs/compliance/runbooks/database-restore.md

I5. Acceptance

All files in §I1 exist and are non-empty.
The SOC 2 evidence binder index links to every committed artifact.
A peer review on the docs (in PR review form) signs off — the index is the deliverable, not just the individual files.

Workstream J — Tests

J1. New test files

File	Covers
`tests/integration/security-headers.test.ts`	A — verifies every required header on every public route
`tests/integration/admin-token.test.ts`	B4 — replaces every Phase 2/3/4 admin-token check with the new helper
`tests/integration/audit-delete-gate.test.ts`	D — DELETE blocked, stored proc allowed
`tests/integration/kek-rotation.test.ts`	E — multi-version decrypt round-trips
`tests/integration/metrics-endpoint.test.ts`	F — Prometheus exposition shape
`tests/integration/alerts-webhook.test.ts`	G — webhook called on the configured triggers

J2. Staging environment

The ZAP / nuclei jobs need a running target. Phase 5 introduces a staging environment matching production:

Same systemd + nginx config.
Same Postgres setup (dedicated DB).
Real Google / Meta / TikTok test apps with sandboxed credentials.
Restored from a snapshot before each weekly scan run (so attack residue doesn’t accumulate).

J3. CI orchestration


push to main → existing CI (Phases 1-4) → deploy to staging → security-scan workflow runs → if green, manual promote to prod

Workstream K — Postgres backup & restore (low-effort v1)

Scope deliberately small. This is the “we can survive losing the database” version. It uses nightly logical dumps only — no WAL archiving, no PITR, no replicas. RPO is one day. A v2 with continuous archiving and PITR is listed in §12.

K1. Targets

Metric	v1 (this phase)	v2 (post-launch, §12)
RPO (data loss window)	≤ 24h	≤ 5 min
RTO (time to restored service)	≤ 4h	≤ 1h
Mechanism	nightly `pg_dump -Fc`, encrypted, off-host	WAL archiving + base backups, PITR via `pgBackRest` or similar
Restore-test cadence	weekly automated drill on staging	continuous (replica is the test)

The v1 numbers are honest — a tenant whose campaign-sync ran an hour before the host died will lose that hour of metric_cache and audit_log rows. Document this in the DPA / customer comms.

K2. Backup script

File: infra/backup/pg-backup.sh


#!/usr/bin/env bash
# Nightly Postgres backup: pg_dump -> age encrypt -> rclone upload.
# Runs as the `postgres` system user via the systemd unit in K3.
# Exits non-zero on any step failure so systemd flags it and §G2 alerts fire.
 
set -euo pipefail
 
DB_NAME="${DB_NAME:-deneva_mcp}"
LOCAL_DIR="${LOCAL_DIR:-/var/backups/deneva-mcp}"
RECIPIENTS="${RECIPIENTS:-/etc/deneva-mcp/backup/age-recipients.txt}"
RCLONE_REMOTE="${RCLONE_REMOTE:-mcp-backups}"            # rclone config name
RCLONE_BUCKET="${RCLONE_BUCKET:-deneva-mcp-backups}"  # bucket / container
LOCAL_RETENTION_DAYS="${LOCAL_RETENTION_DAYS:-7}"
 
ts="$(date -u +%Y%m%dT%H%M%SZ)"
out="$LOCAL_DIR/${DB_NAME}-${ts}.dump.age"
mkdir -p "$LOCAL_DIR"
 
# pg_dump -> age. Stream end-to-end so the plaintext dump never lands on disk.
pg_dump --format=custom --compress=6 --no-owner --no-privileges "$DB_NAME" \
  | age --encrypt --recipients-file "$RECIPIENTS" --output "$out"
 
# Off-host upload. rclone has built-in retry; we still fail loudly on a non-zero exit.
rclone copyto --s3-no-check-bucket "$out" "${RCLONE_REMOTE}:${RCLONE_BUCKET}/$(basename "$out")"
 
# Local retention: delete encrypted dumps older than N days.
find "$LOCAL_DIR" -maxdepth 1 -type f -name "${DB_NAME}-*.dump.age" -mtime +"$LOCAL_RETENTION_DAYS" -delete
 
# Tiny success marker — restore-drill (K5) reads this to confirm a fresh dump exists.
date -u +%s > "$LOCAL_DIR/last-success"

Off-host retention is enforced by the bucket’s lifecycle policy (e.g. S3 / B2 / R2 lifecycle rule: keep 30 daily + 12 monthly + delete the rest), not by the script. One source of truth for retention.

K3. systemd timer

File: infra/systemd/mcp-backup.service


[Unit]
Description=Nightly Postgres backup for deneva_mcp
After=postgresql.service network-online.target
Wants=postgresql.service network-online.target
 
[Service]
Type=oneshot
User=postgres
Group=postgres
EnvironmentFile=/etc/deneva-mcp/backup/env
ExecStart=/opt/deneva-mcp/infra/backup/pg-backup.sh
# Hardening — the backup user only needs to read PG and write the local dir + reach the network.
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/backups/deneva-mcp
ProtectKernelTunables=true
ProtectKernelModules=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX

File: infra/systemd/mcp-backup.timer


[Unit]
Description=Run mcp-backup nightly
 
[Timer]
OnCalendar=*-*-* 02:15:00 UTC
RandomizedDelaySec=10m
Persistent=true                 # if the host was down at 02:15, run on next boot
Unit=mcp-backup.service
 
[Install]
WantedBy=timers.target

/etc/deneva-mcp/backup/env holds the RCLONE_* vars and is mode 0640 owned by postgres:postgres. The rclone remote credentials live in ~postgres/.config/rclone/rclone.conf (mode 0600), provisioned once during host setup.

K4. Encryption key handling

Generate the keypair on the operator workstation: age-keygen -o backup-key.txt. The public line (age1...) goes into /etc/deneva-mcp/backup/age-recipients.txt on the prod host. The full file (with the secret line) never touches the prod host.
Store backup-key.txt in two places: (a) the operator-team password manager (1Password / Bitwarden), (b) a printed copy in a physical safe. Loss of the secret = loss of every backup.
Rotation: generate a new keypair yearly. New backups encrypt to both recipients during the overlap window (age-recipients.txt accepts multiple lines); after 30 days, drop the old recipient. Old encrypted dumps remain restorable with the old key — keep the old keys filed by year.

K5. Restore drill (automated)

A second systemd timer, on the staging host (so it doesn’t impact prod or share a failure domain):

File: infra/backup/pg-restore-drill.sh


#!/usr/bin/env bash
# Weekly: pull latest backup -> decrypt -> restore into a scratch DB -> sanity-check -> drop.
# Exits non-zero on any failure so the §G2 webhook fires `severity: 'critical'`.
 
set -euo pipefail
 
RCLONE_REMOTE="${RCLONE_REMOTE:-mcp-backups}"
RCLONE_BUCKET="${RCLONE_BUCKET:-deneva-mcp-backups}"
SCRATCH_DB="${SCRATCH_DB:-mcp_restore_drill}"
KEY_FILE="${KEY_FILE:-/etc/deneva-mcp/backup/age-key.txt}"   # staging-only: secret key present here
WORK_DIR="$(mktemp -d)"
trap 'rm -rf "$WORK_DIR"' EXIT
 
latest="$(rclone lsf "${RCLONE_REMOTE}:${RCLONE_BUCKET}" --files-only \
  | grep -E '\.dump\.age$' | sort | tail -n1)"
[ -n "$latest" ] || { echo "No backups found"; exit 1; }
 
rclone copyto "${RCLONE_REMOTE}:${RCLONE_BUCKET}/$latest" "$WORK_DIR/$latest"
age --decrypt --identity "$KEY_FILE" --output "$WORK_DIR/dump" "$WORK_DIR/$latest"
 
dropdb --if-exists "$SCRATCH_DB"
createdb "$SCRATCH_DB"
pg_restore --dbname="$SCRATCH_DB" --no-owner --no-privileges --jobs=2 "$WORK_DIR/dump"
 
# Smoke checks. Add a row to each as new tables are introduced.
psql -d "$SCRATCH_DB" -v ON_ERROR_STOP=1 <<'SQL'
SELECT 1/CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM tenants;
SELECT 1/CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM tenant_deks;
SELECT MAX(created_at) > now() - interval '48 hours' AS recent FROM audit_log
  \gset
\if :recent
\else
  \echo 'audit_log latest row >48h old — backup may be stale'
  \q 1
\endif
SQL
 
dropdb "$SCRATCH_DB"
echo "Restore drill OK: $latest"

Wired to a mcp-restore-drill.timer running OnCalendar=Sun 04:00:00 UTC. Failure path: systemd OnFailure= → calls a tiny one-liner that POSTs to ALERTS_WEBHOOK_URL with severity=critical.

The staging host is the only place the secret age key lives outside the operator vault. This is a deliberate trade-off: the drill needs to actually decrypt, and human-driven restoration in an outage is too slow if the key has to be fetched from a vault first. Compensating control: staging is on the same UFW + systemd hardening footprint as prod (§B), and the key file is mode 0400 owned by postgres.

K6. Runbook — “we lost the database”

File: docs/compliance/runbooks/database-restore.md


# Database restore — full loss
 
## Trigger
Postgres data unrecoverable on prod host: disk failure, accidental DROP, ransomware,
or "the host is gone." Sev-1.
 
## Pre-flight
- [ ] Acknowledge the incident; start a comms doc; notify on-call channel.
- [ ] Stop traffic: `sudo systemctl stop deneva-mcp` (the app keeps `audit_log` writes
      out of a half-restored DB).
- [ ] Confirm DB is actually unrecoverable (check `pg_isready`, disk, recent dumps on host).
 
## Restore
1. Provision a clean Postgres 16 instance (same host once disk is replaced, or a new host).
   Run `infra/provision.sh` to recreate roles, then create the empty `deneva_mcp` DB.
2. Fetch the age secret key from the operator vault. Place it on the restoration host as
   `/root/restore-key.txt`, mode 0400. Delete after restore completes.
3. List available dumps:
   `rclone lsf mcp-backups:deneva-mcp-backups | grep dump.age | sort | tail -5`
4. Pull the freshest one:
   `rclone copyto mcp-backups:deneva-mcp-backups/<file> /tmp/<file>`
5. Decrypt:
   `age --decrypt --identity /root/restore-key.txt --output /tmp/dump /tmp/<file>`
6. Restore (run as `postgres`):
   `pg_restore --dbname=deneva_mcp --no-owner --no-privileges --jobs=4 /tmp/dump`
7. Confirm migration version matches the deployed app:
   `psql -d deneva_mcp -c "SELECT * FROM drizzle.__drizzle_migrations ORDER BY id DESC LIMIT 1;"`
   If the app code is ahead, run `npm run db:migrate` before restarting.
8. Smoke checks (same queries as §K5):
   - `SELECT count(*) FROM tenants;` > 0
   - `SELECT count(*) FROM tenant_deks;` matches tenant count
   - Sample one tenant: `decryptToken` round-trips (the KEK is unchanged — DEKs decrypt fine).
9. Restart the app: `sudo systemctl start deneva-mcp`. Confirm `/admin/health/inngest`
   reports 200 and a fresh heartbeat.
10. Wipe restore artefacts: `shred -u /tmp/<file> /tmp/dump /root/restore-key.txt`.
 
## Post-restore
- [ ] Write a Sev-1 audit-log entry: `system.recovery {dump_used, restored_at, lag_seconds}`.
      `lag_seconds` = `now() - dump_timestamp`; this is the actual data loss for this incident.
- [ ] Notify customers per §I4 timeline. The customer-facing message must state the data-loss
      window honestly (rows created in the last `lag_seconds` are gone).
- [ ] Rotate the API_KEY_HMAC_SECRET if there's any chance the lost-host's disk can be read by
      a third party (i.e., the loss was theft, not media failure). Forces all tenants to
      re-issue keys, but the alternative is hoping the disk wasn't recoverable.
- [ ] Post-mortem within 5 business days. Include: actual RTO vs target, root cause, why the
      monitor didn't catch it earlier, action items.
 
## What this runbook does NOT recover
- Data written between the last successful dump and the incident. Use the audit log on the
  restored DB to reconstruct what's missing if a customer asks ("you ran sync at 18:00, our
  dump was 02:15, those 16h of metric_cache are gone").
- v2 with WAL archiving (§12) reduces this gap to minutes.

K7. Acceptance

mcp-backup.timer is enabled; systemctl list-timers shows next run.
After one nightly run: an *.dump.age file exists locally and in the off-host bucket; last-success is updated.
mcp-restore-drill.timer succeeds end-to-end on staging once; failure path tested by feeding it a corrupt dump and confirming the §G2 webhook fires severity: 'critical'.
The runbook in §K6 has been executed end-to-end on staging by an operator who did not write it. Time-to-green captured in the runbook header as the measured RTO.
Bucket lifecycle policy (30 daily + 12 monthly) verified in the cloud console; old objects expire as expected.

Workstream L — Log retention (journald)

Why this is here. Phase 1 §G4 wires Pino to write structured JSON to stdout, and the systemd unit captures stdout into journald automatically. No app-level change is needed for “shipping” — journald is the log store. What’s missing for production is retention bounds so the disk doesn’t fill, and a documented way to query.

L1. journald config

File: /etc/systemd/journald.conf.d/deneva-mcp.conf (drop-in, not the main journald.conf)


[Journal]
Storage=persistent              # write to /var/log/journal, survive reboots
SystemMaxUse=2G                 # cap journal disk usage
SystemMaxFileSize=200M          # rotate when a single file hits this
MaxRetentionSec=30day           # discard entries older than 30 days
ForwardToSyslog=no              # we don't run syslog
Compress=yes

Apply:


sudo systemctl restart systemd-journald
journalctl --disk-usage      # confirm cap is respected

L2. PII redaction confirmation

Pino’s Phase 1 §G4 setup has two layers: the redact paths strip x-api-key and x-admin-token from request headers, and the err serializer (scrubTokens) walks log payloads and redacts token-shaped strings (Bearer …, ya29.…, EAA[A-Z]…) plus token-named fields (access_token, refresh_token, client_secret, authorization, api_key, x-admin-token). Phase 5 adds an evidence check:


# Should produce zero hits — sanity check before declaring Phase 5 done.
journalctl -u deneva-mcp --since "7 days ago" | grep -E '("x-api-key"|Bearer |ya29\.|EAA[A-Z])'

The Phase 1 §G4 serializer should make this grep produce zero matches; this check is a regression-detector, not the primary defence. If anything does match, fix it at the source — extend TOKEN_KEY_RE or the regex list in scrubTokens rather than papering over with the redact paths. The audit-log PII strip from Phase 1 §F1 covers the database side; this check covers the log-stream side.

L3. Querying

Documented one-liners in docs/compliance/runbooks/inngest-incident.md (and used by ops):


# Last 100 errors
journalctl -u deneva-mcp -p err -n 100 --no-pager
 
# Tail live, JSON-formatted
journalctl -u deneva-mcp -f -o json-pretty
 
# All requests for a given request-id (Pino's genReqId from Phase 1 §G4)
journalctl -u deneva-mcp --grep '"requestId":"$REQ_ID"'
 
# Compliance evidence: every audit-related event in a date window
journalctl -u deneva-mcp --since "2024-01-01" --until "2024-02-01" --grep 'audit'

L4. When to upgrade

A single Ubuntu host is fine until any of these become true:

Multiple hosts (need to query all of them at once).
Auditor asks for a tamper-evident log store separate from the host.
Disk pressure forces MaxRetentionSec below the 30-day floor.

When that happens, the cheap next step is Grafana Cloud Loki free tier (50 GB ingest/month, 30-day retention as of writing) using promtail to read journald and ship to Loki. That migration is a config-only change — no app code touches it.

L5. Acceptance

/etc/systemd/journald.conf.d/deneva-mcp.conf exists; journalctl --disk-usage reports a value ≤ SystemMaxUse.
The grep check in §L2 returns zero matches over a 7-day window.
The runbook in §L3 lists the four query patterns and is referenced from incident-response.md.

§11 — Definition of Done (full checklist)

A. nginx TLS hardening

testssl.sh A+ rating; only TLS 1.3.
All Phase 1 security headers present on every public route.
Server tokens disabled; client_max_body_size enforced; server_tokens off.
Per-route rate-limit zones configured.

B. UFW + systemd

UFW allows only 22 (IP-allowlist), 80, 443.
systemd-analyze security ≤ 5.0 (“MEDIUM” or better).
All hardening flags from §B2 active.
systemd-creds replaces plaintext LoadCredential for every secret.
Dedicated ADMIN_TOKEN secret; placeholder removed from Phase 1 §D3.

C. /api/inngest + /admin IP allow-lists

/api/inngest allows only Inngest egress IPs.
/admin/* allows only operator IPs.
Refresh procedure documented in docs/compliance/key-management.md.

D. Audit log DELETE gate

archive_old_audit_rows SECURITY DEFINER function created.
mcp_app cannot raw-DELETE from audit_log; can call the procedure.
Phase 4 archive job uses the procedure.

E. KEK rotation

scripts/rotate-kek.ts rewraps every DEK; idempotent on re-run.
Application supports reading any kekVersion during rotation window.
Runbook docs/compliance/runbooks/kek-rotation.md exists.

F. Prometheus + Grafana

/metrics returns Prometheus exposition format, behind nginx allow-list.
Counters / histograms cover auth, rate limit, cache, sync, token refresh, latency, heartbeat age.
Four committed dashboards auto-load in Grafana.
In-memory counters from Phase 2/4 retired in favor of Prometheus.

G. External alerting

Uptime monitor configured for /admin/health/inngest and Prometheus self-check.
sendAlert webhook wired into the four trigger sites.
On-call rotation documented.

H. Pen-test

Weekly ZAP + nuclei job in CI; green or with documented suppressions.
Manual checklist completed once with all results in the findings ledger.
No open high or critical findings.

I. Compliance documentation

All files in §I1 exist and reviewed.
ROPA accurate as of launch date.
SOC 2 evidence binder index linked from README.md.

J. Tests

All new test files pass.
Staging environment reproducibly snapshot-restored.

K. Postgres backup & restore

mcp-backup.timer enabled; fires nightly and produces an *.dump.age in the off-host bucket.
Local 7-day retention enforced; bucket lifecycle policy (30 daily + 12 monthly) verified.
mcp-restore-drill.timer succeeds on staging weekly; failure path tested.
Runbook docs/compliance/runbooks/database-restore.md executed end-to-end by an operator who did not write it; measured RTO recorded.
age keypair committed nowhere; private key in operator vault + physical safe.

L. Log retention

/etc/systemd/journald.conf.d/deneva-mcp.conf deployed; SystemMaxUse and MaxRetentionSec enforced.
PII grep check in §L2 returns zero matches across a 7-day window.
Query one-liners committed to docs/compliance/runbooks/inngest-incident.md.

§12 — Out of scope (post-launch)

Item	Owner / cadence
Third-party penetration test	Schedule once revenue / customer signals justify; budget ~$15-30k
Multi-region failover (active-active or warm standby)	When first customer asks; significant infra rework
HashiCorp Vault migration (replacing systemd-creds)	When operator team grows beyond 1-2 people; trade-off: higher ops burden for centralized rotation
ISO 27001 certification	Typically follows successful SOC 2 Type II
Bug bounty program	After at least 6 months of production stability
Customer-managed encryption keys (BYOK)	Enterprise feature; revisit when first enterprise prospect asks
Centralised log aggregation (Grafana Cloud Loki + promtail, or similar)	When journald-on-one-host stops fitting; trigger conditions in §L4

§13 — Manual smoke test (production launch checklist)


# 0. All Phase 1-4 work merged. Staging running and green for >7 days.
 
# 1. Provision the Linux server (one-shot, idempotent ansible / shell)
sudo bash infra/provision.sh
# → installs node 22, postgres 16, nginx, ufw; creates deneva-mcp user; deploys systemd unit
 
# 2. Encrypt all secrets via systemd-creds
for s in CREDENTIAL_KEK API_KEY_HMAC_SECRET DB_PASSWORD GOOGLE_CLIENT_SECRET \
         GOOGLE_DEVELOPER_TOKEN META_APP_SECRET TIKTOK_APP_SECRET \
         INNGEST_SIGNING_KEY INNGEST_EVENT_KEY ADMIN_TOKEN; do
  read -s -p "$s: " val; echo
  echo -n "$val" | sudo systemd-creds encrypt --name=$s - /etc/deneva-mcp/creds/$s.cred
done
 
# 3. Start service, confirm
sudo systemctl enable --now deneva-mcp
systemctl status deneva-mcp
sudo systemd-analyze security deneva-mcp
# → exposure level ≤ 5.0
 
# 4. TLS smoke test
testssl.sh https://your-domain.com | grep -E '(Overall Grade|TLS 1.3|TLS 1.2)'
# → TLS 1.3 only, A+
 
# 5. Header smoke test
curl -sI https://your-domain.com/mcp | grep -iE '(strict-transport|content-security|x-frame|x-content-type|referrer-policy|permissions-policy)'
# → six headers present
 
# 6. UFW
sudo ufw status verbose | grep -E '(22|80|443|deny)'
 
# 7. /metrics reachable from Prometheus host, blocked elsewhere
curl -i http://prometheus-host:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="deneva-mcp") | .health'
# → "up"
 
# 8. Trigger every alerting path in staging:
#  - lock out an IP via 11 auth failures → webhook + audit entry
#  - kill inngest for 11 min → /admin/health/inngest 503 → uptime monitor pages
#  - simulate KEK rotation script run on a sandbox tenant → success in <30s
 
# 9. Run the manual pen-test checklist top to bottom, capture results in ledger
 
# 10. Run the GDPR erasure end-to-end on a test tenant; verify all rows gone
 
# 11. SOC 2 evidence: confirm every link in soc2-evidence-binder.md resolves
 
# 12. Backup smoke test: confirm last night's dump landed in the off-host bucket
rclone lsf mcp-backups:deneva-mcp-backups | grep dump.age | sort | tail -3
# → most recent timestamp is from today
# Confirm weekly restore-drill timer is scheduled
systemctl list-timers mcp-restore-drill.timer
# → next trigger date shown; last successful run within 7 days
 
# 13. Announce launch

If every step produces the expected outcome, Phase 5 — and the v1 product — is shipped.