Operations & Reliability · pre-launch readiness

Read the operability.

What it takes to run this in production and hand it to an on-call rotation — SLOs, runbooks, status page, support macros — read for an engineering/operations director doing an operability assessment, and for the team that will receive the pager. The honest state today: these operational sign-offs are pre-launch and largely unsigned. That's not a gap this page papers over; naming it is the handoff.

Sign-off status

0 of 7signed off

6 GA-blocking items are still outstanding — on-call, both runbooks, SLO definitions, the status page, and support macros are all gated on a human sign-off, not on more code. 7 of 7 have never been signed off even once (last_attested: null).

Total

GA-blocking outstanding

Passed

Never signed off

How to read this

A pending item below isn't a hidden defect — it's the handoff signal for the receiving ops team. Each item names exactly what's needed (a schedule, a tested restore, an agreed SLO number) and why it matters for a subscription-billing system specifically. Work through them before accepting real merchant traffic, in roughly the order they appear (GA-blocking first).

Operational sign-offs

External status page — configured, monitored, and merchant-accessible
ga-blockingpending
A public, merchant-accessible status page with automated uptime monitoring against the /health endpoint.
phase: pre-launchnever signed off
Related: #1269 · #1282
On-call rotation defined — schedule, alerts, and escalation paths
ga-blockingpending
Schedule, alert routing (charge-failure / cron / API-error-rate thresholds), and escalation path — an undetected cron failure can miss a whole billing cycle's charges across every merchant.
phase: pre-launchnever signed off
Related: #1269 · #1282
Runbook — database backup verification and recovery procedure
ga-blockingpending
Documented and staging-tested Cloudflare D1 backup/restore with stated RTO and RPO — subscription billing data is financial data.
phase: pre-launchnever signed off
Related: #1269 · #1282
Runbook — subscription charge failures and dunning escalation
ga-blockingpending
An on-call playbook for charge-failure spikes and dunning escalation, with diagnostic queries per scenario, aimed at cutting time-to-mitigate below 15 minutes.
phase: pre-launchnever signed off
Related: #1269 · #1282
SLO definitions — uptime, charge success rate, and API latency
ga-blockingpending
Stakeholder-agreed objectives for uptime, charge-success rate, cron reliability, and p99 latency, plus an error-budget policy — agreed, not engineering-set.
phase: pre-launchnever signed off
Related: #1269 · #1282
Support macros and playbooks — common merchant support scenarios
ga-blockingpending
At least ten support playbooks for common merchant scenarios (double-charge, non-renewal, refund) to cut time-to-resolution and prevent chargebacks.
phase: pre-launchnever signed off
Related: #1269 · #1282
Incident review template — first post-mortem completed
soft-gatingpending
A blameless post-mortem template plus one completed review (real or practice) — soft-gating; a leading indicator of operational maturity, not a hard quality gate.
phase: launchnever signed off
Related: #1269

Built infrastructure (evidence, not sign-off)

Two pieces of runtime infrastructure that the attestations above turn into operational readiness already exist in the repo. Their existence is not the same claim as the sign-offs above — the code being present doesn't mean the SLOs it would measure against are agreed, or that anyone is on-call to act on it.

Observability hooks

tools/observability/exports a per-epic pre-tagged logger,metrics, andcostTracker — sink-agnostic, Phase 1 on Cloudflare Workers Analytics, Phase 2 cutting over to GCP Monitoring perADR-0054. It surfaces Epic DoD Gate 10 and Operations DoD Gate 2.

Honest bound: the hooks exist in code and emit epic-tagged data today. The SLO targets they'd be measured against are theslo-definitions item above — pending — and alert routing / on-call is not yet configured (theon-call-rotation item above). Emitting data is not the same claim as being monitored-with-alerting.

Load-test smoke suite

k6 scripts run via GitHub Actions (load-test-smoke-all.ymland the per-epic workflows alongside it) execute a post-deploy smoke suite of 17 HTTP scenarios at 1–2 virtual users × 30 seconds against the live Worker, confirming every endpoint returns below a 5xx under light load and exercises the real D1 path.

Honest bound: this is a smoke gate, not decision-grade scale or throughput sizing — the same caveat /engineering names for infra cost estimates: design estimates, not load-tested numbers. Real ceilings require a real load test, at production-representative concurrency, that hasn't been run.

Glossary

SLO (service level objective): A target for a measurable behavior — e.g. 99.9% API uptime per month — that stakeholders agree defines "the system is working."
Error budget: The amount of SLO violation allowed before it's treated as an incident — e.g. 43 minutes of downtime per month at 99.9%. An error-budget policy defines what happens (a feature-work freeze, an escalation) when the budget is exhausted.
RTO / RPO: Recovery Time Objective (how long a restore is allowed to take) and Recovery Point Objective (how much data loss, measured in time, is acceptable) for a database recovery procedure.
Dunning: The retry-and-escalation process that runs after a subscription charge fails, before the subscription is cancelled.
On-call rotation: A schedule assigning who is responsible for responding to production alerts at any given time, with a defined escalation path if the primary can't resolve the issue.
GA-blocking: An attestation gating that must reach passed before general availability — as distinct from marketplace-blocking (a submission gate) or soft-gating (a maturity signal, not a hard requirement).

Open items carries the full actionable list across every category, not just operations. For the architecture and code-quality read this page's sibling covers, see Engineering. Runbooks and SLO definitions, once written, land underdocs/runbooks/ anddocs/attestations/operations/ respectively.

Sign-off status

How to read this

Operational sign-offs

Built infrastructure (evidence, not sign-off)

Glossary

Related