Operations & Reliability · pre-launch readiness
Read the operability.
What it takes to run this in production and hand it to an on-call rotation — SLOs, runbooks, status page, support macros — read for an engineering/operations director doing an operability assessment, and for the team that will receive the pager. The honest state today: these operational sign-offs are pre-launch and largely unsigned. That's not a gap this page papers over; naming it is the handoff.
Sign-off status
0 of 7signed off
6 GA-blocking items are still outstanding — on-call, both runbooks, SLO definitions, the status page, and support macros are all gated on a human sign-off, not on more code. 7 of 7 have never been signed off even once (last_attested: null).
Total
7
GA-blocking outstanding
6
Passed
0
Never signed off
7
How to read this
A pending item below isn't a hidden defect — it's the handoff signal for the receiving ops team. Each item names exactly what's needed (a schedule, a tested restore, an agreed SLO number) and why it matters for a subscription-billing system specifically. Work through them before accepting real merchant traffic, in roughly the order they appear (GA-blocking first).
Operational sign-offs
External status page — configured, monitored, and merchant-accessible
ga-blockingpendingA public, merchant-accessible status page with automated uptime monitoring against the /health endpoint.
phase: pre-launchnever signed offOn-call rotation defined — schedule, alerts, and escalation paths
ga-blockingpendingSchedule, alert routing (charge-failure / cron / API-error-rate thresholds), and escalation path — an undetected cron failure can miss a whole billing cycle's charges across every merchant.
phase: pre-launchnever signed offRunbook — database backup verification and recovery procedure
ga-blockingpendingDocumented and staging-tested Cloudflare D1 backup/restore with stated RTO and RPO — subscription billing data is financial data.
phase: pre-launchnever signed offRunbook — subscription charge failures and dunning escalation
ga-blockingpendingAn on-call playbook for charge-failure spikes and dunning escalation, with diagnostic queries per scenario, aimed at cutting time-to-mitigate below 15 minutes.
phase: pre-launchnever signed offSLO definitions — uptime, charge success rate, and API latency
ga-blockingpendingStakeholder-agreed objectives for uptime, charge-success rate, cron reliability, and p99 latency, plus an error-budget policy — agreed, not engineering-set.
phase: pre-launchnever signed offSupport macros and playbooks — common merchant support scenarios
ga-blockingpendingAt least ten support playbooks for common merchant scenarios (double-charge, non-renewal, refund) to cut time-to-resolution and prevent chargebacks.
phase: pre-launchnever signed offIncident review template — first post-mortem completed
soft-gatingpendingA blameless post-mortem template plus one completed review (real or practice) — soft-gating; a leading indicator of operational maturity, not a hard quality gate.
phase: launchnever signed offRelated: #1269
Built infrastructure (evidence, not sign-off)
Two pieces of runtime infrastructure that the attestations above turn into operational readiness already exist in the repo. Their existence is not the same claim as the sign-offs above — the code being present doesn't mean the SLOs it would measure against are agreed, or that anyone is on-call to act on it.
Observability hooks
tools/observability/exports a per-epic pre-tagged logger,metrics, andcostTracker — sink-agnostic, Phase 1 on Cloudflare Workers Analytics, Phase 2 cutting over to GCP Monitoring perADR-0054. It surfaces Epic DoD Gate 10 and Operations DoD Gate 2.
Honest bound: the hooks exist in code and emit epic-tagged data today. The SLO targets they'd be measured against are theslo-definitions item above — pending — and alert routing / on-call is not yet configured (theon-call-rotation item above). Emitting data is not the same claim as being monitored-with-alerting.
Load-test smoke suite
k6 scripts run via GitHub Actions (load-test-smoke-all.ymland the per-epic workflows alongside it) execute a post-deploy smoke suite of 17 HTTP scenarios at 1–2 virtual users × 30 seconds against the live Worker, confirming every endpoint returns below a 5xx under light load and exercises the real D1 path.
Honest bound: this is a smoke gate, not decision-grade scale or throughput sizing — the same caveat /engineering names for infra cost estimates: design estimates, not load-tested numbers. Real ceilings require a real load test, at production-representative concurrency, that hasn't been run.
Glossary
- SLO (service level objective)
- A target for a measurable behavior — e.g. 99.9% API uptime per month — that stakeholders agree defines "the system is working."
- Error budget
- The amount of SLO violation allowed before it's treated as an incident — e.g. 43 minutes of downtime per month at 99.9%. An error-budget policy defines what happens (a feature-work freeze, an escalation) when the budget is exhausted.
- RTO / RPO
- Recovery Time Objective (how long a restore is allowed to take) and Recovery Point Objective (how much data loss, measured in time, is acceptable) for a database recovery procedure.
- Dunning
- The retry-and-escalation process that runs after a subscription charge fails, before the subscription is cancelled.
- On-call rotation
- A schedule assigning who is responsible for responding to production alerts at any given time, with a defined escalation path if the primary can't resolve the issue.
- GA-blocking
- An attestation gating that must reach
passedbefore general availability — as distinct from marketplace-blocking (a submission gate) or soft-gating (a maturity signal, not a hard requirement).
Related
Open items carries the full actionable list across every category, not just operations. For the architecture and code-quality read this page's sibling covers, see Engineering. Runbooks and SLO definitions, once written, land underdocs/runbooks/ anddocs/attestations/operations/ respectively.