← All epicsBRD.md §9 · lines 5184–5367

Read-only per-epic slice. The canonical source of truth is BRD.md — stories are addressed by US-ID, not by this page's line numbers.

Epic 13 — Reconciliation & observability (derived view)

Read-only per-epic slice of BRD.md §9, lines 5184–5367. The canonical source of truth is BRD.md — edit there, never here. The stable address for a story is its US-ID (US-13.x), not a line number. Regenerates on every dev → main sync via derive-state-on-main.

Stories (5): US-13.1, US-13.2, US-13.3, US-13.4, US-13.5

Generated: 2026-07-01T17:48:39.076Z · as-of commit: b083f095

Epic 13 — Reconciliation & observability

Prototype: Event Timeline · Drift Sweep · Replay Tool · Structured Logs

Value: Every charge, order, and state change is verifiable and replayable.

US-13.1: Event log per subscription

Prototype: Event Timeline

Phase: MVP · Persona: Support / Ops

As Support / Ops, I want to see a complete event timeline for any subscription, so that I can diagnose issues without log-diving.

Acceptance criteria:

Given I open a subscription, When the timeline renders, Then it lists every lifecycle event (created, charged, paused, PM-updated, etc.) with timestamp, actor, and payload snippet.
Given I filter by event type or actor, When I apply the filter, Then the timeline updates.

UI states.

surface: "Admin (React/BigDesign) — the 'Activity' timeline Panel in apps/admin/src/pages/subscriptions/SubscriptionAdminDetail.tsx (line 718-772), fed by getSubscriptionEvents (apps/admin/src/lib/api-client.ts) which returns SubscriptionEventRow[] (api-client.ts line 452). Persona: Support / Ops."
idle:
  render: "An event list: each row renders an actor-kind color dot (actorIcon), a friendly event-type label (friendlyEventLabel, line 738), a timestamp (formatDateTime of created_at, line 746), an optional operator id (actorUser), a truncated payload snippet (truncatePayload, line 739, max 80 chars), and an actor-kind Badge (actorKindLabel, line 765)."
  primary_action: "Read the timeline — no row action. AC2 filtering (by event type or actor) is the north-star action and is not yet rendered (see gaps)."
loading:
  render: "While events is null and no error has surfaced, the Panel shows a single 'loading activity' Text (activity.loading, line 730) — no skeleton rows."
error:
  surfaced_at: "Inline, at the top of the Activity Panel: a Message(type='error') with the activity error header (activity.errorHeader, line 723) renders when getSubscriptionEvents rejects (eventsError, line 719-727)."
  recovery: "No inline retry control — the rep reloads the page to re-issue getSubscriptionEvents; the rest of the detail page stays usable because the timeline error is scoped to its own Panel."
empty:
  render: "When events is a non-null empty array the Panel shows the 'no activity' empty copy (activity.empty, line 734); a real subscription always carries at least a created event so this is rare."
edge_status:
  - status: "unknown / newly-added event type"
    affordance: "friendlyEventLabel falls back to a generic label plus the raw type in parens (line 161) rather than printing an internal type name or rendering blank — the row stays readable."
  - status: "empty or null payload"
    affordance: "truncatePayload returns null for '{}' / 'null' / empty (line 168) so the snippet segment is omitted; the row still shows type + actor + timestamp."
  - status: "filter active (north-star, AC2) — by event type or actor"
    affordance: "North-star: event-type and actor Selects narrow the timeline, and a 'no events match — clear filter' state offers a clear-filter control; today no filter UI is rendered so the timeline is always unfiltered (see gaps)."
inputs: []
disabled_focus:
  keyboard: "Read-only timeline — no focusable controls today; rows are static Text/Small/Badge/code nodes in DOM source order (line 741-768). North-star adds filter Selects, which must be real BigDesign Selects reachable in tab order with arrow-key option selection."
gaps: "AC2 ('Given I filter by event type or actor, When I apply the filter, Then the timeline updates') is unbuilt: the Activity Panel renders all events unfiltered with no filter controls. actor_kind is enumerable (the 5-value union at api-client.ts:460 — merchant_user / subscriber / system / webhook_bc / webhook_processor) so the actor filter must be a Select; event_type is open-ended (friendlyEventLabel falls back to unknown) so its filter should be a Select populated from the events actually present. AC1 (type + actor + timestamp + payload snippet) is fully built."

US-13.2: Daily reconciliation sweep

Prototype: Drift Sweep

Phase: MVP · Priority: P0 · Effort: L · Persona: System

As the System, I want to detect state drift between our DB, BC, and the processor daily, so that corrections happen before subscribers notice.

Acceptance criteria:

Given the reconciliation cron runs, When it executes, Then it samples:
- Charges with bc_order_id set but no matching BC order
- BC orders tagged as subscription-origin but no matching charge
- Processor-settled transactions with no matching charge
- Charges stuck in processing > 1 hour
Given any drift is found, When the sweep completes, Then a report populates the exception queue with proposed remediation.

Data contract.

Cron job daily at 02:00 UTC (lowest traffic window)
Queries:
- A: charges WHERE status='succeeded' AND bc_order_id IS NULL (settled but no order)
- B: charges WHERE status='processing' AND attempted_at < now() - 1h (stuck)
- C: BC orders with sub metadata where our charges.bc_order_id doesn't match
- D: Processor transactions via adapter listing (last 7d) where no charge matches
Each drift detection creates an exception_queue row with type, entity_refs, proposed_remediation

Success metrics.

Functional (target): reconciliation detects ≥ 99% of simulated drift scenarios (test suite)
Operational (target): sweep completes within 30 min for a store with 100K subs
Product (target): % of exceptions auto-resolved (remediation applied without merchant action) ≥ 50%

Dependencies.

Exception queue UI (US-21.3)

Non-functional.

Read-only on BC (no writes during sweep; writes happen only on merchant-approved remediation)

Risks / open questions.

Reconciliation with processor historical data can hit rate limits on stores with large volume. Paginate and throttle; acceptable to span multiple runs for initial backfill.

US-13.3: Replay a failed workflow

Prototype: Replay Tool

Phase: MVP · Priority: P1 · Effort: M · Persona: Support / Ops / Developer

As Support / Ops, I want to re-execute a failed charge workflow with the same idempotency key, so that I can recover from transient bugs without duplicate charging.

Acceptance criteria:

Given a charge is stuck in failed due to a known infra issue, When I click "Replay" in the admin, Then the workflow re-runs using the same idempotency key and does not double-charge.

UX notes.

Surface: support admin tool at /stores/[storeHash]/support/charge/{id}/replay
Pre-conditions shown: current state, last error, what replay will attempt
Post-conditions: updated state + events

Data contract.

Our API (impl ships dual endpoints): POST /api/v1/admin/subscriptions/:id/force-retry (charge retry on subscription) and POST /api/v1/admin/charges/:id/retry-materialization (BC order materialization replay for missing-order edge case)
Workflow: force-retry re-queues the pending charge; retry-materialization re-runs the BC order creation only
Adapter behavior: idempotency key recognized, processor returns cached result if it succeeded previously

Success metrics.

Functional: replay never double-charges
Operational (target): support resolution time on stuck charges ≤ 5 min

Non-functional.

Every replay is audited with Support user ID

UI states.

surface: "Admin (React/BigDesign) — replay/retry affordances on apps/admin/src/pages/subscriptions/SubscriptionAdminDetail.tsx: an 'Actions' Panel 'Force retry' Button (actions.forceRetry, line 561 -> handleForceRetry, line 305 -> POST /api/v1/admin/subscriptions/:id/force-retry) and a per-row 'Retry' Button on each failed charge in the history Panel (charge.retryBtn, line 705 -> handleChargeRetry, line 336 -> retryCharge, apps/admin/src/api-client/subscriptions.ts -> POST /api/v1/admin/subscriptions/:id/charges/:cid/retry). Both re-queue the charge; the processor recognises the idempotency key so a previously-succeeded charge is not double-charged. Persona: Support / Ops / Developer."
idle:
  render: "The 'Actions' Panel shows a 'Force retry' Button enabled only when at least one charge has failed (failedExists, line 558). Each failed charge in the history Panel shows a red Error icon, a 'Failed' Badge, and an inline 'Retry' Button (line 694-707); succeeded charges show a success Badge and no retry control."
  primary_action: "'Force retry' re-queues the subscription's pending charge; a per-charge 'Retry' re-queues that specific failed charge — both reusing the same idempotency key."
loading:
  render: "Per-charge retry: the pressed Button shows BigDesign isLoading while retryingChargeId equals that charge id, and every Retry Button plus actions disable to block concurrent replays (retryingChargeId, line 701-702). Force-retry: the Button disables via actionPending while in flight (line 558) but shows no spinner label (see gaps)."
error:
  surfaced_at: "A page-level Message(type='error') at the top of the detail (actionError, line 403-412), dismissible via onClose, carrying the raw 'HTTP <status>: <body>' thrown by postOverride / retryCharge."
  recovery: "On failure actionPending / retryingChargeId reset in finally (line 347), re-enabling the controls so the rep re-presses 'Force retry' or 'Retry'; the shared idempotency key makes a re-press safe against double-charge."
empty:
  render: "When no charge has failed (failedExists false, line 375) the 'Force retry' Button is disabled and the history Panel renders only succeeded rows with no Retry controls — there is nothing to replay, which is the correct at-rest state, not a blank pane."
edge_status:
  - status: "no failed charges"
    affordance: "'Force retry' is disabled (line 558) and no per-charge Retry Buttons render; the success-only history communicates there is nothing to replay."
  - status: "subscription cancelled"
    affordance: "Per-charge Retry Buttons disable (isCancelled, line 702) so a terminal subscription cannot be replayed; the rep must reactivate before retrying."
  - status: "retry already in flight"
    affordance: "All Retry Buttons and force-retry disable while retryingChargeId is set (line 702) so the rep cannot fire concurrent replays of the same charge."
  - status: "charge succeeded but BC order never materialized (missing-order edge)"
    affordance: "North-star: a per-charge 'Replay order materialization' control should call POST /api/v1/admin/charges/:id/retry-materialization (the endpoint already exists, apps/api/src/worker.ts) to re-run only BC order creation; today no UI button exposes that arm (see gaps)."
inputs: []
disabled_focus:
  keyboard: "'Force retry' and each per-charge 'Retry' are real BigDesign <Button>s reachable in tab order and Enter/Space-activatable; native disabled removes them from tab order while in flight or when not applicable. The page-level error Message is dismissible via a real close control."
gaps: "The single AC ('click Replay -> workflow re-runs with the same idempotency key, no double-charge') is satisfied by force-retry + per-charge retry. Divergences from the data contract / UX note remain: (1) the data contract's second arm POST /api/v1/admin/charges/:id/retry-materialization exists in the API (worker.ts) but no UI affordance reaches it; (2) the BRD UX note targets a standalone route /stores/[storeHash]/support/charge/{id}/replay, but the feature is embedded inline in the subscription detail; (3) 'Force retry' shows no in-flight spinner label (only disables), unlike the per-charge Retry."

US-13.4: Structured logs with correlation IDs

Prototype: Structured Logs

Phase: MVP · Persona: Developer

As a Developer debugging a production incident, I want every log line tied to a subscription + charge + request correlation ID, so that I can trace end-to-end flow.

Acceptance criteria:

Given any logged event, When I query by charge_id, Then I see every log line across all services (scheduler, executor, adapter, BC API call, processor call, webhook) for that charge.

US-13.5: Metrics dashboard for platform SRE

Phase: P2 · Persona: Developer

As a Developer / SRE, I want platform health metrics (charge success rate, p50/p95 latency, retry rates, webhook lag), so that I can operate the service at SLA.

Acceptance criteria:

Given I open the internal metrics dashboard, When it renders, Then I see per-store and aggregate: charge outcomes, latency distributions, queue depth, dunning recovery rates.

← Back to all epics