Read-only per-epic slice. The canonical source of truth is BRD.md — stories are addressed by US-ID, not by this page's line numbers.
Epic 13 — Reconciliation & observability (derived view)
Read-only per-epic slice of
BRD.md§9, lines 5184–5367. The canonical source of truth isBRD.md— edit there, never here. The stable address for a story is its US-ID (US-13.x), not a line number. Regenerates on everydev → mainsync viaderive-state-on-main.
- Stories (5): US-13.1, US-13.2, US-13.3, US-13.4, US-13.5
- Generated: 2026-07-01T17:48:39.076Z · as-of commit:
b083f095
Epic 13 — Reconciliation & observability
<!-- traceability:start:BRD:Epic-13 --><!-- traceability:end:BRD:Epic-13 -->Prototype: Event Timeline · Drift Sweep · Replay Tool · Structured Logs
Value: Every charge, order, and state change is verifiable and replayable.
US-13.1: Event log per subscription
<!-- traceability:start:US-13.1 --><!-- traceability:end:US-13.1 -->Prototype: Event Timeline
Phase: MVP · Persona: Support / Ops
As Support / Ops, I want to see a complete event timeline for any subscription, so that I can diagnose issues without log-diving.
Acceptance criteria:
- Given I open a subscription, When the timeline renders, Then it lists every lifecycle event (created, charged, paused, PM-updated, etc.) with timestamp, actor, and payload snippet.
- Given I filter by event type or actor, When I apply the filter, Then the timeline updates.
UI states.
<!-- ui-states US-13.1 -->surface: "Admin (React/BigDesign) — the 'Activity' timeline Panel in apps/admin/src/pages/subscriptions/SubscriptionAdminDetail.tsx (line 718-772), fed by getSubscriptionEvents (apps/admin/src/lib/api-client.ts) which returns SubscriptionEventRow[] (api-client.ts line 452). Persona: Support / Ops."
idle:
render: "An event list: each row renders an actor-kind color dot (actorIcon), a friendly event-type label (friendlyEventLabel, line 738), a timestamp (formatDateTime of created_at, line 746), an optional operator id (actorUser), a truncated payload snippet (truncatePayload, line 739, max 80 chars), and an actor-kind Badge (actorKindLabel, line 765)."
primary_action: "Read the timeline — no row action. AC2 filtering (by event type or actor) is the north-star action and is not yet rendered (see gaps)."
loading:
render: "While events is null and no error has surfaced, the Panel shows a single 'loading activity' Text (activity.loading, line 730) — no skeleton rows."
error:
surfaced_at: "Inline, at the top of the Activity Panel: a Message(type='error') with the activity error header (activity.errorHeader, line 723) renders when getSubscriptionEvents rejects (eventsError, line 719-727)."
recovery: "No inline retry control — the rep reloads the page to re-issue getSubscriptionEvents; the rest of the detail page stays usable because the timeline error is scoped to its own Panel."
empty:
render: "When events is a non-null empty array the Panel shows the 'no activity' empty copy (activity.empty, line 734); a real subscription always carries at least a created event so this is rare."
edge_status:
- status: "unknown / newly-added event type"
affordance: "friendlyEventLabel falls back to a generic label plus the raw type in parens (line 161) rather than printing an internal type name or rendering blank — the row stays readable."
- status: "empty or null payload"
affordance: "truncatePayload returns null for '{}' / 'null' / empty (line 168) so the snippet segment is omitted; the row still shows type + actor + timestamp."
- status: "filter active (north-star, AC2) — by event type or actor"
affordance: "North-star: event-type and actor Selects narrow the timeline, and a 'no events match — clear filter' state offers a clear-filter control; today no filter UI is rendered so the timeline is always unfiltered (see gaps)."
inputs: []
disabled_focus:
keyboard: "Read-only timeline — no focusable controls today; rows are static Text/Small/Badge/code nodes in DOM source order (line 741-768). North-star adds filter Selects, which must be real BigDesign Selects reachable in tab order with arrow-key option selection."
gaps: "AC2 ('Given I filter by event type or actor, When I apply the filter, Then the timeline updates') is unbuilt: the Activity Panel renders all events unfiltered with no filter controls. actor_kind is enumerable (the 5-value union at api-client.ts:460 — merchant_user / subscriber / system / webhook_bc / webhook_processor) so the actor filter must be a Select; event_type is open-ended (friendlyEventLabel falls back to unknown) so its filter should be a Select populated from the events actually present. AC1 (type + actor + timestamp + payload snippet) is fully built."
US-13.2: Daily reconciliation sweep
<!-- traceability:start:US-13.2 --><!-- traceability:end:US-13.2 -->Prototype: Drift Sweep
Phase: MVP · Priority: P0 · Effort: L · Persona: System
As the System, I want to detect state drift between our DB, BC, and the processor daily, so that corrections happen before subscribers notice.
Acceptance criteria:
- Given the reconciliation cron runs, When it executes, Then it samples:
- Charges with
bc_order_idset but no matching BC order - BC orders tagged as subscription-origin but no matching charge
- Processor-settled transactions with no matching charge
- Charges stuck in
processing> 1 hour
- Charges with
- Given any drift is found, When the sweep completes, Then a report populates the exception queue with proposed remediation.
Data contract.
- Cron job daily at 02:00 UTC (lowest traffic window)
- Queries:
- A:
charges WHERE status='succeeded' AND bc_order_id IS NULL(settled but no order) - B:
charges WHERE status='processing' AND attempted_at < now() - 1h(stuck) - C: BC orders with sub metadata where our
charges.bc_order_iddoesn't match - D: Processor transactions via adapter listing (last 7d) where no charge matches
- A:
- Each drift detection creates an
exception_queuerow withtype,entity_refs,proposed_remediation
Success metrics.
- Functional (target): reconciliation detects ≥ 99% of simulated drift scenarios (test suite)
- Operational (target): sweep completes within 30 min for a store with 100K subs
- Product (target): % of exceptions auto-resolved (remediation applied without merchant action) ≥ 50%
Dependencies.
- Exception queue UI (US-21.3)
Non-functional.
- Read-only on BC (no writes during sweep; writes happen only on merchant-approved remediation)
Risks / open questions.
- Reconciliation with processor historical data can hit rate limits on stores with large volume. Paginate and throttle; acceptable to span multiple runs for initial backfill.
US-13.3: Replay a failed workflow
<!-- traceability:start:US-13.3 --><!-- traceability:end:US-13.3 -->Prototype: Replay Tool
Phase: MVP · Priority: P1 · Effort: M · Persona: Support / Ops / Developer
As Support / Ops, I want to re-execute a failed charge workflow with the same idempotency key, so that I can recover from transient bugs without duplicate charging.
Acceptance criteria:
- Given a charge is stuck in
faileddue to a known infra issue, When I click "Replay" in the admin, Then the workflow re-runs using the same idempotency key and does not double-charge.
UX notes.
- Surface: support admin tool at
/stores/[storeHash]/support/charge/{id}/replay - Pre-conditions shown: current state, last error, what replay will attempt
- Post-conditions: updated state + events
Data contract.
- Our API (impl ships dual endpoints):
POST /api/v1/admin/subscriptions/:id/force-retry(charge retry on subscription) andPOST /api/v1/admin/charges/:id/retry-materialization(BC order materialization replay for missing-order edge case) - Workflow: force-retry re-queues the pending charge; retry-materialization re-runs the BC order creation only
- Adapter behavior: idempotency key recognized, processor returns cached result if it succeeded previously
Success metrics.
- Functional: replay never double-charges
- Operational (target): support resolution time on stuck charges ≤ 5 min
Non-functional.
- Every replay is audited with Support user ID
UI states.
<!-- ui-states US-13.3 -->surface: "Admin (React/BigDesign) — replay/retry affordances on apps/admin/src/pages/subscriptions/SubscriptionAdminDetail.tsx: an 'Actions' Panel 'Force retry' Button (actions.forceRetry, line 561 -> handleForceRetry, line 305 -> POST /api/v1/admin/subscriptions/:id/force-retry) and a per-row 'Retry' Button on each failed charge in the history Panel (charge.retryBtn, line 705 -> handleChargeRetry, line 336 -> retryCharge, apps/admin/src/api-client/subscriptions.ts -> POST /api/v1/admin/subscriptions/:id/charges/:cid/retry). Both re-queue the charge; the processor recognises the idempotency key so a previously-succeeded charge is not double-charged. Persona: Support / Ops / Developer."
idle:
render: "The 'Actions' Panel shows a 'Force retry' Button enabled only when at least one charge has failed (failedExists, line 558). Each failed charge in the history Panel shows a red Error icon, a 'Failed' Badge, and an inline 'Retry' Button (line 694-707); succeeded charges show a success Badge and no retry control."
primary_action: "'Force retry' re-queues the subscription's pending charge; a per-charge 'Retry' re-queues that specific failed charge — both reusing the same idempotency key."
loading:
render: "Per-charge retry: the pressed Button shows BigDesign isLoading while retryingChargeId equals that charge id, and every Retry Button plus actions disable to block concurrent replays (retryingChargeId, line 701-702). Force-retry: the Button disables via actionPending while in flight (line 558) but shows no spinner label (see gaps)."
error:
surfaced_at: "A page-level Message(type='error') at the top of the detail (actionError, line 403-412), dismissible via onClose, carrying the raw 'HTTP <status>: <body>' thrown by postOverride / retryCharge."
recovery: "On failure actionPending / retryingChargeId reset in finally (line 347), re-enabling the controls so the rep re-presses 'Force retry' or 'Retry'; the shared idempotency key makes a re-press safe against double-charge."
empty:
render: "When no charge has failed (failedExists false, line 375) the 'Force retry' Button is disabled and the history Panel renders only succeeded rows with no Retry controls — there is nothing to replay, which is the correct at-rest state, not a blank pane."
edge_status:
- status: "no failed charges"
affordance: "'Force retry' is disabled (line 558) and no per-charge Retry Buttons render; the success-only history communicates there is nothing to replay."
- status: "subscription cancelled"
affordance: "Per-charge Retry Buttons disable (isCancelled, line 702) so a terminal subscription cannot be replayed; the rep must reactivate before retrying."
- status: "retry already in flight"
affordance: "All Retry Buttons and force-retry disable while retryingChargeId is set (line 702) so the rep cannot fire concurrent replays of the same charge."
- status: "charge succeeded but BC order never materialized (missing-order edge)"
affordance: "North-star: a per-charge 'Replay order materialization' control should call POST /api/v1/admin/charges/:id/retry-materialization (the endpoint already exists, apps/api/src/worker.ts) to re-run only BC order creation; today no UI button exposes that arm (see gaps)."
inputs: []
disabled_focus:
keyboard: "'Force retry' and each per-charge 'Retry' are real BigDesign <Button>s reachable in tab order and Enter/Space-activatable; native disabled removes them from tab order while in flight or when not applicable. The page-level error Message is dismissible via a real close control."
gaps: "The single AC ('click Replay -> workflow re-runs with the same idempotency key, no double-charge') is satisfied by force-retry + per-charge retry. Divergences from the data contract / UX note remain: (1) the data contract's second arm POST /api/v1/admin/charges/:id/retry-materialization exists in the API (worker.ts) but no UI affordance reaches it; (2) the BRD UX note targets a standalone route /stores/[storeHash]/support/charge/{id}/replay, but the feature is embedded inline in the subscription detail; (3) 'Force retry' shows no in-flight spinner label (only disables), unlike the per-charge Retry."
US-13.4: Structured logs with correlation IDs
<!-- traceability:start:US-13.4 --><!-- traceability:end:US-13.4 -->Prototype: Structured Logs
Phase: MVP · Persona: Developer
As a Developer debugging a production incident, I want every log line tied to a subscription + charge + request correlation ID, so that I can trace end-to-end flow.
Acceptance criteria:
- Given any logged event, When I query by
charge_id, Then I see every log line across all services (scheduler, executor, adapter, BC API call, processor call, webhook) for that charge.
US-13.5: Metrics dashboard for platform SRE
Phase: P2 · Persona: Developer
As a Developer / SRE, I want platform health metrics (charge success rate, p50/p95 latency, retry rates, webhook lag), so that I can operate the service at SLA.
Acceptance criteria:
- Given I open the internal metrics dashboard, When it renders, Then I see per-store and aggregate: charge outcomes, latency distributions, queue depth, dunning recovery rates.