Dunning¶
Generated from a canonical source
This page is a read-only projection of docs/handoff-corpus/dunning.md.
Edit the canonical file, then run npm --prefix tools/project-knowledge-derive run derive.
What dunning is for¶
The invariant you must not break: never mutate charges.idempotency_key on
a retry. The retry counter lives in charges.retry_attempt, never in the
key. A per-attempt suffix breaks next-cycle scheduling (the scheduler
reparses the cycle anchor out of the key), gateway-side dedupe (every gateway
keys on it), and the SQL safety net — simultaneously and silently.
(ADR-0011 §1, §5.)
Dunning decides, after a recurring charge fails, whether to retry, how often, what to tell the customer, and what happens to the subscription when retries run out. The capability breaks into seven reader-facing features, each anchored to its build story for traceability:
- Merchant-tunable retry policy — configure intervals and exhaustion behavior (US-11.1)
- Automatic retry of failed charges — transient failures self-heal per policy (US-11.2)
- Decline triage — retry soft declines, stop on hard ones (US-11.3)
- Card-network auto-updates, before dunning — expired/replaced cards refreshed via Visa VAU / Mastercard ABU (US-11.4)
- Instant recovery when a customer fixes their card — payment-method update retries the charge immediately (US-11.5)
- Step-by-step subscriber notifications — a clear prompt to act at each dunning stage (US-11.6)
- Daily merchant dunning digest — at-risk subscribers surfaced for intervention (US-11.7)
The three decisions that carry the most weight, all recorded in ADR-0011:
- The idempotency key stays identical across every retry — no per-attempt
suffix; the scheduler reparses the cycle anchor from it, gateways dedupe on
it, and a
UNIQUEpartial index backstops it (ADR-0011 §1–§3). - Only soft declines retry — hard declines are terminal until the
subscriber acts; classification is data (
decline-classifier.tscode sets), not per-adapter branching (ADR-0011 §4–§5). - Two coexisting retry mechanisms by design — the tunable policy engine (primary) and a cron sweep (safety net, its own 3-attempt ceiling and exceptions-table handoff); different ceilings and side effects are intentional (ADR-0011 §5; both file headers).
Canonical-framing attestation (operator-ratified 2026-07-02).
services/dunning-retry.ts —
the per-store dunning_policies state machine — is the sole canonical retry
engine.
cron/charge-retry-sweep.ts
is its documented safety-net backstop, rescuing charges the primary path
failed to reschedule — not a competing or residual implementation. Both
mechanisms' file headers state this division of labor explicitly, and the
coexistence contract is ratified in ADR-0011 §5. Traced: the two paths differ
in shape (policy path advances N configurable stages and can transition the
subscription; the sweep hardcodes a 3-attempt ceiling and writes an
exceptions table entry, never a subscription transition) — confirmed in
code at dunning-retry.ts and charge-retry-sweep.ts:188.
How it actually works¶
The retry state machine lives in
services/dunning-retry.ts.
A per-store dunning_policies table replaced the original hardcoded
MAX_CHARGE_RETRY_ATTEMPTS cap. getOrSeedDunningPolicy seeds each store a
default 5-stage curve — 12h / 12h / 24h / 48h / 72h on first access; each
stage carries a delay, an email template key, an on_exhaustion action
(cancel | pause | ignore), and a grace_period_days
(dunning-retry.ts:32-38).
scheduler.ts::processCharge calls classifyDecline
(services/decline-classifier.ts)
on every adapter failure. Classification is a static code-set lookup, not
per-adapter logic: HARD_DECLINE_CODES (e.g. stolen_card,
fraudulent, refer_to_card_issuer) are terminal; SOFT_DECLINE_CODES
(e.g. insufficient_funds, expired_card, network_timeout) retry; an
unrecognized code defaults to soft, "never silently drop a customer"
(decline-classifier.ts:78). A hard decline moves the subscription straight
to past_due and stops (scheduler.ts:1081-1098); a soft or unknown decline
calls scheduleChargeRetry then applyDunningPolicy
(scheduler.ts:1100-1107).
Inside applyDunningPolicy, charge.retry_attempt indexes directly into the
stage array (it was already bumped by markChargeAttempted before this
call):
- Stage exists → reschedule the charge to
status='pending'atnow + stage.delay_hours, setnext_retry_atfor the UI, logcharge.retry_scheduled, and enqueueemail.requestedwith the stage's template key. - Stages exhausted → mark the charge
failed_permanently, logcharge.failed_permanently, enqueue thedunning_finalemail, then apply the last stage'son_exhaustionpolicy:cancelorpauseflips subscription status (or, ifgrace_period_days > 0, only extendscurrent_period_endand leaves status untouched);ignoreleaves the subscription active for manual merchant handling.
Read the failure path as: a charge fails → classify the decline → a hard
decline moves the subscription to past_due and stops; a soft/unknown
decline reschedules per the policy stage, or ratchets to
failed_permanently + applies on_exhaustion when stages exhaust.
sequenceDiagram
autonumber
participant Proc as scheduler.ts::processCharge
participant Adapter as adapters/bc-payments.ts::charge()
participant DB as D1
participant Dunning as services/dunning-retry.ts::applyDunningPolicy
participant Queue as EVENTS_QUEUE
Proc->>Adapter: adapter.charge(ctx)
Adapter-->>Proc: throws (HTTP 4xx/5xx from BigPay, or BcPaymentsRateLimitError on 429)
alt HTTP 429 (BcPaymentsRateLimitError)
Proc->>DB: UPDATE charges SET status='pending', scheduled_at=now+2h+jitter (scheduler.ts:1007-1023)
Proc->>DB: logEvent('charge.rate_limit_reschedule')
else terminal (401/402/403 → TerminalAdapterError, isTerminal())
Proc->>DB: markChargeAttempted(status:'failed', failure_code:'terminal_error')
Proc->>DB: logEvent('charge.failed', {terminal:true})
else soft/hard decline (classifyDecline)
Proc->>DB: markChargeAttempted(status:'failed', failure_code:'adapter_threw')
Proc->>DB: logEvent('charge.failed', {decline_classification})
alt hard decline
Proc->>DB: updateSubscriptionStatus(subscription.id, 'past_due')
Proc->>DB: logEvent('subscription.past_due')
else soft/unknown decline
Proc->>DB: scheduleChargeRetry(charge.id, retry_attempt+1, now)
Proc->>Dunning: applyDunningPolicy(repo, freshCharge, subscription, failure_code, now, EVENTS_QUEUE)
Dunning->>DB: getOrSeedDunningPolicy(store_hash) → stages[retry_attempt]
alt stage exists
Dunning->>DB: UPDATE charges SET status='pending', scheduled_at=next_retry_at
Dunning->>DB: logEvent('charge.retry_scheduled')
Dunning->>Queue: EVENTS_QUEUE.send({type:'email.requested', template_key: stage.email_template_key})
else stages exhausted
Dunning->>DB: markChargeFailedPermanently(charge.id)
Dunning->>DB: logEvent('charge.failed_permanently')
Dunning->>Queue: EVENTS_QUEUE.send({template_key:'dunning_final'})
Dunning->>DB: apply on_exhaustion policy (default 'cancel')
end
end
end
Diagram provenance. Transcluded verbatim from § 1 "Renewal — end-to-end charge sequence" → "Failure branch" of the canonical, code-sourced
docs/architecture/sequence-diagrams.md(derives_frompinsscheduler.ts,dunning-retry.ts,epic-11-dunning.scenario.ts, among others). Its frontmatter carriessign_off: pending— accurate to the code, not yet human-attested, so read it as the current mechanism, not a ratified contract. In the handoff pipeline this is a build-time include of that one source, never a hand-copied fork.
The subscriber-initiated reset path (US-11.5) is a second, smaller sequence in the same source document, § 3 "Dunning — retry scheduling + subscriber-initiated reset":
sequenceDiagram
autonumber
participant Sub as Subscriber (portal)
participant PMRoute as routes/portal/payment-method.ts::handlePortalUpdatePaymentMethod
participant Reset as resetDunningOnPmUpdate
participant DB as D1
Sub->>PMRoute: PUT /api/v1/portal/subscriptions/:id/payment-method
PMRoute->>DB: verify portal-session JWT + subscription ownership
PMRoute->>DB: UPDATE subscriptions SET payment_method_id = new PM
PMRoute->>Reset: resetDunningOnPmUpdate(repo, subscription) — only when subscription.status='past_due'
Reset->>DB: UPDATE charges SET retry_attempt=0, status='pending', scheduled_at=now, next_retry_at=NULL WHERE subscription_id AND status IN dunning-in-progress states
Reset->>DB: UPDATE subscriptions SET status='active'
Reset->>DB: logEvent('subscription.dunning_reset')
Note over Reset,DB: The re-armed charge is picked up by the NEXT cron tick's findDueCharges scan — no synchronous charge call happens in this handler.
Diagram provenance. Same source and
sign_off: pendingstate as above; § 3 ofsequence-diagrams.md. Traced againstroutes/portal/payment-method.ts:196-244—resetDunningOnPmUpdateguards onsub.status !== 'past_due'and only resets charges withstatus IN ('failed','pending','processing') AND retry_attempt > 0.
Supporting pieces:
- What actually fires retries: the cron sweep,
cron/charge-retry-sweep.ts— the policy table alone schedules anext_retry_at; a tick has to run for anything to happen. - Where merchants set policy:
routes/admin/dunning/retry-rules.ts(US-11.1) —GET/PUT(full-replace) /POST .../resetoverdunning_policies. - The merchant digest:
services/dunning-digest.ts(US-11.7) — a pure read that defines "in dunning" assubscriptions.status='past_due'with a survivingchargesrow atstatus='failed' AND retry_attempt > 0, and anchorsdays_in_dunningon the earliestcharge.failed/charge.retry_scheduledevent (notcharges.attempted_at, which is overwritten on every retry).
Security note. Retry internals (
failure_code,retry_attempt,stage_index) are written only to internal audit payloads — never forwarded as subscriber-email variables. An allowlist in the email template engine enforces this (GH #1329, cited indunning-retry.ts:23-31).
Where intent and reality diverge¶
The derived coverage matrix
(_coverage-matrix.json) reports
all 7 of Epic-11's stories at g4_status: pass (US-11.1 through
US-11.7) — each has a behavioral scenario exercising the real handler. That
is true, and it is not the whole truth. Five typed deltas:
- Verified-but-incomplete — US-11.5: the server resets
retry_attemptand reschedules on payment-method update (payment-method.ts:224, G4-verified) but the UI gives the subscriber no signal a retry was enqueued (BRD's own US-11.5gaps:note); also the reset fires only frompast_due(hard-decline state), not during in-flight soft-decline dunning (payment-method.ts:205). - Named-deferred — US-11.6's ">24h email gate" is not built (emails fire
on every stage — traced in
applyDunningPolicy, which sendsemail.requestedunconditionally whenever a stage has a template key; the epic-11 scenario's own spec-reconciliation note says so) and the P2 pre-retry SMS has zero implementation; EU SCA (requires_actionwake) is deferred pending PI-5062 (ADR-0011 §6). - Built-but-untrodden — the
pause/ignore/ grace-period exhaustion branches are real code paths but every default stage shipson_exhaustion:'cancel'+grace_period_days:0(DEFAULT_DUNNING_STAGES,dunning-retry.ts:32-38); and the grace-period branch is a dead end — whengrace_period_days > 0,applyDunningPolicyonly extendscurrent_period_endand leavesstatusunchanged (dunning-retry.ts:200-204), and the one place that consumescurrent_period_endto auto-resume a subscription,listResumableSubscriptions(db.ts:3822-3835), filters onstatus='paused'— a status the grace branch never sets. Nothing applies the terminal status after the grace window elapses. - Contract-verified, not live-verified — G4 scenarios run against real
D1 (
applySchema, real CHECK constraints) but mock the email queue and processor round-trip; gateway-side dedupe is asserted only by key equality, and live email delivery is G5-only (scenario headers). - Built-but-untrodden — the sweep mints a per-attempt idempotency key for
null-key charges (
charge-retry-sweep.ts:100, fallback`${subscription.id}:${nowIso}`inside thectx.idempotencyKeyconstruction), technically violating the stability invariant above for keyless charges; unconfirmed whether null-key charges can reach the sweep in practice.
How to operate & extend¶
- Change the default retry curve:
DEFAULT_DUNNING_STAGESinservices/dunning-retry.ts(12h / 12h / 24h / 48h / 72h, 5 stages). Per-store overrides live in thedunning_policiestable; merchants edit them viaPUT /api/admin/dunning/policy(full replace) or reset to defaults viaPOST /api/admin/dunning/policy/reset(routes/admin/dunning/retry-rules.ts). - Retries not firing? Start at
cron/charge-retry-sweep.ts— it is the safety net, not the primary trigger; the primary path isapplyDunningPolicywritingscheduled_at, picked up by the nextrunScheduledTick. - The invariant you must not break: the stable idempotency key (decision 1 above). A per-attempt suffix breaks next-cycle scheduling and gateway dedupe at once, silently.
- Extension seams:
on_exhaustionactions (new terminal behaviors — add a case inapplyDunningPolicy's exhaustion branch), pre-charge hooks (charges/pre-charge-hook-registry.ts), and per-stage email template keys (subscriber-facing content is merchant-controlled in Resend; internal fields are allowlist-blocked from ever reaching a template, GH #1329).
Confidence notes¶
- The ADR-0011 §5 backoff curve does not match either shipped curve.
ADR-0011 §5 names the curve as "exponential, cumulative anchors at
1m / 5m / 15m / 1h / 6h / 24h." What actually shipped is two different curves, neither matching that text:dunning-retry.ts'sDEFAULT_DUNNING_STAGES(12h / 12h / 24h / 48h / 72h, 5 stages — the code comment calls this "per ADR-0011 recommendation," which it isn't verbatim) anddb.ts'sRETRY_BACKOFF_MS(1h / 4h / 24h, 3 attempts, consumed bycharge-retry-sweep.tsand matching that file's own header comment). Input-B's "two coexisting mechanisms by design" framing still holds — I traced both curves directly in code — but the specific numbers in ADR-0011 §5 are historical intent, superseded by what was actually built, and the decision record doesn't record that divergence. Filing that gap is out of scope for this page; noting it here so a recipient doesn't cite ADR-0011 §5 for the literal curve values. listResumableSubscriptionsas the "no grace consumer" evidence. I traced this myself (db.ts:3822-3835) rather than relying on Input-B's "repo-wide grep" citation alone — it's the one query in the codebase that readscurrent_period_endto resume a subscription, and itsstatus='paused'filter cannot match a subscription the grace branch left untouched. I could not rule out some other code path acting on the extendedcurrent_period_endoutside a grep-visible SQL match; the delta as typed ("built-but-untrodden," not "dead code") already accounts for that residual uncertainty.