Dunning¶

Generated from a canonical source

This page is a read-only projection of docs/handoff-corpus/dunning.md. Edit the canonical file, then run npm --prefix tools/project-knowledge-derive run derive.

What dunning is for¶

The invariant you must not break: never mutate charges.idempotency_key on a retry. The retry counter lives in charges.retry_attempt, never in the key. A per-attempt suffix breaks next-cycle scheduling (the scheduler reparses the cycle anchor out of the key), gateway-side dedupe (every gateway keys on it), and the SQL safety net — simultaneously and silently. (ADR-0011 §1, §5.)

Dunning decides, after a recurring charge fails, whether to retry, how often, what to tell the customer, and what happens to the subscription when retries run out. The capability breaks into seven reader-facing features, each anchored to its build story for traceability:

Merchant-tunable retry policy — configure intervals and exhaustion behavior (US-11.1)
Automatic retry of failed charges — transient failures self-heal per policy (US-11.2)
Decline triage — retry soft declines, stop on hard ones (US-11.3)
Card-network auto-updates, before dunning — expired/replaced cards refreshed via Visa VAU / Mastercard ABU (US-11.4)
Instant recovery when a customer fixes their card — payment-method update retries the charge immediately (US-11.5)
Step-by-step subscriber notifications — a clear prompt to act at each dunning stage (US-11.6)
Daily merchant dunning digest — at-risk subscribers surfaced for intervention (US-11.7)

The three decisions that carry the most weight, all recorded in ADR-0011:

The idempotency key stays identical across every retry — no per-attempt suffix; the scheduler reparses the cycle anchor from it, gateways dedupe on it, and a UNIQUE partial index backstops it (ADR-0011 §1–§3).
Only soft declines retry — hard declines are terminal until the subscriber acts; classification is data (decline-classifier.ts code sets), not per-adapter branching (ADR-0011 §4–§5).
Two coexisting retry mechanisms by design — the tunable policy engine (primary) and a cron sweep (safety net, its own 3-attempt ceiling and exceptions-table handoff); different ceilings and side effects are intentional (ADR-0011 §5; both file headers).

Canonical-framing attestation (operator-ratified 2026-07-02). services/dunning-retry.ts — the per-store dunning_policies state machine — is the sole canonical retry engine. cron/charge-retry-sweep.ts is its documented safety-net backstop, rescuing charges the primary path failed to reschedule — not a competing or residual implementation. Both mechanisms' file headers state this division of labor explicitly, and the coexistence contract is ratified in ADR-0011 §5. Traced: the two paths differ in shape (policy path advances N configurable stages and can transition the subscription; the sweep hardcodes a 3-attempt ceiling and writes an exceptions table entry, never a subscription transition) — confirmed in code at dunning-retry.ts and charge-retry-sweep.ts:188.

How it actually works¶

The retry state machine lives in services/dunning-retry.ts. A per-store dunning_policies table replaced the original hardcoded MAX_CHARGE_RETRY_ATTEMPTS cap. getOrSeedDunningPolicy seeds each store a default 5-stage curve — 12h / 12h / 24h / 48h / 72h on first access; each stage carries a delay, an email template key, an on_exhaustion action (cancel | pause | ignore), and a grace_period_days (dunning-retry.ts:32-38).

scheduler.ts::processCharge calls classifyDecline (services/decline-classifier.ts) on every adapter failure. Classification is a static code-set lookup, not per-adapter logic: HARD_DECLINE_CODES (e.g. stolen_card, fraudulent, refer_to_card_issuer) are terminal; SOFT_DECLINE_CODES (e.g. insufficient_funds, expired_card, network_timeout) retry; an unrecognized code defaults to soft, "never silently drop a customer" (decline-classifier.ts:78). A hard decline moves the subscription straight to past_due and stops (scheduler.ts:1081-1098); a soft or unknown decline calls scheduleChargeRetry then applyDunningPolicy (scheduler.ts:1100-1107).

Inside applyDunningPolicy, charge.retry_attempt indexes directly into the stage array (it was already bumped by markChargeAttempted before this call):

Stage exists → reschedule the charge to status='pending' at now + stage.delay_hours, set next_retry_at for the UI, log charge.retry_scheduled, and enqueue email.requested with the stage's template key.
Stages exhausted → mark the charge failed_permanently, log charge.failed_permanently, enqueue the dunning_final email, then apply the last stage's on_exhaustion policy: cancel or pause flips subscription status (or, if grace_period_days > 0, only extends current_period_end and leaves status untouched); ignore leaves the subscription active for manual merchant handling.

Read the failure path as: a charge fails → classify the decline → a hard decline moves the subscription to past_due and stops; a soft/unknown decline reschedules per the policy stage, or ratchets to failed_permanently + applies on_exhaustion when stages exhaust.

sequenceDiagram
    autonumber
    participant Proc as scheduler.ts::processCharge
    participant Adapter as adapters/bc-payments.ts::charge()
    participant DB as D1
    participant Dunning as services/dunning-retry.ts::applyDunningPolicy
    participant Queue as EVENTS_QUEUE

    Proc->>Adapter: adapter.charge(ctx)
    Adapter-->>Proc: throws (HTTP 4xx/5xx from BigPay, or BcPaymentsRateLimitError on 429)
    alt HTTP 429 (BcPaymentsRateLimitError)
        Proc->>DB: UPDATE charges SET status='pending', scheduled_at=now+2h+jitter (scheduler.ts:1007-1023)
        Proc->>DB: logEvent('charge.rate_limit_reschedule')
    else terminal (401/402/403 → TerminalAdapterError, isTerminal())
        Proc->>DB: markChargeAttempted(status:'failed', failure_code:'terminal_error')
        Proc->>DB: logEvent('charge.failed', {terminal:true})
    else soft/hard decline (classifyDecline)
        Proc->>DB: markChargeAttempted(status:'failed', failure_code:'adapter_threw')
        Proc->>DB: logEvent('charge.failed', {decline_classification})
        alt hard decline
            Proc->>DB: updateSubscriptionStatus(subscription.id, 'past_due')
            Proc->>DB: logEvent('subscription.past_due')
        else soft/unknown decline
            Proc->>DB: scheduleChargeRetry(charge.id, retry_attempt+1, now)
            Proc->>Dunning: applyDunningPolicy(repo, freshCharge, subscription, failure_code, now, EVENTS_QUEUE)
            Dunning->>DB: getOrSeedDunningPolicy(store_hash) → stages[retry_attempt]
            alt stage exists
                Dunning->>DB: UPDATE charges SET status='pending', scheduled_at=next_retry_at
                Dunning->>DB: logEvent('charge.retry_scheduled')
                Dunning->>Queue: EVENTS_QUEUE.send({type:'email.requested', template_key: stage.email_template_key})
            else stages exhausted
                Dunning->>DB: markChargeFailedPermanently(charge.id)
                Dunning->>DB: logEvent('charge.failed_permanently')
                Dunning->>Queue: EVENTS_QUEUE.send({template_key:'dunning_final'})
                Dunning->>DB: apply on_exhaustion policy (default 'cancel')
            end
        end
    end

Diagram provenance. Transcluded verbatim from § 1 "Renewal — end-to-end charge sequence" → "Failure branch" of the canonical, code-sourced docs/architecture/sequence-diagrams.md (derives_from pins scheduler.ts, dunning-retry.ts, epic-11-dunning.scenario.ts, among others). Its frontmatter carries sign_off: pending — accurate to the code, not yet human-attested, so read it as the current mechanism, not a ratified contract. In the handoff pipeline this is a build-time include of that one source, never a hand-copied fork.

The subscriber-initiated reset path (US-11.5) is a second, smaller sequence in the same source document, § 3 "Dunning — retry scheduling + subscriber-initiated reset":

sequenceDiagram
    autonumber
    participant Sub as Subscriber (portal)
    participant PMRoute as routes/portal/payment-method.ts::handlePortalUpdatePaymentMethod
    participant Reset as resetDunningOnPmUpdate
    participant DB as D1

    Sub->>PMRoute: PUT /api/v1/portal/subscriptions/:id/payment-method
    PMRoute->>DB: verify portal-session JWT + subscription ownership
    PMRoute->>DB: UPDATE subscriptions SET payment_method_id = new PM
    PMRoute->>Reset: resetDunningOnPmUpdate(repo, subscription) — only when subscription.status='past_due'
    Reset->>DB: UPDATE charges SET retry_attempt=0, status='pending', scheduled_at=now, next_retry_at=NULL WHERE subscription_id AND status IN dunning-in-progress states
    Reset->>DB: UPDATE subscriptions SET status='active'
    Reset->>DB: logEvent('subscription.dunning_reset')
    Note over Reset,DB: The re-armed charge is picked up by the NEXT cron tick's findDueCharges scan — no synchronous charge call happens in this handler.

Diagram provenance. Same source and sign_off: pending state as above; § 3 of sequence-diagrams.md. Traced against routes/portal/payment-method.ts:196-244 — resetDunningOnPmUpdate guards on sub.status !== 'past_due' and only resets charges with status IN ('failed','pending','processing') AND retry_attempt > 0.

Supporting pieces:

What actually fires retries: the cron sweep, cron/charge-retry-sweep.ts — the policy table alone schedules a next_retry_at; a tick has to run for anything to happen.
Where merchants set policy: routes/admin/dunning/retry-rules.ts (US-11.1) — GET/PUT (full-replace) /POST .../reset over dunning_policies.
The merchant digest: services/dunning-digest.ts (US-11.7) — a pure read that defines "in dunning" as subscriptions.status='past_due' with a surviving charges row at status='failed' AND retry_attempt > 0, and anchors days_in_dunning on the earliest charge.failed / charge.retry_scheduled event (not charges.attempted_at, which is overwritten on every retry).

Security note. Retry internals (failure_code, retry_attempt, stage_index) are written only to internal audit payloads — never forwarded as subscriber-email variables. An allowlist in the email template engine enforces this (GH #1329, cited in dunning-retry.ts:23-31).

Where intent and reality diverge¶

The derived coverage matrix (_coverage-matrix.json) reports all 7 of Epic-11's stories at g4_status: pass (US-11.1 through US-11.7) — each has a behavioral scenario exercising the real handler. That is true, and it is not the whole truth. Five typed deltas:

Verified-but-incomplete — US-11.5: the server resets retry_attempt and reschedules on payment-method update (payment-method.ts:224, G4-verified) but the UI gives the subscriber no signal a retry was enqueued (BRD's own US-11.5 gaps: note); also the reset fires only from past_due (hard-decline state), not during in-flight soft-decline dunning (payment-method.ts:205).
Named-deferred — US-11.6's ">24h email gate" is not built (emails fire on every stage — traced in applyDunningPolicy, which sends email.requested unconditionally whenever a stage has a template key; the epic-11 scenario's own spec-reconciliation note says so) and the P2 pre-retry SMS has zero implementation; EU SCA (requires_action wake) is deferred pending PI-5062 (ADR-0011 §6).
Built-but-untrodden — the pause / ignore / grace-period exhaustion branches are real code paths but every default stage ships on_exhaustion:'cancel' + grace_period_days:0 (DEFAULT_DUNNING_STAGES, dunning-retry.ts:32-38); and the grace-period branch is a dead end — when grace_period_days > 0, applyDunningPolicy only extends current_period_end and leaves status unchanged (dunning-retry.ts:200-204), and the one place that consumes current_period_end to auto-resume a subscription, listResumableSubscriptions (db.ts:3822-3835), filters on status='paused' — a status the grace branch never sets. Nothing applies the terminal status after the grace window elapses.
Contract-verified, not live-verified — G4 scenarios run against real D1 (applySchema, real CHECK constraints) but mock the email queue and processor round-trip; gateway-side dedupe is asserted only by key equality, and live email delivery is G5-only (scenario headers).
Built-but-untrodden — the sweep mints a per-attempt idempotency key for null-key charges (charge-retry-sweep.ts:100, fallback `${subscription.id}:${nowIso}` inside the ctx.idempotencyKey construction), technically violating the stability invariant above for keyless charges; unconfirmed whether null-key charges can reach the sweep in practice.

How to operate & extend¶

Change the default retry curve: DEFAULT_DUNNING_STAGES in services/dunning-retry.ts (12h / 12h / 24h / 48h / 72h, 5 stages). Per-store overrides live in the dunning_policies table; merchants edit them via PUT /api/admin/dunning/policy (full replace) or reset to defaults via POST /api/admin/dunning/policy/reset (routes/admin/dunning/retry-rules.ts).
Retries not firing? Start at cron/charge-retry-sweep.ts — it is the safety net, not the primary trigger; the primary path is applyDunningPolicy writing scheduled_at, picked up by the next runScheduledTick.
The invariant you must not break: the stable idempotency key (decision 1 above). A per-attempt suffix breaks next-cycle scheduling and gateway dedupe at once, silently.
Extension seams: on_exhaustion actions (new terminal behaviors — add a case in applyDunningPolicy's exhaustion branch), pre-charge hooks (charges/pre-charge-hook-registry.ts), and per-stage email template keys (subscriber-facing content is merchant-controlled in Resend; internal fields are allowlist-blocked from ever reaching a template, GH #1329).

Confidence notes¶

The ADR-0011 §5 backoff curve does not match either shipped curve. ADR-0011 §5 names the curve as "exponential, cumulative anchors at 1m / 5m / 15m / 1h / 6h / 24h." What actually shipped is two different curves, neither matching that text: dunning-retry.ts's DEFAULT_DUNNING_STAGES (12h / 12h / 24h / 48h / 72h, 5 stages — the code comment calls this "per ADR-0011 recommendation," which it isn't verbatim) and db.ts's RETRY_BACKOFF_MS (1h / 4h / 24h, 3 attempts, consumed by charge-retry-sweep.ts and matching that file's own header comment). Input-B's "two coexisting mechanisms by design" framing still holds — I traced both curves directly in code — but the specific numbers in ADR-0011 §5 are historical intent, superseded by what was actually built, and the decision record doesn't record that divergence. Filing that gap is out of scope for this page; noting it here so a recipient doesn't cite ADR-0011 §5 for the literal curve values.
listResumableSubscriptions as the "no grace consumer" evidence. I traced this myself (db.ts:3822-3835) rather than relying on Input-B's "repo-wide grep" citation alone — it's the one query in the codebase that reads current_period_end to resume a subscription, and its status='paused' filter cannot match a subscription the grace branch left untouched. I could not rule out some other code path acting on the extended current_period_end outside a grep-visible SQL match; the delta as typed ("built-but-untrodden," not "dead code") already accounts for that residual uncertainty.