Webhook Retry Logic: Exponential Backoff, Idempotency, and Dead Letter Queues

A webhook retry policy is not a loop. It is a promise about what happens when reality interrupts your product.

Servers deploy. Networks flap. Customer endpoints return 500. Rate limits appear. TLS handshakes fail. A receiver processes the request and then times out before sending a response. The sender cannot know which side effect happened unless the system is designed for uncertainty.

That is why good webhook delivery has two halves:

The sender retries transient failures with enough visibility to debug them.
The receiver treats duplicate deliveries as normal, not exceptional.

If you only implement one side, you get a system that works during demos and becomes mysterious during incidents.

Retry policy starts with failure classification

The first question is not "how many retries?" It is "what kind of failure happened?"

Result	Retry?	Why	What to show in logs
Timeout	Usually yes	The receiver may be slow or briefly unavailable	Timeout class, duration, attempt number
Connection error	Yes	Network or deployment interruption	Error code, host, attempt number
429 rate limited	Yes, with backoff	The receiver is asking for time	Status code, retry delay
500 to 599	Yes	Server-side transient failure is plausible	Status code, response snippet
408 or 409	Sometimes	Context matters; often temporary	Status code and response body
400 to 499	Usually no	The request is likely invalid or unauthorized	Status code and clear terminal failure
Unsafe destination	No	Security policy should stop dispatch	Stable error code, not internal network details

Blindly retrying every non-200 response is noisy. Never retrying is brittle. The policy should express what the system believes about each class of failure.

A practical exponential backoff schedule

Exponential backoff gives failing systems room to recover. Jitter prevents every failed job from retrying at the same second.

A simple policy can look like this:

Attempt	Delay before next attempt	Purpose
1	Immediate scheduled delivery	Normal execution
2	1 minute	Catch deploys and short network blips
3	5 minutes	Let temporary incidents settle
4	30 minutes	Avoid hammering an unhealthy endpoint
5	2 hours	Give customer systems time to recover
6	12 hours	Final long-tail recovery attempt

The exact numbers matter less than the rule: retries should slow down as confidence drops.

For customer-facing SaaS, the dashboard must explain the schedule. A hidden retry loop creates support tickets. A visible retry timeline creates trust.

Idempotency is not optional

Retries can create duplicate deliveries. That is not a bug. It is a consequence of distributed systems.

Imagine this sequence:

You send a webhook to a customer endpoint.
Their handler writes to their database successfully.
Their server crashes before returning a 2xx response.
Your system sees a timeout and retries.
Their handler receives the same event again.

From the sender side, retrying was correct. From the receiver side, processing the event twice would be wrong.

That is why every important webhook should include a stable event ID or idempotency key. Receivers should store it before running irreversible side effects.

export async function handleWebhook(req: Request) {
  const event = await req.json();

  const inserted = await db.processedWebhook.createMany({
    data: {
      idempotencyKey: event.idempotencyKey,
      receivedAt: new Date(),
    },
    skipDuplicates: true,
  });

  if (inserted.count === 0) {
    return Response.json({ ok: true, duplicate: true });
  }

  await runBusinessSideEffect(event);

  return Response.json({ ok: true });
}

The exact database call will vary, but the shape is stable: claim the event key atomically, then perform the side effect.

Dead letters are a support workflow

A dead letter queue is often described as a storage place for failed jobs. That is too mechanical.

For product teams, the dead-letter state is a support workflow. It answers:

Which deliveries failed permanently?
Which endpoint failed?
What was the last status code or error?
How many attempts happened?
Can the customer fix the endpoint and retry manually?
Should the system stop retrying because the request is invalid?

If a failed webhook is only visible in logs, the system has no customer support surface.

You do not always need a literal queue named "dead letter." You do need an inspectable terminal state for exhausted deliveries.

What every attempt should record

A delivery attempt should be small, safe, and useful.

At minimum, store:

Job ID and organization ID.
Attempt number.
Started and finished timestamps.
Status code when a response exists.
Latency.
Error class for connection failures and timeouts.
Truncated response body.
Whether another retry is scheduled.
Next run time when applicable.

Avoid storing unlimited response bodies. Avoid leaking secrets from request headers. Avoid exposing internal IP decisions. Logs should help support and engineering without becoming a second copy of customer data.

The retry logic skeleton

A simplified dispatcher looks like this:

async function deliver(job: ScheduledWebhookJob) {
  const attempt = job.attemptCount + 1;

  try {
    const response = await sendHttpRequest(job, {
      timeoutMs: 15_000,
      maxRedirects: 0,
    });

    if (response.status >= 200 && response.status < 300) {
      await markDelivered(job.id, attempt, response);
      return;
    }

    if (shouldRetryStatus(response.status) && attempt < job.maxAttempts) {
      await scheduleRetry(job.id, attempt, computeBackoff(attempt));
      return;
    }

    await markFailed(job.id, attempt, response);
  } catch (error) {
    if (isTransientNetworkError(error) && attempt < job.maxAttempts) {
      await scheduleRetry(job.id, attempt, computeBackoff(attempt));
      return;
    }

    await markFailed(job.id, attempt, error);
  }
}

This is only the delivery path. A real system still needs atomic job claiming, safe URL validation, payload limits, dashboard pagination, quota enforcement, and billing-aware plan limits.

That is the part teams underestimate.

Common retry mistakes

The first mistake is retrying without idempotency. If the receiver can charge a card, send an email, create a record, or trigger an automation, duplicates must be expected.

The second mistake is retrying too fast. A loop that retries every second turns a customer outage into a customer incident.

The third mistake is treating redirects as harmless. For user-provided webhook destinations, redirects can become a security problem if they are followed blindly.

The fourth mistake is storing too much. Full payloads, response bodies, and headers can become sensitive. Store what support needs, truncate aggressively, and keep security boundaries clear.

The fifth mistake is hiding the retry timeline. If a customer cannot see what will happen next, they will assume the platform is broken.

Build it yourself or use a webhook scheduler?

Build retry logic yourself if the webhook delivery path is central to your platform, your team already owns queue operations, and you need custom semantics that a focused scheduler cannot provide.

Use a managed scheduler when the core requirement is scheduled HTTPS delivery with retries, delivery logs, and a dashboard. That is especially true for solo founders and small SaaS teams where the opportunity cost of building infrastructure is high.

One managed option is Webhook Scheduler. Disclosure: We built Webhook Scheduler for this exact use case. It is for scheduled HTTPS webhooks with retry attempts and delivery visibility, not as a general background-compute platform.

If that matches the problem you are solving, the most useful place to inspect it is the API reference. If your retries need arbitrary internal compute, a queue or workflow engine is probably the better tool.

A production checklist

Before calling your webhook retry system production-ready, check the following:

Jobs are claimed atomically before dispatch.
Recent processing jobs cannot be double-dispatched by concurrent workers.
Stale processing jobs can be recovered safely.
Retry delays are visible and deterministic enough to explain.
Receivers get an event ID or idempotency key.
Permanent 4xx errors do not retry forever.
Response bodies and headers are truncated or sanitized.
Unsafe destinations are rejected before scheduling and before dispatch.
Redirects are blocked or revalidated safely.
Failed jobs remain inspectable after retry exhaustion.
Customers can understand what happened without opening an engineering ticket.

The operating principle

Webhook retry logic is not about being optimistic. It is about accepting that delivery can be uncertain and making every state visible.

A good retry system does not promise that every endpoint will be healthy. It promises that failure will be bounded, recorded, retried when appropriate, and understandable when it finally stops.

That is the difference between a background script and product infrastructure.

Webhook Retry Logic: Exponential Backoff, Idempotency, and Dead Letter Queues

Retry policy starts with failure classification

A practical exponential backoff schedule

Idempotency is not optional

Dead letters are a support workflow

What every attempt should record

The retry logic skeleton

Common retry mistakes

Build it yourself or use a webhook scheduler?

A production checklist

The operating principle

Not sure which tools to pick?

Enjoyed this?

Related Essays

How to Schedule Webhooks Without Running Cron Jobs

The Backpressure Black Hole: How SaaS Throttling beats Systems

Why Power Automate is an Ecosystem Tax and Make is a Strategic Asset