Webhook Retry Logic: Exponential Backoff, Idempotency, and Dead Letter Queues
A webhook retry policy is not a loop. It is a promise about what happens when reality interrupts your product.
Servers deploy. Networks flap. Customer endpoints return 500. Rate limits appear. TLS handshakes fail. A receiver processes the request and then times out before sending a response. The sender cannot know which side effect happened unless the system is designed for uncertainty.
That is why good webhook delivery has two halves:
- The sender retries transient failures with enough visibility to debug them.
- The receiver treats duplicate deliveries as normal, not exceptional.
If you only implement one side, you get a system that works during demos and becomes mysterious during incidents.
Retry policy starts with failure classification
The first question is not "how many retries?" It is "what kind of failure happened?"
| Result | Retry? | Why | What to show in logs |
|---|---|---|---|
| Timeout | Usually yes | The receiver may be slow or briefly unavailable | Timeout class, duration, attempt number |
| Connection error | Yes | Network or deployment interruption | Error code, host, attempt number |
| 429 rate limited | Yes, with backoff | The receiver is asking for time | Status code, retry delay |
| 500 to 599 | Yes | Server-side transient failure is plausible | Status code, response snippet |
| 408 or 409 | Sometimes | Context matters; often temporary | Status code and response body |
| 400 to 499 | Usually no | The request is likely invalid or unauthorized | Status code and clear terminal failure |
| Unsafe destination | No | Security policy should stop dispatch | Stable error code, not internal network details |
Blindly retrying every non-200 response is noisy. Never retrying is brittle. The policy should express what the system believes about each class of failure.
A practical exponential backoff schedule
Exponential backoff gives failing systems room to recover. Jitter prevents every failed job from retrying at the same second.
A simple policy can look like this:
| Attempt | Delay before next attempt | Purpose |
|---|---|---|
| 1 | Immediate scheduled delivery | Normal execution |
| 2 | 1 minute | Catch deploys and short network blips |
| 3 | 5 minutes | Let temporary incidents settle |
| 4 | 30 minutes | Avoid hammering an unhealthy endpoint |
| 5 | 2 hours | Give customer systems time to recover |
| 6 | 12 hours | Final long-tail recovery attempt |
The exact numbers matter less than the rule: retries should slow down as confidence drops.
For customer-facing SaaS, the dashboard must explain the schedule. A hidden retry loop creates support tickets. A visible retry timeline creates trust.
Idempotency is not optional
Retries can create duplicate deliveries. That is not a bug. It is a consequence of distributed systems.
Imagine this sequence:
- You send a webhook to a customer endpoint.
- Their handler writes to their database successfully.
- Their server crashes before returning a 2xx response.
- Your system sees a timeout and retries.
- Their handler receives the same event again.
From the sender side, retrying was correct. From the receiver side, processing the event twice would be wrong.
That is why every important webhook should include a stable event ID or idempotency key. Receivers should store it before running irreversible side effects.
export async function handleWebhook(req: Request) {
const event = await req.json();
const inserted = await db.processedWebhook.createMany({
data: {
idempotencyKey: event.idempotencyKey,
receivedAt: new Date(),
},
skipDuplicates: true,
});
if (inserted.count === 0) {
return Response.json({ ok: true, duplicate: true });
}
await runBusinessSideEffect(event);
return Response.json({ ok: true });
}
The exact database call will vary, but the shape is stable: claim the event key atomically, then perform the side effect.
Dead letters are a support workflow
A dead letter queue is often described as a storage place for failed jobs. That is too mechanical.
For product teams, the dead-letter state is a support workflow. It answers:
- Which deliveries failed permanently?
- Which endpoint failed?
- What was the last status code or error?
- How many attempts happened?
- Can the customer fix the endpoint and retry manually?
- Should the system stop retrying because the request is invalid?
If a failed webhook is only visible in logs, the system has no customer support surface.
You do not always need a literal queue named "dead letter." You do need an inspectable terminal state for exhausted deliveries.
What every attempt should record
A delivery attempt should be small, safe, and useful.
At minimum, store:
- Job ID and organization ID.
- Attempt number.
- Started and finished timestamps.
- Status code when a response exists.
- Latency.
- Error class for connection failures and timeouts.
- Truncated response body.
- Whether another retry is scheduled.
- Next run time when applicable.
Avoid storing unlimited response bodies. Avoid leaking secrets from request headers. Avoid exposing internal IP decisions. Logs should help support and engineering without becoming a second copy of customer data.
The retry logic skeleton
A simplified dispatcher looks like this:
async function deliver(job: ScheduledWebhookJob) {
const attempt = job.attemptCount + 1;
try {
const response = await sendHttpRequest(job, {
timeoutMs: 15_000,
maxRedirects: 0,
});
if (response.status >= 200 && response.status < 300) {
await markDelivered(job.id, attempt, response);
return;
}
if (shouldRetryStatus(response.status) && attempt < job.maxAttempts) {
await scheduleRetry(job.id, attempt, computeBackoff(attempt));
return;
}
await markFailed(job.id, attempt, response);
} catch (error) {
if (isTransientNetworkError(error) && attempt < job.maxAttempts) {
await scheduleRetry(job.id, attempt, computeBackoff(attempt));
return;
}
await markFailed(job.id, attempt, error);
}
}
This is only the delivery path. A real system still needs atomic job claiming, safe URL validation, payload limits, dashboard pagination, quota enforcement, and billing-aware plan limits.
That is the part teams underestimate.
Common retry mistakes
The first mistake is retrying without idempotency. If the receiver can charge a card, send an email, create a record, or trigger an automation, duplicates must be expected.
The second mistake is retrying too fast. A loop that retries every second turns a customer outage into a customer incident.
The third mistake is treating redirects as harmless. For user-provided webhook destinations, redirects can become a security problem if they are followed blindly.
The fourth mistake is storing too much. Full payloads, response bodies, and headers can become sensitive. Store what support needs, truncate aggressively, and keep security boundaries clear.
The fifth mistake is hiding the retry timeline. If a customer cannot see what will happen next, they will assume the platform is broken.
Build it yourself or use a webhook scheduler?
Build retry logic yourself if the webhook delivery path is central to your platform, your team already owns queue operations, and you need custom semantics that a focused scheduler cannot provide.
Use a managed scheduler when the core requirement is scheduled HTTPS delivery with retries, delivery logs, and a dashboard. That is especially true for solo founders and small SaaS teams where the opportunity cost of building infrastructure is high.
One managed option is Webhook Scheduler. Disclosure: We built Webhook Scheduler for this exact use case. It is for scheduled HTTPS webhooks with retry attempts and delivery visibility, not as a general background-compute platform.
If that matches the problem you are solving, the most useful place to inspect it is the API reference. If your retries need arbitrary internal compute, a queue or workflow engine is probably the better tool.
A production checklist
Before calling your webhook retry system production-ready, check the following:
- Jobs are claimed atomically before dispatch.
- Recent processing jobs cannot be double-dispatched by concurrent workers.
- Stale processing jobs can be recovered safely.
- Retry delays are visible and deterministic enough to explain.
- Receivers get an event ID or idempotency key.
- Permanent 4xx errors do not retry forever.
- Response bodies and headers are truncated or sanitized.
- Unsafe destinations are rejected before scheduling and before dispatch.
- Redirects are blocked or revalidated safely.
- Failed jobs remain inspectable after retry exhaustion.
- Customers can understand what happened without opening an engineering ticket.
The operating principle
Webhook retry logic is not about being optimistic. It is about accepting that delivery can be uncertain and making every state visible.
A good retry system does not promise that every endpoint will be healthy. It promises that failure will be bounded, recorded, retried when appropriate, and understandable when it finally stops.
That is the difference between a background script and product infrastructure.
Related reading: Webhook scheduling topic hub, How to schedule webhooks without cron jobs, and The hidden cost of building your own job queue.
Not sure which tools to pick?
Answer 7 questions and get a personalized stack recommendation with cost analysis - free.
Try Stack AdvisorEnjoyed this?
One email per week with fresh thinking on tools, systems, and engineering decisions. No spam.

