LogoLogo
AllClearStack
All articles
·10 min read

The 'Exactly-Once' Delivery Lie: How Event Queues Mutate Data

Every distributed system vendor eventually attempts to sell the impossible. They offer a 'guaranteed' exactly-once delivery semantic as if the laws of physics and network theory were merely suggestions. We have been conditioned to believe that if we pay enough for a managed Kafka cluster or a premium SQS tier, our data will arrive exactly once at its destination without fail or duplication. This is a dangerous falsehood that ignores the fundamental mechanics of distributed computing.

Infrastructure providers use the term 'exactly-once' to describe a very specific, narrow set of conditions under which their internal state remains consistent. It does not mean your business logic will only execute once. When we build systems under the assumption of singleton delivery, we stop building defensive code. This absence of defensive architecture leads to the most difficult bugs to debug: silent data mutation caused by overlapping retries.

Modern event-driven architecture is sold as a way to decouple systems, but it often just couples your data integrity to the reliability of an external vendor's ACK logic. If the network fails between the processing of a message and the acknowledgment of that message, the broker will redeliver. Your system has now processed the same event twice, and if your logic isn't strictly idempotent, your database is now in an undefined state. We are trading the simplicity of localized state for the complexity of a distributed lie.

The Physics of Networking Forbids Singleton Guarantees

The Two Generals' Problem is not a puzzle to be solved; it is a boundary of reality. Two entities communicating over an unreliable link can never reach a state of absolute consensus on the status of a message. In the context of a message queue, the producer sends a message, and the broker must acknowledge it. If the acknowledgment is lost due to a network blip, the producer has no way of knowing if the broker received the data or if the connection died before delivery.

Producers are programmed to retry on failure to ensure data is not lost. This transition from 'might have arrived' to 'definitely arrived' necessitates a retry, which immediately moves the system into at-least-once delivery territory. Even if the broker handles deduplication, the same logic applies to the consumer side. A consumer pulls a message, processes it, and then attempts to acknowledge it back to the broker. If the consumer crashes after processing but before the ACK, the broker will eventually time out and hand that message to another worker.

Redelivery is a feature, not a bug, yet we treat it as a failure of the infrastructure. The belief that infrastructure can abstract away the possibility of double-processing is a massive architectural risk. When senior engineers design systems around 'exactly-once' promises, they are delegating their data integrity to a network layer that cannot, by definition, guarantee it. The cost of this trust is paid in reconciled ledgers and ghost records that appear hours or days after an incident.

Vendor Labels Are Specialized Transactional Shorthand

When Apache Kafka or AWS SQS FIFO claim 'exactly-once' semantics, they are using highly specific definitions that do not align with common developer expectations. In Kafka's case, exactly-once is achieved through idempotent producers and atomic writes across partitions. This means that if a producer sends the same message twice, the broker will only write it once. It also ensures that a set of writes to multiple partitions succeeds or fails as a single unit.

This is a massive achievement in distributed engineering, but it stops at the broker's edge. It provides no protection for the side effects your consumer triggers. If your consumer reads a message and calls a third-party API like Stripe or Twilio, Kafka has no control over that interaction. The atomicity of the broker's internal state does not extend to the external world. Developers frequently confuse the internal consistency of the queue with the external consistency of the entire system.

Most cloud providers implement exactly-once within a strictly defined 'deduplication window.' For SQS FIFO, this is a five-minute sliding window based on a message deduplication ID. If a network partition lasts for six minutes, or if a message is redelivered after the window expires, the 'exactly-once' guarantee vanishes. We are building systems on top of these windows, hoping our failures are always brief and our latency is always low. This is not engineering; it is gambling with production data.

The Ghost ACK Is the Architect of Your Database Corruption

Consider a standard webhook ingest flow where an event triggers an account balance update. The consumer receives the event, calculates the new balance, updates the database, and then sends an ACK to the queue. If the process is interrupted between the database commit and the ACK, the queue will redeliver the message. The second time the message is processed, the balance is updated again. This is silent mutation, where the system appears to be functioning normally but the underlying data is factually incorrect.

These errors do not show up in traditional error monitoring because they are technically successful operations. The database performed a valid write, the consumer sent a valid ACK on the second try, and the queue marked the task as complete. Without strict idempotency keys or versioned state checks, these mutations accumulate like plaque in your system. By the time the discrepancy is noticed, the origin of the error is buried under thousands of subsequent valid transactions.

Over-engineering the queue to avoid these scenarios often makes the problem worse. Engineers will add complex coordination layers, distributed locks, and two-phase commits to ensure that a message is only handled once. These layers introduce new failure modes and significantly increase latency. The complexity required to maintain the illusion of singleton delivery is often higher than the complexity required to simply handle duplicates at the application layer. We are building brittle scaffolds to support a lie rather than designing for the inevitable truth of duplication.

Infrastructure Can Never Solve Application-Level Semantic Duplication

There is a difference between a transport-level duplicate and a semantic duplicate. A transport-level duplicate is when the same bytes are delivered twice due to a network retry. A semantic duplicate is when a user clicks a 'submit' button twice, or an upstream service experiences a bug and generates two different events for the same logical action. A message broker's 'exactly-once' feature can only help with the first case, and even then, only under specific conditions.

True safety requires the application to be aware of its own history. This is why idempotency keys are mandatory, not optional. Every incoming event must carry a unique identifier that is checked against a persistent store before any side effects are triggered. If the system has seen the ID before, it should return the previous result without re-executing the logic. This moves the responsibility of consistency from the infrastructure to the application, where it belongs.

We must stop treating the message queue as a source of truth for execution. The database, with its ACID properties and unique constraints, is the only place where 'exactly-once' can be truly enforced. A unique constraint on a transaction ID column is worth more than ten thousand lines of complex distributed consensus logic. By leaning on the database's ability to reject duplicates, we create a system that is robust regardless of how many times the network decides to replay a message.

Relational Constraints Are Cheaper Than Complex Message Orchestration

There is a trend toward moving business logic into the orchestration layer, using event-driven 'sagas' to manage complex flows. While this provides visibility, it often abstracts away the safety of the relational database. Each step in a saga is triggered by an event, and each step is subject to the same redelivery risks. If your saga engine doesn't have built-in, verifiable idempotency, a single failed ACK in the middle of a ten-step process can result in a catastrophic mess of partial states.

Instead of trying to prevent duplicates at the transport level, we should focus on making duplicates harmless. This means using upserts instead of inserts and relative adjustments instead of absolute overwrites. If you are updating a user's status, use a state machine that only allows transitions from 'Pending' to 'Active.' If a duplicate 'Activate' message arrives and the user is already 'Active,' the operation should simply do nothing and return success. This is a localized, simple check that provides absolute safety.

  • Use database unique constraints to block duplicate event IDs at the point of ingestion.
  • Implement idempotency layers at every service boundary, especially for webhooks.
  • Prefer state-based transitions over additive operations (e.g., set to X, don't add Y).
  • Treat every message broker as an 'at-least-once' system, regardless of what the marketing says.

Architects who prioritize localized state checks over distributed delivery guarantees build more resilient systems. They recognize that the network is an adversary and that 'guaranteed delivery' is a promise that can only be kept by the receiver, not the sender. When we stop trying to over-engineer the pipe, we can spend more time making the endpoints smart enough to handle the reality of a messy, duplicated world.

Engineering Leaders Must Audit for Idempotent Failure Modes

If you are leading an engineering organization, your first task should be to audit your webhook ingest endpoints and high-value event consumers. Ask your team a single question: 'What happens if this message is processed three times?' If the answer involves a manual reconciliation or an incorrect account balance, your architecture is flawed. It does not matter if your message broker is configured for exactly-once; you are still vulnerable to failures in the application-to-broker handshake.

The goal of a senior engineer is to design systems that are correct by construction. Relying on the 'exactly-once' setting of a cloud service is an example of 'correct by configuration,' which is a fragile state. Configuration can be changed by a junior engineer or bypassed during a platform migration. In-code idempotency and database-level constraints are persistent and enforceable. They represent a commitment to data integrity that survives infrastructure changes.

We must reject the cynicism of assuming everything will fail while simultaneously rejecting the optimism that the infrastructure will save us. The middle ground is a design pattern where the application assumes the worst of the network. This mindset shift reduces the cognitive load on developers. They no longer have to wonder if a duplicate is possible; they simply write code that doesn't care if a duplicate occurs. This is the only way to build distributed SaaS products that maintain integrity at scale.

Stop paying the complexity tax for message brokers that promise to solve the Two Generals' Problem. They haven't solved it; they have simply hidden the failure cases in the fine print of their documentation. Move your logic back to the database, enforce your constraints at the edge, and treat every event as a potential duplicate. Only then will your system be truly safe from the silent mutation of the exactly-once lie.

Not sure which tools to pick?

Answer 7 questions and get a personalized stack recommendation with cost analysis - free.

Try Stack Advisor

Enjoyed this?

One email per week with fresh thinking on tools, systems, and engineering decisions. No spam.

Related Essays