"exactly once" is not possible, but you can do "at least once" and make the processing of the message idempotent or "de-duplicated". This is true of many messaging systems.
> Exactly-once Semantics are Possible: Here’s How Kafka Does it
> Now, I know what some of you are thinking. Exactly-once delivery is impossible, it comes at too high a price to put it to practical use, or that I’m getting all this entirely wrong! You’re not alone in thinking that.
blah blah blah ... of course it's at-least-once, with the "idempotent producer" so only one resulting message is published back to another kafka stream. Big surprise.
Now many people think "kafka has exactly-once delivery, that's what I want, I don't want to have to deal with this at-least-once stuff" when really it's the same thing, and others have been doing idempotent operations of various kinds for years, and the user still has to figure out how to do their desired thing (which might not be sending one more kafka message) in a mostly idempotent way.
> Is this Magical Pixie Dust I can sprinkle on my application?
> No, not quite. Exactly-once processing is an end-to-end guarantee and the application has to be designed to not violate the property as well. If you are using the consumer API, this means ensuring that you commit changes to your application state concordant with your offsets as described here.
I think that is a pretty clear statement that end-to-end exactly once semantics doesn't come for free. It states that there needs to be additional application logic to achieve this and also specifies what must be done.
Right. But that's at the end of a separate article, while the post which this HN discussion is about throws around the words "exactly once" a lot more casually. The argument is over the use of the words "exactly once". They should just refer to the feature as "transactions" or "idempotent producer".
How do you handle idempotence for at least once delivery? Say I have an event counter of some sort. How would you track whether or not an event was handled in a way where you wouldn't increment the counter multiple times if the event was delivered multiple times?
kafka doesn't give you a counter like that either, I don't think.
What you could do though, is a keep a "set" data structure in a replicated redis or some other database, and each member of the set is the id of the message which incremented that count. Duplicates are naturally ignored when adding to the set. The count is the size of the set.
The message might also have a "time-generated" embedded in it. You could group by hour, and after some number of hours assume that no more messages will be re-delivered from that long ago, and consolidate that set down to a single number.
Or maybe that's too much overhead because there are millions of messages per hour. Maybe in that case you can afford the occasional double-count and don't need all this. Trade-offs will be unique for each situation, and I don't think kafka will completely take care of this kind of thing for you.
Kafka does give you a counter, it's in Neha's blog post. Kafka's idempotent producer registers itself with the brokers with a unique producer id and includes a sequence number with each message it sends. The brokers simply keep track of producer id + the highest sequence number they've seen, and return an ack, if the ack is lost the producer retries and the broker knows to deduplicate the message, and send the ack. In the event counter example this would ensure that each event would be incremented once, even in the face of multiple failures.
Relying on another distributed system to ensure the first distributed system isn't duplicating messages sounds like a headache I wouldn't wish on anyone. (I am a Confluent employee).
OK, so you're right, it seems "brokers" are little databases (maybe just for counters?). In this case the broker acts as the separate de-duplication system I described. I'm much more familiar with a system that does not provide order guarantees (and as a result doesn't need "partitioning" or "re-partitioning" for multiple consumers), but with Kafka where order is guaranteed, a simpler mechanism is possible - keep the count and the message sequence number together, sometimes update both the count and the seq, sometimes just the seq, only update if the new seq is prev+1. And this is built into the broker.
But you still need to understand how this works to do "kafka transactions" and you still need some other scheme to get effects/actions outside of kafka. (And you'll probably get people doing dumb stuff saying "I was told I get exactly-once delivery")
https://www.confluent.io/blog/exactly-once-semantics-are-pos...
> Exactly-once Semantics are Possible: Here’s How Kafka Does it
> Now, I know what some of you are thinking. Exactly-once delivery is impossible, it comes at too high a price to put it to practical use, or that I’m getting all this entirely wrong! You’re not alone in thinking that.
blah blah blah ... of course it's at-least-once, with the "idempotent producer" so only one resulting message is published back to another kafka stream. Big surprise.
Now many people think "kafka has exactly-once delivery, that's what I want, I don't want to have to deal with this at-least-once stuff" when really it's the same thing, and others have been doing idempotent operations of various kinds for years, and the user still has to figure out how to do their desired thing (which might not be sending one more kafka message) in a mostly idempotent way.