"exactly once" is not possible, but you can do "at least once" and make the proc...

apurvamehta · on Nov 1, 2017

From the very same blog post:

> Is this Magical Pixie Dust I can sprinkle on my application?

> No, not quite. Exactly-once processing is an end-to-end guarantee and the application has to be designed to not violate the property as well. If you are using the consumer API, this means ensuring that you commit changes to your application state concordant with your offsets as described here.

I think that is a pretty clear statement that end-to-end exactly once semantics doesn't come for free. It states that there needs to be additional application logic to achieve this and also specifies what must be done.

ploxiln · on Nov 1, 2017

Right. But that's at the end of a separate article, while the post which this HN discussion is about throws around the words "exactly once" a lot more casually. The argument is over the use of the words "exactly once". They should just refer to the feature as "transactions" or "idempotent producer".

dvlsg · on Nov 1, 2017

How do you handle idempotence for at least once delivery? Say I have an event counter of some sort. How would you track whether or not an event was handled in a way where you wouldn't increment the counter multiple times if the event was delivered multiple times?

ploxiln · on Nov 1, 2017

kafka doesn't give you a counter like that either, I don't think.

What you could do though, is a keep a "set" data structure in a replicated redis or some other database, and each member of the set is the id of the message which incremented that count. Duplicates are naturally ignored when adding to the set. The count is the size of the set.

The message might also have a "time-generated" embedded in it. You could group by hour, and after some number of hours assume that no more messages will be re-delivered from that long ago, and consolidate that set down to a single number.

Or maybe that's too much overhead because there are millions of messages per hour. Maybe in that case you can afford the occasional double-count and don't need all this. Trade-offs will be unique for each situation, and I don't think kafka will completely take care of this kind of thing for you.

cmatta · on Nov 1, 2017

Kafka does give you a counter, it's in Neha's blog post. Kafka's idempotent producer registers itself with the brokers with a unique producer id and includes a sequence number with each message it sends. The brokers simply keep track of producer id + the highest sequence number they've seen, and return an ack, if the ack is lost the producer retries and the broker knows to deduplicate the message, and send the ack. In the event counter example this would ensure that each event would be incremented once, even in the face of multiple failures.

Relying on another distributed system to ensure the first distributed system isn't duplicating messages sounds like a headache I wouldn't wish on anyone. (I am a Confluent employee).

ploxiln · on Nov 1, 2017

OK, so you're right, it seems "brokers" are little databases (maybe just for counters?). In this case the broker acts as the separate de-duplication system I described. I'm much more familiar with a system that does not provide order guarantees (and as a result doesn't need "partitioning" or "re-partitioning" for multiple consumers), but with Kafka where order is guaranteed, a simpler mechanism is possible - keep the count and the message sequence number together, sometimes update both the count and the seq, sometimes just the seq, only update if the new seq is prev+1. And this is built into the broker.

But you still need to understand how this works to do "kafka transactions" and you still need some other scheme to get effects/actions outside of kafka. (And you'll probably get people doing dumb stuff saying "I was told I get exactly-once delivery")