Well I don’t get it. The OP never mentions that the reason there’s a 2PC is beca...

magicalhippo · on Jan 26, 2019

I don't see how the scheme works without the workers sending acks back.

First an ack to confirm the transaction request was received and durably logged (in case it needs to replay it when recovering from hardware issues), otherwise the client has no way of knowing it has to resend the job to the crashed worker. Once the client gets the first ack, it is certain the worker will move to the next phase.

I presume transactions would have a unique transaction id, so that the workers can easily identify duplicate transaction requests in case the client resubmitted due to timeout, but the worker was just slow to send the first ack.

Second an ack that confirms that the worker has applied the transaction, or a nack in case any constraints were violated.

The key point is that the algorithm guarantees that the workers will be unified in their second response. If the client gets one nack back from a worker, it can be certain it will only receive nacks from the remaining workers, and that the whole transaction was aborted.

The data fields in the transaction request has to be versioned, so that the remote reads are consistent across the workers. This also makes it easy for the client to regenerate a transaction request should it need to, I suppose.

So from the client's POV they generate the transaction request, send it to the relevant workers, with a retry loop in case of timeouts until first ack is received. Once all workers have ack'ed it is assured all workers will unanimously either apply or reject the transaction, so it waits for the answer from the first worker.

Once the client receives the transaction-applied ack/nack from the first worker, the client is thus free to continue working under the assumption the transaction is either applied or rejected, respectively, and issue further transaction requests.

At least that's my understanding of how it would work. Not my field, and it's late, so possibly I'm all wrong.

angry_octet · on Jan 26, 2019

It sounds like it commits in the foreground, but it can fail due to contention -- in which case nothing happens. So the client might have to retry indefinitely, absent some extra complexity for gaining exclusive access (a lock). It would still have to have a system for retiring transactions, at which point the client knows it went through. You would still have to wait for all the shards to respond.

At least, that's what I think they are describing, it's all a bit hand wavey and suspicious.