> Logging filesystem are beautifully elegant. With a checksum, we can easily detect power-loss and fall back to the previous state by ignoring failed appends.
An edge case when designing a log-structured file system is that a corrupt checksum in a log entry could actually mean one of three things: power loss while writing the last transaction, a misdirected write or bitrot, or a misdirected read.
If you simply read the log up until the corrupt checksum, assuming this can only mean one thing, then you might fall too far back and lose data that has been acknowledged to the user as synced. This could in turn then break the guarantees required for implementing things like Paxos, which could ultimately wreak havoc with a distributed system.
To detect power-loss, you need more than just a checksum. You need a checksum, but you also need a way to "snap" the new log tail onto the old log tail in a two-phase commit. So you write out all your data, then you write out your new log tail, then you sync, then you snap your new log tail on using two atomic sector writes, one of which needs to be synced.
The number of syncs are the same as before, you haven't actually added any additional syncs, but this way you ensure that your log references data which exists, and you can now tell the difference between a power failure and a corrupt log entry.
A power failure is recoverable (roll back), but a corrupt log entry is unrecoverable (assuming no log entry replicas) and means the file system is unmountable.
I'm not sure I'm following this snap operation. What is a "log tail" in this context? Does this the snap operation still work if we are copying over entries of the log lazily and we don't have a definite end-of-log?
You are right though, a checksum used this way provides no protection over bit errors or misdirected write/reads. But, if you assume no bit errors, a checksum can provide power-loss protection as long as there's a fallback.
But doesn't this just move the problem somewhere else? Yes. In this case it moves error detection onto the block device. But it turns out performing error detection/correction at the block device level is simpler and more effective. Most NAND flash components even have built-in ECC hardware for this specific purpose.
This is not my field of expertise, so forgive me if im mistaken, but isnt 'Ohad Rodeh B-trees'[1] a simple and more elegant solution than a journal/WAL?
As far as i know its already used in Linux Btrfs and in LMDB, and i wonder why, if they were designing this from scratch, why they didnt go for this in the first place. Familiarity perhaps?
By the way i have code that deals with the SQLite Btree directly and by reading your comment now i understand why theres a need for a two-phase commit as expressed in:
An edge case when designing a log-structured file system is that a corrupt checksum in a log entry could actually mean one of three things: power loss while writing the last transaction, a misdirected write or bitrot, or a misdirected read.
If you simply read the log up until the corrupt checksum, assuming this can only mean one thing, then you might fall too far back and lose data that has been acknowledged to the user as synced. This could in turn then break the guarantees required for implementing things like Paxos, which could ultimately wreak havoc with a distributed system.
To detect power-loss, you need more than just a checksum. You need a checksum, but you also need a way to "snap" the new log tail onto the old log tail in a two-phase commit. So you write out all your data, then you write out your new log tail, then you sync, then you snap your new log tail on using two atomic sector writes, one of which needs to be synced.
The number of syncs are the same as before, you haven't actually added any additional syncs, but this way you ensure that your log references data which exists, and you can now tell the difference between a power failure and a corrupt log entry.
A power failure is recoverable (roll back), but a corrupt log entry is unrecoverable (assuming no log entry replicas) and means the file system is unmountable.