Filesystems can experience at least three different sorts of errors

cmurf · on March 12, 2022

Btrfs does report most of this, tracked per device, and stored in the device tree. `btrfs decice stats`. The counter can be reset. It is a cumulative counter so if you have one defect and the affected file is read three times, the respective counter increments by three, assuming a persistent error.

https://www.man7.org/linux/man-pages/man8/btrfs-device.8.htm...

Man page includes definitions of the 5 kinds of errors tracked.

The article mentions structural errors. Sounds like this is the detection of an inconsistency. These aren't counted on btrfs, but are logged. Anytime the read or write time tree checker finds a problem, it results in the filesystem going read-only to prevent (further) confusion from ending up on disk. These are exceptionally rare, I've never seen one on any of my filesystems; but have seen it catch things like bitflips due to bad RAM, i.e. the checksum was computed correctly on already corrupted (meta)data.

cmurf · on March 12, 2022

Actually, I have seen a write time tree checker error on one of my file systems. I even reported the bug :) So it also has the benefit of ratting out btrfs itself when it's the cause. The file system did go read only, corruption didn't make it to disk, and it was fine following a reboot (it was the root file system so I couldn't unmount then mount; and this kinds of confusion can't be cleared with a remount).

https://patchwork.kernel.org/project/linux-btrfs/patch/024e4...

diegocg · on March 12, 2022

It's surprising the large amount of people I have found who are incapable of conceiving the notion of what this post calls "structural error" in modern file systems such as ZFS. They have this idea that checksums and scrubbing make ZFS invulnerable to corruption, and thus that the notion of a fsck does not make sense in ZFS. It probably makes sense for many because it fits with the fact that ZFS does not have a fsck, but that does not make that kind of corruption less real.

csdvrx · on March 12, 2022

ZFS Scrub is like a fsck in a crontab...

chungy · on March 12, 2022

scrub is exactly the "zfs fsck" that people claim they pine for. It is part of the file system's design philosophy that every administrative action happens on online pools, which even includes correcting errors in the file system. Making it so you don't have to take your system offline is a feature, not a bug.

rincebrain · on March 12, 2022

I think the difficulty in these conversations is that people say they want a "fsck for ZFS", when often they don't mean they want "periodic sanity checks for ZFS", they mean they want something like "xfs_repair -d* for ZFS" or "extundelete for ZFS" - that is, a tool for salvaging beyond-the-pale mangling such that your pool is no longer able to be imported.

* - not a perfect analogy, but I'm hard-pressed to think of good off the shelf tools for what they're looking for. I guess reiserfsck's infamously side-effect laden --rebuild-tree would probably be closest...

(I am acquainted with zdb -r and import -T; neither helps you if there's not enough metadata consistent to get enough of a pool structure in memory to 'import', but one could still conceivably salvage some data in that case.)

XorNot · on March 13, 2022

Technically this is the various, increasingly dangerous zfs import commands though.

The point of those is that the only time a pool is truly unusable is when you can't even import it.

rincebrain · on March 13, 2022

Yes, -F, -X, and -T.

The issue is, I think, people are expressing a desire for a tool that can still salvage data even when you've gone through all the (128/N) -T options and found them all unhelpful.

siebenmann · on March 12, 2022

Unfortunately ZFS scrubs are not as complete as fsck on a regular filesystem. ZFS scrubs only verify that checksums are intact. They don't verify that filesystem level metadata is correct (although they do verify ZFS structural metadata as part of walking everything, which isn't the same thing). For example, a ZFS scrub will not detect that a filesystem inode has certain sorts of crazy or invalid contents, or damaged ACLs. It doesn't even necessarily verify that the filesystem directory structure is correct and intact.

For more on this, see https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSScrubLim...

(The tl;dr is that a fsck on an ordinary filesystem has to walk the directory tree to find everything. However, ZFS maintains a separate list of active inodes and a scrub can just walk over them and check the checksums of all of their data blocks. It doesn't have to, for example, read a directory's contents to find further files to scrub.)

diegocg · on March 12, 2022

No, it is not. ZFS scrub only checks that the checksums is valid. A scrub will not check in detail for what this post calls "structural error" (it only will find errors when doing normal walk of structures, which is not very detailed - doing a detailed file system checking is so resource intensive that some fsck implementations have a "low memory" mode in order to avoid OOM).

The checksum/mirroring mechanisms cannot fix any structural error when the filesystem is doing something and finds and inconsistency.

ZFS chose to not have a fsck out of pure arrogance, not because scrub is a proper substitute. ZFS developers believed that corruption bugs produced by code can be fixed by providing Bug Free Code (tm). That, and the fact that errors due to media corruption will be fixed with checksums and mirroring, made them believe that they could make fsck a thing of the past. Other modern file systems mimicking ZFS are developing a fsck, despite having scrub-like functionality.

...as I said, and your reply proves again:

> It's surprising the large amount of people I have found who are incapable of conceiving the notion of what this post calls "structural error" in modern file systems such as ZFS

People _really_ wants to believe ZFS has some kind of magic.

XorNot · on March 13, 2022

Your post though isn't providing a single example of a "structural error" which could occur in ZFS and wouldn't be fixed by the checksums and redundancy data.

What's your point here? That ZFS can't correct logic bugs in it's own implementation that would lead to structural errors? (What system could?)

Jenda_ · on March 13, 2022

I don't have experience with ZFS, but in btrfs, you can get "corrupt leaf" on a FS passing scrub. This probably happens either because of bug or because the structures got corrupted (e.g. bad RAM) when they were constructed, before the checksum (of the already wrong data) was calculated. This is kind of expected behavior, as the checksums only assert that we have read from the device what was written to it previously, but they have absolutely no relation to constrains like "the blocks form a valid B-tree".

Another example that I have here right now, is a directory that says the following on "ls"

  # l pg_stat_tmp/
  ls: cannot access 'pg_stat_tmp/global.stat': No such file or directory
  [...]
  -????????? ? ?    ?     ?            ? db_0.stat

The btrfsck says stuff like "parent transid verify failed" and "ERROR: child eb corrupted". scrub finishes without errors.

> That ZFS can't correct logic bugs in it's own implementation that would lead to structural errors?

Again, I can't speak for ZFS, but the problem with btrfs is that for example in ext4 you have fsck that will fix such errors (sometimes losing the affected files). But in btrfs, the fsck is mostly "beta and do not use and it can't fix that, just move the data elsewhere, create the FS from scratch and move data back and hope it won't happen again".

jandrewrogers · on March 12, 2022

On of the more insidious types of errors are phantom writes, which are thankfully rarely seen in current storage hardware. This can cause data loss where everything otherwise looks correct -- I/O, integrity, and structure -- because what you read back may be a valid but old version of what was written to storage.

This type of error can be detected with sufficiently robust integrity checking e.g. some type of durable Merkle tree, but ensuring that integrity checking can reliably detect phantom writes has a relatively high performance cost so many storage systems just assume it will not happen, given the low prevalence and high cost. FWIW, I think this is the correct tradeoff for many storage systems, where it is not a highly probable source of data loss in practice and some types of replications architectures make it relatively straightforward to detect after it has occurred even if not immediately.

wyldfire · on March 12, 2022

Note that many storage devices themselves track errors and provide a well-defined interface to track them. SMART. This is all the more critical for the devices with moving parts.

IIRC they do distinguish between command/interface errors and medium errors, which is somewhat analogous to the filesys I/O and integrity errors discussed.

bediger4000 · on March 12, 2022

Good article. This article, and it looks like almost the entire blog it's a part of, are about computer engineering and operation detached from "the business". So much about computers is written from a business point of view, where anything that doesn't pay for itself in the next quarter is considered worthless. This blog doesn't take "business concerns" into account at all.

rwmj · on March 12, 2022

It's a fair point, but couldn't you derive this information from the kernel log - the different types of errors would look very different in the log. I think for the last type ("structural" caused by errors in the code) you'd likely find different subclasses from the log since there would be various places in the kernel code raising errors.

amelius · on March 12, 2022

And how about network filesystems?