Ext4: fix data corruption caused by unwritten and delayed extents

tux3 · on May 20, 2015

This is probably not as serious as the title implies, and apparently not related to the recent reports of EXT4 corruptions. Quoting [0] Theodore Ts'o:

>So it's pretty hard to hit this bug by accident

>It requires the combination of (a) writing to a portion of a file that was not previously allocated using buffered I/O, (b) an fallocate of a region of the file which is a superset of region written in (a) before it has chance to be written to disk, (c) waiting for the file data in (a) to be written out to disk (either via fsync or via the writeback daemons), and then (d) before the extent status cache gets pushed out of memory, another random write to a portion of the file covered by (a) -- in which case that specific portion of (a) could be replaced by all zeros.

[0] http://thread.gmane.org/gmane.linux.kernel/1956583

digi_owl · on May 20, 2015

Reminds me of corruption issues that came from HDD controllers reshuffling writes so that the logs and the actual data got desynced. Only really noticed if you got a power failure and hit the drive with a fsck afterwards. The fix was to toggle on write barriers.

karavelov · on May 20, 2015

My guess is that it is something related the RAID handling in the block layer: I had a corruption on fat32 fs (/boot/efi partition) after upgrading grub-efi on my laptop. It sits on md-raid0 (intel fake raid)

I also had multiple corruptions of files written on ext4 patition but the /boot/efi corruption rules out ext4 bug.

EDIT: It was with 4.0.2 kernel on Debian

istvan__ · on May 20, 2015

All of my production systems use XFS for the last 5-10 years. I never liked the idea that we have 5 filesystems on Linux an you have to know which is better for your use case, while other Unix like systems offer you maybe 2 options but there is one that is supported and recommended (like ZFS on Solaris). I think ext4 does not offer too much for the most of the users over ext3 but given its history (lots of data loss scenarios) I would not want to risk it to start to use it in production. On the top of that, the recent moves from CentOS to use XFS as the default FS just makes me think I made the right decision 10 years ago to use XFS for almost everything I do.

Ted Tso about ext4:

> P.S. It's bugs like these which is why I'm always amused by people

> who think that just because a file system is safely being used by

> their developers, that it's safe to throw production workloads on

> them.

kasabali · on May 20, 2015

XFS doesn't have an impressive track record when it comes to data loss, either. Come on people, every filesystem with large number of users has had their screw-ups.

rodgerd · on May 20, 2015

True. I am not, however, aware of the XFS developers ever deliberately changing the behaviour of an older version of XFS to make it less safe to match the data corruption problems in a newer version.

istvan__ · on May 20, 2015

Well I am not sure about that, as anecdotal evidence, most systems engineers I know use XFS as default on their systems. This might be coincidence, but I think it is just a sign that XFS has less "screw-ups" than ext3/4. A representative survey would help to prove it.

AceJohnny2 · on May 20, 2015

Well, if we're going by anecdotal evidence, I've had a big loss when an XFS-on-RAID system had a power failure about 4 years ago, and recently I know someone who's lost hundreds of gigs to XFS as well, with no power loss that he's aware of.

So there.

velodrome · on May 20, 2015

Well, I don't know about reliability* but there are many other reasons to choose XFS:

1) Support for large volumes: 8EB vs 16TB (Ext4)

2) Support for large file sizes: 1EB vs 16TB (Ext4)

3) Support for large number of subfolders: Ext3 has a 32k subfolder limit

4) Online defrag support (supported in Ext4)

For us, we had to use XFS for MySQL multi-tenent applications.

* XFS had corruption issues back in the day with specific kernels (2.6.9, that's the one I remember).

edgan · on May 20, 2015

Ext4 does support larger than 16tb filesystems.

df -h | grep md127 ; mount | grep md127

/dev/md127 26T 14T 12T 55% /home/data

/dev/md127 on /home/data type ext4 (rw,relatime,stripe=1792,data=ordered)

Ext3 was 32k, but ext4 is 65k subfolder limit. I have run into this one before. The real solution is to do directory hashing when you get this crazy with tons of subfolders.

istvan__ · on May 20, 2015

I think there is a difference between what is supported and what is possible.

https://access.redhat.com/solutions/1532

simoncion · on May 20, 2015

RedHat has tested EXT4 on partitions larger than 16TB. From your linked note:

"EXT4 [on RHEL7] [Cerfified max FS size:] 50TB [Theoretical max FS size:] (1EB)"

Moreover, RedHat's lack of Enterprise Support for a configuration doesn't mean that that configuration is a bad idea, or is doomed to failure. :)

istvan__ · on May 20, 2015

This is exactly what my comment was about. :)

simoncion · on May 20, 2015

Oh, I see. You were remarking on edgan's use of the phrase "Ext4 does support".

So, firstly, both Red Hat and ext4 support FSs larger than 16TB. So, yeah.

Secondly, I mean, Ted Ts'o works for Google, rather than Red Hat, so it's not like Red Hat has any special insight into the inner workings of ext4. RHEL's lack of "support" for ext4 FSs large than a certain size wouldn't make me wary of using FSs larger than that size.

Ts'o & co. seem to be pretty careful, so I would expect that the worst that would happen would be performance issues.

istvan__ · on May 21, 2015

I am wondering what Google uses nowadays.

Regarding Tso & co, I am not so optimistic. How they communicate on the mailing list is more like: "Look we wrote this, it runs on my laptop fine it might work for other workloads too". There is no extensive test coverage run by against ext4 (at least I am not aware of it). This is not the first data loss bug in the ext4 codebase. Few years from now when enough users start to use it an get rid off these serious bugs from the code, I might even consider POCing it again.

http://www.phoronix.com/scan.php?page=news_item&px=MTIxNDQ

simoncion · on May 21, 2015

I'm somewhat certain that -despite the name- xfstests [0] is the canonical filesystem test suite these days. (At least, I've heard the btrfs devs concerned about passing its tests and adding new tests in it to check corner cases in btrfs. I also remember hearing the ext4 devs mention xfstest.)

I read that mailing list tone as something more like "This software is offered without warranty and might shave your dog and weld your toilet seat down.". I mean, folks on the LKML talk in that manner about their patches, too.

Also, even XFS has had severe data loss bugs in the distant past [1] and much more recently. [2]

[0] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.gi...

[1] https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.1... (Note in particular the last comment that describes this as "not a bug, but a feature!".)

[2] http://thread.gmane.org/gmane.comp.file-systems.xfs.general/... (fixed in RHEL in 2013(!))

istvan__ · on May 20, 2015

Thank you, #1 might be the biggest reason that XFS is used over ext4 I guess, 16TB is not as unusual size for a decent raid array nowadays. I would like to understand why CentOS (and I guess RedHat went for XFS) though.

ComputerGuru · on May 20, 2015

I don't know how or why ext4 took off the way it did. ext2/3fs' dominance was somewhat understandable, it was the dark ages of filesystem architecture (akin to the dark ages of cryptography), and no one used math or science to definitively say something was better than another.

But in the age of ZFS, filesystems like XFS and JFS have proven themselves worthy to remain in the ring as non-versioning, simple filesystems. Their resilience (now, despite any known issues in the past) is beyond reproach, combined with their speed and performance (comparing apples to apples, not journaled to non-journaled) is where it should be.

Why does any OS/distro use extfs by default for new installations? I can understand ancient machines that upgrade from ext3 to ext4 to avoid migrating data to a new filesysystem/mountpoint, but for new installations?

JoshTriplett · on May 20, 2015

Two notable reasons I know of.

First, people like boring in filesystems, and ext4 was seen as a natural and less risky successor to ext3, which most people ran. And I think that's a reasonable perception.

Second, ext4 went out of its way to solve some of the system-level issues that tended to cause problems, before it went into production. For instance, the ext4 developers introduced hacks to deal with software that handles write atomicity incorrectly, making it much less likely that you'd end up with a zero-length file if you cut power or crashed at the wrong time. XFS and JFS had similar issues that actually hit users, breaking invalid-but-widespread assumptions about filesystem semantics. ext4 at first considered having similar semantics, but instead worked around how many programs actually wrote files, which made it safer in practice.

(Also, "in the age of ZFS"? ZFS is by no means the dominant filesystem, or even an obvious contender for that title; it has a few popular features, but it's hamstrung by its choice of license.)

kbenson · on May 20, 2015

Also, you could upgrade ext3 to ext4 in-place. That was a powerful feature for a distro that had ext3 as the default, or for companies that had a lot of data on ext3 volumes.

joosters · on May 20, 2015

Because it just works (critical data corruption bugs aside! But every FS has had those...)

For most uses, no-one will ever spot the difference between an underlying modern(-ish) filesystem. Ext3/4 also has a very decent set of maintenance / recovery tools and is well supported by most distributions. That is plenty good enough for almost everybody.

walterbell · on May 20, 2015

> Their resilience (now, despite any known issues in the past) is beyond reproach

How is resilience characterized for filesystems which are not widely adopted and thus exposed to diverse usage scenarios?

cjbprime · on May 20, 2015

It seems likely that btrfs will become default for most distributions within a few years.

Nothing compares to having decades of experience with an on-disk format, a fsck tool, etc. Ext3/4 has had the most users, therefore it will be the most trusted (bugs like this one notwithstanding). When you're choosing the distro default, you have to prefer reliability to performance/features.

Hm, I guess I just gave an argument for why btrfs won't become the default for another 5-10 years. We'll see!

calpaterson · on May 20, 2015

That has seemed likely for the last few years. btrfs' adoption has been much delayed

e40 · on May 20, 2015

Because it seems to be btrfs is counting their chickens before they hatch. Making predictions about very complex software before it is 100% feature complete is pure folly. Experienced programmers know that when a project is 80% complete, from a feature standpoint, that doesn't mean 20% of the work is left to do. In fact, it is usually that substantially more than 20% of the work is left to do. (I don't know if btrfs is at the 80% mark, but I've heard there are a lot of things left to implement.. it's just an illustration).

simoncion · on May 21, 2015

As you understand it, what things have yet to be implemented?

rockdoe · on May 21, 2015

Compression support needs to be rewritten, RAID5/6 support is extremely new and likely still quite buggy, quota support was just rewritten, in general an unexpected powerdown will dump you in a recovery shell because replay-from-journal has thousands of error cases that aren't properly handled yet (this may take many more years to fix because it's too complicated) and what do you know, this isn't really acceptable for servers.

There are long standing bugs related to ENOSPC and new ones still regularly pop up, often related to the (fixed) separation of metadata vs data allocation. In general, disk space usage handling is awkward and integrates horribly with existing tools (df/du), especially when snapshots are involved.

Snapshot/subtree handling and deciding what gets automatically mounted in parent/child relations is pretty weird and the whole subsystem probably needs a rewrite.

Online dedup.

They probably also want to fix the whole extent design that causes random-access files to take up to N*(N-1)/2 disk space, where N is the original file size.

simoncion · on May 24, 2015

So, I'm asking questions because I'm curious and very probably ignorant. I'm not trying to pick a fight, make points, or get in a dick-waving contest. I also don't know how closely you follow btrfs development. I skim the lists from time to time, so if I'm telling you stuff that you already know, or if my memory is not quite correct, I apologize in advance. Also, if you notice this comment after the comment submission window closes, you can reach me at $MY_HN_USERNAME@gmail

What's wrong with how compression works? From a user's perspective, you either set a mount option, set a bit on a file attribute, or explicitly call for compression with btrfs defrag. [0] If you combine the latter two operations into a single checkbox, this is exactly how NTFS handles things. What am I missing here? Also, I can't agree that modern[1] btrfs handles unexpected power cuts poorly. This just hasn't been my experience. I can, however, agree that log replay isn't complete and still needs work.

I can't speak to the ENOSPC bugs, I haven't run into any in a very, very long time. Some time after 3.14, btrfs got a pool of space called the "Global Reserve" [2] which was intended to address ENOSPC issues. Somewhere around that time, btrfs also grew a better btrfs-specific df function invoked with "btrfs filesystem usage". I don't make extensive use of snapshots, but the numbers I get out of btrfs fi usage almost exactly match the numbers I get from plain old df.

I'm unware of snapshot automounting. Can you help me understand what this is? (A brief Google search wasn't enlightening.) I agree that taking away the ability to have subvolumes outside the FS tree if you didn't set things up just right at FS creation time is pretty bullshit. That certainly needs reworked. Everything else in my limited experience with subvol management seems okay to me. What's seems strange to you?

Btrfs doesn't yet have built-in online dedup, but are you aware of dupremove? [3]

Can you give me a recipe to create an N(N-1)/2 sized random access file? I use BTRFS for some very small random-access databases and large pretty-much-append-only databases and haven't run into this behavior.

I have a couple of things to say about your other comment:

People generally don't write to a mailing list to say what a good time they're having with your software.

It's pretty shit that log replay isn't worked out yet.

I certainly* don't intend to stop making backups. OTOH, I've been "getting lucky" for five years straight while running a FS that many people regard as the most untrustworthy thing in the world on a drive that many people regard as entirely unreliable and untrustworthy. :)

Cheers!

[0] Conceptually, this makes a lot of sense, 'cause all defrag does is re-write the file. NTFS re-writes files when you request that they be compressed, too.

[1] "Modern" means btrfs from the past two years or so.

[2] Much like ext*'s reserved-for-the-superuser space.

[3] https://github.com/markfasheh/duperemove

rockdoe · on May 20, 2015

btrfs also hasn't even reached alpha level stability yet. An unexpected powerdown often means "restore from backups" at this point.

simoncion · on May 21, 2015

I've been using btrfs in my laptop on my OCZ Vertex LE [0] since that drive was shipping with the data-eating v1.0 firmware. [1] I'm also running it on my desktop, and in a force-compress multi-device firewire-attached configuration for a multi-TB Postgres tablespace. It has been many, many years since I've run into a data-loss issue of any kind, despite unexpected power-downs or other failures.

Now... if you're using Kubuntu 15.04[3], with btrfs as your root volume, know that system lockups that require a hard reset may well do something to the log that makes mount hang. Do this [4] if your "task is hung" backtrace looks something like the one in the wiki page, then btrfs-zero-log is the thing you need to do.

[0] The drive is still going strong and error-free, too! :D

[1] This drive is what taught me to ALWAYS upgrade the firmware in your SSDs, ASAP. :P

[3] Christ, what a nightmare this version is! :(

[4] https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log

rockdoe · on May 21, 2015

You can follow the btrfs mailinglists to see the stream of people coming in that hit a case where btrfs can't recover by itself. I've hit one myself that required a btrfs-zero-log "fix", but I'm not running Ubuntu.

You have been extremely lucky. Keep making backups, though.

rockdoe · on May 20, 2015

ext4 is actually one of the fastest, if not the fastest filesystem for most common hardware and workloads and has few workloads where it does badly.

Source: most benchmarks involving Linux filesystems

feld · on May 20, 2015

Politics, that's why.

But XFS and JFS have had their share of corruption bugs too. I don't believe any were long-standing bugs, though.

Wikipedia mentions that JFS has a misfeature where write can be delayed so long (indefinitely) that a power outage can cause major data loss.

Didn't Suse ship with JFS as default out of the box for a very short period of time?

rdtsc · on May 20, 2015

Who is using JFS? And in general why pick that over XFS for instance?

yuvadam · on May 20, 2015

Similar thread on Arch BBS https://bbs.archlinux.org/viewtopic.php?id=197400

DangerousPie · on May 20, 2015

I am running Debian 3.2.65 with ext4. How worried do I need to be about this?

I assume/hope that if this was a common occurrence it would have been found much earlier...

Update: Having reread the report, it appears that this may only be a problem in the new 4.0 kernel?

Update2: Nope, not just 4.0 kernels (see below).

kasabali · on May 20, 2015

> Update: Having reread the report, it appears that this may only be a problem in the new 4.0 kernel?

No, reporter uses 4.0 kernel which is why bug report lists that version. I've looked at commit message for the fix but it doesn't talk about any regression, so this bug may be existing for a long time.

feld · on May 20, 2015

An entire extent could be lost(or zeroed?) which means serious corruption to your file(s) at time of writing if the cards fall in the right order.

I see no indication that this is a 4.0-specific issue. This code has existed for a long time. They fixed it in "stable kernels 4.0.3 as well as much older stable kernels."

edit: major corruption is possible, not just slight, because an extent can be 128Mib on a 4K block size filesystem.

feld · on May 20, 2015

"At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed."

Ok, so I think what's happening here is

  1) Write data
  2) Busy filesystem or something makes the write of the
  extent delayed
  3) An update to that data attempts to get written
  4) The delayed extent gets replaced with new data, but then marked as written here 
  instead of when it hits the platters
  5) Now data loss has happened because there are no
  delayed buffers

I might be wrong with this breakdown. It's hard to tell. It seems that the management of delayed extents is not as robust as it should be.

http://www.spinics.net/lists/linux-ext4/msg47782.html

edit: this type of problem wouldn't affect a COW filesystem because when you overwrite existing data you aren't actually overwriting it, you're writing new data and updating the map of where to find the entire file contents.

cjbprime · on May 20, 2015

As far as I can see, the "critical" label here is coming from a Debian user submitting a Debian bug report, not from the Debian kernel team or upstream kernel maintainers.

I think HN submissions should have some standards for stories that will create the appearance of "there is a critical problem with software X that lots of people use, and everyone should panic and upgrade" announcements. Having the authors of software X involved in the announcement seems like it would be a good place to start.

(I'm not saying that Josh is mistaken about having been hit by this bug, or was involved in the decision to submit this to HN.)

dang · on May 20, 2015

> HN submissions should have some standards for stories that will create the appearance

HN has such a standard: the guidelines ask for titles not to be misleading or linkbait. (Not an opinion on the current bug.)

https://news.ycombinator.com/newsguidelines.html

cjbprime · on May 20, 2015

Ah, okay. I think the title is misleading and should be changed.

dang · on May 20, 2015

Is it (and the url) ok now?

JoshTriplett · on May 20, 2015

> (I'm not saying that Josh is mistaken about having been hit by this bug, or was involved in the decision to submit this to HN.)

I was not involved, and it's sounding like this is a different bug than the one I got hit by. This bug still qualifies as "critical" in a Debian bug-reporting severity sense ("causes serious data loss"). I don't know what the original title was though.

cjbprime · on May 20, 2015

Ah, this story originally linked to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672 and shared its title ("Critical ext4 data corruption bug").

kasabali · on May 20, 2015

You're definitely right. "critical" in this context means it has a critical severity as a Debian bug, it doesn't mean everybody must stop what they're doing right now and patch their kernel.

If I interpreted the commit message correctly, this bug is not a new thing at all and it has been there for years, but it has been noticed only lately, which means we're no better applying the fix now than waiting until distribution packages ship it.

railsisfails · on May 20, 2015

Data loss is not critical? Wow, remind us not to trust your firm with our data.

cjbprime · on May 20, 2015

Every filesystem contains edge case bugs that lose data. The answer to whether everyone reading this story should go upgrade their kernel right now to avoid this one depends on factors like how probable it is that users will hit this one, and how risky the fix is.

The project maintainers are in the best position to make that call.

feld · on May 20, 2015

Delayed extents is probably a normal thing that happens in daily I/O operations. The possibility of it not actually hitting the platters permanently is a scary scenario.

"A single extent in ext4 can map up to 128 MiB of contiguous space with a 4 KiB block size."

StillBored · on May 20, 2015

The whole bug sounds similar to a problem we had on XFS back in the 2.6.16 time frame. Certain operations were delayed for hours meaning that it was possible for renames/writes not to hit the platter if the machine experienced power loss instead of a clean shutdown, even if sync's were performed.

feld · on May 20, 2015

I think JFS still has that problem, if Wikipedia is accurate.

tlb · on May 20, 2015

Better explanation and patch: https://www.mail-archive.com/linux-kernel@vger.kernel.org/ms...

dang · on May 20, 2015

Ok, we changed the url to that from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672. Thanks.

feld · on May 20, 2015

I would say that is critical.

ouch