ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot (and RAID-5)

Posted on March 7, 2017January 1, 2020 by Jody Bruchon

UPDATE 3 (2020-01-01): I wrote this to someone on Reddit in a discussion about the ZFS/XFS/RAID-5 issue, and it does a good job of explaining why this article exists and why it’s presented in an argumentative tone. Please read it before you read the article below. Thanks, and have a wonderful 2020!

There really is no stopping zealots. Anyone who reads my article all the way through and takes the text at face value (rather than taking liberties with interpretation, as the abundant comments underneath it demonstrate) can see that I’m not actually dumping on ZFS nor saying that RAID-5 is the One True Way(tm). It really boils down to: ZFS is over-hyped, people who recommend it tend to omit the info that makes its protection capabilities practically useful, XFS is better for several use cases, RAID-5 is a good choice for a lot of lower-end people who don’t need fast rebuilds but is also not for everyone.

I strongly advocate for people using what fits their specific needs, and two years ago, there was a strong ZFS fanatical element on r/DataHoarder that was aggressively pushing ZFS as a data integrity panacea that all people should use, but leaving out critical things like RAID-Z being required for automatic repair capabilities. At the same time, I had read so many “DON’T USE RAID-5, IT SHOULD BE BANNED!” articles that I was tired of both of these camps.

The fact is that we have no useful figures on the prevalence of bit rot and there are a ton of built-in hardware safeguards against it already in place that so many fellow nerds typically don’t know about. Most people who experience bit rot will never know that that’s what happened, and if the rot is in “empty” space then no one will ever know it happened at all. There’s not some sort of central rot reporting authority, either. Backblaze’s disk failure reports are the closest thing we have to actual data on the subject. No one has enough information on bit rot to be “right.” In the absence of information, the human mind runs wild to fill in the blanks, and I think that’s where a good portion of this technology zealotry comes from.

UPDATE 2: Some fine folks on the openmediavault (OMV) forums disagreed with me and I penned a response which includes a reference to a scientific paper that backs several of my claims. Go check it out if you’re really bored. You know you want to! After all, who doesn’t love watching a nice trash fire on the internet?

UPDATE: Someone thought it’d be funny to submit this to Hacker News. It looks like I made some ZFS fans pretty unhappy. I’ll address some of the retorts posted on HN that didn’t consist of name-calling and personal attacks at the end of this article. And sorry, “OpenZFSonLinux,” I didn’t “delete the article after you rebuked what it said” as you so proudly posted; what I did was lock the post to private viewing while I added my responses, a process that doesn’t happen quickly when 33 of them exist. It’s good to know you’re stalking my posts though. It’s also interesting that you appear to have created a Hacker News user account solely for the purpose of said gloating. If this post has hurt your feelings that badly then you’re probably the kind of person it was written for.

It should also be noted that this is an indirect response to advice seen handed out on Reddit, Stack Overflow, and similar sites. For the grasping-at-straws-to-discredit-me HN nerds that can’t help but harp on the fact that “ZFS doesn’t use CRCs [therefore the author of this post is incompetent],” would you please feel free to tell that to all the people that say “CRC” when discussing ZFS? Language is made to communicate things and if I said “fletcher4” or “SHA256” they may not know what I’m talking about and think I’m the one who is clueless. Damned if you do, damned if you don’t.

tl;dr: Hard drives already do this, the risks of loss are astronomically low, ZFS is useless for many common data loss scenarios, start backing your data up you lazy bastards, and RAID-5 is not as bad as you think.

Bit rot just doesn’t work that way.

I am absolutely sick and tired of people in forums hailing ZFS (and sometimes btrfs which shares similar “advanced” features) as some sort of magical way to make all your data inconveniences go away. If you were to read the ravings of ZFS fanboys, you’d come away thinking that the only thing ZFS won’t do is install kitchen cabinets for you and that RAID-Z is the Holy Grail of ways to organize files on a pile of spinning rust platters.

In reality, the way that ZFS is spoken of by the common Unix-like OS user shows a gross lack of understanding of how things really work under the hood. It’s like the “knowledge” that you’re supposed to discharge a battery as completely as possible before charging it again which hasn’t gone away even though that was accurate for old Ni-Cd battery chemistry and will destroy your laptop or cell phone lithium-ion cells far faster than if you’d have just left it on the charger all the time. Bad knowledge that has spread widely tends to have a very hard time dying. This post shall serve as all of the nails AND the coffin for the ZFS and btrfs feature-worshiping nonsense we see today.

Side note: in case you don’t already know, “bit rot” is the phenomenon where data on a storage medium gets damaged because of that medium “breaking down” over time naturally. Remember those old floppies you used to store your photos on and how you’d get read errors on a lot of them ten years later? That’s sort of like how bit rot works, except bit rot is a lot scarier because it supposedly goes undetected, silently destroying your data and you don’t ever find out until it’s too late and even your backups are corrupted.

“ZFS has CRCs for data integrity”

A certain category of people are terrified of the techno-bogeyman named “bit rot.” These people think that a movie file not playing back or a picture getting mangled is caused by data on hard drives “rotting” over time without any warning. The magical remedy they use to combat this today is the holy CRC, or “cyclic redundancy check.” It’s a certain family of hash algorithms that produce a magic number that will always be the same if the data used to generate it is the same every time.

This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth and is the basis for most of the assertions that ZFS “protects your data” or “guards against bit rot” or other similar claims. While it is true that keeping a hash of a chunk of data will tell you if that data is damaged or not, the filesystem CRCs are an unnecessary and redundant waste of space and their usefulness is greatly over-exaggerated by hordes of ZFS fanatics.

Hard drives already do it better

Enter error-correcting codes (ECC.) You might recognize that term because it’s also the specification for a type of RAM module that has extra bits for error checking and correction. What the CRC Jesus clan don’t seem to realize is that all hard drives since the IDE interface became popular in the 1990s have ECC built into their design and every single bit of information stored on the drive is both protected by it and transparently rescued by it once in a while.

Hard drives (as well as solid-state drives) use an error-correcting code to protect against small numbers of bit flips by both detecting and correcting them. If too many bits flip or the flips happen in a very specific way, the ECC in hard drives will either detect an uncorrectable error and indicate this to the computer or the ECC will be thwarted and “rotten” data will successfully be passed back to the computer as if it was legitimate. The latter scenario is the only bit rot that can happen on the physical medium and pass unnoticed, but what did it take to get there? One bit flip will easily be detected and corrected, so we’re talking about a scenario where multiple bit flips happen in close proximity and in such a manner that it is still mathematically valid.

While it is a possible scenario, it is also very unlikely. A drive that has this many bit errors in close proximity is likely to be failing and the the S.M.A.R.T. status should indicate a higher reallocated sectors count or even worse when this sort of failure is going on. If you’re monitoring your drive’s S.M.A.R.T. status (as you should be) and it starts deteriorating, replace the drive!

Flipping off your CRCs

Note that in most of these bit-flip scenarios, the drive transparently fixes everything and the computer never hears a peep about it. ZFS CRCs won’t change anything if the drive can recover from the error. If the drive can’t recover and sends back the dreaded uncorrectable error (UNC) for the requested sector(s), the drive’s error detection has already done the job that the ZFS CRCs are supposed to do; namely, the damage was detected and reported.

What about the very unlikely scenario where several bits flip in a specific way that thwarts the hard drive’s ECC? This is the only scenario where the hard drive would lose data silently, therefore it’s also the only bit rot scenario that ZFS CRCs can help with. ZFS with CRC checking will detect the damage despite the drive failing to do so and the damage can be handled by the OS appropriately…but what has this gained us? Unless you’re using specific kinds of RAID with ZFS or have an external backup you can restore from, it won’t save your data, it’ll just tell you that the data has been damaged and you’re out of luck.

Hardware failure will kill your data

If your drive’s on-board controller hardware, your data cable, your power supply, your chipset with your hard drive interface inside, your RAM’s physical slot connection, or any other piece of the hardware chain that goes from the physical platters to the CPU have some sort of problem, your data will be damaged. It should be noted that SATA drive interfaces use IEEE 802.3 CRCs so the transmission from the drive CPU to the host system’s drive controller is protected from transmission errors. Using ECC RAM only helps with errors in the RAM itself, but data can become corrupted while being shuffled around in other circuits and the damaged values stored in ECC RAM will be “correct” as far as the ECC RAM is concerned.

The magic CRCs I keep making fun of will help with these failures a little more because the hard drive’s ECC no longer protects the data once the data is outside of a CRC/ECC capable intermediate storage location. This is the only remotely likely scenario that I can think of which would make ZFS CRCs beneficial.

…but again: how likely is this sort of hardware failure to happen without the state of something else in the machine being trashed and crashing something? What are the chances of your chipset scrambling the data only while the other millions of transistors and capacitors on the die remain in a functional and valid working state? As far as I’m concerned, not very likely.

Data loss due to user error, software bugs, kernel crashes, or power supply issues usually won’t be caught by ZFS CRCs at all. Snapshots may help, but they depend on the damage being caught before the snapshot of the good data is removed. If you save something and come back six months later and find it’s damaged, your snapshots might just contain a few months with the damaged file and the good copy was lost a long time ago. ZFS might help you a little, but it’s still no magic bullet.

Nothing replaces backups

By now, you’re probably realizing something about the data CRC gimmick: it doesn’t hold much value for data integrity and it’s only useful for detecting damage, not correcting it and recovering good data. You should always back up any data that is important to you. You should always keep it on a separate physical medium that is ideally not attached to the computer on a regular basis.

Back up your data. I don’t care about your choice of filesystem or what magic software you write that will check your data for integrity. Do backups regularly and make sure the backups actually work.

In all of my systems, I use the far less exciting XFS on Linux with metadata CRCs (once they were added to XFS) on top of a software RAID-5 array. I also keep external backups of all systems updated on a weekly basis. I run S.M.A.R.T. long tests on all drives monthly (including the backups) and about once a year I will test my backups against my data with a tool like rsync that has a checksum-based matching option to see if something has “rotted” over time.

All of my data loss tends to come from poorly typed ‘rm’ commands. I have yet to encounter a failure mode that I could not bounce back from in the past 10 years. ZFS and btrfs are complex filesystems with a few good things going for them, but XFS is simple, stable, and all of the concerning data loss bugs were ironed out a long time ago. It scales well and it performs better all-around than any other filesystem I’ve ever tested. I see no reason to move to ZFS and I strongly question the benefit of catching a highly unlikely set of bit damage scenarios in exchange for the performance hit and increased management complexity that these advanced features will cost me…and if I’m going to turn those features off, why switch in the first place?

Bonus: RAID-5 is not dead, stop saying it is

A related category of blind zealot is the RAID zealot, often following in the footsteps of the ZFS zealot or even occupying the same meat-suit. They loudly scream about the benefits of RAID-6, RAID-10, and fancier RAID configurations. They scorn RAID-5 for having terrible rebuild times, hype up the fact that “if a second drive dies while rebuilding, you lose everything!” They point at 10TB hard drives and do back-of-the-napkin equations and tell you about how dangerous and stupid it is to use RAID-5 and how their system that gives you less space on more drives is so much better.

Stop it, fanboys. You’re dead wrong and you’re showing your ignorance of good basic system administration practices.

I will concede that your fundamental points are mostly correct. Yes, RAID-5 can potentially have a longer rebuild time than multi-stripe redundant formats like RAID-6. Yes, losing a second drive after one fails or during a rebuild will lose everything on the array. Yes, a 32TB RAID-5 with five 8TB drives will take a long time to rebuild (about 50 hours at 180 MB/sec.) No, this isn’t acceptable in an enterprise server environment. Yes, the infamous RAID-5 write hole (where a stripe and its parity aren’t both updated before a crash or power failure and the data is damaged as as result) is a problem, though a very rare one to encounter in the real world. How do I, the smug techno-weenie advocating for dead old stupid RAID-5, counter these obviously correct points?

Longer rebuild time? This is only true if you’re using the drives for something other than rebuilding while it’s rebuilding. What you really mean is that rebuilding slows down less when you interrupt it with other work if you’re using RAID levels with more redundancy. No RAID exists that doesn’t slow down when rebuilding. If you don’t use it much during the rebuild, it’ll go a lot faster. No surprise there!
Losing a second drive? This is possible but statistically very unlikely. However, let’s assume you ordered a bunch of bad Seagates from the same lot number and you really do have a second failure during rebuild. So what? You should be backing up the data to an external backup, in which case this failure does not matter. RAID-6 doesn’t mean you can skip the backups. Are you really not backing up your array? What’s wrong with you?
RAID-5 in the enterprise? Yeah, that’s pretty much dead because of the rebuild process slowdown being worse. An enterprise might have 28 drives in a RAID-10 because it’s faster in all respects. Most of us aren’t an enterprise and can’t afford 28 drives in the first place. It’s important to distinguish between the guy building a storage server for a rack in a huge datacenter and the guy building a home server for video editing work (which happens to be my most demanding use case.
The RAID-5 “write hole?” Use an uninterruptible power supply (UPS). You should be doing this on any machine with important data on it anyway! Assuming you don’t use a UPS, Linux as of kernel version 4.4 has added journaling features for RAID arrays in an effort to close the RAID-5 write hole problem.

A home or small business user is better off with RAID-5 if they’re also doing backups like everyone should anyway. With a 7200 RPM 3TB drive (the best $/GB ratio in 7200 RPM drives as of this writing) costing around $95 each shipped, I can only afford so many drives. I know that I need at least three for a RAID-5 and I need double as many because I need to back that RAID-5 up, ideally to another machine with another identically sized RAID-5 inside. That’s a minimum of six drives for $570 to get two 6TB RAID-5 arrays, one main and one backup. I can buy a nice laptop or even build a great budget gaming desktop for that price, but for these storage servers I haven’t even bought the other components yet. To get 6TB in a RAID-6 or RAID-10 configuration, I’ll need four drives instead of three for each array, adding $190 to the initial storage drive costs. I’d rather spend that money on the other parts and in the rare instance that I must rebuild the array I can use the backup server to read from to reduce my rebuild time impact. I’m not worried about a few extra hours of rebuild.

Not everyone has thousands of dollars to allocate to their storage arrays or the same priorities. All system architecture decisions are trade-offs and some people are better served with RAID-5. I am happy to say, however, that if you’re so adamant that I shouldn’t use RAID-5 and should upgrade to your RAID levels, I will be happy to take your advice on one condition.

Buy me the drives with your own money and no strings attached. I will humbly and graciously accept your gift and thank you for your contribution to my technical evolution.

If you can add to the conversation, please feel free to comment. I want to hear your thoughts. Comments are moderated but I try to approve them quickly.

Update to address Hacker News respondents

First off, it seems that several Hacker News comments either didn’t read what I wrote, missed a few things, or read more into it than what I really said. I want to respond to some of the common themes that emerged in a general fashion rather than individually.

I am well aware that ZFS doesn’t exactly use “CRCs” but that’s how a lot of people refer to the error-checking data in ZFS colloquially so that’s the language I adopted; you pointing out that it’s XYZ algorithm or “technically not a CRC” doesn’t address anything that I said…it’s just mental masturbation to make yourself feel superior and it contributes nothing to the discussion.

I was repeatedly scolded for saying that the ZFS checksum feature is useless despite never saying that. I acknowledge that it does serve a purpose and use cases exist. My position is that I believe ZFS checksums constitute a lot of additional computational effort to protect against a few very unlikely hardware errors once the built-in error checking and correction in most modern hardware is removed from the overall picture. I used the word “most” in my “ZFS is useless for many common data loss scenarios” statement for a reason. This glossing over of important details is the reason I refer to such people as ZFS “zealots” or “fanboys.” Rather than taking the time to understand my position fully before responding, they quickly scanned the post for ways to demonstrate my clear ignorance of the magic of ZFS to the world and jumped all over the first thing that stood out.

kabdib related an anecdote where the RAM on a hard drive’s circuit board was flipping data bits in the cache portion and that the system involved used an integrity check similar to ZFS which is how the damage was detected. The last line sums up the main point: “Just asserting “CRCs are useless” is putting a lot of trust on stuff that has real-world failure modes.“ Remember that I didn’t assert that CRCs are useless; I specifically outlined where the ZFS checksum feature cannot be any more helpful than existing hardware integrity checks which is not the same thing. I question how common it is for hard drive RAM to flip only the bits in a data buffer/cache area without corrupting other parts of RAM that would cause the drive’s built-in software to fail. I’m willing to bet that there aren’t any statistics out there on such a thing. It’s good that a ZFS-like construct caught your hardware issue, but your obscure hard drive failure anecdote does not necessarily extrapolate out to cover billions of hard drives. Still, if you’re making an embedded device like a video game system and you can afford to add that layer of paranoia to it, I don’t think that’s a bad thing. Remember that the purpose of my post is to address those who blindly advocate ZFS as if it’s the blood of Computer Jesus and magically solves the problems of data integrity and bit rot.

rgbrenner offered indirect anecdotal evidence, repetitions of the lie that I asserted “CRCs are useless,” and then made a ridiculous attempt at insulting me: “If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.“ Well then, “rgbrenner,” all I can say is that if you are so damned smart and have proof of this “unreliable and unusable” state that it’s in, file a bug against the filesystem I wrote and use on a daily basis for actual work so it can be fixed, and feel free to keep the condescending know-it-all attitude to yourself when you do so.

AstralStorm made a good point that I’ve also been trying to make: if your data is damaged in RAM that’s not used by ZFS, perhaps while the data is being edited in a program, it can be damaged while in RAM and ZFS will have no idea that it happened.

wyoung2 contributed a lot of information that was well-written and helpful. I don’t think I need to add anything to it, but it deserves some recognition since it’s a shining chunk of gold in this particular comment septic tank.

X86BSD said that “Consumer hardware is notriously busted. Even most of the enterprise hardware isn’t flawless. Firmware bugs etc. .” I disagree. In my experience the vast majority of hardware works as expected. Even most of the computers with every CPU regulator capacitor leaking electrolyte pass extended memory testing and CPU burn-in tests. Hard drives fail a lot more than other hardware does, sure, but even then the ECC does what it’s supposed to do and detects the error and reports it instead of handing over the broken data that failed the error check. I’d like some hard stats rather than anecdotes but I’m not even sure if they exist due to the huge diversity of failure scenarios that can come about.

asveikau recalls the hard drive random bit flipping problem hitting him as well. I don’t think that this anecdote has value because it’s a hard drive hardware failure. Sure, ZFS can catch it, but let’s remember that any filesystem would catch it because the filesystem metadata blocks will be read back with corruption too. XFS has optional metadata CRCs and those would catch this kind of disk failure so I don’t think ZFS can be considered much better for this failure scenario.

wyoung2 made another lengthy comment that requires me to add some details: I generally work only in the context of Linux md RAID (the raid5 driver specifically) so yes, there is a way to scrub the entire array: ‘echo check > /sys/block/md0/md/sync_action’. Also, if a Linux md RAID encounters a read error on a physical disk, the data is pulled from the remaining disk(s) and written back to the bad block, forcing the drive to either rewrite the data successfully or reallocate the sector which has the same effect; it no longer dumps a whole drive from the RAID on the basis of a single read error unless the attempts to do a “repair write” fail also. I can’t really comment on the anecdotal hardware problems discussed; I personally would not tolerate hardware that is faulty as described and would go well out of my way to fix the problem or replace the whole machine if no end was in sight. (I suppose this is a good time to mention that power supply issues and problems with power regulation can corrupt data…)

Yet another wyoung2 comment points out one big advantage ZFS has: if you use RAID that ZFS is aware of, ZFS checksums allow ZFS to know what block is actually bad when you check the array integrity. I actually mentioned this in my original post when I referenced RAID that ZFS pairs with. If you use a proper ZFS RAID setup then ZFS checksums become useful for data integrity; my focus was on the fact that without this ZFS-specific RAID setup the “ZFS protects your data” bullet-point is false. ZFS by itself can only tell you about corruption and it’s a dangerous thing to make people think the protection offered by a ZFS RAID setup is offered by ZFS by itself.

At this point I can only assume that rgbrenner just enjoys being a dick. And that, in contrast, AstralStorm understood what I was trying to say to at least some extent.

DiabloD3 quoted me on “RAID is not a replacement for backups” and then mentions ZFS external backup commands. Hey, uh, you realize that the RAID part was basically a separate post, right? In fact, there is not a single mention of ZFS in the RAID section of the post other than as a topic transition mechanism in the first paragraph. I included the RAID part because the ZFS religion and the RAID-over-5-only religion have the same “smell.”

I’ll have to finish this part later. It takes a lot of time to respond to criticism. Stay tuned for more. I have to stop so I can unlock the post and keep OpenZFSonLinux from eating off his own hands with anticipation. As a cliff-hanger, check this out…I enjoyed the stupidity of X86BSD’s second comment about me endangering my customers’ data [implicitly because I don’t use ZFS] so much that I changed my blog to integrate it and embrace how horrible of a person I am for not making my customers use ZFS with checksums on their single-disk Windows machines. If my destiny is to be “highly unethical” then I might as well embrace it.

72 thoughts on “ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot (and RAID-5)”

David Hajes says:

June 2, 2017 at 1:08 pm

I hear you chief. It makes me sick too. Too many eExperts.

Right now I’m building file server and wherever I go…I hear NAS, ZFS, FreeNAS, BTRFS.

I seek stability and reliability. I really love my photos taken around the world with so many stories and “suffering”

So far I wrote paranoid script for copying data and I can check all data integrity with checksums.

I’m also paranoid (perhaps it is the teaching on extreme sports I do over 30years)

So far I found nothing intelligent…I think I use standard ext4 on Debian. Keep checksum created right from the beginning of file. If anything get corrupted it starts all on flash cards in camera and I cannot do anything about it. Run scripts that check file system.

What do you suggest as optimal intelligent solution for secure file storage/archive. I won’t run file server 24/7. I just want to satisfy my paranoia and sleep better 😀

Reply
1. admin says:
  
  July 12, 2017 at 6:58 pm
  
  There is no substitute for a good backup. Store your files on two different kinds of media at two different physical locations and sync those media as often as possible. If something goes sour on the server, you’ll have backups of most of it. Since nothing can guarantee data will never bit rot, it’s best to skip things like filesystem checksums and use redundancy instead. For the paranoia you can use a tool like md5deep to make a list of (and verify) data checksums periodically, but I’d only bother doing such a thing very infrequently (maybe every half-year) because it takes a ton of time and if you’re rsync-ing for backup it’s not going to transfer a rotted file unless the source file’s change time is also different.
  
  Reply
Esteban says:

August 8, 2017 at 12:45 am

Cheers, finally some logical points regarding ZFS, bit rot, raid, and backing up your data.

Reply
1. Purple says:
  
  September 14, 2022 at 3:30 pm
  
  he didn’t make a single logical point. He just moaned and said it wouldn’t happen. He literally said CRCs are useless and a waste of space… in spite of him previously acknowledging that bit rot happens (if he hasn’t experienced it, lol) and then says it’s useless even though a CRC and RAID-Z would allow for recovering the file. Having a backup doesn’t care for or mitigate this. RAID-Z does. But I’m a “zealot” lol ok chief. Emotional argument to a feature designed to solve a real problem. Saying it doesn’t happen is incredibly ignorant. I guess the folks who wrote ZFS did it for no reason, and companies with much more of a horse in the race than you’ll ever have are using it for gits and shiggles.
  This guy is a moron.
  
  Reply
  1. Jody Bruchon says:
    
    November 4, 2022 at 2:35 pm
    
    Oh look, a dumb shit appears. 1/10 troll harder.
    
    Reply
Josh says:

August 8, 2017 at 12:55 am

I’ve been in threads about the maths behind RAID5 failures. If they were taken at face value, I’d be sitting on six nines likelihood of seeing a URE take an array offline for any given year – but I’ve never seen it happen. People suggest I have to be lying when I say how many RAID5 arrays I have in production without a failure. It’s absurd.

Reply
Real Storage Admin says:

August 8, 2017 at 8:28 pm

I don’t know much about btrfs so I’ll stick to ZFS related comments. ZFS does not use CRC, by default it uses fletcher4 checksum. Fletcher’s checksum is made to approach CRC properties without the computational overhead usually associated with CRC.

Without a checksum, there is no way to tell if the data you read back is different from what you wrote down. As you said corruption can happen for a variety of reason – due to bugs or HW failure anywhere in the storage stack. Just like other filesystems not all types of corruption will be caught even by ZFS, especially on the write to disk side. However, ZFS will catch bit rot and a host of other corruptions, while non-checksumming filesystems will just pass the corrupted data back to the application. Hard drives don’t do it better, they have no idea if they’ve bit rotted over time and there are many other components that may and do corrupt data, it’s not as rare as you think. The longer you hold data and the more data you have the higher the chance you will see corruption at some point.

I want to do my best to avoid corrupting data and then giving it back to my users so I would like to know if my data has been corrupted (not to mention I’d like it to self-heal as well which is what ZFS will do if there is a good copy available). If you care about your data use a checksumming filesystem period. Ideally, a checksumming filesystem that doesn’t keep the checksum next to the data. A typical checksum is less than 0.14 Kb while a block that it’s protecting is 128 Kb by default. I’ll take that 0.1% “waste of space” to detect corruption all day, any day. Now let’s remember ZFS can also do in-line compression which will easily save you 3-50% of storage space (depending on the data you’re storing) and calling a checksum a “waste of space” is even more laughable.

I do want to say that I wholeheartedly agree with “Nothing replaces backups” no matter what filesystem you’re using. Backing up between two OpenZFS pools machines in different physical location is super easy using zfs snapshot-ting and send/receive functionality.

[Admin edit: I got mad when senpai didn’t notice me]

Reply
1. admin says:
  
  August 8, 2017 at 9:31 pm
  
  It does not matter what algorithm is used for the CRC/checksum/hash. In all cases it is a smaller number generated from data that (if taken as one string of bits) constitutes a massively larger number, and it takes time to compute and storage to keep around. The question is this: is it worth the extra storage and the extra computation times for every single I/O operation performed on the filesystem? I say it isn’t.
  
  Hard drives DO in fact know if something has bit rotted, assuming the rot isn’t so severe that it extends beyond the error detection capabilities of the on-disk ECC. Whenever a drive reports an “uncorrectable error” it’s actually reporting an on-disk ECC error that was severe enough that the data couldn’t be corrected. In my opinion, on-disk checksums (CRCs, hashes, whatever term is preferred) are targeting a few types of very rare hardware failures (they must mangle data despite all hardware error checking mechanisms AND must not cause any other damage that crashes the program or machine which would process or write that data out to disk) and do so at significant expense (a check must be done for every piece of data that is read from disk). Even ZFS checksums are not foolproof; for example, if data is damaged in RAM or even in a CPU register before being sent to ZFS, the damaged data will still be treated as valid by ZFS because it has no way to know anything is wrong.
  
  As discussed in my post, ZFS checksums are useless without a working backup of the data to pull from, preferably a ZFS-specific RAID configuration that enables real-time “self-healing” as you’ve mentioned. Without some sort of redundancy…well, what are you going to do? You know it’s damaged but you have no way to fix it.
  
  You seem to take particular issue with my assertion that checksums are a waste of space. Granted, they’re relatively small compared to file data, however the space issue pales in comparison to the processing time and additional I/O for storing and retrieving those checksums. If the checksums aren’t beside the data then that 128K read will incur at least one 4K read to fetch the checksum which is not nearby, resulting in a disk performance hit. Enough read operations with checksum checking at once and streaming read speeds approach the speed of fully random I/O a lot faster than it would otherwise. It also takes CPU time to calculate a hash value over a 128K block; while some are faster than others, all take CPU time and large enough block sizes will repeatedly blow away CPU D-cache lines during the checksum work, reducing overall system performance. Since many ZFS users seem to pair it with FreeNAS and relatively small, weak systems like NAS enclosures, the implications of all this extra CPU hammering should be obvious. Of course, a Core i7 machine with 16GB of DDR4 RAM might do it so fast that it doesn’t matter as much, but being able to buy a bigger box to minimize the impact of lower efficiency does not change the fact that such a drop exists.
  
  In computing, we have to choose a set of compromises since rarely does any given solution satisfy speed, precision, reliability, etc. all at the same time. In my opinion, ZFS data checksums are not worth the added cost, particularly since the problem surface area is very small and unlikely to ever happen once the error checking coverage of hard drive ECC, RAM and on-CPU ECC if applicable, and various bus-level transceiver error detection methods are taken away. The beauty of computing is that you are free to make a different trade-off in favor of bit rot paranoia if it makes you sleep better at night. What’s right for me may not be right for you. I do not consider the very tiny risk of highly specific and unlikely corruption circumstances that can be detected to be worth covering ESPECIALLY since the same cosmic rays that can bit-flip the data in a detectable place could just as easily flip it in an undetectable place, but I’m not in your situation and making your choices.
  
  tl;dr: one of us is less risk-averse, and that’s okay.
  
  Reply
  1. Bran says:
    
    November 25, 2020 at 12:17 pm
    
    Now I’m confused by the two conflicting replies real Storage Admin and Admin. ZFS or EXT4 for my new RAID6 by QNAP 8bay. Should I go ZFS?
    
    Reply
    1. Jody Bruchon says:
      
      November 25, 2020 at 12:58 pm
      
      If you use ZFS, you have to use RAID-Z, otherwise you will get none of the advertised protection that ZFS offers other than detecting degraded data. If you’re using RAID-6, you should use XFS. Don’t use Ext4. Be sure to format the XFS filesystem with the proper flags:
      
      mkfs.xfs -m crc=1,finobt=1 -l size=64m /dev/device
      
      You may also need to use -d su=X,sw=X to align the filesystem to the stripe sizes, but xfsprogs-3.1.0 and up try to do this for you automatically.
      
      Reply
      1. Purple says:
        
        September 14, 2022 at 3:32 pm
        
        “other than detecting degraded data” – that’s a pretty big deal. If you have a backup, and are running rolling backups as you should, if you can’t detect, you’ll be backing up corrupt data. How is this hard?
      2. Jody Bruchon says:
        
        November 4, 2022 at 2:34 pm
        
        It’s not a big deal if the data isn’t degrading, and hard drives already deteect data degradation on-the-fly with every read.
Mikael Persson says:

September 24, 2017 at 3:11 pm

Perhaps this is long since dead, but I wanted to give an example where “bitrot” is quite common. Plenty of laptops still have 2.5″ mechanical hdds, if the drive is spinning and you pick up the laptop, it is quite likely to cause a few kilobytes of sequential broken data. Switch to zfs, activate copies x2 and the errors which the drive could notice, but not fix, are no longer a problem. Drive abuse to be sure, but quite common non the less.

Reply
1. admin says:
  
  September 25, 2017 at 12:34 am
  
  It’s pretty hard to cause the damage you’re talking about, but the damage to the disk surface will be caught by the on-disk error correcting code if this happens. It is extremely unlikely that physical damage to the platter surface will cause data damage that can fool the ECC.
  
  Reply
  1. Ryt says:
    
    January 1, 2018 at 10:43 am
    
    I don’t like the idea that you simply assume the hard drive hardware ALWAYS catches these errors, and I think that is your major flaw in all of your arguments. I worked in aerospace, where you are not allowed to assume any hardware is flawless at any point in time, therefore redundant checks are always needed. Imagine a plane with only one alarm system on the hardware to tell you if something is wrong…. that’s a jet I dont want to be on, just in case something goes wrong and isn’t detected in time to stop further problems.
    
    The CPU overhead of zfs is almost nothing with modern hardware that usually sits 80% idle anyways. 1% of storage for some extra security, literally cost me less than $10 of harddrive space on my 7 drives, but the benifits of time savings using “zfs send” to save backups and replicate data is far more valuable. Time for me is way more important than saving a few bucks worth of storage.
    
    If there were more integrity features that cost less than 10% of anything on the system, I’d enable them, which is why I have an automated script that sha256 each and every file every couple months, and checks against previous known values, to let me know if there is a problem.
    
    It’s not rocket science, more checks are just better when it comes to the value of your data… more than one backup too.
    
    Reply
    1. akismetuser530961444 says:
      
      January 1, 2018 at 11:17 am
      
      boy Albert was right…human stupidity is infinite.
      
      I consider myself extremely paranoid but not dumb. I must confess I almost fell for ZFS/BRFS propaganda.
      
      What cost average ZFS servers did cost me less even with 900VA UPC. It is most secure and redundant system ever built…still with 70% idle system resources.
      
      I can write even more paranoid system checks to utilise more system resources.
      
      Only weakness is fire and HW failures that nobody can avoid…that’s why you still have at least two backups/mirrors on two different places.
      
      Laziness, stupidity caused more failures than those theoretically fears that is whole industry based on.
      
      Reply
    2. admin says:
      
      January 1, 2018 at 11:26 am
      
      I made no such assumption. Where did I ever say that the hard drive hardware ALWAYS catches these errors? You need to re-read what I’ve written and fully understand what the point of my post was. It is extremely unlikely that your hard drive will send bad data to your motherboard, but it is always possible.
      
      The ultimate stinger is that no matter how clever ZFS is, once the data is in RAM (even ECC RAM) it can be damaged due to a variety of hardware issues. Anything from the right series of bit flips to a CPU defect to a cosmic ray hitting a bus trace can mangle your data with no way to detect the damage.
      
      At some point the law of diminishing returns kicks in too hard. For most people ZFS in the setup required for the oft-touted integrity boost is impractical, and my greater point is that ZFS can’t just be deployed and voila, integrity! RAID-Z is mandatory. ZFS without RAID-Z adds little value over traditional RAID with external backups. Advocates of ZFS do not always or even frequently point this out while telling newbies to use ZFS and this bothers me.
      
      If you take the time to fully know and understand what you’re doing and you ultimately choose to deploy ZFS, that’s totally fine. I’m not saying your choice is not valid or that you should not have made it. I’m simply challenging the practical value of what ZFS brings to the table for people that are slapping together a machine to store stuff. Most ZFS talk is by ZFS zealots that loudly scream its virtues; I offer counter-points that are sorely needed for a user to fully understand what they’re getting into if they go the ZFS route and why it may not be as much of a miraculous data-saving black box as it is touted to be, especially without RAID-Z.
      
      Reply
UrQuan3 says:

October 5, 2017 at 6:06 pm

And this is basically my thoughts on ZFS ever since I started hearing about it. Now, I’m not sure I would talk about ZFS independently of RAID-Z. The two are basically always paired. I will give them that the array expandability could be nice, but I have never seen a detailed speed test, and we highly value speed.

That brings me to what I actually wanted to comment on. We will never use RAID-5 ever again. It’s not because of anything you mention, but rather, because the write speed is atrocious. After some problems, we found that our 5 drive array averaged about 5MB/s on write operations. This compared to a single drive averaging around 45MB/s. We tracked down the problem to something inherent in RAID-5. The data and parity are saved in different locations on each disk. This means that each write requires the head to write, then seek to another location on the drive, and write again. The reason random I/O is slow is all the head seeking, and RAID-5 forces this for every write. For $100 more we moved to a RAID-10 with slightly less space and 200MB/s writes. This is, of course, for mechanical drives. SSDs are not nearly as heavily effected, but are still slowed by random writes.

Now, it does seem like to get the most out of modern drives and SMART, something would need to periodically force a read of every used bit on the drive to prevent bad bits from building up undetected. A full backup would do this, but they take forever. zpool scrub would do this. Does a full drive rsync do this as well? It’s much faster than a full backup.

Reply
1. admin says:
  
  October 5, 2017 at 6:49 pm
  I disagree that ZFS + RAID-Z are usually paired. The entire reason for my article is that people constantly sing the praises of ZFS without making it clear that RAID-Z (specifically, as opposed to ZFS on md/LVM RAID) is mandatory for many of the touted integrity features, specifically the magical self-healing that is such a huge draw. I feel that it is dangerous to advocate the features of ZFS without also explaining the requirements for those features to work, yet that’s what you see going on in most “what filesystem for my NAS/server?” threads: “ZFS, it magically stops bit rot and fixes damage! [But I’m not going to tell you about RAID-Z or emphasize good backups, nor about how detecting bit rot is useless without a non-broken backup copy!]”
  
  Your RAID-5 issue might be the same one I discovered if you’re using the md raid5 driver: very large stripe sizes cause massive write speed degradation and the default Linux md raid5 stripe cache size is too small. You’ll often see raid5 how-to guides say to use larger stripes for faster throughput but they are written by people that don’t understand that RAID-5 must be updated for an entire stripe at a time; it’s a form of write amplification just like SSDs, so even just writing one 4K sector requires reading not only a stripe width worth of sectors (minus the one being updated) from every disk excluding the parity disk but also writing one stripe width of parity in addition to the modified sector. For sequential workloads this tends to be of little consequence but for random writes it is simply a disaster. That’s why Linux caches up the stripe updates and tries to write them out more optimally, but the stripe cache is usually too small. It maxes out at 32768. Try using a 64k stripe width and setting the stripe cache size for all md raid5 arrays to 32768 after booting; you’ll probably notice a big difference in performance.
  
  RAID-10 has some issues of its own. I tried out md raid10 (far2) and found the overall performance to be quite poor relative to RAID-5. Of course, I didn’t try any sort of tuning knobs so I may not have given it a fair shake; however, I find that a well-tuned RAID-5 with a properly formatted and aligned XFSv5 filesystem performs well enough to easily handle dumping lossless compressed video data to it in real time while still serving up random small reads without issue, so it’s good enough for my situation. I can understand others choosing a different path though, and that’s what is so wonderful about the Linux ecosystem in general: everyone has options and can pick the one that suits them.
  
  A full-drive rsync will force reading of file data and most of the filesystem metadata but if you really want to force a full disk or array read from end-to-end, there’s an elegant and absurdly simple solution (though it’ll surely starve other tasks trying to perform I/O):
```
cat /dev/md0 > /dev/null
```
  Or if you have the wonderful amazing glorious pv utility and want a progress indicator:
```
pv -pterab /dev/md0 > /dev/null
```
  Reply
2. David Hajes says:
  
  October 6, 2017 at 9:12 am
  
  6TB RAID-10 XFS array scrub takes about 8hours on my file server.
  
  read/write over SMB tragedy (most likely Apple vs Linux vs Win)
  
  NFS 100+ MBs over Giga LAN
  
  otherwise max. speed 250MBs of RAID-1
  
  Reply
Benjamin Bennett says:

October 21, 2017 at 6:28 pm

Bit rot is a problem now, it isn’t 1995 and you are just incorrect.

Read the studies on hard drives and what the ACTUAL hard drive manufactures say.

The entire reason for the extra checksum and checking/correcting on every read is the shear size of hard drives now.

No hard drive ECC /CRC will not save you. Statistically every 12TB of data read there will be a silent data read error and that is what the manufacturers say, not some zfs zealots.
The error read rate hasn’t changed much since 1995 and hardly anyone in 1995 would have been reading 12TB .
You can buy a single 12 TB hard drive now , problem is you cannot read all 12 TB , without an error.

Finally basically all OSes are going down the same route that zfs did for checking checksums of data on the fly. Linux has btrfs(use zfsonlinux) , Mac OS X new APFS and Microsoft’s ReFS.

Read more here https://web.archive.org/web/20090228135946/http://www.sun.com/bigadmin/content/submitted/data_rot.jsp

Reply
1. admin says:
  
  October 21, 2017 at 7:37 pm
  
  You are objectively wrong and I can prove it any night of the week. I have a 12TB RAID-5 array sitting eight feet from me. If your “can’t read 12TB without an error” assertion is true for a single drive then five drives should be five times worse off, yet I’ve run a weekly data scrub on the array since I built it and there has not been a single parity mismatch. Even if the drive had a set of bit flips that happened to pass by ECC, the RAID-5 parity check would almost certainly still fail. For the parity check to pass despite the bit flips they’d have to be extremely specific and possibly span multiple disks in that specific manner.
  
  You also cite an article that cites studies from nearly a decade ago. Storage technology has changed a lot since 2008. The article is ultimately a marketing article, not a technical article. It’s written by a Sun “evangelist” which is a stupid name for “obnoxious marketing guy.”
  
  ReFS is being disabled as a new FS option in Windows 10 Pro SKUs soon, APFS is slow and has a lot of growing pains, btrfs is wonky in all sorts of ways and not trustworthy…what’s your point with all that other stuff? None of those are ZFS and none of those are seeing mass adoption.
  
  How do you explain my 12TB RAID-5 scrub consistently passing? Am I just super lucky and somehow blessed by God himself to the point that I never experience these data errors or is your assertion based on grossly outdated knowledge and the bit rot panic hype pushed by ZFS fanboys?
  
  Reply
  1. ktry says:
    
    May 11, 2019 at 9:59 am
    
    You might look at https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives. It implies that parity is only compared against the data if the repair command issued; check simply reads all blocks on the volume.
    
    Reply
    1. Jody Bruchon says:
      
      May 11, 2019 at 10:27 am
      
      You’re right. The parity in RAID was never intended to be used as an error detection and correction mechanism, only as a way to store a redundant copy of the data in a stripe that can be recovered if a disk is lost. The big problem with running parity repair is that parity repair assumes that a parity inconsistency should be fixed by recalculating the parity, but if it’s actually bit rot of the data in the stripe that caused the parity to become mismatched then overwriting the parity equals overwriting the correct data with the bit rot data. It would be nice if there was a way to check the parity without modifying it. Unfortunately, as there is no way to know if the original data or the parity data is the corrupted block, there is also no way to know which repair is correct at the RAID array stripe level. It’s the same exact issue as ZFS without RAID-Z: you have ZFS checksums that can detect an integrity error but you have no way to fix it, so the data is still lost and backups are your only salvation.
      
      Reply
Mohak says:

November 5, 2017 at 7:32 pm

You do make some very interesting points. I certainly agree that one would be foolish to use ZFS solely on the basis of a few anecdotes. What I am curious about is whether there have been large scale studies on bit rot and their results. Unless we have such data, we can’t make an informed decision about the best fs suited to our needs.

Reply
1. admin says:
  
  November 5, 2017 at 7:45 pm
  
  See, that’s the problem: there are statistics on bit rot out there but they’re accepted without question, passed around, and as with all technological statistics they become outdated. A hard drive (say 80GB) from the early 2000s might be statistically guaranteed to have data loss after 10TB of reads, but that’s irrelevant to a 3TB drive today which uses completely different magnetic storage and retrieval methods. Of course, if one were to (incorrectly) quote the 10TB figure for the 3TB drive, that means the drive can only be read three times before it is guaranteed to lose data…but while that figure may be one of many passed around during a ZFS bit rot paranoia pow-wow, it is not applicable to the modern 3TB drive for multiple reasons. One of the other reasons is that uncorrectable read errors in HDDs and bit rot are two different things: one is a set of bit flips that fails drive ECC checking while the other is a set of bit flips that either fools the ECC method used or that happen in hardware beyond the drive read hardware.
  
  In my personal experience I have seen many incidents of damaged data due to hardware issues such as bad capacitors or power supplies or a power failure, but I have not ever been bitten by bit rot that I am aware of (and if I have been, it clearly didn’t matter since it has not affected me.)
  
  Reply
dmkst says:

November 28, 2017 at 11:53 pm

Thanks for this post. You’ve got some great points here. It is indeed fair that the need for filesystems like ZFS is mostly mitigated by technology that has been built in to hard drives for years and years now.

It’d be great if we had some more complete and up-to-date statistics, but we don’t. While it’s definitely not reasonable to assume error rates of old hard drives apply to new drives like some of your critics have, that doesn’t mean new drives don’t have error rates. I just wonder what those error rates are. I note that you haven’t had any undetected issues that you know of with your disk arrays, and I can’t say I’ve come across any with mine (I only use ZFS on one of my arrays). I don’t want to come across any, though, which is why I make use of ZFS/ReFS in some circumstances.

I disagree with you on two points, though. First and foremost IMO it is absolutely reasonable to assume that any ZFS deployment will involve mirroring/striping/RAIDZ. Any complaints you have regarding ZFS zealots which only apply when ZFS is used on single disks are almost redundant IMO. I’d bet money that almost nobody* uses ZFS on single disks (*relative to total ZFS users). Just go ahead and google ZFS guides, all the ones I just found assume more than one hard drive and that the reader already knows about RAID, and most of them cover why you probably shouldn’t bother with ZFS on a single disk. Nobody I know would try to use ZFS on a single disk. I think it’s reasonable to assume that at this point in time (perhaps not in the future, if somebody creates a click-to-magically-ZFS-all-the-things for Windows/Mac then it will be different), someone who is interested in deploying ZFS (who didn’t hear about it by stumbling across it on BuzzFeed or whatever) is already using multiple disks in RAID arrays for their critical data.

Second, regarding RAID5. Whilst it is true that anybody who cares about their data should have working, regular backups, this does not negate the availability feature of RAID. Using RAID5 on large disks will mean rebuilds take ages as you acknowledged. But having a backup doesn’t make a failed rebuild OK. The downtime might be acceptable depending on the installation, but you seemed to dismiss the implied downtime outright, or perhaps I’m misinterpreting you.

At the end of the day though we’re talking about problems that are very small and may not even matter. It puts the ZFS zealotry in perspective. And the Anti-RAID5 brigade too, though RAID5 still isn’t great.

Reply
1. admin says:
  
  November 29, 2017 at 6:59 am
  
  “It is absolutely reasonable to assume that any ZFS deployment will involve mirroring/striping/RAIDZ” – no, it is not. It is reasonable to assume that someone who takes the time to fully understand what is required for the ZFS auto-healing magic to work will usually choose to deploy it properly. The problem is that ZFS advocates are all over the place in forums, particularly forums (I’m thinking of big tech discussion sites like Reddit, Tom’s Hardware, Ars Technica) relating to data storage, and they often say “use ZFS, it does [insert list here] that others don’t!” The caveat that ZFS detecting bit rot is useless without a way to recover the data from the rot (backups or RAID-Z) rarely comes up. It’s good that many guides will go over this, but I think you’ve made three bad assumptions: one, that other people who are finding out about or attempting to deploy ZFS are technically competent and take the time required to understand something technical before trusting it with their data; two, that people wanting to deploy ZFS will find guides that steer them into making the correct choices about how to do it; and three, that people who find guides that suggest RAID-Z will actually do it. The people in the third group are probably beyond help, but the other two may screw up through pure naivete.
  
  With RAID-5, I was trying to say that the rebuild time may not be a big deal to some installations like mine (I can afford to wait for a 12TB array to rebuild) and having proper backups makes the very unlikely failure of a second disk during rebuilding a moot point. If I had a 28-disk array instead of a 5-disk array, I would probably want something else with faster rebuilds. RAID-5 has a space economy advantage that no other non-RAID-0 array formats (obviously ignoring RAID-2/3/4) offer, so it has its place. When you’re dumping 3TB 7200RPM drives into an array at $90 a pop plus assembling external 3TB backup drives for the same array, it’s nice to minimize the total cost by using RAID-5 instead of RAID-6 or RAID-10. The point was that the “RAID-5 is dead” hype is not necessarily accurate.
  
  Absolutely agreed on the last point. Arguing over very unlikely problems is the way of the nerd! 😉
  
  Reply
kop says:

December 27, 2017 at 5:10 am

Lets start of by stating that I am by no means an expert, novice would more accurately describe me. And as such I would like to hear some answers from the other side of the ZFS-fanatics.
Yes, there is no replacements for backups, but some data is not important enough to be backuped (with the cost that comes with it). Instead redunancy may be enough for some data in the home user scenario. In that respect i’d like to have an as safe as possible data storage with redunancy.
Is it not a large advantage of ZFS raid-forms that it works on block level, so that in a raid-5 case, a URE or other kind of error during rebuild will kill all your data, while it will only corrupt some of your data using raid-z1.
Additionally, is it not an advantage that ZFS is a complete sollution. It includes snapshots, checksumming/scrubbing, compression, deduplication and raid configuration (maybe more that I don’t know of). I would imagine that this is an advantage for a novice end-user, that doesn’t have to read up and install/configure multiple tools, and hope that they will work togather. As well as for general compatibillaty, where it has less dependacys then the mix of multiple tools, so that updates are less likely to break things (in a way that is probally easaly fixable, but again not for me)

Reply
Bozhin says:

December 28, 2017 at 5:00 am

First of all, nice article, I enjoyed reading it.
As kop noted before me, ZFS is not just about the bit rot fuss. Once you properly shape your physical layer, which in my opinion is much less flexible than Linux software raid, it gives you flexibility on a very different level. True, you can’t reshape RAID-Z to MIRROR/RAID-Z2/3 (or vice versa), but once you have your pool(s) you can easily make logical filesystems on top of that which you can expand/shrink at will, create block devices and other filesystems on top of these block devices (like XFS/ext3/ext4/etc), change various filesystem options on the fly – like compression, record size, atime (without the need to remount)… and the list goes on and on.
Given ZFS is expensive on resources it gives back some of its toll like in cheap snapshots that can be used with send/receive to transfer logical filesystems through network (like with ssh), and later transfer only what differs between the last snapshot and current filesystem state. That’s a very neat feature in my opinion, very useful if you need to transfer VM block device from one server to another with minimal downtime for example.
I agree that the hardware problems that ZFS claims to protect us from are indeed very rare and when there are proper backups that does not even matter. But the combination of physical management, volume groups and logical volumes and the interaction between these components greatly simplifies storage administration tasks. And it’s not without a reason they made the filesystem aware of what’s going on in the physical layer and vice versa.
Of course these benefits come at a price. People often neglect the hardware requirements to properly run a filesystem as complex as ZFS. Lack of ECC RAM may not only cause it to fail to detect problems (like bit rot) but may even cause these problems, like any other filesystem by the way. And it’s not just the RAM that matters.
Compared to BTRFS it is very mature and stable, even under Linux. I did the mistake once to try btrfs with somewhat production backups data, and after that I would probably not even dare to touch it again in the next few years at least. That’s not the case with ZFS, I already use it extensively and feel very happy about it.
All that said I still feel great love for XFS, I think this filesystem is greatly underrated compared to some more famous choices like ext4. I use it for all kinds of needs and I think it’s just the greatest for general purpose use from these “simpler” filesystems out there.

Reply
1. admin says:
  
  December 28, 2017 at 8:09 am
  
  Excellent points all around. I would like to point out that LVM on Linux provides a lot of the storage pool functionality that ZFS does, so people have plenty of choices. My biggest concern with ZFS fanaticism lies with the danger to newbies and the questionable need for the added complexity given the apparent rarity of the bit rot problem. I don’t have anything against ZFS itself since it does have purposes that it apparently serves quite well, I just think that fanaticism must be tempered with cold hard reality and it’s dangerous to explain bit rot and ZFS without explaining the “proper setup” required to make that actually work as expected. I think the average user is hugely more likely to have an “I don’t back up” problem than any “I lost data to bit rot!” problems and ZFS without proper hardware and configuration can make that situation even worse.
  
  What little experience I’ve had with btrfs makes me prefer to avoid it like the plague.
  
  The only thing I don’t like about XFS is that there can still be rare hiccups that truncate extents to zero-length. It has happened to me roughly twice in the past six years, but I have snapshot-based rsync backups 😉 so I catch the problem and restore from backups.
  
  Reply
  1. Bozhin says:
    
    December 28, 2017 at 11:17 am
    
    I totally agree, it’s nice to have the possibility to chose from such diverse software projects. LVM is a wonderful piece of software, long proven so far. I just tend to prefer ZFS over it lately.
    It’s worth noting that CEPH RBDs also has some logical volume capabilities similar to thinly provisioned LVM, but it’s useful in different use cases.
    I guess I’ve been lucky enough to have never encounter any kind of truncate problems with XFS. Or if I’ve ever encountered problems like these I’ve not noticed that or I’ve ignored it due to other bigger problems with the same setup that have caused me to resort to backups.
    
    Reply
  2. ktry says:
    
    May 11, 2019 at 10:35 am
    
    How does one prevent a latent corruption from propagating through the entire backup history before it is discovered?
    
    While LVM is a great utility, its snapshot feature is not zero cost. I’ve used LVM snapshots on ext4 filesystems for years to create consistent backups using rsync or rdiff-backup. Just the presence of a single LVM snapshot impacts the performance of the ext4 filesystem until it is removed. During my initial LVM testing, each additional concurrent snapshot resulted in a greater loss of filesystem performance.
    
    I’ve since replaced LVM+ext4+hardware-RAID6 with zfs_raidz3+iSCSI. Snapshots are created every 15 minutes, hourly, daily, weekly, and monthly with various retention periods. This enables me to easily recover from human error. I haven’t detected a performance hit with literally hundreds of concurrent snapshots. I use zfs send/receive is used to replicate zfs filesystems to two other remote pools for backups and periodically offline backups.
    
    Reply
    1. Jody Bruchon says:
      
      May 11, 2019 at 12:31 pm
      
      I don’t use snapshots at all, or LVM for that matter. I use mdadm RAID-5 and rsync to large external drives and that’s all. One of the nice things about rsync is that if a file changes exclusively due to bit rot, the timestamp doesn’t change, so the corrupted file never gets updated on the backup. You’d have to touch the inode to trigger a backup, and chances are fairly good that you’ll notice a problem with the data if you actually use the data, especially for something like a photo or video file where a single instance of corruption will have visible effects across large portions of the image due to the way the compression operates. If I was paranoid, I could use –ignore-existing to copy new files, then use -vn to get a list of existing files that would be copied, then use -vnc to get a list of files that differ based on checksum instead of mtime, then cat list1 list2 | sort | uniq -u to get a list of files that appear to be corrupted on one of the filesystems. This would obviously take a great deal of time and still doesn’t answer the question of how to tell which file is the non-corrupt one, but it’s nice to know that standard Linux tools exist to do such things.
      
      As I said in the original post, it’s all a question of risk management and risk tolerance, and that’s subjective. Some people need the real-time paranoia of a system like you have described, but it isn’t appropriate for everyone.
      
      Reply
Stefan says:

January 12, 2018 at 4:16 pm

Thanks for enlightening me in this topic. I will most likely continue using ext4 or xfs for my data partition on my new server. But I will still use BTRFS on my / for transactional updates on openSUSE, which is pretty neat.

I do have one question on RAID-5 vs RAID-10. At work we do have a server with RAID-5 HDDs for storage and RAID-10 SSDs for high performance and I/O for compiling LibreOffice (which is quite demanding). On my personal home server I want to use RAID-10 for its (slight) speed benefits. What I don’t understand from your explanation is why I need four disks to backup my four disk RAID-10? I would simply pull out one disk of each node (unmounting not really unplugging) and replace it with a backup disk and RAID-10 should start writing the missing data to the two new discs. This should be enough to backup the whole system. In addition to that I will buy a third backup disc which the size of the whole RAID (4TB) and save a image of each backup disc on it for a redundant long term backup which will be physically stored at a different place.

Reply
1. admin says:
  
  January 12, 2018 at 9:22 pm
  
  I’m going to guess that you misread something in my long post or that I explained poorly and didn’t realize I had done a bad job. Degrading your RAID array to back it up is a very bad idea. Don’t ever ever ever do that. If nothing else it has to write whole disks at a time which can be enormously wasteful.
  
  If you are using RAID-10, you’d only need two disks to back it up, not four, and I’d recommend that they be external drives that are individually functional (no RAIDing externals together.) If your RAID is 4TB in logical capacity, one 4TB external is enough to back it up, and two if it’s extremely important and you’d like to keep one off-site and cycle them out. Use rsync to update backups. If using a Linux filesystem with support for preallocate (most do) use rsync with –preallocate to minimize fragmentation. If you have the extra capacity, use rsync with -b and –backup-dir=[dir] to make backup copies of files that are changed. With a little shell script magic, you can use a numbered directory and rotate/delete those backups for a full snapshot-style backup system.
  
  While I have your attention, I want to mention that I tried RAID-10, specifically with a far layout, and found that it performed significantly worse on sequential read/write operations than the RAID-5 I was about to use before trying it out. I did not dig too deeply to find out why. It may be faster for random I/O but most of what I do often involves long sequential operations that tend to have only small amounts of competing I/O (mostly user data backups prior to OS reinstalls) so optimizing fully random I/O at a strong penalty to sequential I/O was simply not acceptable to me. This was at least four or five years ago so things may have changed since then. Benchmark common use cases with real-world data before you trust that you’ll actually see a speed boost. Just because a synthetic benchmark or a know-it-all nerd on the internet says it’s better doesn’t mean it actually is. That applies to my own advice as well; I make no claims of perfection.
  
  If your machines have enough RAM, have you considered compiling LibreOffice in a tmpfs instead? It would probably be a lot faster than those RAID-10 SSDs, if no other reason than the removal of all of the I/O layers in between.
  
  Reply
Andreas says:

February 15, 2018 at 2:18 pm

Interesting post.

I’m a sysadmin at a small business btw, so my needs probably differ from those of most people.

ZFS checksumming has been, to us, extremely helpful in a few cases where someone has failed to jam a harddrive all the way into its bay. This happens often enough (for god knows what reason) that I wouldn’t consider it really unusual. Probably because the bay locks weaken a little with time and bend slightly even when snapping shut. When that happens, the disk still thinks its passing its internal consistency checks, but the data that arrives over the connector is garbled as a result of the poor connection. ZFS picks up on that and makes the disk retry its reads, resulting in (eventually) correct data but incredibly shitty performance.

Even without that, there are features in ZFS that I wouldn’t want to be without (but then again, I work with this stuff and my needs tend to differ in volume from that of most home users).

Incremental CoW snapshots means we backup our 200+ containers incrementally in about 15 minutes instead of several hours, as used to be the case, and then the incremental backups on the datacenter-local backup target goes cross-datacenter to our second datacenter in about the same time (the first sync still takes a long time though, but that’s handled in the background so noone really cares).

Pretty much instant snapshotting/cloning of filesystems means setting up test-containers of running production systems is done without the previously-mandatory “go-get-a-cuppa” as well, and restoring a full backup from the latest snapshot (which we take every 30 minutes because they’re so damn cheap) takes all of 6 seconds from crashed-and-dead to up-and-running. I actually just timed it. Granted, this means that your applications have to be setup to never store any state on disk, or the app server will be a bit lost when it first starts, but in our case we use a database cluster for all of that anyway, so *meh* on that requirement.

We do use the ZFS raid as well. It’s nifty, takes barely any setup and means we don’t have to care about keeping track of driver bugs and whatnot from various hardware vendors’ raid cards. This is especially handy since we buy most of our servers at ebay for $1000 or less, so they’re inevitable refurbished and 3-4 years old, but otoh we can afford to get about 4 times as many of them, so our containers run in parallel over multiple physical servers.

We’ve taken one particular effort to keep snapshots fast and efficient though, and that’s to make sure most logs go off-server instead of being saved locally. Only in the case of network failures are things stashed on the local disk. Because of that, most snapshots are actually 0 bytes, making them virtually noops.

I hadn’t even considered checksums part of the ZFS featureset, but since they provide around 95% of the write performance of ext4 (in our case anyway, but we have plenty of CPU to spare on all servers), it’s worth sacrificing a little to get all the other good stuff.

So yea, if you’re running a home system and just need disk redundancy, regular md setups will probably do the job (although software raid was slow as hell last time I looked, and hardware raid cards were pretty expensive if you wanted good ones), but for those of us running a bucketload of lxc containers and just want migrations, backups and other “normal” things to run smoothly, ZFS is a godsend. Make sure to turn off deduplication though. Even with our physical machines running nearly identical machines in all its containers, it’s simply not worth it.

So yea, if you’re running a small business, or like to fiddle around with application servers, ZFS is a really good fit. If you just want to not lose data, you can probably just get an external USB drive and auto-backup when you plug it in, or a home NAS and sync your stuff there when you join your home network. Both would be cheaper and easier to set up for a single machine.

Reply
1. admin says:
  
  February 15, 2018 at 3:19 pm
  
  Thanks for all that info and feedback. It’s nice to hear from someone who has legitimate uses of ZFS and looks at the technology without letting emotional investment get in the way of its evaluation. I run an extremely small business with “small iron” servers and my setups are all md-raid5 with manually rotated rsync-snapshotted external backup drives. No virtualization, large-volume web traffic, containers, etc. In the end they are almost 100% used as network attached storage. Since I deal almost exclusively with home and small business customers all day long I never see anyone that would truly benefit from the features of ZFS and btrfs.
  
  My article was written from my perspective as someone who sees ZFS inappropriately advocated to data hoarding hobbyists by other hobbyists. Joe Schmoe trying to cheaply build a “big” home server as a fun toy is more likely to do something horribly wrong with ZFS than benefit from it. Being a one-man crew means I often opt for solutions that are as simple as possible so that when I have to inevitably unwind a problem I have less potential contributing factors to worry about. I can only imagine the kind of trouble that a user toying with advanced filesystem snapshot features could get into if they did something “clever” to “free some space” or “boost the performance.” More fancy knobs opens more grand possibilities to add to the UNIX-Haters Handbook.
  
  It sounds like your use cases are definitely the sort of thing that ZFS was built for. It puts things in a whole new perspective that the little guy who thinks 10TB is spacious would rarely find out about on their own. Thanks again!
  
  Reply
  1. Robert says:
    
    March 7, 2018 at 3:02 pm
    
    To echo Andreas’ comment a bit, if you look back at the public commentary from the initial developers at Sun, that’s what it was marketed as; a replacement for hardware RAID controllers that all came with their own price tags and quirks. Although I can’t seem to find a link (not surprising since Sun itself has changed hands) they specifically called out EMC in reference to the “inexpensive” part of the acronym and asked “how did we get here?”
    
    I think that’s a fair point, and Sun delivered a worthy alternative addressing that criticism. The key features being: the ability to deploy arrays on any-ole hardware, make those arrays transportable to other hardware, and while they were at it, make administration happen in the background transparent to the users, other than performance issues during repair operations and what not.
    
    And hey, all of that works. For those of us working with consumer hardware and consumer budgets for small businesses and nonprofits, ZFS is great for that hardware/software portability. In fact I’d go so far as to say it’s beyond ‘great’ and falls in the realm of ‘necessary’. In this day and age the typical lifecycle of consumer computer hardware isn’t very long.
    
    Reply
MrPete says:

August 16, 2018 at 1:49 pm

Thanks for this. I’m an old hand in this biz (you can still find my 1986 USENET post on RLL codes 🙂 )

There’s a new failure mode that your article doesn’t anticipate. I just got burned by this so it is quite fresh for me… at this point I don’t yet have a solution that makes me happy.

You wrote: “A drive that has this many bit errors in close proximity is likely to be failing and the the S.M.A.R.T. status should indicate a higher reallocated sectors count or even worse when this sort of failure is going on. If you’re monitoring your drive’s S.M.A.R.T. status (as you should be) and it starts deteriorating, replace the drive!”

Here’s the symptom I observed:
– High quality SSD used as boot drive. It’s less than a year old, not a lot of writes. No errors.
– Suddenly, block 0 (the Most Important Block on many drives) is unreadable and things are going downhill. Over 8000 SMART log errors. It fails both short and long tests.
– Of course, my initial thought is the #$@^ drive has gone bad.
– Since ALL that’s bad at this point is the GPT and partition table info, I make an attempt to rebuild… just in case. (Yes, on the original drive, after making a ddrescue copy for good measure. Yes I have backups…)
– Surprisingly, the rebuild not only works… the drive no longer has any bad blocks
– In fact, now SMART things the drive is more or less perfect!

Being me, I have been digging in on this, and learned:
– SSD’s dont’ fail as much as HDD’s… However, they lose data much more than HDD’s
– It is not just write degradation, not just bitrot when the drive is powered down.
– On an SSD, areas that have only been READ for quite some time can eventually become weaker, degrade, and become unreadable. ***And so far I have found no indication that the drive firmware addresses this***

Several Implications:
– SSD’s would benefit from occasional rewrite-in-place of data that’s never been touched
– The fact that an SSD has serious SMART errors is NOT necessarily a good reason to replace the drive!

Seems to me this goes beyond ZFS: modern filesystems probably ought to monitor SMART data. Internal ECC error rates ought to be monitored over time, and static drive data rewritten as needed.

Reply
1. admin says:
  
  August 22, 2018 at 10:11 pm
  
  SSDs definitely throw a lot of wrenches into the works. They have a habit of failing suddenly and completely. This is one of the reasons that my mantra about always having working external backups is crucial, now more than ever. Thanks for all of the information!
  
  Reply
FlyboyShill says:

December 5, 2018 at 2:33 pm

I have had real world experience of the hard drives CRC not detecting errors, luckily I had a backup. Let me tell you the story…
I had the misfortune of my young son accidentally knocking a USB hard drive off the top of a workstation which was running Ubuntu server all on Ext4. The drive just had photos on it, which we rarely written to; and certainly was being written to at the time; although I think was spinning. Anyway, the drive hit the floor and I feared the worst. I picked it up gently and positioned it back and began looking for damage. Lo and behold S.M.A.R.T. began complaining, and some files were failing to be read. Ok, I thought, I have a backup, but the backup was in the cloud (CrashPlan); I know, I thought I’ll copy the files over to my internal drive and just use the cloud backup for the files that weren’t copied to cut down on time. Big mistake! I just happened to check one of the supposedly good JPEG files and to my horror it was corrupt (it opened fine, but was clearly a corrupted picture!!!); so neither the drive nor the filesystem (ext4 doesn’t checksum data, so is no surprise) told me this file was actually corrupt! Curious, I decided to check some other files that were supposedly fine….many were corrupt. I then panicked, thinking maybe my backup was corrupt. No, the backed-up files were fine, so I downloaded the lot from the backup.

And beware if you think you don’t need ECC RAM, today at work we experienced https://blogs.oracle.com/linux/attack-of-the-cosmic-rays-v2, and validated was same problem, on a build machine.

I now run a small/cheap Xeon server with ECC RAM + BTRFS, and sleep soundly at night 😉

Reply
1. admin says:
  
  December 12, 2018 at 11:13 pm
  
  Thanks for sharing your story. At the end of the day, it’s all about risk tolerance weighed against other factors. Sounds like your risk tolerance is lower than mine 😉
  
  Reply
Swooper says:

April 12, 2019 at 11:49 pm

You failed to mention one undocumented feature of ZFS that really does pay off: the drop in tech support costs with panicky customers who see an article about ZFS and bit rot and immediately ring up to make sure you’ve given them a highly-resilient file system that offers protection against everything from random errors to venereal disease.

You’re got two choices in these cases:

1) you can sit down with them for several hours and try to explain all the points you’ve made in this post using a calm voice and two-syllable words and then go back to the shop and sweat for a couple of weeks while they go out and get quotes from other vendors promising “more advanced” technology anyway, or

2) you can just set them up on RAID-Z to begin with and then spend two minutes assuring them that their data will survive a zombie apocalypse (because that’s what’s the Z in ZFS stands for) thanks to the miracle of checksumming.

Most pros find the overhead associated with option 2 far less of a time waster than option 1. I don’t know why Sun never included that feature in the manual.

Reply
Spongebob says:

September 5, 2019 at 4:34 am

ZFS rocks , and yes i am totally a fan 🙂 since the day i discovered it and installed it. my main reason isn’t even bit rot its everything else i would say. Pool dividing storage around in sub filesystems, quick snapshots that can be done live. copies per filesystem config and other features like that. one of my main one is you can take the whole bunch of drive and just reinstall them into another box and restart your array.(something you cannot do with raid where config is stored on the controller) you can add mirrors or Raid-z in your pools and thus grow available space with no downtime relatively easily.

Reply
1. Jody Bruchon says:
  
  September 8, 2019 at 9:26 pm
  
  It’s great that ZFS works well for you. Everyone should use what is best for their situation. My main point is that a lot of the information kicking around the internet about ZFS is misleading or lacking some critical points, i.e. RAID-Z being a requirement for ZFS automatic self-healing, arguably one of the most severe omissions in most ZFS evangelism, because what good is detecting bit rot if the rotten data is permanently lost anyway?
  
  Reply
Ivan Volosyuk says:

December 7, 2019 at 8:50 pm

I’m pretty much agree with the points about ECC on HDDs. That seem like CRC is an overkill together with 128 bit address for my needs. But I did my measurements of RAID performance: ZFS (CRC) vs RAID5 (journal or bitmask). It looks like write performance on my system on RAID5/ext4 vs RAID-Z/zfs is largely in favour of RAIDz. My explanation is that checksuming allows much more freedom of write optimizations. HDD write caching can be used with ZFS to a full potential. There is no need to keep strong data consistency between HDDs. The drive which lags behind can catch up without a problem. You can even unplug it for a while. On RAID5/ext4 to write file metadata you first write it to journal and it implies writing to RAID5 journal or bitmap, another actual write of metadata prefixed by write to RAID5 journal/bitmap. If RAID5 uses bitmap (default) it follows by two more writes to RAID5 bitmap. That extra HDD seeks, which are not required for ZFS which uses checksums and uses them to replace both journaling features.

Reply
1. Jody Bruchon says:
  
  December 11, 2019 at 1:25 am
  
  CRCs/checksums are not a replacement for journaling. Journaling records the intent of a transaction before that transaction occurs so that the transaction can be undone in case of an error. CRCs/checksums only “sign” a piece of data; they do not provide any information beyond “this is definitely corrupt.” I don’t think they’re being used in place of journaling. You didn’t explain how you ran your speed tests, what kind of RAID-5 you used (md vs. LVM2 vs. hardware), what chunk size you used, what data and tool you used to benchmark, and so on. Also, ext4 is a pretty lousy filesystem. XFS is superior to all other filesystem options on Linux and consistently outperforms ext4 in almost all general benchmarks. XFS also has RAID stripe alignment formatting options that can make a noticeable difference in RAID-5/6/10 performance if used correctly. Another lesser-known but very important md RAID-5 tunable is the RAID-5 stripe cache size (in /sys/block/md0/md/stripe_cache_size) which is way too low by default and should be set to the maximum of 32768 for best write performance (which caches parity stripe updates rather than immediately writing them out).
  
  I routinely work with data sets involving a small (5-50) number of huge (1+ GiB) files in largely flat directory trees, and other sets involving huge (20,000-500,000) small (0-1 MiB) files within deep and wide directory trees. I test jdupes against these data sets from time to time, including run time benchmarks. I don’t compare to ZFS, but I don’t really need to; I only need to compare against the numbers produced by sequential read tests at the start and end of a drive since e.g. a 4-drive array cannot possibly read faster than (max seq read * 4) drives, and likely has a maximum calculated for only 3 drives. If I get at all close to that speed for mostly sequential operations, I’m doing quite well and don’t need to question my choice of filesystem/RAID subsystem combo.
  
  I just ran some cursory read tests. My encrypted 4 x 8TB md RAID-5 with XFS (and a 64 KiB stripe) reads a single raw disk partition at 174 MiB/s, the raw md device at 512 MiB/s, and 10 video files (28.7 GiB total, broken into quite a few extents each) from the array (uncached) at an average of 313 MiB/s. I don’t know (or care) exactly where those files are located on the disks, but they read out at 61% of the array’s maximum read speed at the start of the array. Since the end of a disk (and thus, by extension, the array) tends to read out at about half of the read speed seen at the beginning of the disk, 61% of the absolute maximum seems like pretty great overall performance to me. I did run a simple write test, but the caches affect the reported speeds, so I don’t feel like it was useful.
  
  Reply
Michaela Lemonjoy says:

May 23, 2020 at 11:17 am

Wish I had found this page sooner.
Been running a RAID-5 setup with offline weekly backup for almost 10 years now. No significant issues to far, and file collection grows each year. Only lost a few pictures when Windoze decided to “fix” my degraded raid….*sigh*

So, I ran into these ZFS knights a while ago, and they immediately started moaning about “RAID-5 is dead” “you must have ZFS and ECC otherwise your computer will rot and die”. And I had basically the same arguments as you have written above, could not agree with you more. But none of these fanboys would listen, every discussion was met with “ZFS is the one and only god”. So I gave up and went along with my RAID-5… until now, thanks!

Reply
1. Jody Bruchon says:
  
  May 24, 2020 at 4:20 pm
  
  I’m happy to have helped!
  
  Reply
Elixd says:

August 9, 2020 at 2:10 pm

Hi! I found your article trying to research a problem. I have have got a several checksum errors on ZFS without getting any I/O errors and S.M.A.R.T errors. I can not find an explanation for this.

1) I can not attribute it to silent bit rot, because the error out was above 500 (too much for rarely happening silent rot).
2)There was several power failures before, but as far as I understand it could lead to a loss of data that hast not been finished writing on the disk – but not checksums error. So power loss could not be the reason

Or am I wrong somewhere? What am I missing? Any ideas?

Reply
1. Jody Bruchon says:
  
  August 9, 2020 at 2:21 pm
  
  Power failures can cause that sort of damage, especially if lots of writes were queued at the time of the power loss. I don’t know how granular the checksum is (are they per-block or does one checksum cover multiple blocks?) but I’d guess that if a bunch of data writes landed and then the checksum writes were lost in the power failure OR vice-versa, you’d have lots of mismatches show up despite no hardware failure.
  
  Another issue that has been proposed in other comments and on Hacker News is that of memory corruption in the hard drive cache RAM. Such corruption would trigger no I/O or SMART errors but would mangle data. There’s no good way to test for this other than to notice a trend of data corruption on the drive and stop trusting it.
  
  Reply
  1. Elixd says:
    
    August 14, 2020 at 5:04 am
    
    Right. Power failure explanition looks like a good fit in this situation, given that it happened not long before the checksum error was found.
    
    Also I found of the same online:
    “one single event when I deviated from ZFS manual: I had ZFS on single disk (iSCIS SAN LUN mapped on host) inside KVM Guest and after initial data copy I forgot to change Cache mode from WriteBack to WriteThrough. Pool (5TB) was readable but had 20k+ errors reported.”
    This is similar to my situation (single disk ZFS) which I used for a block storage of a not mission critical VM disk in my “home lab”.
    
    But this is not supposed to happen, according to current level of understanding: ZFS is supposedly a smart file system that does new writes on a emty space, and does commit the changes/new writes only after it has properly finilized the write (mearing wrote both data and metadata inculduing checksums).
    
    Here is a quote from superuser.com claiming that, data corruption in such cicummstances is practically impossibe:
    “new data is written elsewhere on the disk, then the filesystem metadata structures are updated to point to the new data, and only then is the old data’s block freed for reuse by the filesystem. In this way, a sudden power loss will leave the old copy of the data in place if the new data updates are not 100% committed to persistent storage. You won’t have half the block replaced or anything like that, causing data corruption.”
    
    I guess what my queestion boils down to is: “I am wrong to assume that ZFS would not allow data corrution on power loss, only possible data loss (given no harware failure?”.
    
    There two reasons I would want to know the answer:
    1) if I know that the issue is not with ZFS, then I am sure I have some harware problem
    2) just to understand how fault talerant ZFS is, and of course, it would be interesting to know about other modern transactinal file sustems, such as XFS for example
    
    Reply
    1. Andre Ross says:
      
      December 9, 2020 at 3:23 pm
      
      I know this is a old post but I was looking and you said:
      
      “Pool (5TB) was readable but had 20k+ errors reported.”
      This is similar to my situation (single disk ZFS) which I used for a block storage of a not mission critical VM disk in my “home lab”.
      
      If ZFS is on a VM and pointing to a real hard disk that another OS has access to then you might be sharing read/write with two OS’s the Host OS and the VM OS and they might be stepping on each other.
      
      When you setup ZFS with two OS’s read/writing data on the same controller you will see such corruption and it will not trigger any I/O or SMART errors but would mangle data as the controllers and the hard drives are just writing what they are told but your data is now trashed. (yes I did this dumb)
      
      Not sure if this was your problem but if using ZFS in a VM I do a controller pass-thru to the VM so that no other OS has access to any drives on that controller.
      
      Reply
Andre Ross says:

December 9, 2020 at 2:06 pm

I just ran across your post on ZFS won’t save you
Thank you!!!!!

I like ZFS it’s great and I love it.
But it’s a tool and like all tools you have to know what it can and can’t do.
Short story long if you have a really great hammer a lot of things start looking like nails

Been been running home raid array since the 1990’s being a pack rat of data.
Note I backup the important stuff and store off site (old school).

I started with hardware raid, software raid, Linux ext3, ext4, madam, tried btrfs,
now I run ZFS on 16 4TB drives using raid3.

i.e. I have some old hard drive 5-6years old in the array, so zfs raid3 remember data pack rat.

a. ext4 journal file system (you won’t lose your data because the file system is journal)
b. (Mirror, Raid5, Raid6, Raid10) (hardware, software raid) (you won’t lose you data this time)
c. (ZFS) (We got you now your data is safe this time we cross our fingers and pinky swear)

I’ve had to recover data for my work and home server a number of times.
And I will say without the home server I might not believed or thought I would use what I learned as a data pack rat how to save production data on the job.

Real world and your data.

1. We don’t make that raid controller any more.
2. Data not matching after 4 years ??? how did that happen and why do we need a UPS we never had one before.
3. How could more that 3 hard drive go bad they are only 8-9 years old no errors yesterday.
4. Server flooded.
5. Someone dropped hard drives during maintenance.
6. The production system is raid10 why would we need to back it up the data is safe just run a restore command.
7. How can the data be missing a delete command should not do that much damage.
8. I know we had a fire why would that affect the data.
9. We run 24/7/365 we don’t have time for a backup window that’s why the board voted on running raid60.
10. Power lost no ups we have ZFS. (if some transactions rolls back and some don’t like bank balance rolls back after customer withdraws all their money no data loss right? they can just withdraw the money again when the system comes back up to make the balance zero again data ok right.)

Everyone talks about bit rot protection but what I look at is data lost and how do I recover data the right way.

Once again
Thank You!!!!!

Reply
Ed Wildgoose says:

December 10, 2020 at 1:18 pm

Hi, I think the “RAID5 is dead” might be misdescribed?

The concern that I’ve experienced is that when I pull a disk and replace it, then ANY single bit read error which occurs during the rebuild will kick out the whole array (at least under linux MD).

So I lost a 5 disk array RAID5 because I hadn’t noticed that one disk seemed to be suffering reallocations or some read error, and I pulled a different disk to replace it (think it was suffering many read errors? don’t remember). Array failed during rebuild and at the time I didn’t have the experience to figure a way past that

I’ve hit this a number of times (or come close) over the years. I’ve got a media server which has a bunch of disks in (5-8), and over the years I would say that my failure pattern tends to be that all is well, smartctl reports no problems. Then at some point I tend to see 1-3 drives start to error. And here is the problem….

Now of course the A+ grade linux admin will at this point kick of a raid scrub, do some badblocks magic, force any pending reallocations to happen, check that the drives aren’t set to wait too long before erroring, will have a hotswap caddy for the drive to avoid powering down and will with 100% accuracy pull the correct drive from the array…

The rest of the world will make some trivial error at this point and the array is toast….

My feeling is that for a small extra cost, RAID6 actually gives you the space you need to deal with the single disk failure. If a trivial error occurs during the rebuild then the extra disk buys you the breathing space to recover

I try to buy my disks from different sources and try to get as many different batches as I can. However, I do regularly seem to find failures on my 8 disk array involve more than one disk starting to show reallocations and signs of distress. For this reason I tend to also only use 99% of the disk size so that I can buy just a couple of whatever is on the market when I replace part of the array, without getting hit with problems that the new disk is like 8 sectors smaller than the outgoing one…

My usecase also doesn’t lend itself to backups for this volume of stuff. I have both precious stuff on this array which I write to about 4 other destinations. However, I also have a lot of “large files” which I could live without, but would be inconvenient to re-install. Something like the cost of restoring from the backups is high (think ripping a boatload of CDs or DVDs). For this use case I feel that a large RAID6 array suits my needs best, vs a RAID5 with failover to backups.

Not everyone’s needs are the same, but I think I would give a strong recommendation on RAID6 vs RAID5 to most people using 5+ disk arrays… My (limited) experience is that effectively double disk failures are quite common. Not because you get two disks failing so much as that the process of swapping out a suspect disk, is exactly the moment that linux decides to look really, really closely at the rest of your array and if it finds ANY issue will kick out the whole array… So any single slow read, read error (or pulling the wrong disk) and you are toast… Probably some of this is my failure for not scrubbing often enough or not noticing disks are failing until too late, but for whatever reason, finding single sector errors on more than one disk is quite correlated on my “mature” arrays (and linux feels that any single read error is a “dead disk”)

No opinion on ZFS. I have looked vaguely at “dm-integrity”. I don’t see it mentioned very often though?

Reply
1. Jody Bruchon says:
  
  December 10, 2020 at 7:44 pm
  
  The info on a single error kicking a whole drive out of a Linux md RAID 5/6/10 array is incorrect. When Linux md runs into a read error, it attempts to rewrite the errored block and re-read it. If the rewrite fails multiple times, THEN it will kick the drive out of the array as failed, and rightly so. There are also ways to force-reassemble a RAID array, so a drive getting kicked out of the array doesn’t necessarily cause data loss as long as the array is stopped soon enough. Scrubs try to catch degradation and will trigger the aforementioned rewrite if issues are found. I have not had a drive get kicked out of my arrays in a very, very long time. One of my drives in my 4x1TB RAID-5 is not in the best shape (it has reported almost 1000 uncorrectable read errors over the past 7-8 years) but I run a scrub every few months and about 1-2 times a year it fixes a single failed block and keeps on chugging without intervention.
  
  Reply
Erp says:

March 5, 2021 at 11:02 am

CRC (or any other hashing system) is like a check engine light: it’s there to tell you that a problem exists, not to fix the problem.

Reply
1. Jody Bruchon says:
  
  March 10, 2021 at 1:54 pm
  
  Many tout ZFS as “self-healing.” CRCs on data are useless without a backup. It’s not like a check engine light which indicates a problem that could cause more problems if you ignore it. Once a CRC detects an integrity violation, the damage is already done.
  
  Reply
Bill Quin says:

July 17, 2021 at 9:12 pm

“…but leaving out critical things like RAID-Z being required for automatic repair capabilities”

I don’t think it’s accurate to say that ZFS’s self-healing capability requires RAID-Z.

Even as far back as 2005, self-healing was demonstrated using a zpool comprised of a mirror vdev (not RAID-Z). This self-healing is triggered by, you guessed it: a mismatch of CRC (checksum) on a record of data.

https://blogs.oracle.com/timc/demonstrating-zfs-self-healing

This would also be caught and triggered by a routine “scrub” of the pool.

Reply
1. Jody Bruchon says:
  
  July 26, 2021 at 4:18 pm
  
  Sure, if you have a copy of the data somewhere, you can recover it. A mirror device is effectively a ZFS RAID-1, so you’d have a second copy to pull from. Technically, it’s not RAID-Z, but it’s practically not much different from RAID.
  
  Reply
AngryAdmin says:

August 5, 2021 at 11:03 am

“If you use ZFS, you have to use RAID-Z, otherwise you will get none of the advertised protection that ZFS offers other than detecting degraded data.”

Which ones are missing in your book?

Reply
1. Jody Bruchon says:
  
  August 5, 2021 at 11:49 am
  
  Which ones of what? What’s missing from ZFS without RAID-Z? Self-healing, the one big thing that people want ZFS for. The big magic pill that magically makes bit rot no longer exist.
  
  Reply
Pingback: ZFS is not magic - it needs backup and RAID (Restore it All Podcast #125) - Backup Central
David Studeman says:

May 8, 2022 at 2:28 am

I use ZFS in a Raid-Z configuration for storage and in the few cases where I am limited to two disks, I have used ZFS mirror. I still use MD because so many motherboards come with two M.1 slots and rather than just put swap on one of them and leaving blank space on the other I just MD it and format it as swap(yes, hopefully the system never needs or uses swap). In such cases I mirror the OS itself. BTRFS in Raid 1 mode is pretty solid as of now, raid5/6 on BTRFS, not there yet but lately work on it has been more aggressive.

If I am stuck with a single drive for the OS I just use XFS, I have always loved XFS as it can be formatted in many ways to suit the size of files you pass to and from it and it’s solid. The idea of ZFS or BTRFS on a single drive defeats the purpose most use it in the first place and I wouldn’t even consider it for such use. I have used F2FS in some situations on SSDs as well.

The one thing that I do like about BTRFS when in raid 5 or 6 is that one can keep adding drives to it and expand it, re-balance it and things like that. In it’s current state, raid 5 or 6 cannot be used in production as there are other dark corners beside the write hole as of yet. It can be converted on the fly from single to raid 1 or 0, even 5 or 6 if you have enough disks but don’t do raid 5 or 6 unless you are just experimenting and will put no data on it one can’t afford to lose. In the case of doing a clean install with Debian, I just format one disk as BTRFS and then add an identical partition to a second drive and convert it to raid 1. If one is using the entire disks for one mirrored array, BTRFS will use the entire disk without adding a partition first but if one is using UEFI, you are going to need a fat32 partition somewhere as with any filesystem. Even the old MBR only works on a single disk to start. Something like a DOM works great as a boot device.

While I use ZFS Raid Z, as of yet it cannot grow like BTRFS does but it is coming. The disadvantages are that you need all your disks up front for Raid Z and with BTRFS you don’t but Raid 5/6 on BTRFS is just not safe at this time. I still have an old cobalt XTR that I ran 5 ide drives using MD and XFS. I have not used it since 2016. It actually ran pretty solid. I had been upgrading patches for the 2.6 kernel until the kernel got to the point where the drivers would need to be re-written and the old hardware was just too outdated to bother. The Cobalt Raqs did not use a bios so the kernel had to be in an EXT3 partition and in either a bz or gz form, basically an elf binary but I digress.

To sum up, I like ZFS for my needs because of how it is used but only on Raid Z storage arrays. An MD array formatted in XFS is not bad at all and I have had great success doing this as well. There is also the convenience of having raid built into a filesystem but one has to consider everything and use the right tool for the job they want it to do. The biggest disadvantage of ZFS on Linux is that on occasion I have had to wait before updating kernels until the dkms will work on that kernel whereas BTRFS gets better with every kernel and is native and needs no compiling tools. I reiterate that BTRFS in mirrored Raid 1 mode is very solid. I really want BTRFS to succeed with raid 5/6 in the long run. There is still no free lunch on any filesystem.

Reply
David Studeman says:

May 8, 2022 at 2:36 am

With the Cobalt XTR in my previous comment, I used four drives in MD raid, there is no 5th hot swap bay. It used a Rocket Raid chip. The Cobalt hardware was strange for no really good reason as far as not using a bios. We already proved that this did not prevent anyone from using a non Cobalt OS. Sun bought them and then killed it.

Reply
Jung says:

November 4, 2022 at 5:01 am

Hi Jody – Wondering if you know how often hard drive ECC process is triggered, and what the specific trigger points are. I can’t believe how I’ve spent the last hour trawling wikipedia and the internet, and I still haven’t found a clear answer on what the precise trigger points are! It’s almost like the arcane knowledge has been lost to time!

Reply
1. Jody Bruchon says:
  
  November 4, 2022 at 2:34 pm
  
  It is constant. Every time a read is performed the read includes forward error correcting codes and the data is checked for integrity. There is never a time that ECC isn’t happening in a hard drive. That’s how the drive knows that a read failed so it can report an uncorrectable error to the operating system.
  
  Reply
Function says:

February 16, 2023 at 3:46 pm

What a read. I’m a new home-server owner and still kinda new to Linux and definitely paranoid about data loss.
After about 2 years i’ve finally acquired the skills and did the R&D to set up an automatic 3-2-1 backup of my important data through borg-backup/borgmatic and a remote storage box.
Now that i have backups, i’m was thinking about bit-rot, other filesystems, and how i can somehow gain more control over the integrity of my data.
This post definitely helped me understand the different perspectives more, but i’m still lost and don’t know what i’m supposed to do. Right now my home-server is just an old laptop with a bunch of different drives connected to it, no RAID or anything, all of them are ext4 or still NTFS (yeah i know, bad).
I also don’t need RAID at the moment. I don’t need any better performance and i’ll happily live with this “fragmented” storage instead of pooling the drives together.
Do you have any advice for my situation? Do i even need to worry? Just monitor S.M.A.R.T. and keep calm?
Thanks and regards, Function

Reply
1. Jody Bruchon says:
  
  February 16, 2023 at 10:15 pm
  
  It really depends on the level of paranoia you have. Bit rot is rare. The whole point of error-correcting codes on hard drives is to catch single-bit errors and repair them (or at least detect double-bit errors and warn that they happened), so the drive itself is already guarding against in-place bit rot to some extent, assuming you run regular “scrubs” of your array (if you rarely read the data, you may not detect problems with that data until it’s too late.) For me, I’m okay with the possibility that an extremely rare hardware failure or pair of cosmic ray bit-flips might take out some data. I have backups! I run array scrubs to verify the integrity of the data across the entire array and I run filesystem checks to verify the metadata. If I had some high-value data, I’d want to have data-level error handling like RAID-Z or at least periodic hash checks with tools like md5deep, but I’m not that paranoid.
  
  Decide on your level of risk tolerance and then make your choice. ZFS with RAID-Z has its merits, but I don’t find the failures it helps with to be common enough to fear…but that’s just me and my data.
  
  Reply
Phibert Francis Marsh says:

December 4, 2023 at 2:25 am

I run ZFS on Linux and I can tell you from experience that it appears that ZFS caught data errors of failing drives in my RAIDZ2 array where SMART disk utilities said all was OK. I saw this as well in my USB ZFS backup external drives as well. At least in the case of failing drives and/or controllers, ZFS appears to have caught errors that SMART utilities missed.

Reply
1. Jody Bruchon says:
  
  December 4, 2023 at 8:11 am
  
  Thanks for relating your anecdote. For some people ZFS is worth it; for others it’s not. I just happen to be one of the ones that it’s not for.
  
  Reply