Is it safe to say that Backblaze essentially has an O(logn) algorithm for labor due to drive installation and maintenance, so up front costs and opportunity costs due to capacity weigh heavier in the equation?
The rest of us don’t have that, so a single disk loss can ruin a whole Saturday. Which is why we appreciate that you guys post the numbers as a public service/goodwill generator.
> algorithm for labor due to drive installation and maintenance ... the rest of us don't have that so a single disk loss can ruin Saturday
TOTALLY true. We staff our datacenters with our own datacenter technicians (Backblaze employees) 7 days a week. When they arrive in the morning the first thing they do is replace any drives that failed during the night. The last thing they do before going home is replacing the drives that failed during the day so the fleet is "whole".
Backblaze currently runs at 17 + 3. 17 data drives with 3 calculated parity drives, so we can lose ANY THREE drives out of a "tome" of 20 drives. Each of the 20 drives in one tome is in a different rack in the datacenter. You can read a little more about that in this blog post: https://www.backblaze.com/blog/vault-cloud-storage-architect...
So if 1 drive fails at night in one 20 drive tome we don't wake anybody up, and it's business as usual. That's totally normal, and the drive is replaced at around 8am. However, if 2 drives fail in one tome pagers start going off and employees wake up and start driving towards the datacenter to replace the drives. With 2 drives down we ALSO automatically stop writing new data to that particular tome (but customers can still read files from that tome), because we have notice less drive activity can lighten failure rates. In the VERY unusual situation that 3 drives are down in one tome every single tech ops and datacenter tech and engineer at Backblaze is awake and working on THAT problem until the tome comes back from the brink. We do NOT like being in that position. In that situation we turn off all "cleanup jobs" on that vault to lighten load. The cleanup jobs are the things that are running around deleting files that customers no longer need, like if they age out due to lifecycle rules, etc.
The only exceptions to our datacenters having dedicated staff working 7 days a week are if a particular datacenter is small or just coming online. In that case we lean on "remote hands" to replace drives on weekends. That's more expensive per drive, but it isn't worth employing datacenter technicians that are just hanging out all day Saturday and Sunday bored out of their minds - instead we just pay the bill for remote hands.
Is it actually required to have employees wake up and replace that specific failed drive to restore full capacity of a tome? I would expect an automatic process - disable and remove failed drive completely from a tome, add SOME free reserve drive at SOME rack in the datacenter to the tome, and start populating it immediately. And originally failed drive can be replaced afterwards without hurry.
> I would expect an automatic process - disable and remove failed drive completely from a tome, add SOME free reserve drive at SOME rack in the datacenter to the tome
We have spec'ed this EXACT project a couple different ways but haven't built it yet. In one design, there would be a bank of spare blank drives plugged into a computer or pod somewhere, let's pretend that is on a workbench in the datacenter (it wouldn't be, but it helps the description). The code would auto-select a spare drive from the workbench and begin the rebuild in the middle of the night. At whatever time the datacenter technician arrives back in the datacenter they can move the drive from the workbench and insert it in place of the failed drive in the vault. It doesn't even matter if it is half way through the rebuild, the software as written today already handles that just fine. It would just continue from where the rebuild left off when it was pulled from the workbench.
The reason we don't really want to leave the drive in a random location across the datacenter in the long run is we prefer to isolate the network chatter that goes on INSIDE of one vault to be on one network switch (although a few vaults span two switches to not waste ports on the switches). Pods inside the vault rebuild the drive by talking with the OTHER 19 members of the vault, and we don't want that "network chatter" expanding across more and more switches. I'm certain the first few months of that would be FINE, maybe even for years. But we just don't want to worry about some random choke point getting created that throttles the rebuilds in some corner case without us in control of it. Each pod has a 10 Gbit/sec network port, and (I think) a 40 Gbit/sec uplink to the next level of switches upstream. The switches have some amount of spare capacity in every direction, but my gut feeling is in some corner case the capacity could choke out. We try to save money by barely provisioning everything to what it needs without much spare overhead.
Plus the time to walk a drive across the datacenter isn't that big of a deal. Some of the rebuilds of the largest drives can take a couple days ANYWAY. That 5 extra minutes isn't adding much durability.
> because we have notice less drive activity can lighten failure rates
It's a rite of passage to experience a second drive failure during RAID rebuild/ZFS resilvering.
I got to experience this when I built a Synology box using drives I had around and ordering new ones.
One of the old drives ate itself and I had to start over. Then I did the math on how long the last drive was going to take, realized that since it was only 5% full it was going to be faster to kill the array and start over a 3rd time. Plus less wear and tear on the drives.
The rest of us don’t have that, so a single disk loss can ruin a whole Saturday. Which is why we appreciate that you guys post the numbers as a public service/goodwill generator.