Is it actually required to have employees wake up and replace that specific fail...

brianwski · on Jan 26, 2021

> I would expect an automatic process - disable and remove failed drive completely from a tome, add SOME free reserve drive at SOME rack in the datacenter to the tome

We have spec'ed this EXACT project a couple different ways but haven't built it yet. In one design, there would be a bank of spare blank drives plugged into a computer or pod somewhere, let's pretend that is on a workbench in the datacenter (it wouldn't be, but it helps the description). The code would auto-select a spare drive from the workbench and begin the rebuild in the middle of the night. At whatever time the datacenter technician arrives back in the datacenter they can move the drive from the workbench and insert it in place of the failed drive in the vault. It doesn't even matter if it is half way through the rebuild, the software as written today already handles that just fine. It would just continue from where the rebuild left off when it was pulled from the workbench.

The reason we don't really want to leave the drive in a random location across the datacenter in the long run is we prefer to isolate the network chatter that goes on INSIDE of one vault to be on one network switch (although a few vaults span two switches to not waste ports on the switches). Pods inside the vault rebuild the drive by talking with the OTHER 19 members of the vault, and we don't want that "network chatter" expanding across more and more switches. I'm certain the first few months of that would be FINE, maybe even for years. But we just don't want to worry about some random choke point getting created that throttles the rebuilds in some corner case without us in control of it. Each pod has a 10 Gbit/sec network port, and (I think) a 40 Gbit/sec uplink to the next level of switches upstream. The switches have some amount of spare capacity in every direction, but my gut feeling is in some corner case the capacity could choke out. We try to save money by barely provisioning everything to what it needs without much spare overhead.

Plus the time to walk a drive across the datacenter isn't that big of a deal. Some of the rebuilds of the largest drives can take a couple days ANYWAY. That 5 extra minutes isn't adding much durability.

hinkley · on Jan 27, 2021

Seems like a simpler automatic process would be to increase the redundancy level by 1, and reduce the severity of each failure level a bit.