*The Danger team did built a great system for its time and by the time Microsoft...

rodgerd · on Sept 1, 2015

> I've heard good things about Oracle's RAC, but it's understandably intolerant of your screwing up its disks (SAN mis/re-configuring) when you aren't properly maintaining backups

There are a number of problems with RAC, some of which are people using it wrong, and some of which are inherent to RAC. "Using it wrong" covers things like people not understanding it's on shared storage so it's providing compute node resilience, not storage resilience, so they probably sould spend on some Dataguard (or equivalent) unless they want to be the DBA equivalent of the server admin who thinks you don't need backup because you've got RAID.

The built-in problems come from the fact Oracle ASM doesn't check[1] the signatures on disks/LUNS presented to it. So if the SAN admin, I don't know, manages to somehow reverse the mappings for one LUN of 30 between the stress RAC and the dev RAC, Oracle will not start and say "that ASM disk has the stress signature on it"; Oracle will overwrite the stress LUN with dev data for a while, then go to read it, then discover it doesn't have the on-disk structure it expects, then crash with a SEGV or other entertaining but unhelpful error. But only after it's irretrvably corrupted the ASM group, of course.

[1] as of 10g, the last time I hit this problem.

bro-stick · on Sept 1, 2015

Yup. I managed some dataguard (not RAC) instances on AWS for Palm pre HP. Thankfully we had DR plans and snapshots to cover our asses.

Edit: Fixed HA techs.

mattzito · on Sept 1, 2015

How did you do RAC on AWS without shared storage?

bro-stick · on Sept 1, 2015

Thanks, fixed in edit. I was mistaken, it was dataguard and better snapshots using archive log mode.

mattzito · on Sept 2, 2015

I wasn't totally skeptical - we did RAC on AWS in tests back in the day using a third node as an iSCSI target, but it was a) sketchy as hell, b) not at all redundant, c) not something I thought Palm would go for.

bro-stick · on Sept 2, 2015

It might work on something like OEL with ZFSonLinux using zfs send/recv. Larger implementation might want to investigate drbd or something like OCFS2, GPFS, AFS or Lustre (none of which probably plays well with cloud environments). Maybe Gluster but with trepidation. (It was an AWS consulting shop with banking / military chops, whom could sell ice to enterprise eskimos.)