As we've gotten more device computers have gotten more powerful. Server software has gotten better. HTTP has Keepalive and multiplexing now. Encryption and networking are offloaded to hardware and we have a lot more cores.
I would guess a single server can handle thousands of times as many users as it could handle years ago. HAProxy, Netty, Nginx, and others can handle over a million (simple) HTTP requests per second. That's more requests than Google.com gets.
Most as in I've been watching over 100 AWS VM's and maybe 30 on Azure for years and they die or crash far more often than than the VM's hosted here, at colo, or our old bare metal machines. It's anecdotal but it seems like AWS doesn't really care about warning you before shutting off your machine. Azure is slightly better but still goes down regularly.
I know everyone says "it's okay! Just make your servers fault tolerant!". Well that works great for load balancers and frontend, but doesn't work at all for SQL databases. ACID compliant transactions require a single source of truth and a true multi master SQL database is impossible. Failover yes, but you always risk losing data in the switchover unless you use two phase commit which actually makes your multimaster database slower than a single system. In practice the failover almost always causes some data loss and log conflicts you have to diddle with later. And God help you if the replica falls behind more than a couple seconds.
Anyways, for SQL databases system reliability is as essential as ever and it's a lot easier to get high SLA numbers when you control the hardware and the power switch. The closest you can get to the Holy Grail is running KVM VMs locally and doing live machine migrations when hardware starts to fail, but even that won't keep your database running if something really bad happens.
Thanks for the info very useful, I didn't realise how unreliable AWS is for database servers. You're correct random unannounced shutdowns of db servers is just not acceptable for critical data. I will be launching a new business based on Postgres soon and the thought of this is terrifying. I'm not keen on the RDS type services or CoLo so this is an unexpected problem I need to overcome. Do you know whether VPS provider such as Digital Ocean or CloudSigma would be more reliable?
I would say colocation is most reliable. I'm sure dedicated VPS is better but most still reserve the right to pull the plug for hardware replacements. Colo isn't terribly expensive if you buy used equipment.
Really consider how important 100% uptime is though. Google and S3 have gone down multiple times without killing the internet or losing a ton of customers. Plenty of large SaaS providers still use maintenance windows. Heck, GitHub went down today. Not sure if you use ADP but that goes down for a couple days a week!
I know it's not the popular thing to do but you can get much better relative reliability by running a single database per tenant and running a limited number of tenants per VM.
I would guess a single server can handle thousands of times as many users as it could handle years ago. HAProxy, Netty, Nginx, and others can handle over a million (simple) HTTP requests per second. That's more requests than Google.com gets.
Most as in I've been watching over 100 AWS VM's and maybe 30 on Azure for years and they die or crash far more often than than the VM's hosted here, at colo, or our old bare metal machines. It's anecdotal but it seems like AWS doesn't really care about warning you before shutting off your machine. Azure is slightly better but still goes down regularly.
I know everyone says "it's okay! Just make your servers fault tolerant!". Well that works great for load balancers and frontend, but doesn't work at all for SQL databases. ACID compliant transactions require a single source of truth and a true multi master SQL database is impossible. Failover yes, but you always risk losing data in the switchover unless you use two phase commit which actually makes your multimaster database slower than a single system. In practice the failover almost always causes some data loss and log conflicts you have to diddle with later. And God help you if the replica falls behind more than a couple seconds.
Anyways, for SQL databases system reliability is as essential as ever and it's a lot easier to get high SLA numbers when you control the hardware and the power switch. The closest you can get to the Holy Grail is running KVM VMs locally and doing live machine migrations when hardware starts to fail, but even that won't keep your database running if something really bad happens.