Based on my extensive arguing on the internet experience, I can 100% say that you caused this outage. That caused mayhem and made a lot of people less happy, well done.
I believe Google (G Suite) only claims 99.9% - sounds small, is actually a huge difference. They typically achieve a better % but that is all they "guarantee".
"Service Credit" means: (a) 10% of the total invoice charges for the affected month if the Monthly Uptime Percentage for any calendar month is between 99.0% and 99.9%; or (b) 25% of the total invoice charges for the affected month if the Monthly Uptime Percentage for any calendar month is between 99.0% and 95.0 %; or (c) 50% of the total invoice charges for the affected month if the Monthly Uptime Percentage for any calendar month is less than 95.0%.
For reference: over the course of a year, if you have an uptime of 99%, that means you are down for 3 days 15 hours per year. If you have 99.9% uptime, that means the downtime is roughly 9 hours per year. If you have 99.99% up, then the down is ~ 53 minutes per year. 99.999% then gives you ~5 minutes of down per year. 99.9999% will then give you ~31 seconds of downtime per year.
One thing to keep in mind is that for each 9 of uptime you have, add about two Zeros to the budget and increase the timeline by one time units up and then double it. So if 99% costs you $100 and a day of timeline on the project, then 99.9% will cost you $10,000 and two weeks of timeline, and 99.99% will cost you $1,000,000 and 4 months of timeline, and 99.999% will cost you $100,000,000 and 8 years of timeline, etc.
This is interesting! Is this a rule of thumb or common sense in the infrastructure world or is it something which there are studies? If there are studies do you have some links I could read further on?
> for each 9 of uptime you have, add about two Zeros to the budget
It's more like going from the stone age to the bronze age. You need expertise and architecture suitable for achieving high availability, these are your budget increases, they are not linear at all. But the infrastructure itself doesn't necessarily get more expensive, on the contrary, new architecture could allow you to use the cheapest stuff available on the market.
Much of Googles initial competitive advantage was a sharded architecture and understanding how to run a reliable service across unreliable nodes - mean time to detection and repair are more important than reliability of any one component, though better components can allow you to scale further - there's interesting multiplicative effects.
So even if it's down for the whole month I still have to pay (and probably can't even get out of it; even without some sort of longer contract the lock in is strong). Those are some big corporation terms all right.
I understand why you say git is centralized, but I don’t think it’s accurate. I can work on git while github is offline, I just can’t share that work with others through github, but I can easily upload a repo and we can share it through another service.
I wish I'd saved a link to this, but the last time around when centralization was the big push, someone pointed out that the uptime numbers we care about in a professional setting relate to all of our tools, not just individual ones.
Which is to say that if I can get 75% of my work done without one of my tools then the impact to the business of an hours long outage is relatively small. Even if the outages are frequent, we can learn to adapt (real example: oversubscribed Atlassian products crashing on Friday afternoon due to resource exhaustion - move work to Thursdays or Mondays).
But if the entire system goes down due to shared hardware or simultaneous software upgrades, then you can end up with an office full of people who can't get a lick of work done. Heterogeneity wins out in this case, and 9's can be measured in a way that doesn't reflect the actual business impacts accurately.
Those of us who run their own mail servers: what is your MTBF? That is, ignoring all other differences, from the standpoint of pure availability, how do you compare to popular centralized email services?
I had a serious fault two years ago, when several forest fires knocked out electric power for the two lines feeding the datacenter. Electricity failed three or four times in short periods, and for some reason this wrecked havoc on the UPS-generator combo, and power went out. We were down for six hours and mad at the co-location company.
Other than that, thirty minutes downtime 14 months ago when the public-facing redundant routers mis-detected a failover, failed STONITH and entered split-brain status.
I wouldn't commit to better than 99.9% with our infrastructure (single datacenter with remote data backups) but we exceed the metric most years (99.99% in all but one of the last ten years).
I've been running my own mail and other services since 2005. We had one severe outage in 2008, caused by a bug in reiserfs. In 2013 I migrated to a different server, with minimal downtime but some work involved. Last year, one of the disks in the raid1 was giving smart errors, but it turned out it was hot swappable.
I can't recall the last unplanned outage since the 2008 one.
No outages except when I've migrated postfix/dovecot from old to new servers, and misconfigured. Otherwise uptime has been equal to server provider uptime (Rackspace, then AWS).
I'd be happy to move away from Gmail. Unfortunatelty, I happen to hate spam so that's not really an option
That was what kept me on GMail for so long. But about a month ago I moved several accounts to FastMail (at the urging of others on HN), and have been pleasantly surprised by the results.
FM seems to have a lot fewer false positives, and the amount of spam that gets through seems only marginally more than with GMail.
FM offers a 30-day free trial, which even includes using your own domain. That I found surprising. Usually trials so restricted you can't really get a sense of what you're getting into.
Did not have the same results regarding spam on Fastmail. I still use them for a mostly private email address that friends have, but it never did a great job with a highly public email address.
Do you know of an email provider that doesn't work this way? Asking because I'm currently using gmail for a little project and make use of aliases and catch alls and then make and remove additional accounts as I need those accounts to be able to send mail in addition to receiving it.
Im not really interested in managing everything that I perceive goes into managing my own email server (ip reputation management and having to keep a box online 24/7)
https://www.migadu.com Lets you use unlimited domains, addresses and has regex based catchalls. The only difference between their plans is the amount of outgoing email you can send. I've been using them for a couple of domains for nearly a year with no issues.
I fail to see your point here, we're looking at the situation of someone stuck on Gmail. Nobody is suggesting that people should give up their homebrew domain system in favor of fastmail.
My Gmail account gets more spam in a week than I've gotten in the two years I've had FastMail. And Gmail's spam filter is surprisingly easy to defeat, considering. For months, putting "- -" at the start of the subject line would trick Gmail and put it in your normal inbox. (I just looked, and it looks like they've fixed it, but back in December it was bad.)
This myth that nobody but Gmail can handle spam needs to die. It was true in 2006, but it is not true in 2019.
I've been using Zoho since 2013 and never had issues with spam. It's $25 p.a. and offer's all the bells and whistles I need (multiple domains and unlimited email aliases, DKIM, SPIF, 2FA + application passwords).
Sorry, but another +1 for FM. I've been with them for a couple years now (after 10 years on Gmail) and haven't noticed any increase in spam. And I'm not shy about publishing my email address on the web.
Well Microsoft had their annual mail outage the other day... I guess they could have coordinated a bit better to get them done and out of the way at the same time.
Microsoft has had at least 2 major Azure outages that affected their SSO product. My system was unreachable by anyone but admins. Our systems engineer couldn’t do anything but wait on Microsoft to fix the issue on their end.
We spent ~4 hours troubleshooting an Office365 deployment before realizing the SSO outage was the cause.
The simultaneous fury and relief when we successfully logged users on the next day having made no functional changes was a watershed moment in our self-hosted services commitment.
Given how many services it affected and the mention of 404 errors, I would suspect a GFE bug or bad configuration that started rolling out worldwide (hence the geographically diverse, but not 100% spread of the issue). A decade ago, Tuesdays were a special day for GFEs, but that hasn't been the case for years now. Perhaps it's just the Tuesday curse that persists. :-)
My money is either on that or the static content service.
> The browser connects to the HTTP server on this IP. This server (named the Google Frontend, or GFE) is a reverse proxy that terminates the TCP connection (2). The GFE looks up which service is required (web search, maps
Historically Google terminated all security at that front door layer- hope that changed by now. But interestingly a cert issue down the pipe could easily be the root cause.
Yup. And that's by design. Stability is asymptotic and increasingly expensive to reach toward 100%. There's a very intentional "this is good enough" point.
"The error budget stems from the observation that 100% is the wrong reliability target for basically everything
(pacemakers and anti-lock brakes being notable exceptions).
In general, for any software service or system, 100% is not
the right reliability target because no user can tell the
difference between a system being 100% available and 99.999% available.
There are many other systems in the path between
user and service (their laptop, their home WiFi, their ISP,
the power grid…) and those systems collectively are far less
than 99.999% available. Thus, the marginal difference
between 99.999% and 100% gets lost in the noise of other
unavailability, and the user receives no benefit from the
enormous effort required to add that last 0.001% of
availability.
If 100% is the wrong reliability target for a system, what,
then, is the right reliability target for the system? This
actually isn’t a technical question at all—it’s a product question.."
It’s not hard. It’s actually impossible. I agree, just use >.
If I worked for YC as a HN mod I would literally spend a bit of time every day to review as many mono space-using comments I could every day and edit them to use > instead.
Or give us actual syntax for blockquotes? (> at the beginning of a line, markdown style, would be great…) Which I feel like gets used all the time in the discourse here, and for good reasons, too.
Then you would have that bit of time back, and the rest of us could stop scrolling back and forth when someone code-blocks a blockquote.
Why not use italics instead? How does the ">" help? You can put it at the beginning of the paragraph, but lines get re-flowed so you can't put it at the beginning of lines.
But seriously HN should just allow basic quotes. And, while they are at it, increase the size of voting arrows to make it possible to reliably hit them on mobile. I get it’s somewhat nice to have a site that does not constantly redesign everything, but pretending you lost the password for the server is just overdoing it in the opposite direction.
We don't tolerate houses collapsing out of nowhere, brakes failing over the course of normal usage and planes falling out of the sky during routine flights.
But for some reason, we HAVE TO tolerate software crapping itself once a year?
I don't accept this logic. This is just a sign of how sloppy the industry has become.
This is the reason your phone becomes obsolete after 2 years, whereas your car can continue to run after multiple decades of abuse.
First: We’re not talking about “out of nowhere” or during “routine” operation. Doing better than 99.99% uptime implies robustness to even extreme, unusual situations.
Second: Air travel could be much, much cheaper if it didn’t have to be nearly 100% reliable. This would be the right trade-off to make in almost any application that doesn’t almost guarantee deaths when it fails.
We actually do tolerate it.
Plenty of critical parts in your car are designed to not be 100% available even in all expected cases.
For example, plenty of higher end cars in california come with summer tires that can't be used in cold weather/ice.
Even the brakes you are talking about must be replaced every X miles (depending on how new the car is, this may be between 10k and 50k miles)
Houses are not definitely designed to be 100% available. This is in fact why they fail due to fire or earthquake or other events. The design point is not instant failure, but it's also not "100% available".
I think this is a false equivalency. If we're talking about "service unavailability", planes break all the time. Houses have to be vacated because of flooding, fire, insect infestation. Brakes do fail. Just like with software, we accept a certain level of risk in exchange for cost/convenience efficiencies (e.g. we don't want our planes to fall out of the sky, but we're okay with getting stranded in phoenix for 24 hours because of a busted landing gear).
Also, brakes contribute to service unavailability. Brake pads need to be replaced on average every 50k miles, which takes the average driver 4 years. And let's say the average length of time your car is at the mechanic's to fix brakes is 3 days. That's 3 days of unavailability every 4 years just for brake pad replacements, or 99.8% availability (two nines!), just because of brake pad repairs. Add in all the other required car maintenance, and depending on the reliability of the vehicle, and you might be down into one nine territory.
Gmail going down is like your car being in the shop. It's not equivalent to a plane crashing; the equivalent there would be the entire contents and history of your Gmail account being unrecoverably deleted, and you yourself had no backups. Of course, I'd still much rather have that happen a hundred times than be in one fatal plane crash ..
PS.
99.978% availability translates as a downtime of ~ 2 hours/year total. Not bad! But it's when things break that we realize how performant and reliable they actually are.
I'm consistently amazed how well Google and Facebook are at staying up. They're two services that I don't think I've really experienced a broad outage. Of course with Facebook's data designs, there's sometimes quirkiness as a result, but it's rarely completely off for me.
Google, I think I've only really noticed it offline once in the past 10 years or so. Not complaining at all.
Consider that airplanes are relatively self-contained systems, whereas most of the systems we deal with in networking cross many different independent boundaries, each of which can independently fail for any number of reasons. There's more parties involved in regular operations of distributed software than in maintaining airplanes.
The e-mail equivalent of your house falling down is data loss. This is unavailability, which is more analogous to losing your keys and not being able to get in for 30 minutes.
Not how it works. If you service has that 99.999% availability and you get a single unavailability event in a year, that's already 5 minutes of downtime completely independent from all other events that users may or may not experience, there is almost no chance of overlap between them. Users definitely notice that. And worse, events are going to be even less frequent than that and ever more noticeable and on top of that you are going to underestimate actual unavailability by at least an order.
So, can we get to the level where unavailability is actually an unnoticeable noise? Yes, but definitely not the way Google does things. I'd generalize that Google is absolutely not the place to look for ideas on reliability.
I think you have a misconception on what actual reliability is for more products and services. 99.999% is a solid service, 99.99999% is a hard to achieve target for enterprise software.
To say Google is not the place to look for reliability is a pretty comical statement.
As you get beyond 5 nines, environmental factors begin to dominate (if the service is networked: network unavailability, if the service is onsite, power outages and weather) any reliability inherent in the service.
I don't have a misconception. Can you name a single internet service that has an actual five nines availability? That's definitely not google search nor gmail.
I never claimed Google has 5 9s. Your claim that 99.999% for Gmail means Google isn't a place we should go to for reliability advice is comical at best, ignorant at worst.
I couldn't make that claim, because Gmail is very far from 99.999%. I merely pointed out that SRE quote is wrong, that's the quality of reliability advice they give. I generalized it, because I've seen plenty of bad reliability advices coming from Google, even burned by some of them in the past.
If you are into reliability you really shouldn't take Google seriously.
What specifically are you disagreeing with in the quote? It does not say that Google is five nines, nor does it say whether or not five nines is desirable.
The argument is basically the following:
(1) Consider the claim that getting to 100% reliability is important.
(2) If the claim were true, then it would be important to go from 99.999999999999999% availability to 100%.
(3) That is obviously not important, so the claim must be false.
(4) Since we have shown that the best target value is not 100%, it must be some number x, with x < 100%.
Well, I would say that the complexity of complexity is exponential.
In the aerospace industry, to get that last percent of a percent, 2 completely independent implementations of everything are used. Then to get another decimal, you add 2 more implementations and a consensus algorithm. Then of course you add static/unit/api/integration/stress/fuzz test suits for each implementation. Then test the tests. Then have a human run each test as the "second implementation" of the CI system. And so on, and so on. Each new decimal "9" cost multiple time more in human resource alone.
Then take into account the "productivity loss" of all those process and you need yet more poeple to progress as fast. Adding more people to a project has a diminishing return. After a while you can spend the entire GDP of the world and you wont be able to add another availability decimal point.
Well, the "good enough" point is reached when the cost of improving the reliability of the service is more than people would pay for that improved reliability.
It's not ready for release yet (in a week or two maybe), but I'm working off the idea of a pay-per-use email service. Instead of a flat per month fee, you're charged for resources used. It looks like it'll end up being about an order of magnitude cheaper than other paid email services so far. Hopefully from there it'll be a good base to try out other improvements to email too.
Semi-related, but can anyone suggest good open source email client? I'm working on several computers with macos/linux and Gmail isn't slow only on my deep learning rig, so I'm looking for something to replace it.
Strange. I can't remember that last time it crashed on me, and I have it running almost all the time on my work and personal machines (both Win10, talking to gmail and fastmail respectively via imap).
A while back Mozilla announced that they were migrating development to an independent team. I don't know much about that, but my impression as a user is that quality hasn't been affected and it might even be getting more love. I was happy to donate some money to the project last time they did a fundraiser.
I recently encountered a bug, where the global search has stopped working (Ctrl+K). I have tried most of the remedies suggested and have it semi-working i.e. partial results. This has somewhat stolen a bit of thunder from an otherwise flawless client for my use-case.
I also switched to mutt a few years ago, as part of a switch away from gmail (which I no longer use). HTML email is actually a much smaller problem than I initially expected (and in the rare cases it's a problem you can always pipe it to lynx).
I just switched over to neomutt a while back, after years of Thunderbird and evolution before that. I don't know why I didn't switch earlier. It's beautiful.
I've been using thunderbird for several years...and frankly its not bad. I want to love and support it more, but it does have a few - i think minor - issues. Nevertheless, my opinion is that it is good enough, and the client best suited for general users. I'll be honest though, lately i've been teetering on the brink of trying out cool kids type of clients like mutt/neomutt, etc. But haven't yet, due to too many other projects. And, then....i read a recent blog post about some decent-sounding 2019 plans for thunderbird; see https://blog.mozilla.org/thunderbird/2019/01/thunderbird-in-... So, i'm actually inspired to stick around with thunderbird, and see how/whether it improves. Naturally, the best approach is for you to try things out on your own, and judge for yourself.
Could you list the specifications (roughly) such as amount of RAM, processor (cores/speed/architecture or era) on your "deep learning" rig and on your "several computers"? I'm curious what it takes for the new gmail web interface to run smoothly without being frustrating.
I have correspondence with various Chinese and Cyrillic alphabet symbols and their closed source string\decoder processor mysteriously fails on "complex" symbols and I cannot debug it! Otherwise is an excellent client.
I went through this process a few weeks ago when I was switching to Linux as my primary machine. There simply are no great open source options right now for email clients. Below are general summaries of my notes.
Also, worth noting, most of these clients cannot handle Gmail's style of label handling.
If you feel like dealing with terminal based and text only email, I'm sure Mutt is fine. I'm not sure this is realistic for most people.
Evolution is a decent client with very minimal, basic features. It does allow handling of multiple email accounts but it didn't do a very good job of it. There are no plugins. It was often slow to perform actions like archiving or deleting (as if it did them in real time rather than in the background), which meant I couldn't quickly go through and process a list of emails. I had to pause between each action.
Geary was somewhat like Evolution but even more minimal. It has almost zero configuration options and no plugins. It's a very good looking email client, so if you have simple requirements and like the aesthetic it might be a good option.
Thunderbird was the most promising. A rich plugin ecosystem, better handling of multiple accounts. But it was buggy, a memory hog (6 GB RAM and 25% of my CPU when idling), and it would crash on me. There seems to be some history behind it that I didn't get into, but many plugins were not compatible with my version of Thunderbird. For example, I couldn't use the calendar because my version was too new. If critical plugins are going to be disabled any time I update the app, well, that's not worth dealing with to me.
Claws Mail was okay, even if dated looking. It had no contacts or calendar support. There were a few plugins but nothing like what Thunderbird offered. It had a few bugs, like when I switched folders it would scroll to random messages, and overall was too limited for my usage.
Mailspring is the best open source option I discovered. It's somewhat slow at times and can be buggy, but overall performed better for me than Thunderbird. It has some advanced features built in, handles Gmail labels perfectly, and is very clean looking. The biggest downside is that 1) it requires a Mailspring account (though it doesn't send your emails through Mailspring) and 2) is based off a monthly subscription (with a free option). In other words, you're generally dependent on the continuation of the Mailspring service for ongoing operation of your email client.
If open source is not a strict requirement and you are on Windows, Mailbird is the best email client I've ever used. It nails almost everything out of the box, is lightning fast, and has always been stable. If Mailspring doesn't work out for me I'm likely going to switch to using Mailbird in Wine.
Even if you think "outages globally" and "global outage" are obviously two different things (maybe, but I don't think it's obvious), this seems like a fairly pedantic distinction.
Someone's in a really pedantic mood today. The real life use of the expression contradicts your personal definition (try googling it and tell me what you come up with). It is widely used in reporting and accepted by everyone who's not in the mood to argue personal definitions.
Then again... Schrodinger's outage. Until everyone checks some people are not affected. Hence, never global. Right?
But honestly now, are you upset because of the semantics or because of the flatearther joke? ;)
> Eschew flamebait. Don't introduce flamewar topics unless you have something genuinely new to say. Avoid unrelated controversies and generic tangents. [0]
> Hint: it's you
My comment may or may not be wrong but it was still on the topic of the article. I'll let you judge your own.
The fact that your top comment got flagged is a good indication. If you had stuck to just the topic there would be nothing to criticize. Unfortunately I can't look up your exact phrasing anymore, but it struck me as quite rude and, well, flamebait-y. Hence, this thread.