1000 nodes and beyond: updates to Kubernetes performance and scalability

whalesalad · on March 29, 2016

We are running Kubernetes in production at FarmLogs and LOVE it. We're a very small team with a ton of operational work to do in other facets of the company as we prep for the season to begin, but once we've got some free time there will be an in-depth blog post describing our migration and roll-out. We've also built some really neat tooling that we would like to share with the world.

Upgrading to 1.2 has been incredible. Deployments are faster and pods get scheduled almost instantly now. Our Master nodes are down to about 1/4th of what they were normally doing in terms of CPU usage.

We're really excited to be ridin' on kubelets!

xur17 · on March 29, 2016

I would be very interested to hear your process!

Our team is looking at using it, but we haven't found a great way to do automated deployments with our current build system (Bamboo). The best we've come up with is a series of bash scripts as the deployment step, but I'm not fully comfortable with how that would handle failed deployments yet. Basically, we need a way to handle automated deployments, and see the status of our currently deployed systems / promote environments.

If anyone is using Kubernetes in production, I'd love to hear what your deployment process looks like.

sytse · on March 31, 2016

For GitLab CI we just released deploy to Kubernetes https://about.gitlab.com/2016/03/22/gitlab-8-6-released/

voidfunc · on March 29, 2016

We're looking at Spinnaker.io + Kubernetes right now for just this reason. K8s support was just added.

TheIronYuppie · on March 29, 2016

Disclosure: I work at Google on Kubernetes.

Please feel free to email me (aronchick (at) google) if you'd like to discuss this further (either P or GP post). We've seen a lot of this, and would love to help you out!

voidfunc · on March 30, 2016

Awesome, thanks for the offer! I'll reach out in the next day or two! :)

boulos · on March 28, 2016

It's exciting to see that Kubernetes is ready for basically any scale. You're more likely to run out of quota (on your cloud provider, particularly IPs) or some other resource (on-prem) before you can't schedule a container quickly enough.

Disclaimer: I work on Compute Engine and chat with the Kubernetes folks a lot.

tzaman · on March 28, 2016

That resource being money. I'm deploying a fairly simple app on GKE and things go out of hand quickly due to confusing pricing. Or maybe I just don't where to look.

TheIronYuppie · on March 28, 2016

Disclaimer: I work at Google on Kubernetes.

Can you say more? Did you just spin up too many nodes?

vetinari · on March 29, 2016

I would say that there's impedance mismatch between GKE pricing and unclear requirements, how much resources in what structure you will need.

I was looking at the Kubernetes tutorials and couldn't even start to figure out, how much would it cost to run them. (Well, I didn't try too hard, it wasn't that important.)

oldmanjay · on March 28, 2016

You probably meant to disclose that information, not disclaim it. I suppose this is one of those cases like "literally" where persistent misuse will cause the word to be its own opposite, but I keep fighting the annoying fight anyway.

boulos · on March 28, 2016

In this case, I meant both (so I chose Disclaimer). The full combo is: I work on Compute Engine (Disclosure!), but I don't actually work on Kubernetes (Disclaimer!) though I do hang out with them (both?).

oldmanjay · on March 28, 2016

Fair enough! People certainly seem to do a lot of disclaiming of their credentials these days so I guess everyone is getting the picture anyway.

zurn · on March 29, 2016

It's hard to understand why so many cloud service providers and software stacks are still v4-only considering the operational and development cost of v4+NAT (complexity, management, scaling limitations etc). For most systems it would be enough for the front load balancers to speak v4.

endymi0n · on March 29, 2016

Funny, I see it exactly the other way around. I want NAT+Firewall to have at least decent perimeter security in a private LAN, the 10/8 subnet is large enough to do anything I can ever imagine to be doing and IPv4 is so much easier to grok. For most systems it would be enough for the front load balancers to speak v6.

kiallmacinnes · on March 29, 2016

Wanting a firewall is good, you can have that on v4 or v6. However, after you add 3 simple firewall rules, NAT provides no additional security.

   Drop state=INVALID
   Allow state=EATABLISHED,RELATED
   Drop all

You're now just as secure as if you had a typical NAT setup, but without the decades of kludges that is NAT.

zurn · on March 29, 2016

Then you just have a system that has no security advantages but is harder to reason about due to the additional level of indirection in overloaded addressing (because of ambiguous addressing, management of forwarding rules, etc) vs normal firewalling without NAT. Which equates to some loss of security on the system level, because you can only effectively secure systems you understand.

jbeda · on March 28, 2016

I just want to give a public shout out to Wojtek on this blog post. It shows scalability in for an actual scenario at levels that most users won't need (10M req/s!). Beyond that, there is a clear methodology with lots of hard data. This along with listing the work that it took to get there. Very good post!

Disclaimer: I co-founded Kubernetes and help to coordinate the k8s Scalability SIG, although I'm no longer at Google. I didn't see this before it was published, though.

Rapzid · on March 28, 2016

1.2 has a lot of really nice additions such as infrastructure containers, the new config map API, service draining for node replacements, and many more.

Unfortunately I would be running it on AWS and HA still hasn't been worked out and manual setup is a bear.

justinsb · on March 28, 2016

1.2 includes multi-zone support, so your nodes can be in multiple AZs. This means that a failure of a single zone shouldn't interrupt your apps: http://kubernetes.io/docs/admin/multiple-zones/

What is not yet in 1.2, but is planned for 1.3, is HA Master - so that failure of the zone which contains your master won't interrupt the control plane. (i.e. you will be able to update your apps even as zones are failing).

Rapzid · on March 29, 2016

Ah, nice! That wasn't super clear to me but now that you mentioned it, perhaps it should have been.

justinsb · on March 29, 2016

Not your fault - I was a little slow on getting the docs written up!

atemerev · on March 29, 2016

Oh, cool. That's what I was looking for in the docs for the last 2 days.

philips · on March 28, 2016

The community is working on what we like to call "self-hosted" Kubernetes. This will help reduce the complexity of installation on all platforms. You can see more about it from my KubeCon keynote: https://youtu.be/A49xXiKZNTQ?t=6m The target is to have this all upstream in the next (v1.3) release.

Slides here: https://speakerdeck.com/philips/pushing-kubernetes-forward?s...

teraflop · on March 28, 2016

FYI, the SlideShare link in that video's description is truncated. I think it's supposed to go here: http://www.slideshare.net/kubecon/kubecon-eu-2016-keynote-pu...

philips · on March 28, 2016

The original URL is here: https://speakerdeck.com/philips/pushing-kubernetes-forward

teraflop · on March 28, 2016

Cool, thanks!

Rapzid · on March 29, 2016

Awesome, will have a look. We run VPC per-environment, and support launching environments ad-hoc. So we would need to launch a Kubernetes cluster along with the environments. Anything to reduce the switching costs is very welcome :)

philips · on March 28, 2016

kube-aws is a tool that we built at CoreOS to make installation of kubernetes on AWS easier. We just made a new release (v0.5.1) and would love feedback on that. It is what we use in production here at CoreOS. https://github.com/coreos/coreos-kubernetes/releases

knite · on March 29, 2016

kube-aws is great! When do you expect to update to Kubernetes 1.2?

boulos · on March 28, 2016

Running it on AWS is supported, and the team is making strides to make it even better! (Complain loudly via GitHub issues where you find problems).

Cluster Federation (a form of HA) is coming in 1.3.

deepanchor · on March 29, 2016

Curious to know why they are choosing to go with protobuf for intracluster communications as opposed to zero copy protocols like capn proto or flatbuffers.

No doubt protobufs are probably much more battle tested in google scale environments, but are there any other clear benefits if the goal is to reduce spending cpu time encoding/decoding messages?

Especially in SOA deployments where many small services need to communicate with one another, I would think that the ability to quickly read any field from a message and pass it on (without first having to decode the entire message) would be a very desirable trait.

kentonv · on March 29, 2016

Protobuf is the Google standard, used by basically every single server at Google for the last 15 years. They have built an internal ecosystem of tools around the format. For a Google project to use something different would be weird and would face lots of internal push-back, for good reasons.

Even though FlatBuffers is technically from Google, it's from a sub-team of Android working on tools aimed at Android games. The idea was that you'd store your assets in this format. IIRC the initial release didn't do bounds checking so was totally vulnerable to malicious input (but it wasn't intended for such use cases anyhow). I doubt it is widely used on Google's servers.

Cap'n Proto is not from Google and there's simply no way they'd choose to use it. To be fair, its support for languages other than C++ remains weak, largely because Sandstorm.io doesn't currently have the resources to build it out.

FWIW the ability to read a single field from a message is less important in networking situations because sending/receiving the message is already O(n) and the messages are small-ish, so parsing in O(n) is not a huge deal. Random-access parsing really shines when the input is a massive file on disk.

(I'm the author of Cap'n Proto and also of Protobuf v2 (the first version Google open sourced).)

jbeda · on March 29, 2016

Gory details: https://github.com/kubernetes/kubernetes/issues/8132

tsenart · on March 28, 2016

For those wondering what is being used for load generation in the demo: https://github.com/tsenart/vegeta

roberthbailey · on March 29, 2016

+1. Thanks tsenart for the awesome load generator!

rodionos · on March 28, 2016

The frame at 2:37 shows avg response time of 1.75 ms at 10 mln QPS. Which API call was measured? I'm looking at bar charts under "Metrics from Kubernetes 1.2" and the latencies graphed there appear to be different/higher.

a-robinson · on March 28, 2016

"Metrics from Kubernetes 1.2" is discussing the latency of requests to the Kubernetes API (for managing what's in the cluster).

The latency referred to in the demo is the latency of requests from the loadbots to the nginx containers running in the cluster.

boulos · on March 28, 2016

That's the nginx response time. You can see that when he scales up the loadbots but not the backends and says that the "tail latency has gotten quite high" (about 1min in).

roberthbailey · on March 29, 2016

Correct. In addition, the source code used to run the demo is available on github at https://github.com/kubernetes/contrib/tree/master/scale-demo

rodionos · on March 29, 2016

Thanks, all clear now.

Dangeranger · on March 28, 2016

Does anyone have a good experience to share with a hosted Kubernetes provider outside of GCE and Tectonic? I am primarily comparing using Kubernetes to alternatives such as Rancher or Nomad.

TheIronYuppie · on March 28, 2016

Disclaimer: I work at Google on Kubernetes.

Do you mean GKE?

Dangeranger · on March 28, 2016

My 'GCE' reference was to Google Container Engine with Kubernetes as the cluster manager, yes.

jbeda · on March 28, 2016

Funny story -- Google Container Engine was the obvious name for that product but the TLA for it (GCE) conflicted with Google Compute Engine. We broke the tie by deciding the TLA for Container Engine would be GKE. The 'K' is a nod toward the Kubernetes underpinnings.

Google Compute Engine itself was difficult to name. There were those that were pushing for Google Compute Cluster. But I veto'd as the TLA would have been GCC or GC2. Both would have been awful.

Naming is hard.

AndrewWright · on March 28, 2016

Why not go with Alphabet Cloud, or ABC?

thesandlord · on March 28, 2016

Hahaha this is awesome! Unfortunately, Alphabet wasn't a thing back then.

stretchwithme · on March 28, 2016

Maybe abbreviations could be more flexible with G always having to be the first letter.

jbeda · on March 28, 2016

It gets a lot harder to name things world wide if you don't start with "Google". gmail in Germany was a lesson: http://techcrunch.com/2012/04/14/google-finally-gets-right-t...

stretchwithme · on March 28, 2016

Is gmail an acronym?

bacheson1293 · on March 29, 2016

Rancher now supports Kubernetes and Docker Swarm if you are not already aware.

atemerev · on March 29, 2016

And still no way to make a simple 2-node cluster in 2 different availability zones. What if one AZ fails completely? Happens quite often.

I tried to read HA documentation on Kubernetes and it all starts with warnings like "this is fairly advanced stuff, requiring intimate knowledge of Kubernetes inner workings", and going on with pages and pages of setup process.

Basic HA is not a "fairly advanced stuff", it is a commonplace requirement in any production environment. Why do I need a 1000-node cluster if all 1000 nodes are in the same AZ, which can have an outage anytime?

atemerev · on March 29, 2016

Update: docs just arrived — http://kubernetes.io/docs/admin/multiple-zones/ . That's better :)

SEJeff · on March 29, 2016

Conceptually speaking, having two nodes is not high availability, it is failover / fault tolerance. High availability is generally N + 2 where N is >= 1.

This is a better explanation than I would write on this:

https://www.quora.com/What-is-the-difference-between-a-highl...

atemerev · on March 29, 2016

You are right, of course. Still, I don't understand why it is so low priority in container orchestration platforms. And how it is even possible to live without it in production.

beeps · on March 30, 2016

It's not low on the priority list at all. These are the same people who worked on borg (I'm a contributor, but didn't work on borg); they get stateful applications and understand that it needs to be done RIGHT. No second chances. Nailing this for 1.0 or 1.1 would have consumed a significant portion of the team, but rest assured it will work, soon.

atemerev · on March 29, 2016

Update2: persistent volumes are still allocated only in the same AZ with master container. Hence, no HA databases (only manual volume provisioning is possible).

I still wonder what is the primary use case for Kubernetes (or Docker Swarm, which has similar issues) if high availability is so low on the priority list.

grandinj · on March 29, 2016

Wasnt there an article recently about how 99.9 measurements can still hide lots of bad stuff with high volume services?

I seem to remember the article noting that 99.9995 was more useful.