Hello, good morning, everyone.
So it’s my pleasure to be here and talk a little bit about how we think about scale at Google.
As Alexis mentioned, I run the product team for Kubernetes in Google Cloud.
Running services at Google scale
So, you know, there comes a time in the life of any successful company when it needs to scale to meet its users' demand.
This could be, you know, a hit startup from day one, or it could be later when your product really catches on.
You have to meet that exponential curve.
But actually, many applications are not architected to scale, at least not to scale efficiently, in a cost-effective way, or to scale quickly, both up and down with demand.
So, you know, if you have to bolt on scale later, this is very difficult to do.
It’s sort of like trying to change the tires on an accelerating train.
There are many examples of startups that have failed to do this, there’s also large companies that are well known to have gone through large infrastructure refreshes, while they’re also going through an exponential growth.
And not everyone makes it out alive.
Some challenges Google faces
Fortunately, we’ve solved some of these problems at Google over the past 20 years.
And so, you know, arguably, we’ve learned a few tips and tricks that allow us to kind of scale at scale.
As an engineer at Google, you can write a service, and you can scale from zero users to a billion users without really having to rewrite anything and without having to architect the underlying infrastructure pieces or change them.
That’s pretty remarkable.
And if you think about how this happens, this is not automatic.
This is actually a lot of trial and error and a lot of things that we’ve done on the infrastructure side over the past 20 years to make this possible.
And it’s also what underlies Google Cloud and enables this capability for others.
So what’s the secret?
Well, there are actually two secrets internally at Google, and one of them is Borg, which is our planetary scale cluster orchestration system.
And the second one is, as Alexis was saying, the methods and the people behind Site Reliability Engineering (SRE).
I’m gonna talk a little bit about both of them.
And actually, the unique thing about Google Cloud is that it sort of puts the power of both Borg and SRE into the hands of everyone, individual developers at large.
So, with services like BigQuery and Spanner, and then Google Kubernetes Engine (this is the service that I run), you get that capability.
It’s like getting on the express train from the get-go, so you don’t have to worry about when you’re going to scale.
I’ll show you a couple of examples as well.
A look behind the scenes at Google
So, how do we do that?
And I think the one thing that we’ve learned, most importantly, is to run everything in containers, run everything as microservices.
So we actually launch four billion containers per week, and it’s constantly increasing.
And why do we do that?
Why do we use containers?
There’s many other mechanisms.
But over the past 20 years, what we’ve learned is that containers are actually the most effective way to scale up and scale down cost-efficiently and also quickly.
So you can react to the demand.
GKE: why it exists and how it works
Then, next comes the whole concept of orchestrating containers.
Containers by themselves aren’t enough in a global system.
This is where Borg plays the critical role that it does internally at Google.
The analog of Borg, of course, is Kubernetes, which we’ve contributed as an open-source process project, and then GKE, the piece that we run in the Cloud, which is managed Kubernetes.
GKE is actually much more than Kubernetes itself.
It bakes in ease of use, so you can provision Kubernetes very easily and don’t have to manage it.
It also provides advanced auto-scaling.
This is one of the most loved capabilities, four-way auto-scaling that scales your pods and services, scales your clusters, and also right-sizes both the pods and the clusters.
So, it’s quite complex but also easy to use.
It also has a very lightweight container operating system, which is also very secure, and that allows you to start and stop containers quickly.
And then, of course, the secret sauce, which is, it’s managed by a set of SREs.
So, it’s become a very popular product.
Eighty-plus percent of our top customers use GKE.
And some of the feedback they say it’s like magic, particularly for scaling in a cost-effective way.
And it’s used everywhere.
It’s used in banks, it’s used in retail, it’s even used for pizza delivery.
It’s becoming very ubiquitous. So, we’ve been investing in how do we scale Kubernetes further.
I’m gonna talk a little bit about scalability challenges in Kubernetes itself.
We started down this path, you know, from the beginning, but this was one of the major milestones.
About three years ago, we had Pokemon Go, which just became an overnight hit, much larger than the company that was producing the game expected, 50X larger demand than they expected.
And you know, every weekend, they would have a new rollout in a new geography, and it would exceed what they had expected.
So we learned, on GKE, how we scale to meet the needs of Pokemon Go and created, of course, for them several capabilities that are now table stakes, but then also started investing in how do we run large clusters across multiple zones so that they have high availability?
And how do we run a large cluster across multiple zones efficiently?
Also, how can we autoscale more quickly so that the entire cluster auto-scales up and down as needed?
Since then, we’ve increased the scale in Kubernetes to 5,000 nodes, and we’ve been working on, with a large 5,000 node cluster, how do you bin pack it with multiple teams and multiple workloads for the maximum efficiency?
And then lastly, you want to have multiple of these 5,000 node clusters.
There are customers that have 40,000 nodes.
How do you run multiple clusters together efficiently?
So some of the work, actually this is still ongoing, has led us to find that scale in Kubernetes is multi-dimensional.
Scalability is multi-dimensional
So I talked about 5,000 node clusters.
Actually, as you scale up to 5,000 node clusters, you find that the number of pods that you can have becomes more restricted because of the networking restrictions, the amount of IP address space you have.
Also, the number of services that you have is related to how many pods and how many nodes you have.
So it’s kind of a scalability envelope that you have to pay attention to.
So, there’s ongoing work that we’re doing on improving the scalability envelope.
We’re also working on, of course, decoupling these axis from each other.
So you decouple how many nodes you can have, from how many pods you can have, from how many services you can have, and then individually scale those up.
And this is a lot of the work that’s in progress on the GKE team.
So we talked a little bit about Borg and cluster scaling.
The second element that I was talking about is SRE.
And so I just want to touch on that because Kubernetes has actually become quite popular.
Many people are using it, potentially also in this audience, but the SRE techniques, while we’ve written several books on it, they’re a lot less well-adopted.
And I think a lot of the tooling that Datadog is building and that’s getting to be more widely available is essential to the SRE function, and it’s critical for running services at large scale.
Without observability, there is no automation
So more than anything, SRE is about observability.
You have to measure everything, and you have to automate as much of it as you can.
Automate the toil away, ideally in a closed-loop, setting policies to accomplish the desired SLOs.
Internally at Google, we make sure that we have tooling that as developers write applications, that tooling exposes metrics at multiple levels, you know, at the container level, at the service level, at the application level, exposes metrics about how the application is performing.
Microservices at eBay: a case study
Also about what is accessing the application?
What are the traffic patterns internally?
And then we have mechanisms for remediating and setting SLOs, and then ultimately closed-loop policies.
So in terms of a real-world example leaving Google, you know, eBay is a company that has been using GKE for many years.
And, you know, it became so popular, they had many of their teams developing all of their new applications on GKE.
Ultimately, they found that they have hundreds of microservices running, and their platform team really could not get a handle on which new services were being created, which ones were communicating with which other services.
And so this led to tremendous operational complexity.
Now, of course, every developer team has the freedom to develop in whichever application they want, and so they’re using different types of monitoring tools and different capabilities.
The way eBay solved this problem is by adopting Istio.
Istio is an L7 proxy.
It creates a service mesh, and I’m gonna show you an example of that in a minute.
But what they were able to do with Istio is without troubling their developers, without having the developers rewrite their applications, they were able to get a bird’s eye view of all of the services in their environment.
And this really helped them because they could discover new services as they were coming along, and they could see which services were having trouble, and they could also secure the services at scale.
So I think they said something like it greatly reduced their operational complexity, and they were finally able to uniformly monitor and control their systems.
You know, earlier this year, we announced Anthos.
Anthos is a platform that integrates and combines the capabilities of GKE with Istio in a single package, and it’s something that you can run On-Prem, you can run it, of course, in Cloud.
And so it brings the best of those two things together so that you can actually enable SRE tooling along with container orchestration.
So I’m gonna try and show you a quick demo of what that looks like.
See if we can switch over.
So this is an application, an online shopping application, and, you know, this is just the front page.
But if we look at this application, it’s running in Google Cloud, actually on GKE, and we’ve enabled Istio so we can create a service mesh.
So this is what the service mesh looks like.
Here, you can see all of the services that make up that online shopping application, you know, the ads service, the cart service, the checkout service, and so forth.
And we’re collecting all of these metrics.
And Istio can be bolted on to an existing set of microservices, so there’s no change required to the microservice, but you start getting these error rates and latency information, and, you know, what’s happening there.
You also start, you know, knowing whether there is an SLO policy set.
And if you’ve set an SLO policy, you can see whether it’s within budget or not.
So, you get kind of that view.
The other view, there’s a topology view, which is nice kind of a bird’s eye view to see which services are connected to which other services.
Like, for example, in the case of eBay, this was very critical so they could get a handle on their entire environment.
So we see here, this is the front end.
And, if we scroll down, you know, this is the workload that’s backing the front-end, so it’s a set of deployments.
And then this is the checkout service, and we can, kind of, hopefully, zoom into the checkout service a little bit.
And then, as you zoom into it, you will be able to see all the other services that are connected to the checkout service.
So, you can see what the dependencies are.
So, in this case, the payment service, the product catalog.
You don’t get that topology view from the list.
But from the topology, you can see what’s connected to what and, of course, drill down into each of the metrics for those services.
And then if you go to the service dashboard, let’s say for the catalog checkout service, actually this is the dashboard for the checkout service, you can see, you know, if it’s out of budget.
And if there are recommendations for how to fix it, the tool will often have, you know, automation built into it as well, which allows you to change certain things to improve the health of the service.
So this is really critical tooling for what is the SRE function.
We can go back to the slides, please.
So with Anthos, you know, it bundles in the service mesh along with GKE, so you have the enablement for developers to run at scale and also for SREs to be able to secure, manage, and control your traffic across your entire environment.
So that’s the tooling, the tool kit.
But the most important step perhaps is the cultural change, and the people change to run like an SRE organization.
And this is really where the book is helpful.
But just to summarize, there are a few key mindsets that are SRE mindsets.
One of them is to accept failure as normal, and this has been something that we do internally, I would say, you know, practice regularly, which is through the process of blameless postmortems.
So things go wrong all the time.
We do blameless postmortems where, you know, there’s a lot of psychological safety, and that’s what allows us to figure out what truly is the root cause of the problem and to put in place fixes that avoid that problem from occurring again.
The second tenet is really to reduce organizational silos.
So, for example, in the eBay example, you had all of these services that were being created by different teams.
And, you have to do that to be able to run effectively.
But reducing organizational silos really comes down to setting SLOs and error budgets between teams and between services so they have that communication of what they can expect.
And then lastly, I think this one is particularly important for Cloud as well, is to implement change gradually.
That really comes down to implementing your rollouts in Canary, you know, gradual rollout, and always building in the capability to roll back.
These sound like simple things to do, but they actually require years of practice and enforcement, particularly in large organizations, and particularly when you’re running services at scale.
So in summary, architect for scale from the beginning.
It’s very difficult to bolt on later on.
Use tooling for observability on automation, so something like an Anthos, and then really adopt the SRE mindsets and invest in culture change amongst your people.
With that, hopefully, you can scale like a champ, and, of course, don’t forget to make it as easy and simple-looking as Google Search.