Adventures in Running a Private Cloud (Comcast) | Datadog

Adventures in Running a Private Cloud (Comcast)


Published: 7月 17, 2019
00:00:00
00:00:00

So if this were an epic, I would say that it starts in media res, but since it’s not really an epic, starting at the beginning.

Back in 2016, I was working for Comcast for maybe four or five months as a contractor, doing things that contractors do—in this case, building Puppet manifests and modules.

Would you like to build a cloud?

At some point my boss comes up to me and wants to know if I wanna build a cloud. And this is not something I was expecting, but I found it to be positively intriguing.

So a coworker and I got started on the project, and it took about six months of trial and error and proofs of concept and a batch of bad hardware.

But at the end we built a cloud and it’s built on those blue boxes or hypervisors which run VMs and there’s a control plane and there’s all of the storage.

And it was fine and good. And we now run the private…we’ve now run iterations of private clouds in a bunch of places around the country.

What are the differences between the public and private clouds?

But I want to start with asking what constitutes the private cloud? Since it’s not just having your own hardware and it’s not just having…well, I’ll explain.

So I think the biggest difference between the public and private cloud is that there’s a sense of control, and this control can come in several types.

The first one is that you can control your own hardware and infrastructure.

And in fact, that’s one of the things that Comcast does. We have 4 data centers with 11 OpenStack regions, which you can think of as availability zones.

And we have the hardware, we’ve got the networking gear. It’s all ours to do with as we please, and that’s pretty fantastic.

But you can still have a private cloud even if you don’t have your own hardware.

If you don’t have the hardware, but if you don’t own the hardware, but you own the software, you’ve set up your own OpenStack or VMware region, that is also a private cloud in some sense of the term.

But even if you don’t control the hardware or software, if you control the policies.

If you can…for example, if you can set policies that are close, that are closest to your corporate policies, then perhaps that is, in some way, having the private cloud.

But I think the most important aspect of the private cloud is the ability to control who has access to your cloud.

For example, when you go to a private cloud provider, you may have control over who has access to your VMs or to your block storage. But anyone who pays money to the cloud provider can have access to the underlying infrastructure.

Whereas with the private cloud, one of the…I believe the most important tenet is being able to say, these are the customers who have access and everyone else, there’s public cloud.

And at Comcast, all of our customers are internal.

There are various departments within the organization and one of the benefits of that is we get to work very closely with them and understand what they do.

Why even build your own cloud to begin with?

So this is perhaps a vaguely tenable definition of public cloud, but I want to ask, why would you build a public cloud in the first place?

It doesn’t seem entirely obvious, especially in this day and age, but one reason is cost.

Again, one of the great things about our cloud is that since we have the hardware and we have support for people who use VMs, private cloud is still pretty cheap compared to public cloud. Even for very large, very specialized instances.

Cost is, you know, comparatively negligible.

Also, and from the customer’s perspective because it’s cheaper than public cloud, you don’t have this issue where you’ve accidentally set up an orchestration situation where traffic has spiked, you’ve now accidentally spawned thousands of instances and now you’ve amassed a hundred or million-dollar AWS bill which never happens…

Another reason is that it’s easier to respond to customer needs. One of the things that we use the private cloud for is residential email.

So when customers, by which I mean people who pay money to Comcast for services, check their email using their Comcast account, well, that’s all private cloud on the backend.

We can also provide a bunch of computers to build out clusters for small- or large-scale projects. If people need a strange combination of CPU and RAM, we can do that after some effort.

But a personal reason for running a cloud is it’s now our computer. We have a stake and we have a responsibility to maintain the infrastructure.

All of a sudden the problems that the public cloud provider has that we can safely ignore, they actually become our problem: a disk dies and it becomes our problem.

And one of the side effects of this is we’re able to develop policies and procedures for making things better.

The size and scope of the Comcast private cloud

So how big is the Comcast private cloud?

Just wanna throw out a few numbers about how large it is—or at least the part that I deal with.

And it started small.

In 2013, so about three years before I started, about 1500 VMs. That was small enough where you could know all the customers by name, but over time (this was as of a couple of weeks ago), we’ve got 37,000 VMs running in our regions and that could very well be an undercount.

And that’s 37,000 across 11 regions, this includes a couple of regions that were just built. And so they’re not yet filled up, and each region has between a call of 1 to 200 tenants or customers.

And this is when Datadog grabs information, this is our physical plant. And there’s about 2,500 hypervisors here, plus the control planes plus storage. There’s miscellaneous servers.

Scalability

It’s a lot to take in at any given time, which leads us to perhaps the single biggest challenge.

It’s what DJ Khaled would call a major key—or he should call it a major challenge. But the major challenge is scalability, because it’s how do you deal with these…

How do you deal with thousands of machines, not just thousands of VMs, but thousands of pieces of hardware in a distributed environment where you need to make phone calls to data center employees and so forth.

So when we think of scalability, there’s many different axes, one of which is hardware support.

How do you deal with…I mean, obviously hardware failure is a fact of life.

There’s 2,500 hypervisors, if each one has 8 drives, all of a sudden you’ve got 20,000 drives. If you’ve got 20,000 drives at any one time, at least one of them is going to fail. So how do you replace hardware and how do you make failures faster to recover from?

How do you make the user experience seamless so that if a hypervisor does fail, our customers shouldn’t care?

And sometimes they do, it turns out, sometimes they will ask us, “Where’s…what hypervisor is this machine on? Why isn’t it on some other hypervisor?”

And to which I must tell them it’s on this hypervisor, but ideally you should notice.

And this is…I showed you this before, but this is what one of the regions looks like. There’s a lot of failure points here, not just the drives here, but these are all disk arrays. There’s network switches, and even though they’re redundant, things fail.

And this is just a single region. So if you have 11 regions, then that’s the current footprint.

And what we’ve been working on is fixing that, closing the loop on these failure points so that we can know what’s going to happen (ideally before it happens) so that we can fix things and get users off to hypervisors before anything terrible happens.

And sometimes even that’s too late.

Capacity

Another axis of scaling is capacity. People want a lot of capacity.

Every time we build out a new region, there’s a lot of demand to get on the region. If there’s a region that provides better CPU or I/O performance, people clamor for that.

So how do we build out more capacity?

We’re lucky in that we can buy hardware and build a new region…theoretically. That turns out to be really expensive and it takes about 6 to 12 months to build out a new region. So by the time we build this out, we’re basically building for yesterday.

One thing that we are looking at is how do we make better use of the hardware that we have? How do we find gaps in hardware usage? What’s being underutilized? How do we get more performance out of the same hardware?

And this leads us to a bunch of interesting optimization problems that I hope in the near future will become things that are actionable and used in our regions.

But we also need to scale how we interact with our customers. We have hundreds of customers, but there’s only a handful of us.

We get calls for everything from “Where is my VM located?” to “Who owns my VM?” to “How do I create a new project?”

Comcast’s custom tools

So we have a group that builds tools and I’m part of that group to help customers do things themselves.

And one of the tools that we use is custom capacity views.

The way OpenStack is designed, our use case doesn’t allow a customer to…a customer can overstep the bounds of certain kinds of quota.

So we need an enforcement system, we need people to be able to say, “How much am I using?” and not have to call us with the question.

And when they want to set up a new project, we want them to be able to say, “Well, okay, the region I want to install it in doesn’t have quite enough storage or it doesn’t have quite enough free CPUs. So I think I’m going to go somewhere else.”

So we’ve been building out this project to allow a finer-grained both quota analysis and quota enforcement.

And one of the effects of that is we can fulfill their requests faster and with less back and forth about, “Oh, this region is full,” “This other region is full,” “If you want to use M-family in this region, you can’t.”

Comcast’s unique needs

Another way we need to scale is in terms of software.

A number of our customers are coming from environments where they toss their software onto Bare Metal or VMs. Everything is fine, they use monorepo and everything is good.

But when they move to the cloud, frequently, architectural strategies are required because they can’t depend on their hypervisor. They can’t depend on their VM staying up all the time.

And so our group helps them rethink approaches to build out new strategies and architectures so that their applications are more robust, they’re more cloud-friendly.

And so we’re not just Ops and we’re not just Dev, but we’re also a customer… We’re also technical support, we’re architectural support.

And this introduces even more complexity because we now need engineers who can work in multiple dimensions.

Another thing that we have to scale with private cloud is monitoring. We have tons of machines, each machine is running a variety of services. There’s containers, there’s 30,000 VMs.

And if you look at OpenStack and you don’t have to really read this diagram, but there’s a lot of services here…and how do you monitor them all?

And this is just the control plane level. So this is a couple of machines in each region, but then on every hypervisor you have some subset of these services as well.

So we use Datadog for our OpenStack monitoring, and one technique we use is running frequent tests on every service.

And so this gives us a perspective on how long it should take for customers to do fairly standard operations.

So Nova performance is build a VM, put an image on the VM, start it from a local disk, do some operations, tear it down. And how long does that take?

And so we can detect anomalies in each region. Swift performance, this is for the S3 type storage.

So we get a sense of how our customers are experiencing common tasks, but we also need to know what’s happening on a hypervisor level.

We need to know if there’s problems in OpenStack with Nova which is for compute with Neutron, which is for networking, we need to know what’s causing them.

Originally we were using an OpenStack monitoring solution that involved every compute node or hypervisor sending requests to the control plane to get information basically about itself.

Sending that information to Datadog along with system-level information like memory and disk and network usage. And the control plane was also sending out its statistics to Datadog.

But now you have 200 or so compute nodes in a region hitting the control plane.

And it took a while, a monitoring run would take about an hour and a half, sometimes because of the control plane being overloaded there would never actually be a successful monitoring run.

So we had very reduced visibility. In fact, we got to the point with the compute nodes hitting the control plane API that we would frequently DDoS our own cloud.

And so this is…so every couple of minutes, everything would just stop, everything would…the control plane would collapse. We would turn off monitoring, things would go okay again.

We would try to turn it on a bit. We tried doing exponential backoffs, we tried increasing the intervals between requests on every hypervisor, and all it did was just draw out the DDoSing.

It made it go from fast and painful to excruciatingly slow and painful.

And of course our customers were not happy. And this is me when someone gets my phone number and calls me in the middle of the night asking why they can’t build VMs.

Needless to say, this is a problem that had to get solved fast.

Working with Datadog to create a solution

So we worked with Datadog and developed a solution that by using improved caching techniques, taking advantage of the wealth of data we could get from the API, restructuring requests so that we could get more from each request—we could reduce the number of queries being sent to the control plane…to basically single digits.

And so everything that comes from the compute node to Datadog to Datadog is just system-level metrics.

And the result, first of all, no more DDoSing. I can actually make the choice between having monitoring and having a functional cloud.

But we also reduced the time for a successful monitoring run to 12 seconds, and there’s savings, there’s further savings that could be had by further changing the requests system and whatnot. But going from…an hour and a half to infinity to 12 seconds is a pretty sizable time savings.

And of course, I was pretty happy and my customers were happy, and the level one and level two teams were also happy because they weren’t getting…they weren’t also being inundated with complaints.

So a scalable monitoring solution, even though it took a while to build out, and we’ve tried a bunch of different approaches, we finally solved…we found a better solution.

Scaling teams and setting SLOs, SLIs, and SLAs

Another thing to scale up is our team. We have to provide services that the public cloud providers also provide. But we also need to maintain expectations for our customers.

Everyone wants 120% reliability, they don’t want hypervisors to go down. They want all of their API requests to return instantly, and it’s not possible.

So one way of scaling up our team and becoming more responsive is by actually building out this loop, the flow between Service Level Indicators and Service Level Objectives and Service Level Agreements.

How many of you were at the workshop yesterday?

And now Datadog has tools for this flow.

And so when you figure out what things your customers care about, and you figure out what are the metrics that encompass what they care about.

And then you build out Service Level Objectives, which are what’s an acceptable level? And from there, your Service Level Agreements, which is that (with a little bit of fudge factor) that we can safely promise.

So, for example, one thing that our customers really care about is being able to build VMs. They get very angry when they try to build a VM in OpenStack and it fails. They don’t even really care about the reason.

So one metric is number of VM builds by which I mean VM builds that succeed over the number of VM build attempts, which is number of successes plus number of failures.

And we can turn that into a Service Level Objective of 99.X%, I don’t know how many nines we want to offer yet.

So a build success of that every 30 days, and for an SLA we can promise 99.X-plus epsilon, where epsilon is some small, infinitesimally small number VM build success rate.

And we can try…and if our customers have these expectations in place, and we can hold ourselves to these expectations and, you know, do something around maintaining our promises, then this helps us, this helps the customers know that we have…we can provide the services they need.

And this helps us say, “Well, we’ve agreed that this is the VM build success rate, and the VM builds that have failed are well within this 30-day rate, and sorry, we’re doing what we can.”

Scaling communication

And finally the thing that we also need to scale was communication. We have over a thousand users, a thousand application owners.

And so how do we communicate with our customers?

And this is different than the SLAs because this is “How do we actually interact with our customers?”

So we have an OpenStack Slack channel, which has over a thousand application owners, it’s very, very busy.

And we have two types of user-facing documentation.

We have…well, first we have bots in the OpenStack Slack channel so that people can query, “Where is my hypervisor?” or “Who owns this particular VM?”

As people leave Comcast and move to different departments and all of a sudden, the ownership of VM is up in the air.

And now that there’s bots that handle that, it makes things a little bit easier.

There’s user-facing documentation. So when you come into…when you join the cloud and start a project, you can get documentation on, well, how do I start a project?

How do I use OpenStack from the interface? How do I use it from the command line? What are the settings I need to know when I build out a VM?

Finally, there’s SRE-facing documentation, which in the interest of openness, is available to all users.

If you want to build out a cloud, what’s the toolchain required? What credentials do you need? How does our patching scheme work?

The idea is if we have this documentation available, and more importantly, if people read the documentation, we can all understand this is how upgrade processes work, this is how moving VMs from one hypervisor to another works.

And that reduces the number of surprises that the customers have and can call us about.

Key takeaways from building a private cloud

So finally, I want to get to a couple of encapsulated lessons I’ve learned. Tips for building a private cloud: unless your use case requires it, don’t.

And I’m not saying this because I want to completely dissuade people and just be a cloud operator myself.

If your use case requires it, go ahead and build it. If you’ve got the infrastructure, go ahead and build it.

But in most cases, you really don’t have to build a private cloud. The public cloud is at this point really, really good.

The second lesson is listen to your customers, they have needs and wants.

One, this will inform whether you should build a private cloud. And two, this will help you justify what kinds of hardware, what kinds of networking and storage you’re going to need. Three, find a few core services to offer.

I will say in all fairness, that the private cloud situation, there are some shortcomings. For example, functions-as-a-service isn’t quite where we would like it to be.

There’s container support in the cloud offerings, but the idea is when you’re doing private cloud, you are not a public cloud provider, and so you find your core competencies, you’ve worked those really well and you can make your customers happy by finding those core competencies.

And last, have fun. It’s a privilege to be able to build out a cloud to rebuke the idea that the cloud is just someone else’s computer, and it’s a real learning experience.

I’ve learned more about distributed systems and about virtualization and networking hardware than I probably would have in any other situation.

So if you get a chance to do it, go ahead. In fact, if you just wanna do it at home, you can build out of VM and install OpenStack on the VM. And it’s fun to learn about.

So I thank you for listening and I am willing to take any questions as long as they’re not comments masked as questions.