Scaling AI Infrastructure at OpenAI | Datadog

Scaling AI Infrastructure at OpenAI

Published: July 17, 2019

Background: about OpenAI

To give a little bit of context, OpenAI is a research organization based in San Francisco.

Our mission is formally covered in our charter (in this very large amount of words): to ensure that artificial general intelligence, by which we mean highly autonomous systems that outperform humans at most economically valuable work, benefits all of humanity.

And that’s a mouthful.

It’s got some phrases in there like “artificial intelligence” and “all of humanity” which might evoke science fiction images like terminators wandering around or the whole world turning into a big pile of paper clips.

But the reality is that there’s a lot of near-term work being done on an artificial intelligence that’s increasingly valuable.

The challenges of AI research

I was trying to find an example on the speed of progress on some of the aspects of artificial intelligence, like how research into convolutional neural networks like AlexNet and GoogLeNet and others have made a lot of image recognition tasks feasible.

These are some different datasets from a dataset called ImageNet, which is pretty popular training data.

And as always, there’s an XKCD comic for everything.

This is an XKCD comic, I won’t read it all out to you.

It’s basically commenting on how difficult it is to do image recognition.

The punchline is it can be hard to explain the difference between the easy and the virtually impossible.

This comic was posted in 2014.

Anybody think that we can’t make a bird-checking image recognition system today?

Another recent project that has been announced from a couple of different organizations is research into neural activation atlases.

These help explain more about how different models come to specific decisions, helping to demystify the mythology around artificial neural networks that they’re a black box that you can’t tell what decisions they’ve made.

And it also helps to find out what biases might be inherent in either your training data or in what the model has learned.

Supporting AI research with infrastructure

So, how does the infrastructure team at OpenAI help out with this research?

I don’t have a Ph.D. in math.

Well, consider what a normal research stack might look like.

Think about it like when an application stack would look like if you’re developing a web application.

At the top, you’d have some research code, a training algorithm, PPO, lots of different things.

That’s gonna be built on top of some pretty common frameworks that are available in the public: TensorFlow, PyTorch.

Those training frameworks, in turn, are gonna be run on an infrastructure that’s managed through experiment frameworks.

We have our own in-house frameworks, Rapid and Rcall, but there’s also things like Ray, Kubeflow.

Those frameworks, in turn, are gonna run and they need to bring alongside their related libraries like CUDA and XLA and those libraries.

And of course, we’re bringing all of this code together.

Hopefully, we’re encapsulating it in some form of container.

That doesn’t always happen.

And for those of you who have to deal with training on GPUs, you know the joy that is Nvidia-Docker.

I think I heard at least one laugh.

Those docker containers have to run somewhere so we’ve got to do fleet management and some orchestration at the host level with Chef and Terraform.

That all has to run somewhere because the cloud is just other people’s computers.

So, whether it’s Kubernetes or you’re running bare metal on Azure or Google or what have you, you need to be able to do fleet control plane management, and then you have to call back out because occasionally, you’re gonna use an external service like Datadog and other related tools.

And if you look at that stack, well, the infrastructure team basically helps and debugs and manages almost all of it.

This is, I think, the point in time where I’m supposed to mention we’re hiring.

But in general, the infrastructure team influences a lot of the toolchain that the researchers and the developers use.

If you wanna try to think about it in terms of application development, it’s very similar.

How to train your AI

For some of the patterns I mentioned, it might be helpful to understand a little bit about how AI training works.

Let’s very briefly cover the basics of artificial neural networks.

Okay, maybe not.

First, my parents told me never to do three things in public, math is one of them.

So, let’s try and see.

We can go maybe a little bit simpler.

One really terrible and mostly wrong way to think about ML training might be like this.

Think about a simple problem.

One plus x is equal to three, solve for x.

x is two.

Pretty basic.

And when you think about that right in terms of a function, it still looks pretty similar.

f of x is one plus x.

So, if x is two, f of x is going to be equal to three.

x is just an input and the output of f of x is just a result.

So, you start to think about it in terms of, “If I put in something, I’m gonna get out something.”

Now, in this case, if all you had were those inputs and those outputs, after a few of those input-output pairs, you might be able to start to take a decent guess at what that f of x is.

And that’s a really overly simplistic way of thinking about how to build AI training tools or model training tools.

You’re building tools and training loops to come up with something that, given a certain amount of input, returns approximately the correct answer.

Essentially, you’re solving for the function, not the inputs or the outputs.

My apologies to everyone in the room who is in any way, shape, or form related to machine learning who I have massively glossed over your field on…eggs and tomatoes later on.

No problem.

Supervised learning

Now, for one more gross simplification, let’s talk about different kinds of training.

So, supervised learning is probably maybe the most familiar.

It’s certainly the one that’s been around maybe the longest in popular discussion.

It’s where we know that for given training input, what the output should roughly be.

So, that’s a cat, that’s a cat, that’s a hot dog, that’s a cat.

And the labels are more often than not, not quite completely,

“This is only cat, there’s no hot dog in this.”

So, you start to get probabilities.

Unfortunately, supervised learning requires a ton of labeling and high-quality data and there can be bias in those labels or the model can necessarily not pick up on the right feature.

Sometimes it can figure out, “Oh, cats are any object that has a wood floor in the background.”

Unsupervised learning

A second kind of learning is unsupervised learning.

Imagine you had a set of photos of stock art people and you wanted to sort those into per person groups.

Unsupervised learning lets you do this, but you won’t necessarily know what the labels are.

That doesn’t mean, however, that you couldn’t pick up the labels after you’ve done that training run and assign names to them.

Probably most folks have a photos app on their phone that does something like this.

Reinforcement learning

The third kind of learning that we’re gonna talk about today is reinforcement learning.

This is where we hook up that input-output pair from the training process we talked about as part of a feedback loop.

In this case, the feedback loop, you’ll have an environment of some kind, a simulator, it will output a state of the world, that state of the world might have a reward associated with it, passed into your model, your training system that then comes back out with an output, in this case, the output is gonna be an action to affect the environment.

This is probably most notable for…or one of the more notable results for this is DeepMind solving a large number of Atari games with a single agent.

It’s mesmerizing.

An interesting thing about things like reinforcement learning systems is they also have an interesting habit of not necessarily doing what you want.

This gets into the idea that rewards are hard to specify.

For example, some reinforcement agents, as soon as they figure out, “Uh-oh. I’m gonna lose this game, I’m not gonna be able to win, and I’m not gonna be able to get my high score.”

They’ll try to find a way to crash the environment because “I still have a little reward left.

I wanna win as much as I can.

And if I die right now, I still have some reward."

That seems bad. That seems like, “Okay.


You just have to fix the bugs in your environment."

But a different thing is what if your environment and your goal that you’re specifying aren’t necessarily the right reward?

I’m sorry I don’t have an image for it, but there is one particular video game where the game designers would let you basically collect barrels as you were driving this boat down a river.

And the object of the game is to drive the boat down the river and get to the next level.

But collecting the barrels, you could just basically sit there and spin the boat for all eternity collecting more and more rewards instead of necessarily moving towards what the actual goal was.

So, you get environment buzzing for free.

How neural networks (and other AI models) are developed

The development of a neural network and different models of training, they vary wildly between projects, frameworks, datasets, researchers, phases of the moon, so I’m not gonna be able to give you a necessarily a one-size-fits-all version.

The abbreviated version though might be that a researcher comes up with that little block of code at the top of the stack (the training loop) They get some signs of life using tools like TensorBoard to evaluate behavior, is it actually making progress?

Is it finding out how to move towards its goal?

For our researchers, they often go and they sit one or eight GPU pods sitting on a remote cluster.

Having them ready to go any time since their laptops don’t have Nvidia V100 GPU sitting in them.

I don’t know what that was.

I don’t even have sound plugged into this, so it’s totally not me.

Eventually though, they’ll get their models to a point where it either converges to a studied state or diverges, overfitting instability, test train split, we’re not gonna talk about any of that today.

Scrolling is not a thing here.

Overall though, when a training run is going, the way you should think about this (from an infrastructure perspective) is that each training process is maintaining some amount of state.

It has some, maybe it’s a collection of the input-output pairs, maybe it’s a batch.

But that state is on that training process and it’s actually really expensive to just recreate it all the time.

So, when it comes to managing these instances or pods, you have to think about them more like database machines or stateful servers and less like stateless application instances.

So, testing for neural networks can also be a challenge just like they are for application development in other environments.

This is a recent internal example, hopefully, names sanitized to protect the innocent.

As part of a really good refactor of a particular code base, someone added this helper function.

However, the helper function should have had reverse the sign that was being put out for it.

And what that meant was that the agent that was now training, now trained basically to minimize something rather than to maximize that thing.

And so it trained basically as far away from the goal as fast as possible instead of towards the goal as fast as possible.

Regression testing could have caught it, but the regression testing wasn’t run before the change went out.

Sound familiar?


All the background out of the way.

Let’s kind of go over a few different examples of training and projects we’ve used at OpenAI and what we learned when we scaled that infrastructure.

AI in action: DOTA 2

Let’s start with Defense of the Ancients.

DOTA 2is a really popular online MOBA.

It has a lot of properties that make for interesting research.

There’s a very simple reward signal, win or lose, but there’s a lot of things that happen during the game that you can judge such as number of kills, amount of gold collected.

There are also many different gaming modes such as one versus one or five versus five.

So, as we wanna scale up to harder and more difficult set versions of the problem, we can.

And as partially obscured state.

So, not all of the map is visible to all of the participants in the game.

This is in contrast to games like chess or Go which have visibility into everybody.

All participants have visibility into all of the game state and all of the moves that have been made.

It has a long planning horizon.

DOTA 2 games can go 45 minutes or 60 minutes or longer.

So, actions taken very early in the game need to account for a multitude of possible future outcomes.

This makes it really interesting as a platform to test how far out AI systems can plan into the future given that the size of the possible number of outcomes is far too large to try to just memorize into a single set of outputs.

This project uses a reinforcement learning loop, like we talked about with the game being the environment and the observations being about 20,000 different numbers that come off of the game at each fourth frame, and then the agent gets that observation, makes a decision, and sends it back as a series of actions.

When the DOTA project first started, it was several years ago and Kubernetes wasn’t quite as performed at a large scale, but that’s where the project had started to run out and launch all the pods.

A few of the different knobs we had to tune in Kubernetes and related as we scaled up to 2,500 or so nodes, decreasing ETCD and API server pulling, there were some defaults in some monitoring agents that made very bad time for about 5 to 10 minutes or longer.

This was a disk latency on our ETCD servers.

Separating out the events store from the main of ETCD state store, some of this is now pretty common Kubernetes maintenance.

Discovering there were per host DNS query limits on one of the platform providers that we worked on meant that basically, we had to start applying anti-affinity rules to make sure that we were never asking for too many DNS records from the same IP address.

If anybody here knows about flannel?


Moving on.

Rapid: OpenAI’s in-house infrastructure tool

We’re happy to see the Kubernetes has come a long way since then.

We have several production clusters now that scale much larger.

And now, Kubernetes, I think the official scaling target is 5,000 hosts.

In the end though, because Kubernetes wasn’t quite scaling the way we needed it to, OpenAI developed a framework called Rapid.

Rapid abstracts away the platform APIs and treats virtual machines as a pod-ish single unit of work spread across the large fleet, kind of, again, like Kubernetes pods would be.

An experiment is prepared and launched in isolation to any other experiments.

There is no shared data store where all the experiments are writing the results back into.

They are as isolated as possible.

This works really, really well with how researchers wanna think about the system under test.

They don’t wanna say, “Oh, I can’t do my research project because your service is down because somebody else’s experiment is too busy DDoSing it.”

That researcher wants to focus on just that experiment.

In this configuration with Rapid, there are basically three piles of work, if you will.

There’s rollout workers, optimizers, and evaluation workers.

The rollout workers basically are the ones running the game engine, they’re sitting there and grabbing those observations, they’re sending those samples then to the optimizer, which in this case, is the thing that’s training up the process that’s going to submit outputs.

Those parameters, those model outputs are then shared both back to the rollout workers so that you can close that training loop as well as to evaluation workers to basically determine, “Hey, is this new version of this agent any better or any worse than previous ones?”

If you wanna think about it in the end, pushing through a lot of these, one of the things here is each of these optimizers uses NCCL, which is a particular communications library, to do all reduce.

And a lot of this is basically taking giant matrices and multiplying them together as fast as possible.

If you want to improve AI training, the fastest way is to figure out how to multiply two matrices together even faster.

Many things broke along the way.

One of the things was, in particular, we ran into some concurrency and scaling limits with Redis.

My teammate has a really great talk about that, you can check it out if I start to get boring.

I’ll also post these slides online later if you wanna get the URL.

The benefits of abstraction

Because we had that framework abstraction, another thing became very easy.

Partway through the project, we had the opportunity to migrate from one cloud provider to another to have access to different hardware and different capacity.

And so we were able to do so scaling up from 60,000 CPU cores to 128,000 CPU cores with minimal impact to the researchers.

Over time, as the experiments did scale up in size though, we started to encounter new problems.

One particular platform has maintenance events on hosts that have GPU attached.


Hosts that have GPUs attached.

They say it happens once every four weeks.

They also give themselves an out by saying that might happen more often.

It happens more often.

As I said before, training is stateful-ish.

The second that you detect that a host is unhealthy, you don’t wanna necessarily evacuate because you’re gonna start to change and turn that host.

Constantly turning the experiment every time we saw a maintenance event occur, basically meant that sometimes the model would make no forward progress at all.

That there would be so many maintenance events that it basically was constantly stuck restarting.

We solve this by having the experiment, instead, check the host it’s about to start on and at least guarantee for this particular platform, that there’s a 60-minute window for scheduled maintenance events.

A 60-minute window, and if it’s in the window, don’t even launch it.

Just assume the host is preemptively not going to work out.

DOTA 2 results

After all the scaling work was done and a lot of dedication by the Dota team, this bot initially was able to defeat the world champion in 1v1 Dota.

Dota community members just learn new strategies in meta, which is basically a different way of playing the game, and eventually, different ways to play their favorite game.

As time went on and as the experiments scaled up to larger and larger numbers like you saw previously, the agents started to defeat increasingly skilled five versus five teams.

And earlier this year became the first AI to beat the world champions at an eSports game having won two back to back games versus the world champion Dota team OG.

Large-scale training with GPT-2

Moving on.

Another example of large-scale training at OpenAI is GPT-2.

GPT-2 is a transformer-based generatively pre-trained model.

It’s a little bit more complicated than just a simple supervised or unsupervised model but the gist is pretty similar.

One example of a language model is like auto-completion.

Given an input string, what’s the next most likely word?

I have never in my life meant to type “ducking.”

By the way, I still don’t understand this on Siri.

If you train a model, a language model on a large enough corpus, how good could it get?

Well, it turns out pretty good.

That’s a lot of text.

I’m not gonna read all of that text, but this is from one sample output from the GPT-2 language model.

The Italic text is the prompt and the rest of the output is from the model.

I won’t read all of it out loud, but you’ll notice even towards the end, it’s a lot more coherent talking about noticing that the valley had what appeared to be a natural fountain surrounded by two peaks of rock and silver snow.

It’s not completely coherent though.

You’ll notice that in this discovery of talking Unicorns, they were four-horned.

So, not entirely sure.

This project uses a different in-house framework created by another one of our researchers called Rcall.

Rcall: OpenAI’s in-house, lightweight abstraction framework

This framework has both Kubernetes and Google cloud back ends.

This let researchers test and debug with additional GPU capacity that was in our growing Kubernetes clusters, then migrate to bare metal virtual machines and other hardware as needed.

Sort of kind of an in-between, but a lighter-weight abstraction than Rapid, which basically deals in whole VM sizes.

You can see here it’s as simple as basically saying, “Hey, this is the function I wanna run and I wanna distribute it, launch it on that cluster.”

At the end of GPT-2’s training, as we learned and saw some of the things it was capable of, a new problem emerged.


Initially, when a researcher was asked, “Hey, do we know who has access?

We knew that there were internal controls, but do we know that we have enough accurate internal controls?"

Their answer was less than confidence-inspiring.

We needed to be able to provide some level of isolation and auditing capabilities.

Something that’s rather difficult when everyone has effectively route access where they’re running their code.

This is a snippet from the generator framework for one of the pod descriptions and it says…what I love here is this is only used when the Nvidia driver is used.

We use the Nvidia driver everywhere because everything has a GPU.

So, over time, we’ve improved its capabilities and are now able to offer slightly more isolated training.

Again, something that’s not necessarily the first thing you would think of in a fairly open and academic environment.

Because researchers were using this framework, we were able to basically add in the additional isolation without impacting their existing research.

Large-scale training (MIDI data) with MuseNet

One last example of training infrastructure and projects that use it, let’s talk about MuseNet.

MuseNet was built on a similar foundation as GPT-2, so the same transformer architecture, sparse transformer architecture.

In this case, the input data instead of being images or text, it’s music, specifically MIDI data.

There’s lots and lots of MIDI data available out on the web.

We also happen to have a piano at the office that had MIDI output and for a little bit, we were human input training as much as we could before we all realized that we weren’t gonna have enough data.

This is why I get too early.

This time though, not only did the researcher wanted to show the results, they wanted to let people try the model out themselves.

In this case, there’s a web user interface that lets you pick…say I want a piece of music in the style of Chopin and why don’t you start it off with Lady Gaga’s “Poker Face.”

I wholeheartedly recommend you go listen to these music samples.

They are about two minutes long so I didn’t wanna just stop everything here and listen to them, but go on the web and you’ll still find it because it’s still up.

Because that researcher had developed on a framework that let them understand Kubernetes but also not have to worry about all of the abstractions, they were able to reuse a lot of their existing model training code, cutting it down and then being able to put it into a public-facing web app on Google Kubernetes Engine.

I think that served around 40,000 samples, 40,000 pieces of music the first day that it was launched.

Lessons learned

One ongoing challenge though with these frameworks and researchers using lots of different infrastructure, lots of different tools, Datadog and Kubernetes, and I have to log into this control panel, and what about that piece of data, has been, how quickly can we answer the question, what happened?

Obviously, if the job fails and the researcher sees that it fails in their jobs output logs, they know what’s going on, but what about all the other cases?

Maintenance events or other things need to be understandable by researchers.

We log many different things from many different aspects of our infrastructure into, as it turns out, Datadog logs.

And recently one of the infrastructure team managers wrote up a tool that lets researchers with a single command say, “Hey, this is my experiment name, tell me what happened.

Why did this pod die?"

And in this particular case, it looks like the host that it was running on, failed maintenance—basically got in a maintenance event and a was then removed and evicted.

Three different research projects, all using different amounts of compute.

This is another graph from a recent publication OpenAI did exploring where the boundaries of efficient use of computing are.

A lot of the dimensions are gonna differ for every organization, budget, deadlines, the capacity of hardware as Nvidia launches newer hardware, Graphcore, and other hardware manufacturers, but the answer is always the same.

Everybody is trying to find somewhere in that spot that overlap where they can use to find more ways to more efficiently turn compute into answers.

And this plot made last year talks a little bit more about that.

This is the amount of compute needed for each of these at the time state-of-the-art research results.

You’ll note, by the way, the scale is logarithmic on this graph.

I’ll let you form your own opinions on whether that’s good or bad or what the implications are for the budgets that some folks might need in the future.

To wrap up, this is what the infrastructure team at OpenAI is gonna focus on next.

Making it easier for researchers to scale up model training and share common experiment code.

One of the things I forgot to mention was that the researcher working on MuseNet was able to reuse some of the previous code for GPT-2 because they were on a framework and able to get things that were copyable enough or close enough.

Focusing on Kubernetes as a primary platform, we’ve been picking and choosing Azure, Google cloud, different platforms.

At this point, Kubernetes has kind of proven it’s able to scale a little bit further so we’re gonna see how far we can take that.

Specifically, Kubernetes says that they can handle 5,000 nodes in a single cluster, we’re gonna test that out.

Is there any of you here from Google’s Kubernetes team?

In conclusion…

Recap some of the patterns we touched on today, find common infrastructure abstractions that researchers and engineers can agree on.

Specifically, just get something that everybody’s mental model is like, “Okay.

It’s a pod."

Just it’s a single container of work, whether it’s a VM or something else.

Decouple experiment dependencies where possible.

And this feels painful because, wait, why am I running 18 different copies of Redis?

Because I’m running 18 different experiments.

That might feel painful, but it actually works out far, far better for the researchers.

Standards and practices used in other infrastructure areas don’t necessarily apply in AI training.

Again, the cost of GPUs kind of inverts a lot of…some of the logic you might think of, “Oh, no, the fleet.

Isn’t this member unhealthy?

I wanna turn it as fast as possible."

It’s like, “Wait, no.

We actually wanna hold off and let that model get to the next checkpoint."