Transforming Hulu's Application Platform | Datadog

Transforming Hulu's Application Platform

Published: July 17, 2019

David: Hello, everybody. Can you guys hear me alright in the back?

Excellent. Cool.

Back in 2012…

So to start things off, 2012 was a great year, 2012 was the year where this video, YouTube expanded their 32 bit for their hit counter here.

Also in 2012, we got a first glimpse of that chin, the Thanos chin. You guys might remember a couple of references in media that take you back to 2012.

For me, 2012 was the year that I joined Hulu.

At Hulu, we had a Ruby on Rails site, and we like to say that we were in the top five Rail sites on the whole internet. We were also working with new JavaScript frameworks, so this was Backbone.js, an imperative method of manipulating the DOM, and we were rewriting our whole site in Backbone.js.

So those are a few of the things that were happening in 2012.

A few early factors in engineering at Hulu, that I’d like to cover.

Our tagline was “Hulu is a place where builders can build.” Oftentimes that materialized towards the build and buy decisions.

We built a lot of things. We rack servers into data centers, we built applications from the servers up.

And in fact, the developers were often handed direct access to the servers. So an operating system would be provisioned and the developer would be handed an operating system image on a server or bare metal in the data center or a virtual machine.

Python at Hulu

Another key factor in Hulu that I liked was when I arrived, there was a lot of Python. I love Python and I had never seen so much Python in my life.

So you might be wondering here, I said, top five Rails sites on the internet and now I just said I’d never seen as much Python in my life.

What did I mean?

Rails was often at that time becoming a front-end for the set of back-ends. So essentially a service-oriented architecture, so microservice service, we just knew service-oriented architecture.

So a lot of times Rails would start and it would call into various different back-ends or you would load some JavaScript, and the JavaScript would pull directly to the back-end.

So this was a service-oriented architecture. We had hundreds of services running in that back-end, so the first service that got hit may fan out to additional services.

Downtime and microservices

With multiple services, you know, this is the classic microservice problem where the services can go down, there’s more chances for a service to go down and that can be distracting for developers and that could happen for many reasons.

We had a lot of scale-up at the time, so more traffic coming in, just an organic failure, a deployment failure, or even servers or Rack failures that would take down a whole set of machines.

So if the machines were all provisioned to the same Rack, boop. The other thing with a service-oriented architecture, a lot of the features were developed: new features, new service.

And we also saw a lot of freedom and independence and for developers, they would often own a single service, the ratio was one developer, one service or one developer to multiple services.

So that put a lot of onus on the developers in order to create a new service whenever they needed to spawn a new feature, and the delay time for our feature creation was actually the key factor for the creation of a new service which could also include the time to revision as new service.

So I joined the team at Hulu and I joined a team. We were starting to form a DevOps team, you guys are familiar. And this was our mission statement as a team.

So we were looking to provide developers with the tools to efficiently scale their applications and their teams, so we look to speed up both the team creations and the service creations.

Hulu service architecture, circa 2012

So I’m gonna run you through what a service looks like, or looked like at the time.

This is your typical service here, you might notice I don’t have databases, I don’t have queues, I don’t have any of that.

My focus on this presentation will be just the application layer and how the application layer is deployed, and how it’s been transformed over time at Hulu.

So this is your typical load balancer going to servers, there’s an HTTP server, your NGINX, Apache, whatever, you have your application code running, and then the logs.

The logs, they go on disk. That’s where they write out by default and the application code should send some metric data out.

We can have a metric store, keep track of timeseries data.

You might notice on the right, I have the laptop, that’s the developers deploying application code or going to the server and pulling down logs.

So probably this might even be familiar for a basic setup, maybe you built a service this way. It’s a simple setup and it can get things running and it got a lot of things running for Hulu.

So this was often a very typical architecture at Hulu, so we wanted to try to speed things up. And so we talked about what the common problems were in getting these services deployed.

The basic problems were setting up and configuring the HTTP server, getting it hooked into the application code; load balancing: how do you configure the load balancing for the server; and the logging, the logging all on disk needs to be rotated, backed up or archived, whatever, you got to do something with it.

Donki: a solution

And we noticed that this was a very common problem that every developer or every team had to solve on their own.

There’s a lot of different tools for that configuration management, all these aspects, we wanted to try a different solve and change an abstraction layer.

So I present to you Donki, I’m going to talk about that particular solve that we did way back in 2012.

I got a little pixelated.

So I’m gonna talk about from a developer perspective what this looked like. So from a developer perspective, you went in, you type the name of your application, you picked the cluster you wanted, and at that point, you could create and go.

That seems like there’s a lot of missing pieces, right?

So in terms of the missing pieces, we looked at what the defaults were for all of the different hundreds of applications and we found that there was a really common set of things that people used. There were even with a diversity of a lot of different developers, some people tended to do a same thing.

So we started when going through and seeing how we could crosscut those applications and find the common thing.

So we made some requirements, there were some variations and we made some requirements. The first requirement was that the application had to have a health check and we specified what URL that health check had to run on.

It had to have a WSGI module and this was Python. So that’s the entry point for Python, Rack for Ruby, or you might have just your JAR for Java, or the servlet.

And then the third one was the logs to standard out, so you might know 12 factor apps. So this is kind of a reference, we picked a few things not all.

The things that were in common.

For the deployment path, well, a lot of deployments were either using Fab, or Capistrano, or different deployment systems. These all featured one line deployments.

So we wanted to provide a one line deployment and we wanted to hook up the code repository, what we wanted to do was change the abstraction so that abstraction was just based off of the code repository, and that would do the deployment. So we did a Git push obstruction.

This is not anything new in terms of this might look very similar to what you’ve seen in Heroku, or you might think of Google App Engine, and you’ve seen similar things, all these things existed at the time, so we picked this particular interface.

Once those steps were run, so if an application was created and it fulfilled those health check, entry point, log standards, all this infrastructure would get set up for the application.

So we would go through, we’d set up a load balancer, NGINX, micro WSGI. Today, it’s a very popular Python application server.

The worker would run. Hopefully, you guys can see the diagram on the slides on the side. And then it would run the specific application.

Logs were a big problem.

Most of the architecture is simply a control architecture. The logs kind of were in the flow of an application, right? When an application request occurs, the application will write logs. So that’s actually part of the flow of production. So logs were probably the biggest challenge inside of this architecture.

In terms of this, we were lucky, we had some good people at the time who could go in and do a selection of the various logging tools. Like I mentioned, we were very much a build site, so we looked primarily to open-source to solve this problem.

And what we’re looking to find is something that would scale out for our immediate needs and for a little bit longer-term needs.

So we looked for reliability first, so a store forward architecture, we figured that if anybody lost logs, that would probably be a big problem.

If you don’t have your logs at any point in time, it’s very hard to get towards the bottom of an issue.

So the second challenge was…well, the third problem was a tradeoff. In order to take that stability, would we be able to show the logs and be able to search and do some nice analytics?

We chose to shift the logs off and get them into a central aggregation. If the logs were on individual servers, just aggregating them together was a win.

So we work to get incremental wins.

So that’s the basic design.

For this talk, I’m talking through the journey here, so there’s a lot of details that can go in there and happy to explain it, but there’s also a lot of other talks that go into detail here.

Evaluating the success of this solution

I’d like to talk about how it went for us.

Did this solution work? What was the adoption like? All those challenges.

So to start off, we dogfood it, we ran our other infrastructure applications, brought our DNS portal management, our VM provisioning, all these little things that we used on our team.

And then the question is: we’re trying to provide things for developers, how successful was this in adoption? Were people interested in such a solution? Was the abstraction acceptable?

So we only got some new services, actually. So only a few people came on, new services, small little services, internal portals, small stuff.

So we went out and we started saddling up applications.

So we went in, the team use their existing knowledge of some people who had been around from even earlier days at Hulu. They understood the applications and they knew what it would take to move it from one abstraction to another.

So they went in, pushed another branch or/and deployed it directly on the system.

So you have the systems running in parallel, and then ask the developers, would you be okay with us managing the servers, or do you want to keep with your servers?

Some people no, some people, yes.

But this increased the adoption significantly. So rather than asking for any work in getting to a new platform, we went through the application code and started hosting it.

That was one way, we also gathered all the different requirements of the thing this system wouldn’t do.

The second process was an alternative instance. So during firefighting or an incident, a service might go down.

There’s a classical question, is this an application issue? Is this a network issue? Is this a problem with application host?

It’s a hard question to answer.

One thing we would do is either after the incident was resolved, we would stand up the application and see if we could reproduce the issue on side of Donki to see if the application tuning, worker processes, buffering file connection handles—all those things would make a difference if our settings were good.

Or maybe it was just a network issue or a database issue.

Another alternative is if we had already settled up the application, we could flip to the application that also was an alternative instance of the application, show that it was running.

The other instance was over and we could ask, would you like to cut more traffic or would you like to actually cut traffic over?

It’s risky, but on the other hand, the other instance’s down right now.

So those were the two main methods reaching out to developers or during an incident looking at seeing if we could resolve problems.

Clustering before the advent of containers

So I’d like to talk about clustering.

2012 was the days before Docker, the days before Docker orchestration.

Clustering was an interesting problem.

So the first thing we did in terms of clustering is we didn’t specify what servers people get deployed to, we just had the tuning knobs be how many processes do you want or how many instances and how many threads do you want per process?

So those were good abstractions here and then we would figure out how to deploy, what servers to deploy to, all of these aspects.

So we were deploying lots of applications directly onto servers together. So we were potentially deploying, you know, as we get up to volume 10, 50, 100, different applications directly on the server.

So how would we do this without…how would we handle the isolation?

The first challenge was file system isolation. Well, first, we just told developers actually, we had a trusted system inside of a trusted environment.

You can ask people to do things and see if they would be happy to do them for you. So we asked them not to write to disk, not to use a local disk cache, do these aspects.

Just citing that that wasn’t handled meant that a lot of people didn’t do high volume writes. There were obviously writes to disk, but it didn’t necessarily achieved the IOPS on the server.

Application memory management

The other aspect was file directories and file permissions.

Just Linux, file directories, sub-directories, and permissions to keep things bounded.

In terms of application memory, how would we make sure that no application had this challenge where it would go through and run all of the memory through?

Well, most of the application servers (and definitely in Python), would support two parameters, one of them being the graceful parameter, where if you exceeded this limit, you would be gracefully reloaded after you finish your request or a forceful parameter where if this limit is exceeded, KiLL-9 the process.

So that worked primitively, but effectively for application memory management.

The interesting thing is that did not reserve memory on the server, so we had to ensure that there was enough space (headroom) in case multiple applications grew to their limit

CPU, we just let everybody have the CPU that they had with one quirk: Python, of course, has a GIL, a Global Interpreter Lock, which means that any single process of Python can consume up to one CPU.

So if all processes were fully spinning at 100%, we could have a problem. But it would take a crazy aspect for all processes on the server to go through to 100%.

So it’s playing with risks here.

What about system packages?

System packages, this is the next problem.

Everybody might need unique system packages. So in terms of system packages, we just installed the list of common system packages and moved on.

There seems to be a set of core system packages that are required to compile almost everything inside of Python or most of your common use cases.

There are always exceptions. In that case, often a custom package is required and needs to be built.

The strengths and weaknesses of Linux

So these things worked for us. The reason why they worked, I believe, is that Linux is essentially a multi-user, multi-process system.

It was designed from the beginning to host multiple people, multiple processes together in a friendly environment.

Well, depends on your perspective of the new Linux.

So that worked for us for Python.

We took this example and we took it for Python. We took it for Ruby, so micro WSGI also supported Ruby, so we did it there and then we added Java.

So this span takes us from 2012 to 2015. All with the same abstraction, git pushes to do the deployments, simple application setup, and then different aspects.

For Java, we had to modify a bit, we used upstart and we use cgroups, because in Java, more than one process can take more than one CPU and use a lot of memory. So we use cgroups, the underlying primitive, the same underlying primitive.

So we had a few challenges with this setup. The problem was the production environment got more complex, it was hard to reproduce the production environment locally.

We could run through configuration management on a VM and spawn up an environment, but that was quite a complex environment for people to debug. It wasn’t quite as debug friendly.

Language and asset precompilation, the Git push obstruction kind of broke when the code that’s pushed is not the code that runs.

So if there’s asset precompilation, we had to do that separately, we added a Jenkins job and we move that asset around post deploy, and did it on the back-end.

And then artifact promotion, artifact promotion was a problem we couldn’t even solve with this model, that asset that ran was hard to promote into from a staging environment to a production environment.

There’s definitely downsides.

The switch to Docker

That brought us to this timeline here, where we were working on Donki on Docker. So we took Docker, Docker was running in the environment, and we kept the same abstraction which had different upsides and downsides.

The upside was that all the applications easily ported over, the downside was that we weren’t really exposing that new primitive.

So you could always use Docker images somewhere else in production on other servers, but it wasn’t really supported on this platform. Instead, what the platform would do would be to build a Docker image, pick some set of machines, start the containers on the machines.

So this was before we were using a Docker orchestrator.

Some of the Docker orchestration systems were already out at that time, but we were playing a bit of catch up here where we were starting to use the Docker images.

The benefits of using Docker were we were able to solve some of the problems that I mentioned. So we were able to add lots of languages, that was easy. We just added a new config.

We could spawn out a new type of language support using a similar abstraction. The images could be downloaded, you know, maybe they wouldn’t be built locally, but they could be downloaded so you could run what was in production.

And it was way safer for us to potentially roll changes. We can now roll changes incrementally when the application deployed.

Orchestration before orchestrators

Then we get into the problem here though: what’s our next problem? Our next problem is obviously orchestration.

So at the time, what we were doing was we would schedule the applications directly at their deployment time.

We would inspect the fleet, look at the different parameters available in the cluster, and just deploy optimistically in terms of: is this machine really heavily loaded?

If it’s over a certain bar, let’s move it over to a different server. Let’s offload it off the server. So it’s very simple orchestration.

It actually worked for us in almost all cases, the interesting problem is how much headroom do you leave?

So we were gaining huge means of efficiency, because the containers themselves or the processes weren’t reserving memory, instead, they were getting allocated what they actually use, so we would stack many things.

But on the other hand, it didn’t offer that safety. So there was a challenge there in terms of how much do you over provision?

So in terms of other challenges, the other thing we faced was how would we drain the nodes?

For instance, something goes wrong with a node or we want to upgrade a node or do a change, it would be a long process if we only rescheduled things that deployed time. We would either have to move things off the server individually, application by application ourselves, or we would need to wait.

Just mark the node as unavailable and let it be rescheduled gradually.

Why use Mesos Aurora?

This takes us to our next phase, so around 2016. So in 2016, we saw Docker orchestration, we went into Mesos.

This made our system actually even simpler.

First question you may ask: why not Kubernetes? I figured this would be a common question. It all boiled down to timing.

So we often don’t get to pick exactly when we do a system upgrade and system change, so at this particular time, Kubernetes 1.0 had just been released.

The published maximum cluster size was less than what we were running, and we didn’t feel comfortable taking the risk.

We saw tons of development activity on it, and it was a really agonizing choice. We also saw other teams interested in Mesos and some of our big data teams were interested in scaling out Mesos even larger.

So we had to put all these equations into bits and make a decision. So we chose through to use Mesos, and we also chose to use Aurora for the scheduler.

So then our process was now a little bit simplified, we would build an image, we would generate a configuration, and then we would let Aurora handle the deployment. So this is a little bit simpler.

Now, we’re not doing the scheduling, now we’re not doing some of these aspects, the isolation is now in Docker, so slowly things evolve. So with each evolution, the challenges of the first one that you fix present new challenges in terms of load balancer integration, for instance.

Now the load balancer is rescheduling processes whenever there’s a downtime or a health check failure, so entire fleets of applications could start to get rescheduled.

Interestingly, management…load balancer management APIs are challenging. Sometimes they support great volume other times, a lot of the times, they have challenges and a certain lower throughput and rate limiting on your load balancer management commands.

That’s true in AWS, that’s true in data center. So we had the issue, in fact, in both environments. So we were now running at this time, both in AWS and in data center.

Donki on ECS

Which takes us to our next transformation.

So quickly following that up, we looked at ECS, so we saw that a lot of commonality between Mesos Aurora and ECS in terms of their mechanisms, and I think it was one fairly sized pull request that just added support for multiple orchestration systems.

Once you support one orchestration system, if there’s a similar modeled orchestration system, the transition is easier.

So at this point, we were building the Docker image, we were generating ECS config, and handing off to ECS. Very similar to what we were doing in Aurora, except now, the load balancer commands were abstracted from us, and autoscaling was a lot easier, and the cluster spin up, very importantly, was much easier.

So for autoscaling, we were able to add knobs inside of our own autoscaling for our own clusters and then adding the applications to autoscale in the clusters as well.

Also, a big advantage was we can now go in the setup required for a new cluster, so if some team monitors specific cluster with a specific different instance class, it was much easier.

We could handle that and spin it up through edition of config.

Implementing a new log solution

So in terms of another problem, I mentioned way back there that we were using Facebook Scribes, so we were using raw logs only.

So the logs would store forward all this work, but it was starting to grow into the limits of the system.

It was amazing, scale was available through Scribe, but it was also retired. You can find Scribe now in Facebook’s open-source archive.

So we tried a few different solutions for logs. We looked at ELK, that was very challenging for us to scale to the same order of magnitude.

We looked at Fluentd/S3/Athena, but that didn’t quite provide the visibility, searchability, analytics, some aspects that we were looking for inside of our logs.

So this is where we transitioned, we looked at various different solutions out in the industry, and we saw and trialled together actually routing the whole volume of the logs inside of our trials with different vendors (and over to the different vendors), and we selected Datadog for our logs.

Evaluating the effectiveness of Donki today

So this took you through various different stages of transformation.

So how an application engine can run all of these different applications and move it from one system to another assuming an even abstraction. So this abstraction was maintained all the way through.

So the question here that I’d like to talk about is, how did that work? What was the adoption like further?

So this particular slide shows you year on the bottom, so running from 2015, where we were getting to over the thousand server mark.

So that initial adoption took us towards applications, around 1,000 applications and this took us all the way through to around almost 14,000 different applications created over the lifetime of the system.

So that’s a lot of people clicking through.

This comes with application ID to time that it was created, so this is just showing you application ID over time, which proxies to the number of applications created by developers.

So 14,000 applications created in some way, some failed, some passed, a lot passed. So one way to measure is just pure how many applications were created by the platform?

The second question would be: How many requests? So as requests come into Hulu, what requests get handled by this platform versus another platform?

This is the data for yesterday at approximately 7:00 p.m. So I pulled this off the dashboard that we have, and so that’s 97% of the requests at around 1.5 million requests per second.

So that’s being handled by a number of different clusters and a number of different applications are running on this.

So if you were at the keynote, Karthik mentioned, low key service, and that low key service is here, this little pie chart. So it’s just one piece of the pie.

The other interesting thing is, I mentioned I love Python, things transform at Hulu. And the question was, you know, as we scale bigger, what kind of language would work better?

And really, what do people like? What do people want to choose?

We give a lot of freedom.

So actually, the traffic transfer has changed in transforming a platform that was originally designed first for Python and Ruby, is now handling 75% of its traffic in Java.

A lot of it is through NGINX as well.

But traffic request rates is a hard thing to measure. Does that really mean the application is doing more work if one application gets more requests than another?

That’s hard to say.

So another thing we could look at is the number of running applications.

As you can see, you know, 14,000 applications created, some of them can be Dev, some of them can be staging, some can be, originally, those were mostly in our first data center, that data center doesn’t exist anymore. So a lot of applications migrate.

So today, if this one is probably about a week old, 3,000 applications would be running and this is the distribution by application count.

So you can see that the balance, there’s still a lot of things that are running in Python, and they’ve been running for probably a long time.

I think the first application that’s still running had an ID of 80, so created in a data center that stayed and it has an ID of 80.

For that we run a lot of different containers.

You know, these containers can be supported by ECS today, they could be supported by Mesos Aurora, or they could be running a version for security lockdown that is just straight Docker.

So as you migrate things, it’s hard to completely retire the old platform, so we slowly retire these platforms and have moved these phases that we develop on.

So 83,000 containers, this number is based off of the application settings, so I didn’t detect the raw container size.

The other one is clusters. So originally we’d have, you know, one to two clusters per data center.

Now, we have many clusters much more than ECS, so just 33 clusters. It’s still a pretty small size for around 3,000 applications.

So I’d like to cover some additional metrics.

I guess, I like pulling metrics on these systems or looking up dashboards. This is the dashboard created by the team here.

This shows our mean production lead time, so this is trying to measure the time that a developer commits, so like the history, look at the history of what was deployed to production, and then see when was the author, authoring time of that commit?

So this is one way we try to measure what’s the velocity. So we allow any team to go through and drop down and select. I think it’s a great way of trying to measure things and we would continue using this.

Another measure is how many deployments?

So this is aggregated over seven days, so about 800 production deployments going around every seven days. And then the production deployment times, so this is another challenge, as Donki has some slow deploys.

The 95th percentile deployment time takes full 30 minutes at 95th percentile. The mean might be five minutes, that’s still really slow.

So an even better measure is, how do developers feel about this system?

Like, do you want to use git deploys and a repository, or would you like more flexibility in your system, put whatever you do into Docker, and have the full flexibility for your infrastructure?

The answer is, you know, in terms of how happy are you with this… not so happy.

So trying to get a gauge and a heartbeat for developers is one thing, also, you can try to take surveys and find out, “Ooh, maybe it’s time for a refresh.”

These are the top three challenges that we thought that the developer sustains, so trying to aggregate them all together the custom interface that we have you walk in and it’s some custom tool, you wanna use the open-source tools, you wanna use the industry tools, artifact promotion.

We never figured it out really, because it’s so hard to support artifact promotion when using a Git flow push.

The other aspect is the continuous deployment via Jenkins is hard.

Continuous deployment, continuous integration, great.

So this takes us to where we are today.

So today, we’ve been working towards Spinnaker and Kubernetes, so a general flow where the application and the infrastructure config lives in repositories, separate the pipelines so that the infrastructure config is in one repo, and the application configuration in another.

So you’d have your Docker file in one and you have YAML everywhere, inside of your config.

So the question would be, how do you transition from one to the next?

And all of the previous transitions, we’ve talked about a change in the interface or an underlying change and the clusters would still run and new applications would be transitioned or things wouldn’t be deployed.

In this case, we’re actually changing the interface. So in order to handle this, Donki has one last purpose and this one last purpose is to create pull requests.

And the pull requests run into the application repo and the config repo.

So Donki knows the way the host systems are hosted now and we can provide a transparent port over for the applications through what they are, if you want to use what is there and make an easy transition or, you know, there’s always the option to start from scratch, start with your own Docker file, start with your own config.

So if you might remember, we started with the intro for Thanos’ chin, and you remember what Thanos did at the end.

Well, close to the end, he rested.

So, for Donki, it’s the same thing. Donki has done a lot of work for us, and after a lot of work, sometimes it’s time to rest.

So I’d like to thank you for your time.