Application First, Not Infrastructure First | Datadog

Application First, Not Infrastructure First

Published: April 16, 2019

So what I wanna talk about today is I’m gonna take a step back from the last couple of conversations, and I wanna talk about something new that we’ve been playing around with internally that we’re starting to talk to people about, and open up a conversation about.

The evolution of infrastructure primitives

Something called application first, and on my title slide, it said application first and not infrastructure first.

I’m gonna explain what I’m talking about.

So a long time ago, so in tech years, it’s probably, like six, in a galaxy that was exactly this one, right…

…we cared about thing, and I’ll explain what thing is, but we cared about it at the server level.

So we cared about things like resource, we cared about network, we cared about whether it had power or not.

If you worked in a data center, we cared about how it could talk to other things, we cared about what was running on it, but ultimately, the level at which we cared about things (like monitoring, and logging, and observability), we cared about it at the server level.

Networking, I use emojis for everything.

At some point, the copyright police are gonna come after me on many levels, but today is not that day.

So I use some nice emojis for things, but we cared about all of these things.

We cared about more things other than just these, but these were some of the big ones, right?

Because this is a monitoring and observability summit, right?

So we cared about thing.

And how we think about our infrastructure primitives though is always changing, right?

How many of you still think about the physical server all the time?


A couple, but not a lot, right?

How many of you think about things at the Docker container level?

It’s more.

How many of you think about it at just the function level?

Five of us are living in the future.

So how we think about our primitives is always changing, right?

So we moved from servers, we moved to containers, maybe rather than instances, so an AWS parlance, maybe it’s EC2 instances, but we’re always changing and we’re going to different levels.

And for us, if I’m thinking in the future, we’re thinking only in terms of functions.

So how can I run just the code that I care about at a business level and not care about anything else?

So we change the level at which we’re thinking about these things all the time.

Copyright police, growth.

Adjusting to new technologies

Growth is good, but as our primitives change, the things that we care about, they don’t change, but the level of which we care about it change.

So maybe five years ago?

Five years ago, I worked at a startup, and I used things like Datadog, and other monitoring and tracing tools.

And we were really early in the transition from running everything as just applications on EC2 to running things as applications in Docker containers.

And did I get to stop caring about things like logging, and monitoring, and observability when I changed from EC2 instances to running on ECS?

Trick question, I absolutely did not.

I was on call for it, I would be very sad.

It’d maybe be a very quiet on call if I had no logging.

So actually in…maybe in hindsight, I just could have gone back and just never implemented any logging and I would have had blissfully quiet evenings.

But in real life, I’m a responsible person so I had to care about that.

So as my primitives changed that I was working with underneath me, I still had to continue caring about these things.

But also, I needed to be more granular and more specific, because it didn’t matter as much how my instances spoke to each other, but it mattered a lot how my containers spoke to each other.

And it mattered a lot how my application spoke to each other.

So all those things that we care about, they’re still constant.

So this whole list, I had to still care about it.

And I was just caring about it at a different level, right?

And early on, there were a lot of tricky bits to this because the tools that I used to use to measure how my instances spoke to each other, and what my instances could talk to—those did not quite work the same way for containers that they did for instances.

So there was a learning and growing period for us at a tools level and also at the infrastructure level.

So we’re always gonna care about these things.

As infrastructure evolves, what should we care about?

So if we take a step back, we’re always gonna need to care about seeing what our applications are doing, we need to see how they’re behaving, so those things that wake you up when you’re on call, we need to be able to control how they talk to each other, and we need to be able to control what resources they can access.

And in my opinion, there was kind of a curve like this, right? That at first I cared about it at the instance level, and then I got smaller and smaller, right?

So I went from containers, maybe I went to functions, maybe I used a combination of all of them.

But then, in my opinion, we’re kind of coming out the other side on that a little bit, right?

What if I don’t care about what kind of compute primitive I’m using at all?

What if I only care about my application?

So I don’t care whether it’s a lambda function, or it’s running on EC2, or it’s a monolith, or I don’t know, whatever was before monoliths, I don’t remember.

Someone knows.

I have to care about all those things, but I’m coming out the other side, right?

Where maybe I shouldn’t care, maybe I can be compute agnostic.

But we’re still gonna always need to care about those pieces.

And I actually would posit that I don’t want to care about the primitives.

I’ve done EC2, I’ve done containers, I’ve done functions, I don’t actually wanna care about it.

And everyone thinks that this is a trick question for me, because I work for Amazon and there are always trick questions.

But everyone wants to know, do I use containers?

Do I use serverless?

Do I use something else?

And I have news for you.

It does not matter.

You use whatever works for you.

And I think ideally, the tools that we’re building and the tools that we’re using, should not care.

So not only should you not care only that you use the right tool for what works for you, but you shouldn’t have to care what your primitives are.

Even if that’s a mix.

Controversially, I’ve now used the lambda logo and the Docker logo because these things can coexist in real production life.

And we have this conversation a lot, because people are very opinionated about their tools and their infrastructure, because they wanna know what’s best.

And there is no best answer.

It can be VMs, it can be Firecracker, micro VMs if you want to, it can be EC2 instances, you can run your own data center, you can use Docker containers, those are all fine.

And I think that in reality, in a healthy production environment, you’ll probably end up with a combination thereof.

So you have some people that I think are a little bit more fanatical than others, right?

That they say, “I wanna use entirely serverless.”

And I’m happy for them.

But those are the same people, I think, in a lot of cases that are saying, “What if I could keep a lambda function warm all the time?”

I’m like, “So a web server?”

They’re like “No, but serverless.

“A lambda function that is always warm as a web server.”

I don’t know who needs to hear that, but if you’re out there, come and find me.

I’m gonna be sitting at one of those tables.

You use the right tool for the job.

And in real life, you use a combination of the right tools for the job.

Application first: A framework for the evolution of infrastructure

So how do you do that? So say that we’re living in Abby’s beautiful planet where no one cares about the types of compute printers that they’re using, but I hypothetically still care about things like monitoring, and logging, and observability, and tracing.

You have to move things back out to the application level.

And that means that as an infrastructure provider, we need to think about how can we let you do that?

How can we let you talk across all the different kinds of compute without necessarily caring what they are, right?

So at Reinvent last year, what year is this?

It’s 2019, so at Reinvent 2018, we announced something called App Mesh, which is basically that.

So how can you do application-level communications across AWS?

We think about App Mesh, Air 7 Network.

This is not an interview, so I’m not going to ask everyone to name all of the network layers.

But a lot of our investment and a lot of how we’re kind of thinking about these things moving forward is, “How can we do things like application networking?”

So how can I run an overlay type network that does not care about the kinds of compute that I’m using?

Maybe I’m running my things in a VPC, but some of them are in ECS, some of them are in EKS, some of them are in Kubernetes and EC2, some of them are just running EC2.

Because ultimately, what I care about is: can I see what my service is doing?

Can I see how it’s behaving?

Can I see what it’s talking to?

Can I see what resources it has access to?

It’s like an AWS-level service mesh rather than a container-level service mesh, which is how I think we’ve been approaching that as an industry leading up to that.

So same kind of deal as what I would…I don’t know if I can say a regular service mesh, because what is a regular service mesh?

I feel like we’ve only had really service meshes in production for like two years?

Year and a half?

Eighteen months?

So what we’re going for, though, is those four things that I care about, and how can I do that at kind of the infrastructure application-level, rather than the container-level.

So I think that the biggest difference here between an App Mesh-style networking construct and a service mesh, is that a service mesh is historically two years, is a feature of the orchestrator, right?

So your orchestrator is just building services though, and I don’t think it should matter what your orchestration system is using.

I don’t think it should matter whether you’re doing it on ECS or EC2, or EKS.

If you’re using an orchestrator or you’re building services, you need to register them into a network that knows about the services itself.

And then [you] can just then manage things like communication and isolation, and how do you manage things like policies.

How App Mesh works: A high-level overview

So App Mesh goes on top of the services, not at the orchestrator level.

Doesn’t matter what orchestrator you’re using, you could use a mix of them if you wanted to.

It’s built on top of Envoy, which is an open source proxy.

We run Envoys as Sidecar containers to everything that you’re running, so Sidecar to your EC2 application, or your EKS application, or your ECS application.

As we grow, so how we’re thinking about this, not just now, but six months, a year, two years, three years down the line is: what else can be a part of that mesh?

What if I want to use an ELB for Ingress?

What if I wanna use API gateway?

What if I wanna use lambda functions and have my lambda functions be part of the application service mesh?

Then it’s not just your Envoy, it’s not just your Containers, the Envoy is just the part that’s running the proxy and it shouldn’t matter what you’re actually doing.

So you just define the mesh, you define how services talk to each other, you set policies.

When you create your mesh, it is closed by default.

So nothing can talk to each other without you explicitly saying it.

So what you’re doing is you’re registering everything as part of the same kind of overlay, but then you’re opting in to communications.

You’re saying, I want A to talk to B instead of by default, everything can talk to everything.

I know everyone in here has a security group that has

I know you all have it.

So instead, we’re letting you opt in.

So nothing talks to each other.

You opt into saying, I want service A to talk to service B.

You can tell that I’ve had my fingers on this diagram because emojis made it in, there was not previously emojis in this graph.

If you are more of a picture person, here’s what this ends up looking like.

Configuration is passed down through the mesh.

You can use Cloud Map or Route 53 for things like service discovery, then it doesn’t matter what’s running underneath here.

So all of your services are registered, you handle Ingress however you wanna handle it, you pass your configuration down, and then you can use the observability tools to get out of that.

So I think this kind of starts by, like, a common need, right?

So service A needs to use service B and you have two different service teams.

So the Amazon way of doing this, which I’m not going to repeat at length and ad nauseam [is] something called two pizza teams.

So every team works independently and you own your service all the way up the stack.

So you are responsible for developing it, and building it, and deploying it to production, and going on call for it.

So what that means is that we end up with a lot of teams that work very separately and I think that a lot of people have a fairly similar kind of setup for this, right?

You all work separately, your services are maybe both deployed in AWS, maybe you’re using different tools.

It’s Amazon so maybe my service A is Java, hypothetically, maybe service B is Go because I’m cool, maybe service C is Rust, because I’m even cooler.

Maybe some of them are running containers, and maybe some of them are not.

But I have things that are consistent across all of those.

Maybe I wanna use the same logging format, maybe I wanna send everything to CloudWatch.

Maybe I want them all to use Datadog and send my metrics there so that I can get paged for it when my Java service does something naughty.

So it ends up looking like this, right?

So App Mesh is the control plane, my proxy is run next to my applications regardless of what’s running inside the application itself.

And then I can use that to handle things like observability, and tracing, and a common logging format.

So I can handle some things at the mesh level that apply to all my services, and then I care less about my service and whether I’ve instrumented it or whether I’ve implemented tracing for the language that I need.

I can use them kind of all separately, but also together.

Here’s a hypothetically better look of what those constructs end up looking like.

I’m not a good diagram person, so you get some good with the bad.

So here’s what this ends up looking like.

I think a little bit more simply right?

So I have a load balancer, it handles things like Ingress. Service A needs to talk to service B.

And then from the mesh side, it’s basically virtual nodes that are handling then everything through the mesh.

So it doesn’t change a lot from how I interact with the services themselves or how I do things like Ingress or logging, but I’m controlling all of those through the mesh itself.

This is not any better, but it shows you where the proxies, it’s a graphical representation of what a proxy looks like.

Everyone is welcome.

I make great diagrams.

I’m not really sure what happened on this because I said that I was gonna look at my slides this morning, and then I did, but I think I made it like two-thirds of the way through, and then I was like, “This seems good.”

It was not, so let’s move on to the next slide.

A behind-the-scenes look at App Mesh

So here’s what the proxy looks like in greater detail.

My traffic goes through the proxy, it just runs as a Sidecar application.

I control what comes in and out, but ultimately it doesn’t matter whether it’s a task, or a pod, or just a regular application.

If you’re not familiar with Envoy, it’s an open source project.

Lots of people are building on top of it, started at Lyft with Matt Klein, it’s now graduated in the CNCF.

But open source, you can play with it, you can run Envoy itself, we’re using it underneath App Mesh so that it’s the Envoy proxies that are managed through the mesh that are running next to all of your running applications, but built on Envoy.

So App Mesh itself is what handles configuring those Envoy proxies, so it starts them up, it connects them to the mesh, handles things like the logging, and observability, and tracing.

Observability itself, right?

So I think there are good and there’s bad, right, that come with…Ooh, spicy.

I remember that sound.

I did delete PagerDuty though because I have a different app now, but it was an exciting day.

I love them, but also I didn’t want the app anymore because it was preventing me from sleeping.

Microservices, it was fun.

So I think there’s some good and there’s bad, right, that comes along with the whole distributed systems and microservices thing, right?

Which is that not only do I care about what the individual services and applications are doing, I also care about how they talk to each other.

And I care more about whether they can talk to each other and access the resources they need, but that they’re not accessing anything more than those resources that they need.

So what that means for distributed systems and microservices is that I care about logging, so what are my applications doing?

I care about monitoring, so how are they performing?

Are they healthy or not?

And I care about tracing.

So who and what are they talking to?

How long are those calls taking, right?

Is it really slow for service A to talk to service B, but it’s really fast for service B to talk to service C?

I care about that kind of thing, because I have more, I guess in networking terms, I have more hops.

I have more things that I have to jump through in order to get to my final result.

Monitoring App Mesh

Then you end up with convoluted diagrams that look like this.

There’s a color key here, but I got a little distracted, so it may or may not be accurate.

But I can handle these things through a mesh, I can handle them through App Mesh, right?

So for logging things like CloudWatch logs or HTTP access logging, I care about metrics.

So things like StatsD or Prometheus, which is like PromQL, I guess.

And then I care about things like tracing.

With App Mesh, since our traffic is routed through our Envoy proxy containers, we get an easy way to monitor because they can all go through the mesh.

So tools like Datadog can integrate directly with App Mesh for monitoring and observability goodness.

Ultimately, you can integrate, I think there’s directions on the Datadog site, but ultimately, it looks a lot like integrating just with Envoy because it’s just Envoy proxies, but managed through the mesh.

So that gets you things like service activity, so like request counts or response codes, gets you Envoy activity.

So are my proxies healthy?

Are my health checks passing or application activity?

So HTTP filters plus tracing.

So how can I see how my services are interacting between each other?

If you’re looking for some help getting started, App Mesh is on GitHub, not App Mesh itself, but there’s a bunch of examples on there.

A roadmap of things that are coming up.

But this is also on GitHub, so our new thing is publishing public roadmaps for some of our services.

So for all the container services plus App Mesh, they have public roadmaps on GitHub so you can see what we’re working on and what we’re thinking about in public without having to go search for it yourself or just kind of hope that we’re doing it.

Transparency, yay.


That is it.

Thank you for joining me.

I will be hanging out.

I think it’s a break now, but I will be hanging out at one of those table things after this.

So thank you.