A Journey to Automated Infrastructure (EVestment) | Datadog

A Journey to Automated Infrastructure (eVestment)


Published: July 17, 2019
00:00:00
00:00:00

Curtis: So you are ready to take a journey?

Audience: Yes.

Curtis: Well, it’s fun. We’ve been on a journey for a while now. Automated infrastructure can be extremely empowering to your engineers and it can also deliver value to your clients faster.

And so, before we start, I’m gonna talk about who we are a little bit.

Introduction

And so, eVestment is a Nasdaq company.

They owned us for about two years now.

Why we looked appealing to Nasdaq is because we’re a massive financial database that allows them or that brings together consultants, investors, and managers.

And it allows them to make data-driven decisions using our analytics, market intelligence, and all the other features we build on top of that data set.

And so to kind of put that in perspective, if I was to add up all of our clients’ AUM, which is assets under management, it would be about $68 trillion.

And so, it’s a little bit more than my personal portfolio or maybe yours.

And so with that, we need to constantly deliver them value, modernize our application and our stack, and continue to deliver that value.

And so, to put a little context around that, this is our stack.

Migrating from monolith to microservices

So we started with the monolith and we moved to a microservices architecture.

And so with that move, we needed to do a little bit of retooling.

And so with the UI layer, we went from Ext JS which is a huge JavaScript framework, and we wanted to move to a more modular-based framework.

And so with that, we audited the top three: React, Angular and Vue.js.

And we ended up landing on Vue.

We just felt it pulled a lot of the good things from each, and without all the baggage.

And so with that, the really important thing there is we were able to build components with Vue that you can mount in any framework.

And so now, it’s huge because we don’t have to go rewrite our entire UI layer from Ext over to Vue.

We can just mount any new component into Ext when we need to.

What powers the front end is our API layer which we’re a .NET shop and so, C sharp is our main language.

And the main difference here between the old and the new was we swapped from .NET framework to .NET Core and we did this for a lot of reasons.

The main one that you’ll kinda get today is because .NET Core can run on any OS.

What we do with our APIs now is we package them in Docker containers running on a Linux image using Alpine which was one of the thinner flavors of Linux.

And all of that is deployed and orchestrated in AWS ECS, so Elastic Container Service.

eVestment’s infrastructure, then and now

And so with that, that kinda takes us to our next layer which is infrastructure, and we were in a massive data center before that running a bunch of VMs.

And then we made the move to AWS.

And at that point, we really only had a few data sources at our fingertips which was a managed Elasticsearch cluster which we’ve been on since V1 and Microsoft SQL.

And so, with the move to AWS, we now have RDS or DynamoDB or DocumentDB or any of those things that AWS offers us.

And so, how do we get those resources out into all of our environments?

And so, we started down the path of Chef with EC2s, and like I said, we decided to go Docker.

And so, we changed over there.

We also implemented infrastructure as code with Terraform and you’re gonna hear a lot about that today.

And then our CI pipeline was TeamCity which was getting expensive, every agent cost more money.

It was harder to scale.

So we made the swap over to Jenkins and we’re running those on EC2s, and we’re auto scaling them.

So the more code that the devs are merging, it will autoscale up and then back down.

And so, a couple of tools that we use in the stack are obviously Visual Studio 2019 IDE, but that’s now almost made it to where we only use it in the API layer.

And for all the other layers, we’re using Visual Studio Code. And if you haven’t used that, you don’t have to be a .NET shop to use it.

It actually runs anywhere, so Apple, Linux.

It’s really a nice tool that Microsoft did a good job with.

It has extensions.

So we’ve got Vue extensions, Docker, Terraform, you can even run your Terminal or PowerShell in there.

So it’s actually really nice if you wanna check it out.

Migrating to AWS

And so, I wanna look at the migration to AWS, because that’s really where it kicked off our journey down the path of automated infrastructure.

Like I said before, we went from this massive data center, or managed solution, to the public cloud, and we chose AWS as our provider.

And on this move, we knew we needed, not wanted, but we needed two things.

The first thing is 100% infrastructure as code.

So we didn’t want to do a 5-year plan or a 10-year plan where you do 10% and kind of move through the stages.

We needed 100% infrastructure as code which we used Terraform.

So why did we choose Terraform over say like, CloudFormation, which is AWS’s version of infrastructure as code?

Well, we didn’t wanna be coupled to our cloud provider and we wanted to be able to put any provider into Terraform.

And this becomes important later.

And so, our second thing that we needed was we knew we needed 100% immutable infrastructure as well.

We were tired of doing security patches, and if problems arise, having to deal with the box.

We just wanted to kill and spin up a new one.

And so, our truly amazing team was able to accomplish this in just six months.

And so, on the day that we went to roll it to production, which you can see this picture, it was 6 a.m. on a Saturday.

And with the single click of a button, we were able to roll 17 years’ worth of code and infrastructure flawlessly.

And so, we actually had zero production issues that day.

So it was pretty amazing and kind of shows the power of that automated infrastructure, but it comes with some challenges.

And so, we’re gonna talk about six of those challenges today.

DevOps bottleneck

So I’m gonna start with DevOps as a bottleneck, which kinda doesn’t make much sense because typically DevOps is a concept.

With eVestment, we actually have a team called DevOps.

And so with that, we have all these engineering teams that own the UI and the API.

And DevOps owns the infrastructure, the deployment, and all of the other things that kind of fall under that DevOps bucket.

So as you can see, this causes a massive bottleneck where all these teams are submitting tickets to DevOps.

It also is a huge time waste because now we have to go have priority meetings to figure out which team is the highest priority and what the DevOps team should work on first.

And so, how do we solve this?

Well, as I mentioned before, we now have Terraform for infrastructure as code.

And so, Terraform just becomes another tool on the belt of your full stack engineers.

And so with that, now we’re able to scale in those teams or able to deliver value to our clients faster, and in their priority order, not everyone else’s.

Also, it empowers our engineering teams, because you have to think about this, now you have any level engineer writing infrastructure and owning their UI, API, and all the infrastructure that comes along with it.

So that’s how we kinda spread DevOps as a culture at eVestment.

Project LAMMA

And so, the next challenge that we ran into, we like to call Project LAMMA, which you may have noticed is spelled incorrectly.

And so, this is not just an engineering typo, normally it would be.

We actually spell LAMMA with two Ms and one L because it stands for logging, alerting, monitors, metrics, and APM, application performance monitoring.

And we had a solution for this, but it was kind of piecemealed in all over the place.

So we had ELMAH, and if you haven’t heard of that it’s an open source .NET ASP unhandled exception logger, which actually ended up becoming the logger for our entire system, and it’s persisted to SQL.

At least that’s where we were persisting it.

And we had StatusCake for our uptime APIs, and we had Datadog for our monitors and metrics.

And we had New Relic for APM.

And all of this was alerting through emails.

And so, we knew we wanted a one-stop shop for all of LAMMA.

And so, we compared a lot of different solutions and we landed on Datadog.

And then for alerting, we swapped from emails over to slack channels.

And so now that we have the tool, we need to figure out who manages, who owns it.

And so, it turns out Datadog has a Terraform provider, and so now, our individual engineering teams can own that full slice vertical and have a definition of done for their microservices.

And they can actually code all of LAMMA alongside of their infrastructure, which means it’s now code-reviewed and tested.

And so another thing that we did, we like to stay on immutable, so we made Datadog immutable.

All of the upper environments in Datadog are read only, and the only way to get code out there, or to get any of the resources for Datadog out there, is to use Terraform and deploy it.

And so, let’s take a look at what a provider looks like in Terraform.

And so, here’s an AWS provider and a Datadog provider.

It’s quite simple to get up and running in a few lines of code.

Once I have that provider, I then have access to any resource within it.

So in this example, I’ve got a Synthetics API call which is actually what replaced StatusCake and it’s just a simple health check with a few assertions, but you can see how easy it is to deploy this.

But one thing you may notice is I keep saying the word “resources.”

But the code says “module” and that brings us to our next challenge which was modularity.

And now, I invite Steve up here to kinda talk you through that one.

Modularity

Steve: All right.

Thanks, Curtis.

So as Curtis mentioned, my name is Steve Mastrorocco.

I’m one of the architects at eVestment, and there, my primary role is to help evangelize and be an advocate for the adoption of some the DevOps practices we’ve been talking about.

So one of the first problems we run into once we make infrastructure code is the same problem we run into with any code base.

We find ourselves repeating the same code over and over.

So we can take advantage of some of the things Terraform offers us in the form of modules.

So here’s a very simple example of what that looks like.

It’s just a simple module that creates an AWS tag schema for the developer.

So this is powerful because number one, we use tags like a lot of you probably do for cost and billing.

We also use it for attribution and we use it to populate tags within Datadog.

This way, the developer doesn’t have to remember, “What are all the things I need to tag all my resources with?”

They simply call this module.

They provide three things about the rep, the name of it, what the name of the service is.

Some of this is eVestment specific, but we hand back them a map of the current tag schema.

They then just apply that to all their resources without having to think about, “What are all the things I need to do?”

And the second thing we found ourselves typing over and over again is the pipeline.

So if you’ve used Terraform…

And just by show of hands, how many of you all in the room are using Terraform in production today?

Great.

So a lot of you guys are probably very familiar with this, may even be ahead of where we’re at.

So the plan and apply Terraform can obviously be repetitive and it’s also a little bit nuanced.

If you’re using remote states, there’s a lot of arguments you may have to pass in about where the S3…in our case, S3 bucket is, what’s the DynamoDB using for locking, what’s the workspace I should be in.

Instead, the developer could just consume a Jenkins shared library we’ve written and provide four things.

The first is: what’s the path to their Terraform configuration? What’s the workspace or environment they’re deploying to? Who’s allowed to approve it? And what’s the Slack channel we should send approvals to?

So I’ll touch one out later, but this is important because the developer may want to inject a human gate before they go to production or any environment if they’re not yet comfortable with the Terraform pipeline, and this allows them to do so.

So that one is pretty straightforward and gets our developers up and running quickly with a bunch of functionality and boilerplate stuff they don’t have to worry about.

Local development

But the second one is kind of a unique problem to Terraform, and that’s the local development experience.

So with a normal code base, you sort of have isolation, right?

Just by virtue of being on one laptop versus another laptop, and developer A and developer B don’t really have to worry about colliding with each other.

But with Terraform, that’s a little bit different.

When you’re using a remote state like S3, the default behavior of Terraform is to drop you into the default workspace, and this leads to a situation where developer A and developer B are gonna clobber each other, potentially deleting each other’s resources and so forth.

So Terraform provides a very easy mechanism to get out of this which is Terraform workspaces, but it presents a new challenge.

Now, every time the developer starts working, they have to remember, “I’ve got to make a workspace.

What’s my workspace name I made?

I need to not make a workspace that collides with another workspace that developer B made.”

So we just mask this all behind a simple CLI wrapper that we wrote in PowerShell.

So it’s PowerShell Core which makes it cross platform.

We have developers on Mac and Windows, and of course, our pipeline is in Linux.

And they kinda get this for free.

So you’ll see an example usage of this.

They just simply call it provide the directory that their Terraform lives in, and they automatically down here start getting for free a workspace that matches their Git branch name.

So this is a really nice way because Git branches by definition are unique, and we find that when developer A and developer B are working on the same microservice, they’re typically not working in our case on the same ticket or card.

So they don’t really collide with each other.

So it kinda creates a situation where they don’t have to think about this.

They just get this for free and run this CLI wrapper.

This is just demoing it for you here.

I can show that I’m in a workspace that matches my Git branch name and down here, I can see I consumed the AWS module tag.

I automatically for free got tags that align to my Datadog environment, my Terraform workspace.

And this makes it really easy for a developer to just search in AWS for anything that’s matching their branch name and find their resources quickly.

Infrastructure testability

So the next problem we talked about here…we’re gonna talk about here isn’t really a problem.

It’s more of an opportunity that came out of doing infrastructure as code.

So like any code base, we would write tests to make sure any assertions that we have about our code aren’t breaking every time we make changes.

Now that our infrastructure is code, we can just do the same thing.

So here’s a very simple example of a module being consumed.

Just a Route 53 module being used in two different ways.

I’m making a public and private record across to our split-horizon DNS and I’m just making a private record for maybe our private APIs that shouldn’t be exposed to the world.

This is an example of those tests or assertions.

So we used the Terraform Kitchen provider for Test Kitchen.

So Test Kitchen, if you’re not familiar with it, was born out of the Chef community and it follows a very simple run.

It’s a test harness that goes create, converge, verify, and destroy.

So in this case, the create and converge phase are loosely aligned to a Terraform plan and a Terraform apply.

The verify step would traditionally run in a test we called InSpec.

When we started this journey, the InSpec AWS resources weren’t that mature and they were lacking in some areas.

They’re very good now, but because of that we went with a community open source solution called AWSpec.

If you’ve seen InSpec, it looks exactly like InSpec and if you haven’t seen InSpec, it looks exactly like any other unit testing framework you may be familiar with.

We make simple assertions about describe Route 53 host.

It should exist, it should have this name, it should have these records with these properties.

So once I’ve done that, the developer in the pipeline simply gets immediate feedback about any change they’ve made infrastructure-wise, and if it broke any assertions, it still be true.

Excuse me.

This is just an example of one passing test.

You can see where this is going.

Development pipeline

So now that the developer can get up and running very quickly using modules and not having to think about boilerplate, they’ve got an easy local development work experience without having to think about, “How do I isolate myself?”

They can get immediate feedback from their pipeline to know that the infrastructure change is working and tested.

Now, they’re ready to deploy.

So we need a deployment pipeline.

So this is where the Jenkins shared library comes back into play.

I’m gonna go through what our Jenkins pipeline looks like and kind of talk about it a little bit more.

So the first thing we do of course is we run the Jenkins shared library in the Terraform deployment.

This is actually also our application deployment because we’re using containers and ECS as our scheduler.

Part of the Terraform is, of course, the task definition and the service.

And we just take advantage of Terraform create before destroy to kinda manage creating the new resource before destroying the old one.

Once we’re done with that, the infrastructure is up and running, and ECS takes over the deployment doing the rolling upgrade that it does waiting for ALB health checks to pass before destroying the old one.

So here, we just called AWS CLI using a tool they have within that called wait-for-stable.

Once that’s done, we are confident that our API is up and running and passing our ALB health checks.

We’re ready to deploy the front end.

So Curtis mentioned we’re using Vue and Vue CLI under the hood is just using Webpack.

So the result of this is just all our chunked out JavaScript files which are very nice that we just hosted in S3.

We throw out CloudFront in front of that as a CDN and we’re good to go.

Lastly, we deploy the Datadog dashboard that the team may use to monitor their four golden signals or whatever they’re wanting to look at.

So this might seem a little bit contradictory to what Curtis was mentioning before which is all of our Datadog stuff is Terraform and why wasn’t this deployed in step one.

Well, the teams found that dashboards specifically in the Datadog provider are very difficult to rationalize and write.

You’re basically designing a UI and declarative language of HCL.

And it wasn’t that great of an experience.

So what the teams found was a little bit better.

It was in our lower environment, they have write access.

So they simply log into the lower environments, Dev in this case, create a dashboard and drag and drop how they want it to look and export that dashboard as JSON.

That JSON just then lives along their Terraform code, their Vue code, and their API code, and then of course, is deployed through the Datadog API in the pipeline.

So we’re still achieving the same goal which is, it’s still code, it still goes through a pull request process, and it’s still deployed in a repeatable way that gives us confidence that it’s working.

Shared library and the human gate

So that’s our pipeline.

I’m just gonna dive into one more thing real quick which is our shared library and the human gate I mentioned.

So when teams first come on or in perpetuity, they may want a human gate before Terraform deployments happen in production.

It sounds like a lot of you guys are using Terraform and you can probably know it has a lot of nuances and sometimes plans don’t look like you expect them to.

So we create TFRs per environment.

You can see here I’m looking at the testing one, and if you can’t read it in the back, this little comment says, “Skip human approval.”

So when our shared library is going through the deployment, it simply looks for this in the TFR file and if it finds it, it goes directly from the plan and apply phase without any kind of human intervention.

But if it does…

If it’s omitted, and many times teams do that in production, they end up getting a slack notification that looks like this.

So it points them at, what’s a link to the Jenkins plan that I can review? What environment or workspaces it for, and who’s allowed to approve or reject it?

All of that was coming back from the shared library.

They pass that in when they call it.

So all of these have enabled our teams to start delivering value really fast.

We’re probably not at some of the scale you guys are at and we’re definitely not at Google scale.

So to give you guys kind of an idea of where we’re at, we have roughly 80 engineers currently that are writing code every day.

And since January 1st, so about the last 7 months, those teams have been able to spin up roughly 60 microservices and they’ve done over 1,800 infrastructure deployments using this Terraform pipeline.

And lastly, because they’re not having to think about how to do all that anymore, they’ve been able to make over 11,000 merges to master in their application code repos.

And that’s really valuable for us because every merge here is a deployment for us.

So the business is getting the value they want faster and faster while we still have confidence that everything is being deployed and monitored.

So we’ve done all this and still achieve nearly a four-nines uptime right now which is exceeding our business’s SLOs and things like that.

So I’ll close it out with this is our contact information.

Feel free to reach out to us with any questions after this, and I think if there’s time, Curtis and I are open to taking questions.

Thanks.