Subverting the Monolith With Principles and Tooling (Betterment) | Datadog

Subverting the Monolith with Principles and Tooling (Betterment)


Published: 7月 17, 2019
00:00:00
00:00:00

Can everybody hear me? Yes. Awesome. Thanks for coming.

I know most of y’all are here for the photos of my parents’ goats. I hope I don’t disappoint.

Today, I’d like to tell you all the story of how my team created an entirely new platform for our engineers at Betterment with support for all of our legacy apps and I’d like to share how our principal tooling enabled that endeavor.

The problem

So we’ve been offered this vision of the future and it’s pretty compelling.

It sounds great when you look at the cornucopia of tools our various cloud services offer us to help us achieve all of these goals.

The future seems accessible and easy to tackle. And it would be if you started fresh with a clear set of principles and a clean slate.

But most of us don’t really find our day-to-day that fluid. Our plate is not that clean. Often, it looks more like this.

We’re comfortable in our bed of hay even if we get stuck by pointy pieces of dried grass sometimes. It’s reliable enough.

So how do you go from here, from this comfortable barn nap to there, a rocket ship going to some magical space farm?

How do you get to the future we keep being promised exists?

How do you benefit from all those fancy tools, AWS, or Google Cloud, or Azure offer us?

How do you choose what to implement and when? Because you can’t just declare the past dead and start over.

That depends.

Where are we coming from?

What’s here?

Before you can commit to a path forward to any kind of timeline of when you want to get there from here, you’ve got to sort out what’s in front of you.

You’ve heard of that age old adage, cattle versus pets.

At Betterment in the beginning, say, a half decade ago or more, we’ve got a monolith of pets, cute little baby goats with different names. This is Eugenia by the way.

But for the sake of clarity, I’m going to leave Eugenia and that terrific metaphor aside and move forward.

Looking at the beginning, we were actually in a pretty good spot. This is a spot I’m certain will resonate with a lot of you.

We’re using the cloud.

We’ve been using AWS for quite some time, but we’re not at the 12-factor stage.

Our apps aren’t all portable or environment agnostic.

Dependencies are difficult to manage and if you don’t know what a 12-factor app is, don’t worry, I’ll explain it more in a few minutes.

And though we’re using the cloud and automating deployments for most apps, for some, we’re still updating the same hosts. They aren’t all ephemeral.

There’s a couple old legacy apps that live on five year-old boxes and we’re crossing our fingers hoping AWS doesn’t retire those hosts.

The CI pipeline isn’t totally optimized either.

We haven’t figured out things like parallel job execution or isolated runs, and it isn’t always reliable.

Deploys take a long time.

We’re shipping as often as we can which ends up being a couple times a day because of the bottlenecks in the CI pipeline. We could ship more frequently if it weren’t for those.

The tooling and support for new apps is slim, takes at least a week of one SRE’s time to set up the proper infrastructure for one.

When big traffic comes all of a sudden, it’s a few clicks of a button in the console kind of scaling response.

We’ve iterated on externalizing config. We’re at the point where apps can inject their config via environment variables, but not where the platform handles that injection.

That’s the difference between encrypted files in the repo and using a cloud provider like Parameter Store or Secrets Manager.

What’s 12-factor you ask?

There’s a lot on here, I know. If you know what I’m talking about, great, but if you don’t, here’s a super quick rundown.

A 12-factor app stored in Version Control, relies on explicitly declared dependencies in code. It fetches its config from the environment, talks to internal and external services over an API. It’s built in CI and released to be deployed in CD. It’s stateless and ephemeral, scales horizontally. It’s environment agnostic, exports its state to another service for observability like Datadog, and lives on infrastructure that supports accessible Consul access.

That’s a lot.

Where do we want to go to? How do we get there?

So what is there, that place we wanna get to?

It’s all of those things I just mentioned and then some. And sure, you’re saying this sounds a lot like Heroku. But it’s all those things running on Kubernetes which is the new hotness.

The world we want to be in is a world of tested, automated goats. I mean, cattle in code, right? This is my mom, by the way.

So this is a high-level view of what we need to do.

In order to achieve this, we need to rebuild our CID pipeline, change the consumer contract for our engineers, and make this whole platform as easy and simple to use, and be onboarded to as possible.

Now that we know what we want, how do we get there? When do we want it?

Principles take time.

We’re a robo-advisor in a highly regulated industry. We can’t be reckless and ship without thinking about the state of the world.

How do we want it to happen?

Highly opinionated tools (at least at Betterment) take time to develop. You can’t solve all the problems at once either. You only solve one at a time.

Why don’t apps already follow this pattern? Why aren’t they following the right conventions?

Infrastructure as code has been possible since 2006 really and I know I’m gonna ruffle some feathers for those people who’ve been managing infrastructure in Fortran for the past 30 years. But that’s a different debate.

The availability of AWS in Rails actually created a lot of these scaling problems that precipitated the need to automate deployments and infrastructure management.

So what about building web apps made it so easy to avoid being 12-factor?

We didn’t have a perfect world, a 12-factor in software for the same reasons everyone else didn’t: because people were busy writing the products they need to survive to get funding.

You can’t optimize ahead of time—and I don’t think they needed to.

How do you know when to scale?

So I went to a Meetup once more than half a decade ago at Facebook and it was a talk about how they scaled up to serve billions more people and what they did to refactor their entire frontend without anyone knowing anything.

And remember meek little me raised my hand and I said, “How do you know when to prepare for scale? How do you know what to do? What does that look like?”

They paused, they shrugged, and eventually responded, ‘‘You can’t know until you need it. You don’t know until you’re in the thick of things."

So we scaled up when we needed it and we did it gradually.

We did it by solving one problem at a time, in well-tested code, that could be easily and quickly distributed to all of our engineers at once.

We did it by giving our engineers access to our platforms as soon as we could.

We did it by involving our engineers. We did it by solving the problems they had, not the problems they thought—we thought they had.

So our principles got us to where we are today, right? What do I mean by that?

I mean, that 12-factor app in all its glory.

And I know you’re probably scoffing. “Yes, we know, Sophia, what Heroku is. We know what we should be doing when we build software.”

But it’s not always that simple.

Sometimes there isn’t the extra impetus to force our hand. Kubernetes was the impetus for us at Betterment. Now with a clear goal in mind, we had to sort out how to go about migrating an older system.

There are a number of ways to implement these principles and patterns that we know we should be using.

You could shoehorn them in to your legacy world, but that didn’t really stick.

And it usually results in more technical debt and complications than anything else.

Which problem do you start with when you’re solving one at a time and you’ve only got a few SREs? What problem do you solve first when your end goal is onboarding a decade-old company onto Kubernetes?

We decided to start with CI. Starting with CI made sense because it’s the entry point to a part of continuous deployment that is fairly easily siphoned off.

But before we could rebuild our pipeline, we needed to redefine how our users would interact with our system which means we actually start with the contract.

The broader, more compelling story here is how we’re thinking about how to provide something generic enough that various teams with varying needs get usable value from as soon as possible.

This is one of the main driving forces of our platform and of my team. There are multiple projects that enabled us to solve these problems. I’m gonna talk about a couple of them.

The state of Betterment’s systems

Let’s take a look at the state of the world. This is what our pipeline looked like at the start.

An engineer pushes code, it sends a web hub to Jenkins, which then queues up one of many jobs, execute a series of untested Bash scripts.

Again, this sounds terrible when I say it like that, but it’s not that bad.

All of our Jenkins code config is stored in GitHub. So it’s not as if a change wasn’t recorded in Version Control. We could still go back to our previous world if we made a mistake.

Then Jenkins will run this test, keeping track of the state of things, and there’s more Bash code and other requests back to GitHub.

Eventually, someone merges their code into master. It starts all over again, but with different jobs. Eventually, maybe deploying to production.

The problem is that parent job isn’t atomic. So if one of those child jobs fails, it looks like everything failed.

Then there’s the problem of resource contention in our Jenkins boxes because we hadn’t taken the time yet to optimize that aspect of the pipeline.

This led to the decrease in reliability to the whole workflow.

So we chose CI. CI shouldn’t be a blocker to development and iteration. It’s supposed to enable faster movement by building confidence.

So we wanted to solve this problem with resource contention, flakiness, and also general sadness in the org first, because sad engineers make for sad SREs.

The problems we’re solving here aren’t as massive as we don’t automate deployments at all. They’re problems that we’ve all run into and we’ve all gotten used to because they are problems you can just deal with when you have other priorities.

At least you can deal with it until you just can’t anymore.

Starting with CircleCI

So we chose a thing because we knew we needed to optimize this workflow first and we knew that CI pipelines weren’t our primary business venture at Betterment.

So we decided to use CircleCI.

What’s the big mantra for SREs? Infrastructure as code.

Why did we choose Circle? Configuration as code.

The value of code here is automation and recoverability. Circle is perfect for us because it relies on configuration as code, it’s a well-used CI provider in the industry, it’s not the status quo and it’s opinionated as hell (which we love).

We can start fresh and we could start in a world that’s just as principled as ours.

Before I go into how we wrote our tooling, let me ask the crowd. As engineers, whether you be SRE or product or whatever, what’s your go-to language of choice when you write your tools? Ruby, Elixir, Golang, Python?

You’re all wrong. It’s YAML!

So what did we do? We wrote YAML because Circle is configured by YAML. We wrote YAML with Ruby.

Coach: Betterment’s in-house tool

What does our tool do?

The tool is called Coach, by the way.

In an org with tens of teams writing in only a handful of languages, we believe the only way forward is to enforce a set of conventions with a predetermined contract. This tool does that by automating configuration generation, configuration as code, configuration generated by code.

This will set us up for success down the line.

If we let each Ruby or Java team decide how their CI pipelines will work, wiring up our system to support any combination of attributes, then we have that same problem as before: a bunch of baby goats with different names.

We knew we wanted this tool to externalize configuration for apps and we knew this tool needed explicit enforceable contracts for consumers to follow in order for it to work effectively.

This was the planning for the future bit.

If we get them to follow our conventions now for CI, then we can rely on that contract with us for the future when we venture onwards through our CD rebuild.

So if we have a contract for every project type, then we can begin to automate the configuration generation needed to enforce that contract. And when you think about it, once you automate config for one provider, then you have the patterns laid down to automate it for any other provider.

Suddenly, automation leads to scalability.

They work in tandem.

We could standardize the interface which increases predictability of CI, CD run times, of code coverage.

It reduces risk and lets us onboard new apps which increases developer productivity and increases SRE confidence in the engineers developing those new apps.

And I know I’m really hammering these principles in, but they need to be hammered in, in order to stick to them.

We began running consumer contracts for Ruby apps first because they’re fairly conventional to begin with.

Remember, Rails caused all these scaling problems that prompted infrastructure’s code way back when. At least, they made automating these solutions a bit easier with their whole “convention over configuration” mantra.

Let’s take a look at Ruby’s contract. Every Ruby app will run unit tests with RSpec. Every Ruby app will run Rubocop. Every Ruby app will run integration tests and store screenshots of errors in CI. Every Ruby app will build its code in the same way and store that code, those artifacts in a conventionally stored area in S3.

So to enforce that contract, we wrote a CLI, that tool I mentioned before (Coach), and we wrote it in Ruby (Betterment is a big Ruby shop.)

What does the CLI do? How does it work?

It thinks of consumers in terms of project types. What’s your run time? Are you a library or an app?

If you’re deployable, you’re an app. Apps are simple things with predictable functionality. Apps are easy concepts to manage in your head and they’re easy concepts to manage in code.

Libraries are consumed by apps and we expect the apps to know how to manage that contract.

Our CLI knows that multiple projects of different project types can live in a repository. And it knows that sometimes there’s just one project in a repo. It knows that projects use a primary language with a specified version for that language. It knows that a project will have to run tests in linters and the code will need to be compiled and zipped up, and built into a Docker container because that’s what Kubernetes expects.

The CLI also knows, since we told it, that the agent running all these things will be external to itself.

The CLI can’t schedule and run these processes itself. It’s just a CLI and in the time of writing it, it knows that CircleCI will be that agent, at least for now.

This is our Coach’s mascot and don’t worry, it gets seasonal outfits.

Onto the CI rebuild

So now that we’ve got the consumer contract down, let’s revisit how it helped us complete our CI rebuild.

Within CI, we know that we have a contract we need to enforce, build and test in LintCode in conventional ways.

But the DSL that contract requires is Circles, which means that Coach will need to ultimately automate YAML generation in the format that Circle expects.

So how did we translate one DSL into another?

Well, we have objects in our code base for workflows, jobs, job steps, and of course, projects. We have concepts of order and of a dependency of graph of jobs. All these mirror the components found in Circle.

I know what you’re saying. “How can you build something so tightly coupled to a specific vendor? Can’t that be dangerous? How is that future proof?”

Well, it’s a balancing act, right? Because the reality is that all of us use vendors for a lot of stuff. In Betterment, we’re not a SaaS shop, we’re a robo-advisor. We’ll do what we do best and let the other folks do what they do best. And what we do best is writing opinionated, well-tested code.

That we had one vendor in our sights at the beginning, in an effort to prevent lock-in, one of our primary goals was ultimately to be vendor agnostic.

Let’s dive into some code.

This is a high-level glance at how we structure the CLI code.

I’m gonna dive deeper into the CI config part of it for now. Our CI config handles managing all of those objects I mentioned before. It creates a workflow for a repo, which is composed of a series of jobs for each project in a repo, and those jobs handle things like building, testing, and linting code. The jobs are then composed of more granular steps.

This is a pretty conventional way of handling task orchestration and other third-party services follow this pattern as well.

Let’s take a closer look.

Look at all the steps. Steps are the lowest level construct in Circle. They are the actions performed within a single job. They’re the lowest-level construct in Coach too, but what task orchestrator doesn’t have, a low level construct for a single task.

No CI services have this concept.

So mirroring what we saw on Circle didn’t feel like we were getting into the vendor lock-in we were so trying to avoid.

This is an example of one of those steps in code.

They all pretty follow this convention and this is what I mean by we’re writing YAML with Ruby.

With each class wrapping a single task, we can easily validate and test each component. If anything needs to change, that change set can be isolated to a single file. Each of these project types handles knowledge about what kind of jobs and steps it should support.

Inside Betterment’s Ruby app

So let’s take a look at the Ruby app’s construction.

It seems pretty simple, almost too simple.

What’s Java doing?

Well, Java apps are a horse of a different color or a goat of a different breed, or something. Our contract with Java apps required finessing, because it would take a lot more work to onboard legacy apps that defy convention. Java apps didn’t have a single framework they abided by and so convention didn’t outshine configuration.

Ruby apps, all of ours use Rails, have a lot of built-in convention. They made our lives easier from the beginning which is why we chose to onboard them first.

Onboarding Java apps is really when we started to benefit from the value of an enforceable contract.

We could just tell our legacy Java apps, “Please quack like a duck, but I don’t care what you’re doing while you’re quacking.”

That’s ultimately what convention over configuration was for us. In reality, that means our Java apps respond to specific Gradle tasks to build, test and lint their code, which our platform expects them to respond to.

But within each task definition, they can be a little more flexible. This was to handle the fact that we have different apps that run on different versions of Java and use different frameworks.

But when we got our Java apps to quack, the contract itself turns out almost the same as that of Ruby apps.

If you think about it, both of these, Java and Ruby apps, they’re just executing the fairly conventional pattern of CI. You test, lint, and build your code.

The implementation details, of course, contained the more interesting bits. So let’s look at the Ruby app’s lint job.

I know, this is a lot of code.

The steps we add that don’t have a conditional on them, with the big arrow, expose the basic set of opinions our platform upholds which, if you glance real hard with a magnifying glass, you can see that we require Ruby apps to bundle.

That’s how they handle their external dependencies, to run sopsorific steps, that’s to validate the use of our in-house Secrets Management tool, and to run RuboCop which is a linter for Ruby, and then to store artifacts in CI.

Then we leave some additional features up to the consumer.

If you want to lint your style sheets, we support that. If you want to run Brakeman, we support that. These conditionals were hard fought. They weren’t quick decisions to ship something.

Automating workflows

So how does a single job go from here to one of these?

This is probably 6,000 lines of YAML. We’re not gonna write all of that by hand.

So what do we do? We’ve put all that code through a series of hand-wavy config generation steps, but this is a lot to read. So let’s get through this line by line.

This class, it splits jobs into workflows, jobs like that list of build, test, and lint jobs that I showed you before.

This allows us to split up jobs into things like allow failure workflows and normal workflows, which gives us the flexibility to test things like different versions of Rails or Ruby, or different versions of Java.

Then we slip in some custom tasks like handling changed project awareness. If one project in a repo hasn’t been changed in the PR, don’t run that project’s test suite. And then we include tracking progress, notifications before and during, and then after the workflow was completed.

The juicy stuff happens right before we create the circle config file. At this point, the config variable there on the far right, it’s an array of concatenated objects that all respond to one method, to CircleCi DSL.

We take this config and pass it off to our visitor object which then handles the CircleCIness of it all, actually, taking our workflows and jobs, and steps, and putting them in the right place according to how Circle expects it.

This isolates a lot of the vendor-specific knowledge of our objects in one place.

Most CI vendors have similar concepts to Circles, whether it be pipelines of stages or pipelines of jobs in subsequent stages. If we need to wrap it out or swap it out to use another vendor, it won’t be that painful. So we’ve got our dependency graphs set up, we’ve got all the CircleCI code in one big string, then all we do is write it to a file.

It would be easy to imagine any supporting back end in the future and successful migrations to a different provider are as easy as generating different files.

We built Coach with this kind of future in mind. We know vendor lock-in can be a pain and flexibility and portability of our dependencies was integral to the development of this tool. So we had this list at the beginning of where we were at, what problems we were trying to solve.

Let’s take a look at it and see what Coach and our CI rebuild has helped us with so far.

Using Circle helped us a lot, helped us solve a lot of these problems right off the bat just as a more reliable orchestrator.

It helped us solve a lot of the high latency and flaky box problems we were seeing from Jenkins, but in addition to that, Coach helped us and helped our engineers make their apps follow the conventions of a 12-factor app because of the contracts we laid out and then enforced.

We haven’t tackled CD yet since some apps still rely on their hosts more than they should.

Our deploys are already faster because our CI pipeline has been so heavily optimized. Our engineers are more confident in CI because it’s actually more reliable, and we’ve already reduced onboarding costs significantly.

You don’t need to add a Jenkins job and then add some more untested Bash scripts to run your tests. You can just add a new app to our platform with a single CLI command. Still manually scaling. CI and the 12-factor remigration helped externalize a lot of configuration, but we’re not totally done yet.

Life after Coach

Coach was immediately interesting at the beginning because it made bootstrapping apps real fast. This was just one benefit from building out a super principled tool.

It took us a few months of development because well-tested, well-designed code takes time, but because of our principles, our engineer-first approach, we made it so you can run single command and an app was ready to go.

But beyond just onboarding, our engineers could update their apps config or even a whole repo’s config with just a single command and they could do it themselves.

When we roll out new functionality or improvements, it’s simple to pull them in. Engineers can do their work and write their code, and not worry.

So how does this help us achieve our next set of goals, rebuilding CD?

Because this talk isn’t just to tell you we reinvented the wheel like the rest y’all and here’s our hot take on how we wrote YAML.

These tools were built with the future in mind because while supporting legacy apps in their conventions, we gradually introduced more and more opinions into the platform and gradually asked our engineers and ourselves to follow our newly established contracts.

We changed the way our engineers think about developing code and we changed the way that we think about developing our platform.

Let’s talk about how we managed to gradually migrate all of these baby goat legacy apps that had varying dependencies over to the new world.

Do you remember that space farm I talked about at the beginning? We’re almost there, but not everything could just be supported on Day One on a new platform.

Like I said before, you can’t shoehorn in a solution. We have all these legacy apps and some of them have unconventional dependencies.

How do you support unconventional dependencies without necessarily polluting a new code base, without making it support something we definitely don’t wanna support long term?

And we asked ourselves, “What’s the best format for a dependency that lives outside of the code itself?”

An app. I’m sure you might argue it’s a process, but when you think about how to package and deliver it to be consumed by other apps or by people, then the concept of deploying a process as an app becomes a more salient metaphor.

Some additional considerations…

What we decided to do was consider external dependencies that didn’t easily fit into the platform quite yet.

Remember, we’re not shoehorning anything in here, functionally as apps. If it quacks like an app, it’s an app.

When everything is an app with its own application resource definition, you can build it, you can test it, you can deploy it. It can be managed in code, it can be managed during this migration period where the platform doesn’t quite know how to support everyone off snowflake dependency.

This is pretty similar to our pattern of establishing a contract for CI.

When you ask a Java app to Gradle, prepare test DB, it can do that, but that can encompass any manner of sins. We don’t need to know what it does.

Just like with CI, anything that can quack like an app can be a project type and could be supported on our platform.

We built a tool that supports any number of project types. The separation of concerns is semantically rich.

It ultimately gave us a lot of flexibility when we needed to support what might not conventionally be considered an app as we migrated from old to new.

A couple examples that we need to support in the evolution of our platform: database migrations for a shared database, and database monitoring as a deployable unit.

What makes denoting them as apps so effective?

It allows us to automate their existence in the world which is one step better than manually clicking in the console or running a one-off script.

It allows everything to be thoroughly tested before it goes out into the wild. It reduces risk and centralizes concerns.

Hold on. I’m just gonna have some water. Excuse me.

Okay.

So I have mentioned before how our principles enforced the development of our own tooling and this is most obvious when you consider how our CI tooling, the first aspect of our platform that we built out, informed our CD tooling.

This is when we took the concept of “An app is an app” further. This is when our platform gets more solidified, when we take what we’ve learned in CI and grow out our CD pipeline.

So now that we have these hardened constructs of project types, we can design our deployment pipeline with a solid foundation underground.

Rudder is our other CLI written in, you guessed it, Ruby, that takes these notions of apps and decides how to deploy them to Kubernetes, and that’s what we’ve been building towards, Kubernetes for everybody.

Our CI pipeline builds our apps into a format that Kubernetes knows how to deal with, Docker images. That’s what Coach is building towards all along, a world of 12-factor apps to be deployed on Kubernetes. And because Coach knows so much about the apps, its configuration has become the guiding light for Rudder.

Infrastructure as code: Terraform and Helm

So what language or tool do you use to keep infrastructure in code?

That’s a trick question. It uses all of this.

I’m just not gonna tell you you’re wrong here, but I’m only interested in Terraform and Helm here. Ansible has its own place, but we only use it for image building right now.

Within Rudder, we wrote Terraform with Ruby. We managed Helm charts with Ruby. This allowed us to test what kind of Terraform resources would be created and under what circumstances. This allowed us to assert against Kubernetes’ resources when they were created, kind of like smoke tests.

I know what you’re saying: you can write all this tooling in any other language besides Ruby, but do you see a trend? Keeping our tooling in a language that our engineers use has the added benefit of making the barrier to entry much lower.

If you have a project type you need to support, you can make that case and build and support yourself (with a little help from an SRE).

Let’s look at some Rudder code. This is the entry point to Rudder. It relies explicitly on the existence of the Coach input file: the Coach config definition.

Here, Rudder asks an app about its Coach config which also provides a deploy context. It takes that and then it knows how to deploy those based on project type.

And we can translate these project types then into Helm charts fairly easily and in our deploy process, we use Ruby to pass in the contact correct values, glean from the Coach config the deploy context, and install those Helm charts.

And there you have it: well-tested, automated app deployments thriving already on Kubernetes.

The evolution of Betterment’s infrastructure platform

So how does this evolve?

When you have things that already exist, databases for example, or a bunch of snowflake dependencies, you can’t just create and manage those resources right away on the platform.

And so, considering dependencies as apps is how we moved forward. A Ruby app could be deployed. Its database monitoring could be managed as a separate deployment. Its migrations could be executed independently as yet another deployment.

But as our tooling evolved, we had to reconsider our path forward.

What was so great about building out this part of the platform was how these project types and Helm charts, and the flow of communication between them showed us that perhaps not everything is an app.

It allowed us to see that we can hoist these migratory features that were previously apps and promote them into dependent resources entirely managed by the platform.

When once you had a bucket of interdependent apps you were deploying, you can take that externally managed resource and reconceptualize it as an internalized resource or attachment managed by the platform.

Because long term, it turns out that not everything is an app.

Binding one app to another component in a meaningful way when an app isn’t a useful standalone thing is a necessary requirement for our platform.

An app is an app during migration, but ultimately it needs to be managed by us as a supportive resource to the primary code that requires it.

Part of the reason for promoting these one-off apps to be attachable resources was to make our platform more easy to use by our engineers.

We know what kind of dependencies to manage because they’re declared explicitly in the Coach config. So that support becomes implicit and automated.

This is also one reason for relying on an app’s Coach config file as a singular, configurable entry point to our platform. We want to make onboarding simple and easy, to make the barrier to entry as low as possible, to make it as obvious as possible what kind of world you’ll get based on what you declare in this file.

And when you have a new project type that we don’t support, onboarding is pretty simple, you define a contract, enforce it by adding support in the Coach and Rudder code bases. And once that’s done, onboarding is as simple as running one command.

An engineer can create a new app and put it on our platform by running then just two subsequent CLI commands, and it works.

Distributing at scale

How do we distribute this at scale? Scale is the reason we’re here.

The evolution didn’t happen overnight. We got our engineers hooked and so, we had to come up with a reliable way to release any new hotness.

For CI, we built a lot of our functionality into the images used by CircleCI. So all we need to do is build and tag new Docker containers with our Packer-based image building system.

For local development and configuration updates, we use Homeroom to release new versions of Coach. Once our engineers saw how delightful our platform was, they were happy to oblige and onboard themselves.

Ease of distribution means ease of adoption, which means we can start making assumptions about how people write code at the company, which means we can continue building on our platform knowing that we have consumers abiding by the conventions and contracts we established at the beginning.

When you make it easy to onboard new apps and you make it easy to build and deploy your apps, then it makes taking apart that monorepo a little simpler. Monoliths are easy to manage at scale when they’re much smaller.

If the business domains no longer necessitate sharing space in Version Control, then they shouldn’t do that. Sharing space increases risk and decreases velocity.

If you have dozens of folks contributing to one repo, but all they have in common is a shared database, you need to rethink your org structure and your code structure.

So did we subvert it? I know you all are gripping your chairs with anxiety. The answer is yes.

All of our tooling allowed us to do just that.

We were able to dismantle several repositories and isolate their components into smaller repos managed by the teams that actually needed to manage them.

We did it and we’ve made our team faster. Even as we are building things incrementally, we are focused on delivering value to ourselves and to the rest of the engineering team.

Where is Betterment now?

All right. So let’s take a look at where we are now.

Our checklist is all in bold which generally means we accomplished everything we had our sights on.

The biggest win was onboarding new apps in seconds to the platform, reducing time to deploy with CI as quick as 10 to 25 minutes and that includes…all of this includes our trading system as well.

That’s pretty impressive. Remember, this is the pipeline at the start. This is what we have now.

User-triggered configuration generation and onboarding via Coach.

GitHub webhooks to trigger predictable, fast, paralyzed workloads in CI. Coach Web handles success or failures from CI and Slacks folks then updates their PRs. No more scrappy Bash scripts.

All of these interactions are documented in code and tested.

When CI completes, Coach knows how to trigger a deploy, and guess what we’ve done with Jenkins? Because we’re still using Jenkins, but now it just does a lot less, no more Bash.

We’ve automated Jenkins file pipeline configuration generation. Those repo level files include Coach notifications, Rudder CLI executions.

This is how we deploy our apps to Kubernetes. We released a new version of Rudder and because Jenkins has Rudder available to it, it just has access to any new features right away.

Some takeaways…

So what did all of this teach us? What did it teach me? The goats wanna know.

We didn’t go in thinking an app would be the ideal solution. The development of our tooling revealed to us that this is the simplest way forward even if we knew it wasn’t going to be permanent.

Automation proved to be our friend in every respect throughout this experience. Folks go on and say all the time, “You have to calculate ROI when you’re automating something. Does it take longer to write the code than it does to do the task?''

But I think it’s a bunch of baloney.

Even if a single task that takes a little bit of time, those single tasks accumulate over time. Deliver value as soon as you can. Remind yourselves and your colleagues why you’re building what you’re building.

Make folks excited for the next thing. You gotta build that hype.

I know that if your end goal is clear and your road is paved with principles, you can keep iterating until you find the right solution for your problem.

It’s like the future is already here.

And though we’re here in the future, we at Betterment will continue looking forward, building support for new languages on our platform.

We will continue building out our tooling because for now the work’s not over, but we do have a clear path forward which I think is the next best thing.

Thank you.