How Small Changes Can Make Huge Waves (Braze) | Datadog

How Small Changes Can Make Huge Waves (Braze)

Published: 7月 17, 2019

Thank you.

Oh all right?

Male: You’re good.


Zach: I’m good.

Thanks, everybody.

So today I’m gonna talk about how small changes can make huge waves, or really more accurately how you can use small changes to make huge waves.

We’re going to talk about deploying new infrastructure, deploying new features, and doing it in a really controlled manner using really small changes at a massive scale.

So, get rid of this.

So, one thing you have to know about me first is that I love performance.

I love high-performance cars.

I drink a ton of coffee and track my time religiously, and I have to know how my applications perform.

And this graph up here, this is just an example of that, something that I find…

There we go.

Something that I find really interesting is, this is an example of when we upgraded our API from Ruby 2.2 to Ruby 2.5.

So, this graph we’re looking at here, this is a really small change, and really the visible change here isn’t too large, but we’re looking at the memory usage that one of these processes was using in Ruby 2.2, this chunk, and Ruby 2.5, that chunk over there.

It turned out that making that change is saving us something like $60,000 per month or more in server costs, which, basically allows us to scale so much better because we’re using just a little bit less memory.

And those are the sorts of changes that I get really excited about.

What is Braze?

So, like I said, I’m Zach McCormick.

I’m an engineering manager at Braze on the messaging and automation team.

We work on parts of our product related to campaigns, canvas, segmentation, things of that nature.

So, what is that product?

What does Braze do?

So, I’ll get out of the way here.

Braze, it empowers you to humanize your brand, and that’s to help you build customer relationships at scale.

And that means that we want to get away from the world of hello, first name, last name, weekly email blasts that you’re used to from the year 2000 and onwards, and onto a more deep customer relationship that’s based on data and how you personally interact with these brands.

We do that by providing our customers with the infrastructure in order to send scheduled, triggered, and API-driven messages across multiple channels such as push, in-app messages, content cards, email, and more.

You’ve probably gotten a message that was powered by Braze before.

You might have gotten a push notification from Seamless or an in-app message from iHeartRadio talking about maybe your favorite band or your favorite radio station, or you might have gotten an email from Lyft with receipt information about your ride.

Things like that are all sent by Braze.

And it’s very exciting to work on this sort of thing because it requires a huge investment and scalability to be able to do this at a massive scale, and to be able to do this quickly.

How Braze works

So, how does it all work?

On the right, you can see a screenshot of our dashboard.

So, that’s what our customers, the marketers for these brands, use to craft their campaigns, canvases, use to create their segments, create news feeds and more.

They integrate with our SDK, and they use our REST API in order to provide us with that information about how you’re interacting with their brands.

So, the RSDK in their application may be sending things such as session starts or purchases or custom events, and they may be sending transactional messages using our REST API.

This data allows us to do real-time segmentation in order to send personalized messages.

But the dashboard here, that’s just the tip of the iceberg really.

Let’s see what it looks like underneath the iceberg.

So, this is a really big diagram, a really good diagram of what that pipeline looks like and what it looks like underneath the iceberg, but on the bottom half.

First data goes into the top.

So, again, that’s data coming from your mobile application.

That’s data coming from your website, from your external systems, and it goes through this pipeline where we go through data ingestion, classification, orchestration, personalization, and finally to action where we may be sending out webhooks to your own internal service, or we may be sending out emails, content cards, SMS messages, push messages back to your users.

We then take all of the interaction information that happens with that.

So, that could be email opens, it could be clicking on a button on a push, and we feed it back into the top.

And that’s useful for remarketing, for tracking conversions, and understanding how your relationship is working with those customers.

We also have a product called Currents that allows you to sort of plug into Braze at the top and the bottom.

And you may be feeding additional information into the top of the funnel, or you may be pulling information out of the bottom of the funnel in order to plug it into one of the people in our partner ecosystem.

Behind the scenes…

So, all those different pieces, how is it powered?

So, it’s powered by a job-queuing system or a worker-queue model, and that gives us resiliency and scale.

And basically, we create those jobs from (a majority of the time) API calls.

So an API call comes in, it gets validated, so we make sure that it’s something that we should be getting or something we should be processing, and then it gets turned into a job and put on the top of a job queue.

We run 8 billion of those jobs every single day.

And the job queue, it’s not really just one big job queue, we actually use quite a complicated system that we call dynamic queuing in order to even the load, prevent starvation between customers, and all kinds of other things.

But for now, just imagine that it’s a single queue or a few queues that thousands of workers are constantly pulling jobs off of and executing.

The benefits of the job queuing model

So, why do we use this model?

We use it because it’s both resilient and scalable.

So, in terms of resiliency, we use a Ruby, asynchronous job-processing framework called Sidekiq.

Sidekiq gives us a whole host of functionality, but it gives us two things specifically that I’d like to highlight, and that is reliability queues and retry sets, two things that are extremely helpful when you want to make sure that every job gets run on time and doesn’t get lost through the cracks.

So, reliability queues is a pattern that if you’re not already using it, you should definitely look it up and implement it at your own organization, where if a process fails or you have a hardware failure, the dreaded email you get from Amazon that says this instance is being terminated.

If a worker had pulled some work off of one of those queues, it doesn’t necessarily lose it.

You have a sentinel process that goes and checks to see what work that workers have taken, and any lost work gets put back into the pool of work to work on.

It also gives us retry sets.

So, retry sets are another high-availability sort of feature that Sidekiq gives us where if a job fails for some other reason, more of an application reason, such as a database timeout, like you didn’t get information back in time, or you have some kind of an application error, it’ll go ahead and put that at the bottom of the set so that that work can get retried at a later time, meaning that you can respond to a blip in database connectivity or something like that.

But aside from resiliency, the job-queuing model is also very scalable, and that allows us to deal with uneven workloads really easily.

If you’ve ever worked in the marketing automation space, you know that marketing automation is full of uneven workloads.

We joke around about like the 12:00 blast, the 3:00 blast, the 5:00 blast, because a lot of people sort of time their campaigns around particular times of the day.

And using the job queuing model, we’re able to scale up and scale down to both save costs and still crush the queue as fast as we can in periods of low traffic.

It also allows us to do, when combined with our queuing model, per-customer scaling, which is really important because if customers are expecting that their messages are going to send as quickly as they’re used to all the time, then you need to be able to scale up and down on a per-customer level so that customers aren’t fighting with each other for infrastructure.

So, I keep saying that this is massive, but let’s look at some numbers.

How big is it really?

Customer numbers and statistics

So, we have about 11 billion user profiles.

And you might say that’s more than there are people on earth, I think, or at least last time I checked.

And that’s true.

But we have 11 billion user profiles because each one of our customers, they are either…

Every user profile, rather, represents a distinct user for each one of our customers.

So, if you were to take out your phone and look at all the apps you use, I promise that several of them use Braze to send messages, and you have a user profile for each one of those applications.

With those 11 billion user profiles, we’re running 8 billion Sidekiq jobs per day, like I mentioned earlier.

Those are for things like segmentation, sending messages, those could be for analytics events and processing data, and those run again 8 billion every day, so we’re doing tens of thousands of these jobs per second.

Moreover, we process over 6 billion API calls per day.

The majority of that comes from our SDK API, so that’s processing data that’s coming in from devices.

So, again, people clicking on things, people opening messages, people interacting with those applications.

Combined on top of that, you’ve got transactional messages and things that our users want to schedule from their own internal systems.

So, 6 billion of those per day.

And among all of that, that ends up with us doing somewhere around 350,000 Mongo I/O operations per second, which is just an absolutely tremendous amount of I/O.

It comes out to like 30 billion operations per day.

So, there’s really a lot of scale here, and so I think it’s really important to think about how to do things in a tiny way so that we can make any changes to this system in a very controlled way because if you’re down for seconds, all of a sudden, you’ve got tens of thousands of jobs, tens of thousands of API calls, and hundreds of thousands of Mongo queries that aren’t happening or are failing or are doing the wrong thing.

Scaling your infrastructure and systems

So, today we’re gonna talk about deploying at scale, and we’re gonna talk about two different things.

First, we’re gonna talk about deploying new features or improvements to existing features, and we’re going to talk about frequency capping, which is one of my favorite features that we do.

We’ll talk about what it is, and how it worked, and how it works now, as well as the deployment strategies that went out with releasing a recent performance improvement.

We’ll also talk about deploying new infrastructure for a feature called Content Cards.

We’ll talk about why we needed a new database, so our primary data store is Mongo, but in this case, we chose Postgres.

So, we’ll talk about why we chose Postgres, and then we’ll explain how it got rolled out.

What did we do to make that all work?

Feature flags/flippers

But the first thing I want to talk about because I think it’s so crucial to this talk is feature flippers.

You might know them as feature flags, same idea.

You just basically want some kind of an on-off switch so that you can turn things on and off at the press of a button, the flip of a switch, the loud hammering of me on my ENTER key.

And that allows us to turn features off and on for specific companies.

We can turn it off and on for specific clusters.

We can turn it off and on for specific regions, and we use it for everything.

We use it for new features, we use it for improvements to features, we use it for performance enhancements, literally everything.

So, here’s a screenshot.

I really wish I offset this a little bit.

Here’s a screenshot of our feature flipper internal tool.

And I think this is a really crucial part of it that we use, especially around performance improvements, is that we’ve got a “set the percentage” feature flipper, and that allows us to set a feature or set a performance improvement and say, “I want this to run 10% of the time or 20% of the time and use the old behavior the rest of the time.”

And this percentage rollout is really important for performance testing new things because testing in staging or testing in some kind of a load testing environment can be a great start, but it’s never gonna be the same as production traffic.

And so, this allows us to do it again in a very small way and in a very safe way.

So, an example of something that we rolled out with that way is the “UserCampaignInteractionData” collection.

It’s a mouthful.

But basically, it’s a separate document that’s a roll-up of every single user profile.

So we had 11 billion of those now.

Well we probably don’t have 11 billion of these quite yet, but we’re getting there.

And it actually copies some of the same data that’s written on a user profile, but in a different way using arrays instead of sub-documents, which we’ll look at in a minute.

And we slowly rolled this out starting at 1% for a few companies, 2%, 10% until we were finally at 100% for all companies to make sure that there were no hiccups, no kinks in how we were doing this.

And it was really important that we did this because we worked with our DBAs (ObjectRocket) to basically make sure that there was nothing on fire while we were doing this, right?

If you’ve ever worked with something at a high scale like this, you’ll know that you’ve got connection issues you can run into.

You can run into hot shards where you’re making all of your writes to one particular shard, tons and tons of issues.

So, it was a great feature to have this percentage rollout available for this.

Frequency capping

So, this collection is actually using frequency capping, so we’ll get back to this in a minute.

But what is frequency capping?

The first feature we’re gonna talk about that we released a performance improvement for.

Basically, it’s a global setting that allows marketers to set a message limit for a particular time window for a particular channel, and it basically fixes this problem, right?

So, this is like…you’re familiar with this, right?

Where you just have like tons of notifications, and I bet most people just sort of swipe all these to the right and get rid of them and hope that if it’s important, it’ll come back another day.

Well, this is something that marketers want to avoid because they obviously want you to engage with those notifications, and they also don’t want to send you too many.

So, they use frequency capping in order to limit the number of messages that you get in a certain time window, so messages per hour or messages per day, per week, etc.

It’s really I/O- and bandwidth-intensive actually though to process frequency capping rules throughout our messaging pipeline, but customers get a lot of value in it.

So, we keep promoting it because they’ll often see huge differences in their conversions and in the results that they’re looking for because they’re using frequency capping.

I thought it was a really good thing to optimize on because this is a view of basically one of the jobs in our pipeline that does frequency capping, and you can see that it’s not by message, but it’s by jobs.

So, bear with me on the number, but it’s about 25% of our messages are sent with at least one frequency capping rule governing it.

That comes out to something like 350 million messages per day.

It turns out this takes thousands of hours of CPU time to process, so that’s something where we might be able to make a good optimization here.

So, last thing about the feature before we jump into the nuts and bolts, this is what it looks like to our marketers when they set them up.

So, again, it looks really simple, right?

We’ve got send no more than five campaigns to a user every week, no more than one push notification every two days, and no more than two emails every week.

Really easy to set up, but really complex under the hood.

Braze’s messaging pipeline and architecture

How does this fit into that pipeline?

So, here is my super fun, very high-quality diagram of how our messaging pipeline works.

Some details are removed, but you’ll get the gist of it.

It all starts with audience segmentation.

So, audience segmentation is something that we leverage MongoDB’s sort of massive parallelism for, hence the multiple little boxes there.

And we run all the segments and filtering queries so that we get the audience that the customer is looking for for a certain campaign or for a certain canvas.

Those users get passed into sort of this big business logic steps, that’s the big green box there, and that big business logic step is running lots of different things.

That could be distribution for multivariate testing.

That could be volume limits and rate limits for a campaign.

It could be variant selection or channel selection, tons and tons of stuff.

But this is where frequency capping occurs, right? Smack dab in the middle.

And out of all the different things that happen in here, frequency capping is actually one of the least performant things in this whole process.

And even though we use the only operator in Mongo to pull back certain fields and try to minimize the amount of data we pull back, this part of the job is really a performance killer.

And unfortunately, we have to do it twice.

So, we have features such as optimal time notification or local time sense where we’ll send messages at the user’s local time.

So, I might get a push at 4:00, and then if you’re on the West Coast, you’ll get a push at 4:00 Pacific Time.

And so, that means there’s a time gap between when some of this stuff runs, so we go ahead and check that frequency cap again a second time.

So, we have two places in our pipeline where we do this job, so again, it seems like a fantastic place to optimize.

Once we’ve done that, it’s pretty simple after that, we send the message, and then we write back analytics.

So, user got the message, add one more to the counter for how many we sent for this campaign, etc.

But we’ll dive more into the frequency capping part of this.

So, the original design.

The original design uses our users collection.

I’m gonna stand over here.

So, the original design uses our users collection.

So, this is an example of a user document or a user profile.

You can see it’s got things like first name or last name or email, the standard stuff.

We’ve got some custom attributes, Twitter handle, the fact that I really love coffee.

And we have this sub-document that I was talking about earlier called “campaign summaries”.

So, campaign summaries is a summary, if you couldn’t guess, of every campaign that this user has received and things like timestamps for events like the time they received it or the last time they received it, the last times they’ve interacted with it, and counters for maybe how many times they’ve received it.

And this is the information that we were using to do frequency capping.

We were having to pull all of these summaries from our database into memory in order to apply business logic to it.

Optimizing by user group

So, knowing that that’s what we’ve got, we’ve got to pull those user profiles down.

Let’s take a look at the algorithm.

So, first after that massively parallel segmentation step, in each one of those business logic steps, we’re going to pull in all of our eligible users for that step, and we’re gonna do a MongoDB query on that users collection to pull all of their profiles in.

Now, every document has a 16MB cap.

Most of our user documents aren’t this big, but we do have some that end up quite large, so you could really be pulling in a tremendous amount of data in this step.

So, you’ve got to wait for the database to get it and then you’ve got to wait to actually receive that data.

So, it’s not necessarily all that fast.

And if you look through our Git history, you’ll notice that over time we’ve slowly and, you know, slower and slower been like changing that batch size from 100 users at a time to 50 users at a time to 10 users at a time, and to me that’s like a really good code smell for you should probably rethink how you’re implementing this feature.

So, one of the other signals that let us know we should try this again.

But once we have those user profiles, we go over every rule, we remove all the ineligible campaigns.

Those are often like receipts or transactional messages like “Your food has arrived” or “Your ride is here.”

You don’t want to frequency cap those.

And we count the number of campaigns we’re left with, and then we check the rules, and we say, “Does this violate the frequency capping rules or not?”

And if it violates the frequency capping rules, we remove that user from the set of users that will move on to the next step.

So, yeah, the remaining users move on.

Resource limitations and scalability

So, the UI for this is pretty simple, and really the process isn’t terribly difficult, but it’s high-bandwidth, and it could be pretty slow.

And the more advanced of a customer you are, it tends to be the fact that the more rules you have too.

So, it just makes it more and more taxing on our system.

So, some problems with this, like I said, user profiles can be huge.

We do them in batches of 100 or 200 or 500, and it’s just a lot of network I/O, and it’s actually a ton of RAM usage too, so it’s not super-fast.

And we already use pretty big VMs.

So, we’ve got, you know, gigabytes and gigabytes of RAM, but we still see issues with this.

In fact, sometimes you’ll see things like jobs that’ll take 30 or more seconds to run, which if you remove the frequency capping part would be done in, you know, 200 milliseconds.

So, again, obviously something that we can work on, and we can fix.

That flame graph there shows a trace actually of one of those jobs, and you can see the little purple bit right there.

Those are the Mongo queries that are only being used for frequency capping.

So, the goal is basically how do we get rid of that?

A little code example down there.

Moreover, we’ve got the Ruby runtime to deal with, right?

So, I said we use Ruby, we use Sidekiq.

Ruby is not necessarily known for being the most memory performing language that you can ever use, and it turns out that when we do this process, we end up using, you know, gigabytes of RAM.

We have to ask the OS for tons of RAM to keep all that stuff in memory.

And we end up having things like this, which is like a nice little garbage collector trick, we nail out some variables, and we tell the garbage collector, “Hey, it’s like there’s garbage.

You should probably go collect that, reclaim that memory."

And again, that’s like a really good sign, and you should probably rethink this design.

So, that’s what we did.

We redesigned it using the aggregation pipeline in MongoDB.

This is a massive change to how this feature operated, and we had to figure out how to roll it out slowly.

So, we’ll go into that here in a minute.

But the goals for this were less network I/O and less RAM usage.

Those are the two primary goals, right?

Since we know the network I/O is the bottleneck, it’s very slow, and the RAM usage was causing out of memory errors.

So, you can see we’ve got some lovely out of memory errors in our server logs.

These are what I call, like, soft OOMs, or out of memory errors.

So, these are ones where our Monit Watchdog processes,

were looking at our API server and saying, “Huh.”

Like, “You’re using too much memory.

You might as well just kind of shut down, and we’ll just restart another one of you here, and just hope that that fixes the problem."

And that’s not super fun.

It’s not a great pattern to deal with, but that’s the world we live in.

But these were also occasionally causing hard OOMs, right?

The ones where it actually crashes your process.

Ruby tried to do a malloc() and grab some memory from the OS.

The OS said, “Tough luck,” and it crashes.

So, always want to avoid those.

Moreover, I mean, we really just want to make it faster, right?

And we tried some micro-optimizations.

We tried doing things, like, more aggressively using the only operator to really only pull back the things that we knew we needed, but it made it a little faster, but nowhere close to the results we really wanted.

So, we thought we should offload this work to the database.

Let’s see how much of this work we can actually make the database do, that way we don’t have to pull all this data back and do it ourselves.

So, when we were looking at that, we’re looking at using the aggregation pipeline, we noticed that this wasn’t gonna work.

Our campaign summaries use a hash, or they use a sub-document.

They don’t use an array.

And the aggregation pipeline, as you may guess by the name aggregation, works really well with arrays of data and with lots of documents, and in this case, we have lots of keys inside of this hash.

And the resulting queries would have been gigantic.

So, we needed to use a new collection, and that’s where the collection we talked about earlier comes in.

I said how we use the percentage-based feature flipper to roll this out across all of our clusters, this “UserCampaignInteractionData” document, and it turned out that this was really useful in the frequency capping use case as well as the originally intended use case.

Each one of these documents has an array of received messages in it, and it contains a timestamp, the name of the campaign as well as a dispatch ID, which are basically the three things that we need in order to do frequency capping.

We’re able to use the $elemMatch operator, so MongoDB operator to match items in an array in order to pull back just an absolutely minimal amount of data, just enough we need to do this job.

So, here’s another kind of view of that document, right?

So, we have one of these fields for every channel.

So, it makes it really easy to do those multi-channel and single-channel frequency capping rules, right?

If we’re only looking at push, we can only pull from these two fields, the Android push and iOS push received.

If we’re looking at all campaigns, then we just pull all of them.

Regardless, even if we have to pull all of them for a certain rule, it’s much, much less data than pulling a full user profile.

So, side-by-side, you can see we still the user profiles for segmentation and for all kinds of other purposes, but now we can use this, you know, much prettier, much nicer, well-structured document for this new use case.

So, how does that change our algorithm?

So, our new frequency capping, it starts with a match stage, so we match all the users in our current batch, and we do a projection using the filter operator to limit the time window, exclude the current dispatch for multichannel sends.

So, you don’t want to frequency cap, a push, if it has an email that was supposed to go with it, you want to make sure those both go out, and we exclude transactional messages again.

So, those are your receipts, “Your ride is here” type of notifications.

And an example of the results on the right.

So, it turns out that is a ton smaller than pulling down those user profiles, but I think we can still make it smaller.

So, I can’t really see it, but it’s behind me, so maybe on the screens, but we do a second projection, and we only bring back the dispatch IDs since we already know the time is correct based on our query.

We don’t need to pull that back.

And this is like way smaller than a user profile.

So, it turns out we, like, almost eliminated our network bandwidth for this step.


So, talking about deploying it, the fun part.

So, changing from a regular query to using the aggregation framework, that might be tricky.

It might be terribly underperforming.

We’d never actually used the aggregation framework in such a high-frequency way before.

We actually used it for statistics on like a per-campaign basis for the dashboard we looked at earlier, and we used it for all kinds of reporting and other things.

But we weren’t sure about it in this sort of high-frequency environment, so we need to be very careful.

We started at 1%, 10%.

We started at very low numbers, and we only started on opt-in beta customers because it actually changed the behavior ever so slightly, so we wanted to make sure we could do this with the minimal amount of interruption.

Over a month, we slowly rolled it out to 100% for all of those customers, and we wanted to make sure at this point that it was not only performant, but behaviorally correct as well.

So, we did a ton of testing, and we were looking at graphs, checking on all those things.

And once the behavior and the performance were confirmed, we rolled it out 100% across the board.

So, it turns out it all worked.

So, let’s take a look at some numbers.

So, here’s a graph that I took from Datadog, from the Trace Analytics tool on the maximum duration of one of the jobs that uses frequency capping.

And we can see on the top we’ve got V1, the blue line, V2, the purple line at the bottom, and in the worst case for the old way, we were hitting like about four minutes in this job.

So, when you’re trying to send messages out quickly, four minutes is not the kind of timeline you like to see.

The new worst case, and it’s still the worst case, so not too bad, is still under 60 seconds.

So, we’re doing good in terms of the worst-case scenario.

I think the more interesting part here though is the median duration.

So, taking a look at the median duration, again, we have V1 on the top at about 100 milliseconds, and we have the new version at the bottom at about 75 milliseconds.

There’s a 25% savings.

That’s pretty nice.

And I think that this is really important to kind of take a look at things like this.

When we were looking at graphs like this the entire time, when we had it rolled out at 1%, 10%, 20%, it’s important that you understand your performance when you’re deploying at scale and that’s where tools like this come in handy and you really can’t do it without them.

Content Cards

So, we’ve gone over frequency capping and deploying a new feature using feature flippers and how we did that.

Let’s talk about deploying new infrastructure and talk about our feature called Content Cards.

So, what are Content Cards?

Content Cards are highly targeted events stream that are delivered on a per-user basis, and it supports things like pinning and card expiration, analytics on individual cards, whether or not a user has scrolled past it or clicked on it or interacted with it in some way.

It supports coordination with push messages, etc.

And our customers use it for all kinds of things like you can see, we’ve got coupons, we’ve got onboarding flows, we have alerts and notifications, tons of use case, really exciting feature.

But you have to ask, why did we need a new database for this?

Why couldn’t we use Mongo like we had used before?

Well, it turns out that it’s all because of write behavior.

So, each user, like I said, has this array of cards, and you’re delivering these cards on, like, a per-user basis.

And we need to be able to push lots of new cards at once when we dispatch a campaign.

And it turns out that writing a million cards to a million different arrays is a little difficult, especially if you’re using a document database, and we’ll see why.

This is my very technical diagram of how documents are laid out on disk in MongoDB.

If we were gonna write, let’s say an array of Content Cards on every user, the right behavior would look something like this, right?

We’ve got this array somewhere inside of the space we have allocated for this particular document, and we append the card to that array.

So, we have to find every user and then append the card.

So, what about the case of user B?

So, user B there, they don’t have any more space to write that card.

So, what that means is, is during this million card write that we’re about to do, we have to stop the world and take user B and let’s say it’s 128 kilobytes of space.

Well, we have to go find 256 kilobytes of space somewhere else on the disk, copy all of user B over there, and then now we can insert that card in the array.

And it turns out that even if you have to do that 1% or less of the time, that behavior is like not something you want in a production system.

It could be very slow.

So, we took a look at Postgres, and we were able to do this basically same operation, but at a much higher scale using Postgres.

Essentially what you’re doing instead, I’ll switch spots again here, instead what you’re doing instead is writing to a table.

You’re doing basically an insert on append operation, and you’re writing all the card and its metadata or whatever you need for that and the user ID to a table.

So, it turns out that’s just a simple insert, and a B-tree update for the user ID index, which is, it turns out really predictable and really scalable.

It’s something that works, you know, if you’re sending 1,000 cards, if you’re sending a million cards or if you’re sending 10 million cards, it turns out that it’s pretty performant.

And how did we determine this was the case?

Well, we did it from small changes and testing, and looking at it, and testing something else, and looking at it again, and we did that with Datadog.

So, here’s a chart of API requests for Content Cards.

We’re constantly looking at this sort of data so that we can understand scaling and usage.

We can understand, let’s see, graph latency.

We can understand how fast things are working, so we can look at things like the average time, those nice small numbers at the bottom up to, you know, P95 or P99 times at the top.

And we can understand the answers to questions like, how does scale affect our speed, both in the average case and in the worst case?

That’s weird.

Not important information there, I guess.

And it also lets us figure out, you know, what numbers are dangerous numbers.

You know when we do a 10 million card send, is that when we start to see the latency skyrocket, is it 20 million? Is it 30 million?

It lets us know, you know, all kinds of information like that which helps us build our product.

It helps us build guardrails in.

I still have the bar, oh well.

So, tools like Datadog are really useful for this, and Datadog specifically has been extremely useful because we have things like…

It’s not a great slide.

We have things like this where we can look at the query that we’re actually running.

We can see it in the context of the job itself, and we can identify the slowly performing queries.

We can identify places where we should be using indexes, where we should be making schema changes, and we can deploy those changes, and we can immediately see what happened.

We can immediately see how it impacted our system.

Moreover, going back to the feature flipper slide, if we use feature flippers and we tag everything correctly, we can actually look at how these changes affect performance, but only roll it out to 1%, or 2%, or 10% of our total workload for this feature.

So, one thing that I thought was cool about this…

It’s disappointing.

One thing that I thought that was pretty cool about this actually…

Male: It’s over there.

Working with Datadog’s tracing library

Zach: Oh, it’s over there.

That’s right.

It’s what I thought.

One thing that was fun about this project was I got to work a little bit on dd-trace-rb, so Datadog’s tracing library.

And it was a lot of fun to work on it because it turns out that they do support Active Record in Ruby.

So, if you have Active Record, you already have your SQL traces, but if you use the PG gem directly, you don’t quite have those.

And I actually reached out to some of the folks at Datadog and told them about my problem, and they basically pointed me to a bunch of other contributions and integrations that people had made, and it turns out that writing the integration to support the PG gem took about 4 hours and 200 lines of code.

So, I have to say, and hats off to them, kudos to them.

It’s a very flexible tracing library when you can get all this detailed information we just saw here with 200 lines of code and 4 hours of work from not having it before.

So, that was huge.

So, all that stuff aside, let’s look at the deployment of this feature as well.

So, like I said, we use feature flippers for this, but in this particular case we can’t use them on a percentage point basis, unfortunately.

If a marketer wanted to send 10 million cards, they don’t want just 1 million of them to randomly go to some random pile of users.

That’s not what you want here.

So, instead, we asked a pool of various size customers, so some small, some medium, some very large, to participate in the beta of this feature with us.

And by rolling it out to our smallest beta customers first, we were able to try it out, look at the performance, see how it performed in real-world scenarios, go back, make some changes, test those again with feature flippers and look at what the changes did to performance, then roll it out to the medium-sized folks, iterate the same way, and then roll it out to larger folks.

And now today it’s generally available as a feature.

So, we offer it to all different kinds of people.

So, a really cool sort of deployment strategy, being able to do that sort of thing iteratively the same way that you would develop a product, right?

And that’s exactly what we did.

So, we looked at a new feature, and we looked at new infrastructure.

Let’s talk about, like, the lessons learned sort of from all this.

So, the first lesson is to roll features out slowly.

It’s so important to roll features out slowly when you’re operating really at any scale, but in particular at a massive scale.

You want to be able to support rollback, especially like instant click-the-button type rollback.

You don’t want to have to redeploy new code, and that’s where feature flippers have come in very handy, because you can’t be too careful because if you make a big change and you just flip that giant Frankenstein switch, you may end up totally hosing your infrastructure and that’s not something you want to do, especially if you think it’s a performance improvement.

The second takeaway is feature flippers.

And I think that those are just definitely important for any deployment and any rollout plan that you could possibly have.

Because being able to turn things off and on at a flip of a switch can save your environment, it can save your hosts, it can save your whole infrastructure, it can prevent customers from using a faulty or buggy feature, and it can let you test out these performance improvements and confirm again that there are actually improvements.

Number three is that without instrumentation, you’re blind.

If you don’t have instrumentation at every layer of your application, so at the application layer, at the host layer, the network layer, database layer, you’re really driving blind.

You need to be able to see how all of those pieces interplay and how all of those pieces interact in order to really reliably and confidently deploy features at scale.

And my last lesson learned is that you should work with your metrics providers.

I thought that was a great lesson that I took away from this was that they were very helpful.

I loved working with the Datadog team on building out some of the functionality for the dd-trace-rb library that we were able to use and leverage to roll out this feature really confidently.

They were very helpful to the success of that project.

So, my final takeaway from this is to remember, just like a surfer, you don’t start on the 40-foot wave from day one, you start a little wave at a time and build your way up, and finally, you can catch those huge waves.

So, always remember to start small.

Thank you very much for coming and listening.

We’re hiring, I have to say, or hit me up afterwards.

Thank you.