Building Place Exchange | Datadog

Building Place Exchange


Published: 7月 17, 2019
00:00:00
00:00:00

Good afternoon, everyone.

Thank you so much for being here. I appreciate it.

Let’s begin with a quick show of hands.

This is the LinkNYC, and I’m just wondering how many of you guys have seen this on your way here today?

All right, sweet.

Actually, I was expecting like everyone, but I’ll take 60-ish percent.

Well, congratulations for the folks who have seen it.

You have experienced what we call an out-of-home advertisement.

So this is dystopia.

But put another way, it’s a really good example of what out-of-home means.

Out-of-home is arguably the oldest form of advertisement that exists.

It goes back to the simplest type of advertisement that you can think of, which is to go ahead and make a sign, stand in the corner where you know people will walk around, and hope that they will read it.

What is programmatic advertising?

Now, in today’s world, anything that exists outside of, say, your mobile application or your smart connected TV, counts as an out-of-home advertisement.

And at Place Exchange, what we’re doing is we’re bringing the world of programmatic advertisements to the out-of-home space.

So this is an example of an actual programmatic ad that showed up in a billboard on the side of…I don’t really drive, so I don’t know what the interstates mean…but it’s on the side of a big highway in Philadelphia.

Now, the way this works is programmatic advertising.

What that means is when you go to NYTimes.com or any sort of large blogish kind of site, it’s that thing that pops up in between the content itself.

And the way it works is it is transacted, it is purchased and displayed programmatically, as there isn’t a sales team involved, per se.

And so as you can see here what we’re doing is we’re bringing that paradigm but to street-side ads.

And arguably the biggest driver as to why we wanna do this is that it will allow us to bring useful content to these screens.

And these useful pieces of content can be paid for by the programmatic advertisements, which allows for more interesting and, ideally, useful pieces of information to show up as we commute into work or drive down the street, etc.

Programmatic advertising in action

And so actually, as it stands today, LinkNYC does this already.

If you have walked down the street and are from NYC and you’ve seen the current train status, that’s an example of how a useful piece of content can be displayed on an out-of-home ad.

The one caveat is that it does cost. It has an actual cost factor, which you can’t really represent without having some sort of advertisement to pay for it.

And that’s what Place Exchange is bringing to the table.

Now, the way we’re doing this is if you look here, so this is a few screen formats that we have of a screen. It could be a billboard. It could be a whole plethora of stuff.

And what we’re doing is here’s some examples of how a programmatic ad can be taken.

And, you know, programmatic ads are meant for the mobile web and scaled and wrapped around things like the current, “It’s 89 degrees.”

Actually, I don’t think this is for today.

But it feels that way, right?

And so what it’ll do is it’ll actually go ahead and take the useful content, wrap it around some sort of programmatic advertisement, and then justify the cost of being able to display this.

Place Exchange: a backend system for programmatic advertising

So the biggest problem… the way we do this is, we take some sort of screen that exists.

We take characteristics of that screen.

And our platform will translate it to the equivalents that exist in the programmatic world.

So if you can imagine, before you display an ad, on The Times, there might be a bunch of characteristics you might want to take into account.

And in the same manner, before you display an ad on a billboard, there’s a bunch of characteristics that you might want to also take into account.

So what we do at Place Exchange is we perform that translation, so that mobile buyers, people who want to buy advertisements for The Times, etc, can also buy it on our screens or these days, omnichannel has become a thing.

So when we say omnichannel, they could also potentially buy it for connected TVs and things like that.

So now, the major problem is this.

Has anyone ever seen this XKCD?

It’s one of my favorites.

So basically, the idea is that there are competing standards.

And so then someone goes “Let’s go, big make, you know our own.”

And now you have N plus one competing standards.

And so when we first started Place Exchange, we focused on delivering programmatic advertisements to our Link inventory. And we have a whole bunch of those all around NYC.

So that was actually kind of easy, right, because we know how our Links work.

We know what characteristics, what capabilities they provide.

And so it wasn’t too hard to take that and transform that into the mobile advertising world.

Scaling to multiple publishers and inventory types

The problem started when we tried to scale from just one publisher, if you will, to multiple publishers.

So for instance, if you look here, these are a few examples of additional inventory types that we currently support at Place Exchange today.

So for instance, if you look there, I don’t think you guys have ever seen this, but Van Gogh’s are these mobile vending machine apps. They’re pretty neat. They seem to sell a lot of earphones, but they also have a whole bunch of other stuff.

And right there is a programmatic ad that was delivered through Place Exchange.

If you look here, this is a screen on some sort of airport, I’m not sure which, which again, has a bunch of screens that may or may not have content always, and that can now be leveraged to display ads itself.

Now the problem here though is there’s a whole bunch of different screens. They’re managed by a whole bunch of different people, publishers who may or may not have started at a whole bunch of different points in time.

And because we are dealing with disparate hardwares, what we end up with is this problem of varying screen context.

So I’m gonna use this really neat trick that was just shown to me. Look at that. How cool is this, right?

So I wanted to highlight two pieces really.

So these are all valid.

And this is a little bit…I wanna say 80% of these screen types are currently supported by our platform.

But to be clear, here is a train station deployment, as we call it, versus an in-building elevator deployment.

Now the thing they have in common is both of them have the ability to request an ad and some sort of interesting content from our platform.

The hard thing is that that’s the only thing that they have in common.

It’s really hard, especially if you see an ad like this, where you might have a giant screen in the lobby.

You might have smaller screens within the elevators themselves.

And they may or may not have the same conductivity characteristics, right?

They might be wired to LTE, maybe not, etc.

The same goes for the train station.

Moreover, the duration also is really not normalized at all, because here people are probably on the elevator for a short period of time whereas as the MTA can show us, trains come few and far between, right?

So that’s one problem, right?

Here’s one example of fragmentation in the system that we have to contend with.

Real-world variables to consider

Moving forward, I kind of jammed two things in here.

The one piece is that if you look… if you direct your attention to these guys, it’s also important to note that a screen might not be completely available to us, in order to display the ad.

So the content space that we have allocated might actually be very different from the actual space available.

And beyond that, not all of our screens are static, as in something that you can see from far away.

A billboard would fit that bill.

However, a billboard will.

But what it is is something like a touchscreen device, where you might actually have to stop your ad when someone starts to interact with it, and so on, so forth.

So here on out, again, we have…we are contending with additional fragmentation.

Now, how do we solve this?

And our solution was kind of like, “Hey, look. You know what? Let’s let the people who wanna integrate with us solve this problem for us.”

So what we did is to try to isolate complexity.

We shifted away this act of entity management as far to the client side as possible. So what that means is our publisher partners, they have a really good sense of how their screens work.

They have a really good sense of the data pieces that we require that they have off the bat, or that they can pull from their screens, or that they need to provide through some other means.

Why Place Exchange went API-first

So what we did is we developed an API-first approach that would define flexible abstractions and taxonomies to normalize most of these inventory attributes.

So that regardless of which type of deployment comes in, to us, it always kind of looks like that, a PX model, which is something that we have defined and that we maintain the specification for, which goes back to the standards piece.

Now, the idea here is that.. just remember, this is just the beginning of the entirety of the system itself.

So this has to actually be translated, once again, to the open RTB spec.

Open RTB stands for the open real-time bidding spec.

And what that is, effectively is, that’s what the mobile providers of advertisements want or look at when we attempt to get an ad from them.

So what we’re doing here is first we create this normalization process. Then, because we know what these things are, we can translate it to the open RTB model. And we use that to actually make our API calls to retrieve bids, which then we can send back to these guys to render and do all sorts of additional processing.

Now, having said that and having solved the problem of taking multivariate IoT clients and normalizing them, I wanna focus now on how the rest of our architecture works.

So here, we have what I call a logical diagram. What that means is will…this diagram showcases the guts of what we call the ad request flow, if you will.

So this architecture, it’s quite important. So the ad request is basically the entry point that says, “Hey, please give me an advertisement because I want to show something.”

And so basically, 100% of our traffic starts here, right?

And the few key things to call out here is that once you’ve done the act of taking all of the data pieces and normalizing it, this becomes really easy. And then this stuff becomes super interesting.

And it’s worth noting that the entirety of this architecture is one built on AWS infrastructure. And two, it is entirely “serverless”.

And what we mean by that, by the term serverless here, is that we try to rely on managed service as much as possible. We prefer solutions that have a pay-as-you-go model.

And most importantly, the whole architecture here is a series of Lambda functions that are triggered either off of event inputs, like API Gateway invocations or produces data for Kinesis streams or consumes data from other streams.

And then kind of does that over and over again until we get to the downstream, which is basically a bunch of information that we can store and structure and so forth.

An example of the programmatic advertising process

Now, I wanna show you guys another representation of this, which is a little bit more granular, right?

So this is sort of a high level. Hey, an ad request comes in.

We go ahead and we try to understand what that means. We send up a bid request, which will give us a bunch of bids. We perform an auction.

This creative approval is basically a way for us to validate that the creatives that come in are street safe, which is a really big deal, especially in the out-of-home markets.

We actually have a human to take a look at each creative and either approve it or not.

And then once all that stuff is good, it goes back and it shows up as an ad.

Now to show this again, in a slightly different manner, is this here. So this is the same diagram, but now through the infrastructure bits, and in particular, what I’m gonna do is I’m going to walk through each one and kind of slowly point out a couple of interesting pieces and caveats to this architecture.

So here’s the ad request. A bunch of stuff happens.

We get to there where it’s sort of our data sync where we have a bunch of information.

So let’s begin with our…with this piece right here, which is the data collection API endpoint.

And so when we say data, we really mean ’the entity information about the screen itself." So the slot height, the width, how long it can stay, things like that. That’s what we really mean by the data portion.

So that happens by an API gateway invocation, which will trigger the first of a bunch of Lambda functions.

Now, this Lambda will interact with a bunch of other APIs. And it does a few other things, which we’re not really gonna go into yet.

But the important thing is that once it’s done, once it collects all the bids, it performs the auction.

It performs the validation on the bid responses to ensure quality of the creatives and so forth. It will drop a log record to a Kinesis stream.

Now, that Kinesis stream, well, he’s got a whole lot of stuff that goes on in his life and we’re not there yet.

But the one thing to point out is that we can…what we actually do is point a bunch of consumer Lambdas to it as a means to perform additional work on the data that we did log.

So in this example, this goes back to what we had here where…this creative process actually reads off of the case stream itself.

So every time new ads come in, we use that Lambda to say, “Hey, grab it, and submit it somewhere for the approval process.”

Now, as we move along this process, once the data has been logged to a Kinesis stream, the next step is here, this blue box at this corner.

And what we need to do is we take our log records, and we actually have fire hoses and two of them actually, one that encodes all this content and drops it into a parquet format into an S3 bucket and one more that does the same, but through the JSON format.

And we have glue. Glue is a managed extract transform load service which we use to define the schema for our data formats.

And what we have is…Athena will then take a look at the glue schema, and we can use that to run pay-as-you-go sequel queries for the generation of local reports or Tableau reports, etc.

And that’s kind of how the data flow of the entire system works.

The other piece to talk about is our friend down here.

And so once our data comes in, we perform the… we drop the log, and we do all that stuff.

We also want to be able to keep track of observability, so logs, metrics, etc. And so the way we do it is we actually have a CloudWatch subscription filter, which is basically a Lambda.

And what this Lambda will do is it will read through the log-level statements, perform a little bit of processing, and pass it along to a multitude of logging services.

We have two right now. We might have five in the future. You know, I’m not really sure.

But that’s kind of the process for this piece. And what this does for us is this concludes at least how the ad requests system architecture works.

And so a quick recap. The key technologies that we are using here is API Gateway to handle on-demand web requests.

Kinesis streams for asynchronous data processing, and basically being able to log all the stuff that the system is ingesting.

Glue as an ETL service.

CloudWatch logs for efficient metrics and logging management.

And, of course, Lambdas to stitch together all the stuff that is occurring at each one of these infrastructure points.

How Place Exchange uses observability

Now, the next piece to talk about here is how we integrate observability across our application stack.

And so we do leverage Datadog in order to record all of our metrics, both the stuff that comes out-of-the-box and stuff for system performance and so forth.

But also the business-level pieces that are useful in order to understand in a higher level, hey, is this thing doing what we expect it to?

So I’m gonna spend some time going through each one of these.

And first, I do wanna talk about how we perform client-side monitoring in particular because a lot of these screens are publisher-based, right?

And so they run a whole variety of client-based systems and platforms.

So what we’ve done is we’ve built a small JS space library that wraps around the Datadog’s HTTP API.

And what that does is it allows publishers, should they want to, to pass this information. And for at least one of our publishers, they do use this to pass along information like, “Hey, did the creative run? Do we make an ad request and so forth?”

Now, going into the other aspect, how we perform observability on the backend.

So as mentioned, Datadog is really nice because it does pass along stuff like the basic AWS metrics that are available but through DDSL.

So here’s some examples of what happened last week.

So we did get throttled a couple of times, so I think that’s a good thing. It just means that we are at scale, which is great.

This is a really interesting metric. It basically tells us if the Lambda consumer is falling behind the availability of the log-level events in the K stream or not. And so it looks like we fell behind a little bit, but not too bad.

Business- and client-side metrics

Now, the more interesting piece are the business-level metrics.

So I kinda wanna talk about a couple of them just to give some context as to how we use these metrics at Place Exchange.

So far as that goes, we have bid filter logic. So basically, what that means is, even if a DFP will provide us a bid, we might choose not to display it for a multitude of reasons. And so the way that works is we have an ENUM-based set of reasons that we encode.

And every time we hit one of those, we just fire off one metric. And this way, we can keep track, and we can say, “Oh, hey, it looks like a lot of these creatives are being rejected. What’s going on? Let’s go take a look.”

That being said, the number of auctions recorded is a good way to say, “Hey, how well, is the system performing?” The number of auctions will tell us how much throughput the system is being able to process.

So again, spikes are really interesting, and troughs are scary, but also interesting.

Now, with that being said, what’s really interesting here is that the Kinesis stream batch size metrics are things that we had to come up because we actually ran into a lot of Kinesis constraints as we started to scale about a couple of months ago.

And what that led to was like, “Oh, hey, we actually don’t have a lot of good observability in how long it takes for us to read one batch from Kinesis on average,” or, “Hey, like, we should take a look at how long it takes to write a batch to Kinesis because that could inform us if we should increase the number of shards or something to that effect.”

And then on the client side, the two things that are super important or interesting are acknowledgments that the creative did show up and then also a list of how many ad requests were made because we can use those two numbers to come up with a decent ratio to understand how well the ads are actually being properly delivered.

The value of dashboards

Now, with all of this data, what we usually do is we have all these metrics, and we will publish dashboards.

Dashboards are awesome because what they do is create this concept of what we call an “information furnace.”

And an information furnace is basically systems that will radiate information across the board versus having someone to go ahead and try to find it—because it’s hard to know what kind of information you even wanna know.

So what we do is we take these guys, and we post them all over… or it’s not all over, in key places where the sales team, the product team, the engineering team happen to walk by.

And what we found is that it actually creates a lot of useful back and forth like, “Hey, what does this mean,” or, “This spike seems interesting. What’s going on there?”

And as you can see here, we can conglomerate both the client-side stuff, in addition to the backend, application, stack layer stuff.

And so that sort of concludes how the observability part works, right?

This is how we can flow information and knowledge of the overall plus like, “Hey is it working or not?” characteristics of our architecture.

Obstacles Place Exchange encountered along the way

So what I’m gonna do next is I’m gonna talk about some fun constraints that we ran up against as we built out and scaled this architecture.

And the one caveat to kind of call out now is that a lot of the constraints you can go on StackOverflow and find. So I’m gonna stick to the ones that were really interesting and that we weren’t able to easily find solutions to them.

And quite frankly, the solution we came up with are like, oh, a trade-off. We have this little dance we do, and it really means, “Oh, well, there isn’t a really good solution here, but based on what’s really important, and what we can sort of concede, we’ll go with this approach.”

So the first one has to do with collecting the observability metrics from the Lambda functions.

And in particular, typically, the way this works is you have some sort of daemon StatsD. We use DogStatsD, or like, we would use that, right, for Datadog.

And typically, the way this works is we send a bunch of, observation stuff, and based on the host name, it can get duped and sort of processed.

The issue though, is when it comes to Lambdas, there’s no concept. There’s no equivalent of a host name for a Lambda invocation because it kind of scales based on what AWS wants to do, and based on your load, and so on and so forth.

The problem is that this actually will like result in two metrics per Lambda potentially overriding each other, especially as you scale up and you have multiple invocations.

And the other thing is that, you know, a Lambda will live at most for 15 minutes, and that’s like a new thing, right? I think that happened maybe in March.

So before that, it was even shorter time. It was for like 10 minutes or something to that effect. And so that is problematic.

Beyond that, what we were doing prior was we were actually performing a manual flush. So once our Lambda invocation was done, that’s when we would take the collected metrics and push it through some sort of HTTP API.

And so that resulted in a whole bunch of problems where our metrics weren’t actually matching what we expected.

So we went through a bunch of loops and this was like a pretty long process. But what we eventually ended up on was…in the nick of time, Datadog actually introduced a feature called the distribution metrics API.

And what that did was it actually allowed us to decide on this particular architecture, which lets us write our own CloudWatch subscription filter, process those logs (as I mentioned), and then pass them along to the distribution metrics API versus having to rely on StatsD or what have you.

And so, you know, the main pros of that is just that, one, there’s no need to make those HTTP calls, again, at the end of the Lambda invocation, which is really nice.

It is an asynchronous process. It operates on batches so we can either have it on the subscription filter Lambda or for our consumers of the K streams, we can just pass it along which is nice.

But if we do make use of the CloudWatch subscription filter, we’re only allowed one subscription filter per log group. And this might change in the future. It’s an AWS-specific constraint.

So hopefully, in the future, it will maybe be like five or something or something we can ask to increase.

But that means that if we have logging services, you have one Lambda to approach all of them, which is a little gross, hence the dance.

So with that being said, another interesting constraint that we ran up against was dependency size limits. So a typical Lambda dependency artifact will allow you 50 megs of space zipped.

And as soon as your Lambda’s become more than just toys, that grows in a lot of interesting ways. And so the question was, well, how do get around that?

How do we make that work? And there’s no one solution.

There’s a couple of things we could do, and I kind of wanna talk about a few.

The first is don’t package your AWS modules. For example, boto, if you’re using a Python, for instance, that comes free as part of the Lambda environment, so there’s no reason to include those. So that helps a bit, but not too much.

The other thing you can do is all Lambdas come with a TMP folder per invocation, which actually gives you 512 megs of space, which is nice.

So what you can do is if you can get your deployment package size to under 50 megs, you could theoretically take that, uncompress it as often as you have to, and stick it into the TMP folder and point your dependency path to it.

And that does give you 10 times the amount of space which is nice.

However, it will eat up run time in this thing that we call cold start.

So every time a Lambda invocation starts up, what happens is it has to initialize all the stuff from your dependencies and your code itself.

And so if you have to unzip and move and point as part of your start time, that will take up real time, which can potentially, especially at load, lead to some measurable latency problems.

So this is a potentially… like, you might have to do this, you might have no choice, but this is a constraint we’ll want to or need to contend with.

The other thing, which is really for a Python-specific lambdas that I wanted to call out, is this is a clever trick, really.

But basically, here’s a small script, and what this really does is it acknowledges that in order to run Python, all you need is your .pyc files.

So what you can do is you can just walk through and get rid of all the other stuff, the docs, the py files, the tests. And that actually, for us was what made things stabilized for the time being.

We got a lot of benefits in that regard in that the size shrunk a lot and that…and so that’s kind of the current solution that we have currently in production.

Now, with that being said, the next sort of fun constraint in this story here is caching.

Now, simple caching techniques, like naive caching techniques, don’t really work very well when it comes to Lambda invocations.

There’s a few reasons for that.

So why don’t we consider the cache control Python module, which has various sub-classes, which will allow you to cache according to a bunch of different constraints.

So the simplest is, of course, in-memory caching, which you can use the DictCache class for. It’s in-memory, which is nice, it’s simple.

But just remember that at load, multiple Lambda invocations are initialized, right? So what that means is…or other instances.

So if there’s a ton of load on your API Gateway, AWS might spin up like 100 or 1,000 Lambda environments each that run your code individually, which means that if you rely on this caching mechanism, what that will result in is the same thing might potentially get cached 1,000 times.

And remember, especially for like API Gateway, those don’t live that long, right? I mean, again, long in the world of Lambda, is 15 minutes, but like, those still don’t live that long.

And so what we found out was, things got really expensive super quickly, especially for API calls that we were paying for.

So the clear way to fix this is okay, well, why don’t we go ahead and use a persistent caching layer. So, of course, Redis, Memcached, something like that comes to mind.

And that is nice, and that will be in any other use case, that’s it. We’re done, awesome. But the problem is that typically, Memcached, Redis, etc, they usually live, you know, we had a VPC.

And as it turns out, connecting to a VPC from a Lambda instance, while possible, is not an easy task in terms of performance. In particular, the way it works is you need an elastic network interface to connect to the VPC from the Lambda invocation.

But the act of creating and connecting to one per invocation could end up taking like way more time than you might want to be able to afford.

And so that…and now I should say that there is a solution in the works that AWS announced last December.

However, from our experience, it didn’t seem very robust yet.

And for that reason, it was something that we just could not work with. And so what we ended up doing was we said, “Okay, let’s try to find potentially other ways to solve the same problem.”

So other DBs that we could use. And we took a look at two, in particular.

We ended up going with DynamoDB mainly because it supports a time-to-live garbage collection operation, like right out-of-the-box whereas if we use ElasticSearch, we’d have to spin up another Lambda to perform a cron to walk through and clear this out, which is just more complexity that we didn’t want to incur.

And so this approach works out pretty well. But I wanted to show some caveats.

This is just a diagram of how cache works, honestly.

But the major caveats to call out here are one, when it comes to garbage collection, Dynamo supports a global TTL, which means that you might have to have more multiple tables per cache type, which, for us, we just decided, like, hey, here are the three things that we really need to cache and we focused on those.

But if you have a whole bunch that might be problematic or might require additional software, or code, or logic to encode.

That’s one.

The other thing is… and this is really specific to us. But when it comes to how we provision these pieces, we prefer to use Ansible.

And the pay-as-you-go feature just got released recently, and as a result, it’s not supported in any official Ansible module yet.

It might be true for any other type of infrastructure as code service that you might use.

So we actually had to roll our own, which again, is something that was like, “Okay, in a few days we’ll be fine, or in a few weeks or a few months.”

But like right now, it’s kind of like, “Hey, let’s acknowledge this, detect that, and let’s move on,” which is kind of the theme when it comes to a lot of these pieces.

Nested schema evolution

And so the final thing I wanted us to talk about is nested schema evolution.

And so let’s kind of get a dig deep into this and talk about what the heck that means really.

So what we like to do is we like to overeagerly log data when it comes to a lot of our processing mainly because we’re about, you know, a year and a half old, and so it just helps to have more logs in order to debug or troubleshoot and things like that, right?

So it’s really nice to be able to do that.

And we wanted to use Apache Parquet as our format for the log records mainly because it is quite efficient for logging, and the way…space-efficient rather.

And the way Athena works is you do pay per terabyte of data scanned. And so anywhere where we can get some compression is a win because it directly results into dollars saved and will allow us to continue to be overeager until, you know, we can’t be any more for practical reasons.

The main problem though is, as stated, we are starting off. We’re only a year and a half in and so our data models do evolve.

And while not super frequently, it does occur on a cadence of like every month or every six weeks.

The problem though is, and this is a known problem, Presto DB is what Athena uses in the background, and it does not support nested schema evolution in Parquet, which means that historical data after a model has been updated is no longer available.

And so this PR actually has been closed. It’s been open for a while. It was closed a couple of months ago.

But the way the cadence works is Athena hasn’t pulled it in yet because they have their own schedule for this. And so right now we’re kind of like waiting.

And in the meantime, what we’ve decided to do is we have acknowledged that while this is problematic, we’re gonna log both formats, both parquet and as a JSON, with the idea that if our Parquet queries do fail, we can fall back to the more expensive JSON equivalent, but ideally, that won’t happen as often.

And then once the change does get pushed into Athena itself, we can sort of close out this piece and we’ll be left with the format that we prefer.

But, again, this is one of those, “This is the best we can do at the current moment.”

And so with that being said, that sort of concludes the really interesting stuff that we had to scratch…like most of these stories, I guess, it took a couple of weeks or a couple of sprint cycles to get agreement on.

And they’ll probably evolve as time goes on.

Key takeaways

But a few key takeaways that I wanted to talk through. I got about two minutes left it looks like.

You know, the first thing is…was it worth it?

So if I had the option to re-architect this thing or I was starting to make something again with this, would I use these serverless frameworks or architecture style?

And I think the answer is yes, absolutely mainly because while there are…these constraints are definitely interesting, the cost savings like per month, our average spend on prod is actually not as bad as we had thought it would be.

And the pay-as-you go model has been really great.

I think the other thing is that when it comes to serverless, and this could be a whole other talk, but the way you test and the way you organize your code involves different paradigms.

And I personally think that those paradigms are good because they force better coding practices. And so as a result, I feel like our code is more stable.

And we over-rely on testing because we don’t have like, “Oh, it works on my machine” type of view, because there is no “my machine” right? It’s always on AWS.

The other thing is I should point out that not everything should be done in serverless. And it’s really important to acknowledge what items need to be serverful, if you will, versus serverless, right?

And lots of our architecture stuff that we haven’t talked about here, actually do rely on your typical Django apps and so on and so forth because it just makes sense to.

And the final thing that I wanted to talk about is when it comes to observability, like how do you know what to observe?

The process that we took is we kind of went with like, “Hey, here’s these large key metrics that we think are important. Let’s throw them up somewhere and let’s keep them up there for a while.

Let’s walk around and as people start to ask questions, let’s come back and let’s try to find more specific things or as problems appear, let’s go ahead and fix it there."

And wow, that was perfect timing because I am done.

Thank you, everyone. And I wanted to end with a picture of my cat who wants some coffee as I do. Thank you.