Accelerating Incident Response With Real-Time Business Data at Wayfair | Datadog

Accelerating incident response with real-time business data at Wayfair

Published: 7月 12, 2018

The customer impact of engineering problems

Hey, everyone. How’s it going today?

Have you ever found yourself trying to determine the exact impact to your business while also fixing an engineering problem?

And how many times have you found yourselves checking error logs while also determining how many customers are actually impacted by the problem?

Have you ever found yourself being pushed for answers from the business while also answering to engineering managers?

So I found myself in that position very, very frequently, and I decided to do something about it.

So today, we’re going to talk about how we solve this problem in Wayfair, and we’ll discuss some principles that you can take home with you to kind of establish this as well.

So, a little bit about me and the company.

As you guys know, I’m Nick Vecellio.

I’m a staff engineer over in the Wayfair NOC.

I’m responsible for building monitoring and analytics systems that provide real-time intelligence back to engineering and the business.

For those of you who may not know about Wayfair, we are one of the largest online destinations for home furnishings and decor, so, couches, wall art, that kind of thing.

We currently have around 10 million products from 10,000 suppliers, as well as 12 million active customers, and about 9,700 full-time employees, of which 1,500 are engineers.

So, as you can see, it’s kind of a lot of business to keep track of.

So let’s get going.

So, what are we going to be talking about today?

So, we’re going to start with my four pillars of real-time analytics.

We’ll talk about a first case study, which is building an analytics platform in a weekend.

The second case study, which would be monitoring checkout in real time.

The third case study is going to be how to query all of the things.

And then we’ll wrap it up.

Four principles of real-time analytics

So, first of all, let’s talk about the four pillars.

I was thinking about a couple of ways to kind of establish this talk and how to come up with these four pillars.

And I realized that there’s a basis to these four pillars to start with.

It comes back to the same point, which is going to be data availability.

To make business decisions quickly, data needs to be available, as close to real-time as possible, and in an easily accessible location.

We also need to stick to several principles that will help us create and maintain value within our real-time platforms.

The principles are: keep it simple.

We want to keep these systems as simple as possible so that we know when something breaks and why.

We want to keep it stupid.

We should let users do the work and let the pipeline handle just the aggregation and the indexing.

We want to keep the system fast because, if a system is not fast, how can it possibly be in real time?

And we want to keep it very easy to iterate on. We should be able to turn around the request as soon as possible.

Keep it simple

To run a real-time data analysis system and have it be useful at the same time, we need to keep things as simple as possible.

So we start with the first pillar here, for two major reasons: when it breaks, we need to know why, and when it’s broken, we need to be able to fix it quickly.

Let’s take a look at the diagram to the left.

This is a diagram of an old alerting system that we had built in house at one point, which we’ll talk about later.

Even this is a simplified diagram, but this system had over 15 distinct single points of failure, considering it was all tied together with some glue made out of PHP.

Troubleshooting the system when there was an issue was a multi-person effort, spanned across multiple teams, and it took forever to try and figure out exactly why it was broken.

So let’s compare this to this system.

We’ll talk about this shortly as well. It’s relying on four systems, and five, if you really count what’s feeding it, three of which are built by the same company, and they fit together really well, almost like puzzle pieces.

The fourth has a native connector into the rest, and it’s very well supported at Wayfair.

When part of the system breaks, it’s extremely easy to check where.

First of all, can you log into Kibana?

When did the data stop flowing in?

We can check the Logstash and Elasticsearch error logs, and we can check the Kafka error logs.

And that’s our troubleshooting for the system.

That’s all we need to do.

And each point is obvious.

For example, if Elasticsearch is down, Kibana won’t let you in.

If Kibana is down, we’ll get an NGINX error.

If Kafka is down, I can kind of almost guarantee that someone’s working on it at some point.

Keep it stupid

The next pillar is, we need to keep our system stupid.

To allow them to run quickly and provide value, there’s a few main reasons why we’re going to keep things stupid.

We need to ensure that the inserts don’t fail at any point.

As we talked about earlier, our data needs to be available.

And if data starts failing to insert, it’s no longer available.

So to do this with Elasticsearch and Logstash, we think about upserts.

So if a piece of data is not present, it will be inserted.

And if a piece of data is present, it will be updated.

If there’s a hard failure at any point in the way, we’ll swap the document ID and we’ll insert the information anyway.

Another point is to perform analysis after ingestion.

If you’re trying to ingest and aggregate data in the pipeline, you’re bound to cause some sort of failure, and you’re going to go back to having a slow system.

Keep it fast

Keeping a system fast is of the utmost importance for a real-time system for two main reasons, one that I already mentioned.

If it’s not fast, how can it be in real time?

This data is meant to help resolve incidents and make business decisions, and it needs to be very quickly available.

To address the first point, the idea of a real-time system is that we know it’s happening right now.

We cannot afford a 20-minute delay in data when it comes to our checkout funnel, which is the basis of our business. We would really like to take your money.

On a similar note, let’s say someone notices an issue, and they have some database query that will give insight into exactly what the problem is, but it’s going to take 30 minutes to run.

Is that still valuable?

Can we really make a real business decision off of this, or is the query data invalid by the time it returns?

Keep it easy

The last of our four pillars is to keep it easy, and we hit this in a few points.

This platform needs to be easy to iterate on. Wayfair deploys hundreds of commits every hour, on the hour, 10 hours a day, and there’s several hundred commits in each of those deployments.

This doesn’t factor into our ad hoc deployments or deployments to smaller subsystems.

Since we change things very quickly, we need to be able to deploy new data streams rapidly to serve the needs of the engineers and our business.

We receive these requests all the time, and keeping a low turnaround rate is very important for us.

If it takes more than a day or two to populate new data, then some of the value there is already lost, in my opinion.

We cannot provide immediate value when turnaround time takes a few weeks.

So now that we’ve covered our four pillars, let’s jump into our first case study.

Building an analytics platform in one weekend

So, first of all, we’ll talk about the first time that I built a data analysis system over the course of a weekend.

So, a few years back, over lunch, a friend of mine that worked in the search technologies team and I were discussing how awesome it would be if we could determine how a customer’s search traffic pattern influences buying, and how that works across different geographic regions.

We had this for two reasons. We had this thought for two reasons.

One, it sounded like a fun challenge, and we’re engineers, so we wanted to do it.

Two, the search tech team had relied on some macro-infested Excel workbook to track the performance of their searches, and it was slow, and it was horrible.

Coincidentally, at the time of this discussion, there was a hackathon at Wayfair that weekend, so game on.

We set out to create a system that was as real-time as possible, easy to use, and really, really informative.

Based on some past experience, we decided that we were going to go with a combination of the Elastic stack, Kafka, Python, MS SQL, and Hadoop.

Each of these technologies had a distinct role. Elasticsearch was used to store and aggregate the data, LogStash was used to consume the data, Kafka was used to ship the data, and Python and MS SQL were used to pull data, and Hadoop was where a lot of the data already lived.

Now, since this was a hackathon, we didn’t have the time or the manpower to create all of this as a streaming platform.

But we did architect this proof of concept in a way that would be scalable in the future, should we choose to move forward with the project in production, we did.

We also hit our first challenge right away.

Order and search data have absolutely no hard relation to each other.

This was the main reason we decided to go with Elasticsearch for the analytics.

Since you can’t make relations with this data, we needed to go with aggregations on terms in the data on applied filters to make this all work.

So, take a look at these pieces of data that we have here.

The only real linkable pieces of data we have between these two sets are customer ID, store name, IP address, SKU, and city.

But this is all we really needed to show exactly what we were looking for.

Since SKU was common between the document types, we could use this as a basis for our aggregation.

Let’s say we have these two documents, one for order and one for search.

Let’s assume that we search in Elasticsearch for “abc123.”

This will retrieve both of these documents, and now it’s just a matter of slicing it properly.

At first glance, we can see that the search_success field is true, meaning that the user clicked on a result from their keyword search.

We can now assume that this is, in fact, a barbecue grill.

We can also see that there was another SKU purchase in the order.

If we were to look that up, it’s a spatula.

Now, wait a second, did we just accidentally build a recommendation engine? We kind of did.

With just this information alone, we can see what other people are actually buying with different items.

So, instead of someone buying a wicker chair and us presenting more wicker chairs as recommendations a week later, maybe we can look at what people ordered next to that chair.

In practice, our array of products that came back from the “barbecue grills” search term provides us with patio furniture, utensils, and other barbecue-related items.

Look past “barbecue.”

What can we also use?

We can also use this information to determine a lot.

For example, what were the most common search terms used on-site, and were they successful?

Where are people searching for and ordering specific items from?

What items are selling the most?

What store had the best search performance?

And did search successes just tank on a website, indicating that there might be a problem with our search engine?

Ultimately, after this project was complete, we built it into production, and give it back to the category managers, and we got to tell them that they didn’t have to use Excel anymore.

And they were very, very happy.

Real-time analytics and incident response

How does this help us respond to incidents?

Realistically, customer search behavior isn’t going to help us respond to infrastructure incidents.

But this was a very large step forward for us in terms of real-time analytics at Wayfair.

By building the system, we’ve laid the groundwork for future streaming applications, as well as new and novel uses for Elasticsearch.

Prior to this project, we had really only used Elasticsearch for log storage and visualization, not for really novel uses like this.

There’s also a massive amount of business value here.

The possibilities for what this project can and has been used for is pretty endless.

And to top it all off, it helped develop our four principles of successful business analytic systems, as we talked about first.

So, case solved.

To the point of the four pillars, it was very simple.

We provided some crafted dashboards that only required a SKU or search term to actually be entered for analysis.

Instead of waiting for a database query to start, and staring at rows and columns that return, we have a rapid return and visually pleasing display of the same information.

The platform was also stupid.

The hackathon stuff was actually pretty complicated when we built it, to be honest.

But the framework we built was capable of being stupid once the other parts were in place, after we worked on it in production.

It was fast.

In production, the data is live.

And since it’s backed by Elasticsearch, it’s fast.

And it’s not in rows and columns.

It’s very easy to iterate on as well.

There are numerous ways to improve or enrich this data set. It could be as simple as adding a field to the tracking code that sends this information to our cluster, or we could add enrichments to this data through Logstash.

We could also parse out this data set into a smaller index and manipulate it further.

With all of that, you could easily take a business request, turn it around very quickly, and provide something very valuable to the business.

That brings us to our next project, running analytics in real time on our checkout funnel.

Real-time analytics from the checkout funnel

Now, while it’s definitely important to be able to understand what our customers are looking for, this doesn’t directly impact the Wayfair bottom line as much as understanding the health of our checkout funnel.

So, when you’re trying to monitor something, there are dozens, if not hundreds, if not thousands of ways to monitor things from the infrastructure side.

For example, if you look at this dashboard here, we can monitor NGINX request rates, HAProxy pool sizes, cache hits versus database hits, response times, CDN delivery rates, and so on and so forth.

But does the business side really care about what our backend servers are doing?

No, they care about what our customers are doing.

Our main focus on this particular project was the conversion rate.

However, calculating conversion rate in real-time is extremely difficult because of the nature of Wayfair, and how people utilize the site.

So we typically consider ourselves a shopping site, whereas we consider places like Amazon a buying site.

The challenge of tracking customer behavior

Let’s explain this a little more.

Think about the last time you went on to Amazon to make a purchase.

You’ve kind of already decided what you want and what price you’re willing to pay for it.

Let’s take an example from the other day for me.

I needed a new pair of headphones for work pretty badly, and I was looking to spend about $50 or so.

I went into Amazon, I searched for headphones, I clicked a few filters, I made sure it was Prime eligible, and then I “One-click checkout,” and I was done.

That is a conversion immediately.

This, of course, is an oversimplification and generalization.

But for the sake of example, we have a very simple conversion.

I logged on to their site, I picked an item, and I bought it.

Now, let’s talk about how people at Wayfair typically shop.

And I want you guys to think about the last time you bought or made a large piece of furniture, let’s say a couch, for example.

Here’s what we typically expect of a customer.

So a person hops on to the Wayfair app on their phone on the way to work on a Monday morning, and they add four couches to their cart.

And on their computer at lunch break, they add two more things, and they remove one.

The couches spend a few days in their cart.

The customer then discusses these options later with their partner, who takes their phone on an Android app, and logs in to check them out.

A couple of days later, they finally decide on which couch they want to buy, and they become a converted customer.

So, tracking this behavior is extremely complicated.

And how can we possibly do this in real time if it took two weeks for these people to buy something?

Applying the principles

To answer this question, we need to take a step back first and look at the four principles that I’ve been talking on and on about.

How do we make this simple?

How do we simplify this behavior into something?

And there are really a few ways we can do this, since trying to munge events together that haven’t happened yet is—well, it’s impossible.

Let’s consider the steps through checkout in the first place.

You’ll hit your basket page.

Then you’ll choose your shipping options, you’ll select your payment method, you’ll confirm that everything looks good, and then you hit the receipt page that says, “Thank you for your order.”

The simplest thing we can do here is to measure the start of the funnel versus the end of the funnel.

So let’s start with broad strokes.

“Basket page hits divided by receipt page hits” gives us a very, very rough view of how checkout is working.

Do we maintain a constant and predictable rate throughout the day?

Does this constant and predictable rate match yesterday?

Does it match the week before?

We can watch current traffic versus historical trends in real time to determine this health, but let’s take a step further.

A hit to the basket page doesn’t necessarily indicate that someone is getting ready to go through the checkout funnel, or that it’s even from an active session that was started today.

We stream events for when a cart is created.

So if we start looking at the number of carts created versus orders placed, we actually get a reasonable guesstimate of our conversion rate.

Now, in terms of keeping things simple, we’re not really looking for a very deep analysis of customer behavior.

This is what our data science team is there for.

But, as I mentioned, does our number right now match the number from yesterday?

If it doesn’t, we can take that as an indication that we need to investigate something.

Something else that we keep track of is hits and error rates to individual payment gateways, since a very critical part of our checkout process involves multiple third parties.

By monitoring error rates, hits, and response times from third-party payment processors, we’re rapidly able to detect an issue, report it to the vendor, fix it, and move on with our lives.

So how do we make this stupid?

There’s a few ways that we’ve kept this application fairly stupid.

The way this works is all based off of the transaction ID of the request, and the subsequent requests.

We have an internal tracking system that we use to generate this data that we call Scribe.

Scribe produces a base-level event for each page request.

And for the purposes of this discussion, we filter down these requests to anything that has a controller or page type or otherwise that matches checkout.

We take the transaction ID as the primary key of a document, if you’re used to working with relational databases.

Since we have the base request and its transaction ID, we can then use the Logstash update and upsert functionality.

Utilizing this method, it doesn’t matter if there are delays in the pipeline, or if a “create cart” action makes it to Elasticsearch before the “request” action does.

As long as the IDs match, all of the documents will merge at some particular point in time.

To keep it completely stupid, nothing will ever fail to insert into Elasticsearch, as I mentioned a bit earlier.

And if it fails, it’s actually not completely critical that these documents merge.

It’s very possible to use all of this data for our purposes with the document types as separate documents rather than munged events.

So how do we keep this fast? If we have data streams that are pumping thousands of events or pieces of data per second that we then need to aggregate?

There’s a couple ways we can do this.

So, what we found is that Elasticsearch is just natively quick.

We’ve had some practice with it over the years, and we’ve kind of figured out best practices over this time and how to scale it out at Wayfair.

Because of this, we use a smart distribution of the data.

We take the best practices in terms of shard and replica count, and machine sizing, JVM heap stats, and all that fun stuff.

And we have a smart distribution of the data.

We need to have a good balance of both to be able to power these aggregations in terms of shards and replicas, as well as power the indexing to be able to show it in real-time.

Following the same line of thought around the smart distribution, we use daily indices and a well-defined schema.

Daily indices allow the size of the data set to remain manageable, and have a good distribution of indexing speed and search speed.

Defined schemas make sure there are no surprises as well.

We define this at the Logstash layer and at the Elasticsearch layer.

So if someone decides to add 500 fields to a tracking event tomorrow, we won’t all of a sudden be stuck in segment merging hell, and have a bad time.

So how do we keep this easy? We’ve kept development on this platform as simple as we possibly can.

From the Scribe side, our tracking team turns around new requests very, very quickly.

And for information that is not already streaming into Kafka, it only takes a few minutes for them to get that set up and get the data flowing.

On the Logstash side, we’re able to develop configurations as an ERB template in Puppet, and apply it to a dummy server.

We have reviewers on within our own team who can review the code, and then we can very rapidly deploy this to production.

Since Logstash can hot reload, we have no downtime on deploy, and data continues to flow.

A typical development cycle goes like this.

First, we get a request for information, we determine if that information is already collected, and if it’s already streaming.

And if not, we make that happen, as I just mentioned.

If it’s already collected and already streaming, we make some quick code changes in Puppet and we deploy them.

Several minutes later, we have data.

Using the data

So what do we do with all this data?

There’s many, many uses. We can immediately see drops in traffic through our checkout funnel. We can identify issues with payment processors. We can understand where something is going wrong. We can understand how load time for the page actually affects conversion. We can know what people are buying.

And something that’s bit us before, do we have a renegade promo code that’s gone viral?

So how do we actually respond to this data?

This data has a future.

It’s going to be used for machine learning in the future, which is going to further enhance this data set.

By the way, we’re hiring.

This data will be enriched with infrastructure data as well. Was a bad customer experience the result of one bar of 3G cell service in Montana?

Or was a server running really hot?

So it’s also only the first data set that we’re generating here.

This is more of a proof of concept for future generations of a streaming platform like this.

So how do we actually use this data, now that I’ve talked about it a whole lot?

So a few weeks ago, we received reports of users not being able to add to their cart, which for an e-commerce website is a very, very bad thing.

When something like this happens, everyone panics, we test from different browsers, we test from different computers, data centers, devices, you name it.

Instead, with this data, we’re able to immediately see and visualize “Add to cart” actions from different platforms.

It was immediately apparent that only our mobile app was affected, and only for certain versions of it.

So instead of everyone attempting to diagnose this problem in their own way, we’ve cut out a lot of investigation in time to resolution, and we immediately understand scope, and who’s going to be responsible for triage and fixing.

How about this? The site is on fire, right?

It’s time to panic. Our JavaScript load time is on the order of minutes.

Now, many times we take a look at load time as a massive aggregate of everything happening on-site.

This includes mobile, desktop, et cetera, many, many variables.

Using averages can really throw off data sets when you’re combining them in this fashion.

Was everything on fire?

Or was there one really bad actor?

We were able to filter down to this data and ultimately find the outlier, as you can see on the slide.

It turns out, like I’ve spoiled a second ago, it was one guy in the middle of nowhere who continued to drop and regain his connection, causing very strange activity on the load time.

Instead of a long investigation, we were able to just say, “Yeah, well, it’s one guy, we’re still taking money from other people.”

And we moved on with our lives.

How can we be proactive with this data, though?

How can we actually use this to increase our incident response rate past having someone stare at one of these dashboards all day?

We’ll talk about the third case study now, “Query all of the things.”

Query all of the things

So, I can kind of see people over here.

By a show of hands, do you guys have two or more data sets or data sources that you’re looking to alert off of at any given point in time?

All right.

So, keep them up if you have found this really difficult to do without spending considerable amounts of time and money.

All right.

So, let’s come back to data availability for a moment.

The business had a really strong desire to extract metrics from engineering and make decisions.

So the data existed, but it wasn’t readily accessible by the business because we don’t let them have it.

Over the years, we’ve done some shopping to try and solve this, and try and find a way where we can provide this data back to the business without actually exposing our internal systems.

The problem is, is that most things are proprietary to a given data source.

How do we give the people what they want? And it’s not with proprietary software.

Proprietary software kind of goes against our four principles that we’ve been discussing this whole time.

Four tools to alert with is not simple.

Teaching four tools is not fast.

Maintaining four tools is not easy.

Paying for four tools is also not an easy sell.

A couple of years ago, an engineering team took it on themselves to do this, and created a solution that works for everything.

We developed a tool called Alertserve, which you can see part of the UI for on the right.

Alertserve was a big step forward for us.

Suddenly, we were able to get alerts for user-defined issues from any given data source that we had at Wayfair.

This premise was very quickly snapped up by the business side.

Now PMs and engineers could sit and discuss potential pitfalls and preplan for prerelease, and have their alerts or queries ready to go on launch day.

Since we could query all sorts of data sources, we could now share anything with the business through quick emails.

For example, we could share error counts to specific pages, or controllers, or otherwise. We could show them issues present within our syslogs. We can show them page load times and report back.

However, this was not without its faults.

Let’s take a look back to something I said earlier.

Keep it simple, keep it fast, keep it stupid, and enable rapid iteration by keeping it easy.

So, Alertserve was not simple. It was reliant on a number of technologies.

And the way it kind of worked is that a user would hit our PHP and JavaScript frontend, and configure their query. It was then written to MySQL.

A Jenkins job then pulled the query configurations, and sent off each query to RabbitMQ.

Celery would then run the query and report the metrics back to Zabbix.

Zabbix would then evaluate the data versus the other data points, and decide it was time to alert or not.

As you can see, this was a really overcomplicated system with many points of failure.

When I was going through this architecture, I realized that this was 15-plus single points of failure.

So any given failure across our infrastructure would completely shut down our alerting system, and that’s no good.

The other problem was that this code base spanned multiple languages across multiple repositories, and it checked in at thousands of lines of code that were not located in any central place.

This was me.

So it wasn’t fast.

In fact, during our research into the code base, we found that it was the exact opposite.

MS SQL and MySQL queries were not set to timeout by default, and the user could enter an arbitrary timeout value.

Something that we found that’s a little gory is that the highest timeout that some user had configured was 36 weeks.

There were no request timeouts to other data sources as well.

So if a data source was down or struggling, an individual request could hang for 30, 60, 90 seconds, or however long the actual application’s timeout was.

Another problem was that all of these queries ran in serial.

Now, after seeing this, you might be thinking, “How did anything ever run with a 36-week timeout if things were set to run in serial?”

And I have good news: that query failed with a syntax error.

So Alertserve was not stupid.

In fact, it did a little too much.

It allowed for a multiple evaluation thresholds against multiple queries for the same alert.

And anyone who has ever had to deal with multi-case alerting, or trigger dependencies, or anything of the sort, knows what the problem is here.

Alerts, generally speaking, should be broadly scoped and pick up problems that are new and novel.

By filtering down to multiple specific thresholds, we found ourselves firing alerts for exact cases.

This, of course, leaves room for failure and picking up said new and novel problems.

It also makes it very difficult to troubleshoot when there’s a problem in the alert configuration.

Now, I know some of you have seen this tweet before.

So, Alertserve was not easy to iterate on.

Because of the number of dependencies, any iteration would require a developer to know and understand the complex interplay between a number of systems.

And due to the complexity, it didn’t really ever run in dev properly, so proper testing was an absolute nightmare.

And that’s why this tweet is relevant.

It also took a number of attempts to migrate this application to a new data center, since it was not data center agnostic.

Let’s step away from the gore for a second.

We made a really solid attempt at solving a business problem that we really needed to solve.

But we didn’t first approach it in a way that was fast or scalable.

So how could we continue to provide intelligence to the business if our platform is a whole lot like this bridge that I would not walk across?

So we built something called Retriever.

After considerable amount of thought and discussion and debate, we decided to start from the ground up, and we built Retriever.

We addressed our four points of success very thoroughly.

The frontend of the tool was written in our PHP and JavaScript admin tool framework, meaning that anyone who had ever written an admin tool at Wayfair could very easily iterate on this code.

The backend code was written in Python 3.6, and checks in at the time of me speaking at 2,300 lines of code.

Now this includes tests and .gitignore files, Docker files, documentation, and all that fun stuff.

The functional portion of the code is actually only 1,300 lines, and that includes comments.

So realistically, our code base that powers the backend of the service is now between 700 and 800 lines of code.

We wrote this to be as dependency-free as possible: no more Rabbit, no more Celery, no more MySQL configuration databases.

None of that.

Currently, Jenkins runs the Python code, it queries cache or MS SQL for configuration files.

It runs the queries out to those individual data sources, and we’re done.

To keep it stupid, Retriever does not perform any kind of evaluation on the data that’s returned.

All it does is check that the data is in fact returned.

When it fails, it fails with a very explicit error message.

Retriever then sends the query result back to Datadog, so we can use the robust alerting engine instead of making on-the-fly comparisons during a job.

So how do we make it fast?

First things first, away with serial, no more 36-week-long queries.

We started using threading within the Python process.

And while any Python developers in the room might know that the GIL can be a very limiting factor in many cases for threading, threading works best when you’re waiting on I/O.

And guess what we’re doing. It’s exactly what we’re doing.

We also added query timeouts. We added query timeouts everywhere.

We enforced a hard cutoff of one minute for SQL queries, and five seconds for a given web request.

Each data source also has its own job, so we have further isolation and parallelization.

We can also disable certain jobs by tracking query times if we find that someone has written a really bad or less-than-ideal query, so we don’t actually harm production.

How do we make it easy?

As I mentioned a few minutes ago, we’re only looking at about 1,000 lines of code, or maybe less.

We also decided, as good Python developers, that not writing our own modules is the way to go.

So instead of something proprietary, if someone has worked with pymssql or requests in the past, they’re already up to speed on our code base.

Retriever actually runs in dev.

It’s running in dev right now, and it runs in dev 24 hours a day.

It’s very nice.

Retriever is also data center agnostic.

When we fail over data centers, which we fairly routinely do as tests, it just works.

It’s great, data keeps flowing, and everyone’s happy, including me.

So after all that, what does this actually give us?

And how does this really fill a business need?

Well, we now have a simple interface that allows us to query an arbitrary data source and send off the results to Datadog.

This allows us to use the alerting engine, but also allows us to keep track of all this data for a long period of time.

So now, instead of just sending an email, we can track trends over time, watch for anomalies, and tie it in with other infrastructure metrics.

We’ve also been able to make same-day changes and add new features, fix bugs or making tweaks here and there.

I ran into a bug yesterday sitting in the conference, and I pushed a fix to production in several minutes, and everything was fixed, and everyone was happy.

This tool allows a user to retrieve their data within a common interface, pull it into a common platform that we’re all familiar with, and work with it in a common tool that we’re also all familiar with.

So this quite explicitly fits our four pillars of simple, fast, stupid, and easy.

Also, the checkout funnel metrics I mentioned earlier, are a data source. It’s very helpful.

Understanding your business in real time

So let’s recap.

Remember your foundation.

Real-time data systems must be built on data availability.

Remember your four principles.

You need to keep some things simple.

Don’t overcomplicate analysis on ingestion. Leave that up to the users.

You need to keep it stupid.

Ensure that data is available even if the expected insert fails.

You need to keep it fast.

Real-time analysis needs to be fast.

Otherwise, it’s not real time.

We need to keep it easy.

You should be able to rapidly iterate on this platform and provide further value as quick as you can.

We need to keep analysis simple.

As I mentioned earlier, with the conversion rate problem, it’s very difficult to kind of generate that data set.

But if we keep things simple, and look at trends instead of maybe exact numbers, we can have a very relative and useful data platform.

This helps us know that something is wrong, and not necessarily what. But determining what can be done outside of the main visuals once we receive these alerts, we can dive into Datadog and find out if it was an infrastructure, platform, or otherwise.

And trends are your friend.

So what’s the bottom line here?

Well, as we all know, the business can make life difficult for engineers like us.

But if we pay a little attention to the metrics that the business side uses to run the business, we can use them for our own purposes.

Utilizing business metrics can allow us to easily scope and understand the impact of reported problems, and can put us in the right direction to begin triage.

Will NGINX logs show that people can’t add to their cart?

I’ll give you a hint.


By utilizing these metrics in real time, we can not only know what is breaking, but we can give the business a powerful platform that helps them to understand what our business is doing in real time.

So remember your four pillars, and you too can build a powerful platform like this.

Thank you.