Monitoring Caviar's migration from monolith to microservices
Published: September 28, 2017
Good morning. So, yes. My name is Walter King. I’m an engineer on the platform team at Caviar. My team’s about three people. We’re supporting a team of about 25 engineers to build the Caviar application.
What is Caviar?
So, what is Caviar? Caviar is a food ordering platform. So what that basically means is you’re hungry, you go and you order some food, we send it to a restaurant. They make the food, and then we send a courier, and we have our own delivery fleet, to deliver your food to you at your house. We have over 3,000 restaurants across 20 different regions. Caviar was acquired by Square in 2014, and we’re currently located with them.
The Caviar application
So, to kind of start diving into the details on the technical side, this is kind of what we looked like around three years ago, or a year ago. We had three different applications, facing our different, kind of, groups of users. We had a diner application, which was mobile and web. The merchants, they get an iPad that they receive all their orders on. And then another application, the couriers on their phone can see where they’re supposed to go to pick up food and where they’re supposed to go deliver it.
It’s typical monolithic Rails app, we called it the Delivery Rails app, because delivery was really our first product. And then there’s some other small support apps that kind of go with it. Everything’s running inside of Docker, running on a single ECS cluster, inside of Amazon AWS. And this is the application that we’re trying to split up, and move to microservices.
So, before we wanted to do the split, we really wanted to make sure we had a solid base, and so we took an audit of all the different monitoring tools that we had. So it included things like Datadog for metrics and APM, Bugsnag for exception tracking and counting, and Kibana for logs. So, here is the kind of an example of a trace that we have for our monolithic Rails application, after we kind of went through and solidified the instrumentation.
A couple of things I want to point out, first off, all those yellow bars come in with the default implementation of the agent. They’re all SQL queries. That’s great, though, but when we first did this, that’s pretty much all we saw was just a flat line of yellow bars. So we went through it manually, instrumented everything in our applications so we could get more detail. So when we saw a Postgres query, we could say, “Oh, it’s this section of code that was calling it, and that’s the section of code that we should go and improve.”
The other thing that’s interesting here is those red bars are all exceptions. They’re exceptions we captured and handled. So from the top-level metrics point of view, these are 200s. They look successful. But there’s some errors that are deep down inside of the system. And so, we wanted to go and try to figure out more information on that, so we needed to kind of correlate this with our other systems.
So we, like I said, use Kibana and Bugsnag. That trace ID that we pulled from the URL is also searchable inside of Bugsnag, and every log line that we output includes that trace ID. So between these three systems, we can jump around using that trace ID as the common identifier to bring everything together.
So, great, we have this monolithic Rails application, and when we think about how we organize our team, we had, you know, 15, 20 engineers at the time, and pretty much, it was just one big team, just like we had one big app. At this point, we’re growing, and we need to figure out a way to split that up. So what we decided was to split it based on our 3 different applications, and have teams focused on those particular audiences, where audience is just a group of users. And this was really to give ownership, clear ownership over feature development. We were having a lot of troubles where, you know, you make a feature and move off to another one, and then it was unclear who was responsible to go and fix any bugs. So on the org side, we split up into three teams. So we wanted to split up the application similarly.
And all we did was we took the application and deployed it three times, and then our load balancer routed two of the three different applications based on the URLs. So it’s the same exact code deployed three different times. And the main benefit we were looking for here was to kind of split up our ops, along the same lines as our org structure, so that when we got a page, that page went to the team that caused it, so if the diner’s app was causing pages, the diners team was the one that went and fixed it. And similarly, if there were any downtime events, like the merchant app could go down for 10 minutes, and it was bad that merchants weren’t getting new orders, but diners were still able to place orders and we could catch up a little bit later. So, on the APM side, we can see the different applications separated entirely. Pretty much, aside from deployments, these were all completely separate applications from paging to APM to monitoring metrics.
Excuse me. So, so yes. So we split it up into the different applications. If we dig into one of those, we can see that only the API calls that are relevant to that particular team were the ones that show up in the APM, because they’re the only ones actually getting called. So we kind of got, like, a pseudo microservices here, where, you know, since only 10% of the application was being called by certain teams, those were the ones that they could look into.
So great, so our monolith application is split up. What next? We actually already had one little service in the, in the system, the ML service. This service was primarily responsible for calculating ETAs. So when you log into the webpage, you see a list of restaurants, it may tell you how long it’s going to take that particular restaurant to deliver food to you. This endpoint represented about half of our volume in the system, because we were calling it so many times for each individual restaurant. And it was a relatively slow request, and so it was something we really wanted to dig into, because it was a common refrain that people were like, “ML service is slow.”
So, so yeah. So the big problem, though, was that it was still kind of interlooped with the rest of the diner application. And so we also split that out. So we had an application that just served a single endpoint. And so, we had it serving a single endpoint, but we still couldn’t really get correlation between the two endpoints, because we could see the averages, the average response time for the delivery application, and the averages for ML service, but they were called by different parameters, so it was unclear what was going on.
So we went to add cross-service tracing to this. So we were using Excon for our HTTP communication. Excon, unfortunately, is not instrumented by default in the Rails application, in the APM. So we just instrumented ourselves. It was pretty simple to do. Excon has its own middleware system, so I pretty much took the code that’s in the net/http instrumentation, copied it into an Excon middleware, and it was great. We did make one kind of major change to it, which was to switch, instead of reporting it as Excon as the service name, we report it as the domain name. It was really important to us to see a split out between the different APIs, what the averages were, kind of, for the health of the given system, whereas when it’s just net/http, we found that it was a little bit conflated together.
So what’s this look like? In this chart here, we have the trace. All the dark green stuff is the deliver monolith. And then the light green stuff is the remote tracing. And so you can see here that there was…We can see all the Postgres calls that the remote app was making. But it turns out this only had accounted for a third of any given request. So at that point, once we had the information that said ML service wasn’t what was slow, it was actually the deliverer thing calling it, we kind of decided to not pursue increasing the performance of that downstream application.
So, at this point, it was about time for us to kind of start adding new microservices to our system. The first one we added was fulfillments, and the responsibility of the fulfillments service was to basically command and control the different orders. So after an order was placed, it got passed to the fulfillment service, and the fulfillment service made sure that the merchants and the couriers got the information that they needed in order to deliver that order. So, as, when we built it, we built all the same monitoring tools that we would built into the delivery application. We wanted to do a slow, city-by-city release. We were kind of worried because this was kind of the core part of our service, you know, assigning the deliveries. So we wanted to make sure it worked. And so what we would do is we would release into our smaller cities, come in the next day and say, “Great, the site didn’t go down last night, let’s add a new city.”
That worked great for a while, until we added one of our two or three bigger cities, and this is what happened. So, this was about an hour of downtime. And what happened was the response time on the service had been slowly creeping up, to the point where it started hitting timeouts in the caller code, and once those timeouts starting hitting, it started retrying, and that just kind of cascaded into bad things happening. Luckily, it was feature flagged, so we were able to turn the feature flag off, fix the data, and then start everything back up again. But it was an hour of downtime, and that was not what we were hoping for.
So at this point, we rolled everything back, and said, “You know, could we have done it better? Should we have been able to predict this?” And when we looked at the metrics, what we saw was that our response times were slowly creeping up at the, like, P99 levels, but the averages were not. So, we went through and added a whole bunch of new monitors with, kind of, tight SLAs at those P99 levels to really catch the outliers. And then once we got them, to go through and fix them.
So this is a couple weeks ago, a monitor that we had. This is a three-second alert on the P99, which is still pretty high, but we generally weren’t hitting it too often. And that night during dinner rush, we noticed that we started crossing the thresholds. So all this was affecting, you know, less than a percent of users, we wanted to kind of jump right in and fix it immediately, and really focus on performance from the start.
And so, through investigating, we were able to find the actual endpoint that was causing all the slowness. On the right side is that, that response time graph, so that yellow line there is how it was increasing over that day. It’s kind of hard to see, but the other two lines are relatively flat. So again, the other, most users weren’t experiencing any impact.
Without this alert, we probably would have ignored this and not noticed it until it became an issue where, like, half the people were affected, which would have been way worse. It turned out that this particular API call, none of the data it was reading was used anywhere, so on the caller side, we just turned it off. So you can see when we did that, on the left, that’s the request count, it dropped precipitously and likewise, the response time did. We found, most of the time, any of our performance issues could be fixed by, you know, requesting less data, reading less data from the database, and just generally, like, don’t do work, not doing work that we didn’t really need to do.
So, this is kind of what a request looks like today. For any of our new endpoints, we’re generally starting the Rails app, making a whole bunch of network calls through Excon, bunching the responses together, and then returning that to the frontend. There’s not as much detail in a lot of those downstream services, like we added to deliver. But generally, they’re doing less work, because they’re more microservices oriented. So, we almost get the function level detail just from the API calls that we’re making. And what we found, now that we had all this APM instrumentation, that anytime we make a change, we kind of try to tie it back to a user experience, and try to measure exactly what the performance impact of any fix that we push out is, and make sure that we’re working on the right thing.
So, if you have any questions, my email is email@example.com. We’re, of course, hiring, so there’s a link there to the jobs page.
Cool. Well, all right. Thank you.