This is a guest post from Liran Haimovitch, co-founder and CTO of Rookout, the rapid debugging company.
At Rookout, we aim to make debugging as fast and frictionless as possible for developers. So naturally we pay close attention to the availability and performance of our own SaaS application, monitoring everything we can to ensure that we’re delivering a great experience to our growing customer base.
Datadog is our main monitoring system at Rookout, and we’ve even developed an integration to send real-time debugging data to Datadog for monitoring and analytics. In this post, I’ll discuss four aspects of the Datadog platform that have been especially valuable to us.
As a modern SaaS company with little classic infrastructure, we have no servers to manage, as we are running on Google Kubernetes Engine, and no databases, as we’re using Google SQL. So naturally, what we care about first and foremost is our software. From this perspective, Datadog’s easy-to-install APM is simply fantastic. You install it with just a few lines of code and get this informative screen:
Whenever someone tells me something went wrong in one of our environments, this is the first place I look. Requests per second, latency, and error rate are the three key metrics to watch for, since they go awry in almost every malfunction we have.
As soon as I integrated Datadog with our infrastructure, I started getting visibility into our SaaS platform, but setting alerts was still ahead.
When installing monitoring on your infrastructure for the first time, chances are you yourself don’t know how to define good/bad performance. The good news is this issue has a rather simple solution. Every time something goes wrong, we have a postmortem, and frankly, so should you! One of the questions we ask ourselves in every postmortem is: how can we detect it better next time? We find that more often than not, a Datadog monitor is the answer.
Here are some examples of the monitors we use:
- Make sure Redis slaves are connected, to ensure failover will happen
- Make sure Redis has enough memory
- Make sure container count doesn’t increase too much
- Make sure node count doesn’t increase too much
- Traefik Ingress
- Monitor various client/server error codes
- Monitor latency
While APM gave us pretty impressive coverage out of the box, we wanted to go further to resolve some unanswered questions we had about our app’s performance and success rate. For instance, Rookout Service uses a heavy-duty serialization framework. Originally, this was one of our most useful pieces of code—it enabled us to get new REST endpoints in a matter of minutes.
Today, with requests rates going through the roof, it can sometimes become a performance bottleneck. So, we added custom tracing spans to it, including all the metadata that might be required to investigate those issues. Here’s a (simplified) snippet of how we did it:
class MonitoredSchema(Schema): def dump(self, obj, **kwargs): with tracer.start_span("dump", tracer.current_span(), "schema", self.__class__.__name__): return super(MonitoredSchema, self).dump(obj, **kwargs) def load(self, data, **kwargs): with tracer.start_span("load", tracer.current_span(), "schema", self.__class__.__name__): return super(MonitoredSchema, self).load(data, **kwargs)
Soon after you start running containers in production, you’ll experience an “Aha!” moment, when you suddenly stop caring as much about your containers. You will no longer worry about their numbers, where they run, and how they perform. Since most of the time, really, it just works. (We use Google Kubernetes Engine, but ECS provides a similar experience.)
Still, there will be times when you have to dive deeper. For me, those moments can be triggered by:
- Service disruptions in underlying networking and computing grids
- Kubernetes resource allocations: set them too high and you waste resources, set them too low and your availability will suffer
At these moments, you need to know exactly how many containers you’re running; how they are deployed across nodes and zones; and how many resources they are using.
Surprisingly enough, the Kubernetes ecosystem has yet to build good tools for providing visibility into your running containers. This is why I especially appreciate the Datadog Containers page. It provides a solid UX and allows you to answer the above questions as well as many others.
Datadog became one of our fundamental monitoring tools at a very early stage here at Rookout. In fact, I wish we could have implemented it earlier still. Today, thanks in part to the four features outlined above, Datadog is an integral part of our engineering toolkit at Rookout, enabling us to provide a cutting-edge SaaS product to our customers.
Rookout lets dev teams add non-breaking breakpoints to their live code (running in dev, staging, and production) to get any data they need instantaneously, without restarting, redeploying, or adding more code.