Advances in Log Management

Published: April 16, 2019

00:00:00

I’m Steve Lechner.

I’m a product manager on our log management team.

And there’s a couple of very exciting developments in our product that I’d like to introduce you to today with a story.

Kevin: A SRE’s tale

This is Kevin.

Kevin is a new SRE in a large organization with very many developer teams.

He walks into the office one day, and then he sees a Slack message in his channel that tells him, “Hey, there’s a lot of 500 responses coming from one of your services that you own. And, well, here’s a link, why don’t you click on it to investigate further.”

So he clicks on it and it brings him to this view of his log explorer, which shows him the count of all the status 500 logs across all of his many services for all of his many developer teams.

And clearly, one of them has some sort of account of 500s issues happening right now, so he clicks on it and then he sees that option to view the logs.

He clicks through that.

It brings him to ultimately an actual example of a 500 response that’s happening right now for one of his services.

And that particular 500 log isn’t really telling him all that he needs to know in order to investigate, so he clicks on Service and that See Trace in APM option, which brings him to this view.

It’s the actual trace of the request that ultimately generated his 500 response.

So, that flame graph there is showing, for every function that contributes to the whole request path showing the time taken, you’ll notice there’s three of them or so that have this little red icon, and those are the ones that have some sort of error happening.

And he clicks on the error tab at the very bottom, you can see that he can actually find the traceback that is contributing to the fact that this is resulting in a 500.

Now, from this view, he can identify, “Is this a customer-facing issue,” and he can prioritize just how urgent it is.

He can, from the code that he sees as erring out, he can identify maybe what team he needs to loop into the investigation.

Breaking down Kevin’s troubleshooting workflow

The point being that this investigative flow that Kevin was able to accomplish here was very smooth, and there’s actually two tricks that he employed here that made that smooth investigative flow possible, and those are the two tricks that I want to sort of focus on today and sort of unpack.

The first of those steps that he took was this one here.

From a log event, an interesting log event, he was able to jump to the request that had generated that log event, that See Trace in APM.

Now, how was this possible?

It’s really because Kevin’s organization is taking advantage of a new feature that Datadog offers for deep linking your log events, their log events rather, with their traces.

Now, before Kevin’s organization used Datadog, they actually really tried to make this deep linking of their log events with their traces and really correlate them together.

They were able to do that at a high level based on what host the log came from and the request was happening on what service, but actually linking together their logs with the actual request that generated their logs was something very difficult for them to do.

It involves like instrumenting all of their code to pass in some sort of identifier string into their logs and then coming up with a protocol that would work across services as the request jumped from service to service.

So, that way, every service could catch that request ID and pass into their logs as well.

And getting all of their teams to comply with that protocol, it was all a mess, just a lot of effort.

Deep-linking log events and traces with Datadog

Fortunately, they were able to take advantage of a new feature that Datadog now offers, where Datadog, with its distributed tracing libraries, can actually inject the trace ID and span IDs into the logs that their code generates.

So, with this, and what was the setup process for them, instead of actually having to instrument any of their code, it was the same setup process that they had for enabling APM to begin with.

These folks are Java users, so that was installing the Java client, instrumenting their application with the auto instrumentation method of the Java JAR file.

The only thing they had to change to get this log injection thing to happen was to add a new environment variable for DD logs injection, so sort of a switch that they flipped on.

What was the benefit that came about as a result of this, well, we already went through the one flow of being able to go from interesting log event to trace that of the request that actually generated the log event, but Kevin could have done a couple other things.

He could have, from that log event, actually just queried across all of the other logs that share the same trace ID and just seeing the full story of all of the logs that are generated by the same request.

Regardless of whether they came from, you know, this file or that file, maybe that container, wherever they came from, they were all part of the same request story.

From there, potentially, you could have pivoted to, from maybe the URL path of what was being queried on, you could have pivoted to the browser logs to see what are some of the frontend context around the backend 500 that he’s investigating.

Or if you were taking an investigative path from, say, the APM side originally, once he gets down to a trace, he can actually capture, he can actually jump into the full context of this request, not just the spans and how much time is taken and the success of each span, each function, he can actually jump over to that Logs tab to get that full story of all of the logs that were generated by this request there in the same view.

And so any variable that might be getting logged from the code, he actually has access to that in the same view as the entire flame graph.

That was the first of the two tricks that Kevin was able to employ in order to have that smooth investigative process.

Centralized log processing

The second trick that he used was this one right here, where he was able to go from a monitor alert to a query of all of the 500s, account of all the 500s across his entire organization, regardless of what service generated them or what team owned that code.

And there again, remember Kevin’s situation, he’s a new SRE at a large organization that has dozens of teams.

Now, there’s, actually for that kind of organization, this sort of a cross-team aggregated query would have been very difficult to accomplish.

Why, because there would have been three problems blocking the way of this happening.

The first being that there are many sources of HTTP logs, and so, as a result, any number of his teams at his larger organization could have been using Nginx or IIS or Apache or any of the other ones.

These all have their own syntaxes for how they log their HTTP logs, and so you can’t really expect to query all of them the same way and end up with the right HTTP status code results, for example.

The other, another problem, rather, that his organization would have been facing is that each team will actually customize their logging habits based on their own custom needs or their history of needs, and so, even if they were all using Nginx, well, they’d still run into the same problem.

And a third problem that they’d run into is the fact that each team will actually have their own naming conventions for the same stuff.

I mean, one team might call it HTTP code, another one, status code, another response code, payload code.

Kevin would have run into this and said, “Well, shoot, which one do I query on?”

And so it would have been much harder to get that aggregated view across all of the teams.

He would have to know which team uses which naming convention in order to actually get to an understanding of the health of his HTTP responses.

Now, fortunately, Kevin’s organization is able to take advantage of the centralized processing pipelines of Datadog in order to solve these problems.

Now, from the very beginning of the logs product, they were very excited to see that they were able to just sort of send their logs to Datadog, and from within that centralized UI, they were able to define how each log should get parsed out into rich JSON content rather than just a simple string.

And especially, they were happy to see that, you know, a lot of the logs they were generating, especially HTTP ones, they came from technologies that had integrations, and so the pipelines were already set up for them, which they can clone and then easily tailor.

And they can actually, as you’ll see here, they can actually test out the pipelines before they send logs and make sure that they can see how their logs are getting parsed out and processed and enriched the right way.

And they can iterate on these and change them to make sure they can fine-tune them to fit their needs in a matter of minutes rather than having to deploy these changes across infrastructure that does the parsing for them rather than having to loop in other teams to make those deployments happen.

It’s much easier.

Nested pipelines—and how they prevent conflicts between teams

Now, the problem though that Kevin’s very large organization ran into with many developer teams is that, very quickly, they ended up having over 100 of these processing pipelines, and especially as new teams continue to clone the Nginx pipeline and further tailor it time and again, it became more and more ambiguous as to, “Well, when I want to modify how my logs are being parsed out and processed and enriched, which pipeline do I fix and change? And, well, if I do make a change, which other team’s logs will I impact as a result?”

Fortunately, they were able to start using this new feature that we’ve released at Datadog of nested pipelines.

You can actually now, and they were able to actually nest their pipelines into a higher level hierarchy, if you will.

So, that way, instead of just having one pipeline for every source or type of log event, they were able to set up a pipeline per team, and within each team’s pipeline, they were able to create a sub-pipeline for each type or source of log event.

As a result, it’s much easier now for them to effectively manage the processing of their logs without having to worry or risk the possibility of one team accidentally changing how another team processes their logs, and it’s much more easy for them to manage.

Now, another feature that Kevin’s organization…actually, and this is just a quick view of what it looks like for Kevin to actually nest one of these pipelines as very easy, just opening up the higher level pipeline and he can drag and drop from the right-hand side, another pipeline into the higher level one.

Voila.

Standard attributes: Bringing consistency to org-wide standards

Now, the other feature that Kevin’s organization is employing in order to solve their larger organization processing pipeline problems is this Standard Attributes mapping, which is also a new feature as well.

Now, Kevin doesn’t know it because he is a new SRE, but before he came to his organization, they went through great pains to establish a standard naming convention for all of their log attributes.

Why, because they really wanted to remove any redundancy across their log attributes and, along with that, any ambiguity of, “Well, which HTTP status code attribute should I be querying on?”

Now, what did they do about this?

They organized a cross-team committee, and they all met together and decided, “These are the standard attributes that our organization needs and these are the names we should give them.”

And they planned engineering work across all of their teams to refactor how they log so, that way, they can be in compliance with their standard naming convention.

Now, unfortunately, about half of those teams delayed that work for one reason or another, and as new developers continue to join the organization, they continued logging the way that they were familiar with.

And those new attributes that they introduced were not compliant with the standard naming convention, and so the whole thing sort of fell apart.

Fortunately, with this feature, Kevin’s organization was able to configure in their log management account their standard naming conventions for all of their logs within their configurations of their Datadog account, and then any noncompliant attribute, they were able to just tell it what it ought to map to.

And so they were able to, with that, I mean, regardless of which HTTP status code somebody is logging with or what name for that, it all ends up just being the same attribute rather than four or five.

And so they removed all of the redundancy in their facet panels, they removed the ambiguity of, “Which one do I query on?”

What’s really interesting about this though is that they were able to enforce the standard naming convention without any developer teams having to change how they log.

Everybody just continues doing what they were doing before, and the log management tool just takes all of the nonstandard attributes and maps them over to what they ought to have been called.

And when a new developer comes in and introduces yet another name for HTTP status code, whatever that might have been, to solve that, it’s a very easy thing, you just open up the standard attribute, go over to…and you add that new word for it at the end of that synonym list for that attribute, if you will.

So it’s very easily solved.

Conclusion

So, with the centralized processing, the nested pipelines, the standard attributes, Kevin’s organization is able to accomplish this, the ability to very easily query across all of the 500 response codes across all of their services, regardless of what team actually generated that log and what their logging habits were.

And so Kevin’s able to very smoothly investigate whenever any one service has a spike or an increase in their 500 responses.

He’s even able to set up a monitor to make that aggregated query for him on a regular basis and tell him when there’s an increase more than it should be over a bad threshold.