Modern, high-scale applications can generate hundreds of millions of logs per day. Each log provides point-in-time insights into the state of the services and systems that emitted it. But logs are not created in isolation. Each log event represents a small, sequential step in a larger story, such as a user request, database restart process, or CI/CD pipeline. As applications become more complex, having the ability to view your logs within the larger context when performing root cause analysis is critical to understanding impact on complex, distributed systems.
In practice, grouping your logs in this way often requires a lot of contextual knowledge across the complicated mesh of services that make up your application. Manually piecing together relevant logs from each of these services is difficult and time-consuming. To solve this, we’ve developed Log Transaction Queries, which simplifies the process of aggregating logs according to shared attributes to expose the context and calculate key metrics for particular processes or user journeys.
Log Transaction Queries offer a unique method of log aggregation that provides valuable insight for a variety of use cases, such as e-commerce data, web user activity, financial transactions, authentication sessions, and CI/CD pipelines. With a Log Transaction Query, you can group together log events from across your stack according to shared attributes. For example, using payment ID or order ID as a common attribute, you could aggregate logs from an e-commerce app into transactions and quickly answer key questions like: how many orders were successfully initiated in the past 12 hours? How many threw errors, and on which step(s) of the process? How many took the user more than 10 minutes to complete? And so on.
Alternatively, let’s imagine you’re managing a stock-trading app. You can ensure you’re meeting the strict SLAs required in the financial services industry by tracking individual clients’ logs to understand end-to-end latency. By creating log transactions grouped by customer ID, you can isolate all of the relevant log events from a particular customer’s transaction history. Using logs emitted when the customer’s request invokes various in-house and partner services or APIs, you can easily compare the execution times of each step and spot performance bottlenecks.
Transaction Queries are formed using a primary identifier and, optionally, a set of operators. Datadog will then create transactions by grouping together all the logs that share the same primary identifier value. Because transactions are generated at query time, Datadog automatically calculates performance indicators like duration and max severity at query time as well. Any operators you add to the query will surface more detail about the transactions report, such as the count of logs that contain unique values for a particular facet (users, for example) or the p95 of a useful metric (such as page load time).
The resulting table provides the queried values for each transaction grouping, as well as out-of-the-box calculations of the count of log events, duration of the full transaction (the time elapsed between the earliest and latest log events), and the max severity, which indicates whether transactions contain errors. By adding these metrics to your query, you can surface transactions that show you where problems, such as high severity errors or latency, are occurring.
In the above example, we’re grouping logs from our e-commerce application by merchant ID and counting unique sessions, customers, and locations. We can use this query to look for merchants on our platform experiencing anomalously long payment processing times, and immediately understand the scope of the issue (i.e., the number of customers or locations affected). We can filter the list further, using tags to specify, for example, what host, service, AWS role, or availability zone to focus our exploration on.
If we notice an anomaly that we want to investigate, we can view the summary details of the relevant transaction and inspect each included log event. The graph visualizes a count of relevant log events over time so we can spot moments of unusual activity, such as a high volume of errors. In the example above, we can see a registry of activity on a merchant for the specified time span, with the customer ID and location included along with the content of the log. Once we’ve identified a log event containing something interesting (or problematic), we can focus on it to view relevant event attributes or pivot to related metrics and traces to diagnose the source of any code errors or sources of latency.
Using the “View in Context” button under each log side panel, we can re-run our query to show only transactions on the same host or container as our target. This way, we get a full picture of the transactions timeline for the host in question and validate whether any performance issues we initially observed are also showing up in other transactions on that host. Navigating quickly and easily between scopes, we can better leverage our logs to form a clear picture of events on each facet of our infrastructure, which we can use to generate actionable performance insights and troubleshoot issues.
Logs provide invaluable visibility into your applications and context around problems. Datadog’s Log Transaction Queries feature helps you cut through the noise of your environment’s logs by pulling together relevant logs from sources across your stack to give you deep insights into the health and performance of individual requests and processes. Get started using transactions in Datadog’s Log Explorer today. Or, if you’re brand new to Datadog, sign up for a 14-day free trial.