Transform and Enrich Your Logs With Datadog Observability Pipelines | Datadog

Transform and enrich your logs with Datadog Observability Pipelines

Author Candace Shamieh
Author Pratik Parekh

Last updated: 8月 12, 2024

Today’s distributed IT infrastructure consists of many services, systems, and applications, each generating logs in different formats. These logs contain layers of important information used for data analytics, security monitoring, and application debugging. However, extracting valuable insights from raw logs is complex, requiring teams to first transform the logs into a well-known format for easier search and analysis. Teams may also need to add custom context to the logs to simplify debugging and improve data quality as they are ingested by downstream applications.

Datadog Observability Pipelines already provides users with the ability to aggregate their logs behind a single pane of glass, process them, and route them to assigned destinations. Now Observability Pipelines’ out-of-the-box processors allow you to convert logs into a standardized format, enrich logs for better consumption by your downstream applications, embed data provenance into your logs for simplified debugging, and add precise geographic location for more efficient analysis.

In this post, we’ll discuss how Observability Pipelines can:

Easily convert logs into a machine-readable, structured format

Logs are produced from a variety of sources, including network appliances, messaging queues, and transaction processes. Each source emits logs in a unique format that contains relevant, source-specific information. As a result, teams must tokenize the logs so they can adhere to data governance and compliance standards before routing them to downstream applications like log management tools, debuggers, or SIEM systems. This parsing process is tedious, resource-intensive, and difficult to scale. Writing custom parsing rules for dozens or hundreds of log types is error-prone, making you susceptible to data loss.

Observability Pipelines introduces the Grok Parser, a processor that allows you to choose and automatically apply preconfigured parsing rules from the Datadog Pipeline Library to your logs. Adding the Grok Parser to Observability Pipelines enables you to process and transform over 150 different log types while en route to their destination.

View  of a pipeline dual shipping logs from an HTTP client to Elasticsearch and Datadog. The pipeline processes logs with a Grok Parser.

For example, let’s say you have an unstructured log message that displays a long string of numbers and words from an application. The log message includes an IP address, browser information, API calls, and more, but your downstream applications are unable to properly ingest and analyze the message because of its default format. Instead of manually sorting, defining, and categorizing each portion of the log, the Grok Parser uses the source attribute within the log message to automatically extract fields like IP address, timestamp, request, bytes, OS, and browser. You’re then able to use the tokenized logs to quickly search for relevant information and derive helpful insights as you conduct analysis or troubleshooting steps.

The following code provides an example of how a raw Docker log looks before and after using our Grok Parser:

Raw Log Input
192.0.2.17 - - [06/Jan/2017:16:16:37 +0000] "GET /datadoghq/company?test=var1%30Pl HTTP/1.1" 200 612 "http://www.test-url.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" "-"
Parsed Output
{
  "http": {
    "referer": "http://www.test-url.com/",
    "status_code": 200,
    "method": "GET",
    "useragent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
    "version": "1.1",
    "url": "/datadoghq/company?test=var1%30Pl"
  },
  "network": {
    "bytes_written": 612,
    "client": {
      "ip": "192.0.2.17"
    }
  },
  "timestamp": 1483719397000
}

Add contextual information to enrich your log data

Raw logs usually contain specific, detailed information, such as customer IDs, SKUs, or IP addresses. While valuable, this information can be difficult to search for and understand in its original form. Enriching your logs with contextual data allows them to be searchable in a human-readable format and enhances usability for downstream analysis.

Observability Pipelines’ Enrichment Table processor allows you to create custom tables that enrich data as it routes logs to the appropriate destination, avoiding the complex, time-consuming task of transforming diverse log formats after ingestion. Using a CSV file, the processor adds contextual information to each log message.

View  of a pipeline dual shipping logs from an HTTP client to Elasticsearch and Datadog. The pipeline utilizes the Enrichment Table processor.

As an example, let’s say you’re the regional director of operations for a food delivery service and one of your managers wants to create a report that shows each delivery driver’s zone. They attempt to find the zoned areas by searching for employee names but cannot find the information because your raw logs contain only universally unique identifiers, or UUIDs. With the Enrichment Table processor, you can insert employee names into the logs and associate them with the UUID.

The following code illustrates a log before and after enrichment:

CSV Enrichment File
empID, Name, Delivery Zone, State
1990, John Smith, Dallas-Fort Worth Metroplex, TX
3998, Jane Doe, Metro Atlanta, GA
8267, Michael Johnson, Nashville Metropolitan Area, TN
Enriched Output
{
  "employee_details": {
    "emp_id": 1990,
    "Name": "John Smith",
    "Delivery Zone": "Dallas-Fort Worth Metroplex",
    "State": "TX"
  },
}
{
  "employee_details": {
    "emp_id": 3998,
    "Name": "Jane Doe",
    "Delivery Zone": "Metro Atlanta",
    "State": "GA"
  },
}
{
  "employee_details": {
    "emp_id": 8267,
    "Name": "Michael Johnson",
    "Delivery Zone": "Nashville Metropolitan Area",
    "State": "TN"
  },
}

Integrate data provenance into logs for faster debugging

Most large, distributed systems make use of autoscaling, where resources are spun up or torn down dynamically as demand fluctuates. When debugging an issue or tracking a possible security threat, identifying affected hosts is critical to better understand the scale of the issue or vector of the attack. When aggregating logs from various sources, users often need to run multiple queries to gather enough context before taking action.

With the Add Hostname processor, you can automatically insert the hostname directly within the log message. The Add Hostname processor automatically embeds system information in a structured format, making it easily searchable and helping you pinpoint a root cause if an issue occurs.

Enhance security and improve data analysis with the GeoIP Parser

Nested within the Enrichment Table processor, the GeoIP Parser adds context from the GeoIP database directly within the log message. This processor enables you to pinpoint an IP address’s geographical location, improving security monitoring and data analytics. With the GeoIP Parser, you can visualize geographical locations in your reports and add maps to custom notebooks or as widgets on a Datadog dashboard.

View  of a world map in a Datadog notebook that was created by the GeoIP Parser.

To demonstrate how the GeoIP Parser works with the other new Observability Pipelines processors to enhance security, let’s say you have a production server farm with multiple servers generating logs. The Grok Parser structures your logs with the IP address, browser information, request, and timestamp. When a critical service fails, the Add Hostname processor enables you to easily correlate logs with the specific server via hostname, allowing for faster troubleshooting and targeted fixes. In the case of a suspected attack on your environment, the GeoIP Parser can help you pinpoint the geographic location of a malicious host sooner, while the Enrichment Table processor attaches a CSV file that contains a list of known malicious IP addresses.

Building upon the food delivery service example, let’s say the IT team receives an alert notifying them of a suspicious login attempt in the route-planning application used by delivery drivers. The login attempt was made from Metro Atlanta, so IT can’t immediately determine whether an actual delivery driver made the attempt. Using the Add Hostname processor to insert hostnames into logs, the IT team can pinpoint the host and, subsequently, the device used for the login attempt. Taking the investigation further, IT can leverage the GeoIP Parser to query the GeoIP database and connect delivery driver device locations with IP addresses and timestamps. This additional context helps the IT team identify the exact locations of their Metro Atlanta delivery drivers at the time of the suspicious login attempt and whether the activity is typical.

The following code shows how the output of a log will look after processing:

Parsed and Enriched Output
{
  "employee_details": {
    "emp_id": 1990,
    "Name": "John Smith",
    "Delivery Zone": "Dallas-Fort Worth Metroplex",
    "State": "TX"
  },
  "hostname": "john-smith-1-device",
  "geoip": {
    "country": "United States",
    "region": "Texas",
    "city": "Fort Worth",
    "latitude": 32.7767,
    "longitude": -96.7970
   },
   "timestamp": "2024-07-15T12:34:56Z", 
   "action": "login attempt", 
   "ip": "203.0.113.25" },
{
  "employee_details": {
    "emp_id": 3998,
    "Name": "Jane Doe",
    "Delivery Zone": "Metro Atlanta",
    "State": "GA"
  },
  "hostname": "jane-doe-1-device",
  "geoip": {
    "country": "United States",
    "region": "Georgia",
    "city": "Alpharetta",
    "latitude": 34.0522,
    "longitude": -118.2437
  }, 
  "timestamp": "2024-07-15T12:34:56Z", 
  "action": "login attempt", 
  "ip": "198.51.100.23" }, 
{
  "employee_details": {
    "emp_id": 8267,
    "Name": "Michael Johnson",
    "Delivery Zone": "Nashville Metropolitan Area",
    "State": "TN"
  },
  "hostname": "michael-johnson-1-device",
      "geoip": {
        "country": "United States",
        "region": "Tennessee",
        "city": "Franklin",
        "latitude": 35.9251,
        "longitude": -86.8689
  }, 
  "timestamp": "2024-07-15T12:34:56Z", 
  "action": "login attempt", 
  "ip": "192.0.2.18" 
  } 

Get started today

Datadog Observability Pipelines enables you to collect, transform, enrich, and route your infrastructure logs. The Grok Parser equips you with out-of-the-box parsing rules, allowing you to transform logs on the wire and enhance your search and analysis capabilities for efficient downstream consumption. Enrich logs with custom context with our Enrichment Table processor, and insert the local hostname with the Add Hostname processor for simplified debugging. Lastly, leverage our GeoIP Parser to add geographic information to log messages so that you can identify and respond to incidents quickly and create custom maps for your reports.

For more information, visit our documentation. If you’re new to Datadog, you can sign up for a 14-day .