VLLM Observability & Monitoring | Datadog

vLLM Observability & Monitoring

Gain comprehensive visibility into the performance and resource usage of your LLM workloads.

dg/vllmheader

A unified monitoring platform provides full visibility into the health and performance of each layer of your environment at a glance. Datadog allows you to customize this insight to your stack by collecting and correlating data from more than 1,000 vendor-backed technologies, all in a single pane of glass. Easily monitor your underlying infrastructure, supporting services, applications alongside security data in one centralized monitoring platform.

 

Ensure Fast, Reliable Responses to Prompts

  • Visualize critical performance metrics like end-to-end request latency, token generation throughput, and time to first token (TTFT) with an intuitive OOTB dashboard
  • Identify and resolve infrastructure issues or resource constraints to ensure your LLM application remains fast and reliable, even under heavy load
  • Adjust resource allocation to meet demand and keep your LLMs performing at their best with end-to-end visibility
dg/vllm2

Optimize Resource Usage and Reduce Cloud Costs

  • Prevent over-provisioning by monitoring key LLM serving metrics like GPU/CPU utilization and cache usage
  • Reduce idle cloud spend while ensuring LLM workloads maintain high performance by tracking real-time resource consumption
  • Balance performance and cost-efficiency by rightsizing infrastructure and avoiding unnecessary scaling events
dg/vllm3

Detect and Address Critical Issues Before They Impact Production

  • Detect issues early by proactively monitoring key LLM application performance metrics with preconfigured Recommended Monitors
  • Prevent delays or interruptions by tracking metrics like queue size, preemptions, and requests waiting in real time
  • Resolve potential problems before they impact performance with actionable alerts on predefined thresholds
dg/vllm4

Next-generation ML Monitoring

Monitor and your entire machine learning stack with Datadog.

watchdog-apm-illustration.png

AWS Trainium & Inferentia

Monitor and optimize deep learning workloads running on AWS AI chips

tracesearch-apm-illustrationv2.png

OpenAI

Monitor token consumption, API performance, and more.

servicemap-apm-illustration.png

NVIDIA DCGM Exporter

Gather metrics from NVIDIA’s discrete GPUs, essential to parallel computing.

Thousands of Customers Love & Trust the Datadog Platform

ML Monitoring Resources

Learn about how Datadog can help you monitor your entire AI stack.

Datadog AI Monitoring Starter Kit