Unified vSphere monitoring with Datadog

Managing VMware environments sometimes feels like herding cats. VMware vSphere environments are in a constant state of flux: new applications get rolled out and compete for resources with existing ones while virtual machines bounce between hosts in search for better performance.

Adequate VM performance depends on the behavior of a number of systems such as storage arrays, ESXi host hardware, and all the applications running on the VM. All these systems are seemingly independent yet they are all connected to vSphere. Without unified vSphere monitoring and the rest of the systems, tracking down the root cause of performance issues is an uphill battle.

To help you understand your infrastructure, from the ESXi host all the way to the application we’re pleased to announce our new integration for VMware vSphere. Integrating Datadog with vSphere allows you and your team at large to see the bigger picture.

vSphere performance overview
vSphere performance overview

The problem with siloed monitoring

Many VM performance problems are caused by external systems that are for all practical purposes invisible to you if you solely rely on vCenter.

For instance, an upgrade to SQL Server 12.0 may increase the volume of database writes during replication, gobbling up allocated VM memory until it quickly exceeds its allotment. This causes the underlying physical memory pages to fill up, and will affect the performance of other VMs on the same ESXi host. vSphere metrics like mem.swapped.average and mem.vmmemctl.average will show that an issue is occurring, but pinpointing the cause will require a great deal of investigation.

To make matters worse, the upgrade itself would have been performed by DBAs using a database management tool that doesn’t connect to vCenter. You would not be able to identify the root cause of the issue without an extensive review of the database’s configuration changes.

Getting to the same conclusion is not only possible but also easy with Datadog: All in a matter of minutes as we’ll see below.

Auto-discovery across VM and app layers

The Datadog integration places an Agent on the vCenter server, and collects vSphere performance metrics in real time, as well as configuration events like vMotions and resource configuration changes.

The agent gathers events and metrics provided by vCenter and tags them based on VMware clusters and VM configuration so that you can use the exact same cluster and VM names in Datadog. The data is continuously sent through a secure connection to Datadog where it is processed and normalized to a common timescale, along with performance data from over 80 other commonly-available tools, applications and cloud-based services.

Once all of this data is streaming to Datadog, you can see the big picture: all your performance data in one place.

vSphere monitoring - correlate vSphere and app metrics

In order to quickly resolve the the database-caused issue mentioned in our example above, the VMware (or database) team – facing a suspected issue stemming from a SQLServer upgrade – could put together in minutes a dashboard like the one below. Here, SQLServer read and write metrics combined with vSphere memory metrics, are normalized to the same time scale and shown in context. If multiple VMs working together as a service are being affected, these can be monitored and graphed as an aggregate thanks to Datadog’s tagging.

The dashboard below shows that swapping and ballooning in the VM begin to occur at the same time as the increase in SQL Server writes. Without Datadog you would only see memory ballooning without a clear idea of the cause. With Datadog you can save time and directly look into what happened with the SQL Server instance at 3:13 PM.

vSphere and SQL Server data in one place
vSphere and SQL Server data in one place

These SQL Server metrics are only one example of the many systems that can be monitored together. Datadog integrates with over 80 different applications and services and supports standards like SNMP.

See how events in other systems impact VMs

In addition to comparing performance metrics between systems, Datadog allows for correlation of events from other systems to VMware metrics. In order to get a better idea of what exactly changed in the SQL Server application, you can zoom into the time frame around 3:13 PM and see what specific events occurred with the database. The dashboard below shows time series data for the vSphere metrics, and overlays database events to correlate the SQL Server events that occurred in that time period to vSphere performance.

vSphere monitoring lets you correlate its performance to application events
vSphere monitoring lets you correlate its performance to application events

And, there’s the “smoking gun”. The replication lag, a consequence of the SQL Server upgrade, shows up at the moment that swapping and ballooning begin to take off.

If you would like to see the impact of other systems in your VMware environment and have this information accessible across teams, Datadog is available for a . You will begin to receive vSphere metrics and events immediately after installing the Datadog Agent on the vCenter server.

