What is OpenStack?
OpenStack is an open-source cloud-computing software platform. It is primarily deployed as infrastructure-as-a-service and can be likened to a version of Amazon Web Services that can be hosted anywhere. Originally developed as a joint project between Rackspace and NASA, OpenStack is about five years old and has a large number of high-profile corporate supporters, including Google, Hewlett-Packard, Comcast, IBM, and Intel.
The core of the OpenStack project lies in the Compute module, known as Nova. Nova is responsible for the provisioning and management of virtual machines. It features full support for KVM and QEMU out of the box, with partial support for other hypervisors including VMWare, Xen, and Hyper-V.
OpenStack overview dashboard
Here are some of the things you’ll want to see in any OpenStack dashboard. If you’re a Datadog user, your OpenStack metrics will automatically populate an out-of-the-box dashboard in your Datadog account called “OpenStack - Overview” like in the screenshot below. If you’re not a current user, you can still follow along and craft your own dashboard with these useful metrics.
Nova metrics can be logically grouped into four categories:Hypervisor metrics give a clear view of the work performed by your hypervisors, nova server metrics give you a window into your virtual machine instances, tenant metrics provide detailed information about user resource usage (including quotas), and finally, message queue metrics give you performance details about the underlying message-passing pipeline Nova uses to coordinate work.
Here’s a widget-by-widget breakdown of the graphs and query values in this dashboard.
Nova, Neutron, and Keystone counters
These counters display the number of running Nova, Neutron, and Keystone API endpoints. Because your number of physical hosts should change infrequently, you can expect these numbers to be static. Changes in these counters point to down API endpoints, which means there is trouble in your deployment.
The hypervisor counter reports the number of hypervisors that are up and running. This counter can also be said to reflect the number of Nova nodes running, as each Nova node has one hypervisor. Unexpected changes to this metric point to problems with your Nova cluster.
Nova server metrics
Computing nodes generally constitute the majority of nodes in an OpenStack deployment. The Nova server metrics group provides information on individual instances operating on computation nodes.
HDD read rate by instance
This timeseries graph reports the average rate of read requests per second per instance. Spikes in this metric indicate that a virtual machine may have low RAM, causing it to thrash the disk with constant memory paging.
The hypervisor initiates and oversees the operation of virtual machines. Failure of this critical piece of software will cause tenants to experience issues provisioning and performing other operations on their virtual machines, so monitoring the hypervisor is crucial.
Top memory RSSThis toplist displays the current resident set size (RSS) of the
nova-computedaemon (VM instance manager), grouped by host aggregate. Although this metric should fluctuate under normal conditions, any dramatic changes should be investigated.
Hypervisor load map
Used vs Free disk spaceThis timeseries graph reports the amount of disk space (in gigabytes) currently available for allocation, aggregated by physical host. It is plotted against the amount of disk space in use. Maintaining ample disk space is critical, because the hypervisor will be unable to spawn new virtual machines if there isn’t enough available space.
Current workload by hypervisorThis bar graph tracks hypervisor operations: Build, Snapshot, Migrate, and Resize.
Change in running VMsThis change graph tracks changes in the number of instances running on each host. Depending on your use case, unexpected changes to this metric should be investigated.
VCPUs used vs availableThis timeseries graph plots the number of virtual CPUs in use against the maximum number available. Remember, OpenStack allows you to overcommit RAM and CPU resources. This means you can increase the number of resources available to your instances, at the cost of performance.
RabbitMQ serves both as a synchronous and asynchronous communications channel for Nova. Failure of this component will disrupt operations across your deployment. Monitoring RabbitMQ is essential if you want the full picture of your OpenStack environment.
queue memoryThis timeseries graph plots the memory usage of RabbitMQ, broken down by queue. Although not often an issue, a significant spike in queue memory could point to a large backlog of unreceived (“ready”) messages, or worse.
Consumer utilizationThis timeseries graph reports on the utilization of each queue, represented as a percentage. Ideally, this metric will be 100 percent for each queue, meaning consumers get messages as quickly as they are published. This metric is only availabile in RabbitMQ 3.3 and greater.
Consumers by queueThis toplist represents the current number of consumers per message queue. Your number of consumers should usually be non-zero for a given queue. Zero consumers means that producers are sending out messages into the void. Depending on your RabbitMQ configuration, those messages could be lost forever.
Tenant metrics are primarily focused on resource usage. Remember, tenants are just groups of users. In OpenStack, each tenant is allotted a specific amount of resources, subject to a quota. Monitoring these metrics allows you to fully exploit the available resources and can help inform requests for quota increases should the need arise.
Floating IPs used vs maxThis timeseries graph plots the number of floating IPs used by the tenant against the maximum number of floating IPs allowed.
RAM used vs maxThis timeseries graph plots the number of floating IPs used by the tenant against the maximum number of floating IPs allowed.
Cores used vs maxThis timeseries graph plots the current number of cores in use against the maximum number of cores allocated.
Instances used vs maxThis timeseries graph plots the current number of instances running against the maximum number of instances allowed. Remember, if a tenant is close to nearing their instance limit, they can always resize the instance to a larger one, if other resource quotas permit.
We’ve walked you through a number of metrics which are good indicators of your cloud’s performance and health. If you’d like to see this dashboard for your OpenStack metrics, you can try Datadog for free for 14 days. This dashboard will be populated with your metrics immediately after you enable the OpenStack integration.