The Need: a Less “Cloudy” Cloud
In order to maintain their foothold in Latin America, MercadoLibre continuously seeks to expand and improve its environment. According to Architecture Manager Darío Simonassi, “Each application is handled by a team that has total responsibility for its development and operation. In total, we have some 600 developers who are constantly creating new applications and enhancing existing ones.”
A few years ago, virtually all of the development teams were experiencing at least some operational issues. MercadoLibre’s engineering department believed that these problems were a normal part of having a hybrid cloud environment. Darío, however, began to believe that these issues could be resolved if the teams had better visibility into the underlying infrastructure. MercadoLibre had patched together a monitoring framework using various open source systems. This arrangement was proving difficult to integrate into a common system for comparing and correlating metrics across the many applications and infrastructure components that MercadoLibre was using.
“The constant changes being made by separate teams in a shared hybrid cloud environment proved to be too dynamic for these basic monitoring tools to handle,” Darío recalls. So he set out to find a tool that was purpose-built for monitoring multiple applications in a dynamic hybrid cloud infrastructure.
“The extent of the capabilities we were missing became really obvious when I saw a demonstration of Datadog at OSCON [Open Source Convention], so I knew this was exactly what we needed.”
Dramatically Improved Troubleshooting Speed and Accuracy
“Datadog has made it possible for us to find and fix problems quickly, and to truly understand the underlying causes to help avoid similar problems in the future,” Darío explains. “Although there is much more that Datadog does for us, the sophisticated troubleshooting capability is the greatest value we get from it.” Darío particularly appreciates what he calls Datadog’s “multidimensional” analysis of possible root causes in both the physical and virtual network, server, and storage resources. “Support for multidimensional metrics is, in my opinion, what makes Datadog unique in the industry.”
Darío cited MercadoLibre’s payment service as an example of the complex interactions that can be difficult to troubleshoot. “If the payments drop, we can quickly determine if it’s a problem with Visa or American Express, or if it’s a problem with our applications or infrastructure. This is because Datadog monitors all of the different components involved.”
“We monitor an enormous number of data points, and Datadog has been able to keep up with the collection and correlation of these multidimensional metrics without any issues.”
Architecture Manager, MercadoLibre
Providing Monitoring That Scales Alongside Cloud Infrastructure
Based on the initial improvement in troubleshooting capabilities, the development teams began using Datadog to monitor all of the applications running on all of their AWS instances and on-premise servers. Setting up Datadog involved configuring both standard out-of-the-box integrations with customized data feeds. Together, these data sources provide complete visibility in real-time across the entire hybrid cloud infrastructure.
Darío is impressed by how Datadog has been able to handle the breadth and depth of the data being gathered: “We monitor an enormous number of data points, and Datadog has been able to keep up with the collection and correlation of these multidimensional metrics without any issues. So, we are confident that the system will be able to scale along with our infrastructure.”
Providing Development Teams Visibility Across Infrastructure Components
Most of the development teams are now taking advantage of the ways Datadog enables developers to collect custom metrics from the applications they’re working on using the open-source tool StatsD or via API. With 600 applications running on 15,000 VMs, the potential for interactions between application components and underlying infrastructure is enormous. This is why Darío values the visibility Datadog provides into how applications from one development team are affecting those of other teams.
“Before we began to use Datadog, most teams had no idea how changes in their applications might affect others,” says Darío. “Now all of the teams have the insight they need to test and troubleshoot operational and performance issues before and during production rollouts. This has also helped the teams work better together, and has made each team more accountable.”
Helps Maximize Performance Across the Entire Hybrid Cloud
Continuous, infrastructure-wide monitoring gives MercadoLibre the ability to optimize resource utilization and maximize overall performance. The teams get immediate notice of any application or VM experiencing a problem, and any adverse impact from changes made to configurations is also known almost immediately. Fine-tuning performance has now become a proactive process, according to Darío: “We now use Datadog to help regularly redistribute the total load to rebalance all available resources.”
The investigation into a performance problem often reveals that existing resources are simply overloaded, and that more are needed. “Because we are able to get peak performance from most of our infrastructure at the VM, host, cluster, and datacenter levels, it is common for a performance problem to indicate the need to add capacity,” Darío adds.
“ Before we began to use Datadog, most teams had no idea how changes in their applications might affect others. Now all of the teams have the insight they need to test and troubleshoot operational and performance issues before and during production rollouts.”
Architecture Manager, MercadoLibre