Achieving Operational Visibility for All Development Teams | Datadog
CASE STUDY

Achieving Operational Visibility for all Development Teams

Learn how Datadog bridged the gap between application and infrastructure teams to enable cross-team collaboration

about MercadoLibre

MercadoLibre is the largest online marketplace in Latin America, offering a wide range of digital solutions for e-commerce, payments, and advertising. The company was established in 1999 in Argentina and now operates in 12 additional countries.


主な成果

600 apps on 15,000 VMs

Datadog's monitoring solutions scaled along with MercadoLibre's dynamic and distributed infrastructure.

Improved collaboration

With Datadog, all of MercadoLibre's development teams were able to analyze their applications' performance in the same place.


Challenge

MercadoLibre needed better visibility into their distributed applications and dynamic hybrid cloud infrastructure. They had patched together a monitoring framework using various open source tools, but these disparate solutions made it difficult and time-consuming for them to correlate telemetry data from across their stack.


なぜDatadogなのか?

Datadog's turn-key integrations enabled MercadoLibre to automatically correlate metrics from across their dynamic and distributed system, which allowed their development teams to focus their energy on building and releasing new features in order to maximize performance.


The Need: a Less “Cloudy” Cloud

In order to maintain their foothold in Latin America, MercadoLibre continuously seeks to expand and improve its environment. According to Architecture Manager Darío Simonassi, “Each application is handled by a team that has total responsibility for its development and operation. In total, we have some 600 developers who are constantly creating new applications and enhancing existing ones.”

A few years ago, virtually all of the development teams were experiencing at least some operational issues. MercadoLibre’s engineering department believed that these problems were a normal part of having a hybrid cloud environment. Darío, however, began to believe that these issues could be resolved if the teams had better visibility into their distributed applications and dynamic hybrid cloud infrastructure that ran on-prem and on AWS. Gaining better visibility into their AWS environments was especially crucial, especially as the teams relied on core AWS services like Amazon CloudFront, Amazon EC2, Amazon DynamoDB Database, and Amazon Elastic Block Storage, and Amazon Elastic Load Balancer.

MercadoLibre had patched together a monitoring framework using various open source systems. This arrangement was proving difficult to integrate into a common system for comparing and correlating metrics across the many applications and infrastructure components that MercadoLibre was using.

“The constant changes being made by separate teams in a shared hybrid cloud environment proved to be too dynamic for these basic monitoring tools to handle,” Darío recalls. So he set out to find a tool that was purpose-built for monitoring multiple applications in a dynamic hybrid cloud infrastructure and that could integrate seamlessly with AWS.

“The extent of the capabilities we were missing became really obvious when I saw a demonstration of Datadog at OSCON [Open Source Convention], so I knew this was exactly what we needed.”

Dramatically Improved Troubleshooting Speed and Accuracy

“Datadog has made it possible for us to find and fix problems quickly, and to truly understand the underlying causes to help avoid similar problems in the future,” Darío explains. “Although there is much more that Datadog does for us, the sophisticated troubleshooting capability is the greatest value we get from it.” Darío particularly appreciates what he calls Datadog’s “multidimensional” analysis of possible root causes in both the physical and virtual network, server, and storage resources. “Support for multidimensional metrics is, in my opinion, what makes Datadog unique in the industry.”

Datadog’s turn-key integrations, particularly those with AWS, enabled MercadoLibre to automatically correlate metrics from across their dynamic and distributed system, which allowed their development teams to focus their energy on building and releasing new features in order to maximize performance.

Darío cited MercadoLibre’s payment service as an example of the complex interactions that can be difficult to troubleshoot. “If the payments drop, we can quickly determine if it’s a problem with Visa or American Express, or if it’s a problem with our applications or infrastructure. This is because Datadog monitors all of the different components involved.”

 “We monitor an enormous number of data points, and Datadog has been able to keep up with the collection and correlation of these multidimensional metrics without any issues.”

Darío Simonassi
Architecture Manager, MercadoLibre

Providing Monitoring That Scales Alongside Cloud Infrastructure

Based on the initial improvement in troubleshooting capabilities, the development teams began using Datadog to monitor all of the applications running on all of their AWS instances and on-premise servers. Setting up Datadog involved configuring both standard out-of-the-box integrations with customized data feeds. Together, these data sources, which include physical servers and virtual AWS products like Amazon EC2 and AWS Lambda, provide complete visibility in real-time across the entire hybrid cloud infrastructure.

Darío is impressed by how Datadog has been able to handle the breadth and depth of the on-prem and AWS data being gathered: “We monitor an enormous number of data points, and Datadog has been able to keep up with the collection and correlation of these multidimensional metrics without any issues. So, we are confident that the system will be able to scale along with our infrastructure.”

Providing Development Teams Visibility Across Infrastructure Components

Most of the development teams are now taking advantage of the ways Datadog enables developers to collect custom metrics from the applications they’re working on using the open-source tool StatsD or via API. With 600 applications running on 15,000 VMs, the potential for interactions between application components and underlying infrastructure is enormous. This is why Darío values the visibility Datadog provides into how applications from one development team are affecting those of other teams.

“Before we began to use Datadog, most teams had no idea how changes in their applications might affect others,” says Darío. “Now all of the teams have the insight they need to test and troubleshoot operational and performance issues before and during production rollouts. This has also helped the teams work better together, and has made each team more accountable.”

Helps Maximize Performance Across the Entire Hybrid Cloud

Continuous, infrastructure-wide monitoring gives MercadoLibre the ability to optimize resource utilization and maximize overall performance. The teams get immediate notice of any application or VM experiencing a problem, and any adverse impact from changes made to configurations is also known almost immediately. Fine-tuning performance has now become a proactive process, according to Darío: “We now use Datadog to help regularly redistribute the total load to rebalance all available resources.”

The investigation into a performance problem often reveals that existing resources are simply overloaded, and that more are needed. “Because we are able to get peak performance from most of our infrastructure at the VM, host, cluster, and datacenter levels, it is common for a performance problem to indicate the need to add capacity,” Darío adds.

“ Before we began to use Datadog, most teams had no idea how changes in their applications might affect others. Now all of the teams have the insight they need to test and troubleshoot operational and performance issues before and during production rollouts.”

Darío Simonassi
Architecture Manager, MercadoLibre

リソース

/blog/log-based-metrics/log-based-metrics-hero

BLOG

Generate metrics from your logs to view historical trends and track SLOs
/blog/agent-integration-management/agent-integration-management-hero

BLOG

Install updated Datadog integrations as they become available
/blog/logging-without-limits-new-features/hero_image

BLOG

Introducing Metrics from Logs and Log Rehydration™
/blog/host-info-panel/host-info-hero_rv2

BLOG

New in Datadog: Add context to request traces with host metrics