SulAmérica unifies observability across 2,000 services to deliver reliable health care at scale

Bringing clarity to a complex, multi-cloud health platform

SulAmérica’s mission is to protect and care for people’s health and well-being. For the millions of clients, beneficiaries, brokers, and partners who rely on its platform, that means delivering reliable, increasingly digital healthcare services.

The company operates a multi-cloud environment spanning Google Cloud, Amazon Web Services, and Microsoft Azure, alongside an on-premises data center and mainframe. Across that infrastructure, more than 2,000 services work together to support the health journeys of SulAmérica’s customers.

As the platform grew in complexity, the engineering team relied on multiple observability tools that did not integrate. Engineers had to switch between systems and manually correlate metrics and traces to understand what was happening—slowing incident response and making it harder to meet service level agreements. “The tool we were using at the time was not meeting all of our needs,” says Marcos Paulo Cruz, IT Infrastructure Manager. “We had issues visualizing the end-to-end experience of our customers—particularly with API monitoring in our Salesforce environment.”

SulAmérica needed a unified observability platform that could correlate data across its hybrid environment, provide end-to-end visibility into the customer experience, and help teams get ahead of incidents before they affected users.

A single platform for unified observability

After evaluating several vendors, SulAmérica selected Datadog in January 2025 and completed its migration by March. The decision came down to ease of adoption, well-organized modules, and the ability for engineers to investigate issues without switching tools—including their OpenTelemetry pipelines. “Datadog stood out and addressed our pain points,” says Cruz.

“Datadog stood out and addressed our pain points.”

The team’s most-used product is Application Performance Monitoring (APM), which provides visibility into service performance across latency, error rates, and application failures. “APM is our workhorse,” says Cruz. “It’s where we identify the root cause of application failures through tracing.”

Infrastructure Monitoring complements this by providing visibility across both legacy systems and cloud infrastructure, with monitors surfacing CPU, memory, and disk issues early.

SulAmérica also uses Container Monitoring to identify and fix issues in problematic workloads and clusters with Bits AI Kubernetes Remediation. A weekly automated workflow runs every Saturday at 6:00 AM to clean up image pull backoffs, preventing failed applications from consuming unnecessary resources. Bits AI Kubernetes Remediation helps the team with root cause analysis and is the foundation for experiments with Agent Builder—currently in preview—to identify outdated software versions in Kubernetes clusters.

“APM is our workhorse. It's where we identify the root cause of application failures through tracing.”

The value of unified observability became clear during a recent API gateway incident. Another team had been investigating the issue for four hours without identifying the cause. When Nathan de Mesquita dos Santos, Site Reliability Engineer, joined the incident, he used Datadog’s AI-assisted tools to correlate traces and quickly identified the root cause—a changed API rate limit. “When we joined the incident, Nathan used AI and reached the same conclusion the team was already converging on—independently, and much faster,” says Cruz.

A second example highlights how improved visibility is changing development practices. The Seguro Viagem travel insurance application depended on a legacy Salesforce service that functioned as a black box—engineers could detect latency but could not identify its source. Using APM and OpenTelemetry, the team instrumented the application to surface per-service latency and is now building a new version with better visibility from the start.

Fewer incidents, lower costs, and a more confident engineering team

Consolidating onto a single platform transformed how SulAmérica’s engineering team operates. “Previously we had many tools, and context switching was inefficient,” says Cruz. “We couldn’t correlate metrics across tools—you had to look at one, then another, and manually piece things together to find an answer. Datadog changed that.”

With a unified view across infrastructure, applications, and services, the team shifted from reactive to proactive operations. Instead of escalating incidents into war rooms before having enough context, engineers can now detect signals earlier and investigate issues before they impact customers.

“We couldn't correlate metrics across tools—you had to look at one, then another, and manually piece things together to find an answer. Datadog changed that.”

The number of incidents has dropped significantly, and war rooms are far less frequent. The impact on Kubernetes operations has been equally significant. Using Datadog’s Bits AI Kubernetes Remediation, SulAmérica reduced image pull errors from hundreds of workloads down to just four—transforming cluster hygiene almost entirely through automated weekly workflows. The team used real usage data to right-size workload allocation, cutting allocated capacity by 45% (from 400GB to 240GB) and reducing memory usage by 40% (from 250MB to 150MB). Overall Google Kubernetes Engine (GKE) infrastructure costs dropped by 9%, with no ongoing manual effort required to sustain those gains.

Looking ahead: automating cost and performance insights

SulAmérica is now taking a measured approach to AI-driven automation, prioritizing clean, reliable signals before layering automation on top. This approach is already underway through experiments with Agent Builder in its Kubernetes environment.

In the future, the team plans to expand automation to analyze metrics, generate summaries of anomalies, and surface insights directly in collaboration tools.

The long-term goal is to calculate a cloud cost per transaction metric by combining observability and business data. Linking infrastructure spend directly to user activity will provide clearer insight into operational costs and support more data-driven decision-making across the organization.