From observability to experience: building customer-centric digital services
Itaú Unibanco operates in 18 countries and serves more than 70 million customers with a broad portfolio of products and services. With over a century of history and decades of accumulated technological legacy, the bank operates in a highly complex environment involving thousands of applications and more than 90,000 employees—17,000 of whom are dedicated to technology.
To sustain its evolution and scale efficiently, Itaú is undergoing a major modernization of its technology platform, aiming to fully migrate its infrastructure to the cloud by 2028.
This journey is transforming how the bank develops, operates, and observes its systems, creating more resilient, integrated foundations and preparing them for continuous evolution. In a rapidly digitizing financial services landscape, this modernization is essential. Today, 97% of customer interactions occur through digital channels, requiring high availability, real-time responsiveness, and continuous reliability.
In this context, modernizing the observability platform has become strategic to address challenges of scale, speed, and availability—enabling efficient operations and supporting the delivery of customer-centric digital services.
Opportunity: managing scale, speed, and multiple signals
As Itaú modernized its platform, complexity increased across all layers, driving the operation of thousands of services across multiple cloud providers, hybrid environments, and on-premises systems. This resulted in a significant increase in telemetry data volume.
Previously, logs and tracing were spread across different tools, requiring additional effort to correlate signals during incident analysis. The growing volume of logs and alerts created challenges in filtering relevant information, especially in critical systems.
“Maintaining high availability at our scale requires full visibility. It’s essential to have a platform that continuously helps us understand system behavior, customer impact, and risks in real time,” says Thiago Morais.
To ensure operational reliability at scale, Itaú consolidated its monitoring tools into a single platform, replacing fragmented systems with a unified approach capable of handling speed, volume, and system criticality. Alerts are now efficiently routed to responsible teams, ensuring secure and fast operation of essential systems—improving customer experience and supporting operational excellence.
Why Datadog: centralized visibility, AI-driven analysis, and platform modernization support
Adopting Datadog as an integrated observability platform gave the bank a unified view of its infrastructure, with applications, logs, and alerts tied to user experience and integrated with Amazon Web Services (AWS), the bank’s primary cloud provider. Teams gained immediate visibility into services like Amazon EC2, AWS Lambda, and managed databases, accelerating setup and eliminating monitoring gaps in production environments.
In partnership with Datadog, Itaú also created a centralized team to define standards for data ingestion, identification, and tagging—improving consistency, alert quality, and cost predictability.
Another key differentiator was the use of AI enabled by features like Datadog Watchdog and Bits AI. These capabilities help engineers move quickly from detection to understanding incidents by automatically highlighting the most relevant signals.
“Standardization allows teams to move fast without losing alignment. Combined with Datadog’s AI capabilities, it helps reduce guesswork and shorten investigation time,” says Morais.
By unifying observability with Datadog, Itaú achieved measurable results:
- Eliminated 13 tools from its observability stack
- Detected anomalies up to three hours earlier using Watchdog
- Reduced issue resolution time by 35%
- Reduced incident rate by 40%
- Increased frontend error detection visibility by 70%
Enhancing observability with operational efficiency
Operating real-time banking services requires fast investigation and clear alerting, especially in complex, distributed environments. Itaú addressed this by adopting an integrated operational view—connecting infrastructure, applications, logs, and user experience signals.
With Datadog, teams reduced noise and accelerated root cause analysis, clearly linking customer-reported issues to backend performance in critical services. This enables faster and more effective responses to customer-impacting incidents.
At Itaú’s scale—handling real-time payments and high-availability digital channels—comprehensive log collection is essential for compliance, threat detection, and incident response. However, modernization drove log volumes up to 8 petabytes per month, making cost control critical.
Using Datadog Log Management, teams correlate logs, metrics, and traces, improving context sharing and speeding up incident investigation. Flex Logs helps control ingestion costs while maintaining necessary retention for high-demand use cases.
Additionally, Itaú implemented Observability Pipelines to optimize log flow. These pipelines apply intelligent sampling, preserve critical logs, remove duplicate WARN and ERROR events, and filter low-value logs like routine health checks. This improves alert quality, protects sensitive data, and enables efficient large-scale log management.
“At our scale, logging with Datadog is a strategic decision—not just a technical one,” says Morais.
Key outcomes include:
- 40% reduction in daily log volume after Observability Pipelines
- 13.6% reduction in log-related costs
Connecting frontend experience to backend performance
Delivering high-availability digital experiences at scale requires understanding how customer experience connects to backend performance. Itaú now prioritizes incidents based on customer impact, ensuring fast and effective responses.
To achieve this, Itaú uses Datadog APM and Real User Monitoring (RUM) to link frontend behavior with backend execution. APM provides end-to-end tracing across services, helping identify latency, errors, and dependencies. RUM tracks user interactions in critical journeys such as login and payments.
With RUM Without Limits, teams capture all sessions and control indexing without code changes—focusing on specific users, errors, or campaigns while managing costs.
“With frontend metrics, we can quickly identify issues and support backend teams to act fast, avoiding customer impact,” says Morais.
Measured benefits include:
- Reduced load times in applications like Itaú Shop
- 50% reduction in RUM consumption with RUM Without Limits
Building the future with Itaú Unibanco
Itaú’s modernization efforts are transforming how the bank develops and manages its technology, enabling faster delivery, greater resilience, and continuous service availability.
By adopting Datadog as a unified observability platform, Itaú turns large volumes of telemetry data into clear, actionable insights across thousands of services—driving more efficient and proactive operations.
As it progresses toward a fully cloud-native architecture, Datadog serves as the foundation for observability at scale. This continuous visibility allows Itaú to operate with reliability and predictability, creating room for innovation and delivering seamless, customer-centric digital experiences—even as digital services continue to grow.
“Datadog gives us the observability we need to scale securely and maintain customer trust,” concludes Morais.