Self-hosted stack creates reliability challenges
Rakuten Viber is one of the world’s most popular and trusted super-apps, on a mission to make every connection meaningful—from personal chats and secure dating to managing financial services and business relationships. As a global leader in communication, Rakuten Viber connects hundreds of millions of users through messaging, calls, and communities, processing tens of billions of interactions daily. This massive scale is supported by a robust multi-cloud infrastructure, utilizing thousands of Amazon EC2 instances and advanced services across Amazon Web Services, Microsoft Azure, and Google Cloud to ensure a seamless, secure, and reliable experience for users worldwide.
Viber’s engineering priorities center on reliability, scalability, and cost efficiency. This supports its goal of delivering a seamless, secure experience to users while maintaining sustainable and efficient infrastructure.
As Viber’s scale continued to grow, increasing logging, the company’s self-hosted solution could no longer handle the scale. In addition, total cost of ownership (TCO) increased. These issues began directly impacting end users. In some instances, the team could not resolve critical incidents in a timely manner, which affected both customer success and development productivity.
The situation reached a critical point in 2024 when an issue made log data unavailable for an entire day. When the SME who managed the company’s self-hosted stack left the organization later that year, the Viber team knew it was time to make a change as TCO grew even higher. “He was the dam holding back the flood of issues,” says Sergey Korolev, Director of DevOps.
Viber needed to transition from its open source stack to a managed observability platform that could maintain cost efficiency while supporting its scale. “Viber is a communication platform, and excels at that, not an observability vendor,” says Korolev. “We’d like to focus on what really matters for our users.”
Modernizing observability with Datadog
To address its critical observability challenges, Viber turned to Datadog Flex Logs and Datadog Observability Pipelines. After evaluating multiple vendors, Viber chose Datadog for its proven reliability and best-of-breed observability tooling, including metrics and monitoring capabilities that could unify its entire observability stack.
Viber adopted Flex Logs as its primary logging solution because it is specifically designed to handle its massive data volumes cost-effectively. “We use Flex Logs exclusively, as indexing the full volumes of our logs is way too expensive,” explains Korolev. “This provides us the ability to ingest and index logs with no performance issues observed.”
Viber’s migration was remarkably smooth thanks to Observability Pipelines, which enabled dual-shipping of logs to multiple destinations and accelerated the entire process. The team completed onboarding in less than a month, seamlessly transitioning from a complex open source stack to Flex Logs without any major rework or architectural changes.
By reusing existing agents, the team avoided disruption and minimized risk. Despite initial concerns about losing functionality, the migration revealed no feature gaps or performance issues. Developers quickly adapted to Datadog’s intuitive UI, and the solution was rapidly replicated across environments.
As Korolev notes, “Datadog Flex Logs performs very well. Especially considering the log volumes and other solutions we’ve tried, it is really fast.”
In addition to Flex Logs, Observability Pipelines is now a cornerstone of Viber’s cost optimization strategy, helping the team to aggregate, process, and enrich log events before ingestion. The tool filters out at least double the amount of data that is indexed, ensuring that only valuable logs reach Datadog while automatically forwarding all events to Amazon S3 for long-term archiving and analytics.
“Observability Pipelines helped us take control of our logging costs from day one,” says Korolev. “It's incredibly easy to use—we routed only the logs we needed to Datadog and sent everything else straight to Amazon S3 for archival.”
The team created dashboards to monitor events from each service, enabling proactive cost management. When a recent code change doubled the log volume from a particular service, they were able to quickly throttle events through Observability Pipelines while maintaining visibility into system health. Today, teams across the organization can access the insights they need—backend engineering relies on metrics, monitors, and dashboards to track service health; DevOps teams use it for incident detection and response; product teams benefit from visibility into performance trends that impact user experience; and customer success uses data to understand user activities and remediate issues.
Transforming observability at scale
Today, Viber uses Datadog Log Management to capture logs from all its services, including backend, video and audio systems, and more. The implementation of Flex Logs together with Observability Pipelines provides a fast, reliable logging system that supports the company’s scale while maintaining consistent and reasonable costs.
By easily ingesting and then indexing only the relevant logs, Viber reduced log data by 80% (reducing redundant logs from 25 TB to 5 TB per day). Viber’s engineers are able to track and tackle issues faster than ever before, which has delivered significant reductions in both mean time to resolution (MTTR) and mean time to detection (MTTD). More importantly, the shift from managing its complex self-hosted stack to Datadog has dramatically reduced the gap between the development and operations teams, making it easier to resolve issues collaboratively.
Looking ahead, Viber is exploring additional capabilities, including deeper log analysis using Datadog Notebooks for more complex querying needs and advanced on-demand research and structuring. The company is also looking forward to incorporating Datadog Sensitive Data Scanner to simplify compliance by automatically scrubbing sensitive data. “Providing services to hundreds of millions of users and generating tens billions of events requires eliminating clutter and noise from our observability systems as much as possible,” explains Korolev. “Datadog Flex Logs and Observability Pipelines let us cost-effectively collect, filter, and retain massive log volumes, helping to reduce noise and costs while ensuring reliable visibility across all services.”