Picsart delivers real-time AI creativity at global scale with Datadog | Datadog
Picsart delivers real-time AI creativity at global scale with Datadog

Case Study

Picsart delivers real-time AI creativity at global scale with Datadog

About Picsart

Picsart is a global creative platform with hundreds of millions of users, providing AI-powered tools to create, edit, and generate visual and audio content for a worldwide community.

Software Development / AI
~500+
Miami, Florida
“Datadog allows us to understand our AI systems end-to-end, so we can maintain performance, reduce risk, and deliver better user experiences at scale.”
case-studies/picsart/jor-khachatryan
“Datadog allows us to understand our AI systems end-to-end, so we can maintain performance, reduce risk, and deliver better user experiences at scale.”
Jor Khachatryan Senior Manager, SRE and Incident Management Picsart

Why Datadog?

  • Brings metrics, logs, and traces together in one platform
  • Enables real-time visibility across distributed AI systems
  • Provides end-to-end tracing from user request to infrastructure
  • Supports high-cardinality data across large-scale workloads
  • Helps teams define and monitor SLOs based on real usage
  • Allows fast correlation of signals for quicker troubleshooting

Challenge

As AI became central to its product, Picsart needed deeper visibility into complex, real-time workloads to maintain low latency, ensure reliability, and troubleshoot issues across distributed systems.

Key results

↓50% MTTR

Faster detection and root cause identification

↓50% debugging time

Reduced investigation complexity across teams

Improved AI latency performance

Optimized inference and reduced bottlenecks

Scaling AI creativity for hundreds of millions of users

Picsart is building a platform where anyone can create. By combining intuitive design tools with powerful AI capabilities, the company enables hundreds of millions of users to generate, edit, and transform content regardless of skill level. AI is at the center of that experience, powering everything from image generation to new interactive environments like AI Playground and Picsart Flows, where users can explore more than 125 models and creative workflows.

As these capabilities expand, so do user expectations. “Users expect real-time responses, even though behind the scenes these are complex, CPU and GPU intensive workloads,” says Jor Khachatryan, Senior Manager, SRE and Incident Management, at Picsart.

Delivering fast, consistent results across a global user base requires operating highly distributed systems where performance, latency, and reliability are tightly coupled to user experience. As AI became core to the product, maintaining that experience at scale became increasingly challenging.

Picsart first image

Limited visibility into complex, distributed AI systems

As traffic and AI workloads grew, Picsart needed to better understand how its systems behaved end to end. Requests span multiple layers, from user interactions to backend services, model inference, and infrastructure. Without clear visibility across this full path, identifying bottlenecks and diagnosing issues became more difficult. “We needed to understand request flows from the user through our services to the models and infrastructure,” Khachatryan explains. “Traditional monitoring was not enough for that level of complexity.”

At the same time, the team needed to monitor latency, success rates, and system behavior in real time, while supporting high-cardinality data across distributed environments. Without deeper observability, troubleshooting slowed down and maintaining consistent performance across regions became more difficult.

This gap began to impact both reliability and development speed, making it clear that a more comprehensive approach to observability was required.

“We needed to understand request flows from the user through our services to the models and infrastructure,” Khachatryan explains. “Traditional monitoring was not enough for that level of complexity.”

Unifying observability to operate AI systems in real time

Picsart adopted Datadog to bring together metrics, logs, and traces into a unified platform. This allowed teams to correlate signals in real time and gain full visibility into system behavior across every layer of the stack.

With Datadog APM, the team achieved near-complete visibility into backend services and AI pipelines. This made it possible to define and track service level objectives for availability and latency based on real production data. Real User Monitoring added another critical dimension by providing direct insight into how users experience the platform across regions, devices, and network conditions. “The key value comes from connecting these layers,” says Khachatryan. “We can trace issues end to end, from a user session all the way through services down to infrastructure.”

This unified view also improved collaboration. SRE teams lead reliability, incident response, and system visibility, while AI and engineering teams use the same platform to monitor model performance, investigate latency, and debug issues. With everyone working from the same data, teams stay aligned around system performance and user experience.

Reducing MTTR and improving performance across AI workloads

With full observability in place, Picsart significantly improved its ability to detect, investigate, and resolve issues. Mean time to resolution (MTTR) decreased by approximately 40–50%, and debugging time was reduced by a similar margin. “The biggest shift was how quickly we can go from detection to root cause,” Khachatryan says. “Instead of navigating multiple tools, we now correlate traces, metrics, and logs in a single platform.”

This capability is especially important in AI systems, where performance issues often stem from multiple factors across different layers. In one case, latency began increasing in an AI-powered feature, but the source was not immediately clear. Using Datadog APM, the team identified that the issue affected only a subset of traffic and traced it to increased inference time in a specific model version. At the same time, infrastructure metrics showed GPU saturation in certain regions, while logs revealed queuing delays in the inference pipeline. By correlating these signals, the team uncovered a combination of factors contributing to the issue. They rebalanced traffic and optimized workloads, reducing latency and preventing broader user impact.

Datadog also plays a critical role during feature rollouts. With real-time visibility, teams can monitor both system performance and user impact as changes are deployed. In one instance, Real User Monitoring revealed degraded session performance and increased frontend errors for specific regions and device types.

By linking this data with backend traces, the team identified a latency issue introduced by a recent deployment. Because they could trace the issue from user sessions down to service-level behavior, they quickly isolated the root cause and rolled back the change before it affected a larger portion of users.

“The biggest shift was how quickly we can go from detection to root cause,” Khachatryan says. “Instead of navigating multiple tools, we now correlate traces, metrics, and logs in a single platform.”

Scaling AI innovation with confidence and control

As Picsart continues to expand its AI platform, observability remains a foundational capability for managing complexity and maintaining performance. “Datadog is a key enabler for operating AI systems at scale,” says Khachatryan. “It allows us to maintain reliability while increasing delivery speed.”

With real-time visibility into system behavior and user experience, teams can move faster without increasing operational risk. Engineers spend less time troubleshooting and more time building new capabilities, while SRE teams maintain control over system reliability.

Looking ahead, Picsart is continuing to invest in more advanced AI experiences, including real-time generation, personalization, and interactive workflows. As these systems grow more complex, the ability to understand and operate them effectively becomes even more critical. “For AI-first companies, observability should be treated as a core system, not an add-on,” Khachatryan says. “It is essential for scaling reliably.”

With Datadog, Picsart has built the visibility and control needed to support continuous innovation while delivering fast, reliable creative experiences to users around the world.

“Datadog is a key enabler for operating AI systems at scale,” says Khachatryan. “It allows us to maintain reliability while increasing delivery speed.”

Resources

apm/product_heros_APM

product

Datadog Modern Application Performance Monitoring (APM)
State of AI Engineering

BLOG

State of AI Engineering
Diagnose and resolve database performance issues faster with Database Investigator

BLOG

Diagnose and resolve database performance issues faster with Database Investigator