Unifying Kubernetes Monitoring and Optimizing Incident Response with AI

Past Challenges in the T’order System

T’order, South Korea’s leading table-ordering platform, has transformed the consumer dining experience with its contactless digital ordering and payment system. As the company scaled, it faced major challenges in infrastructure operations–managing 10 Kubernetes clusters and 300 workloads, with added complexity from integrating with numerous POS providers.

The operations team’s biggest concern was the amount of time and cost required for incident response. The existing model was reactive, responding only after issues occurred. Non-developers struggled to understand logs and metrics, and both developers and the operations team found it difficult to quickly determine whether the root cause was within the service or the infrastructure. As a result, response times slowed and discussions around responsibility and cause became prolonged.

In this environment, when incidents occurred, teams had to manually sift through logs in ELK stacks (Elasticsearch, Kibana, Beats, Logstash) built per service. With this approach, 70% of incident response time was spent searching logs, 20% was spent on communication and ownership alignment, and only 10% was actually resolving the issue. In an environment where teams had to handle both development and operations, lack of visibility became a constant obstacle. Engineers had to jump between multiple tools to locate problem areas, and as services scaled, it took even longer to understand overall health. Visibility dropped sharply as services rapidly increased, and downtime inevitably rose as well.

T’order is used by both end users (B2C) and store owners (B2B). Even a few minutes of downtime in the ordering service can significantly impact store revenue and operations. This made stability and reliability a top operational priority.

From Reactive Response to Proactive Operations

As the service architecture expanded, T’order concluded it needed a new operating model beyond the legacy approach due to clear limits in incident response and operational efficiency. The first priority was centralizing fragmented logging and monitoring systems to reduce the time spent investigating issues.

Another key task was establishing a clear identification and tracing system across hundreds of distributed services. As services multiplied, it became difficult to know which team owned which service and which environment or cluster it was running in, which directly slowed incident analysis. To address this, T’order needed a consistent metadata tagging scheme that included unique service IDs, teams, environments, categories, cluster names, and more as a shared standard across operations.

In addition, the company needed a proactive system that could detect anomalies early and respond quickly. The “search logs after the fact” approach became increasingly ineffective at scale. T’order required active monitoring that could immediately notify relevant teams at the moment of an incident and quickly distinguish between infrastructure issues and application issues.

Finally, T’order needed a structure that reduced recurring, unnecessary communication overhead during incident response and clearly separated ownership and roles. Summaries, severity assessment, and resolution guidance needed to be automated so development teams and operations teams could focus on their core responsibilities—building product features and optimizing operations. To achieve this, the company prioritized refining alerts, sending only what was necessary, and embedding remediation guidance directly into messages.

Outcomes Powered by Datadog

To solve these challenges, T’order evaluated various open-source monitoring tools and ultimately chose Datadog because it provided the clearest visualization of service architecture and incident factors. Previously, teams had to jump between three or more tools to review metrics and logs. Datadog was the only solution that could unify these observability signals into a single view.

Datadog also offered easy installation and setup for Kubernetes environments using Helm charts, enabling smooth integration as T’order migrated legacy systems to Amazon Web Services (AWS), including Amazon Elastic Kubernetes Service (EKS). To control cluster characteristics, T’order used AppSet, and standardized its metadata tagging system from the service design stage to fully leverage Datadog’s tag-based search. Tags such as service ID, team, environment, service name, category, cluster name, and region were defined in advance—allowing teams to quickly identify where a service was running and which team owned it in just a few clicks.

One of Datadog’s core strengths is its ability to unify observability into a single pane of glass. T’order brought together APM traces, error logs, infrastructure metrics like CPU and memory, request volume, latency, HTTP 4xx/5xx rates, and HPA status into one dashboard.

Understanding service-to-service dependencies also improved significantly. Using Datadog’s Service Map, the team could visually see which services were communicating, where traffic was concentrating, and where failures were occurring. For a company like T’order with many external integrations, this relationship map became a key tool for dramatically accelerating incident investigation.

Datadog became more than a monitoring tool. It formed the basis for a shift in T’order’s operational model. With APM, logs, and infrastructure metrics accessible in one place, the team could immediately determine whether the issue was infrastructure-related or internal to the service. This created a clear ownership model with strong role separation: developers focused on development, while the DevOps team focused on operational optimization.

“There was no other platform that visualized as well as Datadog and helped us identify incident factors so clearly.”

T’order extended Datadog’s capabilities by integrating it with Model Context Protocol (MCP), an AI-powered automation layer that enables proactive incident response. When an anomaly alert is generated in Datadog, it is sent via Amazon Simple Notification Service (SNS) to the MCP module, where AI immediately analyzes the related logs and traces to identify the root cause and propose resolution steps. Based on this analysis, the system automatically drafts an incident report, creates a Jira ticket, and sends a summarized notification to Slack. As a result, developers receive immediate AI-driven guidance when issues occur, reducing response initiation time to within 2 to 3 minutes and improving overall incident resolution speed by 53%.

Future Plans

T’order plans to further advance its Datadog observability system and MCP automated response framework to strengthen company-wide operational efficiency and stability. So far, the system has primarily focused on collecting and analyzing APM and logs, but the company plans to expand its use of Datadog by broadening alert scenarios and anomaly detection models. The goal is to detect risk factors across the service more quickly and accurately.

T’order also plans to migrate the monitoring metrics for managed services into Datadog, further centralizing and unifying its monitoring environment. In addition, the company intends to advance its system by transitioning from its existing open-source MCP server to Datadog’s officially supported MCP module, while strengthening its internal AI-powered exploration capabilities on the website.

T’order’s ultimate goal is to build a highly productive operating environment by delivering only truly essential alerts, complete with analysis and actionable guidance, at the right time and in the most effective way.