Prioritize and Promote Service Observability Best Practices With Service Scorecards | Datadog

Prioritize and promote service observability best practices with Service Scorecards

Author Kruthi Vuppala
Author Brooke Chen

Published: May 9, 2023

The Datadog Service Catalog consolidates knowledge of your organization’s services and shows you information about their performance, reliability, and ownership in a central location. The Service Catalog now includes Service Scorecards, which inform service owners, SREs, and other stakeholders throughout your organization of any gaps in observability or deviations from reliability best practices.

Datadog automatically evaluates each service against pass-fail rules in three categories: Production Readiness, Ownership and Documentation, and Observability Best Practices. Service Scorecards summarize observability and give teams a starting point to learn about and prioritize improvements to their services.

In this post, we’ll look at some of the ways in which Service Scorecards can help your team align with best practices, plan work to improve your services’ observability, and communicate and collaborate with stakeholders and other teams.

The Service Catalog shows the percentage of rules each service has passed.

Ensure that teams have adopted best practices

The first scorecard category, Production Readiness, includes rules to help ensure that an SRE team’s processes have positioned them well to operate their services. The category includes a rule that evaluates whether a team has defined SLOs to track their services’ performance and another to check that they’ve created monitors so that they can quickly react to potential issues. Another rule checks that the Service Catalog record designates an on-call responder to enable collaboration between teams as they troubleshoot incidents that involve more than one service. The final rule in the category tests whether the team has deployed an updated version of the service within the last three months to make sure that they’re following agile best practices.

The Observability Best Practices category helps teams track their adoption of Datadog monitoring capabilities. Rules in this category evaluate whether a team is able to correlate APM data with logs, which allows them to speed up troubleshooting and use Deployment Tracking to correlate performance issues with code deployments.

The screenshot below illustrates how engineering managers can track their teams’ scores by filtering the Scorecards view to show only relevant services. The “% of Services Passing” column shows the percentage of services in the shopist applications that are passing each Observability Best Practices rule. This can help managers ensure that all teams are consistently using Datadog features to get maximum visibility into their services.

The Observability Best Practices scorecard is filtered to show scores for services from the shopist applications. Two rules are listed, and only 13 percent of services are passing each rule.

Identify and prioritize improvements

Service Scorecards give teams an up-to-date assessment of their services, which they can use to identify, prioritize, and plan improvements. The screenshot below shows that only seven percent of services in the shopist applications have passed the Production Readiness category’s SLOs Defined rule.

The Production Readiness scorecard shows four rules. Seven percent of services are passing the On-call Defined rule, and no services are passing the other rules.

By clicking the scorecard, the service owner can see detailed information about any failed rules, plus relevant documentation to help their team make revisions to align with the conditions of the rule. In the screenshot below, the detail panel describes how SLOs help teams set clear targets for their service and ensure consistent performance. The service owner can click the “Get Started” button to immediately begin defining SLOs to adhere to the rule and improve their score.

The Production Readiness scorecard's detail panel shows that two teams are failing the SLOs Defined rule, and includes a description of SLOs and a link to get started creating them.

Datadog automatically updates your scores once a day, so your scorecards always highlight the areas where your team should focus to gain visibility and maximize the service’s performance, reliability, and availability. As teams make progress mitigating the observability gaps in their services, service owners can use Service Scorecards to share that progress with stakeholders across the organization.

Communicate with stakeholders

The Service Catalog helps service owners build trust with stakeholders by giving them insight into the health and performance of their services. Service Scorecards extend this transparency by letting stakeholders identify actionable areas of improvement for these services, such as observability gaps or failure to adhere to standard team processes. This information can help organizations recognize risks and prioritize the work necessary to remediate issues.

For example, an engineering manager can track the Service Scorecards of teams in their area to verify that they have created SLOs and designated on-call incident responders for their services. Based on that information, that manager may direct teams with failing scores to prioritize remediation over feature development until they have improved their services to comply with those rules.

Rules in the Ownership and Documentation category evaluate whether a team’s code repositories, documentation, runbooks, and dashboards are easily accessible to other teams whose services depend on theirs. By verifying that this information is available, these rules—as well as the Production Readiness rule that ensures that an on-call responder has been designated—help teams hold themselves accountable to other teams. The screenshot below shows that the team that operates the shopist-checkout application has not provided documentation for their services. This communicates to stakeholders a risk that can hamper collaboration and highlights an improvement that the team can make.

The Ownership and Documentation scorecard shows four rules, and the Contacts Defined rule has a score of zero percent.

The organization can track scorecards in these categories to understand how efficiently teams can troubleshoot and what remediations they can make to improve their incident response and minimize mean-time-to-resolution (MTTR).

Maintain visibility and track improvement with Service Scorecards

By allowing service owners, engineers, managers, and other stakeholders in your organization to continually track service observability, Service Scorecards help teams improve the performance, reliability, and availability of their services.

Service Scorecards are now in private beta, and you can contact support to request to have them enabled for your organization. See the documentation to learn more about Service Scorecards and the Service Catalog. And if you’re not yet using Datadog, you can start right away with a .