Moving from a monolith to microservices lets you simplify code deployments, improve the reliability of your applications, and give teams autonomy to work independently in their preferred languages and tooling. But adopting a microservices architecture can bring increased complexity that leads to gaps in your team members’ knowledge about how your services work, what dependencies they have, and which teams own them.
The Datadog Service Catalog helps you consolidate knowledge of your organization’s services and streamline communication between SREs, application developers, and service owners. The catalog is automatically populated for APM customers and takes only seconds to import services from other Datadog telemetry. Taken together, this information helps you simplify service governance and improve the discoverability and reliability of your applications.
In this post, we’ll demonstrate how the Service Catalog helps you improve how you organize and communicate information about your services, manage incidents that impact services and their downstream dependencies, and confidently deploy code updates to your services.
When domain expertise about applications is distributed across teams, it can be hard to locate the people and information you need to resolve issues. And if there’s deep knowledge of your application held by just a single tenured engineer, teams may have difficulty quickly getting the answers they need to operate and troubleshoot their services. By providing a central repository of what your organization knows about its services, the Service Catalog helps engineering teams coordinate high-priority activities like incident management as well as the day-to-day ownership and operation of your services.
The Service Catalog provides key data—such as documentation, deployment history, and code libraries—that can help bridge the knowledge gap between new engineers and their experienced teammates, allowing them to gain an understanding of their services and start contributing right away. The Service Catalog also helps you plan and implement enhancements and bug fixes by making it easy for independent teams to grasp the dependencies among services without having to rely on subject matter experts, spreadsheets, or unnecessary meetings.
The Service Catalog automatically includes applications that you’ve instrumented for APM or Universal Service Monitoring, as well as RUM applications and services that run on tagged infrastructure components. Without adding any data manually, you’ll see the dependency map (shown in the screenshot above) plus performance data such as requests, errors, and latency. You’ll also see information about the service’s reliability—including SLOs, incidents, and deployment history.
Even when services don’t emit any Datadog telemetry that allows the Service Catalog to track them automatically, you can manually register them via the Datadog API. You can also use the API to enrich a service’s definition by adding owner contact information and links to resources such as code repositories and documentation. You can also easily manage your services’ definitions through our source code integration, which lets you see and edit your services’ YAML files within Datadog.
The Service Catalog makes it easy for on-call engineers to see on-call contact information for other services—including PagerDuty schedules—so they can quickly communicate with the right people during a live incident. It also gives them critical and current information about services—from performance metrics, dependencies, SLOs, and underlying infrastructure to service owner contact information, on-call schedules, documentation, and runbooks. New engineers who are on call may need to quickly learn information that hasn’t been documented, and even experienced incident responders need resources that allow them to confidently investigate and fix problems. The Service Catalog helps on-call engineers reduce incident response time by providing authoritative information—such as code repositories and relevant libraries—within their unified observability platform.
A failing dependency is sometimes the key to understanding an incident, and the Service Catalog’s dependency map makes it easy to see whether a service’s dependencies are in an alert state. You can drill down to get more information about each service—including performance metrics, error rates, and SLO status—as well as recent activity like deployments and incidents. All this information helps you fully understand the state of the services you depend on. By providing at-a-glance status and performance information about your organization’s services—as well as service owner contact information and PagerDuty on-call schedules—the dependency map brings deep context to your troubleshooting and incident management processes.
Successful teams deploy code updates frequently, fixing bugs to maximize reliability and performance and adding features to improve customer satisfaction. But even if you release service updates carefully, each deployment presents a risk to the performance and availability of that service and all the services that depend on it. The Service Catalog makes it easy to see which services call yours so you know the potential blast radius of a faulty deployment.
To help teams collaborate to ensure smooth deployments, you can add service owner contact information to each Service Catalog entry—including a link to the relevant team’s Slack channel. The Service Catalog makes it easy for team members to reach out for ad-hoc communication before or during a deployment. It also shows them performance and reliability information about services, giving all parties a single source of truth to boost collaboration and help troubleshoot any unexpected behavior.
The Service Catalog provides information that’s critical to stakeholders throughout the organization—not just service owners. Although teams outside of engineering may not need the same information as the engineers who operate services and resolve incidents, the Service Catalog lets them independently gather key data to answer important questions.
For example, during an active incident, the Service Catalog helps you ensure that stakeholders throughout the organization—such as support teams and technical account managers—have easy access to the key information they need to manage communication. The Service Catalog can also help you coordinate game days and other reliability exercises by providing contact information for the owners of all relevant services and endpoints.
The Service Catalog brings a high-level view into your organization’s services and can be a vital tool for engineering leadership to recognize observability blindspots. By centralizing all of your service knowledge, the Service Catalog helps engineering managers spot services that aren’t instrumented for tracing or profiling, don’t generate sufficient logs, or that have been orphaned without being properly deprecated. Engineering managers can also use data from the Service Catalog to ensure standardization of reliability practices such as SLO status, deployment frequency, PagerDuty on-call coverage, correlation between logs and traces, and integration between your APM and RUM data.
The Service Catalog is a powerful way to collect knowledge of your services within the Datadog platform, helping you eliminate knowledge gaps even as your environment scales to comprise hundreds of services. The Service Catalog is generally available and free to use for all Datadog customers. You can define your services to add ownership and resource information, and if you’re using APM or Universal Service Monitoring, your catalog will be populated automatically and service dependencies will be mapped out-of-the-box. See the Service Catalog docs to create and enrich your own Service Catalog, and if you’re not already using Datadog, you can start today with a 14-day free trial.