2023-03-08 Incident: A Deep Dive into Our Incident Response
This post sketches out our incident response process, where it succeeded and where it stumbled on March 8, and what we learned along the way.
A deep dive into what happened at the platform level during the outage of March 8, 2023.
Learn how we developed a new scheduling algorithm for data fetching and rendering and how we built it for use across our suite of Datadog products.
A closer look at storage routing in Husky, Datadog's third-generation event storage system.
We’ve recently improved the raw performance of the Datadog Agent, leading to 20% less CPU use on Agents flooded with custom metrics.
Learn about Datadog's repeatable design elements that we've documented in our design style guide called DRUIDS.
Husky is an unbundled, distributed, schemaless, vectorized column store. Here's how we built it—and why.
Employees at all modern software companies use a ton of outside pieces of software to do their jobs. Learn how Datadog's IT team expanded Clarity to automate monitoring these accounts for inactivity and optimizing how much we spend on them.
The story of a seemingly simple issue that led us into the hidden complexities of gRPC, DNS, and Kubernetes.
See Datadog's proof of concept exploit for breaking out from unprivileged containers using the Dirty Pipe vulnerability.
How several patches and fixes in Go 1.18 bring improved profiling accuracy.
How the Datadog DesignOps team uses Datadog to understand our users and make well-informed design decisions