Dash 2018: Inspiring talks, new features, and a great community
The very first Dash conference is a wrap! If you were in attendance, we thank you for joining us in NYC for our inaugural conference. If you weren’t there, we hope you’ll join us next year. Here’s a recap of some of the highlights from the two-day event.
Datadog CEO and co-founder Olivier Pomel kicked off the Dash keynotes with some Datadog history. Olivier recounted how at a previous company, he had run the development team, while Datadog co-founder and CTO Alexis Lê-Quôc ran the operations team. They found themselves asking why, despite their best efforts, the two teams couldn’t always see eye to eye—which became the starting point for Datadog. “We started Datadog to bring dev and ops together, to give them a single view of their systems, and to give them a common language so that they could actually work properly together,” Olivier said. “Another way to say it is, we started Datadog to break down silos. And that’s still our mission today.”
The Datadog product teams announced several new features at Dash, often being joined onstage by customers such as Square, Zendesk, and Airbnb, who were beta users or key partners in developing the new features:
- Watchdog: an auto-detection engine that uses sophisticated machine learning algorithms to identify real problems and potential root causes in your applications—without any setup or configuration
- Trace Search & Analytics: an interface for drilling down into performance data and trace events using high-cardinality attributes—so you can find traces from any endpoint, availability zone, user ID, product SKU, and so on
- Logging without Limits: a set of features that enable you to cost-effectively collect, inspect, and archive all your logs with dynamic indexing, global live tail, and archiving to cloud storage
Observability concerns = business concerns
James Turnbull, CTO at Empatico and a prolific author and open source contributor, also started his talk with a bit of recent history. “Back in the day, IT wasn’t necessarily a critical component in a business,” James said. Nowadays, IT is not only a revenue center—it’s a competitive differentiator. “Being better at product—being better at IT, being better at shipping product, having higher availability and better performance—actually matters from a competition standpoint,” James argued. To that end, you should approach observability as a business concern—ensuring good service for customers by monitoring your systems from end to end, using multiple data sources such as metrics, logs, traces, and data from configuration management tools. “Tools give you data that tells you information,” James said. “Data with context actually gives you answers.”
Lessons learned in the most unforgiving environment
To close out the keynote presentations, former NASA flight director Paul Hill shared insights from how NASA runs reviews of all operations—not just incidents, but successful flights and simulated missions as well. “At the end of this, every time we do it, no matter what level of training it is or what we’re preparing for, we pull the team together and we debrief,” Paul explained. “This is where we start our lessons-learned activities.”
Regardless of the type of event, the NASA team poses the following questions: What went well? What didn’t go well, and why not? Where did we get lucky? What still needs work or improvement? “Throughout all of it, the focus is on the team being better the next time we show up,” Paul said. He noted that the debrief culture at NASA requires full transparency and an “all-in” approach: everyone who participated in the event, made a decision, or recommended a decision is required to be at the review, and everyone in the larger organization is invited to attend and participate as well.
In the performance track, DraftKings CTO Travis Dunn (video) explained how his company uses circuit breakers to prevent non-critical systems and minor features from causing outages as they moved to a complex, microservices-based architecture. Brian Lucas of Optimizely (video) explained how leading tech companies are using experimentation built into CD to efficiently test releases before making them generally available. Phil Calçado of Meetup.com explored how the challenge of debugging applications has changed in a highly distributed world, drawing on war stories amassed over the previous 10 years. Finally, Tiffany Low and Willie Yao of Airbnb (video) shared the story of how Airbnb failed to migrate to microservices on their first attempt—at the time, they didn’t have an urgent need, so the costs outweighed the benefits. As they scaled their developer team, however, deployments slowed and the migration became necessary. The lessons from their first attempt helped them successfully migrate to microservices when it mattered most.
In the scalability breakout group, we heard from Aaron Brady of Shopify (video), who spoke about the unique challenges in building and scaling Shopify, their heavy use of MySQL and database sharding, and how they worked to simplify customer onboarding. Johan Mjönes from EA DICE (video) described how his organization manages AAA game launches, which require quickly scaling from zero to millions of users. Nick Vecellio described how Wayfair (video) started making use of real-time BI data to accelerate incident response. Anatoly Mikhaylov and Daniel Rieder from Zendesk (video) showed how they used Datadog APM to discover sources of massive infrastructure costs that would otherwise be hidden in error messages. And Rob Desjarlais from Liberty Mutual (video) covered lower-level tools that can be used to monitor busy systems—including showing how default networking memory configurations can come back to bite you.
In the third breakout group, the teams track, Segment’s Calvin French-Owen (video) mapped out patterns for effective teamwork during outages. Stacy Gorelik of Flatiron Health (video) recounted her experience and lessons learned building out a platform organization at Flatiron (and made incredible use of Legos for storytelling). Kristina Bennett and Liz Fong-Jones of Google (video) walked through how you can use service level objectives and error budgets to balance the competing concerns of high availability and rapid feature development. Intercom’s Brian Scanlan (video) explained how Intercom transitioned their on-call rotation to a virtual team of volunteers in an effort to improve the quality of on-call as well as the quality of life for engineers. Finally, James Burns and Bruce Wong from Stitch Fix (video) spoke about implementing the famed chaos monkey, as well as best practices for chaos engineering generally.
See you in 2019!
We were thrilled to hear your stories at Dash, to share in your knowledge, and to get to know the members of this incredible community a bit better. We’re already making plans for next year—we hope you’ll join us for an even bigger and better Dash in 2019!