Supporting Platform Reliability and Team Growth
As HashiCorp prepared for Terraform Enterprise’s public release in 2017, they began to focus on growing the team that develops and supports the product’s platform. But as the organization evolved from a core group of generalist engineers to one with more specialized teams, HashiCorp found it difficult to share tribal knowledge about their system and its interdependencies with new team members. “Ironically, in a company whose philosophy is all around a healthy devops culture, we were seeing patterns that tend to emerge when you have limited communication across differential expertise,” says HashiCorp Director of Infrastructure Paul Hinze. “We were starting to feel an application team split that none of us liked, and we started to self-diagnose.” Ultimately, Hinze says, the team concluded that “this is in some sense a tooling problem—this is a visibility problem.”
Poor Interface, Poor User Adoption, Poor Visibility
The lack of visibility was attributed to the poor usability of the self-hosted monitoring tools that HashiCorp was using at the time, which left engineers ill-equipped to effectively troubleshoot issues or get real-time feedback on new product features. “We essentially had only one or two people interested in learning Graphite well enough to get data out of it,” says Hinze. The limited access to real-time monitoring and alerting hindered the team’s responses to issues, causing unnecessary delays in problem diagnosis and resolution. Without the ability to track and compare current and historical states, troubleshooting became a reactive, time-consuming, and tedious task. Hinze recalls investigating one problem by manually comparing screenshots of logs on his desktop. “There were often times when we could have solved problems in a more efficient way,” he says. “These metrics were just not close at hand as a problem-solving tool—and that’s what we wanted to change.”
In order for HashiCorp to effectively grow their team, monetize their offering, and support their enterprise customers, they needed organization-wide visibility into their platform and underlying systems.
“ Our biggest concern as our team grew was time to diagnose issues.”
Director of Infrastructure, HashiCorp
Accessible Monitoring Provides a Clear Understanding of HashiCorp’s Platform
Datadog provided HashiCorp with the visibility they needed to maintain application and system health, and offered a user-friendly platform that made these insights accessible across their organization. For Matt McQuillan, a HashiCorp SRE, the change was palpable: “It’s the difference of going to this weird IP address with an older interface and figuring it out for yourself, versus Datadog, which is more intuitive to use and easier to get to.” Now, instead of operational visibility being limited to one or two monitoring experts, dozens of team members have ready access to the data they need to rapidly troubleshoot performance problems or test new features. With
600+ built-in integrations, connections to HashiCorp’s Terraform and Nomad products, and the ability to pull performance data directly from their application, Datadog provides HashiCorp with an easy-to-understand, cohesive view of their internal and customer-facing systems.
“ We liked Datadog’s ease of use and wide industry adoption, but the integrated APM features really helped seal the deal for us.”
Site Reliability Engineer, HashiCorp
A Single Interface for Troubleshooting
Datadog’s intuitive platform provided HashiCorp with a rich out-of-the-box toolset for identifying and addressing issues. HashiCorp’s entire team had immediate access to dashboards for Amazon ECS, RDS, and other critical infrastructure components based on Datadog’s vendor-supported AWS integrations. And with just a few lines of code, HashiCorp was able to instrument their Ruby on Rails application for Datadog’s application performance monitoring (APM). With APM, HashiCorp can trace individual requests and gather detailed performance metrics from their application, then seamlessly pivot between code- and system-level views to resolve issues. The addition of log management on the Datadog platform further enriches the problem-solving insights available to HashiCorp in a single user interface. “Having application logs in Datadog allows for a total view of application health,” McQuillan says, enabling his teams to even more rapidly remediate issues. “It’s about diagnosing the root cause rather than just blindly scaling up,” Hinze adds. “Datadog has given us the ability to quickly isolate the problem path.”
Lower Latency, Faster Page Load, Better Customer Experience
HashiCorp has clear service level objectives (SLOs) designed to help them maintain the performance of their platform and health of their business. These SLOs are based on specific throughput and latency metrics, which are aggregated in Datadog to indicate the quality of service they are providing to customers. “Datadog has provided a great mechanism to pull in these stats and display them in a friendly way for our organization,” Hinze says. Datadog dashboards and alerts allow HashiCorp to maintain fast response times in their application, ensuring that Terraform Enterprise users are able to update and provision their infrastructure without delay. Should any slowdowns arise, log files and application traces from the relevant time frames can be automatically retrieved and correlated to quickly identify and resolve the issue.
Improving Communication and Collaboration During Active Incidents
When it comes to addressing a live incident, Datadog makes it easy for HashiCorp’s remote workforce to collaborate using real-time data. Using Datadog’s built-in integrations with Slack and other communication tools, engineers can share graph snapshots and links to dashboards or request traces. This seamless collaboration allows teammates to investigate issues together and improve MTTR. “The handoff of context is huge,” McQuillan says. “It’s the ability to pass a link to an application developer, and know they can see exactly what I’m seeing.” Engineers can then use Datadog APM to delve into traces from specific request types or database queries, as well as the associated log entries, to diagnose and remediate the root cause of any errors or latency.
“ APM’s been a real game changer for us in terms of troubleshooting.”
Director of Infrastructure, HashiCorp
Accelerating Innovation with a Single Platform for Monitoring
Datadog provided immediate value to HashiCorp as a robust, easy-to-use platform for troubleshooting, but this is just the foundation for future product development and collaboration between HashiCorp team members. “Datadog has been instrumental in bringing the engineering and operations teams together to view and discuss the health of our application,” McQuillan says. “It has already helped us diagnose application issues, and we look forward to helping our application team use it as feedback to their code changes in the future.” By combining insights from metrics, traces, and logs, Datadog is now the central monitoring platform on which HashiCorp will manage and grow Terraform Enterprise.