Improving Staff Productivity by Providing Developers With a Workflow-Oriented Operational Monitoring System | Datadog
CASE STUDY

Improving Staff Productivity by Providing Developers with a Workflow-Oriented Operational Monitoring System

Learn how SimpleReach correlates metrics and events in Datadog to get actionable insights into the cause and effect of changes

About SimpleReach

Acquired by Nativo in 2019, SimpleReach is a data platform that enables businesses to measure and optimize content distribution by tracking engagement, social activity, and other key metrics. SimpleReach uses predictive analytics to project the popularity of content across different platforms, enabling marketers to develop more successful content strategies.


Key Results

8 billion

Number of daily interactions that SimpleReach tracked with Datadog, allowing them to get actionable insights for improving operations.

15 minutes

SimpleReach’s team got Datadog up and running in a matter of minutes.

300

Number of AWS instances SimpleReach was able to monitor with Datadog, while still easily identifying problems with individual machines.


Challenge

As SimpleReach’s platform grew, teams began spending more time tracking and comparing performance metrics during infrastructural updates. Their existing open source monitoring tools created a disconnect between development and operations teams, making it difficult to assess the performance implications of frequent changes in the production environment.


Why Datadog?

Datadog Infrastructure Monitoring enabled SimpleReach’s teams to get a unified view of real-time changes across their operating environment. Additionally, because events and metrics were automatically correlated in Datadog, engineers could get meaningful context for troubleshooting. With Datadog’s flexible dashboards and alerts, SimpleReach was able to simplify code deployment processes and improve DevOp productivity.


Successful content strategies require actionable insights, and providing those insights has made SimpleReach the standard in content measurement and distribution. The company’s solution gives any organization the means to measure and optimize content distribution by offering real-time visibility and detailed historical reporting into how content performs across a wide range of metrics, such as reach, engagement and social activity.

With insight into which content drives conversions, SimpleReach programmatically amplifies the right content to targeted audiences across channels such as Facebook, Twitter, LinkedIn, Outbrain, Nativo, StumbleUpon and TripleLift. Currently tracking and processing more than 8 billion content interactions in real-time daily, the company works with leading publishers such as The New York Times, Forbes, The Huffington Post, and Fortune 500 companies including Intel and SAP, as well as startups and mid-sized marketers.

The SimpleReach platform measures the value of content by using predictive analytics to calculate a holistic score that predicts the popularity of articles and other content, including syndicated and sponsored content. The metrics and algorithms used enable the system to deliver 95 percent accuracy over a 60 – 90 minute window.

The Need: Incorporating Ops Insights Into a Developer’s Workflow

SimpleReach’s platform runs within an Amazon Web Services (AWS) environment. Using this cloud infrastructure service has made it possible for SimpleReach to operate effectively with a small team relative to the size of the actual infrastructure. Just nine people were able to grow the platform to support 240 servers handling 8 billion interactions per day.

Eric Lubow, who was CTO of SimpleReach at the time, noticed that his team was spending an increasing amount of time tracking and comparing performance metrics when updates occurred. This analysis was necessary, as symptoms of performance issues would begin as soon as the staff added / removed servers or made other infrastructure changes. The team needed to assess performance implications (positive or negative) before continuing on. When the environment grew by over an order of magnitude from dozens to hundreds of servers, Lubow knew that changes to the team’s original monitoring tools and processes would be needed.

“ This is how easy it should be for developers to gain meaningful insight into how changes in software can impact operations. It’s also one of the easiest set-ups I’ve ever done, as I was able to get the entire Datadog system up and running in only about 15 minutes.”

Eric Lubow
CTO, SimpleReach

The underlying problem was a familiar one: a disconnect between development and operations. “The developers didn’t realize how the changes they were making were affecting the production environment,” Lubow recalls. “Some of the impacts were significant, and the need for frequent changes in both application and system software was making the situation untenable.”

Previous experience with Nagios and other open source tools proving to be too rigid and incomplete for their needs, motivated Lubow to evaluate some commercial infrastructure monitoring solutions. Unfortunately, none of the tools fulfilled the organization’s needs. “All of these solutions monitored the operating environment as intended, but not one of them gave developers the actionable insights and process-oriented tools they needed to improve their workflow,” noted Lubow.

Lubow started to believe that he might be forced to build a custom monitoring tool in order for the entire team to have the information they needed to streamline and accelerate development of SimpleReach’s scalable systems. But, Lubow also knew that doing so would take time and effort away from expanding and enhancing the SimpleReach platform and scaling the infrastructure as the customer base quickly grew.

When Lubow discovered Datadog, he knew he had found exactly what he wanted in a monitoring tool. “This is how easy it should be for developers to gain meaningful insight into how changes in software can impact operations,” says Lubow. “It’s also one of the easiest set-ups I’ve ever done, as I was able to get the entire Datadog system up and running in only about 15 minutes.”

Insight Into How Development Changes are Impacting Operations

Because SimpleReach uses a home-grown code deployment system, it was necessary to configure the Datadog system to effectively monitor all of the changes being made. Datadog’s flexibility and intuitive ease-of-use substantially simplified the effort required to translate the code deployment processes into the custom data streams needed for effective monitoring.

One of the Datadog capabilities Lubow values most is the way it anticipates how the operating environment is likely to change during the development phase of a project: “This insight places the onus of operations partially onto the development team, which is exactly the way it should be. The development team now routinely uses the Datadog API to capture pertinent events and custom metrics, and this has enabled the kind of workflow-oriented organizational change we needed to improve DevOps productivity.”

Correlating Seemingly Disparate Events and Metrics to Reveal the Effects of Changes

As SimpleReach’s environment scales and becomes more complex, the rate of change in the environment continuously accelerates. There are new versions of software and new or updated applications being deployed multiple times per day. Nearly every intended change seems to precipitate one or more unanticipated and often undesirable changes somewhere else. This is why Lubow also values the way that Datadog correlates different events and metrics to provide actionable insight: “In effect, Datadog correlates the cause and effect of changes so that we can see what’s happening at a high level very quickly.”

The correlation of seemingly disparate data points provides essential context and precision across different architectural elements and timeframes, which has dramatically enhanced troubleshooting effectiveness of the development. “Finding the root cause of a problem used to take hours and in some cases days. But with Datadog, more often than not, we can now pinpoint the cause or causes in minutes,” adds Lubow.

Identifying Individual Instances Experiencing Problems

Lubow and others have observed that, after deploying about 250 – 300 AWS instances, a tipping point seems to occur where at least one instance will experience a problem even with a seemingly minor change to the operating environment. According to Lubow, “We’ve now reached that point, and Datadog has made it very easy for us to find and fix any machine that is behaving abnormally. This saves us a lot of time and effort, and also helps maximize our resource utilization on a machine and personnel level.”

Dramatically Improves Both Developer and Operator Productivity

Improving productivity is important in any organization, but is perhaps even more so at SimpleReach. “Things can become pretty hectic here whenever something major happens somewhere in the world,” noted Lubow, recalling how traffic grew by four times after the Boston Marathon bombing and spiked by a factor of six when Robin Williams passed away.

This aspect of the company’s business is why Lubow has taken advantage of another Datadog feature by setting an alert for any uptick in traffic. He also now looks at Datadog dashboards regularly throughout the day to monitor status, and plans to put it up on the IT department’s “big screen” so that the entire staff can benefit from insight afforded by Datadog.

“ Finding the root cause of a problem used to take hours and in some cases days. But with Datadog, more often than not, we can pinpoint cause or causes in minutes.”

Eric Lubow
CTO, SimpleReach

Resources