AccuWeather reduces data incident response time by 80% with Datadog Data Observability | Datadog
AccuWeather reduces data incident response time by 80% with Datadog Data Observability

Case Study

AccuWeather reduces data incident response time by 80% with Datadog Data Observability

About AccuWeather

AccuWeather provides localized weather forecasts with proven Superior Accuracy™, severe weather warnings, and MinuteCast® precipitation projections to over 1.5 billion people daily via apps, web, radio, and television.

Meteorology, Data analytics
500+ Employees
Pennsylvania
Databricks
“Our normal incident response time was around an hour to 90 minutes. Now it's reduced to just a couple of minutes.”
case-studies/accuweather/travis-teague
“Our normal incident response time was around an hour to 90 minutes. Now it's reduced to just a couple of minutes.”
Travis Teague Data Operations Manager AccuWeather

Why Datadog?

  • Unified visibility across serverless and classic Databricks compute
  • Correlated and aggregated monitors that reduce alert noise
  • Visibility into job performance and Spark execution metrics
  • Automated alert routing to incident management systems

Challenge

AccuWeather’s existing monitoring system generated numerous unactionable alerts and couldn’t differentiate between transient issues and critical problems, leading to alert fatigue and incident response times averaging 90 minutes.

Key results

↓50%

Reduction in unactionable alerts

80% faster incident response

From 90 minutes to just a couple minutes

Automated routing

Alerts create tickets and notify appropriate teams

Noisy alerts slow data incident response time

AccuWeather is the most accurate and most used source of weather forecasts and warnings in the world. Billions of people rely on AccuWeather’s forecasts with proven Superior Accuracy™ across its consumer platforms, and AccuWeather For Business serves more than half of Fortune 500 companies and thousands of other companies around the world.

AccuWeather processes massive amounts of weather data in near real time. Data is central to the business: they use 30 years of historical weather data for customer analysis, real-time weather observations for live streaming apps, and forecasts created by blending multiple weather models with expert meteorologist input. “There are hundreds of parameters that go into forecasting the weather, and when it comes to forecast models, you have hundreds of those that are all weighted differently,” says Travis Teague, Data Operations Manager at AccuWeather.

As AccuWeather grew its data operations, alert fatigue became a challenge. Its existing monitoring system could not tell the difference between temporary issues and problems that needed immediate attention. Jobs that run frequently might occasionally fail due to temporary cloud issues, triggering alerts even when the failures fall within acceptable limits.

Without the ability to apply weather-specific rules to monitoring, the team received many alerts that didn’t need action while potentially missing critical issues. Response times averaged around 90 minutes, often because the team would wait to see if the issues would resolve on their own.

AccuWeather needed a monitoring solution that could apply weather-specific rules to its data pipelines, reduce noisy alerts, and help the team find and fix real issues faster.

AccuWeather first image

Applying dataset-specific business logic to reduce alert noise

AccuWeather uses Databricks to process its massive weather datasets, running Lakeflow Jobs, Databricks’ built-in orchestration and distributed data processing product, to manage complex dependencies across approximately 4,500 weekly jobs. To monitor these critical pipelines, the company deployed Datadog’s Data Observability: Jobs Monitoring for unified visibility into its Databricks’ jobs and workflows.

The team built an observability strategy using correlated and aggregated monitors that apply weather-specific business logic. This approach lets them set up alerts that match how weather data actually works, reducing noise while enabling them to catch critical issues right away.

For jobs that run continuously to collect weather observations, the team configured monitors to alert only when a job fails several times in a row, indicating an actual problem rather than a temporary cloud outage. “You don’t necessarily need to be alerted every time that fails,” explains Teague. “But if it fails a certain number of times, that’s when we know we actually need to do something.”

Similarly, AccuWeather processes multiple weather forecast models, with certain hours being more critical than others. The team set up monitoring to alert only when multiple jobs all fail during the same critical run. “If every model fails for that particular model hour, we need to know about it,” says Teague. “But if only one job fails, it’s okay because we’re covered by successful runs of our other models.”

AccuWeather integrated alerts directly into its incident response platform, which automatically creates tickets and routes them to the appropriate teams when issues arise. The team includes links to relevant dashboards in alerts, providing immediate context for faster troubleshooting. In addition to incident routing, they use Datadog’s insights into compute use, cluster performance, and Spark execution metrics to optimize their Databricks environment and control costs.

Intelligent alerting reduces false alarms and accelerates response

AccuWeather has improved its monitoring efficiency and incident response. By using correlated and aggregated alerting monitors, the team cut unactionable alerts by over 50%, reducing alert fatigue and enabling engineers to focus on real issues. “Our normal incident response time was around an hour to 90 minutes,” says Teague. “Now it’s reduced to just a couple of minutes.”

“Our normal incident response time was around an hour to 90 minutes. Now, it’s reduced to just a couple of minutes.”

This improvement stems from the team’s ability to apply rules that distinguish between temporary issues and actual problems. Before, the team would often wait to see if issues resolved on their own, leading to slow responses. With recommended and out-of-the-box monitors, the team can quickly configure alerts and set custom thresholds tailored to their business needs. “We can apply our business and domain logic, so if we get an alert, we know there’s actually something wrong versus it just being a job that runs every five minutes that we don’t care if it fails a couple of times,” explains Teague. “Instead of having to wait and letting it fail several times, we know when there is actually a problem.”

“We can apply our business and domain logic, so if we get an alert, we know there’s actually something wrong versus it just being a job that runs every five minutes that we don’t care if it fails a couple of times.”

When data pipeline issues do arise, Datadog Data Observability detects critical problems and helps the team pinpoint the root cause. Alerts are automatically routed to the right teams with links to relevant dashboards, enabling faster investigation and resolution. Datadog also helps AccuWeather rightsize its Databricks Platform performance by surfacing idle compute, cluster utilization, and Spark execution metrics, helping the team use resources more efficiently.

With reliable monitoring in place, AccuWeather can confidently deliver the best real-time weather forecasts and warnings with proven Superior Accuracy™ that people and companies trust and depend on every day to save lives, better protect property, and make the best weather-impacted decisions.

Resources

og/default/og-thumbnails-generic

official docs

Data Observability Overview
Understanding data lineage

BLOG

Understanding data lineage
Detect issues and optimize spend with Databricks serverless job monitoring

BLOG

Detect issues and optimize spend with Databricks serverless job monitoring