Nexthink scales digital employee experience monitoring and improves incident management with Datadog

Observability gaps threaten service quality during cloud transition

Nexthink is the leader in digital employee experience management that helps IT teams optimize workplace productivity at scale. Through real-time data collection, analysis, and automated remediation, Nexthink helps organizations understand, diagnose, and resolve issues impacting employee productivity and satisfaction. Operating globally, the company runs over 360 microservices while monitoring more than 20 million endpoints (and growing).

Nexthink’s recent transition to the cloud introduced observability challenges that threatened its ability to maintain service quality at scale. Their previous observability solution created a fragmented monitoring environment with scattered tools and information silos. The solution was also expensive and slow.

These limitations impacted Nexthink’s incident management processes. Without centralized visibility, incidents became chaotic events. Teams relied on disjointed tools and static documentation, making it hard to track ownership and share knowledge. Developers struggled with alert fatigue and miscommunications, leading to inefficient incident resolution processes. For a company responsible for monitoring millions of endpoints globally, observability gaps posed serious risk to both platform reliability and customer satisfaction.

Nexthink team collaborating in meeting room

Unified observability platform transforms monitoring and incident management

After evaluating several options, Nexthink chose Datadog as its observability platform. During their proof of concept (POC), Nexthink’s IT teams were immediately impressed by Datadog’s dynamic capabilities compared with their previous solution. Datadog released multiple features during the POC period, demonstrating a roadmap and delivery pace that exceeded expectations. “The speed of innovation is huge,” says Mirko Piaggesi, Director of Cloud Observability and Cloud Governance at Nexthink.

One of the most significant improvements was gaining better control over data flow, particularly with logs. Nexthink can now archive logs directly to Amazon S3 without ingesting or indexing them. This means they’re not paying for all the logs they ingest. “That made a huge difference,” says Pascal Gandilhon, head of observability at Nexthink. “The result is more efficient for everyone, and we have what we want at a lower cost.”

The team also implemented Metrics without Limits™ to cost-effectively monitor all their critical Custom Metrics at scale. They’re currently tracking 9.9k metrics across their environment. “Our observability relies heavily on Custom Metrics, especially for Dashboards and Monitors,” says Piaggesi.

To optimize costs, they leverage 141 integrations to get out-of-the-box metrics at no additional cost and supplement those with Custom Metrics only when necessary. This approach has transformed their workflow. They can now adjust tags and identify which metrics are being used and which aren’t, giving them visibility into usage patterns they previously lacked. Now they can safely shut down unused metrics because they have concrete data on what’s actually queried—even distinguishing between metrics that appear in dashboards but are never actually viewed versus those actively used. This smooth workflow eliminates the need for back-and-forth between teams.

Nexthink also recently deployed Datadog’s Internal Developer Portal (IDP) to replace its outdated static documentation. The IDP acts as a central, dynamic view of its system architecture. It provides immediate access to application dependencies, speeding up debugging and analysis. By consolidating metrics and logs in one place, the IDP reduces confusion and accelerates collaboration. It also helps track service ownership and component details, making incident handling more efficient. “Everything is integrated, and it saves us a lot of time and energy,” says Abdelrhman Hamouda, SRE lead at Nexthink.

The team also implemented Datadog’s On-Call feature to manage paging according to their existing team structure. Everything in production now has an entry in Software Catalog with a designated owner. This allows SRE operations teams to quickly identify the right owner when something breaks and page the appropriate team. On-call management is now seamlessly integrated into their workflow, fostering greater ownership and reducing random alerts. This approach has improved both adoption and trust in their incident management processes. “It’s a very intuitive way of doing things. It makes things make sense,” adds Gandilhon.

Nexthink engineers monitoring system dashboards

Enabling growth while reducing operational complexity

Today, Nexthink has transformed its incident management capabilities and resolved critical scalability issues that previously caused system crashes. “We are scaling up with multiple instances, which leads to happier customers,” says Hamouda.

The company now operates a unified platform that serves operations, developer, and support teams. This consolidated approach has reduced cognitive load and improved DevOps alignment across the organization. When incidents occur, observability is no longer a pain point. Teams don’t have to wait for dashboards to load like they did with their previous solution because their alerts come with rich context direct from the source.

One of the most significant improvements is Nexthink’s ability to measure production readiness and communicate it effectively to teams and leadership. They now use Service Level Objectives, measuring them automatically across entities within the platform. This capability helps them balance innovation with reliability while maintaining better control over the data they ingest.

The platform’s reliability has been crucial for Nexthink’s lean operations model. “We are very few people, but we support a large number of developers and teams, and we process a huge amount of data,” explains Hamouda. “If we didn’t have a reliable tool, it would be a nightmare.”

“Having all the incident management at the same place with observability and On-Call is golden for us.”

Meanwhile, Datadog’s rapid innovation cycle provides additional value. “Datadog is releasing many nice features that hit the cost point,” adds Hamouda. “That helps management see the value of the investment and continue investing in Datadog.”

Today, Nexthink’s SRE and Observability teams are focusing on creating a smooth experience for all users within the organization. The Observability team provides and configures Datadog in a standardized way across teams while managing platform maintenance and costs. Engineering teams use the platform to monitor their services and production environments.

“We are using Datadog to raise the maturity level of all the teams progressively.”

Looking ahead, Nexthink plans to automate IDP file maintenance within its CI/CD pipelines and is also transitioning its security tooling to Datadog. Ultimately, the company has successfully enabled its engineering teams with the right data and tools to monitor services effectively. “For a similar budget, we are now processing significantly more data, in a more reliable platform, and with many more features compared to our previous solution,” concludes Gandilhon.