Monitoring Financial Data Mesh on AWS using Datadog
March 11, 2025
Introduction
As the financial services industry evolves, data-driven decisions are essential to stay competitive. To achieve this, organizations build data meshes to gain flexibility and autonomy for data domains. By integrating data from sources like historical market data, transaction data, investment insights, and third-party datasets, financial organizations can build innovative solutions for their customers. But how do you integrate robust monitoring to ensure data reliability, security, and compliance across distributed environments?
This post will serve as an architectural overview of how Datadog’s observability platform can be integrated with data mesh on AWS to achieve visibility into distributed environments.
Explanation of the architecture
- 1. Producer/Reference Data accounts are AWS accounts that collect, transform, and store financial data from various sources, such as transactional databases, trading systems, payment processors, or risk models. Services such as AWS Glue, Amazon EMR, and Amazon S3 are used in this process, and they can be monitored using Datadog to ensure reliability, security, cost control, and performance optimization.
- a. AWS Glue: Monitor ETL job success/failures, track job execution times, detect CPU/memory performance metrics, and optimize Glue worker utilization.
- b. Amazon EMR: Monitor EMR cluster performance and resource utilization by tracking CPU, memory, and disk usage, monitor job execution times for Spark/Hive/Presto jobs, and detect underutilized clusters to optimize autoscaling. A large number of pipelines within EMR can be complex and could have interdependent steps, and monitoring these steps helps identify and solve issues quickly for timely processing and to prevent incorrect analytics and business decisions.
- c. Amazon S3: Monitor object versioning, data completeness, and failed data uploads to upstream data pipelines.
- 2. Data Catalog accounts are AWS accounts that store dataset metadata in AWS Glue Data Catalog. They provide a unified view of all datasets available across the mesh, reducing any data silos. These accounts are also responsible for resource sharing through the AWS Glue Data Catalog and other accounts such as Producer and Consumer to discover datasets without duplicating them. Telemetry such as events and logs are sent to Datadog either using the AWS integration[5] or CloudWatch Log Groups, Datadog’s Forwarder, or Kinesis Data Streams[4] to monitor for unexpected behaviors. Monitors and alerts can be configured to help detect issues such as unusual query patterns, data latency, or deviations in access patterns indicating security concerns. Additionally, monitoring the centralized AWS Glue Data Catalog for potential downtime or data quality issues can help domain administrators solve issues before they reach Consumer workloads.
- 3. Consumer accounts are AWS accounts that run analytics and/or data science workloads. These workloads access data from the Data Catalog and Producer accounts through federated access. AWS service and workload telemetry are sent to CloudWatch and Log Groups, where it can be polled by Datadog’s AWS integration[5] or forwarded from CloudWatch to Datadog using the Datadog Forwarder or Kinesis Data Streams[4].
- 4. Logs from CloudWatch Log Groups are then forwarded to Datadog Log Management using the Datadog Forwarder or Kinesis Data Streams. If your application generates a large volume of logs and you prefer a managed service with minimal maintenance, Amazon Data Firehose is a suitable choice. If low-latency log delivery and the ability to customize log processing are priorities, Lambda Forwarder is a more appropriate choice.
- 5. Datadog’s AWS integration is a feature within Datadog that integrates Datadog with your AWS accounts using the AssumeRole functionality of AWS Security Token Service. This allows Datadog to have read-only access to pull monitoring data from AWS for a unified view into your applications and infrastructure’s observability landscape.
- 6. When Datadog is integrated with AWS through its native integration solution[5], it enables comprehensive monitoring of your AWS services. In this architecture, the Datadog integration collects metrics, logs, and events from Amazon Athena, Amazon Redshift, AWS Glue, Amazon S3, and Amazon IAM from many AWS accounts to provide domain-specific visibility and access monitoring. Your applications can also be instrumented with Datadog APM to monitor cross-domain data flows in real time. For analytics workloads, Datadog can provide metrics on resource utilization, query performance, and overload efficiency to ensure that Consumers such as data scientists and analysts can access data without latency. Additionally, using the IAM events and IAM Access Analyzer, Datadog Cloud Security Management can help ensure secure access to data and compliance with security standards from each Consumer.
Authors
Lowell Abraham, Sr. Product Solutions Architect
References
Inspiration and reference documents or existing solutions: